unique identifiers as a separate library

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

unique identifiers as a separate library

Sebastian Fischer
Hello,

for a project I am using the libraries Unique, UniqSupply, and UniqFM  
from the package ghc. It seems odd to have a dependency on a whole  
compiler only in order to use its (highly efficient and, hence,  
preferred) implementation of unique identifiers.

Would it be possible to put everything concerned with unique  
identifiers in GHC into a separate package on Hackage?

This may also allow me to get a fix for <http://hackage.haskell.org/trac/ghc/ticket/2880 
 > without reinstalling GHC.

Cheers,
Sebastian

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Simon Marlow-7
Sebastian Fischer wrote:

> for a project I am using the libraries Unique, UniqSupply, and UniqFM
> from the package ghc. It seems odd to have a dependency on a whole
> compiler only in order to use its (highly efficient and, hence,
> preferred) implementation of unique identifiers.
>
> Would it be possible to put everything concerned with unique identifiers
> in GHC into a separate package on Hackage?
>
> This may also allow me to get a fix for
> <http://hackage.haskell.org/trac/ghc/ticket/2880> without reinstalling GHC.

Sure, that would be a useful chunk to separate out from GHC.  However,
looking at the code I see that our unique supply monad is really not a lazy
  monad at all:

thenUs :: UniqSM a -> (a -> UniqSM b) -> UniqSM b
thenUs expr cont us
   = case (expr us) of { (result, us') -> cont result us' }

which is strict, and even the lazy version:

lazyThenUs :: UniqSM a -> (a -> UniqSM b) -> UniqSM b
lazyThenUs (USM expr) cont
   = USM (\us -> let (result, us') = expr us in unUSM (cont result) us')

doesn't really split the supply, because it will force the left side as
soon as the unique on the right side is demanded.

Given that our monad is strict, there's no need for it to use
mkSplitUniqueSupply, it could just call genSym to create new uniques.  I
notice there are other parts of the compiler that do make use of the lazy
splittable unique supply in their own monads, but I'm not sure if they
really need it.

Cheers,
        Simon
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Sebastian Fischer
In reply to this post by Sebastian Fischer
On Dec 17, 2008, at 10:54 AM, Sebastian Fischer wrote:

> Would it be possible to put everything concerned with unique  
> identifiers in GHC into a separate package on Hackage?


I have wrapped up (a tiny subset of) GHC's uniques into the package  
`uniqueid` and put it on Hackage:

     http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uniqueid

It only provides

     type Id
     hashedId :: Id -> Int

     type IdSupply
     initIdSupply  :: Char -> IO IdSupply
     splitIdSupply :: IdSupply -> (IdSupply,IdSupply)
     idFromSupply  :: IdSupply -> Id

     instance Eq Id
     instance Ord Id
     instance Show Id

The main difference is due to my fear of depending on the foreign  
function `genSymZh` which I replaced by a global counting IORef.

The other difference is that the Show instance does not rely on GHC's  
static flags and can hence be used outside of GHC sessions.

The code is on github:

     http://github.com/sebfisch/uniqueid

Extensions welcome!

Cheers,
Sebastian
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Isaac Dupree
Sebastian Fischer wrote:
> On Dec 17, 2008, at 10:54 AM, Sebastian Fischer wrote:
>
>> Would it be possible to put everything concerned with unique
>> identifiers in GHC into a separate package on Hackage?
>
>
> I have wrapped up (a tiny subset of) GHC's uniques into the package
> `uniqueid` and put it on Hackage:

thanks!

> The main difference is due to my fear of depending on the foreign
> function `genSymZh` which I replaced by a global counting IORef.

which is its own risk.  maybe you should NOINLINE it?

Potential code criticisms / suggestions for it as a library:

Unboxed: so it only works on GHC, even though others have
unsafe IO too.  In theory, strictness annotations should be
able to achieve the same efficiency.

"Char" is supposed to represent a Unicode character -- but
this code behaves oddly:
For 64-bit Int#, it does so.
For 32-bit Int#, it assumes Char is within the first 8 bits
(ASCII and a little more).
If Int# (or Int) can be 30-bit (like Haskell98 permission),
its correctness suffers even worse.
Is it really even a necessary part of the design?  The only
way you provide to extract it or depend on its value is
indirectly via the "Show" instance.  Its presence there is,
in any case, at the cost of max. 2^24 (16 million) IDs
before problems happen, whereas billions is still not a
great limit but at least is somewhat larger. (applications
that are long-running or deal with huge amounts of data
could be affected)

unsafeDupableInterleaveIO: this "Dupable" was safe for GHC
to use because GHC is single-threaded.  Is it safe in a
library setting?  I guess likewise, the IORef global
variable wouldn't be thread-safe... but this one isn't even
safe between separate runs of initIdSupply.  On the other
hand, thread-safety probably makes it much less efficient
(if you can find a way to use atomic int CPU instructions,
it might not be too bad, or else per-thread counters... or
just declare how unsafe it is)

unsafePerformIO: it's not totally necessary here.  Its only
function is to make IDs generated by different runs of
initIdSupply be distinct.  So it could, anyway, probably be
refactored to only use unsafePerformIO global-ness once per
initIdSupply and just use unsafeInterleaveIO within (where
currently nextInt is called).

-Isaac
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

ajb@spamcop.net
In reply to this post by Sebastian Fischer
G'day all.

Quoting Sebastian Fischer <[hidden email]>:

> I have wrapped up (a tiny subset of) GHC's uniques into the package
> `uniqueid` and put it on Hackage:
>
>     http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uniqueid

First off, thanks for this.

> The main difference is due to my fear of depending on the foreign
> function `genSymZh` which I replaced by a global counting IORef.

Why not depend on this instead?

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/value-supply

Cheers,
Andrew Bromage
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Simon Marlow-7
[hidden email] wrote:

> G'day all.
>
> Quoting Sebastian Fischer <[hidden email]>:
>
>> I have wrapped up (a tiny subset of) GHC's uniques into the package
>> `uniqueid` and put it on Hackage:
>>
>>     http://hackage.haskell.org/cgi-bin/hackage-scripts/package/uniqueid
>
> First off, thanks for this.
>
>> The main difference is due to my fear of depending on the foreign
>> function `genSymZh` which I replaced by a global counting IORef.
>
> Why not depend on this instead?
>
> http://hackage.haskell.org/cgi-bin/hackage-scripts/package/value-supply

Looking at the code for this, I'm somewhat suspicious that it actually
works with GHC:

   -- The extra argument to ``gen'' is passed because without
   -- it Hugs spots that the recursive calls are the same but does
   -- not know that unsafePerformIO is unsafe.
   where gen _ r = Node { supplyValue  = unsafePerformIO (genSym r),
                          supplyLeft   = gen False r,
                          supplyRight  = gen True r }

even if that extra Bool argument is enough to fool Hugs, I wouldn't count
on it being enough to fool GHC -O2!  You probably want to use
unsafeInterleaveIO like we do in GHC's UniqSupply library.

Also, I'd replace the MVar with an IORef and use atomicModifyIORef for speed.

Cheers,
        Simon
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Simon Marlow-7
In reply to this post by Isaac Dupree
Isaac Dupree wrote:

> Sebastian Fischer wrote:
>> On Dec 17, 2008, at 10:54 AM, Sebastian Fischer wrote:
>>
>>> Would it be possible to put everything concerned with unique
>>> identifiers in GHC into a separate package on Hackage?
>>
>>
>> I have wrapped up (a tiny subset of) GHC's uniques into the package
>> `uniqueid` and put it on Hackage:
>
> thanks!
>
>> The main difference is due to my fear of depending on the foreign
>> function `genSymZh` which I replaced by a global counting IORef.
>
> which is its own risk.  maybe you should NOINLINE it?
>
> Potential code criticisms / suggestions for it as a library:
>
> Unboxed: so it only works on GHC, even though others have unsafe IO
> too.  In theory, strictness annotations should be able to achieve the
> same efficiency.
>
> "Char" is supposed to represent a Unicode character -- but this code
> behaves oddly:
> For 64-bit Int#, it does so.
> For 32-bit Int#, it assumes Char is within the first 8 bits (ASCII and a
> little more).
> If Int# (or Int) can be 30-bit (like Haskell98 permission), its
> correctness suffers even worse.
> Is it really even a necessary part of the design?  The only way you
> provide to extract it or depend on its value is indirectly via the
> "Show" instance.  Its presence there is, in any case, at the cost of
> max. 2^24 (16 million) IDs before problems happen, whereas billions is
> still not a great limit but at least is somewhat larger. (applications
> that are long-running or deal with huge amounts of data could be affected)
>
> unsafeDupableInterleaveIO: this "Dupable" was safe for GHC to use
> because GHC is single-threaded.  Is it safe in a library setting?

It would be safe if the genSym was atomic - so I recommend using
atomicModifyIORef.

Cheers,
        Simon

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Iavor Diatchki
In reply to this post by Simon Marlow-7
Hello,
I have made the two changes that Simon suggested and uploaded a new
version of the library.   By the way, GHC seemed to work correctly
even without the extra boolean parameter, perhaps it treats
unsafePerformIO specially somehow?  A somewhat related question:  I
ended up using three calls to unsafeInterleaveIO which seems like a
bit much.  Could I have done it in a different way somehow?  This is
what I did:

gen r = do v <- unsafeInterleaveIO (genSym r)
                   ls <- unsafeInterleaveIO (gen r)
                   rs <- unsafeInterleaveIO (gen r)
                   return (Node v ls rs)

Note that a single unsafeInterleaveIO around the whole do block is not
quite right (this is what the code in the other package does) because
this will increment the name as soon as the generator object is
forced, and we want the name to be increment when the name is forced.

-Iavor




On Fri, Dec 19, 2008 at 1:24 AM, Simon Marlow <[hidden email]> wrote:

>> Why not depend on this instead?
>>
>> http://hackage.haskell.org/cgi-bin/hackage-scripts/package/value-supply
>
> Looking at the code for this, I'm somewhat suspicious that it actually works
> with GHC:
>
>  -- The extra argument to ``gen'' is passed because without
>  -- it Hugs spots that the recursive calls are the same but does
>  -- not know that unsafePerformIO is unsafe.
>  where gen _ r = Node { supplyValue  = unsafePerformIO (genSym r),
>                         supplyLeft   = gen False r,
>                         supplyRight  = gen True r }
>
> even if that extra Bool argument is enough to fool Hugs, I wouldn't count on
> it being enough to fool GHC -O2!  You probably want to use
> unsafeInterleaveIO like we do in GHC's UniqSupply library.
>
> Also, I'd replace the MVar with an IORef and use atomicModifyIORef for
> speed.
>
> Cheers,
>        Simon
>
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Simon Marlow-7
Iavor Diatchki wrote:

> Hello,
> I have made the two changes that Simon suggested and uploaded a new
> version of the library.   By the way, GHC seemed to work correctly
> even without the extra boolean parameter, perhaps it treats
> unsafePerformIO specially somehow?  A somewhat related question:  I
> ended up using three calls to unsafeInterleaveIO which seems like a
> bit much.  Could I have done it in a different way somehow?  This is
> what I did:
>
> gen r = do v <- unsafeInterleaveIO (genSym r)
>                    ls <- unsafeInterleaveIO (gen r)
>                    rs <- unsafeInterleaveIO (gen r)
>                    return (Node v ls rs)

I'd probably do it like this:

 > gen r = unsafeDupableInterleaveIO $ do
 >            v <- unsafeDupableInterleaveIO (genSym r)
 >            ls <- gen r
 >            rs <- gen r
 >            return (Node v ls rs)

which is close to the way GHC does it, except that we do indeed call genSym
for every node.  Calling genSym is cheaper than building the thunk for
unsafeDupableInterleaveIO, although if there's an atomicModifyIORef
involved that will probably tip the balance the other way.

Cheers,
        Simon
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Sebastian Fischer
Thanks for pointing me at Iavors package. We should not have two  
different libraries for the same purpose.

value-supply-0.2 is about 6-7 times slower than GHC's UniqSupply, but  
porting Simons suggestion (to use unsafeDupableInterleaveIO) into  
Iavors code, GHC is "only" about twice as fast.

Can we do better? I tried to use SPECIALIZE INLINE pragmas (at  
new[Enum|Num]Supply for Int) and/or strictness annotations (for the  
unique value) without significant benefit.

I like Iavors more general interface (polymorphic values) and that  
only demanded identifiers are created. The latter saves half of the  
names when evaluating

 > map supplyValue . split

I will replace my dependency to ghc with one to value-supply if  
unsafeDupableInterleaveIO is used instead of unsafeInterleaveIO where  
it is available.

Cheers,
Sebastian
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Iavor Diatchki
Hi,
Thanks for the feedback!  I have uploaded a new version of value
supply with the following changes that should improve performance.
  - It uses unsafeDupableInterleaveIO to avoid double locking,
  - Do not create an intermediate list for Enum and Num supples
  - Specialize code for Int, as this is the most commonly used type
for new name generation.

These together should make things much faster but I have not had a
chance to test that.  I looked at the core and it seemed that the
specializations were kicking in, so hopefully all is OK :)
It would be great if you could let me know if things are still slower
than the GHC version, and if so by how much.
Also, I made a git-hub repo for the code, in case anyone is interested
in contributing:
http://github.com/yav/value-supply/tree/master
-Iavor


On Mon, Dec 22, 2008 at 3:27 AM, Sebastian Fischer
<[hidden email]> wrote:

> Thanks for pointing me at Iavors package. We should not have two different
> libraries for the same purpose.
>
> value-supply-0.2 is about 6-7 times slower than GHC's UniqSupply, but
> porting Simons suggestion (to use unsafeDupableInterleaveIO) into Iavors
> code, GHC is "only" about twice as fast.
>
> Can we do better? I tried to use SPECIALIZE INLINE pragmas (at
> new[Enum|Num]Supply for Int) and/or strictness annotations (for the unique
> value) without significant benefit.
>
> I like Iavors more general interface (polymorphic values) and that only
> demanded identifiers are created. The latter saves half of the names when
> evaluating
>
>> map supplyValue . split
>
> I will replace my dependency to ghc with one to value-supply if
> unsafeDupableInterleaveIO is used instead of unsafeInterleaveIO where it is
> available.
>
> Cheers,
> Sebastian
> _______________________________________________
> Glasgow-haskell-users mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Isaac Dupree
Iavor Diatchki wrote:
>   - It uses unsafeDupableInterleaveIO to avoid double locking,

in particular,

> gen r = unsafeDupableInterleaveIO
>               $ do v <- unsafeDupableInterleaveIO (genSym r)
>                    ls <- gen r
>                    rs <- gen r
>                    return (Node v ls rs)

where is the double locking?  We want referential
transparency...

e.g. suppose you use newNumSupply to create a thunk for a
Gen; when evaluated, it will run unsafeDupableInterleaveIO.
  You send that thunk off to two different other threads.
Then those threads decide to evaluate it (say, enough to get
the first genSym'd value) and happen to make a race
condition, so it becomes two separate IO computations.
Therefore one of them runs atomicModifyIORef, and the other
one runs atomicModifyIORef, and so they get two different
values out.

Node 0 (...) (...)
Node 1 (...) (...)

when it's suppose to be the very same Gen data structure.

so, am I right or wrong about what the perils of
unsafeDupableInterleaveIO are?

I could see changing (unsafe[Dupable]InterleaveIO (genSym
r)) to (genSym r), to halve the number of
unsafeInterleaveIOs needed if we assume that most of the
time a node is evaluated in order to get a value... but it's
hard to see a good way to make *fewer* InterleaveIOs than
there are genSym'd values.  (possible, but hard, and really
depends on the relative expenses/risks of locking, of
computing the next number, and of using up the "address
space" of all possible Ints for example).  Maybe the outer
InterleaveIO could "strictly" make a few levels of Nodes
(with lazy genSym'd values this time) before "interleaving"
again, to reduce the amount of interleaving from the
non-semantics-changing side.

-Isaac
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Iavor Diatchki
Hi,
Sigh.  I think that Isaac is quite right.  Even though I think that it
would be quite rare for multiple threads to share the same name supply
in practise, I would really rather have safe code and not have to
think about it.  So... I have reverted to the non-dupable versions by
default.  I also added an "unsafeNewIntSupply" that uses that dupable
primitives, for those who like living on the edge :)   Thanks to
Sebastian for the bench marking program!   The performance numbers for
the current version on hackage (0.4) are as follows:

With safe primitives value-supply is about 7 times slower, with the
unsafe ones it is about 2 times slower (a bit less actually).  Even
though this seems like a big difference, the actual time it takes to
generate names is very small:  I have to generate about 10 million
names to get reliable measurements, so I am not sure if the difference
matters in practise.

If anyone has further ideas, please chime in.

-Iavor


On Tue, Dec 23, 2008 at 8:30 AM, Isaac Dupree <[hidden email]> wrote:

> Iavor Diatchki wrote:
>>
>>  - It uses unsafeDupableInterleaveIO to avoid double locking,
>
> in particular,
>
>> gen r = unsafeDupableInterleaveIO
>>              $ do v <- unsafeDupableInterleaveIO (genSym r)
>>                   ls <- gen r
>>                   rs <- gen r
>>                   return (Node v ls rs)
>
> where is the double locking?  We want referential transparency...
>
> e.g. suppose you use newNumSupply to create a thunk for a Gen; when
> evaluated, it will run unsafeDupableInterleaveIO.  You send that thunk off
> to two different other threads. Then those threads decide to evaluate it
> (say, enough to get the first genSym'd value) and happen to make a race
> condition, so it becomes two separate IO computations. Therefore one of them
> runs atomicModifyIORef, and the other one runs atomicModifyIORef, and so
> they get two different values out.
>
> Node 0 (...) (...)
> Node 1 (...) (...)
>
> when it's suppose to be the very same Gen data structure.
>
> so, am I right or wrong about what the perils of unsafeDupableInterleaveIO
> are?
>
> I could see changing (unsafe[Dupable]InterleaveIO (genSym r)) to (genSym r),
> to halve the number of unsafeInterleaveIOs needed if we assume that most of
> the time a node is evaluated in order to get a value... but it's hard to see
> a good way to make *fewer* InterleaveIOs than there are genSym'd values.
>  (possible, but hard, and really depends on the relative expenses/risks of
> locking, of computing the next number, and of using up the "address space"
> of all possible Ints for example).  Maybe the outer InterleaveIO could
> "strictly" make a few levels of Nodes (with lazy genSym'd values this time)
> before "interleaving" again, to reduce the amount of interleaving from the
> non-semantics-changing side.
>
> -Isaac
>
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Isaac Dupree
Hi again Iavor,

A couple performance ideas if you want to test them:

unsafeInterleaveIO is cheap until you need to evaluate its
result.  So how about this, I think it makes there be 1/3 as
many "structural" unsafeInterleaveIO's, so if it took "2"
amount of time on unsafeInterleaveIO:ing before, it should
take "1.33" time on it after this: and just a bit more
time/memory to construct Nodes that might not be used.

gen r = unsafeInterleaveIO $ do
                   v <- unsafeInterleaveIO (genSym r)
                   n1 <- gen r; n2 <- gen r; n3 <- gen r; n4
<- gen r
                   return (Node v1 (Node v2 n1 n2) (Node v3
n3 n4))

I also feel tempted to apply the
static-argument-transformation manually,
where
    gen r = gen'
      where
        gen' = unsafeInterleaveIO $ do
           v <- unsafeInterleaveIO (genSym r)
           n1 <- gen'; n2 <- gen' --etc.
           return (Node ...)

or similar

which I guess is safe because this is only
unsafeInterleaveIO, not unsafePerformIO?  Dunno if it'd be
speed-beneficial though.

version 0.4:
> genericNewSupply :: b -> (IORef b -> IO a) -> IO (Supply a)
> genericNewSupply start genSym = gen =<< newIORef start
>   where gen r = unsafeInterleaveIO
>               $ do ls <- gen r
>                    rs <- gen r
>                    return (Node (unsafePerformIO (genSym r)) ls rs)

Why unsafePerformIO, was it faster?(i'd guess slower
actually, as unsafePerformIO is NOINLINE..)  It's
considerably less safe than unsafeInterleaveIO!  For
example, do the static-argument-transformation above, then
float out the unsafePerformIO because it's the same
expression each time through gen', and suddenly the all the
"unique" values are all the same!

we can make this value-supply very good ultimately :-)

also, I might call "unsafeNewIntSupply" something more
specific, like "unthreadsafeNew...", or the more obscure but
conventional "dupable" description-word.  Did it help
specializing that to Int, i.e. why not
"unsafeGenericNewSupply"? because I can imagine a simple
data that's not an Int, where you'd still want to avoid the
thread-safety overhead.  Also, your implementation of it
could be more efficient: it doesn't need to do locking, so I
suggest modifyIORef rather than atomicModifyIORef (Actually
you'll have to use readIORef >>= writeIORef >> return,
instead, because modifyIORef has a different type than
atomicModifyIORef).  Possible refactor: All the functions
***GenSym r = atomicModifyIORef r (some expression that
doesn't mention r); doing the "[atomic]ModifyIORef r" could
be the caller's responsibility instead, and e.g. listGenSym
(a:as) = (as,a).

in fact, for lists (as you get a incomplete-pattern-match
warning there, but you know the list is always infinite,
because you made it with "iterate"), you could instead use
an infinite-list type, Data.Stream from package "Stream"[*];
as Stream is not a sum type (it only has one possible
constructor: Cons), it might even be a bit more efficient!
[*]
http://hackage.haskell.org/packages/archive/Stream/0.2.6/doc/html/Data-Stream.html

thanks for your effort! and especially for measuring the
performance timing!
-Isaac
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Iavor Diatchki
Hello,

Thanks for your comments!


On Tue, Dec 23, 2008 at 3:22 PM, Isaac Dupree <[hidden email]> wrote:

> unsafeInterleaveIO is cheap until you need to evaluate its result.  So how
> about this, I think it makes there be 1/3 as many "structural"
> unsafeInterleaveIO's, so if it took "2" amount of time on
> unsafeInterleaveIO:ing before, it should take "1.33" time on it after this:
> and just a bit more time/memory to construct Nodes that might not be used.
>
> gen r = unsafeInterleaveIO $ do
>                  v <- unsafeInterleaveIO (genSym r)
>                  n1 <- gen r; n2 <- gen r; n3 <- gen r; n4 <- gen r
>                  return (Node v1 (Node v2 n1 n2) (Node v3 n3 n4))

I played around with that but, for some reason, it did not seem to
speed things up, at least for the benchmark that I run.  Will have to
experiment some more.


> version 0.4:
>>
>> genericNewSupply :: b -> (IORef b -> IO a) -> IO (Supply a)
>> genericNewSupply start genSym = gen =<< newIORef start
>>  where gen r = unsafeInterleaveIO
>>              $ do ls <- gen r
>>                   rs <- gen r
>>                   return (Node (unsafePerformIO (genSym r)) ls rs)
>
> Why unsafePerformIO, was it faster?(i'd guess slower actually, as
> unsafePerformIO is NOINLINE..)

I just tested to see what would happen and forgot to change it back
(now I have switched it back).  There does not seem to be a
significant difference between the two, speed-wise.

> Did it help specializing that to Int, i.e. why not
> "unsafeGenericNewSupply"? because I can imagine a simple data that's not an
> Int, where you'd still want to avoid the thread-safety overhead.

Nope, it was just the simplest thing to do.  I can add more general
"unsafe" versions.

>  Also, your
> implementation of it could be more efficient: it doesn't need to do locking,
> so I suggest modifyIORef rather than atomicModifyIORef (Actually you'll have
> to use readIORef >>= writeIORef >> return, instead, because modifyIORef has
> a different type than atomicModifyIORef).

I don't think that that's quite right.  I was thinking that it should
be OK to use different supplies that share a reference in different
threads.  So, for example, split the supply and pass each version to a
new thread.  This should be OK with the dupable primitives because the
thunks would be evaluated in different threads.  However, we still
need the synchronize access to the reference, or we'll get incorrect
values.

Possible refactor: All the
> functions ***GenSym r = atomicModifyIORef r (some expression that doesn't
> mention r); doing the "[atomic]ModifyIORef r" could be the caller's
> responsibility instead, and e.g. listGenSym (a:as) = (as,a).

This is a good idea, the atomicUpdates are duplicated everywhere.  I
did this, and for some reason it seems to have made things a little
slower, but I like the code better, so I think that I will keep it.  I
am surprised that it affected the speed of execution, perhaps it is
just my benchmark that is being unreliable.

> in fact, for lists (as you get a incomplete-pattern-match warning there, but
> you know the list is always infinite, because you made it with "iterate"),
> you could instead use an infinite-list type, Data.Stream from package
> "Stream"[*]; as Stream is not a sum type (it only has one possible
> constructor: Cons), it might even be a bit more efficient!

You are right but the benefit is small enough that I don't think it
warrants a dependency on another package.

bye,
-Iavor
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Isaac Dupree
Iavor Diatchki wrote:

>>  Also, your
>> implementation of it could be more efficient: it doesn't need to do locking,
>> so I suggest modifyIORef rather than atomicModifyIORef (Actually you'll have
>> to use readIORef >>= writeIORef >> return, instead, because modifyIORef has
>> a different type than atomicModifyIORef).
>
> I don't think that that's quite right.  I was thinking that it should
> be OK to use different supplies that share a reference in different
> threads.  So, for example, split the supply and pass each version to a
> new thread.  This should be OK with the dupable primitives because the
> thunks would be evaluated in different threads.  However, we still
> need the synchronize access to the reference, or we'll get incorrect
> values.

That's true (if you're careful enough about which thread is
evaluating the splitting -- the existing split* functions
are actually *not* strict enough for this, and they probably
shouldn't be either, so an additional warning that the user
may have to add some `seq`s or `evaluate`s might be needed).
  Anyway, if atomicModifyIORef isn't that big of an
overhead, then no problem :-)

>> in fact, for lists (as you get a incomplete-pattern-match warning there, but
>> you know the list is always infinite, because you made it with "iterate"),
>> you could instead use an infinite-list type, Data.Stream from package
>> "Stream"[*]; as Stream is not a sum type (it only has one possible
>> constructor: Cons), it might even be a bit more efficient!
>
> You are right but the benefit is small enough that I don't think it
> warrants a dependency on another package.

I'm going to try to argue to convince you that it's entirely
appropriate to use Data.Stream :-)

 From my somewhat mathematical point of view, that is.

1. Data.Stream is a small module that implements a very
well-known, practically standard (though under-used), and
simple data type.  All you really need from it is import
Data.Stream(Stream(Cons)), plus (iterate, streamToList) if
you're keeping your current interface.  It's nothing you
should be afraid to depend on, if you're using its concept
(which you are).  It's much smaller than the 'containers'
package, which similarly you don't use but if you needed a
Map or something you obviously should.

2. The more compelling argument, that it took me a good
night's sleep to think of:

Supply is an infinite binary tree.  Stream is an infinite
unary tree (er, more commonly called infinite "list").
They're both codata.  They're both comonads.  (although they
don't go so far as to depend on category-extras to provide
an instance.)  The primary function/purpose of Supply is an
*efficient* way to turn a Stream into a Supply.  I daresay
it would, even, be more fundamental to provide interface
newSupply      :: Stream a -> IO (Supply a)
than
newSupply      :: a -> (a -> a) -> IO (Supply a)
(although it might be worth keeping both interfaces? mainly
for compatibility, since one's just a Data.Stream.iterate
away from the other and your haddocks could say so)
In fact, it's terribly annoying to turn a (Stream a) into a
(a, a -> a) -- in fact I don't think it can be done in
general (you can turn it into a (b, b -> (a, b)) though a-la
unfold, with b = Stream a) -- so you should definitely
provide the above (Stream a -> IO (Supply a)) interface.
(Although if you're cowardly enough about the extra
dependency, I guess you could make it the riskier ([a] -> IO
(Supply a)), risk being if the user provides a finite
list... Sorry for conditionally insulting you; it seems a
horribly underhanded thing for me to do :-)

Likewise, it would be nice for
split          :: Supply a -> Stream (Supply a)
...and then you would not even depend on Data.List anymore!

(I don't happen to think the arguments that Data.List is
better than Stream for definitely-infinite lists are very
convincing; except possibly that List will be more
up-to-date with respect to stream-fusion optimizations, and
even then, value-supply doesn't actually rely on any of
those optimizations; it actually does keep around the Stream
(or List), or in the case of Num/Enum, it doesn't use one in
the first place.)

On the other hand, I still want the Stream-based interfaces,
but my initial argument isn't even that necessary: consider
implementing the current signature:
newSupply      :: a -> (a -> a) -> IO (Supply a)
currently by:
newSupply x f   = genericNewSupply (iterate f x) listGenSym
but we don't need to use a list at all, it could be like:
newSupply x f   = genericNewSupply x (\a -> (f a, a))
(with atomicallyModifyIORef added as appropriate depending
on your refactorings)


P.S. more code cleanup, if you didn't notice it:

0.4:
> -- XXX: Is the atomic necessary?
> import Data.IORef(IORef,newIORef,atomicModifyIORef)

yes the atomic is necessary, so you can update the comment
to say why? (because multiple threads might be reading from
a Supply-structure created from the same newSupply run, and
we don't want memory corruption etc.)

> import System.IO.Unsafe(unsafePerformIO,unsafeInterleaveIO)
>
> #if __GLASGOW_HASKELL__ >= 608
> import GHC.IOBase(unsafeDupableInterleaveIO,unsafeDupablePerformIO)
> #else
> ...
> unsafeDupablePerformIO :: IO a -> a
> unsafeDupablePerformIO = unsafePerformIO

various "PerformIO" variants to delete (I'm absurdly pleased
that we implemented this without needing any)

-Isaac
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Iavor Diatchki
Hi,
Thanks for the feedback!  Suggestions implemented in your daily
value-supply release :)
Happy Holidays!
-Iavor

On Wed, Dec 24, 2008 at 7:46 AM, Isaac Dupree <[hidden email]> wrote:

> Iavor Diatchki wrote:
>>>
>>>  Also, your
>>> implementation of it could be more efficient: it doesn't need to do
>>> locking,
>>> so I suggest modifyIORef rather than atomicModifyIORef (Actually you'll
>>> have
>>> to use readIORef >>= writeIORef >> return, instead, because modifyIORef
>>> has
>>> a different type than atomicModifyIORef).
>>
>> I don't think that that's quite right.  I was thinking that it should
>> be OK to use different supplies that share a reference in different
>> threads.  So, for example, split the supply and pass each version to a
>> new thread.  This should be OK with the dupable primitives because the
>> thunks would be evaluated in different threads.  However, we still
>> need the synchronize access to the reference, or we'll get incorrect
>> values.
>
> That's true (if you're careful enough about which thread is evaluating the
> splitting -- the existing split* functions are actually *not* strict enough
> for this, and they probably shouldn't be either, so an additional warning
> that the user may have to add some `seq`s or `evaluate`s might be needed).
>  Anyway, if atomicModifyIORef isn't that big of an overhead, then no problem
> :-)
>
>>> in fact, for lists (as you get a incomplete-pattern-match warning there,
>>> but
>>> you know the list is always infinite, because you made it with
>>> "iterate"),
>>> you could instead use an infinite-list type, Data.Stream from package
>>> "Stream"[*]; as Stream is not a sum type (it only has one possible
>>> constructor: Cons), it might even be a bit more efficient!
>>
>> You are right but the benefit is small enough that I don't think it
>> warrants a dependency on another package.
>
> I'm going to try to argue to convince you that it's entirely appropriate to
> use Data.Stream :-)
>
> From my somewhat mathematical point of view, that is.
>
> 1. Data.Stream is a small module that implements a very well-known,
> practically standard (though under-used), and simple data type.  All you
> really need from it is import Data.Stream(Stream(Cons)), plus (iterate,
> streamToList) if you're keeping your current interface.  It's nothing you
> should be afraid to depend on, if you're using its concept (which you are).
>  It's much smaller than the 'containers' package, which similarly you don't
> use but if you needed a Map or something you obviously should.
>
> 2. The more compelling argument, that it took me a good night's sleep to
> think of:
>
> Supply is an infinite binary tree.  Stream is an infinite unary tree (er,
> more commonly called infinite "list"). They're both codata.  They're both
> comonads.  (although they don't go so far as to depend on category-extras to
> provide an instance.)  The primary function/purpose of Supply is an
> *efficient* way to turn a Stream into a Supply.  I daresay it would, even,
> be more fundamental to provide interface
> newSupply      :: Stream a -> IO (Supply a)
> than
> newSupply      :: a -> (a -> a) -> IO (Supply a)
> (although it might be worth keeping both interfaces? mainly for
> compatibility, since one's just a Data.Stream.iterate away from the other
> and your haddocks could say so)
> In fact, it's terribly annoying to turn a (Stream a) into a (a, a -> a) --
> in fact I don't think it can be done in general (you can turn it into a (b,
> b -> (a, b)) though a-la unfold, with b = Stream a) -- so you should
> definitely provide the above (Stream a -> IO (Supply a)) interface.
> (Although if you're cowardly enough about the extra dependency, I guess you
> could make it the riskier ([a] -> IO (Supply a)), risk being if the user
> provides a finite list... Sorry for conditionally insulting you; it seems a
> horribly underhanded thing for me to do :-)
>
> Likewise, it would be nice for
> split          :: Supply a -> Stream (Supply a)
> ...and then you would not even depend on Data.List anymore!
>
> (I don't happen to think the arguments that Data.List is better than Stream
> for definitely-infinite lists are very convincing; except possibly that List
> will be more up-to-date with respect to stream-fusion optimizations, and
> even then, value-supply doesn't actually rely on any of those optimizations;
> it actually does keep around the Stream (or List), or in the case of
> Num/Enum, it doesn't use one in the first place.)
>
> On the other hand, I still want the Stream-based interfaces, but my initial
> argument isn't even that necessary: consider implementing the current
> signature:
> newSupply      :: a -> (a -> a) -> IO (Supply a)
> currently by:
> newSupply x f   = genericNewSupply (iterate f x) listGenSym
> but we don't need to use a list at all, it could be like:
> newSupply x f   = genericNewSupply x (\a -> (f a, a))
> (with atomicallyModifyIORef added as appropriate depending on your
> refactorings)
>
>
> P.S. more code cleanup, if you didn't notice it:
>
> 0.4:
>>
>> -- XXX: Is the atomic necessary?
>> import Data.IORef(IORef,newIORef,atomicModifyIORef)
>
> yes the atomic is necessary, so you can update the comment to say why?
> (because multiple threads might be reading from a Supply-structure created
> from the same newSupply run, and we don't want memory corruption etc.)
>
>> import System.IO.Unsafe(unsafePerformIO,unsafeInterleaveIO)
>>
>> #if __GLASGOW_HASKELL__ >= 608
>> import GHC.IOBase(unsafeDupableInterleaveIO,unsafeDupablePerformIO)
>> #else
>> ...
>> unsafeDupablePerformIO :: IO a -> a
>> unsafeDupablePerformIO = unsafePerformIO
>
> various "PerformIO" variants to delete (I'm absurdly pleased that we
> implemented this without needing any)
>
> -Isaac
>
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: unique identifiers as a separate library

Simon Marlow-7
In reply to this post by Isaac Dupree
Isaac Dupree wrote:

> Iavor Diatchki wrote:
>>   - It uses unsafeDupableInterleaveIO to avoid double locking,
>
> in particular,
>
>> gen r = unsafeDupableInterleaveIO
>>               $ do v <- unsafeDupableInterleaveIO (genSym r)
>>                    ls <- gen r
>>                    rs <- gen r
>>                    return (Node v ls rs)
>
> where is the double locking?  We want referential transparency...
>
> e.g. suppose you use newNumSupply to create a thunk for a Gen; when
> evaluated, it will run unsafeDupableInterleaveIO.  You send that thunk
> off to two different other threads. Then those threads decide to
> evaluate it (say, enough to get the first genSym'd value) and happen to
> make a race condition, so it becomes two separate IO computations.
> Therefore one of them runs atomicModifyIORef, and the other one runs
> atomicModifyIORef, and so they get two different values out.
>
> Node 0 (...) (...)
> Node 1 (...) (...)
>
> when it's suppose to be the very same Gen data structure.
>
> so, am I right or wrong about what the perils of
> unsafeDupableInterleaveIO are?

Oops! Yes, you're quite right.

Cheers,
        Simon
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users