Where is Data.Atom ?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

Where is Data.Atom ?

Brian Hulley
Hi -
I seem to remember from some discussion long ago that there is a Data.Atom
module somewhere (hopefully with a BSD licence), but a search on Hoogle
turned up no results so maybe I'm mistaken.
What I'm looking for is an Atom type and the following operations:

      fromString :: String -> Atom
      toString :: Atom -> String

      instance Eq Atom
      instance Ord Atom -- this is where things get difficult!

It is particularly important that the ordering comparison is extremely fast,
preferably independent of the length of the original String, and that it
preserves the lexicographic ordering between the original Strings.

I can see that an "unsafe" global ref to a Trie of Char with Unique as the
"value" of a node would allow me to implement fromString, toString, and
instance Eq Atom, but I've got no idea how to implement instance Ord Atom so
that the order is independent of the order in which Atoms are created and
exactly the same as the lexicographic ordering of the String without being
O(n) where n is the min of the lengths of the Atoms being compared.

I'm also hoping that Atoms which are no longer in use would manage to
magically vanish by themselves. In a Trie implementation, this would mean
maintaining an invariant that leaf nodes are always held by a weak link and
internal nodes by a strong link.

Any ideas (or pointers to a nice downloadable module that already does all
this :-) )?

Thanks,
Brian.

--
Logic empowers us and Love gives us purpose.
Yet still phantoms restless for eras long past,
congealed in the present in unthought forms,
strive mightily unseen to destroy us.

http://www.metamilk.com 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

Bulat Ziganshin-2
Hello Brian,

Sunday, July 2, 2006, 10:58:29 PM, you wrote:

>       fromString :: String -> Atom
>       toString :: Atom -> String

>       instance Eq Atom
>       instance Ord Atom -- this is where things get difficult!

i think that ByteString is a very strong candidate to Atom. `memicmp`
is very fast operation, unless you plan to use really large strings
with the same beginnings



--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

David House
In reply to this post by Brian Hulley
On 02/07/06, Brian Hulley <[hidden email]> wrote:
> I can see that an "unsafe" global ref to a Trie of Char with Unique as the
> "value" of a node would allow me to implement fromString, toString, and
> instance Eq Atom, but I've got no idea how to implement instance Ord Atom so
> that the order is independent of the order in which Atoms are created and
> exactly the same as the lexicographic ordering of the String without being
> O(n) where n is the min of the lengths of the Atoms being compared.

Isn't compare on strings O(n) anyway? I suppose it would actually be
O(d), where d is the index of the first difference, but that's O(n)
worst-case. Even equality (a special case of compare) on strings
(well, [Char], to be more precise) involves traversing the list and so
is O(n) worst-case, no?

--
-David House, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

Brian Hulley
David House wrote:

> On 02/07/06, Brian Hulley <[hidden email]> wrote:
>> I can see that an "unsafe" global ref to a Trie of Char with Unique
>> as the "value" of a node would allow me to implement fromString,
>> toString, and instance Eq Atom, but I've got no idea how to
>> implement instance Ord Atom so that the order is independent of the
>> order in which Atoms are created and exactly the same as the
>> lexicographic ordering of the String without being O(n) where n is
>> the min of the lengths of the Atoms being compared.
>
> Isn't compare on strings O(n) anyway? I suppose it would actually be
> O(d), where d is the index of the first difference, but that's O(n)
> worst-case. Even equality (a special case of compare) on strings
> (well, [Char], to be more precise) involves traversing the list and so
> is O(n) worst-case, no?

If an Atom was just represented by an Int, then construction would be O(n)
where n is the length of the String (using a Trie with Int as the data
associated with each node), and == and /= would be O(1).

If the Int's could somehow be chosen clairvoyantly then < would also be
O(1). However since we can't see into the future (to know which new Atoms
will be created) afaics the problem of correctly choosing these Int's is
completely impossible!

Of course an Integer could be used (since it can be of any length) but it's
likely that any systematic way of assigning them would simply encode the
string in the Integer leading back to O(n) so in terms of speed for < there
would be no advantage over ByteString's.

So perhaps my original spec is impossible to implement, though it is an open
question whether some very clever encoding (with corresponding
implementation of <) could be found which would lead to a better average
performance (whatever that means).

An alternative design for an atom module could be:

      create :: MonadIO m => String -> m Atom
      toString :: Atom -> String

      instance Eq Atom          -- O(1)
      instance Ord Atom        -- O(1) but depends on creation order

but here the < would not be lexicographic, so although it would be useful
for implementing symbol tables, environments etc it's not ideal for GUI use
(eg when displaying a tree of modules where everything should be listed
alphabetically).

As an aside, if the monad was removed then the result of atom "a" < atom "b"
(atom :: String -> Atom) could not be determined by analysis of the program.
It would depend on the evaluation order chosen by the compiler, but in a
sense this doesn't matter because whatever the result is, it would be the
same at any future time during the same run of the program so the use of
Atoms as keys would still be safe. But is this still "functional"?

Regards, Brian.
--
Logic empowers us and Love gives us purpose.
Yet still phantoms restless for eras long past,
congealed in the present in unthought forms,
strive mightily unseen to destroy us.

http://www.metamilk.com 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

Duncan Coutts
In reply to this post by Bulat Ziganshin-2
On Sun, 2006-07-02 at 23:08 +0400, Bulat Ziganshin wrote:

> Hello Brian,
>
> Sunday, July 2, 2006, 10:58:29 PM, you wrote:
>
> >       fromString :: String -> Atom
> >       toString :: Atom -> String
>
> >       instance Eq Atom
> >       instance Ord Atom -- this is where things get difficult!
>
> i think that ByteString is a very strong candidate to Atom. `memicmp`
> is very fast operation, unless you plan to use really large strings
> with the same beginnings

and also equal lengths.

The nice thing about the ByteString representation is that we can use
the length to short-cut inequality and use equality of the pointers to
short-cut equality:

eq :: ByteString -> ByteString -> Bool
eq a@(PS p s l) b@(PS p' s' l')
    | l /= l'            = False    -- short cut on length
    | p == p' && s == s' = True     -- short cut for the same string
    | otherwise          = compareBytes a b == EQ

So the worst case is long strings that compare equal but have different
memory blocks.

Duncan

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

Brian Hulley
In reply to this post by Bulat Ziganshin-2
Bulat Ziganshin wrote:

> Hello Brian,
>
> Sunday, July 2, 2006, 10:58:29 PM, you wrote:
>
>>       fromString :: String -> Atom
>>       toString :: Atom -> String
>
>>       instance Eq Atom
>>       instance Ord Atom -- this is where things get difficult!
>
> i think that ByteString is a very strong candidate to Atom. `memicmp`
> is very fast operation, unless you plan to use really large strings
> with the same beginnings

Thanks - I think I'll use ByteString as the implementation of Atom for the
moment,
Brian.

--
Logic empowers us and Love gives us purpose.
Yet still phantoms restless for eras long past,
congealed in the present in unthought forms,
strive mightily unseen to destroy us.

http://www.metamilk.com 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

Neil Mitchell
In reply to this post by Brian Hulley
Hi

> module somewhere (hopefully with a BSD licence), but a search on Hoogle
> turned up no results so maybe I'm mistaken.

Just so people know, not finding a result in Hoogle is not a good
indicator for something not existing. Hoogle doesn't index things for
various reasons:

* Not in the standard GHC libraries
* I got a parse error when trying to produce the HTML generated by
Haddock (this is about 5% of libraries)
* They have a large interface which overlaps with other names, and I
decided to delete them (OpenGL, Win32 - I think, I may have put these
back)
* Not on Windows, i.e. Posix
* A bug in Hoogle - more common that you'd think :)

Of course, I'm working on fixing all these issues and more, but for
the moment just use Hoogle to find things, not find the absence of
things. I saw this done on Haskell IRC the other day as well.
Remember, Hoogle is still unreleased ever :)

Thanks

Neil
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

Brian Hulley
In reply to this post by Brian Hulley
Brian Hulley wrote:

> So perhaps my original spec is impossible to implement, though it is
> an open question whether some very clever encoding (with corresponding
> implementation of <) could be found which would lead to a better
> average performance (whatever that means).
>
> An alternative design for an atom module could be:
>
>      create :: MonadIO m => String -> m Atom
>      toString :: Atom -> String
>
>      instance Eq Atom          -- O(1)
>      instance Ord Atom        -- O(1) but depends on creation order
>
> but here the < would not be lexicographic, so although it would be
> useful for implementing symbol tables, environments etc it's not
> ideal for GUI use (eg when displaying a tree of modules where
> everything should be listed alphabetically).

Eureka! There *is* a way to get O(1) lexicographic comparisons of Atoms :-)
although at the expense of O(n + k) but possibly amortized O(n + ?) creation
time where n is the length of the string and k is the number of Atoms in
existence at any given time. The previous reason for needing to be
clairvoyant in the choice of Unique (or Integer or Int) to represent an Atom
was to allow expressions such as

     atom "a" < atom "b"

because when the code has determined the representation for the first atom
the creation of the second atom mustn't be allowed to change it.

However with monadic creation, such expressions can't be written, and
therefore it would be possible to represent atoms as follows:

   newtype Atom = Atom (IORef Int) -- or IORef Integer

and when new atoms are created, we could simply update the Int's referenced
by other atoms to ensure that the relative ordering is still lexicographic.
If we assume that it's very unlikely we'd need more than 2^31 atoms to exist
at any given time (if this assumption is wrong we could use Integer instead
of Int) and that usually we'd have a lot less, the Int's could be allocated
with gaps between them so that we'd only occasionally need to adjust the
Int's of the atoms to the left or right when a new atom is inserted in the
table, and in all cases it's extremely unlikely that we'd need to adjust the
representation of all the other Atoms (hence amortized O(n) hopefully).

The comparison would then be:

      compare (Atom l) (Atom r) = unsafePerformIO $ do
            li <- readIORef l
            ri <- readIORef r
            return (compare li ri)

which is safe as long as creation of atoms is not allowed inside
unsafePerformIO (it would be nice if there was a way to tell the typechecker
that a specific action is not allowed in unsafe IO)

Still there would be some tricky details to work out eg how to ensure that
the average number of Integer's that need to be updated on creation of each
new atom is as small as possible, so I'll leave this as an exercise for the
reader!!! :-)

Essentially it depends on how well the following can be implemented:

      data OrderingToken = ???
      instance Eq OrderingToken
      instance Ord OrderingToken

      create :: MonadIO m => m OrderingToken

      -- x < y |- createBetween x y == z where x < z and z < y
      createBetween :: MonadIO m => OrderingToken -> OrderingToken -> m
(OrderingToken)

It would be interesting to investigate what the theoretical tradeoffs would
be between the complexities of the above functions (assuming we want (<) to
be O(1)) and whether or not they require to be monadic.

Anyway apologies for rambling on,

Brian.

--
Logic empowers us and Love gives us purpose.
Yet still phantoms restless for eras long past,
congealed in the present in unthought forms,
strive mightily unseen to destroy us.

http://www.metamilk.com 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

haskell-2
Brian Hulley wrote:

>
> The comparison would then be:
>
>      compare (Atom l) (Atom r) = unsafePerformIO $ do
>            li <- readIORef l
>            ri <- readIORef r
>            return (compare li ri)
>
> which is safe as long as creation of atoms is not allowed inside
> unsafePerformIO (it would be nice if there was a way to tell the
> typechecker that a specific action is not allowed in unsafe IO)
>

That is highly unsafe with respect to multiple threads of evaluation.  You could
be changing the li and ri value in between reading them.  Locking will be
needed, assuming there is any sane way to manage locks inside unsafePerformIO.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

ajb@spamcop.net
In reply to this post by Brian Hulley
G'day all.

Quoting Brian Hulley <[hidden email]>:

> I can see that an "unsafe" global ref to a Trie of Char with Unique as the
> "value" of a node would allow me to implement fromString, toString, and
> instance Eq Atom, but I've got no idea how to implement instance Ord Atom so
> that the order is independent of the order in which Atoms are created and
> exactly the same as the lexicographic ordering of the String without being
> O(n) where n is the min of the lengths of the Atoms being compared.

Unique would give you an instance of Ord Atom, it just wouldn't give
you lexicographic ordering.  Whether you need lexicographic ordering
or not largely depends on what you want atoms for.

In X11, for example, atoms basically form a user-extensible enumeration
type.  X11 atoms do support ordering, but the ordering is arbitrary by
design.  The reason is that it is very important that atoms are fixed-
size to make manipulation and incorporation into network protocol
messages easier; that's their whole reason for existing.

Lisp/Prolog atoms, on the other hand, do need lexicographic ordering.

The point is that atoms which use the standard ordering that comes with
Uniques isn't wrong, though it may not quite be what you want.

> I'm also hoping that Atoms which are no longer in use would manage to
> magically vanish by themselves. In a Trie implementation, this would mean
> maintaining an invariant that leaf nodes are always held by a weak link and
> internal nodes by a strong link.

This would, of course, be inappropriate behaviour if you used arbitrary
ordering for your atoms.

Cheers,
Andrew Bromage
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom ?

ajb@spamcop.net
In reply to this post by Brian Hulley
G'day all.

Quoting Brian Hulley <[hidden email]>:

> So perhaps my original spec is impossible to implement, though it is an open
> question whether some very clever encoding (with corresponding
> implementation of <) could be found which would lead to a better average
> performance (whatever that means).

For what it's worth:

    - It's not such a dumb idea to separate equality testing from
      less-than testing.  A (Unique,ByteString) pair makes the former
      fast and the latter pretty fast when it's needed.

    - Arithmetic coding preserves lexicographic ordering.  If you have
      a good statistical model for what sort of strings might be used
      as atoms, it may be possible to construct a representation of atoms
      which fit in machine 32- or 64-bit word most of the time.  (This is
      similar to the way that GHC represents Integers; use a machine word
      when it fits.)

    - Hash table implementations rely on using a cheap test first
      (comparing hash codes), going to a full comparison only if the
      cheap test doesn't rule out what you're looking for.

> As an aside, if the monad was removed then the result of atom "a" < atom "b"
> (atom :: String -> Atom) could not be determined by analysis of the program.
> It would depend on the evaluation order chosen by the compiler, but in a
> sense this doesn't matter because whatever the result is, it would be the
> same at any future time during the same run of the program so the use of
> Atoms as keys would still be safe. But is this still "functional"?

Questions of this nature produce much disagreement.  Once upon a time,
program arguments and environment variables were passed to a Haskell
program using a similar mechanism.  I think we got rid of this because
it's not "functional" if you consider programs that fork.

Cheers,
Andrew Bromage
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re[2]: Where is Data.Atom ?

Bulat Ziganshin-2
In reply to this post by Duncan Coutts
Hello Duncan,

Monday, July 3, 2006, 12:50:14 AM, you wrote:

>> >       instance Ord Atom -- this is where things get difficult!
> and also equal lengths.

> The nice thing about the ByteString representation is that we can use
> the length to short-cut inequality and use equality of the pointers to
> short-cut equality:

see first line of quote :)



--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Where is Data.Atom?

Iain Alexander
In reply to this post by Brian Hulley
Another suggestion:

Put your strings in an ordered binary tree (other data structures might also
work here).

Make your Atom an encoding of the structure of the tree (resp. other
structure).  This is logically a sequence of bits, 0 for left (less than), 1 for
right (greater than) - if you terminate this variable-length sequence with
1, I think it all works out OK.  You then encode this sequence of bits in a
Word32 (or some other convenient item), with implicit trailing zeros.

root ~> 1[000...]
root.left ~> 01[000...]
root.right ~> 11[000...]
...

There are clearly some questions such as how unbalanced is the tree
likely to get (since you can't conveniently rebalance it), and what to do if
your Atom "overflows", but this might give you something to work with.
--
Iain Alexander      [hidden email]


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Where is Data.Atom?

Brian Hulley
Iain Alexander wrote:

> Another suggestion:
>
> Put your strings in an ordered binary tree (other data structures
> might also work here).
>
> Make your Atom an encoding of the structure of the tree (resp. other
> structure).  This is logically a sequence of bits, 0 for left (less
> than), 1 for right (greater than) - if you terminate this
> variable-length sequence with 1, I think it all works out OK.  You
> then encode this sequence of bits in a Word32 (or some other
> convenient item), with implicit trailing zeros.
>
> root ~> 1[000...]
> root.left ~> 01[000...]
> root.right ~> 11[000...]
> ...
>
> There are clearly some questions such as how unbalanced is the tree
> likely to get (since you can't conveniently rebalance it), and what
> to do if your Atom "overflows", but this might give you something to
> work with.

Thanks for the suggestion. However I think that just using a binary tree
would be equivalent to just using the representation of the original string
as a sequence of bits (except for "early out" eg representing 0b00001111 as
just 1111 etc), and any other systematic encoding of strings that preserved
all the info necessary for lexicographic ordering (that still works when new
atoms are created later) would also end up using at least the same number of
bit comparisons on average (since a new information preserving
representation is just a permutation of the original at the bit level - if
some strings are represented with fewer bits others would need more bits so
it would all balance out).

The only way I can see of getting truly O(1) lexicographic comparisons (by
this I mean O(log P) where P is the maximum number of atoms that can exist
simultaneously at any point during program execution) would be to have a
"floating" representation (accessed via an IORef) where the concrete values
of existing atoms are changed as necessary behind the scenes as new atoms
are created, but as Chris pointed out this would require a locking operation
if the atoms were used in a multi-threaded program, so all in all, it may
well be much faster to just use ByteStrings especially if the Atoms are just
to be used to represent fairly short strings such as module names or natural
language words etc.

Still I think this would be an interesting project - Prolog style Atoms with
O(1) lexicographic ordering and O(n + f k) creation time where k is the size
of the existing population and f is some function which minimizes the
amortized time...

Regards, Brian.
--
Logic empowers us and Love gives us purpose.
Yet still phantoms restless for eras long past,
congealed in the present in unthought forms,
strive mightily unseen to destroy us.

http://www.metamilk.com 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Where is Data.Atom?

haskell-2
Brian Hulley wrote:
 > Iain Alexander wrote:
 >> Another suggestion:
 >>
 >> Put your strings in an ordered binary tree (other data structures
 >> might also work here).
 >>
 >> Make your Atom an encoding of the structure of the tree (resp. other
 >> structure).  This is logically a sequence of bits, 0 for left (less
 >> than), 1 for right (greater than) - if you terminate this
 >> variable-length sequence with 1, I think it all works out OK.  You
 >> then encode this sequence of bits in a Word32 (or some other
 >> convenient item), with implicit trailing zeros.
 >>
 >> root ~> 1[000...]
 >> root.left ~> 01[000...]
 >> root.right ~> 11[000...]
 >> ...
 >>
 >> There are clearly some questions such as how unbalanced is the tree
 >> likely to get (since you can't conveniently rebalance it), and what
 >> to do if your Atom "overflows", but this might give you something to
 >> work with.
 >
 > Thanks for the suggestion. However I think that just using a binary tree
 > would be equivalent to just using the representation of the original
 > string as a sequence of bits (except for "early out" eg representing
 > 0b00001111 as just 1111 etc), and any other systematic encoding of
 > strings that preserved all the info necessary for lexicographic ordering
 > (that still works when new atoms are created later) would also end up
 > using at least the same number of bit comparisons on average (since a
 > new information preserving representation is just a permutation of the
 > original at the bit level - if some strings are represented with fewer
 > bits others would need more bits so it would all balance out).


The main thing is that speeding up Ord is a "memo" problem, not a "clever
encoding of atoms" problem.

Your fromString function seems like an IO action.  It cannot be a pure function,
except for identity or a hash function.  (Though hashing to Int will
at least speed up equality testing.).  If you use unsafePerformIO to make it
seem like a pure function then you need locking even if you think you have a
single CPU and thread, since the runtime may assume it can evaluate expressions
in weird ways.  This is the "unsafe at any speed" issue in Haskell.

You might turn strings s1 and s2 into atoms a1 and a2 cheaply by returning the
next highest unused Word32.  Keep an array of Atom->String, (see
http://haskell.org/haskellwiki/Library/ArrayRef#Using_dynamic_.28resizable.29_arrays 
  for dynamic arrays).  Call the # of atoms "N_A".  If you need fromString to
return the same atom if given the same string then you have to keep a tree and
search down through log(N_A) strings comparisons just to assign a new atom.
Otherwise this is O(1).

To compare atoms a1 and a2 you have to compare the original strings s1 and s2.
This original compare is O(min(length(s1),length(s2))).  If you cache this then
you make future comparisons faster, either O(1) or O(log(N_A)).

Perhaps: (compare a1 a2) could look up the answer in an "immutable" 2D array
with index (a1,a2) for which the lazy value will then be computed once (and then
cached). The actual comparison would lookup up the original s1 and s2 via the
global array (or map) and do the comparison once.   This memo table is the
explicit space/time trade off.  The Atom->String array and/or the 2D memo-array
could be replaced with something else, for the usual trade offs. And the only
thread & locking issues come from generating the next atom without risking
double-assignment.

--
Chris
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe