Array heap objects

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Array heap objects

David Feuer
It appears that the RTS has more flexibility in its array representations than users can access through primops. In particular, there are "pointers-first" and "bitmapped" arrays. Are these used for the evaluation stack or something? I'd be very interested in getting user access to some of this power.

Lots of data structures can be represented efficiently using internal nodes consisting of some (unboxed) structural information, some pointers to other internal nodes, and some (boxed or unboxed) leaves. A hashmap, for example, could have internal nodes laid out something like this (pointers-first would probably work well):

bitmap1: which children are inhabited?
bitmap2: which children are internal nodes?
pointers: pointers to internal nodes and leaf nodes. (If we change some of the traditional arithmetic, we can probably even store keys and values directly in their parent nodes rather than having separate Leaf nodes.)

As it is today, we need extra indirections between each array and the next one to accommodate the tagging: node -> array -> node -> array -> ...

How feasible would it be to offer things like this?

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Array heap objects

Ben Gamari-2
David Feuer <[hidden email]> writes:

> It appears that the RTS has more flexibility in its array representations
> than users can access through primops. In particular, there are
> "pointers-first" and "bitmapped" arrays. Are these used for the evaluation
> stack or something? I'd be very interested in getting user access to some
> of this power.
>
I'm a bit unclear on what in particular you are referring to. GHC supports
four types of arrays:

 * ByteArray#: Just a pile of bytes

 * Array#: An array of pointers to lifted things with a card table used
   to record which regions of the array are "dirty" (and therefore may
   need to be scavenged for young-generation references.

 * ArrayArray#: An array of pointers to Array#s. These also have a card
   table. This really should be
   generalized and called UnliftedArray#, but oh well.

 * SmallArray#: An array of pointers to lifted things without a card
   table. The lack of card table means that the entire array must be
   scavenged if any part of it has changed.

We don't support arrays that mix pointers and non-pointers.

Perhaps you are thinking of the bitmap structures used to encode the
structure of stack frames?

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Array heap objects

David Feuer
Yes, the bitmap structures used to encode the structure of stack
frames. These *look* like they might be reusable for mix-and-match
arrays in Haskell-land. If so, that could be a pretty cheap new
feature.

On Wed, Oct 7, 2020 at 1:37 PM Ben Gamari <[hidden email]> wrote:

>
> David Feuer <[hidden email]> writes:
>
> > It appears that the RTS has more flexibility in its array representations
> > than users can access through primops. In particular, there are
> > "pointers-first" and "bitmapped" arrays. Are these used for the evaluation
> > stack or something? I'd be very interested in getting user access to some
> > of this power.
> >
> I'm a bit unclear on what in particular you are referring to. GHC supports
> four types of arrays:
>
>  * ByteArray#: Just a pile of bytes
>
>  * Array#: An array of pointers to lifted things with a card table used
>    to record which regions of the array are "dirty" (and therefore may
>    need to be scavenged for young-generation references.
>
>  * ArrayArray#: An array of pointers to Array#s. These also have a card
>    table. This really should be
>    generalized and called UnliftedArray#, but oh well.
>
>  * SmallArray#: An array of pointers to lifted things without a card
>    table. The lack of card table means that the entire array must be
>    scavenged if any part of it has changed.
>
> We don't support arrays that mix pointers and non-pointers.
>
> Perhaps you are thinking of the bitmap structures used to encode the
> structure of stack frames?
>
> Cheers,
>
> - Ben
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Array heap objects

Ben Gamari-2
On October 7, 2020 2:02:53 PM EDT, David Feuer <[hidden email]> wrote:
>Yes, the bitmap structures used to encode the structure of stack
>frames. These *look* like they might be reusable for mix-and-match
>arrays in Haskell-land. If so, that could be a pretty cheap new
>feature.

Ahh, yes. In principle we could expose a new array type allowing the user to specify the pointer-ness of each entry. This wouldn't reuse any of the large bitmap machinery since these must be specified in the info table rather than the closure but the idea is similar. Implementation would require a new set of primops along with GC support but I suspect that none of this is too hard.

All these being said, this may not be terribly efficient for garbage collection since the GC will need to walk over both pointers and non-pointers, trashing a good amount of cache in the process. Moreover, the array itself would be larger, due to the two bitmaps.

I suspect most applications would be on the whole better served by a structure of arrays approach to achieve the same idea.  If you have multiple pointers and locality is a concern then you can even pack all of your pointers into a single array with some indexing magic and an unsafeCoerce.

Cheers,

- Ben
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Array heap objects

David Feuer
Ah, too bad about reuse. What do you mean about walking over both pointers and non-pointers? The extra word (for pointers-first layout) or few words (for bitmapped) will be more than made up for in most cases by not needing extra heap objects between a node and its children.

Simon has explained that these unsafe coercions between lifted and unlifted are really *too* unsafe (so we'd probably have to NOINLINE like mad). Once we have BoxedRep, we can fix that part by offering unsafe primops that allow lifted things to be read from/written to arrays of unlifted things and vice versa, but I'm not sure that's so useful if we can't get a little bit of unboxed (non-pointer) stuff in there too.

On Thu, Oct 8, 2020, 8:32 AM Ben Gamari <[hidden email]> wrote:
On October 7, 2020 2:02:53 PM EDT, David Feuer <[hidden email]> wrote:
>Yes, the bitmap structures used to encode the structure of stack
>frames. These *look* like they might be reusable for mix-and-match
>arrays in Haskell-land. If so, that could be a pretty cheap new
>feature.

Ahh, yes. In principle we could expose a new array type allowing the user to specify the pointer-ness of each entry. This wouldn't reuse any of the large bitmap machinery since these must be specified in the info table rather than the closure but the idea is similar. Implementation would require a new set of primops along with GC support but I suspect that none of this is too hard.

All these being said, this may not be terribly efficient for garbage collection since the GC will need to walk over both pointers and non-pointers, trashing a good amount of cache in the process. Moreover, the array itself would be larger, due to the two bitmaps.

I suspect most applications would be on the whole better served by a structure of arrays approach to achieve the same idea.  If you have multiple pointers and locality is a concern then you can even pack all of your pointers into a single array with some indexing magic and an unsafeCoerce.

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Array heap objects

Ben Gamari-2
David Feuer <[hidden email]> writes:

> Ah, too bad about reuse. What do you mean about walking over both pointers
> and non-pointers? The extra word (for pointers-first layout) or few words
> (for bitmapped) will be more than made up for in most cases by not needing
> extra heap objects between a node and its children.
>
Let's consider the case that you want an array of (String, Int#) pairs.
Today you have a choice of two representations:

 * structure-of-arrays:

       type Array = (Array# String, ByteArray#)

   where the ByteArray# contains packed `Int#`s.

 * array-of-structures:

       data Entry = Entry String !Int#

       type Array = Array# Entry

In the latter case you indeed incur an extra indirection, but not in the
former. It is true that locality may be in the former (especially
if your entry is a wide product rather than the pair used here) but on
the whole it's often a good choice.

I think you are asking for a very general MixedArray#, which would allow
you to specify dynamically whether each word of the array is a pointer
or not. This seems like a lot of power and may complicate garbage
collection. Apologies if I have misunderstood.

I think you can get most of the power of this idea with a much simpler
idea: rather than allow configuration of each word dynamically, specify
the array as a bunch of packed unboxed tuples. For instance, in Haskell
this might look like:

     -- This class would need to be magic since the compiler would need
     -- to generate a info table describing the entry layout for each
     -- instance.
     class HasArrayRep (a :: k)

     data PackedArray# (a :: k)

     newPackedArray# :: HasArrayRep a
                     => Int#
                     -> State# s -> (# State# s, PackedArray# a)
     readPackedArray# :: HasArrayRep a
                      => Int -> PackedArray# a ->
                      -> State# s -> (# State# s, a)
     -- ... etc.

Implementation would involve the introduction of a new array closure
type and suitable scavenging logic. Each HasArrayRep instance would emit
an info table; the layout section of this info table would contain a
bitmap (of the same sort used for stack frames) which would describe the
layout of a single array element. To avoid unaligned accesses you would
want to round up the element size to the nearest word size, potentially
incurring some slop (e.g. each element of a `PackedArray# (# Word#,
Word8# #)` would require two words). However, sub-word-size fields could
in principle be packed into a single word (finally allowing us to get
some mileage out of the sub-word-size primitive types we now have).

Perhaps this helps in your use-case?

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Array heap objects

David Feuer
No, I don't think that gets me all the way to what I want, although it might be a reasonable approach to one *part* of it. The main idea I'm after is efficient representations of things like HashMap, where nodes in a tree have variable numbers of children. First, let's look at how HashMap is actually defined (with inconsequential simplifications & clarifications):

data HashMap k v
    = Empty
    | BitmapIndexed !Word !(SmallArray (HashMap k v))
    | Leaf !Word !(Leaf k v)
    | Collision !Word !(SmallArray (Leaf k v))

The internal nodes are represented by BitmapIndexed constructors. The biggest goal is to flatten those BitmapIndexed values out, so let's start with just that. Conceptually, imagine we have

data HashMap# :: Type -> Type -> TYPE 'UnliftedRep -- hopefully soon ('BoxedRep 'Unlifted)
pattern BitmapIndexed# :: Word# -> SmallArray# (HashMap k v) -> HashMap# k v

I want a value produced by `BitmapIndexed#` to be a single heap object with:

1. Enough information to tell that it was produced by `BitmapIndexed#`.
2. The included `Word#` (representing the hash).
3. Pointers to all the children.

One way to think of this is as unpacking arrays into (unlifted) user-defined types. Something like

unlifted data HashMap#
  = Empty#
  | BitmapIndexed# !Word {-# UnpackArray #-} {-# Unpack #-} !(SmallArray (HashMap k v))
  | ...

Where these fake pragmas indicate a sort of double unpacking, where the array size and elements
are unpacked into the `BitmapIndexed#` constructor.

Now let's get into the leaves. We may well want to unpack the leaves into their parents. This gets
a little messy. If we have a totally wild bitmap approach, this is pretty straightforward, because we
can indicate which elements are pointers (to child nodes, keys, or values) and which elements are
unboxed (hashes), and do some fancy arithmetic to figure out where each thing is. Is there a
cleaner, more general approach? I think that's going to be very hard. From a "pure" standpoint,
we could imagine some sort of succinct data structure that lets us locate the nth element and figure
out which alternative of the sum it is. But then we'd need some *very* principled operations for
constructing arrays; the usual unrestricted mutation approach is not going to work at all.

On Thu, Oct 8, 2020, 3:12 PM Ben Gamari <[hidden email]> wrote:
David Feuer <[hidden email]> writes:

> Ah, too bad about reuse. What do you mean about walking over both pointers
> and non-pointers? The extra word (for pointers-first layout) or few words
> (for bitmapped) will be more than made up for in most cases by not needing
> extra heap objects between a node and its children.
>
Let's consider the case that you want an array of (String, Int#) pairs.
Today you have a choice of two representations:

 * structure-of-arrays:

       type Array = (Array# String, ByteArray#)

   where the ByteArray# contains packed `Int#`s.

 * array-of-structures:

       data Entry = Entry String !Int#

       type Array = Array# Entry

In the latter case you indeed incur an extra indirection, but not in the
former. It is true that locality may be in the former (especially
if your entry is a wide product rather than the pair used here) but on
the whole it's often a good choice.

I think you are asking for a very general MixedArray#, which would allow
you to specify dynamically whether each word of the array is a pointer
or not. This seems like a lot of power and may complicate garbage
collection. Apologies if I have misunderstood.

I think you can get most of the power of this idea with a much simpler
idea: rather than allow configuration of each word dynamically, specify
the array as a bunch of packed unboxed tuples. For instance, in Haskell
this might look like:

     -- This class would need to be magic since the compiler would need
     -- to generate a info table describing the entry layout for each
     -- instance.
     class HasArrayRep (a :: k)

     data PackedArray# (a :: k)

     newPackedArray# :: HasArrayRep a
                     => Int#
                     -> State# s -> (# State# s, PackedArray# a)
     readPackedArray# :: HasArrayRep a
                      => Int -> PackedArray# a ->
                      -> State# s -> (# State# s, a)
     -- ... etc.

Implementation would involve the introduction of a new array closure
type and suitable scavenging logic. Each HasArrayRep instance would emit
an info table; the layout section of this info table would contain a
bitmap (of the same sort used for stack frames) which would describe the
layout of a single array element. To avoid unaligned accesses you would
want to round up the element size to the nearest word size, potentially
incurring some slop (e.g. each element of a `PackedArray# (# Word#,
Word8# #)` would require two words). However, sub-word-size fields could
in principle be packed into a single word (finally allowing us to get
some mileage out of the sub-word-size primitive types we now have).

Perhaps this helps in your use-case?

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs