GC-Managed ByteArray Slicing

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

GC-Managed ByteArray Slicing

Andrew Martin
In GHC Haskell, there are two common options for working with bytes in memory:

    data ByteArray = ByteArray ByteArray#
    data ByteSlice = ByteSlice ByteArray# Int# Int#

The second one, ByteSlice, is roughly what the bytestring library does (but with a ForeignPtr instead of a ByteArray# inside of it). What's unfortunate is that it's difficult to know which of these two types is going to provide better performance for a given application. These heuristics are helpful for guiding the process:
  1. The longer the data is going to be around, the better the unsliced variant will be. This point is about space efficiency. The sliced variant needs two extra machine words to hold those indices, plus it holds on to extra bytes from the array that it is slicing into.
  2. The shorter the data is going to be around, the better the sliced variant is. This point is about speed of execution. The cost of the unsliced variant is typically a memcpy. If the data is getting GCed quickly anyway, the unsliced copy doesn't save you any space.
  3. The shorter the data is, the better the unsliced variant is.
  4. The more sharing of the data there is, the better the sliced variant is.
But, users have to commit to which one they want to use in their data type definitions, and the data type may be used in different contexts where one or the other is preferable. For example, consider the following:

    data Student = Student
      { name :: {-# UNPACK #-} !BytesType
      , grade :: {-# UNPACK #-} !Int
      }

Which of the two types should we use for BytesType? (For this example, ignore the incorrectness of using bytes to represent text. With either of the two bytes types, it's easy to put a newtype on top that communicates that what's inside is guaranteed to be UTF-8 encoded text). It depends on the usage. For example, consider that we are parsing students from this xml file:

    <students>
      <student>
        <name>Erica Chang</name>
        <grade>87</grade>
      </student>
      <student>
        <name>Lizbeth Cruz</name>
        <grade>91</grade>
      </student>
      ...
    </students>

Let's say we've got about 1GB of this, and we run a parser on 64KB chunks as they are read from disk. There are two contasting scenarios to consider.

If the file is parsed at startup and the resulting array of students is kept in memory for a long time afterward, then the unsliced variant is better. It allows us to throw away all of the bytes corresponding to the xml nodes, which are no longer relevant. The sliced variant, by constrast, would keep the entire contents of the original file in memory.

However, if the file is parsed by a SAX-style streaming parser, the students being parsed, processed, and then discarded as they encountered, the sliced variant is better. Slicing wins here because the chunks from the file are discarded shortly after they are encountered. Read "shortly" as "before the next minor GC".

In this example, Student was an application-specific data type, so the application author could figure out what they were actually doing with the data and define the data type accordingly. However, this isn't always a solution. What if they needed to use the same type differently in different contexts? What if the type was provided by a library and the library author doesn't know how it will be used?

So far, this has just been an explanation of a problem that I've noticed. I understand the problem well, but now I'm going to propose a solution. There are a lot of holes in my understanding of GHC's runtime, so I don't understand if what I'm proposing is actually plausible.

What if the GHC's runtime handled the copy decisions for us? What if we had these primitives:

    data Bytes#
    sliceBytes# :: Bytes# -> Int# -> Int# -> Bytes#
    lengthBytes# :: Bytes# -> Int#

Slicing causes the runtime to track that there is an additional reference to the byte array but that the offset and length are different. I'm not sure where this information would be stored though. When GC runs, it would copy slices from bytes if the there were no longer any references to the entire array. So a single type could serve the both purposes discussed in the scenario above. I'm not sure what kind of impact this might have on GC time.

-- 
-Andrew Thaddeus Martin

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: GC-Managed ByteArray Slicing

Edward Kmett-2
I wouldn't recommend doing this to the existing ByteArray#, but for a separate type it could be nice with some caveats. 

1.) It's probably not that you want to copy out slices when there are no references to the whole array but rather that you can keep the smallest containing sub-interval of the Bytes# that contain all references. In your story, if I slice to take its tail and then take that things tail. I'd double my storage requirement after GC for large Bytes# when I drop the reference to the original. This is a bit of an orthogonal strategy to the one you propose, but the one you propose has problems in that I probably should carefully memoize the results per slice somehow to maximize sharing in the result, which complicates the implementation a lot. If on the other hand, these slices are not allowed to nest then you don't have either problem, but that sounds like a nightmare waiting to happen to debug.

2.) You'd probably have a hard time doing the move during GC because you don't really get to know the space of all references in before you have to move to to-space. This is further complicated by multiple generations even if you conspire some way to procrastinate and delay the move til after you move everything else and have found the largest sub-interval for the current generation's references. One way to this would be to do something like track a min/max bound per generation in the object itself of the largest containing intervals that are wanted, and during a gc when you copy the thing from from space to to space you copy over the interval from the min bound from of all generations to the max bound from all generations, then reset the current generations bounds to be something like (maxPos,minPos) and as you find more references to sub-slices it'll grow by continually mutating these secret mutable fields inside the object to tell the next GC how much it can clean things up based solely on the contents referenced from this generation.

-Edward

On Sat, Sep 8, 2018 at 6:35 AM Andrew Martin <[hidden email]> wrote:
In GHC Haskell, there are two common options for working with bytes in memory:

    data ByteArray = ByteArray ByteArray#
    data ByteSlice = ByteSlice ByteArray# Int# Int#

The second one, ByteSlice, is roughly what the bytestring library does (but with a ForeignPtr instead of a ByteArray# inside of it). What's unfortunate is that it's difficult to know which of these two types is going to provide better performance for a given application. These heuristics are helpful for guiding the process:
  1. The longer the data is going to be around, the better the unsliced variant will be. This point is about space efficiency. The sliced variant needs two extra machine words to hold those indices, plus it holds on to extra bytes from the array that it is slicing into.
  2. The shorter the data is going to be around, the better the sliced variant is. This point is about speed of execution. The cost of the unsliced variant is typically a memcpy. If the data is getting GCed quickly anyway, the unsliced copy doesn't save you any space.
  3. The shorter the data is, the better the unsliced variant is.
  4. The more sharing of the data there is, the better the sliced variant is.
But, users have to commit to which one they want to use in their data type definitions, and the data type may be used in different contexts where one or the other is preferable. For example, consider the following:

    data Student = Student
      { name :: {-# UNPACK #-} !BytesType
      , grade :: {-# UNPACK #-} !Int
      }

Which of the two types should we use for BytesType? (For this example, ignore the incorrectness of using bytes to represent text. With either of the two bytes types, it's easy to put a newtype on top that communicates that what's inside is guaranteed to be UTF-8 encoded text). It depends on the usage. For example, consider that we are parsing students from this xml file:

    <students>
      <student>
        <name>Erica Chang</name>
        <grade>87</grade>
      </student>
      <student>
        <name>Lizbeth Cruz</name>
        <grade>91</grade>
      </student>
      ...
    </students>

Let's say we've got about 1GB of this, and we run a parser on 64KB chunks as they are read from disk. There are two contasting scenarios to consider.

If the file is parsed at startup and the resulting array of students is kept in memory for a long time afterward, then the unsliced variant is better. It allows us to throw away all of the bytes corresponding to the xml nodes, which are no longer relevant. The sliced variant, by constrast, would keep the entire contents of the original file in memory.

However, if the file is parsed by a SAX-style streaming parser, the students being parsed, processed, and then discarded as they encountered, the sliced variant is better. Slicing wins here because the chunks from the file are discarded shortly after they are encountered. Read "shortly" as "before the next minor GC".

In this example, Student was an application-specific data type, so the application author could figure out what they were actually doing with the data and define the data type accordingly. However, this isn't always a solution. What if they needed to use the same type differently in different contexts? What if the type was provided by a library and the library author doesn't know how it will be used?

So far, this has just been an explanation of a problem that I've noticed. I understand the problem well, but now I'm going to propose a solution. There are a lot of holes in my understanding of GHC's runtime, so I don't understand if what I'm proposing is actually plausible.

What if the GHC's runtime handled the copy decisions for us? What if we had these primitives:

    data Bytes#
    sliceBytes# :: Bytes# -> Int# -> Int# -> Bytes#
    lengthBytes# :: Bytes# -> Int#

Slicing causes the runtime to track that there is an additional reference to the byte array but that the offset and length are different. I'm not sure where this information would be stored though. When GC runs, it would copy slices from bytes if the there were no longer any references to the entire array. So a single type could serve the both purposes discussed in the scenario above. I'm not sure what kind of impact this might have on GC time.

-- 
-Andrew Thaddeus Martin
_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries

_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries