Compressed Data.Map for more efficient RAM usage?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Compressed Data.Map for more efficient RAM usage?

Jeremy Shaw-3
Sometimes we  want to store very large collection types in RAM -- such as a Data.Map or Data.IxSet.

It seems like we could trade-off some speed for space savings by compressing the values in RAM. 

Lemmih has previously created compact-map:


which mightybyte used to create:


compact-map converts the Haskell values to a more compact binary representation. But could we do even better by compressing the bytestrings?

Here is a useless implementation:


That version compresses each value by itself. But, that is probably going to be worse RAM usage, not better. I think what we really need to do is to store a dictionary in CMap that is used to compress all the ByteString values. That way if the same string appears in 10000 entries, they can all be compressed using the same compression code.

One question is, how much would this benefit us?

As an experiment I tried compressing one checkpoint file for real world acid-state app that I am involved with. A checkpoint file contains binary data that is serialized by the SafeCopy library. And it is data that is currently store in an IxSet in RAM.

The uncompressed checkpoint file is 115MB. The compressed checkpoint file is 26MB.

A checkpoint file probably contains more redundancy than an equivalent compressed IxSet because in addition to the values it also includes version tags and other easily compressed data. But, those numbers do look promising. At the very least, they do not rule out the possibility of some big savings.

I do not have time to explore this anymore. But I think it could be a fun project for someone to work on. ;)

I imagine a prototype with a shared dictionary could be hacked up in a few days. A bit more work would be required to document it, memory profile it, etc.

If done well, we should be able to reuse the incremental compression code in Data.Map, Data.Set, Data.IxSet, HiggsSet, and other places. But, just focusing on Data.Map is a good start. I will create a bug in this bug tracker that links back to this message,


- jeremy



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Compressed Data.Map for more efficient RAM usage?

Antoine Latter-2
On Thu, Feb 16, 2012 at 3:51 PM, Jeremy Shaw <[hidden email]> wrote:

> Sometimes we  want to store very large collection types in RAM -- such as a
> Data.Map or Data.IxSet.
>
> It seems like we could trade-off some speed for space savings by compressing
> the values in RAM.
>
> Lemmih has previously created compact-map:
>
>  http://hackage.haskell.org/package/compact-map
>
> which mightybyte used to create:
>
> https://github.com/mightybyte/compact-ixset
>
> compact-map converts the Haskell values to a more compact binary
> representation. But could we do even better by compressing the bytestrings?
>
> Here is a useless implementation:
>
> http://hpaste.org/63836
>

You could have a re-implemented HashMap which would un-pack the
payload's ByteString constructor into the leaves of the HashMap type
itself.

Then you would save on both the keys and the values.

This assumes you're not using the ordering provided by Data.Map.

Antoine

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Compressed Data.Map for more efficient RAM usage?

Johan Tibell-2
In reply to this post by Jeremy Shaw-3
On Thu, Feb 16, 2012 at 1:51 PM, Jeremy Shaw <[hidden email]> wrote:
> Sometimes we  want to store very large collection types in RAM -- such as a
> Data.Map or Data.IxSet.
>
> It seems like we could trade-off some speed for space savings by compressing
> the values in RAM.

Not knowing the actual data you want to store or the usage pattern I
cannot quite say if these suggestions will be of use:

* If the data is used in a read-only fashion (e.g. as a big lookup
table,) consider using an SSTable
(http://en.wikipedia.org/wiki/SSTable). The wiki page doesn't contain
a lot of documentation but I think you can find one implemented in
Cassandra. An SSTable is a sorted table from byte keys and byte
values. It's highly memory efficient as the keys can be prefix encoded
(as they are stored sorted) and whole blocks of the table can be
compressed using e.g. Zippy.

* If you need mutability you could consider using LevelDB.

* If you need a pure data structure consider using a type-specialized
version of the HAMT data structure which I use in the experimental
hamt branch  of unordered-containers
(https://github.com/tibbe/unordered-containers). The HAMT has quite
low overhead as is and if you specialize the key and value types (at
the cost of lots of code duplication) you can decrease the per
key/value overhead with another 4 words per such pair.

-- Johan

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Compressed Data.Map for more efficient RAM usage?

Johan Tibell-2
In reply to this post by Antoine Latter-2
On Thu, Feb 16, 2012 at 2:03 PM, Antoine Latter <[hidden email]> wrote:
> You could have a re-implemented HashMap which would un-pack the
> payload's ByteString constructor into the leaves of the HashMap type
> itself.
>
> Then you would save on both the keys and the values.

Note that ByteString has a high per-value overhead due to the internal
use of ForeignPtrs and indicies that track offset/size:
http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html

-- Johan

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Compressed Data.Map for more efficient RAM usage?

Antoine Latter-2
On Thu, Feb 16, 2012 at 4:29 PM, Johan Tibell <[hidden email]> wrote:
> On Thu, Feb 16, 2012 at 2:03 PM, Antoine Latter <[hidden email]> wrote:
>> You could have a re-implemented HashMap which would un-pack the
>> payload's ByteString constructor into the leaves of the HashMap type
>> itself.
>>
>> Then you would save on both the keys and the values.

Hah hah - forget what I said. HashMap may be smaller in terms of the
structure of the tree it builds, but it can't be smaller in the size
of its keys since the hash is lossy :-/

>
> Note that ByteString has a high per-value overhead due to the internal
> use of ForeignPtrs and indicies that track offset/size:
> http://blog.johantibell.com/2011/06/memory-footprints-of-some-common-data.html
>

You could store just the raw GHC ByteArray, which drops the indices.
You could then promote to a ByteString to call into standard
deserialization libraries. I don't know at what point we descend too
far into voodoo-magic.

> -- Johan

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe