Fingerprinting Haskell Objects

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Fingerprinting Haskell Objects

ozataman
Hello everybody,

I have a little question I wanted to run by the folks here. I've run into it several times over the past few years and would love to lock down a good answer.

What's the best way to "fingerprint" a Haskell object into, say, ByteString, so that this fingerprint can be used as the "lookup key" in a database (for example) and be trusted that it will remain constant over time even as the underlying libraries evolve?

Here's a simple example:
  • Say I'm building a manual index on top of a key-value store (redis, dynamodb, etc.)

  • I want my keys to be arbitrary tuples (or similar records) that may contain various fields in them

  • I would like to avoid ad-hoc, hand-written MyTuple -> ByteString and ByteString -> MyTuple conversions. However, Generic derivations, template-haskell, etc. are acceptable

  • Notice how your fingerprint, which is used as a lookup key in the database, has to remain stationary. If it changes even by a single bit over time for the same MyTuple, the key-value store will NOT be able to find the index associated with MyTuple at this later time

Here are some ideas (and related concepts) I've considered and used over the years:
  • Hand-write a "Prism' MyTuple ByteString". This works, but is tedious and error-prone.

  • Use Serialize/Binary and trust that the encode/decode pair will produce results consistently in 5 years (dangerous territory!)

  • Use SafeCopy, which is great for ensuring timeless decoding of the *value* in the index, but can we be sure that fingerprint (MyTuple -> ByteString) conversion is persistent? What if SafeCopy authors one day decide to encode tuples differently? They would write the migrations to transparently handle legacy code for *values*, but not for *keys*. Also notice here how migrations help with the ByteString -> MyTuple leg, but do not ensure MyTuple -> ByteString produces the same ByteString over time.

  • Hashable would've been nice, but there is NO guarantee of persistent results, even across multiple runs of the same code
What would be your preferred solution?

Thank you,
Oz






_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Fingerprinting Haskell Objects

Alexander Kjeldaas

Assuming the Generic instance is a stable interface, I would create a traversal of that, feeding directly into a Blake2b-implementation (a fast SHA3 finalist, tweaked).


This gives you a cryptographically strong fingerprint, space usage is flexible (extract as many bytes as you want), is fast (~1GB/s), and with low complexity/external dependencies.


Alexander


On Tue, Oct 7, 2014 at 10:30 PM, Ozgun Ataman <[hidden email]> wrote:
Hello everybody,

I have a little question I wanted to run by the folks here. I've run into it several times over the past few years and would love to lock down a good answer.

What's the best way to "fingerprint" a Haskell object into, say, ByteString, so that this fingerprint can be used as the "lookup key" in a database (for example) and be trusted that it will remain constant over time even as the underlying libraries evolve?

Here's a simple example:
  • Say I'm building a manual index on top of a key-value store (redis, dynamodb, etc.)

  • I want my keys to be arbitrary tuples (or similar records) that may contain various fields in them

  • I would like to avoid ad-hoc, hand-written MyTuple -> ByteString and ByteString -> MyTuple conversions. However, Generic derivations, template-haskell, etc. are acceptable

  • Notice how your fingerprint, which is used as a lookup key in the database, has to remain stationary. If it changes even by a single bit over time for the same MyTuple, the key-value store will NOT be able to find the index associated with MyTuple at this later time

Here are some ideas (and related concepts) I've considered and used over the years:
  • Hand-write a "Prism' MyTuple ByteString". This works, but is tedious and error-prone.

  • Use Serialize/Binary and trust that the encode/decode pair will produce results consistently in 5 years (dangerous territory!)

  • Use SafeCopy, which is great for ensuring timeless decoding of the *value* in the index, but can we be sure that fingerprint (MyTuple -> ByteString) conversion is persistent? What if SafeCopy authors one day decide to encode tuples differently? They would write the migrations to transparently handle legacy code for *values*, but not for *keys*. Also notice here how migrations help with the ByteString -> MyTuple leg, but do not ensure MyTuple -> ByteString produces the same ByteString over time.

  • Hashable would've been nice, but there is NO guarantee of persistent results, even across multiple runs of the same code
What would be your preferred solution?

Thank you,
Oz






_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Fingerprinting Haskell Objects

Kyle Marek-Spartz
I’m not sure hashing is what is desired due to the ByteString -> MyTuple conversion that was mentioned.


Kyle Marek-Spartz
 
 
 
 
 
On Oct 7, 2014, 4:15:55 PM, Alexander Kjeldaas <[hidden email]> wrote:

Assuming the Generic instance is a stable interface, I would create a traversal of that, feeding directly into a Blake2b-implementation (a fast SHA3 finalist, tweaked).


This gives you a cryptographically strong fingerprint, space usage is flexible (extract as many bytes as you want), is fast (~1GB/s), and with low complexity/external dependencies.


Alexander




_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Fingerprinting Haskell Objects

Alexander Kjeldaas

I think the two problems should be separated.  MyTuple -> ByteString should have the properties that the encoding is stable and 1:1.  ByteString -> MyTuple should have the property that old data is readable, but it does not need to be 1:1.

So any library that supports backwards compatibility can be used for the ByteString -> MyTuple conversion, including SafeCopy.  If SafeCopy implements a new, faster encoding, that does not affect this conversion.

For a key/value store, the hash of the object can be used as key, and the SafeCopy serialization can be stored in the value together with whatever else is required.

Alexander


On Wed, Oct 8, 2014 at 12:21 AM, Kyle Marek-Spartz <[hidden email]> wrote:
I’m not sure hashing is what is desired due to the ByteString -> MyTuple conversion that was mentioned.


Kyle Marek-Spartz
 
 
 
 
 
On Oct 7, 2014, 4:15:55 PM, Alexander Kjeldaas <[hidden email]> wrote:

Assuming the Generic instance is a stable interface, I would create a traversal of that, feeding directly into a Blake2b-implementation (a fast SHA3 finalist, tweaked).


This gives you a cryptographically strong fingerprint, space usage is flexible (extract as many bytes as you want), is fast (~1GB/s), and with low complexity/external dependencies.


Alexander





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Fingerprinting Haskell Objects

ozataman
In reply to this post by Alexander Kjeldaas


On Tue, Oct 7, 2014 at 5:15 PM, Alexander Kjeldaas <[hidden email]> wrote:

Assuming the Generic instance is a stable interface, I would create a traversal of that, feeding directly into a Blake2b-implementation (a fast SHA3 finalist, tweaked).


This gives you a cryptographically strong fingerprint, space usage is flexible (extract as many bytes as you want), is fast (~1GB/s), and with low complexity/external dependencies.


Alexander

Thank you for the reply. My own thinking has been along similar lines of finding a stable serialization for the record. Hashing is certainly a nice way to compress the result - one I've used myself in the past. However, can we really consider Generic a stable interface?

 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe