Parsec to parse tree structures?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

Parsec to parse tree structures?

David Fries-2
Hello Café

Some time ago I wrote a parser for a project of one our customers. The
format was proprietary and binary. The data was structured as a tree
with tables pointing to sub tables farther in the file. (Well actually
there was one or two cases where branches joined together, so I guess it
was a directed graph.) Also it had no defined endianess, some tables
were bigendian others were little endian and some were in their very own
in-house-endianess.
All in all, the typical binary data structure that has been in use and
continuously extended for the last 15 years. You know the kind.

Oddly enough, our customer never bothered to write a parser of their
own. I wonder why.

My parser was written in C# and wasn't particularly elegant, but it
worked reliably. I was wondering how you would parse tree-like
structures with Parsec (or other functional parsers)? Up to know, all
examples I've seen were of sequential data. To parse trees, you'd
essentially need to be able to follow pointers, parse whatever is there
and then jump back. I guess I'd have to mess around with the internal
state of the parser to do that, which is something I'd rather avoid.


regards,
dave

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

Stephen Tetley-2
On 14 March 2010 16:03, david fries <[hidden email]> wrote:
[SNIP]

> Oddly enough, our customer never bothered to write a parser of their
> own. I wonder why.

Hi David

If the binary structure was previously used only with C programs its
quite common just to use casting to unpack the data into a struct -
though your example seems to suggest this wasn't being done as the
format had both big and little endian tables.

In Haskell or other modern functional languages like SML, parse trees
are generally represented as algebraic types - so there are no
pointers. If you're familiar with ANTLR from the OO world, its rather
like working with the tree definition automatically as opposed to
generating classes from the data description language.


Best wishes

Stephen
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

David Fries-2
Hi Stephen

Perhaps my description of the format was a bit unclear. When I said
pointer I simply meant a number which is the position in the stream.

Imagine the tables looking something like this:

RootTable
   HeaderMagicNumber (1Byte): 0x50
   VersionNumber (2 Bytes): 1234
   SubTablePointer (4 Bytes): 200 ------------.
...Some more fields of the root table         ¦
                                              ¦
                                              ¦
SubTable (positioned at byte 200 of the file)<'
   SomeFlags (1 Byte): 0x00
   Comment (Variable String): "hello, world!\0"

AnotherTable
...

Our parser produced object instances representing each table. And I used
normal references between tables instead of pointers.

You had to start parsing at the beginning of the file (where the root
table is), otherwise you have no clue where in the structure you are.
Because all tables after the root table are dynamically positioned when
the whole thing is serialized.
So, to parse the root, you read the first byte, check that it's 0x50,
read the next two bytes (which are the version number), then you read
four bytes (pointer to the sub table). The sub table starts at byte 200
of the file. So now I would jump to that position in the file and start
parsing SubTable. After that I'd jump back and parse the remaining
fields of the root table.

On Sun, 2010-03-14 at 16:23 +0000, Stephen Tetley wrote:

> On 14 March 2010 16:03, david fries <[hidden email]> wrote:
> [SNIP]
>
> > Oddly enough, our customer never bothered to write a parser of their
> > own. I wonder why.
>
> Hi David
>
> If the binary structure was previously used only with C programs its
> quite common just to use casting to unpack the data into a struct -
> though your example seems to suggest this wasn't being done as the
> format had both big and little endian tables.
>
> In Haskell or other modern functional languages like SML, parse trees
> are generally represented as algebraic types - so there are no
> pointers. If you're familiar with ANTLR from the OO world, its rather
> like working with the tree definition automatically as opposed to
> generating classes from the data description language.
>
>
> Best wishes
>
> Stephen
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

Stephen Tetley-2
Hi David

Ah ha - this form of binary file layout is quite common (e.g. PECOFF
object files and OpenType / TrueType fonts).

Parsec and other parsing libraries are perhaps not ideal for the task,
as they consume input as they parse. I have my own alternative to
Parsec - Kangaroo [1] - for parsing binary files. It moves a cursor
around inside the file (strictly speaking an array in memory from
reading the file), so you can parse within a sub-region of the file
and jump back out again.

Although its on Hackage, I wouldn't really recommend its use - its now
fairly well documented but the API is not stable and I only work on it
sporadically. Because I didn't want any dependencies, the package is
quite a bit larger than it need be - if someone were interested in
technique they might be better off using it as a start point. The most
important bits are the 'intraparse' function and the monadic machinery
inside the Kangaroo.ParseMonad module.

Even when a binary format has a published standard, unfortunately the
standard might not be detailed enough to actually produce a parser.
This is the case for True Type and PECOFF which I wrote Kangaroo for,
and as I don't have much enthusiasm for deriving a parser from another
open-source implementation, its rather stalling any continued
development of Kangaroo.


[1] http://hackage.haskell.org/package/kangaroo

Best wishes

Stephen
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

David Fries-2
Yep, That's what I had in mind. Thanks for the link.

On Sun, 2010-03-14 at 19:09 +0000, Stephen Tetley wrote:

> Hi David
>
> Ah ha - this form of binary file layout is quite common (e.g. PECOFF
> object files and OpenType / TrueType fonts).
>
> Parsec and other parsing libraries are perhaps not ideal for the task,
> as they consume input as they parse. I have my own alternative to
> Parsec - Kangaroo [1] - for parsing binary files. It moves a cursor
> around inside the file (strictly speaking an array in memory from
> reading the file), so you can parse within a sub-region of the file
> and jump back out again.
>
> Although its on Hackage, I wouldn't really recommend its use - its now
> fairly well documented but the API is not stable and I only work on it
> sporadically. Because I didn't want any dependencies, the package is
> quite a bit larger than it need be - if someone were interested in
> technique they might be better off using it as a start point. The most
> important bits are the 'intraparse' function and the monadic machinery
> inside the Kangaroo.ParseMonad module.
>
> Even when a binary format has a published standard, unfortunately the
> standard might not be detailed enough to actually produce a parser.
> This is the case for True Type and PECOFF which I wrote Kangaroo for,
> and as I don't have much enthusiasm for deriving a parser from another
> open-source implementation, its rather stalling any continued
> development of Kangaroo.
>
>
> [1] http://hackage.haskell.org/package/kangaroo
>
> Best wishes
>
> Stephen
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

John Lato-2
In reply to this post by David Fries-2
Hi Dave,

> From: david fries <[hidden email]>
>
> Hello Café
>
> Some time ago I wrote a parser for a project of one our customers. The
> format was proprietary and binary. The data was structured as a tree
> with tables pointing to sub tables farther in the file. (Well actually
> there was one or two cases where branches joined together, so I guess it
> was a directed graph.) Also it had no defined endianess, some tables
> were bigendian others were little endian and some were in their very own
> in-house-endianess.
> All in all, the typical binary data structure that has been in use and
> continuously extended for the last 15 years. You know the kind.
>
> My parser was written in C# and wasn't particularly elegant, but it
> worked reliably. I was wondering how you would parse tree-like
> structures with Parsec (or other functional parsers)? Up to know, all
> examples I've seen were of sequential data. To parse trees, you'd
> essentially need to be able to follow pointers, parse whatever is there
> and then jump back. I guess I'd have to mess around with the internal
> state of the parser to do that, which is something I'd rather avoid.

Have you seen Edward Kmett's Parsing Trifecta talk/slides, available
at http://comonad.com/reader/2009/iteratees-parsec-and-monoid/ ?  I
think a similar approach would allow you to parse this data structure.
 The trick is to create a data source that can read from arbitrary
locations.

Also see http://okmij.org/ftp/Haskell/Iteratee/Tiff.hs for an
iteratee-based TIFF reader.  TIFF files have a similar structure, so
this code could apply pretty directly.  The parsing is quite
low-level, but you could use iteratee-parsec [1] or
attoparsec-iteratee [2] to use parser combinators with this technique.

[1] http://hackage.haskell.org/package/iteratee-parsec
[2] http://hackage.haskell.org/package/attoparsec-iteratee
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

Ryan Ingram
In reply to this post by David Fries-2
Here are some questions you have not answered that are quite important
for your design:

1) How important is retaining shared references?  In particular, is
the structure mutable after being read-in to memory?  If it is, and
you mutate an object, do you expect other references to that object to
be mutated as well?

2) If the answer to (1) is "not important", then how important is
parsing performance?  If you parse the same object more than once
(turning the directed graph into a tree), is that a problem?

This is an interesting problem and of course there are a lot of ways
to tackle it, but I think you need to elaborate a bit more on what is
needed.  If you want to maintain the "object graph" structure, you'll
treat the problem quite a bit differently than if you are happy to
convert it to a pure data structure.

  -- ryan

On Sun, Mar 14, 2010 at 9:03 AM, david fries <[hidden email]> wrote:

> Hello Café
>
> Some time ago I wrote a parser for a project of one our customers. The
> format was proprietary and binary. The data was structured as a tree
> with tables pointing to sub tables farther in the file. (Well actually
> there was one or two cases where branches joined together, so I guess it
> was a directed graph.) Also it had no defined endianess, some tables
> were bigendian others were little endian and some were in their very own
> in-house-endianess.
> All in all, the typical binary data structure that has been in use and
> continuously extended for the last 15 years. You know the kind.
>
> Oddly enough, our customer never bothered to write a parser of their
> own. I wonder why.
>
> My parser was written in C# and wasn't particularly elegant, but it
> worked reliably. I was wondering how you would parse tree-like
> structures with Parsec (or other functional parsers)? Up to know, all
> examples I've seen were of sequential data. To parse trees, you'd
> essentially need to be able to follow pointers, parse whatever is there
> and then jump back. I guess I'd have to mess around with the internal
> state of the parser to do that, which is something I'd rather avoid.
>
>
> regards,
> dave
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

David Fries-2
In reply to this post by John Lato-2
Thanks for the links!

I didn't realize that Iteratee can also do random access. I looked at
Oleg's slides a while back, but I haven't wrapped my head around it.

On Mon, 2010-03-15 at 08:57 -0500, John Lato wrote:

> Hi Dave,
>
> > From: david fries <[hidden email]>
> >
> > Hello Café
> >
> > Some time ago I wrote a parser for a project of one our customers. The
> > format was proprietary and binary. The data was structured as a tree
> > with tables pointing to sub tables farther in the file. (Well actually
> > there was one or two cases where branches joined together, so I guess it
> > was a directed graph.) Also it had no defined endianess, some tables
> > were bigendian others were little endian and some were in their very own
> > in-house-endianess.
> > All in all, the typical binary data structure that has been in use and
> > continuously extended for the last 15 years. You know the kind.
> >
> > My parser was written in C# and wasn't particularly elegant, but it
> > worked reliably. I was wondering how you would parse tree-like
> > structures with Parsec (or other functional parsers)? Up to know, all
> > examples I've seen were of sequential data. To parse trees, you'd
> > essentially need to be able to follow pointers, parse whatever is there
> > and then jump back. I guess I'd have to mess around with the internal
> > state of the parser to do that, which is something I'd rather avoid.
>
> Have you seen Edward Kmett's Parsing Trifecta talk/slides, available
> at http://comonad.com/reader/2009/iteratees-parsec-and-monoid/ ?  I
> think a similar approach would allow you to parse this data structure.
>  The trick is to create a data source that can read from arbitrary
> locations.
>
> Also see http://okmij.org/ftp/Haskell/Iteratee/Tiff.hs for an
> iteratee-based TIFF reader.  TIFF files have a similar structure, so
> this code could apply pretty directly.  The parsing is quite
> low-level, but you could use iteratee-parsec [1] or
> attoparsec-iteratee [2] to use parser combinators with this technique.
>
> [1] http://hackage.haskell.org/package/iteratee-parsec
> [2] http://hackage.haskell.org/package/attoparsec-iteratee


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

David Fries-2
In reply to this post by Ryan Ingram
Hi Ryan,

1) Retaining shared references was not important to us at the time. We
were only interested in the correctness of the values that were written.

2) Performance wasn't a big issue either. The parser was part of a
debugging tool we wrote for internal use.

My my concern was how you would perform random access in a functional
parser. You're points are interesting too. I guess if we really had
wanted to work with parsed objects, retaining the shared references
would have been a must.

On Mon, 2010-03-15 at 14:57 -0700, Ryan Ingram wrote:

> Here are some questions you have not answered that are quite important
> for your design:
>
> 1) How important is retaining shared references?  In particular, is
> the structure mutable after being read-in to memory?  If it is, and
> you mutate an object, do you expect other references to that object to
> be mutated as well?
>
> 2) If the answer to (1) is "not important", then how important is
> parsing performance?  If you parse the same object more than once
> (turning the directed graph into a tree), is that a problem?
>
> This is an interesting problem and of course there are a lot of ways
> to tackle it, but I think you need to elaborate a bit more on what is
> needed.  If you want to maintain the "object graph" structure, you'll
> treat the problem quite a bit differently than if you are happy to
> convert it to a pure data structure.
>
>   -- ryan
>
> On Sun, Mar 14, 2010 at 9:03 AM, david fries <[hidden email]> wrote:
> > Hello Café
> >
> > Some time ago I wrote a parser for a project of one our customers. The
> > format was proprietary and binary. The data was structured as a tree
> > with tables pointing to sub tables farther in the file. (Well actually
> > there was one or two cases where branches joined together, so I guess it
> > was a directed graph.) Also it had no defined endianess, some tables
> > were bigendian others were little endian and some were in their very own
> > in-house-endianess.
> > All in all, the typical binary data structure that has been in use and
> > continuously extended for the last 15 years. You know the kind.
> >
> > Oddly enough, our customer never bothered to write a parser of their
> > own. I wonder why.
> >
> > My parser was written in C# and wasn't particularly elegant, but it
> > worked reliably. I was wondering how you would parse tree-like
> > structures with Parsec (or other functional parsers)? Up to know, all
> > examples I've seen were of sequential data. To parse trees, you'd
> > essentially need to be able to follow pointers, parse whatever is there
> > and then jump back. I guess I'd have to mess around with the internal
> > state of the parser to do that, which is something I'd rather avoid.
> >
> >
> > regards,
> > dave
> >
> > _______________________________________________
> > Haskell-Cafe mailing list
> > [hidden email]
> > http://www.haskell.org/mailman/listinfo/haskell-cafe
> >


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

wren ng thornton
david fries wrote:
> My my concern was how you would perform random access in a functional
> parser. You're points are interesting too. I guess if we really had
> wanted to work with parsed objects, retaining the shared references
> would have been a must.

Out of curiosity, was there *really* a need for random access? It sounds
like you could've written things to do it in one forward pass, keeping a
table on the side of the targets for dangling pointers, and then
back-patching once you've reached the object being pointed at. You may
even be able to use laziness to do the back-patching instead of needing
STRefs. The only problem I could see with that approach is if there was
funny bit-packing going on so that objects were overlapped in the same
memory region.

--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

Stephen Tetley-2
hi wren

Where I've used it, random access does seem conceptual more
satisfactory than trying to avoid it.

Well designed binary formats are deterministic - so wherever you are
inside the file you should know what you are parsing. One example of
this determinism is that parsing "local" alternatives are generally
encoded on a single tag, whereas in a parser for a programming
language parsing alternatives might require lookahead and possibly
other disambiguation. Another example is that formats will often have
a "index table" table at the start of the file giving start position
and length for all the sub tables - this saves you from having to
start each table with a tag and length.

For complex formats e.g. PECOFF or TrueType, you might only want to
parse certain tables [*]. After parsing the index table you could
build a list of parsers to run sequentially on the body of the file
(including parsers that just drop unwanted tables), but this seems too
much work (and too much to go wrong), when I can use a function almost
as simple as parseAt for the tables I'm interested in.

parseAt :: Start -> End -> Parser a -> Parser a

In practice, I made parseAt slightly more complicated so it could
encode where the cursor is moved to at the end of the parse.

Best wishes

Stephen

[*] Certain tables in TrueType / OpenType are propriety - it might be
unwise for an open source parser to even include parsing operations
for those tables.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

wren ng thornton
Stephen Tetley wrote:
> hi wren
>
> Where I've used it, random access does seem conceptual more
> satisfactory than trying to avoid it.

I was thinking more about performance issues (avoiding disk seeks) which
would also alleviate the problem of needing random access when it's not
available.


> For complex formats e.g. PECOFF or TrueType, you might only want to
> parse certain tables [*]. After parsing the index table you could
> build a list of parsers to run sequentially on the body of the file
> (including parsers that just drop unwanted tables), but this seems too
> much work (and too much to go wrong)

Even still, you could linearize the access in order to minimize seeking
back and forth. The linearization could even be done automatically by
having the table of what needs backpatching also serve as a work queue
(when done with a block, seek to the closest next block; when the queue
is empty, you're done). The backpatching queue also preserves structure
sharing.

Perhaps I've just been parsing too many multigigabyte files lately... :)

--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Parsec to parse tree structures?

Jason Dagit-2
In reply to this post by David Fries-2


On Sun, Mar 14, 2010 at 9:03 AM, david fries <[hidden email]> wrote:
Hello Café

Some time ago I wrote a parser for a project of one our customers. The
format was proprietary and binary. The data was structured as a tree
with tables pointing to sub tables farther in the file. (Well actually
there was one or two cases where branches joined together, so I guess it
was a directed graph.) Also it had no defined endianess, some tables
were bigendian others were little endian and some were in their very own
in-house-endianess.
All in all, the typical binary data structure that has been in use and
continuously extended for the last 15 years. You know the kind.

Oddly enough, our customer never bothered to write a parser of their
own. I wonder why.

My parser was written in C# and wasn't particularly elegant, but it
worked reliably. I was wondering how you would parse tree-like
structures with Parsec (or other functional parsers)? Up to know, all
examples I've seen were of sequential data. To parse trees, you'd
essentially need to be able to follow pointers, parse whatever is there
and then jump back. I guess I'd have to mess around with the internal
state of the parser to do that, which is something I'd rather avoid.

Would binary or cereal be right for this task?

My understanding is they provide low-level ways to get at bytes in a chunk of memory.  Using them, I believe you'd read the data into a bytestring and then load in the bytes and seek/jump as needed.  I don't know if they support the backwards jumping though.

And, I think ideally you use it by defining a type class instance for each object in the binary format and then use the provided type classes to just get/put your format.

If the representation is very much like a C struct, you might consider Storable.  Although, I've never heard of anyone using it as a parser.

Jason

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe