Is it possible to have constant-space JSON decoding?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Is it possible to have constant-space JSON decoding?

oleg-30

I am doing, for several months, constant-space processing of large XML
files using iteratees. The file contains many XML elements (which are
a bit complex than a number). An element can be processed
independently. After the parser finished with one element, and dumped
the related data, the processing of the next element starts anew, so
to speak. No significant state is accumulated for the overall parsing
sans the counters of processed and bad elements, for statistics. XML
is somewhat like JSON, only more complex: an XML parser has to deal
with namespaces, parsed entities, CDATA sections and the other
interesting stuff. Therefore, I'm quite sure there should not be
fundamental problems in constant-space parsing of JSON.

BTW, the parser itself is described there
        http://okmij.org/ftp/Streams.html#xml



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to have constant-space JSON decoding?

Johan Tibell-2
Hi Oleg,

On Tue, Dec 4, 2012 at 9:13 PM,  <[hidden email]> wrote:

> I am doing, for several months, constant-space processing of large XML
> files using iteratees. The file contains many XML elements (which are
> a bit complex than a number). An element can be processed
> independently. After the parser finished with one element, and dumped
> the related data, the processing of the next element starts anew, so
> to speak. No significant state is accumulated for the overall parsing
> sans the counters of processed and bad elements, for statistics. XML
> is somewhat like JSON, only more complex: an XML parser has to deal
> with namespaces, parsed entities, CDATA sections and the other
> interesting stuff. Therefore, I'm quite sure there should not be
> fundamental problems in constant-space parsing of JSON.
>
> BTW, the parser itself is described there
>         http://okmij.org/ftp/Streams.html#xml

It certainly is possible (using a SAX style parser). What you can't
have (I think) is a function:

    decode :: FromJSON a => ByteString -> Maybe a

and constant-memory parsing at the same time. The return type here
says that we will return Nothing if parsing fails. We can only do so
after looking at the whole input (otherwise how would we know if it's
malformed).

The use cases aeson was designed for (which I bet is the majority use
case) is parsing smaller messages sent over the network (i.e. in web
service APIs) so this is the only mode of parsing it supplies.

-- Johan

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to have constant-space JSON decoding?

Jason Dagit-3


On Tue, Dec 4, 2012 at 9:37 PM, Johan Tibell <[hidden email]> wrote:
Hi Oleg,

On Tue, Dec 4, 2012 at 9:13 PM,  <[hidden email]> wrote:
> I am doing, for several months, constant-space processing of large XML
> files using iteratees. The file contains many XML elements (which are
> a bit complex than a number). An element can be processed
> independently. After the parser finished with one element, and dumped
> the related data, the processing of the next element starts anew, so
> to speak. No significant state is accumulated for the overall parsing
> sans the counters of processed and bad elements, for statistics. XML
> is somewhat like JSON, only more complex: an XML parser has to deal
> with namespaces, parsed entities, CDATA sections and the other
> interesting stuff. Therefore, I'm quite sure there should not be
> fundamental problems in constant-space parsing of JSON.
>
> BTW, the parser itself is described there
>         http://okmij.org/ftp/Streams.html#xml

It certainly is possible (using a SAX style parser). What you can't
have (I think) is a function:

    decode :: FromJSON a => ByteString -> Maybe a

and constant-memory parsing at the same time. The return type here
says that we will return Nothing if parsing fails. We can only do so
after looking at the whole input (otherwise how would we know if it's
malformed).

I thought it was possible to get around this with lazy patterns such "Wadler's force function" [1]?

(untested code)

force y =
  let Just x = y
  in Just x

lazyDecode :: FromJSON a => ByteString -> Maybe a
lazyDecode = force . decode




_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to have constant-space JSON decoding?

Malcolm Wallace-2
In reply to this post by Johan Tibell-2
See also the incremental XML parser in HaXml, described in "Partial parsing: combining choice with commitment", IFL 2006.  It has constant space usage (for some patterns of usage), even with extremely large inputs.

http://www.google.co.uk/url?sa=t&rct=j&q=malcolm+wallace+partial+parsing&source=web&cd=2&ved=0CEEQFjAB&url=http%3A%2F%2Fciteseerx.ist.psu.edu%2Fviewdoc%2Fdownload%3Fdoi%3D10.1.1.135.7512%26rep%3Drep1%26type%3Dpdf&ei=Db3BUNmiOsfS4QTAkoDYAw&usg=AFQjCNHHywUCvaFv8eBoQ-x9jj4GOMHo2w

On 5 Dec 2012, at 05:37, Johan Tibell wrote:

> Hi Oleg,
>
> On Tue, Dec 4, 2012 at 9:13 PM,  <[hidden email]> wrote:
>> I am doing, for several months, constant-space processing of large XML
>> files using iteratees. The file contains many XML elements (which are
>> a bit complex than a number). An element can be processed
>> independently. After the parser finished with one element, and dumped
>> the related data, the processing of the next element starts anew, so
>> to speak. No significant state is accumulated for the overall parsing
>> sans the counters of processed and bad elements, for statistics. XML
>> is somewhat like JSON, only more complex: an XML parser has to deal
>> with namespaces, parsed entities, CDATA sections and the other
>> interesting stuff. Therefore, I'm quite sure there should not be
>> fundamental problems in constant-space parsing of JSON.
>>
>> BTW, the parser itself is described there
>>        http://okmij.org/ftp/Streams.html#xml
>
> It certainly is possible (using a SAX style parser). What you can't
> have (I think) is a function:
>
>    decode :: FromJSON a => ByteString -> Maybe a
>
> and constant-memory parsing at the same time. The return type here
> says that we will return Nothing if parsing fails. We can only do so
> after looking at the whole input (otherwise how would we know if it's
> malformed).
>
> The use cases aeson was designed for (which I bet is the majority use
> case) is parsing smaller messages sent over the network (i.e. in web
> service APIs) so this is the only mode of parsing it supplies.
>
> -- Johan
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to have constant-space JSON decoding?

Albert Y. C. Lai
In reply to this post by Jason Dagit-3
On 12-12-05 12:48 AM, Jason Dagit wrote:

> I thought it was possible to get around this with lazy patterns such
> "Wadler's force function" [1]?
>
> (untested code)
>
> force y =
>    let Just x = y
>    in Just x
>
> lazyDecode :: FromJSON a => ByteString -> Maybe a
> lazyDecode = force . decode

This says, the type is Maybe, but the value is always Just. If there
will be a parse error, you will get an async imprecise exception, not
Nothing, at the later time when you look closer.

This respects the letter, but not the point, of the Maybe type. If you
are to do this (I have no moral, I hold nothing against doing this, the
async imprecise exception will give you enough grief), you may as well
skip the Just wrapping and directly go ByteString -> a.

The point of the Maybe type is to communicate parse errors by a less
async, less imprecise way.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Is it possible to have constant-space JSON decoding?

oleg-30
In reply to this post by Johan Tibell-2

Johan Tibell posed an interesting problem of incremental XML parsing
while still detecting and reporting ill-formedness errors.
> What you can't have (I think) is a function:
>
>     decode :: FromJSON a => ByteString -> Maybe a
>
> and constant-memory parsing at the same time. The return type here
> says that we will return Nothing if parsing fails. We can only do so
> after looking at the whole input (otherwise how would we know if it's
> malformed).

The problem is very common: suppose we receive an XML document over a
network (e.g., in an HTTP stream). Network connections are inherently
unreliable and can be dropped at any time (e.g., because someone
tripped over a power cord). The XML document therefore can come
truncated, for example, missing the end tag for the root
element. According to the XML Recommendations, such document is
ill-formed, and hence does not have an Infoset (In contrast, invalid
but well-formed documents do have the Infoset). Strictly speaking, we
should not be processing an XML document until we verified that it is
well-formed, that is, until we parsed it at all and have checked that
all end tags are in place. It seems we can't do the incremental XML
processing in principle.

I should mention first that sometimes people avoid such a strict
interpretation. If we process telemetry data encoded in XML, we may
consider a document meaningful even if the root end tag is missing. We
process as far as we can.

Even if we take the strict interpretation, it is still possible
to handle a document incrementally so long as the processing is
functional or the side effects can be backed out (e.g., in a
transaction). This message illustrates exactly such an incremental
processing that always detects ill-formed XML, and, optionally,
invalid XML. It is possible after all to detect ill-formedness
errors and process the document without loading it all in memory
first -- using as little memory as needed to hold the state of the
processor -- just a short string in our example.

Our running example is an XML document representing a finite map:
a collection of key-value pairs where key is an integer:

  <map>
   <kv><key>1</key><value>v1</value></kv>
   <kv><key>2</key><value>v2</value></kv>
   <kv><key>bad</key><value>v3</value></kv>

The above document is both ill-formed (missing the end tag)
and invalid (one key is bad: non-read-able). We would like to write
a lookup function for a key in such an encoded map. We should report
an ill-formedness error always. The reporting of validation
errors may vary. The function

xml_lookup :: Monad m => Key -> Iteratee Char m (Maybe Value)
xml_lookup key = id .| xml_enum default_handlers .| xml_normalize
                 .| kv_enum (lkup key)

always reports the validation errors. The function is built
by composition from smaller, separately developed and tested
pipeline components: parsing of a
document to the XMLStream, normalization, converting the XMLStream to
a stream of (Key,Value) pairs and finally searching the stream.
A similar function that replaces kv_enum with kv_enum_pterm
terminates the (Key,Value) conversion as soon as its client iteratee
finished. In that case if the lookup succeeds before we encounter the bad
key, no error is reported. Ill-formedness errors are raised always.

The code
        http://okmij.org/ftp/Haskell/Iteratee/XMLookup.hs

shows the examples of both methods of validation error reporting.
This code also illustrates the library of parsing combinators, which
represent the element structure (`DTD').


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe