Efficient way to take n bytes from lazy bytestring

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Sal
Reply | Threaded
Open this post in threaded view
|

Efficient way to take n bytes from lazy bytestring

Sal
This question is regarding Streaming library. What is an efficient way to take n bytes from lazy bytestring? 

What I have is an Aeson decoder that decodes lazy bytestring from network bytes. Since it will read the whole thing into memory which is fine btw, one might want to place a limit on max bytes consumed (to prevent malicious OOM attacks involving large payloads). Here is my attempt at limiting the number of bytes - in the example below, we limit the max bytes to 4:

Prelude> import Data.ByteString.Streaming as SBS
Prelude SBS> import Streaming as S -- to import Of constructor
Prelude SBS S> :set -XBangPatterns
Prelude SBS S> :set -XOverloadedStrings
Prelude SBS S>  (\(!x :> y) -> return x) =<< (SBS.toLazy . (SBS.take 4) . SBS.fromLazy $ "hello")
"hell"



Just want to check if above code can be improved in any way.


Currently, I have Aeson `decode` function consuming from websocket network bytes without any limit on message size. With a "limitBytes" function like above, I can do something like below:


-- before: decode x
-- After the changes - if the bytes exceed the limit, rest of the bytes are dropped, aeson decode will fail and return Nothing
return . decode =<< limitBytes $ x


--
Sal
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to take n bytes from lazy bytestring

Sal
Hmm, after posting this, I realized that my solution might be over-engineered! We can just use "take" function of lazy bytestring since we don't really use any thing in Streaming. 

On Friday, June 10, 2016 at 11:14:10 AM UTC-4, Sal wrote:
This question is regarding Streaming library. What is an efficient way to take n bytes from lazy bytestring? 

What I have is an Aeson decoder that decodes lazy bytestring from network bytes. Since it will read the whole thing into memory which is fine btw, one might want to place a limit on max bytes consumed (to prevent malicious OOM attacks involving large payloads). Here is my attempt at limiting the number of bytes - in the example below, we limit the max bytes to 4:

Prelude> import Data.ByteString.Streaming as SBS
Prelude SBS> import Streaming as S -- to import Of constructor
Prelude SBS S> :set -XBangPatterns
Prelude SBS S> :set -XOverloadedStrings
Prelude SBS S>  (\(!x :> y) -> return x) =<< (SBS.toLazy . (SBS.take 4) . SBS.fromLazy $ "hello")
"hell"



Just want to check if above code can be improved in any way.


Currently, I have Aeson `decode` function consuming from websocket network bytes without any limit on message size. With a "limitBytes" function like above, I can do something like below:


-- before: decode x
-- After the changes - if the bytes exceed the limit, rest of the bytes are dropped, aeson decode will fail and return Nothing
return . decode =<< limitBytes $ x


--
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to take n bytes from lazy bytestring

Michael Thompson
You can if you like apply aeson/attoparsec parsers directly to a 'streaming bytestring'. 

      import Streaming
      import qualified Data.ByteString.Streaming.Char8 as Q
      import qualified Data.ByteString.Streaming.Aeson as Aeson
      import Data.Aeson 
      import qualified Data.Attoparsec.ByteString.Streaming as A


I can write things like this (note there is the obvious IsString instance for `ByteString m r`):

    >>> (a,b) <- A.parse json' $ void $ Q.splitAt 27  "{\"a\":[1,2],\"b\":[3,4] }      xxxxxxxxxxxxxxxxx"    

    >>> a
    Left (Object (fromList [("a",Array [Number 1.0,Number 2.0]),("b",Array [Number 3.0,Number 4.0])]))

    >>> Q.length b
    5 :> ()

Obviously one should take account of where in general the bytes one is not reading were supposed to be coming from and similar questions.

--
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to take n bytes from lazy bytestring

Michael Thompson
Again, the above should be exactly the same as the pipes-bytestring/attoparsec/aeson version. 
The only difference should be that `splitAt` is a direct function, not a lens. So a pipes version would use `view (PB.splitAt n)`
The lenses mostly subserve StateT parsing, and I was thinking of this as pipes specific - like piping.
The 'Streaming' material is supposed to isolate a more direct, limited, etc. subsystem of the pipes approach.

--
Sal
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to take n bytes from lazy bytestring

Sal
In reply to this post by Michael Thompson
Thanks, Michael. When will your suggested approach be more useful than something like this below (LBS is Data.ByteString.Lazy, decode is Aeson decoder):

decode . LBS.take 27

That is assuming that all of the information stored in the JSON is used up - if streaming JSON saves memory by looking at one object at a time, that could be more useful. Or if streaming approach fuses better than above solution.

On Friday, June 10, 2016 at 5:38:38 PM UTC-4, Michael Thompson wrote:
You can if you like apply aeson/attoparsec parsers directly to a 'streaming bytestring'. 

      import Streaming
      import qualified Data.ByteString.Streaming.Char8 as Q
      import qualified Data.ByteString.Streaming.Aeson as Aeson
      import Data.Aeson 
      import qualified Data.Attoparsec.ByteString.Streaming as A


I can write things like this (note there is the obvious IsString instance for `ByteString m r`):

    >>> (a,b) <- A.parse json' $ void $ Q.splitAt 27  "{\"a\":[1,2],\"b\":[3,4] }      xxxxxxxxxxxxxxxxx"    

    >>> a
    Left (Object (fromList [("a",Array [Number 1.0,Number 2.0]),("b",Array [Number 3.0,Number 4.0])]))

    >>> Q.length b
    5 :> ()

Obviously one should take account of where in general the bytes one is not reading were supposed to be coming from and similar questions.

--
Reply | Threaded
Open this post in threaded view
|

Re: Efficient way to take n bytes from lazy bytestring

Michael Thompson
Byte stream types shouldn't have much advantage over regular 
lazy bytestring here since aeson `decode`, which is defined 
using `Attoparsec.parse` will be the more expensive calculation, 
and of course it accumulates whatever it needs, validating the input
before returning an answer. So there's not much point in giving it 
a proper streaming io input. 

The json-stream library does permit you to stream values as they
arise, /if/ all the pieces fit together. 

http://hackage.haskell.org/package/json-stream-0.4.1.0/docs/Data-JsonStream-Parser.html

This is really only convenient if for example the top level json value is 
an array. I wrote a little helper for this, maybe you saw

https://hackage.haskell.org/package/streaming-utils-0.1.4.3/docs/Data-ByteString-Streaming-Aeson.html#v:streamParse

The program at the top of the file uses it. But it all depends on the structure of the json 
and what you are hoping to get from it. It is also simple to stream the elements of a
one subordinate array of a top level object, or to stream all the elements of 
all the subordinate arrays of a top level array, forgetting anything else
about the element of the array these elements came from (I think that's the
case in my example, if I remember). 

But trying to get more complex structure would require designing an 
appropriate free monad for each case. (Maybe that's not so crazy though)

It isn't clear that something like `json-stream` doesn't violate the whole
idea of json, which envisages validation of the whole file as a condition of
parsing. The decision that named fields in an 'object' can
come in any order and can be missing and so on, is a mess
from a number of points of view, one of them being proper streaming.

--