Quantcast

List x ByteString x Lazy Bytestring

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

List x ByteString x Lazy Bytestring

John Sneer
Hello Café,

  I've used Haskell and GHC to solve particular real life application. 4
  tools were developed and their function is almost the same - they
  modify textual input according to patterns found in the text. Thus, it
  is something like a compiler, the result is also a text and it is not
  parsed to tokens as patterns appear on a different level.

  The tools differ in tasks and number of modifications performed,
  otherwise, in principal, they are very much similar.

  I used lists (Prelude, Data.List) to develop the tools. After
  successfully completing the development, I've started to optimize the
  code to make the tools faster. After modification of some algorithms
  (which dropped the processing time notably), I started to change data
  structures. I swapped lists with lazy bytestrings. Nevertheless, what
  an unpleasant surprise, the processing speed dropped down,
  significantly / more then 30% time needed).

  So my questions follow:
- What kind of application is lazy bytestring suitable for?
- Would it be worth using strict bytestring even if input files may be
large? (They would fit in memory, but may consume whole)
- If bytestring is not suitable for text manipulation, is there
something faster than lists?
- It would be nice to have native sort for lazy bytestring - would it be
slower than  pack $ Data.List.sort $ unpack ?
- If bytestring is suitable for text manipulation could we have some
hGetTextualContents which translates Windows EOL (CR+LF) to LF?

I'm sorry if the answers are obvious.
Could anyone point me to some article/pages I could read about?

  Thanks and regards,

    John

--
http://www.fastmail.fm - A fast, anti-spam email service.


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

Simon Hengel
Hi,

> - If bytestring is not suitable for text manipulation, is there
> something faster than lists?
Have a look at the text package[1].                                                                                                                            
                                                                                                                                                               
Cheers,                                                                                                                                                        
Simon                                                                                                                                                          
                                                                                                                                                               
[1] http://hackage.haskell.org/package/text   

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

Yves Parès
However the performance issues seem odd: text is based on bytestring.

2011/12/5 Simon Hengel <[hidden email]>
Hi,

> - If bytestring is not suitable for text manipulation, is there
> something faster than lists?
Have a look at the text package[1].

Cheers,
Simon

[1] http://hackage.haskell.org/package/text

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

Johan Tibell-2
On Mon, Dec 5, 2011 at 6:09 AM, Yves Parès <[hidden email]> wrote:
However the performance issues seem odd: text is based on bytestring.

This is not the case. Text is based on ByteArray#, GHC internal type for blocks of bytes. The text package depends on the bytestring package because it allows you to encode/decode Text<->ByteString.

-- Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

Daniel Fischer
In reply to this post by John Sneer
On Monday 05 December 2011, 14:14:56, John Sneer wrote:
> I've used Haskell and GHC to solve particular real life application. 4
>   tools were developed and their function is almost the same - they
>   modify textual input according to patterns found in the text. Thus, it

Hmm, modification can be a problem for ByteStrings, since it entails
copying. That could be worse for strict BytStrings than lazy, if in the
lazy ByteString you can reuse many chunks.

>   is something like a compiler, the result is also a text and it is not
>   parsed to tokens as patterns appear on a different level.
>
>   The tools differ in tasks and number of modifications performed,
>   otherwise, in principal, they are very much similar.
>
>   I used lists (Prelude, Data.List) to develop the tools. After
>   successfully completing the development, I've started to optimize the
>   code to make the tools faster. After modification of some algorithms
>   (which dropped the processing time notably), I started to change data
>   structures. I swapped lists with lazy bytestrings. Nevertheless, what
>   an unpleasant surprise, the processing speed dropped down,
>   significantly / more then 30% time needed).

Two main possibilities:
1. your algorithm isn't suited for ByteStrings
2. you're doing it wrong

The above indicates 1., but without a more detailed description and/or
code, it's impossible to tell.

>
>   So my questions follow:
> - What kind of application is lazy bytestring suitable for?

Anything that involves examining large sequences of bytes (or ASCII
[latin1/other single-byte encoding] text) basically sequentially (it's not
good if you have to jump forwards and backwards a lot and far).
Also some types of modification of such data.

> - Would it be worth using strict bytestring even if input files may be
> large? (They would fit in memory, but may consume whole)

Probably not, see above. But see above.

> - If bytestring is not suitable for text manipulation, is there
> something faster than lists?

text has already been mentioned, but again, there are types of manipulation
it's not well-suited for and where a linked list may be superior.

> - It would be nice to have native sort for lazy bytestring - would it be
> slower than  pack $ Data.List.sort $ unpack ?

The natural sort for ByteStrings would be a counting sort,
O(alphabet size + length), so for long ByteStrings, it should be
significantly faster than pack . sort . unpack, but for short ones, it
would be significantly slower.

> - If bytestring is suitable for text manipulation could we have some
> hGetTextualContents which translates Windows EOL (CR+LF) to LF?

Doing such a transformation would be kind of against the purpose of
ByteStrings, I think.  Isn't the point of ByteStrings to get the raw bytes
as efficiently as possible?


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

Yves Parès
In reply to this post by Johan Tibell-2
Oh, sorry, my bad.
I misunderstood the dependency.

2011/12/5 Johan Tibell <[hidden email]>
On Mon, Dec 5, 2011 at 6:09 AM, Yves Parès <[hidden email]> wrote:
However the performance issues seem odd: text is based on bytestring.

This is not the case. Text is based on ByteArray#, GHC internal type for blocks of bytes. The text package depends on the bytestring package because it allows you to encode/decode Text<->ByteString.

-- Johan



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

Axman
I wonder what profiling tells you, you should identify where your performance problems actually are before trying to optimise.

Some things that might help are using something like Blaze-Builder[1] to construct your bytestrings for output. I'm hoping that they're sufficiently lazy that you can lazily read in the input, and write output as you've made it available. if you use the blaze-builder-enumerator package, you should be able to get exactly what you want (but probably requires some minor knowledge of iteratees).

Anyway, without seeing your code, we can't easily tell you what's wrong.

Cheers,

On 6 December 2011 02:11, Yves Parès <[hidden email]> wrote:
Oh, sorry, my bad.
I misunderstood the dependency.


2011/12/5 Johan Tibell <[hidden email]>
On Mon, Dec 5, 2011 at 6:09 AM, Yves Parès <[hidden email]> wrote:
However the performance issues seem odd: text is based on bytestring.

This is not the case. Text is based on ByteArray#, GHC internal type for blocks of bytes. The text package depends on the bytestring package because it allows you to encode/decode Text<->ByteString.

-- Johan



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe




--
-- Alex Mason

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

John Sneer
Thank you for your replies and provided links.
 
Unfortunately, I cannot provide the code as my boss thinks it could provide some extra information - definitely not Haskellish, but other.
 
Nevetheless, your contribution was helpful and I was able to tune the code to get even better perofmance.
 
Moreover, the performance is not big issue - I can process almost one million of files from 40 to 120 seconds, where sizes of files vary from several hundreds of bytes to several hundred of megabytes. I just thought I could improve overall tool-chain performance (several such tools are used in a chain).
 
Thank you very much, I'l try the blaze-builder on the smallest tool and if any interesting outcome apperas I'll let you know.
 
Best regards,
 
  John
 
 
On Tue, Dec 6, 2011, at 03:27 PM, Axman wrote:
I wonder what profiling tells you, you should identify where your performance problems actually are before trying to optimise.
 
Some things that might help are using something like Blaze-Builder[1] to construct your bytestrings for output. I'm hoping that they're sufficiently lazy that you can lazily read in the input, and write output as you've made it available. if you use the blaze-builder-enumerator package, you should be able to get exactly what you want (but probably requires some minor knowledge of iteratees).

Anyway, without seeing your code, we can't easily tell you what's wrong.
 
Cheers,
 
On 6 December 2011 02:11, Yves Parès <[hidden email]> wrote:
Oh, sorry, my bad.
I misunderstood the dependency.


2011/12/5 Johan Tibell <[hidden email]>
On Mon, Dec 5, 2011 at 6:09 AM, Yves Parès <[hidden email]> wrote:
However the performance issues seem odd: text is based on bytestring.
 
This is not the case. Text is based on ByteArray#, GHC internal type for blocks of bytes. The text package depends on the bytestring package because it allows you to encode/decode Text<->ByteString.
 
-- Johan
 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
 


 
--
-- Alex Mason
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

 
-- 
http://www.fastmail.fm - The professional email service

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

John Sneer
In reply to this post by Daniel Fischer

Hello!


> > I've used Haskell and GHC to solve particular real life application. 4
> >   tools were developed and their function is almost the same - they
> >   modify textual input according to patterns found in the text. Thus, it
>
> Hmm, modification can be a problem for ByteStrings, since it entails
> copying. That could be worse for strict BytStrings than lazy, if in the
> lazy ByteString you can reuse many chunks.

I understand now, that is probably the point.


>
> Two main possibilities:
> 1. your algorithm isn't suited for ByteStrings
> 2. you're doing it wrong
>
> The above indicates 1., but without a more detailed description and/or
> code, it's impossible to tell.


Yes, it seems that the (1) is the point, because I split and re-build
the bytestream many times during processing.


> >   So my questions follow:
> > - What kind of application is lazy bytestring suitable for?
>
> Anything that involves examining large sequences of bytes (or ASCII
> [latin1/other single-byte encoding] text) basically sequentially (it's
> not
> good if you have to jump forwards and backwards a lot and far).
> Also some types of modification of such data.
>
> > - Would it be worth using strict bytestring even if input files may be
> > large? (They would fit in memory, but may consume whole)
>
> Probably not, see above. But see above.
>
> > - If bytestring is not suitable for text manipulation, is there
> > something faster than lists?
>
> text has already been mentioned, but again, there are types of
> manipulation
> it's not well-suited for and where a linked list may be superior.
>
> > - It would be nice to have native sort for lazy bytestring - would it be
> > slower than  pack $ Data.List.sort $ unpack ?
>
> The natural sort for ByteStrings would be a counting sort,
> O(alphabet size + length), so for long ByteStrings, it should be
> significantly faster than pack . sort . unpack, but for short ones, it
> would be significantly slower.
>
> > - If bytestring is suitable for text manipulation could we have some
> > hGetTextualContents which translates Windows EOL (CR+LF) to LF?
>
> Doing such a transformation would be kind of against the purpose of
> ByteStrings, I think.  Isn't the point of ByteStrings to get the raw
> bytes
> as efficiently as possible?


OK, thank you very much for explanation.

Best regards,

  John


--
http://www.fastmail.fm - Email service worth paying for. Try it for free


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate
star

Re: List x ByteString x Lazy Bytestring

Simon Meier-2
Hi John,

>> > I've used Haskell and GHC to solve particular real life application. 4
>> >   tools were developed and their function is almost the same - they
>> >   modify textual input according to patterns found in the text. Thus, it
>>
>> Hmm, modification can be a problem for ByteStrings, since it entails
>> copying. That could be worse for strict BytStrings than lazy, if in the
>> lazy ByteString you can reuse many chunks.
>
> I understand now, that is probably the point.
>
>
>>
>> Two main possibilities:
>> 1. your algorithm isn't suited for ByteStrings
>> 2. you're doing it wrong
>>
>> The above indicates 1., but without a more detailed description and/or
>> code, it's impossible to tell.
>
>
> Yes, it seems that the (1) is the point, because I split and re-build
> the bytestream many times during processing.

For splitting it might be interesting to have a look at 'attoparsec'
(http://hackage.haskell.org/package/attoparsec), which are parser
combinators specialized to bytestring's. If the splitting still leaves
large enough chunks of the original input intact (large enough being
around > 128 bytes), then you might be able to achieve a benefit.

To use strict and lazy bytestrings efficiently, it helps a lot to know
their internal representation (a slice of a char[] array and lists of
slices of char[] arrays) and to think about what the different
operations cost on this datastructure. (The source code of the library
is actually not that hard to understand.) An obviously very expensive
operation is concatenation (guaranteed copying for strict bytestrings
at least a list traversal for lazy bytestrings). The blaze-builder
library provides a type that supports O(1) concatenation of sequences
of bytes and an efficient conversion to a lazy bytestring that has a
large average chunk size. Large chunks are important to amortize the
work spent on the boundary with the speedup gained due to the compact
and cache efficient representation.

If you keep using lists, but need a lot of concatenation using
difference lists (http://hackage.haskell.org/package/dlist) also
allows a O(1) concatenation. Note that the "cost" of having this O(1)
concatenation is that the resulting list cannot be inspected, as long
as it is in a form that supports O(1) concatenation. The same holds
for the lazy bytestring builder provided by blaze-builder.

Note that I'd be very curious, if you could achieve any further
performance improvement using blaze-builder...which is no surprise, as
I'm its author :-) Note also that the API has been cleaned up and will
be published with the next release of the bytestring library. So just
ignore the whole 'write' stuff. I should have separated it.

good luck and best regards,
Simon

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Loading...