String vs ByteString

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
177 messages Options
1234 ... 9
Reply | Threaded
Open this post in threaded view
|

String vs ByteString

Erik de Castro Lopo-34
Hi all,

I'm using Tagsoup to strip data out of some rather large XML files.

Since the files are large I'm using ByteString, but that leads me
to wonder what is the best way to handle clashes between Prelude
functions like putStrLn and the ByteString versions?

Anyone have any suggestions for doing this as neatly as possible?

Erik
--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Johan Tibell-2
Hi Erik,

On Fri, Aug 13, 2010 at 1:32 PM, Erik de Castro Lopo <[hidden email]> wrote:
Since the files are large I'm using ByteString, but that leads me
to wonder what is the best way to handle clashes between Prelude
functions like putStrLn and the ByteString versions?

Anyone have any suggestions for doing this as neatly as possible?

Use qualified imports, like so:

import qualified Data.ByteString as B

main = B.putStrLn $ B.pack "test"

Cheers,
Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Michael Snoyman
In reply to this post by Erik de Castro Lopo-34
Just import the ByteString module qualified. In other words:

import qualified Data.ByteString as S

or for lazy bytestrings:

import qualified Data.ByteString.Lazy as L

Cheers,
Michael

On Fri, Aug 13, 2010 at 2:32 PM, Erik de Castro Lopo <[hidden email]> wrote:
Hi all,

I'm using Tagsoup to strip data out of some rather large XML files.

Since the files are large I'm using ByteString, but that leads me
to wonder what is the best way to handle clashes between Prelude
functions like putStrLn and the ByteString versions?

Anyone have any suggestions for doing this as neatly as possible?

Erik
--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Michael Snoyman
In reply to this post by Johan Tibell-2


On Fri, Aug 13, 2010 at 2:42 PM, Johan Tibell <[hidden email]> wrote:
Hi Erik,


On Fri, Aug 13, 2010 at 1:32 PM, Erik de Castro Lopo <[hidden email]> wrote:
Since the files are large I'm using ByteString, but that leads me
to wonder what is the best way to handle clashes between Prelude
functions like putStrLn and the ByteString versions?

Anyone have any suggestions for doing this as neatly as possible?

Use qualified imports, like so:

import qualified Data.ByteString as B
 
main = B.putStrLn $ B.pack "test"

If you want to pack a String into a ByteString, you'll need to import Data.ByteString.Char8 instead.

Michael 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Johan Tibell-2
On Fri, Aug 13, 2010 at 1:47 PM, Michael Snoyman <[hidden email]> wrote:
Use qualified imports, like so:

import qualified Data.ByteString as B
 
main = B.putStrLn $ B.pack "test"

If you want to pack a String into a ByteString, you'll need to import Data.ByteString.Char8 instead.


Very true. That's what I get for using a random example without testing it first.

-- Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Pierre-Etienne Meunier-3
In reply to this post by Erik de Castro Lopo-34
Hi,

Why don't you use the Data.Rope library ?
The asymptotic complexities are way better than those of the ByteString functions.

PE

El 13/08/2010, a las 07:32, Erik de Castro Lopo escribió:

> Hi all,
>
> I'm using Tagsoup to strip data out of some rather large XML files.
>
> Since the files are large I'm using ByteString, but that leads me
> to wonder what is the best way to handle clashes between Prelude
> functions like putStrLn and the ByteString versions?
>
> Anyone have any suggestions for doing this as neatly as possible?
>
> Erik
> --
> ----------------------------------------------------------------------
> Erik de Castro Lopo
> http://www.mega-nerd.com/
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Johan Tibell-2
On Fri, Aug 13, 2010 at 4:03 PM, Pierre-Etienne Meunier <[hidden email]> wrote:
Hi,

Why don't you use the Data.Rope library ?
The asymptotic complexities are way better than those of the ByteString functions.

PE

For some operations. I'd expect it to be a constant factor slower on average though.

-- Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Kevin Jardine-3
I'm interested to see this kind of open debate on performance,
especially about libraries that provide widely used data structures
such as strings.

One of the more puzzling aspects of Haskell for newbies is the large
number of libraries that appear to provide similar/duplicate
functionality.

The Haskell Platform deals with this to some extent, but it seems to
me that if there are new libraries that appear to provide performance
boosts over more widely used libraries, it would be best if the new
code gets incorporated into the existing more widely used libraries
rather than creating more code to maintain / choose from.

I think that open debate about performance trade-offs could help
consolidate the libraries.

Kevin

On Aug 13, 4:08 pm, Johan Tibell <[hidden email]> wrote:

> On Fri, Aug 13, 2010 at 4:03 PM, Pierre-Etienne Meunier <
>
> [hidden email]> wrote:
> > Hi,
>
> > Why don't you use the Data.Rope library ?
> > The asymptotic complexities are way better than those of the ByteString
> > functions.
>
> > PE
>
> For some operations. I'd expect it to be a constant factor slower on average
> though.
>
> -- Johan
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Johan Tibell-2
On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine <[hidden email]> wrote:
I'm interested to see this kind of open debate on performance,
especially about libraries that provide widely used data structures
such as strings.

One of the more puzzling aspects of Haskell for newbies is the large
number of libraries that appear to provide similar/duplicate
functionality.

The Haskell Platform deals with this to some extent, but it seems to
me that if there are new libraries that appear to provide performance
boosts over more widely used libraries, it would be best if the new
code gets incorporated into the existing more widely used libraries
rather than creating more code to maintain / choose from.

I think that open debate about performance trade-offs could help
consolidate the libraries.

Kevin

I agree.

Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak.

Cheers,
Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Gábor Lehel
On Fri, Aug 13, 2010 at 4:43 PM, Johan Tibell <[hidden email]> wrote:

> On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine <[hidden email]>
> wrote:
>>
>> I'm interested to see this kind of open debate on performance,
>> especially about libraries that provide widely used data structures
>> such as strings.
>>
>> One of the more puzzling aspects of Haskell for newbies is the large
>> number of libraries that appear to provide similar/duplicate
>> functionality.
>>
>> The Haskell Platform deals with this to some extent, but it seems to
>> me that if there are new libraries that appear to provide performance
>> boosts over more widely used libraries, it would be best if the new
>> code gets incorporated into the existing more widely used libraries
>> rather than creating more code to maintain / choose from.
>>
>> I think that open debate about performance trade-offs could help
>> consolidate the libraries.
>>
>> Kevin
>
> I agree.
>
> Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> have text, use Data.Text. Those libraries have benchmarks and have been well
> tuned by experienced Haskelleres and should be the fastest and most memory
> compact in most cases. There are still a few cases where String beats Text
> but they are being worked on as we speak.

How about the case for text which is guaranteed to be in ascii/latin1?
ByteString again?

>
> Cheers,
> Johan
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>



--
Work is punishment for failing to procrastinate effectively.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Sean Leather
In reply to this post by Johan Tibell-2
 
Johan Tibell wrote:
Here's a rule of thumb: If you have binary data, use Data.ByteString. If you have text, use Data.Text. Those libraries have benchmarks and have been well tuned by experienced Haskelleres and should be the fastest and most memory compact in most cases. There are still a few cases where String beats Text but they are being worked on as we speak.

Which one do you use for strings in HTML or XML in which UTF-8 has become the commonly accepted standard encoding? It's text, not binary, so I should choose Data.Text. But isn't there a performance penalty for translating from Data.Text's internal 16-bit encoding to UTF-8?

http://tools.ietf.org/html/rfc3629
http://www.utf8.com/

Regards,
Sean

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Daniel Fischer-4
In reply to this post by Gábor Lehel
On Friday 13 August 2010 17:25:58, Gábor Lehel wrote:
> How about the case for text which is guaranteed to be in ascii/latin1?
> ByteString again?

If you can be sure that that won't change anytime soon, definitely.
Bonus points if you can write the code so that later changing to e.g.
Data.Text requires only a change of imports.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Daniel Fischer-4
In reply to this post by Sean Leather
On Friday 13 August 2010 17:27:32, Sean Leather wrote:
> Which one do you use for strings in HTML or XML in which UTF-8 has
> become the commonly accepted standard encoding? It's text, not binary,
> so I should choose Data.Text. But isn't there a performance penalty for
> translating from Data.Text's internal 16-bit encoding to UTF-8?

Yes there is.
Whether using String, Data.Text or Data.ByteString + Data.ByteString.UTF8
is the best choice depends on what you do. Test and then decide.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
In reply to this post by Gábor Lehel
2010/8/13 Gábor Lehel <[hidden email]>
How about the case for text which is guaranteed to be in ascii/latin1?
ByteString again?
 
If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons.
  1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 "ß" (eszett, sharp S), you'll get the two-character sequence "SS", which is correct. Using Char8.map Char.toUpper here gives the wrong answer.
  2. In many cases, the API is easier to use, because it's oriented towards using text data, instead of being a port of the list API.
  3. Some commonly used functions, such as substring searching, are way faster than their ByteString counterparts.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Donn Cave-4
In reply to this post by Sean Leather
Quoth Sean Leather <[hidden email]>,

> Which one do you use for strings in HTML or XML in which UTF-8 has become
> the commonly accepted standard encoding? It's text, not binary, so I should
> choose Data.Text. But isn't there a performance penalty for translating from
> Data.Text's internal 16-bit encoding to UTF-8?

Use both?

I am not familiar with Text, but UTF-8 is pretty awkward, and I will
sure look into Text before wasting any time trying to fine-tune my
ByteString handling for UTF-8.

But in practice only a fraction of my data input will be manipulated
in an encoding-sensitive context.  I'm thinking _all_ data is binary,
and accordingly all inputs are ByteString;  conversion to Text will
happen as needed for ... uh, wait, is there a conversion from
ByteString to Text?  Well, if not, no doubt that's coming.

        Donn Cave, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Johan Tibell-2
In reply to this post by Bryan O'Sullivan
2010/8/13 Bryan O'Sullivan <[hidden email]>
2010/8/13 Gábor Lehel <[hidden email]>

How about the case for text which is guaranteed to be in ascii/latin1?
ByteString again?
 
If you know it's text and not binary data you are working with, you should still use Data.Text. There are a few good reasons.
  1. The API is more correct. For instance, if you use Text.toUpper on a string containing latin1 "ß" (eszett, sharp S), you'll get the two-character sequence "SS", which is correct. Using Char8.map Char.toUpper here gives the wrong answer.
  2. In many cases, the API is easier to use, because it's oriented towards using text data, instead of being a port of the list API.
  3. Some commonly used functions, such as substring searching, are way faster than their ByteString counterparts.
These are all good reasons. An even more important reason is type safety:

A function that receives a Text argument has the guaranteed that the input is valid Unicode. A function that receives a ByteString doesn't have that guarantee and if validity is important the function must perform a validity check before operating on the data. If the function does not validate the input the function might crash or, even worse, write invalid data to disk or some other data store, corrupting the application data.

This is a bit of a subtle point that you really only see once systems get large. Even though you might pay for the conversion from ByteString to Text you might make up for that by avoiding several validity checks down the road.

Cheers,
Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Daniel Fischer-4
In reply to this post by Bryan O'Sullivan
On Friday 13 August 2010 17:57:36, Bryan O'Sullivan wrote:
>    3. Some commonly used functions, such as substring searching, are
> *way*faster than their ByteString counterparts.

That's an unfortunate example. Using the stringsearch package, substring
searching in ByteStrings was considerably faster than in Data.Text in my
tests.
Replacing substrings blew Data.Text to pieces even, with a factor of 10-65
between ByteString and Text (and much smaller memory footprint).

stringsearch (Data.ByteString.Lazy.Search):

$ ./bmLazy +RTS -s -RTS ../../bigfile Gutenberg Hutzenzwerg > /dev/null                                  
./bmLazy ../../bigfile Gutenberg Hutzenzwerg +RTS -s
      92,045,816 bytes allocated in the heap
          31,908 bytes copied during GC
         103,368 bytes maximum residency (1 sample(s))
          39,992 bytes maximum slop
               2 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   158 collections,     0 parallel,  0.01s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    0.07s  (  0.17s elapsed)
  GC    time    0.01s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    0.08s  (  0.17s elapsed)

  %GC time      10.5%  (2.1% elapsed)

  Alloc rate    1,353,535,321 bytes per MUT second

  Productivity  89.5% of total user, 40.1% of total elapsed

Data.Text.Lazy:

$ ./textLazy +RTS -s -RTS ../../bigfile Gutenberg Hutzenzwerg > /dev/null                                
./textLazy ../../bigfile Gutenberg Hutzenzwerg +RTS -s
   4,916,133,652 bytes allocated in the heap
       6,721,496 bytes copied during GC
      12,961,776 bytes maximum residency (58 sample(s))
      12,788,968 bytes maximum slop
              39 MB total memory in use (1 MB lost due to fragmentation)

  Generation 0:  8774 collections,     0 parallel,  0.70s,  0.73s elapsed
  Generation 1:    58 collections,     0 parallel,  0.03s,  0.03s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    9.87s  ( 10.23s elapsed)
  GC    time    0.73s  (  0.75s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   10.60s  ( 10.99s elapsed)

  %GC time       6.9%  (6.9% elapsed)

  Alloc rate    497,956,181 bytes per MUT second

bigfile is a ~75M file.


The point of the more adequate API for text manipulation stands, of course.

Cheers,
Daniel
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer <[hidden email]> wrote:

That's an unfortunate example. Using the stringsearch package, substring
searching in ByteStrings was considerably faster than in Data.Text in my
tests.

Interesting. Got a test case so I can repro and fix? :-) 


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Kevin Jardine-3
This back and forth on performance is great!

I often see ByteString used where Text is theoretically more
appropriate (eg. the Snap web framework) and it would be good to get
these performance issues ironed out so people feel more comfortable
using the right tool for the job based upon API rather than
performance.

Many other languages have two major formats for strings (binary and
text) and it would be great if performance improvements for ByteString
and Text allowed the same kind of convergence for Haskell.

Kevin

On Aug 13, 7:53 pm, "Bryan O'Sullivan" <[hidden email]> wrote:

> On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer <[hidden email]>wrote:
>
>
>
> > That's an unfortunate example. Using the stringsearch package, substring
> > searching in ByteStrings was considerably faster than in Data.Text in my
> > tests.
>
> Interesting. Got a test case so I can repro and fix? :-)
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Daniel Fischer-4
In reply to this post by Bryan O'Sullivan
On Friday 13 August 2010 19:53:37, Bryan O'Sullivan wrote:
> On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer
<[hidden email]>wrote:
> > That's an unfortunate example. Using the stringsearch package,
> > substring searching in ByteStrings was considerably faster than in
> > Data.Text in my tests.
>
> Interesting. Got a test case so I can repro and fix? :-)

Sure, use http://norvig.com/big.txt (~6.2M), cat it together a few times to
test on larger files.

ByteString code (bmLazy.hs):
----------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import System.Environment (getArgs)
import qualified Data.ByteString.Char8 as C
import qualified Data.ByteString.Lazy as L
import Data.ByteString.Lazy.Search

main :: IO ()
main = do
    (file : pat : _) <- getArgs
    let !spat = C.pack pat
        work = indices spat
    L.readFile file >>= print . length . work
----------------------------------------------------------------
Data.Text.Lazy (textLazy.hs):
----------------------------------------------------------------
{-# LANGUAGE BangPatterns #-}
module Main (main) where

import System.Environment (getArgs)
import qualified Data.Text.Lazy as T
import qualified Data.Text.Lazy.IO as TIO
import Data.Text.Lazy.Search

main :: IO ()
main = do
    (file : pat : _) <- getArgs
    let !spat = T.pack pat
        work = indices spat
    TIO.readFile file >>= print . length . work
----------------------------------------------------------------
(Data.Text.Lazy.Search is of course not exposed by default ;), I use
text-0.7.2.1)

Some local timings:

1. real words in a real text file:

$ time ./textLazy big.txt the
92805                                                                                  
0.59user 0.00system 0:00.61elapsed 97%CPU
$ time ./bmLazy big.txt the92805                                                                                
0.02user 0.01system 0:00.04elapsed 104%CPU

$ time ./textLazy big.txt and
43587                                                                                  
0.56user 0.01system 0:00.58elapsed 100%CPU
$ time ./bmLazy big.txt and
43587                                                                                
0.02user 0.01system 0:00.03elapsed 88%CPU


$ time ./textLazy big.txt mother
317
0.44user 0.01system 0:00.46elapsed 99%CPU
$ time ./bmLazy big.txt mother
317
0.00user 0.01system 0:00.02elapsed 69%CPU


$ time ./textLazy big.txt deteriorate
2
0.37user 0.00system 0:00.38elapsed 98%CPU
$ time ./bmLazy big.txt deteriorate
2
0.01user 0.01system 0:00.02elapsed 114%CPU

$ time ./textLazy big.txt "Project Gutenberg"
177
0.37user 0.00system 0:00.38elapsed 97%CPU
$ time ./bmLazy big.txt "Project Gutenberg"
177
0.00user 0.01system 0:00.01elapsed 100%CPU

2. periodic pattern in a file of 33.4M of aaaaa:

$ time ./bmLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
34999942
1.22user 0.04system 0:01.30elapsed 97%CPU
$ time ./textLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
593220
3.07user 0.03system 0:03.14elapsed 98%CPU

Oh, that's closer, but text doesn't find overlapping matches, well, we can
do that too (replace indices with nonOverlappingIndices):

$ time ./noBMLazy ../AAA
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
593220
0.18user 0.04system 0:00.23elapsed 97%CPU

Yeah, that's more like it :D

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
1234 ... 9