String vs ByteString

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
177 messages Options
12345 ... 9
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Daniel Fischer-4
On Friday 13 August 2010 19:53:37 you wrote:
> On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer
<[hidden email]>wrote:
> > That's an unfortunate example. Using the stringsearch package,
> > substring searching in ByteStrings was considerably faster than in
> > Data.Text in my tests.
>
> Interesting. Got a test case so I can repro and fix? :-)

Just occurred to me, a lot of the difference is due to the fact that text
has to convert a ByteString to Text on reading the file, so I timed that by
reading the file and counting the chunks, that took text 0.21s for big.txt
vs. Data.ByteString.Lazy's 0.01s.
So for searching in-memory strings, subtract about 0.032s/MB from the
difference - it's still large.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Kevin Jardine-3
Surely a lot of real world text processing programs are IO intensive?
So if there is no native Text IO and everything needs to be read in /
written out as ByteString data converted to/from Text this strikes me
as a major performance sink.

Or is there native Text IO but just not in your example?

Kevin

On Aug 13, 8:57 pm, Daniel Fischer <[hidden email]> wrote:
> Just occurred to me, a lot of the difference is due to the fact that text
> has to convert a ByteString to Text on reading the file, so I timed that by
> reading the file and counting the chunks, that took text 0.21s for big.txt
> vs. Data.ByteString.Lazy's 0.01s.
> So for searching in-memory strings, subtract about 0.032s/MB from the
> difference - it's still large.
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Daniel Fischer-4
On Friday 13 August 2010 21:32:12, Kevin Jardine wrote:
> Surely a lot of real world text processing programs are IO intensive?
> So if there is no native Text IO and everything needs to be read in /
> written out as ByteString data converted to/from Text this strikes me
> as a major performance sink.
>
> Or is there native Text IO but just not in your example?

Outdated information, sorry.
Up to ghc-6.10, text's IO was via ByteString, it's no longer so.
However, the native Text IO is (of course) much slower than ByteString IO
due to the need of en/decoding.

>
> Kevin
>
> On Aug 13, 8:57 pm, Daniel Fischer <[hidden email]> wrote:
> > Just occurred to me, a lot of the difference is due to the fact that
> > text has to convert a ByteString to Text on reading the file, so I
> > timed that by reading the file and counting the chunks, that took text
> > 0.21s for big.txt vs. Data.ByteString.Lazy's 0.01s.
> > So for searching in-memory strings, subtract about 0.032s/MB from the
> > difference - it's still large.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ketil Malde-5
In reply to this post by Johan Tibell-2
Johan Tibell <[hidden email]> writes:

> Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> have text, use Data.Text.

If you have a large amount of mostly ASCII text, use ByteString, since
Data.Text uses twice the storage.  Also, ByteString might make more
sense if the data is in a byte-oriented encoding, and the cost of
encoding and decoding utf-16 would be significant.

-k
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Kevin Jardine-3
I find it disturbing that a modern programming language like Haskell
still apparently forces you to choose between a representation for
"mostly ASCII text" and Unicode.

Surely efficient Unicode text should always be the default? And if the
Unicode format used by the Text library is not efficient enough then
can't that be fixed?

Cheers,
Kevin

On Aug 13, 10:28 pm, Ketil Malde <[hidden email]> wrote:

> Johan Tibell <[hidden email]> writes:
> > Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> > have text, use Data.Text.
>
> If you have a large amount of mostly ASCII text, use ByteString, since
> Data.Text uses twice the storage.  Also, ByteString might make more
> sense if the data is in a byte-oriented encoding, and the cost of
> encoding and decoding utf-16 would be significant.
>
> -k
> --
> If I haven't seen further, it is by standing in the footprints of giants
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Don Stewart-2
There are many libraries for many purposes.

    How to pick your string library in Haskell
    http://blog.ezyang.com/2010/08/strings-in-haskell/

kevinjardine:

> I find it disturbing that a modern programming language like Haskell
> still apparently forces you to choose between a representation for
> "mostly ASCII text" and Unicode.
>
> Surely efficient Unicode text should always be the default? And if the
> Unicode format used by the Text library is not efficient enough then
> can't that be fixed?
>
> Cheers,
> Kevin
>
> On Aug 13, 10:28 pm, Ketil Malde <[hidden email]> wrote:
> > Johan Tibell <[hidden email]> writes:
> > > Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> > > have text, use Data.Text.
> >
> > If you have a large amount of mostly ASCII text, use ByteString, since
> > Data.Text uses twice the storage.  Also, ByteString might make more
> > sense if the data is in a byte-oriented encoding, and the cost of
> > encoding and decoding utf-16 would be significant.
> >
> > -k
> > --
> > If I haven't seen further, it is by standing in the footprints of giants
> > _______________________________________________
> > Haskell-Cafe mailing list
> > [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Edward Z. Yang
In reply to this post by Kevin Jardine-3
Excerpts from Kevin Jardine's message of Fri Aug 13 16:37:14 -0400 2010:
> I find it disturbing that a modern programming language like Haskell
> still apparently forces you to choose between a representation for
> "mostly ASCII text" and Unicode.
>
> Surely efficient Unicode text should always be the default? And if the
> Unicode format used by the Text library is not efficient enough then
> can't that be fixed?

For what it's worth, Java uses UTF-16 representation internally for
strings, and thus also wastes space.

There is something to be said for UTF-8 in-memory representation, but
it takes a lot of care.  A newtype for dirty and clean UTF-8 may come
in handy.

Cheers,
Edward
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Kevin Jardine-3
In reply to this post by Don Stewart-2
Hi Don,

With respect, I disagree with that approach.

Almost every modern programming language has one or at most two
standard representations for strings.

That includes PHP, Python, Ruby, Perl and many others. The lack of a
standard text representation in Haskell has created a crazy patchwork
of incompatible libraries requiring explicit and often inefficient
conversions to connect them together.

I expect Haskell to be higher level than those other languages so that
I can ignore the lower level details and focus on the algorithms. But
in fact the string issue forces me to deal with lower level details
than even PHP requires. I end up with a program littered with ugly
pack, unpack, toString, fromString and similar calls.

That just doesn't feel right to me.

Kevin

On Aug 13, 10:39 pm, Don Stewart <[hidden email]> wrote:

> There are many libraries for many purposes.
>
>     How to pick your string library in Haskell
>    http://blog.ezyang.com/2010/08/strings-in-haskell/
>
> kevinjardine:
>
> > I find it disturbing that a modern programming language like Haskell
> > still apparently forces you to choose between a representation for
> > "mostly ASCII text" and Unicode.
>
> > Surely efficient Unicode text should always be the default? And if the
> > Unicode format used by the Text library is not efficient enough then
> > can't that be fixed?
>
> > Cheers,
> > Kevin
>
> > On Aug 13, 10:28�pm, Ketil Malde <[hidden email]> wrote:
> > > Johan Tibell <[hidden email]> writes:
> > > > Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> > > > have text, use Data.Text.
>
> > > If you have a large amount of mostly ASCII text, use ByteString, since
> > > Data.Text uses twice the storage. �Also, ByteString might make more
> > > sense if the data is in a byte-oriented encoding, and the cost of
> > > encoding and decoding utf-16 would be significant.
>
> > > -k
> > > --
> > > If I haven't seen further, it is by standing in the footprints of giants
> > > _______________________________________________
> > > Haskell-Cafe mailing list
> > > [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
> > _______________________________________________
> > Haskell-Cafe mailing list
> > [hidden email]
> >http://www.haskell.org/mailman/listinfo/haskell-cafe
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]://www.haskell.org/mailman/listinfo/haskell-cafe
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ketil Malde-5
Kevin Jardine <[hidden email]> writes:

> Almost every modern programming language has one or at most two
> standard representations for strings.

> That includes PHP, Python, Ruby, Perl and many others. The lack of a
> standard text representation in Haskell has created a crazy patchwork
> of incompatible libraries requiring explicit and often inefficient
> conversions to connect them together.

Haskell does have a standard representation for strings, namely [Char].
Unfortunately, this sacrifices efficiency for elegance, which gives rise
to the plethora of libraries.

> I end up with a program littered with ugly
> pack, unpack, toString, fromString and similar calls.

Some of this can be avoided using a language extension that let you
overload string constants.

There are always trade offs, and no one solution will fit all: UTF-8 is
space efficient while UTF-16 is time efficient (at least for certain
classes of problems and data).  It does seem that it should be possible
to unify the various libraries wrapping bytestrings (CompactString,
ByteString.UTF8 etc), however.

-k
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Erik de Castro Lopo-34
In reply to this post by Pierre-Etienne Meunier-3
Pierre-Etienne Meunier wrote:

> Hi,
>
> Why don't you use the Data.Rope library ?
> The asymptotic complexities are way better than those of the
> ByteString functions.

What I see as my current problem is that there is already
a problem having two things Sting and ByteString which
represent strings. Add Text and Data.Rope makes that
problem worse, not better.

Erik
--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Erik de Castro Lopo-34
In reply to this post by Kevin Jardine-3
Kevin Jardine wrote:

> With respect, I disagree with that approach.
>
> Almost every modern programming language has one or at most two
> standard representations for strings.

I think having two makes sense, one for arrays of arbitrary
binary bytes and one for some unicode data format, preferably
UTF-8.
 

> That includes PHP, Python, Ruby, Perl and many others. The lack of a
> standard text representation in Haskell has created a crazy patchwork
> of incompatible libraries requiring explicit and often inefficient
> conversions to connect them together.
>
> I expect Haskell to be higher level than those other languages so that
> I can ignore the lower level details and focus on the algorithms. But
> in fact the string issue forces me to deal with lower level details
> than even PHP requires. I end up with a program littered with ugly
> pack, unpack, toString, fromString and similar calls.
>
> That just doesn't feel right to me.

That is what I was trying to say whenI started this thread. Thank
you.

Erik
--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Erik de Castro Lopo-34
In reply to this post by Ketil Malde-5
Ketil Malde wrote:

> Haskell does have a standard representation for strings, namely [Char].
> Unfortunately, this sacrifices efficiency for elegance, which gives rise
> to the plethora of libraries.

To have the default standard representation be one that works
so poorly for many common everyday tasks such as mangling
large chunks of XML is a large part of the problem.

Erik
--
----------------------------------------------------------------------
Erik de Castro Lopo
http://www.mega-nerd.com/
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ivan Lazar Miljenovic
In reply to this post by Kevin Jardine-3
Kevin Jardine <[hidden email]> writes:

> Hi Don,
>
> With respect, I disagree with that approach.
>
> Almost every modern programming language has one or at most two
> standard representations for strings.

Almost every modern programming language thinks you can whack a print
statement wherever you like... ;-)

> That includes PHP, Python, Ruby, Perl and many others. The lack of a
> standard text representation in Haskell has created a crazy patchwork
> of incompatible libraries requiring explicit and often inefficient
> conversions to connect them together.
>
> I expect Haskell to be higher level than those other languages so that
> I can ignore the lower level details and focus on the algorithms. But
> in fact the string issue forces me to deal with lower level details
> than even PHP requires. I end up with a program littered with ugly
> pack, unpack, toString, fromString and similar calls.

So, the real issue here is that there is not yet a good abstraction over
what we consider to be textual data, and instead people have to code to
a specific data type.

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Jason Dagit-2


On Fri, Aug 13, 2010 at 4:03 PM, Ivan Lazar Miljenovic <[hidden email]> wrote:
Kevin Jardine <[hidden email]> writes:

> Hi Don,
>
> With respect, I disagree with that approach.
>
> Almost every modern programming language has one or at most two
> standard representations for strings.

Almost every modern programming language thinks you can whack a print
statement wherever you like... ;-)

> That includes PHP, Python, Ruby, Perl and many others. The lack of a
> standard text representation in Haskell has created a crazy patchwork
> of incompatible libraries requiring explicit and often inefficient
> conversions to connect them together.
>
> I expect Haskell to be higher level than those other languages so that
> I can ignore the lower level details and focus on the algorithms. But
> in fact the string issue forces me to deal with lower level details
> than even PHP requires. I end up with a program littered with ugly
> pack, unpack, toString, fromString and similar calls.

So, the real issue here is that there is not yet a good abstraction over
what we consider to be textual data, and instead people have to code to
a specific data type.

Isn't this the same problem we have with numeric literals?  I might even go so far as to suggest it's going to be a problem with all types of literals.

Isn't it also a problem which is partially solved with the OverloadedStrings extension?

It seems like the interface exposed by ByteString could be in a type class.  At that point, would the problem be solved?

Jason

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
In reply to this post by Daniel Fischer-4
On Fri, Aug 13, 2010 at 9:55 AM, Daniel Fischer <[hidden email]> wrote:

That's an unfortunate example. Using the stringsearch package, substring
searching in ByteStrings was considerably faster than in Data.Text in my
tests.

Daniel, thanks again for bringing up this example! It turned out that quite a lot of the difference in performance was due to an inadvertent space leak in the text search code. With a single added bang pattern, the execution time and space usage both improved markedly.

There is of course still lots of room for improvement, but having test cases like this helps immensely.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ivan Lazar Miljenovic
In reply to this post by Jason Dagit-2
Jason Dagit <[hidden email]> writes:

> On Fri, Aug 13, 2010 at 4:03 PM, Ivan Lazar Miljenovic <
>>
>> So, the real issue here is that there is not yet a good abstraction over
>> what we consider to be textual data, and instead people have to code to
>> a specific data type.
>>
>
> Isn't this the same problem we have with numeric literals?  I might even go
> so far as to suggest it's going to be a problem with all types of
> literals.

Not just literals; there is no common way of doing a character
replacement (e.g. map toUpper) in a textual type for example.

> Isn't it also a problem which is partially solved with the OverloadedStrings
> extension?
> http://haskell.cs.yale.edu/ghc/docs/6.12.2/html/users_guide/type-class-extensions.html#overloaded-strings

That just convert literals; it doesn't provide a common API.

> It seems like the interface exposed by ByteString could be in a type class.
>  At that point, would the problem be solved?

To a certain extent, yes.

There is no one typeclass that could cover everything (especially since
something as simple as toUpper won't work if I understand Bryan's ß ->
SS example), but it would help in the majority of cases.

There has been one attempt, but it doesn't seem very popular (tagsoup
has another, but it's meant to be internal only):
http://hackage.haskell.org/packages/archive/ListLike/latest/doc/html/Data-ListLike.html#39

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Brandon S Allbery KF8NH
In reply to this post by Kevin Jardine-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/13/10 16:37 , Kevin Jardine wrote:
> Surely efficient Unicode text should always be the default? And if the

Efficient for what?  The most efficient Unicode representation for
Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16.

- --
brandon s. allbery     [linux,solaris,freebsd,perl]      [hidden email]
system administrator  [openafs,heimdal,too many hats]  [hidden email]
electrical and computer engineering, carnegie mellon university      KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxl5iUACgkQIn7hlCsL25VxzQCgl0lKLIPQwygh/LlUbCq3v2bv
VOcAnR/xJfYBIa1NbNp5VcNk2TlZb1mn
=b9YK
-----END PGP SIGNATURE-----
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Evan Laforge
On Fri, Aug 13, 2010 at 6:41 PM, Brandon S Allbery KF8NH
<[hidden email]> wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 8/13/10 16:37 , Kevin Jardine wrote:
>> Surely efficient Unicode text should always be the default? And if the
>
> Efficient for what?  The most efficient Unicode representation for
> Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16.

I have an app that is using Data.Text, however I'm thinking of
switching to UTF8 bytestrings.  The reasons are that there are two
main things I do with text: pass it to a C API to display, and parse
it.  The C API expects UTF8, and the parser libraries with a
reputation for being fast all seem to have bytestring inputs, but not
Data.Text (I'm using unpack -> parsec, which is not optimal).
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Dan Doel
On Friday 13 August 2010 8:51:46 pm Evan Laforge wrote:
> I have an app that is using Data.Text, however I'm thinking of
> switching to UTF8 bytestrings.  The reasons are that there are two
> main things I do with text: pass it to a C API to display, and parse
> it.  The C API expects UTF8, and the parser libraries with a
> reputation for being fast all seem to have bytestring inputs, but not
> Data.Text (I'm using unpack -> parsec, which is not optimal).

You should be able to use parsec with text. All you need to do is write a
Stream instance:

  instance Monad m => Stream Text m Char where
    uncons = return . Text.uncons

-- Dan
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Felipe Lessa
On Fri, Aug 13, 2010 at 10:01 PM, Dan Doel <[hidden email]> wrote:

> On Friday 13 August 2010 8:51:46 pm Evan Laforge wrote:
>> I have an app that is using Data.Text, however I'm thinking of
>> switching to UTF8 bytestrings.  The reasons are that there are two
>> main things I do with text: pass it to a C API to display, and parse
>> it.  The C API expects UTF8, and the parser libraries with a
>> reputation for being fast all seem to have bytestring inputs, but not
>> Data.Text (I'm using unpack -> parsec, which is not optimal).
>
> You should be able to use parsec with text. All you need to do is write a
> Stream instance:
>
>  instance Monad m => Stream Text m Char where
>    uncons = return . Text.uncons

Then this should be on a 'parsec-text' package.  Instances are always
implicitly imported.

Suppose packages A and B define this instance separately.  If
package C imports A and B, then it can't use any of those
instances nor define its own.

Cheers! =)

--
Felipe.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
12345 ... 9