String vs ByteString

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
177 messages Options
1234567 ... 9
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Donn Cave-4
Quoth "Bryan O'Sullivan" <[hidden email]>,

> In the case of the text library, it is often (but not always) competitive
> with bytestring, and I improve it when I can, especially when given test
> cases. My goal is for it to be the obvious choice on several fronts:
>
>    - Cleanliness of API, where it's already better, but could still improve
>    - Performance, which is not quite where I want it (target: parity with,
>    or better than, bytestring)
>    - Quality, where text has slightly more test coverage than bytestring

That sounds great, and I'm looking forward to using Text in my
application - at least, where I think it would help with respect
to correctness.  I can't imagine I would unpack all my data right
off the socket, or disk, and use Text throughout my application,
because I'm skeptical that unpacking megabytes of data from 8 to
16 bits can be done without noticeable impact on resources.  I
wouldn't imagine I would be filing a bug report on that, because
it's a given - if I have a big data load, obviously I should be
using ByteString.

Am I confused about this?  It's why I can't see Text ever being
simply the obvious choice.  [Char] will continue to be the obvious
choice if you want a functional data type that supports pattern
matching etc.  ByteString will continue to be the obvious choice
for big data loads.  We'll have a three way choice between programming
elegance, correctness and efficiency.  If Haskell were more than
just a research language, this might be its most prominent open
sore, don't you think?

        Donn Cave, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

John Millikin
On Sat, Aug 14, 2010 at 22:07, Donn Cave <[hidden email]> wrote:
> Am I confused about this?  It's why I can't see Text ever being
> simply the obvious choice.  [Char] will continue to be the obvious
> choice if you want a functional data type that supports pattern
> matching etc.  ByteString will continue to be the obvious choice
> for big data loads.  We'll have a three way choice between programming
> elegance, correctness and efficiency.  If Haskell were more than
> just a research language, this might be its most prominent open
> sore, don't you think?

I don't see why [Char] is "obvious" -- you'd never use [Word8] for
storing binary data, right? [Char] is popular because it's the default
type for string literals, and due to simple inertia, but when there's
a type based on packed arrays there's no reason to use the list
representation.

Also, despite the name, ByteString and Text are for separate purposes.
ByteString is an efficient [Word8], Text is an efficient [Char] -- use
ByteString for binary data, and Text for...text. Most mature languages
have both types, though the choice of UTF-16 for Text is unusual.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Edward Z. Yang
Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
> Also, despite the name, ByteString and Text are for separate purposes.
> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
> ByteString for binary data, and Text for...text. Most mature languages
> have both types, though the choice of UTF-16 for Text is unusual.

Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode
text representations, I cannot really agree with "unusual". :-)

Cheers,
Edward
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Michael Snoyman


On Sun, Aug 15, 2010 at 8:39 AM, Edward Z. Yang <[hidden email]> wrote:
Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
> Also, despite the name, ByteString and Text are for separate purposes.
> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
> ByteString for binary data, and Text for...text. Most mature languages
> have both types, though the choice of UTF-16 for Text is unusual.

Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode
text representations, I cannot really agree with "unusual". :-)

When I'm writing a web app, my code is sitting on a Linux system where the default encoding is UTF-8, communicating with a database speaking UTF-8, receiving request bodies in UTF-8 and sending response bodies in UTF-8. So converting all of that data to UTF-16, just to be converted right back to UTF-8, does seem strange for that purpose.

Remember, Python, .NET and Java are all imperative languages without referential transparency. I doubt saying they do something some way will influence most Haskell coders much ;).

Michael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

John Millikin
In reply to this post by Edward Z. Yang
On Sat, Aug 14, 2010 at 22:39, Edward Z. Yang <[hidden email]> wrote:
> Excerpts from John Millikin's message of Sun Aug 15 01:32:51 -0400 2010:
>> Also, despite the name, ByteString and Text are for separate purposes.
>> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
>> ByteString for binary data, and Text for...text. Most mature languages
>> have both types, though the choice of UTF-16 for Text is unusual.
>
> Given that both Python, .NET, Java and Windows use UTF-16 for their Unicode
> text representations, I cannot really agree with "unusual". :-)

Python doesn't use UTF-16; on UNIX systems it uses UCS-4, and on
WIndows it uses UCS-2. The difference is important because:

Python: len("\U0001dd1e") == 2
Haskell: length (pack "\x0001dd1e")

Java, .NET, Windows, JavaScript, and some other languages use UTF-16
because when Unicode support was added to these systems, the astral
characters had not been invented yet, and 16 bits was enough for the
entire Unicode character set. They originally used UCS-2, but then
moved to UTF-16 to minimize incompatibilities.

Anything based on UNIX generally uses UTF-8, because Unicode support
was added later after the problems of UCS-2/UTF-16 had been
discovered. C libraries written by UNIX users use UTF-8 almost
exclusively -- this includes most language bindings available on
Hackage.

I don't mean that UTF-16 is itself unusual, but it's a legacy encoding
-- there's no reason to use it in new projects. If "text" had been
started 15 years ago, I could understand, but since it's still in
active development the use of UTF-16 simply adds baggage.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

John Millikin
On Sat, Aug 14, 2010 at 22:54, John Millikin <[hidden email]> wrote:
> Haskell: length (pack "\x0001dd1e")

Apologies -- this line ought to be:

Haskell: Data.Text.length (Data.Text.pack "\x0001dd1e") == 1
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
In reply to this post by Michael Snoyman
On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman <[hidden email]> wrote:

When I'm writing a web app, my code is sitting on a Linux system where the default encoding is UTF-8, communicating with a database speaking UTF-8, receiving request bodies in UTF-8 and sending response bodies in UTF-8. So converting all of that data to UTF-16, just to be converted right back to UTF-8, does seem strange for that purpose.

Bear in mind that much of the data you're working with can't be readily trusted. UTF-8 coming from the filesystem, the network, and often the database may not be valid. The cost of validating it isn't all that different from the cost of converting it to UTF-16.

And of course the internals of Data.Text are all fusion-based, so much of the time you're not going to be allocating UTF-16 arrays at all, but instead creating a pipeline of characters that are manipulated in a tight loop. This eliminates a lot of the additional copying that bytestring has to do, for instance.

To give you an idea of how competitive Data.Text can be compared to C code, this is the system's wc command counting UTF-8 characters in a modestly large file:

$ time wc -m huge.txt 
32443330
real 0.728s

This is Data.Text performing the same task:

$ time ./FileRead text huge.txt 
32443330
real 0.697s



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Donn Cave-4
In reply to this post by John Millikin
Quoth John Millikin <[hidden email]>,

> I don't see why [Char] is "obvious" -- you'd never use [Word8] for
> storing binary data, right? [Char] is popular because it's the default
> type for string literals, and due to simple inertia, but when there's
> a type based on packed arrays there's no reason to use the list
> representation.

Well, yes, string literals - and pattern matching support, maybe
that's the same thing.  And I think it's fair to say that [Char]
is a natural, elegant match for the language, I mean it leverages
your basic Haskell skills if for example you want to parse something
fairly simple.  So even if ByteString weren't the monumental hassle
it is today for simple stuff, String would have at least a little appeal.
And if packed arrays really always mattered, [Char] would be long gone.
They don't, you can do a lot of stuff with [Char] before it turns into
a problem.

> Also, despite the name, ByteString and Text are for separate purposes.
> ByteString is an efficient [Word8], Text is an efficient [Char] -- use
> ByteString for binary data, and Text for...text. Most mature languages
> have both types, though the choice of UTF-16 for Text is unusual.

Maybe most mature languages have one or more extra string types
hacked on to support wide characters.  I don't think it's necessarily
a virtue.  ByteString vs. ByteString.Char8, where you can choose
more or less indiscriminately to treat the data as Char or Word8,
seems to me like a more useful way to approach the problem.  (Of
course, ByteString.Char8 isn't a good way to deal with wide characters
correctly, I'm just saying that's where I'd like to find the answer,
not in some internal character encoding into which all "text" data
must be converted.)

        Donn Cave, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
In reply to this post by Donn Cave-4
On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave <[hidden email]> wrote:
 
Am I confused about this?  It's why I can't see Text ever being
simply the obvious choice.  [Char] will continue to be the obvious
choice if you want a functional data type that supports pattern
matching etc.

Actually, with view patterns, Text is pretty nice to pattern match against:

foo (uncons -> Just (c,cs)) = "whee"

despam (prefixed "spam" -> Just suffix) = "whee" `mappend` suffix

ByteString will continue to be the obvious choice
for big data loads.

Don't confuse "I have big data" with "I need bytes". If you are working with bytes, use bytestring. If you are working with text, outside of a few narrow domains you should use text.

 We'll have a three way choice between programming
elegance, correctness and efficiency.  If Haskell were more than
just a research language, this might be its most prominent open
sore, don't you think?

No, that's just FUD. 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Colin Paul Adams
In reply to this post by Bryan O'Sullivan
>>>>> "Bryan" == Bryan O'Sullivan <[hidden email]> writes:

    Bryan> On Sat, Aug 14, 2010 at 10:46 PM, Michael Snoyman <[hidden email]> wrote:
    Bryan>     When I'm writing a web app, my code is sitting on a Linux
    Bryan> system where the default encoding is UTF-8, communicating
    Bryan> with a database speaking UTF-8, receiving request bodies in
    Bryan> UTF-8 and sending response bodies in UTF-8. So converting all
    Bryan> of that data to UTF-16, just to be converted right back to
    Bryan> UTF-8, does seem strange for that purpose.


    Bryan> Bear in mind that much of the data you're working with can't
    Bryan> be readily trusted. UTF-8 coming from the filesystem, the
    Bryan> network, and often the database may not be valid. The cost of
    Bryan> validating it isn't all that different from the cost of
    Bryan> converting it to UTF-16.

But UTF-16 (apart from being an abomination for creating a hole in the
codepoint space and making it impossible to ever etxend it) is slow to
process compared with UTF-32 - you can't get the nth character in
constant time, so it seems an odd choice to me.
--
Colin Adams
Preston Lancashire
()  ascii ribbon campaign - against html e-mail
/\  www.asciiribbon.org   - against proprietary attachments
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Johan Tibell-2
Hi Colin,

On Sun, Aug 15, 2010 at 9:34 AM, Colin Paul Adams <[hidden email]> wrote:
But UTF-16 (apart from being an abomination for creating a hole in the
codepoint space and making it impossible to ever etxend it) is slow to
process compared with UTF-32 - you can't get the nth character in
constant time, so it seems an odd choice to me.

Aside: Getting the nth character isn't very useful when working with Unicode text:

* Most text processing is linear.
* What we consider a character and what Unicode considers a character differs a bit e.g. since Unicode uses combining characters.

Cheers,
Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ivan Lazar Miljenovic
In reply to this post by Don Stewart-2
Don Stewart <[hidden email]> writes:

>     * Pay attention to Haskell Cafe announcements
>     * Follow the Reddit Haskell news.
>     * Read the quarterly reports on Hackage
>     * Follow Planet Haskell

And yet there are still many packages that fall under the radar with no
announcements of any kind on initial release or even new versions :(

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Vo Minh Thu
2010/8/15 Ivan Lazar Miljenovic <[hidden email]>:
> Don Stewart <[hidden email]> writes:
>
>>     * Pay attention to Haskell Cafe announcements
>>     * Follow the Reddit Haskell news.
>>     * Read the quarterly reports on Hackage
>>     * Follow Planet Haskell
>
> And yet there are still many packages that fall under the radar with no
> announcements of any kind on initial release or even new versions :(

If you're interested in a comprehensive update list, you can follow
Hackage on Twitter, or the news feed.

Cheers,
Thu
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Andrew Coppin
In reply to this post by Don Stewart-2
Don Stewart wrote:
> So, to stay up to date, but without drowning in data. Do one of:
>
>     * Pay attention to Haskell Cafe announcements
>     * Follow the Reddit Haskell news.
>     * Read the quarterly reports on Hackage
>     * Follow Planet Haskell
>  

Interesting. Obviously I look at Haskell Cafe from time to time
(although there's usually far too much traffic to follow it all). I
wasn't aware of *any* of the other resources listed.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ivan Lazar Miljenovic
In reply to this post by Vo Minh Thu
Vo Minh Thu <[hidden email]> writes:

> 2010/8/15 Ivan Lazar Miljenovic <[hidden email]>:
>> Don Stewart <[hidden email]> writes:
>>
>>>     * Pay attention to Haskell Cafe announcements
>>>     * Follow the Reddit Haskell news.
>>>     * Read the quarterly reports on Hackage
>>>     * Follow Planet Haskell
>>
>> And yet there are still many packages that fall under the radar with no
>> announcements of any kind on initial release or even new versions :(
>
> If you're interested in a comprehensive update list, you can follow
> Hackage on Twitter, or the news feed.

Except that that doesn't tell you:

* The purpose of the library
* How a release differs from a previous one
* Why you should use it, etc.

Furthermore, several interesting discussions have arisen out of
announcement emails.

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Vo Minh Thu
2010/8/15 Ivan Lazar Miljenovic <[hidden email]>:

> Vo Minh Thu <[hidden email]> writes:
>
>> 2010/8/15 Ivan Lazar Miljenovic <[hidden email]>:
>>> Don Stewart <[hidden email]> writes:
>>>
>>>>     * Pay attention to Haskell Cafe announcements
>>>>     * Follow the Reddit Haskell news.
>>>>     * Read the quarterly reports on Hackage
>>>>     * Follow Planet Haskell
>>>
>>> And yet there are still many packages that fall under the radar with no
>>> announcements of any kind on initial release or even new versions :(
>>
>> If you're interested in a comprehensive update list, you can follow
>> Hackage on Twitter, or the news feed.
>
> Except that that doesn't tell you:
>
> * The purpose of the library
> * How a release differs from a previous one
> * Why you should use it, etc.
>
> Furthermore, several interesting discussions have arisen out of
> announcement emails.

Sure, nor does it write a book chapter about some practical usage. I
mean (tongue in cheek) that the other ressource, nor even some proper
annoucement, provide all that.

I still remember the UHC annoucement (a (nearly) complete Haskell 98
compiler) thread where most of it was about lack of support for n+k
pattern.

But the bullet list above was to point Andrew a few places where he
could have learn about Text.

Cheers,
Thu
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Brandon S Allbery KF8NH
In reply to this post by Bryan O'Sullivan
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/15/10 03:01 , Bryan O'Sullivan wrote:
> On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave <[hidden email]
> <mailto:[hidden email]>> wrote:
>      We'll have a three way choice between programming
>     elegance, correctness and efficiency.  If Haskell were more than
>     just a research language, this might be its most prominent open
>     sore, don't you think?
>
> No, that's just FUD.

More to the point, there's nothing elegant about [Char] --- its sole
"advantage" is requiring no thought.

- --
brandon s. allbery     [linux,solaris,freebsd,perl]      [hidden email]
system administrator  [openafs,heimdal,too many hats]  [hidden email]
electrical and computer engineering, carnegie mellon university      KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxoBPgACgkQIn7hlCsL25WbWACgz+MXfwL6ly1Euv1X1HD7Gmg8
fO0Anj1LY6CqDyLjr0s5L2M5Okx8ie+/
=eIIs
-----END PGP SIGNATURE-----
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bill Atkins-6
No, not really.  Linked lists are very easy to deal with recursively and Strings automatically work with any already-defined list functions.

On Sun, Aug 15, 2010 at 11:17 AM, Brandon S Allbery KF8NH <[hidden email]> wrote:
More to the point, there's nothing elegant about [Char] --- its sole
"advantage" is requiring no thought.


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Donn Cave-4
In reply to this post by Bryan O'Sullivan
Quoth "Bryan O'Sullivan" <[hidden email]>,
> On Sat, Aug 14, 2010 at 10:07 PM, Donn Cave <[hidden email]> wrote:
...
>> ByteString will continue to be the obvious choice
>> for big data loads.
>
> Don't confuse "I have big data" with "I need bytes". If you are working with
> bytes, use bytestring. If you are working with text, outside of a few narrow
> domains you should use text.

I wonder how many ByteString users are `working with bytes', in the
sense you apparently mean where the bytes are not text characters.
My impression is that in practice, there is a sizeable contingent
out here using ByteString.Char8 and relatively few applications for
the Word8 type.  Some of it should no doubt move to Text, but the
ability to work with native packed data - minimal processing and
space requirements, interoperability with foreign code, mmap, etc. -
is attractive enough that the choice can be less than obvious.

        Donn Cave, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Donn Cave-4
In reply to this post by Bill Atkins-6
Quoth Bill Atkins <[hidden email]>,

> No, not really.  Linked lists are very easy to deal with recursively and
> Strings automatically work with any already-defined list functions.

Yes, they're great - a terrible mistake, for a practical programming
language, but if you fail to recognize the attraction, you miss some of
the historical lesson on emphasizing elegance and correctness over
practical performance.

        Donn Cave, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
1234567 ... 9