String vs ByteString

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
177 messages Options
1 ... 6789
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

wren ng thornton
Ivan Lazar Miljenovic wrote:

> On 18 August 2010 12:12, wren ng thornton <[hidden email]> wrote:
>> Johan Tibell wrote:
>>> To my knowledge the data we have about prevalence of encoding on the web
>>> is
>>> accurate. We crawl all pages we can get our hands on, by starting at some
>>> set of seeds and then following all the links. You cannot be sure that
>>> you've reached all web sites as there might be cliques in the web graph
>>> but
>>> we try our best to get them all. You're unlikely to get a better estimate
>>> anywhere else. I doubt few organizations have the machinery required to
>>> crawl most of the web.
>> There was a study recently on this. They found that there are four main
>> parts of the Internet:
>>
>> * a densely connected core, where from any site you can get to any other
>> * an "in cone", from which you can reach the core (but not other in-cone
>> members, since then you'd both be in the core)
>> * an "out cone", which can be reached from the core (but which cannot reach
>> each other)
>> * and, unconnected islands
>
> I'm guessing here that you're referring to what I've heard called the
> "hidden web": databases, etc. that require sign-ins, etc. (as stuff
> that isn't in the core, to differing degrees: some of these databases
> are indexed by google but you can't actually read them without an
> account, etc.) ?

Not so far as I recall. I'd have to find a copy of the paper to be sure
though. Because the metric used was graph connectivity, if those hidden
pages have links out into non-hidden pages (e.g., the login page), then
they'd be counted in the same way as the non-hidden pages reachable from
them.

--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Michael Snoyman
In reply to this post by andrew
Well, I'm not certain if it counts as a typical Chinese website, but here are the stats;

UTF8: 64,198
UTF16: 113,160

And just for fun, after gziping:

UTF8: 17,708
UTF16: 19,367

On Wed, Aug 18, 2010 at 2:59 AM, anderson leo <[hidden email]> wrote:
Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese.

-Andrew

On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman <[hidden email]> wrote:


On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <[hidden email]> wrote:
Ketil Malde wrote:
> I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
> 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
> RAM, UTF-16 will be slower than UTF-8...

I don't think the genome is typical text. And
I doubt that is true if that text is in a CJK language.

> I think that *IF* we are aiming for a single, grand, unified text
> library to Rule Them All, it needs to use UTF-8.

Given the growth rate of China's economy, if CJK isn't
already the majority of text being processed in the world,
it will be soon. I have seen media reports claiming CJK is
now a majority of text data going over the wire on the web,
though I haven't seen anything scientific backing up those claims.
It certainly seems reasonable. I believe Google's measurements
based on their own web index showing wide adoption of UTF-8
are very badly skewed due to a strong Western bias.

In that case, if we have to pick one encoding for Data.Text,
UTF-16 is likely to be a better choice than UTF-8, especially
if the cost is fairly low even for the special case of Western
languages. Also, UTF-16 has become by far the dominant internal
text format for most software and for most user platforms.
Except on desktop Linux - and whether we like it or not, Linux
desktops will remain a tiny minority for the foreseeable future.

 I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.

As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8.

As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.

Michael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe




_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Jinjing Wang
In reply to this post by wren ng thornton
> John Millikin wrote:
>>
>> The reason many Japanese and Chinese users reject UTF-8 isn't due to
>> space constraints (UTF-8 and UTF-16 are roughly equal), it's because
>> they reject Unicode itself.
>
> +1.
>
> This is the thing Unicode advocates don't want to admit. Until Unicode has
> code points for _all_ Chinese and Japanese characters, there will be active
> resistance to adoption.
>
> --
> Live well,
> ~wren

For mainland chinese websites:

Most that became popular during web 1.0 (5-10 years ago) are using
utf-8 incompatible format, e.g. gb2312.

for example:

* www.sina.com.cn
* www.sohu.com

They didn't switch to utf-8 probably just because they never have to.

However, many of the popular websites started during web 2.0 are adopting utf-8

for example:

* renren.com (chinese largest facebook clone)
* www.kaixin001.com (chinese second largest facebook clone)
* t.sina.com.cn (an example of twitter clone)

These websites adopted utf-8 because (I think) most web development
tools have already standardized on utf-8, and there's little reason
change it.

I'm not aware of any (at least common) chinese characters that can be
represented by gb2312 but not in unicode. Since the range of gb2312 is
a subset of the range of gbk, which is a subset of the range of
gb18030. And gb18030 is just another encoding of unicode.

ref:

* http://en.wikipedia.org/wiki/GB_18030

--
jinjing
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

andrew
In reply to this post by Michael Snoyman
More typical Chinese web sites:
    www.ifeng.com         (web site likes nytimes)
    dzh.mop.com           (community for fun)
    www.csdn.net          (web site for IT)
    www.sohu.com        (web site like yahoo)
    www.sina.com         (web site like yahoo)

-- Andrew

On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman <[hidden email]> wrote:
Well, I'm not certain if it counts as a typical Chinese website, but here are the stats;

UTF8: 64,198
UTF16: 113,160

And just for fun, after gziping:

UTF8: 17,708
UTF16: 19,367


On Wed, Aug 18, 2010 at 2:59 AM, anderson leo <[hidden email]> wrote:
Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese.

-Andrew

On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman <[hidden email]> wrote:


On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <[hidden email]> wrote:
Ketil Malde wrote:
> I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
> 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
> RAM, UTF-16 will be slower than UTF-8...

I don't think the genome is typical text. And
I doubt that is true if that text is in a CJK language.

> I think that *IF* we are aiming for a single, grand, unified text
> library to Rule Them All, it needs to use UTF-8.

Given the growth rate of China's economy, if CJK isn't
already the majority of text being processed in the world,
it will be soon. I have seen media reports claiming CJK is
now a majority of text data going over the wire on the web,
though I haven't seen anything scientific backing up those claims.
It certainly seems reasonable. I believe Google's measurements
based on their own web index showing wide adoption of UTF-8
are very badly skewed due to a strong Western bias.

In that case, if we have to pick one encoding for Data.Text,
UTF-16 is likely to be a better choice than UTF-8, especially
if the cost is fairly low even for the special case of Western
languages. Also, UTF-16 has become by far the dominant internal
text format for most software and for most user platforms.
Except on desktop Linux - and whether we like it or not, Linux
desktops will remain a tiny minority for the foreseeable future.

 I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.

As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8.

As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.

Michael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ketil Malde-5
In reply to this post by John Millikin
John Millikin <[hidden email]> writes:

> The reason many Japanese and Chinese users reject UTF-8 isn't due to
> space constraints (UTF-8 and UTF-16 are roughly equal), it's because
> they reject Unicode itself.

Probably because they don't think it's complicated enough¹?

> Shift-JIS and the various Chinese encodings both contain Han
> characters which are missing from Unicode, either due to the Han
> unification or simply were not considered important enough to include

Surely there's enough space left?  I seem to remember some Han
characters outside of the BMP, so I would have guessed this is an
argument from back in the UCS-2 days.

(BTW, on a long train ride, I brought the linear-B alphabet, and
practiced writing notes to my kids.  So linear-B isn't entirely useless
:-)

>From casual browsing of Wikipedia, the current status in CJK-land seems
to be something like this:

China: GB2312 and its successor GB18030
Taiwan, Macao, and Hong Kong: Big5
Japan: Shift-JIS
Korea: EUC-KR

It is interesting that some of these provide a lot fewer characters than
Unicode.  Another feature of several of them is that ASCII and e.g. kana
scripts take up one byte, and ideograms take up two, which correlates
with the expected width of the glyphs.

Several of the pages indicate that Unicode, and mainly UTF-8, is
gradually taking over.

-k

¹ Those who remember Emacs in the MULE days will know what I mean.
--
If I haven't seen further, it is by standing in the footprints of giants
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

wren ng thornton
In reply to this post by Jinjing Wang
Jinjing Wang wrote:

>> John Millikin wrote:
>>> The reason many Japanese and Chinese users reject UTF-8 isn't due to
>>> space constraints (UTF-8 and UTF-16 are roughly equal), it's because
>>> they reject Unicode itself.
>> +1.
>>
>> This is the thing Unicode advocates don't want to admit. Until Unicode has
>> code points for _all_ Chinese and Japanese characters, there will be active
>> resistance to adoption.
>
> [...]
> However, many of the popular websites started during web 2.0 are adopting utf-8
>
> for example:
>
> * renren.com (chinese largest facebook clone)
> * www.kaixin001.com (chinese second largest facebook clone)
> * t.sina.com.cn (an example of twitter clone)
>
> These websites adopted utf-8 because (I think) most web development
> tools have already standardized on utf-8, and there's little reason
> change it.

Interesting. I don't know much about the politics of Chinese encodings,
other than that the GB formats are/were dominant.

As for the politics of Japanese encodings, last time I did web work
(just at the beginning of web2.0, before they started calling it that)
there was still a lot of active resistance among the Japanese. Given
some of the characters folks were complaining about, I think it's more
an issue of principle than practicality. Then again, the Japanese do
love their language games, so obscure and archaic characters are used
far more often than would be expected... Whether web2.0 has caused the
Japanese to change too, I can't say. I got out of that line of work ^_^


> I'm not aware of any (at least common) chinese characters that can be
> represented by gb2312 but not in unicode. Since the range of gb2312 is
> a subset of the range of gbk, which is a subset of the range of
> gb18030. And gb18030 is just another encoding of unicode.

All the specific characters I've seen folks complain about were very
uncommon or even archaic. All the common characters are there for
Japanese too. The only time I've run into issues it was for an archaic
character used in a manga title. I was working on a library catalog, and
was too pedantic to spell it "wrong".

--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Michael Snoyman
In reply to this post by andrew
Alright, here's the results for the first three in the list (please forgive me for being lazy- I am a Haskell programmer after all):

UTF8: 299949
UTF16: 566610

GBK: 1866
UTF8: 1891
UTF16: 3684

UTF8: 122870
UTF16: 217420

Seems like UTF8 is a consistent winner versus UTF16, and not much of a loser to the native formats.

Michael

On Wed, Aug 18, 2010 at 11:01 AM, anderson leo <[hidden email]> wrote:
More typical Chinese web sites:
    www.ifeng.com         (web site likes nytimes)
    dzh.mop.com           (community for fun)
    www.csdn.net          (web site for IT)
    www.sohu.com        (web site like yahoo)
    www.sina.com         (web site like yahoo)

-- Andrew


On Wed, Aug 18, 2010 at 11:40 AM, Michael Snoyman <[hidden email]> wrote:
Well, I'm not certain if it counts as a typical Chinese website, but here are the stats;

UTF8: 64,198
UTF16: 113,160

And just for fun, after gziping:

UTF8: 17,708
UTF16: 19,367


On Wed, Aug 18, 2010 at 2:59 AM, anderson leo <[hidden email]> wrote:
Hi michael, here is a web site http://zh.wikipedia.org/zh-cn/. It is the wikipedia for Chinese.

-Andrew

On Tue, Aug 17, 2010 at 7:00 PM, Michael Snoyman <[hidden email]> wrote:


On Tue, Aug 17, 2010 at 1:50 PM, Yitzchak Gale <[hidden email]> wrote:
Ketil Malde wrote:
> I haven't benchmarked it, but I'm fairly sure that, if you try to fit a
> 3Gbyte file (the Human genome, say¹), into a computer with 4Gbytes of
> RAM, UTF-16 will be slower than UTF-8...

I don't think the genome is typical text. And
I doubt that is true if that text is in a CJK language.

> I think that *IF* we are aiming for a single, grand, unified text
> library to Rule Them All, it needs to use UTF-8.

Given the growth rate of China's economy, if CJK isn't
already the majority of text being processed in the world,
it will be soon. I have seen media reports claiming CJK is
now a majority of text data going over the wire on the web,
though I haven't seen anything scientific backing up those claims.
It certainly seems reasonable. I believe Google's measurements
based on their own web index showing wide adoption of UTF-8
are very badly skewed due to a strong Western bias.

In that case, if we have to pick one encoding for Data.Text,
UTF-16 is likely to be a better choice than UTF-8, especially
if the cost is fairly low even for the special case of Western
languages. Also, UTF-16 has become by far the dominant internal
text format for most software and for most user platforms.
Except on desktop Linux - and whether we like it or not, Linux
desktops will remain a tiny minority for the foreseeable future.

 I think you are conflating two points here, and ignoring some important data. Regarding the data: you haven't actually quoted any statistics about the prevalence of CJK data, but even if the majority of web pages served are in those three languages, a fairly high percentage of the content will *still* be ASCII, due simply to the HTML, CSS and Javascript overhead. I'd hate to make up statistics on the spot, especially when I don't have any numbers from you to compare them with.

As far as the conflation, there are two questions with regard to the encoding choice: encoding/decoding time and space usage. I don't think *anyone* is asserting that UTF-16 is a common encoding for files anywhere, so by using UTF-16 we are simply incurring an overhead in every case. We can't consider a CJK encoding for text, so its prevalence is irrelevant to this topic. What *is* relevant is that a very large percentage of web pages *are*, in fact, standardizing on UTF-8, and that all 7-bit text files are by default UTF-8.

As far as space usage, you are correct that CJK data will take up more memory in UTF-8 than UTF-16. The question still remains whether the overall document size will be larger: I'd be interested in taking a random sampling of CJK-encoded pages and comparing their UTF-8 and UTF-16 file sizes. I think simply talking about this in the vacuum of data is pointless. If anyone can recommend a CJK website which would be considered representative (or a few), I'll do the test myself.

Michael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe






_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Johan Tibell-2
In reply to this post by John Meacham
On Wed, Aug 18, 2010 at 2:12 AM, John Meacham <[hidden email]> wrote:
<ranty thing to follow>
That said, there is never a reason to use UTF-16, it is a vestigial
remanent from the brief period when it was thought 16 bits would be
enough for the unicode standard, any defense of it nowadays is after the
fact justification for having accidentally standardized on it back in
the day.

This is false. Text uses UTF-16 internally as early benchmarks indicated that it was faster. See Tom Harper's response to the other thread that was spawned of this thread by Ketil.

Text continues to be UTF-16 today because

    * no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and
    * no one has written a patch that converts Text to use UTF-8 internally.

I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere.

Cheers,
Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ivan Lazar Miljenovic
Johan Tibell <[hidden email]> writes:

> Text continues to be UTF-16 today because
>
>     * no one has written a benchmark that shows that UTF-8 would be faster
> *for use in Data.Text*, and
>     * no one has written a patch that converts Text to use UTF-8 internally.
>
> I'm quite frustrated by this whole discussion; there's lots of talking, no
> coding, and only a little benchmarking (of web sites, not code). This will
> get us nowhere.

This was my impression as well.  If someone desperately wants Text to
use UTF-8 internally, why not help code such a change rather than just
waving the suggestion around in the air?

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Michael Snoyman
In reply to this post by Johan Tibell-2


On Wed, Aug 18, 2010 at 2:39 PM, Johan Tibell <[hidden email]> wrote:
On Wed, Aug 18, 2010 at 2:12 AM, John Meacham <[hidden email]> wrote:
<ranty thing to follow>
That said, there is never a reason to use UTF-16, it is a vestigial
remanent from the brief period when it was thought 16 bits would be
enough for the unicode standard, any defense of it nowadays is after the
fact justification for having accidentally standardized on it back in
the day.

This is false. Text uses UTF-16 internally as early benchmarks indicated that it was faster. See Tom Harper's response to the other thread that was spawned of this thread by Ketil.

Text continues to be UTF-16 today because

    * no one has written a benchmark that shows that UTF-8 would be faster *for use in Data.Text*, and
    * no one has written a patch that converts Text to use UTF-8 internally.

I'm quite frustrated by this whole discussion; there's lots of talking, no coding, and only a little benchmarking (of web sites, not code). This will get us nowhere.

Here's my response to the two points:

* I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been "don't use bytestring, it's the wrong datatype, text will get fixed," which is quite underwhelming.

* Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment.

Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking.

Michael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Duncan Coutts-4
On 18 August 2010 15:04, Michael Snoyman <[hidden email]> wrote:

> For me, the whole point of this discussion was to
> determine whether we should attempt porting to UTF-8, which as I understand
> it would be a rather large undertaking.

And the answer to that is, yes but only if we have good reason to
believe it will actually be faster, and that's where we're most
interested in benchmarks rather than hand waving.

As Johan and others have said, the original choice to use UTF16 was
based on benchmarks showing it was faster (than UTF8 or UTF32). So if
we want to counter that then we need either to argue that these were
the wrong choice of benchmarks that do not reflect real usage, or that
with better implementations that the balance would shift.

Now there is an interesting argument to claim that we spend more time
shovling strings about than we do actually processing them in any
interesting way and therefore that we should pick benchmarks that
reflect that. This would then shift the balance to favour the internal
representation being identical to some particular popular external
representation --- even if that internal representation is slower for
many processing tasks.

Duncan
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Johan Tibell-2
In reply to this post by Michael Snoyman
Hi Michael,

On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman <[hidden email]> wrote:
Here's my response to the two points:

* I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been "don't use bytestring, it's the wrong datatype, text will get fixed," which is quite underwhelming.

I went through all the emails you sent on with topic "String vs ByteString" and "Re: String vs ByteString" and I can't find a single benchmark. I do agree with you that

    * UTF-8 is more compact than UTF-16, and
    * UTF-8 is by far the most used encoding on the web.

and that establishes a reasonable *theoretical* argument for why switching to UTF-8 might be faster.

What I'm looking for is a program that shows a big difference so we can validate the hypothesis. As Duncan mentioned we already ran some benchmarks early on the showed the opposite. Someone posted a benchmark earlier in this thread and Bryan addressed the issue raised by that poster. We want more of those.
 
* Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment.

I'm not sure this discussion has surfaced that many facts. What we do have is plenty of theories. I can easily add some more:

    * GHC is not doing a good job laying out the branches in the validation code that does arithmetic on the input byte sequence, to validate the input and compute the Unicode code point that should be streamed using fusion.

    * The differences in text and bytestring's fusion framework get in the way of some optimization in GHC (text uses a more sophisticated fusion frameworks that handles some cases bytestring can't according to Bryan).

    * Lingering space leaks is hurting performance (Bryan plugged one already).

    * The use of a polymorphic loop state in the fusion framework gets in the way of unboxing.

    * Extraneous copying in the Handle implementation slows down I/O.

All these are plausible reasons why Text might perform worse than ByteString. We need find out why ones are true by benchmarking and looking at the generated Core.
 
Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking.

I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on.

Cheers,
Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Johan Tibell-2
In reply to this post by wren ng thornton
On Wed, Aug 18, 2010 at 4:12 AM, wren ng thornton <[hidden email]> wrote:
There was a study recently on this. They found that there are four main parts of the Internet:

* a densely connected core, where from any site you can get to any other
* an "in cone", from which you can reach the core (but not other in-cone members, since then you'd both be in the core)
* an "out cone", which can be reached from the core (but which cannot reach each other)
* and, unconnected islands

The surprising part is they found that all four parts are approximately the same size. I forget the exact numbers, but they're all 25+/-5%.

This implies that an exhaustive crawl of the web would require having about 50% of all websites as seeds (the in-cone plus the islands). If we're only interested in a representative sample, then we could get by with fewer. However, that depends a lot on the definition of "representative". And we can't have an accurate definition of representative without doing the entire crawl at some point in order to discover the appropriate distributions. Then again, distributions change over time...

Thus, I would guess that Google only has 50~75% of the net: the core, the out-cone, and a fraction of the islands and in-cone.

That's an interesting result.

However, if you weigh each page with its page views you'll probably find that Google (and other search engines) probably cover much more than that since page views on sites tend to follow a power-law distribution.

-- Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Michael Snoyman
In reply to this post by Johan Tibell-2


On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell <[hidden email]> wrote:
Hi Michael,


On Wed, Aug 18, 2010 at 4:04 PM, Michael Snoyman <[hidden email]> wrote:
Here's my response to the two points:

* I haven't written a patch showing that Data.Text would be faster using UTF-8 because that would require fulfilling the second point (I'll get to in a second). I *have* shown where there are huge performance differences between text and ByteString/String. Unfortunately, the response has been "don't use bytestring, it's the wrong datatype, text will get fixed," which is quite underwhelming.

I went through all the emails you sent on with topic "String vs ByteString" and "Re: String vs ByteString" and I can't find a single benchmark. I do agree with you that

    * UTF-8 is more compact than UTF-16, and
    * UTF-8 is by far the most used encoding on the web.

and that establishes a reasonable *theoretical* argument for why switching to UTF-8 might be faster.

What I'm looking for is a program that shows a big difference so we can validate the hypothesis. As Duncan mentioned we already ran some benchmarks early on the showed the opposite. Someone posted a benchmark earlier in this thread and Bryan addressed the issue raised by that poster. We want more of those.
 
Sorry, I thought I'd sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic:


Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn't be using Bryan's fusion logic.

* Since the prevailing attitude has been such a disregard to any facts shown thus far, it seems that the effort required to learn the internals of the text package and attempt a patch would be wasted. In the meanwhile, Jasper has released blaze-builder which does an amazing job at producing UTF-8 encoded data, which for the moment is my main need. As much as I'll be chastised by the community, I'll stick with this approach for the moment.

I'm not sure this discussion has surfaced that many facts. What we do have is plenty of theories. I can easily add some more:

    * GHC is not doing a good job laying out the branches in the validation code that does arithmetic on the input byte sequence, to validate the input and compute the Unicode code point that should be streamed using fusion.

    * The differences in text and bytestring's fusion framework get in the way of some optimization in GHC (text uses a more sophisticated fusion frameworks that handles some cases bytestring can't according to Bryan).

    * Lingering space leaks is hurting performance (Bryan plugged one already).

    * The use of a polymorphic loop state in the fusion framework gets in the way of unboxing.

    * Extraneous copying in the Handle implementation slows down I/O.

All these are plausible reasons why Text might perform worse than ByteString. We need find out why ones are true by benchmarking and looking at the generated Core.
  
Now if you tell me that text would consider applying a UTF-8 patch, that would be a different story. But I don't have the time to maintain a separate UTF-8 version of text. For me, the whole point of this discussion was to determine whether we should attempt porting to UTF-8, which as I understand it would be a rather large undertaking.

I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on.

I think that's the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don't care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that's an (uneducated) guess.

Some people have been floating the idea of multiple text packages. I personally would *not* want to go down that road, but it might be the only approach that allows top performance for all use cases. As is, I'm quite happy using blaze-builder for Hamlet.

Michael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
On Wed, Aug 18, 2010 at 10:12 AM, Michael Snoyman <[hidden email]> wrote:
 
While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic:

Even though your benchmark didn't explicitly come up in this thread, Johan and I spent some time improving the performance of Text for it. As a result, in darcs HEAD, Text is faster than String, but slower than ByteString. I'd certainly like to close that gap more aggressively.

If the other contributors to this thread took just one minute to craft a benchmark they cared about for every ten minutes they spend producing hot air, we'd be a lot better off.
 
It could be that these were flaws in text that are correctable and have nothing to do with UTF-16;

Since the internal representation used by text is completely opaque, we could of course change it if necessary, with no user-visible consequences. I've yet to see any data that suggests that it's specifically UTF-16 that is related to any performance shortfalls, however.

Some people have been floating the idea of multiple text packages. I personally would *not* want to go down that road, but it might be the only approach that allows top performance for all use cases.

I'd be surprised if that proves necessary.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Johan Tibell-2
In reply to this post by Michael Snoyman
On Wed, Aug 18, 2010 at 7:12 PM, Michael Snoyman <[hidden email]> wrote:
On Wed, Aug 18, 2010 at 6:24 PM, Johan Tibell <[hidden email]> wrote:
 
Sorry, I thought I'd sent these out. While working on optimizing Hamlet I started playing around with the BigTable benchmark. I wrote two blog posts on the topic:


Originally, Hamlet had been based on the text package; the huge slow-down introduced by text convinced me to migrate to bytestrings, and ultimately blaze-html/blaze-builder. It could be that these were flaws in text that are correctable and have nothing to do with UTF-16; however, it will be difficult to produce a benchmark purely on the UTF-8/UTF-16 divide. Using UTF-16 bytestrings would probably overstate the impact since it wouldn't be using Bryan's fusion logic.

Those are great. As Bryan mentioned we've already improved performance and I think I know how to improve it further.

I appreciate that it's difficult to show the UTF-8/UTF-16 divide. I think the approach we're trying at the moment is looking at benchmarks, improving performance, and repeating until we can't improve anymore. It could be the case that we get a benchmark where the performance difference between bytestring and text cannot be explained/fixed by factors other than changing the internal encoding. That would be strong evidence that we should try to switch the internal encoding. We haven't seen any such benchmarks yet.

As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to browse through the repo but could find that input ByteStrings are actually validated anywhere. If they're not it's a big generous to say that it deals with UTF-8 data, as it would really just be concatenating byte sequences, without validating them. We should ask Jasper about the current state.
 
I don't see any reason why Bryan wouldn't accept an UTF-8 patch if it was faster on some set of benchmarks (starting with the ones already in the library) that we agree on.

I think that's the main issue, and one that Duncan nailed on the head: we have to think about what are the important benchmarks. For Hamlet, I need fast UTF-8 bytestring generation. I don't care at all about algorithmic speed for split texts, as an example. My (probably uneducated) guess is that UTF-16 tends to perform many operations in memory faster since almost all characters are represented as 16 bits, while the big benefit for UTF-8 is in reading UTF-8 data, rendering UTF-8 data and decreased memory usage. But as I said, that's an (uneducated) guess.

I agree. Lets create some more benchmarks.

For example, lately I've been working on a benchmark, inspired by a real world problem, where I iterate over the lines in a ~500 MBs file, encoded using UTF-8 data, inserting each line in a Data.Map and do a bunch of further processing on it (such as splitting the strings into words). This tests text I/O throughput, memory overhead, performance of string comparison, etc.

We already have benchmarks for reading files (in UTF-8) in several different ways (lazy I/O and iteratee style folds).

Boil down the things you care about into a self contained benchmark and send it to this list or put it somewhere were we can retrieve it.

Cheers,
Johan


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Michael Snoyman


On Wed, Aug 18, 2010 at 11:58 PM, Johan Tibell <[hidden email]> wrote:
As for blaze I'm not sure exactly how it deals with UTF-8 input. I tried to browse through the repo but could find that input ByteStrings are actually validated anywhere. If they're not it's a big generous to say that it deals with UTF-8 data, as it would really just be concatenating byte sequences, without validating them. We should ask Jasper about the current state.
 
As far as I can tell, Blaze *never* validates input ByteStrings. The "proper" approach to inserting data into blaze is either via String or Text. I requested that Jasper provide an unsafeByteString function in Blaze for Hamlet's usage: Hamlet does the UTF-8 encoding at compile time and is able to gain a little extra performance boost.

If you want to properly validate bytestrings before inputing them, I believe the best approach would be to use utf8-string or text to read in the bytestrings, but Jasper may have a better approach.

Michael

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
1 ... 6789