String vs ByteString

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
177 messages Options
123456 ... 9
Reply | Threaded
Open this post in threaded view
|

Re: String vs ByteString

Kevin Jardine-3
On Aug 14, 2:41 am, Brandon S Allbery KF8NH <[hidden email]>
wrote:

> Efficient for what?  The most efficient Unicode representation for
> Latin-derived strings is UTF-8, but the most efficient for CJK is UTF-16.

I think that this kind of programming detail should be handled
internally (even if necessary by switching automatically from UTF-8 to
UTF-16 depending upon the language).

I'm using Haskell so that I can write high level code. In my view I
should not have to care if the people using my application write in
Farsi, Quechua or Tamil.

Kevin
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Andrew Coppin
In reply to this post by Johan Tibell-2
Johan Tibell wrote:

> On Fri, Aug 13, 2010 at 4:24 PM, Kevin Jardine <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     One of the more puzzling aspects of Haskell for newbies is the large
>     number of libraries that appear to provide similar/duplicate
>     functionality.
>
>
> I agree.
>
> Here's a rule of thumb: If you have binary data, use Data.ByteString.
> If you have text, use Data.Text. Those libraries have benchmarks and
> have been well tuned by experienced Haskelleres and should be the
> fastest and most memory compact in most cases. There are still a few
> cases where String beats Text but they are being worked on as we speak.

Interesting. I've never even heard of Data.Text. When did that come into
existence?

More importantly: How does the average random Haskeller discover that a
package has become available that might be relevant to their work?

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Florian Weimer
In reply to this post by Bryan O'Sullivan
* Bryan O'Sullivan:

> If you know it's text and not binary data you are working with, you should
> still use Data.Text. There are a few good reasons.
>
>    1. The API is more correct. For instance, if you use Text.toUpper on a
>    string containing latin1 "ß" (eszett, sharp S), you'll get the
>    two-character sequence "SS", which is correct. Using Char8.map Char.toUpper
>    here gives the wrong answer.

Data.Text ist still incorrect for some scripts:

$ LANG=tr_TR.UTF-8 ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> import Data.Text
Prelude Data.Text> toUpper $ pack "i"
Loading package array-0.3.0.0 ... linking ... done.
Loading package containers-0.3.0.0 ... linking ... done.
Loading package deepseq-1.1.0.0 ... linking ... done.
Loading package bytestring-0.9.1.5 ... linking ... done.
Loading package text-0.7.2.1 ... linking ... done.
"I"
Prelude Data.Text>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ivan Lazar Miljenovic
In reply to this post by Andrew Coppin
Andrew Coppin <[hidden email]> writes:

> Interesting. I've never even heard of Data.Text. When did that come
> into existence?

The first version hit Hackage in February last year...

> More importantly: How does the average random Haskeller discover that
> a package has become available that might be relevant to their work?

Look on Hackage; subscribe to mailing lists (where package maintainers
should really write announcement emails), etc.

It's rather surprising you haven't heard of text: it is for benchmarking
this that Bryan wrote criterion; there's emails on -cafe and blog posts
that mention it on a semi-regular basis, etc.

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Johan Tibell-2
In reply to this post by Florian Weimer
On Sat, Aug 14, 2010 at 12:15 PM, Florian Weimer <[hidden email]> wrote:
* Bryan O'Sullivan:

> If you know it's text and not binary data you are working with, you should
> still use Data.Text. There are a few good reasons.
>
>    1. The API is more correct. For instance, if you use Text.toUpper on a
>    string containing latin1 "ß" (eszett, sharp S), you'll get the
>    two-character sequence "SS", which is correct. Using Char8.map Char.toUpper
>    here gives the wrong answer.

Data.Text ist still incorrect for some scripts:

$ LANG=tr_TR.UTF-8 ghci
GHCi, version 6.12.1: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Prelude> import Data.Text
Prelude Data.Text> toUpper $ pack "i"
Loading package array-0.3.0.0 ... linking ... done.
Loading package containers-0.3.0.0 ... linking ... done.
Loading package deepseq-1.1.0.0 ... linking ... done.
Loading package bytestring-0.9.1.5 ... linking ... done.
Loading package text-0.7.2.1 ... linking ... done.
"I"
Prelude Data.Text>

Yes. We need locale support for that one. I think Bryan is planning to add it.

-- Johan
 

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Andrew Coppin
In reply to this post by Ivan Lazar Miljenovic
Ivan Lazar Miljenovic wrote:

> Andrew Coppin <[hidden email]> writes:
>  
>
>> More importantly: How does the average random Haskeller discover that
>> a package has become available that might be relevant to their work?
>>    
>
> Look on Hackage; subscribe to mailing lists (where package maintainers
> should really write announcement emails), etc.
>  

OK. I guess I must have missed that one...

> It's rather surprising you haven't heard of text: it is for benchmarking
> this that Bryan wrote criterion; there's emails on -cafe and blog posts
> that mention it on a semi-regular basis, etc.
>  

Well, I suppose I don't do a lot of text processing work... If all
you're trying to do is parse commands from an interactive terminal
prompt, [Char] is probably good enough.

(What I do do is process big chunks of binary data - which is what
ByteString is intended for.)

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Ivan Lazar Miljenovic
Andrew Coppin <[hidden email]> writes:

> Well, I suppose I don't do a lot of text processing work... If all
> you're trying to do is parse commands from an interactive terminal
> prompt, [Char] is probably good enough.

Neither do I, yet I've heard of it... ;-)

--
Ivan Lazar Miljenovic
[hidden email]
IvanMiljenovic.wordpress.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Brandon S Allbery KF8NH
In reply to this post by Kevin Jardine-3
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 8/14/10 01:29 , Kevin Jardine wrote:
> I think that this kind of programming detail should be handled
> internally (even if necessary by switching automatically from UTF-8 to
> UTF-16 depending upon the language).

This is going to carry a heavy speed penalty.

> I'm using Haskell so that I can write high level code. In my view I
> should not have to care if the people using my application write in
> Farsi, Quechua or Tamil.

Ideally yes, but arguably the existing Unicode representations don't allow
this to be done nicely.  (Of course, arguably there is no "nice" way to do
it; UTF-16 is the best you can do as a workable generic setting.)

- --
brandon s. allbery     [linux,solaris,freebsd,perl]      [hidden email]
system administrator  [openafs,heimdal,too many hats]  [hidden email]
electrical and computer engineering, carnegie mellon university      KF8NH
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2.0.10 (Darwin)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAkxmqaEACgkQIn7hlCsL25WmOQCfYEjkem99o5IpwxnD7bNaDYyG
768AoK17I605DqDxIdnFUE7MK2ktMtrN
=lOPK
-----END PGP SIGNATURE-----
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Donn Cave-4
Quoth Brandon S Allbery KF8NH <[hidden email]>,
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 8/14/10 01:29 , Kevin Jardine wrote:
>> I think that this kind of programming detail should be handled
>> internally (even if necessary by switching automatically from UTF-8 to
>> UTF-16 depending upon the language).

It seems like the right thing, described in the wrong words - wouldn't
it be a more sensible ideal, to simply `switch' depending on the
character encoding?

I mean, to start with, you'd surely wish for some standardization,
so that the difference between UTF-8 and UTF-16 is essentially internal,
while you use the same API indifferently.

Second, a key requirement to effectively work with external data is
support for multiple character encodings.  E.g., if Text is internally
UTF-16, it still must be able to input and output UTF-8, and presumably
also UTF-16 where appropriate.

So given full support for _both_ encodings (for example, Text
implementation for `native' UTF-8), and support for input data of
_either_ encoding as encountered at run time ... then the internal
implementation choice should simply follow the external data.  For
Chinese inputs you'd be running UTF-16 functions, for French UTF-8.

        Donn Cave, [hidden email]
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Don Stewart-2
In reply to this post by Andrew Coppin
andrewcoppin:
> Interesting. I've never even heard of Data.Text. When did that come into  
> existence?
>
> More importantly: How does the average random Haskeller discover that a  
> package has become available that might be relevant to their work?

In this case, Data.Text has been announced on this very list several
times:
   
   Text 0.7 announcement
    http://www.haskell.org/pipermail/haskell-cafe/2009-December/070866.html

   Text 0.5 announcement
    http://www.haskell.org/pipermail/haskell-cafe/2009-October/067517.html

   Text 0.2 announcement
    http://www.haskell.org/pipermail/haskell-cafe/2009-May/061800.html

   Text 0.1 annoucnement
    http://www.haskell.org/pipermail/haskell-cafe/2009-February/056723.html

As well as on Planet Haskell several times:

   Finally! Fast Unicode support for Haskell
     http://www.serpentine.com/blog/2009/02/27/finally-fast-unicode-support-for-haskell/

   Streaming Unicode support for Haskell: text 0.2
     http://www.serpentine.com/blog/2009/05/22/streaming-unicode-support-for-haskell-text-02/

   Case conversion and text 0.3
     http://www.serpentine.com/blog/2009/06/07/case-conversion-and-text-03/

As well as being presented at Anglo Haskell

    http://www.wellquite.org/non-blog/AngloHaskell2008/tom%20harper.pdf

It is mentioned repeatedly in the quarterly Hackage status posts:

    "vector and text are quickly rising as the preferred arrays and unicode libraries"
      http://donsbot.wordpress.com/2010/04/03/the-haskell-platform-q1-2010-report/

    "text has made it into the top 30 libraries"
      http://donsbot.wordpress.com/2010/06/30/popular-haskell-packages-q2-2010-report/

    Ranked 31st most popular package by June 2010.
      http://code.haskell.org/~dons/hackage/Jun-2010/popular.txt

    Ranked 41st most popular package by April 2010.
      http://www.galois.com/~dons/hackage/april-2010/popularity.csv

    Ranked 345th by August 2009
      http://www.galois.com/~dons/hackage/august-2009/popularity-august-2009.html

And discussed on Reddit Haskell many times:

    http://www.reddit.com/r/haskell/comments/8qfvw/doing_unicode_case_conversion_and_error_recovery/

    http://www.reddit.com/r/haskell/comments/80smp/datatext_fast_unicode_bytestrings_with_stream/

    http://www.reddit.com/r/haskell/comments/80smp/datatext_fast_unicode_bytestrings_with_stream/

    http://www.reddit.com/r/haskell/comments/ade08/the_performance_of_datatext/

So, to stay up to date, but without drowning in data. Do one of:

    * Pay attention to Haskell Cafe announcements
    * Follow the Reddit Haskell news.
    * Read the quarterly reports on Hackage
    * Follow Planet Haskell

-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

David Menendez-2
In reply to this post by Johan Tibell-2
On Fri, Aug 13, 2010 at 10:43 AM, Johan Tibell <[hidden email]> wrote:
>
> Here's a rule of thumb: If you have binary data, use Data.ByteString. If you
> have text, use Data.Text. Those libraries have benchmarks and have been well
> tuned by experienced Haskelleres and should be the fastest and most memory
> compact in most cases. There are still a few cases where String beats Text
> but they are being worked on as we speak.

It's a good rule, but I don't know how helpful it is to someone doing
XML processing. From what I can tell, the only XML library that uses
Data.Text is libxml-sax, although tagsoup can probably be easily
extended to use it. HXT, HaXml, and xml all use [Char] internally.

--
Dave Menendez <[hidden email]>
<http://www.eyrie.org/~zednenem/>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Yitzchak Gale
In reply to this post by Sean Leather
Sean Leather wrote:
> Which one do you use for strings in HTML or XML in which UTF-8 has become
> the commonly accepted standard encoding?

UTF-8 is only becoming the standard for non-CJK languages.
We are told by members of our community in CJK countries
that UTF-8 is not widely adopted there, and there is no sign that
it ever will be. And one should be aware that the proportion of
CJK in global Internet traffic is growing quickly.

But of course, that is still a legitimate question for some
situations in which full internationalization will not be needed.

Regards,
Yitz
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Sean Leather

Yitzchak Gale wrote:
Sean Leather wrote:
> Which one do you use for strings in HTML or XML in which UTF-8 has become
> the commonly accepted standard encoding?

UTF-8 is only becoming the standard for non-CJK languages.
We are told by members of our community in CJK countries
that UTF-8 is not widely adopted there, and there is no sign that
it ever will be. And one should be aware that the proportion of
CJK in global Internet traffic is growing quickly.

So then, what is the standard? Being not familiar with this area, I googled a bit, and I don't see a consensus. But I also noticeably don't see UTF-16. So, if this is the case, then a similar question still arises for CJK text: What format/library to use for it (assuming one doesn't want a performance penalty for translating between Data.Text's internal format and the target format)? It appears that there are no ideal answers to such questions.

Regards,
Sean

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Yitzchak Gale
Sean Leather wrote:
> So then, what is the standard?
> ...I also noticeably don't see UTF-16.

Right there are a handful of language-specific 16-bit encodings
that are popular, from what I understand.

> So, if this is the case, then a similar question still arises for CJK text:
> What format/library to use for it (assuming one doesn't want a performance
> penalty for translating between Data.Text's internal format and the target
> format)? It appears that there are no ideal answers to such questions.

Right. If you know you'll be in a specific encoding - whether UTF-8,
Latin1, one of the CJK encodings, or whatever, it might sometimes
make sense to skip Data.Text and do the IO as raw bytes using
ByteString and then encode/decode manually only when needed.
Otherwise, Data.Text is probably the way to go.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
In reply to this post by Sean Leather
On Sat, Aug 14, 2010 at 3:46 PM, Sean Leather <[hidden email]> wrote:

So then, what is the standard?

There isn't one. There are many national standards:
  • China: GB-2312, GBK and GB18030
  • Taiwan: Big5
  • Japan: JIS and Shift-JIS (0208 and 0213 variants) and EUC-JP
  • Korea: KS-X-2001, EUC-KR, and ISO-2022-KR
In general, Unicode uptake is increasing rapidly: http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html

Being not familiar with this area, I googled a bit, and I don't see a consensus. But I also noticeably don't see UTF-16. So, if this is the case, then a similar question still arises for CJK text: What format/library to use for it (assuming one doesn't want a performance penalty for translating between Data.Text's internal format and the target format)?

In my opinion, this "performance penalty" hand-wringing is mostly silly. We're talking a pretty small factor of performance difference in most of these cases. Even the biggest difference, between ByteString and String, is usually much less than a factor of 100.

Your absolute first concern should be correctness, for which you should (a) use text and (b) assume that any performance issues are being actively worked on, especially if you report concrete problems and how to reproduce them. In the unlikely event that you need to support non-Unicode encodings, they are readily available via text-icu.

The only significant change to the text API that lies ahead is an introduction of locale support in a few critical places, so that we can do the right thing for languages like Turkish.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

John Millikin
On Sat, Aug 14, 2010 at 16:38, Bryan O'Sullivan <[hidden email]> wrote:
> In my opinion, this "performance penalty" hand-wringing is mostly silly.
> We're talking a pretty small factor of performance difference in most of these
> cases. Even the biggest difference, between ByteString and String, is usually
> much less than a factor of 100.

This attitude towards performance, that it doesn't really matter as
long as something happens *eventually*, is what pushed me away from
Python and towards more performant languages like Haskell in the first
place. Sure, you might not notice a few extra seconds when parsing
some file on your quad-core developer desktop, but those seconds turn
into 20 minutes of lost battery power when running on smaller systems.
Having to convert the internal data structure between [Char], (Ptr
Word16), and (Ptr Word8) can quickly cause user-visible problems.

Libraries which will (by their nature) see heavy use, such as
"bytestring" and "text", ought to have much attention paid to their
performance characteristics. A factor of 2-3x might be the difference
between being able to use a library, and having to rewrite its
functionality to be more efficient.

> In the unlikely event that you need to support non-Unicode encodings,
> they are readily available via text-icu.

Unfortunately, text-icu is hardcoded to use libicu 4.0, which was
released well over a year ago and is no longer available in many
distributions. I sent you a patch to support newer versions a few
months ago, but never received a response. Meanwhile, libicu is up to
4.4 by now.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Yitzchak Gale
In reply to this post by Bryan O'Sullivan
Bryan O'Sullivan wrote:
> In general, Unicode uptake is increasing rapidly:
http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html

These Google graphs are the oft-quoted source of
Unicode's growing dominance. But the data for those graphs
is taken from Google's own web indexing. Google is a
U.S. company that appears to have a strong Western
culture bias - viz. their recent high-profile struggles with
China. Google is far from being the dominant market
leader in CJK countries that they are in Western countries.
Their level of understanding of those markets is clearly not
the same.

It could be this really is true for CJK countries as well,
or it could be that the data is skewed by Google's web
indexing methods. I won't believe that source until it is
highly corroborated with data and opinions that are native
to CJK countries, from sources that do not have a vested
interest in Unicode adoption.

What we have heard in the past from members of our own
community in CJK countries does not agree at all with
Google's claims, but that may be changing. It would be
great to hear more from them.

Regards,
Yitz
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
In reply to this post by John Millikin
On Sat, Aug 14, 2010 at 5:11 PM, John Millikin <[hidden email]> wrote:

This attitude towards performance, that it doesn't really matter as
long as something happens *eventually*, is what pushed me away from
Python and towards more performant languages like Haskell in the first
place.

But wait, wait - I'm not at all contending that performance doesn't matter! In fact, I spent a couple of months working on criterion precisely because I want to base my own performance work on extremely solid data, and to afford the same opportunity to other people. So far in this thread, there's been exactly one performance number posted, by Daniel. Not only have I already thanked him for it, I immediately used (and continue to use) it to improve the performance of the text library in that instance.

More broadly, what I am recommending is simple:
  • Use a good library.
  • Expect good performance out of it.
  • Measure the performance you get out of your application.
  • If it's not good enough, and the fault lies in a library you chose, report a bug and provide a test case.
In the case of the text library, it is often (but not always) competitive with bytestring, and I improve it when I can, especially when given test cases. My goal is for it to be the obvious choice on several fronts:
  • Cleanliness of API, where it's already better, but could still improve
  • Performance, which is not quite where I want it (target: parity with, or better than, bytestring)
  • Quality, where text has slightly more test coverage than bytestring
However, just text alone is a big project, and I could get a lot more done if I was both coding and integrating patches than if coding alone :-) So patches are very welcome.

> In the unlikely event that you need to support non-Unicode encodings,
> they are readily available via text-icu.

Unfortunately, text-icu is hardcoded to use libicu 4.0, which was
released well over a year ago and is no longer available in many
distributions. I sent you a patch to support newer versions a few
months ago, but never received a response.

Yes, that's quite embarrassing, and I am quite apologetic about it, especially since I just asked for help in the preceding paragraph. If it's any help, there's a story behind my apparent sloth: I overenthusiastically accepted a patch from another contributor a few months before yours, and his changes left the text-icu darcs repo in a mess from which I have yet to rescue it. I do still have your patch, and I'll probably abandon my attempts to clean up the other one, as it was more work than I cared to clean it up.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

Bryan O'Sullivan
In reply to this post by Yitzchak Gale
On Sat, Aug 14, 2010 at 5:39 PM, Yitzchak Gale <[hidden email]> wrote:

It could be this really is true for CJK countries as well,
or it could be that the data is skewed by Google's web
indexing methods.

I also wouldn't be surprised if the picture for web-based text is quite different from that for other textual data.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: String vs ByteString

wren ng thornton
In reply to this post by Yitzchak Gale
Yitzchak Gale wrote:
> Bryan O'Sullivan wrote:
>> In general, Unicode uptake is increasing rapidly:
>>  http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html
>
> These Google graphs are the oft-quoted source of
> Unicode's growing dominance. But the data for those graphs
> is taken from Google's own web indexing.

Note also that all those encodings near the bottom are remaining
relatively constant. UTF8 is taking its market share from ASCII and
Western European encodings, not so much from other encodings (as yet).

As Bryan mentioned, Unicode doesn't have wide acceptance in CJK
countries. These days, Japanese websites seem to have finally started to
standardize--- in that they use HTTP/HTML headers to say which encoding
the pages are in (and generally use JIS or Shift-JIS). This is a big
step up from a decade ago when non-commercial sites pretty invariably
required fiddling with the browser to get rid of mojibake. Japan hasn't
been bitten by the i18n/l10n bug and they don't have a strong F/OSS
community to drive adoption either.

--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
123456 ... 9