PROPOSAL: New efficient Unicode string library.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

PROPOSAL: New efficient Unicode string library.

Johan Tibell-2
Dear haskell-cafe,

I would like to propose a new, ByteString like, Unicode string library
which can be used where both efficiency (currently offered by
ByteString) and i18n support (currently offered by vanilla Strings)
are needed. I wrote a skeleton draft today but I'm a bit tired so I
didn't get all the details. Nevertheless I think it fleshed out enough
for some initial feedback. If I can get the important parts nailed
down before Hackathon I could hack on it there.

Apologies for not getting everything we discussed on #haskell down in
the first draft. It'll get in there eventually.

Bring out your Unicode kung-fu!

http://haskell.org/haskellwiki/UnicodeByteString

Cheers,

Johan Tibell
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Twan van Laarhoven
Johan Tibell wrote:

> Dear haskell-cafe,
>
> I would like to propose a new, ByteString like, Unicode string library
> which can be used where both efficiency (currently offered by
> ByteString) and i18n support (currently offered by vanilla Strings)
> are needed. I wrote a skeleton draft today but I'm a bit tired so I
> didn't get all the details. Nevertheless I think it fleshed out enough
> for some initial feedback. If I can get the important parts nailed
> down before Hackathon I could hack on it there.
>
> Apologies for not getting everything we discussed on #haskell down in
> the first draft. It'll get in there eventually.
>
> Bring out your Unicode kung-fu!
>
> http://haskell.org/haskellwiki/UnicodeByteString

Have you looked at my CompactString library[1]? It essentially does
exactly this, with one extension: the type is parameterized over the
encoding. From the discussion on #haskell it would seem that some people
consider this unforgivable, while others consider it essential.

In my opinion flexibility should be more important, you can always
restrict things later. For the common case where encoding doesn't matter
there is Data.CompactString.UTF8, which provides an un-parameterized
type. I called this type 'CompactString' as well, which might be a bit
unfortunate. I don't like the name UnicodeString, since it suggests that
the normal string somehow doesn't support unicode. This module could be
made more prominent. Maybe Data.CompactString could be the specialized
type, while Data.CompactString.Parameterized supports different encodings.

A word of warning: The library is still in the alpha stage of
development. I don't fully trust it myself yet :)

[1] http://twan.home.fmf.nl/compact-string/

Twan

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Vitaliy Akimov
In reply to this post by Johan Tibell-2
Hi, thanks for proposal,
Why questions connected with converting are considered only?
The library i18n should give a number of other services such as
normalization, comparison, sorting, etc.
Furthermore it's not so easy to keep such library up to date.
Why simply do not make a bindings to IBM ICU
(http://www-306.ibm.com/software/globalization/icu/index.jsp) library
which is up to date unicode implementation?

Vitaliy.

2007/9/25, Johan Tibell <[hidden email]>:

> Dear haskell-cafe,
>
> I would like to propose a new, ByteString like, Unicode string library
> which can be used where both efficiency (currently offered by
> ByteString) and i18n support (currently offered by vanilla Strings)
> are needed. I wrote a skeleton draft today but I'm a bit tired so I
> didn't get all the details. Nevertheless I think it fleshed out enough
> for some initial feedback. If I can get the important parts nailed
> down before Hackathon I could hack on it there.
>
> Apologies for not getting everything we discussed on #haskell down in
> the first draft. It'll get in there eventually.
>
> Bring out your Unicode kung-fu!
>
> http://haskell.org/haskellwiki/UnicodeByteString
>
> Cheers,
>
> Johan Tibell
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Deborah Goldsmith-2
In reply to this post by Johan Tibell-2
I'll look over the proposal more carefully when I get time, but the  
most important issue is to not let the storage type leak into the  
interface.

 From an implementation point of view, UTF-16 is the most efficient  
representation for processing Unicode. It's the native Unicode  
representation for Windows, Mac OS X, and the ICU open source i18n  
library. UTF-8 is not very efficient for anything except English. Its  
most valuable property is compatibility with software that thinks of  
character strings as byte arrays, and in fact that's why it was  
invented.

UTF-32 is conceptually cleaner, but characters outside the BMP (Basic  
Multilingual Plane) are rare in actual text, so UTF-16 turns out to  
be the best combination of space and time efficiency.

Deborah

On Sep 24, 2007, at 3:52 PM, Johan Tibell wrote:

> Dear haskell-cafe,
>
> I would like to propose a new, ByteString like, Unicode string library
> which can be used where both efficiency (currently offered by
> ByteString) and i18n support (currently offered by vanilla Strings)
> are needed. I wrote a skeleton draft today but I'm a bit tired so I
> didn't get all the details. Nevertheless I think it fleshed out enough
> for some initial feedback. If I can get the important parts nailed
> down before Hackathon I could hack on it there.
>
> Apologies for not getting everything we discussed on #haskell down in
> the first draft. It'll get in there eventually.
>
> Bring out your Unicode kung-fu!
>
> http://haskell.org/haskellwiki/UnicodeByteString
>
> Cheers,
>
> Johan Tibell
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
On 2007-09-26, Deborah Goldsmith <[hidden email]> wrote:
>  From an implementation point of view, UTF-16 is the most efficient  
> representation for processing Unicode.

This depends on the characteristics of the text being processed.
Spacewise, English stays 1 byte/char in UTF-8.  Most European languages
go up to at most 2, and on average only a bit above 1.  Greek and
Cyrillic are 2 bytes/char.  It's really only the Asian, African, Arabic,
etc, that lose space-wise.

It's true that time-wise there are definite issues in finding character
boundaries.

--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Johan Tibell-2
In reply to this post by Deborah Goldsmith-2
> I'll look over the proposal more carefully when I get time, but the
> most important issue is to not let the storage type leak into the
> interface.

Agreed,

>  From an implementation point of view, UTF-16 is the most efficient
> representation for processing Unicode. It's the native Unicode
> representation for Windows, Mac OS X, and the ICU open source i18n
> library. UTF-8 is not very efficient for anything except English. Its
> most valuable property is compatibility with software that thinks of
> character strings as byte arrays, and in fact that's why it was
> invented.

If UTF-16 is what's used by everyone else (how about Java? Python?) I
think that's a strong reason to use it. I don't know Unicode well
enough to say otherwise.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
On 2007-09-26, Johan Tibell <[hidden email]> wrote:
> If UTF-16 is what's used by everyone else (how about Java? Python?) I
> think that's a strong reason to use it. I don't know Unicode well
> enough to say otherwise.

The internal representations don't matter except in the case of making
FFI linkages.  The external representations do, and UTF-8 has won on
that front.

--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Johan Tibell-2
On 9/26/07, Aaron Denney <[hidden email]> wrote:
> On 2007-09-26, Johan Tibell <[hidden email]> wrote:
> > If UTF-16 is what's used by everyone else (how about Java? Python?) I
> > think that's a strong reason to use it. I don't know Unicode well
> > enough to say otherwise.
>
> The internal representations don't matter except in the case of making
> FFI linkages.  The external representations do, and UTF-8 has won on
> that front.

It could matter for performance. However, you can encode your
UnicodeString into any external representation you want for your I/O
needs, including UTF-8.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Tony Finch
In reply to this post by Aaron Denney
On Wed, 26 Sep 2007, Aaron Denney wrote:
>
> It's true that time-wise there are definite issues in finding character
> boundaries.

UTF-16 has no advantage over UTF-8 in this respect, because of surrogate
pairs and combining characters. Code points, characters, and glyphs are
all different things, and it's very difficult to represent the latter two
as anything other than a string of code points.

Tony.
--
f.a.n.finch  <[hidden email]>  http://dotat.at/
IRISH SEA: SOUTHERLY, BACKING NORTHEASTERLY FOR A TIME, 3 OR 4. SLIGHT OR
MODERATE. SHOWERS. MODERATE OR GOOD, OCCASIONALLY POOR.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Jonathan Cast
In reply to this post by Johan Tibell-2
On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:

> > I'll look over the proposal more carefully when I get time, but the
> > most important issue is to not let the storage type leak into the
> > interface.
>
> Agreed,
>
> >  From an implementation point of view, UTF-16 is the most efficient
> > representation for processing Unicode. It's the native Unicode
> > representation for Windows, Mac OS X, and the ICU open source i18n
> > library. UTF-8 is not very efficient for anything except English. Its
> > most valuable property is compatibility with software that thinks of
> > character strings as byte arrays, and in fact that's why it was
> > invented.
>
> If UTF-16 is what's used by everyone else (how about Java? Python?) I
> think that's a strong reason to use it. I don't know Unicode well
> enough to say otherwise.

I disagree.  I realize I'm a dissenter in this regard, but my position
is: excellent Unix support first, portability second, excellent support
for Win32/MacOS a distant third.  That seems to be the opposite of every
language's position.  Unix absolutely needs UTF-8 for backward
compatibility.

jcc


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Duncan Coutts
In message <1190825044.9435.1.camel@jcchost> Jonathan Cast <[hidden email]> writes:
> On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:

> > If UTF-16 is what's used by everyone else (how about Java? Python?) I
> > think that's a strong reason to use it. I don't know Unicode well
> > enough to say otherwise.
>
> I disagree.  I realize I'm a dissenter in this regard, but my position
> is: excellent Unix support first, portability second, excellent support
> for Win32/MacOS a distant third.  That seems to be the opposite of every
> language's position.  Unix absolutely needs UTF-8 for backward
> compatibility.

I think you're talking about different things, internal vs external representations.

Certainly we must support UTF-8 as an external representation. The choice of
internal representation is independent of that. It could be [Char] or some
memory efficient packed format in a standard encoding like UTF-8,16,32. The
choice depends mostly on ease of implementation and performance. Some formats
are easier/faster to process but there are also conversion costs so in some use
cases there is a performance benefit to the internal representation being the
same as the external representation.

So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8
has the advantage of being the same as a common external representation so
conversion is cheap (only need to validate rather than copy). UTF-8 is more
compact for western languages but less compact for eastern languages compared to
UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the
common case UTF-16 is effectively fixed width. According to the ICU implementors
this has speed advantages (probably due to branch prediction and smaller code size).

One solution is to do both and benchmark them.

Duncan
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Jonathan Cast
On Wed, 2007-09-26 at 18:46 +0100, Duncan Coutts wrote:

> In message <1190825044.9435.1.camel@jcchost> Jonathan Cast <[hidden email]> writes:
> > On Wed, 2007-09-26 at 09:05 +0200, Johan Tibell wrote:
>
> > > If UTF-16 is what's used by everyone else (how about Java? Python?) I
> > > think that's a strong reason to use it. I don't know Unicode well
> > > enough to say otherwise.
> >
> > I disagree.  I realize I'm a dissenter in this regard, but my position
> > is: excellent Unix support first, portability second, excellent support
> > for Win32/MacOS a distant third.  That seems to be the opposite of every
> > language's position.  Unix absolutely needs UTF-8 for backward
> > compatibility.
>
> I think you're talking about different things, internal vs external representations.
>
> Certainly we must support UTF-8 as an external representation. The choice of
> internal representation is independent of that. It could be [Char] or some
> memory efficient packed format in a standard encoding like UTF-8,16,32. The
> choice depends mostly on ease of implementation and performance. Some formats
> are easier/faster to process but there are also conversion costs so in some use
> cases there is a performance benefit to the internal representation being the
> same as the external representation.
>
> So, the obvious choices of internal representation are UTF-8 and UTF-16. UTF-8
> has the advantage of being the same as a common external representation so
> conversion is cheap (only need to validate rather than copy). UTF-8 is more
> compact for western languages but less compact for eastern languages compared to
> UTF-16. UTF-8 is a more complex encoding in the common cases than UTF-16. In the
> common case UTF-16 is effectively fixed width. According to the ICU implementors
> this has speed advantages (probably due to branch prediction and smaller code size).
>
> One solution is to do both and benchmark them.

OK, right.

jcc

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
In reply to this post by Johan Tibell-2
On 2007-09-26, Johan Tibell <[hidden email]> wrote:

> On 9/26/07, Aaron Denney <[hidden email]> wrote:
>> On 2007-09-26, Johan Tibell <[hidden email]> wrote:
>> > If UTF-16 is what's used by everyone else (how about Java? Python?) I
>> > think that's a strong reason to use it. I don't know Unicode well
>> > enough to say otherwise.
>>
>> The internal representations don't matter except in the case of making
>> FFI linkages.  The external representations do, and UTF-8 has won on
>> that front.
>
> It could matter for performance. However, you can encode your
> UnicodeString into any external representation you want for your I/O
> needs, including UTF-8.

Right.  I was trying to say "other languages internal representations
shouldn't affect the choice of those doing a Haskell implementation."

--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
In reply to this post by Tony Finch
On 2007-09-26, Tony Finch <[hidden email]> wrote:
> On Wed, 26 Sep 2007, Aaron Denney wrote:
>>
>> It's true that time-wise there are definite issues in finding character
>> boundaries.
>
> UTF-16 has no advantage over UTF-8 in this respect, because of surrogate
> pairs and combining characters.

Good point.

--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Deborah Goldsmith-2
On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
>> UTF-16 has no advantage over UTF-8 in this respect, because of  
>> surrogate
>> pairs and combining characters.
>
> Good point.

Well, not so much. As Duncan mentioned, it's a matter of what the most  
common case is. UTF-16 is effectively fixed-width for the majority of  
text in the majority of languages. Combining sequences and surrogate  
pairs are relatively infrequent.

Speaking as someone who has done a lot of Unicode implementation, I  
would say UTF-16 represents the best time/space tradeoff for an  
internal representation. As I mentioned, it's what's used in Windows,  
Mac OS X, ICU, and Java.

Deborah

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

'data' syntax - a suggestion

Richard A. O'Keefe
In reply to this post by Aaron Denney
I have often found myself wishing for a small extension to the syntax of
Haskell 'data' declarations.  It goes like this:

        data <as usual>
           = <as usual>
           | ...
           | <as usual>
+++   where type <tvar> = <type>
                 type <tvar> = <type>
                 ...
           deriving <as usual>

Even something like binary search trees would, to me, be clearer as

        data BST key val
           = Empty
           | Fork key val bst bst
           where type bst = BST key val

because this establishes an *essential* identity between the 3rd and 4th
Fork argument types, rather than an "accidental" identity.  I can't set
this up using 'type' outside the 'data' declaration, because it has to
refer to the type arguments of BST.

Semantically, this is just an abbreviation mechanism with no  
consequences
of any kind outside the 'data' declaration itself.  The only point would
be to make reading and writing 'data' declarations easier, especially
large ones.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
In reply to this post by Deborah Goldsmith-2
On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote:

> On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
>>> UTF-16 has no advantage over UTF-8 in this respect, because of  
>>> surrogate
>>> pairs and combining characters.
>>
>> Good point.
>
> Well, not so much. As Duncan mentioned, it's a matter of what the most  
> common case is. UTF-16 is effectively fixed-width for the majority of  
> text in the majority of languages. Combining sequences and surrogate  
> pairs are relatively infrequent.

Infrequent, but they exist, which means you can't seek x/2 bytes ahead
to seek x characters ahead.  All such seeking must be linear for both
UTF-16 *and* UTF-8.

--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Richard A. O'Keefe
In reply to this post by Johan Tibell-2
On 26 Sep 2007, at 7:05 pm, Johan Tibell wrote:
> If UTF-16 is what's used by everyone else (how about Java? Python?) I
> think that's a strong reason to use it. I don't know Unicode well
> enough to say otherwise.

Java uses 16-bit variables to hold characters.
This is SOLELY for historical reasons, not because it is a good choice.
The history is a bit funny:  the ISO 10646 group were working away
defining a 31-bit character set, and the industry screamed blue murder
about how this was going to ruin the economy, bring back the Dark Ages,
&c, and promptly set up the Unicode consortium to define a 16-bit
character set that could do the same job.  Early versions of Unicode
had only about 30 000 characters, after heroic (and not entirely
appreciated) efforts at unifiying Chinese characters as used in China
with those used in Japan and those used in Korea.  They also lumbered
themselves (so that they would have a fighting chance of getting
Unicode adopted) with a "round trip conversion" policy, namely that it
should be possible to take characters using ANY current encoding
standard, convert them to Unicode, and then convert back to the original
encoding with no loss of information.  This led to failure of  
unification:
there are two versions of Å (one for ordinary use, one for Angstroms),
two versions of mu (one for Greek, one for micron), three complete  
copies
of ASCII, &c).  However, 16 bits really is not enough.

Here's a table from http://www.unicode.org/versions/Unicode5.0.0/

Graphic   98,884
Format    140
Control     65
Private Use 137,468
Surrogate  2,048
Noncharacter     66
Reserved 875,441

Excluding Private Use and Reserved, I make that 101,203 currently
defined codes.  That's nearly 1.5* the number that would fit in 16
bits.

Java has had to deal with this, don't think it hasn't.  For example,
where Java had one set of functions referring to characters in strings
by position, it now has two complete sets:  one to use *which 16-bit
code* (which is fast) and one to use *which actual Unicode character*
(which is slow).  The key point is that the second set is *always*
slow even when there are no characters outside the basic multilingual
plane.

One Smalltalk system I sometimes use has three complete string
implementations (all characters fit in a byte, all characters fit
in 16 bits, some characters require more) and dynamically switches
from narrow strings to wide strings behind your back.  In a language
with read-only strings, that makes a lot of sense; it's just a pity
Smalltalk isn't one.

If you want to minimize conversion effort when talking to the operating
system, files, and other programs, UTF-8 is probably the way to go.
(That's on Unix.  For Windows it might be different.)

If you want to minimize the effort of recognising character boundaries
while processing strings, 32-bit characters are the way to go.  If you
want to be able to index into a string efficiently, they are the *only*
way to go.  Solaris bit the bullet many years ago; Sun C compilers
jumped straight from 8-bit wchar_t to 32_bit without ever stopping at  
16.

16-bit characters *used* to be a reasonable compromise, but aren't any
longer.  Unicode keeps on growing.  There were 1,349 new characters
from Unicode 4.1 to Unicode 5.0 (IIRC).  There are lots more scripts
in the pipeline.  (What the heck _is_ Tangut, anyway?)




_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Ross Paterson
In reply to this post by Tony Finch
On Wed, Sep 26, 2007 at 11:25:30AM +0100, Tony Finch wrote:
> On Wed, 26 Sep 2007, Aaron Denney wrote:
> > It's true that time-wise there are definite issues in finding character
> > boundaries.
>
> UTF-16 has no advantage over UTF-8 in this respect, because of surrogate
> pairs and combining characters.

Combining characters are not an issue here, just the surrogate pairs,
because we're discussing representations of sequences of Chars (Unicode
code points).
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
On 2007-09-27, Ross Paterson <[hidden email]> wrote:
> Combining characters are not an issue here, just the surrogate pairs,
> because we're discussing representations of sequences of Chars (Unicode
> code points).

You'll never want to combine combining characters or vice-versa?  Never
want to figure out how much screen space a sequence will take?  It _is_
an issue.

--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
1234