FPS/Data.ByteString candidate

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
42 messages Options
123
Reply | Threaded
Open this post in threaded view
|

FPS/Data.ByteString candidate

Donald Bruce Stewart
Following discussion, I've tagged FPS 0.4, a candidate for the base
library. Changes:

    * Renamed to Data.ByteString(ByteString)
    * Improved documentation
    * Tweaks to build under ghc 6.6
    * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith
    * Much faster: elemIndices, lineIndices, split, replicate
    * More automagic benchmarks and QuickCheck tests.

As usual, code is here: http://www.cse.unsw.edu.au/~dons/fps.html

-- Don
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: FPS/Data.ByteString candidate

John Meacham
On Sun, Apr 23, 2006 at 05:27:43PM +1000, Donald Bruce Stewart wrote:
> Following discussion, I've tagged FPS 0.4, a candidate for the base
> library. Changes:
>
>     * Renamed to Data.ByteString(ByteString)
>     * Improved documentation
>     * Tweaks to build under ghc 6.6
>     * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith
>     * Much faster: elemIndices, lineIndices, split, replicate
>     * More automagic benchmarks and QuickCheck tests.

Can we get rid of every reference to 'Char' in the interface? a search
and replace setting them to 'Word8' should do it. Casting between Word8
and Char is just very wrong. a Char based FastString can be built on top
of it, but we want to be typesafe in any interface.

        John

--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: FPS/Data.ByteString candidate

Donald Bruce Stewart
john:

> On Sun, Apr 23, 2006 at 05:27:43PM +1000, Donald Bruce Stewart wrote:
> > Following discussion, I've tagged FPS 0.4, a candidate for the base
> > library. Changes:
> >
> >     * Renamed to Data.ByteString(ByteString)
> >     * Improved documentation
> >     * Tweaks to build under ghc 6.6
> >     * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith
> >     * Much faster: elemIndices, lineIndices, split, replicate
> >     * More automagic benchmarks and QuickCheck tests.
>
> Can we get rid of every reference to 'Char' in the interface? a search
> and replace setting them to 'Word8' should do it. Casting between Word8
> and Char is just very wrong. a Char based FastString can be built on top
> of it, but we want to be typesafe in any interface.

Ok. I appreciate this concern.

I'll follow Simon Marlow's library here and partition it into, something
like:

    Data.ByteString             -- the core ByteString and Word8 operations
    Data.PackedString.Latin1    -- Char level packed string functions

John (and Ashley?) would this be ok?

-- Don
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: FPS/Data.ByteString candidate

Einar Karttunen
In reply to this post by John Meacham
On 24.04 16:31, John Meacham wrote:

> On Sun, Apr 23, 2006 at 05:27:43PM +1000, Donald Bruce Stewart wrote:
> > Following discussion, I've tagged FPS 0.4, a candidate for the base
> > library. Changes:
> >
> >     * Renamed to Data.ByteString(ByteString)
> >     * Improved documentation
> >     * Tweaks to build under ghc 6.6
> >     * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith
> >     * Much faster: elemIndices, lineIndices, split, replicate
> >     * More automagic benchmarks and QuickCheck tests.
>
> Can we get rid of every reference to 'Char' in the interface? a search
> and replace setting them to 'Word8' should do it. Casting between Word8
> and Char is just very wrong. a Char based FastString can be built on top
> of it, but we want to be typesafe in any interface.

The Chars in the interface make it much more easy to use in production
code. Should we also change the type of putStrLn to:
putStrLn :: [Word8] -> IO () ?

I think the name ByteString implies that it uses bytes and removing
the Char functions does not help much. I am mostly using them to
handle UTF8 data at the moment and it works quite well.

In effect this would mean sprinkling all my code with fromIntegral.
The name Latin1 is particularly bad since there are many other
single byte encodings around.

We could have a Data.ByteString with a Word8+Char interface and the
module name telling us it is about bytes. Then we can have a
Data.Encode.String.{UTF8,UTF16BE,UTF16LE,UTF32BE,UTF32LE,Latin{1,2,3...}}

- Einar Karttunen
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Data.ByteString candidate 3

Donald Bruce Stewart
In reply to this post by Donald Bruce Stewart
dons:

> john:
> > On Sun, Apr 23, 2006 at 05:27:43PM +1000, Donald Bruce Stewart wrote:
> > > Following discussion, I've tagged FPS 0.4, a candidate for the base
> > > library. Changes:
> > >
> > >     * Renamed to Data.ByteString(ByteString)
> > >     * Improved documentation
> > >     * Tweaks to build under ghc 6.6
> > >     * Added: getLine, getContents, putStr, putStrLn, zip, unzip, zipWith
> > >     * Much faster: elemIndices, lineIndices, split, replicate
> > >     * More automagic benchmarks and QuickCheck tests.
> >
> > Can we get rid of every reference to 'Char' in the interface? a search
> > and replace setting them to 'Word8' should do it. Casting between Word8
> > and Char is just very wrong. a Char based FastString can be built on top
> > of it, but we want to be typesafe in any interface.

Ok, here's what I've done:
    http://www.cse.unsw.edu.au/~dons/fps/new/

The code is in the darcs repo:
    http://www.cse.unsw.edu.au/~dons/fps.html

The code has been partioned into:
    Data.ByteString         a Word8 only layer. All functions are in terms of Word8
    Data.ByteString.Char    provides an ascii/byte-Char layer over the Word8 layer.

So essentially this is the Data.ByteString that John and Ashley were
looking for, I think, and with a new explicit Char layer, for people
like Simon, Einar and me, who need the convenience of literal Chars.

This separation means there is now an encoding-agnostic Word8 level of
high-performance code, which should be generally useful.

I'm quite happy with this now, the code is a lot cleaner, and hopefully
this will appease the people who disliked the intermingling of Char and
Word8 code, which I agree was unsatisfactory.

Opinions?

-- Don
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Ketil Malde-3
[hidden email] (Donald Bruce Stewart) writes:

I already voiced this on IRC, but if you'll forgive me, I'll sum up my
small minority report.

> This separation means there is now an encoding-agnostic Word8 level of
> high-performance code, which should be generally useful.

I'm very happy with the separation, and I think using the Latin-1
charset as the default is the right choice.  The only thing I am
unhappy about, is the last minute name change, which means that the
interpretation as Latin-1 is no longer explicit to the user.

A naive user may think that the anonymously named Char module
interprets the locale for instance, or might disregard the character
set issues entirely, and be confused when a string literal using
characters with code points > 255 don't work as expected.

In addition, it is natural to extend this with other character sets,
but it is no longer obvious where to put sibling modules implementing
the same Char functionality with a different (single byte) encoding.

Quite frankly, I don't see any advantage of selecting one particular
encoding, and then disguise the fact from the user.

> Opinions?

Well, you did ask.  Thanks for the good work, I'm currently
benchmarking my programs to check what the current Char IO costs are,
and if my suspicions are corroborated, I'll spend some time this week
to switch.

-k
--
If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Duncan Coutts
In reply to this post by Donald Bruce Stewart
On Tue, 2006-04-25 at 21:12 +1000, Donald Bruce Stewart wrote:

> Ok, here's what I've done:
>     http://www.cse.unsw.edu.au/~dons/fps/new/


> I'm quite happy with this now, the code is a lot cleaner, and hopefully
> this will appease the people who disliked the intermingling of Char and
> Word8 code, which I agree was unsatisfactory.
>
> Opinions?

It's looking good Don. I think the Word8/Char split was the right way to
go.

On that theme, there are a few more functions for which we should
consider which of the Word8/Char modules would be their best home. If we
are saying that the base module is encoding agnostic then I think these
functions should probably move to the .Char module since they make
assumptions about the encoding of things like '\n'.

breakSpace
dropSpace
dropSpaceEnd

lines
words
unlines
unwords
lines'
unlines'
linesCRLF'
unlinesCRLF'
unwords'
lineIndices

not sure about:
betweenLines


Duncan

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Simon Marlow-5
In reply to this post by Donald Bruce Stewart
Donald Bruce Stewart wrote:

> The code has been partioned into:
>     Data.ByteString         a Word8 only layer. All functions are in terms of Word8
>     Data.ByteString.Char    provides an ascii/byte-Char layer over the Word8 layer.

Ok, but where would we put a UTF8 version of the Char layer?  I'm
thinking that "Latin1" would be more correct than "Char", and leaves
room for adding UTF8 and other encodings later.

Cheers,
        Simon


_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: FPS/Data.ByteString candidate

Ross Paterson
In reply to this post by Einar Karttunen
On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:
> The name Latin1 is particularly bad since there are many other
> single byte encodings around.

The name is quite appropriate, since that is the particular encoding of
Char that is exposed by the interface.  What's bad is that there's no
choice.  Calling it Latin1 is just being honest about that, and leaving
room for modules with other encodings or an interface parameterized
by encoding.

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Duncan Coutts
In reply to this post by Simon Marlow-5
On Tue, 2006-04-25 at 13:08 +0100, Simon Marlow wrote:
> Donald Bruce Stewart wrote:
>
> > The code has been partioned into:
> >     Data.ByteString         a Word8 only layer. All functions are in terms of Word8
> >     Data.ByteString.Char    provides an ascii/byte-Char layer over the Word8 layer.
>
> Ok, but where would we put a UTF8 version of the Char layer?  I'm
> thinking that "Latin1" would be more correct than "Char", and leaves
> room for adding UTF8 and other encodings later.

As others have pointed out, it's not strictly Latin1. Don and I reckon
it's probably safe to say that the current Data.ByteString.Char layer is
ok for any 8-bit fixed-width encoding with ASCII as a subset, so that
means it's probably ok for many of the Latin* encodings.

How would we distinguish a full fixed0width 4-byte Unicode version? A
purist mgiht say that this should be Data.ByteString.Char since a Char
really is a 4-byte Unicode value and then change the current
Data.ByteString.Char to be Data.ByteString.Char8 or something like that.

Duncan

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Donald Bruce Stewart
In reply to this post by Simon Marlow-5
simonmarhaskell:

> Donald Bruce Stewart wrote:
>
> >The code has been partioned into:
> >    Data.ByteString         a Word8 only layer. All functions are in terms
> >    of Word8
> >    Data.ByteString.Char    provides an ascii/byte-Char layer over the
> >    Word8 layer.
>
> Ok, but where would we put a UTF8 version of the Char layer?  I'm
> thinking that "Latin1" would be more correct than "Char", and leaves
> room for adding UTF8 and other encodings later.

Ok. Einar had some concerns that Latin1 wasn't the most accurate, as he
uses the Char ops for more general purposes. But Data.ByteString.Latin1
would probably be ok for me. Or Data.ByteString.Char8 perhaps.

-- Don

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: FPS/Data.ByteString candidate

Donald Bruce Stewart
In reply to this post by Ross Paterson
ross:
> On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:
> > The name Latin1 is particularly bad since there are many other
> > single byte encodings around.
>
> The name is quite appropriate, since that is the particular encoding of
> Char that is exposed by the interface.  What's bad is that there's no
> choice.  Calling it Latin1 is just being honest about that, and leaving
> room for modules with other encodings or an interface parameterized
> by encoding.

Ok. Duncan, Ketil, Ross and Simon make good points here.
I'll move Data.ByteString.Char -> Data.ByteString.Latin1

-- Don
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Duncan Coutts
In reply to this post by Duncan Coutts
On Tue, 2006-04-25 at 13:13 +0100, Duncan Coutts wrote:

> On Tue, 2006-04-25 at 13:08 +0100, Simon Marlow wrote:
> > Donald Bruce Stewart wrote:
> >
> > > The code has been partioned into:
> > >     Data.ByteString         a Word8 only layer. All functions are in terms of Word8
> > >     Data.ByteString.Char    provides an ascii/byte-Char layer over the Word8 layer.
> >
> > Ok, but where would we put a UTF8 version of the Char layer?  I'm
> > thinking that "Latin1" would be more correct than "Char", and leaves
> > room for adding UTF8 and other encodings later.
>
> As others have pointed out, it's not strictly Latin1. Don and I reckon
> it's probably safe to say that the current Data.ByteString.Char layer is
> ok for any 8-bit fixed-width encoding with ASCII as a subset, so that
> means it's probably ok for many of the Latin* encodings.
>
> How would we distinguish a full fixed0width 4-byte Unicode version? A
> purist mgiht say that this should be Data.ByteString.Char since a Char
> really is a 4-byte Unicode value and then change the current
> Data.ByteString.Char to be Data.ByteString.Char8 or something like that.

Actually after further discussion we've think that strictly
Data.ByteString.Char will only fully work with Latin1 because only for
Latin1 will the Chars we get back be genuine Unicode code-points (since
the first 256 code points of Unicode are the same as Latin1 - or so I am
told).

For other Latin encodings what you get back will only be a Unicode code
point for chars <127. So for other Latin encodings you'd need different
implementations of w2c & c2w that map the 256 chars to/from the correct
Unicode code points.

So that suggests that we might want to call it Data.ByteString.Latin1.
At this point we wish we had parameterisable modules so we could have
various other encodings just by parameterising on the w2c/c2w mappings.

Most of the time you could use Data.ByteString.Latin1 for other Latin
encodings and get away with it (so long as you don't want to use things
like isUpper for chars >127) which is both a blessing and a curse.

Duncan

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

John Meacham
In reply to this post by Donald Bruce Stewart
On Tue, Apr 25, 2006 at 09:12:55PM +1000, Donald Bruce Stewart wrote:
> I'm quite happy with this now, the code is a lot cleaner, and hopefully
> this will appease the people who disliked the intermingling of Char and
> Word8 code, which I agree was unsatisfactory.
>
> Opinions?

Well, it is all great except it doesn't provide what we really want, a
drop in fast replacement for haskell strings :)

I'd like to see the Char and String names reserved for things that
actually can represent Chars and Strings. The internal representation
can be completly abstract based on the ByteString data type (though, I
am partial to utf8).

not that a 'Latin1' module couldn't be provided too. But I don't think
the module should be called 'Char'.

        John


--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Simon Marlow-5
In reply to this post by Duncan Coutts
Duncan Coutts wrote:

> How would we distinguish a full fixed0width 4-byte Unicode version?

Good point, and that's why using the Data.PackedString hierarchy was
nice, because it accomodated various different character widths.  I
quite like

   Data.ByteString
   Data.PackedString.Latin1
   Data.PackedString.UTF8
   Data.PackedString.UCS4
   etc.

(but this is getting a bit bikeshedish, so I'll try to resist the
temptation to comment any further :-)

Cheers,
        Simon
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: FPS/Data.ByteString candidate

Duncan Coutts
In reply to this post by Donald Bruce Stewart
On Tue, 2006-04-25 at 22:34 +1000, Donald Bruce Stewart wrote:

> ross:
> > On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:
> > > The name Latin1 is particularly bad since there are many other
> > > single byte encodings around.
> >
> > The name is quite appropriate, since that is the particular encoding of
> > Char that is exposed by the interface.  What's bad is that there's no
> > choice.  Calling it Latin1 is just being honest about that, and leaving
> > room for modules with other encodings or an interface parameterized
> > by encoding.
>
> Ok. Duncan, Ketil, Ross and Simon make good points here.
> I'll move Data.ByteString.Char -> Data.ByteString.Latin1

If you want to justify that and provide some concrete spec you can add
something like the following to the Data.ByteString.Latin1 docs:

        Manipulate ByteStrings using Char operations. All Chars will be
        truncated to 8 bits.
       
        More specifically these byte strings are taken to be in the
        subset of Unicode covered by code points 0-255. This covers
        Unicode Basic Latin, Latin-1 Supplement and C0+C1 Controls.
       
        See: http://www.unicode.org/charts/
        http://www.unicode.org/charts/PDF/U0000.pdf
        http://www.unicode.org/charts/PDF/U0080.pdf


One reason to be so specific is that other definitions of character sets
commonly called "Latin-1" omit the control characters and so do not
cover all bytes 0-255.

I think this allows us to justify reinterpreting Word8s as Chars and
getting valid Unicode code points.

Duncan

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: FPS/Data.ByteString candidate

Duncan Coutts
In reply to this post by Donald Bruce Stewart
On Tue, 2006-04-25 at 22:34 +1000, Donald Bruce Stewart wrote:

> ross:
> > On Tue, Apr 25, 2006 at 12:08:45PM +0300, Einar Karttunen wrote:
> > > The name Latin1 is particularly bad since there are many other
> > > single byte encodings around.
> >
> > The name is quite appropriate, since that is the particular encoding of
> > Char that is exposed by the interface.  What's bad is that there's no
> > choice.  Calling it Latin1 is just being honest about that, and leaving
> > room for modules with other encodings or an interface parameterized
> > by encoding.
>
> Ok. Duncan, Ketil, Ross and Simon make good points here.
> I'll move Data.ByteString.Char -> Data.ByteString.Latin1

Ok one final point from a discussion between me and Einar Karttunen...

(I'm mindful of Simon's comment about sheds... :-) )


There are two different common uses of a 8-bit string library with
different assumptions and guarantees. (As it happens they have the same
implementation)

In one use case, we want to be able to guarantee that we can get Chars
out of our string and guarantee that they really are Haskell Chars. That
is that they are valid Unicode code points which we could pass to
functions like isUpper and get valid answers. As an example consider
Char 'Â' (chr 0xC2, Latin capital A with circumflex). This is not ASCII
but it is clearly upper case. If we don't know that we're working with
an 8-bit subset of Unicode then we can't use Unicode properties like
isUpper etc.

Then the other common use case is where we have some character string
encoding which contains ASCII as a subset. That is we don't know the
encoding exactly (it may be Latin1, LatinN, UTF8, etc) but we do know
that ASCII chars 0-127 are represent by those same numbers in our byte
stream. Examples where this is useful is in parsing network protocols.
There are several examples of these which use 8-bit extensions of ASCII
but the protocol only gives semantics to chars in the ASCII subset. For
this case it would be very inconvenient to have to use an API based just
on Word8 but on the other hand we can't give a proper guarantee on being
able to turn bytes into Haskell Chars (only for bytes <127).

So what do we do about this?

Einar was thinking about an API that might look like this:
Data.ByteString.{Char8, Latin1, Latin2, ..., UTF8, ...}

Char8 should provide:
* litle overhead
* For ascii characters the right translation
* c2w . w2c = id
* toUpper and toLower on Ascii
* Ord with raw byte values

Latin1 should guarantee:
* Correct translation for Latin1, C0 and C1 characters
* Really just a subset of unicode for character handling
* Predicates like toUpper and toLower
* toUpper and toLower per Unicode definition
  (there is no common latin1 definition afaik)
* Ord per UCA (unicode collation algorithm)
* Or use locale for toUpper/toLower and Ord.

So basically the .Char8 module is for the ASCII extension case and
the .Latin1 is for the 8-bit Unicode subset case.

I think in fact that darcs would want the .Char8 version but I expect
that may other users will want a library that can guarantee conversions
to ordinary Haskell Chars (which involves an assumption on the character
encoding).

Duncan

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re[2]: Data.ByteString candidate 3

Bulat Ziganshin-2
In reply to this post by Simon Marlow-5
Hello Simon,

Tuesday, April 25, 2006, 5:34:20 PM, you wrote:

> Good point, and that's why using the Data.PackedString hierarchy was
> nice, because it accomodated various different character widths.  I
> quite like

>    Data.ByteString
>    Data.PackedString.Latin1
>    Data.PackedString.UTF8
>    Data.PackedString.UCS4
>    etc.

i think these module names are great - first work with just Word8,
while Data.PackedString.* modules works with different Char
representations


--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

John Meacham
In reply to this post by Simon Marlow-5
On Tue, Apr 25, 2006 at 02:34:20PM +0100, Simon Marlow wrote:

> Duncan Coutts wrote:
>
> >How would we distinguish a full fixed0width 4-byte Unicode version?
>
> Good point, and that's why using the Data.PackedString hierarchy was
> nice, because it accomodated various different character widths.  I
> quite like
>
>   Data.ByteString
>   Data.PackedString.Latin1
>   Data.PackedString.UTF8
>   Data.PackedString.UCS4
>   etc.

Do we really need all of these? UCS4BE? UTF16? if you care intimatly
about the underlying binary representation, then you should be using
ByteString directly, since you are working with binary data. if you just
want a fast string replacement, then you don't care about the internal
representation, you just want it to be fast.

We don't want issues where someones library takes UTF8 strings but
someone elses takes UCS4 strings and you want them to play nice
together.

I think all we really need are

Data.ByteString
Data.PackedString

(Though, I suppose Latin1 could be useful)

but note, do the people that want latin1 just need ASCII? because it should be
noted that if we have a UTF8 PackedString, then we can make
ASCII-specific access routines that are just as fast as the ones in the
Latin1 variety without giving up the ability to store full unicode
values in the string.

        John



--
John Meacham - ⑆repetae.net⑆john⑈
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Data.ByteString candidate 3

Einar Karttunen
On 25.04 13:46, John Meacham wrote:
> I think all we really need are
>
> Data.ByteString
> Data.PackedString
>
> (Though, I suppose Latin1 could be useful)

Using the Word8 API is not very pleasant, because all
character constants etc are not Word8.

As for Latin1 - what semantics do we use for toUpper/toLower and Ord?
Using the unicode ones or locale seems the sensible thing if the data
really is Latin1.

Thus a simple wrapper to the Word8 api is desirable. Make it follow
few simple rules:
* c2w . w2c = id  (conversion is a bijection)
* ascii characters translated correctly
* toLower/toUpper for ascii
* Ord by byte values.

This is very useful for many purposes and does not mean that there
should not be a fancy UTF8 module. Rather than arguing about killing
this, wouldn't it be more productive to create the UTF8 module?

> but note, do the people that want latin1 just need ASCII? because it should be
> noted that if we have a UTF8 PackedString, then we can make
> ASCII-specific access routines that are just as fast as the ones in the
> Latin1 variety without giving up the ability to store full unicode
> values in the string.

Case conversions and ordering need to be different. Thus we need to newtype
things to avoid having two conflicting Ord instances. The UTF8 layer
should provide:

* Unicode toUpper/toLower
* Unicode collation (UCA) for Ord
* Graphemes (see Perl6 for good ways to do this)
* Normalisation

- Einar Karttunen
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
123