On 9/28/07, ok <[hidden email]> wrote:
> Now there's a paper that was mentioned about a month ago in this > mailing list which basically dealt with that by splitting each type > into two: roughly speaking a bit that expresses the recursion and > a bit that expresses the choice structure. Would you like to give a link to that paper? (the following is a bit offtopic) In the 1995 paper[1]: "Bananas in Space: Extending Fold and Unfold to Exponential Types", Erik Meijer and Graham Hutton showed a interesting technique: Your ADT: data Expr env = Variable (Var env) | Constant Int | Unary String (Expr env) | Binary String (Expr env) (Expr env) can be written without recursion by using a fixpoint newtype combinator (not sure if this is the right name for it): newtype Rec f = In { out :: f (Rec f) } data Var env = Var env String data E env e = Variable (Var env) | Constant Int | Unary String e | Binary String e e type Expr env = Rec (E env) example = In (Binary "+" (In (Constant 1)) (In (Constant 2))) You can see that you don't have to name the recursive 'Expr env' explicitly. However constructing a 'Expr' is a bit verbose because of the 'In' newtype constructors. regards, Bas van Dijk [1] http://citeseer.ist.psu.edu/293490.html _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Aaron Denney
Sorry for the long delay, work has been really busy...
On Sep 27, 2007, at 12:25 PM, Aaron Denney wrote: > On 2007-09-27, Aaron Denney <[hidden email]> wrote: >>> Well, not so much. As Duncan mentioned, it's a matter of what the >>> most >>> common case is. UTF-16 is effectively fixed-width for the majority >>> of >>> text in the majority of languages. Combining sequences and surrogate >>> pairs are relatively infrequent. >> >> Infrequent, but they exist, which means you can't seek x/2 bytes >> ahead >> to seek x characters ahead. All such seeking must be linear for both >> UTF-16 *and* UTF-8. >> >>> Speaking as someone who has done a lot of Unicode implementation, I >>> would say UTF-16 represents the best time/space tradeoff for an >>> internal representation. As I mentioned, it's what's used in >>> Windows, >>> Mac OS X, ICU, and Java. > > I guess why I'm being something of a pain-in-the-ass here, is that > I want to use your Unicode implementation expertise to know what > these time/space tradeoffs are. > > Are there any algorithmic asymptotic complexity differences, or all > these all constant factors? The constant factors depend on projected > workload. And are these actually tradeoffs, except between UTF-32 > (which uses native wordsizes on 32-bit platforms) and the other two? > Smaller space means smaller cache footprint, which can dominate. Yes, cache footprint is one reason to use UTF-16 rather than UTF-32. Having no surrogate pairs also doesn't save you anything because you need to handle sequences anyway, such as combining marks and clusters. The best reference for all of this is: http://www.unicode.org/faq/utf_bom.html See especially: http://www.unicode.org/faq/utf_bom.html#10 http://www.unicode.org/faq/utf_bom.html#12 Which data type is best depends on what the purpose is. If the data will primarily be ASCII with an occasional non-ASCII characters, UTF-8 may be best. If the data is general Unicode text, UTF-16 is best. I would think a Unicode string type would be intended for processing natural language text, not just ASCII data. > Simplicity of algorithms is also a concern. Validating a byte > sequence > as UTF-8 is harder than validating a sequence of 16-bit values as > UTF-16. > > (I'd also like to see a reference to the Mac OS X encoding. I know > that > the filesystem interface is UTF-8 (decomposed a certain a way). Is it > just that UTF-16 is a common application choice, or is there some > common framework or library that uses that?) UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon, and is what appears in the APIs for all of them. UTF-16 is also what's stored in the volume catalog on Mac disks. UTF-8 is only used in BSD APIs for backward compatibility. It's also used in plain text files (or XML or HTML), again for compatibility. Deborah _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Deborah Goldsmith wrote:
> UTF-16 is the native encoding used for Cocoa, Java, ICU, and Carbon, and > is what appears in the APIs for all of them. UTF-16 is also what's > stored in the volume catalog on Mac disks. UTF-8 is only used in BSD > APIs for backward compatibility. It's also used in plain text files (or > XML or HTML), again for compatibility. > > Deborah On OS X, Cocoa and Carbon use Core Foundation, whose API does not have a one-true-encoding internally. Follow the rather long URL for details: http://developer.apple.com/documentation/CoreFoundation/Conceptual/CFStrings/index.html?http://developer.apple.com/documentation/CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#//apple_ref/doc/uid/20001179 I would vote for an API that not just hides the internal store, but allows different internal stores to be used in a mostly compatible way. However, There is a UniChar typedef on OS X which is the same unsigned 16 bit integer as Java's JNI would use. -- Chris _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On Oct 2, 2007, at 5:11 AM, ChrisK wrote:
> Deborah Goldsmith wrote: > >> UTF-16 is the native encoding used for Cocoa, Java, ICU, and >> Carbon, and >> is what appears in the APIs for all of them. UTF-16 is also what's >> stored in the volume catalog on Mac disks. UTF-8 is only used in BSD >> APIs for backward compatibility. It's also used in plain text >> files (or >> XML or HTML), again for compatibility. >> >> Deborah > > > On OS X, Cocoa and Carbon use Core Foundation, whose API does not > have a > one-true-encoding internally. Follow the rather long URL for details: > > http://developer.apple.com/documentation/CoreFoundation/Conceptual/ > CFStrings/index.html?http://developer.apple.com/documentation/ > CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#// > apple_ref/doc/uid/20001179 > > I would vote for an API that not just hides the internal store, but > allows > different internal stores to be used in a mostly compatible way. > > However, There is a UniChar typedef on OS X which is the same > unsigned 16 bit > integer as Java's JNI would use. UTF-16 is the type used in all the APIs. Everything else is considered an encoding conversion. CoreFoundation uses UTF-16 internally except when the string fits entirely in a single-byte legacy encoding like MacRoman or MacCyrillic. If any kind of Unicode processing needs to be done to the string, it is first coerced to UTF-16. If it weren't for backwards compatibility issues, I think we'd use UTF-16 all the time as the machinery for switching encodings adds complexity. I wouldn't advise it for a new library. Deborah _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On Tue, 2007-10-02 at 08:02 -0700, Deborah Goldsmith wrote:
> On Oct 2, 2007, at 5:11 AM, ChrisK wrote: > > Deborah Goldsmith wrote: > > > >> UTF-16 is the native encoding used for Cocoa, Java, ICU, and > >> Carbon, and > >> is what appears in the APIs for all of them. UTF-16 is also what's > >> stored in the volume catalog on Mac disks. UTF-8 is only used in BSD > >> APIs for backward compatibility. It's also used in plain text > >> files (or > >> XML or HTML), again for compatibility. > >> > >> Deborah > > > > > > On OS X, Cocoa and Carbon use Core Foundation, whose API does not > > have a > > one-true-encoding internally. Follow the rather long URL for details: > > > > http://developer.apple.com/documentation/CoreFoundation/Conceptual/ > > CFStrings/index.html?http://developer.apple.com/documentation/ > > CoreFoundation/Conceptual/CFStrings/Articles/StringStorage.html#// > > apple_ref/doc/uid/20001179 > > > > I would vote for an API that not just hides the internal store, but > > allows > > different internal stores to be used in a mostly compatible way. > > > > However, There is a UniChar typedef on OS X which is the same > > unsigned 16 bit > > integer as Java's JNI would use. > > UTF-16 is the type used in all the APIs. Everything else is > considered an encoding conversion. > > CoreFoundation uses UTF-16 internally except when the string fits > entirely in a single-byte legacy encoding like MacRoman or > MacCyrillic. If any kind of Unicode processing needs to be done to > the string, it is first coerced to UTF-16. If it weren't for > backwards compatibility issues, I think we'd use UTF-16 all the time > as the machinery for switching encodings adds complexity. I wouldn't > advise it for a new library. I would like to, again, strongly argue against sacrificing compatibility with Linux/BSD/etc. for the sake of compatibility with OS X or Windows. FFI bindings have to convert data formats in any case; Haskell shouldn't gratuitously break Linux support (or make life harder on Linux) just to support proprietary operating systems better. Now, if /independent of the details of MacOS X/, UTF-16 is better (objectively), it can be converted to anything by the FFI. But doing it the way Java or MacOS X or Win32 or anyone else does it, at the expense of Linux, I am strongly opposed to. jcc _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
> I would like to, again, strongly argue against sacrificing
> compatibility > with Linux/BSD/etc. for the sake of compatibility with OS X or > Windows. Ehm? I've used to think MacOS is a sort of BSD... _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On Tue, 2007-10-02 at 22:05 +0400, Miguel Mitrofanov wrote:
> > I would like to, again, strongly argue against sacrificing > > compatibility > > with Linux/BSD/etc. for the sake of compatibility with OS X or > > Windows. > > Ehm? I've used to think MacOS is a sort of BSD... Cocoa, then. jcc _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Deborah Goldsmith-2
On Tue, Oct 02, 2007 at 08:02:30AM -0700, Deborah Goldsmith wrote:
> UTF-16 is the type used in all the APIs. Everything else is considered an > encoding conversion. > > CoreFoundation uses UTF-16 internally except when the string fits entirely > in a single-byte legacy encoding like MacRoman or MacCyrillic. If any kind > of Unicode processing needs to be done to the string, it is first coerced > to UTF-16. If it weren't for backwards compatibility issues, I think we'd > use UTF-16 all the time as the machinery for switching encodings adds > complexity. I wouldn't advise it for a new library. I do not believe that anyone was seriously advocating multiple blessed encodings. The main question is *which* encoding to bless. 99+% of text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 better for me? Stefan _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
> I do not believe that anyone was seriously advocating multiple blessed
> encodings. The main question is *which* encoding to bless. 99+% of > text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 > better for me? All software I write professional have to support 40 languages (including CJK ones) so I would prefer UTF-16 in case I could use Haskell at work some day in the future. I dunno that who uses what encoding the most is good grounds to pick encoding though. Ease of implementation and speed on some representative sample set of text may be. -- Johan _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote:
> > I do not believe that anyone was seriously advocating multiple blessed > > encodings. The main question is *which* encoding to bless. 99+% of > > text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 > > better for me? > > All software I write professional have to support 40 languages > (including CJK ones) so I would prefer UTF-16 in case I could use > Haskell at work some day in the future. I dunno that who uses what > encoding the most is good grounds to pick encoding though. Ease of > implementation and speed on some representative sample set of text may > be. I believe CJK is still a relatively uncommon case compared to English and other Latin-alphabet languages. (That said, I live in a country all of whose dominant languages use the Latin alphabet) Stefan _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Deborah Goldsmith-2
Lots of people wrote:
> I want a UTF-8 bikeshed! > No, I want a UTF-16 bikeshed! What the heck does it matter what encoding the library uses internally? I expect the interface to be something like (from my own CompactString library): > fromByteString :: Encoding -> ByteString -> UnicodeString > toByteString :: Encoding -> UnicodeString -> ByteString The only matter is efficiency for a particular encoding. I would suggest that we get a working library first. Either UTF-8 or UTF-16 will do, as long as it works. Even better would be to implement both (and perhaps more encodings), and then benchmark them to get a sensible default. Then the choice can be made available to the user as well, in case someone has specifix needs. But again: get it working first! Twan _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Bugzilla from jonathanccast@fastmail.fm
On Oct 2, 2007, at 8:44 AM, Jonathan Cast wrote:
> I would like to, again, strongly argue against sacrificing > compatibility > with Linux/BSD/etc. for the sake of compatibility with OS X or > Windows. > FFI bindings have to convert data formats in any case; Haskell > shouldn't > gratuitously break Linux support (or make life harder on Linux) just > to > support proprietary operating systems better. > > Now, if /independent of the details of MacOS X/, UTF-16 is better > (objectively), it can be converted to anything by the FFI. But > doing it > the way Java or MacOS X or Win32 or anyone else does it, at the > expense > of Linux, I am strongly opposed to. No one is advocating that. Any Unicode support library needs to support exporting text as UTF-8 since it's so widely used. It's used on Mac OS X, too, in exactly the same contexts it would be used on Linux. However, UTF-8 is a poor choice for internal representation. On Oct 2, 2007, at 2:32 PM, Stefan O'Rear wrote: > UTF-8 supports CJK languages too. The only question is efficiency, > and > I believe CJK is still a relatively uncommon case compared to English > and other Latin-alphabet languages. (That said, I live in a country > all > of whose dominant languages use the Latin alphabet) First of all, non-Latin countries already represent a large fraction of computer usage and the computer market. It is not at all "relatively uncommon." Japan alone is a huge market. China is a huge market. Second, it's not just CJK, but anything that's not mostly ASCII. Russian, Greek, Thai, Arabic, Hebrew, etc. etc. etc. UTF-8 is intended for compatibility with existing software that expects multibyte encodings. It doesn't work well as an internal representation. Again, no one is saying a Unicode library shouldn't have full support for input and output of UTF-8 (and other encodings). If you want to process ASCII text and squeeze out every last ounce of performance, use byte strings. Unicode strings should be optimized for representing and processing human language text, a large share of which is not in the Latin alphabet. Remember, speakers of English and other Latin-alphabet languages are a minority in the world, though not in the computer-using world. Yet. Deborah _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Twan van Laarhoven
On Oct 2, 2007, at 3:01 PM, Twan van Laarhoven wrote:
> Lots of people wrote: > > I want a UTF-8 bikeshed! > > No, I want a UTF-16 bikeshed! > > What the heck does it matter what encoding the library uses > internally? I expect the interface to be something like (from my own > CompactString library): > > fromByteString :: Encoding -> ByteString -> UnicodeString > > toByteString :: Encoding -> UnicodeString -> ByteString I agree, from an API perspective the internal encoding doesn't matter. > > The only matter is efficiency for a particular encoding. This matters a lot. > > > I would suggest that we get a working library first. Either UTF-8 or > UTF-16 will do, as long as it works. > > Even better would be to implement both (and perhaps more encodings), > and then benchmark them to get a sensible default. Then the choice > can be made available to the user as well, in case someone has > specifix needs. But again: get it working first! The problem is that the internal encoding can have a big effect on the implementation of the library. It's better not to have to do it over again if the first choice is not optimal. I'm just trying to share the experience of the Unicode Consortium, the ICU library contributors, and Apple, with the Haskell community. They, and I personally, have many years of experience implementing support for Unicode. Anyway, I think we're starting to repeat ourselves... Deborah _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Twan van Laarhoven
On Wed, 2007-10-03 at 00:01 +0200, Twan van Laarhoven wrote:
> Lots of people wrote: > > I want a UTF-8 bikeshed! > > No, I want a UTF-16 bikeshed! > > What the heck does it matter what encoding the library uses internally? +1 jcc _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Stefan O'Rear
Stefan O'Rear wrote:
> On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote: >>> I do not believe that anyone was seriously advocating multiple blessed >>> encodings. The main question is *which* encoding to bless. 99+% of >>> text I encounter is in US-ASCII, so I would favor UTF-8. Why is UTF-16 >>> better for me? >> All software I write professional have to support 40 languages >> (including CJK ones) so I would prefer UTF-16 in case I could use >> Haskell at work some day in the future. I dunno that who uses what >> encoding the most is good grounds to pick encoding though. Ease of >> implementation and speed on some representative sample set of text may >> be. > > UTF-8 supports CJK languages too. The only question is efficiency Due to the additional complexity of handling UTF-8 -- EVEN IF the actual text processed happens all to be US-ASCII -- will UTF-8 perhaps be less efficient than UTF-16, or only as fast? Isaac _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On Oct 2, 2007, at 21:12 , Isaac Dupree wrote: > Stefan O'Rear wrote: >> On Tue, Oct 02, 2007 at 11:05:38PM +0200, Johan Tibell wrote: >>>> I do not believe that anyone was seriously advocating multiple >>>> blessed >>>> encodings. The main question is *which* encoding to bless. 99+ >>>> % of >>>> text I encounter is in US-ASCII, so I would favor UTF-8. Why is >>>> UTF-16 >>>> better for me? >>> All software I write professional have to support 40 languages >>> (including CJK ones) so I would prefer UTF-16 in case I could use >>> Haskell at work some day in the future. I dunno that who uses what >>> encoding the most is good grounds to pick encoding though. Ease of >>> implementation and speed on some representative sample set of >>> text may >>> be. >> UTF-8 supports CJK languages too. The only question is efficiency > > Due to the additional complexity of handling UTF-8 -- EVEN IF the > actual text processed happens all to be US-ASCII -- will UTF-8 > perhaps be less efficient than UTF-16, or only as fast? UTF8 will be very slightly faster in the all-ASCII case, but quickly blows chunks if you have *any* characters that require multibyte. Given the way UTF8 encoding works, this includes even Latin-1 non- ASCII, never mind CJK. (I think people have been missing that point. UTF8 is only cheap for 00-7f, *nothing else*.) -- brandon s. allbery [solaris,freebsd,perl,pugs,haskell] [hidden email] system administrator [openafs,heimdal,too many hats] [hidden email] electrical and computer engineering, carnegie mellon university KF8NH _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Stefan O'Rear
On Tue, 2007-10-02 at 14:32 -0700, Stefan O'Rear wrote:
> UTF-8 supports CJK languages too. The only question is efficiency, and > I believe CJK is still a relatively uncommon case compared to English > and other Latin-alphabet languages. (That said, I live in a country all > of whose dominant languages use the Latin alphabet) As for space efficiency, I guess the argument could be made that since an ideogram typically conveys a whole word, it is reasonably to spend more bits for it. Anyway, I am unsure if I should take part in this discussion, as I'm not really dealing with text as such in multiple languages. Most of my data is in ASCII, and when they are not, I'm happy to treat it ("treat" here meaning "mostly ignore") as Latin1 bytes (current ByteString) or UTF-8. The only thing I miss is the ability to use String syntactic sugar -- but IIUC, that's coming? However, increased space usage is not acceptable, and I also don't want any conversion layer which could conceivably modify my data (e.g. by normalizing or error handling). -k _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Brandon S Allbery KF8NH
On Tue, 2007-10-02 at 21:45 -0400, Brandon S. Allbery KF8NH wrote:
> > Due to the additional complexity of handling UTF-8 -- EVEN IF the > > actual text processed happens all to be US-ASCII -- will UTF-8 > > perhaps be less efficient than UTF-16, or only as fast? > UTF8 will be very slightly faster in the all-ASCII case, but quickly > blows chunks if you have *any* characters that require multibyte. What benchmarks are you basing this on? Doubling your data size is going to cost you if you are doing simple operations (searching, say), but I don't see UTF-8 being particularly expensive - somebody (forget who) implemented UTF-8 on top of ByteString, and IIRC, the benchmarks numbers didn't change all that much from the regular Char8. -k _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Twan van Laarhoven
On Wed, Oct 03, 2007 at 12:01:50AM +0200,
Twan van Laarhoven <[hidden email]> wrote a message of 24 lines which said: > Lots of people wrote: > > I want a UTF-8 bikeshed! > > No, I want a UTF-16 bikeshed! Personnally, I want an UTF-32 bikeshed. UTF-16 is as lousy as UTF-8 (for both of them, characters have different sizes, unlike what happens in UTF-32). > What the heck does it matter what encoding the library uses > internally? +1 It can even use a non-standard encoding scheme if it wants. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
> > What the heck does it matter what encoding the library uses
> > internally? > > +1 It can even use a non-standard encoding scheme if it wants. Sounds good to me. I (think) one of my initial questions was if the encoding should be visible in the type of the UnicodeString type or not. My gut feeling is that having the type visible might make it hard to change the internal representation but I haven't yet got a good example to prove this. -- Johan _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Free forum by Nabble | Edit this page |