On Thu, Sep 27, 2007 at 07:26:07AM +0000, Aaron Denney wrote:
> On 2007-09-27, Ross Paterson <[hidden email]> wrote: > > Combining characters are not an issue here, just the surrogate pairs, > > because we're discussing representations of sequences of Chars (Unicode > > code points). > > You'll never want to combine combining characters or vice-versa? Never > want to figure out how much screen space a sequence will take? It _is_ > an issue. It's an issue for a higher layer, not for a compact String representation. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Aaron Denney
On Thu, Sep 27, 2007 at 06:39:24AM +0000, Aaron Denney wrote:
> On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote: > > Well, not so much. As Duncan mentioned, it's a matter of what the most > > common case is. UTF-16 is effectively fixed-width for the majority of > > text in the majority of languages. Combining sequences and surrogate > > pairs are relatively infrequent. > > Infrequent, but they exist, which means you can't seek x/2 bytes ahead > to seek x characters ahead. All such seeking must be linear for both > UTF-16 *and* UTF-8. You could get rapid seeks by ignoring the UTFs and representing strings as sequences of chunks, where each chunk is uniformly 8-bit, 16-bit or 32-bit as required to cover the characters it contains. Hardly anyone would need 32-bit chunks (and some of us would need only the 8-bit ones). _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Richard A. O'Keefe
On 9/27/07, ok <[hidden email]> wrote:
> (What the heck _is_ Tangut, anyway?) http://en.wikipedia.org/wiki/Tangut_language Juanma _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Ross Paterson
On 2007-09-27, Ross Paterson <[hidden email]> wrote:
> On Thu, Sep 27, 2007 at 07:26:07AM +0000, Aaron Denney wrote: >> On 2007-09-27, Ross Paterson <[hidden email]> wrote: >> > Combining characters are not an issue here, just the surrogate pairs, >> > because we're discussing representations of sequences of Chars (Unicode >> > code points). >> >> You'll never want to combine combining characters or vice-versa? Never >> want to figure out how much screen space a sequence will take? It _is_ >> an issue. > > It's an issue for a higher layer, not for a compact String representation. Yes, and no. It's not something the lower layer should be doing, but enabling the higher layers to do so efficiently is a concern. -- Aaron Denney -><- _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Richard A. O'Keefe
On 9/27/07, ok <[hidden email]> wrote:
> I have often found myself wishing for a small extension to the syntax of > Haskell 'data' declarations. It goes like this: ['where' clause to allow locally defined names in type declarations] Nice. Quite a few times I've found myself declaring type synonyms for this reason, but you end up polluting the global namespace. +1 vote. -- Thomas Conway [hidden email] Silence is the perfectest herald of joy: I were but little happy, if I could say how much. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Thomas Conway writes: > On 9/27/07, ok <[hidden email]> wrote: >> I have often found myself wishing for a small extension to the syntax of >> Haskell 'data' declarations. It goes like this: > ['where' clause to allow locally defined names in type declarations] > > Nice. > > Quite a few times I've found myself declaring type synonyms for this > reason, but you end up polluting the global namespace. > > +1 vote. Data with where? You haven't heard about GADTs? http://en.wikibooks.org/wiki/Haskell/GADT http://www.haskell.org/haskellwiki/Generalised_algebraic_datatype Jerzy Karczmarczuk _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Aaron Denney
In message <[hidden email]> [hidden email] writes:
> On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote: > > On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote: > >>> UTF-16 has no advantage over UTF-8 in this respect, because of > >>> surrogate > >>> pairs and combining characters. > >> > >> Good point. > > > > Well, not so much. As Duncan mentioned, it's a matter of what the most > > common case is. UTF-16 is effectively fixed-width for the majority of > > text in the majority of languages. Combining sequences and surrogate > > pairs are relatively infrequent. > > Infrequent, but they exist, which means you can't seek x/2 bytes ahead > to seek x characters ahead. All such seeking must be linear for both > UTF-16 *and* UTF-8. And in [Char] for all these years, yet I don't hear people complaining. Most string processing is linear and does not need random access to characters. Duncan _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
2007/9/27, Duncan Coutts <[hidden email]>:
> > Infrequent, but they exist, which means you can't seek x/2 bytes ahead > > to seek x characters ahead. All such seeking must be linear for both > > UTF-16 *and* UTF-8. > > And in [Char] for all these years, yet I don't hear people complaining. Most > string processing is linear and does not need random access to characters. Well, if you never heard anyone complaining about [Char] and never had any problem with it's slowness, you're probably not in a field where the efficiency of a Unicode library is really a concern, that's for sure. (I know that the _main_ problem with [Char] wasn't random access, but you must admit [Char] isn't really a good example to speak about efficiency problems) -- Jedaï _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by jerzy.karczmarczuk
On 9/27/07, [hidden email]
<[hidden email]> wrote: > > Thomas Conway writes: > > > On 9/27/07, ok <[hidden email]> wrote: > >> I have often found myself wishing for a small extension to the syntax of > >> Haskell 'data' declarations. It goes like this: > > ['where' clause to allow locally defined names in type declarations] > > > > Nice. > > > > Quite a few times I've found myself declaring type synonyms for this > > reason, but you end up polluting the global namespace. > > > > +1 vote. > > Data with where? > You haven't heard about GADTs? I think that you haven't read the question carefully, because "where" in GADTs is simply a syntactic sugar. However, this seems to be available already with GADTs and type equality constraints: data BST key val where Empty :: BST key val Fork :: (bst ~ BST key val) => key -> val -> bst -> bst -> BST key val It's a pity you can't use bst (or a type synonym) instead of the last "BST key val". Best regards Tomasz _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Chaddaï Fouché
> Well, if you never heard anyone complaining about [Char] and never had
> any problem with it's slowness, you're probably not in a field where > the efficiency of a Unicode library is really a concern, that's for > sure. (I know that the _main_ problem with [Char] wasn't random > access, but you must admit [Char] isn't really a good example to speak > about efficiency problems) I have problems with [Char] and use ByteString instead but that forces me to keep track of the encoding myself and hence UnicodeString. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Tomasz Zielonka
Tomasz Zielonka wrote:
> On 9/27/07, [hidden email] > <[hidden email]> wrote: >> Thomas Conway writes: >> >>> On 9/27/07, ok <[hidden email]> wrote: >>>> I have often found myself wishing for a small extension to the syntax of >>>> Haskell 'data' declarations. It goes like this: >>> ['where' clause to allow locally defined names in type declarations] >>> >>> Nice. >>> >>> Quite a few times I've found myself declaring type synonyms for this >>> reason, but you end up polluting the global namespace. >>> >>> +1 vote. >> Data with where? >> You haven't heard about GADTs? > > I think that you haven't read the question carefully, because "where" > in GADTs is simply a syntactic sugar. However, this seems to be > available already with GADTs and type equality constraints: > > data BST key val where > Empty :: BST key val > Fork :: (bst ~ BST key val) => key -> val -> bst -> bst -> BST key val > > It's a pity you can't use bst (or a type synonym) instead of the last > "BST key val". Indeed. GADT syntax looks like a type signature (except for strictness annotations, which presently aren't part of function syntax!) but apparently the (->)s and result-type aren't type-signature, because type-synonyms can't be used for them. I tried. (because there were several GADT constructors with slightly different signatures, so I made a type-synonym with an argument to try to shorten them). It seems a pity to me too. Isaac _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Ross Paterson
On Thu, 27 Sep 2007, Ross Paterson wrote:
> > Combining characters are not an issue here, just the surrogate pairs, > because we're discussing representations of sequences of Chars (Unicode > code points). I dislike referring to unicode code points as "characters" because that tends to imply a lot of invalid simplifications. Tony. -- f.a.n.finch <[hidden email]> http://dotat.at/ IRISH SEA: SOUTHERLY, BACKING NORTHEASTERLY FOR A TIME, 3 OR 4. SLIGHT OR MODERATE. SHOWERS. MODERATE OR GOOD, OCCASIONALLY POOR. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In message <[hidden email]> Tony Finch
<[hidden email]> writes: > On Thu, 27 Sep 2007, Ross Paterson wrote: > > > > Combining characters are not an issue here, just the surrogate pairs, > > because we're discussing representations of sequences of Chars (Unicode > > code points). > > I dislike referring to unicode code points as "characters" because that > tends to imply a lot of invalid simplifications. Just to be pedantic, Ross did say Char not character. A Char is defined in the Haskell report as a Unicode code point. As you say, that does not directly correspond to what many people think of as a character due to combining characters etc. Duncan _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Aaron Denney
On 2007-09-27, Aaron Denney <[hidden email]> wrote:
> On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote: >> On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote: >>>> UTF-16 has no advantage over UTF-8 in this respect, because of >>>> surrogate >>>> pairs and combining characters. >>> >>> Good point. >> >> Well, not so much. As Duncan mentioned, it's a matter of what the most >> common case is. UTF-16 is effectively fixed-width for the majority of >> text in the majority of languages. Combining sequences and surrogate >> pairs are relatively infrequent. > > Infrequent, but they exist, which means you can't seek x/2 bytes ahead > to seek x characters ahead. All such seeking must be linear for both > UTF-16 *and* UTF-8. > >> Speaking as someone who has done a lot of Unicode implementation, I >> would say UTF-16 represents the best time/space tradeoff for an >> internal representation. As I mentioned, it's what's used in Windows, >> Mac OS X, ICU, and Java. I guess why I'm being something of a pain-in-the-ass here, is that I want to use your Unicode implementation expertise to know what these time/space tradeoffs are. Are there any algorithmic asymptotic complexity differences, or all these all constant factors? The constant factors depend on projected workload. And are these actually tradeoffs, except between UTF-32 (which uses native wordsizes on 32-bit platforms) and the other two? Smaller space means smaller cache footprint, which can dominate. Simplicity of algorithms is also a concern. Validating a byte sequence as UTF-8 is harder than validating a sequence of 16-bit values as UTF-16. (I'd also like to see a reference to the Mac OS X encoding. I know that the filesystem interface is UTF-8 (decomposed a certain a way). Is it just that UTF-16 is a common application choice, or is there some common framework or library that uses that?) -- Aaron Denney -><- _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by jerzy.karczmarczuk
[hidden email] wrote:
> Data with where? > You haven't heard about GADTs? To avoid clashing with GADT's "where", I propose to rename ok's keyword to "wherein", or "wheretype", or something data B k v = E | F b b wherein type b = B k v data B k v = E | F b b wheretype b = B k v (I also propose that ok should not just take an existing unrelated thread like "Unicode string library", click "reply", and herein talk about a new topic; but rather, should take the necessary extra effort to start a new thread altogether.) _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Duncan Coutts
On 2007-09-27, Duncan Coutts <[hidden email]> wrote:
> In message <[hidden email]> [hidden email] writes: >> On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote: >> > On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote: >> >>> UTF-16 has no advantage over UTF-8 in this respect, because of >> >>> surrogate >> >>> pairs and combining characters. >> >> >> >> Good point. >> > >> > Well, not so much. As Duncan mentioned, it's a matter of what the most >> > common case is. UTF-16 is effectively fixed-width for the majority of >> > text in the majority of languages. Combining sequences and surrogate >> > pairs are relatively infrequent. >> >> Infrequent, but they exist, which means you can't seek x/2 bytes ahead >> to seek x characters ahead. All such seeking must be linear for both >> UTF-16 *and* UTF-8. > > And in [Char] for all these years, yet I don't hear people complaining. Most > string processing is linear and does not need random access to characters. Yeah. I'm saying the differences between them are going to be in the constant factors, and that these constant factors will differ between workloads. -- Aaron Denney -><- _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Albert Y. C. Lai
On 9/27/07, Albert Y. C. Lai <[hidden email]> wrote:
[hidden email] wrote: I'm not sure there is a clash. data B k v where ... is easily distinguished from data B k v = ... where ... -- Dave Menendez <[hidden email]> < http://www.eyrie.org/~zednenem/> _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
On 9/28/07, David Menendez <[hidden email]> wrote:
> I'm not sure there is a clash. > > data B k v where ... > > is easily distinguished from > > data B k v = ... where ... Indeed. Although Richard's proposal was simpler, I reckon it's worth discussing whether the where clause should allow normal type/data/newtype declarations, effectively introducing a new scope. There are obviously some type variable quantification and name resolution issues that should yield several conference papers. Here are a couple of examples: data Tree key val = Leaf key val | Node BST key val BST where type BST = Tree key val data RelaxedTree key val = Leaf Bal [(key,val)] | Node Bal [(key,RelaxedTree key val)] where data Bal = Balanced | Unbalanced -- Thomas Conway [hidden email] Silence is the perfectest herald of joy: I were but little happy, if I could say how much. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Thomas Conway wrote:
> Although Richard's proposal was simpler, I reckon it's worth > discussing whether the where clause should allow normal > type/data/newtype declarations, effectively introducing a new scope. > There are obviously some type variable quantification and name > resolution issues that should yield several conference papers. > > data RelaxedTree key val > = Leaf Bal [(key,val)] > | Node Bal [(key,RelaxedTree key val)] > where > data Bal = Balanced | Unbalanced Is Bal visible outside data RelaxedTree? If so, why not put it at the top level. If not, are Balanced and Unbalanced visible? If not, then there is no way to construct a RelaxedTree. If so, then you could not give a type annotation to x = Balanced. > data Tree key val > = Leaf key val > | Node BST key val BST > where > type BST = Tree key val The type synonym example is much easier because it is effectively syntactic sugar, and although BST is not visible, Tree key val is. But is let allowed as well, if we want to restrict the visibility of BST to just the Node constructor? Type synomym of a type variable OK? data Tree key val = let BST = key in Leaf BST val -- perversely called BST | let BST = Tree key val in Node BST key val BST _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
In reply to this post by Thomas Conway
On 28 Sep 2007, at 10:01 am, Thomas Conway wrote:
> data Tree key val > = Leaf key val > | Node BST key val BST > where > type BST = Tree key val > > > data RelaxedTree key val > = Leaf Bal [(key,val)] > | Node Bal [(key,RelaxedTree key val)] > where > data Bal = Balanced | Unbalanced My proposal was deliberately rather limited. My feeling was that if there is a constructor (like Balanced, Unbalanced) then I want it to belong to a module-scope type name. What I'm looking for is something that provides (1) an easily understood way of abbreviating repeated types in a data, type, or newtype declaration (2) and using them *uniformly* throughout such a declaration (which is why GADTs don't help) (3) to reduce the incidence of errors (4) and clarify the programmer's intent in much the same way as field names do (but as a complementary, not a rival technique) (5) and above all, to simplify maintenance. The thing that got me thinking about this is my continuing attempt to write a compiler (to C) for a legacy language in Haskell. I start out with a simple AST data type, adequate for testing the grammar. And then I start adding semantic information to the nodes, and suddenly I find myself adding extra fields all over the place. Now there's a paper that was mentioned about a month ago in this mailing list which basically dealt with that by splitting each type into two: roughly speaking a bit that expresses the recursion and a bit that expresses the choice structure. My feeling about that was that while it is a much more powerful and general technique, it isn't as easy to get your head around as a single level solution. Here's a trivial example. Parser-only version: newtype Var = Var String data Expr = Variable Var | Constant Int | Unary String Expr | Binary String Expr Revised version: data Var env = Var env String data Expr env = Variable (Var env) | Constant Int | Unary String (Expr env) | Binary String (Expr env) (Expr env) Now let's do Expr using my proposal: data Expr = Variable var | Constant Int | Unary String expr | Binary String expr expr where type var = Var type expr = Expr (obtained from the first parser-only version by lower-casing the type names) becoming * data Expr env = Variable var | Constant Int | Unary String expr | Binary String expr expr * where type var = Var env * type expr = Expr env To my mind it's clearer to see 'expr' repeated than '(Expr env)' repeated, as you don't have to keep checking that the argument is the same. I'm not wedded to this scheme. It's the simplest thing I can think of that will do the job. But the Haskell spirit is, if I may say so, seems to be to look for the simplest thing that can do the job at hand and a whole lot more in a principled way. What I'm looking for in a better counter-proposal is something that makes it this easy or easier to revise and extend a type. Perhaps a variation on GADTs would be the way to go. I don't know. _______________________________________________ Haskell-Cafe mailing list [hidden email] http://www.haskell.org/mailman/listinfo/haskell-cafe |
Free forum by Nabble | Edit this page |