PROPOSAL: New efficient Unicode string library.

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
61 messages Options
1234
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Ross Paterson
On Thu, Sep 27, 2007 at 07:26:07AM +0000, Aaron Denney wrote:
> On 2007-09-27, Ross Paterson <[hidden email]> wrote:
> > Combining characters are not an issue here, just the surrogate pairs,
> > because we're discussing representations of sequences of Chars (Unicode
> > code points).
>
> You'll never want to combine combining characters or vice-versa?  Never
> want to figure out how much screen space a sequence will take?  It _is_
> an issue.

It's an issue for a higher layer, not for a compact String representation.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Ross Paterson
In reply to this post by Aaron Denney
On Thu, Sep 27, 2007 at 06:39:24AM +0000, Aaron Denney wrote:
> On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote:
> > Well, not so much. As Duncan mentioned, it's a matter of what the most  
> > common case is. UTF-16 is effectively fixed-width for the majority of  
> > text in the majority of languages. Combining sequences and surrogate  
> > pairs are relatively infrequent.
>
> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> to seek x characters ahead.  All such seeking must be linear for both
> UTF-16 *and* UTF-8.

You could get rapid seeks by ignoring the UTFs and representing strings
as sequences of chunks, where each chunk is uniformly 8-bit, 16-bit or
32-bit as required to cover the characters it contains.  Hardly anyone
would need 32-bit chunks (and some of us would need only the 8-bit ones).
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Juanma Barranquero
In reply to this post by Richard A. O'Keefe
On 9/27/07, ok <[hidden email]> wrote:

> (What the heck _is_ Tangut, anyway?)

http://en.wikipedia.org/wiki/Tangut_language

             Juanma
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
In reply to this post by Ross Paterson
On 2007-09-27, Ross Paterson <[hidden email]> wrote:

> On Thu, Sep 27, 2007 at 07:26:07AM +0000, Aaron Denney wrote:
>> On 2007-09-27, Ross Paterson <[hidden email]> wrote:
>> > Combining characters are not an issue here, just the surrogate pairs,
>> > because we're discussing representations of sequences of Chars (Unicode
>> > code points).
>>
>> You'll never want to combine combining characters or vice-versa?  Never
>> want to figure out how much screen space a sequence will take?  It _is_
>> an issue.
>
> It's an issue for a higher layer, not for a compact String representation.

Yes, and no.  It's not something the lower layer should be doing, but
enabling the higher layers to do so efficiently is a concern.


--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

Thomas Conway
In reply to this post by Richard A. O'Keefe
On 9/27/07, ok <[hidden email]> wrote:
> I have often found myself wishing for a small extension to the syntax of
> Haskell 'data' declarations.  It goes like this:
    ['where' clause to allow locally defined names in type declarations]

Nice.

Quite a few times I've found myself declaring type synonyms for this
reason, but you end up  polluting the global namespace.

+1 vote.

--
Thomas Conway
[hidden email]

Silence is the perfectest herald of joy:
I were but little happy, if I could say how much.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

jerzy.karczmarczuk

Thomas Conway writes:

> On 9/27/07, ok <[hidden email]> wrote:
>> I have often found myself wishing for a small extension to the syntax of
>> Haskell 'data' declarations.  It goes like this:
>     ['where' clause to allow locally defined names in type declarations]
>
> Nice.
>
> Quite a few times I've found myself declaring type synonyms for this
> reason, but you end up  polluting the global namespace.
>
> +1 vote.

Data with where?
You haven't heard about GADTs?

http://en.wikibooks.org/wiki/Haskell/GADT
http://www.haskell.org/haskellwiki/Generalised_algebraic_datatype 


Jerzy Karczmarczuk


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Duncan Coutts
In reply to this post by Aaron Denney
In message <[hidden email]> [hidden email] writes:

> On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote:
> > On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
> >>> UTF-16 has no advantage over UTF-8 in this respect, because of  
> >>> surrogate
> >>> pairs and combining characters.
> >>
> >> Good point.
> >
> > Well, not so much. As Duncan mentioned, it's a matter of what the most  
> > common case is. UTF-16 is effectively fixed-width for the majority of  
> > text in the majority of languages. Combining sequences and surrogate  
> > pairs are relatively infrequent.
>
> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> to seek x characters ahead.  All such seeking must be linear for both
> UTF-16 *and* UTF-8.

And in [Char] for all these years, yet I don't hear people complaining. Most
string processing is linear and does not need random access to characters.

Duncan
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Chaddaï Fouché
2007/9/27, Duncan Coutts <[hidden email]>:
> > Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> > to seek x characters ahead.  All such seeking must be linear for both
> > UTF-16 *and* UTF-8.
>
> And in [Char] for all these years, yet I don't hear people complaining. Most
> string processing is linear and does not need random access to characters.

Well, if you never heard anyone complaining about [Char] and never had
any problem with it's slowness, you're probably not in a field where
the efficiency of a Unicode library is really a concern, that's for
sure. (I know that the _main_ problem with [Char] wasn't random
access, but you must admit [Char] isn't really a good example to speak
about efficiency problems)

--
Jedaï
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

Tomasz Zielonka
In reply to this post by jerzy.karczmarczuk
On 9/27/07, [hidden email]
<[hidden email]> wrote:

>
> Thomas Conway writes:
>
> > On 9/27/07, ok <[hidden email]> wrote:
> >> I have often found myself wishing for a small extension to the syntax of
> >> Haskell 'data' declarations.  It goes like this:
> >     ['where' clause to allow locally defined names in type declarations]
> >
> > Nice.
> >
> > Quite a few times I've found myself declaring type synonyms for this
> > reason, but you end up  polluting the global namespace.
> >
> > +1 vote.
>
> Data with where?
> You haven't heard about GADTs?

I think that you haven't read the question carefully, because "where"
in GADTs is simply a syntactic sugar. However, this seems to be
available already with GADTs and type equality constraints:

data BST key val where
    Empty   :: BST key val
    Fork    :: (bst ~ BST key val) => key -> val -> bst -> bst -> BST key val

It's a pity you can't use bst (or a type synonym) instead of the last
"BST key val".

Best regards
Tomasz
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Johan Tibell-2
In reply to this post by Chaddaï Fouché
> Well, if you never heard anyone complaining about [Char] and never had
> any problem with it's slowness, you're probably not in a field where
> the efficiency of a Unicode library is really a concern, that's for
> sure. (I know that the _main_ problem with [Char] wasn't random
> access, but you must admit [Char] isn't really a good example to speak
> about efficiency problems)

I have problems with [Char] and use ByteString instead but that forces
me to keep track of the encoding myself and hence UnicodeString.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

Isaac Dupree
In reply to this post by Tomasz Zielonka
Tomasz Zielonka wrote:

> On 9/27/07, [hidden email]
> <[hidden email]> wrote:
>> Thomas Conway writes:
>>
>>> On 9/27/07, ok <[hidden email]> wrote:
>>>> I have often found myself wishing for a small extension to the syntax of
>>>> Haskell 'data' declarations.  It goes like this:
>>>     ['where' clause to allow locally defined names in type declarations]
>>>
>>> Nice.
>>>
>>> Quite a few times I've found myself declaring type synonyms for this
>>> reason, but you end up  polluting the global namespace.
>>>
>>> +1 vote.
>> Data with where?
>> You haven't heard about GADTs?
>
> I think that you haven't read the question carefully, because "where"
> in GADTs is simply a syntactic sugar. However, this seems to be
> available already with GADTs and type equality constraints:
>
> data BST key val where
>     Empty   :: BST key val
>     Fork    :: (bst ~ BST key val) => key -> val -> bst -> bst -> BST key val
>
> It's a pity you can't use bst (or a type synonym) instead of the last
> "BST key val".

Indeed.  GADT syntax looks like a type signature (except for strictness
annotations, which presently aren't part of function syntax!) but
apparently the (->)s and result-type aren't type-signature, because
type-synonyms can't be used for them.  I tried.  (because there were
several GADT constructors with slightly different signatures, so I made
a type-synonym with an argument to try to shorten them).  It seems a
pity to me too.

Isaac

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Tony Finch
In reply to this post by Ross Paterson
On Thu, 27 Sep 2007, Ross Paterson wrote:
>
> Combining characters are not an issue here, just the surrogate pairs,
> because we're discussing representations of sequences of Chars (Unicode
> code points).

I dislike referring to unicode code points as "characters" because that
tends to imply a lot of invalid simplifications.

Tony.
--
f.a.n.finch  <[hidden email]>  http://dotat.at/
IRISH SEA: SOUTHERLY, BACKING NORTHEASTERLY FOR A TIME, 3 OR 4. SLIGHT OR
MODERATE. SHOWERS. MODERATE OR GOOD, OCCASIONALLY POOR.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: PROPOSAL: New efficient Unicode string library.

Duncan Coutts
In message <[hidden email]> Tony Finch
<[hidden email]> writes:
> On Thu, 27 Sep 2007, Ross Paterson wrote:
> >
> > Combining characters are not an issue here, just the surrogate pairs,
> > because we're discussing representations of sequences of Chars (Unicode
> > code points).
>
> I dislike referring to unicode code points as "characters" because that
> tends to imply a lot of invalid simplifications.

Just to be pedantic, Ross did say Char not character. A Char is defined in the
Haskell report as a Unicode code point. As you say, that does not directly
correspond to what many people think of as a character due to combining
characters etc.

Duncan
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
In reply to this post by Aaron Denney
On 2007-09-27, Aaron Denney <[hidden email]> wrote:

> On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote:
>> On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
>>>> UTF-16 has no advantage over UTF-8 in this respect, because of  
>>>> surrogate
>>>> pairs and combining characters.
>>>
>>> Good point.
>>
>> Well, not so much. As Duncan mentioned, it's a matter of what the most  
>> common case is. UTF-16 is effectively fixed-width for the majority of  
>> text in the majority of languages. Combining sequences and surrogate  
>> pairs are relatively infrequent.
>
> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
> to seek x characters ahead.  All such seeking must be linear for both
> UTF-16 *and* UTF-8.
>
>> Speaking as someone who has done a lot of Unicode implementation, I
>> would say UTF-16 represents the best time/space tradeoff for an
>> internal representation. As I mentioned, it's what's used in Windows,
>> Mac OS X, ICU, and Java.

I guess why I'm being something of a pain-in-the-ass here, is that
I want to use your Unicode implementation expertise to know what
these time/space tradeoffs are.

Are there any algorithmic asymptotic complexity differences, or all
these all constant factors?  The constant factors depend on projected
workload.  And are these actually tradeoffs, except between UTF-32
(which uses native wordsizes on 32-bit platforms) and the other two?
Smaller space means smaller cache footprint, which can dominate.

Simplicity of algorithms is also a concern.  Validating a byte sequence
as UTF-8 is harder than validating a sequence of 16-bit values as UTF-16.  

(I'd also like to see a reference to the Mac OS X encoding.  I know that
the filesystem interface is UTF-8 (decomposed a certain a way).  Is it
just that UTF-16 is a common application choice, or is there some
common framework or library that uses that?)

--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

Albert Y. C. Lai
In reply to this post by jerzy.karczmarczuk
[hidden email] wrote:
> Data with where?
> You haven't heard about GADTs?

To avoid clashing with GADT's "where", I propose to rename ok's keyword
to "wherein", or "wheretype", or something

data B k v = E | F b b wherein type b = B k v

data B k v = E | F b b wheretype b = B k v

(I also propose that ok should not just take an existing unrelated
thread like "Unicode string library", click "reply", and herein talk
about a new topic; but rather, should take the necessary extra effort to
start a new thread altogether.)
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: PROPOSAL: New efficient Unicode string library.

Aaron Denney
In reply to this post by Duncan Coutts
On 2007-09-27, Duncan Coutts <[hidden email]> wrote:

> In message <[hidden email]> [hidden email] writes:
>> On 2007-09-27, Deborah Goldsmith <[hidden email]> wrote:
>> > On Sep 26, 2007, at 11:06 AM, Aaron Denney wrote:
>> >>> UTF-16 has no advantage over UTF-8 in this respect, because of  
>> >>> surrogate
>> >>> pairs and combining characters.
>> >>
>> >> Good point.
>> >
>> > Well, not so much. As Duncan mentioned, it's a matter of what the most  
>> > common case is. UTF-16 is effectively fixed-width for the majority of  
>> > text in the majority of languages. Combining sequences and surrogate  
>> > pairs are relatively infrequent.
>>
>> Infrequent, but they exist, which means you can't seek x/2 bytes ahead
>> to seek x characters ahead.  All such seeking must be linear for both
>> UTF-16 *and* UTF-8.
>
> And in [Char] for all these years, yet I don't hear people complaining. Most
> string processing is linear and does not need random access to characters.

Yeah.  I'm saying the differences between them are going to be in the
constant factors, and that these constant factors will differ between
workloads.  

--
Aaron Denney
-><-

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

David Menendez-2
In reply to this post by Albert Y. C. Lai
On 9/27/07, Albert Y. C. Lai <[hidden email]> wrote:
[hidden email] wrote:
> Data with where?
> You haven't heard about GADTs?

To avoid clashing with GADT's "where", I propose to rename ok's keyword
to "wherein", or "wheretype", or something

data B k v = E | F b b wherein type b = B k v

data B k v = E | F b b wheretype b = B k v

I'm not sure there is a clash.

data B k v where ...

is easily distinguished from

data B k v = ... where ...

--
Dave Menendez <[hidden email]>
< http://www.eyrie.org/~zednenem/>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

Thomas Conway
On 9/28/07, David Menendez <[hidden email]> wrote:
> I'm not sure there is a clash.
>
> data B k v where ...
>
> is easily distinguished from
>
> data B k v = ... where ...

Indeed.

Although Richard's proposal was simpler, I reckon it's worth
discussing whether the where clause should allow normal
type/data/newtype declarations, effectively introducing a new scope.
There are obviously some type variable quantification and name
resolution issues that should yield several conference papers.

Here are a couple of examples:


data Tree key val
    = Leaf key val
    | Node BST key val BST
    where
    type BST = Tree key val


data RelaxedTree key val
    = Leaf Bal [(key,val)]
    | Node Bal [(key,RelaxedTree key val)]
    where
    data Bal = Balanced | Unbalanced

--
Thomas Conway
[hidden email]

Silence is the perfectest herald of joy:
I were but little happy, if I could say how much.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

Dan Weston
Thomas Conway wrote:

> Although Richard's proposal was simpler, I reckon it's worth
> discussing whether the where clause should allow normal
> type/data/newtype declarations, effectively introducing a new scope.
> There are obviously some type variable quantification and name
> resolution issues that should yield several conference papers.
>
> data RelaxedTree key val
>     = Leaf Bal [(key,val)]
>     | Node Bal [(key,RelaxedTree key val)]
>     where
>     data Bal = Balanced | Unbalanced

Is Bal visible outside data RelaxedTree? If so, why not put it at the
top level. If not, are Balanced and Unbalanced visible? If not, then
there is no way to construct a RelaxedTree. If so, then you could not
give a type annotation to x = Balanced.

 > data Tree key val
 >     = Leaf key val
 >     | Node BST key val BST
 >     where
 >     type BST = Tree key val

The type synonym example is much easier because it is effectively
syntactic sugar, and although BST is not visible, Tree key val is. But
is let allowed as well, if we want to restrict the visibility of BST to
just the Node constructor? Type synomym of a type variable OK?

data Tree key val
     = let BST = key in Leaf BST val  -- perversely called BST
     | let BST = Tree key val in Node BST key val BST


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: 'data' syntax - a suggestion

Richard A. O'Keefe
In reply to this post by Thomas Conway
On 28 Sep 2007, at 10:01 am, Thomas Conway wrote:

> data Tree key val
>     = Leaf key val
>     | Node BST key val BST
>     where
>     type BST = Tree key val
>
>
> data RelaxedTree key val
>     = Leaf Bal [(key,val)]
>     | Node Bal [(key,RelaxedTree key val)]
>     where
>     data Bal = Balanced | Unbalanced


My proposal was deliberately rather limited.
My feeling was that if there is a constructor
(like Balanced, Unbalanced)
then I want it to belong to a module-scope type name.

What I'm looking for is something that provides
(1) an easily understood way of abbreviating repeated types in a
     data, type, or newtype declaration
(2) and using them *uniformly* throughout such a declaration
     (which is why GADTs don't help)
(3) to reduce the incidence of errors
(4) and clarify the programmer's intent in much the same way as
     field names do (but as a complementary,  not a rival technique)
(5) and above all, to simplify maintenance.

The thing that got me thinking about this is my continuing attempt
to write a compiler (to C) for a legacy language in Haskell.
I start out with a simple AST data type, adequate for testing
the grammar.  And then I start adding semantic information to the
nodes, and suddenly I find myself adding extra fields all over the
place.

Now there's a paper that was mentioned about a month ago in this
mailing list which basically dealt with that by splitting each type
into two:  roughly speaking a bit that expresses the recursion and
a bit that expresses the choice structure.  My feeling about that
was that while it is a much more powerful and general technique, it
isn't as easy to get your head around as a single level solution.

Here's a trivial example.

Parser-only version:

        newtype Var
            =   Var String

        data Expr
           = Variable Var
           | Constant Int
           | Unary String Expr
           | Binary String Expr

Revised version:

        data Var env
           = Var env String

        data Expr env
           = Variable (Var env)
           | Constant Int
           | Unary String (Expr env)
           | Binary String (Expr env) (Expr env)

Now let's do Expr using my proposal:

        data Expr
           = Variable var
           | Constant Int
           | Unary String expr
           | Binary String expr expr
           where type var = Var
                 type expr = Expr

(obtained from the first parser-only version by lower-casing the type  
names)
becoming

* data Expr env
           = Variable var
           | Constant Int
           | Unary String expr
           | Binary String expr expr
*   where type var = Var env
* type expr = Expr env

To my mind it's clearer to see 'expr' repeated than '(Expr env)'  
repeated,
as you don't have to keep checking that the argument is the same.

I'm not wedded to this scheme.  It's the simplest thing I can think of
that will do the job.  But the Haskell spirit is, if I may say so,
seems to be to look for the simplest thing that can do the job at hand
and a whole lot more in a principled way.

What I'm looking for in a better counter-proposal is something that
makes it this easy or easier to revise and extend a type.  Perhaps
a variation on GADTs would be the way to go.  I don't know.


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
1234