Rewrite of Data.Char library?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Rewrite of Data.Char library?

Ahn, Ki Yung
In the #haskell IRC channel, we just had a discussion on Data.Char
predicates such as isAlpha, isUpper, isLower.  The implementation of
Data.Char is not Haskell 98 since Char specification in Haskell 98 only
covers latin1.  However, current predicates are confusing and intuitive
properties does not hold.  One example is this:

[17:53:32] <newsham> > let cs = [minBound..maxBound]; us = filter
isUpper cs; ls = filter isLower cs in take 5 $ (map toUpper ls) \\ us
[17:53:33] <lambdabot>   "\170\186\223I\312"

isLower '\170' == True  but you can't turn that into an uppercase
letter.  isUpper '170' == '\170'.

I know that GHC team working on a rewrite of IO library for better
Unicode support (I hope also includes better locale and charset
support).  Along the line to the new IO library work, it would also be
good to have some cleanup in the Data.Char as well.

Thanks,

Ahn, Ki Yung

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Rewrite of Data.Char library?

Ahn, Ki Yung
Ahn, Ki Yung 쓴 글:

> In the #haskell IRC channel, we just had a discussion on Data.Char
> predicates such as isAlpha, isUpper, isLower.  The implementation of
> Data.Char is not Haskell 98 since Char specification in Haskell 98 only
> covers latin1.  However, current predicates are confusing and intuitive
> properties does not hold.  One example is this:
>
> [17:53:32] <newsham> > let cs = [minBound..maxBound]; us = filter
> isUpper cs; ls = filter isLower cs in take 5 $ (map toUpper ls) \\ us
> [17:53:33] <lambdabot>   "\170\186\223I\312"
>
> isLower '\170' == True  but you can't turn that into an uppercase
> letter.  isUpper '170' == '\170'.
>
> I know that GHC team working on a rewrite of IO library for better
> Unicode support (I hope also includes better locale and charset
> support).  Along the line to the new IO library work, it would also be
> good to have some cleanup in the Data.Char as well.
>
> Thanks,
>
> Ahn, Ki Yung

Just a follow-up to add, and my suggestions.  Lowercase and Uppercase
problem seems not to be solvable, since in some languages like German sz
doesn't have a good definition for an uppercase letter.  So, my previous
posting wouldn't be a really big problem.

Another problem is that, in the Haskell 98 Report, isAlpha is defined as
isLower or isUpper.  This is different from the current implementation.
What isAlhpa is categorizing is all the "Letter" categories.

So, wouldn't it be better to keep isAlpha to follow the definition of
the Haskell 98 report, and just define a new predicate called isLetter
if needed?  That at least sounds more proper and the programmer can
easily guess that it would correspond to the Letter categories in the
Unicode.

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Rewrite of Data.Char library?

Colin Paul Adams
>>>>> "Ahn" == Ahn, Ki Yung <[hidden email]> writes:


    Ahn> Just a follow-up to add, and my suggestions.  Lowercase and
    Ahn> Uppercase problem seems not to be solvable, since in some
    Ahn> languages like German sz doesn't have a good definition for
    Ahn> an uppercase letter.

You just follow the Unicode default foldings.

Note that these foldings are not inverses of each other - intuition is
misplaced here.
--
Colin Adams
Preston Lancashire
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Rewrite of Data.Char library?

Ahn, Ki Yung
Colin Paul Adams wrote:

>>>>>> "Ahn" == Ahn, Ki Yung <[hidden email]> writes:
>
>
>     Ahn> Just a follow-up to add, and my suggestions.  Lowercase and
>     Ahn> Uppercase problem seems not to be solvable, since in some
>     Ahn> languages like German sz doesn't have a good definition for
>     Ahn> an uppercase letter.
>
> You just follow the Unicode default foldings.
>
> Note that these foldings are not inverses of each other - intuition is
> misplaced here.

Agreed.

The real problem I think, as I've mentioned in the second posting is
that isAlpha is neither backward compatible nor conforms to the Haskell
98 report.  I suggest that we revert isAlpha back to the definition of
Haskell 98 report and just add a new function called isLetter for what
current implementation of Data.Char.isAlpha is doing.  If this is
controversial, I think there wouldn't be any problem that at least
isAlpha in Char, the good old Haskell 98 standard libary module, should
conform to the Haskell 98 definition, which is either isUpper or isLower.

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Rewrite of Data.Char library?

Ian Lynagh
In reply to this post by Ahn, Ki Yung
On Thu, Oct 22, 2009 at 07:56:56PM -0700, Ahn, Ki Yung wrote:
> Ahn, Ki Yung 쓴 글:
> > In the #haskell IRC channel, we just had a discussion on Data.Char
> > predicates such as isAlpha, isUpper, isLower.  The implementation of
> > Data.Char is not Haskell 98 since Char specification in Haskell 98 only
> > covers latin1.

Char in Haskell98 covers Unicode too;
http://haskell.org/onlinereport/char.html says:

    Function toUpper converts a letter to the corresponding upper-case
    letter, leaving any other character unchanged. Any Unicode letter
    which has an upper-case equivalent is transformed. Similarly,
    toLower converts a letter to the corresponding lower-case letter,
    leaving any other character unchanged.

> > However, current predicates are confusing and intuitive
> > properties does not hold.  One example is this:
> >
> > [17:53:32] <newsham> > let cs = [minBound..maxBound]; us = filter
> > isUpper cs; ls = filter isLower cs in take 5 $ (map toUpper ls) \\ us
> > [17:53:33] <lambdabot>   "\170\186\223I\312"
> >
> > isLower '\170' == True  but you can't turn that into an uppercase
> > letter.  isUpper '170' == '\170'.

What behaviour would you expect?

> Another problem is that, in the Haskell 98 Report, isAlpha is defined as
> isLower or isUpper.  This is different from the current implementation.
> What isAlhpa is categorizing is all the "Letter" categories.

Right, we have:

isLower = "Letter, Lowercase"

isUpper = "Letter, Uppercase" or "Letter, Titlecase"

isAlpha = "Letter, Lowercase" or
          "Letter, Uppercase" or "Letter, Titlecase" or
          "Letter, Modifier" or "Letter, Other"

The report says:
    any alphabetic character which is not lower case is treated as upper
    case (Unicode actually has three cases: upper, lower, and title"
and defines:
    isAlpha c =  isUpper c || isLower c
so the implementation is not consistent with the language definition. I
wouldn't like to say which is "wrong", though (but I would guess "both"
:-)  I think it would be great if someone were to design a new interface
that provided something closer to the Unicode spec, perhaps in
Data.Char.Unicode; we could make the current interface a layer on top).

> So, wouldn't it be better to keep isAlpha to follow the definition of
> the Haskell 98 report, and just define a new predicate called isLetter
> if needed?

If your idea is to improve the handling of '\170' then this won't help.
'\170' is "Letter, Lowercase".


Thanks
Ian

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries