Better casing functions (German ß, etc.)

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Better casing functions (German ß, etc.)

박신환

Current Haskell has 'simple' `Char`-to-`Char` casing functions (as specified by Unicode), namely `toUpper`, `toLower` and `toTitle`.

 

So to convert cases of a `String`, Haskell intends `fmap toUpper`, etc. But this has some bugs.

 

Case 1. German ß (Eszett) 

 

'ß' (U+00DF), Latin Small Letter Sharp S, is a lowercase letter itself, but Unicode doesn't specify its 'simple' uppercase counterpart.

It's because its uppercase counterpart is not a single character, but two characters, "SS".

 

Case 2. Turkish İ and ı

Rather than the common 'I' and 'i' case pair, Turkish language has the 'İ' (U+0130) and 'i' pair and the 'I' and 'ı' (U+0131) pair. Those are, dotted I pair and dotless I pair, respectively.

 

Case 3. Greek Σ (Sigma) 

Greek 'Σ' (U+03A3) must be lowercase mapped to 'ς' (U+03C2) if followed by a whitespace, rather than normal 'σ' (U+03C3).

 

Case 4. Greek iota subscript (Ypogegrammeni)

Greek 'Capital' letters with iota subscripts (for example, 'ᾈ' (U+1F88)), though they are the 'simple' uppercase counterpart of their lowercase counterpart, they themselves are actually treated as titlecase characters. For example, the actual uppercase counterpart of 'ᾀ' (U+1F80) is "ἈΙ" (U+1F08 U+0399). That is, an actual capital iota instead of the iota subscript.

 

Case 5. Precomposed letters without upper/lowercase counterpart 

For example, ΐ (U+03B0) doesn't have precomposed uppercase counterpart. It must be effectively mapped to "Ϊ́" (U+03AA U+0301).


In Summary, we need more elaborated casing functions which are `String`-to-`String`.


Bibliography:

    The Unicode Standard Version 11.0 – Core Specification, Section 5.18.


_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Better casing functions (German ß, etc.)

Francesco Ariis
Hello 박신환,

On Wed, Jul 11, 2018 at 03:59:37PM +0900, 박신환 wrote:
> Case 4. Greek iota subscript (Ypogegrammeni)

I think not even Data.Text handles this correctly!
_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Better casing functions (German ß, etc.)

Mario Blažević-3
In reply to this post by 박신환
On 2018-07-11 02:59 AM, 박신환 wrote:
>
> Current Haskell has 'simple' `Char`-to-`Char` casing functions (as
> specified by Unicode), namely `toUpper`, `toLower` and `toTitle`.
>
> So to convert cases of a `String`, Haskell intends `fmap toUpper`,
> etc. But this has some bugs.
>

I've never tested the cases you list, but I believe the text-icu library
covers them. See
http://hackage.haskell.org/package/text-icu-0.7.0.1/docs/Data-Text-ICU.html#g:4

> Case 1. German ß (Eszett)
>
> 'ß' (U+00DF), Latin Small Letter Sharp S, is a lowercase letter
> itself, but Unicode doesn't specify its 'simple' uppercase counterpart.
>
> It's because its uppercase counterpart is not a single character, but
> two characters, "SS".
>
> Case 2. Turkish İ and ı
>
> Rather than the common 'I' and 'i' case pair, Turkish language has the
> 'İ' (U+0130) and 'i' pair and the 'I' and 'ı'(U+0131) pair. Those
> are, dotted I pair and dotless I pair, respectively.
>
> Case 3. Greek Σ (Sigma)
>
> Greek 'Σ' (U+03A3) must be lowercase mapped to 'ς' (U+03C2) if
> followed by a whitespace, rather than normal 'σ' (U+03C3).
>
> Case 4. Greek iota subscript (Ypogegrammeni)
>
> Greek 'Capital' letters with iota subscripts (for example, 'ᾈ'
> (U+1F88)), though they are the 'simple' uppercase counterpart of their
> lowercase counterpart, they themselves are actually treated as
> titlecase characters. For example, the actual uppercase counterpart of
> 'ᾀ' (U+1F80) is "ἈΙ" (U+1F08 U+0399). That is, an actual capital iota
> instead of the iota subscript.
>
> Case 5. Precomposed letters without upper/lowercase counterpart
>
> For example, ΐ (U+03B0) doesn't have precomposed uppercase
> counterpart. It must be effectively mapped to "Ϊ́" (U+03AA U+0301).
>
>
> In Summary, we need more elaborated casing functions which are
> `String`-to-`String`.
>
>
> Bibliography:
>
> /The Unicode Standard Version 11.0 – Core Specification/, Section 5.18.
>
>
>
> _______________________________________________
> Libraries mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries


_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: Better casing functions (German ß, etc.)

Mikhail Glushenkov
In reply to this post by 박신환
Hi,

On Wed, 11 Jul 2018, 08:00 박신환, <[hidden email]> wrote:
>
> [...]
> Case 1. German ß (Eszett)
>
>
>
> 'ß' (U+00DF), Latin Small Letter Sharp S, is a lowercase letter itself, but Unicode doesn't specify its 'simple' uppercase counterpart.
>
> It's because its uppercase counterpart is not a single character, but two characters, "SS".

Capital sharp s is now also considered valid:
https://medium.com/@typefacts/the-german-capital-letter-eszett-e0936c1388f8
_______________________________________________
Libraries mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/libraries