Haskell future and UTF8 vs UTF-16

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Haskell future and UTF8 vs UTF-16

Alan & Kim Zimmerman
Hi all

What is the current and future status of UTF8 vs UTF-16 in the haskell world?

I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.

The question arises as I ponder a pull request on haskell-lsp to switch to a UTF-16 based library[1]

Alan

[1] https://github.com/alanz/haskell-lsp/pull/70

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell future and UTF8 vs UTF-16

Merijn Verstraaten
On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman <[hidden email]> wrote:
> What is the current and future status of UTF8 vs UTF-16 in the haskell world?
>
> I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.

As far as I know there was a UTF-8 fork of Text made as part of the Summer of Code a year or so ago, but it got ditched because it turned out to be slower than the UTF16 version in practice. So as far as I know, there's no real plan to adopt to UTF8, especially since the internal encoding used by Text is pretty much irrelevant by most users of Text.

Cheers,
Merijn

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

signature.asc (891 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Haskell future and UTF8 vs UTF-16

Moritz Kiefer
In reply to this post by Alan & Kim Zimmerman
Hi Alan,

On 02/11/2018 10:39 AM, Alan & Kim Zimmerman wrote:
> What is the current and future status of UTF8 vs UTF-16 in the haskell
> world?

The only somewhat active effort to move towards UTF-8 in `text` that I’m
aware of is https://github.com/text-utf8. I’m not personally involved
with that project so I can’t tell you much more but you might want to
contact the authors.

Cheers,
Moritz


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Haskell future and UTF8 vs UTF-16

Adam Bergmark-2
There is also Foundation.String which I heard people speak enthusiastically about https://hackage.haskell.org/package/foundation-0.0.19/docs/Foundation-String.html

Cheers,
Adam


On Sun, 11 Feb 2018 at 16:52 Moritz Kiefer <[hidden email]> wrote:
Hi Alan,

On 02/11/2018 10:39 AM, Alan & Kim Zimmerman wrote:
> What is the current and future status of UTF8 vs UTF-16 in the haskell
> world?

The only somewhat active effort to move towards UTF-8 in `text` that I’m
aware of is https://github.com/text-utf8. I’m not personally involved
with that project so I can’t tell you much more but you might want to
contact the authors.

Cheers,
Moritz

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell future and UTF8 vs UTF-16

Joachim Durchholz
In reply to this post by Merijn Verstraaten
Am 11.02.2018 um 12:29 schrieb Merijn Verstraaten:
> On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman <[hidden email]> wrote:
>> What is the current and future status of UTF8 vs UTF-16 in the haskell world?
>>
>> I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.
>
> As far as I know there was a UTF-8 fork of Text made as part of the Summer of Code a year or so ago, but it got ditched because it turned out to be slower than the UTF16 version in practice.
Mmm... correctness is another relevant point here.
Does Text handle characters beyond the Basic Multilingual Plane (U+00000
to U+0FFFF) properly, do does one have to deal with "surrogate pairs" there?

I'm curious because I am seeing this kind of trouble in the Java world.
The standard libraries there have pretty weak support for characters
beyond 0x0FFFF, so most Java programmers pretend that these don't exist.
I'm pretty sure Chinese users hate Java for that reason...

Regards,
Jo
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell future and UTF8 vs UTF-16

Chris Wong-2
On Feb 12, 2018 10:57 AM, "Joachim Durchholz" <[hidden email]> wrote:
Am 11.02.2018 um 12:29 schrieb Merijn Verstraaten:
On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman <[hidden email]> wrote:
What is the current and future status of UTF8 vs UTF-16 in the haskell world?

I understand that currently Text uses UTF-16, and it is used generally because of compatibility requirements in the Microsoft ecosystem, but that there are movements afoot to move to a UTF8 only environment at some unspecified future point.

As far as I know there was a UTF-8 fork of Text made as part of the Summer of Code a year or so ago, but it got ditched because it turned out to be slower than the UTF16 version in practice.
Mmm... correctness is another relevant point here.
Does Text handle characters beyond the Basic Multilingual Plane (U+00000 to U+0FFFF) properly, do does one have to deal with "surrogate pairs" there?

I'm curious because I am seeing this kind of trouble in the Java world. The standard libraries there have pretty weak support for characters beyond 0x0FFFF, so most Java programmers pretend that these don't exist. I'm pretty sure Chinese users hate Java for that reason...

IIRC, the public Text interface works with code points, not 16-bit units. Length and indexing are O(n) for this reason.

So there should be no issues from a correctness point of view.

Chris

Regards,
Jo

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell future and UTF8 vs UTF-16

Zemyla
I'd actually been thinking about whether it'd be worth it to include a
fingertree of character lengths in order to make length O(1) and
indexing, take, and drop O(log n). However, a Text is currently three
unpacked values, and putting something that can't be unboxed in there
may not be such a good idea.

On Sun, Feb 11, 2018 at 5:51 PM, Chris Wong <[hidden email]> wrote:

> On Feb 12, 2018 10:57 AM, "Joachim Durchholz" <[hidden email]> wrote:
>
> Am 11.02.2018 um 12:29 schrieb Merijn Verstraaten:
>>
>> On 11 Feb 2018, at 10:39, Alan & Kim Zimmerman <[hidden email]>
>> wrote:
>>>
>>> What is the current and future status of UTF8 vs UTF-16 in the haskell
>>> world?
>>>
>>> I understand that currently Text uses UTF-16, and it is used generally
>>> because of compatibility requirements in the Microsoft ecosystem, but that
>>> there are movements afoot to move to a UTF8 only environment at some
>>> unspecified future point.
>>
>>
>> As far as I know there was a UTF-8 fork of Text made as part of the Summer
>> of Code a year or so ago, but it got ditched because it turned out to be
>> slower than the UTF16 version in practice.
>
> Mmm... correctness is another relevant point here.
> Does Text handle characters beyond the Basic Multilingual Plane (U+00000 to
> U+0FFFF) properly, do does one have to deal with "surrogate pairs" there?
>
> I'm curious because I am seeing this kind of trouble in the Java world. The
> standard libraries there have pretty weak support for characters beyond
> 0x0FFFF, so most Java programmers pretend that these don't exist. I'm pretty
> sure Chinese users hate Java for that reason...
>
>
> IIRC, the public Text interface works with code points, not 16-bit units.
> Length and indexing are O(n) for this reason.
>
> So there should be no issues from a correctness point of view.
>
> Chris
>
> Regards,
> Jo
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> To (un)subscribe, modify options or view archives go to:
> http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
> Only members subscribed via the mailman list are allowed to post.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Haskell future and UTF8 vs UTF-16

M Farkas-Dyck-2
On 13/02/2018, Zemyla <[hidden email]> wrote:
> I'd actually been thinking about whether it'd be worth it to include a
> fingertree of character lengths in order to make length O(1) and
> indexing, take, and drop O(log n). However, a Text is currently three
> unpacked values, and putting something that can't be unboxed in there
> may not be such a good idea.

Yeah, whoever needs these operations likely ought to rather use
`Vector Char` or such, or define a wrapper type including the
character length information, lest we penalize all users for it.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.