DBCS encoding support on Windows

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

DBCS encoding support on Windows

Max Bolingbroke-2
Hi GHCers,

I've implemented support in GHC for extra Windows code pages on the branch
"dbcs" of the base library.

The problem this solves is that currently users of Haskell on a Windows
machine running in a locale which uses a double-byte code page such as
CP936 (GBK) or CP950 (Big5) cannot properly interact with the Windows
console in their native language. Unfortunately code page support is a
prerequisite for getting this to work correctly because for all Microsoft's
fine talk about Unicode being the future, the Windows console does not seem
to support it properly - code pages are the only way to go for console
input and output.

As the standard Windows locale encodings in many regions, these code pages
are also the predominant method of encoding text files in many countries,
so they are useful outside the console.

The solution is along the lines suggested in
http://hackage.haskell.org/trac/ghc/ticket/3977, i.e. we create an
iconv-like interface to Window's MultiByteToWideChar and
WideCharToMultiByte APIs by the judicious use of binary search. In my
branch, these APIs will be used whenever we don't have a built-in native
Haskell TextEncoding for the code page (we used to fall back on using
latin1 for such code pages).

Unless there are any objections I'll merge this into the base library main
branch next week.

Cheers,
Max
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130423/32a6aea6/attachment.htm>

Reply | Threaded
Open this post in threaded view
|

DBCS encoding support on Windows

Simon Peyton Jones
Great stuff.

One thing: have you left enough documentation in the code that, when someone comes along in 3 years time, they can understand the problem and how you have dealt with it?  Lot of "Note [Blah]" stuff?  Or something.

Thanks

Simon

From: ghc-devs-bounces at haskell.org [mailto:ghc-devs-bounces at haskell.org] On Behalf Of Max Bolingbroke
Sent: 23 April 2013 21:29
To: ghc-devs at haskell.org
Subject: DBCS encoding support on Windows

Hi GHCers,

I've implemented support in GHC for extra Windows code pages on the branch "dbcs" of the base library.

The problem this solves is that currently users of Haskell on a Windows machine running in a locale which uses a double-byte code page such as CP936 (GBK) or CP950 (Big5) cannot properly interact with the Windows console in their native language. Unfortunately code page support is a prerequisite for getting this to work correctly because for all Microsoft's fine talk about Unicode being the future, the Windows console does not seem to support it properly - code pages are the only way to go for console input and output.

As the standard Windows locale encodings in many regions, these code pages are also the predominant method of encoding text files in many countries, so they are useful outside the console.

The solution is along the lines suggested in http://hackage.haskell.org/trac/ghc/ticket/3977, i.e. we create an iconv-like interface to Window's MultiByteToWideChar and WideCharToMultiByte APIs by the judicious use of binary search. In my branch, these APIs will be used whenever we don't have a built-in native Haskell TextEncoding for the code page (we used to fall back on using latin1 for such code pages).

Unless there are any objections I'll merge this into the base library main branch next week.

Cheers,
Max
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130424/07f2bda6/attachment-0001.htm>

Reply | Threaded
Open this post in threaded view
|

DBCS encoding support on Windows

Max Bolingbroke-2
The algorithm in the new module (GHC.IO.Encoding.CodePage.API) is rather
intricate, so I've commented it quite thoroughly. The changes to other
modules are minimal: we simply now use a real code page encoding instead of
brokenly using latin1 when GHC doesn't have the code page built in, so
there isn't much of a change to document.

Max


On 24 April 2013 08:12, Simon Peyton-Jones <simonpj at microsoft.com> wrote:

>  Great stuff.   ****
>
> ** **
>
> One thing: have you left enough documentation in the code that, when
> someone comes along in 3 years time, they can understand the problem and
> how you have dealt with it?  Lot of ?Note [Blah]? stuff?  Or something.***
> *
>
>
> Thanks****
>
> ** **
>
> Simon****
>
> ** **
>
> *From:* ghc-devs-bounces at haskell.org [mailto:ghc-devs-bounces at haskell.org]
> *On Behalf Of *Max Bolingbroke
> *Sent:* 23 April 2013 21:29
> *To:* ghc-devs at haskell.org
> *Subject:* DBCS encoding support on Windows****
>
> ** **
>
> Hi GHCers,****
>
> ** **
>
> I've implemented support in GHC for extra Windows code pages on the branch
> "dbcs" of the base library.****
>
> ** **
>
> The problem this solves is that currently users of Haskell on a Windows
> machine running in a locale which uses a double-byte code page such as
> CP936 (GBK) or CP950 (Big5) cannot properly interact with the Windows
> console in their native language. Unfortunately code page support is a
> prerequisite for getting this to work correctly because for all Microsoft's
> fine talk about Unicode being the future, the Windows console does not seem
> to support it properly - code pages are the only way to go for console
> input and output.****
>
> ** **
>
> As the standard Windows locale encodings in many regions, these code pages
> are also the predominant method of encoding text files in many countries,
> so they are useful outside the console.****
>
> ** **
>
> The solution is along the lines suggested in
> http://hackage.haskell.org/trac/ghc/ticket/3977, i.e. we create an
> iconv-like interface to Window's MultiByteToWideChar and
> WideCharToMultiByte APIs by the judicious use of binary search. In my
> branch, these APIs will be used whenever we don't have a built-in native
> Haskell TextEncoding for the code page (we used to fall back on using
> latin1 for such code pages).****
>
> ** **
>
> Unless there are any objections I'll merge this into the base library main
> branch next week.****
>
> ** **
>
> Cheers,****
>
> Max****
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130424/43793e24/attachment-0001.htm>

Reply | Threaded
Open this post in threaded view
|

DBCS encoding support on Windows

Simon Peyton Jones
I was thinking of people who don't know what DBCS or a code page is.  But maybe they are going to be too clueless for comments to help!

S

From: omega.theta at gmail.com [mailto:omega.theta at gmail.com] On Behalf Of Max Bolingbroke
Sent: 24 April 2013 21:04
To: Simon Peyton-Jones
Cc: ghc-devs at haskell.org
Subject: Re: DBCS encoding support on Windows

The algorithm in the new module (GHC.IO.Encoding.CodePage.API) is rather intricate, so I've commented it quite thoroughly. The changes to other modules are minimal: we simply now use a real code page encoding instead of brokenly using latin1 when GHC doesn't have the code page built in, so there isn't much of a change to document.

Max

On 24 April 2013 08:12, Simon Peyton-Jones <simonpj at microsoft.com<mailto:simonpj at microsoft.com>> wrote:
Great stuff.

One thing: have you left enough documentation in the code that, when someone comes along in 3 years time, they can understand the problem and how you have dealt with it?  Lot of "Note [Blah]" stuff?  Or something.

Thanks

Simon

From: ghc-devs-bounces at haskell.org<mailto:ghc-devs-bounces at haskell.org> [mailto:ghc-devs-bounces at haskell.org<mailto:ghc-devs-bounces at haskell.org>] On Behalf Of Max Bolingbroke
Sent: 23 April 2013 21:29
To: ghc-devs at haskell.org<mailto:ghc-devs at haskell.org>
Subject: DBCS encoding support on Windows

Hi GHCers,

I've implemented support in GHC for extra Windows code pages on the branch "dbcs" of the base library.

The problem this solves is that currently users of Haskell on a Windows machine running in a locale which uses a double-byte code page such as CP936 (GBK) or CP950 (Big5) cannot properly interact with the Windows console in their native language. Unfortunately code page support is a prerequisite for getting this to work correctly because for all Microsoft's fine talk about Unicode being the future, the Windows console does not seem to support it properly - code pages are the only way to go for console input and output.

As the standard Windows locale encodings in many regions, these code pages are also the predominant method of encoding text files in many countries, so they are useful outside the console.

The solution is along the lines suggested in http://hackage.haskell.org/trac/ghc/ticket/3977, i.e. we create an iconv-like interface to Window's MultiByteToWideChar and WideCharToMultiByte APIs by the judicious use of binary search. In my branch, these APIs will be used whenever we don't have a built-in native Haskell TextEncoding for the code page (we used to fall back on using latin1 for such code pages).

Unless there are any objections I'll merge this into the base library main branch next week.

Cheers,
Max

_______________________________________________
ghc-devs mailing list
ghc-devs at haskell.org<mailto:ghc-devs at haskell.org>
http://www.haskell.org/mailman/listinfo/ghc-devs

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130425/d5d52fff/attachment.htm>

Reply | Threaded
Open this post in threaded view
|

DBCS encoding support on Windows

Max Bolingbroke-2
In reply to this post by Max Bolingbroke-2
This is now in HEAD.

Enjoy!
Max


On 23 April 2013 21:29, Max Bolingbroke <batterseapower at hotmail.com> wrote:

> Hi GHCers,
>
> I've implemented support in GHC for extra Windows code pages on the branch
> "dbcs" of the base library.
>
> The problem this solves is that currently users of Haskell on a Windows
> machine running in a locale which uses a double-byte code page such as
> CP936 (GBK) or CP950 (Big5) cannot properly interact with the Windows
> console in their native language. Unfortunately code page support is a
> prerequisite for getting this to work correctly because for all Microsoft's
> fine talk about Unicode being the future, the Windows console does not seem
> to support it properly - code pages are the only way to go for console
> input and output.
>
> As the standard Windows locale encodings in many regions, these code pages
> are also the predominant method of encoding text files in many countries,
> so they are useful outside the console.
>
> The solution is along the lines suggested in
> http://hackage.haskell.org/trac/ghc/ticket/3977, i.e. we create an
> iconv-like interface to Window's MultiByteToWideChar and
> WideCharToMultiByte APIs by the judicious use of binary search. In my
> branch, these APIs will be used whenever we don't have a built-in native
> Haskell TextEncoding for the code page (we used to fall back on using
> latin1 for such code pages).
>
> Unless there are any objections I'll merge this into the base library main
> branch next week.
>
> Cheers,
> Max
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130508/4afa61a7/attachment.htm>