UTF-16

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

UTF-16

Andrew Coppin
I don't know if anybody cares, but... Today a wrote some trivial code to
decode (not encode) UTF-16.

I believe somebody out there has a UTF-8 decoder, but I needed UTF-16 as
it happens. (I didn't bother decoding code points outside the BMP - I'm
sure you can figure out why.)

If anybody is interested, I can share the code - but it's pretty minimal...

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: UTF-16

Donald Bruce Stewart
andrewcoppin:
> I don't know if anybody cares, but... Today a wrote some trivial code to
> decode (not encode) UTF-16.
>
> I believe somebody out there has a UTF-8 decoder, but I needed UTF-16 as
> it happens. (I didn't bother decoding code points outside the BMP - I'm
> sure you can figure out why.)
>
> If anybody is interested, I can share the code - but it's pretty minimal...

Perhaps you could polish it up, and provide it in a form suitable for
use as a patch to:

    http://code.haskell.org/utf8-string/

that is, put it in a module:

    Codec.Binary.UTF16.String

and provide the functions:

    encode :: String -> [Word8]
    decode :: [Word8] -> String

? And then submit that as a patch to Eric, the utf8 maintainer.

-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: UTF-16

Colin Paul Adams
In reply to this post by Andrew Coppin
>>>>> "Andrew" == Andrew Coppin <[hidden email]> writes:

    Andrew> I believe somebody out there has a UTF-8 decoder, but I
    Andrew> needed UTF-16 as it happens. (I didn't bother decoding
    Andrew> code points outside the BMP - I'm sure you can figure out
    Andrew> why.)

Well I can't.

And if you don't, then its not a UTF-16 decoder.

The whole point of UTF-16 is to blight everybodies lives by
introducing a gap in the codepoint space.
--
Colin Adams
Preston Lancashire
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

RE: UTF-16

Bayley, Alistair-2
In reply to this post by Donald Bruce Stewart
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Donald
> Bruce Stewart
>
> andrewcoppin:
> > I don't know if anybody cares, but... Today a wrote some
> trivial code to
> > decode (not encode) UTF-16.
> >
> > I believe somebody out there has a UTF-8 decoder, but I
> needed UTF-16 as it happens.
>
> Perhaps you could polish it up, and provide it in a form suitable for
> use as a patch to:
>
>     http://code.haskell.org/utf8-string/
>
> that is, put it in a module:
>
>     Codec.Binary.UTF16.String
>
> and provide the functions:
>
>     encode :: String -> [Word8]
>     decode :: [Word8] -> String
>
> ? And then submit that as a patch to Eric, the utf8 maintainer.
>
> -- Don

There is a UTF16 en/decoder in Foreign.C.String (see cWcharsToChars &
charsToCWchars):
  http://darcs.haskell.org/libraries/base/Foreign/C/String.hs

but it only seems to be available for Windows users, via the CWSTring
functions.

In Takusen we also have a UTF8 module (it's about the fourth or fifth
out there, after HXML and John Meacham's, and someone else's - Graham
Klyne?, and one hidden away in GHC's internals). It has pure en/decode
functions (String <-> [Word8]), naturally (which we ripped off from John
Meacham), but we were more interested in efficient marshalling from
CStrings (or data buffers, if you like), so we wrote specific code to
marshall CString -> String fairly quickly, and space efficiently (see
fromUTF8Ptr, which is wrapped by peekUTF8String{Len}):
  http://darcs.haskell.org/takusen/Foreign/C/UTF8.hs

We stuck it in the Foreign.C namespace, rather than Codec, because we're
doing more FFI related stuff. I'm not sure what the best location is;
perhaps there should be a split, with FFI functions (withUTF8String,
peekUTF8String) in Foreign.C, and pure functions in Codec.

(Also, is there a wiki page somewhere which gives advice as to how to
locate/name library modules, and what the currently occupied namespace
is, including user libs like those on Hackage? It's sometimes a bit
tricky to try to figure out where to put a new module.)

Obviously a proliferation of UTF8 modules isn't great for code re-use.
Is there a plan to consolidate and expose UTF8 and UTF16 de- and
encoders in the libraries? I note that the various UTF8 modules have
fairly similar implementations, and differ mainly w.r.t. how much of the
UTF8 codepoint space they handle (for example, HXML's decodes up to 6
bytes, which isn't strictly standards compliant). Also, some choice as
how to handle errors in the byte stream might be nice i.e. the user
could choose between functions which raise errors, or introduce
substition chars.

Alistair
*****************************************************************
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*****************************************************************
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re[2]: UTF-16

Bulat Ziganshin-2
In reply to this post by Donald Bruce Stewart
Hello Donald,

Thursday, July 26, 2007, 8:13:37 AM, you wrote:
>> I don't know if anybody cares, but... Today a wrote some trivial code to
>> decode (not encode) UTF-16.

These functions already exist in win-specific part of base:

cWcharsToChars :: [CWchar] -> [Char]
charsToCWchars :: [Char] -> [CWchar]

#ifdef mingw32_HOST_OS

-- On Windows, wchar_t is 16 bits wide and CWString uses the UTF-16 encoding.

-- coding errors generate Chars in the surrogate range
cWcharsToChars = map chr . fromUTF16 . map fromIntegral
 where
  fromUTF16 (c1:c2:wcs)
    | 0xd800 <= c1 && c1 <= 0xdbff && 0xdc00 <= c2 && c2 <= 0xdfff =
      ((c1 - 0xd800)*0x400 + (c2 - 0xdc00) + 0x10000) : fromUTF16 wcs
  fromUTF16 (c:wcs) = c : fromUTF16 wcs
  fromUTF16 [] = []

charsToCWchars = foldr utf16Char [] . map ord
 where
  utf16Char c wcs
    | c < 0x10000 = fromIntegral c : wcs
    | otherwise   = let c' = c - 0x10000 in
                    fromIntegral (c' `div` 0x400 + 0xd800) :
                    fromIntegral (c' `mod` 0x400 + 0xdc00) : wcs




--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re[2]: UTF-16

Bulat Ziganshin-2
In reply to this post by Bayley, Alistair-2
Hello Alistair,

Thursday, July 26, 2007, 12:29:06 PM, you wrote:

> Obviously a proliferation of UTF8 modules isn't great for code re-use.
> Is there a plan to consolidate and expose UTF8 and UTF16 de- and
> encoders in the libraries?

afair there is utf-string module, which provides utf-8 functionality.
may be we should just add utf-16 support there?

--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

RE: Re[2]: UTF-16

Bayley, Alistair-2
> From: Bulat Ziganshin [mailto:[hidden email]]
>
> > Obviously a proliferation of UTF8 modules isn't great for
> code re-use.
> > Is there a plan to consolidate and expose UTF8 and UTF16 de- and
> > encoders in the libraries?
>
> afair there is utf-string module, which provides utf-8 functionality.
> may be we should just add utf-16 support there?

Probably. Is Codec.Binary.UTF8.String the best namespace for it? (why
Binary?) Where should FFI functions that marshal to-and-from UTF8
encoded CStrings sit?

Also, the UTF8 decoder decodes up to 6 bytes per char, which AFAIUI is
not standards compliant.

It'd be nice if these ended up somewhere in the base libs, or (if we're
looking to reduce the size of base) in the standard distributions, so
that (as a UTF8 library user) I don't feel compelled to either:
 - add another dependency to my project, or
 - copy the code into my project, or
 - implement my own de/encoder to fill a gap in this one

Alistair
*****************************************************************
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*****************************************************************
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe