How to use Unicode strings?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

How to use Unicode strings?

dokondr
Please advise how to write Unicode string, so this example would work:

main = do
  putStrLn "Les signes orthographiques inclus les accents (aigus, grâve, circonflexe), le tréma, l'apostrophe, la cédille, le trait d'union et la majuscule."

I get the following error:
hello.hs:4:68:
    lexical error in string/character literal (UTF-8 decoding error)
Failed, modules loaded: none.
Prelude>

Also, how to read Unicode characters from standard input?

Thanks!

--
Dmitri O. Kondratiev
[hidden email]
http://www.geocities.com/dkondr

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Luke Palmer-2
2008/11/22 Dmitri O.Kondratiev <[hidden email]>:
> Please advise how to write Unicode string, so this example would work:
>
> main = do
>   putStrLn "Les signes orthographiques inclus les accents (aigus, grâve,
> circonflexe), le tréma, l'apostrophe, la cédille, le trait d'union et la
> majuscule."

That really ought to work.  Is the file encoded in UTF-8 (rather than,
eg. latin-1)?

Luke
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Alexey Khudyakov
>
> That really ought to work.  Is the file encoded in UTF-8 (rather than,
> eg. latin-1)?
>
This should pretend to work. Simple print functions garble unicode characters.
For example :

> putStrLn "Ну и где этот ваш хвалёный уникод?"

prints following output

C 8 345 MB>B 20H E20;Q=K9 C=8:>4?

Not pretty? Althrough Dmitri's variant seems to work fine.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Austin Seipp
In reply to this post by dokondr
Excerpts from Dmitri O.Kondratiev's message of Sat Nov 22 05:40:41 -0600 2008:

> Please advise how to write Unicode string, so this example would work:
>
> main = do
>   putStrLn "Les signes orthographiques inclus les accents (aigus, grâve,
> circonflexe), le tréma, l'apostrophe, la cédille, le trait d'union et la
> majuscule."
>
> I get the following error:
> hello.hs:4:68:
>     lexical error in string/character literal (UTF-8 decoding error)
> Failed, modules loaded: none.
> Prelude>
>
> Also, how to read Unicode characters from standard input?
>
> Thanks!
>

Hi,

Check out the utf8-string package on hackage:

http://hackage.haskell.org/cgi-bin/hackage-scripts/package/utf8-string

In particular, you probably want the System.IO.UTF8 functions, which
are identical to to their non-utf8 counterparts in System.IO except,
well, they handle unicode properly.

More specifically, you will probably want to mainly look at
Codec.Binary.UTF8.String.encodeString and decodeString, respectively
(in fact, most of the System.IO.UTF8 functions are defined in terms of
these, e.g. 'putStrLn x = IO.putStrLn (encodeString x)' and 'getLine =
liftM decodeString IO.getLine'.)

Austin
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Janis Voigtlaender
In reply to this post by Alexey Khudyakov
Alexey Khudyakov wrote:
>>putStrLn "Ну и где этот ваш хвалёный уникод?"

:-)

--
Dr. Janis Voigtlaender
http://wwwtcs.inf.tu-dresden.de/~voigt/
mailto:[hidden email]

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Don Stewart-2
In reply to this post by Alexey Khudyakov
alexey.skladnoy:

> >
> > That really ought to work.  Is the file encoded in UTF-8 (rather than,
> > eg. latin-1)?
> >
> This should pretend to work. Simple print functions garble unicode characters.
> For example :
>
> > putStrLn "Ну и где этот ваш хвалёный уникод?"
>
> prints following output
>
> C 8 345 MB>B 20H E20;Q=K9 C=8:>4?
>
> Not pretty? Althrough Dmitri's variant seems to work fine.

Use the UTF8 printing functions,

    import qualified System.IO.UTF8 as U

    main = U.putStrLn "Ну и где этот ваш хвалёный уникод?"

Running this,

    *Main> main
    Ну и где этот ваш хвалёный уникод?

-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Duncan Coutts
On Sat, 2008-11-22 at 10:02 -0800, Don Stewart wrote:

> Use the UTF8 printing functions,
>
>     import qualified System.IO.UTF8 as U
>
>     main = U.putStrLn "Ну и где этот ваш хвалёный уникод?"
>
> Running this,
>
>     *Main> main
>     Ну и где этот ваш хвалёный уникод?


This upsets me. We need to get on with doing this properly. The
System.IO.UTF8 module is a useful interim workaround but we're not using
it properly most of the time.

It is right when you're working with a text file that you know to be in
the UTF-8 format. For example .cabal files are UTF-8, irrespective of
the platform or the system locale.

It is not right when working with the terminal. The encoding of the
terminal is given by the locale. We cannot statically declare that it is
UTF-8.

The right thing to do is to make Prelude.putStrLn do the right thing. We
had a long discussion on how to fix the H98 IO functions to do this
better. We just need to get on with it, or we'll end up with too many
cases of people using System.IO.UTF8 inappropriately.

For the case where System.IO.UTF8 is right we probably still want a more
general solution, like a handle setting for the encoding.

Duncan

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Maurí­cio
In reply to this post by dokondr
> Please advise how to write Unicode string, so this example would work:
>
> main = do
>   putStrLn "Les signes orthographiques inclus les accents (aigus, grâve,
> circonflexe), le tréma, l'apostrophe, la cédille, le trait d'union et la
> majuscule."
 > (...)

Besides the Haskell stuff, you probably want to
check if your terminal outputs utf-8.

I use a nice X terminal named 'mlterm'. It's main
goal is to support unicode. But I don't know enough
to tell you how to check your terminal, or even
if just changing to mlterm will always work.

Sometimes, I wonder why distributions don't just
agree on considering support for anything but utf-8
a bug (except in 'iconv', of course). Well, there's
probably someone out there who would have problems
with that, and I don't want problems for anyone.
But I hope their problems would be worst than mine
trying to deal with different encodings.

Best,
Maurício

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Alexey Khudyakov
In reply to this post by Duncan Coutts
>

> This upsets me. We need to get on with doing this properly. The
> System.IO.UTF8 module is a useful interim workaround but we're not using
> it properly most of the time.
>
> ... skipped ...
>
> The right thing to do is to make Prelude.putStrLn do the right thing. We
> had a long discussion on how to fix the H98 IO functions to do this
> better. We just need to get on with it, or we'll end up with too many
> cases of people using System.IO.UTF8 inappropriately.
>
But this bring question what "the right thing" is? If locale is UTF8 or system
support unicode some other way - no problem, just encode string properly.
Problem is how to deal with untanslatable characters. Skip? Replace with
question marks? Anything other? Probably we need to look how this is
solved in other languages. (Or not solved)

And this problem related not only to IO. It raises whenever strings cross
border between haskell world and outside world. Opening files with unicode
names, execing, etc.

For example:
Prelude> readFile "файл"
*** Exception: D09;: openFile: does not exist (No such file or directory)
Prelude> executeFile "echo" True ["Сейчас сломается"] Nothing
!59G0A A;><05BAO

Althrough it's possible to work around using encodeString/decodeString from
Codec.Binary.UTF8.String it won't work on non-UTF8 systems. It's not only
neandertalian systems with one-byte locales, windows AFAIK uses other
unicode encoding.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

Don Stewart-2
alexey.skladnoy:

> >
> > This upsets me. We need to get on with doing this properly. The
> > System.IO.UTF8 module is a useful interim workaround but we're not using
> > it properly most of the time.
> >
> > ... skipped ...
> >
> > The right thing to do is to make Prelude.putStrLn do the right thing. We
> > had a long discussion on how to fix the H98 IO functions to do this
> > better. We just need to get on with it, or we'll end up with too many
> > cases of people using System.IO.UTF8 inappropriately.
> >
> But this bring question what "the right thing" is? If locale is UTF8 or system
> support unicode some other way - no problem, just encode string properly.
> Problem is how to deal with untanslatable characters. Skip? Replace with
> question marks? Anything other? Probably we need to look how this is
> solved in other languages. (Or not solved)
>
> And this problem related not only to IO. It raises whenever strings cross
> border between haskell world and outside world. Opening files with unicode
> names, execing, etc.
>
> For example:
> Prelude> readFile "файл"
> *** Exception: D09;: openFile: does not exist (No such file or directory)
> Prelude> executeFile "echo" True ["Сейчас сломается"] Nothing
> !59G0A A;><05BAO
>
> Althrough it's possible to work around using encodeString/decodeString from
> Codec.Binary.UTF8.String it won't work on non-UTF8 systems. It's not only
> neandertalian systems with one-byte locales, windows AFAIK uses other
> unicode encoding.

For just decoding / encoding in other locales, there are codec
libraries. Hunt around on hackage.

    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/encoding
    http://hackage.haskell.org/cgi-bin/hackage-scripts/package/Encode


-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re[2]: How to use Unicode strings?

Bulat Ziganshin-2
In reply to this post by Alexey Khudyakov
Hello Alexey,

Sunday, November 23, 2008, 10:20:47 AM, you wrote:

> And this problem related not only to IO. It raises whenever strings cross
> border between haskell world and outside world. Opening files with unicode
> names, execing, etc.

this completely depends on libraries, and ghc-bundled i/o libs doesn't
support unicode filenames. freearc project contains its own simple i/o
library that doesn't have this problem (and also support files >4gb on
windows). unfortunately, this library doesn't include any buffering

--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How to use Unicode strings?

wren ng thornton
In reply to this post by Alexey Khudyakov
Alexey Khudyakov wrote:
> But this bring question what "the right thing" is? If locale is UTF8 or system
> support unicode some other way - no problem, just encode string properly.
> Problem is how to deal with untanslatable characters. Skip? Replace with
> question marks? Anything other? Probably we need to look how this is
> solved in other languages. (Or not solved)

Regarding untranslatable characters, I think the only correct thing to
do is consider it exceptional behavior and have the conversion function
accept a handler function which takes the character as input and
produces a string for it. That way programs can define their own
behavior, since this is something that doesn't have a "right" way to
recover in all cases. Canonical handlers which skip, replace with
question marks (or other arbitrary character), throw actual exceptions,
etc could be provided for convenience.

For stream-based "strings" a al ByteString, dealing with this sort of a
handler in an efficient manner is fairly straightforward (though some
CPS tricks may be needed to get rid of the Maybe in the result of the
basic converter). For [Char] strings efficiency is harder, but the
implementation should still be easy (given the basic converter).

Most extant languages I've seen tend to pick a single solution for all
cases, but I don't think we should follow along that path.

--
Live well,
~wren
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe