implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Bulat Ziganshin-2
Hello all

this letter describes why i think that using hand-made (de)coder for
support of UTF-8 encoded files is better than using iconv. to let
other readers know, iconv is wide-spread C library that performs
buffer-to-buffer conversion between any text encodings (utf-8, utf-16,
latin-1, ucs-2, ucs-4 and more). hand-made (en)coder implemented
by me is just "converter", i.e. high-order function, between the
getByte/putByte and getChar/putChar operations. so it can be used in
any monad and with any purposes, not only for text I/O

one can find example of library that uses iconv in the "System\IO\Text.hs"
module from http://haskell.org/~simonmar/new-io.tar.gz and example of
hand-made encoder in module "Data\CharEncoding.hs"
and its usage - in "System\Stream\Transformer\CharEncoding.hs"
from http://freearc.narod.ru/Streams.tar.gz

i crossposted this letter to Marcin and Simon because you have
discussed with me this question and to Einar because he once asked
me about one specific feature in this area.


why iconv is better:

1) it's lightning fast, making virtually zero speed overhead
2) it's robust
3) it contains already implemented and debugged algorithms for all
possible encodings we can encounter
4) it has highly developed error processing facilities
(i mean signalling about errors in input data and/or masking them)

why hand-made conversion is better:

1) i don't know whether iconv will be available on every Hugs and GHC
installation?

2) Einar once asked me about changing the encoding on the
fly, that is needed for some HTML processing. it is also possible that
some program will need to intersperse text I/O with
buffer/array/byte/bits I/O. it's a sort of things that are absolutely
impossible with iconv

3) my library support Streams that works in ANY monad (not only IO, ST
and their derivatives). it's impossible to implement iconv conversion
for such stream types

as you can see, while the last arguments says about very specific
situations, these situations absolutely can't be handled by iconv, so
we need to implement hand-made conversions anyway. on the other side,
iconv strong points don't have principal meaning - the speed with
hand-made routines will be enough, about several mb/s; all possible
encodings can be implemented and debugged sooner or later; only
processing of errors in input data is weak point of the current design
itself

moreover, there are implementation issues that make me more enthusiastic
about hand-made solution. it just already implemented and really works.
implementation of the CharEncoding for streams is in module
"System\Stream\Transformer\CharEncoding.hs", which is very trivial.
implementation of different encoders in "Data\CharEncoding.hs"
is slightly more complex, but these routines also used in
"instance Binary String", i.e. to serialize strings. also, i think
that "Data\CharEncoding.hs" module should be a part of standard
Haskell library, so implementation of CharEncoding stream transformer
is almost "free"

on the other side, implementation of text encoding in "new I/O"
library is about 1000 lines long. while i don't need to copy them all,
using iconv anyway will be much more complex than using hand-made routines.
this include complexity of interaction with iconv itself and complexity of
implementing various I/O operations over the buffer that contains
4-byte characters. i already implemented 3 buffering transformers and
adding one more buffering scheme is the last thing i want to do. vice
versa - now i'm searching for ways to omit repetitions of code by joining
them all into one. it's very boring - to have 3 or 4 similar things
and replicate every change to them all

at the same time, the library design is open and it's entirely
possible to have two alternative char encoding transformers. everyone
can develop additional transformers even without interaction with me -
in this case, it should just implement vGetChar/bPutChar operations
via the vGetBuf/vPutBuf ones. i just propose to leave the things as
they are, and go to implementing of iconv-based transformer only when we
will be actually bothered by it's restrictions
 

--
Best regards,
 Bulat                          mailto:[hidden email]

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Einar Karttunen
On 20.04 17:38, Bulat Ziganshin wrote:
> one can find example of library that uses iconv in the "System\IO\Text.hs"
> module from http://haskell.org/~simonmar/new-io.tar.gz and example of
> hand-made encoder in module "Data\CharEncoding.hs"
> and its usage - in "System\Stream\Transformer\CharEncoding.hs"
> from http://freearc.narod.ru/Streams.tar.gz

Does Data.CharEncoding work with encodings that have state associated
with them? One example is ISO-2022-JP. Maybe with using a suitable
monad transformer?

> 2) Einar once asked me about changing the encoding on the
> fly, that is needed for some HTML processing. it is also possible that
> some program will need to intersperse text I/O with
> buffer/array/byte/bits I/O. it's a sort of things that are absolutely
> impossible with iconv

The example goes like this:
1) HTTP client reads response from server using ascii
2) When reading headers is complete, either:
   * decode body (binary data) and after decompressing convert to text
   * decode body (text in some encoding) straight from the Handle.

Is there a reason this is impossible with iconv if the character conversion
is on top of the buffering?

- Einar Karttunen
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re[2]: implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Bulat Ziganshin-2
Hello Einar,

Thursday, April 20, 2006, 6:24:14 PM, you wrote:

> Does Data.CharEncoding work with encodings that have state associated
> with them? One example is ISO-2022-JP.

no. so the list of things that is principal impossible with current
design of Data.CharEncoding is error processing/masking and handling
of stateful encodings

> Maybe with using a suitable monad transformer?

how you imagine that? we have the following classes:

class ByteStream m h where
  vGetByte :: h -> m Word8
  vPutByte :: h -> Word8 -> m ()

class TextStream m h where
  vGetChar :: h -> m Char
  vPutChar :: h -> Char -> m ()

and char encoding transformer should implement later via former:

instance ByteStream m h => TextStream m (CharEncoding h) where ...

it seems that we should just improve type of (vGetByte->vGetChar) and
(vPutByte->vPutChar) converters so that they will accept old state
and error processing mode and returns error code and new state. smth
like this:

type PutByte m h = h -> Word8 -> m ()
type EncodeConverter m h state = PutByte m h -> ErrMode -> h -> state
                                 -> m (Either Char ErrCode, state)

where `state` saves current processing state, Errmode is error
processing mode and ErrCode is error code. of course, this should make
implementation even slower :(


>> 2) Einar once asked me about changing the encoding on the
>> fly, that is needed for some HTML processing. it is also possible that
>> some program will need to intersperse text I/O with
>> buffer/array/byte/bits I/O. it's a sort of things that are absolutely
>> impossible with iconv

> The example goes like this:
> 1) HTTP client reads response from server using ascii
> 2) When reading headers is complete, either:
>    * decode body (binary data) and after decompressing convert to text
>    * decode body (text in some encoding) straight from the Handle.

> Is there a reason this is impossible with iconv if the character conversion
> is on top of the buffering?

let's they answer :)  i just want to mention to Simon that some apps
want to use binary and text i/o at the same stream. if you think that
HTTP has bad design, you know where to complain ;)


--
Best regards,
 Bulat                            mailto:[hidden email]

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Marcin 'Qrczak' Kowalczyk
In reply to this post by Bulat Ziganshin-2
Bulat Ziganshin <[hidden email]> writes:

> this letter describes why i think that using hand-made (de)coder for
> support of UTF-8 encoded files is better than using iconv.

A Haskell recoder is fine, and probably a good idea for important
encodings, provided that wrapping a block recoder implemented in C
is not ruled out. The two approaches should coexist.

> 2) Einar once asked me about changing the encoding on the fly, that
> is needed for some HTML processing.

HTML can be parsed by treating it as ISO-8859-1 first, looking for
headers specifying the encoding only, and then converting the whole
stream to the right encoding.

> it is also possible that some program will need to intersperse text
> I/O with buffer/array/byte/bits I/O. it's a sort of things that are
> absolutely impossible with iconv

Of course it's possible.

HTTP specifies that headers end with an empty line. The boundary can
be found without decoding the text at all. Then the part before the
boundary is treated as ASCII text and converted to strings, and the
rest is binary.

Or alternatively the text can be read by decoding one character at
a time, and after the boundary is found, the rest cis read from the
underlying binary stream. Even IConv can be used one character at a
time, it will only be inefficient; but here ASCII can be implemented
by hand.

Emitting HTTP is analogous.

> 3) my library support Streams that works in ANY monad (not only IO,
> ST and their derivatives). it's impossible to implement iconv
> conversion for such stream types

Which is good. It's impossible to implement a stateful encoding in a
monad which doesn't carry state.

> moreover, there are implementation issues that make me more
> enthusiastic about hand-made solution. it just already implemented
> and really works.

Your implementation doesn't detect unencodable or malformed input.

And I've already implemented both an IConv wrapper and some
hand-written encodings (but not for Haskell). They work too :-)

> using iconv anyway will be much more complex than using hand-made
> routines.

iconv is done once and tens of encodings become available at once.
Each would have to be hand-implemented separately.

--
   __("<         Marcin Kowalczyk
   \__/       [hidden email]
    ^^     http://qrnik.knm.org.pl/~qrczak/
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: implementation of UTF-8 conversion for text I/O: iconv vs hand-made

Graham Klyne-2
In reply to this post by Bulat Ziganshin-2
FWIW, there's a fairly complete pure-Haskell UTF-8 converter implementation in
HXML toolbox, which I "stole" and adapted for a version of HaXml;  e.g.:

http://www.ninebynine.org/Software/HaskellUtils/HaXml-1.12/src/Text/XML/HaXml/Unicode.hs

(Please ignore me if I miss your point.)

#g
--

Bulat Ziganshin wrote:

> Hello all
>
> this letter describes why i think that using hand-made (de)coder for
> support of UTF-8 encoded files is better than using iconv. to let
> other readers know, iconv is wide-spread C library that performs
> buffer-to-buffer conversion between any text encodings (utf-8, utf-16,
> latin-1, ucs-2, ucs-4 and more). hand-made (en)coder implemented
> by me is just "converter", i.e. high-order function, between the
> getByte/putByte and getChar/putChar operations. so it can be used in
> any monad and with any purposes, not only for text I/O
>
> one can find example of library that uses iconv in the "System\IO\Text.hs"
> module from http://haskell.org/~simonmar/new-io.tar.gz and example of
> hand-made encoder in module "Data\CharEncoding.hs"
> and its usage - in "System\Stream\Transformer\CharEncoding.hs"
> from http://freearc.narod.ru/Streams.tar.gz
>
> i crossposted this letter to Marcin and Simon because you have
> discussed with me this question and to Einar because he once asked
> me about one specific feature in this area.
>
>
> why iconv is better:
>
> 1) it's lightning fast, making virtually zero speed overhead
> 2) it's robust
> 3) it contains already implemented and debugged algorithms for all
> possible encodings we can encounter
> 4) it has highly developed error processing facilities
> (i mean signalling about errors in input data and/or masking them)
>
> why hand-made conversion is better:
>
> 1) i don't know whether iconv will be available on every Hugs and GHC
> installation?
>
> 2) Einar once asked me about changing the encoding on the
> fly, that is needed for some HTML processing. it is also possible that
> some program will need to intersperse text I/O with
> buffer/array/byte/bits I/O. it's a sort of things that are absolutely
> impossible with iconv
>
> 3) my library support Streams that works in ANY monad (not only IO, ST
> and their derivatives). it's impossible to implement iconv conversion
> for such stream types
>
> as you can see, while the last arguments says about very specific
> situations, these situations absolutely can't be handled by iconv, so
> we need to implement hand-made conversions anyway. on the other side,
> iconv strong points don't have principal meaning - the speed with
> hand-made routines will be enough, about several mb/s; all possible
> encodings can be implemented and debugged sooner or later; only
> processing of errors in input data is weak point of the current design
> itself
>
> moreover, there are implementation issues that make me more enthusiastic
> about hand-made solution. it just already implemented and really works.
> implementation of the CharEncoding for streams is in module
> "System\Stream\Transformer\CharEncoding.hs", which is very trivial.
> implementation of different encoders in "Data\CharEncoding.hs"
> is slightly more complex, but these routines also used in
> "instance Binary String", i.e. to serialize strings. also, i think
> that "Data\CharEncoding.hs" module should be a part of standard
> Haskell library, so implementation of CharEncoding stream transformer
> is almost "free"
>
> on the other side, implementation of text encoding in "new I/O"
> library is about 1000 lines long. while i don't need to copy them all,
> using iconv anyway will be much more complex than using hand-made routines.
> this include complexity of interaction with iconv itself and complexity of
> implementing various I/O operations over the buffer that contains
> 4-byte characters. i already implemented 3 buffering transformers and
> adding one more buffering scheme is the last thing i want to do. vice
> versa - now i'm searching for ways to omit repetitions of code by joining
> them all into one. it's very boring - to have 3 or 4 similar things
> and replicate every change to them all
>
> at the same time, the library design is open and it's entirely
> possible to have two alternative char encoding transformers. everyone
> can develop additional transformers even without interaction with me -
> in this case, it should just implement vGetChar/bPutChar operations
> via the vGetBuf/vPutBuf ones. i just propose to leave the things as
> they are, and go to implementing of iconv-based transformer only when we
> will be actually bothered by it's restrictions
>  
>

--
Graham Klyne
For email:
http://www.ninebynine.org/#Contact

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries