Hexdump

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Hexdump

Dominic Steinitz
All,

I finally got around to putting a "hexdump" function into a module. Any
suggestions as to where it should live in the hierarchy?

Dominic.

module Hex(
   ppHex
   ) where

import Data.List
import Codec.Utils
import Numeric
import Text.PrettyPrint

split :: Int -> [a] -> [[a]]
split n xs = unfoldr (g n) xs

g :: Int -> [a] -> Maybe ([a],[a])
g n [] = Nothing
g n y  = Just (splitAt n y)

sh x | x < 16    = '0':(showHex x "")
     | otherwise = showHex x ""

type OctetsPerLine = Int

ppHex :: OctetsPerLine -> [Octet] -> Doc
ppHex n =
   vcat .
   map hcat .
   map (intersperse colon) .
   map (map (text . sh)) .
   split n

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

UTF8 (was Re: Hexdump)

Malcolm Wallace
Dominic Steinitz <[hidden email]> wrote:

> I finally got around to putting a "hexdump" function into a module.
> Any  suggestions as to where it should live in the hierarchy?

How about
    Codec.Hexdump
    Codec.Text.Hexdump
    Text.Codec.Hexdump

I was also thinking it would be nice to have pure Haskell
implementations of the various Unicode encodings.  Here is my attempt at
the UTF-8 codec.  Like Dominic, I am wondering where such a module
should live in the hierarchy?

    Text.Codec.UTF8
    Codec.Text.UTF8
    Text.UTF8
    Codec.UTF8

Regards,
    Malcolm

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

UTF8.hs (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: UTF8 (was Re: Hexdump)

Bayley, Alistair
> From: [hidden email]
> [mailto:[hidden email]] On Behalf Of Malcolm Wallace
>
> I was also thinking it would be nice to have pure Haskell
> implementations of the various Unicode encodings.  Here is my
> attempt at
> the UTF-8 codec.  Like Dominic, I am wondering where such a module
> should live in the hierarchy?
>
>     Text.Codec.UTF8
>     Codec.Text.UTF8
>     Text.UTF8
>     Codec.UTF8
>
> Regards,
>     Malcolm


My initial instinct is to put the UTF codecs in Foreign.C
(Foreign.C.Unicode, say, or Foreign.C.Codec), because I've only used
ever used them to marshall between Haskell Strings and CStrings. Also, a
bunch o' functions like withUTF8CString{Len}, peekUTF8CString, etc are
also useful.

OTOH, I can think of uses for non-CString oriented codecs (e.g.
converting an HTTP request body from a specific charset to Unicode), so
maybe this is a dumb idea.

Alistair
*****************************************************************
Confidentiality Note: The information contained in this message,
and any attachments, may contain confidential and/or privileged
material. It is intended solely for the person(s) or entity to
which it is addressed. Any review, retransmission, dissemination,
or taking of any action in reliance upon this information by
persons or entities other than the intended recipient(s) is
prohibited. If you received this in error, please contact the
sender and delete the material from any computer.
*****************************************************************
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 (was Re: Hexdump)

Malcolm Wallace
In reply to this post by Malcolm Wallace
Oops, I wrote:

fromUTF8 (w:ws)
    | w <  0x80  {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws
    | w >= 0xc0  {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws
    | w >= 0xe0  {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws
    | w >= 0xf0  {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws
    | w >= 0xf8  {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws
    | w >= 0xfc  {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws

which should of course have been

fromUTF8 (w:ws)
    | w <  0x80  {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws
    | w >= 0xfc  {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws
    | w >= 0xf8  {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws
    | w >= 0xf0  {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws
    | w >= 0xe0  {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws
    | w >= 0xc0  {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws

Regards,
    Malcolm
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 (was Re: Hexdump)

Simon Marlow-5
In reply to this post by Malcolm Wallace
Malcolm Wallace wrote:

> Dominic Steinitz <[hidden email]> wrote:
>
>
>>I finally got around to putting a "hexdump" function into a module.
>>Any  suggestions as to where it should live in the hierarchy?
>
>
> How about
>     Codec.Hexdump
>     Codec.Text.Hexdump
>     Text.Codec.Hexdump
>
> I was also thinking it would be nice to have pure Haskell
> implementations of the various Unicode encodings.  Here is my attempt at
> the UTF-8 codec.  Like Dominic, I am wondering where such a module
> should live in the hierarchy?
>
>     Text.Codec.UTF8
>     Codec.Text.UTF8
>     Text.UTF8
>     Codec.UTF8

FWIW, our current draft hierarchy has this in Codec.Text.UTF8.

(see http://darcs.haskell.org/ghc/libraries/doc/lib-hierarchy.html).

Cheers,
        Simon
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 (was Re: Hexdump)

Simon Marlow-5
In reply to this post by Malcolm Wallace
Malcolm Wallace wrote:

> Oops, I wrote:
>
> fromUTF8 (w:ws)
>     | w <  0x80  {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws
>     | w >= 0xc0  {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws
>     | w >= 0xe0  {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws
>     | w >= 0xf0  {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws
>     | w >= 0xf8  {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws
>     | w >= 0xfc  {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws
>
> which should of course have been
>
> fromUTF8 (w:ws)
>     | w <  0x80  {- 0xxxxxxx -} = toEnum (fromEnum w) : fromUTF8 ws
>     | w >= 0xfc  {- 1111110x -} = bytes 5 (fromEnum (w`mask`0x01)) ws
>     | w >= 0xf8  {- 111110xx -} = bytes 4 (fromEnum (w`mask`0x03)) ws
>     | w >= 0xf0  {- 11110xxx -} = bytes 3 (fromEnum (w`mask`0x07)) ws
>     | w >= 0xe0  {- 1110xxxx -} = bytes 2 (fromEnum (w`mask`0x0f)) ws
>     | w >= 0xc0  {- 110xxxxx -} = bytes 1 (fromEnum (w`mask`0x1f)) ws

Getting a UTF-8 decoder right is quite non-trivial.  Take a look at this:

   http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

I made a half-hearted attempt to get most of this right in GHC's UTF-8
decoder, but by no means all of it is implemented.  I do think it would
be nice if the Haskell implementation was correct, for some value of
correct, though.

Cheers,
        Simon
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 (was Re: Hexdump)

Ashley Yakeley
Simon Marlow wrote:

> Getting a UTF-8 decoder right is quite non-trivial.  Take a look at this:
>
>   http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
>
> I made a half-hearted attempt to get most of this right in GHC's UTF-8
> decoder, but by no means all of it is implemented.  I do think it would
> be nice if the Haskell implementation was correct, for some value of
> correct, though.

There are a couple of implementations here:
<http://cvs.sourceforge.net/viewcvs.py/haskell-i18n/Source/Text/Encoding/>

SF project:
<http://sourceforge.net/projects/haskell-i18n>

It hasn't been touched for a couple of years.

--
Ashley Yakeley, Seattle WA
WWED? http://www.cs.utexas.edu/users/EWD/

_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries
Reply | Threaded
Open this post in threaded view
|

Re: UTF8 (was Re: Hexdump)

Bulat Ziganshin-2
In reply to this post by Malcolm Wallace
Hello Malcolm,

Tuesday, March 21, 2006, 7:07:53 PM, you wrote:

> I was also thinking it would be nice to have pure Haskell
> implementations of the various Unicode encodings.  Here is my attempt at
> the UTF-8 codec.

UTF-8 codecs are migrating from app to app, you can find such code in
the ghc, jhc, darcs... all these codecs use the ([Char] <-> [Word8])
conversion that is both slow (because lists are lazy) and can't be
used in non-list environment (how, for example, we can read enough
bytes to decode just one Char?). in my Streams library, i used
higher-order monadic functions to implement encodings.

In my model, encoder is just a higher-order function that accepts as
parameter function (putByte :: (Monad m) => Word8 -> m ()) and uses it to
implement (putChar :: (Monad m) => Char -> m ()) operation, so each
encoder has type:
utf8Encode :: (Monad m) => (Word8 -> m ()) -> Char -> m ()

In the same fashion, each decoder accepts parameter of functional type
(getByte :: (Monad m) => m Word8), and uses it to implement
(getChar :: (Monad m) => m Char) operation, so the whole decoder has
type:
utf8Decode :: (Monad m) => m Word8 -> m Char

Using these higher-order functions allows me to implement both UTF8
(and any other) encoding for text streams and UTF8 encoding for
serializing strings/chars in binary i/o module. i attached this
module.

--
Best regards,
 Bulat                            mailto:[hidden email]
_______________________________________________
Libraries mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/libraries

CharEncoding.hs (5K) Download Attachment