Reading files efficiently

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Reading files efficiently

Bugzilla from 1@234.cx
I've got another n00b question, thanks for all the help you have been
giving me!

I want to read a text file.  As an example, let's use
/usr/share/dict/words and try to print out the last line of the file.
First of all I came up with this program:

import System.IO
main = readFile "/usr/share/dict/words" >>= putStrLn.last.lines

This program gives the following error, presumably because there is an
ISO-8859-1 character in the dictionary:
"Program error: <handle>: IO.getContents: protocol error (invalid
character encoding)"

How can I tell the Haskell system that it is to read ISO-8859-1 text
rather than UTF-8?

I now used iconv to convert the file to UTF-8 and tried again.  This
time it worked, but it seems horribly inefficient -- Hugs took 2.8
seconds to read a 96,000 line file.  By contrast the equivalent Python
program:

print open("words", "r").readlines()[-1]

took 0.05 seconds.  I assume I must be doing something wrong here, and
somehow causing Haskell to use a particularly inefficient algorithm.
Can anyone give me any clues what I should be doing instead?

Thanks again,
Pete

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Reading files efficiently

Donald Bruce Stewart
1:

> I've got another n00b question, thanks for all the help you have been
> giving me!
>
> I want to read a text file.  As an example, let's use
> /usr/share/dict/words and try to print out the last line of the file.
> First of all I came up with this program:
>
> import System.IO
> main = readFile "/usr/share/dict/words" >>= putStrLn.last.lines
>
> This program gives the following error, presumably because there is an
> ISO-8859-1 character in the dictionary:
> "Program error: <handle>: IO.getContents: protocol error (invalid
> character encoding)"
>
> How can I tell the Haskell system that it is to read ISO-8859-1 text
> rather than UTF-8?
>
> I now used iconv to convert the file to UTF-8 and tried again.  This
> time it worked, but it seems horribly inefficient -- Hugs took 2.8
> seconds to read a 96,000 line file.  By contrast the equivalent Python
> program:
>
> print open("words", "r").readlines()[-1]
>
> took 0.05 seconds.  I assume I must be doing something wrong here, and
> somehow causing Haskell to use a particularly inefficient algorithm.
> Can anyone give me any clues what I should be doing instead?

a) Compile your code with GHC instead of interpreting it. GHC is blazing fast.

    $ ghc -O A.hs
    $ time ./a.out
    Zyzzogeton
    ./a.out  0.23s user 0.01s system 91% cpu 0.257 total

b) If not satisifed with the result, Use packed strings (as python does).

http://www.cse.unsw.edu.au/~dons/fps.html

    import qualified Data.FastPackedString as P
    import IO
    main = P.readFile "/usr/share/dict/words" >>= P.hPut stdout . last . P.lines

    $ ghc -O2 -package fps B.hs
    $ time ./a.out
    Zyzzogeton./a.out  0.04s user 0.02s system 86% cpu 0.063 total

0.06s is ok with me  :)

-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Reading files efficiently

Donald Bruce Stewart
dons:

> 1:
> > I've got another n00b question, thanks for all the help you have been
> > giving me!
> >
> > I want to read a text file.  As an example, let's use
> > /usr/share/dict/words and try to print out the last line of the file.
> > First of all I came up with this program:
> >
> > import System.IO
> > main = readFile "/usr/share/dict/words" >>= putStrLn.last.lines
> >
> > This program gives the following error, presumably because there is an
> > ISO-8859-1 character in the dictionary:
> > "Program error: <handle>: IO.getContents: protocol error (invalid
> > character encoding)"
> >
> > How can I tell the Haskell system that it is to read ISO-8859-1 text
> > rather than UTF-8?
> >
> > I now used iconv to convert the file to UTF-8 and tried again.  This
> > time it worked, but it seems horribly inefficient -- Hugs took 2.8
> > seconds to read a 96,000 line file.  By contrast the equivalent Python
> > program:
> >
> > print open("words", "r").readlines()[-1]
> >
> > took 0.05 seconds.  I assume I must be doing something wrong here, and
> > somehow causing Haskell to use a particularly inefficient algorithm.
> > Can anyone give me any clues what I should be doing instead?
>
> a) Compile your code with GHC instead of interpreting it. GHC is blazing fast.
>
>     $ ghc -O A.hs
>     $ time ./a.out
>     Zyzzogeton
>     ./a.out  0.23s user 0.01s system 91% cpu 0.257 total
>
> b) If not satisifed with the result, Use packed strings (as python does).
>
> http://www.cse.unsw.edu.au/~dons/fps.html
>
>     import qualified Data.FastPackedString as P
>     import IO
>     main = P.readFile "/usr/share/dict/words" >>= P.hPut stdout . last . P.lines
>
>     $ ghc -O2 -package fps B.hs
>     $ time ./a.out
>     Zyzzogeton./a.out  0.04s user 0.02s system 86% cpu 0.063 total
>
> 0.06s is ok with me  :)

Faster, don't split up the file into lines. Here we're following the
"How to optimise Haskell code by posting to haskell-cafe@" law:

import qualified Data.FastPackedString as P
import IO

main = do P.readFile "/usr/share/dict/words" >>= P.hPut stdout . snd .  P.spanEnd (/='\n') . P.init
          putChar '\n'

$ time ./a.out
Zyzzogeton
./a.out  0.00s user 0.01s system 60% cpu 0.013 total

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Reading files efficiently

Bugzilla from 1@234.cx
In reply to this post by Donald Bruce Stewart
Donald Bruce Stewart wrote:

> a) Compile your code with GHC instead of interpreting it. GHC is blazing fast.

That's one answer I suppose!  I quite liked using Hugs for that
particular program because it's a script that I didn't want to spend
time compiling.  Oh well, it's not that important.

I did notice that the script runs much quicker with runghc rather than
runhugs.  Is there any way of making runghc work with a script whose
name doesn't end ".hs"?

> b) If not satisifed with the result, Use packed strings (as python does).

Good suggestion, thanks.

Pete

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Re: Reading files efficiently

Donald Bruce Stewart
> Donald Bruce Stewart wrote:
>
> >a) Compile your code with GHC instead of interpreting it. GHC is blazing fast.
>
> That's one answer I suppose!  I quite liked using Hugs for that
> particular program because it's a script that I didn't want to spend
> time compiling.  Oh well, it's not that important.

You can use 'ghci' as well -- it's much like hugs.
 
> I did notice that the script runs much quicker with runghc rather than
> runhugs.  Is there any way of making runghc work with a script whose
> name doesn't end ".hs"?

Well, I know this works:

    $ cat A.lhs
    #!/usr/bin/env runhaskell
    > main = putStrLn "gotcha!"

    $ ./A.lhs
    gotcha!

But for files with no .hs or .lhs extension? Anyone know of a trick?

-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Reading files efficiently

Simon Marlow-5
Donald Bruce Stewart wrote:

> Well, I know this works:
>
>     $ cat A.lhs
>     #!/usr/bin/env runhaskell
>     > main = putStrLn "gotcha!"
>
>     $ ./A.lhs
>     gotcha!
>
> But for files with no .hs or .lhs extension? Anyone know of a trick?

GHC 6.6 will allow this, because we added the -x flag (works just like
gcc's -x flag).  eg. "ghc -x hs foo.wibble" will interpret foo.wibble as
a .hs file.  I have an uncommitted patch for runghc that uses -x, I need
to test & commit it.

Cheers,
        Simon
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Reading files efficiently

Bugzilla from 1@234.cx
Simon Marlow wrote:

> GHC 6.6 will allow this, because we added the -x flag (works just like
> gcc's -x flag).  eg. "ghc -x hs foo.wibble" will interpret foo.wibble as
> a .hs file.  I have an uncommitted patch for runghc that uses -x, I need
> to test & commit it.

Ah, that will be very useful, thanks!

You may already know this, but there is an oddity with shellscripts that
can make it difficult to pass flags like -x in a useful way.  It's
easiest to show this by an example.  First create a shellscript called
bar, containing one line:

#!./foo -x -y

Now create foo from foo.c:

#include <stdio.h>

extern int
main (int argc, char *argv[])
{
   int i;

   for (i = 0; i < argc; i++)
     printf ("argv[%d] = %s\n", i, argv[i]);
}

Now run bar:

$ ./bar -a -b
argv[0] = ./foo
argv[1] = -x -y
argv[2] = ./bar
argv[3] = -a
argv[4] = -b

Notice how -x and -y have ended up in the same element of argv even
though they were meant to be separate arguments.  -a and -b were fine
because that line was processed by the shell, which saw the space
between them and split them up.  -x and -y were not processed by the
shell, and the kernel is unintelligent about command lines.  Everything
after the command name ends up in a single argument.

Of course this isn't a disaster, it just means that programs which
accept arguments in the #! line have to be careful how they parse them.

Pete

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe