Lazy file IO & Space leaks/waste

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Lazy file IO & Space leaks/waste

Aleksandar Dimitrov
Hello list,

I'm currently writing a small linguistic corpus analyzer. My input file is only
25MB, but profiling shows that the overall amount of allocation over the
program's runtime is several GB. That's a little too much - adding to that is
the fact that the program is abysmally slow, so I'm suspecting a space leak
somewhere.

I'm using the ByteString.Lazy.Char8 class in order to work efficiently with lazy
IO and I must admit that I'm very inexperienced with predicting runtime and
space behaviour of lazy IO :-( It worked well in the past, but I'm stuck now.

The program can be found here:
http://hpaste.org/fastcgi/hpaste.fcgi/view?id=11863#a11863

The important bits are as follows:

> mf :: [C.ByteString] -> StdWord
> mf [] = Word [] C.empty
> mf s = Word (tail s) (head s)
>
> f' = mf . reverse . C.words

> main :: IO ()
> main = do
>     corpus_name <- liftM head getArgs
>     corpus <- liftM (Corpus . (map f') . C.lines) $ C.readFile corpus_name
>     print $ length (content corpus)
>     let interesting = filterForInterestingTags interestingTags corpus
>     print $ show (freqMap interesting)

freqMap also uses foldr' to fold the corpus it is given into a Data.Map, but
according to the profiling output, that's not where the bad stuff happens.
Here's the interesting parts of the profiling output:

>    analyse +RTS -p -RTS corpus/bnc-samp.tt
>
> total time  =       30.46 secs   (1523 ticks @ 20 ms)
> total alloc = 14,871,619,904 bytes  (excludes profiling overheads)
>
> COST CENTRE                 MODULE no. entries  %time %alloc %time %alloc
>
> MAIN                        MAIN     1        0   0.0    0.0 100.0  100.0
>   main                      Main   221        0   0.0    0.0   0.0    0.0
>  CAF                        Main   206       14   0.1    0.1 100.0  100.0
>   showsPrec_aSF             Main   223        1   0.0    0.0   0.0    0.0
>   interestingTags           Main   218        1   0.0    0.0   0.0    0.0
>   f'                        Main   216        1   0.0    0.0   0.0    0.0
>    mf                       Main   217        1   0.0    0.0   0.0    0.0
>   main                      Main   212        1   2.4    4.4  99.9   99.9
>    f'                       Main   219        0  89.0   91.5  90.2   92.7
>     mf                      Main   220  2427450   1.2    1.2   1.2    1.2
>    filterForInterestingTags Main   215        1   5.4    0.1   5.4    0.1
>    freqMap                  Main   214        1   0.6    0.5   2.0    2.7
>     compare_aRL             Main   222  1141996   1.4    2.2   1.4    2.2

Obviously the main cost centre seems to be f' (which I've factored out in order
to view it separately as a cost centre.) It's not the call to reverse that makes
it slow (removing it doesn't affect the run time.) Interestingly, it prints out
the first statement (length) very quickly, then takes a lot of time. I've
removed the call to freqMap, and it seems that GHC is smart enough to drop the
call to f' completely, because only the length ever gets evaluated. But the
freqMap also needs the processing from f', and that's where the bad stuff starts
happening.

Should I look into DeepSeq? I've tried adding strictness to the functions by
hand, and also to the data structures, but that didn't seem to help so far. So
I'm looking for solutions to make this faster. I'm guessing that a smart `seq`
somewhere might help, but I don't know where :-\

As a side note: how can I find out how the run-time data structures look like in
memory? I.e. I want to know *what* exactly stays around as thunks in memory so
that I can focus better on where to add strictness or redirect the program flow.
Ideally, only a very smart part of the file should ever be in memory, with
processing happening incrementally!

Thanks for any pointers!
Aleks
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 198 bytes
Desc: not available
Url : http://www.haskell.org/pipermail/beginners/attachments/20091105/5917de71/attachment.bin
Reply | Threaded
Open this post in threaded view
|

Lazy file IO & Space leaks/waste

Nicolas Pouillard-2
Excerpts from Aleksandar Dimitrov's message of Thu Nov 05 12:01:11 +0100 2009:

> Hello list,
>
> I'm currently writing a small linguistic corpus analyzer. My input file is only
> 25MB, but profiling shows that the overall amount of allocation over the
> program's runtime is several GB. That's a little too much - adding to that is
> the fact that the program is abysmally slow, so I'm suspecting a space leak
> somewhere.
>
> I'm using the ByteString.Lazy.Char8 class in order to work efficiently with lazy
> IO and I must admit that I'm very inexperienced with predicting runtime and
> space behaviour of lazy IO :-( It worked well in the past, but I'm stuck now.
>
> The program can be found here:
> http://hpaste.org/fastcgi/hpaste.fcgi/view?id=11863#a11863

I've added a revision of your code and some highlights.
The main one is to suggest foldl' instead of foldr'. However without the input
text helping to improve is harder.

--
Nicolas Pouillard
http://nicolaspouillard.fr
Reply | Threaded
Open this post in threaded view
|

Re: Lazy file IO & Space leaks/waste

Heinrich Apfelmus
In reply to this post by Aleksandar Dimitrov
Aleksandar Dimitrov wrote:

> The important bits are as follows:
>
>> mf :: [C.ByteString] -> StdWord
>> mf [] = Word [] C.empty
>> mf s = Word (tail s) (head s)
>>
>> f' = mf . reverse . C.words
>
>> main :: IO ()
>> main = do
>>     corpus_name <- liftM head getArgs
>>     corpus <- liftM (Corpus . (map f') . C.lines) $ C.readFile corpus_name
>>     print $ length (content corpus)
>>     let interesting = filterForInterestingTags interestingTags corpus
>>     print $ show (freqMap interesting)
>
>  [...]
>
> Ideally, only a very smart part of the file should ever be in memory, with
> processing happening incrementally!

The

    print $ length (content corpus)

statement seems contradictory to your goal? After all, the whole file is
read into the  corpus  variable to calculate its  length .


Regards,
apfelmus

--
http://apfelmus.nfshost.com