Excess mem consumption in file IO task

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Excess mem consumption in file IO task

Moritz Tacke
Hi!

I have some resource problems when extracting data from a file. The
task is as follows: I have a huge (500MB) binary file, containing some
interesting parts and lots of rubbish. Furthermore, there is a
directory that tells me the parts of the file (first- and last byte
index) that contain the substrings I need. My approach to do this is
to open the file and to pass the list of addresses along with the
handle to a function that processes the list step-by-step and calls a
subfunction which uses the handle to seek the start position of the
interesting block, reads the block into a bytestring (lazy or not,
didn't make any difference here) and calls the  function that scans
this byte string for the interesting part. Using this approach - which
results in a data structure with an approximate size of 10 MB - the
program uses hundreds of megabytes of RAM, which forces my computer to
swap (with the obvious results...).
I have right now two main suspects: The recursive function is
tail-recursive, but I don't know whether the usual way to write these
functions (with an accumulator etc) works in monadic code (the stage
is, of course, the IO monad, and I am using the do-notation as I don't
like the only other way I know, writing lambdas and lambdas and
lambdas into the function body). The other problem I can imagine is
the passing-around of the file handle, and the subsequent reading of
byte strings: Are those strings somehow attached to the handle, and
does the handle work in a different way than I expected, i.e. is the
handle copied while using it as an argument for another function, and
exists something like a register of handles that keeps the connection
upright and, therefore, excludes the (handle, string)-chunk from
garbage collection?
I have, of course, been experimenting with the "seq" - function, but,
honestly, I am not sure whether I got it right. Does a call to
"identity $! (function arguments ...)" force the full evaluation of
the function?
Greetings!

        Moritz
Reply | Threaded
Open this post in threaded view
|

Re: Excess mem consumption in file IO task

Ertugrul Söylemez
"Moritz Tacke" <[hidden email]> wrote:

> I have some resource problems when extracting data from a file. The
> task is as follows: I have a huge (500MB) binary file, containing some
> interesting parts and lots of rubbish. Furthermore, there is a
> directory that tells me the parts of the file (first- and last byte
> index) that contain the substrings I need. My approach to do this is
> to open the file and to pass the list of addresses along with the
> handle to a function that processes the list step-by-step and calls a
> subfunction which uses the handle to seek the start position of the
> interesting block, reads the block into a bytestring (lazy or not,
> didn't make any difference here) and calls the function that scans
> this byte string for the interesting part. Using this approach - which
> results in a data structure with an approximate size of 10 MB - the
> program uses hundreds of megabytes of RAM, which forces my computer to
> swap (with the obvious results...).

You may want to post the relevant parts of your source code on
hpaste.org for reference.


> I have right now two main suspects: The recursive function is
> tail-recursive, but I don't know whether the usual way to write these
> functions (with an accumulator etc) works in monadic code (the stage
> is, of course, the IO monad, and I am using the do-notation as I don't
> like the only other way I know, writing lambdas and lambdas and
> lambdas into the function body). The other problem I can imagine is
> the passing-around of the file handle, and the subsequent reading of
> byte strings: Are those strings somehow attached to the handle, and
> does the handle work in a different way than I expected, i.e. is the
> handle copied while using it as an argument for another function, and
> exists something like a register of handles that keeps the connection
> upright and, therefore, excludes the (handle, string)-chunk from
> garbage collection?

Usually no, unless you read the file with a lazy read function like
hGetContents.  And the normal notation and the do-notation are
equivalent.  When compiling, the do-notation is simply translated to the
normal notation.


> I have, of course, been experimenting with the "seq" - function, but,
> honestly, I am not sure whether I got it right. Does a call to
> "identity $! (function arguments ...)" force the full evaluation of
> the function?

No,

  a `seq` b

says that before evaluating 'b', 'a' should be evaluated.  The function
itself may treat its arguments lazily, which makes a difference, when
it's recursive.


Greets,
Ertugrul.


--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://blog.ertes.de/