[newbie] processing large logs

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

[newbie] processing large logs

Eugene Crosser
Hello gentlemen,

I am exposed to functional programming for less than a month, and just
trying to understand the concepts, so please bear with me.

I tried to use Haskell for a simple task on my dayjob, that involved
parsing mail system logs and counting distinct addresses (I work for a
relatively big ISP).

Reducing unnecessary details, let's say that we have a file of N lines,
some of them repeating, so that there are only M distinct lines, where M
<< N.  The task is to count the number of times each distinct line
appears in the file, and print the most frequent one with its count.

This is my program:
========
module Main where
import Data.Map
main = printMax . (foldr processLine empty) . lines =<< getContents
processLine line map = insertWith (\new old -> new + old) line 1 map
printMax map = putStrLn $ show $ foldWithKey
   (\key val accum -> if val > (snd accum) then (key,val) else accum)
       ("",0) map
========
The thing kinda works on small data sets, but if you feed it with
250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
500,000 lines I get "*** Exception: stack overflow" (using runhaskell
from ghc 6.2.4).  (For comparison, a perl script does the same job
(using global hash) an order of magnitude faster, consumes 3 Mb RAM and
can process billion lines without a problem).

The question is: what I am doing wrong?

Thanks
Eugene

crosser@pccross:~/src$ perl genlist.pl 500000 1000|time runhaskell distinct.hs *** Exception: stack overflow
Command exited with non-zero status 1
27.46user 0.49system 0:29.75elapsed 93%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (0major+34676minor)pagefaults 0swaps

crosser@pccross:~/src$ perl genlist.pl 250000 1000|time runhaskell distinct.hs
("a0000000531",300)
36.66user 0.72system 0:54.82elapsed 68%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (522major+50361minor)pagefaults 0swaps

crosser@pccross:~/src$ perl genlist.pl 250000 1000|time perl distinct.pl
a0000000531: 300
0.22user 0.00system 0:01.14elapsed 20%CPU (0avgtext+0avgdata 0maxresident)k
0inputs+0outputs (15major+472minor)pagefaults 0swaps


module Main where
import Data.Map
main = printMax . (foldr processLine empty) . lines =<< getContents
processLine line map = insertWith (\new old -> new + old) line 1 map
printMax map = putStrLn $ show $ foldWithKey
        (\key val accum -> if val > (snd accum) then (key,val) else accum)
                ("",0) map

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

distinct.pl (146 bytes) Download Attachment
genlist.pl (133 bytes) Download Attachment
signature.asc (262 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Udo Stenzel
Eugene Crosser wrote:

> This is my program:
> ========
> module Main where
> import Data.Map
> main = printMax . (foldr processLine empty) . lines =<< getContents
> processLine line map = insertWith (\new old -> new + old) line 1 map
> printMax map = putStrLn $ show $ foldWithKey
>    (\key val accum -> if val > (snd accum) then (key,val) else accum)
>        ("",0) map
> ========
> The thing kinda works on small data sets, but if you feed it with
> 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
> 500,000 lines I get "*** Exception: stack overflow"
Your program isn't strict enough.  While you expect it to keep a
"running total" in the map which is updated with each new line, it
really only creates lots of thunks that are only evaluated when the
result is demanded.  These thunks are as large as the input plus
overhead.

You have to force the evaluation of intermediate results.  To do so, you
have to replace foldr by foldl (foldr is just recursion, foldl is
accumulator recursion), then use the strict variant of that, and then
evaluate all values before putting them into the map.  In summary, this
should work (untested code, note the use of foldl'):

main = printMax . (foldl' processLine empty) . lines =<< getContents
processLine map line =
    let total = findWithDefault 0 line map + 1
    in total `seq` insert line total map
   

Yes, this is all terribly non-obvious.  It takes time until you see
where lazyness is going to hurt you, and you'll easily overlook some
such situations.  I also think, it's an unfortunate oversight that
insertWith is lazy and that there's no way to make it strict as a mere
user of Data.Map.


Udo.
--
"Guy Steele leads a small team of researchers in Burlington,
Massachusetts, who are taking on an _enormous_challenge_ -- create a
programming language better than Java."
        -- Sun.Com (emphasis by Paul Graham)

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Evan Martin
In reply to this post by Eugene Crosser
On 5/14/06, Eugene Crosser <[hidden email]> wrote:
> main = printMax . (foldr processLine empty) . lines =<< getContents
> [snip]
> The thing kinda works on small data sets, but if you feed it with
> 250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
> 500,000 lines I get "*** Exception: stack overflow" (using runhaskell
> from ghc 6.2.4).

To elaborate on Udo's point:
If you look at the definition of foldr you'll see where the stack
overflow is coming from:  foldr recurses all the way down to the end
of the list, so your stack gets 250k (or attempts 500k) entries deep
so it can process the last line in the file first, then unwinds.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Donald Bruce Stewart
martine:

> On 5/14/06, Eugene Crosser <[hidden email]> wrote:
> >main = printMax . (foldr processLine empty) . lines =<< getContents
> >[snip]
> >The thing kinda works on small data sets, but if you feed it with
> >250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
> >500,000 lines I get "*** Exception: stack overflow" (using runhaskell
> >from ghc 6.2.4).
>
> To elaborate on Udo's point:
> If you look at the definition of foldr you'll see where the stack
> overflow is coming from:  foldr recurses all the way down to the end
> of the list, so your stack gets 250k (or attempts 500k) entries deep
> so it can process the last line in the file first, then unwinds.

Also, don't use runhaskell! Compile the code with -O :)

-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Donald Bruce Stewart
dons:

> martine:
> > On 5/14/06, Eugene Crosser <[hidden email]> wrote:
> > >main = printMax . (foldr processLine empty) . lines =<< getContents
> > >[snip]
> > >The thing kinda works on small data sets, but if you feed it with
> > >250,000 lines (1000 distinct), the process size grows to 200 Mb, and on
> > >500,000 lines I get "*** Exception: stack overflow" (using runhaskell
> > >from ghc 6.2.4).
> >
> > To elaborate on Udo's point:
> > If you look at the definition of foldr you'll see where the stack
> > overflow is coming from:  foldr recurses all the way down to the end
> > of the list, so your stack gets 250k (or attempts 500k) entries deep
> > so it can process the last line in the file first, then unwinds.
>
> Also, don't use runhaskell! Compile the code with -O :)

Not sure what processLine does, but just trying out Data.ByteString on
this as a test:

> import qualified Data.ByteString.Char8 as B
> import Data.List
>
> main = print . foldl' processLine 0 . B.lines =<< B.getContents
>     where processLine acc l = if B.length l > 10 then acc+1 else acc

Just count the long lines. Probably you do something fancier.

Anyway, 32M runs through this in:

    $ time ./a.out < /home/dons/fps/tests/32M
    470400
    ./a.out < /home/dons/fps/tests/32M  0.31s user 0.28s system 28% cpu
    2.082 total

with 32M heap (these are strict byte arrays).

Using Data.ByteString.Lazy:

> import qualified Data.ByteString.Lazy as B
> import Data.List
>
> main = print . foldl' processLine 0 . B.split 10  =<< B.getContents
>     where processLine acc l = if B.length l > 10 then acc+1 else acc

    $ time ./a.out < /home/dons/fps/tests/32M
    470400
    ./a.out < /home/dons/fps/tests/32M  0.32s user 0.11s system 26% cpu
    1.592 total

With only 3M heap used.

-- Don
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Eugene Crosser
In reply to this post by Udo Stenzel
Udo Stenzel wrote:

> Eugene Crosser wrote:
>> This is my program:
>> ========
>> module Main where
>> import Data.Map
>> main = printMax . (foldr processLine empty) . lines =<< getContents
>> processLine line map = insertWith (\new old -> new + old) line 1 map
>> printMax map = putStrLn $ show $ foldWithKey
>>    (\key val accum -> if val > (snd accum) then (key,val) else accum)
>>        ("",0) map
>> ========

> You have to force the evaluation of intermediate results.  To do so, you
> have to replace foldr by foldl (foldr is just recursion, foldl is
> accumulator recursion),

Having read "Yet another Haskell tutorial" (note on p.20), doesn't foldl
have to read the complete list before it can start processing it
(beginning from the last element)?  As opposed to foldr that can fetch
elements one by one as they are needed?

Otherwise, point on strictness taken...  Well, apparently the whole deal
is even more weird than it happened at the first glance...

Eugene


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (262 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Antti-Juhani Kaijanaho-3
Eugene Crosser wrote:
> Having read "Yet another Haskell tutorial" (note on p.20), doesn't foldl
> have to read the complete list before it can start processing it
> (beginning from the last element)?  As opposed to foldr that can fetch
> elements one by one as they are needed?

They're complementary.

If the result is of a type where partial evaluation is possible (say, a
list: between "not evaluated" and "fully evaluated", there are as many
intermediate stages of evaluation as there are elements in the list),
then foldr is the better choice, as it constructs the output list (or
whatever) lazily.  (You also need to make sure that the fold parameter
function is lazy in the "rest of output" parameter.)

If the result is of a type that doesn't allow partial evaluation (an
integer, for example: there is no intermediate stage between "not
evaluated" and "fully evaluated"), or used in a context where laziness
is not a virtue, then it pays to avoid laziness in its evaluation: hence
foldl' is the better choice. (You also need to make sure that the fold
parameter function is strict in the accumulator parameter.)

In elementary (nth-language) Haskell, one is generally trying to learn
the stuff about Haskell that is *different* from conventional languages,
so in elementary tutorials the rule of thumb "foldr is better" works.
It's just one of the half-lies that people get told in elementary
courses that one needs to augment in later stages of learning :)
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Udo Stenzel
In reply to this post by Eugene Crosser
Eugene Crosser wrote:
> Having read "Yet another Haskell tutorial" (note on p.20), doesn't foldl
> have to read the complete list before it can start processing it
> (beginning from the last element)?  As opposed to foldr that can fetch
> elements one by one as they are needed?

Both foldl and foldr start from the left of the list; dictated by the
structure of the list datatype nothing else is possible.  The actual
difference is that foldl passes an accumulator along and returns the
final value of the accumulator.  This also means that foldl is tail
recursive and foldr isn't.

Depending on what you want to do, both combinators can start processing
right away.  foldr does so only if the folded function is lazy in its
second argument (the list constructor is an example of such a function,
but Map.insert isn't), foldl' always does so.  You cannot get a result
from foldl' before the complete input is consumed.

If it's still unclear, you should take the definitions of foldr, foldl
and foldl' and simulate their reductions by hand on paper.  You should
see how foldr cannot apply a strict function (like (+)) before the
complete(!) list is transformed into a gargantuan thunk, how foldl just
plain refuses to apply the obviously needed reduction step and cannot be
persuaded to do so and how foldl' is what you want.  You'll also see how
everything is different for a lazy funktion (like (:)).


Udo.
--
It is easier to get forgiveness than permission.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Eugene Crosser
Udo Stenzel wrote:

> Eugene Crosser wrote:
>> Having read "Yet another Haskell tutorial" (note on p.20), doesn't foldl
>> have to read the complete list before it can start processing it
>> (beginning from the last element)?  As opposed to foldr that can fetch
>> elements one by one as they are needed?
>
> Both foldl and foldr start from the left of the list; dictated by the
> structure of the list datatype nothing else is possible.  The actual
> difference is that foldl passes an accumulator along and returns the
> final value of the accumulator.  This also means that foldl is tail
> recursive and foldr isn't.
I think that I get it now.  foldl will actually yield any result when it
hits the end of the list, while foldr will give you partial result (if
partial result makes any sense, that is) after each iteration.  And to
get any advantage of the latter, you need to be able to consume that
"partial result" element-by-element.  Right?

Anyway, I understand that you used 'seq' in your example as a way to
"strictify" the function that updates accumulator.  Could you (or
anyone) explain (in plain English, preferably:) the reason why 'seq' is
the way it is.  In the first place, why does it have the first argument
at all, and what should you put there?

Eugene

P.S. just FYI: after the changes, my benchmark program stops growing
with the growth of data set, and in compiled form it has the same RAM
footprint as the equivalent (interpreted) perl script.  Still, it
consumes 20 times more CPU...

P.P.S. Thanks people, you are really helpful!


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (262 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Antti-Juhani Kaijanaho-3
Eugene Crosser wrote:
> Anyway, I understand that you used 'seq' in your example as a way to
> "strictify" the function that updates accumulator.  Could you (or
> anyone) explain (in plain English, preferably:) the reason why 'seq' is
> the way it is.  In the first place, why does it have the first argument
> at all, and what should you put there?

seq returns its second argument without doing anything to it.  As a
side-effect, it also evaluates (shallowly) its first argument.

So, first argument should be what you want to be evaluated, second is
what you want seq to return.

Note that e `seq` e is useless; it does *not* force the evaluation of e
before it would be evaluated in any case.


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: [newbie] processing large logs

Udo Stenzel
In reply to this post by Eugene Crosser
Eugene Crosser wrote:
> Anyway, I understand that you used 'seq' in your example as a way to
> "strictify" the function that updates accumulator.  Could you (or
> anyone) explain (in plain English, preferably:) the reason why 'seq' is
> the way it is.  In the first place, why does it have the first argument
> at all

If you write 'seq a b' it means: "Should you need to evaluate b,
evaluate (the top constructor of) a first."

The example at hand was something like

update' key value map =
        let value' = lookupWithDefault 0 key map
        in value' `seq` insert key value' map

We assume that your program will somehow demand the final value of the
map involved, so the 'insert ...' expression will be evaluated at some
point.  Due to lazy semantics that doesn't mean that value' is
evaluated, instead an unevaluated thunk is put into the map to be
evaluated once you look it up.  Since it is this thunk which takes up all the
space, we have to make sure it is evaluated eagerly.  That's what the
'seq' does: if evaluation of the map is demanded, value' has to be
evaluated before.

Notice that there is an application of seq inside of foldl', too.  Foldl
would build an expression like this:

( insert kn vn ( ... ( insert k2 v2 ( insert k1 v1 empty ) ) ... ) )

Nothing demands the evaluation of the deeply nested part.  Foldl' places
seq at the appropriate places, so evaluation progresses from the inside
out, which is exactly what you need.  If you mistakenly used foldl, the
'seq' in the update function would never be triggered.  (A single
forgotten 'seq' can sometimes ruin everything.  This makes "sprinkling
seqs until it works" quite frustrating.)


> and what should you put there?

I wish I had a good rule of thumb here.  Accumulators are a good
candidate, the things deep in data structures are good, too, and heap
profiling might point you at the right place.


> Still, it consumes 20 times more CPU...

Well, that's probably the result of strings being represented as linked
lists of unicode characters and Data.Map not being tailored to
structured keys.  You can make your code faster if you don't care that
it gets uglier.


Udo.
--
If you're not making waves, you're not underway - Adm. Nimitz

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (196 bytes) Download Attachment