How on Earth Do You Reason about Space?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
29 messages Options
12
Reply | Threaded
Open this post in threaded view
|

How on Earth Do You Reason about Space?

Aleksandar Dimitrov
Dear Cafe,

(Excuse the probably very ranty email; I am, unfortunately, at the end of my
wits, and I hope that as fellow programmers, you will understand that this is
among the most dreadful situations for our kind to be in.)

Say, we have an input file that contains a word per line. I want to find all
unigrams (unique words) in that file, and associate with them the amount of
times they occurred in the file. This would allow me, for example, to make a
list of word frequencies in a given text.

Simple enough task. Here's an implementation using iteratees (lazy IO is evil)
and unordered-containers' Data.HashMap.Strict, which enforces WHNF in values
and keys:

> import qualified Data.ByteString.Char8 as S
> import qualified Data.Iteratee as I
> import Data.Iteratee.IO
>
> import qualified Data.HashMap.Strict as T
>
> import Data.Iteratee.Char
> import Data.List (foldl')
> import Data.Char (toLower)
>
> import Data.Ord (comparing)
> import Data.List (sortBy)
> import System.Environment (getArgs)
>
> type Wordcounts = T.HashMap S.ByteString Int
>
> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ enumLinesBS (I.liftI $ step T.empty)
>     where step t (I.Chunk str) = I.liftI (step $! foldl' maybeIncrement t str)
>           step t stream = I.idone t stream
>           maybeIncrement t s
>               | s == S.empty = t
>               | otherwise    = {-# SCC "m-I-other" #-} T.insertWith (+) s 1 t
>
> main :: IO ()
> main = getArgs >>= fileDriverVBuf 65536 f'.head >>= print.prettyList
>     where prettyList = -- sortBy (comparing snd) . T.toList
>                        T.keys
Some lines are empty, and I don't want them to be recorded, so that's why
maybeIncrement is necessary.
hpaste of this code: http://hpaste.org/47300/spaceleak (ignore convert, that's
yet another issue.)

Now, here's some observations: on a 75M input file (minuscule, compared to what
I actually need) this program will eat 30M of heap space (says profiling) and
return in 14 secs.

I have two problems with that: a) that's too much heap space, b) the actual memory
residency is *much* worse.

ad b) average memory residency is at 38MB (this is OK, given heap consumption)
but max residency is at 130MB, which is unacceptable to me (remember that I need
to run this on files *much* bigger than just 75M.)

<<ghc: 9013546200 bytes, 16876 GCs, 38232524/130572448 avg/max bytes residency
(58 samples), 306M in use, 0.00 INIT (0.00 elapsed), 6.12 MUT (6.28 elapsed),
7.45 GC (7.50 elapsed) :ghc>>

In fact, htop reports total memory residency of the program at around 320 MB at
the end of its life-cycle (compiled and ran without profiling options.) I tried
running this program on a 1.4GB file, and had to kill it when at 3.5GB memory
consumption, my computer started paging. The actual hash-map, however, shouldn't
be much bigger than in the 75MB case (I expect about twice the size,) since the
input is natural language. I redirected the output of the program (showing a
list of assoc pairs that were in the hash map) to a file, and that file measured
11MB in the case of a 75MB corpus, and 40MB when I ran the program on a 1.4GB
corpus (I had to use a machine with 24GB of RAM to be able to run this.)

ad a) heap consumption is too high for two reasons: firstly, the actual data I
care about is much less than there's data on the heap. Secondly, about half the
heap space is in LAG state. Here are profiles that will illustrate this:
http://imgur.com/wBWmJ&XN1mW&ltYNR.

- The first image shows 50% of the heap space being gobbled up with data that
  shouldn't be there anymore (LAG)
- The second image shows the types that are in LAG state: ByteString and
  HashMap.  So, it seems I'm keeping around hash maps?
- The third image blames the cost centre m-i-other for the LAG.
- Then there's the retainer profile: I have no idea how to read it. I thought it
  had something to do with iteratee reading in chunks of data, but after setting
  the chunk size to 4096 bytes and actually getting *less* spikes than before, I
  doubt that really is the case.

It seems like half of the data this program puts on the heap is not cleaned up
properly, and the otherwise-branch of maybeIncrement (i.e. the only really
important line is this program) is to blame. I'm thinking there's too little
strictness somewhere in the T.insertWith clause, but I have no idea how to make
it better.  Honestly, sprinkling around magic seq-dust every time something goes
wrong doesn't feel good. I cannot reason about Haskell's space-performance even
a little bit, and I have no idea how to change that.

Some more remarks:
- I tried using lazy IO first, but that was a total disaster. Iteratee is nice,
  and works very well, but maybe I don't know how to use it?
- In terms of map-implementation, I observed that unordered-containers <
  Data.Map < bytestring-trie. It surprised me to see that Data.Map would perform
  better than bytestring-trie for this task (usually, Data.Map would be around
  twice as fast.)
- I should probably look into using a mutable data structure as a map. I would
  find it a bit disenchanting to have to use a mutable data structure this way,
  but maybe that's just reality for you.

RANT

I have tried and tried again to avoid writing programs in Haskell that would
leak space like BP likes to leak oil. However, I have yet to produce a single
instance of a program that would do anything at all and at the same time consume
less memory than there is actual data in the input file.

It is very disconcerting to me that I seem to be unable, even after quite some
practice, to identify space leaks in trivial programs like the above. I know of
no good resource to educate myself in that regard. I have read the GHC manual,
RWH's chapter on profiling, also "Inside T5"'s recent series on the Haskell
heap, but no dice. Even if I can clearly see the exact line where at least some
of the leaking happens (as I can in this case,) it seems impossible for me to
prevent it.

*thank you very much* for reading this far. This is probably a mostly useless
email anyhow, I just had to get it off my chest. Maybe, just maybe, someone
among you will have a crucial insight that will save Haskell for me :-) But
currently, I see no justification to not start my next project in Lua, Python or
Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually
*run* it, why bother?  (Yes, this isn't the first time I've ran into this
problem …)

Thanks, and kindest regards to you all,
Aleks

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Aleksandar Dimitrov
On Tue, May 31, 2011 at 06:10:00PM +0200, Aleksandar Dimitrov wrote:
> ad a) heap consumption is too high for two reasons: firstly, the actual data I
> care about is much less than there's data on the heap. Secondly, about half the
> heap space is in LAG state. Here are profiles that will illustrate this:
> http://imgur.com/wBWmJ&XN1mW&ltYNR.

Pardon me, I forgot to actually add the retainer profile:
- http://i.imgur.com/xLO13
- http://imgur.com/wBWmJ&XN1mW&ltYNR&xLO13l?full is the full gallery.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Malcolm Wallace-2
In reply to this post by Aleksandar Dimitrov
ad a) heap consumption is too high for two reasons: firstly, the actual data I
care about is much less than there's data on the heap. Secondly, about half the
heap space is in LAG state. Here are profiles that will illustrate this:
http://imgur.com/wBWmJ&XN1mW&ltYNR.

- The first image shows 50% of the heap space being gobbled up with data that
shouldn't be there anymore (LAG)
- The second image shows the types that are in LAG state: ByteString and
HashMap. So, it seems I'm keeping around hash maps?
 

In Lag/Drag/Void/Use profiling, Lag is actually heap cells that are created too _early_.  (Drag are those that are kept for longer than necessary.)  Lots of Lag generally means your program is too strict - it is forcing structure long before it needs to.  To fix it, you need to make things lazier.  My first suspicion would fall on ByteString.

Regards,
    Malcolm

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Aleksandar Dimitrov
> In Lag/Drag/Void/Use profiling, Lag is actually heap cells that are created
> too _early_.  (Drag are those that are kept for longer than necessary.)  Lots
> of Lag generally means your program is too strict - it is forcing structure
> long before it needs to.  To fix it, you need to make things lazier.  My first
> suspicion would fall on ByteString.

Indeed, thank you, I mixed those up. I cannot use lazy byte strings here,
because of the way Data.Iteratee.Char's enumLinesBS works (it takes strict
byte strings.)

The only other strictness in there is ($!) and foldl'. The latter is necessary
for the program to even run (i.e. not run out of stack space.)

The strict application in step's argument seems necessary, since without it, the
program consumes 1200 MB of RAM (on my 75MB test data,) and takes very very long.
The hb profile indicates that a lot of data is allocated up front, and then
gradually eliminated. Interestingly, removing ($!) here seemed to *introduce*
unnecessary strictness. Here's the hb profile without ($!):
http://imgur.com/Ex7Pd I don't understand what is happening here :-\ I only just
started using iteratees.

Regards,
Aleks

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Johan Tibell-2
In reply to this post by Aleksandar Dimitrov
Hi Aleksandar,

On Tue, May 31, 2011 at 6:10 PM, Aleksandar Dimitrov
<[hidden email]> wrote:
> Say, we have an input file that contains a word per line. I want to find all
> unigrams (unique words) in that file, and associate with them the amount of
> times they occurred in the file. This would allow me, for example, to make a
> list of word frequencies in a given text.

Here's how I would do it:

{-# LANGUAGE BangPatterns #-}
module Ngram (countUnigrams) where

import qualified Data.ByteString as S
import qualified Data.HashMap.Strict as M
import System.IO

foldLines :: (a -> S.ByteString -> a) -> a -> Handle -> IO a
foldLines f z0 h = go z0
  where
    go !z = do
        eof <- hIsEOF h
        if eof
            then return z
            else do
                line <- S.hGetLine h
                go $ f z line
{-# INLINE foldLines #-}

-- Example use
countUnigrams :: IO (M.HashMap S.ByteString Int)
countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty stdin

> RANT
>
> I have tried and tried again to avoid writing programs in Haskell that would
> leak space like BP likes to leak oil. However, I have yet to produce a single
> instance of a program that would do anything at all and at the same time consume
> less memory than there is actual data in the input file.
>
> It is very disconcerting to me that I seem to be unable, even after quite some
> practice, to identify space leaks in trivial programs like the above. I know of
> no good resource to educate myself in that regard. I have read the GHC manual,
> RWH's chapter on profiling, also "Inside T5"'s recent series on the Haskell
> heap, but no dice. Even if I can clearly see the exact line where at least some
> of the leaking happens (as I can in this case,) it seems impossible for me to
> prevent it.
>
> *thank you very much* for reading this far. This is probably a mostly useless
> email anyhow, I just had to get it off my chest. Maybe, just maybe, someone
> among you will have a crucial insight that will save Haskell for me :-) But
> currently, I see no justification to not start my next project in Lua, Python or
> Java. Sure, Haskell's code is pretty, and it's fun, but if I can't actually
> *run* it, why bother?  (Yes, this isn't the first time I've ran into this
> problem …)

We definitely need more accessible material on how to reliably write
fast Haskell code. There are those among us who can, but it shouldn't
be necessary to learn it in the way they did (i.e. by lots of
tinkering, learning from the elders, etc). I'd like to write a 60 (or
so) pages tutorial on the subject, but haven't found the time.

In addition to RWH, perhaps the slides from the talk on
high-performance Haskell I gave could be useful:

    http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html

Cheers,
Johan

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Edward Z. Yang
In reply to this post by Aleksandar Dimitrov
Hello Aleksandar,

It is possible that the iteratees library is space leaking; I recall some
recent discussion to this effect.  Your example seems simple enough that
you might recompile with a version of iteratees that has -auto-all enabled.
Unfortunately, it's not really a safe bet to assume your libraries are
leak free, and if you've pinpointed it down to a single line, and there
doesn't seem a way to squash the leak, I'd bet it's the library's fault.

Edward

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Brandon Moore-2
In reply to this post by Aleksandar Dimitrov
I can't reproduce heap usage growing with the
size of the input file.

I made a word list from Project Gutenberg's
copy of "War and Peace" by

tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt

Using 1, 25, or 1000 repetitions of this ~3MB wordlist
shows about 100MB of address space used according
to top, and no more than 5MB or so of haskell heap
used according to the memory profile, with a flat
memory profile.


Is your memory usage growing with the size of the input
file, or the size of the histogram?

I was worried data sharing might mean your keys
retain entire 64K chunks of the input. However, it
seems enumLines depends on the StringLike ByteString
instance, which just converts to and from String.
That can't be efficient, but I suppose it avoids excessive sharing.

The other thing that occurs to me is that the total size of
your keys would also be approximately the size of the input
file if you were using plain text without each word split onto
a separate line.

Brandon

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Brandon Moore-2
In reply to this post by Aleksandar Dimitrov
Wait, do ByteStrings show up on a heap profile, if the space is
allocated with malloc?

Anyway, I think my tests still show that the memory used by the
process doesn't grow simply by adding more data, if you
are no longer added keys to the map.



----- Original Message -----

> From: Brandon Moore <[hidden email]>
> To: Aleksandar Dimitrov <[hidden email]>; "[hidden email]" <[hidden email]>
> Cc:
> Sent: Tuesday, May 31, 2011 1:43 PM
> Subject: Re: [Haskell-cafe] How on Earth Do You Reason about Space?
>
> I can't reproduce heap usage growing with the
> size of the input file.
>
> I made a word list from Project Gutenberg's
> copy of "War and Peace" by
>
> tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
>
> Using 1, 25, or 1000 repetitions of this ~3MB wordlist
> shows about 100MB of address space used according
> to top, and no more than 5MB or so of haskell heap
> used according to the memory profile, with a flat
> memory profile.
>
>
> Is your memory usage growing with the size of the input
> file, or the size of the histogram?
>
> I was worried data sharing might mean your keys
> retain entire 64K chunks of the input. However, it
> seems enumLines depends on the StringLike ByteString
> instance, which just converts to and from String.
> That can't be efficient, but I suppose it avoids excessive sharing.
>
> The other thing that occurs to me is that the total size of
> your keys would also be approximately the size of the input
> file if you were using plain text without each word split onto
> a separate line.
>
> Brandon
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Aleksandar Dimitrov
In reply to this post by Johan Tibell-2
Hi Johan,

> Here's how I would do it:

I implemented your method, with these minimal changes (i.e. just using a main
driver in the same file.)

> countUnigrams :: Handle -> IO (M.Map S.ByteString Int)
> countUnigrams = foldLines (\ m s -> M.insertWith (+) s 1 m) M.empty
>
> main :: IO ()
> main = do (f:_) <- getArgs
>           openFile f ReadMode >>= countUnigrams >>= print . M.toList

It seems to perform about 3x worse than the iteratee method in terms of time,
and worse in terms of space :-( On Brandon's War & Peace example, hGetLine uses
1.565 seconds for the small file, whereas my iteratee method uses 1.085s for the
small file, and around 2 minutes for the large file.

For the large file, the code above starts consuming around 2.5GB of RAM,
so it clearly has a space leak somewhere. Where, I don't know.

If you want to try it out, here's a short command line to make a test corpus the
way Brandon made one:

+++

wget 'http://www.gutenberg.org/files/2600/2600.zip';
unzip 2600.zip;
touch wnp100.txt;
for i in {1..100}; do echo -n "$i "; cat 2600.txt >> wnp100.txt; done;
echo "Done.

+++

Note, that, as I detailed in my prior email to Brandon, even if you do end up
with a (supposedly) non-leaking program for this example corpus, that doesn't
mean it'll scale well to real world data.

I also tried sprinkling strictness annotations throughout your above code, but I
failed to produce good results :-(

> We definitely need more accessible material on how to reliably write
> fast Haskell code. There are those among us who can, but it shouldn't
> be necessary to learn it in the way they did (i.e. by lots of
> tinkering, learning from the elders, etc). I'd like to write a 60 (or
> so) pages tutorial on the subject, but haven't found the time.

I'd be an eager reader :-) Please do announce it on -cafe or the "usual places"
should you ever come around to writing it!

I, unfortunately, don't really have any contact to "the elders," apart from what
I read on their respective blogs…

> In addition to RWH, perhaps the slides from the talk on
> high-performance Haskell I gave could be useful:
>
>     http://blog.johantibell.com/2010/09/slides-from-my-high-performance-haskell.html

Thanks, I'll give it a look later tomorrow!

Regards,
Aleks

PS: Sorry I didn't answer you in #haskell, I ended up having to go afk for a
short while. Thanks for all your help!

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Aleksandar Dimitrov
In reply to this post by Brandon Moore-2
On Tue, May 31, 2011 at 11:43:27AM -0700, Brandon Moore wrote:

> I can't reproduce heap usage growing with the
> size of the input file.
>
> I made a word list from Project Gutenberg's
> copy of "War and Peace" by
>
> tr -sc '[[:alpha:]]' '\n' < pg2600.txt > words.txt
>
> Using 1, 25, or 1000 repetitions of this ~3MB wordlist
> shows about 100MB of address space used according
> to top, and no more than 5MB or so of haskell heap
> used according to the memory profile, with a flat
> memory profile.
This will lead to very small variance in your data. The effect I'm experiencing
is so small in this case, that it's barely observable (but it is, see below.)

> Is your memory usage growing with the size of the input
> file, or the size of the histogram?

While the histogram is naturally growing with the size of the input file, memory
usage seems to be proportional mainly to the histogram. It is clear that, due to
the effect of the Long Tail, the histogram is going to constantly grow in a real
setting, as opposed to just replicating the same data.  In your test case, the
histogram is *not* growing with the size of the input file.

The memory usage is proportional to the histogram, which is proportional to the
file size. That is not the problem. The problem is, that, compared to the size
of the histogram, the memory consumption is *inadequately* high. Here's some
more data, using your Tolstoi example:

du file.txt
344M file.txt

<<ghc: 86596536784 bytes, 163747 GCs, 8798605/11210760 avg/max bytes residency
(2348 samples), 30M in use, 0.00 INIT (0.00 elapsed), 59.20 MUT (60.44 elapsed),
60.28 GC (60.20 elapsed) :ghc>>

As you can see, memory residency is at 8 MB avg, 10 MB max. This is the
War&Peace file, replicated 100 times. Let's look at the result for the file
*without* first replicating it 100 times:

du words.txt
3.0M words.txt

<<ghc: 805440768 bytes, 1519 GCs, 5106482/9552824 avg/max bytes residency (35
samples), 25M in use, 0.00 INIT (0.00 elapsed), 0.56 MUT (0.59 elapsed), 0.50 GC
(0.53 elapsed) :ghc>>

4.8MB avg, 9.1 MB max. It seems input file size *does* have an effect of sorts,
but it's negligible. What is more interesting is this: the file is a whopping
3MB big. How on earth does the program consume almost 5 MB *on average*? This is
*not* constant memory usage. This is memory usage trivial enough to not be worth
a fuss for small inputs, but unfortunately, it gets larger as soon as you
increase the file size (in a realistic fashion: i.e. you'll also increase the
unigram count.)

> I was worried data sharing might mean your keys
> retain entire 64K chunks of the input. However, it
> seems enumLines depends on the StringLike ByteString
> instance, which just converts to and from String.

Ouch, that sounds like something worth fixing.

> The other thing that occurs to me is that the total size of
> your keys would also be approximately the size of the input
> file if you were using plain text without each word split onto
> a separate line.

Well, I am not. The corpus is a word-per-line corpus, I'm reading a word per
line, and adding that to my hash map. This should never result in a data
structure even close to the size of the original corpus.

It could be, in a very unrealistic worst case scenario. But even a corpus of
30GB of Poe and Heidegger isn't going to make that happen. Furthermore, mine is not
such a scenario at all. As I said, if you reduce the corpus to a set of words
(i.e. a set of unigrams) you get a 40MB file from a 1.4GB corpus. Why is it,
that in order to create that 40MB file from a 1.4GB corpus, my trivial little
program needs somewhere north of 6-8 GB of RAM?

In this trivial example for War and Peace, why is it that in order to create the
unigram table for War and Peace, which ends up being a mere 201KB big, we're
chomping through 5MB on average, and nearly 10MB max? That's at least 25 times
more than we actually *should* have (yes, I know that the RTS is somewhere
there, too, but I think it's not a problem to ignore that for now.)

Regards,
Aleks

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Aleksandar Dimitrov
In reply to this post by Edward Z. Yang
On Tue, May 31, 2011 at 02:13:14PM -0400, Edward Z. Yang wrote:
> It is possible that the iteratees library is space leaking; I recall some
> recent discussion to this effect.  Your example seems simple enough that
> you might recompile with a version of iteratees that has -auto-all enabled.

If I understand you correctly, you imply that I should try compiling iteratee
with profiling, no? I did install the iteratee library with profiling support (I
have the cabal profiling flag globally set in my cabal config,) but my profiles
so far seem to be blaming LAGging ByteStrings and HashMaps.

I, unfortunately, do not know how I would test iteratee itself for a space leak
here.

> Unfortunately, it's not really a safe bet to assume your libraries are
> leak free, and if you've pinpointed it down to a single line, and there
> doesn't seem a way to squash the leak, I'd bet it's the library's fault.

Since my knowledge of Haskell, and, in particular, high-performance Haskell, is
very lacking, my current m.o. is to blame myself :-) It might be iteratee, but
unfortunately, I have not found something that gives me better performance than
iteratee yet.

Regards,
Aleks

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

John Lato-2
In reply to this post by Aleksandar Dimitrov
From: "Edward Z. Yang" <[hidden email]>

Hello Aleksandar,

It is possible that the iteratees library is space leaking; I recall some
recent discussion to this effect.  Your example seems simple enough that
you might recompile with a version of iteratees that has -auto-all enabled.
Unfortunately, it's not really a safe bet to assume your libraries are
leak free, and if you've pinpointed it down to a single line, and there
doesn't seem a way to squash the leak, I'd bet it's the library's fault.

Edward

I can't reproduce the space leak here.  I tried Aleksander's original code, my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy bytestring version.

my iteratee version (only f' has changed from Aleksander's code):

f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s -> T.insertWith (+) s 1 t) T.empty

my lazy bytestring version

> import Data.Iteratee.Char
> import Data.List (foldl')import Data.Char (toLower)
>
> import Data.Ord (comparing)
> import Data.List (sortBy)
> import System.Environment (getArgs)
> import qualified Data.ByteString.Lazy.Char8 as L
> import qualified Data.HashMap.Strict as T
>
> f'2 = foldl' (\t s -> T.insertWith (+) s 1 t) T.empty . filter (not . L.null) . L.lines
>
> main2 :: IO ()
> main2 = getArgs >>= L.readFile .head >>= print . T.keys . f'2
 
None of these leak space for me (all compiled with ghc-7.0.3 -O2).  Performance was pretty comparable for every version, although Aleksander's original did seem to have a very small edge.

As someone already pointed out, keep in mind that this will use a lot of memory anyway, unless there's a lot of repetition of words.

I'd be happy to help track down a space leak in iteratee, but for now I'm not seeing one.

Best,
John Lato

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Aleksandar Dimitrov
On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:
> I can't reproduce the space leak here.  I tried Aleksander's original code,
> my iteratee version, the Ngrams version posted by Johan Tibell, and a lazy
> bytestring version.

I unfortunately can't post the actual corpus here, because it's copyrighted. But
there's plenty of ways to retrieve large amounts of data from the Internet. See
below.


> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ (enumLinesBS I.><> I.filter (not . S.null)) $ I.foldl' (\t s
> -> T.insertWith (+) s 1 t) T.empty

Neat, folding the specialised function into the enumeratee is nifty! One can
tell that my experience with iteratee/enumerator has been only a day's worth for
now :-\

> None of these leak space for me (all compiled with ghc-7.0.3 -O2).
> Performance was pretty comparable for every version, although Aleksander's
> original did seem to have a very small edge.

How big were your input corpora?

Here's an absolutely evil shell script that is going to make me donate money to
project Gutenberg. It will gather a corpus that is still very small, but at
least realistic in its distributional properties.

+++ scrape.sh

#!/bin/sh

textfile=all_text.txt

touch $textfile

text=0
size=0
for i in `seq 10 300`; do
        wget -q "http://www.gutenberg.org/files/$i/$i.zip"
        if [ -f $i.zip ]; then
                unzip -qq $i.zip
                tr -sc '[[:alpha:]]' '\n' < $i.txt >> $textfile
                text=`dc -e "$text 1 + p"`
                size=`du -h $textfile | cut -f 1`
                rm -f $i.zip $i.txt
        fi
        echo -n "\rFound $text of $i texts, total $size."
done
echo "\rFound $text texts, total $size"

+++

It'll take a while to run, and the resulting corpus is just a fraction of what
I'm using, but it serves well to illustrate the problem. If you want, you can
increase the amount of data gathered by simply tweaking the numerical range in
the seq statement. (If you make it gobble up more bandwidth, it might be polite
to put a sleep somewhere in the inner loop. I'd host the resulting file myself,
but I don't know if there aren't any conditions on that.)

If you did not tweak the script, it should've gathered some 100MB of data.

Running my unmodified original program, htop records 320MB of RAM usage towards
the end (*without* profiling being enabled.)

So it seems that I can't get rid of a factor of around 3x the input file size.
Luckily, the dependency seems to be linear. Here's some profiling:

<<ghc: 30478883712 bytes, 57638 GCs, 41925397/143688744 avg/max bytes residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc>>
../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s system 99% cpu 52.064 total

ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)

Redirecting the program's output to a file will yield a mere 2.2M for the data
gathered by the above script. Since those 2.2M of data are all I care about, why
do I need so much more RAM to compute them?

Are my demands unreasonable?

> I'd be happy to help track down a space leak in iteratee, but for now I'm
> not seeing one.

Thank you for your offer! Maybe I'm just seeing ghosts, and there is no space
leak. But I really do think that the program is eating too much RAM.

Regards,
Aleks

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

John Lato-2
In reply to this post by Aleksandar Dimitrov
From: Brandon Moore <[hidden email]>


I was worried data sharing might mean your keys
retain entire 64K chunks of the input. However, it
seems enumLines depends on the StringLike ByteString
instance, which just converts to and from String.
That can't be efficient, but I suppose it avoids excessive sharing.

That's true for 'enumLines', however the OP is using 'enumLinesBS', which operates on bytestrings directly.

Data sharing certainly could be an issue here.  I tried performing Data.ByteString.copy before inserting the key into the map, but that used more memory.  I don't have an explanation for this; it's not what I would expect.

The other parameter which affects sharing is the chunk size.  I got a much better memory profile when using a chunksize of 1024 instead of  65536.

Oddly enough, when using the large chunksize I saw lower memory usage from Data.Map, but with the small chunksize Data.HashMap has a significant advantage.

John Lato

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

John Lato-2
In reply to this post by Aleksandar Dimitrov
On Wed, Jun 1, 2011 at 12:55 AM, Aleksandar Dimitrov <[hidden email]> wrote:
On Tue, May 31, 2011 at 11:30:06PM +0100, John Lato wrote:

> None of these leak space for me (all compiled with ghc-7.0.3 -O2).
> Performance was pretty comparable for every version, although Aleksander's
> original did seem to have a very small edge.

How big were your input corpora?

Today I was using multiple copies of War & Peace, as Brandon specified.  Total size is about 90M.
 

So it seems that I can't get rid of a factor of around 3x the input file size.
Luckily, the dependency seems to be linear. Here's some profiling:

<<ghc: 30478883712 bytes, 57638 GCs, 41925397/143688744 avg/max bytes residency (189 samples), 322M in use, 0.00 INIT (0.00 elapsed), 23.73 MUT (24.94 elapsed), 26.71 GC (27.10 elapsed) :ghc>>
../src/cafe/tools/iterTable 106M_text.txt +RTS -tstderr  50.44s user 1.50s system 99% cpu 52.064 total

ghc itself reports 38MB avg (can live with that,) and 140MB max (too much.)

Redirecting the program's output to a file will yield a mere 2.2M for the data
gathered by the above script. Since those 2.2M of data are all I care about, why
do I need so much more RAM to compute them?

Are my demands unreasonable?

I think the issue is data sharing, as Brandon mentioned.  A bytestring consists of an offset, length, and a pointer.  You're using a chunk size of 64k, which means the generated bytestrings all have a pointer to that 64k of data.  Suppose there's one new word in that 64k, and it's near the beginning of the chunk.  Even though the word may only be a few characters long, it'll reference the entire chunk and keep it from being GC'd.

There are a few solutions to this.  The first is to make a copy of the bytestring so only the required data is retained.  In my experiments this wasn't helpful, but it would depend on your corpus.  The second is to start with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.  If your corpus is similar to natural language, I think it'll probably work better for you as well.

Note that Johan's Ngram code also only keeps the minimum required data, giving it a good memory profile.   I didn't notice this last night because I was testing with different data, and unfortunately the peculiar distribution of that data masked this problem.

John L

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Daniel Fischer
In reply to this post by John Lato-2
On Wednesday 01 June 2011 12:13:54, John Lato wrote:

> > From: Brandon Moore <[hidden email]>
> >
> >
> > I was worried data sharing might mean your keys
> > retain entire 64K chunks of the input. However, it
> > seems enumLines depends on the StringLike ByteString
> > instance, which just converts to and from String.
> > That can't be efficient, but I suppose it avoids excessive sharing.
>
> That's true for 'enumLines', however the OP is using 'enumLinesBS',
> which operates on bytestrings directly.
>
> Data sharing certainly could be an issue here.  I tried performing
> Data.ByteString.copy before inserting the key into the map, but that
> used more memory.  I don't have an explanation for this; it's not what
> I would expect.

If you don't copy, the small ByteStrings (the words) point into the large
chunk, which peacefully rests in memory. A few ripples come from the chunk
boundaries, where more often than not a word will have parts in both
chunks, a new ByteString is allocated for those.
A bit of space is wasted for the Constructor of the large chunk and for the
word-separators ('\n' here), and at the boundaries. Not a big deal, the
newlines cost one byte per text-word, but the constructors cost five or so
machine words per text-word.

If you copy each small ByteString from the chunk, a) the chunk stays in
memory until it's completely copied [doesn't matter much with chunk sizes
of a few k] and b) you get memory fragmentation because of alignment
requirements.

>
> The other parameter which affects sharing is the chunk size.  I got a
> much better memory profile when using a chunksize of 1024 instead of
> 65536.
>
> Oddly enough, when using the large chunksize I saw lower memory usage
> from Data.Map, but with the small chunksize Data.HashMap has a
> significant advantage.
>
> John Lato

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Daniel Fischer
In reply to this post by John Lato-2
On Wednesday 01 June 2011 12:28:28, John Lato wrote:
> There are a few solutions to this.  The first is to make a copy of the
> bytestring so only the required data is retained.  In my experiments
> this wasn't helpful, but it would depend on your corpus.  The second is
> to start with smaller chunks.

The third, check whether the word is already known, and *make a copy if
not*. That should only keep the required parts (including the currently
processed chunk) in memory.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Edward Z. Yang
That sounds like a plausible reason why naive copying explodes space.
Something like string interning would be good here... and since you're
hashing already...

Edward

Excerpts from Daniel Fischer's message of Wed Jun 01 06:46:24 -0400 2011:

> On Wednesday 01 June 2011 12:28:28, John Lato wrote:
> > There are a few solutions to this.  The first is to make a copy of the
> > bytestring so only the required data is retained.  In my experiments
> > this wasn't helpful, but it would depend on your corpus.  The second is
> > to start with smaller chunks.
>
> The third, check whether the word is already known, and *make a copy if
> not*. That should only keep the required parts (including the currently
> processed chunk) in memory.
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Aleksandar Dimitrov
In reply to this post by John Lato-2
Hi John,
> I think the issue is data sharing, as Brandon mentioned.  A bytestring
> consists of an offset, length, and a pointer.  You're using a chunk size of
> 64k, which means the generated bytestrings all have a pointer to that 64k of
> data.  Suppose there's one new word in that 64k, and it's near the beginning
> of the chunk.  Even though the word may only be a few characters long, it'll
> reference the entire chunk and keep it from being GC'd.

This seems to be the issue indeed! If the bytestrings on the hash map are
holding references to the chunks, it is clear that we're going to consume memory
scaling with the size of the input file, in case there are *any* new chunks
generated.

As I understand it, this what not a problem with the multiple copies of War and
Peace, because all byte strings were already found on the hash table! On reading
new input, old entries were found on the hash table, so only old chunks were
kept in memory, the new ones could be gc'ed.

In *realistic* data, however, the Long Tail is the reason that, after a while,
some chunks of input would only be kept because there were a few byte strings
referencing them. New words are rare, but they need not occur more frequently
than once per chunk, in order to keep the whole chunk in memory, even though the
chunk was mostly useless (most other data in this chunk would already be on the
hash map.)

> There are a few solutions to this.  The first is to make a copy of the
> bytestring so only the required data is retained.  In my experiments this
> wasn't helpful, but it would depend on your corpus.  The second is to start
> with smaller chunks.  Using a chunk size of 1024 worked fairly well for me.
>  If your corpus is similar to natural language, I think it'll probably work
> better for you as well.

I think I solved this problem elegantly: I used Data.Text as hash map keys,
instead of Data.ByteString. See the modified program below:

> import qualified Data.Iteratee as I
> import Data.Iteratee.Char
> import Data.Iteratee.IO
>
> import qualified Data.HashMap.Strict as T
>
> import Data.Ord (comparing)
> import Data.List (sortBy)
> import System.Environment (getArgs)
>
> import qualified Data.ByteString as S
> import qualified Data.Text as X
> import Data.Text.Encoding (decodeUtf8)
>
> type Wordcounts = T.HashMap X.Text Int
>
> f' :: Monad m => I.Iteratee S.ByteString m Wordcounts
> f' = I.joinI $ (enumLinesBS I.><> I.filter (not.S.null)) $ I.foldl'
>     (\t s -> T.insertWith (+) (convert s) 1 t) T.empty
>     where convert = decodeUtf8
>
> main :: IO ()
> main = getArgs >>= fileDriverVBuf 65536 f'.head
>        >>= mapM_ (\(w,c)-> putStrLn $ X.unpack w ++ "\t" ++ show c).sortM
>     where sortM = sortBy (comparing snd) . T.toList
Initial benchmarks on the realistic 100MB Gutenberg corpus I downloaded with my
script yesterday report the following: htop says 120M memory residency towards
the end of the life-cycle.

<<ghc: 40928992256 bytes, 77984 GCs, 20939886/46665600 avg/max bytes residency (455 samples), 133M in use, 0.00 INIT (0.00 elapsed), 26.01 MUT (26.20 elapsed), 32.96 GC (33.09 elapsed) :ghc>>

19MB avg, 44MB max residency, 133M in use (which is similar to what htop told
me) and the heap profile clearly shows allocation and deallocation of the
bytestrings: http://imgur.com/U1nyw (I'm attributing the ruggedness of the
profile with the frequent little spikes to the iteratee chunk allocation and
deallocation.) It seems I can't get rid of 50% of the heap being lag state. I
don't quite understand yet what that is. I also don't know what INHERENT_USE is.

But in any case, I now have a program that I can reasonably expect to run if I
could fit the input file into memory. I might try to implement an analogous
program in C++ or Java, just to see whether that would do better or similarly in
terms of memory consumption.

> Note that Johan's Ngram code also only keeps the minimum required data,
> giving it a good memory profile.   I didn't notice this last night because I
> was testing with different data, and unfortunately the peculiar distribution
> of that data masked this problem.

This is kind of the big problem here: whether or not you'll see the particular
behaviour I was so upset about seemed to depend on the corpus' distributional
properties.

In any case, I would like to thank you all for helping me understand and address
the issue. I probably still have a long way to go in terms of understanding
space-behaviour of Haskell programs, but at least I don't see ByteStrings as
this black box of efficiency anymore, but actually understand how they're
structured, and what they're good at, and what they aren't good at.

(take home lesson: Data.Text is really nice. Also: if iteratee has a space leak,
I probably didn't hit it, really. But: if reading byte-strings, one should mind
the pointers that byte-strings actually are.)

Regards,
Aleks

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (501 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: How on Earth Do You Reason about Space?

Ketil Malde-5
In reply to this post by Aleksandar Dimitrov
Aleksandar Dimitrov <[hidden email]> writes:

> Now, here's some observations: on a 75M input file (minuscule, compared to what
> I actually need) this program will eat 30M of heap space (says profiling) and
> return in 14 secs.
>
> I have two problems with that: a) that's too much heap space, b) the actual memory
> residency is *much* worse.

30M isn't a lot these days.  How does it scale?

> ad b) average memory residency is at 38MB (this is OK, given heap consumption)
> but max residency is at 130MB, which is unacceptable to me (remember that I need
> to run this on files *much* bigger than just 75M.)

I think max residency (depending on how you measure) can be twice the
heap size due to using copying GC.  If you run short of memory, the
runtime will switch to compacting GC which will be slower but use less
memory.

> I have tried and tried again to avoid writing programs in Haskell that would
> leak space like BP likes to leak oil.

I know the feeling.  I think making a frequency table ought to be as
simple as

   ... M.fromListWith (+) . map (,1) . words =<< readFile ...

or at worst

  ... foldl' (\m w -> M.insertWith' (+) w 1 m) M.empty . words =<< readFile...

(which evaluates things strictly, and at least in my small tests, seem
to use quite a bit less heap space).

-k
--
If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
12