mapM is supralinear?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

mapM is supralinear?

Travis Erdman


The performance of mapM appears to be supralinear in the length of the list it is mapping on.  Does it need to be this way?  As a comparison, both mapM_ and map are linear in the length of the list.


To wit:

travis@PW:~/Documents/insurer$ ghci
GHCi, version 7.0.3: http://www.haskell.org/ghc/  :? for help
Loading package ghc-prim ... linking ... done.
Loading package integer-gmp ... linking ... done.
Loading package base ... linking ... done.
Loading package ffi-1.0 ... linking ... done.
Prelude> :set +s
Prelude> :set +t
Prelude> :m Data.List Data.Maybe
Prelude Data.List Data.Maybe> foldl' (+) 0 $ fromJust $ mapM (\x -> Just x) [1..500000]
125000250000
it :: Integer
(2.23 secs, 112875864 bytes)
Prelude Data.List Data.Maybe> foldl' (+) 0 $ fromJust $ mapM (\x -> Just x) [1..1000000]
500000500000
it :: Integer
(6.86 secs, 214002204 bytes)
Prelude Data.List Data.Maybe> foldl' (+) 0 $ fromJust $ mapM (\x -> Just x) [1..2000000]
2000001000000
it :: Integer
(24.39 secs, 429299748 bytes)


Prelude Data.List Data.Maybe> foldl' (+) 0 $ map (\x -> x - 1) [1..1000000]
499999500000
it :: Integer
(0.77 secs, 171213436 bytes)
Prelude Data.List Data.Maybe> foldl' (+) 0 $ map (\x -> x - 1) [1..10000000]
49999995000000
it :: Integer
(7.42 secs, 1723399472 bytes)
Prelude Data.List Data.Maybe> foldl' (+) 0 $ map (\x -> x - 1) [1..40000000]
799999980000000
it :: Integer
(30.46 secs, 6894835952 bytes)



Prelude Data.List Data.Maybe> mapM_ (\x -> Just x) [1..1000000]
Just ()
it :: Maybe ()
(0.42 secs, 82761248 bytes)
Prelude Data.List Data.Maybe> mapM_ (\x -> Just x) [1..10000000]
Just ()
it :: Maybe ()
(3.87 secs, 808012660 bytes)
Prelude Data.List Data.Maybe> mapM_ (\x -> Just x) [1..100000000]
Just ()
it :: Maybe ()
(38.40 secs, 8054769564 bytes)
Prelude Data.List Data.Maybe>


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Daniel Fischer
On Wednesday 07 September 2011, 01:01:08, Travis Erdman wrote:
> The performance of mapM appears to be supralinear in the length of the
> list it is mapping on.

Hmm. Could reproduce with 6.12.3 and 7.0.4, but not with 7.2.1.

> Does it need to be this way?  

Apparently it doesn't, and it seems to be fixed now.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Ertugrul Söylemez
In reply to this post by Travis Erdman
Travis Erdman <[hidden email]> wrote:

> The performance of mapM appears to be supralinear in the length of the
> list it is mapping on.  Does it need to be this way?  As a comparison,
> both mapM_ and map are linear in the length of the list.

It needs to be this way in most monads.  It's not a problem of mapM
itself, but of its definition in the particular monad.  In general it's
a bad idea to use mapM over IO.  For [] it will eat lots of memory
quickly and by its mere definition there is nothing you can do about
that.

mapM_ is linear, because it can throw away the results, so no
complicated accumulation occurs.  map is usually linear, because used
properly it will be optimized away leaving just a loop, which doesn't
produce any data structures in memory and is just run element by
element.


Greets,
Ertugrul


--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://ertes.de/



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Roman Cheplyaka-2
* Ertugrul Soeylemez <[hidden email]> [2011-09-07 16:20:03+0200]
> In general it's a bad idea to use mapM over IO.

Could you explain why?

Thanks.

--
Roman I. Cheplyaka :: http://ro-che.info/

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Maciej Piechotka
On Fri, 2011-09-09 at 00:41 +0200, Roman Cheplyaka wrote:
> * Ertugrul Soeylemez <[hidden email]> [2011-09-07 16:20:03+0200]
> > In general it's a bad idea to use mapM over IO.
>
> Could you explain why?
>
> Thanks.
>

Hmm. Isn't it explained by next sentence ("For [] it will eat lots of
memory quickly and by its mere definition there is nothing you can do
about that.")?

Regards

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

signature.asc (853 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Daniel Fischer
In reply to this post by Roman Cheplyaka-2
On Friday 09 September 2011, 00:41:11, Roman Cheplyaka wrote:
> * Ertugrul Soeylemez <[hidden email]> [2011-09-07 16:20:03+0200]
>
> > In general it's a bad idea to use mapM over IO.
>
> Could you explain why?

Take it with a grain of salt, there's nothing necessarily wrong with using
mapM over IO on short lists.
The problem is that IO's semantics imply that nothing can be made available
before the entire list has been consumed and a large thunk is built on the
way. Thus for longish lists there's a serious risk of stack overflows (or
even heap exhaustion if you mapM the right [wrong] functions).
The same applies to replicateM, and to other monads with a (>>=) which
isn't lazy enough.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Ertugrul Söylemez
In reply to this post by Roman Cheplyaka-2
Roman Cheplyaka <[hidden email]> wrote:

> > In general it's a bad idea to use mapM over IO.
>
> Could you explain why?

Most applications don't require loading the entire result into memory,
so a combinator like foldM is more appropriate.  You should use mapM
over IO only, when the list is short, or when there is really no way
around loading everything into memory.


Greets,
Ertugrul


--
nightmare = unsafePerformIO (getWrongWife >>= sex)
http://ertes.de/



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

John Lato-2
In reply to this post by Travis Erdman
> From: Daniel Fischer <[hidden email]>
>
> On Friday 09 September 2011, 00:41:11, Roman Cheplyaka wrote:
>> * Ertugrul Soeylemez <[hidden email]> [2011-09-07 16:20:03+0200]
>>
>> > In general it's a bad idea to use mapM over IO.
>>
>> Could you explain why?
>
> Take it with a grain of salt, there's nothing necessarily wrong with using
> mapM over IO on short lists.

Agreed.  Whenever I'd like to use mapM (or any other function for
which a *M_ is available), I've found the following rules helpful:

1.  If I can guarantee the list is short (~ n<=20), go ahead and use mapM
2.  Otherwise use mapM_, foldM_, or foldM if a real reduction is
possible (i.e. not "foldM snocM []").

Step 2 sometimes requires changing my design, but it's always been for
the better.  `mapM_` tends to require more pipeline composition, so
it's leveraging the language's strengths.

This has served me well, especially in IO, but in other monads as well.

John L.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Tim Docker-4

On 09/09/2011, at 8:19 PM, John Lato wrote:

> Agreed.  Whenever I'd like to use mapM (or any other function for
> which a *M_ is available), I've found the following rules helpful:
>
> 1.  If I can guarantee the list is short (~ n<=20), go ahead and use  
> mapM
> 2.  Otherwise use mapM_, foldM_, or foldM if a real reduction is
> possible (i.e. not "foldM snocM []").
>
> Step 2 sometimes requires changing my design, but it's always been for
> the better.  `mapM_` tends to require more pipeline composition, so
> it's leveraging the language's strengths.

This thread is really interesting - it relates directly to problems I  
am currently
having with mapM over large lists (see the thread "stack overflow  
pain").

Can you explain what you mean by "mapM_ tends to require more pipeline  
composition"?
In what way is it leveraging the language strengths?

Thanks,

Tim

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

John Lato-2
On Wed, Sep 21, 2011 at 1:57 PM, Tim Docker <[hidden email]> wrote:

>
> On 09/09/2011, at 8:19 PM, John Lato wrote:
>
>> Agreed.  Whenever I'd like to use mapM (or any other function for
>> which a *M_ is available), I've found the following rules helpful:
>>
>> 1.  If I can guarantee the list is short (~ n<=20), go ahead and use mapM
>> 2.  Otherwise use mapM_, foldM_, or foldM if a real reduction is
>> possible (i.e. not "foldM snocM []").
>>
>> Step 2 sometimes requires changing my design, but it's always been for
>> the better.  `mapM_` tends to require more pipeline composition, so
>> it's leveraging the language's strengths.
>
> This thread is really interesting - it relates directly to problems I am
> currently
> having with mapM over large lists (see the thread "stack overflow pain").
>
> Can you explain what you mean by "mapM_ tends to require more pipeline
> composition"?
> In what way is it leveraging the language strengths?

Hmm, that is suitably cryptic.  One way to think of it is an inversion
of control.  Instead of operating on whole collections of things in a
monad, you specify monadic actions (pipelines) which are applied
sequentially to each input.

Here's a simple example.  Suppose you have a bunch of data serialized
to files, and you want to read each file into a data structure, apply
some process based upon the last file's data, and write out the output
to new files.  One way to do that would look like:

do
    dats <- mapM readMyData files
    let pairs = zip (mempty:dats) dats
    zipWithM_ (\(last, this) fname -> writeMyData (update last this)
fname) pairs newFiles

However, you could also put everything into a single monadic
operation, like this

do
    foldM_ (\last (infile, outfile) -> do
                                                    this <- readMyData infile
                                                    writeMyData
(update last this) outfile
                                                    return this
               )
               mempty
               (zip files newFiles)

The first interleaves control (mapM, zipWIthM_) with monadic actions
(file IO), whereas the second only has one control function (foldM_)
which completely processes one input.  I say this is more pipeline
composition because you have to create an entire pipeline from input
to output, which is then sequentially fed inputs by the control
function.

I say this leverages Haskell's strengths because it's quite easy to
compose functions and monadic actions in Haskell.  It also tends to be
garbage-collector friendly.  I also find it much easier to reason
about space usage.  You don't need to worry if part of a list is being
retained, because the full list of data doesn't appear anywhere.  If
you need to access prior elements they're specified explicitly so you
know exactly how much data you're holding on to.

My perspective might be warped by my work on iteratees, but I find
this a very natural approach.

John L.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Arseniy Alekseyev
> Apparently it doesn't, and it seems to be fixed now.

Does anyone know what exactly the bug was? Because this seems like a
serious bug to me. I've run into it myself today and wasn't happy.
Linear algorithms should work in linear time however much memory they
allocate (modulo cache thrashing of course). Existence of people
claiming otherwise surprises me!


On 22 September 2011 01:05, John Lato <[hidden email]> wrote:

> On Wed, Sep 21, 2011 at 1:57 PM, Tim Docker <[hidden email]> wrote:
>>
>> On 09/09/2011, at 8:19 PM, John Lato wrote:
>>
>>> Agreed.  Whenever I'd like to use mapM (or any other function for
>>> which a *M_ is available), I've found the following rules helpful:
>>>
>>> 1.  If I can guarantee the list is short (~ n<=20), go ahead and use mapM
>>> 2.  Otherwise use mapM_, foldM_, or foldM if a real reduction is
>>> possible (i.e. not "foldM snocM []").
>>>
>>> Step 2 sometimes requires changing my design, but it's always been for
>>> the better.  `mapM_` tends to require more pipeline composition, so
>>> it's leveraging the language's strengths.
>>
>> This thread is really interesting - it relates directly to problems I am
>> currently
>> having with mapM over large lists (see the thread "stack overflow pain").
>>
>> Can you explain what you mean by "mapM_ tends to require more pipeline
>> composition"?
>> In what way is it leveraging the language strengths?
>
> Hmm, that is suitably cryptic.  One way to think of it is an inversion
> of control.  Instead of operating on whole collections of things in a
> monad, you specify monadic actions (pipelines) which are applied
> sequentially to each input.
>
> Here's a simple example.  Suppose you have a bunch of data serialized
> to files, and you want to read each file into a data structure, apply
> some process based upon the last file's data, and write out the output
> to new files.  One way to do that would look like:
>
> do
>    dats <- mapM readMyData files
>    let pairs = zip (mempty:dats) dats
>    zipWithM_ (\(last, this) fname -> writeMyData (update last this)
> fname) pairs newFiles
>
> However, you could also put everything into a single monadic
> operation, like this
>
> do
>    foldM_ (\last (infile, outfile) -> do
>                                                    this <- readMyData infile
>                                                    writeMyData
> (update last this) outfile
>                                                    return this
>               )
>               mempty
>               (zip files newFiles)
>
> The first interleaves control (mapM, zipWIthM_) with monadic actions
> (file IO), whereas the second only has one control function (foldM_)
> which completely processes one input.  I say this is more pipeline
> composition because you have to create an entire pipeline from input
> to output, which is then sequentially fed inputs by the control
> function.
>
> I say this leverages Haskell's strengths because it's quite easy to
> compose functions and monadic actions in Haskell.  It also tends to be
> garbage-collector friendly.  I also find it much easier to reason
> about space usage.  You don't need to worry if part of a list is being
> retained, because the full list of data doesn't appear anywhere.  If
> you need to access prior elements they're specified explicitly so you
> know exactly how much data you're holding on to.
>
> My perspective might be warped by my work on iteratees, but I find
> this a very natural approach.
>
> John L.
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Lennart Augustsson
You seem to ignore garbage collection.

On Sat, Sep 24, 2011 at 6:40 AM, Arseniy Alekseyev <[hidden email]> wrote:
> Apparently it doesn't, and it seems to be fixed now.

Does anyone know what exactly the bug was? Because this seems like a
serious bug to me. I've run into it myself today and wasn't happy.
Linear algorithms should work in linear time however much memory they
allocate (modulo cache thrashing of course). Existence of people
claiming otherwise surprises me!


On 22 September 2011 01:05, John Lato <[hidden email]> wrote:
> On Wed, Sep 21, 2011 at 1:57 PM, Tim Docker <[hidden email]> wrote:
>>
>> On 09/09/2011, at 8:19 PM, John Lato wrote:
>>
>>> Agreed.  Whenever I'd like to use mapM (or any other function for
>>> which a *M_ is available), I've found the following rules helpful:
>>>
>>> 1.  If I can guarantee the list is short (~ n<=20), go ahead and use mapM
>>> 2.  Otherwise use mapM_, foldM_, or foldM if a real reduction is
>>> possible (i.e. not "foldM snocM []").
>>>
>>> Step 2 sometimes requires changing my design, but it's always been for
>>> the better.  `mapM_` tends to require more pipeline composition, so
>>> it's leveraging the language's strengths.
>>
>> This thread is really interesting - it relates directly to problems I am
>> currently
>> having with mapM over large lists (see the thread "stack overflow pain").
>>
>> Can you explain what you mean by "mapM_ tends to require more pipeline
>> composition"?
>> In what way is it leveraging the language strengths?
>
> Hmm, that is suitably cryptic.  One way to think of it is an inversion
> of control.  Instead of operating on whole collections of things in a
> monad, you specify monadic actions (pipelines) which are applied
> sequentially to each input.
>
> Here's a simple example.  Suppose you have a bunch of data serialized
> to files, and you want to read each file into a data structure, apply
> some process based upon the last file's data, and write out the output
> to new files.  One way to do that would look like:
>
> do
>    dats <- mapM readMyData files
>    let pairs = zip (mempty:dats) dats
>    zipWithM_ (\(last, this) fname -> writeMyData (update last this)
> fname) pairs newFiles
>
> However, you could also put everything into a single monadic
> operation, like this
>
> do
>    foldM_ (\last (infile, outfile) -> do
>                                                    this <- readMyData infile
>                                                    writeMyData
> (update last this) outfile
>                                                    return this
>               )
>               mempty
>               (zip files newFiles)
>
> The first interleaves control (mapM, zipWIthM_) with monadic actions
> (file IO), whereas the second only has one control function (foldM_)
> which completely processes one input.  I say this is more pipeline
> composition because you have to create an entire pipeline from input
> to output, which is then sequentially fed inputs by the control
> function.
>
> I say this leverages Haskell's strengths because it's quite easy to
> compose functions and monadic actions in Haskell.  It also tends to be
> garbage-collector friendly.  I also find it much easier to reason
> about space usage.  You don't need to worry if part of a list is being
> retained, because the full list of data doesn't appear anywhere.  If
> you need to access prior elements they're specified explicitly so you
> know exactly how much data you're holding on to.
>
> My perspective might be warped by my work on iteratees, but I find
> this a very natural approach.
>
> John L.
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Arseniy Alekseyev
Garbage collection takes amortized O(1) per allocation, doesn't it?

On 26 September 2011 18:00, Lennart Augustsson <[hidden email]> wrote:

> You seem to ignore garbage collection.
>
> On Sat, Sep 24, 2011 at 6:40 AM, Arseniy Alekseyev
> <[hidden email]> wrote:
>>
>> > Apparently it doesn't, and it seems to be fixed now.
>>
>> Does anyone know what exactly the bug was? Because this seems like a
>> serious bug to me. I've run into it myself today and wasn't happy.
>> Linear algorithms should work in linear time however much memory they
>> allocate (modulo cache thrashing of course). Existence of people
>> claiming otherwise surprises me!
>>
>>
>> On 22 September 2011 01:05, John Lato <[hidden email]> wrote:
>> > On Wed, Sep 21, 2011 at 1:57 PM, Tim Docker <[hidden email]> wrote:
>> >>
>> >> On 09/09/2011, at 8:19 PM, John Lato wrote:
>> >>
>> >>> Agreed.  Whenever I'd like to use mapM (or any other function for
>> >>> which a *M_ is available), I've found the following rules helpful:
>> >>>
>> >>> 1.  If I can guarantee the list is short (~ n<=20), go ahead and use
>> >>> mapM
>> >>> 2.  Otherwise use mapM_, foldM_, or foldM if a real reduction is
>> >>> possible (i.e. not "foldM snocM []").
>> >>>
>> >>> Step 2 sometimes requires changing my design, but it's always been for
>> >>> the better.  `mapM_` tends to require more pipeline composition, so
>> >>> it's leveraging the language's strengths.
>> >>
>> >> This thread is really interesting - it relates directly to problems I
>> >> am
>> >> currently
>> >> having with mapM over large lists (see the thread "stack overflow
>> >> pain").
>> >>
>> >> Can you explain what you mean by "mapM_ tends to require more pipeline
>> >> composition"?
>> >> In what way is it leveraging the language strengths?
>> >
>> > Hmm, that is suitably cryptic.  One way to think of it is an inversion
>> > of control.  Instead of operating on whole collections of things in a
>> > monad, you specify monadic actions (pipelines) which are applied
>> > sequentially to each input.
>> >
>> > Here's a simple example.  Suppose you have a bunch of data serialized
>> > to files, and you want to read each file into a data structure, apply
>> > some process based upon the last file's data, and write out the output
>> > to new files.  One way to do that would look like:
>> >
>> > do
>> >    dats <- mapM readMyData files
>> >    let pairs = zip (mempty:dats) dats
>> >    zipWithM_ (\(last, this) fname -> writeMyData (update last this)
>> > fname) pairs newFiles
>> >
>> > However, you could also put everything into a single monadic
>> > operation, like this
>> >
>> > do
>> >    foldM_ (\last (infile, outfile) -> do
>> >                                                    this <- readMyData
>> > infile
>> >                                                    writeMyData
>> > (update last this) outfile
>> >                                                    return this
>> >               )
>> >               mempty
>> >               (zip files newFiles)
>> >
>> > The first interleaves control (mapM, zipWIthM_) with monadic actions
>> > (file IO), whereas the second only has one control function (foldM_)
>> > which completely processes one input.  I say this is more pipeline
>> > composition because you have to create an entire pipeline from input
>> > to output, which is then sequentially fed inputs by the control
>> > function.
>> >
>> > I say this leverages Haskell's strengths because it's quite easy to
>> > compose functions and monadic actions in Haskell.  It also tends to be
>> > garbage-collector friendly.  I also find it much easier to reason
>> > about space usage.  You don't need to worry if part of a list is being
>> > retained, because the full list of data doesn't appear anywhere.  If
>> > you need to access prior elements they're specified explicitly so you
>> > know exactly how much data you're holding on to.
>> >
>> > My perspective might be warped by my work on iteratees, but I find
>> > this a very natural approach.
>> >
>> > John L.
>> >
>> > _______________________________________________
>> > Haskell-Cafe mailing list
>> > [hidden email]
>> > http://www.haskell.org/mailman/listinfo/haskell-cafe
>> >
>>
>> _______________________________________________
>> Haskell-Cafe mailing list
>> [hidden email]
>> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Malcolm Wallace-2

On 26 Sep 2011, at 23:14, Arseniy Alekseyev wrote:

> Garbage collection takes amortized O(1) per allocation, doesn't it?

No.  For Mark-Sweep GC, the cost is proportional to

(H+R) / (H-R)
where H is the total heap size
      R is the reachable (i.e. live) heap

This formula amortises the cost of a collection over the amount of free space recovered.
For two-space copying collection, the cost is proportional to

R / ((H/2)-R)

In both cases, as R approaches H (or H/2), the cost of GC becomes rather large.  So in essence, the more live data you have, the more GC will cost.

Regards,
    Malcolm


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Arseniy Alekseyev
Malcolm, one should amortize the cost of the collection over the
amount of free space allocated rather than recovered (there are cases
when no space is recovered, would you call the GC cost infinite
then?).

If one does that, and also runs the expensive collection not too often
[1], the time amortizes to O(1) easily (notice that the amount
allocated between GCs can be kept proportional to H).

I don't know if GC used in GHC does indeed have amortized O(1) cost
per allocation, but if it doesn't, it should.

[1] - a sufficient condition would be that there exists some real
number q such that q > 1 and the next GC runs not sooner than when H
reaches H_0*q where H_0 is the heap size remaining after the last
collection.

On 27 September 2011 10:02, Malcolm Wallace <[hidden email]> wrote:

>
> On 26 Sep 2011, at 23:14, Arseniy Alekseyev wrote:
>
>> Garbage collection takes amortized O(1) per allocation, doesn't it?
>
> No.  For Mark-Sweep GC, the cost is proportional to
>
> (H+R) / (H-R)
> where H is the total heap size
>      R is the reachable (i.e. live) heap
>
> This formula amortises the cost of a collection over the amount of free space recovered.
> For two-space copying collection, the cost is proportional to
>
> R / ((H/2)-R)
>
> In both cases, as R approaches H (or H/2), the cost of GC becomes rather large.  So in essence, the more live data you have, the more GC will cost.
>
> Regards,
>    Malcolm
>
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Malcolm Wallace-2

On 27 Sep 2011, at 11:23, Arseniy Alekseyev wrote:

> Malcolm, one should amortize the cost of the collection over the
> amount of free space allocated rather than recovered

They are the same thing.  You can only allocate from the space that has been recovered.  It is true that generational GC has a nursery area of largely constant size, which is always used for fresh allocation, but that is usually considered an optimisation (albeit a considerable one), which does not fundamentally change the underlying asymptotic costs of the major collections.  When you have large heap residency, the proportion of time spent in GC increases.

> (there are cases
> when no space is recovered, would you call the GC cost infinite
> then?).

Indeed I would.  When that happens, usually the program aborts without completing its computation, so the computation is infinitely delayed.

Regards,
    Malcolm

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Ketil Malde-5
In reply to this post by Arseniy Alekseyev

Here's my feeble understanding of GC:

1. GC runs when after some specified amount of allocations
2. GC runs in time proportional to the live heap, which it needs to
   traverse.

This means that for a long running mapM, like any other computation
generating a long list, (1) GC will run a number of times proportional to
the length of the list, and (2) each run will have a cost
proportional to the length of the list.  I.e. a linear algorithm is now
quadratic.

A lazy mapM (or mapM_), consuming the list as fast as it is generated,
will of course keep the list short/heap small, and thus the cost of each
GC is constant (for some value of "constant").

I suppose generational GC will help in practice, since the young
generation gets to start near the end of the list, but it will still be
linear in generated length, and you still need major GCs too,
occasionally.

Also, I guess mapM is more vulnerable to this, since the operations (IO,
say) involved in building the list likely do short-lived allocations,
triggering more GCs than simpler, pure computations would.

Do let me know if this is probably a terribly naive view.

-k
--
If I haven't seen further, it is by standing in the footprints of giants

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: mapM is supralinear?

Arseniy Alekseyev
In reply to this post by Malcolm Wallace-2
Ketil, I suppose your argument is correct for some implementations of
GC (hopefully not the ones I use). However, a trivial optimisation of
running GC with a frequency logarithmic in the (allocation rate / heap
size) seems to make almost any kind of GC amortized O(1) while keeping
the total heap bounded within constant factor of the reachable heap
size. So, we optimize the (1.) in your algorithm and in case of mapM
we should get a logarithmic (instead of linear) number of GCs with
exponentially (instead of linearly) increasing costs reaching O(N) in
the end and totalling to O(N) too!

Does anyone know if such worst-case complexity precautions are taken
in GHC? If not, why?

Malcolm, I fail to see how "They are the same thing" is compatible
with "Indeed I would". They together imply that O(1) (GC amortized
over the amount allocated) and O(+inf) (GC amortized over the amount
reclaimed) are the same thing. Also, I don't see how OOM condition is
relevant here. I may have enough memory for a lot of useful things
even without GC.

On 27 September 2011 12:42, Ketil Malde <[hidden email]> wrote:

>
> Here's my feeble understanding of GC:
>
> 1. GC runs when after some specified amount of allocations
> 2. GC runs in time proportional to the live heap, which it needs to
>   traverse.
>
> This means that for a long running mapM, like any other computation
> generating a long list, (1) GC will run a number of times proportional to
> the length of the list, and (2) each run will have a cost
> proportional to the length of the list.  I.e. a linear algorithm is now
> quadratic.
>
> A lazy mapM (or mapM_), consuming the list as fast as it is generated,
> will of course keep the list short/heap small, and thus the cost of each
> GC is constant (for some value of "constant").
>
> I suppose generational GC will help in practice, since the young
> generation gets to start near the end of the list, but it will still be
> linear in generated length, and you still need major GCs too,
> occasionally.
>
> Also, I guess mapM is more vulnerable to this, since the operations (IO,
> say) involved in building the list likely do short-lived allocations,
> triggering more GCs than simpler, pure computations would.
>
> Do let me know if this is probably a terribly naive view.
>
> -k
> --
> If I haven't seen further, it is by standing in the footprints of giants
>


On 27 September 2011 12:35, Malcolm Wallace <[hidden email]> wrote:

>
> On 27 Sep 2011, at 11:23, Arseniy Alekseyev wrote:
>
>> Malcolm, one should amortize the cost of the collection over the
>> amount of free space allocated rather than recovered
>
> They are the same thing.  You can only allocate from the space that has been recovered.  It is true that generational GC has a nursery area of largely constant size, which is always used for fresh allocation, but that is usually considered an optimisation (albeit a considerable one), which does not fundamentally change the underlying asymptotic costs of the major collections.  When you have large heap residency, the proportion of time spent in GC increases.
>
>> (there are cases
>> when no space is recovered, would you call the GC cost infinite
>> then?).
>
> Indeed I would.  When that happens, usually the program aborts without completing its computation, so the computation is infinitely delayed.
>
> Regards,
>    Malcolm
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe