Is P.fold lazy or strict?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Is P.fold lazy or strict?

Alexey Raga
Sorry for another beginner's question, but is P.fold strict or lazy on its accumulator?

In the documentation it says that it is "Strict fold of the elements of a 'Producer'", but here is what I see:  

I have a big CSV file (50M rows), and one of the columns contains about 4.5M unique values. I fold these values into a Data.Map.Strict:

    names :: Producer ByteString IO ()
    names = bslines
            >-> P.map toVector 
            >-> P.map (V.! 25)
    
    main :: IO ()
    main = do
      m <- P.fold (\s a -> if M.member a s then s else M.insert a 1 s) M.empty id names
      Prelude.print $ M.size m
      Prelude.print $ M.findMax m
      Prelude.print "ok"

The output suggests that the map is of the right size, and the max element is correct. This takes ~6.4GB of RAM.

Now I extract these 4.5M unique values into a separate file and run the same code again (only the column index is changing, nothing else).
The output is the same (same size, same max element), except that now it only takes 1.2GB of RAM to run.
 
Am I right suspecting that laziness causes this issue? But where and how can I fix it?

Cheers,
Alexey.

--
Reply | Threaded
Open this post in threaded view
|

Re: Is P.fold lazy or strict?

Gabriel Gonzalez
Yes.  It's definitely strict in the accumulator and your `Map` fold looks correct to me.  As a side, note you should probably be building a `Set` instead of a `Map`.

Are you sure that it's the map that is leaking space?  You should profile your heap using the instructions here:

https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/prof-heap.html

Also, how  are `bslines` and `toVector` implemented?

On 8/21/2015 9:33 PM, Alexey Raga wrote:
Sorry for another beginner's question, but is P.fold strict or lazy on its accumulator?

In the documentation it says that it is "Strict fold of the elements of a 'Producer'", but here is what I see:  

I have a big CSV file (50M rows), and one of the columns contains about 4.5M unique values. I fold these values into a Data.Map.Strict:

    names :: Producer ByteString IO ()
    names = bslines
            >-> P.map toVector 
            >-> P.map (V.! 25)
    
    main :: IO ()
    main = do
      m <- P.fold (\s a -> if M.member a s then s else M.insert a 1 s) M.empty id names
      Prelude.print $ M.size m
      Prelude.print $ M.findMax m
      Prelude.print "ok"

The output suggests that the map is of the right size, and the max element is correct. This takes ~6.4GB of RAM.

Now I extract these 4.5M unique values into a separate file and run the same code again (only the column index is changing, nothing else).
The output is the same (same size, same max element), except that now it only takes 1.2GB of RAM to run.
 
Am I right suspecting that laziness causes this issue? But where and how can I fix it?

Cheers,
Alexey.
--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
send an email to [hidden email].

--
Reply | Threaded
Open this post in threaded view
|

Re: Is P.fold lazy or strict?

Alexey Raga
I was trimming everything else from my program and so far I think that it is the Map thing, since if I don't do any "M.insert" in this code it works fine and don't consume any memory at all.

In "real life" I use "pipes-csv", but here are the implementations of "bslines" and "toVector":

    import Data.ByteString.Char8 as BS
    import Pipes.ByteString as PBS
    import Control.Foldl as F (Fold(..), purely, mconcat)

    bslines :: (MonadIO m) => Producer ByteString m ()
    bslines = purely folds F.mconcat . view PBS.lines $ PBS.stdin

    toVector a = V.fromList $ BS.splitWith (==',') a

I will soon try to profile the code as you suggested.
  
Cheers,
Alexey.

On Saturday, August 22, 2015 at 3:15:07 PM UTC+10, Gabriel Gonzalez wrote:
Yes.  It's definitely strict in the accumulator and your `Map` fold looks correct to me.  As a side, note you should probably be building a `Set` instead of a `Map`.

Are you sure that it's the map that is leaking space?  You should profile your heap using the instructions here:

<a href="https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/prof-heap.html" target="_blank" rel="nofollow" onmousedown="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fdownloads.haskell.org%2F~ghc%2Flatest%2Fdocs%2Fhtml%2Fusers_guide%2Fprof-heap.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHU_pCIHbU02oJD44msmrhDZCZg9Q&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fdownloads.haskell.org%2F~ghc%2Flatest%2Fdocs%2Fhtml%2Fusers_guide%2Fprof-heap.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHU_pCIHbU02oJD44msmrhDZCZg9Q&#39;;return true;">https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/prof-heap.html

Also, how  are `bslines` and `toVector` implemented?

On 8/21/2015 9:33 PM, Alexey Raga wrote:
Sorry for another beginner's question, but is P.fold strict or lazy on its accumulator?

In the documentation it says that it is "Strict fold of the elements of a 'Producer'", but here is what I see:  

I have a big CSV file (50M rows), and one of the columns contains about 4.5M unique values. I fold these values into a Data.Map.Strict:

    names :: Producer ByteString IO ()
    names = bslines
            >-> P.map toVector 
            >-> P.map (V.! 25)
    
    main :: IO ()
    main = do
      m <- P.fold (\s a -> if M.member a s then s else M.insert a 1 s) M.empty id names
      Prelude.print $ M.size m
      Prelude.print $ M.findMax m
      Prelude.print "ok"

The output suggests that the map is of the right size, and the max element is correct. This takes ~6.4GB of RAM.

Now I extract these 4.5M unique values into a separate file and run the same code again (only the column index is changing, nothing else).
The output is the same (same size, same max element), except that now it only takes 1.2GB of RAM to run.
 
Am I right suspecting that laziness causes this issue? But where and how can I fix it?

Cheers,
Alexey.
--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
send an email to <a href="javascript:" target="_blank" gdf-obfuscated-mailto="3z3qA04yBAAJ" rel="nofollow" onmousedown="this.href=&#39;javascript:&#39;;return true;" onclick="this.href=&#39;javascript:&#39;;return true;">haskell-pipe...@googlegroups.com.

--
Reply | Threaded
Open this post in threaded view
|

Re: Is P.fold lazy or strict?

Alexey Raga
Oh, I have found what was the problem.
It wasn't in fold, it was in a bytestring. Even if I use only a part of it in a map, the whole bytestring is hanging in memory, probably because BS.split doesn't really split it but returns portions of the original one.

I assume something similar happens when using "pipes-csv" library because that's how I started this investigation. 
Creating a copy before pushing it to the map helps.

Cheers,
Alexey.

On Saturday, August 22, 2015 at 6:27:17 PM UTC+10, Alexey Raga wrote:
I was trimming everything else from my program and so far I think that it is the Map thing, since if I don't do any "M.insert" in this code it works fine and don't consume any memory at all.

In "real life" I use "pipes-csv", but here are the implementations of "bslines" and "toVector":

    import Data.ByteString.Char8 as BS
    import Pipes.ByteString as PBS
    import Control.Foldl as F (Fold(..), purely, mconcat)

    bslines :: (MonadIO m) => Producer ByteString m ()
    bslines = purely folds F.mconcat . view PBS.lines $ PBS.stdin

    toVector a = V.fromList $ BS.splitWith (==',') a

I will soon try to profile the code as you suggested.
  
Cheers,
Alexey.

On Saturday, August 22, 2015 at 3:15:07 PM UTC+10, Gabriel Gonzalez wrote:
Yes.  It's definitely strict in the accumulator and your `Map` fold looks correct to me.  As a side, note you should probably be building a `Set` instead of a `Map`.

Are you sure that it's the map that is leaking space?  You should profile your heap using the instructions here:

<a href="https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/prof-heap.html" rel="nofollow" target="_blank" onmousedown="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fdownloads.haskell.org%2F~ghc%2Flatest%2Fdocs%2Fhtml%2Fusers_guide%2Fprof-heap.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHU_pCIHbU02oJD44msmrhDZCZg9Q&#39;;return true;" onclick="this.href=&#39;https://www.google.com/url?q\75https%3A%2F%2Fdownloads.haskell.org%2F~ghc%2Flatest%2Fdocs%2Fhtml%2Fusers_guide%2Fprof-heap.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHU_pCIHbU02oJD44msmrhDZCZg9Q&#39;;return true;">https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/prof-heap.html

Also, how  are `bslines` and `toVector` implemented?

On 8/21/2015 9:33 PM, Alexey Raga wrote:
Sorry for another beginner's question, but is P.fold strict or lazy on its accumulator?

In the documentation it says that it is "Strict fold of the elements of a 'Producer'", but here is what I see:  

I have a big CSV file (50M rows), and one of the columns contains about 4.5M unique values. I fold these values into a Data.Map.Strict:

    names :: Producer ByteString IO ()
    names = bslines
            >-> P.map toVector 
            >-> P.map (V.! 25)
    
    main :: IO ()
    main = do
      m <- P.fold (\s a -> if M.member a s then s else M.insert a 1 s) M.empty id names
      Prelude.print $ M.size m
      Prelude.print $ M.findMax m
      Prelude.print "ok"

The output suggests that the map is of the right size, and the max element is correct. This takes ~6.4GB of RAM.

Now I extract these 4.5M unique values into a separate file and run the same code again (only the column index is changing, nothing else).
The output is the same (same size, same max element), except that now it only takes 1.2GB of RAM to run.
 
Am I right suspecting that laziness causes this issue? But where and how can I fix it?

Cheers,
Alexey.
--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
send an email to haskell-pipe...@googlegroups.com.

--
Reply | Threaded
Open this post in threaded view
|

Re: Is P.fold lazy or strict?

Gabriel Gonzalez
In reply to this post by Alexey Raga
What did you expect the in-memory size of the `Map` to be?  For example, what is the size of each `a` you are inserting into the `Map`?

On 8/22/2015 1:27 AM, Alexey Raga wrote:
I was trimming everything else from my program and so far I think that it is the Map thing, since if I don't do any "M.insert" in this code it works fine and don't consume any memory at all.

In "real life" I use "pipes-csv", but here are the implementations of "bslines" and "toVector":

    import Data.ByteString.Char8 as BS
    import Pipes.ByteString as PBS
    import Control.Foldl as F (Fold(..), purely, mconcat)

    bslines :: (MonadIO m) => Producer ByteString m ()
    bslines = purely folds F.mconcat . view PBS.lines $ PBS.stdin

    toVector a = V.fromList $ BS.splitWith (==',') a

I will soon try to profile the code as you suggested.
  
Cheers,
Alexey.

On Saturday, August 22, 2015 at 3:15:07 PM UTC+10, Gabriel Gonzalez wrote:
Yes.  It's definitely strict in the accumulator and your `Map` fold looks correct to me.  As a side, note you should probably be building a `Set` instead of a `Map`.

Are you sure that it's the map that is leaking space?  You should profile your heap using the instructions here:

<a moz-do-not-send="true" href="https://downloads.haskell.org/%7Eghc/latest/docs/html/users_guide/prof-heap.html" target="_blank" rel="nofollow" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fdownloads.haskell.org%2F~ghc%2Flatest%2Fdocs%2Fhtml%2Fusers_guide%2Fprof-heap.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHU_pCIHbU02oJD44msmrhDZCZg9Q';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fdownloads.haskell.org%2F~ghc%2Flatest%2Fdocs%2Fhtml%2Fusers_guide%2Fprof-heap.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHU_pCIHbU02oJD44msmrhDZCZg9Q';return true;">https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/prof-heap.html

Also, how  are `bslines` and `toVector` implemented?

On 8/21/2015 9:33 PM, Alexey Raga wrote:
Sorry for another beginner's question, but is P.fold strict or lazy on its accumulator?

In the documentation it says that it is "Strict fold of the elements of a 'Producer'", but here is what I see:  

I have a big CSV file (50M rows), and one of the columns contains about 4.5M unique values. I fold these values into a Data.Map.Strict:

    names :: Producer ByteString IO ()
    names = bslines
            >-> P.map toVector 
            >-> P.map (V.! 25)
    
    main :: IO ()
    main = do
      m <- P.fold (\s a -> if M.member a s then s else M.insert a 1 s) M.empty id names
      Prelude.print $ M.size m
      Prelude.print $ M.findMax m
      Prelude.print "ok"

The output suggests that the map is of the right size, and the max element is correct. This takes ~6.4GB of RAM.

Now I extract these 4.5M unique values into a separate file and run the same code again (only the column index is changing, nothing else).
The output is the same (same size, same max element), except that now it only takes 1.2GB of RAM to run.
 
Am I right suspecting that laziness causes this issue? But where and how can I fix it?

Cheers,
Alexey.
--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
from it, send an email to <a moz-do-not-send="true" href="javascript:" target="_blank" gdf-obfuscated-mailto="3z3qA04yBAAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">haskell-pipe...@googlegroups.com.
moz-do-not-send="true" href="javascript:" target="_blank" gdf-obfuscated-mailto="3z3qA04yBAAJ" rel="nofollow" onmousedown="this.href='javascript:';return true;" onclick="this.href='javascript:';return true;">haskel...@....

--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
send an email to [hidden email].

--
Reply | Threaded
Open this post in threaded view
|

Re: Is P.fold lazy or strict?

Gabriel Gonzalez
In reply to this post by Alexey Raga
Yeah, that's right.  `bytestring` tries to minimize copies as much as possible so when you slice into a `ByteString` it just updates an offset and length field.  It's internal representation is:

    data ByteString = PS {-# UNPACK #-} !(ForeignPtr Word8) -- payload
                         {-# UNPACK #-} !Int                -- offset
                         {-# UNPACK #-} !Int                -- length

On 8/22/2015 6:49 AM, Alexey Raga wrote:
Oh, I have found what was the problem.
It wasn't in fold, it was in a bytestring. Even if I use only a part of it in a map, the whole bytestring is hanging in memory, probably because BS.split doesn't really split it but returns portions of the original one.

I assume something similar happens when using "pipes-csv" library because that's how I started this investigation. 
Creating a copy before pushing it to the map helps.

Cheers,
Alexey.

On Saturday, August 22, 2015 at 6:27:17 PM UTC+10, Alexey Raga wrote:
I was trimming everything else from my program and so far I think that it is the Map thing, since if I don't do any "M.insert" in this code it works fine and don't consume any memory at all.

In "real life" I use "pipes-csv", but here are the implementations of "bslines" and "toVector":

    import Data.ByteString.Char8 as BS
    import Pipes.ByteString as PBS
    import Control.Foldl as F (Fold(..), purely, mconcat)

    bslines :: (MonadIO m) => Producer ByteString m ()
    bslines = purely folds F.mconcat . view PBS.lines $ PBS.stdin

    toVector a = V.fromList $ BS.splitWith (==',') a

I will soon try to profile the code as you suggested.
  
Cheers,
Alexey.

On Saturday, August 22, 2015 at 3:15:07 PM UTC+10, Gabriel Gonzalez wrote:
Yes.  It's definitely strict in the accumulator and your `Map` fold looks correct to me.  As a side, note you should probably be building a `Set` instead of a `Map`.

Are you sure that it's the map that is leaking space?  You should profile your heap using the instructions here:

<a moz-do-not-send="true" href="https://downloads.haskell.org/%7Eghc/latest/docs/html/users_guide/prof-heap.html" rel="nofollow" target="_blank" onmousedown="this.href='https://www.google.com/url?q\75https%3A%2F%2Fdownloads.haskell.org%2F~ghc%2Flatest%2Fdocs%2Fhtml%2Fusers_guide%2Fprof-heap.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHU_pCIHbU02oJD44msmrhDZCZg9Q';return true;" onclick="this.href='https://www.google.com/url?q\75https%3A%2F%2Fdownloads.haskell.org%2F~ghc%2Flatest%2Fdocs%2Fhtml%2Fusers_guide%2Fprof-heap.html\46sa\75D\46sntz\0751\46usg\75AFQjCNHU_pCIHbU02oJD44msmrhDZCZg9Q';return true;">https://downloads.haskell.org/~ghc/latest/docs/html/users_guide/prof-heap.html

Also, how  are `bslines` and `toVector` implemented?

On 8/21/2015 9:33 PM, Alexey Raga wrote:
Sorry for another beginner's question, but is P.fold strict or lazy on its accumulator?

In the documentation it says that it is "Strict fold of the elements of a 'Producer'", but here is what I see:  

I have a big CSV file (50M rows), and one of the columns contains about 4.5M unique values. I fold these values into a Data.Map.Strict:

    names :: Producer ByteString IO ()
    names = bslines
            >-> P.map toVector 
            >-> P.map (V.! 25)
    
    main :: IO ()
    main = do
      m <- P.fold (\s a -> if M.member a s then s else M.insert a 1 s) M.empty id names
      Prelude.print $ M.size m
      Prelude.print $ M.findMax m
      Prelude.print "ok"

The output suggests that the map is of the right size, and the max element is correct. This takes ~6.4GB of RAM.

Now I extract these 4.5M unique values into a separate file and run the same code again (only the column index is changing, nothing else).
The output is the same (same size, same max element), except that now it only takes 1.2GB of RAM to run.
 
Am I right suspecting that laziness causes this issue? But where and how can I fix it?

Cheers,
Alexey.
--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to haskell-pipe...@googlegroups.com.
moz-do-not-send="true" rel="nofollow" target="_top" link="external">[hidden email].

--
You received this message because you are subscribed to the Google Groups "Haskell Pipes" group.
send an email to [hidden email].

--