Quantcast

GHC Performance / Replacement for R?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

GHC Performance / Replacement for R?

Dominic Steinitz-2
I am trying to use Haskell as a replacement for R but running into two
problems which I describe below. Are there any plans to address the
performance issues I have encountered?

 1. I seem to have to jump through a lot of hoops just to be able to
    select the data I am interested in.

{-# LANGUAGE ScopedTypeVariables #-}

{-# OPTIONS_GHC -Wall #-}

import Data.Csv hiding ( decodeByName )
import qualified Data.Vector as V

import Data.ByteString ( ByteString )
import qualified Data.ByteString.Char8 as B

import qualified Pipes.Prelude as P
import qualified Pipes.ByteString as Bytes
import Pipes
import qualified Pipes.Csv as Csv
import System.IO

import qualified Control.Foldl as L

main :: IO ()
main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
  let csvs :: Producer (V.Vector ByteString) IO ()
      csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
      uvectors :: Producer (V.Vector ByteString) IO ()
      uvectors = csvs  >-> P.map (V.foldr V.cons V.empty)
  vec_vec <- L.impurely P.foldM  L.vector uvectors
  print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
  print $ V.length vec_vec
  let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
  print $ V.length rockspring

Here's the equivalent R:

df <- read.csv("787338586_T_ONTIME.csv")
rockspring <- df[df$ORIGIN == "RKS",]

 2. Now I think I could improve the above to make an environment that
    is more similar to the one my colleagues are used to in R but more
    problematical is the memory usage.

 * 112.5M file
 * Just loading the source into ghci takes 142.7M
 * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
   takes me up to 4.75G. But we probably don't want to do this!
 * Let's try again.
 * > :set -XScopedTypeVariables
 * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
 * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
 * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs  >-> P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
 * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM  L.vector uvectors
 * Now I am up at 3.17G. In R I am under 221.3M.
 * > V.length rockspring takes a long time to return 155 and now I am
   at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
   instantaneous and now uses only 379.5M.
 * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
   37 columns and 471,949 rows.

Running this as an executable gives

~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s
["2014-01-01","EV","20366","N904EV","2512","10747","1074702","30747",
 "BRO","Brownsville, TX","Texas","11298","1129803","30194",
  "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
  "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
  "","0.00","482.00","","","","","",""]
471949
155
  14,179,764,240 bytes allocated in the heap
   3,378,342,072 bytes copied during GC
     786,333,512 bytes maximum residency (13 sample(s))
      36,933,976 bytes maximum slop
            1434 MB total memory in use (0 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0     26989 colls,     0 par    1.423s   1.483s     0.0001s    0.0039s
  Gen  1        13 colls,     0 par    1.005s   1.499s     0.1153s    0.6730s

  INIT    time    0.000s  (  0.003s elapsed)
  MUT     time    3.195s  (  3.193s elapsed)
  GC      time    2.428s  (  2.982s elapsed)
  EXIT    time    0.016s  (  0.138s elapsed)
  Total   time    5.642s  (  6.315s elapsed)

  %GC     time      43.0%  (47.2% elapsed)

  Alloc rate    4,437,740,019 bytes per MUT second

  Productivity  57.0% of total user, 50.9% of total elapsed

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: GHC Performance / Replacement for R?

Haskell - Glasgow-haskell-users mailing list
Sounds bad.  But it'll need someone with bytestring expertise to debug.  Maybe there's a GHC problem underlying; or maybe it's shortcoming of bytestring.  

Simon

|  -----Original Message-----
|  From: Glasgow-haskell-users [mailto:glasgow-haskell-users-
|  [hidden email]] On Behalf Of Dominic Steinitz
|  Sent: 25 August 2016 10:11
|  To: GHC users <[hidden email]>
|  Subject: GHC Performance / Replacement for R?
|  
|  I am trying to use Haskell as a replacement for R but running into two
|  problems which I describe below. Are there any plans to address the
|  performance issues I have encountered?
|  
|   1. I seem to have to jump through a lot of hoops just to be able to
|      select the data I am interested in.
|  
|  {-# LANGUAGE ScopedTypeVariables #-}
|  
|  {-# OPTIONS_GHC -Wall #-}
|  
|  import Data.Csv hiding ( decodeByName )
|  import qualified Data.Vector as V
|  
|  import Data.ByteString ( ByteString )
|  import qualified Data.ByteString.Char8 as B
|  
|  import qualified Pipes.Prelude as P
|  import qualified Pipes.ByteString as Bytes import Pipes import
|  qualified Pipes.Csv as Csv import System.IO
|  
|  import qualified Control.Foldl as L
|  
|  main :: IO ()
|  main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
|    let csvs :: Producer (V.Vector ByteString) IO ()
|        csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
|        uvectors :: Producer (V.Vector ByteString) IO ()
|        uvectors = csvs  >-> P.map (V.foldr V.cons V.empty)
|    vec_vec <- L.impurely P.foldM  L.vector uvectors
|    print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
|    print $ V.length vec_vec
|    let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
|    print $ V.length rockspring
|  
|  Here's the equivalent R:
|  
|  df <- read.csv("787338586_T_ONTIME.csv")
|  rockspring <- df[df$ORIGIN == "RKS",]
|  
|   2. Now I think I could improve the above to make an environment that
|      is more similar to the one my colleagues are used to in R but more
|      problematical is the memory usage.
|  
|   * 112.5M file
|   * Just loading the source into ghci takes 142.7M
|   * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
|     takes me up to 4.75G. But we probably don't want to do this!
|   * Let's try again.
|   * > :set -XScopedTypeVariables
|   * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
|   * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
|  HasHeader (Bytes.fromHandle h) >-> P.concat
|   * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs  >->
|  P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
|   * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM
|  L.vector uvectors
|   * Now I am up at 3.17G. In R I am under 221.3M.
|   * > V.length rockspring takes a long time to return 155 and now I am
|     at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
|     instantaneous and now uses only 379.5M.
|   * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
|     37 columns and 471,949 rows.
|  
|  Running this as an executable gives
|  
|  ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
|  01","EV","20366","N904EV","2512","10747","1074702","30747",
|   "BRO","Brownsville, TX","Texas","11298","1129803","30194",
|    "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
|    "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
|    "","0.00","482.00","","","","","",""]
|  471949
|  155
|    14,179,764,240 bytes allocated in the heap
|     3,378,342,072 bytes copied during GC
|       786,333,512 bytes maximum residency (13 sample(s))
|        36,933,976 bytes maximum slop
|              1434 MB total memory in use (0 MB lost due to
|  fragmentation)
|  
|                                       Tot time (elapsed)  Avg pause
|  Max pause
|    Gen  0     26989 colls,     0 par    1.423s   1.483s     0.0001s
|  0.0039s
|    Gen  1        13 colls,     0 par    1.005s   1.499s     0.1153s
|  0.6730s
|  
|    INIT    time    0.000s  (  0.003s elapsed)
|    MUT     time    3.195s  (  3.193s elapsed)
|    GC      time    2.428s  (  2.982s elapsed)
|    EXIT    time    0.016s  (  0.138s elapsed)
|    Total   time    5.642s  (  6.315s elapsed)
|  
|    %GC     time      43.0%  (47.2% elapsed)
|  
|    Alloc rate    4,437,740,019 bytes per MUT second
|  
|    Productivity  57.0% of total user, 50.9% of total elapsed
|  
|  _______________________________________________
|  Glasgow-haskell-users mailing list
|  [hidden email]
|  https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
|  askell.org%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
|  users&data=01%7c01%7csimonpj%40microsoft.com%7c5017a5fe26cb4df9c41d08d
|  3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
|  NSOJREZrt2Fsqhi63iJOpxmku68E%3d
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: GHC Performance / Replacement for R?

Iavor Diatchki
Hello,

when you parse the CSV fully, you end up creating a lot of small bytestring objects, and each of these adds some overhead.   The vectors themselves add up some additional overhead.  All of this adds up when you have as many fields as you do.   An alternative would be to use a different representation for the data, which recomputes things when needed.   While this might be a bit slower in some cases, it could have significant saving in terms of memory use.   I wrote up a small example to illustrate what I have in mind, which should be attached to this e-mail.

Basically, instead of parsing the CSV file fully, I just indexed where the lines are (ref. the "rows" field of "CSV").  This allows me to access each row quickly, and the when I need to get a specific field, I simply parse the bytes of the row.
One could play all kinds of games like that, and I imagine R does something similar, although I have never looked at how it works.   To test the approach I generated ~200Mb of sample data (generator is also in the attached file), and I was able to filter it using ~240Mb, which is comparable to what you reported about R.  One could probably package all this up in library that supports "R like" operations. 

These are the stats I get from -s:

   4,137,632,432 bytes allocated in the heap
         925,200 bytes copied during GC
     200,104,224 bytes maximum residency (2 sample(s))
       6,217,864 bytes maximum slop
             246 MB total memory in use (1 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7564 colls,     0 par    0.024s   0.011s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.001s     0.0003s    0.0006s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.364s  (  0.451s elapsed)
  GC      time    0.024s  (  0.011s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    0.388s  (  0.463s elapsed)

  %GC     time       6.2%  (2.5% elapsed)

  Alloc rate    11,367,122,065 bytes per MUT second

  Productivity  93.8% of total user, 78.6% of total elapsed

-Iavor




On Thu, Aug 25, 2016 at 3:31 AM, Simon Peyton Jones via Glasgow-haskell-users <[hidden email]> wrote:
Sounds bad.  But it'll need someone with bytestring expertise to debug.  Maybe there's a GHC problem underlying; or maybe it's shortcoming of bytestring.

Simon

|  -----Original Message-----
|  From: Glasgow-haskell-users [mailto:[hidden email]
[hidden email]] On Behalf Of Dominic Steinitz
|  Sent: 25 August 2016 10:11
|  To: GHC users <[hidden email]>
|  Subject: GHC Performance / Replacement for R?
|
|  I am trying to use Haskell as a replacement for R but running into two
|  problems which I describe below. Are there any plans to address the
|  performance issues I have encountered?
|
|   1. I seem to have to jump through a lot of hoops just to be able to
|      select the data I am interested in.
|
|  {-# LANGUAGE ScopedTypeVariables #-}
|
|  {-# OPTIONS_GHC -Wall #-}
|
|  import Data.Csv hiding ( decodeByName )
|  import qualified Data.Vector as V
|
|  import Data.ByteString ( ByteString )
|  import qualified Data.ByteString.Char8 as B
|
|  import qualified Pipes.Prelude as P
|  import qualified Pipes.ByteString as Bytes import Pipes import
|  qualified Pipes.Csv as Csv import System.IO
|
|  import qualified Control.Foldl as L
|
|  main :: IO ()
|  main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
|    let csvs :: Producer (V.Vector ByteString) IO ()
|        csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
|        uvectors :: Producer (V.Vector ByteString) IO ()
|        uvectors = csvs  >-> P.map (V.foldr V.cons V.empty)
|    vec_vec <- L.impurely P.foldM  L.vector uvectors
|    print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
|    print $ V.length vec_vec
|    let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
|    print $ V.length rockspring
|
|  Here's the equivalent R:
|
|  df <- read.csv("787338586_T_ONTIME.csv")
|  rockspring <- df[df$ORIGIN == "RKS",]
|
|   2. Now I think I could improve the above to make an environment that
|      is more similar to the one my colleagues are used to in R but more
|      problematical is the memory usage.
|
|   * 112.5M file
|   * Just loading the source into ghci takes 142.7M
|   * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
|     takes me up to 4.75G. But we probably don't want to do this!
|   * Let's try again.
|   * > :set -XScopedTypeVariables
|   * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
|   * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
|  HasHeader (Bytes.fromHandle h) >-> P.concat
|   * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs  >->
|  P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
|   * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM
|  L.vector uvectors
|   * Now I am up at 3.17G. In R I am under 221.3M.
|   * > V.length rockspring takes a long time to return 155 and now I am
|     at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
|     instantaneous and now uses only 379.5M.
|   * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
|     37 columns and 471,949 rows.
|
|  Running this as an executable gives
|
|  ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
|  01","EV","20366","N904EV","2512","10747","1074702","30747",
|   "BRO","Brownsville, TX","Texas","11298","1129803","30194",
|    "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
|    "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
|    "","0.00","482.00","","","","","",""]
|  471949
|  155
|    14,179,764,240 bytes allocated in the heap
|     3,378,342,072 bytes copied during GC
|       786,333,512 bytes maximum residency (13 sample(s))
|        36,933,976 bytes maximum slop
|              1434 MB total memory in use (0 MB lost due to
|  fragmentation)
|
|                                       Tot time (elapsed)  Avg pause
|  Max pause
|    Gen  0     26989 colls,     0 par    1.423s   1.483s     0.0001s
|  0.0039s
|    Gen  1        13 colls,     0 par    1.005s   1.499s     0.1153s
|  0.6730s
|
|    INIT    time    0.000s  (  0.003s elapsed)
|    MUT     time    3.195s  (  3.193s elapsed)
|    GC      time    2.428s  (  2.982s elapsed)
|    EXIT    time    0.016s  (  0.138s elapsed)
|    Total   time    5.642s  (  6.315s elapsed)
|
|    %GC     time      43.0%  (47.2% elapsed)
|
|    Alloc rate    4,437,740,019 bytes per MUT second
|
|    Productivity  57.0% of total user, 50.9% of total elapsed
|
|  _______________________________________________
|  Glasgow-haskell-users mailing list
[hidden email]
https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
askell.org%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
|  users&data=01%7c01%7csimonpj%40microsoft.com%7c5017a5fe26cb4df9c41d08d
|  3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
|  NSOJREZrt2Fsqhi63iJOpxmku68E%3d
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

test.hs (2K) Download Attachment
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: GHC Performance / Replacement for R?

Dominic Steinitz-2

Hi Iavor,

Thank you very much for this. It's nice to know that we have the ability in Haskell to be as frugal (or profligate) with memory as R when working with data frames. I should say this number of fields is quite low in the data science world. Data sets with 500 columns are not uncommon and I did have one with 10,000 columns!

I know other folks have worked on producing something like data frames e.g. https://github.com/acowley/Frames and http://stla.github.io/stlapblog/posts/HaskellFrames.html for example but I wanted to remain in the world of relatively simple types and I haven't looked at its performance in terms of memory. On the plus side it did manage to read in the 10,000 column data set although ghc took about 5 minutes to do the typechecking (I should say within ghci).

Just to mention that R is not the only language that has nice facilities for data exploration; python has a package called pandas: http://pandas.pydata.org.

I feel we still have a way to go to make Haskell provide as easy an environment for data exploration as R or Python but I shall continue on my crusade.

Many thanks once again, Dominic.


On 30/08/2016 22:05, Iavor Diatchki wrote:
Hello,

when you parse the CSV fully, you end up creating a lot of small bytestring objects, and each of these adds some overhead.   The vectors themselves add up some additional overhead.  All of this adds up when you have as many fields as you do.   An alternative would be to use a different representation for the data, which recomputes things when needed.   While this might be a bit slower in some cases, it could have significant saving in terms of memory use.   I wrote up a small example to illustrate what I have in mind, which should be attached to this e-mail.

Basically, instead of parsing the CSV file fully, I just indexed where the lines are (ref. the "rows" field of "CSV").  This allows me to access each row quickly, and the when I need to get a specific field, I simply parse the bytes of the row.
One could play all kinds of games like that, and I imagine R does something similar, although I have never looked at how it works.   To test the approach I generated ~200Mb of sample data (generator is also in the attached file), and I was able to filter it using ~240Mb, which is comparable to what you reported about R.  One could probably package all this up in library that supports "R like" operations. 

These are the stats I get from -s:

   4,137,632,432 bytes allocated in the heap
         925,200 bytes copied during GC
     200,104,224 bytes maximum residency (2 sample(s))
       6,217,864 bytes maximum slop
             246 MB total memory in use (1 MB lost due to fragmentation)

                                     Tot time (elapsed)  Avg pause  Max pause
  Gen  0      7564 colls,     0 par    0.024s   0.011s     0.0000s    0.0001s
  Gen  1         2 colls,     0 par    0.000s   0.001s     0.0003s    0.0006s

  INIT    time    0.000s  (  0.000s elapsed)
  MUT     time    0.364s  (  0.451s elapsed)
  GC      time    0.024s  (  0.011s elapsed)
  EXIT    time    0.000s  (  0.001s elapsed)
  Total   time    0.388s  (  0.463s elapsed)

  %GC     time       6.2%  (2.5% elapsed)

  Alloc rate    11,367,122,065 bytes per MUT second

  Productivity  93.8% of total user, 78.6% of total elapsed

-Iavor




On Thu, Aug 25, 2016 at 3:31 AM, Simon Peyton Jones via Glasgow-haskell-users <[hidden email]> wrote:
Sounds bad.  But it'll need someone with bytestring expertise to debug.  Maybe there's a GHC problem underlying; or maybe it's shortcoming of bytestring.

Simon

|  -----Original Message-----
|  From: Glasgow-haskell-users [mailto:[hidden email]
[hidden email]] On Behalf Of Dominic Steinitz
|  Sent: 25 August 2016 10:11
|  To: GHC users <[hidden email]>
|  Subject: GHC Performance / Replacement for R?
|
|  I am trying to use Haskell as a replacement for R but running into two
|  problems which I describe below. Are there any plans to address the
|  performance issues I have encountered?
|
|   1. I seem to have to jump through a lot of hoops just to be able to
|      select the data I am interested in.
|
|  {-# LANGUAGE ScopedTypeVariables #-}
|
|  {-# OPTIONS_GHC -Wall #-}
|
|  import Data.Csv hiding ( decodeByName )
|  import qualified Data.Vector as V
|
|  import Data.ByteString ( ByteString )
|  import qualified Data.ByteString.Char8 as B
|
|  import qualified Pipes.Prelude as P
|  import qualified Pipes.ByteString as Bytes import Pipes import
|  qualified Pipes.Csv as Csv import System.IO
|
|  import qualified Control.Foldl as L
|
|  main :: IO ()
|  main = withFile "examples/787338586_T_ONTIME.csv" ReadMode $ \h -> do
|    let csvs :: Producer (V.Vector ByteString) IO ()
|        csvs = Csv.decode HasHeader (Bytes.fromHandle h) >-> P.concat
|        uvectors :: Producer (V.Vector ByteString) IO ()
|        uvectors = csvs  >-> P.map (V.foldr V.cons V.empty)
|    vec_vec <- L.impurely P.foldM  L.vector uvectors
|    print $ (vec_vec :: V.Vector (V.Vector ByteString)) V.! 17
|    print $ V.length vec_vec
|    let rockspring = V.filter (\x -> x V.! 8 == B.pack "RKS") vec_vec
|    print $ V.length rockspring
|
|  Here's the equivalent R:
|
|  df <- read.csv("787338586_T_ONTIME.csv")
|  rockspring <- df[df$ORIGIN == "RKS",]
|
|   2. Now I think I could improve the above to make an environment that
|      is more similar to the one my colleagues are used to in R but more
|      problematical is the memory usage.
|
|   * 112.5M file
|   * Just loading the source into ghci takes 142.7M
|   * > foo <- readFile "examples/787338586_T_ONTIME.csv" > length foo
|     takes me up to 4.75G. But we probably don't want to do this!
|   * Let's try again.
|   * > :set -XScopedTypeVariables
|   * > h <- openFile "examples/787338586_T_ONTIME.csv" ReadMode
|   * > let csvs :: Producer (V.Vector ByteString) IO () = Csv.decode
|  HasHeader (Bytes.fromHandle h) >-> P.concat
|   * > let uvectors :: Producer (V.Vector ByteString) IO () = csvs  >->
|  P.map (V.map id) >-> P.map (V.foldr V.cons V.empty)
|   * > vec_vec :: V.Vector (V.Vector ByteString) <- L.impurely P.foldM
|  L.vector uvectors
|   * Now I am up at 3.17G. In R I am under 221.3M.
|   * > V.length rockspring takes a long time to return 155 and now I am
|     at 3.5G!!! In R > rockspring <- df[df$ORIGIN == "RKS",] seems
|     instantaneous and now uses only 379.5M.
|   * > length(rockspring) 37 > length(df$ORIGIN) 471949 i.e. there are
|     37 columns and 471,949 rows.
|
|  Running this as an executable gives
|
|  ~/Dropbox/Private/labels $ ./examples/BugReport +RTS -s ["2014-01-
|  01","EV","20366","N904EV","2512","10747","1074702","30747",
|   "BRO","Brownsville, TX","Texas","11298","1129803","30194",
|    "DFW","Dallas/Fort Worth, TX","Texas","0720","0718",
|    "-2.00","8.00","0726","0837","7.00","0855","0844","-11.00","0.00",
|    "","0.00","482.00","","","","","",""]
|  471949
|  155
|    14,179,764,240 bytes allocated in the heap
|     3,378,342,072 bytes copied during GC
|       786,333,512 bytes maximum residency (13 sample(s))
|        36,933,976 bytes maximum slop
|              1434 MB total memory in use (0 MB lost due to
|  fragmentation)
|
|                                       Tot time (elapsed)  Avg pause
|  Max pause
|    Gen  0     26989 colls,     0 par    1.423s   1.483s     0.0001s
|  0.0039s
|    Gen  1        13 colls,     0 par    1.005s   1.499s     0.1153s
|  0.6730s
|
|    INIT    time    0.000s  (  0.003s elapsed)
|    MUT     time    3.195s  (  3.193s elapsed)
|    GC      time    2.428s  (  2.982s elapsed)
|    EXIT    time    0.016s  (  0.138s elapsed)
|    Total   time    5.642s  (  6.315s elapsed)
|
|    %GC     time      43.0%  (47.2% elapsed)
|
|    Alloc rate    4,437,740,019 bytes per MUT second
|
|    Productivity  57.0% of total user, 50.9% of total elapsed
|
|  _______________________________________________
|  Glasgow-haskell-users mailing list
[hidden email]
https://na01.safelinks.protection.outlook.com/?url=http%3a%2f%2fmail.h
askell.org%2fcgi-bin%2fmailman%2flistinfo%2fglasgow-haskell-
|  users&data=01%7c01%7csimonpj%40microsoft.com%7c5017a5fe26cb4df9c41d08d
|  3ccc7b5bd%7c72f988bf86f141af91ab2d7cd011db47%7c1&sdata=2Ku1Fr5QttHRoj5
|  NSOJREZrt2Fsqhi63iJOpxmku68E%3d
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users



_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Loading...