# Performance of Idiomatic lazy Haskell

24 messages
12
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 As a beginner fascinated about Haskell's lazy, idiomatic style I tried a small example: Calculate a numerical approximation of PI based on the Leibniz formula. Variant 1: My lazy, idiomatic approach (as I understand it so far). It takes on my machine with eps = 0.00000001 approx. 25 sec. import Data.List main :: IO () main = do putStrLn "EPS: "           eps <- readLn :: IO Double           let pi14 = calcpi14 eps           putStrLn \$ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) calcpi14:: Double->Double calcpi14 eps = foldl' (+) 0 \$ takeWhile (\x -> 4 * abs x > eps) ev      where e = map (\x->1.0/x) [1.0,3.0..]            ev = zipWith (*) e \$ cycle [1.0,-1.0] Variant 2: My "C-style" approach achieving state via additonal function parameters. It takes on my machine with eps = 0.00000001 approx. 5 sec. import Data.List main :: IO () main = do putStrLn "EPS: "           eps <- readLn :: IO Double           let pi14 = pisum 1 3 False eps           putStrLn \$ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum s i _     eps | 4/i < eps = s pisum s i True  eps =     let h = s + 1/i     in h `seq` pisum h (i+2) False eps pisum s i False eps =     let h = s - 1/i     in h `seq` pisum h (i+2) True eps I'd like to do idiomatic, lazy Haskell w/o such a performance penalty? How can I learn/achive that? -- Markus
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 On Sun, Jan 31, 2010 at 11:14:36AM +0100, Markus B?hm wrote: > Variant 1: My lazy, idiomatic approach (as I understand it so far). > It takes on my machine with eps = 0.00000001 approx. 25 sec. > > Variant 2: My "C-style" approach achieving state via additonal > function parameters. > It takes on my machine with eps = 0.00000001 approx. 5 sec. You could try the stream-fusion package on Hackage:   http://hackage.haskell.org/package/stream-fusionIt should give better performance on this kind of code since it can see the loops through the lists.  Don't forget to compile with optimizations! :) HTH, -- Felipe.
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 In reply to this post by Markus Böhm Hi Markus I haven't tested the efficiency, but an unfold seems natural / idiomatic in this case. The Leibniz formula generates a list in the first step then sums it. Generating a list is often a perfect fit for an unfold. The code below seems quite close to the idea of the Leibniz formula (add 2 to the denominator, flip the sign at each step), sum and multiple by 4. > import Data.List (unfoldr) > leibniz n = (4 *) \$ sum \$ take n step -- Note this unfoldr generates an infinite list (both cases produce Just values) ... > step :: [Double] > step = unfoldr phi (True,1) where >   phi (sig,d) | sig == True = Just (1/d, (False,d+2)) >               | otherwise   = Just (negate (1/d), (True,d+2)) Alternatively, an unfoldr with a stop condition might seem more natural, creating a bounded list. However, here it is a bit ugly as there are now 3 elements in the unfold state: > leibniz' n = (4 *) \$ sum \$ step' n > step' :: Int -> [Double] > step' n = unfoldr phi (0,True,1) where >   phi (i,_,_)   | i   == n    = Nothing >   phi (i,sig,d) | sig == True = Just (1/d, (i+1,False,d+2)) >                 | otherwise   = Just (negate (1/d), (i+1,True,d+2)) Best wishes Stephen
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Hi Markus Whoops, I hadn't read your email properly and wasn't accounting for the epsilon. Here's a version that does, although it is perhaps a little slow... > import Data.List (unfoldr) > leibniz eps = converge eps ser > ser :: [Double] > ser = unfoldr phi (True,1) where >   phi (sig,d) | sig == True = Just (1/d, (False,d+2)) >               | otherwise   = Just (negate (1/d), (True,d+2)) > converge :: Double -> [Double] -> Double > converge eps xs = step 0 0 xs where >   step a b (x:xs) = let a' = a + (4*x) in >                     if abs (a'-b) < eps then a' else step a' a xs > demo1 =  leibniz 0.00000000025 Best wishes Stephen
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Am Sonntag 31 Januar 2010 13:23:33 schrieb Stephen Tetley: > Hi Markus > > Whoops, I hadn't read your email properly and wasn't accounting for > the epsilon. Here's a version that does, although it is perhaps a > little slow... > Better use the previous and calculate how many terms you need: ============================================ module Main (main) where import Data.List (unfoldr) main :: IO () main = do     putStrLn "EPS: "     eps <- readLn :: IO Double     let mx = floor (4/eps)         !k = (mx+1) `quot` 2     putStrLn \$ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) \$ sum \$ take n step step :: [Double] step = unfoldr phi (True,1) where  ? phi (sig,d) | sig         = Just (1/d, (False,d+2))  ? ? ? ? ? ? ? | otherwise ? = Just (negate (1/d), (True,d+2)) giving \$ echo '0.00000001' | ./unfoldPi +RTS -sstderr -RTS ./unfoldPi +RTS -sstderr                                                                       EPS:                                                                                         PI mit EPS 1.0e-8 = 3.141592648589476                                                         27,305,969,616 bytes allocated in the heap                                                        2,523,788 bytes copied during GC                                                               61,660 bytes maximum residency (1 sample(s))                                                 38,864 bytes maximum slop                                                                        1 MB total memory in use (0 MB lost due to fragmentation)                       Generation 0: 52083 collections,     0 parallel,  0.29s,  0.44s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.00s  (  0.00s elapsed)   MUT   time   23.10s  ( 23.14s elapsed)   GC    time    0.29s  (  0.44s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time   23.39s  ( 23.57s elapsed)   %GC time       1.2%  (1.9% elapsed)   Alloc rate    1,182,002,769 bytes per MUT second   Productivity  98.8% of total user, 98.0% of total elapsed =============================================== a little better if we do the unfolding ourselves, not creating any intermediate pairs: leibniz n = (4*) . sum \$ unf n True 1 unf :: Int -> Bool -> Double -> [Double] unf 0 _ _ = [] unf k True n = 1/n : unf (k-1) False (n+2) unf k False n = negate (1/n) : unf (k-1) True (n+2) which gives \$ echo '0.00000001' | ./unfPi +RTS -sstderr -RTS ./unfPi +RTS -sstderr                                                                       EPS:                                                                                         PI mit EPS 1.0e-8 = 3.141592648589476                                                         15,250,801,168 bytes allocated in the heap                                                        1,284,632 bytes copied during GC                                                               61,592 bytes maximum residency (1 sample(s))                                                 38,864 bytes maximum slop                                                                        1 MB total memory in use (0 MB lost due to fragmentation)                       Generation 0: 29089 collections,     0 parallel,  0.22s,  0.32s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.00s  (  0.00s elapsed)   MUT   time   18.05s  ( 18.05s elapsed)   GC    time    0.22s  (  0.32s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time   18.28s  ( 18.37s elapsed)   %GC time       1.2%  (1.7% elapsed)   Alloc rate    844,773,242 bytes per MUT second   Productivity  98.8% of total user, 98.2% of total elapsed ============================================= Still not competitive with the loop version \$ echo '0.00000001' | ./loopPi +RTS -sstderr -RTS ./loopPi +RTS -sstderr                                                                       EPS:                                                                                         PI mit EPS 1.0e-8 = 3.1415926526069526                                                                136,248 bytes allocated in the heap                                                            2,024 bytes copied during GC                                                               51,720 bytes maximum residency (1 sample(s))                                                 38,392 bytes maximum slop                                                                        1 MB total memory in use (0 MB lost due to fragmentation)                       Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.00s  (  0.00s elapsed)   MUT   time    6.64s  (  6.64s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    6.64s  (  6.64s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    20,517 bytes per MUT second   Productivity 100.0% of total user, 100.0% of total elapsed which can also be improved: {-# LANGUAGE BangPatterns #-} module Main (main) where main :: IO () main = do     putStrLn "EPS: "     eps <- readLn :: IO Double     let mx = floor (4/eps)         k = (mx-1) `quot` 2         !pi14 = pisum (even k) (fromInteger (2*k+1))     putStrLn \$ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) -- sum from small numbers to large, to reduce cancellation -- although, in this particular case, for eps = 1e-8, the result is -- farther off than summing the other way pisum :: Bool -> Double -> Double pisum bl start = go bl start 0       where         go _ n s | n < 1 = s         go True n !s = go False (n-2) (s+recip n)         go False n !s = go True (n-2) (s-recip n) \$ echo '0.00000001' | ./mloopPi +RTS -sstderr -RTS ./mloopPi +RTS -sstderr                                                                       EPS:                                                                                         PI mit EPS 1.0e-8 = 3.141592648589793                                                                  136,416 bytes allocated in the heap            2,024 bytes copied during GC           51,720 bytes maximum residency (1 sample(s))           38,392 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.00s  (  0.00s elapsed)   MUT   time    4.55s  (  4.56s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    4.55s  (  4.56s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    29,966 bytes per MUT second   Productivity  99.9% of total user, 99.8% of total elapsed Most of the gain is due to the much simpler loop-break test, a small bit may be due to bang-patterns vs. seq. Bottom line: GHC isn't very good at fusing away intermediate lists in strict algorithms (it's better at it for lazy algorithms). Allocating so many cons-cells just to be immediately garbage-collected costs a lot of time. You can code the loop directly, which gives reasonable results, or, as Felipe suggested, use the fusion framework created by experts: module Main (main) where import qualified Data.List.Stream as S main :: IO () main = do     putStrLn "EPS: "     eps <- readLn :: IO Double     let mx = floor (4/eps)         !k = (mx+1) `quot` 2     putStrLn \$ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) \$ S.sum \$ S.take n step step :: [Double] step = S.unfoldr phi (True,1) where  ? phi (sig,d) | sig         = Just (1/d, (False,d+2))  ? ? ? ? ? ? ? | otherwise ? = Just (negate (1/d), (True,d+2)) which beats the hand-coded loop: \$ echo '0.00000001' | ./sunfPi +RTS -sstderr -RTS ./sunfPi +RTS -sstderr                                                                       EPS:                                                                                         PI mit EPS 1.0e-8 = 3.1415926445727678                                                                136,560 bytes allocated in the heap            2,024 bytes copied during GC           51,720 bytes maximum residency (1 sample(s))           38,392 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.00s  (  0.00s elapsed)   MUT   time    4.27s  (  4.27s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    4.27s  (  4.27s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    31,994 bytes per MUT second   Productivity 100.0% of total user, 100.0% of total elapsed > > import Data.List (unfoldr) > > > > > > leibniz eps = converge eps ser > > > > > > ser :: [Double] > > ser = unfoldr phi (True,1) where > >   phi (sig,d) | sig == True = Just (1/d, (False,d+2)) > > > >               | otherwise   = Just (negate (1/d), (True,d+2)) > > > > converge :: Double -> [Double] -> Double > > converge eps xs = step 0 0 xs where > >   step a b (x:xs) = let a' = a + (4*x) in > >                     if abs (a'-b) < eps then a' else step a' a xs > > > > demo1 =  leibniz 0.00000000025 > > Best wishes > > Stephen
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Hi Daniel Thanks - the figures are very impressive for the stream fusion library. I knew the paper, but I hadn't looked at it the implementation. Making a stricter unfoldr by using a strictness annotation on the state and getting rid of the tuple is nowhere near stream fusion-lib, i.e: data Maybe2 a st = Nothing2 | Just2 a !st   deriving (Eq,Show) Performance wise stream fusion even beats a monoidal unfoldr. A monoidal unfoldr seems reasonable to me for this problem as there is no need to generate a list. As a monoidal unfoldr is not in the 'standard' libraries some people might not consider it idiomatic though. I put in a fixed value for epsilon in all three version rather than used echo at the command line and compiled all with ghc --make -O2 -------------- module Main (main) where import Data.Monoid data Maybe2 a st = Nothing2 | Just2 a !st   deriving (Eq,Show) dummy_eps :: Double dummy_eps = 0.00000001 main :: IO () main = do     putStrLn "EPS: "     eps <- return dummy_eps     let mx = floor (4/eps)         !k = (mx+1) `quot` 2     putStrLn \$ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz n = (4 *) \$ getSum \$ step n step :: Integer -> Sum Double step times = unfoldrMon phi (0,True,1) where    phi (i,_,_)   | i == times  = Nothing2    phi (i,sig,d) | sig         = Just2 (Sum (1/d)) (i+1,False,d+2)                  | otherwise   = Just2 (Sum (negate (1/d))) (i+1,True,d+2) unfoldrMon      :: Monoid a => (b -> Maybe2 a b) -> b -> a unfoldrMon f b  =   case f b of    Just2 a new_b -> a `mappend` unfoldrMon f new_b    Nothing2      -> mempty -- --------------------------------------------------- Leibniz1 (stream fusion) --------------------------------------------------- \$ ./Leibniz1 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz1.exe +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.1415926445727678           24,404 bytes allocated in the heap              892 bytes copied during GC            3,068 bytes maximum residency (1 sample(s))           13,316 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.03s  (  0.00s elapsed)   MUT   time    4.59s  (  4.59s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    4.63s  (  4.59s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    5,276 bytes per MUT second   Productivity  99.3% of total user, 100.0% of total elapsed --------------------------------------------------- Leibniz3 (monoidal unfoldr) --------------------------------------------------- \$ ./Leibniz3 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz3.exe +RTS -sstderr EPS: Stack space overflow: current size 8388608 bytes. Use `+RTS -Ksize -RTS' to increase it.       92,448,984 bytes allocated in the heap           15,368 bytes copied during GC        8,382,800 bytes maximum residency (5 sample(s))        4,198,684 bytes maximum slop               17 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:   125 collections,     0 parallel,  1.66s,  1.66s elapsed   Generation 1:     5 collections,     0 parallel,  0.03s,  0.03s elapsed   INIT  time    0.03s  (  0.00s elapsed)   MUT   time    3.50s  (  3.55s elapsed)   GC    time    1.69s  (  1.69s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    5.22s  (  5.23s elapsed)   %GC time      32.3%  (32.2% elapsed)   Alloc rate    26,180,243 bytes per MUT second   Productivity  67.1% of total user, 66.9% of total elapsed --------------------------------------------------- Leibniz2 - no stream fusion, unfoldr with strictness annotation on the state --------------------------------------------------- \$ ./Leibniz2 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz2.exe +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.141592648589476   25,600,024,064 bytes allocated in the heap        2,152,224 bytes copied during GC            3,336 bytes maximum residency (1 sample(s))           11,908 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0: 48828 collections,     0 parallel,  2.19s,  2.23s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.02s  (  0.00s elapsed)   MUT   time   75.16s  ( 76.22s elapsed)   GC    time    2.19s  (  2.23s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time   77.36s  ( 78.45s elapsed)   %GC time       2.8%  (2.8% elapsed)   Alloc rate    340,553,219 bytes per MUT second   Productivity  97.2% of total user, 95.8% of total elapsed
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Arghh, an error in that post that I didn't notice, I send a revision soon... Apologies Stephen > > --------------------------------------------------- > Leibniz3 (monoidal unfoldr) > --------------------------------------------------- > > \$ ./Leibniz3 +RTS -sstderr -RTS > d:\coding\haskell\cafe\Leibniz3.exe +RTS -sstderr > EPS: > Stack space overflow: current size 8388608 bytes.
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 In reply to this post by Stephen Tetley-2 Am Sonntag 31 Januar 2010 15:52:41 schrieb Stephen Tetley: > Hi Daniel > > Thanks - the figures are very impressive for the stream fusion > library. I knew the paper, but I hadn't looked at it the > implementation. > Just for the record, the loops (my variant) coded in C take 2.7 seconds (icc) resp. 3.0 seconds (gcc) with -O3. Markus' loop takes 5.32s (icc) resp. 6.04s (gcc) with -O3. Well, that's largely due to the fact that that takes two divisions per iteration (and divisions are very slow on my box). Rewriting it so that only one division per iteration is necessary: ================================================ module Main (main) where main :: IO () main = do putStrLn "EPS: " ? ? ? ? ? eps <- readLn :: IO Double ? ? ? ? ? let pi14 = pisum 1 3 False (eps*0.25) ? ? ? ? ? putStrLn \$ "PI mit EPS "++(show eps)++" = "++ show(4*pi14) pisum s i b eps     | d < eps   = s     | b         = let h = s+d in h `seq` pisum h (i+2) False eps     | otherwise = let h = s-d in h `seq` pisum h (i+2) True eps       where         d = 1/i ================================================ it takes 5.42 seconds (eps = 1e-8) when compiled natively, that loop in C is practically indistinguishable from my loop variant. So stream-fusion or hand-coded loops in Haskell are clearly slower than C loops, but don't stink when compiled with the native code generator. However, I get identical to gcc performance with ghc -O2 -fexcess-precision -fvia-C -optc-O3 [-optc-ffast-math doesn't make any difference then] for the hand-coded loops (all, my variant, Markus' original, Markus' rewritten - of course, each identical to the corresponding C-loop). Also for stream-fusion. Some of the list creating variants also profit from compilation via C, others not. However, none come near the NCG loop performance, let alone C loop performance. To sum up: Although GHC's native code generator does a not too bad job with arithmetic-heavy loops, gcc is better (even without optimisations). But if you compile via C, you can get identical performance - for these loops. In general, things are more complicated, often the native code generator produces better code than the via-C route. It's usually worth a test which produces the better binary, native or via-C.
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 In reply to this post by Stephen Tetley-2 Hello Daniel I can get close with a tail recursive + accumulator monoidal unfoldr and strict triple for the state (leibniz4). But it does seem to be loosing the point of the initial exercise "to be idiomatic", also the answers are divergent too... Best wishes Stephen ------ Leibniz4 - monodial unfoldr strict triple ------ \$ ./Leibniz4 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz4.exe +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.1415926526069526           24,424 bytes allocated in the heap              892 bytes copied during GC            3,068 bytes maximum residency (1 sample(s))           13,316 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.03s  (  0.00s elapsed)   MUT   time    4.34s  (  4.45s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    4.38s  (  4.45s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    5,582 bytes per MUT second   Productivity  99.3% of total user, 97.5% of total elapsed ------ Leibniz1 - stream fusion ------ \$ ./Leibniz1 +RTS -sstderr -RTS d:\coding\haskell\cafe\Leibniz1.exe +RTS -sstderr EPS: PI mit EPS 1.0e-8 = 3.1415926445727678           24,404 bytes allocated in the heap              892 bytes copied during GC            3,068 bytes maximum residency (1 sample(s))           13,316 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.02s  (  0.00s elapsed)   MUT   time    4.61s  (  4.63s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    4.63s  (  4.63s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    5,276 bytes per MUT second   Productivity  99.7% of total user, 99.7% of total elapsed ---------- module Main (main) where import Data.Monoid data Maybe2 a st = Nothing2 | Just2 a !st   deriving (Eq,Show) dummy_eps :: Double dummy_eps = 0.00000001 main :: IO () main = do     putStrLn "EPS: "     eps <- return dummy_eps     let mx = floor (4/eps)         !k = (mx+1) `quot` 2     putStrLn \$ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k) leibniz :: Integer -> Double leibniz n = (4 *) \$ getSum \$ step (fromIntegral n) data Trip = T3 !Int !Bool !Double step :: Int -> Sum Double step times = unfoldrMon phi (T3 0 True 1) where    phi :: Trip -> Maybe2 (Sum Double) Trip    phi (T3 i _   _) | i == times  = Nothing2    phi (T3 i sig d) | sig         = Just2 (Sum (1/d)) (T3 (i+1) False (d+2))                     | otherwise   = Just2 (Sum (negate (1/d))) (T3 (i+1) True (d+2)) unfoldrMon      :: Monoid a => (b -> Maybe2 a b) -> b -> a unfoldrMon f b0  = step b0 mempty where    step b acc = case f b of               Nothing2       -> acc               Just2 a new_b  -> step new_b  (a `mappend` acc)
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Am Sonntag 31 Januar 2010 19:21:10 schrieb Stephen Tetley: > Hello Daniel > > I can get close with a tail recursive + accumulator monoidal unfoldr > and strict triple for the state (leibniz4). Beats stream fusion and direct loops comfortably here (3.55s) when compiled with the NCG, very narrowly when compiled via C (2.99s). Congrats. > > But it does seem to be loosing the point of the initial exercise "to > be idiomatic", Yes. Returning to that, a) for tasks where it's appropriate/sufficiently natural, hand-coded loops are idiomatic, too. b) use stream-fusion to get rid of intermediate structures. > also the answers are divergent too... In how far? > > > Best wishes > > Stephen
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 On 31 January 2010 18:50, Daniel Fischer <[hidden email]> wrote: > > Yes. Returning to that, > > a) for tasks where it's appropriate/sufficiently natural, hand-coded loops > are idiomatic, too. > Indeed - originally I looked at Markus's code and only saw that an unfoldr would be a natural implementation for the Leibniz generating function. I overlooked the episilon used to control the accuracy which pollutes the unfold and makes it no clearer (too much haste / speed on my part). >> also the answers are divergent too... > > In how far? On my machine, they diverged at the eighth decimal place - Leibniz4 3.1415926526069526 Leibniz1 3.1415926445727678 Best wishes Stephen
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Am Sonntag 31 Januar 2010 20:22:56 schrieb Stephen Tetley: > On my machine, they diverged at the eighth decimal place - > > Leibniz4 3.1415926526069526 > Leibniz1 3.1415926445727678 Ah, that. I thought you were talking about timings. There are several things that can lead to differing results here, 1) order of summation (I think only my loop had a different order) 2) due to different tests, it might happen that one algorithm evaluates one term more than the other 3) are intermediate results truncated to 64 bits or kept in an 80-bit register? I'm astonished by the size of the difference, though. But if you sum from small to large, the results are very accurately the theoretically expected results, so I think it's 1) together with 3) which make the difference.
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Am Montag 01 Februar 2010 10:37:41 schrieb Markus B?hm: > Daniel, Stephen, Felipe thanks for Your answers/advice, which I > studied this morning. Helps me a lot on my learning curve. I guess it > takes some time to get a feeling for the performance/runtime behaviour > of abstractions in Haskell. > > Just for fun and w/o knowing Lua: I translated my Haskell > "loops-while"-version into Lua That's not quite a translation of the loop you posted. Not really important, though, small difference. > and ran on LuaJit Beta 2.0.0. Seems to > beat our best Haskell version. So LuaJit Beta 2.0.0 beats gcc -O3 on such a simple loop? I'm not going to install Lua to test, but if that's really the case, I'm sure the folks at gnu.org would like to hear about it. Or don't you consider -fvia-C compiled loops to be true Haskell versions? > > print("EPS: ") > EPS = io.read("*number") > pi2 = 1 > i = 3 > s = false > repeat >     pi1 = pi2 >     if s then >         pi2 = pi1 + 1.0/i >         s = false >     else >         pi2 = pi1 - 1.0/i >         s = true >     end >     i = i+2 > > until EPS > 4.0 * math.abs(pi2-pi1) > > print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2) > > > -- Markus
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Am Montag 01 Februar 2010 14:31:40 schrieb Markus B?hm: > Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with > attached variant. That's pretty fast. > I compiled all our Haskell variants with ghc --make -O2. Well, GHC isn't as good at loop-optimising as gcc is. Depending on the loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than the NCG compiled. > I don't know about -fvia-C, have to find out. I hope I didn't > distract You with my Lua variant. No sweat. > In any case thank You very much for > your advice. Markus. > You're welcome. Can you try the below with ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs and run with echo '0.00000001' | time ./luaLoop ? It's a fairly direct translation of the Lua code, and it runs here more or less equally fast as (gcc compiled) C-loops. ================================= module Main (main) where main :: IO () main = do     putStrLn "EPS:"     eps <- readLn :: IO Double     print \$ 4*calcPi eps calcPi :: Double -> Double calcPi eps = go False 1 3       where         go bl p1 i             | 4*abs(p2-p1) < eps    = p1             | otherwise             = go (not bl) p2 (i+2)               where                 p2  | bl        = p1+1/i                     | otherwise = p1-1/i ==================================
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Daniel, I use GHC 6.12.1 and Windows XP. The time command doesn't seem to work. It says in German: specified time can't be read. Give a new time. -- Markus On Mon, Feb 1, 2010 at 2:52 PM, Daniel Fischer <[hidden email]> wrote: > Am Montag 01 Februar 2010 14:31:40 schrieb Markus B?hm: >> Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with >> attached variant. > > That's pretty fast. > >> I compiled all our Haskell variants with ghc --make -O2. > > Well, GHC isn't as good at loop-optimising as gcc is. Depending on the > loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than > the NCG compiled. > >> I don't know about -fvia-C, have to find out. I hope I didn't >> distract You with my Lua variant. > > No sweat. > >> In any case thank You very much for >> your advice. Markus. >> > > You're welcome. > > Can you try the below with > > ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs > > and run with > > echo '0.00000001' | time ./luaLoop > > ? > It's a fairly direct translation of the Lua code, and it runs here more or > less equally fast as (gcc compiled) C-loops. > > ================================= > module Main (main) where > > main :: IO () > main = do > ? ?putStrLn "EPS:" > ? ?eps <- readLn :: IO Double > ? ?print \$ 4*calcPi eps > > calcPi :: Double -> Double > calcPi eps = go False 1 3 > ? ? ?where > ? ? ? ?go bl p1 i > ? ? ? ? ? ?| 4*abs(p2-p1) < eps ? ?= p1 > ? ? ? ? ? ?| otherwise ? ? ? ? ? ? = go (not bl) p2 (i+2) > ? ? ? ? ? ? ?where > ? ? ? ? ? ? ? ?p2 ?| bl ? ? ? ?= p1+1/i > ? ? ? ? ? ? ? ? ? ?| otherwise = p1-1/i > > ================================== > >
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Daniel, here another feedback: F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make p 085-pi_lualoop.hs [1 of 1] Compiling Main             ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking luaLoop.exe ... F:\MeineUebungen>echo '0.00000001' | time ./luaLoop Eingegebene Zeit kann nicht ?bernommen werden. Geben Sie die neue Zeit ein: F:\MeineUebungen>luaLoop +RTS -sstderr -RTS luaLoop +RTS -sstderr EPS: 0.00000001 3.1415926485894725           61,428 bytes allocated in the heap            1,316 bytes copied during GC            4,564 bytes maximum residency (1 sample(s))           11,820 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.02s  (  0.00s elapsed)   MUT   time    8.06s  ( 16.83s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    8.08s  ( 16.83s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    7,604 bytes per MUT second   Productivity  99.8% of total user, 47.9% of total elapsed -- Markus On Mon, Feb 1, 2010 at 3:08 PM, Markus B?hm <[hidden email]> wrote: > Daniel, I use GHC 6.12.1 and Windows XP. The time command doesn't seem > to work. It says in German: specified time can't be read. Give a new > time. > > -- Markus > > > > > On Mon, Feb 1, 2010 at 2:52 PM, Daniel Fischer <[hidden email]> wrote: >> Am Montag 01 Februar 2010 14:31:40 schrieb Markus B?hm: >>> Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with >>> attached variant. >> >> That's pretty fast. >> >>> I compiled all our Haskell variants with ghc --make -O2. >> >> Well, GHC isn't as good at loop-optimising as gcc is. Depending on the >> loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than >> the NCG compiled. >> >>> I don't know about -fvia-C, have to find out. I hope I didn't >>> distract You with my Lua variant. >> >> No sweat. >> >>> In any case thank You very much for >>> your advice. Markus. >>> >> >> You're welcome. >> >> Can you try the below with >> >> ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs >> >> and run with >> >> echo '0.00000001' | time ./luaLoop >> >> ? >> It's a fairly direct translation of the Lua code, and it runs here more or >> less equally fast as (gcc compiled) C-loops. >> >> ================================= >> module Main (main) where >> >> main :: IO () >> main = do >> ? ?putStrLn "EPS:" >> ? ?eps <- readLn :: IO Double >> ? ?print \$ 4*calcPi eps >> >> calcPi :: Double -> Double >> calcPi eps = go False 1 3 >> ? ? ?where >> ? ? ? ?go bl p1 i >> ? ? ? ? ? ?| 4*abs(p2-p1) < eps ? ?= p1 >> ? ? ? ? ? ?| otherwise ? ? ? ? ? ? = go (not bl) p2 (i+2) >> ? ? ? ? ? ? ?where >> ? ? ? ? ? ? ? ?p2 ?| bl ? ? ? ?= p1+1/i >> ? ? ? ? ? ? ? ? ? ?| otherwise = p1-1/i >> >> ================================== >> >> >
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 Am Montag 01 Februar 2010 15:22:45 schrieb Markus B?hm: > Daniel, I use GHC 6.12.1 and Windows XP. *sigh* Have you a CygWin installed? It should run as given from a CygWin shell, I believe. > The time command doesn't seem > to work. It says in German: specified time can't be read. Give a new > time. Replace "time" with however you get the CPU time used by a process on Windows. But I think Windows doesn't allow piping, and I'm not sure whether it has echo, so see below. > Daniel, here another feedback: > > F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop > --make p 085-pi_lualoop.hs > [1 of 1] Compiling Main             ( p085-pi_lualoop.hs, > p085-pi_lualoop.o ) Linking luaLoop.exe ... > > F:\MeineUebungen>echo '0.00000001' | time ./luaLoop > Eingegebene Zeit kann nicht ?bernommen werden. > Geben Sie die neue Zeit ein: > > F:\MeineUebungen>luaLoop +RTS -sstderr -RTS > luaLoop +RTS -sstderr > EPS: > 0.00000001 > 3.1415926485894725 >           61,428 bytes allocated in the heap >            1,316 bytes copied during GC >            4,564 bytes maximum residency (1 sample(s)) >           11,820 bytes maximum slop >                1 MB total memory in use (0 MB lost due to fragmentation) > >   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s > elapsed Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s > elapsed > >   INIT  time    0.02s  (  0.00s elapsed) >   MUT   time    8.06s  ( 16.83s elapsed) >   GC    time    0.00s  (  0.00s elapsed) >   EXIT  time    0.00s  (  0.00s elapsed) >   Total time    8.08s  ( 16.83s elapsed) > >   %GC time       0.0%  (0.0% elapsed) > >   Alloc rate    7,604 bytes per MUT second > >   Productivity  99.8% of total user, 47.9% of total elapsed What????? Please try the following: 1) hardcode eps = 1e-8 in the source (or get it via getArgs and pass it as a command line argument) 2) ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o cloop -- make Source.hs 3) ghc -O2 -fforce-recomp -o nloop --make Source.hs 4) cloop +RTS -sstderr 5) nloop +RTS -sstderr > > -- Markus >
Open this post in threaded view
|

## Performance of Idiomatic lazy Haskell

 1. I used Your lualoop file with content: module Main (main) where main :: IO () main = do    --putStrLn "EPS:"    -- eps <- readLn :: IO Double    print \$ 4*calcPi 0.00000001 calcPi :: Double -> Double calcPi eps = go False 1 3      where        go bl p1 i            | 4*abs(p2-p1) < eps    = p1            | otherwise             = go (not bl) p2 (i+2)              where                p2  | bl        = p1+1/i                    | otherwise = p1-1/i 2. F:\MeineUebungen>ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o c loop --make p085-pi_lualoop.hs [1 of 1] Compiling Main             ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking cloop.exe ... F:\MeineUebungen>ghc -O2 -fforce-recomp -o nloop --make p085-pi_lualoop.hs [1 of 1] Compiling Main             ( p085-pi_lualoop.hs, p085-pi_lualoop.o ) Linking nloop.exe ... F:\MeineUebungen>cloop +RTS -sstderr cloop +RTS -sstderr 3.1415926485894725           20,860 bytes allocated in the heap              892 bytes copied during GC            3,068 bytes maximum residency (1 sample(s))           13,316 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.02s  (  0.02s elapsed)   MUT   time    7.39s  (  7.27s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    7.41s  (  7.28s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    2,816 bytes per MUT second   Productivity  99.8% of total user, 101.5% of total elapsed F:\MeineUebungen>nloop +RTS -sstderr nloop +RTS -sstderr 3.1415926485894725           20,860 bytes allocated in the heap              892 bytes copied during GC            3,068 bytes maximum residency (1 sample(s))           13,316 bytes maximum slop                1 MB total memory in use (0 MB lost due to fragmentation)   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed   Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed   INIT  time    0.02s  (  0.00s elapsed)   MUT   time    4.77s  (  4.81s elapsed)   GC    time    0.00s  (  0.00s elapsed)   EXIT  time    0.00s  (  0.00s elapsed)   Total time    4.78s  (  4.81s elapsed)   %GC time       0.0%  (0.0% elapsed)   Alloc rate    4,362 bytes per MUT second   Productivity  99.7% of total user, 99.0% of total elapsed -- Markus On Mon, Feb 1, 2010 at 3:40 PM, Daniel Fischer <[hidden email]> wrote: > Am Montag 01 Februar 2010 15:22:45 schrieb Markus B?hm: > >> Daniel, I use GHC 6.12.1 and Windows XP. > > *sigh* > > Have you a CygWin installed? It should run as given from a CygWin shell, I > believe. > >> The time command doesn't seem >> to work. It says in German: specified time can't be read. Give a new >> time. > > Replace "time" with however you get the CPU time used by a process on > Windows. > But I think Windows doesn't allow piping, and I'm not sure whether it has > echo, so see below. > >> Daniel, here another feedback: >> >> F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop >> --make p 085-pi_lualoop.hs >> [1 of 1] Compiling Main ? ? ? ? ? ? ( p085-pi_lualoop.hs, >> p085-pi_lualoop.o ) Linking luaLoop.exe ... >> >> F:\MeineUebungen>echo '0.00000001' | time ./luaLoop >> Eingegebene Zeit kann nicht ?bernommen werden. >> Geben Sie die neue Zeit ein: >> >> F:\MeineUebungen>luaLoop +RTS -sstderr -RTS >> luaLoop +RTS -sstderr >> EPS: >> 0.00000001 >> 3.1415926485894725 >> ? ? ? ? ? 61,428 bytes allocated in the heap >> ? ? ? ? ? ?1,316 bytes copied during GC >> ? ? ? ? ? ?4,564 bytes maximum residency (1 sample(s)) >> ? ? ? ? ? 11,820 bytes maximum slop >> ? ? ? ? ? ? ? ?1 MB total memory in use (0 MB lost due to fragmentation) >> >> ? Generation 0: ? ? 0 collections, ? ? 0 parallel, ?0.00s, ?0.00s >> elapsed Generation 1: ? ? 1 collections, ? ? 0 parallel, ?0.00s, ?0.00s >> elapsed >> >> ? INIT ?time ? ?0.02s ?( ?0.00s elapsed) >> ? MUT ? time ? ?8.06s ?( 16.83s elapsed) >> ? GC ? ?time ? ?0.00s ?( ?0.00s elapsed) >> ? EXIT ?time ? ?0.00s ?( ?0.00s elapsed) >> ? Total time ? ?8.08s ?( 16.83s elapsed) >> >> ? %GC time ? ? ? 0.0% ?(0.0% elapsed) >> >> ? Alloc rate ? ?7,604 bytes per MUT second >> >> ? Productivity ?99.8% of total user, 47.9% of total elapsed > > What????? > > Please try the following: > > 1) hardcode eps = 1e-8 in the source (or get it via getArgs and pass it as > a command line argument) > 2) ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o cloop -- > make Source.hs > 3) ghc -O2 -fforce-recomp -o nloop --make Source.hs > 4) cloop +RTS -sstderr > 5) nloop +RTS -sstderr > >> >> -- Markus >> >
12