Performance of Idiomatic lazy Haskell

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
24 messages Options
12
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Markus Böhm
As a beginner fascinated about Haskell's lazy, idiomatic style I tried
a small example:
Calculate a numerical approximation of PI based on the Leibniz formula.

Variant 1: My lazy, idiomatic approach (as I understand it so far).
It takes on my machine with eps = 0.00000001 approx. 25 sec.

import Data.List

main :: IO ()
main = do putStrLn "EPS: "
          eps <- readLn :: IO Double
          let pi14 = calcpi14 eps
          putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

calcpi14:: Double->Double
calcpi14 eps = foldl' (+) 0 $ takeWhile (\x -> 4 * abs x > eps) ev
     where e = map (\x->1.0/x) [1.0,3.0..]
           ev = zipWith (*) e $ cycle [1.0,-1.0]


Variant 2: My "C-style" approach achieving state via additonal
function parameters.
It takes on my machine with eps = 0.00000001 approx. 5 sec.

import Data.List

main :: IO ()
main = do putStrLn "EPS: "
          eps <- readLn :: IO Double
          let pi14 = pisum 1 3 False eps
          putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

pisum s i _     eps | 4/i < eps = s
pisum s i True  eps =
    let h = s + 1/i
    in h `seq` pisum h (i+2) False eps
pisum s i False eps =
    let h = s - 1/i
    in h `seq` pisum h (i+2) True eps

I'd like to do idiomatic, lazy Haskell w/o such a performance penalty?
How can I learn/achive that?


-- Markus
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Felipe Lessa
On Sun, Jan 31, 2010 at 11:14:36AM +0100, Markus B?hm wrote:
> Variant 1: My lazy, idiomatic approach (as I understand it so far).
> It takes on my machine with eps = 0.00000001 approx. 25 sec.
>
> Variant 2: My "C-style" approach achieving state via additonal
> function parameters.
> It takes on my machine with eps = 0.00000001 approx. 5 sec.

You could try the stream-fusion package on Hackage:

  http://hackage.haskell.org/package/stream-fusion

It should give better performance on this kind of code since it
can see the loops through the lists.  Don't forget to compile
with optimizations! :)

HTH,

--
Felipe.
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Stephen Tetley-2
In reply to this post by Markus Böhm
Hi Markus

I haven't tested the efficiency, but an unfold seems natural /
idiomatic in this case. The Leibniz formula generates a list in the
first step then sums it. Generating a list is often a perfect fit for
an unfold. The code below seems quite close to the idea of the Leibniz
formula (add 2 to the denominator, flip the sign at each step), sum
and multiple by 4.

> import Data.List (unfoldr)

> leibniz n = (4 *) $ sum $ take n step

-- Note this unfoldr generates an infinite list (both cases produce
Just values) ...

> step :: [Double]
> step = unfoldr phi (True,1) where
>   phi (sig,d) | sig == True = Just (1/d, (False,d+2))
>               | otherwise   = Just (negate (1/d), (True,d+2))



Alternatively, an unfoldr with a stop condition might seem more
natural, creating a bounded list. However, here it is a bit ugly
as there are now 3 elements in the unfold state:

> leibniz' n = (4 *) $ sum $ step' n

> step' :: Int -> [Double]
> step' n = unfoldr phi (0,True,1) where
>   phi (i,_,_)   | i   == n    = Nothing
>   phi (i,sig,d) | sig == True = Just (1/d, (i+1,False,d+2))
>                 | otherwise   = Just (negate (1/d), (i+1,True,d+2))


Best wishes

Stephen
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Stephen Tetley-2
Hi Markus

Whoops, I hadn't read your email properly and wasn't accounting for
the epsilon. Here's a version that does, although it is perhaps a
little slow...

> import Data.List (unfoldr)


> leibniz eps = converge eps ser


> ser :: [Double]
> ser = unfoldr phi (True,1) where
>   phi (sig,d) | sig == True = Just (1/d, (False,d+2))
>               | otherwise   = Just (negate (1/d), (True,d+2))


> converge :: Double -> [Double] -> Double
> converge eps xs = step 0 0 xs where
>   step a b (x:xs) = let a' = a + (4*x) in
>                     if abs (a'-b) < eps then a' else step a' a xs

> demo1 =  leibniz 0.00000000025

Best wishes

Stephen
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Daniel Fischer-4
Am Sonntag 31 Januar 2010 13:23:33 schrieb Stephen Tetley:
> Hi Markus
>
> Whoops, I hadn't read your email properly and wasn't accounting for
> the epsilon. Here's a version that does, although it is perhaps a
> little slow...
>

Better use the previous and calculate how many terms you need:

============================================
module Main (main) where

import Data.List (unfoldr)

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ sum $ take n step

step :: [Double]
step = unfoldr phi (True,1) where
 ? phi (sig,d) | sig         = Just (1/d, (False,d+2))
 ? ? ? ? ? ? ? | otherwise ? = Just (negate (1/d), (True,d+2))


giving
$ echo '0.00000001' | ./unfoldPi +RTS -sstderr -RTS
./unfoldPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.141592648589476                                                      
  27,305,969,616 bytes allocated in the heap                                                
       2,523,788 bytes copied during GC                                                    
          61,660 bytes maximum residency (1 sample(s))                                      
          38,864 bytes maximum slop                                                        
               1 MB total memory in use (0 MB lost due to fragmentation)                    

  Generation 0: 52083 collections,     0 parallel,  0.29s,  0.44s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   23.10s  ( 23.14s elapsed)
  GC    time    0.29s  (  0.44s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   23.39s  ( 23.57s elapsed)

  %GC time       1.2%  (1.9% elapsed)

  Alloc rate    1,182,002,769 bytes per MUT second

  Productivity  98.8% of total user, 98.0% of total elapsed
===============================================

a little better if we do the unfolding ourselves, not creating any
intermediate pairs:


leibniz n = (4*) . sum $ unf n True 1

unf :: Int -> Bool -> Double -> [Double]
unf 0 _ _ = []
unf k True n = 1/n : unf (k-1) False (n+2)
unf k False n = negate (1/n) : unf (k-1) True (n+2)


which gives
$ echo '0.00000001' | ./unfPi +RTS -sstderr -RTS
./unfPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.141592648589476                                                      
  15,250,801,168 bytes allocated in the heap                                                
       1,284,632 bytes copied during GC                                                    
          61,592 bytes maximum residency (1 sample(s))                                      
          38,864 bytes maximum slop                                                        
               1 MB total memory in use (0 MB lost due to fragmentation)                    

  Generation 0: 29089 collections,     0 parallel,  0.22s,  0.32s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time   18.05s  ( 18.05s elapsed)
  GC    time    0.22s  (  0.32s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   18.28s  ( 18.37s elapsed)

  %GC time       1.2%  (1.7% elapsed)

  Alloc rate    844,773,242 bytes per MUT second

  Productivity  98.8% of total user, 98.2% of total elapsed
=============================================

Still not competitive with the loop version
$ echo '0.00000001' | ./loopPi +RTS -sstderr -RTS
./loopPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.1415926526069526                                                      
         136,248 bytes allocated in the heap                                                
           2,024 bytes copied during GC                                                    
          51,720 bytes maximum residency (1 sample(s))                                      
          38,392 bytes maximum slop                                                        
               1 MB total memory in use (0 MB lost due to fragmentation)                    

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    6.64s  (  6.64s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    6.64s  (  6.64s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    20,517 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

which can also be improved:

{-# LANGUAGE BangPatterns #-}
module Main (main) where

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let mx = floor (4/eps)
        k = (mx-1) `quot` 2
        !pi14 = pisum (even k) (fromInteger (2*k+1))
    putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

-- sum from small numbers to large, to reduce cancellation
-- although, in this particular case, for eps = 1e-8, the result is
-- farther off than summing the other way
pisum :: Bool -> Double -> Double
pisum bl start = go bl start 0
      where
        go _ n s | n < 1 = s
        go True n !s = go False (n-2) (s+recip n)
        go False n !s = go True (n-2) (s-recip n)



$ echo '0.00000001' | ./mloopPi +RTS -sstderr -RTS
./mloopPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.141592648589793                                                        
         136,416 bytes allocated in the heap
           2,024 bytes copied during GC
          51,720 bytes maximum residency (1 sample(s))
          38,392 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.55s  (  4.56s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.55s  (  4.56s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    29,966 bytes per MUT second

  Productivity  99.9% of total user, 99.8% of total elapsed

Most of the gain is due to the much simpler loop-break test, a small bit
may be due to bang-patterns vs. seq.


Bottom line:
GHC isn't very good at fusing away intermediate lists in strict algorithms
(it's better at it for lazy algorithms). Allocating so many cons-cells just
to be immediately garbage-collected costs a lot of time.

You can code the loop directly, which gives reasonable results, or, as
Felipe suggested, use the fusion framework created by experts:

module Main (main) where

import qualified Data.List.Stream as S

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- readLn :: IO Double
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ S.sum $ S.take n step

step :: [Double]
step = S.unfoldr phi (True,1) where
 ? phi (sig,d) | sig         = Just (1/d, (False,d+2))
 ? ? ? ? ? ? ? | otherwise ? = Just (negate (1/d), (True,d+2))


which beats the hand-coded loop:
$ echo '0.00000001' | ./sunfPi +RTS -sstderr -RTS
./sunfPi +RTS -sstderr                                                                      
EPS:                                                                                        
PI mit EPS 1.0e-8 = 3.1415926445727678                                                      
         136,560 bytes allocated in the heap
           2,024 bytes copied during GC
          51,720 bytes maximum residency (1 sample(s))
          38,392 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.00s  (  0.00s elapsed)
  MUT   time    4.27s  (  4.27s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.27s  (  4.27s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    31,994 bytes per MUT second

  Productivity 100.0% of total user, 100.0% of total elapsed

> > import Data.List (unfoldr)
> >
> >
> > leibniz eps = converge eps ser
> >
> >
> > ser :: [Double]
> > ser = unfoldr phi (True,1) where
> >   phi (sig,d) | sig == True = Just (1/d, (False,d+2))
> >
> >               | otherwise   = Just (negate (1/d), (True,d+2))
> >
> > converge :: Double -> [Double] -> Double
> > converge eps xs = step 0 0 xs where
> >   step a b (x:xs) = let a' = a + (4*x) in
> >                     if abs (a'-b) < eps then a' else step a' a xs
> >
> > demo1 =  leibniz 0.00000000025
>
> Best wishes
>
> Stephen


Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Stephen Tetley-2
Hi Daniel

Thanks - the figures are very impressive for the stream fusion
library. I knew the paper, but I hadn't looked at it the
implementation.

Making a stricter unfoldr by using a strictness annotation on the
state and getting rid of the tuple is nowhere near stream fusion-lib,
i.e:

data Maybe2 a st = Nothing2 | Just2 a !st
  deriving (Eq,Show)


Performance wise stream fusion even beats a monoidal unfoldr. A
monoidal unfoldr seems reasonable to me for this problem as there is
no need to generate a list. As a monoidal unfoldr is not in the
'standard' libraries some people might not consider it idiomatic
though.

I put in a fixed value for epsilon in all three version rather than
used echo at the command line and compiled all with ghc --make -O2

--------------

module Main (main) where

import Data.Monoid

data Maybe2 a st = Nothing2 | Just2 a !st
  deriving (Eq,Show)


dummy_eps :: Double
dummy_eps = 0.00000001

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- return dummy_eps
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz n = (4 *) $ getSum $ step n

step :: Integer -> Sum Double
step times = unfoldrMon phi (0,True,1) where
   phi (i,_,_)   | i == times  = Nothing2
   phi (i,sig,d) | sig         = Just2 (Sum (1/d)) (i+1,False,d+2)
                 | otherwise   = Just2 (Sum (negate (1/d))) (i+1,True,d+2)


unfoldrMon      :: Monoid a => (b -> Maybe2 a b) -> b -> a
unfoldrMon f b  =
  case f b of
   Just2 a new_b -> a `mappend` unfoldrMon f new_b
   Nothing2      -> mempty


--

---------------------------------------------------
Leibniz1 (stream fusion)
---------------------------------------------------

$ ./Leibniz1 +RTS -sstderr -RTS
d:\coding\haskell\cafe\Leibniz1.exe +RTS -sstderr
EPS:
PI mit EPS 1.0e-8 = 3.1415926445727678
          24,404 bytes allocated in the heap
             892 bytes copied during GC
           3,068 bytes maximum residency (1 sample(s))
          13,316 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.03s  (  0.00s elapsed)
  MUT   time    4.59s  (  4.59s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.63s  (  4.59s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    5,276 bytes per MUT second

  Productivity  99.3% of total user, 100.0% of total elapsed

---------------------------------------------------
Leibniz3 (monoidal unfoldr)
---------------------------------------------------

$ ./Leibniz3 +RTS -sstderr -RTS
d:\coding\haskell\cafe\Leibniz3.exe +RTS -sstderr
EPS:
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.
      92,448,984 bytes allocated in the heap
          15,368 bytes copied during GC
       8,382,800 bytes maximum residency (5 sample(s))
       4,198,684 bytes maximum slop
              17 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:   125 collections,     0 parallel,  1.66s,  1.66s elapsed
  Generation 1:     5 collections,     0 parallel,  0.03s,  0.03s elapsed

  INIT  time    0.03s  (  0.00s elapsed)
  MUT   time    3.50s  (  3.55s elapsed)
  GC    time    1.69s  (  1.69s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    5.22s  (  5.23s elapsed)

  %GC time      32.3%  (32.2% elapsed)

  Alloc rate    26,180,243 bytes per MUT second

  Productivity  67.1% of total user, 66.9% of total elapsed

---------------------------------------------------
Leibniz2 - no stream fusion, unfoldr with strictness annotation on the state
---------------------------------------------------

$ ./Leibniz2 +RTS -sstderr -RTS
d:\coding\haskell\cafe\Leibniz2.exe +RTS -sstderr
EPS:
PI mit EPS 1.0e-8 = 3.141592648589476
  25,600,024,064 bytes allocated in the heap
       2,152,224 bytes copied during GC
           3,336 bytes maximum residency (1 sample(s))
          11,908 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0: 48828 collections,     0 parallel,  2.19s,  2.23s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.02s  (  0.00s elapsed)
  MUT   time   75.16s  ( 76.22s elapsed)
  GC    time    2.19s  (  2.23s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time   77.36s  ( 78.45s elapsed)

  %GC time       2.8%  (2.8% elapsed)

  Alloc rate    340,553,219 bytes per MUT second

  Productivity  97.2% of total user, 95.8% of total elapsed
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Stephen Tetley-2
Arghh, an error in that post that I didn't notice, I send a revision soon...

Apologies Stephen

>
> ---------------------------------------------------
> Leibniz3 (monoidal unfoldr)
> ---------------------------------------------------
>
> $ ./Leibniz3 +RTS -sstderr -RTS
> d:\coding\haskell\cafe\Leibniz3.exe +RTS -sstderr
> EPS:
> Stack space overflow: current size 8388608 bytes.
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Daniel Fischer-4
In reply to this post by Stephen Tetley-2
Am Sonntag 31 Januar 2010 15:52:41 schrieb Stephen Tetley:
> Hi Daniel
>
> Thanks - the figures are very impressive for the stream fusion
> library. I knew the paper, but I hadn't looked at it the
> implementation.
>

Just for the record, the loops (my variant) coded in C take 2.7 seconds
(icc) resp. 3.0 seconds (gcc) with -O3.
Markus' loop takes 5.32s (icc) resp. 6.04s (gcc) with -O3.
Well, that's largely due to the fact that that takes two divisions per
iteration (and divisions are very slow on my box).
Rewriting it so that only one division per iteration is necessary:

================================================
module Main (main) where

main :: IO ()
main = do putStrLn "EPS: "
? ? ? ? ? eps <- readLn :: IO Double
? ? ? ? ? let pi14 = pisum 1 3 False (eps*0.25)
? ? ? ? ? putStrLn $ "PI mit EPS "++(show eps)++" = "++ show(4*pi14)

pisum s i b eps
    | d < eps   = s
    | b         = let h = s+d in h `seq` pisum h (i+2) False eps
    | otherwise = let h = s-d in h `seq` pisum h (i+2) True eps
      where
        d = 1/i
================================================

it takes 5.42 seconds (eps = 1e-8) when compiled natively, that loop in C
is practically indistinguishable from my loop variant.


So stream-fusion or hand-coded loops in Haskell are clearly slower than C
loops, but don't stink when compiled with the native code generator.

However, I get identical to gcc performance with
ghc -O2 -fexcess-precision -fvia-C -optc-O3
[-optc-ffast-math doesn't make any difference then] for the hand-coded
loops (all, my variant, Markus' original, Markus' rewritten - of course,
each identical to the corresponding C-loop).

Also for stream-fusion.

Some of the list creating variants also profit from compilation via C,
others not. However, none come near the NCG loop performance, let alone C
loop performance.

To sum up:
Although GHC's native code generator does a not too bad job with
arithmetic-heavy loops, gcc is better (even without optimisations).
But if you compile via C, you can get identical performance -
for these loops.
In general, things are more complicated, often the native code generator
produces better code than the via-C route. It's usually worth a test which
produces the better binary, native or via-C.

Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Stephen Tetley-2
In reply to this post by Stephen Tetley-2
Hello Daniel

I can get close with a tail recursive + accumulator monoidal unfoldr
and strict triple for the state (leibniz4).

But it does seem to be loosing the point of the initial exercise "to
be idiomatic", also the answers are divergent too...


Best wishes

Stephen

------
Leibniz4 - monodial unfoldr strict triple
------

$ ./Leibniz4 +RTS -sstderr -RTS
d:\coding\haskell\cafe\Leibniz4.exe +RTS -sstderr
EPS:
PI mit EPS 1.0e-8 = 3.1415926526069526
          24,424 bytes allocated in the heap
             892 bytes copied during GC
           3,068 bytes maximum residency (1 sample(s))
          13,316 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.03s  (  0.00s elapsed)
  MUT   time    4.34s  (  4.45s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.38s  (  4.45s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    5,582 bytes per MUT second

  Productivity  99.3% of total user, 97.5% of total elapsed

------
Leibniz1 - stream fusion
------

$ ./Leibniz1 +RTS -sstderr -RTS
d:\coding\haskell\cafe\Leibniz1.exe +RTS -sstderr
EPS:
PI mit EPS 1.0e-8 = 3.1415926445727678
          24,404 bytes allocated in the heap
             892 bytes copied during GC
           3,068 bytes maximum residency (1 sample(s))
          13,316 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.02s  (  0.00s elapsed)
  MUT   time    4.61s  (  4.63s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.63s  (  4.63s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    5,276 bytes per MUT second

  Productivity  99.7% of total user, 99.7% of total elapsed


----------


module Main (main) where

import Data.Monoid


data Maybe2 a st = Nothing2 | Just2 a !st
  deriving (Eq,Show)

dummy_eps :: Double
dummy_eps = 0.00000001

main :: IO ()
main = do
    putStrLn "EPS: "
    eps <- return dummy_eps
    let mx = floor (4/eps)
        !k = (mx+1) `quot` 2
    putStrLn $ "PI mit EPS " ++ (show eps) ++ " = " ++ show (leibniz k)

leibniz :: Integer -> Double
leibniz n = (4 *) $ getSum $ step (fromIntegral n)

data Trip = T3 !Int !Bool !Double

step :: Int -> Sum Double
step times = unfoldrMon phi (T3 0 True 1) where
   phi :: Trip -> Maybe2 (Sum Double) Trip
   phi (T3 i _   _) | i == times  = Nothing2
   phi (T3 i sig d) | sig         = Just2 (Sum (1/d)) (T3 (i+1) False (d+2))
                    | otherwise   = Just2 (Sum (negate (1/d))) (T3
(i+1) True (d+2))



unfoldrMon      :: Monoid a => (b -> Maybe2 a b) -> b -> a
unfoldrMon f b0  = step b0 mempty where
   step b acc = case f b of
              Nothing2       -> acc
              Just2 a new_b  -> step new_b  (a `mappend` acc)
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Daniel Fischer-4
Am Sonntag 31 Januar 2010 19:21:10 schrieb Stephen Tetley:
> Hello Daniel
>
> I can get close with a tail recursive + accumulator monoidal unfoldr
> and strict triple for the state (leibniz4).

Beats stream fusion and direct loops comfortably here (3.55s) when compiled
with the NCG, very narrowly when compiled via C (2.99s).
Congrats.

>
> But it does seem to be loosing the point of the initial exercise "to
> be idiomatic",

Yes. Returning to that,

a) for tasks where it's appropriate/sufficiently natural, hand-coded loops
are idiomatic, too.

b) use stream-fusion to get rid of intermediate structures.

> also the answers are divergent too...

In how far?

>
>
> Best wishes
>
> Stephen

Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Stephen Tetley-2
On 31 January 2010 18:50, Daniel Fischer <[hidden email]> wrote:


>
> Yes. Returning to that,
>
> a) for tasks where it's appropriate/sufficiently natural, hand-coded loops
> are idiomatic, too.
>

Indeed - originally I looked at Markus's code and only saw that an
unfoldr would be a natural implementation for the Leibniz generating
function. I overlooked the episilon used to control the accuracy which
pollutes the unfold and makes it no clearer (too much haste / speed on
my part).



>> also the answers are divergent too...
>
> In how far?

On my machine, they diverged at the eighth decimal place -

Leibniz4 3.1415926526069526
Leibniz1 3.1415926445727678


Best wishes

Stephen
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Daniel Fischer-4
Am Sonntag 31 Januar 2010 20:22:56 schrieb Stephen Tetley:
> On my machine, they diverged at the eighth decimal place -
>
> Leibniz4 3.1415926526069526
> Leibniz1 3.1415926445727678

Ah, that. I thought you were talking about timings.

There are several things that can lead to differing results here,

1) order of summation (I think only my loop had a different order)
2) due to different tests, it might happen that one algorithm evaluates one
term more than the other
3) are intermediate results truncated to 64 bits or kept in an 80-bit
register?

I'm astonished by the size of the difference, though.

But if you sum from small to large, the results are very accurately the
theoretically expected results, so I think it's 1) together with 3) which
make the difference.
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Markus Böhm
Daniel, Stephen, Felipe thanks for Your answers/advice, which I
studied this morning. Helps me a lot on my learning curve. I guess it
takes some time to get a feeling for the performance/runtime behaviour
of abstractions in Haskell.

Just for fun and w/o knowing Lua: I translated my Haskell
"loops-while"-version into Lua and ran on LuaJit Beta 2.0.0. Seems to
beat our best Haskell version.

print("EPS: ")
EPS = io.read("*number")
pi2 = 1
i = 3
s = false
repeat
    pi1 = pi2
    if s then
        pi2 = pi1 + 1.0/i
        s = false
    else
        pi2 = pi1 - 1.0/i
        s = true
    end
    i = i+2

until EPS > 4.0 * math.abs(pi2-pi1)

print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2)


-- Markus




On Sun, Jan 31, 2010 at 9:43 PM, Daniel Fischer
<[hidden email]> wrote:

> Am Sonntag 31 Januar 2010 20:22:56 schrieb Stephen Tetley:
>> On my machine, they diverged at the eighth decimal place -
>>
>> Leibniz4 3.1415926526069526
>> Leibniz1 3.1415926445727678
>
> Ah, that. I thought you were talking about timings.
>
> There are several things that can lead to differing results here,
>
> 1) order of summation (I think only my loop had a different order)
> 2) due to different tests, it might happen that one algorithm evaluates one
> term more than the other
> 3) are intermediate results truncated to 64 bits or kept in an 80-bit
> register?
>
> I'm astonished by the size of the difference, though.
>
> But if you sum from small to large, the results are very accurately the
> theoretically expected results, so I think it's 1) together with 3) which
> make the difference.
> _______________________________________________
> Beginners mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/beginners
>
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Daniel Fischer-4
Am Montag 01 Februar 2010 10:37:41 schrieb Markus B?hm:
> Daniel, Stephen, Felipe thanks for Your answers/advice, which I
> studied this morning. Helps me a lot on my learning curve. I guess it
> takes some time to get a feeling for the performance/runtime behaviour
> of abstractions in Haskell.
>
> Just for fun and w/o knowing Lua: I translated my Haskell
> "loops-while"-version into Lua

That's not quite a translation of the loop you posted. Not really
important, though, small difference.

> and ran on LuaJit Beta 2.0.0. Seems to
> beat our best Haskell version.

So LuaJit Beta 2.0.0 beats gcc -O3 on such a simple loop?
I'm not going to install Lua to test, but if that's really the case, I'm
sure the folks at gnu.org would like to hear about it.
Or don't you consider -fvia-C compiled loops to be true Haskell versions?

>
> print("EPS: ")
> EPS = io.read("*number")
> pi2 = 1
> i = 3
> s = false
> repeat
>     pi1 = pi2
>     if s then
>         pi2 = pi1 + 1.0/i
>         s = false
>     else
>         pi2 = pi1 - 1.0/i
>         s = true
>     end
>     i = i+2
>
> until EPS > 4.0 * math.abs(pi2-pi1)
>
> print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2)
>
>
> -- Markus
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Markus Böhm
Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with
attached variant. I compiled all our Haskell variants with ghc --make
-O2. I don't know about -fvia-C, have to find out. I hope I didn't
distract You with my Lua variant. In any case thank You very much for
your advice. Markus.


print("EPS: ")
EPS = io.read("*number")
local x = os.clock()
pi2 = 1
i = 3
s = false
repeat
    pi1 = pi2
    if s then
        pi2 = pi1 + 1.0/i
        s = false
    else
        pi2 = pi1 - 1.0/i
        s = true
    end
    i = i+2

until EPS > 4.0 * math.abs(pi2-pi1)

print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2)
print(string.format("elapsed time: %.2f\n", os.clock() -x))

-- Markus




On Mon, Feb 1, 2010 at 2:16 PM, Daniel Fischer <[hidden email]> wrote:

> Am Montag 01 Februar 2010 10:37:41 schrieb Markus B?hm:
>> Daniel, Stephen, Felipe thanks for Your answers/advice, which I
>> studied this morning. Helps me a lot on my learning curve. I guess it
>> takes some time to get a feeling for the performance/runtime behaviour
>> of abstractions in Haskell.
>>
>> Just for fun and w/o knowing Lua: I translated my Haskell
>> "loops-while"-version into Lua
>
> That's not quite a translation of the loop you posted. Not really
> important, though, small difference.
>
>> and ran on LuaJit Beta 2.0.0. Seems to
>> beat our best Haskell version.
>
> So LuaJit Beta 2.0.0 beats gcc -O3 on such a simple loop?
> I'm not going to install Lua to test, but if that's really the case, I'm
> sure the folks at gnu.org would like to hear about it.
> Or don't you consider -fvia-C compiled loops to be true Haskell versions?
>
>>
>> print("EPS: ")
>> EPS = io.read("*number")
>> pi2 = 1
>> i = 3
>> s = false
>> repeat
>> ? ? pi1 = pi2
>> ? ? if s then
>> ? ? ? ? pi2 = pi1 + 1.0/i
>> ? ? ? ? s = false
>> ? ? else
>> ? ? ? ? pi2 = pi1 - 1.0/i
>> ? ? ? ? s = true
>> ? ? end
>> ? ? i = i+2
>>
>> until EPS > 4.0 * math.abs(pi2-pi1)
>>
>> print("PI mit EPS", EPS," = ",4*pi1," n= ",(i-3)/2)
>>
>>
>> -- Markus
>
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Daniel Fischer-4
Am Montag 01 Februar 2010 14:31:40 schrieb Markus B?hm:
> Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with
> attached variant.

That's pretty fast.

> I compiled all our Haskell variants with ghc --make -O2.

Well, GHC isn't as good at loop-optimising as gcc is. Depending on the
loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than
the NCG compiled.

> I don't know about -fvia-C, have to find out. I hope I didn't
> distract You with my Lua variant.

No sweat.

> In any case thank You very much for
> your advice. Markus.
>

You're welcome.

Can you try the below with

ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs

and run with

echo '0.00000001' | time ./luaLoop

?
It's a fairly direct translation of the Lua code, and it runs here more or
less equally fast as (gcc compiled) C-loops.

=================================
module Main (main) where

main :: IO ()
main = do
    putStrLn "EPS:"
    eps <- readLn :: IO Double
    print $ 4*calcPi eps

calcPi :: Double -> Double
calcPi eps = go False 1 3
      where
        go bl p1 i
            | 4*abs(p2-p1) < eps    = p1
            | otherwise             = go (not bl) p2 (i+2)
              where
                p2  | bl        = p1+1/i
                    | otherwise = p1-1/i

==================================

Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Markus Böhm
Daniel, I use GHC 6.12.1 and Windows XP. The time command doesn't seem
to work. It says in German: specified time can't be read. Give a new
time.

-- Markus




On Mon, Feb 1, 2010 at 2:52 PM, Daniel Fischer <[hidden email]> wrote:

> Am Montag 01 Februar 2010 14:31:40 schrieb Markus B?hm:
>> Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with
>> attached variant.
>
> That's pretty fast.
>
>> I compiled all our Haskell variants with ghc --make -O2.
>
> Well, GHC isn't as good at loop-optimising as gcc is. Depending on the
> loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than
> the NCG compiled.
>
>> I don't know about -fvia-C, have to find out. I hope I didn't
>> distract You with my Lua variant.
>
> No sweat.
>
>> In any case thank You very much for
>> your advice. Markus.
>>
>
> You're welcome.
>
> Can you try the below with
>
> ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs
>
> and run with
>
> echo '0.00000001' | time ./luaLoop
>
> ?
> It's a fairly direct translation of the Lua code, and it runs here more or
> less equally fast as (gcc compiled) C-loops.
>
> =================================
> module Main (main) where
>
> main :: IO ()
> main = do
> ? ?putStrLn "EPS:"
> ? ?eps <- readLn :: IO Double
> ? ?print $ 4*calcPi eps
>
> calcPi :: Double -> Double
> calcPi eps = go False 1 3
> ? ? ?where
> ? ? ? ?go bl p1 i
> ? ? ? ? ? ?| 4*abs(p2-p1) < eps ? ?= p1
> ? ? ? ? ? ?| otherwise ? ? ? ? ? ? = go (not bl) p2 (i+2)
> ? ? ? ? ? ? ?where
> ? ? ? ? ? ? ? ?p2 ?| bl ? ? ? ?= p1+1/i
> ? ? ? ? ? ? ? ? ? ?| otherwise = p1-1/i
>
> ==================================
>
>
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Markus Böhm
Daniel, here another feedback:

F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make p
085-pi_lualoop.hs
[1 of 1] Compiling Main             ( p085-pi_lualoop.hs, p085-pi_lualoop.o )
Linking luaLoop.exe ...

F:\MeineUebungen>echo '0.00000001' | time ./luaLoop
Eingegebene Zeit kann nicht ?bernommen werden.
Geben Sie die neue Zeit ein:

F:\MeineUebungen>luaLoop +RTS -sstderr -RTS
luaLoop +RTS -sstderr
EPS:
0.00000001
3.1415926485894725
          61,428 bytes allocated in the heap
           1,316 bytes copied during GC
           4,564 bytes maximum residency (1 sample(s))
          11,820 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.02s  (  0.00s elapsed)
  MUT   time    8.06s  ( 16.83s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    8.08s  ( 16.83s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    7,604 bytes per MUT second

  Productivity  99.8% of total user, 47.9% of total elapsed

-- Markus




On Mon, Feb 1, 2010 at 3:08 PM, Markus B?hm <[hidden email]> wrote:

> Daniel, I use GHC 6.12.1 and Windows XP. The time command doesn't seem
> to work. It says in German: specified time can't be read. Give a new
> time.
>
> -- Markus
>
>
>
>
> On Mon, Feb 1, 2010 at 2:52 PM, Daniel Fischer <[hidden email]> wrote:
>> Am Montag 01 Februar 2010 14:31:40 schrieb Markus B?hm:
>>> Daniel, with LuaJIT it needs 1.45 sec cpu-time on my machine with
>>> attached variant.
>>
>> That's pretty fast.
>>
>>> I compiled all our Haskell variants with ghc --make -O2.
>>
>> Well, GHC isn't as good at loop-optimising as gcc is. Depending on the
>> loop, the via-C compiled binaries are between 1.4 and 2.3 times faster than
>> the NCG compiled.
>>
>>> I don't know about -fvia-C, have to find out. I hope I didn't
>>> distract You with my Lua variant.
>>
>> No sweat.
>>
>>> In any case thank You very much for
>>> your advice. Markus.
>>>
>>
>> You're welcome.
>>
>> Can you try the below with
>>
>> ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop --make Whatever.hs
>>
>> and run with
>>
>> echo '0.00000001' | time ./luaLoop
>>
>> ?
>> It's a fairly direct translation of the Lua code, and it runs here more or
>> less equally fast as (gcc compiled) C-loops.
>>
>> =================================
>> module Main (main) where
>>
>> main :: IO ()
>> main = do
>> ? ?putStrLn "EPS:"
>> ? ?eps <- readLn :: IO Double
>> ? ?print $ 4*calcPi eps
>>
>> calcPi :: Double -> Double
>> calcPi eps = go False 1 3
>> ? ? ?where
>> ? ? ? ?go bl p1 i
>> ? ? ? ? ? ?| 4*abs(p2-p1) < eps ? ?= p1
>> ? ? ? ? ? ?| otherwise ? ? ? ? ? ? = go (not bl) p2 (i+2)
>> ? ? ? ? ? ? ?where
>> ? ? ? ? ? ? ? ?p2 ?| bl ? ? ? ?= p1+1/i
>> ? ? ? ? ? ? ? ? ? ?| otherwise = p1-1/i
>>
>> ==================================
>>
>>
>
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Daniel Fischer-4
Am Montag 01 Februar 2010 15:22:45 schrieb Markus B?hm:

> Daniel, I use GHC 6.12.1 and Windows XP.

*sigh*

Have you a CygWin installed? It should run as given from a CygWin shell, I
believe.

> The time command doesn't seem
> to work. It says in German: specified time can't be read. Give a new
> time.

Replace "time" with however you get the CPU time used by a process on
Windows.
But I think Windows doesn't allow piping, and I'm not sure whether it has
echo, so see below.

> Daniel, here another feedback:
>
> F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop
> --make p 085-pi_lualoop.hs
> [1 of 1] Compiling Main             ( p085-pi_lualoop.hs,
> p085-pi_lualoop.o ) Linking luaLoop.exe ...
>
> F:\MeineUebungen>echo '0.00000001' | time ./luaLoop
> Eingegebene Zeit kann nicht ?bernommen werden.
> Geben Sie die neue Zeit ein:
>
> F:\MeineUebungen>luaLoop +RTS -sstderr -RTS
> luaLoop +RTS -sstderr
> EPS:
> 0.00000001
> 3.1415926485894725
>           61,428 bytes allocated in the heap
>            1,316 bytes copied during GC
>            4,564 bytes maximum residency (1 sample(s))
>           11,820 bytes maximum slop
>                1 MB total memory in use (0 MB lost due to fragmentation)
>
>   Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s
> elapsed Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s
> elapsed
>
>   INIT  time    0.02s  (  0.00s elapsed)
>   MUT   time    8.06s  ( 16.83s elapsed)
>   GC    time    0.00s  (  0.00s elapsed)
>   EXIT  time    0.00s  (  0.00s elapsed)
>   Total time    8.08s  ( 16.83s elapsed)
>
>   %GC time       0.0%  (0.0% elapsed)
>
>   Alloc rate    7,604 bytes per MUT second
>
>   Productivity  99.8% of total user, 47.9% of total elapsed

What?????

Please try the following:

1) hardcode eps = 1e-8 in the source (or get it via getArgs and pass it as
a command line argument)
2) ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o cloop --
make Source.hs
3) ghc -O2 -fforce-recomp -o nloop --make Source.hs
4) cloop +RTS -sstderr
5) nloop +RTS -sstderr

>
> -- Markus
>
Reply | Threaded
Open this post in threaded view
|

Performance of Idiomatic lazy Haskell

Markus Böhm
1. I used Your lualoop file with content:
module Main (main) where

main :: IO ()
main = do
   --putStrLn "EPS:"
   -- eps <- readLn :: IO Double
   print $ 4*calcPi 0.00000001

calcPi :: Double -> Double
calcPi eps = go False 1 3
     where
       go bl p1 i
           | 4*abs(p2-p1) < eps    = p1
           | otherwise             = go (not bl) p2 (i+2)
             where
               p2  | bl        = p1+1/i
                   | otherwise = p1-1/i

2.
F:\MeineUebungen>ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o c
loop --make p085-pi_lualoop.hs
[1 of 1] Compiling Main             ( p085-pi_lualoop.hs, p085-pi_lualoop.o )
Linking cloop.exe ...


F:\MeineUebungen>ghc -O2 -fforce-recomp -o nloop --make p085-pi_lualoop.hs
[1 of 1] Compiling Main             ( p085-pi_lualoop.hs, p085-pi_lualoop.o )
Linking nloop.exe ...

F:\MeineUebungen>cloop +RTS -sstderr
cloop +RTS -sstderr
3.1415926485894725
          20,860 bytes allocated in the heap
             892 bytes copied during GC
           3,068 bytes maximum residency (1 sample(s))
          13,316 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.02s  (  0.02s elapsed)
  MUT   time    7.39s  (  7.27s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    7.41s  (  7.28s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    2,816 bytes per MUT second

  Productivity  99.8% of total user, 101.5% of total elapsed


F:\MeineUebungen>nloop +RTS -sstderr
nloop +RTS -sstderr
3.1415926485894725
          20,860 bytes allocated in the heap
             892 bytes copied during GC
           3,068 bytes maximum residency (1 sample(s))
          13,316 bytes maximum slop
               1 MB total memory in use (0 MB lost due to fragmentation)

  Generation 0:     0 collections,     0 parallel,  0.00s,  0.00s elapsed
  Generation 1:     1 collections,     0 parallel,  0.00s,  0.00s elapsed

  INIT  time    0.02s  (  0.00s elapsed)
  MUT   time    4.77s  (  4.81s elapsed)
  GC    time    0.00s  (  0.00s elapsed)
  EXIT  time    0.00s  (  0.00s elapsed)
  Total time    4.78s  (  4.81s elapsed)

  %GC time       0.0%  (0.0% elapsed)

  Alloc rate    4,362 bytes per MUT second

  Productivity  99.7% of total user, 99.0% of total elapsed


-- Markus



On Mon, Feb 1, 2010 at 3:40 PM, Daniel Fischer <[hidden email]> wrote:

> Am Montag 01 Februar 2010 15:22:45 schrieb Markus B?hm:
>
>> Daniel, I use GHC 6.12.1 and Windows XP.
>
> *sigh*
>
> Have you a CygWin installed? It should run as given from a CygWin shell, I
> believe.
>
>> The time command doesn't seem
>> to work. It says in German: specified time can't be read. Give a new
>> time.
>
> Replace "time" with however you get the CPU time used by a process on
> Windows.
> But I think Windows doesn't allow piping, and I'm not sure whether it has
> echo, so see below.
>
>> Daniel, here another feedback:
>>
>> F:\MeineUebungen>ghc -O2 -fexcess-precision -fvia-C -optc-O3 -o luaLoop
>> --make p 085-pi_lualoop.hs
>> [1 of 1] Compiling Main ? ? ? ? ? ? ( p085-pi_lualoop.hs,
>> p085-pi_lualoop.o ) Linking luaLoop.exe ...
>>
>> F:\MeineUebungen>echo '0.00000001' | time ./luaLoop
>> Eingegebene Zeit kann nicht ?bernommen werden.
>> Geben Sie die neue Zeit ein:
>>
>> F:\MeineUebungen>luaLoop +RTS -sstderr -RTS
>> luaLoop +RTS -sstderr
>> EPS:
>> 0.00000001
>> 3.1415926485894725
>> ? ? ? ? ? 61,428 bytes allocated in the heap
>> ? ? ? ? ? ?1,316 bytes copied during GC
>> ? ? ? ? ? ?4,564 bytes maximum residency (1 sample(s))
>> ? ? ? ? ? 11,820 bytes maximum slop
>> ? ? ? ? ? ? ? ?1 MB total memory in use (0 MB lost due to fragmentation)
>>
>> ? Generation 0: ? ? 0 collections, ? ? 0 parallel, ?0.00s, ?0.00s
>> elapsed Generation 1: ? ? 1 collections, ? ? 0 parallel, ?0.00s, ?0.00s
>> elapsed
>>
>> ? INIT ?time ? ?0.02s ?( ?0.00s elapsed)
>> ? MUT ? time ? ?8.06s ?( 16.83s elapsed)
>> ? GC ? ?time ? ?0.00s ?( ?0.00s elapsed)
>> ? EXIT ?time ? ?0.00s ?( ?0.00s elapsed)
>> ? Total time ? ?8.08s ?( 16.83s elapsed)
>>
>> ? %GC time ? ? ? 0.0% ?(0.0% elapsed)
>>
>> ? Alloc rate ? ?7,604 bytes per MUT second
>>
>> ? Productivity ?99.8% of total user, 47.9% of total elapsed
>
> What?????
>
> Please try the following:
>
> 1) hardcode eps = 1e-8 in the source (or get it via getArgs and pass it as
> a command line argument)
> 2) ghc -O2 -fforce-recomp -fexcess-precision -fvia-C -optc-O3 -o cloop --
> make Source.hs
> 3) ghc -O2 -fforce-recomp -o nloop --make Source.hs
> 4) cloop +RTS -sstderr
> 5) nloop +RTS -sstderr
>
>>
>> -- Markus
>>
>
12