Optimization demonstration

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

Optimization demonstration

Dušan Kolář

Dear Café,

 

I'm trying to do very small, but impressive example about optimizations possible during Haskell compilation. So far, I can demonstrate that the following two programs (if compiled) perform computation in the same time:

 

1)

 

main =

putStrLn $ show $ sum $ map (*(2::Int)) [(1::Int)..(100000000::Int)]

 

 

2)

 

main =

putStrLn $! show $! sumup 1 0

 

sumup :: Int -> Int -> Int

sumup n total =

if n<=(100000000::Int) then sumup (n+1) $! total+(2*n)

else total

 

 

Nevertheless, I expect a question on comparison with C:

 

3)

 

#include <stdio.h>

 

int main(void) {

long sum, i;

sum = 0;

for (i=1; i <= 100000000L; ++i) {

sum += 2*i;

}

printf("%ld\n",sum);

return 0;

}

 

 

Unfortunately, in this case the C is much more faster (it prints the result immediately), at least on my machine. Is it due to a fact that C compiler does a brutal optimization leading to compile-time evaluation, while ghc is not able to do that?

 

I'm using -O2 -dynamic --make ghc compiler flags. For gcc for C compilation just -O2, running Arch Linux.

 

Is there any option, how to force compile time evaluation? The reason, why I think it works this way is the fact, that when running out of long type values in C a code is generated that computes the values regularly (providing misleading value as a result) taking its time.

 

Best regards,

 

Dušan

 

 

 


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Shao Cheng
You can use Template Haskell to perform arbitrary computation at compile-time (even if it requires IO!), and then `lift` the result into a Haskell literal. This works for any type with a `Lift` instance (or with a bit of trick, any serializable type).

Coming back to your use case, you may try avoid using raw lists and switch to unboxed vectors, turn on -O2 and rely on stream fusion of the vector package. That will result in a considerable speedup.

On Tue, Feb 27, 2018, 11:09 PM Dušan Kolář <[hidden email]> wrote:

Dear Café,

 

I'm trying to do very small, but impressive example about optimizations possible during Haskell compilation. So far, I can demonstrate that the following two programs (if compiled) perform computation in the same time:

 

1)

 

main =

putStrLn $ show $ sum $ map (*(2::Int)) [(1::Int)..(100000000::Int)]

 

 

2)

 

main =

putStrLn $! show $! sumup 1 0

 

sumup :: Int -> Int -> Int

sumup n total =

if n<=(100000000::Int) then sumup (n+1) $! total+(2*n)

else total

 

 

Nevertheless, I expect a question on comparison with C:

 

3)

 

#include <stdio.h>

 

int main(void) {

long sum, i;

sum = 0;

for (i=1; i <= 100000000L; ++i) {

sum += 2*i;

}

printf("%ld\n",sum);

return 0;

}

 

 

Unfortunately, in this case the C is much more faster (it prints the result immediately), at least on my machine. Is it due to a fact that C compiler does a brutal optimization leading to compile-time evaluation, while ghc is not able to do that?

 

I'm using -O2 -dynamic --make ghc compiler flags. For gcc for C compilation just -O2, running Arch Linux.

 

Is there any option, how to force compile time evaluation? The reason, why I think it works this way is the fact, that when running out of long type values in C a code is generated that computes the values regularly (providing misleading value as a result) taking its time.

 

Best regards,

 

Dušan

 

 

 

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Ryan Reich
I thought fusion might be the answer, but don't the standard list functions have rewrite rules for that too? Build/consume; this should apply directly to this example (version 1).

On Feb 27, 2018 07:30, "Shao Cheng" <[hidden email]> wrote:
You can use Template Haskell to perform arbitrary computation at compile-time (even if it requires IO!), and then `lift` the result into a Haskell literal. This works for any type with a `Lift` instance (or with a bit of trick, any serializable type).

Coming back to your use case, you may try avoid using raw lists and switch to unboxed vectors, turn on -O2 and rely on stream fusion of the vector package. That will result in a considerable speedup.

On Tue, Feb 27, 2018, 11:09 PM Dušan Kolář <[hidden email]> wrote:

Dear Café,

 

I'm trying to do very small, but impressive example about optimizations possible during Haskell compilation. So far, I can demonstrate that the following two programs (if compiled) perform computation in the same time:

 

1)

 

main =

putStrLn $ show $ sum $ map (*(2::Int)) [(1::Int)..(100000000::Int)]

 

 

2)

 

main =

putStrLn $! show $! sumup 1 0

 

sumup :: Int -> Int -> Int

sumup n total =

if n<=(100000000::Int) then sumup (n+1) $! total+(2*n)

else total

 

 

Nevertheless, I expect a question on comparison with C:

 

3)

 

#include <stdio.h>

 

int main(void) {

long sum, i;

sum = 0;

for (i=1; i <= 100000000L; ++i) {

sum += 2*i;

}

printf("%ld\n",sum);

return 0;

}

 

 

Unfortunately, in this case the C is much more faster (it prints the result immediately), at least on my machine. Is it due to a fact that C compiler does a brutal optimization leading to compile-time evaluation, while ghc is not able to do that?

 

I'm using -O2 -dynamic --make ghc compiler flags. For gcc for C compilation just -O2, running Arch Linux.

 

Is there any option, how to force compile time evaluation? The reason, why I think it works this way is the fact, that when running out of long type values in C a code is generated that computes the values regularly (providing misleading value as a result) taking its time.

 

Best regards,

 

Dušan

 

 

 

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Brandon Allbery
In reply to this post by Dušan Kolář
On Tue, Feb 27, 2018 at 10:06 AM, Dušan Kolář <[hidden email]> wrote:

Unfortunately, in this case the C is much more faster (it prints the result immediately), at least on my machine. Is it due to a fact that C compiler does a brutal optimization leading to compile-time evaluation, while ghc is not able to do that?


ghc is less prone to invoke that kind of optimization, but sometimes can do so. And yes, gcc is decidedly "brutal" with -O2: inspect the generated assembler and you'll find that it just prints a constant.

--
brandon s allbery kf8nh                               sine nomine associates
[hidden email]                                  [hidden email]
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Ryan Reich
In other words, it's not competition with the language C but with its popular compiler. Choose an example that doesn't simplify and you'll get a fairer contest.

On Feb 27, 2018 07:52, "Brandon Allbery" <[hidden email]> wrote:
On Tue, Feb 27, 2018 at 10:06 AM, Dušan Kolář <[hidden email]> wrote:

Unfortunately, in this case the C is much more faster (it prints the result immediately), at least on my machine. Is it due to a fact that C compiler does a brutal optimization leading to compile-time evaluation, while ghc is not able to do that?


ghc is less prone to invoke that kind of optimization, but sometimes can do so. And yes, gcc is decidedly "brutal" with -O2: inspect the generated assembler and you'll find that it just prints a constant.

--
brandon s allbery kf8nh                               sine nomine associates
[hidden email]                                  [hidden email]
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Vale Cofer-Shabica
You can prevent this particular optimization-to-constant by declaring sum 'volatile' as below:

#include <stdio.h>

int main(void) {
   volatile long sum = 0;

   for (long i = 1; i <= 100000000L; ++i) {
       sum += i;
   }
   printf("%ld\n",sum);
   return 0;
}

I don't know if this provides the comparison you're looking for, but it does force the compiler to emit assembly to do the summation.

--
vale cofer-shabica
401.267.8253

On Tue, Feb 27, 2018 at 12:26 PM, Ryan Reich <[hidden email]> wrote:
In other words, it's not competition with the language C but with its popular compiler. Choose an example that doesn't simplify and you'll get a fairer contest.

On Feb 27, 2018 07:52, "Brandon Allbery" <[hidden email]> wrote:
On Tue, Feb 27, 2018 at 10:06 AM, Dušan Kolář <[hidden email]> wrote:

Unfortunately, in this case the C is much more faster (it prints the result immediately), at least on my machine. Is it due to a fact that C compiler does a brutal optimization leading to compile-time evaluation, while ghc is not able to do that?


ghc is less prone to invoke that kind of optimization, but sometimes can do so. And yes, gcc is decidedly "brutal" with -O2: inspect the generated assembler and you'll find that it just prints a constant.

--
brandon s allbery kf8nh                               sine nomine associates
[hidden email]                                  [hidden email]
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Neil Mayhew
In reply to this post by Shao Cheng

On 2018-02-27 08:19 AM, Shao Cheng wrote:

Coming back to your use case, you may try avoid using raw lists and switch to unboxed vectors, turn on -O2 and rely on stream fusion of the vector package. That will result in a considerable speedup.

I looked at the core that’s generated, and there’s no need for vectors. Fusion happens, there’s no use of lists at all and unboxed types are used. The code boils down to a single recursive function:

let go i sum = case i of
        100000000 -> sum + 200000000
        _ -> go (i + 1) (sum + i * 2)
in go 1 0

except that the types are unboxed. The following complete program compiles down to almost identical core when compiled without optimization:

{-# LANGUAGE MagicHash #-}

import GHC.Exts

main = print $ I# value
  where
    value =
        let go :: Int# -> Int# -> Int#
            go i sum = case i of
                100000000# -> sum +# 200000000#
                _ -> go (i +# 1#) (sum +# i *# 2#)
        in go 1# 0#

I think that’s impressive even if it’s not a single number. Execution time on my lowly i5 is only 50ms.

BTW, GHC 8 seems to have removed the option for exporting core (-fext-core) but there’s a wonderful plugin package called dump-core that produces HTML output with colouring and interactivity. You just install it from Hackage and use the extra options it provides.

It seems to me that gcc’s compile-time evaluation of this loop is a special-case that matches the kind of thing that often crops up in C. I assume it’s not capable of doing that for every expression that could be evaluated at compile time, so a more complicated and realistic example would probably defeat it. After all, ghc could in theory evaluate any pure value (CAF) at compile time if it chose to, but that’s usually not what you want.

Also, it’s worth noting that due to Haskell’s lazy evaluation, a pure value (CAF) will never be evaluated more than once at runtime, which isn’t something you get with C.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Brandon Allbery
-fext-core wasn't about exporting it, but about accepting core as *source* ("external core"). Which was always tricky and was broken for years before the option was removed.

On Tue, Feb 27, 2018 at 1:51 PM, Neil Mayhew <[hidden email]> wrote:

On 2018-02-27 08:19 AM, Shao Cheng wrote:

Coming back to your use case, you may try avoid using raw lists and switch to unboxed vectors, turn on -O2 and rely on stream fusion of the vector package. That will result in a considerable speedup.

I looked at the core that’s generated, and there’s no need for vectors. Fusion happens, there’s no use of lists at all and unboxed types are used. The code boils down to a single recursive function:

let go i sum = case i of
        100000000 -> sum + 200000000
        _ -> go (i + 1) (sum + i * 2)
in go 1 0

except that the types are unboxed. The following complete program compiles down to almost identical core when compiled without optimization:

{-# LANGUAGE MagicHash #-}

import GHC.Exts

main = print $ I# value
  where
    value =
        let go :: Int# -> Int# -> Int#
            go i sum = case i of
                100000000# -> sum +# 200000000#
                _ -> go (i +# 1#) (sum +# i *# 2#)
        in go 1# 0#

I think that’s impressive even if it’s not a single number. Execution time on my lowly i5 is only 50ms.

BTW, GHC 8 seems to have removed the option for exporting core (-fext-core) but there’s a wonderful plugin package called dump-core that produces HTML output with colouring and interactivity. You just install it from Hackage and use the extra options it provides.

It seems to me that gcc’s compile-time evaluation of this loop is a special-case that matches the kind of thing that often crops up in C. I assume it’s not capable of doing that for every expression that could be evaluated at compile time, so a more complicated and realistic example would probably defeat it. After all, ghc could in theory evaluate any pure value (CAF) at compile time if it chose to, but that’s usually not what you want.

Also, it’s worth noting that due to Haskell’s lazy evaluation, a pure value (CAF) will never be evaluated more than once at runtime, which isn’t something you get with C.


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.



--
brandon s allbery kf8nh                               sine nomine associates
[hidden email]                                  [hidden email]
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Dušan Kolář
In reply to this post by Dušan Kolář

Many thanks all for replies, yes, that's what I assumed/wrote. C compiler does compile-time evaluation, while ghc does another transformation. And, well, yes, declaring upper bound in loop in extra file breaks the optimization for C compiler and, thus, both evaluation times for C and Haskell binaries are the same thus.

 

Thank you once, again,

 

Dušan

 

P.S.

If anyone can point me to resource, where examples on demonstration of Haskell optimization are, I'll be very glad.

D.

 

On úterý 27. února 2018 23:06:38 CET Richard O'Keefe wrote:

> In order to optimise array indexing and to automatically vectorise

> your code, high performance compilers for imperative languages

> like C and Fortran analyse counted loops like you wouldn't believe.

> You can find out what the C compiler does with your code by using

> the -S command line option. gcc 7.2 optimised the whole loop away.

>

> Split your C program into two files.

> One contains

> long bound = 100000000L, incr = 1L;

> The other is a version of your existing code with

> extern long bound, incr;

> ...

> for (i = 1L; i <= bound; i += incr) { ... }

> ...

> gcc 7.2 isn't quite smart enough to optimise this away. Yet.

>

> The polyhedral optimisations going on are described in the

> open literature and are valuable for imperative array-

> crunching code. You are not likely to see them in GHC any

> time soon, just as you're not likely to see deforestation

> in gcc any time soon.

>

> On 28 February 2018 at 04:06, Dušan Kolář <[hidden email]> wrote:

> > Dear Café,

> >

> >

> >

> > I'm trying to do very small, but impressive example about optimizations

> > possible during Haskell compilation. So far, I can demonstrate that the

> > following two programs (if compiled) perform computation in the same time:

> >

> >

> >

> > 1)

> >

> >

> >

> > main =

> >

> > putStrLn $ show $ sum $ map (*(2::Int)) [(1::Int)..(100000000::Int)]

> >

> >

> >

> >

> >

> > 2)

> >

> >

> >

> > main =

> >

> > putStrLn $! show $! sumup 1 0

> >

> >

> >

> > sumup :: Int -> Int -> Int

> >

> > sumup n total =

> >

> > if n<=(100000000::Int) then sumup (n+1) $! total+(2*n)

> >

> > else total

> >

> >

> >

> >

> >

> > Nevertheless, I expect a question on comparison with C:

> >

> >

> >

> > 3)

> >

> >

> >

> > #include <stdio.h>

> >

> >

> >

> > int main(void) {

> >

> > long sum, i;

> >

> > sum = 0;

> >

> > for (i=1; i <= 100000000L; ++i) {

> >

> > sum += 2*i;

> >

> > }

> >

> > printf("%ld\n",sum);

> >

> > return 0;

> >

> > }

> >

> >

> >

> >

> >

> > Unfortunately, in this case the C is much more faster (it prints the

> > result immediately), at least on my machine. Is it due to a fact that C

> > compiler does a brutal optimization leading to compile-time evaluation,

> > while ghc is not able to do that?

> >

> >

> >

> > I'm using -O2 -dynamic --make ghc compiler flags. For gcc for C

> > compilation just -O2, running Arch Linux.

> >

> >

> >

> > Is there any option, how to force compile time evaluation? The reason, why

> > I think it works this way is the fact, that when running out of long type

> > values in C a code is generated that computes the values regularly

> > (providing misleading value as a result) taking its time.

> >

> >

> >

> > Best regards,

> >

> >

> >

> > Dušan

> >

> >

> >

> >

> >

> >

> >

> > _______________________________________________

> > Haskell-Cafe mailing list

> > To (un)subscribe, modify options or view archives go to:

> > http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

> > Only members subscribed via the mailman list are allowed to post.

 

--

 

Dusan Kolar tel: +420 54 114 1238

UIFS FIT VUT Brno fax: +420 54 114 1270

Bozetechova 2 e-mail: [hidden email]

Brno 612 66

Czech Republic

 

--

 


_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: Optimization demonstration

Neil Mayhew
In reply to this post by Brandon Allbery
On 2018-02-27 09:55 PM, Brandon Allbery wrote:
-fext-core wasn't about exporting it, but about accepting core as *source* ("external core"). Which was always tricky and was broken for years before the option was removed.


Thanks. I see I should have been using `-ddump-simpl` instead.

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.