Where do I start if I would like help improve GHC compilation times?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|

Where do I start if I would like help improve GHC compilation times?

Alfredo Di Napoli
Hey folks,

maybe I’m setting up for something too ambitious for me, but I would like to take an active stance to the overlasting “GHC compilation times are terrible” matter, instead of simply stare at the screen with despair whenever GHC compiles a sufficiently large Haskell program ;)

To make this even more interesting, I have never contributed to GHC either! The max I have pushed myself into was 2 years ago when I successfully built GHC head from source and tried to fix an Haddock “easy” ticket I don’t even recall (full disclosure, eventually I didn’t :D ).

Specifically, I would love community recommendations & guidance about:

1. Is this simply too daunting for somebody like me? Maybe is better to first start contributing more regularly, take confidence with the code base AND then move forward?

2. Are compilation times largely dependant from the target platform (I’m on Darwin) or there is
something which can be done “globally” so that the benefits can be experienced by everybody?

3. Is there any recommended workflow to profile GHC compilation times? Is there any build flavour one should prefer when doing so? (Maybe the full, slowest one?)

Thanks in advance for taking the time reading this mail, and have a nice weekend!

Alfredo

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Ben Gamari-2
Alfredo Di Napoli <[hidden email]> writes:

> Hey folks,
>
Hi Alfredo!

First, thanks for writing. More eyes looking at GHC's compiler
performance is badly needed.

> maybe I’m setting up for something too ambitious for me, but I would like
> to take an active stance to the overlasting “GHC compilation times are
> terrible” matter, instead of simply stare at the screen with despair
> whenever GHC compiles a sufficiently large Haskell program ;)
>
> To make this even more interesting, I have never contributed to GHC either!
> The max I have pushed myself into was 2 years ago when I successfully built
> GHC head from source and tried to fix an Haddock “easy” ticket I don’t even
> recall (full disclosure, eventually I didn’t :D ).
>
> Specifically, I would love community recommendations & guidance about:
>
> 1. Is this simply too daunting for somebody like me? Maybe is better to
> first start contributing more regularly, take confidence with the code base
> AND then move forward?
>
As with any software project, it is possible to treat the compiler as a
black box, throw a profiler at it and see what hotspots show up. This
gives you a place to focus your effort, allowing you to learn a small
area and broaden your knowledge as necessary.

However, I think it's fair to say that you will be significantly more
productive if you first develop a basic understanding of the compilation
pipeline. I'd recommend having a look at the GHC Commentary [1] for a
start.

I think it also helps to have a rough idea of what "slow" means to you.
I find it is quite helpful if you have a particular program which you
feel compiles more slowly than you would like (especially if it even
compiles slowly with -O0, since then much less of the compiler is
involved in compilation). Another approach is to look for programs whose
compilation time has regressed over the course of GHC releases. It is
not hard to find these examples and it is often possible to bisect your
way back to the regressing commit.

Also, note that I have collected some notes pertaining to compiler
performance on the Wiki [2]. Here you will find a number of tickets of
interest (as well a some rough themes which I've noticed), some nofib
results which might guide your efforts, as well as a list of some
fixes which have been committed in the past.

[1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler 
[2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler

> 2. Are compilation times largely dependant from the target platform (I’m on
> Darwin) or there is something which can be done “globally” so that the
> benefits can be experienced by everybody?
>
There are some external considerations (e.g. the platform's compiler and
linking toolchain) which contribute to GHC's runtime. For instance, it
is known that the BFD ld linker implementation that many Linux
distributions use by default is a great deal slower than it could be.
This particular issue has come up recently and I'm currently working on
allowing us to use the more performant gold linker when available.

However, I think it's fair to say that for most programs GHC's runtime
is largely independent of platform. I would invite you to try compiling
a package which you consider GHC to compile "slowly" with GHC's -v flag
(and GHC 8.0.1 or newer). This will give you a rough breakdown of where
time is spent. For many packages you will find that the simplifier
and/or typechecker dominate, followed (often distantly) by native code
generation. Of these steps native code generation is the only one with a
strong platform dependence.

> 3. Is there any recommended workflow to profile GHC compilation times? Is
> there any build flavour one should prefer when doing so? (Maybe the full,
> slowest one?)
>
There are a few options here:

 * As of GHC 8.0 the compiler will output timing and allocation
   information for its various stages if run with -v. This can be
   extremely helpful to get a high-level picture of where the compiler
   is spending its time while compiling your program. This is almost
   always the right place to start.

 * As with any Haskell program, the cost centre profiler can be used to
   characterize the memory and CPU behavior of various parts of the
   compiler.

   GHC's source tree includes a "prof" build flavour which builds the
   compiler with profiling enabled. However it only includes a handful
   of cost-centres and is best used when you already have a rough idea
   where you are looking and can add further cost-centres to drill down
   to your hotspot.

   Simply enabling -fprof-exported across the entire tree just doesn't
   work in my experience: not only is the resulting compiler quite slow,
   but the profile you get is far too unwieldy to learn from.

 * Occassionally the ticky-ticky profiler can be helpful in identifying
   allocation hotspots without the full overhead of the cost-centre
   profiler.

 * In principle our newly-stable DWARF debug information can be used for
   profiling, although this is still a work in progress and requires a
   patched GHC for best results. It's probably best to stick to the more
   traditional profiling mechanisms for now.

Anyways, I hope this helps. Always feel free to get in touch with me
personally (IRC and email are both great) if you would like to discuss
particular issues. Thanks again for your interest!

Cheers,

- Ben


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Alfredo Di Napoli
Hey Ben,

thanks for the quite exhaustive reply! I’m on the go right now, but I promise to get back to you with a meaningful reply later this weekend ;)

Alfredo

On 7 April 2017 at 18:22, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey folks,
>
Hi Alfredo!

First, thanks for writing. More eyes looking at GHC's compiler
performance is badly needed.

> maybe I’m setting up for something too ambitious for me, but I would like
> to take an active stance to the overlasting “GHC compilation times are
> terrible” matter, instead of simply stare at the screen with despair
> whenever GHC compiles a sufficiently large Haskell program ;)
>
> To make this even more interesting, I have never contributed to GHC either!
> The max I have pushed myself into was 2 years ago when I successfully built
> GHC head from source and tried to fix an Haddock “easy” ticket I don’t even
> recall (full disclosure, eventually I didn’t :D ).
>
> Specifically, I would love community recommendations & guidance about:
>
> 1. Is this simply too daunting for somebody like me? Maybe is better to
> first start contributing more regularly, take confidence with the code base
> AND then move forward?
>
As with any software project, it is possible to treat the compiler as a
black box, throw a profiler at it and see what hotspots show up. This
gives you a place to focus your effort, allowing you to learn a small
area and broaden your knowledge as necessary.

However, I think it's fair to say that you will be significantly more
productive if you first develop a basic understanding of the compilation
pipeline. I'd recommend having a look at the GHC Commentary [1] for a
start.

I think it also helps to have a rough idea of what "slow" means to you.
I find it is quite helpful if you have a particular program which you
feel compiles more slowly than you would like (especially if it even
compiles slowly with -O0, since then much less of the compiler is
involved in compilation). Another approach is to look for programs whose
compilation time has regressed over the course of GHC releases. It is
not hard to find these examples and it is often possible to bisect your
way back to the regressing commit.

Also, note that I have collected some notes pertaining to compiler
performance on the Wiki [2]. Here you will find a number of tickets of
interest (as well a some rough themes which I've noticed), some nofib
results which might guide your efforts, as well as a list of some
fixes which have been committed in the past.

[1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler
[2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler

> 2. Are compilation times largely dependant from the target platform (I’m on
> Darwin) or there is something which can be done “globally” so that the
> benefits can be experienced by everybody?
>
There are some external considerations (e.g. the platform's compiler and
linking toolchain) which contribute to GHC's runtime. For instance, it
is known that the BFD ld linker implementation that many Linux
distributions use by default is a great deal slower than it could be.
This particular issue has come up recently and I'm currently working on
allowing us to use the more performant gold linker when available.

However, I think it's fair to say that for most programs GHC's runtime
is largely independent of platform. I would invite you to try compiling
a package which you consider GHC to compile "slowly" with GHC's -v flag
(and GHC 8.0.1 or newer). This will give you a rough breakdown of where
time is spent. For many packages you will find that the simplifier
and/or typechecker dominate, followed (often distantly) by native code
generation. Of these steps native code generation is the only one with a
strong platform dependence.

> 3. Is there any recommended workflow to profile GHC compilation times? Is
> there any build flavour one should prefer when doing so? (Maybe the full,
> slowest one?)
>
There are a few options here:

 * As of GHC 8.0 the compiler will output timing and allocation
   information for its various stages if run with -v. This can be
   extremely helpful to get a high-level picture of where the compiler
   is spending its time while compiling your program. This is almost
   always the right place to start.

 * As with any Haskell program, the cost centre profiler can be used to
   characterize the memory and CPU behavior of various parts of the
   compiler.

   GHC's source tree includes a "prof" build flavour which builds the
   compiler with profiling enabled. However it only includes a handful
   of cost-centres and is best used when you already have a rough idea
   where you are looking and can add further cost-centres to drill down
   to your hotspot.

   Simply enabling -fprof-exported across the entire tree just doesn't
   work in my experience: not only is the resulting compiler quite slow,
   but the profile you get is far too unwieldy to learn from.

 * Occassionally the ticky-ticky profiler can be helpful in identifying
   allocation hotspots without the full overhead of the cost-centre
   profiler.

 * In principle our newly-stable DWARF debug information can be used for
   profiling, although this is still a work in progress and requires a
   patched GHC for best results. It's probably best to stick to the more
   traditional profiling mechanisms for now.

Anyways, I hope this helps. Always feel free to get in touch with me
personally (IRC and email are both great) if you would like to discuss
particular issues. Thanks again for your interest!

Cheers,

- Ben



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Alfredo Di Napoli
Hey Ben,

as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 . The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

```
#!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
-RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \
-I../ghc/compiler/stage2/build \
-i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes \
-i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \
-i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \
-i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise \
-fforce-recomp -c $@
```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

Thank you!

Alfredo

On 7 April 2017 at 18:30, Alfredo Di Napoli <[hidden email]> wrote:
Hey Ben,

thanks for the quite exhaustive reply! I’m on the go right now, but I promise to get back to you with a meaningful reply later this weekend ;)

Alfredo

On 7 April 2017 at 18:22, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey folks,
>
Hi Alfredo!

First, thanks for writing. More eyes looking at GHC's compiler
performance is badly needed.

> maybe I’m setting up for something too ambitious for me, but I would like
> to take an active stance to the overlasting “GHC compilation times are
> terrible” matter, instead of simply stare at the screen with despair
> whenever GHC compiles a sufficiently large Haskell program ;)
>
> To make this even more interesting, I have never contributed to GHC either!
> The max I have pushed myself into was 2 years ago when I successfully built
> GHC head from source and tried to fix an Haddock “easy” ticket I don’t even
> recall (full disclosure, eventually I didn’t :D ).
>
> Specifically, I would love community recommendations & guidance about:
>
> 1. Is this simply too daunting for somebody like me? Maybe is better to
> first start contributing more regularly, take confidence with the code base
> AND then move forward?
>
As with any software project, it is possible to treat the compiler as a
black box, throw a profiler at it and see what hotspots show up. This
gives you a place to focus your effort, allowing you to learn a small
area and broaden your knowledge as necessary.

However, I think it's fair to say that you will be significantly more
productive if you first develop a basic understanding of the compilation
pipeline. I'd recommend having a look at the GHC Commentary [1] for a
start.

I think it also helps to have a rough idea of what "slow" means to you.
I find it is quite helpful if you have a particular program which you
feel compiles more slowly than you would like (especially if it even
compiles slowly with -O0, since then much less of the compiler is
involved in compilation). Another approach is to look for programs whose
compilation time has regressed over the course of GHC releases. It is
not hard to find these examples and it is often possible to bisect your
way back to the regressing commit.

Also, note that I have collected some notes pertaining to compiler
performance on the Wiki [2]. Here you will find a number of tickets of
interest (as well a some rough themes which I've noticed), some nofib
results which might guide your efforts, as well as a list of some
fixes which have been committed in the past.

[1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler
[2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler

> 2. Are compilation times largely dependant from the target platform (I’m on
> Darwin) or there is something which can be done “globally” so that the
> benefits can be experienced by everybody?
>
There are some external considerations (e.g. the platform's compiler and
linking toolchain) which contribute to GHC's runtime. For instance, it
is known that the BFD ld linker implementation that many Linux
distributions use by default is a great deal slower than it could be.
This particular issue has come up recently and I'm currently working on
allowing us to use the more performant gold linker when available.

However, I think it's fair to say that for most programs GHC's runtime
is largely independent of platform. I would invite you to try compiling
a package which you consider GHC to compile "slowly" with GHC's -v flag
(and GHC 8.0.1 or newer). This will give you a rough breakdown of where
time is spent. For many packages you will find that the simplifier
and/or typechecker dominate, followed (often distantly) by native code
generation. Of these steps native code generation is the only one with a
strong platform dependence.

> 3. Is there any recommended workflow to profile GHC compilation times? Is
> there any build flavour one should prefer when doing so? (Maybe the full,
> slowest one?)
>
There are a few options here:

 * As of GHC 8.0 the compiler will output timing and allocation
   information for its various stages if run with -v. This can be
   extremely helpful to get a high-level picture of where the compiler
   is spending its time while compiling your program. This is almost
   always the right place to start.

 * As with any Haskell program, the cost centre profiler can be used to
   characterize the memory and CPU behavior of various parts of the
   compiler.

   GHC's source tree includes a "prof" build flavour which builds the
   compiler with profiling enabled. However it only includes a handful
   of cost-centres and is best used when you already have a rough idea
   where you are looking and can add further cost-centres to drill down
   to your hotspot.

   Simply enabling -fprof-exported across the entire tree just doesn't
   work in my experience: not only is the resulting compiler quite slow,
   but the profile you get is far too unwieldy to learn from.

 * Occassionally the ticky-ticky profiler can be helpful in identifying
   allocation hotspots without the full overhead of the cost-centre
   profiler.

 * In principle our newly-stable DWARF debug information can be used for
   profiling, although this is still a work in progress and requires a
   patched GHC for best results. It's probably best to stick to the more
   traditional profiling mechanisms for now.

Anyways, I hope this helps. Always feel free to get in touch with me
personally (IRC and email are both great) if you would like to discuss
particular issues. Thanks again for your interest!

Cheers,

- Ben




_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Reid Barton-2
Building modules from GHC itself is a little tricky and DynFlags is
extra tricky since it is involved in import cycles. Here is what I do:

* Copy DynFlags.hs somewhere outside the tree (for your present
purposes, it is no longer part of the compiler, but just some module
to be provided as input).
* Get rid of all the {-# SOURCE #-} pragmas on imports to turn them
into ordinary, non-boot file imports.
* Build with ".../ghc/inplace/bin/ghc-stage2 DynFlags -package ghc
-I.../ghc/compiler/stage2" plus whatever other options you want (e.g.,
probably "-fforce-recomp -O +RTS -s" at a minimum). By using "-package
ghc" you compile DynFlags against the version of ghc that you have
just built.
* This will result in some type errors, because DynFlags imports some
functions that expect arguments of type DynFlags. (This relates to the
import cycles that we broke earlier.) Since you are building against
the version of those functions from the ghc package, they expect the
type ghc:DynFlags.DynFlags, but they are now receiving a value of type
DynFlags from the main package. This is no big deal, just insert an
unsafeCoerce wherever necessary (mostly in front of occurrences of
"dflags") to get the compiler to stop complaining.

This is not 100% faithful to the way DynFlags would actually be
compiled during a GHC build, but the advantage of this method is that
you don't have to worry about GHC doing any recompilation checking
between the copy of DynFlags that you are testing on and the
compiler's own modules.

Regards,
Reid Barton


On Sun, Apr 9, 2017 at 5:37 AM, Alfredo Di Napoli
<[hidden email]> wrote:

> Hey Ben,
>
> as promised I’m back to you with something more articulated and hopefully
> meaningful. I do hear you perfectly — probably trying to dive head-first
> into this without at least a rough understanding of the performance hotspots
> or the GHC overall architecture is going to do me more harm than good (I get
> the overall picture and I’m aware of the different stages of the GHC
> compilation pipeline, but it’s far from saying I’m proficient with the
> architecture as whole). I have also read a couple of years ago the GHC
> chapter on the “Architeture of Open Source Applications” book, but I don’t
> know how much that is still relevant. If it is, I guess I should refresh my
> memory.
>
> I’m currently trying to move on 2 fronts — please advice if I’m a fool
> flogging a dead horse or if I have any hope of getting anything done ;)
>
> 1. I’m trying to treat indeed the compiler as a black block (as you adviced)
> trying to build a sufficiently large program where GHC is not “as fast as I
> would like” (I know that’s a very lame definition of “slow”, hehe). In
> particular, I have built the stage2 compiler with the “prof” flavour as you
> suggested, and I have chosen 2 examples as a reference “benchmark” for
> performance; DynFlags.hs (which seems to have been mentioned multiple times
> as a GHC perf killer) and the highlighting-kate package as posted here:
> https://ghc.haskell.org/trac/ghc/ticket/9221 . The idea would be to compile
> those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof
> file AND the `-v` flag, find any hotspot, try to change something,
> recompile, observe diff, rinse and repeat. Do you think I have any hope of
> making progress this way? In particular, I think compiling DynFlags.hs is a
> bit of a dead-end; I whipped up this buggy script which escalated into a
> Behemoth which is compiling pretty much half of the compiler once again :D
>
> ```
> #!/usr/bin/env bash
>
> ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
> -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2
> \
> -I../ghc/compiler/stage2/build \
> -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes
> \
> -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
> \
> -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
> \
> -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise
> \
> -fforce-recomp -c $@
> ```
>
> I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but
> it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s
> pulling in half of the compiler anyway :D I tried to reuse the .hi files
> from my stage2 compilation but I failed (GHC was complaining about interface
> file mismatch). Short story short, I don’t think it will be a very agile way
> to proceed. Am I right? Do you have any recommendation in such sense? Do I
> have any hope to compile DynFlags.hs in a way which would make this perf
> investigation feasible?
>
> The second example (the highlighting-kate package) seems much more
> promising. It takes maybe 1-2 mins on my machine, which is enough to take a
> look at the perf output. Do you think I should follow this second lead? In
> principle any 50+ modules package I think would do (better if with a lot of
> TH ;) ) but this seems like a low-entry barrier start.
>
> 2. The second path I’m exploring is simply to take a less holistic approach
> and try to dive in into a performance ticket like the ones listed here:
> https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
> Maybe some are very specific, but it seems like fixing small things and move
> forward could help giving me understanding of different sub-parts of GHC,
> which seems less intimidating than the black-box approach.
>
> In conclusion, what do you think is the best approach, 1 or 2, both or none?
> ;)
>
> Thank you!
>
> Alfredo
>
> On 7 April 2017 at 18:30, Alfredo Di Napoli <[hidden email]>
> wrote:
>>
>> Hey Ben,
>>
>> thanks for the quite exhaustive reply! I’m on the go right now, but I
>> promise to get back to you with a meaningful reply later this weekend ;)
>>
>> Alfredo
>>
>> On 7 April 2017 at 18:22, Ben Gamari <[hidden email]> wrote:
>>>
>>> Alfredo Di Napoli <[hidden email]> writes:
>>>
>>> > Hey folks,
>>> >
>>> Hi Alfredo!
>>>
>>> First, thanks for writing. More eyes looking at GHC's compiler
>>> performance is badly needed.
>>>
>>> > maybe I’m setting up for something too ambitious for me, but I would
>>> > like
>>> > to take an active stance to the overlasting “GHC compilation times are
>>> > terrible” matter, instead of simply stare at the screen with despair
>>> > whenever GHC compiles a sufficiently large Haskell program ;)
>>> >
>>> > To make this even more interesting, I have never contributed to GHC
>>> > either!
>>> > The max I have pushed myself into was 2 years ago when I successfully
>>> > built
>>> > GHC head from source and tried to fix an Haddock “easy” ticket I don’t
>>> > even
>>> > recall (full disclosure, eventually I didn’t :D ).
>>> >
>>> > Specifically, I would love community recommendations & guidance about:
>>> >
>>> > 1. Is this simply too daunting for somebody like me? Maybe is better to
>>> > first start contributing more regularly, take confidence with the code
>>> > base
>>> > AND then move forward?
>>> >
>>> As with any software project, it is possible to treat the compiler as a
>>> black box, throw a profiler at it and see what hotspots show up. This
>>> gives you a place to focus your effort, allowing you to learn a small
>>> area and broaden your knowledge as necessary.
>>>
>>> However, I think it's fair to say that you will be significantly more
>>> productive if you first develop a basic understanding of the compilation
>>> pipeline. I'd recommend having a look at the GHC Commentary [1] for a
>>> start.
>>>
>>> I think it also helps to have a rough idea of what "slow" means to you.
>>> I find it is quite helpful if you have a particular program which you
>>> feel compiles more slowly than you would like (especially if it even
>>> compiles slowly with -O0, since then much less of the compiler is
>>> involved in compilation). Another approach is to look for programs whose
>>> compilation time has regressed over the course of GHC releases. It is
>>> not hard to find these examples and it is often possible to bisect your
>>> way back to the regressing commit.
>>>
>>> Also, note that I have collected some notes pertaining to compiler
>>> performance on the Wiki [2]. Here you will find a number of tickets of
>>> interest (as well a some rough themes which I've noticed), some nofib
>>> results which might guide your efforts, as well as a list of some
>>> fixes which have been committed in the past.
>>>
>>> [1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler
>>> [2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler
>>>
>>> > 2. Are compilation times largely dependant from the target platform
>>> > (I’m on
>>> > Darwin) or there is something which can be done “globally” so that the
>>> > benefits can be experienced by everybody?
>>> >
>>> There are some external considerations (e.g. the platform's compiler and
>>> linking toolchain) which contribute to GHC's runtime. For instance, it
>>> is known that the BFD ld linker implementation that many Linux
>>> distributions use by default is a great deal slower than it could be.
>>> This particular issue has come up recently and I'm currently working on
>>> allowing us to use the more performant gold linker when available.
>>>
>>> However, I think it's fair to say that for most programs GHC's runtime
>>> is largely independent of platform. I would invite you to try compiling
>>> a package which you consider GHC to compile "slowly" with GHC's -v flag
>>> (and GHC 8.0.1 or newer). This will give you a rough breakdown of where
>>> time is spent. For many packages you will find that the simplifier
>>> and/or typechecker dominate, followed (often distantly) by native code
>>> generation. Of these steps native code generation is the only one with a
>>> strong platform dependence.
>>>
>>> > 3. Is there any recommended workflow to profile GHC compilation times?
>>> > Is
>>> > there any build flavour one should prefer when doing so? (Maybe the
>>> > full,
>>> > slowest one?)
>>> >
>>> There are a few options here:
>>>
>>>  * As of GHC 8.0 the compiler will output timing and allocation
>>>    information for its various stages if run with -v. This can be
>>>    extremely helpful to get a high-level picture of where the compiler
>>>    is spending its time while compiling your program. This is almost
>>>    always the right place to start.
>>>
>>>  * As with any Haskell program, the cost centre profiler can be used to
>>>    characterize the memory and CPU behavior of various parts of the
>>>    compiler.
>>>
>>>    GHC's source tree includes a "prof" build flavour which builds the
>>>    compiler with profiling enabled. However it only includes a handful
>>>    of cost-centres and is best used when you already have a rough idea
>>>    where you are looking and can add further cost-centres to drill down
>>>    to your hotspot.
>>>
>>>    Simply enabling -fprof-exported across the entire tree just doesn't
>>>    work in my experience: not only is the resulting compiler quite slow,
>>>    but the profile you get is far too unwieldy to learn from.
>>>
>>>  * Occassionally the ticky-ticky profiler can be helpful in identifying
>>>    allocation hotspots without the full overhead of the cost-centre
>>>    profiler.
>>>
>>>  * In principle our newly-stable DWARF debug information can be used for
>>>    profiling, although this is still a work in progress and requires a
>>>    patched GHC for best results. It's probably best to stick to the more
>>>    traditional profiling mechanisms for now.
>>>
>>> Anyways, I hope this helps. Always feel free to get in touch with me
>>> personally (IRC and email are both great) if you would like to discuss
>>> particular issues. Thanks again for your interest!
>>>
>>> Cheers,
>>>
>>> - Ben
>>>
>>
>
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

David Feuer-2
In reply to this post by Alfredo Di Napoli
Be aware that some of the biggest performance problems with TH simply can't be fixed without changes to the TH language. For details, see Edward Yang's blog post: http://blog.ezyang.com/2016/07/what-template-haskell-gets-wrong-and-racket-gets-right/

There was a Reddit thread discussing that post at https://www.reddit.com/r/haskell/comments/4tfzah/what_template_haskell_gets_wrong_and_racket_gets/


David Feuer
Well-Typed, LLP

-------- Original message --------
From: Alfredo Di Napoli <[hidden email]>
Date: 4/9/17 5:37 AM (GMT-05:00)
To: Ben Gamari <[hidden email]>
Subject: Re: Where do I start if I would like help improve GHC compilation times?

Hey Ben,

as promised I’m back to you with something more articulated and hopefully meaningful. I do hear you perfectly — probably trying to dive head-first into this without at least a rough understanding of the performance hotspots or the GHC overall architecture is going to do me more harm than good (I get the overall picture and I’m aware of the different stages of the GHC compilation pipeline, but it’s far from saying I’m proficient with the architecture as whole). I have also read a couple of years ago the GHC chapter on the “Architeture of Open Source Applications” book, but I don’t know how much that is still relevant. If it is, I guess I should refresh my memory.

I’m currently trying to move on 2 fronts — please advice if I’m a fool flogging a dead horse or if I have any hope of getting anything done ;)

1. I’m trying to treat indeed the compiler as a black block (as you adviced) trying to build a sufficiently large program where GHC is not “as fast as I would like” (I know that’s a very lame definition of “slow”, hehe). In particular, I have built the stage2 compiler with the “prof” flavour as you suggested, and I have chosen 2 examples as a reference “benchmark” for performance; DynFlags.hs (which seems to have been mentioned multiple times as a GHC perf killer) and the highlighting-kate package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 . The idea would be to compile those with -v +RTS -p -hc -RTS enabled, look at the output from the .prof file AND the `-v` flag, find any hotspot, try to change something, recompile, observe diff, rinse and repeat. Do you think I have any hope of making progress this way? In particular, I think compiling DynFlags.hs is a bit of a dead-end; I whipped up this buggy script which escalated into a Behemoth which is compiling pretty much half of the compiler once again :D

```
#!/usr/bin/env bash

../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
-RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2 \
-I../ghc/compiler/stage2/build \
-i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes \
-i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude \
-i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn \
-i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise \
-fforce-recomp -c $@
```

I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s pulling in half of the compiler anyway :D I tried to reuse the .hi files from my stage2 compilation but I failed (GHC was complaining about interface file mismatch). Short story short, I don’t think it will be a very agile way to proceed. Am I right? Do you have any recommendation in such sense? Do I have any hope to compile DynFlags.hs in a way which would make this perf investigation feasible?

The second example (the highlighting-kate package) seems much more promising. It takes maybe 1-2 mins on my machine, which is enough to take a look at the perf output. Do you think I should follow this second lead? In principle any 50+ modules package I think would do (better if with a lot of TH ;) ) but this seems like a low-entry barrier start.

2. The second path I’m exploring is simply to take a less holistic approach and try to dive in into a performance ticket like the ones listed here: https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
Maybe some are very specific, but it seems like fixing small things and move forward could help giving me understanding of different sub-parts of GHC, which seems less intimidating than the black-box approach.

In conclusion, what do you think is the best approach, 1 or 2, both or none? ;)

Thank you!

Alfredo

On 7 April 2017 at 18:30, Alfredo Di Napoli <[hidden email]> wrote:
Hey Ben,

thanks for the quite exhaustive reply! I’m on the go right now, but I promise to get back to you with a meaningful reply later this weekend ;)

Alfredo

On 7 April 2017 at 18:22, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey folks,
>
Hi Alfredo!

First, thanks for writing. More eyes looking at GHC's compiler
performance is badly needed.

> maybe I’m setting up for something too ambitious for me, but I would like
> to take an active stance to the overlasting “GHC compilation times are
> terrible” matter, instead of simply stare at the screen with despair
> whenever GHC compiles a sufficiently large Haskell program ;)
>
> To make this even more interesting, I have never contributed to GHC either!
> The max I have pushed myself into was 2 years ago when I successfully built
> GHC head from source and tried to fix an Haddock “easy” ticket I don’t even
> recall (full disclosure, eventually I didn’t :D ).
>
> Specifically, I would love community recommendations & guidance about:
>
> 1. Is this simply too daunting for somebody like me? Maybe is better to
> first start contributing more regularly, take confidence with the code base
> AND then move forward?
>
As with any software project, it is possible to treat the compiler as a
black box, throw a profiler at it and see what hotspots show up. This
gives you a place to focus your effort, allowing you to learn a small
area and broaden your knowledge as necessary.

However, I think it's fair to say that you will be significantly more
productive if you first develop a basic understanding of the compilation
pipeline. I'd recommend having a look at the GHC Commentary [1] for a
start.

I think it also helps to have a rough idea of what "slow" means to you.
I find it is quite helpful if you have a particular program which you
feel compiles more slowly than you would like (especially if it even
compiles slowly with -O0, since then much less of the compiler is
involved in compilation). Another approach is to look for programs whose
compilation time has regressed over the course of GHC releases. It is
not hard to find these examples and it is often possible to bisect your
way back to the regressing commit.

Also, note that I have collected some notes pertaining to compiler
performance on the Wiki [2]. Here you will find a number of tickets of
interest (as well a some rough themes which I've noticed), some nofib
results which might guide your efforts, as well as a list of some
fixes which have been committed in the past.

[1] https://ghc.haskell.org/trac/ghc/wiki/Commentary/Compiler
[2] https://ghc.haskell.org/trac/ghc/wiki/Performance/Compiler

> 2. Are compilation times largely dependant from the target platform (I’m on
> Darwin) or there is something which can be done “globally” so that the
> benefits can be experienced by everybody?
>
There are some external considerations (e.g. the platform's compiler and
linking toolchain) which contribute to GHC's runtime. For instance, it
is known that the BFD ld linker implementation that many Linux
distributions use by default is a great deal slower than it could be.
This particular issue has come up recently and I'm currently working on
allowing us to use the more performant gold linker when available.

However, I think it's fair to say that for most programs GHC's runtime
is largely independent of platform. I would invite you to try compiling
a package which you consider GHC to compile "slowly" with GHC's -v flag
(and GHC 8.0.1 or newer). This will give you a rough breakdown of where
time is spent. For many packages you will find that the simplifier
and/or typechecker dominate, followed (often distantly) by native code
generation. Of these steps native code generation is the only one with a
strong platform dependence.

> 3. Is there any recommended workflow to profile GHC compilation times? Is
> there any build flavour one should prefer when doing so? (Maybe the full,
> slowest one?)
>
There are a few options here:

 * As of GHC 8.0 the compiler will output timing and allocation
   information for its various stages if run with -v. This can be
   extremely helpful to get a high-level picture of where the compiler
   is spending its time while compiling your program. This is almost
   always the right place to start.

 * As with any Haskell program, the cost centre profiler can be used to
   characterize the memory and CPU behavior of various parts of the
   compiler.

   GHC's source tree includes a "prof" build flavour which builds the
   compiler with profiling enabled. However it only includes a handful
   of cost-centres and is best used when you already have a rough idea
   where you are looking and can add further cost-centres to drill down
   to your hotspot.

   Simply enabling -fprof-exported across the entire tree just doesn't
   work in my experience: not only is the resulting compiler quite slow,
   but the profile you get is far too unwieldy to learn from.

 * Occassionally the ticky-ticky profiler can be helpful in identifying
   allocation hotspots without the full overhead of the cost-centre
   profiler.

 * In principle our newly-stable DWARF debug information can be used for
   profiling, although this is still a work in progress and requires a
   patched GHC for best results. It's probably best to stick to the more
   traditional profiling mechanisms for now.

Anyways, I hope this helps. Always feel free to get in touch with me
personally (IRC and email are both great) if you would like to discuss
particular issues. Thanks again for your interest!

Cheers,

- Ben




_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Niklas Hambüchen
In reply to this post by Alfredo Di Napoli
I have some suggestions for low hanging fruits in this effort.

1. Make ghc print more statistics on what it spending time on

When I did the linking investigation recently
(https://www.reddit.com/r/haskell/comments/63y43y/liked_linking_3x_faster_with_gold_link_10x_faster/)
I noticed (with strace) that there are lots of interesting syscalls
being made that you might not expect. For example, each time TH is used,
shared libraries are loaded, and to determine the shared library paths,
ghc shells out to `gcc --print-file-name`. Each such invocation takes 20
ms on my system, and I have 1000 invocations in my build. That's 20
seconds (out of 2 minutes build time) just asking gcc for paths.

I recommend that for every call to an external GHC measures how long
that call took, so that it can be asked to print a summary when it's done.

That might give us lots of interesting things to optimize. For example,
This would have made the long linker times totally obvious.

At the end, I would love to know for each compilation (both one-shot as
used in ghc's build system, and `ghc --make`):

* What programs did it invoke and how long did they take
* What files did it read and how long did that take
* How long did it take to read all the `.hi` files in `ghc --make`
* High level time summary (parsing, typechecking, codegen, .hi files, etc)

That way we'll know at least what is slow, and don't have to resort to
strace every time in order to obtain this basic answer.

2. Investigate if idiotic syscalls are being done and how much

There's this concept I call "idiotic syscalls", which are syscalls of
which you know from before that they won't contribute anything
productive. For example, if you give a linker N many `-L` flags (library
dirs) and M many `-l` flags (library names to link), it will try to
`stat()` or `open()` N*M many files, out of which most are total
rubbish, because we typically know what library is in what dir.
Example: You pass `-L/usr/lib/opencv -L/usr/lib/imagemagick
-L/usr/lib/blas -lopencv -limagemagick -lblas`. Then you you will get
things like `open("/usr/lib/opencv/libimagemagick.so") = ENOENT` which
makes no sense and obviously that file doesn't exist. This is a problem
with the general "search path" concept; same happens for running
executables searching through $PATH. Yes, nonexistent file opens fail
fast, but in my typical ghc invocation I get millions of them (and we
should at least measure how much time is wasted on them), and they
clutter the strace output and make the real problems harder to investigate.
We should check if we can create ways to give pass those files that do
exist.

3. Add pure TemplateHaskell

It is well known that TH is a problem for incremental compilation
because it can have side effects and we must therefore be more
conservative about when to recompile; when you see a `[TH]` in your `ghc
--make` output, it's likely that time again.

I believe this could be avoided by adding a variant of TH that forbids
the use of the `runIO` function, and can thus not have side effects.

Most TH does not need side effects, for example any form of code
generation based on other data types (lenses, instances for whatever).
If that was made "pure TH", we would not have to recompile when inputs
to our TH functions change.

Potentially this could even be determined automatically instead of
adding a new variant of TH like was done for typed TH `$$()`, simply by
inspecting what's in the TH and if we can decide there's no `runIO` in
there, mark it as clean, otherwise as tainted.

4. Build ghc with `ghc --make` if possible

This one might be controversial or impossible (others can likely tell
us). Most Haskell code is built with `ghc --make`, not with the one-shot
compilation system + make or Hadrian as as done in GHC's build system.
Weirdly, often `ghc --make` scales much worse and has much worse
incremental recompilation times than the one-shot mode, which doesn't
make sense given that it has no process creation overhead, can do much
better caching etc. I believe that if ghc or large parts of it (e.g.
stage2) itself was built with `--make`, we would magically see --make
become very good, simply we make the right people (GHC devs) suffer
through it daily :D. I expect from this the solution of the `-j`
slowness, GHC overhead reduction, faster interface file loads and so on.

These are some ideas.

Niklas
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Alfredo Di Napoli
Hey all,

thanks a ton for the invaluable pointers. I’m now in the “I-dunno-what-I-am-doing” mode banging SCC annotations like there is no tomorrow, trying to spot any chance for some low-hanging-fruit algorithmic improvement (like using a sequence instead of a list, etc), and will come back to your suggestions as I will face the inevitable dead-end wall :D

Alfredo

On 10 April 2017 at 01:54, Niklas Hambüchen <[hidden email]> wrote:
I have some suggestions for low hanging fruits in this effort.

1. Make ghc print more statistics on what it spending time on

When I did the linking investigation recently
(https://www.reddit.com/r/haskell/comments/63y43y/liked_linking_3x_faster_with_gold_link_10x_faster/)
I noticed (with strace) that there are lots of interesting syscalls
being made that you might not expect. For example, each time TH is used,
shared libraries are loaded, and to determine the shared library paths,
ghc shells out to `gcc --print-file-name`. Each such invocation takes 20
ms on my system, and I have 1000 invocations in my build. That's 20
seconds (out of 2 minutes build time) just asking gcc for paths.

I recommend that for every call to an external GHC measures how long
that call took, so that it can be asked to print a summary when it's done.

That might give us lots of interesting things to optimize. For example,
This would have made the long linker times totally obvious.

At the end, I would love to know for each compilation (both one-shot as
used in ghc's build system, and `ghc --make`):

* What programs did it invoke and how long did they take
* What files did it read and how long did that take
* How long did it take to read all the `.hi` files in `ghc --make`
* High level time summary (parsing, typechecking, codegen, .hi files, etc)

That way we'll know at least what is slow, and don't have to resort to
strace every time in order to obtain this basic answer.

2. Investigate if idiotic syscalls are being done and how much

There's this concept I call "idiotic syscalls", which are syscalls of
which you know from before that they won't contribute anything
productive. For example, if you give a linker N many `-L` flags (library
dirs) and M many `-l` flags (library names to link), it will try to
`stat()` or `open()` N*M many files, out of which most are total
rubbish, because we typically know what library is in what dir.
Example: You pass `-L/usr/lib/opencv -L/usr/lib/imagemagick
-L/usr/lib/blas -lopencv -limagemagick -lblas`. Then you you will get
things like `open("/usr/lib/opencv/libimagemagick.so") = ENOENT` which
makes no sense and obviously that file doesn't exist. This is a problem
with the general "search path" concept; same happens for running
executables searching through $PATH. Yes, nonexistent file opens fail
fast, but in my typical ghc invocation I get millions of them (and we
should at least measure how much time is wasted on them), and they
clutter the strace output and make the real problems harder to investigate.
We should check if we can create ways to give pass those files that do
exist.

3. Add pure TemplateHaskell

It is well known that TH is a problem for incremental compilation
because it can have side effects and we must therefore be more
conservative about when to recompile; when you see a `[TH]` in your `ghc
--make` output, it's likely that time again.

I believe this could be avoided by adding a variant of TH that forbids
the use of the `runIO` function, and can thus not have side effects.

Most TH does not need side effects, for example any form of code
generation based on other data types (lenses, instances for whatever).
If that was made "pure TH", we would not have to recompile when inputs
to our TH functions change.

Potentially this could even be determined automatically instead of
adding a new variant of TH like was done for typed TH `$$()`, simply by
inspecting what's in the TH and if we can decide there's no `runIO` in
there, mark it as clean, otherwise as tainted.

4. Build ghc with `ghc --make` if possible

This one might be controversial or impossible (others can likely tell
us). Most Haskell code is built with `ghc --make`, not with the one-shot
compilation system + make or Hadrian as as done in GHC's build system.
Weirdly, often `ghc --make` scales much worse and has much worse
incremental recompilation times than the one-shot mode, which doesn't
make sense given that it has no process creation overhead, can do much
better caching etc. I believe that if ghc or large parts of it (e.g.
stage2) itself was built with `--make`, we would magically see --make
become very good, simply we make the right people (GHC devs) suffer
through it daily :D. I expect from this the solution of the `-j`
slowness, GHC overhead reduction, faster interface file loads and so on.

These are some ideas.

Niklas


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Ben Gamari-3
In reply to this post by Niklas Hambüchen
Niklas Hambüchen <[hidden email]> writes:

> I have some suggestions for low hanging fruits in this effort.

Hi Niklas,

Thanks for your suggestions, they are quite reasonable.

> 1. Make ghc print more statistics on what it spending time on
>
> When I did the linking investigation recently
> (https://www.reddit.com/r/haskell/comments/63y43y/liked_linking_3x_faster_with_gold_link_10x_faster/)
> I noticed (with strace) that there are lots of interesting syscalls
> being made that you might not expect. For example, each time TH is used,
> shared libraries are loaded, and to determine the shared library paths,
> ghc shells out to `gcc --print-file-name`. Each such invocation takes 20
> ms on my system, and I have 1000 invocations in my build. That's 20
> seconds (out of 2 minutes build time) just asking gcc for paths.
>
> I recommend that for every call to an external GHC measures how long
> that call took, so that it can be asked to print a summary when it's done.
Indeed. Since 8.0 GHC has had a scheme for tracking CPU time and
allocations of various compiler phases. It would be great to implement a
similar scheme for external tools. You can't necessarily reuse the
existing infrastructure since you will want to use monotonic time
instead of CPU time and allocations aren't relevant.

> That might give us lots of interesting things to optimize. For example,
> This would have made the long linker times totally obvious.

I have a number of thoughts on this which I'll share in another message
as they are largely orthogonal to the matter at hand.

> At the end, I would love to know for each compilation (both one-shot as
> used in ghc's build system, and `ghc --make`):
>
> * What programs did it invoke and how long did they take
> * What files did it read and how long did that take
> * How long did it take to read all the `.hi` files in `ghc --make`
> * High level time summary (parsing, typechecking, codegen, .hi files, etc)

We already have much of this. See D1959 for the relevant change. It
would be great to extend this with more (optional) detail (e.g. measure
individual interface file deserialization times).

> That way we'll know at least what is slow, and don't have to resort to
> strace every time in order to obtain this basic answer.

Indeed. On the whole this would be a very nice, more-or-less
self-contained project.

> 2. Investigate if idiotic syscalls are being done and how much
>
> There's this concept I call "idiotic syscalls", which are syscalls of
> which you know from before that they won't contribute anything
> productive. For example, if you give a linker N many `-L` flags (library
> dirs) and M many `-l` flags (library names to link), it will try to
> `stat()` or `open()` N*M many files, out of which most are total
> rubbish, because we typically know what library is in what dir.
> Example: You pass `-L/usr/lib/opencv -L/usr/lib/imagemagick
> -L/usr/lib/blas -lopencv -limagemagick -lblas`. Then you you will get
> things like `open("/usr/lib/opencv/libimagemagick.so") = ENOENT` which
> makes no sense and obviously that file doesn't exist. This is a problem
> with the general "search path" concept; same happens for running
> executables searching through $PATH. Yes, nonexistent file opens fail
> fast, but in my typical ghc invocation I get millions of them (and we
> should at least measure how much time is wasted on them), and they
> clutter the strace output and make the real problems harder to investigate.
> We should check if we can create ways to give pass those files that do
> exist.
Indeed, the matter of library search paths is actually a pretty bad one
although largely a Cabal issue (see GHC #11587).

> 3. Add pure TemplateHaskell
>
> It is well known that TH is a problem for incremental compilation
> because it can have side effects and we must therefore be more
> conservative about when to recompile; when you see a `[TH]` in your `ghc
> --make` output, it's likely that time again.
>
> I believe this could be avoided by adding a variant of TH that forbids
> the use of the `runIO` function, and can thus not have side effects.
>
> Most TH does not need side effects, for example any form of code
> generation based on other data types (lenses, instances for whatever).
> If that was made "pure TH", we would not have to recompile when inputs
> to our TH functions change.
>
> Potentially this could even be determined automatically instead of
> adding a new variant of TH like was done for typed TH `$$()`, simply by
> inspecting what's in the TH and if we can decide there's no `runIO` in
> there, mark it as clean, otherwise as tainted.
Cross-compiling users would also greatly appreciate this.

> 4. Build ghc with `ghc --make` if possible
>
> This one might be controversial or impossible (others can likely tell
> us). Most Haskell code is built with `ghc --make`, not with the one-shot
> compilation system + make or Hadrian as as done in GHC's build system.
> Weirdly, often `ghc --make` scales much worse and has much worse
> incremental recompilation times than the one-shot mode, which doesn't
> make sense given that it has no process creation overhead, can do much
> better caching etc. I believe that if ghc or large parts of it (e.g.
> stage2) itself was built with `--make`, we would magically see --make
> become very good, simply we make the right people (GHC devs) suffer
> through it daily :D. I expect from this the solution of the `-j`
> slowness, GHC overhead reduction, faster interface file loads and so on.
>
One thing I've wondered about is how we can make use of
compact regions to try to minimize GC costs in long-lived GHC sessions
(e.g. a --make invocation on a large project). This would likely help
parallel performance as well. N.B. #9221 is essentially the catch-all
ticket for -j scalability issues.

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

GHC and linker choice

Ben Gamari-3
In reply to this post by Niklas Hambüchen
Niklas Hambüchen <[hidden email]> writes:

> I have some suggestions for low hanging fruits in this effort.
>
> 1. Make ghc print more statistics on what it spending time on
>
> When I did the linking investigation recently
> (https://www.reddit.com/r/haskell/comments/63y43y/liked_linking_3x_faster_with_gold_link_10x_faster/)
> I noticed (with strace) that there are lots of interesting syscalls
> being made that you might not expect. For example, each time TH is used,
> shared libraries are loaded, and to determine the shared library paths,
> ghc shells out to `gcc --print-file-name`. Each such invocation takes 20
> ms on my system, and I have 1000 invocations in my build. That's 20
> seconds (out of 2 minutes build time) just asking gcc for paths.
>
> I recommend that for every call to an external GHC measures how long
> that call took, so that it can be asked to print a summary when it's done.
>

> That might give us lots of interesting things to optimize. For example,
> This would have made the long linker times totally obvious.
>
For what it's worth, some of us have recognized for quite some time that
BFD ld is a known slow spot in the compilation pipeline. However, up
until now we have considered this to be more of an issue to be handled
at the packaging level than in GHC issue. Currently, GHC relies on `gcc`
to link and gcc has its own idea of the default linker. Many users (e.g.
Debian packagers, users who are cross compiling) are quite unhappy when
GHC coerces the system toolchain into changing its default behavior.

Consequently we have maintained the status quo and silently wished that
Linux distributions would start moving to gold by default. Sadly there
continues to be little progress on this matter.

However, given that so many users are now relying on GHC binary
distributions and that BFD ld is an increasingly significant issue as
dependency counts increase, I think the status quo is increasingly
untenable. I have proposed (#13541) that we introduce configure
logic to use gold when available. There are outstanding questions however

 * What interface (i.e. configure flags) do we expose to the user to
   allow them to turn on or off this logic?

 * Do we want to enable the logic by default in binary
   distributions to ensure that users see the benefits of gold if it is
   available?

 * Do we want to also enable it in source distributions as well? If yes
   then developers may be caught by surprise; if no then there will be
   an inconsistency between the default configuration of the source tree
   and binary distributions (although there is plenty of precedent for this).

Anyways, let me know if you have thoughts on any of this. Hopefully
we'll be able to get this done for 8.4.

Cheers,

- Ben


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Ben Gamari-2
In reply to this post by Alfredo Di Napoli
Alfredo Di Napoli <[hidden email]> writes:

> Hey Ben,
>
Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit
longer than I would like.

> as promised I’m back to you with something more articulated and hopefully
> meaningful. I do hear you perfectly — probably trying to dive head-first
> into this without at least a rough understanding of the performance
> hotspots or the GHC overall architecture is going to do me more harm than
> good (I get the overall picture and I’m aware of the different stages of
> the GHC compilation pipeline, but it’s far from saying I’m proficient with
> the architecture as whole). I have also read a couple of years ago the GHC
> chapter on the “Architeture of Open Source Applications” book, but I don’t
> know how much that is still relevant. If it is, I guess I should refresh my
> memory.
>
It sounds like you have done a good amount of reading. That's great.
Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise
it's likely worthwhile diving in.

> I’m currently trying to move on 2 fronts — please advice if I’m a fool
> flogging a dead horse or if I have any hope of getting anything done ;)
>
> 1. I’m trying to treat indeed the compiler as a black block (as you
> adviced) trying to build a sufficiently large program where GHC is not “as
> fast as I would like” (I know that’s a very lame definition of “slow”,
> hehe). In particular, I have built the stage2 compiler with the “prof”
> flavour as you suggested, and I have chosen 2 examples as a reference
> “benchmark” for performance; DynFlags.hs (which seems to have been
> mentioned multiple times as a GHC perf killer) and the highlighting-kate
> package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .
Indeed, #9221 would be a very interesting ticket to look at. The
highlighting-kate package is interesting in the context of that ticket
as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may
not provide the whole story. In particular, it has been speculated that
the scaling issues may be due to either,

 * threads hitting a blackhole, resulting in blocking

 * the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

> The idea would be to compile those with -v +RTS -p -hc -RTS enabled,
> look at the output from the .prof file AND the `-v` flag, find any
> hotspot, try to change something, recompile, observe diff, rinse and
> repeat. Do you think I have any hope of making progress this way? In
> particular, I think compiling DynFlags.hs is a bit of a dead-end; I
> whipped up this buggy script which
> escalated into a Behemoth which is compiling pretty much half of the
> compiler once again :D
>
> ```
> #!/usr/bin/env bash
>
> ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
> -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2
> \
> -I../ghc/compiler/stage2/build \
> -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes
> \
> -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
> \
> -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
> \
> -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise
> \
> -fforce-recomp -c $@
> ```
>
> I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but
> it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s
> pulling in half of the compiler anyway :D I tried to reuse the .hi files
> from my stage2 compilation but I failed (GHC was complaining about
> interface file mismatch). Short story short, I don’t think it will be a
> very agile way to proceed. Am I right? Do you have any recommendation in
> such sense? Do I have any hope to compile DynFlags.hs in a way which would
> make this perf investigation feasible?
>
What I usually do in this case is just take the relevant `ghc` command
line directly from the `make` output and execute it manually. I would
imagine your debug cycle would look something like,

 * instrument the compiler
 * build stage1
 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line)
 * think
 * repeat

This should only take a few minutes per iteration.

> The second example (the highlighting-kate package) seems much more
> promising. It takes maybe 1-2 mins on my machine, which is enough to take a
> look at the perf output. Do you think I should follow this second lead? In
> principle any 50+ modules package I think would do (better if with a lot of
> TH ;) ) but this seems like a low-entry barrier start.
>
> 2. The second path I’m exploring is simply to take a less holistic approach
> and try to dive in into a performance ticket like the ones listed here:
> https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
> Maybe some are very specific, but it seems like fixing small things and
> move forward could help giving me understanding of different sub-parts of
> GHC, which seems less intimidating than the black-box approach.
>
Do you have any specific tickets from these lists that you found
interesting?

> In conclusion, what do you think is the best approach, 1 or 2, both or
> none? ;)

I would say that it largely depends upon what you feel most comfortable
with. If you feel up for it, I think #9221 would be a nice, fairly
self-contained, yet high-impact ticket which would be worth spending a
few days diving further into.

Cheers,

- Ben


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: GHC and linker choice

Niklas Hambüchen
In reply to this post by Ben Gamari-3
Hi Ben,

> For what it's worth, some of us have recognized for quite some time that
> BFD ld is a known slow spot in the compilation pipeline. However, up
> until now we have considered this to be more of an issue to be handled
> at the packaging level than in GHC issue. Currently, GHC relies on `gcc`
> to link and gcc has its own idea of the default linker. Many users (e.g.
> Debian packagers, users who are cross compiling) are quite unhappy when
> GHC coerces the system toolchain into changing its default behavior.
>
> Consequently we have maintained the status quo and silently wished that
> Linux distributions would start moving to gold by default. Sadly there
> continues to be little progress on this matter.
This is a great point.

I don't know if using BFD ld vs gold is a controversial move, given that
they are both GNU projects. I would imagine if lld was in the game, that
might be more controversial.

I have the theory that both Haskell packagers and Linux distributions
are simply conservative about changing the linker not because they don't
want it, but because they are scared of breaking people's code, and
especially in the packagers' case, that they don't feel expert enough to
judge whether them making this move will be safe.

I can imagine that they would rather welcome if the GHC devs made this
decision, since after all they generate the linker invocations and have
the knowledge to judge whether they are supposed to work.

If a packager wants to *override* GHC's default behaviour, that should
certainly be possible, but as a packager I would find it unfair if GHC
pushed the burden of setting up the optimal defaults for my users on me.

For a comparison with other programming languages:

From tickets like https://github.com/golang/go/issues/1588 seems that go
also uses gold by default when available.

Rust switched to gold by default with
https://github.com/rust-lang/rust/pull/29974, but reverted the commit
with https://github.com/rust-lang/rust/pull/30913, with reasons being
the linked tickets.

The Rust tickets in particular have already discussed the various issues
we may be running into regarding autodetection, cross compilation and so
on, it may be good to get inspiration from there.

>  * Do we want to also enable it in source distributions as well? If yes
>    then developers may be caught by surprise

I would recommend to do the same in source and binary distributions. The
surprise should be minimised by properly advertising it as a key thing
in the release notes. Packagers are supposed to read those, and in fact
and most users of GHC love to read them for they contain many goodies.

Another option is to not change the default or enable autodetection, but
make it trivial to use gold for the user, e.g. by at least fixing the
Cabal library (currently you need to put your linker options into a lot
of places to get the desired results, see
https://github.com/haskell/cabal/issues/4435, and even if you do, for
some invocations Cabal just doesn't pass on what you tell it to do, see
https://github.com/haskell/cabal/issues/4439).
We could run this for a GHC release term, and also have cabal-install
and stack recommend people that they should try to set some easy option
to use the gold linker if available. If that works well, we could switch
to autodetection the release after that.
I haven't though much about whether this is actually a better idea than
just enabling it where we can.

Niklas


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (484 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Alfredo Di Napoli
In reply to this post by Ben Gamari-2
Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:


Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey Ben,
>
Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit
longer than I would like.

> as promised I’m back to you with something more articulated and hopefully
> meaningful. I do hear you perfectly — probably trying to dive head-first
> into this without at least a rough understanding of the performance
> hotspots or the GHC overall architecture is going to do me more harm than
> good (I get the overall picture and I’m aware of the different stages of
> the GHC compilation pipeline, but it’s far from saying I’m proficient with
> the architecture as whole). I have also read a couple of years ago the GHC
> chapter on the “Architeture of Open Source Applications” book, but I don’t
> know how much that is still relevant. If it is, I guess I should refresh my
> memory.
>
It sounds like you have done a good amount of reading. That's great.
Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise
it's likely worthwhile diving in.

> I’m currently trying to move on 2 fronts — please advice if I’m a fool
> flogging a dead horse or if I have any hope of getting anything done ;)
>
> 1. I’m trying to treat indeed the compiler as a black block (as you
> adviced) trying to build a sufficiently large program where GHC is not “as
> fast as I would like” (I know that’s a very lame definition of “slow”,
> hehe). In particular, I have built the stage2 compiler with the “prof”
> flavour as you suggested, and I have chosen 2 examples as a reference
> “benchmark” for performance; DynFlags.hs (which seems to have been
> mentioned multiple times as a GHC perf killer) and the highlighting-kate
> package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The
highlighting-kate package is interesting in the context of that ticket
as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may
not provide the whole story. In particular, it has been speculated that
the scaling issues may be due to either,

 * threads hitting a blackhole, resulting in blocking

 * the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

> The idea would be to compile those with -v +RTS -p -hc -RTS enabled,
> look at the output from the .prof file AND the `-v` flag, find any
> hotspot, try to change something, recompile, observe diff, rinse and
> repeat. Do you think I have any hope of making progress this way? In
> particular, I think compiling DynFlags.hs is a bit of a dead-end; I
> whipped up this buggy script which
> escalated into a Behemoth which is compiling pretty much half of the
> compiler once again :D
>
> ```
> #!/usr/bin/env bash
>
> ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
> -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2
> \
> -I../ghc/compiler/stage2/build \
> -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes
> \
> -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
> \
> -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
> \
> -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise
> \
> -fforce-recomp -c $@
> ```
>
> I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but
> it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s
> pulling in half of the compiler anyway :D I tried to reuse the .hi files
> from my stage2 compilation but I failed (GHC was complaining about
> interface file mismatch). Short story short, I don’t think it will be a
> very agile way to proceed. Am I right? Do you have any recommendation in
> such sense? Do I have any hope to compile DynFlags.hs in a way which would
> make this perf investigation feasible?
>
What I usually do in this case is just take the relevant `ghc` command
line directly from the `make` output and execute it manually. I would
imagine your debug cycle would look something like,

 * instrument the compiler
 * build stage1
 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line)
 * think
 * repeat

This should only take a few minutes per iteration.

> The second example (the highlighting-kate package) seems much more
> promising. It takes maybe 1-2 mins on my machine, which is enough to take a
> look at the perf output. Do you think I should follow this second lead? In
> principle any 50+ modules package I think would do (better if with a lot of
> TH ;) ) but this seems like a low-entry barrier start.
>
> 2. The second path I’m exploring is simply to take a less holistic approach
> and try to dive in into a performance ticket like the ones listed here:
> https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
> Maybe some are very specific, but it seems like fixing small things and
> move forward could help giving me understanding of different sub-parts of
> GHC, which seems less intimidating than the black-box approach.
>
Do you have any specific tickets from these lists that you found
interesting?

> In conclusion, what do you think is the best approach, 1 or 2, both or
> none? ;)

I would say that it largely depends upon what you feel most comfortable
with. If you feel up for it, I think #9221 would be a nice, fairly
self-contained, yet high-impact ticket which would be worth spending a
few days diving further into.

Cheers,

- Ben



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Simon Marlow-7
Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se.  We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines.  The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes.  Using a ByteString Builder instead might yield some improvement.

Cheers
Simon

On 17 April 2017 at 18:44, Alfredo Di Napoli <[hidden email]> wrote:
Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:


Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey Ben,
>
Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit
longer than I would like.

> as promised I’m back to you with something more articulated and hopefully
> meaningful. I do hear you perfectly — probably trying to dive head-first
> into this without at least a rough understanding of the performance
> hotspots or the GHC overall architecture is going to do me more harm than
> good (I get the overall picture and I’m aware of the different stages of
> the GHC compilation pipeline, but it’s far from saying I’m proficient with
> the architecture as whole). I have also read a couple of years ago the GHC
> chapter on the “Architeture of Open Source Applications” book, but I don’t
> know how much that is still relevant. If it is, I guess I should refresh my
> memory.
>
It sounds like you have done a good amount of reading. That's great.
Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise
it's likely worthwhile diving in.

> I’m currently trying to move on 2 fronts — please advice if I’m a fool
> flogging a dead horse or if I have any hope of getting anything done ;)
>
> 1. I’m trying to treat indeed the compiler as a black block (as you
> adviced) trying to build a sufficiently large program where GHC is not “as
> fast as I would like” (I know that’s a very lame definition of “slow”,
> hehe). In particular, I have built the stage2 compiler with the “prof”
> flavour as you suggested, and I have chosen 2 examples as a reference
> “benchmark” for performance; DynFlags.hs (which seems to have been
> mentioned multiple times as a GHC perf killer) and the highlighting-kate
> package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The
highlighting-kate package is interesting in the context of that ticket
as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may
not provide the whole story. In particular, it has been speculated that
the scaling issues may be due to either,

 * threads hitting a blackhole, resulting in blocking

 * the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

> The idea would be to compile those with -v +RTS -p -hc -RTS enabled,
> look at the output from the .prof file AND the `-v` flag, find any
> hotspot, try to change something, recompile, observe diff, rinse and
> repeat. Do you think I have any hope of making progress this way? In
> particular, I think compiling DynFlags.hs is a bit of a dead-end; I
> whipped up this buggy script which
> escalated into a Behemoth which is compiling pretty much half of the
> compiler once again :D
>
> ```
> #!/usr/bin/env bash
>
> ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
> -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2
> \
> -I../ghc/compiler/stage2/build \
> -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes
> \
> -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
> \
> -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
> \
> -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise
> \
> -fforce-recomp -c $@
> ```
>
> I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but
> it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s
> pulling in half of the compiler anyway :D I tried to reuse the .hi files
> from my stage2 compilation but I failed (GHC was complaining about
> interface file mismatch). Short story short, I don’t think it will be a
> very agile way to proceed. Am I right? Do you have any recommendation in
> such sense? Do I have any hope to compile DynFlags.hs in a way which would
> make this perf investigation feasible?
>
What I usually do in this case is just take the relevant `ghc` command
line directly from the `make` output and execute it manually. I would
imagine your debug cycle would look something like,

 * instrument the compiler
 * build stage1
 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line)
 * think
 * repeat

This should only take a few minutes per iteration.

> The second example (the highlighting-kate package) seems much more
> promising. It takes maybe 1-2 mins on my machine, which is enough to take a
> look at the perf output. Do you think I should follow this second lead? In
> principle any 50+ modules package I think would do (better if with a lot of
> TH ;) ) but this seems like a low-entry barrier start.
>
> 2. The second path I’m exploring is simply to take a less holistic approach
> and try to dive in into a performance ticket like the ones listed here:
> https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
> Maybe some are very specific, but it seems like fixing small things and
> move forward could help giving me understanding of different sub-parts of
> GHC, which seems less intimidating than the black-box approach.
>
Do you have any specific tickets from these lists that you found
interesting?

> In conclusion, what do you think is the best approach, 1 or 2, both or
> none? ;)

I would say that it largely depends upon what you feel most comfortable
with. If you feel up for it, I think #9221 would be a nice, fairly
self-contained, yet high-impact ticket which would be worth spending a
few days diving further into.

Cheers,

- Ben



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Alfredo Di Napoli
Hey Simon,

thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;)

That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well.

Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded.

Alfredo 

On 18 April 2017 at 12:01, Simon Marlow <[hidden email]> wrote:
Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se.  We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines.  The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes.  Using a ByteString Builder instead might yield some improvement.

Cheers
Simon

On 17 April 2017 at 18:44, Alfredo Di Napoli <[hidden email]> wrote:
Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:


Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey Ben,
>
Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit
longer than I would like.

> as promised I’m back to you with something more articulated and hopefully
> meaningful. I do hear you perfectly — probably trying to dive head-first
> into this without at least a rough understanding of the performance
> hotspots or the GHC overall architecture is going to do me more harm than
> good (I get the overall picture and I’m aware of the different stages of
> the GHC compilation pipeline, but it’s far from saying I’m proficient with
> the architecture as whole). I have also read a couple of years ago the GHC
> chapter on the “Architeture of Open Source Applications” book, but I don’t
> know how much that is still relevant. If it is, I guess I should refresh my
> memory.
>
It sounds like you have done a good amount of reading. That's great.
Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise
it's likely worthwhile diving in.

> I’m currently trying to move on 2 fronts — please advice if I’m a fool
> flogging a dead horse or if I have any hope of getting anything done ;)
>
> 1. I’m trying to treat indeed the compiler as a black block (as you
> adviced) trying to build a sufficiently large program where GHC is not “as
> fast as I would like” (I know that’s a very lame definition of “slow”,
> hehe). In particular, I have built the stage2 compiler with the “prof”
> flavour as you suggested, and I have chosen 2 examples as a reference
> “benchmark” for performance; DynFlags.hs (which seems to have been
> mentioned multiple times as a GHC perf killer) and the highlighting-kate
> package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The
highlighting-kate package is interesting in the context of that ticket
as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may
not provide the whole story. In particular, it has been speculated that
the scaling issues may be due to either,

 * threads hitting a blackhole, resulting in blocking

 * the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

> The idea would be to compile those with -v +RTS -p -hc -RTS enabled,
> look at the output from the .prof file AND the `-v` flag, find any
> hotspot, try to change something, recompile, observe diff, rinse and
> repeat. Do you think I have any hope of making progress this way? In
> particular, I think compiling DynFlags.hs is a bit of a dead-end; I
> whipped up this buggy script which
> escalated into a Behemoth which is compiling pretty much half of the
> compiler once again :D
>
> ```
> #!/usr/bin/env bash
>
> ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
> -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2
> \
> -I../ghc/compiler/stage2/build \
> -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes
> \
> -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
> \
> -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
> \
> -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise
> \
> -fforce-recomp -c $@
> ```
>
> I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but
> it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s
> pulling in half of the compiler anyway :D I tried to reuse the .hi files
> from my stage2 compilation but I failed (GHC was complaining about
> interface file mismatch). Short story short, I don’t think it will be a
> very agile way to proceed. Am I right? Do you have any recommendation in
> such sense? Do I have any hope to compile DynFlags.hs in a way which would
> make this perf investigation feasible?
>
What I usually do in this case is just take the relevant `ghc` command
line directly from the `make` output and execute it manually. I would
imagine your debug cycle would look something like,

 * instrument the compiler
 * build stage1
 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line)
 * think
 * repeat

This should only take a few minutes per iteration.

> The second example (the highlighting-kate package) seems much more
> promising. It takes maybe 1-2 mins on my machine, which is enough to take a
> look at the perf output. Do you think I should follow this second lead? In
> principle any 50+ modules package I think would do (better if with a lot of
> TH ;) ) but this seems like a low-entry barrier start.
>
> 2. The second path I’m exploring is simply to take a less holistic approach
> and try to dive in into a performance ticket like the ones listed here:
> https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
> Maybe some are very specific, but it seems like fixing small things and
> move forward could help giving me understanding of different sub-parts of
> GHC, which seems less intimidating than the black-box approach.
>
Do you have any specific tickets from these lists that you found
interesting?

> In conclusion, what do you think is the best approach, 1 or 2, both or
> none? ;)

I would say that it largely depends upon what you feel most comfortable
with. If you feel up for it, I think #9221 would be a nice, fairly
self-contained, yet high-impact ticket which would be worth spending a
few days diving further into.

Cheers,

- Ben



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs




_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Bartosz Nitka
Hey Alfredo,

Thanks for taking a look at this. I've taken a look at this before and collected some notes here: https://github.com/niteria/notes/blob/master/PrettyPrintingGHC.md#problems
Based on my investigations I concluded that there are places where pretty-printing Asm and Cmm gets accidentally quadratic.
If I remember correctly the core of the problem was that even in the "fast" mode the pretty printer was computing the lengths of subtrees in `reduceDoc`, which in some cases made the operations quadratic.
I've tried implementing a "just append bytes"-mode without `reduceDoc`, but what stopped me was my lack of understanding of different semantics for 3 kinds of empty Doc's.
I think if you took the time to understand it, you could implement it in a way that's not quadratic, reaping a substantial performance win.

Cheers,
Bartosz

2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli <[hidden email]>:
Hey Simon,

thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;)

That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well.

Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded.

Alfredo 

On 18 April 2017 at 12:01, Simon Marlow <[hidden email]> wrote:
Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se.  We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines.  The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes.  Using a ByteString Builder instead might yield some improvement.

Cheers
Simon

On 17 April 2017 at 18:44, Alfredo Di Napoli <[hidden email]> wrote:
Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:


Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey Ben,
>
Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit
longer than I would like.

> as promised I’m back to you with something more articulated and hopefully
> meaningful. I do hear you perfectly — probably trying to dive head-first
> into this without at least a rough understanding of the performance
> hotspots or the GHC overall architecture is going to do me more harm than
> good (I get the overall picture and I’m aware of the different stages of
> the GHC compilation pipeline, but it’s far from saying I’m proficient with
> the architecture as whole). I have also read a couple of years ago the GHC
> chapter on the “Architeture of Open Source Applications” book, but I don’t
> know how much that is still relevant. If it is, I guess I should refresh my
> memory.
>
It sounds like you have done a good amount of reading. That's great.
Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise
it's likely worthwhile diving in.

> I’m currently trying to move on 2 fronts — please advice if I’m a fool
> flogging a dead horse or if I have any hope of getting anything done ;)
>
> 1. I’m trying to treat indeed the compiler as a black block (as you
> adviced) trying to build a sufficiently large program where GHC is not “as
> fast as I would like” (I know that’s a very lame definition of “slow”,
> hehe). In particular, I have built the stage2 compiler with the “prof”
> flavour as you suggested, and I have chosen 2 examples as a reference
> “benchmark” for performance; DynFlags.hs (which seems to have been
> mentioned multiple times as a GHC perf killer) and the highlighting-kate
> package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The
highlighting-kate package is interesting in the context of that ticket
as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may
not provide the whole story. In particular, it has been speculated that
the scaling issues may be due to either,

 * threads hitting a blackhole, resulting in blocking

 * the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

> The idea would be to compile those with -v +RTS -p -hc -RTS enabled,
> look at the output from the .prof file AND the `-v` flag, find any
> hotspot, try to change something, recompile, observe diff, rinse and
> repeat. Do you think I have any hope of making progress this way? In
> particular, I think compiling DynFlags.hs is a bit of a dead-end; I
> whipped up this buggy script which
> escalated into a Behemoth which is compiling pretty much half of the
> compiler once again :D
>
> ```
> #!/usr/bin/env bash
>
> ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
> -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2
> \
> -I../ghc/compiler/stage2/build \
> -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes
> \
> -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
> \
> -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
> \
> -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise
> \
> -fforce-recomp -c $@
> ```
>
> I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but
> it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s
> pulling in half of the compiler anyway :D I tried to reuse the .hi files
> from my stage2 compilation but I failed (GHC was complaining about
> interface file mismatch). Short story short, I don’t think it will be a
> very agile way to proceed. Am I right? Do you have any recommendation in
> such sense? Do I have any hope to compile DynFlags.hs in a way which would
> make this perf investigation feasible?
>
What I usually do in this case is just take the relevant `ghc` command
line directly from the `make` output and execute it manually. I would
imagine your debug cycle would look something like,

 * instrument the compiler
 * build stage1
 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line)
 * think
 * repeat

This should only take a few minutes per iteration.

> The second example (the highlighting-kate package) seems much more
> promising. It takes maybe 1-2 mins on my machine, which is enough to take a
> look at the perf output. Do you think I should follow this second lead? In
> principle any 50+ modules package I think would do (better if with a lot of
> TH ;) ) but this seems like a low-entry barrier start.
>
> 2. The second path I’m exploring is simply to take a less holistic approach
> and try to dive in into a performance ticket like the ones listed here:
> https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
> Maybe some are very specific, but it seems like fixing small things and
> move forward could help giving me understanding of different sub-parts of
> GHC, which seems less intimidating than the black-box approach.
>
Do you have any specific tickets from these lists that you found
interesting?

> In conclusion, what do you think is the best approach, 1 or 2, both or
> none? ;)

I would say that it largely depends upon what you feel most comfortable
with. If you feel up for it, I think #9221 would be a nice, fairly
self-contained, yet high-impact ticket which would be worth spending a
few days diving further into.

Cheers,

- Ben



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs




_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Alfredo Di Napoli
Hey Bartosz,

thanks for that, looks like I won the jackpot today ;)

Sounds like your notes are an excellent starting point. Will try to get a better understanding of that module myself, to see if we can reap some perf win here, but the fact you “have been there before” looks like this is a promising path to take.

Cheers!

Alfredo

On 18 April 2017 at 13:01, Bartosz Nitka <[hidden email]> wrote:
Hey Alfredo,

Thanks for taking a look at this. I've taken a look at this before and collected some notes here: https://github.com/niteria/notes/blob/master/PrettyPrintingGHC.md#problems
Based on my investigations I concluded that there are places where pretty-printing Asm and Cmm gets accidentally quadratic.
If I remember correctly the core of the problem was that even in the "fast" mode the pretty printer was computing the lengths of subtrees in `reduceDoc`, which in some cases made the operations quadratic.
I've tried implementing a "just append bytes"-mode without `reduceDoc`, but what stopped me was my lack of understanding of different semantics for 3 kinds of empty Doc's.
I think if you took the time to understand it, you could implement it in a way that's not quadratic, reaping a substantial performance win.

Cheers,
Bartosz

2017-04-18 11:17 GMT+01:00 Alfredo Di Napoli <[hidden email]>:
Hey Simon,

thanks for chiming in. I had the same suspect as well, and in fact my first temptation was to use dlists to avoid costly concatenation (I guess a Builder shares pretty much the same idea, which is avoid right-appends). I eventually bailed out as that Pretty module had some functions like sepX or fillNBE (I might be wrong, going by memory only here) which needed to *inspect* the current [Doc] we were carrying around, and thus dlists couldn’t accomplish this in O(1), but forcing it back to a list was necessary. Am I right in thinking that using a Builder here will suffer the same malady? Ideally I would like constant time for both appends, left concat & inspection (as in pattern-matching inspection) but I don’t think what I’m asking exists, at least not in its functional declination anyway ;)

That’s why I was thinking to give Deques (or more generally, Sequences) a go: they don’t have O(1) appends but at least can be inspected in O(1). Question has to be answered whether this will yield improvement or not, which I guess depends upon the ratio of inspection / appending we do in the pretty printing. In that case using dlists or builders might be better. Last but not least, although the current implementation doesn’t backtrack anyway, I’m wondering wether switching to a different representation for a Doc (namely a huge function composition of Token(s), as described in the paper) could be beneficial as well.

Do you guys have any opinions? Yesterday I extracted Pretty.hs from the sourcetree and I’m now planning to produce a criterion benchmark and compare different implementations, althought it’s a bit hard to predict the real world usage as I don’t have access to a representative Doc document as produced by GHC, so my benchs could be all ill-founded.

Alfredo 

On 18 April 2017 at 12:01, Simon Marlow <[hidden email]> wrote:
Pretty-printing the asm is a likely contender for optimisation, however the problem is not the pretty-printing per se.  We don't actually use any of the backtracking stuff when printing asm, since there's no point nicely indenting things or wrapping lines.  The overhead is probably all in the layer of data structure that we generate in Pretty before it gets dumped into raw bytes.  Using a ByteString Builder instead might yield some improvement.

Cheers
Simon

On 17 April 2017 at 18:44, Alfredo Di Napoli <[hidden email]> wrote:
Dear all,

after sprinkling (ehm, littering) GHC source code with cost centres, I was not surprised to see that roughly 20% of the compilation time (as in .prof) was spent in the core gen/simplification process (10% of the total time) and on the asm code gen (another 10%).

I have almost immediately abandoned the idea of try optimising some modules in simplCore (considering my 0-knowledge of GHC internals anyway..) but I have been dwelling on the following: Outputable.hs and Pretty.hs seems to be have been implemented making deliberate use of lists and concatenations between them, which left me wondering if there was room for optimisation there. I have found this interesting paper on the topic:


Now, it’s totally possible that this has been already tried (with no success) but judging from the original copyright of Pretty.hs (dated 2001), it seems it was written prior to the work of Olaf Chitil (the author of the paper).

TL;DR I was thinking (even just as a fun exercise to learn more about GHC internals) to leverage the ideas of that paper and switch to a different implementation for `Doc` coupled with the use of lazy dequeues, which *might* increase the performances of the codegen and thus of the compiler overall. Am I fighting a strawman (or flogging a dead horse, pick your rethorical figure :D ) or is there even a tiny chance of this being actually useful?

Have a nice evening,

Alfredo

On 11 April 2017 at 00:47, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey Ben,
>
Hi Alfredo,

Sorry for the late response! The email queue from the weekend was a bit
longer than I would like.

> as promised I’m back to you with something more articulated and hopefully
> meaningful. I do hear you perfectly — probably trying to dive head-first
> into this without at least a rough understanding of the performance
> hotspots or the GHC overall architecture is going to do me more harm than
> good (I get the overall picture and I’m aware of the different stages of
> the GHC compilation pipeline, but it’s far from saying I’m proficient with
> the architecture as whole). I have also read a couple of years ago the GHC
> chapter on the “Architeture of Open Source Applications” book, but I don’t
> know how much that is still relevant. If it is, I guess I should refresh my
> memory.
>
It sounds like you have done a good amount of reading. That's great.
Perhaps skimming the AOSA chapter again wouldn't hurt, but otherwise
it's likely worthwhile diving in.

> I’m currently trying to move on 2 fronts — please advice if I’m a fool
> flogging a dead horse or if I have any hope of getting anything done ;)
>
> 1. I’m trying to treat indeed the compiler as a black block (as you
> adviced) trying to build a sufficiently large program where GHC is not “as
> fast as I would like” (I know that’s a very lame definition of “slow”,
> hehe). In particular, I have built the stage2 compiler with the “prof”
> flavour as you suggested, and I have chosen 2 examples as a reference
> “benchmark” for performance; DynFlags.hs (which seems to have been
> mentioned multiple times as a GHC perf killer) and the highlighting-kate
> package as posted here: https://ghc.haskell.org/trac/ghc/ticket/9221 .

Indeed, #9221 would be a very interesting ticket to look at. The
highlighting-kate package is interesting in the context of that ticket
as it has a very large amount of parallelism available.

If you do want to look at #9221, note that the cost centre profiler may
not provide the whole story. In particular, it has been speculated that
the scaling issues may be due to either,

 * threads hitting a blackhole, resulting in blocking

 * the usual scaling limitations of GHC's stop-the-world GC

The eventlog may be quite useful for characterising these.

> The idea would be to compile those with -v +RTS -p -hc -RTS enabled,
> look at the output from the .prof file AND the `-v` flag, find any
> hotspot, try to change something, recompile, observe diff, rinse and
> repeat. Do you think I have any hope of making progress this way? In
> particular, I think compiling DynFlags.hs is a bit of a dead-end; I
> whipped up this buggy script which
> escalated into a Behemoth which is compiling pretty much half of the
> compiler once again :D
>
> ```
> #!/usr/bin/env bash
>
> ../ghc/inplace/bin/ghc-stage2 --make -j8 -v +RTS -A256M -qb0 -p -h \
> -RTS -DSTAGE=2 -I../ghc/includes -I../ghc/compiler -I../ghc/compiler/stage2
> \
> -I../ghc/compiler/stage2/build \
> -i../ghc/compiler/utils:../ghc/compiler/types:../ghc/compiler/typecheck:../ghc/compiler/basicTypes
> \
> -i../ghc/compiler/main:../ghc/compiler/profiling:../ghc/compiler/coreSyn:../ghc/compiler/iface:../ghc/compiler/prelude
> \
> -i../ghc/compiler/stage2/build:../ghc/compiler/simplStg:../ghc/compiler/cmm:../ghc/compiler/parser:../ghc/compiler/hsSyn
> \
> -i../ghc/compiler/ghci:../ghc/compiler/deSugar:../ghc/compiler/simplCore:../ghc/compile/specialise
> \
> -fforce-recomp -c $@
> ```
>
> I’m running it with `./dynflags.sh ../ghc/compiler/main/DynFlags.hs` but
> it’s taking a lot to compile (20+ mins on my 2014 mac Pro) because it’s
> pulling in half of the compiler anyway :D I tried to reuse the .hi files
> from my stage2 compilation but I failed (GHC was complaining about
> interface file mismatch). Short story short, I don’t think it will be a
> very agile way to proceed. Am I right? Do you have any recommendation in
> such sense? Do I have any hope to compile DynFlags.hs in a way which would
> make this perf investigation feasible?
>
What I usually do in this case is just take the relevant `ghc` command
line directly from the `make` output and execute it manually. I would
imagine your debug cycle would look something like,

 * instrument the compiler
 * build stage1
 * use stage2 to build DynFlags using the stage1 compiler (using a saved command line)
 * think
 * repeat

This should only take a few minutes per iteration.

> The second example (the highlighting-kate package) seems much more
> promising. It takes maybe 1-2 mins on my machine, which is enough to take a
> look at the perf output. Do you think I should follow this second lead? In
> principle any 50+ modules package I think would do (better if with a lot of
> TH ;) ) but this seems like a low-entry barrier start.
>
> 2. The second path I’m exploring is simply to take a less holistic approach
> and try to dive in into a performance ticket like the ones listed here:
> https://www.reddit.com/r/haskell/comments/45q90s/is_anything_being_done_to_remedy_the_soul/czzq6an/
> Maybe some are very specific, but it seems like fixing small things and
> move forward could help giving me understanding of different sub-parts of
> GHC, which seems less intimidating than the black-box approach.
>
Do you have any specific tickets from these lists that you found
interesting?

> In conclusion, what do you think is the best approach, 1 or 2, both or
> none? ;)

I would say that it largely depends upon what you feel most comfortable
with. If you feel up for it, I think #9221 would be a nice, fairly
self-contained, yet high-impact ticket which would be worth spending a
few days diving further into.

Cheers,

- Ben



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs




_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs




_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Ben Gamari-2
In reply to this post by Alfredo Di Napoli
Alfredo Di Napoli <[hidden email]> writes:

> Hey Simon,
>
> thanks for chiming in. I had the same suspect as well, and in fact my first
> temptation was to use dlists to avoid costly concatenation (I guess a
> Builder shares pretty much the same idea, which is avoid right-appends). I
> eventually bailed out as that Pretty module had some functions like sepX or
> fillNBE (I might be wrong, going by memory only here) which needed to
> *inspect* the current [Doc] we were carrying around, and thus dlists
> couldn’t accomplish this in O(1), but forcing it back to a list was
> necessary. Am I right in thinking that using a Builder here will suffer the
> same malady? Ideally I would like constant time for both appends, left
> concat & inspection (as in pattern-matching inspection) but I don’t think
> what I’m asking exists, at least not in its functional declination anyway ;)
>
> That’s why I was thinking to give Deques (or more generally, Sequences) a
> go: they don’t have O(1) appends but at least can be inspected in O(1).
> Question has to be answered whether this will yield improvement or not,
> which I guess depends upon the ratio of inspection / appending we do in the
> pretty printing. In that case using dlists or builders might be better.
> Last but not least, although the current implementation doesn’t backtrack
> anyway, I’m wondering wether switching to a different representation for a
> Doc (namely a huge function composition of Token(s), as described in the
> paper) could be beneficial as well.
>
> Do you guys have any opinions? Yesterday I extracted Pretty.hs from the
> sourcetree and I’m now planning to produce a criterion benchmark and
> compare different implementations, althought it’s a bit hard to predict the
> real world usage as I don’t have access to a representative Doc document as
> produced by GHC, so my benchs could be all ill-founded.
>
Note that GHC's `Pretty` module is just a slightly modified version of
the `pretty` package. The differences are fairly minimal (although
important for performance):

 * It uses FastString in place of String, giving us fast `length` (in
   https://ghc.haskell.org/trac/ghc/ticket/8809#comment:60
   I propose that we extend `pretty` with a typeclass for string types)

 * GHC's variant still has a known stack overflow bug that was fixed
   upstream. Unfortunately, compiler performance regressed when we
   attempted to port upstream's patch (#10735)

Ideally we would fix these and just use the `pretty` library itself.

In addition to these issues, it would be quite helpful if `pretty`
gained a special-case for the infinite band-width case (which is what we
use in the code generator). The point here is that we should need to do
no layout computation in the infinite band case: merely place line
breaks precisely where the user asks. This should result in a noticeable
improvement in code generation performance (IIRC Bartosz noted rather
significant amounts of time spent pretty-printing assembler).

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Alfredo Di Napoli
Hey Ben,

thanks for that. I didn’t realise I opened pretty much the Pandora’s box, hehe. I have now found (whilst reading Bartosz’s notes) the numerous tickets & discussions around the library, but what you say in this email neatly summarise the crux of the problem, as far as I could see. Rather than saying silly things, I would rather spend a bit of time reading everything that has been written by folks way smarter than me on the subject and get back to you folks later ;)

Alfredo

On 18 April 2017 at 14:21, Ben Gamari <[hidden email]> wrote:
Alfredo Di Napoli <[hidden email]> writes:

> Hey Simon,
>
> thanks for chiming in. I had the same suspect as well, and in fact my first
> temptation was to use dlists to avoid costly concatenation (I guess a
> Builder shares pretty much the same idea, which is avoid right-appends). I
> eventually bailed out as that Pretty module had some functions like sepX or
> fillNBE (I might be wrong, going by memory only here) which needed to
> *inspect* the current [Doc] we were carrying around, and thus dlists
> couldn’t accomplish this in O(1), but forcing it back to a list was
> necessary. Am I right in thinking that using a Builder here will suffer the
> same malady? Ideally I would like constant time for both appends, left
> concat & inspection (as in pattern-matching inspection) but I don’t think
> what I’m asking exists, at least not in its functional declination anyway ;)
>
> That’s why I was thinking to give Deques (or more generally, Sequences) a
> go: they don’t have O(1) appends but at least can be inspected in O(1).
> Question has to be answered whether this will yield improvement or not,
> which I guess depends upon the ratio of inspection / appending we do in the
> pretty printing. In that case using dlists or builders might be better.
> Last but not least, although the current implementation doesn’t backtrack
> anyway, I’m wondering wether switching to a different representation for a
> Doc (namely a huge function composition of Token(s), as described in the
> paper) could be beneficial as well.
>
> Do you guys have any opinions? Yesterday I extracted Pretty.hs from the
> sourcetree and I’m now planning to produce a criterion benchmark and
> compare different implementations, althought it’s a bit hard to predict the
> real world usage as I don’t have access to a representative Doc document as
> produced by GHC, so my benchs could be all ill-founded.
>
Note that GHC's `Pretty` module is just a slightly modified version of
the `pretty` package. The differences are fairly minimal (although
important for performance):

 * It uses FastString in place of String, giving us fast `length` (in
   https://ghc.haskell.org/trac/ghc/ticket/8809#comment:60
   I propose that we extend `pretty` with a typeclass for string types)

 * GHC's variant still has a known stack overflow bug that was fixed
   upstream. Unfortunately, compiler performance regressed when we
   attempted to port upstream's patch (#10735)

Ideally we would fix these and just use the `pretty` library itself.

In addition to these issues, it would be quite helpful if `pretty`
gained a special-case for the infinite band-width case (which is what we
use in the code generator). The point here is that we should need to do
no layout computation in the infinite band case: merely place line
breaks precisely where the user asks. This should result in a noticeable
improvement in code generation performance (IIRC Bartosz noted rather
significant amounts of time spent pretty-printing assembler).

Cheers,

- Ben


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Where do I start if I would like help improve GHC compilation times?

Johannes Waldmann-2
In reply to this post by Alfredo Di Napoli
> it would be quite helpful if `pretty`
> gained a special-case for the infinite band-width case

Absolutely +1 this.

Not "pretty" but similar:
https://github.com/jwaldmann/wl-pprint-text/commit/7843d96db57ab01c0be336a5be0d7bca61902005

- J.W.
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs