performance regressions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

performance regressions

Herbert Valerio Riedel-3
On 2014-12-16 at 22:45:36 +0100, Richard Eisenberg wrote:
> I've learned several very interesting things in this analysis.
>
> - Inlining polymorphic methods is very important.

otoh, there are cases where marking methods INLINE have catastrophic
effects; the following

  https://github.com/kolmodin/binary/commit/48c02966512a67b018fcdf093fab8d34bddf9a69

was necessary a few months ago, as otherwise GHC HEAD's compile-time
memory usage would explode:

  https://ghc.haskell.org/trac/ghc/ticket/9630


Reply | Threaded
Open this post in threaded view
|

performance regressions

Richard Eisenberg-2
In reply to this post by Simon Peyton Jones
I've gained a little more insight here, but only a little.

INLINE doesn't work because zipWithAndUnzipM is recursive. Oddly, using worker/wrapper made allocations go *up*, so that's somehow not the answer.

When I use `inline`, GHC will inline the function once; recursive calls remain. This seems to the local minimum in allocations.
Using `inline` gives roughly a 7% allocation level change on this pathological case.

To reproduce, just compile HEAD's TcFlatten with and without the call to `inline`. You can see, using -ddump-simpl, what GHC is doing. Then, you can check the allocation numbers by compiling perf/compiler/T9872a.hs. My tests were all using the default build settings, with an unmodified build.mk, as that seemed to be the most performant setting.

In other news, a slight change to the algorithm (see #9872, comment 23) makes a 50% improvement. Will hopefully commit tonight, after a full local validation run.

Richard

On Dec 17, 2014, at 10:59 AM, Simon Peyton Jones <simonpj at microsoft.com> wrote:

> I still would like to understand why INLINE does not make it inline. That's weird.
>
> Eg way to reproduce.
>
> Simion
>
> |  -----Original Message-----
> |  From: Richard Eisenberg [mailto:eir at cis.upenn.edu]
> |  Sent: 17 December 2014 15:56
> |  To: Simon Peyton Jones
> |  Cc: Joachim Breitner; ghc-devs at haskell.org
> |  Subject: Re: performance regressions
> |  
> |  By unsubstantiated guess is that INLINEABLE would have the same effect
> |  as INLINE here, as GHC doesn't see fit to actually inline the
> |  function, even with INLINE -- the big improvement seen between (1) and
> |  (2) is actually specialization, not inlining. The jump from (2) to (3)
> |  is actual inlining. Thus, it seems that GHC's heuristics for inlining
> |  aren't working out for the best here.
> |  
> |  I've pushed my changes, though I agree with Simon that more research
> |  may uncover even more improvements here. I didn't focus on the number
> |  of calls because that number didn't regress. Will look into this soon.
> |  
> |  Richard
> |  
> |  On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones
> |  <simonpj at microsoft.com> wrote:
> |  
> |  > If you use INLINEABLE, that should make the function specialisable
> |  to a particular monad, even if it's in a different module. You
> |  shouldn't need INLINE for that.
> |  >
> |  > I don't understand the difference between cases (2) and (3).
> |  >
> |  > I am still suspicious of why there are so many calls to this one
> |  function that it, alone, is allocating a significant proportion of
> |  compilation of the entire run of GHC.  Are you sure there isn't an
> |  algorithmic improvement to be had, to simply reduce the number of
> |  calls?
> |  >
> |  > Simon
> |  >
> |  > |  -----Original Message-----
> |  > |  From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On Behalf Of
> |  > | Richard Eisenberg
> |  > |  Sent: 16 December 2014 21:46
> |  > |  To: Joachim Breitner
> |  > |  Cc: ghc-devs at haskell.org
> |  > |  Subject: Re: performance regressions
> |  > |
> |  > |  I've learned several very interesting things in this analysis.
> |  > |
> |  > |  - Inlining polymorphic methods is very important. Here are some
> |  > | data  points to back up that claim:
> |  > |     * Original implementation using zipWithAndUnzipM:
> |  8,472,613,440
> |  > |  bytes allocated in the heap
> |  > |     * Adding {-# INLINE #-} to the definition thereof:
> |  6,639,253,488
> |  > |  bytes allocated in the heap
> |  > |     * Using `inline` at call site to force inlining:
> |  6,281,539,792
> |  > |  bytes allocated in the heap
> |  > |
> |  > |  The middle step above allowed GHC to specialize zipWithAndUnzipM
> |  to
> |  > | my  particular monad, but GHC didn't see fit to actually inline
> |  the
> |  > | function. Using `inline` forced it, to good effect. (I did not
> |  > | collect  data on code sizes, but it wouldn't be hard to.)
> |  > |
> |  > |  By comparison:
> |  > |     * Hand-written recursion:    6,587,809,112 bytes allocated in
> |  the
> |  > |  heap
> |  > |  Interestingly, this is *not* the best result!
> |  > |
> |  > |  Conclusion: We should probably add INLINE pragmas to Util and
> |  > | MonadUtils.
> |  > |
> |  > |
> |  > |  - I then looked at rejiggering the algorithm to keep the common
> |  > | case  fast. This had a side effect of changing the
> |  zipWithAndUnzipM
> |  > | to  mapAndUnzipM, from Control.Monad. To my surprise, this brought
> |  > | disaster!
> |  > |     * Using `inline` and mapAndUnzipM:        7,463,047,432 bytes
> |  > |  allocated in the heap
> |  > |     * Hand-written recursion:                 5,848,602,848 bytes
> |  > |  allocated in the heap
> |  > |
> |  > |  That last number is better than the numbers above because of the
> |  > | algorithm streamlining. But, the inadequacy of mapAndUnzipM
> |  > | surprised  me -- it already has an INLINE pragma in Control.Monad
> |  of course.
> |  > |  Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed
> |  > | getting  inlined, but a call to `map` remained, perhaps causing
> |  > | extra  allocation.
> |  > |
> |  > |  Conclusion: We should examine the implementation of mapAndUnzipM
> |  > | (and  similar functions) in Control.Monad. Is it as fast as
> |  possible?
> |  > |
> |  > |
> |  > |
> |  > |  In the end, I was unable to bring the allocation numbers down to
> |  > | where  they were before my work. This is because the flattener now
> |  > | deals in  roles. Most of its behavior is the same between nominal
> |  > | and  representational roles, so it seems silly (though very
> |  > | possible) to  specialize the code to nominal to keep that path
> |  fast.
> |  > | Instead, I  identified one key spot and made that go fast.
> |  > |
> |  > |  Thus, there is a 7% bump to memory usage on very-type-family-
> |  heavy
> |  > | code, compared to before my commit on Friday. (On more ordinary
> |  > | code,  there is no noticeable change.)
> |  > |
> |  > |  Validating my patch locally now; will push when that's done.
> |  > |
> |  > |  Thanks,
> |  > |  Richard
> |  > |
> |  > |  On Dec 16, 2014, at 10:41 AM, Joachim Breitner <mail at joachim-
> |  > | breitner.de> wrote:
> |  > |
> |  > |  > Hi,
> |  > |  >
> |  > |  >
> |  > |  > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard
> |  Eisenberg:
> |  > |  >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-
> |  > | breitner.de> wrote:
> |  > |  >>
> |  > |  >>> another guess (without looking at the code, sorry): Are they
> |  in
> |  > | the  >>> same module? I.e., can GHC specialize the code to your
> |  > | particular  Monad?
> |  > |  >
> |  > |  >> No, they're not in the same module. I could also try moving
> |  the
> |  > | >> zipWithAndUnzipM function to the same module, and even
> |  > | specializing  >> it by hand to the right monad.
> |  > |  >
> |  > |  > I did mean zipWithAndUnzipM, so maybe yes: Try that.
> |  > |  >
> |  > |  > (I find it hard to believe that any polymorphic monadic code
> |  > | should  > perform well, with those many calls to an unknown (>>=)
> |  > | with a  > function parameter, but maybe I'm too pessimistic here.)
> |  > | >  >  >> Could that be preventing the fusing?
> |  > |  >
> |  > |  > There is not going to be any fusing here, at least not list
> |  > | fusion;  > that would require your code to be written in terms of
> |  > | functions  with  > fusion rules.
> |  > |  >
> |  > |  > Greetings,
> |  > |  > Joachim
> |  > |  >
> |  > |  > --
> |  > |  > Joachim "nomeata" Breitner
> |  > |  >  mail at joachim-breitner.de * http://www.joachim-breitner.de/  >
> |  > | Jabber: nomeata at joachim-breitner.de  * GPG-Key: 0xF0FBF51F  Debian
> |  > | > Developer: nomeata at debian.org  >  >
> |  > | _______________________________________________
> |  > |  > ghc-devs mailing list
> |  > |  > ghc-devs at haskell.org
> |  > |  > http://www.haskell.org/mailman/listinfo/ghc-devs
> |  > |
> |  > |  _______________________________________________
> |  > |  ghc-devs mailing list
> |  > |  ghc-devs at haskell.org
> |  > |  http://www.haskell.org/mailman/listinfo/ghc-devs
> |  >
>
>


12