performance regressions

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
22 messages Options
12
Reply | Threaded
Open this post in threaded view
|

performance regressions

Richard Eisenberg-2
Hi devs,

Phab has shown up some performance regressions in my recent commits. See https://phabricator.haskell.org/harbormaster/build/2607/. The failures except for haddock.base are new, and evidently my fault. They didn't show up on Travis. Will look into it shortly, but I doubt over the weekend.

I don't think this should hold up the 7.10 fork, however. I have a suspicion as to what caused the problem -- the reimplementation of TcInteract.matchFam. That reimplementation was solely to reduce code repetition between TcInteract.matchFam and FamInstEnv.reduceTyFamApp_maybe. It can safely be rolled back -- just use the implementation of matchFam that existed before my commits.

I'll look into this Monday at the latest, but if a remedy is needed sooner, try the technique above.

Why would these failures show up on Phab but not on Travis?

Sorry about the bother!

Thanks,
Richard
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141212/2522f6b3/attachment.html>

Reply | Threaded
Open this post in threaded view
|

performance regressions

Joachim Breitner-2
Hi,


Am Freitag, den 12.12.2014, 21:51 -0500 schrieb Richard Eisenberg:
>
> Phab has shown up some performance regressions in my recent commits.
> See https://phabricator.haskell.org/harbormaster/build/2607/. The
> failures except for haddock.base are new, and evidently my fault. They
> didn't show up on Travis. Will look into it shortly, but I doubt over
> the weekend.


ghcspeed also observes this:
http://ghcspeed-nomeata.rhcloud.com/changes/?rev=7256213843b80d75a86f033be77516a62d56044a&exe=2&env=johan%27s%20buildbot

Especially the T9872 benchmarks have a huge increase in allocations. But
you seem to be aware of this, so that?s fine.

Greetings,
Joachim

--
Joachim ?nomeata? Breitner
  mail at joachim-breitner.de ? http://www.joachim-breitner.de/
  Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
  Debian Developer: nomeata at debian.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141213/d76801bf/attachment.sig>

Reply | Threaded
Open this post in threaded view
|

performance regressions

Richard Eisenberg-2
I think I've fixed this. I've pushed the fix to wip/rae, and waiting for validation results before pushing to master.

My hunch below was right -- it was the change to matchFam, which essentially evaluated type-level functions more strictly. I've now made it lazier again. I'd like to better understand the tradeoff here, and to see if there's a principled sweet spot. But that will happen in a few days.

Expect a push to master soon.

Again, sorry for the bother.

Richard

On Dec 13, 2014, at 8:32 AM, Joachim Breitner <mail at joachim-breitner.de> wrote:

> Hi,
>
>
> Am Freitag, den 12.12.2014, 21:51 -0500 schrieb Richard Eisenberg:
>>
>> Phab has shown up some performance regressions in my recent commits.
>> See https://phabricator.haskell.org/harbormaster/build/2607/. The
>> failures except for haddock.base are new, and evidently my fault. They
>> didn't show up on Travis. Will look into it shortly, but I doubt over
>> the weekend.
>
>
> ghcspeed also observes this:
> http://ghcspeed-nomeata.rhcloud.com/changes/?rev=7256213843b80d75a86f033be77516a62d56044a&exe=2&env=johan%27s%20buildbot
>
> Especially the T9872 benchmarks have a huge increase in allocations. But
> you seem to be aware of this, so that?s fine.
>
> Greetings,
> Joachim
>
> --
> Joachim ?nomeata? Breitner
>  mail at joachim-breitner.de ? http://www.joachim-breitner.de/
>  Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
>  Debian Developer: nomeata at debian.org
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs


Reply | Threaded
Open this post in threaded view
|

performance regressions

Richard Eisenberg-2
Fixed, hopefully!

On Dec 13, 2014, at 10:03 AM, Richard Eisenberg <eir at cis.upenn.edu> wrote:

> I think I've fixed this. I've pushed the fix to wip/rae, and waiting for validation results before pushing to master.
>
> My hunch below was right -- it was the change to matchFam, which essentially evaluated type-level functions more strictly. I've now made it lazier again. I'd like to better understand the tradeoff here, and to see if there's a principled sweet spot. But that will happen in a few days.
>
> Expect a push to master soon.
>
> Again, sorry for the bother.
>
> Richard
>
> On Dec 13, 2014, at 8:32 AM, Joachim Breitner <mail at joachim-breitner.de> wrote:
>
>> Hi,
>>
>>
>> Am Freitag, den 12.12.2014, 21:51 -0500 schrieb Richard Eisenberg:
>>>
>>> Phab has shown up some performance regressions in my recent commits.
>>> See https://phabricator.haskell.org/harbormaster/build/2607/. The
>>> failures except for haddock.base are new, and evidently my fault. They
>>> didn't show up on Travis. Will look into it shortly, but I doubt over
>>> the weekend.
>>
>>
>> ghcspeed also observes this:
>> http://ghcspeed-nomeata.rhcloud.com/changes/?rev=7256213843b80d75a86f033be77516a62d56044a&exe=2&env=johan%27s%20buildbot
>>
>> Especially the T9872 benchmarks have a huge increase in allocations. But
>> you seem to be aware of this, so that?s fine.
>>
>> Greetings,
>> Joachim
>>
>> --
>> Joachim ?nomeata? Breitner
>> mail at joachim-breitner.de ? http://www.joachim-breitner.de/
>> Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
>> Debian Developer: nomeata at debian.org
>>
>> _______________________________________________
>> ghc-devs mailing list
>> ghc-devs at haskell.org
>> http://www.haskell.org/mailman/listinfo/ghc-devs
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>


Reply | Threaded
Open this post in threaded view
|

performance regressions

Herbert Valerio Riedel
Hello Richard,

Can you please push the fix asap to master? This performance failures
are causing distracting false alarms (in terms of validation failures)
on each pushed commit as well as submitted code-revisions.

Thanks,
  HVR
 
On 2014-12-13 at 16:55:40 +0100, Richard Eisenberg wrote:

> Fixed, hopefully!
>
> On Dec 13, 2014, at 10:03 AM, Richard Eisenberg <eir at cis.upenn.edu> wrote:
>
>> I think I've fixed this. I've pushed the fix to wip/rae, and waiting for validation results before pushing to master.
>>
>> My hunch below was right -- it was the change to matchFam, which
>> essentially evaluated type-level functions more strictly. I've now
>> made it lazier again. I'd like to better understand the tradeoff
>> here, and to see if there's a principled sweet spot. But that will
>> happen in a few days.
>>
>> Expect a push to master soon.
>>
>> Again, sorry for the bother.


Reply | Threaded
Open this post in threaded view
|

performance regressions

Joachim Breitner-2
In reply to this post by Richard Eisenberg-2
Hi,

Am Samstag, den 13.12.2014, 10:55 -0500 schrieb Richard Eisenberg:
> Fixed, hopefully!

Mitigated, but still a regression:

http://ghcspeed-nomeata.rhcloud.com/timeline/?exe=2&base=2%2B68&ben=tests%2Falloc%2FT9872a&env=1&revs=50&equid=on#

Is that now a level that we?ll have to live with, or is it still
unexpectedly high?

Greetings,
Joachim

--
Joachim ?nomeata? Breitner
  mail at joachim-breitner.de ? http://www.joachim-breitner.de/
  Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
  Debian Developer: nomeata at debian.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141214/da7947fd/attachment.sig>

Reply | Threaded
Open this post in threaded view
|

performance regressions

Richard Eisenberg-2
I pushed my supposed fix yesterday morning, as I emailed out the "Fixed, hopefully" note.

Of course, I now see that it wasn't a full fix.

This is all most assuredly my fault. However, I also feel that elements of the infrastructure failed me somewhat, making this error easier for me to commit:
- Travis has not picked up on these errors.
- Harbormaster has seemed unreliable, finding spurious compile-time errors sometimes, and sometimes just failing to test code that it sees. For example, when I pushed Diff 1901 to D546, Harbormaster never even attempted. Also, when I pushed my "fix", commit 3ec9391711, Harbormaster also skipped, as viewable here: https://phabricator.haskell.org/diffusion/GHC/

So, after pushing yesterday morning, I didn't get any email from Harbormaster saying that it failed, so I thought my fix was indeed a fix. Because my local build was a devel2 build (and that I had only about 20 minutes to work), I was unable to test locally -- all I could tell is that my fix lowered the numbers (as verified by ghcspeed).

Having a weekend full of plans, there wasn't really any opportunity for me to do the work necessary to figure out what's going on. It will be first on my docket tomorrow.

I suppose one lesson here is that I shouldn't push anything at all non-trivial on a Friday afternoon. But I also hope we can either improve the infrastructure (of course, it's *much* better than it was months ago!) or have realistic expectations of what we can expect from the infrastructure (e.g., trust Harbormaster/Travis when seeking feedback, but always validate locally before actually pushing to master).

More tomorrow,
Richard

On Dec 14, 2014, at 3:47 PM, Joachim Breitner <mail at joachim-breitner.de> wrote:

> Hi,
>
> Am Samstag, den 13.12.2014, 10:55 -0500 schrieb Richard Eisenberg:
>> Fixed, hopefully!
>
> Mitigated, but still a regression:
>
> http://ghcspeed-nomeata.rhcloud.com/timeline/?exe=2&base=2%2B68&ben=tests%2Falloc%2FT9872a&env=1&revs=50&equid=on#
>
> Is that now a level that we?ll have to live with, or is it still
> unexpectedly high?
>
> Greetings,
> Joachim
>
> --
> Joachim ?nomeata? Breitner
>  mail at joachim-breitner.de ? http://www.joachim-breitner.de/
>  Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
>  Debian Developer: nomeata at debian.org
>


Reply | Threaded
Open this post in threaded view
|

performance regressions

Joachim Breitner-2
Hi,

Am Sonntag, den 14.12.2014, 22:37 -0500 schrieb Richard Eisenberg:
> Of course, I now see that it wasn't a full fix.
>
> This is all most assuredly my fault.

I wouldn?t call it a fault. Where wood is chopped, splinters must fall.
(Hopefully correct translation of a German idiom.)

We don?t _have_ to catch everything before it hits master, it is already
pretty good if we manage to catch and fix regressions later.

I guess we could use the known_broken feature of the test suite more
quickly, to signal that an issue is being worked on and to avoid others
from stumbling over test suite failures.

> - Travis has not picked up on these errors.

unfortunately, travis is slighly less useful since a few weeks due to
T5681 failing (possibly due to the use of LLVM-3.4), but I?m still
waiting for an reply on that issue.

But it wouldn?t have helped you: Travis generally skips all performance
tests. These used to be far less reliable when I set up travis (this got
better due to the removal of some of the max_bytes_allocated tests, and
also due to harbormaster and ghcspeed monitoring), and also because we
still hardly make the 50 minute deadline on Travis.

So do not rely on Travis for performance measurements.
(This is documented on https://ghc.haskell.org/trac/ghc/wiki/Travis, but
it needs to become common knowledge.)


Greetings,
Joachim

--
Joachim ?nomeata? Breitner
  mail at joachim-breitner.de ? http://www.joachim-breitner.de/
  Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
  Debian Developer: nomeata at debian.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141215/c4490edd/attachment.sig>

Reply | Threaded
Open this post in threaded view
|

performance regressions

Ben Gamari-2
Joachim Breitner <nomeata at debian.org> writes:

> Hi,
>
> Am Montag, den 15.12.2014, 10:58 -0500 schrieb Ben Gamari:
>> >> - Travis has not picked up on these errors.
>> >
>> > unfortunately, travis is slighly less useful since a few weeks due to
>> > T5681 failing (possibly due to the use of LLVM-3.4), but I?m still
>> > waiting for an reply on that issue.
>> >
>> You aren't looking for a response from me on this, are you? I just
>> checked and I don't seem to have any outstanding messages from you but
>> it's entirely possible I overlooked something.
>
> this is independent of our arm issues, and I think a tad older; I did
> not direct it to anyone specific.
>
> But I guess you are likely a person that can tell what?s wrong here:
>
> Am Sonntag, den 30.11.2014, 20:01 +0100 schrieb Joachim Breitner:
>> Compile failed (status 256) errors were:
>> /tmp/ghc16123_0/ghc16123_5.s: Assembler messages:
>>
>> /tmp/ghc16123_0/ghc16123_5.s:26:0:
>>      Error: can't resolve `.rodata' {.rodata section} - `Main_zdwwork_info$def' {.text section}
>>
>> /tmp/ghc16123_0/ghc16123_5.s:46:0:
>>      Error: can't resolve `.rodata' {.rodata section} - `Main_work_info$def' {.text section}
>>
>> /tmp/ghc16123_0/ghc16123_5.s:66:0:
>>      Error: can't resolve `.rodata' {.rodata section} - `Main_main1_info$def' {.text section}
>>
>> /tmp/ghc16123_0/ghc16123_5.s:86:0:
>>      Error: can't resolve `.rodata' {.rodata section} - `Main_main_info$def' {.text section}
>>
>> /tmp/ghc16123_0/ghc16123_5.s:106:0:
>>      Error: can't resolve `.rodata' {.rodata section} - `Main_main2_info$def' {.text section}
>>
>> /tmp/ghc16123_0/ghc16123_5.s:126:0:
>>      Error: can't resolve `.rodata' {.rodata section} - `ZCMain_main_info$def' {.text section}
>>
>> *** unexpected failure for T5681(optllvm)
>>
>> https://s3.amazonaws.com/archive.travis-ci.org/jobs/42557559/log.txt
>>
>> Any ideas?
>
> Is it possible that this is due the llvm version used? Do we support 3.4
> in GHC HEAD?
>
>    Using LLVM tools
>       llc   : /usr/local/clang-3.4/bin/llc
>       opt   : /usr/local/clang-3.4/bin/opt
>
> (http://smart-cactus.org/~ben/posts/2014-11-28-state-of-llvm-backend.html
> does not talk about GHC HEAD explicitly. Should I look at the 7.10
> row? Does that mean that 3.4 is not supported? Shouldn?t the build
> system, or at least the compiler, fail harder and more helpfully in
> this case?)
>
LLVM 3.4 appears to have an unfortunate behavior whereby it will lose track
of which section symbols with Internal linkage belong.  I haven't had a
chance to delve into this too deeply, however given that both 3.3 and
3.5 behave as expected I'm pretty sure this a bug. There are a few
options here,

  a. Mark the `$def` symbols as ExternallyVisible, working around the
  issue at the expense of exposing internal symbols to the outside
  world.

  b. Mark LLVM 3.4 as unsupported

At the moment I'm leaning towards (b) since I haven't had a chance to
think through the implications of (a); if nothing else I suspect this
wouldn't help the DLL symbol table size issues on Windows. Giving up on
LLVM 3.4 might be unfortunate for a good number of users, however.

Ultimately this underlines the need to package LLVM with GHC.

Cheers,

- Ben


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141215/521a8ac9/attachment.sig>

Reply | Threaded
Open this post in threaded view
|

performance regressions

Jan Stolarek
> Using this combinator instead of writing the algorithm directly cost me 30%
> allocation overhead!
What does your algorithm look like when you write it directly? Something like this:

flatten_many fmode roles tys
 = unzip `liftM` mapM go (zip roles tys)
 where
   go (Nominal,ty)          = flatten_one (fmode { fe_eq_rel = NomEq }) ty
   go (Representational,ty) = flatten_one (fmode { fe_eq_rel = ReprEq }) ty
   go (Phantom, ty)         = -- See Note [Phantoms in the flattener]
                              return (ty, mkTcPhantomCo ty ty)

?

Maybe this has something to do with `zipWithAndUnzipM` not being tail-recursive vs. direct version
being able to fuse intermediate lists?

Janek

Reply | Threaded
Open this post in threaded view
|

performance regressions

Joachim Breitner-2
In reply to this post by Ben Gamari-2
Hi,

Am Montag, den 15.12.2014, 23:48 -0500 schrieb Richard Eisenberg:
> Can anyone tell me: why? Have I made some horrible mistake in the implementation?
> And, relatedly: how can I fix this? I want to learn from this experience how to avoid this problem next time...

another guess (without looking at the code, sorry): Are they in the same
module? I.e., can GHC specialize the code to your particular Monad?

Greetings,
Joachim

--
Joachim Breitner
  e-Mail: mail at joachim-breitner.de
  Homepage: http://www.joachim-breitner.de
  Jabber-ID: nomeata at joachim-breitner.de

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141216/5ee1f22e/attachment.sig>

Reply | Threaded
Open this post in threaded view
|

performance regressions

Simon Peyton Jones
In reply to this post by Ben Gamari-2
|  Using this combinator instead of writing the algorithm directly cost
|  me 30% allocation overhead!

That seems surprising.  I'd build a profiled compiler (GhcProfiled=YES) and see what it says.  

If it increases allocation by 30% overall, there must be a LOT of calls to this function.  Should there be so many?

S

|  -----Original Message-----
|  From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On Behalf Of
|  Richard Eisenberg
|  Sent: 16 December 2014 04:49
|  To: Ben Gamari
|  Cc: Joachim Breitner; ghc-devs at haskell.org
|  Subject: Re: performance regressions
|  
|  I've made progress, but still need some help.
|  
|  It turns out that a monadic combinator (that I wrote) is mostly
|  responsible:
|  
|  > zipWithAndUnzipM :: Monad m
|  >                  => (a -> b -> m (c, d)) -> [a] -> [b] -> m ([c],
|  [d])
|  > zipWithAndUnzipM f (x:xs) (y:ys)
|  >   = do { (c, d) <- f x y
|  >        ; (cs, ds) <- zipWithAndUnzipM f xs ys
|  >        ; return (c:cs, d:ds) }
|  > zipWithAndUnzipM _ _ _ = return ([], [])
|  >
|  
|  Using this combinator instead of writing the algorithm directly cost
|  me 30% allocation overhead!
|  
|  Can anyone tell me: why? Have I made some horrible mistake in the
|  implementation?
|  And, relatedly: how can I fix this? I want to learn from this
|  experience how to avoid this problem next time...
|  
|  Unfortunately, my commit causes 50% overhead, not 30%, so I'm not out
|  of the woods yet. Hopefully, another 20% of good news tomorrow.
|  
|  Thanks!
|  Richard
|  
|  On Dec 15, 2014, at 11:33 AM, Ben Gamari <ben at smart-cactus.org> wrote:
|  
|  > Joachim Breitner <nomeata at debian.org> writes:
|  >
|  >> Hi,
|  >>
|  >> Am Montag, den 15.12.2014, 10:58 -0500 schrieb Ben Gamari:
|  >>>>> - Travis has not picked up on these errors.
|  >>>>
|  >>>> unfortunately, travis is slighly less useful since a few weeks
|  due
|  >>>> to
|  >>>> T5681 failing (possibly due to the use of LLVM-3.4), but I'm
|  still
|  >>>> waiting for an reply on that issue.
|  >>>>
|  >>> You aren't looking for a response from me on this, are you? I just
|  >>> checked and I don't seem to have any outstanding messages from you
|  >>> but it's entirely possible I overlooked something.
|  >>
|  >> this is independent of our arm issues, and I think a tad older; I
|  did
|  >> not direct it to anyone specific.
|  >>
|  >> But I guess you are likely a person that can tell what's wrong
|  here:
|  >>
|  >> Am Sonntag, den 30.11.2014, 20:01 +0100 schrieb Joachim Breitner:
|  >>> Compile failed (status 256) errors were:
|  >>> /tmp/ghc16123_0/ghc16123_5.s: Assembler messages:
|  >>>
|  >>> /tmp/ghc16123_0/ghc16123_5.s:26:0:
|  >>>     Error: can't resolve `.rodata' {.rodata section} -
|  >>> `Main_zdwwork_info$def' {.text section}
|  >>>
|  >>> /tmp/ghc16123_0/ghc16123_5.s:46:0:
|  >>>     Error: can't resolve `.rodata' {.rodata section} -
|  >>> `Main_work_info$def' {.text section}
|  >>>
|  >>> /tmp/ghc16123_0/ghc16123_5.s:66:0:
|  >>>     Error: can't resolve `.rodata' {.rodata section} -
|  >>> `Main_main1_info$def' {.text section}
|  >>>
|  >>> /tmp/ghc16123_0/ghc16123_5.s:86:0:
|  >>>     Error: can't resolve `.rodata' {.rodata section} -
|  >>> `Main_main_info$def' {.text section}
|  >>>
|  >>> /tmp/ghc16123_0/ghc16123_5.s:106:0:
|  >>>     Error: can't resolve `.rodata' {.rodata section} -
|  >>> `Main_main2_info$def' {.text section}
|  >>>
|  >>> /tmp/ghc16123_0/ghc16123_5.s:126:0:
|  >>>     Error: can't resolve `.rodata' {.rodata section} -
|  >>> `ZCMain_main_info$def' {.text section}
|  >>>
|  >>> *** unexpected failure for T5681(optllvm)
|  >>>
|  >>> https://s3.amazonaws.com/archive.travis-
|  ci.org/jobs/42557559/log.txt
|  >>>
|  >>> Any ideas?
|  >>
|  >> Is it possible that this is due the llvm version used? Do we
|  support
|  >> 3.4 in GHC HEAD?
|  >>
|  >>   Using LLVM tools
|  >>      llc   : /usr/local/clang-3.4/bin/llc
|  >>      opt   : /usr/local/clang-3.4/bin/opt
|  >>
|  >> (http://smart-cactus.org/~ben/posts/2014-11-28-state-of-llvm-
|  backend.
|  >> html does not talk about GHC HEAD explicitly. Should I look at the
|  >> 7.10 row? Does that mean that 3.4 is not supported? Shouldn't the
|  >> build system, or at least the compiler, fail harder and more
|  >> helpfully in this case?)
|  >>
|  > LLVM 3.4 appears to have an unfortunate behavior whereby it will
|  lose
|  > track of which section symbols with Internal linkage belong.  I
|  > haven't had a chance to delve into this too deeply, however given
|  that
|  > both 3.3 and
|  > 3.5 behave as expected I'm pretty sure this a bug. There are a few
|  > options here,
|  >
|  >  a. Mark the `$def` symbols as ExternallyVisible, working around the
|  > issue at the expense of exposing internal symbols to the outside
|  > world.
|  >
|  >  b. Mark LLVM 3.4 as unsupported
|  >
|  > At the moment I'm leaning towards (b) since I haven't had a chance
|  to
|  > think through the implications of (a); if nothing else I suspect
|  this
|  > wouldn't help the DLL symbol table size issues on Windows. Giving up
|  > on LLVM 3.4 might be unfortunate for a good number of users,
|  however.
|  >
|  > Ultimately this underlines the need to package LLVM with GHC.
|  >
|  > Cheers,
|  >
|  > - Ben
|  >
|  >
|  > _______________________________________________
|  > ghc-devs mailing list
|  > ghc-devs at haskell.org
|  > http://www.haskell.org/mailman/listinfo/ghc-devs
|  
|  _______________________________________________
|  ghc-devs mailing list
|  ghc-devs at haskell.org
|  http://www.haskell.org/mailman/listinfo/ghc-devs

Reply | Threaded
Open this post in threaded view
|

performance regressions

Richard Eisenberg-2
In reply to this post by Jan Stolarek

On Dec 16, 2014, at 2:59 AM, Jan Stolarek <jan.stolarek at p.lodz.pl> wrote:

> What does your algorithm look like when you write it directly? Something like this:
>
> flatten_many fmode roles tys
> = unzip `liftM` mapM go (zip roles tys)
> where
>   go (Nominal,ty)          = flatten_one (fmode { fe_eq_rel = NomEq }) ty
>   go (Representational,ty) = flatten_one (fmode { fe_eq_rel = ReprEq }) ty
>   go (Phantom, ty)         = -- See Note [Phantoms in the flattener]
>                              return (ty, mkTcPhantomCo ty ty)
>
> ?
>
> Maybe this has something to do with `zipWithAndUnzipM` not being tail-recursive vs. direct version
> being able to fuse intermediate lists?


My direct version is even uglier:

> flatten_many fmode roles tys
>   = go roles tys
>   where
>     go (Nominal:rs)          (ty:tys)
>       = do { (xi, co) <- flatten_one (setFEEqRel fmode NomEq)  ty
>            ; (xis, cos) <- go rs tys
>            ; return (xi:xis, co:cos) }
>     go (Representational:rs) (ty:tys)
>       = do { (xi, co) <- flatten_one (setFEEqRel fmode ReprEq) ty
>            ; (xis, cos) <- go rs tys
>            ; return (xi:xis, co:cos) }
>     go (Phantom:rs)          (ty:tys)
>       = do { (xis, cos) <- go rs tys
>            ; -- See Note [Phantoms in the flattener]
>              return (ty:xis, mkTcPhantomCo ty ty:cos) }
>     go _ _ = return ([], [])

I could refactor to make it better, but I would be worried that the version you wrote would suffer from the same problems as zipWithAndUnzipM. Will check to see, though.

On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-breitner.de> wrote:

> another guess (without looking at the code, sorry): Are they in the same
> module? I.e., can GHC specialize the code to your particular Monad?

No, they're not in the same module. I could also try moving the zipWithAndUnzipM function to the same module, and even specializing it by hand to the right monad. Could that be preventing the fusing?


On Dec 16, 2014, at 8:49 AM, Simon Peyton Jones <simonpj at microsoft.com> wrote:

> That seems surprising.  I'd build a profiled compiler (GhcProfiled=YES) and see what it says.  
>
> If it increases allocation by 30% overall, there must be a LOT of calls to this function.  Should there be so many?

I've been working from a profiled compiler. That's how I found that this function was the culprit -- it certainly wasn't my first guess!

Yes, there are A LOT of calls: 7,106,808 to be exact. (The test case is perf/compiler/T9872a; the function is flatten_many). The number of calls doesn't vary from before my commit, though, so the raw number isn't the problem -- it's the allocation.


I'll try turning some of these knobs to see where the difference is.

Thanks,
Richard


Reply | Threaded
Open this post in threaded view
|

performance regressions

Joachim Breitner-2
Hi,


Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg:
> On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-breitner.de> wrote:
>
> > another guess (without looking at the code, sorry): Are they in the same
> > module? I.e., can GHC specialize the code to your particular Monad?

> No, they're not in the same module. I could also try moving the
> zipWithAndUnzipM function to the same module, and even specializing it
> by hand to the right monad.

I did mean zipWithAndUnzipM, so maybe yes: Try that.

(I find it hard to believe that any polymorphic monadic code should
perform well, with those many calls to an unknown (>>=) with a function
parameter, but maybe I?m too pessimistic here.)


>  Could that be preventing the fusing?

There is not going to be any fusing here, at least not list fusion; that
would require your code to be written in terms of functions with fusion
rules.

Greetings,
Joachim

--
Joachim ?nomeata? Breitner
  mail at joachim-breitner.de ? http://www.joachim-breitner.de/
  Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
  Debian Developer: nomeata at debian.org

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 819 bytes
Desc: This is a digitally signed message part
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141216/dbbda1b7/attachment.sig>

Reply | Threaded
Open this post in threaded view
|

performance regressions

Richard Eisenberg-2
I've learned several very interesting things in this analysis.

- Inlining polymorphic methods is very important. Here are some data points to back up that claim:
   * Original implementation using zipWithAndUnzipM:    8,472,613,440 bytes allocated in the heap
   * Adding {-# INLINE #-} to the definition thereof:   6,639,253,488 bytes allocated in the heap
   * Using `inline` at call site to force inlining:     6,281,539,792 bytes allocated in the heap

The middle step above allowed GHC to specialize zipWithAndUnzipM to my particular monad, but GHC didn't see fit to actually inline the function. Using `inline` forced it, to good effect. (I did not collect data on code sizes, but it wouldn't be hard to.)

By comparison:
   * Hand-written recursion:    6,587,809,112 bytes allocated in the heap
Interestingly, this is *not* the best result!

Conclusion: We should probably add INLINE pragmas to Util and MonadUtils.


- I then looked at rejiggering the algorithm to keep the common case fast. This had a side effect of changing the zipWithAndUnzipM to mapAndUnzipM, from Control.Monad. To my surprise, this brought disaster!
   * Using `inline` and mapAndUnzipM:        7,463,047,432 bytes allocated in the heap
   * Hand-written recursion:                 5,848,602,848 bytes allocated in the heap

That last number is better than the numbers above because of the algorithm streamlining. But, the inadequacy of mapAndUnzipM surprised me -- it already has an INLINE pragma in Control.Monad of course. Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed getting inlined, but a call to `map` remained, perhaps causing extra allocation.

Conclusion: We should examine the implementation of mapAndUnzipM (and similar functions) in Control.Monad. Is it as fast as possible?



In the end, I was unable to bring the allocation numbers down to where they were before my work. This is because the flattener now deals in roles. Most of its behavior is the same between nominal and representational roles, so it seems silly (though very possible) to specialize the code to nominal to keep that path fast. Instead, I identified one key spot and made that go fast.

Thus, there is a 7% bump to memory usage on very-type-family-heavy code, compared to before my commit on Friday. (On more ordinary code, there is no noticeable change.)

Validating my patch locally now; will push when that's done.

Thanks,
Richard

On Dec 16, 2014, at 10:41 AM, Joachim Breitner <mail at joachim-breitner.de> wrote:

> Hi,
>
>
> Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg:
>> On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-breitner.de> wrote:
>>
>>> another guess (without looking at the code, sorry): Are they in the same
>>> module? I.e., can GHC specialize the code to your particular Monad?
>
>> No, they're not in the same module. I could also try moving the
>> zipWithAndUnzipM function to the same module, and even specializing it
>> by hand to the right monad.
>
> I did mean zipWithAndUnzipM, so maybe yes: Try that.
>
> (I find it hard to believe that any polymorphic monadic code should
> perform well, with those many calls to an unknown (>>=) with a function
> parameter, but maybe I?m too pessimistic here.)
>
>
>> Could that be preventing the fusing?
>
> There is not going to be any fusing here, at least not list fusion; that
> would require your code to be written in terms of functions with fusion
> rules.
>
> Greetings,
> Joachim
>
> --
> Joachim ?nomeata? Breitner
>  mail at joachim-breitner.de ? http://www.joachim-breitner.de/
>  Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
>  Debian Developer: nomeata at debian.org
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs


Reply | Threaded
Open this post in threaded view
|

performance regressions

Carter Schonwald
Would specialize pragmas also have these performance improvements, or is
inline needed?


On Tue, Dec 16, 2014 at 4:45 PM, Richard Eisenberg <eir at cis.upenn.edu>
wrote:

>
> I've learned several very interesting things in this analysis.
>
> - Inlining polymorphic methods is very important. Here are some data
> points to back up that claim:
>    * Original implementation using zipWithAndUnzipM:    8,472,613,440
> bytes allocated in the heap
>    * Adding {-# INLINE #-} to the definition thereof:   6,639,253,488
> bytes allocated in the heap
>    * Using `inline` at call site to force inlining:     6,281,539,792
> bytes allocated in the heap
>
> The middle step above allowed GHC to specialize zipWithAndUnzipM to my
> particular monad, but GHC didn't see fit to actually inline the function.
> Using `inline` forced it, to good effect. (I did not collect data on code
> sizes, but it wouldn't be hard to.)
>
> By comparison:
>    * Hand-written recursion:    6,587,809,112 bytes allocated in the heap
> Interestingly, this is *not* the best result!
>
> Conclusion: We should probably add INLINE pragmas to Util and MonadUtils.
>
>
> - I then looked at rejiggering the algorithm to keep the common case fast.
> This had a side effect of changing the zipWithAndUnzipM to mapAndUnzipM,
> from Control.Monad. To my surprise, this brought disaster!
>    * Using `inline` and mapAndUnzipM:        7,463,047,432 bytes allocated
> in the heap
>    * Hand-written recursion:                 5,848,602,848 bytes allocated
> in the heap
>
> That last number is better than the numbers above because of the algorithm
> streamlining. But, the inadequacy of mapAndUnzipM surprised me -- it
> already has an INLINE pragma in Control.Monad of course. Looking at
> -ddump-simpl, it seems that mapAndUnzipM was indeed getting inlined, but a
> call to `map` remained, perhaps causing extra allocation.
>
> Conclusion: We should examine the implementation of mapAndUnzipM (and
> similar functions) in Control.Monad. Is it as fast as possible?
>
>
>
> In the end, I was unable to bring the allocation numbers down to where
> they were before my work. This is because the flattener now deals in roles.
> Most of its behavior is the same between nominal and representational
> roles, so it seems silly (though very possible) to specialize the code to
> nominal to keep that path fast. Instead, I identified one key spot and made
> that go fast.
>
> Thus, there is a 7% bump to memory usage on very-type-family-heavy code,
> compared to before my commit on Friday. (On more ordinary code, there is no
> noticeable change.)
>
> Validating my patch locally now; will push when that's done.
>
> Thanks,
> Richard
>
> On Dec 16, 2014, at 10:41 AM, Joachim Breitner <mail at joachim-breitner.de>
> wrote:
>
> > Hi,
> >
> >
> > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg:
> >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-breitner.de>
> wrote:
> >>
> >>> another guess (without looking at the code, sorry): Are they in the
> same
> >>> module? I.e., can GHC specialize the code to your particular Monad?
> >
> >> No, they're not in the same module. I could also try moving the
> >> zipWithAndUnzipM function to the same module, and even specializing it
> >> by hand to the right monad.
> >
> > I did mean zipWithAndUnzipM, so maybe yes: Try that.
> >
> > (I find it hard to believe that any polymorphic monadic code should
> > perform well, with those many calls to an unknown (>>=) with a function
> > parameter, but maybe I?m too pessimistic here.)
> >
> >
> >> Could that be preventing the fusing?
> >
> > There is not going to be any fusing here, at least not list fusion; that
> > would require your code to be written in terms of functions with fusion
> > rules.
> >
> > Greetings,
> > Joachim
> >
> > --
> > Joachim ?nomeata? Breitner
> >  mail at joachim-breitner.de ? http://www.joachim-breitner.de/
> >  Jabber: nomeata at joachim-breitner.de  ? GPG-Key: 0xF0FBF51F
> >  Debian Developer: nomeata at debian.org
> >
> > _______________________________________________
> > ghc-devs mailing list
> > ghc-devs at haskell.org
> > http://www.haskell.org/mailman/listinfo/ghc-devs
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141217/4592c60e/attachment-0001.html>

Reply | Threaded
Open this post in threaded view
|

performance regressions

Simon Peyton Jones
In reply to this post by Richard Eisenberg-2
If you use INLINEABLE, that should make the function specialisable to a particular monad, even if it's in a different module. You shouldn't need INLINE for that.

I don't understand the difference between cases (2) and (3).

I am still suspicious of why there are so many calls to this one function that it, alone, is allocating a significant proportion of compilation of the entire run of GHC.  Are you sure there isn't an algorithmic improvement to be had, to simply reduce the number of calls?

Simon

|  -----Original Message-----
|  From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On Behalf Of
|  Richard Eisenberg
|  Sent: 16 December 2014 21:46
|  To: Joachim Breitner
|  Cc: ghc-devs at haskell.org
|  Subject: Re: performance regressions
|  
|  I've learned several very interesting things in this analysis.
|  
|  - Inlining polymorphic methods is very important. Here are some data
|  points to back up that claim:
|     * Original implementation using zipWithAndUnzipM:    8,472,613,440
|  bytes allocated in the heap
|     * Adding {-# INLINE #-} to the definition thereof:   6,639,253,488
|  bytes allocated in the heap
|     * Using `inline` at call site to force inlining:     6,281,539,792
|  bytes allocated in the heap
|  
|  The middle step above allowed GHC to specialize zipWithAndUnzipM to my
|  particular monad, but GHC didn't see fit to actually inline the
|  function. Using `inline` forced it, to good effect. (I did not collect
|  data on code sizes, but it wouldn't be hard to.)
|  
|  By comparison:
|     * Hand-written recursion:    6,587,809,112 bytes allocated in the
|  heap
|  Interestingly, this is *not* the best result!
|  
|  Conclusion: We should probably add INLINE pragmas to Util and
|  MonadUtils.
|  
|  
|  - I then looked at rejiggering the algorithm to keep the common case
|  fast. This had a side effect of changing the zipWithAndUnzipM to
|  mapAndUnzipM, from Control.Monad. To my surprise, this brought
|  disaster!
|     * Using `inline` and mapAndUnzipM:        7,463,047,432 bytes
|  allocated in the heap
|     * Hand-written recursion:                 5,848,602,848 bytes
|  allocated in the heap
|  
|  That last number is better than the numbers above because of the
|  algorithm streamlining. But, the inadequacy of mapAndUnzipM surprised
|  me -- it already has an INLINE pragma in Control.Monad of course.
|  Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed getting
|  inlined, but a call to `map` remained, perhaps causing extra
|  allocation.
|  
|  Conclusion: We should examine the implementation of mapAndUnzipM (and
|  similar functions) in Control.Monad. Is it as fast as possible?
|  
|  
|  
|  In the end, I was unable to bring the allocation numbers down to where
|  they were before my work. This is because the flattener now deals in
|  roles. Most of its behavior is the same between nominal and
|  representational roles, so it seems silly (though very possible) to
|  specialize the code to nominal to keep that path fast. Instead, I
|  identified one key spot and made that go fast.
|  
|  Thus, there is a 7% bump to memory usage on very-type-family-heavy
|  code, compared to before my commit on Friday. (On more ordinary code,
|  there is no noticeable change.)
|  
|  Validating my patch locally now; will push when that's done.
|  
|  Thanks,
|  Richard
|  
|  On Dec 16, 2014, at 10:41 AM, Joachim Breitner <mail at joachim-
|  breitner.de> wrote:
|  
|  > Hi,
|  >
|  >
|  > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg:
|  >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-
|  breitner.de> wrote:
|  >>
|  >>> another guess (without looking at the code, sorry): Are they in
|  the
|  >>> same module? I.e., can GHC specialize the code to your particular
|  Monad?
|  >
|  >> No, they're not in the same module. I could also try moving the
|  >> zipWithAndUnzipM function to the same module, and even specializing
|  >> it by hand to the right monad.
|  >
|  > I did mean zipWithAndUnzipM, so maybe yes: Try that.
|  >
|  > (I find it hard to believe that any polymorphic monadic code should
|  > perform well, with those many calls to an unknown (>>=) with a
|  > function parameter, but maybe I'm too pessimistic here.)
|  >
|  >
|  >> Could that be preventing the fusing?
|  >
|  > There is not going to be any fusing here, at least not list fusion;
|  > that would require your code to be written in terms of functions
|  with
|  > fusion rules.
|  >
|  > Greetings,
|  > Joachim
|  >
|  > --
|  > Joachim "nomeata" Breitner
|  >  mail at joachim-breitner.de * http://www.joachim-breitner.de/
|  >  Jabber: nomeata at joachim-breitner.de  * GPG-Key: 0xF0FBF51F  Debian
|  > Developer: nomeata at debian.org
|  >
|  > _______________________________________________
|  > ghc-devs mailing list
|  > ghc-devs at haskell.org
|  > http://www.haskell.org/mailman/listinfo/ghc-devs
|  
|  _______________________________________________
|  ghc-devs mailing list
|  ghc-devs at haskell.org
|  http://www.haskell.org/mailman/listinfo/ghc-devs

Reply | Threaded
Open this post in threaded view
|

performance regressions

Richard Eisenberg-2
By unsubstantiated guess is that INLINEABLE would have the same effect as INLINE here, as GHC doesn't see fit to actually inline the function, even with INLINE -- the big improvement seen between (1) and (2) is actually specialization, not inlining. The jump from (2) to (3) is actual inlining. Thus, it seems that GHC's heuristics for inlining aren't working out for the best here.

I've pushed my changes, though I agree with Simon that more research may uncover even more improvements here. I didn't focus on the number of calls because that number didn't regress. Will look into this soon.

Richard

On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones <simonpj at microsoft.com> wrote:

> If you use INLINEABLE, that should make the function specialisable to a particular monad, even if it's in a different module. You shouldn't need INLINE for that.
>
> I don't understand the difference between cases (2) and (3).
>
> I am still suspicious of why there are so many calls to this one function that it, alone, is allocating a significant proportion of compilation of the entire run of GHC.  Are you sure there isn't an algorithmic improvement to be had, to simply reduce the number of calls?
>
> Simon
>
> |  -----Original Message-----
> |  From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On Behalf Of
> |  Richard Eisenberg
> |  Sent: 16 December 2014 21:46
> |  To: Joachim Breitner
> |  Cc: ghc-devs at haskell.org
> |  Subject: Re: performance regressions
> |  
> |  I've learned several very interesting things in this analysis.
> |  
> |  - Inlining polymorphic methods is very important. Here are some data
> |  points to back up that claim:
> |     * Original implementation using zipWithAndUnzipM:    8,472,613,440
> |  bytes allocated in the heap
> |     * Adding {-# INLINE #-} to the definition thereof:   6,639,253,488
> |  bytes allocated in the heap
> |     * Using `inline` at call site to force inlining:     6,281,539,792
> |  bytes allocated in the heap
> |  
> |  The middle step above allowed GHC to specialize zipWithAndUnzipM to my
> |  particular monad, but GHC didn't see fit to actually inline the
> |  function. Using `inline` forced it, to good effect. (I did not collect
> |  data on code sizes, but it wouldn't be hard to.)
> |  
> |  By comparison:
> |     * Hand-written recursion:    6,587,809,112 bytes allocated in the
> |  heap
> |  Interestingly, this is *not* the best result!
> |  
> |  Conclusion: We should probably add INLINE pragmas to Util and
> |  MonadUtils.
> |  
> |  
> |  - I then looked at rejiggering the algorithm to keep the common case
> |  fast. This had a side effect of changing the zipWithAndUnzipM to
> |  mapAndUnzipM, from Control.Monad. To my surprise, this brought
> |  disaster!
> |     * Using `inline` and mapAndUnzipM:        7,463,047,432 bytes
> |  allocated in the heap
> |     * Hand-written recursion:                 5,848,602,848 bytes
> |  allocated in the heap
> |  
> |  That last number is better than the numbers above because of the
> |  algorithm streamlining. But, the inadequacy of mapAndUnzipM surprised
> |  me -- it already has an INLINE pragma in Control.Monad of course.
> |  Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed getting
> |  inlined, but a call to `map` remained, perhaps causing extra
> |  allocation.
> |  
> |  Conclusion: We should examine the implementation of mapAndUnzipM (and
> |  similar functions) in Control.Monad. Is it as fast as possible?
> |  
> |  
> |  
> |  In the end, I was unable to bring the allocation numbers down to where
> |  they were before my work. This is because the flattener now deals in
> |  roles. Most of its behavior is the same between nominal and
> |  representational roles, so it seems silly (though very possible) to
> |  specialize the code to nominal to keep that path fast. Instead, I
> |  identified one key spot and made that go fast.
> |  
> |  Thus, there is a 7% bump to memory usage on very-type-family-heavy
> |  code, compared to before my commit on Friday. (On more ordinary code,
> |  there is no noticeable change.)
> |  
> |  Validating my patch locally now; will push when that's done.
> |  
> |  Thanks,
> |  Richard
> |  
> |  On Dec 16, 2014, at 10:41 AM, Joachim Breitner <mail at joachim-
> |  breitner.de> wrote:
> |  
> |  > Hi,
> |  >
> |  >
> |  > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard Eisenberg:
> |  >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-
> |  breitner.de> wrote:
> |  >>
> |  >>> another guess (without looking at the code, sorry): Are they in
> |  the
> |  >>> same module? I.e., can GHC specialize the code to your particular
> |  Monad?
> |  >
> |  >> No, they're not in the same module. I could also try moving the
> |  >> zipWithAndUnzipM function to the same module, and even specializing
> |  >> it by hand to the right monad.
> |  >
> |  > I did mean zipWithAndUnzipM, so maybe yes: Try that.
> |  >
> |  > (I find it hard to believe that any polymorphic monadic code should
> |  > perform well, with those many calls to an unknown (>>=) with a
> |  > function parameter, but maybe I'm too pessimistic here.)
> |  >
> |  >
> |  >> Could that be preventing the fusing?
> |  >
> |  > There is not going to be any fusing here, at least not list fusion;
> |  > that would require your code to be written in terms of functions
> |  with
> |  > fusion rules.
> |  >
> |  > Greetings,
> |  > Joachim
> |  >
> |  > --
> |  > Joachim "nomeata" Breitner
> |  >  mail at joachim-breitner.de * http://www.joachim-breitner.de/
> |  >  Jabber: nomeata at joachim-breitner.de  * GPG-Key: 0xF0FBF51F  Debian
> |  > Developer: nomeata at debian.org
> |  >
> |  > _______________________________________________
> |  > ghc-devs mailing list
> |  > ghc-devs at haskell.org
> |  > http://www.haskell.org/mailman/listinfo/ghc-devs
> |  
> |  _______________________________________________
> |  ghc-devs mailing list
> |  ghc-devs at haskell.org
> |  http://www.haskell.org/mailman/listinfo/ghc-devs
>


Reply | Threaded
Open this post in threaded view
|

performance regressions

Simon Peyton Jones
I still would like to understand why INLINE does not make it inline. That's weird.

Eg way to reproduce.

Simion

|  -----Original Message-----
|  From: Richard Eisenberg [mailto:eir at cis.upenn.edu]
|  Sent: 17 December 2014 15:56
|  To: Simon Peyton Jones
|  Cc: Joachim Breitner; ghc-devs at haskell.org
|  Subject: Re: performance regressions
|  
|  By unsubstantiated guess is that INLINEABLE would have the same effect
|  as INLINE here, as GHC doesn't see fit to actually inline the
|  function, even with INLINE -- the big improvement seen between (1) and
|  (2) is actually specialization, not inlining. The jump from (2) to (3)
|  is actual inlining. Thus, it seems that GHC's heuristics for inlining
|  aren't working out for the best here.
|  
|  I've pushed my changes, though I agree with Simon that more research
|  may uncover even more improvements here. I didn't focus on the number
|  of calls because that number didn't regress. Will look into this soon.
|  
|  Richard
|  
|  On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones
|  <simonpj at microsoft.com> wrote:
|  
|  > If you use INLINEABLE, that should make the function specialisable
|  to a particular monad, even if it's in a different module. You
|  shouldn't need INLINE for that.
|  >
|  > I don't understand the difference between cases (2) and (3).
|  >
|  > I am still suspicious of why there are so many calls to this one
|  function that it, alone, is allocating a significant proportion of
|  compilation of the entire run of GHC.  Are you sure there isn't an
|  algorithmic improvement to be had, to simply reduce the number of
|  calls?
|  >
|  > Simon
|  >
|  > |  -----Original Message-----
|  > |  From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On Behalf Of
|  > | Richard Eisenberg
|  > |  Sent: 16 December 2014 21:46
|  > |  To: Joachim Breitner
|  > |  Cc: ghc-devs at haskell.org
|  > |  Subject: Re: performance regressions
|  > |
|  > |  I've learned several very interesting things in this analysis.
|  > |
|  > |  - Inlining polymorphic methods is very important. Here are some
|  > | data  points to back up that claim:
|  > |     * Original implementation using zipWithAndUnzipM:
|  8,472,613,440
|  > |  bytes allocated in the heap
|  > |     * Adding {-# INLINE #-} to the definition thereof:
|  6,639,253,488
|  > |  bytes allocated in the heap
|  > |     * Using `inline` at call site to force inlining:
|  6,281,539,792
|  > |  bytes allocated in the heap
|  > |
|  > |  The middle step above allowed GHC to specialize zipWithAndUnzipM
|  to
|  > | my  particular monad, but GHC didn't see fit to actually inline
|  the
|  > | function. Using `inline` forced it, to good effect. (I did not
|  > | collect  data on code sizes, but it wouldn't be hard to.)
|  > |
|  > |  By comparison:
|  > |     * Hand-written recursion:    6,587,809,112 bytes allocated in
|  the
|  > |  heap
|  > |  Interestingly, this is *not* the best result!
|  > |
|  > |  Conclusion: We should probably add INLINE pragmas to Util and
|  > | MonadUtils.
|  > |
|  > |
|  > |  - I then looked at rejiggering the algorithm to keep the common
|  > | case  fast. This had a side effect of changing the
|  zipWithAndUnzipM
|  > | to  mapAndUnzipM, from Control.Monad. To my surprise, this brought
|  > | disaster!
|  > |     * Using `inline` and mapAndUnzipM:        7,463,047,432 bytes
|  > |  allocated in the heap
|  > |     * Hand-written recursion:                 5,848,602,848 bytes
|  > |  allocated in the heap
|  > |
|  > |  That last number is better than the numbers above because of the
|  > | algorithm streamlining. But, the inadequacy of mapAndUnzipM
|  > | surprised  me -- it already has an INLINE pragma in Control.Monad
|  of course.
|  > |  Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed
|  > | getting  inlined, but a call to `map` remained, perhaps causing
|  > | extra  allocation.
|  > |
|  > |  Conclusion: We should examine the implementation of mapAndUnzipM
|  > | (and  similar functions) in Control.Monad. Is it as fast as
|  possible?
|  > |
|  > |
|  > |
|  > |  In the end, I was unable to bring the allocation numbers down to
|  > | where  they were before my work. This is because the flattener now
|  > | deals in  roles. Most of its behavior is the same between nominal
|  > | and  representational roles, so it seems silly (though very
|  > | possible) to  specialize the code to nominal to keep that path
|  fast.
|  > | Instead, I  identified one key spot and made that go fast.
|  > |
|  > |  Thus, there is a 7% bump to memory usage on very-type-family-
|  heavy
|  > | code, compared to before my commit on Friday. (On more ordinary
|  > | code,  there is no noticeable change.)
|  > |
|  > |  Validating my patch locally now; will push when that's done.
|  > |
|  > |  Thanks,
|  > |  Richard
|  > |
|  > |  On Dec 16, 2014, at 10:41 AM, Joachim Breitner <mail at joachim-
|  > | breitner.de> wrote:
|  > |
|  > |  > Hi,
|  > |  >
|  > |  >
|  > |  > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard
|  Eisenberg:
|  > |  >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-
|  > | breitner.de> wrote:
|  > |  >>
|  > |  >>> another guess (without looking at the code, sorry): Are they
|  in
|  > | the  >>> same module? I.e., can GHC specialize the code to your
|  > | particular  Monad?
|  > |  >
|  > |  >> No, they're not in the same module. I could also try moving
|  the
|  > | >> zipWithAndUnzipM function to the same module, and even
|  > | specializing  >> it by hand to the right monad.
|  > |  >
|  > |  > I did mean zipWithAndUnzipM, so maybe yes: Try that.
|  > |  >
|  > |  > (I find it hard to believe that any polymorphic monadic code
|  > | should  > perform well, with those many calls to an unknown (>>=)
|  > | with a  > function parameter, but maybe I'm too pessimistic here.)
|  > | >  >  >> Could that be preventing the fusing?
|  > |  >
|  > |  > There is not going to be any fusing here, at least not list
|  > | fusion;  > that would require your code to be written in terms of
|  > | functions  with  > fusion rules.
|  > |  >
|  > |  > Greetings,
|  > |  > Joachim
|  > |  >
|  > |  > --
|  > |  > Joachim "nomeata" Breitner
|  > |  >  mail at joachim-breitner.de * http://www.joachim-breitner.de/  >
|  > | Jabber: nomeata at joachim-breitner.de  * GPG-Key: 0xF0FBF51F  Debian
|  > | > Developer: nomeata at debian.org  >  >
|  > | _______________________________________________
|  > |  > ghc-devs mailing list
|  > |  > ghc-devs at haskell.org
|  > |  > http://www.haskell.org/mailman/listinfo/ghc-devs
|  > |
|  > |  _______________________________________________
|  > |  ghc-devs mailing list
|  > |  ghc-devs at haskell.org
|  > |  http://www.haskell.org/mailman/listinfo/ghc-devs
|  >


Reply | Threaded
Open this post in threaded view
|

performance regressions

John Lato-2
Is it possible INLINE didn't inline the function because it's recursive? If
it were my function, I'd probably try a manual worker /wrapper.

On 07:59, Wed, Dec 17, 2014 Simon Peyton Jones <simonpj at microsoft.com>
wrote:

> I still would like to understand why INLINE does not make it inline.
> That's weird.
>
> Eg way to reproduce.
>
> Simion
>
> |  -----Original Message-----
> |  From: Richard Eisenberg [mailto:eir at cis.upenn.edu]
> |  Sent: 17 December 2014 15:56
> |  To: Simon Peyton Jones
> |  Cc: Joachim Breitner; ghc-devs at haskell.org
> |  Subject: Re: performance regressions
> |
> |  By unsubstantiated guess is that INLINEABLE would have the same effect
> |  as INLINE here, as GHC doesn't see fit to actually inline the
> |  function, even with INLINE -- the big improvement seen between (1) and
> |  (2) is actually specialization, not inlining. The jump from (2) to (3)
> |  is actual inlining. Thus, it seems that GHC's heuristics for inlining
> |  aren't working out for the best here.
> |
> |  I've pushed my changes, though I agree with Simon that more research
> |  may uncover even more improvements here. I didn't focus on the number
> |  of calls because that number didn't regress. Will look into this soon.
> |
> |  Richard
> |
> |  On Dec 17, 2014, at 4:15 AM, Simon Peyton Jones
> |  <simonpj at microsoft.com> wrote:
> |
> |  > If you use INLINEABLE, that should make the function specialisable
> |  to a particular monad, even if it's in a different module. You
> |  shouldn't need INLINE for that.
> |  >
> |  > I don't understand the difference between cases (2) and (3).
> |  >
> |  > I am still suspicious of why there are so many calls to this one
> |  function that it, alone, is allocating a significant proportion of
> |  compilation of the entire run of GHC.  Are you sure there isn't an
> |  algorithmic improvement to be had, to simply reduce the number of
> |  calls?
> |  >
> |  > Simon
> |  >
> |  > |  -----Original Message-----
> |  > |  From: ghc-devs [mailto:ghc-devs-bounces at haskell.org] On Behalf Of
> |  > | Richard Eisenberg
> |  > |  Sent: 16 December 2014 21:46
> |  > |  To: Joachim Breitner
> |  > |  Cc: ghc-devs at haskell.org
> |  > |  Subject: Re: performance regressions
> |  > |
> |  > |  I've learned several very interesting things in this analysis.
> |  > |
> |  > |  - Inlining polymorphic methods is very important. Here are some
> |  > | data  points to back up that claim:
> |  > |     * Original implementation using zipWithAndUnzipM:
> |  8,472,613,440
> |  > |  bytes allocated in the heap
> |  > |     * Adding {-# INLINE #-} to the definition thereof:
> |  6,639,253,488
> |  > |  bytes allocated in the heap
> |  > |     * Using `inline` at call site to force inlining:
> |  6,281,539,792
> |  > |  bytes allocated in the heap
> |  > |
> |  > |  The middle step above allowed GHC to specialize zipWithAndUnzipM
> |  to
> |  > | my  particular monad, but GHC didn't see fit to actually inline
> |  the
> |  > | function. Using `inline` forced it, to good effect. (I did not
> |  > | collect  data on code sizes, but it wouldn't be hard to.)
> |  > |
> |  > |  By comparison:
> |  > |     * Hand-written recursion:    6,587,809,112 bytes allocated in
> |  the
> |  > |  heap
> |  > |  Interestingly, this is *not* the best result!
> |  > |
> |  > |  Conclusion: We should probably add INLINE pragmas to Util and
> |  > | MonadUtils.
> |  > |
> |  > |
> |  > |  - I then looked at rejiggering the algorithm to keep the common
> |  > | case  fast. This had a side effect of changing the
> |  zipWithAndUnzipM
> |  > | to  mapAndUnzipM, from Control.Monad. To my surprise, this brought
> |  > | disaster!
> |  > |     * Using `inline` and mapAndUnzipM:        7,463,047,432 bytes
> |  > |  allocated in the heap
> |  > |     * Hand-written recursion:                 5,848,602,848 bytes
> |  > |  allocated in the heap
> |  > |
> |  > |  That last number is better than the numbers above because of the
> |  > | algorithm streamlining. But, the inadequacy of mapAndUnzipM
> |  > | surprised  me -- it already has an INLINE pragma in Control.Monad
> |  of course.
> |  > |  Looking at -ddump-simpl, it seems that mapAndUnzipM was indeed
> |  > | getting  inlined, but a call to `map` remained, perhaps causing
> |  > | extra  allocation.
> |  > |
> |  > |  Conclusion: We should examine the implementation of mapAndUnzipM
> |  > | (and  similar functions) in Control.Monad. Is it as fast as
> |  possible?
> |  > |
> |  > |
> |  > |
> |  > |  In the end, I was unable to bring the allocation numbers down to
> |  > | where  they were before my work. This is because the flattener now
> |  > | deals in  roles. Most of its behavior is the same between nominal
> |  > | and  representational roles, so it seems silly (though very
> |  > | possible) to  specialize the code to nominal to keep that path
> |  fast.
> |  > | Instead, I  identified one key spot and made that go fast.
> |  > |
> |  > |  Thus, there is a 7% bump to memory usage on very-type-family-
> |  heavy
> |  > | code, compared to before my commit on Friday. (On more ordinary
> |  > | code,  there is no noticeable change.)
> |  > |
> |  > |  Validating my patch locally now; will push when that's done.
> |  > |
> |  > |  Thanks,
> |  > |  Richard
> |  > |
> |  > |  On Dec 16, 2014, at 10:41 AM, Joachim Breitner <mail at joachim-
> |  > | breitner.de> wrote:
> |  > |
> |  > |  > Hi,
> |  > |  >
> |  > |  >
> |  > |  > Am Dienstag, den 16.12.2014, 09:59 -0500 schrieb Richard
> |  Eisenberg:
> |  > |  >> On Dec 16, 2014, at 4:01 AM, Joachim Breitner <mail at joachim-
> |  > | breitner.de> wrote:
> |  > |  >>
> |  > |  >>> another guess (without looking at the code, sorry): Are they
> |  in
> |  > | the  >>> same module? I.e., can GHC specialize the code to your
> |  > | particular  Monad?
> |  > |  >
> |  > |  >> No, they're not in the same module. I could also try moving
> |  the
> |  > | >> zipWithAndUnzipM function to the same module, and even
> |  > | specializing  >> it by hand to the right monad.
> |  > |  >
> |  > |  > I did mean zipWithAndUnzipM, so maybe yes: Try that.
> |  > |  >
> |  > |  > (I find it hard to believe that any polymorphic monadic code
> |  > | should  > perform well, with those many calls to an unknown (>>=)
> |  > | with a  > function parameter, but maybe I'm too pessimistic here.)
> |  > | >  >  >> Could that be preventing the fusing?
> |  > |  >
> |  > |  > There is not going to be any fusing here, at least not list
> |  > | fusion;  > that would require your code to be written in terms of
> |  > | functions  with  > fusion rules.
> |  > |  >
> |  > |  > Greetings,
> |  > |  > Joachim
> |  > |  >
> |  > |  > --
> |  > |  > Joachim "nomeata" Breitner
> |  > |  >  mail at joachim-breitner.de * http://www.joachim-breitner.de/  >
> |  > | Jabber: nomeata at joachim-breitner.de  * GPG-Key: 0xF0FBF51F  Debian
> |  > | > Developer: nomeata at debian.org  >  >
> |  > | _______________________________________________
> |  > |  > ghc-devs mailing list
> |  > |  > ghc-devs at haskell.org
> |  > |  > http://www.haskell.org/mailman/listinfo/ghc-devs
> |  > |
> |  > |  _______________________________________________
> |  > |  ghc-devs mailing list
> |  > |  ghc-devs at haskell.org
> |  > |  http://www.haskell.org/mailman/listinfo/ghc-devs
> |  >
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20141217/c7b3eaec/attachment-0001.html>

12