RTS changes affect runtime when they shouldn’t

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
15 messages Options
Reply | Threaded
Open this post in threaded view
|

RTS changes affect runtime when they shouldn’t

Joachim Breitner-2
Hi,

while keeping an eye on the performance numbers, I notice a pattern
where basically any change to the rts makes some benchmarks go up or
down by a significant percentage. Recent example:
https://git.haskell.org/ghc.git/commitdiff/0aba999f60babe6878a1fd2cc8410139358cad16
which exposed an additional secure modular power function in integer
(and should really not affect any of our test cases) causes these
changes:

Benchmark name prev change now
nofib/time/FS 0.434 -  4.61% 0.414 seco
nds
nofib/time/VS 0.369 + 15.45% 0.426 seco
nds
nofib/time/scs 0.411 -  4.62% 0.392 sec
onds
https://perf.haskell.org/ghc/#revision/0aba999f60babe6878a1fd2cc8410139
358cad16

The new effBench benchmarks (FS, VS) are particularly often
affected, but also old friends like scs, lambda, integer…


In a case like this I can see that the effect is spurious, but it
really limits our ability to properly evaluate changes to the compiler
– in some cases it makes us cheer about improvements that are not
really there, in other cases it makes us hunt for ghosts.


Does anyone have a solid idea what is causing these differences? Are
they specific to the builder for perf.haskell.org, or do you observe
them as well? And what can we do here?


For the measurements in my thesis I switched to measuring instruction
counts (using valgrind) instead. These are much more stable, requires
only a single NoFibRun, and the machine does not have to be otherwise
quiet. Should I start using these on perf.haskell.org? Or would we lose
too much by not tracking actual running times any more?


Greetings,
Joachim



--
Joachim Breitner
  [hidden email]
  http://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Ben Gamari-2
Joachim Breitner <[hidden email]> writes:

[snip]

>
> Does anyone have a solid idea what is causing these differences? Are
> they specific to the builder for perf.haskell.org, or do you observe
> them as well? And what can we do here?
>
There is certainly no shortage of possible causes:

    https://www.youtube.com/watch?v=IX16gcX4vDQ

It would be interesting to take a few days to really try to build an
understanding of a few of these performance jumps with perf. At the
moment we can only speculate.

> For the measurements in my thesis I switched to measuring instruction
> counts (using valgrind) instead. These are much more stable, requires
> only a single NoFibRun, and the machine does not have to be otherwise
> quiet. Should I start using these on perf.haskell.org? Or would we lose
> too much by not tracking actual running times any more?
>
Note that valgrind can also do cache modelling so I suspect it can give
you a reasonably good picture of execution; certainly better than
runtime. However, the trade-off is that (last I checked) it's incredibly
slow. Don't you think I mean just a bit less peppy than usual. I mean
soul-crushingly, mollasses-on-a-cold-December-morning slow.

If we think that we can bear the cost of running valigrind then I think
it would be a great improvement. As you point out, the current run times
are essentially useless.

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Joachim Breitner-2
Hi

Am Mittwoch, den 20.09.2017, 14:33 -0400 schrieb Ben Gamari:
> Note that valgrind can also do cache modelling so I suspect it can give
> you a reasonably good picture of execution; certainly better than
> runtime. However, the trade-off is that (last I checked) it's incredibly
> slow. Don't you think I mean just a bit less peppy than usual. I mean
> soul-crushingly, mollasses-on-a-cold-December-morning slow.
>
> If we think that we can bear the cost of running valigrind then I think
> it would be a great improvement. As you point out, the current run times
> are essentially useless.

I did it for my thesis and I found it ok. I mean I always sent it off
to some machine and looked at the results later, so I did not really
care whether it took 30mins or 2h.

I think I’ll try it in perf.haskell.org and see what happens.

Joachim
--
Joachim “nomeata” Breitner
  [hidden email]
  https://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Sebastian Graf
Hi,

I did it for my thesis and I found it ok. I mean I always sent it off
to some machine and looked at the results later, so I did not really
care whether it took 30mins or 2h.

I did the same for my thesis (the setup of which basically was a rip-off of Joachim's) and it was really quite bearable. I think it was even faster than doing NoFibRuns=30 without counting instructions.

The only real drawback I see is that instruction count might skew results, because AFAIK it doesn't properly take the architecture (pipeline, latencies, etc.) into account. It might be just OK for the average program, though.

On Wed, Sep 20, 2017 at 10:13 PM, Joachim Breitner <[hidden email]> wrote:
Hi

Am Mittwoch, den 20.09.2017, 14:33 -0400 schrieb Ben Gamari:
> Note that valgrind can also do cache modelling so I suspect it can give
> you a reasonably good picture of execution; certainly better than
> runtime. However, the trade-off is that (last I checked) it's incredibly
> slow. Don't you think I mean just a bit less peppy than usual. I mean
> soul-crushingly, mollasses-on-a-cold-December-morning slow.
>
> If we think that we can bear the cost of running valigrind then I think
> it would be a great improvement. As you point out, the current run times
> are essentially useless.

I did it for my thesis and I found it ok. I mean I always sent it off
to some machine and looked at the results later, so I did not really
care whether it took 30mins or 2h.

I think I’ll try it in perf.haskell.org and see what happens.

Joachim
--
Joachim “nomeata” Breitner
  [hidden email]
  https://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Joachim Breitner-2
Hi,

Am Donnerstag, den 21.09.2017, 00:34 +0200 schrieb Sebastian Graf:

> Hi,
>
> > I did it for my thesis and I found it ok. I mean I always sent it off
> > to some machine and looked at the results later, so I did not really
> > care whether it took 30mins or 2h.
>
> I did the same for my thesis (the setup of which basically was a rip-
> off of Joachim's) and it was really quite bearable. I think it was
> even faster than doing NoFibRuns=30 without counting instructions.
>
> The only real drawback I see is that instruction count might skew
> results, because AFAIK it doesn't properly take the architecture
> (pipeline, latencies, etc.) into account. It might be just OK for the
> average program, though.
I’ll try that now, and see if I like the results better. It might take
a few iterations to find the right settings, so perf.haskell.org might
not update quickly for now.

Greetings,
Joachim

--
Joachim Breitner
  [hidden email]
  http://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

perf.haskell.org update: Now using cachegrind

Joachim Breitner-2
Hi,

I have switched perf.haskell.org to run nofib with
$ make -C nofib EXTRA_RUNTEST_OPTS=-cachegrind NoFibRuns=1 mode=slow -j8

Right now, a complete build with testsuite and nofib takes ~2½h, but
that was before I added the "-j8" to the line above, so let’s see if
that helps.

Even with cachegrind, most nofib tests finish in under one minute, 7
take between one and 10 minutes, k-nucleotide needs 20 minutes and
fannkuch-redux needs 27 minutes. All in all not too bad.

I reset the perf.haskell.org database and started re-measuring commits
from 055d73c6576bed2affaf96ef6a6b89aeb2cd2e9f on. It will take a while
(a week maybe?) for it to catch up with master.


I hope that this allows us to spot performance regressions with greater
precision.

Greetings,
Joachim

--
Joachim “nomeata” Breitner
  [hidden email]
  https://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Sven Panne-2
In reply to this post by Sebastian Graf
2017-09-21 0:34 GMT+02:00 Sebastian Graf <[hidden email]>:
[...] The only real drawback I see is that instruction count might skew results, because AFAIK it doesn't properly take the architecture (pipeline, latencies, etc.) into account. It might be just OK for the average program, though.

It really depends on what you're trying to measure: The raw instruction count is basically useless if you want to have a number which has any connection to the real time taken by the program. The average number of cycles per CPU instruction varies by 2 orders of magnitude on modern architectures, see e.g. the Skylake section in http://www.agner.org/optimize/instruction_tables.pdf (IMHO a must-read for anyone doing serious optimizations/measurements on the assembly level). And these numbers don't even include the effects of the caches, pipeline stalls, branch prediction, execution units/ports, etc. etc. which can easily add another 1 or 2 orders of magnitude.

So what can one do? It basically boils down to a choice:

   * Use a stable number like the instruction count (the "Instructions Read" (Ir) events), which has no real connection to the speed of a program.

   * Use a relatively volatile number like real time and/or cycles used, which is what your users will care about. If you put a non-trivial amount of work into your compiler, you can make these numbers a bit more stable (e.g. by making the code layout/alignment more stable), but you will still get quite different numbers if you switch to another CPU generation/manufacturer.

A bit tragic, but that's life in 2017... :-}

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Joachim Breitner-2
Hi,

Am Samstag, den 23.09.2017, 20:45 +0200 schrieb Sven Panne:

> 2017-09-21 0:34 GMT+02:00 Sebastian Graf <[hidden email]>:
> > [...] The only real drawback I see is that instruction count might
> > skew results, because AFAIK it doesn't properly take the
> > architecture (pipeline, latencies, etc.) into account. It might be
> > just OK for the average program, though.
> >
>
> It really depends on what you're trying to measure: The raw
> instruction count is basically useless if you want to have a number
> which has any connection to the real time taken by the program. The
> average number of cycles per CPU instruction varies by 2 orders of
> magnitude on modern architectures, see e.g. the Skylake section in ht
> tp://www.agner.org/optimize/instruction_tables.pdf (IMHO a must-read
> for anyone doing serious optimizations/measurements on the assembly
> level). And these numbers don't even include the effects of the
> caches, pipeline stalls, branch prediction, execution units/ports,
> etc. etc. which can easily add another 1 or 2 orders of magnitude.
>
> So what can one do? It basically boils down to a choice:
>
>    * Use a stable number like the instruction count (the
> "Instructions Read" (Ir) events), which has no real connection to the
> speed of a program.
>
>    * Use a relatively volatile number like real time and/or cycles
> used, which is what your users will care about. If you put a non-
> trivial amount of work into your compiler, you can make these numbers
> a bit more stable (e.g. by making the code layout/alignment more
> stable), but you will still get quite different numbers if you switch
> to another CPU generation/manufacturer.
>
> A bit tragic, but that's life in 2017... :-}
what I want to do is to reliably catch regressions. What are the odds
that a change to the Haskell compiler (in particular to Core2Core
transformations) will cause a significant increase in runtime without a
 significant increase in instruction count?
(Honest question, not rhetoric).

Greetings,
Joachim

--
Joachim Breitner
  [hidden email]
  http://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Bardur Arantsson-2
In reply to this post by Sven Panne-2
On 2017-09-23 20:45, Sven Panne wrote:

> 2017-09-21 0:34 GMT+02:00 Sebastian Graf <[hidden email]
> <mailto:[hidden email]>>:
>
>     [...] The only real drawback I see is that instruction count might
>     skew results, because AFAIK it doesn't properly take the
>     architecture (pipeline, latencies, etc.) into account. It might be
>     just OK for the average program, though.
>
>
> It really depends on what you're trying to measure: The raw instruction
> count is basically useless if you want to have a number which has any
> connection to the real time taken by the program. The average number of
> cycles per CPU instruction varies by 2 orders of magnitude on modern
> architectures, see e.g. the Skylake section
> in http://www.agner.org/optimize/instruction_tables.pdf (IMHO a
> must-read for anyone doing serious optimizations/measurements on the
> assembly level). And these numbers don't even include the effects of the
> caches, pipeline stalls, branch prediction, execution units/ports, etc.
> etc. which can easily add another 1 or 2 orders of magnitude.
>
> So what can one do? It basically boils down to a choice:
>
>    * Use a stable number like the instruction count (the "Instructions
> Read" (Ir) events), which has no real connection to the speed of a program.
>
>    * Use a relatively volatile number like real time and/or cycles used,
> which is what your users will care about. If you put a non-trivial
> amount of work into your compiler, you can make these numbers a bit more
> stable (e.g. by making the code layout/alignment more stable), but you
> will still get quite different numbers if you switch to another CPU
> generation/manufacturer.
>
> A bit tragic, but that's life in 2017... :-}
>
>

I may be missing something since I have only quickly skimmed the thread,
but...: Why not track all of these things and correlate them with
individual runs? The Linux 'perf' tool can retrieve a *lot* of
interesting numbers, esp. around cache hit rates, branch predicition hit
rates, etc.

Regards,

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Sven Panne-2
In reply to this post by Joachim Breitner-2
2017-09-23 21:06 GMT+02:00 Joachim Breitner <[hidden email]>:
what I want to do is to reliably catch regressions.

The main question is: Which kind of regressions do you want to catch? Do you care about runtime as experienced by the user? Measure the runtime. Do you care abou code size? Measure the code size. etc. etc. Measuring things like the number of fetched instructions as an indicator for the experienced runtime is basically a useless exercise, unless you do this on ancient RISC processors, where each instruction takes a fixed number of cycles.
 
What are the odds that a change to the Haskell compiler (in particular to Core2Core
transformations) will cause a significant increase in runtime without a
 significant increase in instruction count?
(Honest question, not rhetoric).

The odds are actually quite high, especially when you define "significant" as "changing a few percent" (which we do!). Just a few examples from current CPUs:

   * If branch prediction has not enough information to do this better, it assumes that backward branches are taken (think: loops) and forward branches are not taken (so you should put "exceptional" code out of the common, straight-line code). If by some innocent looking change the code layout changes, you can easily get a very measurable difference in runtime even if the number of executed instructions stays exactly the same.

   * Even if the number of instructions changes only a tiny bit, it could be the case that it is just enough to make caching much worse and/or make the loop stream detector fail to detect a loop.

There are lots of other scenarios, so in a nutshell: Measure what you really care about, not something you think might be related to that.

As already mentioned in another reply, "perf" can give you very detailed hints about how good your program uses the pipeline, caches, branch prediction etc. Perhaps the performance dashboard should really collect these, too, this would remove a lot of guesswork.

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

David Feuer-2
In reply to this post by Joachim Breitner-2
I think changes to the RTS, code generator, and general heap layout are exactly where we *do* want to worry about these very low-level details. Changes in type checking, desugaring, core-to-core, etc., probably are not, because it's just too hard to tease out the relationship between what they do and what instructions are emitted in the end.



David Feuer
Well-Typed, LLP

-------- Original message --------
From: Sven Panne <[hidden email]>
Date: 9/24/17 2:00 PM (GMT-05:00)
To: Joachim Breitner <[hidden email]>
Subject: Re: RTS changes affect runtime when they shouldn’t

2017-09-23 21:06 GMT+02:00 Joachim Breitner <[hidden email]>:

> what I want to do is to reliably catch regressions.


The main question is: Which kind of regressions do you want to catch? Do
you care about runtime as experienced by the user? Measure the runtime. Do
you care abou code size? Measure the code size. etc. etc. Measuring things
like the number of fetched instructions as an indicator for the experienced
runtime is basically a useless exercise, unless you do this on ancient RISC
processors, where each instruction takes a fixed number of cycles.


> What are the odds that a change to the Haskell compiler (in particular to
> Core2Core
> transformations) will cause a significant increase in runtime without a
>  significant increase in instruction count?
> (Honest question, not rhetoric).
>

The odds are actually quite high, especially when you define "significant"
as "changing a few percent" (which we do!). Just a few examples from
current CPUs:

   * If branch prediction has not enough information to do this better, it
assumes that backward branches are taken (think: loops) and forward
branches are not taken (so you should put "exceptional" code out of the
common, straight-line code). If by some innocent looking change the code
layout changes, you can easily get a very measurable difference in runtime
even if the number of executed instructions stays exactly the same.

   * Even if the number of instructions changes only a tiny bit, it could
be the case that it is just enough to make caching much worse and/or make
the loop stream detector fail to detect a loop.

There are lots of other scenarios, so in a nutshell: Measure what you
really care about, not something you think might be related to that.

As already mentioned in another reply, "perf" can give you very detailed
hints about how good your program uses the pipeline, caches, branch
prediction etc. Perhaps the performance dashboard should really collect
these, too, this would remove a lot of guesswork.

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Ben Gamari-2
In reply to this post by Bardur Arantsson-2
Bardur Arantsson <[hidden email]> writes:

> I may be missing something since I have only quickly skimmed the thread,
> but...: Why not track all of these things and correlate them with
> individual runs? The Linux 'perf' tool can retrieve a *lot* of
> interesting numbers, esp. around cache hit rates, branch predicition hit
> rates, etc.
>
While it's not a bad idea, I think it's easy to drown in information. Of
course, it's also fairly easy to hide information that we don't care
about, so perhaps this is worth doing regardless.

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RTS changes affect runtime when they shouldn’t

Sven Panne-2
2017-09-26 18:35 GMT+02:00 Ben Gamari <[hidden email]>:
While it's not a bad idea, I think it's easy to drown in information. Of
course, it's also fairly easy to hide information that we don't care
about, so perhaps this is worth doing regardless.

The point is: You don't know in advance which of the many performance characteristics "perf" spits out is relevant. If e.g. you see a regression in runtime although you really didn't expect one (tiny RTS change etc.), a quick look at the diffs of all perf values can often give a hint (e.g. branch prediction was screwed up by different code layout etc.).

So I think it's best to collect all data, but make the user-relevant data (runtime, code size) more prominent than the technical/internal data (cache hit ratio, branch prediction hit ratio, etc.), which is for analysis only. Although the latter is a cause for the former, from a compiler user's perspective it's irrelevant. So there is no actual risk in drowning in data, because you primarily care only for a small subset of it.

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: perf.haskell.org update: Now using cachegrind

Joachim Breitner-2
In reply to this post by Joachim Breitner-2
Hi,

update ton this:

Am Freitag, den 22.09.2017, 09:06 -0400 schrieb Joachim Breitner:
> I have switched perf.haskell.org to run nofib with
> $ make -C nofib EXTRA_RUNTEST_OPTS=-cachegrind NoFibRuns=1 mode=slow -j8

it looks like this will not work out, with the current setup;
perf.haskell.org has failed to catch up with the number of commits
coming in.

I’ll see if there are some low-hanging fruits to speed this up, such as
not building haddocks.

I also wonder whether, when using cachegrind, the results from
different machines are actually comparable. Maybe I’ll try that some
day.


Greetings,
Joachim

--
Joachim “nomeata” Breitner
  [hidden email]
  https://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: perf.haskell.org update: Now using cachegrind

Sven Panne-2
2017-09-30 17:56 GMT+02:00 Joachim Breitner <[hidden email]>:
[...] I also wonder whether, when using cachegrind, the results from
different machines are actually comparable. [...]

In general, they are not really comparable: cachegrind doesn't collect *actual* cache statistics, it emulates a simplified version of the caching machinery, trying to auto-detect parameters, see e.g.:


This doesn't mean that the numbers are useless, but they are only (good) general hints, and in rare extreme cases, they can be totally off from the real numbers. For the "real stuff", one has to use "perf", but then you can only compare numbers from the same CPU models.

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs