potential for GHC benchmarks w.r.t. optimisations being incorrect

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

potential for GHC benchmarks w.r.t. optimisations being incorrect

Daniel Cartwright
I am admittedly unsure of how GHC's optimisation benchmarks are currently implemented/carried out, but I feel as though this paper and its findings could be relevant to GHC devs:

http://cis.upenn.edu/~cis501/papers/producing-wrong-data.pdf

Basically, according to this paper, the cache effects of changing where the stack starts based on the number of environment variables are huge for many compiler benchmarks, and adjusting for this effect shows that gcc -O3 is only in actuality 1% faster than gcc -O2.


"The question they looked at was the following: does the compiler’s -O3 optimization flag result in speedups over -O2? This question is investigated in the light of measurement biases caused by two sources: Unix environment size, and linking order.
to the total size of the representation of Unix environment variables (such as PATH, HOME, etc.). Typically, these variables are part of the memory image of each process. The call stack begins where the environment ends. This gives rise to the following hypothesis: changing the sizes of (unused!) environment variables can change the alignment of variables on the stack and thus the performance of the program under test due to different behavior of hardware buffers such as caches or TLBs. (This is the source of the hypothetical example in the first paragraph, which I made up. On the machine where I am typing this, my user name appears in 12 of the environment variables that are set by default. All other things being equal, another user with a user name of a different length will have an environment size that differs by a multiple of 12 bytes.)"

"So does this hypothesis hold? Yes. Using a simple computational kernel the authors observe that changing the size of the environment can often cause a slowdown of 33% and, in one particular case, by 300%. On larger benchmarks the effects are less pronounced but still present. Using the C programs from the standard SPEC CPU2006 benchmark suite, the effects of -O2 and -O3 optimizations were compared across a wide range of environment sizes. For several of the programs a wide range of variations was observed, and the results often included both positive and negative observations. The effects were not correlated with the environment size. All this means that for some benchmarks, a compiler engineer might by accident test a purported optimization in a lucky environment and observe a 10% speedup, while users of the same optimization in an unlucky environment may have a 10% slowdown on the same workload."

I write this out of curiosity, as well as concern, over how this may affect GHC.

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: potential for GHC benchmarks w.r.t. optimisations being incorrect

Joachim Breitner-2
Hi,

Am Samstag, den 05.05.2018, 12:33 -0400 schrieb Daniel Cartwright:
> I write this out of curiosity, as well as concern, over how this may affect GHC.

our performance measurements are pretty non-scientific. For many
decades, developers just ran our benchmark suite (nofib) before and
after their change, hopefully on a cleanly built working copy, and
pasted the most interesting numbers in the commit logs. Maybe some went
for coffee to have an otherwise relatively quiet machine (or have some
remote setup), maybe not.

In the end, the run-time performance numbers are often ignored and we
we focus on comparing the effects of *dynamic heap allocations*, which
are much more stable across different environments, and which we
believe are a good proxy for actual performance, at least for the kind
of high-level optimizations that we work on in the core-to-core
pipeline. But this assumption is folklore, and not scientifically
investigated.

Since two years or so we started collecting performance numbers for
every commit to the GHC repository, and I wrote a tool to print
comparisons: https://perf.haskell.org/ghc/

This runs on a dedicated physical machine, and still the run-time
numbers were varying too widely and gave us many false warnings (and
probably reported many false improvements which we of course were happy
to believe). I have since switched to measuring only dynamic
instruction counts with valgrind. This means that we cannot detect
improvement or regressions due to certain low-level stuff, but we gain
the ability to reliably measure *something* that we expect to change
when we improve (or accidentally worsen) the high-level
transformations.

I wish there were a better way of getting a reliable, stable number
that reflects the actual performance.

Cheers,
Joachim


--
Joachim Breitner
  [hidden email]
  http://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: potential for GHC benchmarks w.r.t. optimisations being incorrect

Andreas Klebinger
Joachim Breitner schrieb:
This runs on a dedicated physical machine, and still the run-time
numbers were varying too widely and gave us many false warnings (and
probably reported many false improvements which we of course were happy
to believe). I have since switched to measuring only dynamic
instruction counts with valgrind. This means that we cannot detect
improvement or regressions due to certain low-level stuff, but we gain
the ability to reliably measure *something* that we expect to change
when we improve (or accidentally worsen) the high-level
transformations.
While this matches my experience with the default settings, I had good results by tuning the number of measurements nofib does.
With a high number of NoFibRuns (30+) , disabling frequency scaling, stopping background tasks and walking away from the computer
till it was done I got noise down to differences of about +/-0.2% for subsequent runs.

This doesn't eliminate alignment bias and the like but at least it gives fairly reproducible results.

Sven Panne schrieb:

4% is far from being "big", look e.g. at https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues where changing just the alignment of the code lead to a 10% difference. :-/ The code itself or its layout wasn't changed at all. The "Producing Wrong Data Without Doing Anything Obviously Wrong!" paper gives more funny examples.

I'm not saying that code layout has no impact, quite the opposite. The main point is: Do we really have a benchmarking machinery in place which can tell you if you've improved the real run time or made it worse? I doubt that, at least at the scale of a few percent. To reach just that simple yes/no conclusion, you would need quite a heavy machinery involving randomized linking order, varying environments (in the sense of "number and contents of environment variables"), various CPU models etc. If you do not do that, modern HW will leave you with a lot of "WTF?!" moments and wrong conclusions.
You raise good points. While the example in the blog seems a bit constructed with the whole loop fitting in a cache line the principle is a real concern though.
I've hit alignment issues and WTF moments plenty of times in the past when looking at micro benchmarks.

However on the scale of nofib so far I haven't really seen this happen. It's good to be aware of the chance for a whole suite to give
wrong results though.
I wonder if this effect is limited by GHC's tendency to use 8 byte alignment for all code (at least with tables next to code)?
If we only consider 16byte (DSB Buffer) and 32 Byte (Cache Lines) relevant this reduces the possibilities by a lot after all.

In the particular example I've hit however it's pretty obvious that alignment is not the issue. (And I still verified that).
In the end how big the impact of a better layout would be in general is hard to quantify. Hence the question if anyone has
pointers to good literature which looks into this.

Cheers
Andreas



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Re: potential for GHC benchmarks w.r.t. optimisations being incorrect

Joachim Breitner-2
Hi,

Am Sonntag, den 06.05.2018, 16:41 +0200 schrieb Andreas Klebinger:
> With a high number of NoFibRuns (30+) , disabling frequency scaling,
> stopping background tasks and walking away from the computer
> till it was done I got noise down to differences of about +/-0.2% for
> subsequent runs.
>
> This doesn't eliminate alignment bias and the like but at least it
> gives fairly reproducible results.

That’s true, but it leaves alignment bias. This bit my in my work on
Call Arity, as I write in my thesis:

   Initially, I attempted to use the actual run time measurements, but it
   turned out to be a mostly pointless endeavour. For example the knights
   benchmark would become 9% slower when enabling Call Arity (i.e. when
   comparing (A) to (B)), a completely unexpected result, given that the
   changes to the GHC Core code were reasonable. Further investigation
   using performance data obtained from the CPU indicated that with the
   changed code, the CPU’s instruction decoder was idling for more cycles,
   hinting at cache effects and/or bad program layout.
   Indeed: When I compiled the code with the compiler flag -g, which
   includes debugging information in the resulting binary, but should otherwise
   not affect the relative performance characteristics much, the unexpected
   difference vanished. I conclude that non-local changes to the
   Haskell or Core code will change the layout of the generated program
   code in unpredictable ways and render such run time measurements
   mostly meaningless.

   This conclusion has been drawn before [MDHS09], and recently, tools
   to mitigate this effect, e.g. by randomising the code layout [CB13], were
   created. Unfortunately, these currently target specific C compilers, so I
   could not use them here.

   In the following measurements, I avoid this problem by not measuring
   program execution time, but simply by counting the number of instructions performed.
   This way, the variability in execution time due to code
   layout does not affect the results. To obtain the instruction counts I employ
   valgrind [NS07], which runs the benchmarks on a virtual CPU and
   thus produces more reliable and reproducible measurements.

Unpleasant experience.

Cheers,
Joachim
--
Joachim Breitner
  [hidden email]
  http://www.joachim-breitner.de/

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (849 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Re: potential for GHC benchmarks w.r.t. optimisations being incorrect

Sven Panne-2
In reply to this post by Andreas Klebinger
2018-05-06 16:41 GMT+02:00 Andreas Klebinger <[hidden email]>:
[...] If we only consider 16byte (DSB Buffer) and 32 Byte (Cache Lines) relevant this reduces the possibilities by a lot after all. [...]

Nitpick: Cache lines on basically all Intel/AMD processors contain 64 bytes, see e.g. http://www.agner.org/optimize/microarchitecture.pdf 

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs