Basic Block Layout in the NCG

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Basic Block Layout in the NCG

Andreas Klebinger
Does anyone have good hints for literature on basic block layout algorithms?
I've run into a few examples where the current algorithm falls apart
while working on Cmm.

There is a trac ticket https://ghc.haskell.org/trac/ghc/ticket/15124#ticket
where I tracked some of the issues I ran into.

As it stands some cmm optimizations are far out weighted by
accidental changes they cause in the layout of basic blocks.

The main problem seems to be that the current codegen only considers the
last jump
in a basic block as relevant for code layout.

This works well for linear chains of control flow but behaves badly and
somewhat
unpredictable when dealing with branch heavy code where blocks have more
than
one successor or calls.

In particular if we have a loop

A jmp B call C call D

which we enter into at block B from Block E
we would like something like:

E,B,C,D,A

Which means with some luck C/D might be still in cache if we return from
the call.

However we can currently get:

E,B,A,X,D,X,C

where X are other unrelated blocks. This happens since call edges are
invisible to the layout algorithm.
It even happens when we have (conditional) jumps from B  to C and C to D
since these are invisible as well!

I came across cases where inverting conditions lead to big performance
losses since suddenly block layout
got all messed up. (~4% slowdown for the worst offenders).

So I'm looking for solutions there.

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Basic Block Layout in the NCG

Sven Panne-2
2018-05-05 21:23 GMT+02:00 Andreas Klebinger <[hidden email]>:
[...] I came across cases where inverting conditions lead to big performance losses since suddenly block layout
got all messed up. (~4% slowdown for the worst offenders). [...]
 
4% is far from being "big", look e.g. at https://dendibakh.github.io/blog/2018/01/18/Code_alignment_issues where changing just the alignment of the code lead to a 10% difference. :-/ The code itself or its layout wasn't changed at all. The "Producing Wrong Data Without Doing Anything Obviously Wrong!" paper gives more funny examples.

I'm not saying that code layout has no impact, quite the opposite. The main point is: Do we really have a benchmarking machinery in place which can tell you if you've improved the real run time or made it worse? I doubt that, at least at the scale of a few percent. To reach just that simple yes/no conclusion, you would need quite a heavy machinery involving randomized linking order, varying environments (in the sense of "number and contents of environment variables"), various CPU models etc. If you do not do that, modern HW will leave you with a lot of "WTF?!" moments and wrong conclusions.


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Basic Block Layout in the NCG

Kavon Farvardin
> 4% is far from being "big", look e.g. at https://dendibakh.github.io/
> blog/2018/01/18/Code_alignment_issues where changing just the
> alignment of the code lead to a 10% difference. :-/ The code itself
> or its layout wasn't changed at all. The "Producing Wrong Data
> Without Doing Anything Obviously Wrong!" paper gives more funny
> examples.

This particular instance of performance difference due to aligning
blocks/functions is not very surprising (it is mentioned in section
3.4.1.5 of the Intel Optimization Manual). I also think 4% on large,
non-trivial programs is "bigger" than 10% on the tiny example in that
blog post (especially since the alignment issues disappear above 1024
iterations of the loop).

~kavon

On Sun, 2018-05-06 at 14:42 +0200, Sven Panne wrote:

> 2018-05-05 21:23 GMT+02:00 Andreas Klebinger <[hidden email]
> t>:
> > [...] I came across cases where inverting conditions lead to big
> > performance losses since suddenly block layout
> > got all messed up. (~4% slowdown for the worst offenders). [...]
>
>  
> 4% is far from being "big", look e.g. at https://dendibakh.github.io/
> blog/2018/01/18/Code_alignment_issues where changing just the
> alignment of the code lead to a 10% difference. :-/ The code itself
> or its layout wasn't changed at all. The "Producing Wrong Data
> Without Doing Anything Obviously Wrong!" paper gives more funny
> examples.
>
> I'm not saying that code layout has no impact, quite the opposite.
> The main point is: Do we really have a benchmarking machinery in
> place which can tell you if you've improved the real run time or made
> it worse? I doubt that, at least at the scale of a few percent. To
> reach just that simple yes/no conclusion, you would need quite a
> heavy machinery involving randomized linking order, varying
> environments (in the sense of "number and contents of environment
> variables"), various CPU models etc. If you do not do that, modern HW
> will leave you with a lot of "WTF?!" moments and wrong conclusions.
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (673 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Basic Block Layout in the NCG

Kavon Farvardin
In reply to this post by Andreas Klebinger
> Does anyone have good hints for literature on basic block layout
> algorithms?

Here are some thoughts:

* Branch Probability *

Any good code layout algorithm should take branch probabilities into
account.  From what I've seen, we already have a few likely-branch
heuristics baked into the generation/transformation of Cmm, though
perhaps it's worth doing more to add probabilities as in [1,3].  The
richer type information in STG could come in handy.

I think the first step in leveraging branch proability information is
to use a Float to represent the magnitude of likeliness instead of a
Bool.

Target probabilities on CmmSwitches could also help create smarter
SwitchPlans. Slides 20-21 in [2] demonstrate a lower-cost decision tree
based on these probabilities.


* Code Layout *

The best all-in-one source for static code positioning I've seen is in
[5], and might be a good starting point for exploring that space.  More
importantly, [5] talks about function positioning, which is something I
think we're missing.  A more sophisticated extension to [5]'s function
positioning can be found in [6].

Keeping in mind that LLVM is tuned to optimize loops within functions,
at at high-level LLVM does the following [4]:

    The algorithm works from the inner-most loop within a
    function outward, and at each stage walks through the
    basic blocks, trying to coalesce them into sequential
    chains where allowed by the CFG (or demanded by heavy
    probabilities). Finally, it walks the blocks in
    topological order, and the first time it reaches a
    chain of basic blocks, it schedules them in the
    function in-order.

There are also plenty of heuristics such as "tail duplication" to deal
with diamonds and other odd cases in the CFG that are harder to layout.
  Unfortunately, there don't seem to be any sources cited.  We may want
to develop our own heuristics to modify the CFG for better layout as
well.



[1] Thomas Ball, James R. Larus. Branch Prediction for Free (https://do
i.org/10.1145/173262.155119)

[2] Hans Wennborg. The recent switch lowering improvements. (http://llv
m.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf) See also: http
s://www.youtube.com/watch?v=gMqSinyL8uk

[3] James E. Smith. A study of branch prediction strategies (https://dl
.acm.org/citation.cfm?id=801871)

[4] http://llvm.org/doxygen/MachineBlockPlacement_8cpp_source.html

[5] Karl Pettis, Robert C. Hansen. Profile guided code positioning. (ht
tps://doi.org/10.1145/93542.93550)

[6] Hashemi et al. Efficient procedure mapping using cache line
coloring (https://doi.org/10.1145/258915.258931)


~kavon


On Sat, 2018-05-05 at 21:23 +0200, Andreas Klebinger wrote:

> Does anyone have good hints for literature on basic block layout
> algorithms?
> I've run into a few examples where the current algorithm falls apart
> while working on Cmm.
>
> There is a trac ticket https://ghc.haskell.org/trac/ghc/ticket/15124#
> ticket
> where I tracked some of the issues I ran into.
>
> As it stands some cmm optimizations are far out weighted by
> accidental changes they cause in the layout of basic blocks.
>
> The main problem seems to be that the current codegen only considers
> the
> last jump
> in a basic block as relevant for code layout.
>
> This works well for linear chains of control flow but behaves badly
> and
> somewhat
> unpredictable when dealing with branch heavy code where blocks have
> more
> than
> one successor or calls.
>
> In particular if we have a loop
>
> A jmp B call C call D
>
> which we enter into at block B from Block E
> we would like something like:
>
> E,B,C,D,A
>
> Which means with some luck C/D might be still in cache if we return
> from
> the call.
>
> However we can currently get:
>
> E,B,A,X,D,X,C
>
> where X are other unrelated blocks. This happens since call edges
> are
> invisible to the layout algorithm.
> It even happens when we have (conditional) jumps from B  to C and C
> to D
> since these are invisible as well!
>
> I came across cases where inverting conditions lead to big
> performance
> losses since suddenly block layout
> got all messed up. (~4% slowdown for the worst offenders).
>
> So I'm looking for solutions there.
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (673 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Basic Block Layout in the NCG

Andreas Klebinger
On branch probability:
I've actually created a patch to add more probabilities in the recent past which included
probabilities on CmmSwitches. Although it made little difference when only tagging error
branches.

Partially that also run into issues with code layout which is why I put that on ice for now.

The full patch is here:
https://phabricator.haskell.org/D4327

I think this really has to be done back to front as GHC currently throws
away all likelyhood information before we get to block layout.
Which makes it very hard to take advantage of this information.

Code Layout:

That seems exactly like the kind of pointers I was looking for!
I do wonder how well some of them aged. For example [5] and [6] use at most two way associative cache.
But as you said it should be at least a good starting point.

I will put your links into the ticket so they are easily found once I (or someone else!) has time to look
deeper into this.

Cheers
Andreas


Sonntag, 6. Mai 2018 20:17
Does anyone have good hints for literature on basic block layout
algorithms?

Here are some thoughts:

* Branch Probability *

Any good code layout algorithm should take branch probabilities into
account.  From what I've seen, we already have a few likely-branch
heuristics baked into the generation/transformation of Cmm, though
perhaps it's worth doing more to add probabilities as in [1,3].  The
richer type information in STG could come in handy.

I think the first step in leveraging branch proability information is
to use a Float to represent the magnitude of likeliness instead of a
Bool.

Target probabilities on CmmSwitches could also help create smarter
SwitchPlans. Slides 20-21 in [2] demonstrate a lower-cost decision tree
based on these probabilities.


* Code Layout *

The best all-in-one source for static code positioning I've seen is in
[5], and might be a good starting point for exploring that space.  More
importantly, [5] talks about function positioning, which is something I
think we're missing.  A more sophisticated extension to [5]'s function
positioning can be found in [6].

Keeping in mind that LLVM is tuned to optimize loops within functions,
at at high-level LLVM does the following [4]:

    The algorithm works from the inner-most loop within a 
    function outward, and at each stage walks through the
    basic blocks, trying to coalesce them into sequential
    chains where allowed by the CFG (or demanded by heavy
    probabilities). Finally, it walks the blocks in 
    topological order, and the first time it reaches a 
    chain of basic blocks, it schedules them in the 
    function in-order.

There are also plenty of heuristics such as "tail duplication" to deal
with diamonds and other odd cases in the CFG that are harder to layout.
  Unfortunately, there don't seem to be any sources cited.  We may want
to develop our own heuristics to modify the CFG for better layout as
well.



[1] Thomas Ball, James R. Larus. Branch Prediction for Free (https://do
i.org/10.1145/173262.155119)

[2] Hans Wennborg. The recent switch lowering improvements. (http://llv
m.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf) See also: http
<a class="moz-txt-link-freetext" href="s://www.youtube.com/watch?v=gMqSinyL8uk">s://www.youtube.com/watch?v=gMqSinyL8uk

[3] James E. Smith. A study of branch prediction strategies (https://dl
.acm.org/citation.cfm?id=801871)

[4] http://llvm.org/doxygen/MachineBlockPlacement_8cpp_source.html

[5] Karl Pettis, Robert C. Hansen. Profile guided code positioning. (ht
tps://doi.org/10.1145/93542.93550)

[6] Hashemi et al. Efficient procedure mapping using cache line
coloring (https://doi.org/10.1145/258915.258931)


~kavon


On Sat, 2018-05-05 at 21:23 +0200, Andreas Klebinger wrote:
Does anyone have good hints for literature on basic block layout
algorithms?
I've run into a few examples where the current algorithm falls apart 
while working on Cmm.

There is a trac ticket https://ghc.haskell.org/trac/ghc/ticket/15124#
ticket
where I tracked some of the issues I ran into.

As it stands some cmm optimizations are far out weighted by
accidental changes they cause in the layout of basic blocks.

The main problem seems to be that the current codegen only considers
the 
last jump
in a basic block as relevant for code layout.

This works well for linear chains of control flow but behaves badly
and 
somewhat
unpredictable when dealing with branch heavy code where blocks have
more 
than
one successor or calls.

In particular if we have a loop

A jmp B call C call D

which we enter into at block B from Block E
we would like something like:

E,B,C,D,A

Which means with some luck C/D might be still in cache if we return
from 
the call.

However we can currently get:

E,B,A,X,D,X,C

where X are other unrelated blocks. This happens since call edges
are 
invisible to the layout algorithm.
It even happens when we have (conditional) jumps from B  to C and C
to D 
since these are invisible as well!

I came across cases where inverting conditions lead to big
performance 
losses since suddenly block layout
got all messed up. (~4% slowdown for the worst offenders).

So I'm looking for solutions there.

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Samstag, 5. Mai 2018 21:23
Does anyone have good hints for literature on basic block layout algorithms?
I've run into a few examples where the current algorithm falls apart while working on Cmm.

There is a trac ticket https://ghc.haskell.org/trac/ghc/ticket/15124#ticket
where I tracked some of the issues I ran into.

As it stands some cmm optimizations are far out weighted by
accidental changes they cause in the layout of basic blocks.

The main problem seems to be that the current codegen only considers the last jump
in a basic block as relevant for code layout.

This works well for linear chains of control flow but behaves badly and somewhat
unpredictable when dealing with branch heavy code where blocks have more than
one successor or calls.

In particular if we have a loop

A jmp B call C call D

which we enter into at block B from Block E
we would like something like:

E,B,C,D,A

Which means with some luck C/D might be still in cache if we return from the call.

However we can currently get:

E,B,A,X,D,X,C

where X are other unrelated blocks. This happens since call edges are invisible to the layout algorithm.
It even happens when we have (conditional) jumps from B  to C and C to D since these are invisible as well!

I came across cases where inverting conditions lead to big performance losses since suddenly block layout
got all messed up. (~4% slowdown for the worst offenders).

So I'm looking for solutions there.



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

RE: Basic Block Layout in the NCG

GHC - devs mailing list

There is good info in this thread … do add it to the ticket #15124

 

Simon

 

From: ghc-devs <[hidden email]> On Behalf Of Andreas Klebinger
Sent: 06 May 2018 20:59
To: Kavon Farvardin <[hidden email]>
Cc: [hidden email]
Subject: Re: Basic Block Layout in the NCG

 

On branch probability:

I've actually created a patch to add more probabilities in the recent past which included
probabilities on CmmSwitches. Although it made little difference when only tagging error
branches.

Partially that also run into issues with code layout which is why I put that on ice for now.

The full patch is here:
https://phabricator.haskell.org/D4327

I think this really has to be done back to front as GHC currently throws
away all likelyhood information before we get to block layout.
Which makes it very hard to take advantage of this information.

Code Layout:

That seems exactly like the kind of pointers I was looking for!
I do wonder how well some of them aged. For example [5] and [6] use at most two way associative cache.
But as you said it should be at least a good starting point.

I will put your links into the ticket so they are easily found once I (or someone else!) has time to look
deeper into this.

Cheers
Andreas



Sonntag, 6. Mai 2018 20:17

Does anyone have good hints for literature on basic block layout
algorithms?
 
Here are some thoughts:
 
* Branch Probability *
 
Any good code layout algorithm should take branch probabilities into
account.  From what I've seen, we already have a few likely-branch
heuristics baked into the generation/transformation of Cmm, though
perhaps it's worth doing more to add probabilities as in [1,3].  The
richer type information in STG could come in handy.
 
I think the first step in leveraging branch proability information is
to use a Float to represent the magnitude of likeliness instead of a
Bool.
 
Target probabilities on CmmSwitches could also help create smarter
SwitchPlans. Slides 20-21 in [2] demonstrate a lower-cost decision tree
based on these probabilities.
 
 
* Code Layout *
 
The best all-in-one source for static code positioning I've seen is in
[5], and might be a good starting point for exploring that space.  More
importantly, [5] talks about function positioning, which is something I
think we're missing.  A more sophisticated extension to [5]'s function
positioning can be found in [6].
 
Keeping in mind that LLVM is tuned to optimize loops within functions,
at at high-level LLVM does the following [4]:
 
    The algorithm works from the inner-most loop within a 
    function outward, and at each stage walks through the
    basic blocks, trying to coalesce them into sequential
    chains where allowed by the CFG (or demanded by heavy
    probabilities). Finally, it walks the blocks in 
    topological order, and the first time it reaches a 
    chain of basic blocks, it schedules them in the 
    function in-order.
 
There are also plenty of heuristics such as "tail duplication" to deal
with diamonds and other odd cases in the CFG that are harder to layout.
  Unfortunately, there don't seem to be any sources cited.  We may want
to develop our own heuristics to modify the CFG for better layout as
well.
 
 
 
[1] Thomas Ball, James R. Larus. Branch Prediction for Free (https://do
i.org/10.1145/173262.155119)
 
[2] Hans Wennborg. The recent switch lowering improvements. (http://llv
m.org/devmtg/2015-10/slides/Wennborg-SwitchLowering.pdf) See also: http
s://www.youtube.com/watch?v=gMqSinyL8uk
 
[3] James E. Smith. A study of branch prediction strategies (https://dl
.acm.org/citation.cfm?id=801871)
 
[4] http://llvm.org/doxygen/MachineBlockPlacement_8cpp_source.html
 
[5] Karl Pettis, Robert C. Hansen. Profile guided code positioning. (ht
tps://doi.org/10.1145/93542.93550)
 
[6] Hashemi et al. Efficient procedure mapping using cache line
coloring (https://doi.org/10.1145/258915.258931)
 
 
~kavon
 
 
On Sat, 2018-05-05 at 21:23 +0200, Andreas Klebinger wrote:
Does anyone have good hints for literature on basic block layout
algorithms?
I've run into a few examples where the current algorithm falls apart 
while working on Cmm.
 
There is a trac ticket https://ghc.haskell.org/trac/ghc/ticket/15124#
ticket
where I tracked some of the issues I ran into.
 
As it stands some cmm optimizations are far out weighted by
accidental changes they cause in the layout of basic blocks.
 
The main problem seems to be that the current codegen only considers
the 
last jump
in a basic block as relevant for code layout.
 
This works well for linear chains of control flow but behaves badly
and 
somewhat
unpredictable when dealing with branch heavy code where blocks have
more 
than
one successor or calls.
 
In particular if we have a loop
 
A jmp B call C call D
 
which we enter into at block B from Block E
we would like something like:
 
E,B,C,D,A
 
Which means with some luck C/D might be still in cache if we return
from 
the call.
 
However we can currently get:
 
E,B,A,X,D,X,C
 
where X are other unrelated blocks. This happens since call edges
are 
invisible to the layout algorithm.
It even happens when we have (conditional) jumps from B  to C and C
to D 
since these are invisible as well!
 
I came across cases where inverting conditions lead to big
performance 
losses since suddenly block layout
got all messed up. (~4% slowdown for the worst offenders).
 
So I'm looking for solutions there.
 
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

Samstag, 5. Mai 2018 21:23

Does anyone have good hints for literature on basic block layout algorithms?
I've run into a few examples where the current algorithm falls apart while working on Cmm.

There is a trac ticket https://ghc.haskell.org/trac/ghc/ticket/15124#ticket
where I tracked some of the issues I ran into.

As it stands some cmm optimizations are far out weighted by
accidental changes they cause in the layout of basic blocks.

The main problem seems to be that the current codegen only considers the last jump
in a basic block as relevant for code layout.

This works well for linear chains of control flow but behaves badly and somewhat
unpredictable when dealing with branch heavy code where blocks have more than
one successor or calls.

In particular if we have a loop

A jmp B call C call D

which we enter into at block B from Block E
we would like something like:

E,B,C,D,A

Which means with some luck C/D might be still in cache if we return from the call.

However we can currently get:

E,B,A,X,D,X,C

where X are other unrelated blocks. This happens since call edges are invisible to the layout algorithm.
It even happens when we have (conditional) jumps from B  to C and C to D since these are invisible as well!

I came across cases where inverting conditions lead to big performance losses since suddenly block layout
got all messed up. (~4% slowdown for the worst offenders).

So I'm looking for solutions there.

 


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs