CMM-to-ASM: Register allocation wierdness

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

CMM-to-ASM: Register allocation wierdness

Harendra Kumar
Hi,

I am implementing unicode normalization in Haskell. I challenged myself to match the performance with the best C/C++ implementation, the best being the ICU library. I am almost there, beating it in one of the benchmarks and within 30% for others. I am out of all application level tricks that I could think of and now need help from the compiler.

I started with a bare minimum loop and adding functionality incrementally watching where the performance trips. At one point I saw that adding just one 'if' condition reduced the performance by half. I looked at what's going on at the assembly level. Here is a github gist of the assembly instructions executed in the fast path of the loop, corresponding cmm snippets and also the full cmm corresponding to the loop:


I have annotated the assembly code with labels matching the corresponding CMM.

With the addition of another "if" condition the loop which was pretty simple till now suddenly got bloated with a lot of register reassignments. Here is a snippet of the register movements added:

# _n4se:
# swap r14 <-> r11
=> 0x408d6b: mov    %r11,0x98(%rsp)
=> 0x408d73: mov    %r14,%r11
=> 0x408d76: mov    0x98(%rsp),%r14

# reassignments
# rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx
=> 0x408d7e: mov    %rbx,0x90(%rsp)
=> 0x408d86: mov    %rcx,%rbx
=> 0x408d89: mov    %rdx,%rcx
=> 0x408d8c: mov    %rsi,%rdx
=> 0x408d8f: mov    %rdi,%rsi
=> 0x408d92: mov    %r8,%rdi
=> 0x408d95: mov    %r9,%r8
=> 0x408d98: mov    %r10,%r9
=> 0x408d9b: mov    0x90(%rsp),%r10
.
.
.
loop logic here which uses only %rax, %r10 and %r9 . 
.
.
.
# _n4s8:
# shuffle back to original assignments
=> 0x4090dc: mov    %r14,%r11
=> 0x4090df: mov    %r9,%r10
=> 0x4090e2: mov    %r8,%r9
=> 0x4090e5: mov    %rdi,%r8
=> 0x4090e8: mov    %rsi,%rdi
=> 0x4090eb: mov    %rdx,%rsi
=> 0x4090ee: mov    %rcx,%rdx
=> 0x4090f1: mov    %rbx,%rcx
=> 0x4090f4: mov    %rax,%rbx
=> 0x4090f7: mov    0x88(%rsp),%rax

=> 0x4090ff: jmpq   0x408d2a


The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used. 

Maybe this is because these are reusable blocks and the movement is necessary when used in some other path?

Can this be avoided? Or at least avoided in a certain fast path somehow by hinting the compiler? Any pointers to the GHC code will be appreciated. I am not yet much familiar with the GHC code but can dig deeper pretty quickly. But before that I hope someone knowledgeable in this area can shed some light on this at a conceptual level or if at all it can be improved. I can provide more details and experiment more if needed.

Thanks,
Harendra

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Harendra Kumar
My earlier experiment was on GHC-7.10.3. I repeated this on GHC-8.0.1 and the assembly traced was exactly the same except for a marginal improvement. The 8.0.1 code generator removed the r14/r11 swap but the rest of the register ring shift remains the same. I have updated the github gist with the 8.0.1 trace:


thanks,
harendra

On 13 June 2016 at 00:08, Harendra Kumar <[hidden email]> wrote:
Hi,

I am implementing unicode normalization in Haskell. I challenged myself to match the performance with the best C/C++ implementation, the best being the ICU library. I am almost there, beating it in one of the benchmarks and within 30% for others. I am out of all application level tricks that I could think of and now need help from the compiler.

I started with a bare minimum loop and adding functionality incrementally watching where the performance trips. At one point I saw that adding just one 'if' condition reduced the performance by half. I looked at what's going on at the assembly level. Here is a github gist of the assembly instructions executed in the fast path of the loop, corresponding cmm snippets and also the full cmm corresponding to the loop:


I have annotated the assembly code with labels matching the corresponding CMM.

With the addition of another "if" condition the loop which was pretty simple till now suddenly got bloated with a lot of register reassignments. Here is a snippet of the register movements added:

# _n4se:
# swap r14 <-> r11
=> 0x408d6b: mov    %r11,0x98(%rsp)
=> 0x408d73: mov    %r14,%r11
=> 0x408d76: mov    0x98(%rsp),%r14

# reassignments
# rbx -> r10 -> r9 -> r8 -> rdi -> rsi -> rdx -> rcx -> rbx
=> 0x408d7e: mov    %rbx,0x90(%rsp)
=> 0x408d86: mov    %rcx,%rbx
=> 0x408d89: mov    %rdx,%rcx
=> 0x408d8c: mov    %rsi,%rdx
=> 0x408d8f: mov    %rdi,%rsi
=> 0x408d92: mov    %r8,%rdi
=> 0x408d95: mov    %r9,%r8
=> 0x408d98: mov    %r10,%r9
=> 0x408d9b: mov    0x90(%rsp),%r10
.
.
.
loop logic here which uses only %rax, %r10 and %r9 . 
.
.
.
# _n4s8:
# shuffle back to original assignments
=> 0x4090dc: mov    %r14,%r11
=> 0x4090df: mov    %r9,%r10
=> 0x4090e2: mov    %r8,%r9
=> 0x4090e5: mov    %rdi,%r8
=> 0x4090e8: mov    %rsi,%rdi
=> 0x4090eb: mov    %rdx,%rsi
=> 0x4090ee: mov    %rcx,%rdx
=> 0x4090f1: mov    %rbx,%rcx
=> 0x4090f4: mov    %rax,%rbx
=> 0x4090f7: mov    0x88(%rsp),%rax

=> 0x4090ff: jmpq   0x408d2a


The registers seem to be getting reassigned here, data flowing from one to the next. In this particular path a lot of these register movements seem unnecessary and are only undone at the end without being used. 

Maybe this is because these are reusable blocks and the movement is necessary when used in some other path?

Can this be avoided? Or at least avoided in a certain fast path somehow by hinting the compiler? Any pointers to the GHC code will be appreciated. I am not yet much familiar with the GHC code but can dig deeper pretty quickly. But before that I hope someone knowledgeable in this area can shed some light on this at a conceptual level or if at all it can be improved. I can provide more details and experiment more if needed.

Thanks,
Harendra


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Dominic Steinitz-2
In reply to this post by Harendra Kumar
>  Hi, I am implementing unicode normalization in Haskell. I
> challenged myself to match the performance with the best C/C++
> implementation, the best being the ICU library. I am almost there,
> beating it in one of the benchmarks and within 30% for others. I am
> out of all application level tricks that I could think of and now
> need help from the compiler.

I can't answer your question but I am very happy that someone is
looking at performance issues.

I am sad that no-one has responded.

More generally, is there a story around what is happening to improve
performance?

Dominic.

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Ben Gamari-2
In reply to this post by Harendra Kumar
Harendra Kumar <[hidden email]> writes:

> My earlier experiment was on GHC-7.10.3. I repeated this on GHC-8.0.1 and
> the assembly traced was exactly the same except for a marginal improvement.
> The 8.0.1 code generator removed the r14/r11 swap but the rest of the
> register ring shift remains the same. I have updated the github gist with
> the 8.0.1 trace:
>

Have you tried compiling with -fregs-graph [1] (the graph-coloring
allocator)?

By default GHC uses a very naive linear register allocator which I'd
imagine may produce these sorts of results. At some point there was an
effort to make -fregs-graph the default (see #2790) but it is
unfortunately quite slow despite having a relatively small impact on
produced-code quality in most cases. However, in your case it may be
worth enabling. Note, however, that the graph coloring allocator has a
few quirks of its own (see #8657 and #7697).

It actually came to my attention while researching this that the
-fregs-graph flag is currently silently ignored [2]. Unfortunately this
means you'll need to build a new compiler if you want to try using it.

Simon Marlow: If we really want to disable this option we should at very
least issue an error when the user requests it. However, really it seems
to me like we shouldn't disable it at all; why not just allow the user
to use it and add a note to the documentation stating that the graph
coloring allocator may fail with some programs and if it breaks the user
gets to keep both pieces?

All-in-all, the graph coloring allocator is in great need of some love;
Harendra, perhaps you'd like to have a try at dusting it off and perhaps
look into why it regresses in compiler performance? It would be great if
we could use it by default.

Cheers,

- Ben


[1] http://downloads.haskell.org/~ghc/master/users-guide//using-optimisation.html?highlight=register%20graph#ghc-flag--fregs-graph
[2] https://git.haskell.org/ghc.git/commitdiff/f0a7261a39bd1a8c5217fecba56c593c353f198c

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

signature.asc (482 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Harendra Kumar
On 16 June 2016 at 13:59, Ben Gamari <[hidden email]> wrote:
>
> It actually came to my attention while researching this that the
> -fregs-graph flag is currently silently ignored [2]. Unfortunately this
> means you'll need to build a new compiler if you want to try using it.

Yes I did try -fregs-graph and -fregs-iterative both. To debug why nothing changed I had to compare the executables produced with and without the flags and found them identical.  A note in the manual could have saved me some time since that's the first place to go for help. I was wondering if I am making a mistake in the build and if it is not being rebuilt properly. Your note confirms my observation, it indeed does not change anything.

> All-in-all, the graph coloring allocator is in great need of some love;
> Harendra, perhaps you'd like to have a try at dusting it off and perhaps
> look into why it regresses in compiler performance? It would be great if
> we could use it by default.

Yes, I can try that. In fact I was going in that direction and then stopped to look at what llvm does. llvm gave me impressive results in some cases though not so great in others. I compared the code generated by llvm and it perhaps did a better job in theory (used fewer instructions) but due to more spilling the end result was pretty similar. 

But I found a few interesting optimizations that llvm did. For example, there was a heap adjustment and check in the looping path which was redundant and was readjusted in the loop itself without use. LLVM either removed the redundant  _adjustments_ in the loop or moved them out of the loop. But it did not remove the corresponding heap _checks_. That makes me wonder if the redundant heap checks can also be moved or removed. If we can do some sort of loop analysis at the CMM level itself and avoid or remove the redundant heap adjustments as well as checks or at least float them out of the cycle wherever possible. That sort of optimization can make a significant difference to my case at least. Since we are explicitly aware of the heap at the CMM level there may be an opportunity to do better than llvm if we optimize the generated CMM or the generation of CMM itself.

A thought that came to my mind was whether we should focus on getting better code out of the llvm backend or the native code generator. LLVM seems pretty good at the specialized task of code generation and low level optimization, it is well funded, widely used and has a big community support. That allows us to leverage that huge effort and take advantage of the new developments. Does it make sense to outsource the code generation and low level optimization tasks to llvm and ghc focussing on higher level optimizations which are harder to do at the llvm level? What are the downsides of using llvm exclusively in future?

-harendra

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Karel Gardas
On 06/16/16 12:53 PM, Harendra Kumar wrote:
> A thought that came to my mind was whether we should focus on getting
> better code out of the llvm backend or the native code generator. LLVM
> seems pretty good at the specialized task of code generation and low
> level optimization, it is well funded, widely used and has a big
> community support. That allows us to leverage that huge effort and take
> advantage of the new developments. Does it make sense to outsource the
> code generation and low level optimization tasks to llvm and ghc
> focussing on higher level optimizations which are harder to do at the
> llvm level? What are the downsides of using llvm exclusively in future?

Good reading IMHO about the topic is here:
https://ghc.haskell.org/trac/ghc/wiki/ImprovedLLVMBackend

Cheers,
Karel

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

RE: CMM-to-ASM: Register allocation wierdness

Simon Peyton Jones
In reply to this post by Ben Gamari-2
| All-in-all, the graph coloring allocator is in great need of some love;
| Harendra, perhaps you'd like to have a try at dusting it off and perhaps
| look into why it regresses in compiler performance? It would be great if
| we could use it by default.

I second this.   Volunteers are sorely needed.

Simon
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Harendra Kumar
In reply to this post by Karel Gardas
That's a nice read, thanks for the pointer. I agree with the solution presented there. If we can do that it will be awesome. If help is needed I can spend some time on it.

One of the things that I noticed is that the code can be optimized significantly if we know the common case so that we can optimize that path at the expense of less common path. At times I saw wild difference in performance just by a very small change in the source. I could attribute the difference to code blocks having moved and differently placed jump instructions or change in register allocations impacting the common case more. This could be avoided if we know the common case.

The common case is not visible or obvious to low level tools. It is easier to write the code in a low level language like C such that it is closer to how it will run on the processor, we can also easily influence gcc from the source level. It is harder to do the same in a high level language like Haskell. Perhaps there is no point in doing so. What we can do instead is to use the llvm toolchain to perform feedback directed optimization and it will adjust the low level code accordingly based on your feedback runs. That will be entirely free since it can be done at the llvm level.

My point is that it will pay off in things like that if we invest in integrating llvm better.

-harendra

On 16 June 2016 at 16:48, Karel Gardas <[hidden email]> wrote:
On 06/16/16 12:53 PM, Harendra Kumar wrote:
A thought that came to my mind was whether we should focus on getting
better code out of the llvm backend or the native code generator. LLVM
seems pretty good at the specialized task of code generation and low
level optimization, it is well funded, widely used and has a big
community support. That allows us to leverage that huge effort and take
advantage of the new developments. Does it make sense to outsource the
code generation and low level optimization tasks to llvm and ghc
focussing on higher level optimizations which are harder to do at the
llvm level? What are the downsides of using llvm exclusively in future?

Good reading IMHO about the topic is here: https://ghc.haskell.org/trac/ghc/wiki/ImprovedLLVMBackend

Cheers,
Karel



_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Ben Gamari-2
In reply to this post by Harendra Kumar

Ccing David Spitzenberg, who has thought about proc-point splitting, which
is relevant for reasons that we will see below.


Harendra Kumar <[hidden email]> writes:

> On 16 June 2016 at 13:59, Ben Gamari <[hidden email]> wrote:
>>
>> It actually came to my attention while researching this that the
>> -fregs-graph flag is currently silently ignored [2]. Unfortunately this
>> means you'll need to build a new compiler if you want to try using it.
>
> Yes I did try -fregs-graph and -fregs-iterative both. To debug why nothing
> changed I had to compare the executables produced with and without the
> flags and found them identical.  A note in the manual could have saved me
> some time since that's the first place to go for help. I was wondering if I
> am making a mistake in the build and if it is not being rebuilt
> properly. Your note confirms my observation, it indeed does not change
> anything.
>
Indeed; I've opened D2335 [1] to reenable -fregs-graph and add an
appropriate note to the users guide.

>> All-in-all, the graph coloring allocator is in great need of some love;
>> Harendra, perhaps you'd like to have a try at dusting it off and perhaps
>> look into why it regresses in compiler performance? It would be great if
>> we could use it by default.
>
> Yes, I can try that. In fact I was going in that direction and then stopped
> to look at what llvm does. llvm gave me impressive results in some cases
> though not so great in others. I compared the code generated by llvm and it
> perhaps did a better job in theory (used fewer instructions) but due to
> more spilling the end result was pretty similar.
>
For the record, I have also struggled with register spilling issues in
the past. See, for instance, #10012, which describes a behavior which
arises from the C-- sinking pass's unwillingness to duplicate code
across branches. While in general it's good to avoid the code bloat that
this duplication implies, in the case shown in that ticket duplicating
the computation would be significantly less code than the bloat from
spilling the needed results.

> But I found a few interesting optimizations that llvm did. For example,
> there was a heap adjustment and check in the looping path which was
> redundant and was readjusted in the loop itself without use. LLVM either
> removed the redundant  _adjustments_ in the loop or moved them out of the
> loop. But it did not remove the corresponding heap _checks_. That makes me
> wonder if the redundant heap checks can also be moved or removed. If we can
> do some sort of loop analysis at the CMM level itself and avoid or remove
> the redundant heap adjustments as well as checks or at least float them out
> of the cycle wherever possible. That sort of optimization can make a
> significant difference to my case at least. Since we are explicitly aware
> of the heap at the CMM level there may be an opportunity to do better than
> llvm if we optimize the generated CMM or the generation of CMM itself.
>
Very interesting, thanks for writing this down! Indeed if these checks
really are redundant then we should try to avoid them. Do you have any
code you could share that demosntrates this?

It would be great to open Trac tickets to track some of the optimization
opportunities that you noted we may be missing. Trac tickets are far
easier to track over longer durations than mailing list conversations,
which tend to get lost in the noise after a few weeks pass.

> A thought that came to my mind was whether we should focus on getting
> better code out of the llvm backend or the native code generator. LLVM
> seems pretty good at the specialized task of code generation and low level
> optimization, it is well funded, widely used and has a big community
> support. That allows us to leverage that huge effort and take advantage of
> the new developments. Does it make sense to outsource the code generation
> and low level optimization tasks to llvm and ghc focussing on higher level
> optimizations which are harder to do at the llvm level? What are the
> downsides of using llvm exclusively in future?
>
There is indeed a question of where we wish to focus our optimization
efforts. However, I think using LLVM exclusively would be a mistake.
LLVM is a rather large dependency that has in the past been rather
difficult to track (this is why we now only target one LLVM release in a
given GHC release). Moreover, it's significantly slower than our
existing native code generator. There are a number of reasons for this,
some of which are fixable. For instance, we currently make no effort to tell
LLVM which passes are worth running and which we've handled; this is
something which should be fixed but will require a rather significant
investment by someone to determine how GHC's and LLVM's passes overlap,
how they interact, and generally which are helpful (see GHC #11295).

Furthermore, there are a few annoying impedance mismatches between Cmm
and LLVM's representation. This can be seen in our treatment of proc
points: when we need to take the address of a block within a function
LLVM requires that we break the block into a separate procedure, hiding
many potential optimizations from the optimizer. This was discussed
further on this list earlier this year [2]. It would be great to
eliminate proc-point splitting but doing so will almost certainly
require cooperation from LLVM.

Cheers,

- Ben


[1] https://phabricator.haskell.org/D2335
[2] https://mail.haskell.org/pipermail/ghc-devs/2015-November/010535.html

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

signature.asc (482 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Harendra Kumar

Thanks Ben! I have my responses inline below.

On 16 June 2016 at 18:07, Ben Gamari <[hidden email]> wrote:

Indeed; I've opened D2335 [1] to reenable -fregs-graph and add an
appropriate note to the users guide.
 
Thanks! That was quick.
 
For the record, I have also struggled with register spilling issues in
the past. See, for instance, #10012, which describes a behavior which
arises from the C-- sinking pass's unwillingness to duplicate code
across branches. While in general it's good to avoid the code bloat that
this duplication implies, in the case shown in that ticket duplicating
the computation would be significantly less code than the bloat from
spilling the needed results.

Not sure if this is possible but when unsure we can try both and compare if the duplication results in significantly more code than no duplication and make a decision based on that. Though that will slow down the compilation. Maybe we can bundle slower passes in something like -O3, meaning it will be slow and may or may not provide better results?


> But I found a few interesting optimizations that llvm did. For example,
> there was a heap adjustment and check in the looping path which was
> redundant and was readjusted in the loop itself without use. LLVM either
> removed the redundant  _adjustments_ in the loop or moved them out of the
> loop. But it did not remove the corresponding heap _checks_. That makes me
> wonder if the redundant heap checks can also be moved or removed. If we can
> do some sort of loop analysis at the CMM level itself and avoid or remove
> the redundant heap adjustments as well as checks or at least float them out
> of the cycle wherever possible. That sort of optimization can make a
> significant difference to my case at least. Since we are explicitly aware
> of the heap at the CMM level there may be an opportunity to do better than
> llvm if we optimize the generated CMM or the generation of CMM itself.
>
Very interesting, thanks for writing this down! Indeed if these checks
really are redundant then we should try to avoid them. Do you have any
code you could share that demosntrates this?

The gist that I provided in this email thread earlier demonstrates it. Here it is again:


If you look at the CMM trace in the gist. Start at label c4ic where we allocate space on heap (+48). Now, there are many possible paths from this point on some of those use the heap and some don't. I have marked those which use the heap by curly braces, the rest do not use it at all.
1) c4ic (allocate) -> c4mw -> {c4pv} -> ...
2) c4ic (allocate) -> c4mw -> c4pw -> ((c4pr -> ({c4pe} -> ... | c4ph -> ...)) | cp4ps -> ...)

If we can place this allocation at both c4pv and c4pe instead of the common parent then we can save the fast path from this check.  The same thing applies to the allocation at label c4jd as well.

I have the code to produce this CMM, I can commit it on a branch and leave it in the github repository so that we can use it for fixing.


It would be great to open Trac tickets to track some of the optimization

Will do.
 
There is indeed a question of where we wish to focus our optimization
efforts. However, I think using LLVM exclusively would be a mistake.
LLVM is a rather large dependency that has in the past been rather
difficult to track (this is why we now only target one LLVM release in a
given GHC release). Moreover, it's significantly slower than our
existing native code generator. There are a number of reasons for this,
some of which are fixable. For instance, we currently make no effort to tell
LLVM which passes are worth running and which we've handled; this is
something which should be fixed but will require a rather significant
investment by someone to determine how GHC's and LLVM's passes overlap,
how they interact, and generally which are helpful (see GHC #11295).

Furthermore, there are a few annoying impedance mismatches between Cmm
and LLVM's representation. This can be seen in our treatment of proc
points: when we need to take the address of a block within a function
LLVM requires that we break the block into a separate procedure, hiding
many potential optimizations from the optimizer. This was discussed
further on this list earlier this year [2]. It would be great to
eliminate proc-point splitting but doing so will almost certainly
require cooperation from LLVM.

It sounds like we need to continue with both for now and see how the llvm option pans out. There is clearly no reason for a decisive tilt towards llvm in near future.

-harendra 


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

RE: CMM-to-ASM: Register allocation wierdness

Simon Peyton Jones
In reply to this post by Simon Peyton Jones

Takenobu, and others

Thanks.  Several people have mentioned that email from me is ending up in spam.

It turns out to be a fault in the Haskell.org mailman setup, which was mis-forwarding my email.

Apparently it is fixed now.  If THIS message ends up in spam with a complaint like that below, could you let me know?

Thanks

Simon

                                                                                                                                                                                                                    

From: Takenobu Tani [mailto:[hidden email]]
Sent: 18 June 2016 08:18
To: Simon Peyton Jones <[hidden email]>
Subject: Re: CMM-to-ASM: Register allocation wierdness

 

Hi Simon,

 

I report to you about your mails.

 

Maybe, your mails don't reach to Gmail users.

I don't know why, but your mails have been distributed to "Spam" folder in Gmail.

 

Gmail displays following message:

  "Why is this message in Spam? It has a from address in microsoft.com but has failed microsoft.com's required tests for authentication."

 

 

For reference, I attach the screen of my Gmail of spam folder.

Recent your mails have been detected as spam.

Please check your mail settings.

 

Regards,

Takenobu

 

2016-06-16 21:10 GMT+09:00 Simon Peyton Jones <[hidden email]>:

| All-in-all, the graph coloring allocator is in great need of some love;
| Harendra, perhaps you'd like to have a try at dusting it off and perhaps
| look into why it regresses in compiler performance? It would be great if
| we could use it by default.

I second this.   Volunteers are sorely needed.

Simon

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

 


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Takenobu Tani
Hi Simon,

I've received this email in "inbox" folder of Gmail.
It is all right now.

Thank you,
Takenobu


2016-06-18 22:01 GMT+09:00 Simon Peyton Jones <[hidden email]>:

Takenobu, and others

Thanks.  Several people have mentioned that email from me is ending up in spam.

It turns out to be a fault in the Haskell.org mailman setup, which was mis-forwarding my email.

Apparently it is fixed now.  If THIS message ends up in spam with a complaint like that below, could you let me know?

Thanks

Simon

                                                                                                                                                                                                                    

From: Takenobu Tani [mailto:[hidden email]]
Sent: 18 June 2016 08:18
To: Simon Peyton Jones <[hidden email]>
Subject: Re: CMM-to-ASM: Register allocation wierdness

 

Hi Simon,

 

I report to you about your mails.

 

Maybe, your mails don't reach to Gmail users.

I don't know why, but your mails have been distributed to "Spam" folder in Gmail.

 

Gmail displays following message:

  "Why is this message in Spam? It has a from address in microsoft.com but has failed microsoft.com's required tests for authentication."

 

 

For reference, I attach the screen of my Gmail of spam folder.

Recent your mails have been detected as spam.

Please check your mail settings.

 

Regards,

Takenobu

 

2016-06-16 21:10 GMT+09:00 Simon Peyton Jones <[hidden email]>:

| All-in-all, the graph coloring allocator is in great need of some love;
| Harendra, perhaps you'd like to have a try at dusting it off and perhaps
| look into why it regresses in compiler performance? It would be great if
| we could use it by default.

I second this.   Volunteers are sorely needed.

Simon

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

 



_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Carter Schonwald
In reply to this post by Simon Peyton Jones
This email would have been marked spam had I not unmarked all your emails as spam.  Also a gmail user :/

On Saturday, June 18, 2016, Simon Peyton Jones <[hidden email]> wrote:

Takenobu, and others

Thanks.  Several people have mentioned that email from me is ending up in spam.

It turns out to be a fault in the Haskell.org mailman setup, which was mis-forwarding my email.

Apparently it is fixed now.  If THIS message ends up in spam with a complaint like that below, could you let me know?

Thanks

Simon

                                                                                                                                                                                                                    

From: Takenobu Tani [mailto:<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;takenobu.hs@gmail.com&#39;);" target="_blank">takenobu.hs@...]
Sent: 18 June 2016 08:18
To: Simon Peyton Jones <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;simonpj@microsoft.com&#39;);" target="_blank">simonpj@...>
Subject: Re: CMM-to-ASM: Register allocation wierdness

 

Hi Simon,

 

I report to you about your mails.

 

Maybe, your mails don't reach to Gmail users.

I don't know why, but your mails have been distributed to "Spam" folder in Gmail.

 

Gmail displays following message:

  "Why is this message in Spam? It has a from address in microsoft.com but has failed microsoft.com's required tests for authentication."

 

 

For reference, I attach the screen of my Gmail of spam folder.

Recent your mails have been detected as spam.

Please check your mail settings.

 

Regards,

Takenobu

 

2016-06-16 21:10 GMT+09:00 Simon Peyton Jones <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;simonpj@microsoft.com&#39;);" target="_blank">simonpj@...>:

| All-in-all, the graph coloring allocator is in great need of some love;
| Harendra, perhaps you'd like to have a try at dusting it off and perhaps
| look into why it regresses in compiler performance? It would be great if
| we could use it by default.

I second this.   Volunteers are sorely needed.

Simon

_______________________________________________
Glasgow-haskell-users mailing list
<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;Glasgow-haskell-users@haskell.org&#39;);" target="_blank">Glasgow-haskell-users@...
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

 


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Brandon Allbery
On Sun, Jun 19, 2016 at 12:30 AM, Carter Schonwald <[hidden email]> wrote:
This email would have been marked spam had I not unmarked all your emails as spam.  Also a gmail user :/

Same. I forwarded my received headers to the infra folks; they made some more adjustments to Mailman, which need to be tested.

--
brandon s allbery kf8nh                               sine nomine associates
[hidden email]                                  [hidden email]
unix, openafs, kerberos, infrastructure, xmonad        http://sinenomine.net

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Ben Gamari-2
In reply to this post by Harendra Kumar
Harendra Kumar <[hidden email]> writes:

> Thanks Ben! I have my responses inline below.
>
No worries!

> On 16 June 2016 at 18:07, Ben Gamari <[hidden email]> wrote:
>
>> For the record, I have also struggled with register spilling issues in
>> the past. See, for instance, #10012, which describes a behavior which
>> arises from the C-- sinking pass's unwillingness to duplicate code
>> across branches. While in general it's good to avoid the code bloat that
>> this duplication implies, in the case shown in that ticket duplicating
>> the computation would be significantly less code than the bloat from
>> spilling the needed results.
>>
>
> Not sure if this is possible but when unsure we can try both and compare if
> the duplication results in significantly more code than no duplication and
> make a decision based on that. Though that will slow down the compilation.
> Maybe we can bundle slower passes in something like -O3, meaning it will be
> slow and may or may not provide better results?
>
Indeed this would be one option although I suspect we can do better.
I have discussed the problem with a few people and have some ideas on
how to proceed. Unfortunately I've been suffering from a chronic lack of
time recently.

snip

>> Very interesting, thanks for writing this down! Indeed if these checks
>> really are redundant then we should try to avoid them. Do you have any
>> code you could share that demosntrates this?
>>
>
snip
>
> I have the code to produce this CMM, I can commit it on a branch and leave
> it in the github repository so that we can use it for fixing.
>
Indeed it would be great if you could provide the program that produced
this code.

>> It would be great to open Trac tickets to track some of the optimization
>>
>
> Will do.
>
Thanks!

>>
>> Furthermore, there are a few annoying impedance mismatches between Cmm
>> and LLVM's representation. This can be seen in our treatment of proc
>> points: when we need to take the address of a block within a function
>> LLVM requires that we break the block into a separate procedure, hiding
>> many potential optimizations from the optimizer. This was discussed
>> further on this list earlier this year [2]. It would be great to
>> eliminate proc-point splitting but doing so will almost certainly
>> require cooperation from LLVM.
>>
>
> It sounds like we need to continue with both for now and see how the llvm
> option pans out. There is clearly no reason for a decisive tilt towards
> llvm in near future.
>
I agree.

Cheers,

- Ben

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

signature.asc (482 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

CMM-to-ASM: Register allocation wierdness

Carter Schonwald
In reply to this post by Carter Schonwald


Agreed. There's also some other mismatches between ghc and llvm in a few fun / interesting ways! 



There's a lot of room for improvement in both code gens, but there's also a lot of room to improve the ease of experimenting with improvements.  Eg we don't have a peephole pass per target, so those get hacked into the pretty printing code last time I checked

On Thursday, June 16, 2016, Ben Gamari <<a href="javascript:_e(%7B%7D,&#39;cvml&#39;,&#39;ben@smart-cactus.org&#39;);" target="_blank">ben@...> wrote:

Ccing David Spitzenberg, who has thought about proc-point splitting, which
is relevant for reasons that we will see below.


Harendra Kumar <[hidden email]> writes:

> On 16 June 2016 at 13:59, Ben Gamari <[hidden email]> wrote:
>>
>> It actually came to my attention while researching this that the
>> -fregs-graph flag is currently silently ignored [2]. Unfortunately this
>> means you'll need to build a new compiler if you want to try using it.
>
> Yes I did try -fregs-graph and -fregs-iterative both. To debug why nothing
> changed I had to compare the executables produced with and without the
> flags and found them identical.  A note in the manual could have saved me
> some time since that's the first place to go for help. I was wondering if I
> am making a mistake in the build and if it is not being rebuilt
> properly. Your note confirms my observation, it indeed does not change
> anything.
>
Indeed; I've opened D2335 [1] to reenable -fregs-graph and add an
appropriate note to the users guide.

>> All-in-all, the graph coloring allocator is in great need of some love;
>> Harendra, perhaps you'd like to have a try at dusting it off and perhaps
>> look into why it regresses in compiler performance? It would be great if
>> we could use it by default.
>
> Yes, I can try that. In fact I was going in that direction and then stopped
> to look at what llvm does. llvm gave me impressive results in some cases
> though not so great in others. I compared the code generated by llvm and it
> perhaps did a better job in theory (used fewer instructions) but due to
> more spilling the end result was pretty similar.
>
For the record, I have also struggled with register spilling issues in
the past. See, for instance, #10012, which describes a behavior which
arises from the C-- sinking pass's unwillingness to duplicate code
across branches. While in general it's good to avoid the code bloat that
this duplication implies, in the case shown in that ticket duplicating
the computation would be significantly less code than the bloat from
spilling the needed results.

> But I found a few interesting optimizations that llvm did. For example,
> there was a heap adjustment and check in the looping path which was
> redundant and was readjusted in the loop itself without use. LLVM either
> removed the redundant  _adjustments_ in the loop or moved them out of the
> loop. But it did not remove the corresponding heap _checks_. That makes me
> wonder if the redundant heap checks can also be moved or removed. If we can
> do some sort of loop analysis at the CMM level itself and avoid or remove
> the redundant heap adjustments as well as checks or at least float them out
> of the cycle wherever possible. That sort of optimization can make a
> significant difference to my case at least. Since we are explicitly aware
> of the heap at the CMM level there may be an opportunity to do better than
> llvm if we optimize the generated CMM or the generation of CMM itself.
>
Very interesting, thanks for writing this down! Indeed if these checks
really are redundant then we should try to avoid them. Do you have any
code you could share that demosntrates this?

It would be great to open Trac tickets to track some of the optimization
opportunities that you noted we may be missing. Trac tickets are far
easier to track over longer durations than mailing list conversations,
which tend to get lost in the noise after a few weeks pass.

> A thought that came to my mind was whether we should focus on getting
> better code out of the llvm backend or the native code generator. LLVM
> seems pretty good at the specialized task of code generation and low level
> optimization, it is well funded, widely used and has a big community
> support. That allows us to leverage that huge effort and take advantage of
> the new developments. Does it make sense to outsource the code generation
> and low level optimization tasks to llvm and ghc focussing on higher level
> optimizations which are harder to do at the llvm level? What are the
> downsides of using llvm exclusively in future?
>

There is indeed a question of where we wish to focus our optimization
efforts. However, I think using LLVM exclusively would be a mistake.
LLVM is a rather large dependency that has in the past been rather
difficult to track (this is why we now only target one LLVM release in a
given GHC release). Moreover, it's significantly slower than our
existing native code generator. There are a number of reasons for this,
some of which are fixable. For instance, we currently make no effort to tell
LLVM which passes are worth running and which we've handled; this is
something which should be fixed but will require a rather significant
investment by someone to determine how GHC's and LLVM's passes overlap,
how they interact, and generally which are helpful (see GHC #11295).

Furthermore, there are a few annoying impedance mismatches between Cmm
and LLVM's representation. This can be seen in our treatment of proc
points: when we need to take the address of a block within a function
LLVM requires that we break the block into a separate procedure, hiding
many potential optimizations from the optimizer. This was discussed
further on this list earlier this year [2]. It would be great to
eliminate proc-point splitting but doing so will almost certainly
require cooperation from LLVM.

Cheers,

- Ben


[1] https://phabricator.haskell.org/D2335
[2] https://mail.haskell.org/pipermail/ghc-devs/2015-November/010535.html

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Harendra Kumar
In reply to this post by Ben Gamari-2


On 19 June 2016 at 14:03, Ben Gamari <[hidden email]> wrote:

Indeed it would be great if you could provide the program that produced
this code.

>> It would be great to open Trac tickets to track some of the optimization

Ok, I created an account on ghc trac and raised two tickets: #12231 & #12232. Yay!  I also added the code branch to reproduce this on github (https://github.com/harendra-kumar/unicode-transforms/tree/ghc-trac-12231).

-harendra

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: CMM-to-ASM: Register allocation wierdness

Ben Gamari-2
Harendra Kumar <[hidden email]> writes:

> On 19 June 2016 at 14:03, Ben Gamari <[hidden email]> wrote:
>
>>
>> Indeed it would be great if you could provide the program that produced
>> this code.
>>
>> >> It would be great to open Trac tickets to track some of the optimization
>
>
> Ok, I created an account on ghc trac and raised two tickets: #12231 &
> #12232. Yay!  I also added the code branch to reproduce this on github (
> https://github.com/harendra-kumar/unicode-transforms/tree/ghc-trac-12231).
>
Great job summarizing the issue. Thanks!

Cheers,

 - Ben

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/glasgow-haskell-users

signature.asc (482 bytes) Download Attachment