Performance degradation when factoring out common code

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Performance degradation when factoring out common code

Harendra Kumar
Hi,

I have this code snippet for the bind implementation of a Monad:

    AsyncT m >>= f = AsyncT $ \_ stp yld ->
        let run x = (runAsyncT x) Nothing stp yld
            yield a _ Nothing  = run $ f a
            yield a _ (Just r) = run $ f a <> (r >>= f)
        in m Nothing stp yield

I want to have multiple versions of this implementation parameterized by a function, like this:

bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
    let run x = (runAsyncT x) Nothing stp yld
        yield a _ Nothing  = run $ f a
        yield a _ (Just r) = run $ f a `k` (bindWith k r f)
    in m Nothing stp yield

And then the bind function becomes:

(>>=) = bindWith (<>)

But this leads to a performance degradation of more than 10%. inlining does not help, I tried INLINE pragma as well as the "inline" GHC builtin. I thought this should be a more or less straightforward replacement making the second version equivalent to the first one. But apparently there is something going on here that makes it perform worse. 

I did not look at the core, stg or asm yet. Hoping someone can quickly comment on it. Any ideas why is it so? Can this be worked around somehow?

Thanks,
Harendra

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Mikolaj Konarski-2
Hello,

I've had a similar problem that's been fixed in 8.2.1:

https://ghc.haskell.org/trac/ghc/ticket/12603

You can also use some extreme global flags, such as

ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively

to get most the GHC subtlety and shyness out of the way
when experimenting.

Good luck
Mikolaj



On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
<[hidden email]> wrote:

> Hi,
>
> I have this code snippet for the bind implementation of a Monad:
>
>     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>         let run x = (runAsyncT x) Nothing stp yld
>             yield a _ Nothing  = run $ f a
>             yield a _ (Just r) = run $ f a <> (r >>= f)
>         in m Nothing stp yield
>
> I want to have multiple versions of this implementation parameterized by a
> function, like this:
>
> bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>     let run x = (runAsyncT x) Nothing stp yld
>         yield a _ Nothing  = run $ f a
>         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>     in m Nothing stp yield
>
> And then the bind function becomes:
>
> (>>=) = bindWith (<>)
>
> But this leads to a performance degradation of more than 10%. inlining does
> not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
> thought this should be a more or less straightforward replacement making the
> second version equivalent to the first one. But apparently there is
> something going on here that makes it perform worse.
>
> I did not look at the core, stg or asm yet. Hoping someone can quickly
> comment on it. Any ideas why is it so? Can this be worked around somehow?
>
> Thanks,
> Harendra
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Harendra Kumar
In reply to this post by Harendra Kumar
Thanks Mikolaj! I have seen some surprising behavior quite a few times recently and I was wondering whether GHC should do better. In one case I had to use SPECIALIZE very aggressively, in another version of the same code it worked well without that. I have been doing a lot of trial and error with the INLINE/NOINLINE pragmas to figure out what the right combination is. Sometimes it just feels like black magic, because I cannot find a rationale to explain the behavior. I am not sure if there are any more such problems lurking in, perhaps this is an area where some improvement looks possible.

-harendra


On 8 September 2017 at 17:10, Mikolaj Konarski <[hidden email]> wrote:
Hello,

I've had a similar problem that's been fixed in 8.2.1:

https://ghc.haskell.org/trac/ghc/ticket/12603

You can also use some extreme global flags, such as

ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively

to get most the GHC subtlety and shyness out of the way
when experimenting.

Good luck
Mikolaj



On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
<[hidden email]> wrote:
> Hi,
>
> I have this code snippet for the bind implementation of a Monad:
>
>     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>         let run x = (runAsyncT x) Nothing stp yld
>             yield a _ Nothing  = run $ f a
>             yield a _ (Just r) = run $ f a <> (r >>= f)
>         in m Nothing stp yield
>
> I want to have multiple versions of this implementation parameterized by a
> function, like this:
>
> bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>     let run x = (runAsyncT x) Nothing stp yld
>         yield a _ Nothing  = run $ f a
>         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>     in m Nothing stp yield
>
> And then the bind function becomes:
>
> (>>=) = bindWith (<>)
>
> But this leads to a performance degradation of more than 10%. inlining does
> not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
> thought this should be a more or less straightforward replacement making the
> second version equivalent to the first one. But apparently there is
> something going on here that makes it perform worse.
>
> I did not look at the core, stg or asm yet. Hoping someone can quickly
> comment on it. Any ideas why is it so? Can this be worked around somehow?
>
> Thanks,
> Harendra
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Harendra Kumar
I should also point out that I saw performance improvements by manually factoring out and propagating some common expressions to outer loops in performance sensitive paths. Now I have made this a habit to do this manually. Not sure if something like this has also been fixed with that ticket or some other ticket.

-harendra

On 8 September 2017 at 17:34, Harendra Kumar <[hidden email]> wrote:
Thanks Mikolaj! I have seen some surprising behavior quite a few times recently and I was wondering whether GHC should do better. In one case I had to use SPECIALIZE very aggressively, in another version of the same code it worked well without that. I have been doing a lot of trial and error with the INLINE/NOINLINE pragmas to figure out what the right combination is. Sometimes it just feels like black magic, because I cannot find a rationale to explain the behavior. I am not sure if there are any more such problems lurking in, perhaps this is an area where some improvement looks possible.

-harendra


On 8 September 2017 at 17:10, Mikolaj Konarski <[hidden email]> wrote:
Hello,

I've had a similar problem that's been fixed in 8.2.1:

https://ghc.haskell.org/trac/ghc/ticket/12603

You can also use some extreme global flags, such as

ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively

to get most the GHC subtlety and shyness out of the way
when experimenting.

Good luck
Mikolaj



On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
<[hidden email]> wrote:
> Hi,
>
> I have this code snippet for the bind implementation of a Monad:
>
>     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>         let run x = (runAsyncT x) Nothing stp yld
>             yield a _ Nothing  = run $ f a
>             yield a _ (Just r) = run $ f a <> (r >>= f)
>         in m Nothing stp yield
>
> I want to have multiple versions of this implementation parameterized by a
> function, like this:
>
> bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>     let run x = (runAsyncT x) Nothing stp yld
>         yield a _ Nothing  = run $ f a
>         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>     in m Nothing stp yield
>
> And then the bind function becomes:
>
> (>>=) = bindWith (<>)
>
> But this leads to a performance degradation of more than 10%. inlining does
> not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
> thought this should be a more or less straightforward replacement making the
> second version equivalent to the first one. But apparently there is
> something going on here that makes it perform worse.
>
> I did not look at the core, stg or asm yet. Hoping someone can quickly
> comment on it. Any ideas why is it so? Can this be worked around somehow?
>
> Thanks,
> Harendra
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Harendra Kumar
In reply to this post by Harendra Kumar
I tested my code on GHC 8.2.1, there is a good news and a bad news. The good news is that with 8.2.1 performance of my code has improved by a huge margin of around 20%, I had reported a similar perf improvement earlier in another one of my package with 8.2. 

The bad news is that the problem that I reported today is not fixed. I am still seeing a 15% difference between automatic and manual inlining. I guess I should raise a ticket.

-harendra

On 8 September 2017 at 17:10, Mikolaj Konarski <[hidden email]> wrote:
Hello,

I've had a similar problem that's been fixed in 8.2.1:

https://ghc.haskell.org/trac/ghc/ticket/12603

You can also use some extreme global flags, such as

ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively

to get most the GHC subtlety and shyness out of the way
when experimenting.

Good luck
Mikolaj



On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
<[hidden email]> wrote:
> Hi,
>
> I have this code snippet for the bind implementation of a Monad:
>
>     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>         let run x = (runAsyncT x) Nothing stp yld
>             yield a _ Nothing  = run $ f a
>             yield a _ (Just r) = run $ f a <> (r >>= f)
>         in m Nothing stp yield
>
> I want to have multiple versions of this implementation parameterized by a
> function, like this:
>
> bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>     let run x = (runAsyncT x) Nothing stp yld
>         yield a _ Nothing  = run $ f a
>         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>     in m Nothing stp yield
>
> And then the bind function becomes:
>
> (>>=) = bindWith (<>)
>
> But this leads to a performance degradation of more than 10%. inlining does
> not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
> thought this should be a more or less straightforward replacement making the
> second version equivalent to the first one. But apparently there is
> something going on here that makes it perform worse.
>
> I did not look at the core, stg or asm yet. Hoping someone can quickly
> comment on it. Any ideas why is it so? Can this be worked around somehow?
>
> Thanks,
> Harendra
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

RE: Performance degradation when factoring out common code

GHC - devs mailing list
In reply to this post by Harendra Kumar

I know that this is not an easy request, but can either of you produce a small example that demonstrates your problem?   If so, please open a ticket.

 

I don’t like hearing about people having to use trial and error  with INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the problem unless I can reproduce it.

 

Simon

 

From: ghc-devs [mailto:[hidden email]] On Behalf Of Harendra Kumar
Sent: 08 September 2017 13:50
To: Mikolaj Konarski <[hidden email]>
Cc: [hidden email]
Subject: Re: Performance degradation when factoring out common code

 

I should also point out that I saw performance improvements by manually factoring out and propagating some common expressions to outer loops in performance sensitive paths. Now I have made this a habit to do this manually. Not sure if something like this has also been fixed with that ticket or some other ticket.

 

-harendra

 

On 8 September 2017 at 17:34, Harendra Kumar <[hidden email]> wrote:

Thanks Mikolaj! I have seen some surprising behavior quite a few times recently and I was wondering whether GHC should do better. In one case I had to use SPECIALIZE very aggressively, in another version of the same code it worked well without that. I have been doing a lot of trial and error with the INLINE/NOINLINE pragmas to figure out what the right combination is. Sometimes it just feels like black magic, because I cannot find a rationale to explain the behavior. I am not sure if there are any more such problems lurking in, perhaps this is an area where some improvement looks possible.

 

-harendra

 

 

On 8 September 2017 at 17:10, Mikolaj Konarski <[hidden email]> wrote:

Hello,

I've had a similar problem that's been fixed in 8.2.1:

https://ghc.haskell.org/trac/ghc/ticket/12603

You can also use some extreme global flags, such as

ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively

to get most the GHC subtlety and shyness out of the way
when experimenting.

Good luck
Mikolaj




On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
<[hidden email]> wrote:
> Hi,
>
> I have this code snippet for the bind implementation of a Monad:
>
>     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>         let run x = (runAsyncT x) Nothing stp yld
>             yield a _ Nothing  = run $ f a
>             yield a _ (Just r) = run $ f a <> (r >>= f)
>         in m Nothing stp yield
>
> I want to have multiple versions of this implementation parameterized by a
> function, like this:
>
> bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>     let run x = (runAsyncT x) Nothing stp yld
>         yield a _ Nothing  = run $ f a
>         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>     in m Nothing stp yield
>
> And then the bind function becomes:
>
> (>>=) = bindWith (<>)
>
> But this leads to a performance degradation of more than 10%. inlining does
> not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
> thought this should be a more or less straightforward replacement making the
> second version equivalent to the first one. But apparently there is
> something going on here that makes it perform worse.
>
> I did not look at the core, stg or asm yet. Hoping someone can quickly
> comment on it. Any ideas why is it so? Can this be worked around somehow?
>
> Thanks,
> Harendra
>

> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>

 

 


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Harendra Kumar
I will try creating a minimal example and open a ticket for the inlining problem, the one I am sure about. 

-harendra

On 8 September 2017 at 18:35, Simon Peyton Jones <[hidden email]> wrote:

I know that this is not an easy request, but can either of you produce a small example that demonstrates your problem?   If so, please open a ticket.

 

I don’t like hearing about people having to use trial and error  with INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the problem unless I can reproduce it.

 

Simon

 

From: ghc-devs [mailto:[hidden email]] On Behalf Of Harendra Kumar
Sent: 08 September 2017 13:50
To: Mikolaj Konarski <[hidden email]>
Cc: [hidden email]
Subject: Re: Performance degradation when factoring out common code

 

I should also point out that I saw performance improvements by manually factoring out and propagating some common expressions to outer loops in performance sensitive paths. Now I have made this a habit to do this manually. Not sure if something like this has also been fixed with that ticket or some other ticket.

 

-harendra

 

On 8 September 2017 at 17:34, Harendra Kumar <[hidden email]> wrote:

Thanks Mikolaj! I have seen some surprising behavior quite a few times recently and I was wondering whether GHC should do better. In one case I had to use SPECIALIZE very aggressively, in another version of the same code it worked well without that. I have been doing a lot of trial and error with the INLINE/NOINLINE pragmas to figure out what the right combination is. Sometimes it just feels like black magic, because I cannot find a rationale to explain the behavior. I am not sure if there are any more such problems lurking in, perhaps this is an area where some improvement looks possible.

 

-harendra

 

 

On 8 September 2017 at 17:10, Mikolaj Konarski <[hidden email]> wrote:

Hello,

I've had a similar problem that's been fixed in 8.2.1:

https://ghc.haskell.org/trac/ghc/ticket/12603

You can also use some extreme global flags, such as

ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively

to get most the GHC subtlety and shyness out of the way
when experimenting.

Good luck
Mikolaj




On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
<[hidden email]> wrote:
> Hi,
>
> I have this code snippet for the bind implementation of a Monad:
>
>     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>         let run x = (runAsyncT x) Nothing stp yld
>             yield a _ Nothing  = run $ f a
>             yield a _ (Just r) = run $ f a <> (r >>= f)
>         in m Nothing stp yield
>
> I want to have multiple versions of this implementation parameterized by a
> function, like this:
>
> bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>     let run x = (runAsyncT x) Nothing stp yld
>         yield a _ Nothing  = run $ f a
>         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>     in m Nothing stp yield
>
> And then the bind function becomes:
>
> (>>=) = bindWith (<>)
>
> But this leads to a performance degradation of more than 10%. inlining does
> not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
> thought this should be a more or less straightforward replacement making the
> second version equivalent to the first one. But apparently there is
> something going on here that makes it perform worse.
>
> I did not look at the core, stg or asm yet. Hoping someone can quickly
> comment on it. Any ideas why is it so? Can this be worked around somehow?
>
> Thanks,
> Harendra
>

> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>

 

 



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Harendra Kumar
While trying to come up with a minimal example I discovered one more puzzling thing. runghc is fastest, ghc is slower, ghc with optimization is slowest. This is completely reverse of the expected order.

ghc -O1 (-O2 is similar): 

time                 15.23 ms   (14.72 ms .. 15.73 ms)

ghc -O0:

time                 3.612 ms   (3.548 ms .. 3.728 ms)

runghc:

time                 2.250 ms   (2.156 ms .. 2.348 ms)


I am grokking it further. Any pointers will be helpful. I understand that -O2 can sometimes be slower e.g. aggressive inlining can sometimes be counterproductive. But 4x variation is a lot and this is the case with -O1 as well which should be relatively safer than -O2 in general. Worst of all runghc is significantly faster than ghc. What's going on?

-harendra


On 8 September 2017 at 18:49, Harendra Kumar <[hidden email]> wrote:
I will try creating a minimal example and open a ticket for the inlining problem, the one I am sure about. 

-harendra

On 8 September 2017 at 18:35, Simon Peyton Jones <[hidden email]> wrote:

I know that this is not an easy request, but can either of you produce a small example that demonstrates your problem?   If so, please open a ticket.

 

I don’t like hearing about people having to use trial and error  with INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the problem unless I can reproduce it.

 

Simon

 

From: ghc-devs [mailto:[hidden email]] On Behalf Of Harendra Kumar
Sent: 08 September 2017 13:50
To: Mikolaj Konarski <[hidden email]>
Cc: [hidden email]
Subject: Re: Performance degradation when factoring out common code

 

I should also point out that I saw performance improvements by manually factoring out and propagating some common expressions to outer loops in performance sensitive paths. Now I have made this a habit to do this manually. Not sure if something like this has also been fixed with that ticket or some other ticket.

 

-harendra

 

On 8 September 2017 at 17:34, Harendra Kumar <[hidden email]> wrote:

Thanks Mikolaj! I have seen some surprising behavior quite a few times recently and I was wondering whether GHC should do better. In one case I had to use SPECIALIZE very aggressively, in another version of the same code it worked well without that. I have been doing a lot of trial and error with the INLINE/NOINLINE pragmas to figure out what the right combination is. Sometimes it just feels like black magic, because I cannot find a rationale to explain the behavior. I am not sure if there are any more such problems lurking in, perhaps this is an area where some improvement looks possible.

 

-harendra

 

 

On 8 September 2017 at 17:10, Mikolaj Konarski <[hidden email]> wrote:

Hello,

I've had a similar problem that's been fixed in 8.2.1:

https://ghc.haskell.org/trac/ghc/ticket/12603

You can also use some extreme global flags, such as

ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively

to get most the GHC subtlety and shyness out of the way
when experimenting.

Good luck
Mikolaj




On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
<[hidden email]> wrote:
> Hi,
>
> I have this code snippet for the bind implementation of a Monad:
>
>     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>         let run x = (runAsyncT x) Nothing stp yld
>             yield a _ Nothing  = run $ f a
>             yield a _ (Just r) = run $ f a <> (r >>= f)
>         in m Nothing stp yield
>
> I want to have multiple versions of this implementation parameterized by a
> function, like this:
>
> bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>     let run x = (runAsyncT x) Nothing stp yld
>         yield a _ Nothing  = run $ f a
>         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>     in m Nothing stp yield
>
> And then the bind function becomes:
>
> (>>=) = bindWith (<>)
>
> But this leads to a performance degradation of more than 10%. inlining does
> not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
> thought this should be a more or less straightforward replacement making the
> second version equivalent to the first one. But apparently there is
> something going on here that makes it perform worse.
>
> I did not look at the core, stg or asm yet. Hoping someone can quickly
> comment on it. Any ideas why is it so? Can this be worked around somehow?
>
> Thanks,
> Harendra
>

> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>

 

 




_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Matthew Pickering
Do you have the code?

On Sat, Sep 9, 2017 at 6:05 AM, Harendra Kumar <[hidden email]> wrote:

> While trying to come up with a minimal example I discovered one more
> puzzling thing. runghc is fastest, ghc is slower, ghc with optimization is
> slowest. This is completely reverse of the expected order.
>
> ghc -O1 (-O2 is similar):
>
> time                 15.23 ms   (14.72 ms .. 15.73 ms)
>
> ghc -O0:
>
> time                 3.612 ms   (3.548 ms .. 3.728 ms)
>
> runghc:
>
> time                 2.250 ms   (2.156 ms .. 2.348 ms)
>
>
> I am grokking it further. Any pointers will be helpful. I understand that
> -O2 can sometimes be slower e.g. aggressive inlining can sometimes be
> counterproductive. But 4x variation is a lot and this is the case with -O1
> as well which should be relatively safer than -O2 in general. Worst of all
> runghc is significantly faster than ghc. What's going on?
>
> -harendra
>
>
> On 8 September 2017 at 18:49, Harendra Kumar <[hidden email]>
> wrote:
>>
>> I will try creating a minimal example and open a ticket for the inlining
>> problem, the one I am sure about.
>>
>> -harendra
>>
>> On 8 September 2017 at 18:35, Simon Peyton Jones <[hidden email]>
>> wrote:
>>>
>>> I know that this is not an easy request, but can either of you produce a
>>> small example that demonstrates your problem?   If so, please open a ticket.
>>>
>>>
>>>
>>> I don’t like hearing about people having to use trial and error  with
>>> INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the problem
>>> unless I can reproduce it.
>>>
>>>
>>>
>>> Simon
>>>
>>>
>>>
>>> From: ghc-devs [mailto:[hidden email]] On Behalf Of
>>> Harendra Kumar
>>> Sent: 08 September 2017 13:50
>>> To: Mikolaj Konarski <[hidden email]>
>>> Cc: [hidden email]
>>> Subject: Re: Performance degradation when factoring out common code
>>>
>>>
>>>
>>> I should also point out that I saw performance improvements by manually
>>> factoring out and propagating some common expressions to outer loops in
>>> performance sensitive paths. Now I have made this a habit to do this
>>> manually. Not sure if something like this has also been fixed with that
>>> ticket or some other ticket.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>> On 8 September 2017 at 17:34, Harendra Kumar <[hidden email]>
>>> wrote:
>>>
>>> Thanks Mikolaj! I have seen some surprising behavior quite a few times
>>> recently and I was wondering whether GHC should do better. In one case I had
>>> to use SPECIALIZE very aggressively, in another version of the same code it
>>> worked well without that. I have been doing a lot of trial and error with
>>> the INLINE/NOINLINE pragmas to figure out what the right combination is.
>>> Sometimes it just feels like black magic, because I cannot find a rationale
>>> to explain the behavior. I am not sure if there are any more such problems
>>> lurking in, perhaps this is an area where some improvement looks possible.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>>
>>>
>>> On 8 September 2017 at 17:10, Mikolaj Konarski
>>> <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>> I've had a similar problem that's been fixed in 8.2.1:
>>>
>>> https://ghc.haskell.org/trac/ghc/ticket/12603
>>>
>>> You can also use some extreme global flags, such as
>>>
>>> ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively
>>>
>>> to get most the GHC subtlety and shyness out of the way
>>> when experimenting.
>>>
>>> Good luck
>>> Mikolaj
>>>
>>>
>>>
>>>
>>> On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
>>> <[hidden email]> wrote:
>>> > Hi,
>>> >
>>> > I have this code snippet for the bind implementation of a Monad:
>>> >
>>> >     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>>> >         let run x = (runAsyncT x) Nothing stp yld
>>> >             yield a _ Nothing  = run $ f a
>>> >             yield a _ (Just r) = run $ f a <> (r >>= f)
>>> >         in m Nothing stp yield
>>> >
>>> > I want to have multiple versions of this implementation parameterized
>>> > by a
>>> > function, like this:
>>> >
>>> > bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>>> >     let run x = (runAsyncT x) Nothing stp yld
>>> >         yield a _ Nothing  = run $ f a
>>> >         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>>> >     in m Nothing stp yield
>>> >
>>> > And then the bind function becomes:
>>> >
>>> > (>>=) = bindWith (<>)
>>> >
>>> > But this leads to a performance degradation of more than 10%. inlining
>>> > does
>>> > not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
>>> > thought this should be a more or less straightforward replacement
>>> > making the
>>> > second version equivalent to the first one. But apparently there is
>>> > something going on here that makes it perform worse.
>>> >
>>> > I did not look at the core, stg or asm yet. Hoping someone can quickly
>>> > comment on it. Any ideas why is it so? Can this be worked around
>>> > somehow?
>>> >
>>> > Thanks,
>>> > Harendra
>>> >
>>>
>>> > _______________________________________________
>>> > ghc-devs mailing list
>>> > [hidden email]
>>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>> >
>>>
>>>
>>>
>>>
>>
>>
>
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Harendra Kumar
The code is at: https://github.com/harendra-kumar/asyncly. The benchmark code is in "benchmark/Main.hs".  The relevant function is "asyncly_basic".

If you want to run it, you can use the following steps to reproduce the behavior I reported below:

1) Run "stack build"
2) Run "stack runghc benchmark/Main.hs" for runghc figures
3) Run "stack ghc benchmark/Main.hs && benchmark/Main" to compile and run normally
4) Run "stack ghc -- -O2 benchmark/Main.hs && benchmark/Main" to compile and run with -O2 flag

Just look at the first benchmark (asyncly-serial), you can comment out all others if you want to. Note that the library gets compiled without any optimization flags (see the ghc options in the cabal file). So what we are seeing here is just the effect of -O2 on compiling benchmarks/Main.hs.

I am also trying to isolate the problem to a minimal case. I tried removing all the INLINE pragmas in the library to make sure that I am not screwing it up by asking the compiler to inline aggressively, but that does not seem to make any difference to the situation. Let me know if you need any information from me or help in running it.

There are three issues that I am trying to get answers for:

1) Why runghc is faster? It means that there is a possibility for the program to run as fast as runghc runs it. How do I get that performance or an explanation of it?

2) Why -O1/O2 degrades performance so much by 4-5x.

3) The third one is the original problem that I posted in this thread, compiler is unable to match manual inlining. It is possible that this is an issue only when -O1/O2 is used and not when -O0 is used.

Thanks for the help.

-harendra

On 9 September 2017 at 13:30, Matthew Pickering <[hidden email]> wrote:
Do you have the code?

On Sat, Sep 9, 2017 at 6:05 AM, Harendra Kumar <[hidden email]> wrote:
> While trying to come up with a minimal example I discovered one more
> puzzling thing. runghc is fastest, ghc is slower, ghc with optimization is
> slowest. This is completely reverse of the expected order.
>
> ghc -O1 (-O2 is similar):
>
> time                 15.23 ms   (14.72 ms .. 15.73 ms)
>
> ghc -O0:
>
> time                 3.612 ms   (3.548 ms .. 3.728 ms)
>
> runghc:
>
> time                 2.250 ms   (2.156 ms .. 2.348 ms)
>
>
> I am grokking it further. Any pointers will be helpful. I understand that
> -O2 can sometimes be slower e.g. aggressive inlining can sometimes be
> counterproductive. But 4x variation is a lot and this is the case with -O1
> as well which should be relatively safer than -O2 in general. Worst of all
> runghc is significantly faster than ghc. What's going on?
>
> -harendra
>
>
> On 8 September 2017 at 18:49, Harendra Kumar <[hidden email]>
> wrote:
>>
>> I will try creating a minimal example and open a ticket for the inlining
>> problem, the one I am sure about.
>>
>> -harendra
>>
>> On 8 September 2017 at 18:35, Simon Peyton Jones <[hidden email]>
>> wrote:
>>>
>>> I know that this is not an easy request, but can either of you produce a
>>> small example that demonstrates your problem?   If so, please open a ticket.
>>>
>>>
>>>
>>> I don’t like hearing about people having to use trial and error  with
>>> INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the problem
>>> unless I can reproduce it.
>>>
>>>
>>>
>>> Simon
>>>
>>>
>>>
>>> From: ghc-devs [mailto:[hidden email]] On Behalf Of
>>> Harendra Kumar
>>> Sent: 08 September 2017 13:50
>>> To: Mikolaj Konarski <[hidden email]>
>>> Cc: [hidden email]
>>> Subject: Re: Performance degradation when factoring out common code
>>>
>>>
>>>
>>> I should also point out that I saw performance improvements by manually
>>> factoring out and propagating some common expressions to outer loops in
>>> performance sensitive paths. Now I have made this a habit to do this
>>> manually. Not sure if something like this has also been fixed with that
>>> ticket or some other ticket.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>> On 8 September 2017 at 17:34, Harendra Kumar <[hidden email]>
>>> wrote:
>>>
>>> Thanks Mikolaj! I have seen some surprising behavior quite a few times
>>> recently and I was wondering whether GHC should do better. In one case I had
>>> to use SPECIALIZE very aggressively, in another version of the same code it
>>> worked well without that. I have been doing a lot of trial and error with
>>> the INLINE/NOINLINE pragmas to figure out what the right combination is.
>>> Sometimes it just feels like black magic, because I cannot find a rationale
>>> to explain the behavior. I am not sure if there are any more such problems
>>> lurking in, perhaps this is an area where some improvement looks possible.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>>
>>>
>>> On 8 September 2017 at 17:10, Mikolaj Konarski
>>> <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>> I've had a similar problem that's been fixed in 8.2.1:
>>>
>>> https://ghc.haskell.org/trac/ghc/ticket/12603
>>>
>>> You can also use some extreme global flags, such as
>>>
>>> ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively
>>>
>>> to get most the GHC subtlety and shyness out of the way
>>> when experimenting.
>>>
>>> Good luck
>>> Mikolaj
>>>
>>>
>>>
>>>
>>> On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
>>> <[hidden email]> wrote:
>>> > Hi,
>>> >
>>> > I have this code snippet for the bind implementation of a Monad:
>>> >
>>> >     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>>> >         let run x = (runAsyncT x) Nothing stp yld
>>> >             yield a _ Nothing  = run $ f a
>>> >             yield a _ (Just r) = run $ f a <> (r >>= f)
>>> >         in m Nothing stp yield
>>> >
>>> > I want to have multiple versions of this implementation parameterized
>>> > by a
>>> > function, like this:
>>> >
>>> > bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>>> >     let run x = (runAsyncT x) Nothing stp yld
>>> >         yield a _ Nothing  = run $ f a
>>> >         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>>> >     in m Nothing stp yield
>>> >
>>> > And then the bind function becomes:
>>> >
>>> > (>>=) = bindWith (<>)
>>> >
>>> > But this leads to a performance degradation of more than 10%. inlining
>>> > does
>>> > not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
>>> > thought this should be a more or less straightforward replacement
>>> > making the
>>> > second version equivalent to the first one. But apparently there is
>>> > something going on here that makes it perform worse.
>>> >
>>> > I did not look at the core, stg or asm yet. Hoping someone can quickly
>>> > comment on it. Any ideas why is it so? Can this be worked around
>>> > somehow?
>>> >
>>> > Thanks,
>>> > Harendra
>>> >
>>>
>>> > _______________________________________________
>>> > ghc-devs mailing list
>>> > [hidden email]
>>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>> >
>>>
>>>
>>>
>>>
>>
>>
>
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>


_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Harendra Kumar
I could pinpoint one part of the problem. Please see the ticket: https://ghc.haskell.org/trac/ghc/ticket/14208. Here is the description that I wrote in the ticket:

In this particular case -O2 is 2x slower than -O0 and -O0 is 2x slower than runghc. Please see the github repo: https://github.com/harendra-kumar/ghc-perf to reproduce the issue. Readme file in the repo has instructions to reproduce.

The issue seems to occur when the code is placed in a different module. When all the code is in the same module the problem does not occur. In that case -O2 is faster than -O0. However, when the code is split into two modules the performance gets inverted.

Also, it does not occur always, when I tried to change the code to make it simpler for repro the problem did not occur.



-harendra

On 9 September 2017 at 14:08, Harendra Kumar <[hidden email]> wrote:
The code is at: https://github.com/harendra-kumar/asyncly. The benchmark code is in "benchmark/Main.hs".  The relevant function is "asyncly_basic".

If you want to run it, you can use the following steps to reproduce the behavior I reported below:

1) Run "stack build"
2) Run "stack runghc benchmark/Main.hs" for runghc figures
3) Run "stack ghc benchmark/Main.hs && benchmark/Main" to compile and run normally
4) Run "stack ghc -- -O2 benchmark/Main.hs && benchmark/Main" to compile and run with -O2 flag

Just look at the first benchmark (asyncly-serial), you can comment out all others if you want to. Note that the library gets compiled without any optimization flags (see the ghc options in the cabal file). So what we are seeing here is just the effect of -O2 on compiling benchmarks/Main.hs.

I am also trying to isolate the problem to a minimal case. I tried removing all the INLINE pragmas in the library to make sure that I am not screwing it up by asking the compiler to inline aggressively, but that does not seem to make any difference to the situation. Let me know if you need any information from me or help in running it.

There are three issues that I am trying to get answers for:

1) Why runghc is faster? It means that there is a possibility for the program to run as fast as runghc runs it. How do I get that performance or an explanation of it?

2) Why -O1/O2 degrades performance so much by 4-5x.

3) The third one is the original problem that I posted in this thread, compiler is unable to match manual inlining. It is possible that this is an issue only when -O1/O2 is used and not when -O0 is used.

Thanks for the help.

-harendra


On 9 September 2017 at 13:30, Matthew Pickering <[hidden email]> wrote:
Do you have the code?

On Sat, Sep 9, 2017 at 6:05 AM, Harendra Kumar <[hidden email]> wrote:
> While trying to come up with a minimal example I discovered one more
> puzzling thing. runghc is fastest, ghc is slower, ghc with optimization is
> slowest. This is completely reverse of the expected order.
>
> ghc -O1 (-O2 is similar):
>
> time                 15.23 ms   (14.72 ms .. 15.73 ms)
>
> ghc -O0:
>
> time                 3.612 ms   (3.548 ms .. 3.728 ms)
>
> runghc:
>
> time                 2.250 ms   (2.156 ms .. 2.348 ms)
>
>
> I am grokking it further. Any pointers will be helpful. I understand that
> -O2 can sometimes be slower e.g. aggressive inlining can sometimes be
> counterproductive. But 4x variation is a lot and this is the case with -O1
> as well which should be relatively safer than -O2 in general. Worst of all
> runghc is significantly faster than ghc. What's going on?
>
> -harendra
>
>
> On 8 September 2017 at 18:49, Harendra Kumar <[hidden email]>
> wrote:
>>
>> I will try creating a minimal example and open a ticket for the inlining
>> problem, the one I am sure about.
>>
>> -harendra
>>
>> On 8 September 2017 at 18:35, Simon Peyton Jones <[hidden email]>
>> wrote:
>>>
>>> I know that this is not an easy request, but can either of you produce a
>>> small example that demonstrates your problem?   If so, please open a ticket.
>>>
>>>
>>>
>>> I don’t like hearing about people having to use trial and error  with
>>> INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the problem
>>> unless I can reproduce it.
>>>
>>>
>>>
>>> Simon
>>>
>>>
>>>
>>> From: ghc-devs [mailto:[hidden email]] On Behalf Of
>>> Harendra Kumar
>>> Sent: 08 September 2017 13:50
>>> To: Mikolaj Konarski <[hidden email]>
>>> Cc: [hidden email]
>>> Subject: Re: Performance degradation when factoring out common code
>>>
>>>
>>>
>>> I should also point out that I saw performance improvements by manually
>>> factoring out and propagating some common expressions to outer loops in
>>> performance sensitive paths. Now I have made this a habit to do this
>>> manually. Not sure if something like this has also been fixed with that
>>> ticket or some other ticket.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>> On 8 September 2017 at 17:34, Harendra Kumar <[hidden email]>
>>> wrote:
>>>
>>> Thanks Mikolaj! I have seen some surprising behavior quite a few times
>>> recently and I was wondering whether GHC should do better. In one case I had
>>> to use SPECIALIZE very aggressively, in another version of the same code it
>>> worked well without that. I have been doing a lot of trial and error with
>>> the INLINE/NOINLINE pragmas to figure out what the right combination is.
>>> Sometimes it just feels like black magic, because I cannot find a rationale
>>> to explain the behavior. I am not sure if there are any more such problems
>>> lurking in, perhaps this is an area where some improvement looks possible.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>>
>>>
>>> On 8 September 2017 at 17:10, Mikolaj Konarski
>>> <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>> I've had a similar problem that's been fixed in 8.2.1:
>>>
>>> https://ghc.haskell.org/trac/ghc/ticket/12603
>>>
>>> You can also use some extreme global flags, such as
>>>
>>> ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively
>>>
>>> to get most the GHC subtlety and shyness out of the way
>>> when experimenting.
>>>
>>> Good luck
>>> Mikolaj
>>>
>>>
>>>
>>>
>>> On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
>>> <[hidden email]> wrote:
>>> > Hi,
>>> >
>>> > I have this code snippet for the bind implementation of a Monad:
>>> >
>>> >     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>>> >         let run x = (runAsyncT x) Nothing stp yld
>>> >             yield a _ Nothing  = run $ f a
>>> >             yield a _ (Just r) = run $ f a <> (r >>= f)
>>> >         in m Nothing stp yield
>>> >
>>> > I want to have multiple versions of this implementation parameterized
>>> > by a
>>> > function, like this:
>>> >
>>> > bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>>> >     let run x = (runAsyncT x) Nothing stp yld
>>> >         yield a _ Nothing  = run $ f a
>>> >         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>>> >     in m Nothing stp yield
>>> >
>>> > And then the bind function becomes:
>>> >
>>> > (>>=) = bindWith (<>)
>>> >
>>> > But this leads to a performance degradation of more than 10%. inlining
>>> > does
>>> > not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
>>> > thought this should be a more or less straightforward replacement
>>> > making the
>>> > second version equivalent to the first one. But apparently there is
>>> > something going on here that makes it perform worse.
>>> >
>>> > I did not look at the core, stg or asm yet. Hoping someone can quickly
>>> > comment on it. Any ideas why is it so? Can this be worked around
>>> > somehow?
>>> >
>>> > Thanks,
>>> > Harendra
>>> >
>>>
>>> > _______________________________________________
>>> > ghc-devs mailing list
>>> > [hidden email]
>>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>> >
>>>
>>>
>>>
>>>
>>
>>
>
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Performance degradation when factoring out common code

Harendra Kumar
In reply to this post by Harendra Kumar
Ok, I filed a ticket for the inlining issue as well - https://ghc.haskell.org/trac/ghc/ticket/14211. The reproduction test case is in the same repo on the "inlining-issue" branch, here - https://github.com/harendra-kumar/ghc-perf/tree/inlining-issue .

Performance with manually inlining a function is more than 10% faster compared to factoring out code and using INLINE pragma.

stack bench for compiler inlined code

time                 46.71 ms   (45.53 ms .. 47.79 ms)

stack bench --flag ghc-perf:manual for manually inlined code

time                 39.46 ms   (38.92 ms .. 39.94 ms)

-harendra

On 9 September 2017 at 14:08, Harendra Kumar <[hidden email]> wrote:
The code is at: https://github.com/harendra-kumar/asyncly. The benchmark code is in "benchmark/Main.hs".  The relevant function is "asyncly_basic".

If you want to run it, you can use the following steps to reproduce the behavior I reported below:

1) Run "stack build"
2) Run "stack runghc benchmark/Main.hs" for runghc figures
3) Run "stack ghc benchmark/Main.hs && benchmark/Main" to compile and run normally
4) Run "stack ghc -- -O2 benchmark/Main.hs && benchmark/Main" to compile and run with -O2 flag

Just look at the first benchmark (asyncly-serial), you can comment out all others if you want to. Note that the library gets compiled without any optimization flags (see the ghc options in the cabal file). So what we are seeing here is just the effect of -O2 on compiling benchmarks/Main.hs.

I am also trying to isolate the problem to a minimal case. I tried removing all the INLINE pragmas in the library to make sure that I am not screwing it up by asking the compiler to inline aggressively, but that does not seem to make any difference to the situation. Let me know if you need any information from me or help in running it.

There are three issues that I am trying to get answers for:

1) Why runghc is faster? It means that there is a possibility for the program to run as fast as runghc runs it. How do I get that performance or an explanation of it?

2) Why -O1/O2 degrades performance so much by 4-5x.

3) The third one is the original problem that I posted in this thread, compiler is unable to match manual inlining. It is possible that this is an issue only when -O1/O2 is used and not when -O0 is used.

Thanks for the help.

-harendra


On 9 September 2017 at 13:30, Matthew Pickering <[hidden email]> wrote:
Do you have the code?

On Sat, Sep 9, 2017 at 6:05 AM, Harendra Kumar <[hidden email]> wrote:
> While trying to come up with a minimal example I discovered one more
> puzzling thing. runghc is fastest, ghc is slower, ghc with optimization is
> slowest. This is completely reverse of the expected order.
>
> ghc -O1 (-O2 is similar):
>
> time                 15.23 ms   (14.72 ms .. 15.73 ms)
>
> ghc -O0:
>
> time                 3.612 ms   (3.548 ms .. 3.728 ms)
>
> runghc:
>
> time                 2.250 ms   (2.156 ms .. 2.348 ms)
>
>
> I am grokking it further. Any pointers will be helpful. I understand that
> -O2 can sometimes be slower e.g. aggressive inlining can sometimes be
> counterproductive. But 4x variation is a lot and this is the case with -O1
> as well which should be relatively safer than -O2 in general. Worst of all
> runghc is significantly faster than ghc. What's going on?
>
> -harendra
>
>
> On 8 September 2017 at 18:49, Harendra Kumar <[hidden email]>
> wrote:
>>
>> I will try creating a minimal example and open a ticket for the inlining
>> problem, the one I am sure about.
>>
>> -harendra
>>
>> On 8 September 2017 at 18:35, Simon Peyton Jones <[hidden email]>
>> wrote:
>>>
>>> I know that this is not an easy request, but can either of you produce a
>>> small example that demonstrates your problem?   If so, please open a ticket.
>>>
>>>
>>>
>>> I don’t like hearing about people having to use trial and error  with
>>> INLINE or SPECIALISE pragmas.  But I can’t even begin to solve the problem
>>> unless I can reproduce it.
>>>
>>>
>>>
>>> Simon
>>>
>>>
>>>
>>> From: ghc-devs [mailto:[hidden email]] On Behalf Of
>>> Harendra Kumar
>>> Sent: 08 September 2017 13:50
>>> To: Mikolaj Konarski <[hidden email]>
>>> Cc: [hidden email]
>>> Subject: Re: Performance degradation when factoring out common code
>>>
>>>
>>>
>>> I should also point out that I saw performance improvements by manually
>>> factoring out and propagating some common expressions to outer loops in
>>> performance sensitive paths. Now I have made this a habit to do this
>>> manually. Not sure if something like this has also been fixed with that
>>> ticket or some other ticket.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>> On 8 September 2017 at 17:34, Harendra Kumar <[hidden email]>
>>> wrote:
>>>
>>> Thanks Mikolaj! I have seen some surprising behavior quite a few times
>>> recently and I was wondering whether GHC should do better. In one case I had
>>> to use SPECIALIZE very aggressively, in another version of the same code it
>>> worked well without that. I have been doing a lot of trial and error with
>>> the INLINE/NOINLINE pragmas to figure out what the right combination is.
>>> Sometimes it just feels like black magic, because I cannot find a rationale
>>> to explain the behavior. I am not sure if there are any more such problems
>>> lurking in, perhaps this is an area where some improvement looks possible.
>>>
>>>
>>>
>>> -harendra
>>>
>>>
>>>
>>>
>>>
>>> On 8 September 2017 at 17:10, Mikolaj Konarski
>>> <[hidden email]> wrote:
>>>
>>> Hello,
>>>
>>> I've had a similar problem that's been fixed in 8.2.1:
>>>
>>> https://ghc.haskell.org/trac/ghc/ticket/12603
>>>
>>> You can also use some extreme global flags, such as
>>>
>>> ghc-options: -fexpose-all-unfoldings -fspecialise-aggressively
>>>
>>> to get most the GHC subtlety and shyness out of the way
>>> when experimenting.
>>>
>>> Good luck
>>> Mikolaj
>>>
>>>
>>>
>>>
>>> On Fri, Sep 8, 2017 at 11:21 AM, Harendra Kumar
>>> <[hidden email]> wrote:
>>> > Hi,
>>> >
>>> > I have this code snippet for the bind implementation of a Monad:
>>> >
>>> >     AsyncT m >>= f = AsyncT $ \_ stp yld ->
>>> >         let run x = (runAsyncT x) Nothing stp yld
>>> >             yield a _ Nothing  = run $ f a
>>> >             yield a _ (Just r) = run $ f a <> (r >>= f)
>>> >         in m Nothing stp yield
>>> >
>>> > I want to have multiple versions of this implementation parameterized
>>> > by a
>>> > function, like this:
>>> >
>>> > bindWith k (AsyncT m) f = AsyncT $ \_ stp yld ->
>>> >     let run x = (runAsyncT x) Nothing stp yld
>>> >         yield a _ Nothing  = run $ f a
>>> >         yield a _ (Just r) = run $ f a `k` (bindWith k r f)
>>> >     in m Nothing stp yield
>>> >
>>> > And then the bind function becomes:
>>> >
>>> > (>>=) = bindWith (<>)
>>> >
>>> > But this leads to a performance degradation of more than 10%. inlining
>>> > does
>>> > not help, I tried INLINE pragma as well as the "inline" GHC builtin. I
>>> > thought this should be a more or less straightforward replacement
>>> > making the
>>> > second version equivalent to the first one. But apparently there is
>>> > something going on here that makes it perform worse.
>>> >
>>> > I did not look at the core, stg or asm yet. Hoping someone can quickly
>>> > comment on it. Any ideas why is it so? Can this be worked around
>>> > somehow?
>>> >
>>> > Thanks,
>>> > Harendra
>>> >
>>>
>>> > _______________________________________________
>>> > ghc-devs mailing list
>>> > [hidden email]
>>> > http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>>> >
>>>
>>>
>>>
>>>
>>
>>
>
>
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
>



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs