cross module optimization issues

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

cross module optimization issues

John Lato-2
Hello,

I have a problem with a package I'm working on, and I don't have any
idea how to sort out the current problem.

One part of my package is in one monolithic module, without an export
list, which works fine.  However, when I've started to separate out
certain functions into another module, and added an export list to one
of the modules, which dramatically decreases performance.  The memory
behavior (as shown by -hT) is also quite different, with substantial
memory usage by "FUN_2_0".  Are there any suggestions as to how I
could improve this?

Thanks,
John
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

Don Stewart-2
jwlato:

> Hello,
>
> I have a problem with a package I'm working on, and I don't have any
> idea how to sort out the current problem.
>
> One part of my package is in one monolithic module, without an export
> list, which works fine.  However, when I've started to separate out
> certain functions into another module, and added an export list to one
> of the modules, which dramatically decreases performance.  The memory
> behavior (as shown by -hT) is also quite different, with substantial
> memory usage by "FUN_2_0".  Are there any suggestions as to how I
> could improve this?
>

Are you compiling with aggressive cross-module optimisations on (e.g.
-O2)? You may have to add explicit inlining pragmas (check the Core
output), to ensure key functions are exported in their entirety.

-- Don
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

John Lato-2
On Sat, Nov 15, 2008 at 10:09 PM, Don Stewart <[hidden email]> wrote:

> jwlato:
>> Hello,
>>
>> I have a problem with a package I'm working on, and I don't have any
>> idea how to sort out the current problem.
>>
>> One part of my package is in one monolithic module, without an export
>> list, which works fine.  However, when I've started to separate out
>> certain functions into another module, and added an export list to one
>> of the modules, which dramatically decreases performance.  The memory
>> behavior (as shown by -hT) is also quite different, with substantial
>> memory usage by "FUN_2_0".  Are there any suggestions as to how I
>> could improve this?
>>
>
> Are you compiling with aggressive cross-module optimisations on (e.g.
> -O2)? You may have to add explicit inlining pragmas (check the Core
> output), to ensure key functions are exported in their entirety.
>

Thanks for the reply.

I'm compiling with -O2 -Wall.  After looking at the Core output, I
think I've found the key difference.  A function that is bound in a
"where" statement is different between the monolithic and split
sources.  I have no idea why, though.  I'll experiment with a few
different things to see if I can get this resolved.

John
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

RE: cross module optimization issues

Simon Peyton Jones
| I'm compiling with -O2 -Wall.  After looking at the Core output, I
| think I've found the key difference.  A function that is bound in a
| "where" statement is different between the monolithic and split
| sources.  I have no idea why, though.  I'll experiment with a few
| different things to see if I can get this resolved.

In general, splitting code across modules should not make programs less efficient -- as Don says, GHC does quite aggressive cross-module inlining.

There is one exception, though.  If a non-exported non-recursive function is called exactly once, then it is inlined *regardless of size*, because doing so does not cause code duplication.  But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times.  You can change the threshold for (a) exposing and (b) using an inlining, with flags -funfolding-creation-threshold and -funfolding-use-threshold respectively.

If you find there's something else going on then I'm all ears.

Simon
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

John Lato-2
On Wed, Nov 19, 2008 at 4:17 PM, Simon Peyton-Jones
<[hidden email]> wrote:

> | I'm compiling with -O2 -Wall.  After looking at the Core output, I
> | think I've found the key difference.  A function that is bound in a
> | "where" statement is different between the monolithic and split
> | sources.  I have no idea why, though.  I'll experiment with a few
> | different things to see if I can get this resolved.
>
> In general, splitting code across modules should not make programs less efficient -- as Don says, GHC does quite aggressive cross-module inlining.
>
> There is one exception, though.  If a non-exported non-recursive function is called exactly once, then it is inlined *regardless of size*, because doing so does not cause code duplication.  But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times.  You can change the threshold for (a) exposing and (b) using an inlining, with flags -funfolding-creation-threshold and -funfolding-use-threshold respectively.
>
> If you find there's something else going on then I'm all ears.
>
> Simon
>

I did finally find the changes that make a difference.  I think it's
safe to say that I have no idea what's actually going on, so I'll just
report my results and let others try to figure it out.

I tried upping the thresholds mentioned, up to
-funfolding-creation-threshold 200 -funfolding-use-threshold 100.
This didn't seem to make any performance difference (I didn't check
the core output).

This project is based on Oleg's Iteratee code; I started using his
IterateeM.hs and Enumerator.hs files and added my own stuff to
Enumerator.hs (thanks Oleg, great work as always).  When I started
cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my
minimal test case increased from 19s to 43s.

I've found two factors that contributed.  When I was cleaning up, I
also removed a bunch of unused functions from IterateeM.hs (some of
the test functions and functions specific to his running example of
HTTP encoding).  When I added those functions back in, and added
INLINE pragmas to the exported functions in MyEnum.hs, I got the
performance back.

In general I hadn't added export lists to the modules yet, so all
functions should have been exported.

So it seems that somehow the unused functions in IterateeM.hs are
affecting how the functions I care about get implemented (or
exported).  I did not expect that.  Next step for me is to see what
happens if I INLINE the functions I'm exporting and remove the others,
I suppose.

Thank you Simon and Don for your advice, especially since I'm pretty
far over my head at this point.

John
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

RE: cross module optimization issues

Simon Peyton Jones
| This project is based on Oleg's Iteratee code; I started using his
| IterateeM.hs and Enumerator.hs files and added my own stuff to
| Enumerator.hs (thanks Oleg, great work as always).  When I started
| cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my
| minimal test case increased from 19s to 43s.
|
| I've found two factors that contributed.  When I was cleaning up, I
| also removed a bunch of unused functions from IterateeM.hs (some of
| the test functions and functions specific to his running example of
| HTTP encoding).  When I added those functions back in, and added
| INLINE pragmas to the exported functions in MyEnum.hs, I got the
| performance back.
|
| In general I hadn't added export lists to the modules yet, so all
| functions should have been exported.

I'm totally snowed under with backlog from my recent absence, so I can't look at this myself, but if anyone else wants to I'd be happy to support with advice and suggestions.

In general, having an explicit export list is good for performance. I typed an extra section in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain why.  In general that page is where we should document user advice for performance in GHC.

I can't explain why *adding* unused functions would change performance though!

Simon


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

RE: cross module optimization issues

Mitchell, Neil
Hi John,

I'm vaguely curious, and have next week off, so if you can provide the
code, and directions for running in both variants and the test case,
I'll take a look. Please email me at ndmitchell -AT- gmail.com though,
as I loose this email address at 11pm tonight :-)

Thanks

Neil


> -----Original Message-----
> From: [hidden email]
> [mailto:[hidden email]] On Behalf
> Of Simon Peyton-Jones
> Sent: 21 November 2008 10:34 am
> To: John Lato
> Cc: [hidden email]; Don Stewart
> Subject: RE: cross module optimization issues
>
> | This project is based on Oleg's Iteratee code; I started using his
> | IterateeM.hs and Enumerator.hs files and added my own stuff to
> | Enumerator.hs (thanks Oleg, great work as always).  When I started
> | cleaning up by moving my functions from Enumerator.hs to
> MyEnum.hs, my
> | minimal test case increased from 19s to 43s.
> |
> | I've found two factors that contributed.  When I was cleaning up, I
> | also removed a bunch of unused functions from IterateeM.hs (some of
> | the test functions and functions specific to his running example of
> | HTTP encoding).  When I added those functions back in, and added
> | INLINE pragmas to the exported functions in MyEnum.hs, I got the
> | performance back.
> |
> | In general I hadn't added export lists to the modules yet, so all
> | functions should have been exported.
>
> I'm totally snowed under with backlog from my recent absence,
> so I can't look at this myself, but if anyone else wants to
> I'd be happy to support with advice and suggestions.
>
> In general, having an explicit export list is good for
> performance. I typed an extra section in the GHC performance
> resource http://haskell.org/haskellwiki/Performance/GHC to
> explain why.  In general that page is where we should
> document user advice for performance in GHC.
>
> I can't explain why *adding* unused functions would change
> performance though!
>
> Simon
>
>
> _______________________________________________
> Glasgow-haskell-users mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>
>

==============================================================================
Please access the attached hyperlink for an important electronic communications disclaimer:

http://www.credit-suisse.com/legal/en/disclaimer_email_ib.html
==============================================================================

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

Don Stewart-2
In reply to this post by John Lato-2
jwlato:

> On Wed, Nov 19, 2008 at 4:17 PM, Simon Peyton-Jones
> <[hidden email]> wrote:
> > | I'm compiling with -O2 -Wall.  After looking at the Core output, I
> > | think I've found the key difference.  A function that is bound in a
> > | "where" statement is different between the monolithic and split
> > | sources.  I have no idea why, though.  I'll experiment with a few
> > | different things to see if I can get this resolved.
> >
> > In general, splitting code across modules should not make programs less efficient -- as Don says, GHC does quite aggressive cross-module inlining.
> >
> > There is one exception, though.  If a non-exported non-recursive function is called exactly once, then it is inlined *regardless of size*, because doing so does not cause code duplication.  But if it's exported and is large, then its inlining is not exposed -- and even if it were it might not be inlined, because doing so duplicates its code an unknown number of times.  You can change the threshold for (a) exposing and (b) using an inlining, with flags -funfolding-creation-threshold and -funfolding-use-threshold respectively.
> >
> > If you find there's something else going on then I'm all ears.
> >
> > Simon
> >
>
> I did finally find the changes that make a difference.  I think it's
> safe to say that I have no idea what's actually going on, so I'll just
> report my results and let others try to figure it out.
>
> I tried upping the thresholds mentioned, up to
> -funfolding-creation-threshold 200 -funfolding-use-threshold 100.
> This didn't seem to make any performance difference (I didn't check
> the core output).
>
> This project is based on Oleg's Iteratee code; I started using his
> IterateeM.hs and Enumerator.hs files and added my own stuff to
> Enumerator.hs (thanks Oleg, great work as always).  When I started
> cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my
> minimal test case increased from 19s to 43s.
>
> I've found two factors that contributed.  When I was cleaning up, I
> also removed a bunch of unused functions from IterateeM.hs (some of
> the test functions and functions specific to his running example of
> HTTP encoding).  When I added those functions back in, and added
> INLINE pragmas to the exported functions in MyEnum.hs, I got the
> performance back.
>
> In general I hadn't added export lists to the modules yet, so all
> functions should have been exported.
>
> So it seems that somehow the unused functions in IterateeM.hs are
> affecting how the functions I care about get implemented (or
> exported).  I did not expect that.  Next step for me is to see what
> happens if I INLINE the functions I'm exporting and remove the others,
> I suppose.
>
> Thank you Simon and Don for your advice, especially since I'm pretty
> far over my head at this point.
>

Is this , since it is in IO code, a -fno-state-hack scenario?
Simon  wrote recently about when and why -fno-state-hack would be
needed, if you want to follow that up.

-- Don
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

John Lato-2
On Sat, Nov 22, 2008 at 6:55 PM, Don Stewart <[hidden email]> wrote:
> jwlato:
>
> Is this , since it is in IO code, a -fno-state-hack scenario?
> Simon  wrote recently about when and why -fno-state-hack would be
> needed, if you want to follow that up.
>
> -- Don
>

Unfortunately, -fno-state-hack doesn't seem to make much difference.
In any case, only the functions that actually do file IO are in the IO
monad; otherwise the functions  use a generic Monad constraint.
Although you have reminded me that I should make a non-IO test case.

For Neil, and anyone else interested in looking at this, I'll put the
code and build instructions up later today.  I've just been cleaning
up some test cases to make it easier to run.

John
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

Neil Mitchell
In reply to this post by Simon Peyton Jones
Hi

I've talked to John a bit, and discussed test cases etc. I've tracked
this down a little way.

Given the attached file, compiling witih SHORT_EXPORT_LIST makes the
code go _slower_. By exporting the "print_lines" function the code
doubles in speed. This runs against everything I was expecting, and
that Simon has described.

Taking a look at the .hi files for the two alternatives, there are two
differences:

1) In the faster .hi file, the body of print_lines is exported. This
is reasonable and expected.

2) In the faster .hi file, there are additional specialisations, which
seemingly have little/nothing to do with print_lines, but are omitted
if it is not exported:

"SPEC >>= [GHC.IOBase.IO]" ALWAYS forall @ el
                                         $dMonad :: GHC.Base.Monad GHC.IOBase.IO
  Sound.IterateeM.>>= @ GHC.IOBase.IO @ el $dMonad
  = Sound.IterateeM.a
      `cast`
    (forall el1 a b.
     Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO a
     -> (a -> Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO b)
     -> trans
            (sym ((GHC.IOBase.:CoIO)
                      (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO b)))
            (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO b)))
      @ el
"SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
                                                         $dMonad ::
GHC.Base.Monad GHC.IOBase.IO
  Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
  = Sound.IterateeM.$s$f2 @ el
"SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
                                                         $dMonad ::
GHC.Base.Monad GHC.IOBase.IO
  Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
  = Sound.IterateeM.$s$f21 @ el
"SPEC Sound.IterateeM.liftI [GHC.IOBase.IO]" ALWAYS forall @ el
                                                           @ a
                                                           $dMonad ::
GHC.Base.Monad GHC.IOBase.IO
  Sound.IterateeM.liftI @ GHC.IOBase.IO @ el @ a $dMonad
  = Sound.IterateeM.$sliftI @ el @ a
"SPEC return [GHC.IOBase.IO]" ALWAYS forall @ el
                                            $dMonad :: GHC.Base.Monad
GHC.IOBase.IO
  Sound.IterateeM.return @ GHC.IOBase.IO @ el $dMonad
  = Sound.IterateeM.a7
      `cast`
    (forall el1 a.
     a
     -> trans
            (sym ((GHC.IOBase.:CoIO)
                      (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO a)))
            (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO a)))
      @ el

My guess is that these cause the slowdown - but is there any reason
that print_lines not being exported should cause them to be omitted?

All these tests were run on GHC 6.10.1 with -O2.

Thanks

Neil


On Fri, Nov 21, 2008 at 10:33 AM, Simon Peyton-Jones
<[hidden email]> wrote:

> | This project is based on Oleg's Iteratee code; I started using his
> | IterateeM.hs and Enumerator.hs files and added my own stuff to
> | Enumerator.hs (thanks Oleg, great work as always).  When I started
> | cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my
> | minimal test case increased from 19s to 43s.
> |
> | I've found two factors that contributed.  When I was cleaning up, I
> | also removed a bunch of unused functions from IterateeM.hs (some of
> | the test functions and functions specific to his running example of
> | HTTP encoding).  When I added those functions back in, and added
> | INLINE pragmas to the exported functions in MyEnum.hs, I got the
> | performance back.
> |
> | In general I hadn't added export lists to the modules yet, so all
> | functions should have been exported.
>
> I'm totally snowed under with backlog from my recent absence, so I can't look at this myself, but if anyone else wants to I'd be happy to support with advice and suggestions.
>
> In general, having an explicit export list is good for performance. I typed an extra section in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain why.  In general that page is where we should document user advice for performance in GHC.
>
> I can't explain why *adding* unused functions would change performance though!
>
> Simon
>
>
> _______________________________________________
> Glasgow-haskell-users mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
>

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users

IterateeM.hs (35K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

RE: cross module optimization issues

Simon Peyton Jones
The specialisations are indeed caused (indirectly) by the presence of print_lines.  If print_lines is dead code (as it is when print_lines is not exported), then there are no calls to the overloaded functions at these specialised types, and so you don't get the specialised versions.  You can get specialised versions by a SPECIALISE pragma, or SPECIALISE INSTANCE

Does that make sense?

Simon

| -----Original Message-----
| From: Neil Mitchell [mailto:[hidden email]]
| Sent: 28 November 2008 09:48
| To: Simon Peyton-Jones
| Cc: John Lato; [hidden email]; Don Stewart
| Subject: Re: cross module optimization issues
|
| Hi
|
| I've talked to John a bit, and discussed test cases etc. I've tracked
| this down a little way.
|
| Given the attached file, compiling witih SHORT_EXPORT_LIST makes the
| code go _slower_. By exporting the "print_lines" function the code
| doubles in speed. This runs against everything I was expecting, and
| that Simon has described.
|
| Taking a look at the .hi files for the two alternatives, there are two
| differences:
|
| 1) In the faster .hi file, the body of print_lines is exported. This
| is reasonable and expected.
|
| 2) In the faster .hi file, there are additional specialisations, which
| seemingly have little/nothing to do with print_lines, but are omitted
| if it is not exported:
|
| "SPEC >>= [GHC.IOBase.IO]" ALWAYS forall @ el
|                                          $dMonad :: GHC.Base.Monad GHC.IOBase.IO
|   Sound.IterateeM.>>= @ GHC.IOBase.IO @ el $dMonad
|   = Sound.IterateeM.a
|       `cast`
|     (forall el1 a b.
|      Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO a
|      -> (a -> Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO b)
|      -> trans
|             (sym ((GHC.IOBase.:CoIO)
|                       (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO b)))
|             (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO b)))
|       @ el
| "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
|                                                          $dMonad ::
| GHC.Base.Monad GHC.IOBase.IO
|   Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
|   = Sound.IterateeM.$s$f2 @ el
| "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
|                                                          $dMonad ::
| GHC.Base.Monad GHC.IOBase.IO
|   Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
|   = Sound.IterateeM.$s$f21 @ el
| "SPEC Sound.IterateeM.liftI [GHC.IOBase.IO]" ALWAYS forall @ el
|                                                            @ a
|                                                            $dMonad ::
| GHC.Base.Monad GHC.IOBase.IO
|   Sound.IterateeM.liftI @ GHC.IOBase.IO @ el @ a $dMonad
|   = Sound.IterateeM.$sliftI @ el @ a
| "SPEC return [GHC.IOBase.IO]" ALWAYS forall @ el
|                                             $dMonad :: GHC.Base.Monad
| GHC.IOBase.IO
|   Sound.IterateeM.return @ GHC.IOBase.IO @ el $dMonad
|   = Sound.IterateeM.a7
|       `cast`
|     (forall el1 a.
|      a
|      -> trans
|             (sym ((GHC.IOBase.:CoIO)
|                       (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO a)))
|             (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO a)))
|       @ el
|
| My guess is that these cause the slowdown - but is there any reason
| that print_lines not being exported should cause them to be omitted?
|
| All these tests were run on GHC 6.10.1 with -O2.
|
| Thanks
|
| Neil
|
|
| On Fri, Nov 21, 2008 at 10:33 AM, Simon Peyton-Jones
| <[hidden email]> wrote:
| > | This project is based on Oleg's Iteratee code; I started using his
| > | IterateeM.hs and Enumerator.hs files and added my own stuff to
| > | Enumerator.hs (thanks Oleg, great work as always).  When I started
| > | cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my
| > | minimal test case increased from 19s to 43s.
| > |
| > | I've found two factors that contributed.  When I was cleaning up, I
| > | also removed a bunch of unused functions from IterateeM.hs (some of
| > | the test functions and functions specific to his running example of
| > | HTTP encoding).  When I added those functions back in, and added
| > | INLINE pragmas to the exported functions in MyEnum.hs, I got the
| > | performance back.
| > |
| > | In general I hadn't added export lists to the modules yet, so all
| > | functions should have been exported.
| >
| > I'm totally snowed under with backlog from my recent absence, so I can't look at this
| myself, but if anyone else wants to I'd be happy to support with advice and suggestions.
| >
| > In general, having an explicit export list is good for performance. I typed an extra section
| in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain why.
| In general that page is where we should document user advice for performance in GHC.
| >
| > I can't explain why *adding* unused functions would change performance though!
| >
| > Simon
| >
| >
| > _______________________________________________
| > Glasgow-haskell-users mailing list
| > [hidden email]
| > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
| >
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

John Lato-2
Neil, thank you very much for taking the time to look at this; I
greatly appreciate it.

One thing I don't understand is why the specializations are caused by
print_lines.  I suppose the optimizer can infer something which it
couldn't otherwise.

If I read this properly, the functions being specialized are liftI,
(>>=), return, and $f2.  One thing I'm not sure about is when INLINE
provides the desired optimal behavior, as opposed to SPECIALIZE.  The
monad functions are defined in the Monad instance, and thus aren't
currently INLINE'd or SPECIALIZE'd.  However, if they are separate
functions, would INLINE be sufficient?  Would that give the optimizer
enough to work with the derive the specializations on its own?  I'll
have some time to experiment with this myself tomorrow, but I'd
appreciate some direction (rather than guessing blindly).

What is "$f2"?  I've seen that appear before, but I'm not sure where
it comes from.

Thanks,
John

On Fri, Nov 28, 2008 at 10:31 AM, Simon Peyton-Jones
<[hidden email]> wrote:

> The specialisations are indeed caused (indirectly) by the presence of print_lines.  If print_lines is dead code (as it is when print_lines is not exported), then there are no calls to the overloaded functions at these specialised types, and so you don't get the specialised versions.  You can get specialised versions by a SPECIALISE pragma, or SPECIALISE INSTANCE
>
> Does that make sense?
>
> Simon
>
> | -----Original Message-----
> | From: Neil Mitchell [mailto:[hidden email]]
> | Sent: 28 November 2008 09:48
> | To: Simon Peyton-Jones
> | Cc: John Lato; [hidden email]; Don Stewart
> | Subject: Re: cross module optimization issues
> |
> | Hi
> |
> | I've talked to John a bit, and discussed test cases etc. I've tracked
> | this down a little way.
> |
> | Given the attached file, compiling witih SHORT_EXPORT_LIST makes the
> | code go _slower_. By exporting the "print_lines" function the code
> | doubles in speed. This runs against everything I was expecting, and
> | that Simon has described.
> |
> | Taking a look at the .hi files for the two alternatives, there are two
> | differences:
> |
> | 1) In the faster .hi file, the body of print_lines is exported. This
> | is reasonable and expected.
> |
> | 2) In the faster .hi file, there are additional specialisations, which
> | seemingly have little/nothing to do with print_lines, but are omitted
> | if it is not exported:
> |
> | "SPEC >>= [GHC.IOBase.IO]" ALWAYS forall @ el
> |                                          $dMonad :: GHC.Base.Monad GHC.IOBase.IO
> |   Sound.IterateeM.>>= @ GHC.IOBase.IO @ el $dMonad
> |   = Sound.IterateeM.a
> |       `cast`
> |     (forall el1 a b.
> |      Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO a
> |      -> (a -> Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO b)
> |      -> trans
> |             (sym ((GHC.IOBase.:CoIO)
> |                       (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO b)))
> |             (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO b)))
> |       @ el
> | "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
> |                                                          $dMonad ::
> | GHC.Base.Monad GHC.IOBase.IO
> |   Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
> |   = Sound.IterateeM.$s$f2 @ el
> | "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
> |                                                          $dMonad ::
> | GHC.Base.Monad GHC.IOBase.IO
> |   Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
> |   = Sound.IterateeM.$s$f21 @ el
> | "SPEC Sound.IterateeM.liftI [GHC.IOBase.IO]" ALWAYS forall @ el
> |                                                            @ a
> |                                                            $dMonad ::
> | GHC.Base.Monad GHC.IOBase.IO
> |   Sound.IterateeM.liftI @ GHC.IOBase.IO @ el @ a $dMonad
> |   = Sound.IterateeM.$sliftI @ el @ a
> | "SPEC return [GHC.IOBase.IO]" ALWAYS forall @ el
> |                                             $dMonad :: GHC.Base.Monad
> | GHC.IOBase.IO
> |   Sound.IterateeM.return @ GHC.IOBase.IO @ el $dMonad
> |   = Sound.IterateeM.a7
> |       `cast`
> |     (forall el1 a.
> |      a
> |      -> trans
> |             (sym ((GHC.IOBase.:CoIO)
> |                       (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO a)))
> |             (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO a)))
> |       @ el
> |
> | My guess is that these cause the slowdown - but is there any reason
> | that print_lines not being exported should cause them to be omitted?
> |
> | All these tests were run on GHC 6.10.1 with -O2.
> |
> | Thanks
> |
> | Neil
> |
> |
> | On Fri, Nov 21, 2008 at 10:33 AM, Simon Peyton-Jones
> | <[hidden email]> wrote:
> | > | This project is based on Oleg's Iteratee code; I started using his
> | > | IterateeM.hs and Enumerator.hs files and added my own stuff to
> | > | Enumerator.hs (thanks Oleg, great work as always).  When I started
> | > | cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my
> | > | minimal test case increased from 19s to 43s.
> | > |
> | > | I've found two factors that contributed.  When I was cleaning up, I
> | > | also removed a bunch of unused functions from IterateeM.hs (some of
> | > | the test functions and functions specific to his running example of
> | > | HTTP encoding).  When I added those functions back in, and added
> | > | INLINE pragmas to the exported functions in MyEnum.hs, I got the
> | > | performance back.
> | > |
> | > | In general I hadn't added export lists to the modules yet, so all
> | > | functions should have been exported.
> | >
> | > I'm totally snowed under with backlog from my recent absence, so I can't look at this
> | myself, but if anyone else wants to I'd be happy to support with advice and suggestions.
> | >
> | > In general, having an explicit export list is good for performance. I typed an extra section
> | in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain why.
> | In general that page is where we should document user advice for performance in GHC.
> | >
> | > I can't explain why *adding* unused functions would change performance though!
> | >
> | > Simon
> | >
> | >
> | > _______________________________________________
> | > Glasgow-haskell-users mailing list
> | > [hidden email]
> | > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
> | >
>
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

RE: cross module optimization issues

Simon Peyton Jones
The $f2 comes from the instance Monad (IterateeGM ...).
print_lines uses a specialised version of that instance, namely
        Monad (IterateeGM el IO)
The fact that print_lines uses it makes GHC generate a specialised version of the instance decl.

Even in the absence of print_lines you can generate the specialised instance thus

instance Monad m => Monad (IterateeGM el m) where
    {-# SPECIALISE instance Monad (IterateeGM el IO) #-}
        ... methods...

does that help?

Simon

| -----Original Message-----
| From: John Lato [mailto:[hidden email]]
| Sent: 28 November 2008 12:07
| To: Simon Peyton-Jones
| Cc: Neil Mitchell; [hidden email]; Don Stewart
| Subject: Re: cross module optimization issues
|
| Neil, thank you very much for taking the time to look at this; I
| greatly appreciate it.
|
| One thing I don't understand is why the specializations are caused by
| print_lines.  I suppose the optimizer can infer something which it
| couldn't otherwise.
|
| If I read this properly, the functions being specialized are liftI,
| (>>=), return, and $f2.  One thing I'm not sure about is when INLINE
| provides the desired optimal behavior, as opposed to SPECIALIZE.  The
| monad functions are defined in the Monad instance, and thus aren't
| currently INLINE'd or SPECIALIZE'd.  However, if they are separate
| functions, would INLINE be sufficient?  Would that give the optimizer
| enough to work with the derive the specializations on its own?  I'll
| have some time to experiment with this myself tomorrow, but I'd
| appreciate some direction (rather than guessing blindly).
|
| What is "$f2"?  I've seen that appear before, but I'm not sure where
| it comes from.
|
| Thanks,
| John
|
| On Fri, Nov 28, 2008 at 10:31 AM, Simon Peyton-Jones
| <[hidden email]> wrote:
| > The specialisations are indeed caused (indirectly) by the presence of print_lines.  If
| print_lines is dead code (as it is when print_lines is not exported), then there are no calls
| to the overloaded functions at these specialised types, and so you don't get the specialised
| versions.  You can get specialised versions by a SPECIALISE pragma, or SPECIALISE INSTANCE
| >
| > Does that make sense?
| >
| > Simon
| >
| > | -----Original Message-----
| > | From: Neil Mitchell [mailto:[hidden email]]
| > | Sent: 28 November 2008 09:48
| > | To: Simon Peyton-Jones
| > | Cc: John Lato; [hidden email]; Don Stewart
| > | Subject: Re: cross module optimization issues
| > |
| > | Hi
| > |
| > | I've talked to John a bit, and discussed test cases etc. I've tracked
| > | this down a little way.
| > |
| > | Given the attached file, compiling witih SHORT_EXPORT_LIST makes the
| > | code go _slower_. By exporting the "print_lines" function the code
| > | doubles in speed. This runs against everything I was expecting, and
| > | that Simon has described.
| > |
| > | Taking a look at the .hi files for the two alternatives, there are two
| > | differences:
| > |
| > | 1) In the faster .hi file, the body of print_lines is exported. This
| > | is reasonable and expected.
| > |
| > | 2) In the faster .hi file, there are additional specialisations, which
| > | seemingly have little/nothing to do with print_lines, but are omitted
| > | if it is not exported:
| > |
| > | "SPEC >>= [GHC.IOBase.IO]" ALWAYS forall @ el
| > |                                          $dMonad :: GHC.Base.Monad GHC.IOBase.IO
| > |   Sound.IterateeM.>>= @ GHC.IOBase.IO @ el $dMonad
| > |   = Sound.IterateeM.a
| > |       `cast`
| > |     (forall el1 a b.
| > |      Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO a
| > |      -> (a -> Sound.IterateeM.IterateeGM el1 GHC.IOBase.IO b)
| > |      -> trans
| > |             (sym ((GHC.IOBase.:CoIO)
| > |                       (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO b)))
| > |             (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO b)))
| > |       @ el
| > | "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
| > |                                                          $dMonad ::
| > | GHC.Base.Monad GHC.IOBase.IO
| > |   Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
| > |   = Sound.IterateeM.$s$f2 @ el
| > | "SPEC Sound.IterateeM.$f2 [GHC.IOBase.IO]" ALWAYS forall @ el
| > |                                                          $dMonad ::
| > | GHC.Base.Monad GHC.IOBase.IO
| > |   Sound.IterateeM.$f2 @ GHC.IOBase.IO @ el $dMonad
| > |   = Sound.IterateeM.$s$f21 @ el
| > | "SPEC Sound.IterateeM.liftI [GHC.IOBase.IO]" ALWAYS forall @ el
| > |                                                            @ a
| > |                                                            $dMonad ::
| > | GHC.Base.Monad GHC.IOBase.IO
| > |   Sound.IterateeM.liftI @ GHC.IOBase.IO @ el @ a $dMonad
| > |   = Sound.IterateeM.$sliftI @ el @ a
| > | "SPEC return [GHC.IOBase.IO]" ALWAYS forall @ el
| > |                                             $dMonad :: GHC.Base.Monad
| > | GHC.IOBase.IO
| > |   Sound.IterateeM.return @ GHC.IOBase.IO @ el $dMonad
| > |   = Sound.IterateeM.a7
| > |       `cast`
| > |     (forall el1 a.
| > |      a
| > |      -> trans
| > |             (sym ((GHC.IOBase.:CoIO)
| > |                       (Sound.IterateeM.IterateeG el1 GHC.IOBase.IO a)))
| > |             (sym ((Sound.IterateeM.:CoIterateeGM) el1 GHC.IOBase.IO a)))
| > |       @ el
| > |
| > | My guess is that these cause the slowdown - but is there any reason
| > | that print_lines not being exported should cause them to be omitted?
| > |
| > | All these tests were run on GHC 6.10.1 with -O2.
| > |
| > | Thanks
| > |
| > | Neil
| > |
| > |
| > | On Fri, Nov 21, 2008 at 10:33 AM, Simon Peyton-Jones
| > | <[hidden email]> wrote:
| > | > | This project is based on Oleg's Iteratee code; I started using his
| > | > | IterateeM.hs and Enumerator.hs files and added my own stuff to
| > | > | Enumerator.hs (thanks Oleg, great work as always).  When I started
| > | > | cleaning up by moving my functions from Enumerator.hs to MyEnum.hs, my
| > | > | minimal test case increased from 19s to 43s.
| > | > |
| > | > | I've found two factors that contributed.  When I was cleaning up, I
| > | > | also removed a bunch of unused functions from IterateeM.hs (some of
| > | > | the test functions and functions specific to his running example of
| > | > | HTTP encoding).  When I added those functions back in, and added
| > | > | INLINE pragmas to the exported functions in MyEnum.hs, I got the
| > | > | performance back.
| > | > |
| > | > | In general I hadn't added export lists to the modules yet, so all
| > | > | functions should have been exported.
| > | >
| > | > I'm totally snowed under with backlog from my recent absence, so I can't look at this
| > | myself, but if anyone else wants to I'd be happy to support with advice and suggestions.
| > | >
| > | > In general, having an explicit export list is good for performance. I typed an extra
| section
| > | in the GHC performance resource http://haskell.org/haskellwiki/Performance/GHC to explain
| why.
| > | In general that page is where we should document user advice for performance in GHC.
| > | >
| > | > I can't explain why *adding* unused functions would change performance though!
| > | >
| > | > Simon
| > | >
| > | >
| > | > _______________________________________________
| > | > Glasgow-haskell-users mailing list
| > | > [hidden email]
| > | > http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
| > | >
| >

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

Pepe Iborra-3

On 28/11/2008, at 15:46, Simon Peyton-Jones wrote:

> The $f2 comes from the instance Monad (IterateeGM ...).
> print_lines uses a specialised version of that instance, namely
>        Monad (IterateeGM el IO)
> The fact that print_lines uses it makes GHC generate a specialised  
> version of the instance decl.
>
> Even in the absence of print_lines you can generate the specialised  
> instance thus
>
> instance Monad m => Monad (IterateeGM el m) where
>    {-# SPECIALISE instance Monad (IterateeGM el IO) #-}
>        ... methods...
>
> does that help?


Once Simon and Neil dig the issue and analyze it, the reason seems  
evident.
But this thread reminds of why writing high performance Haskell code  
is regarded as a black art outside the community (well, and sometimes  
inside too).

Wouldn't a JIT version of GHC be a great thing to have?
Or would a backend for LLVM be already beneficial enough?


Cheers
pepe
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

John Lato-2
In reply to this post by Simon Peyton Jones
Yes, this does help, thank you.  I didn't know you could generate
specialized instances.  In fact, I was so sure that this was some
arcane feature I immediately went to the GHC User Guide because I
didn't believe it was documented.

I immediately stumbled upon Section 8.13.9.

Thanks to everyone who helped me with this.  I think I've achieved a
small bit of enlightenment.

Cheers,
John

On Fri, Nov 28, 2008 at 2:46 PM, Simon Peyton-Jones
<[hidden email]> wrote:

> The $f2 comes from the instance Monad (IterateeGM ...).
> print_lines uses a specialised version of that instance, namely
>        Monad (IterateeGM el IO)
> The fact that print_lines uses it makes GHC generate a specialised version of the instance decl.
>
> Even in the absence of print_lines you can generate the specialised instance thus
>
> instance Monad m => Monad (IterateeGM el m) where
>    {-# SPECIALISE instance Monad (IterateeGM el IO) #-}
>        ... methods...
>
> does that help?
>
> Simon
>
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|

Re: cross module optimization issues

Neil Mitchell
In reply to this post by Pepe Iborra-3
Hi

>> instance Monad m => Monad (IterateeGM el m) where
>>   {-# SPECIALISE instance Monad (IterateeGM el IO) #-}
>>
>> does that help?

Yes. With that specialise line in, we get identical performance
between the two results.

So, in summary:

The print_lines function uses the IterateeGM with IO as the underlying
monad, which causes GHC to specialise IterateeGM with IO. If
print_lines is not exported, then it is deleted as dead code, and the
specialisation is never generated. The specialisation is crucial for
performance later on. In this way, by keeping unused code reachable,
GHC does better optimisation.

> Once Simon and Neil dig the issue and analyze it, the reason seems evident.
> But this thread reminds of why writing high performance Haskell code is
> regarded as a black art outside the community (well, and sometimes inside
> too).
>
> Wouldn't a JIT version of GHC be a great thing to have?
> Or would a backend for LLVM be already beneficial enough?

I don't think either would have the benefits offered by
specialisation. If GHC exported more information about instances, it
could do more specialisations later, but it is a trade off. If you ran
GHC in some whole-program mode, then you wouldn't have the problem,
but would gain additional problems.

I always hoped Supero (http://www-users.cs.york.ac.uk/~ndm/supero/)
would remove some of the black art associated with program
optimisation - there are no specialise pragmas, and I'm pretty sure in
the above example it would have done the correct thing. In some ways,
whole-program and fewer special cases gives a much better mental model
of how optimisation might effect a program. Of course, its still a
research prototype, but perhaps one day...

Thanks

Neil
_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users