Floating through small case analyses

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
1 message Options
Reply | Threaded
Open this post in threaded view
|

Floating through small case analyses

Ben Gamari-2

Hello Simon,

Today I've been working on trying to whip bytestring's Builder into
shape [1,2]. The last remaining issue is a performance problem that
appears to be due to over-zealous floating of some small literals in
every GHC version I've tested (7.6, 7.8, and 7.10). The test case can be
found here [3] and is characterized by repeated monoidal appends
(reproduced roughly here in half-Core/half-Haskell),

    loop s n | s `seq` n `seq` False = undefined
    loop _ 0 = mempty
    loop s n =
          singleton (s + (fromInteger @Word8 (__integer 0))) <>
          singleton (s + (fromInteger @Word8 (__integer 1))) <>
          singleton (s + (fromInteger @Word8 (__integer 2))) <>
          singleton (s + (fromInteger @Word8 (__integer 3))) <>
          ...
          singleton (s + (fromInteger @Word8 (__integer 15))) <>
          loop (s+16) (s-16)

The initial float-out stage floats all of the (fromInteger ...)
expressions out to the top level, resulting in definitions of the form,

    lvl_s1b3j :: Integer
    lvl_s1b3j = __integer 0

    lvl_s1b3k :: Word8
    lvl_s1b3k = $fBitsWord8_$cfromInteger lvl_s1b3j

Simplifier phase 2 then does some inlining and in so doing introduces a
boolean case analysis. Both alternatives of the case require the
now-top-level literals.

Unfortunately, the float-in pass is quite strict about when it will
float expressions inside case analyses. Specifically, it requires the
float to be both small [4] (which I believe these are) and used in more than
one but not all alternatives [5]. Given that there are only two
alternatives here this seems a bit too restrictive.

The end result is that we end up producing a bunch of bindings outside
of a boolean case analysis,

    case narrow8Word# sc_s1c39 of a_s1c3g { __DEFAULT ->
    case plusWord# sc_s1c39 (__word 1) of sat_s1c3i { __DEFAULT ->

    case narrow8Word# sat_s1c3i of a1_s1c3h { __DEFAULT ->
    case plusWord# sc_s1c39 (__word 2) of sat_s1c3k { __DEFAULT ->
   
    ...

    case plusWord# sc_s1c39 (__word 16) of sat_s1c3M { __DEFAULT ->
    case narrow8Word# sat_s1c3M of a16_s1c3L { __DEFAULT ->

    case tagToEnum# @ Bool sat_s1c3P of _ {
      False ->
        {-# write each of the values above to memory #-}

      True  ->
        {-# write each of the values above to memory #-}
    }

This ends up producing extremely poor assembly, with the compiler first
computing all 16 store addresses, placing them on the stack, and then
performing all 16 stores.

I'm a bit unsure of what to blame this on. Perhaps it would make sense
to always float small dupable expressions in to sufficiently small
case expressions (say, three or fewer alternatives)? Perhaps these
near-literals should never have been floated out at all? Perhaps there
another option I've not thought of?

Thoughts?

Cheers,

- Ben


[1]: https://github.com/kolmodin/binary/pull/65
[2]: https://github.com/haskell/bytestring/pull/40
[3]: https://gist.github.com/bgamari/22d127779a48113c1153#file-test-hs
[4]: https://github.com/ghc/ghc/blob/master/compiler/coreSyn/CoreUtils.hs#L710
[5]: https://github.com/ghc/ghc/blob/master/compiler/simplCore/FloatIn.hs#L527
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 472 bytes
Desc: not available
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20150115/861aad55/attachment.sig>