ST Vector / STRef -- Performances and allocation

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
9 messages Options
Reply | Threaded
Open this post in threaded view
|

ST Vector / STRef -- Performances and allocation

Guillaume Bouchard
Hello.

I read a reddit post [0] which ends in a debate about haskell
performances. Someone gives a C implementation which runs in 15ms on
some dataset. I wanted to learn about mutable vector and ST and got
something in haskell which runs in 50ms. Code is here [1].

[0] https://www.reddit.com/r/haskell/comments/4ojsn0/counting_inversions_haskell/
[1] https://github.com/guibou/inversionsCount/blob/syntax/inversions.hs

When I'm profiling the code, I'm surprised to see more than 50% of the
%alloc in function which, from my point of view, must not allocate.

For example, this function :

inc :: forall s. STRef s Int -> ST s ()
inc v = modifySTRef' v (+1)

Is responsible for 10% of the allocation in this program. I was hoping
that the "unboxing magic" of GHC may have replaced my Int somewhere by
an unboxed one, but I guess the allocation means that the STRef is
somehow pointing to a heap allocated Int.

Do you know a way to remove theses allocations?

As a second question, if you see any other way to improve this
implementation, I'll be happy to learn something new ;)

Finally, if you read my code, you'll see that I implemented a small
DSL around ST / STRef / MVector to make my main function more
readable. How do people write readable code in real life ?

Especially, I was annoyed when I tried to convert this piece of C code :

if (x1 < mid && (x2 == sz || tmp[x1] <= p[x2])) {
    a;
} else {
   b;
}

In my haskell code, x1, x2 are STRef, and tmp and p are MVector. This
was highly painful to write in haskell and I had to write something
such as:

x1 <- readSTRef x1'
x2 <- readSTRef x2'

cond <- if x1 < mid
                 then if x2 == sz
                           then return True
                           else do
                                tmp_x1 <- Vector.read tmp x1
                                p_x2 <_ Vector.read p x2
                                return (tmp_x1 <= p_x2)
                 else return False
if cond
   then a
   else b

This is painful, complex and error prone, so is there another solution ?

Thank you.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ST Vector / STRef -- Performances and allocation

Carter Schonwald
StRef is a boxed heap value.  

If you use a single element Unboxed Mutable vector you may find there's less boxing :)

On Saturday, June 18, 2016, Guillaume Bouchard <[hidden email]> wrote:
Hello.

I read a reddit post [0] which ends in a debate about haskell
performances. Someone gives a C implementation which runs in 15ms on
some dataset. I wanted to learn about mutable vector and ST and got
something in haskell which runs in 50ms. Code is here [1].

[0] https://www.reddit.com/r/haskell/comments/4ojsn0/counting_inversions_haskell/
[1] https://github.com/guibou/inversionsCount/blob/syntax/inversions.hs

When I'm profiling the code, I'm surprised to see more than 50% of the
%alloc in function which, from my point of view, must not allocate.

For example, this function :

inc :: forall s. STRef s Int -> ST s ()
inc v = modifySTRef' v (+1)

Is responsible for 10% of the allocation in this program. I was hoping
that the "unboxing magic" of GHC may have replaced my Int somewhere by
an unboxed one, but I guess the allocation means that the STRef is
somehow pointing to a heap allocated Int.

Do you know a way to remove theses allocations?

As a second question, if you see any other way to improve this
implementation, I'll be happy to learn something new ;)

Finally, if you read my code, you'll see that I implemented a small
DSL around ST / STRef / MVector to make my main function more
readable. How do people write readable code in real life ?

Especially, I was annoyed when I tried to convert this piece of C code :

if (x1 < mid && (x2 == sz || tmp[x1] <= p[x2])) {
    a;
} else {
   b;
}

In my haskell code, x1, x2 are STRef, and tmp and p are MVector. This
was highly painful to write in haskell and I had to write something
such as:

x1 <- readSTRef x1'
x2 <- readSTRef x2'

cond <- if x1 < mid
                 then if x2 == sz
                           then return True
                           else do
                                tmp_x1 <- Vector.read tmp x1
                                p_x2 <_ Vector.read p x2
                                return (tmp_x1 <= p_x2)
                 else return False
if cond
   then a
   else b

This is painful, complex and error prone, so is there another solution ?

Thank you.
_______________________________________________
Haskell-Cafe mailing list
<a href="javascript:;" onclick="_e(event, &#39;cvml&#39;, &#39;Haskell-Cafe@haskell.org&#39;)">Haskell-Cafe@...
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ST Vector / STRef -- Performances and allocation

Amos Robinson
In reply to this post by Guillaume Bouchard
Hi,
You might want to add inline pragmas to "inc" and so on.
STRef is boxed, so it will allocate. The 'URef' type here might be useful: http://hackage.haskell.org/package/mutable-containers-0.3.2.1/docs/Data-Mutable.html
It is just a wrapper around unboxed mutable vector, as Carter suggests.

The sad story is that in my experience, if you really want decent performance you will have to dump the Core and inspect it by hand. You'd then add inline and bang patterns based on the Core. I do not recommend it as a relaxing weekend exercise. I usually use an invocation something like:
> ghc -ddump-prep -dppr-case-as-let -dsuppress-all -fforce-recomp inversions.hs > inversions.hscore


On Sun, 19 Jun 2016 at 06:38 Guillaume Bouchard <[hidden email]> wrote:
Hello.

I read a reddit post [0] which ends in a debate about haskell
performances. Someone gives a C implementation which runs in 15ms on
some dataset. I wanted to learn about mutable vector and ST and got
something in haskell which runs in 50ms. Code is here [1].

[0] https://www.reddit.com/r/haskell/comments/4ojsn0/counting_inversions_haskell/
[1] https://github.com/guibou/inversionsCount/blob/syntax/inversions.hs

When I'm profiling the code, I'm surprised to see more than 50% of the
%alloc in function which, from my point of view, must not allocate.

For example, this function :

inc :: forall s. STRef s Int -> ST s ()
inc v = modifySTRef' v (+1)

Is responsible for 10% of the allocation in this program. I was hoping
that the "unboxing magic" of GHC may have replaced my Int somewhere by
an unboxed one, but I guess the allocation means that the STRef is
somehow pointing to a heap allocated Int.

Do you know a way to remove theses allocations?

As a second question, if you see any other way to improve this
implementation, I'll be happy to learn something new ;)

Finally, if you read my code, you'll see that I implemented a small
DSL around ST / STRef / MVector to make my main function more
readable. How do people write readable code in real life ?

Especially, I was annoyed when I tried to convert this piece of C code :

if (x1 < mid && (x2 == sz || tmp[x1] <= p[x2])) {
    a;
} else {
   b;
}

In my haskell code, x1, x2 are STRef, and tmp and p are MVector. This
was highly painful to write in haskell and I had to write something
such as:

x1 <- readSTRef x1'
x2 <- readSTRef x2'

cond <- if x1 < mid
                 then if x2 == sz
                           then return True
                           else do
                                tmp_x1 <- Vector.read tmp x1
                                p_x2 <_ Vector.read p x2
                                return (tmp_x1 <= p_x2)
                 else return False
if cond
   then a
   else b

This is painful, complex and error prone, so is there another solution ?

Thank you.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: ST Vector / STRef -- Performances and allocation

Guillaume Bouchard
Thank you for this answer.

On Sun, Jun 19, 2016 at 1:08 AM, Amos Robinson <[hidden email]> wrote:
> Hi,
> You might want to add inline pragmas to "inc" and so on.
> STRef is boxed, so it will allocate. The 'URef' type here might be useful:
> http://hackage.haskell.org/package/mutable-containers-0.3.2.1/docs/Data-Mutable.html
> It is just a wrapper around unboxed mutable vector, as Carter suggests.

I added INLINE pragma on all my functions and it does not change
anything unfortunately.

I tried the package mutable-containers and the Api is really great,
but I'm surprised by the results :

- PRef (Unboxed bytearray of size 1, the closest thing to an unboxed
stack allocation) : 86ms
- URef (Unboxed vector of size 1) : 130ms
- SRef (Storable vector of size 1) : 43ms (This is an improvement !)
- BRef (Boxed ref) : 137ms
- STRef : 54ms (this was my baseline)

This is based on the creation of the Ref using a "mutable" function such as :

mutable v = asF <$> newRef v

(with asF == asPRef, asURef, ...)

However, if I'm providing a type signature for mutable, I got totally
different results :

mutable :: forall (f :: * -> *) a.
                 (Prim a, PrimMonad f) =>
                 a -> f (MyRef (PrimState f) a)
mutable v = asF <$> newRef v

(with type MyRef = PRef, URef, ...)

- PRef: 39ms (huge improvement, and winner as expected)
- URef: 45ms
- SRef: 43ms
- BRef: 84ms
- STRef: 54ms

I don't understand the speed difference which is only impacted by the
availability of a signature for mutable.

Now much of the allocation due to STRef have disappears. There is
still some which annoys me.

Most of the allocations comes from that line :

https://github.com/guibou/inversionsCount/blob/syntax/inversions.hs#L35

Which is nothing more that a boolean expression between different ST
action which are reading from unboxed Vector or PRef. So I was
expecting without allocation. Perhaps the fact that it uses "high
level" combinators such as "<$>", "<*>" and "=<<" is responsible ?

> The sad story is that in my experience, if you really want decent
> performance you will have to dump the Core and inspect it by hand. You'd
> then add inline and bang patterns based on the Core. I do not recommend it
> as a relaxing weekend exercise. I usually use an invocation something like:
>> ghc -ddump-prep -dppr-case-as-let -dsuppress-all -fforce-recomp
>> inversions.hs > inversions.hscore

Thank you ;) For now I rather prefer reading intel x86 assembler than
GHC core, but today is a rainy day, so I'll use it to start my
understanding of the GHC core ;)

--
G.
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: ST Vector / STRef -- Performances and allocation

Haskell - Haskell-Cafe mailing list
Guillaume Bouchard wrote:
> - PRef (Unboxed bytearray of size 1, the closest thing to an unboxed
> stack allocation) : 86ms
> - URef (Unboxed vector of size 1) : 130ms
> - SRef (Storable vector of size 1) : 43ms (This is an improvement !)
> - BRef (Boxed ref) : 137ms
> - STRef : 54ms (this was my baseline)

You really shouldn't use any mutable variable at all for this, but pass
the values around as function arguments instead:

count_inv' :: V2.MVector s Int32 -> V2.MVector s Int32 -> ST s Int
count_inv' a buf
  | V.length a <= 1 = return 0
  | otherwise = do
      let len = V.length a
          mid = len `div` 2

      counta <- count_inv' (V.slice 0 mid a) buf
      countb <- count_inv' (V.slice mid (len - mid) a) buf

      V.unsafeCopy (V.slice 0 mid buf) (V.slice 0 mid a)

      let go idx1 idx2 count i = if i == len then return count else do
              cond <- return (idx1 < mid) .&&. (return (idx2 == len) .||. (V.unsafeRead buf idx1 .<=. V.unsafeRead a idx2))
              if cond then do
                  V.unsafeRead buf idx1 >>= V.unsafeWrite a i
                  go (idx1 + 1) idx2 (count + idx2 - mid) (i+1)
               else do
                  V.unsafeRead a idx2 >>= V.unsafeWrite a i
                  go idx1 (idx2 + 1) count (i+1)
      go 0 mid (counta + countb) (0 :: Int)

Besides, the code spends most of its time on parsing the input. The
following more low-level code does the job far more quickly:

import Data.ByteString.Char8 as B

parse' :: IO [Int32]
parse' = do
    content <- B.getContents
    return $ map fromIntegral $ unfoldr (B.readInt . B.dropWhile (=='\n')) $ content

It's possible to improve this slightly by implementing the code from scratch:

parse'' :: IO [Int32]
parse'' = do
  content <- B.getContents
  return $ go content
 where
  go b = case B.uncons b of
      Nothing -> []
      Just ('\n',b) -> go b
      Just ('-',b) -> go'' 0 b
      Just (d,b) -> go' (fromIntegral (ord d - 48)) b
  go' v b = case B.uncons b of
      Nothing -> [v]
      Just ('\n',b) -> v : go b
      Just (d,b) -> go' (v*10 + fromIntegral (ord d - 48)) b
  go'' v b = case B.uncons b of
      Nothing -> [v]
      Just ('\n',b) -> v : go b
      Just (d,b) -> go' (v*10 - fromIntegral (ord d - 48)) b

Taken together these changes improve the runtime from 79ms to 21ms here.

Cheers,

Bertram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: ST Vector / STRef -- Performances and allocation

Haskell - Haskell-Cafe mailing list
Bertram Felgenhauer via Haskell-Cafe wrote:

> Guillaume Bouchard wrote:
> > - PRef (Unboxed bytearray of size 1, the closest thing to an unboxed
> > stack allocation) : 86ms
> > - URef (Unboxed vector of size 1) : 130ms
> > - SRef (Storable vector of size 1) : 43ms (This is an improvement !)
> > - BRef (Boxed ref) : 137ms
> > - STRef : 54ms (this was my baseline)
>
> You really shouldn't use any mutable variable at all for this, but pass
> the values around as function arguments instead [...]

As to why, the reason is that to get good performance, these variables
should end up in registers. But the mutable variables in Haskell are
all heap-allocated objects, and afaik the compiler has no way of
allocating them elsewhere (on the stack, or in a register). So the only
way to get good code is to not use Haskell's mutable variables at all.

All this is specific to ghc, obviously.

Cheers,

Bertram

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: ST Vector / STRef -- Performances and allocation

Guillaume Bouchard
In reply to this post by Haskell - Haskell-Cafe mailing list
Thank you ! I'm now in an equal time with the C code ;)

On Sun, Jun 19, 2016 at 3:12 PM, Bertram Felgenhauer via Haskell-Cafe
<[hidden email]> wrote:

> Guillaume Bouchard wrote:
>> - PRef (Unboxed bytearray of size 1, the closest thing to an unboxed
>> stack allocation) : 86ms
>> - URef (Unboxed vector of size 1) : 130ms
>> - SRef (Storable vector of size 1) : 43ms (This is an improvement !)
>> - BRef (Boxed ref) : 137ms
>> - STRef : 54ms (this was my baseline)
>
> You really shouldn't use any mutable variable at all for this, but pass
> the values around as function arguments instead:

Nice ! This is so obvious that I did not thought about it. Thank you.
(Thank you for the followup-mail, it is clear that it may be easier
for GHC to unbox / put in register "normal variable" than more complex
type such as the one involved in PRef).

> Besides, the code spends most of its time on parsing the input. The
> following more low-level code does the job far more quickly:

Impressive. On the profiling the parsing was accounting only for 10%
of the running time. Perhaps the profiler introduces more overhead on
the vector part than on the parsing.

> import Data.ByteString.Char8 as B
>
> parse' :: IO [Int32]
> parse' = do
>     content <- B.getContents
>     return $ map fromIntegral $ unfoldr (B.readInt . B.dropWhile (=='\n')) $ content

This is in part with the C code. ~16ms

> It's possible to improve this slightly by implementing the code from scratch:
>
> parse'' :: IO [Int32]
> parse'' = do
>   content <- B.getContents
>   return $ go content
>  where
>   go b = case B.uncons b of
>       Nothing -> []
>       Just ('\n',b) -> go b
>       Just ('-',b) -> go'' 0 b
>       Just (d,b) -> go' (fromIntegral (ord d - 48)) b
>   go' v b = case B.uncons b of
>       Nothing -> [v]
>       Just ('\n',b) -> v : go b
>       Just (d,b) -> go' (v*10 + fromIntegral (ord d - 48)) b
>   go'' v b = case B.uncons b of
>       Nothing -> [v]
>       Just ('\n',b) -> v : go b
>       Just (d,b) -> go' (v*10 - fromIntegral (ord d - 48)) b
>
> Taken together these changes improve the runtime from 79ms to 21ms here.

This is better than the C code ;)

Thank you. I learned a few thing today.

--
Guillaume
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: ST Vector / STRef -- Performances and allocation

Haskell - Haskell-Cafe mailing list
Guillaume Bouchard wrote:

> Thank you ! I'm now in an equal time with the C code ;)
>
> On Sun, Jun 19, 2016 at 3:12 PM, Bertram Felgenhauer via Haskell-Cafe
> <[hidden email]> wrote:
> > Guillaume Bouchard wrote:
> >> - PRef (Unboxed bytearray of size 1, the closest thing to an unboxed
> >> stack allocation) : 86ms
> >> - URef (Unboxed vector of size 1) : 130ms
> >> - SRef (Storable vector of size 1) : 43ms (This is an improvement !)
> >> - BRef (Boxed ref) : 137ms
> >> - STRef : 54ms (this was my baseline)
> >
> > You really shouldn't use any mutable variable at all for this, but pass
> > the values around as function arguments instead:
>
> Nice ! This is so obvious that I did not thought about it. Thank you.
> (Thank you for the followup-mail, it is clear that it may be easier
> for GHC to unbox / put in register "normal variable" than more complex
> type such as the one involved in PRef).

I'd say that the complexity of the type isn't the main obstacle here,
but mutability. The point is that when a value is unboxed, one
essentially creates a copy of it (on the stack, and possibly on the heap
if one ends up re-boxing the value later on). For ordinary Haskell values
that is not a problem; they are immutable and a copy is identical, for
all purposes, to the original. The only potential downside to unboxing
is that some memory may be wasted. With mutable data, however, copies
cannot track updates to the original, so unboxing is no longer safe
unconditionally. It still could be done if the mutable structure is
never shared, but it requires a precise non-sharing analysis. I suspect
that this kind of analysis does not (yet?) exist in ghc, because it
would not benefit pure code.

Cheers,

Bertram
_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.
Reply | Threaded
Open this post in threaded view
|

Re: ST Vector / STRef -- Performances and allocation

Stefan Monnier
> unconditionally. It still could be done if the mutable structure is
> never shared, but it requires a precise non-sharing analysis. I suspect
> that this kind of analysis does not (yet?) exist in ghc, because it
> would not benefit pure code.

Since these mutable boxes seem to mimick plain C variables, it's indeed
better to map them to immutable Haskell variables.  After all, that's
exactly what the C compiler will do via SSA.

GHC could try to do the same with Refs, but it would only be beneficial
on "C-style Haskell code": usually when Refs are used, it's precisely
because SSA doesn't work for that variable, such as when the equivalent
C code would use & on it.


        Stefan

_______________________________________________
Haskell-Cafe mailing list
To (un)subscribe, modify options or view archives go to:
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Only members subscribed via the mailman list are allowed to post.