When are MVars better than STM?

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

When are MVars better than STM?

Thomas Koster
Hi friends,

Using Criterion, I have been running benchmarks to measure the
relative performance of STM and MVars for some simple transactions
that I expect will be typical in my application. I am using GHC 7.10.2
and libraries as at Stackage LTS 3.2.

I have found that STM is faster than MVars in all my benchmarks,
without exception. This seems to go against accepted wisdom [1][2][3].
I have not included my source code here to save space, but if you
suspect that I am using MVars incorrectly, just say so and I will post
my source code separately.

I have two questions:

1. When are MVars faster than STM? If the answer is "never", then when
are MVars "better" than STM? (Choose your own definition of "better".)

2. When given two capabilities (+RTS -N2), MVars are suddenly an order
of magnitude slower than with just one capability. Why?


For those who want details:

My benchmark forks four Haskell threads. Each thread repeats a
transaction that increments a shared counter many, many times. These
transactions must be serialized. The counter is therefore highly
contended. One version uses an MVar to store the counter in the
obvious way. The other version uses a TVar instead.

By the way, simply using "atomic-primops" to increment the counter
won't do because the increment operation is actually a mock substitute
for a more complex operation. I use the counter for my benchmarks
because the real operation needs much more memory and I don't want the
additional, unpredictable GC cost to affect my measurements.

Typical measurements are:

1 capability, using MVar: 37.30 ms
1 capability, using TVar: 24.88 ms
2 capabilities, using MVar: 1.564 s
2 capabilities, using TVar: 80.09 ms
4 capabilities, using MVar: 2.890 s
4 capabilities, using TVar: 207.8 ms

Notice that the MVar version suddenly slows by an order of magnitude
when run with more than one capability. Why is this so? (This is
question 2.)

Despite the absolute time elapsed, I realize that the CPU usage
characteristics of the two versions are also quite different. I
realize that the MVar version interlocks the four threads so that only
one capability is ever busy at a time, irrespective of the number of
capabilities available, whereas the STM version allows up to four
capabilities to be busy at once. However, I believe that the
additional parallel transactions in the STM version would be mostly
wasted, destined to be retried. Unless I am mistaken, this assumption
appears to be consistent with the observation that the STM version
with -N1 is the fastest of all. Despite all this wasted work by the
thundering herd, the total CPU time (i.e. my power bill) for the STM
version is still less than for the MVar version, because the MVar
version so dramatically slow.

Paradoxically, MVars seem to be the wrong tool for this job. So when
are MVars faster than STM? (This is question 1.)

[1] https://stackoverflow.com/questions/15439966/when-why-use-an-mvar-over-a-tvar
[2] https://www.reddit.com/r/haskell/comments/39ef3y/ioref_vs_mvar_vs_tvar_vs_tmvar/
[3] https://mail.haskell.org/pipermail/haskell-cafe/2014-January/112158.html

Thanks,
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

Christopher Allen
Could you post the code please?

On Sun, Jan 24, 2016 at 12:46 AM, Thomas Koster <[hidden email]> wrote:
Hi friends,

Using Criterion, I have been running benchmarks to measure the
relative performance of STM and MVars for some simple transactions
that I expect will be typical in my application. I am using GHC 7.10.2
and libraries as at Stackage LTS 3.2.

I have found that STM is faster than MVars in all my benchmarks,
without exception. This seems to go against accepted wisdom [1][2][3].
I have not included my source code here to save space, but if you
suspect that I am using MVars incorrectly, just say so and I will post
my source code separately.

I have two questions:

1. When are MVars faster than STM? If the answer is "never", then when
are MVars "better" than STM? (Choose your own definition of "better".)

2. When given two capabilities (+RTS -N2), MVars are suddenly an order
of magnitude slower than with just one capability. Why?


For those who want details:

My benchmark forks four Haskell threads. Each thread repeats a
transaction that increments a shared counter many, many times. These
transactions must be serialized. The counter is therefore highly
contended. One version uses an MVar to store the counter in the
obvious way. The other version uses a TVar instead.

By the way, simply using "atomic-primops" to increment the counter
won't do because the increment operation is actually a mock substitute
for a more complex operation. I use the counter for my benchmarks
because the real operation needs much more memory and I don't want the
additional, unpredictable GC cost to affect my measurements.

Typical measurements are:

1 capability, using MVar: 37.30 ms
1 capability, using TVar: 24.88 ms
2 capabilities, using MVar: 1.564 s
2 capabilities, using TVar: 80.09 ms
4 capabilities, using MVar: 2.890 s
4 capabilities, using TVar: 207.8 ms

Notice that the MVar version suddenly slows by an order of magnitude
when run with more than one capability. Why is this so? (This is
question 2.)

Despite the absolute time elapsed, I realize that the CPU usage
characteristics of the two versions are also quite different. I
realize that the MVar version interlocks the four threads so that only
one capability is ever busy at a time, irrespective of the number of
capabilities available, whereas the STM version allows up to four
capabilities to be busy at once. However, I believe that the
additional parallel transactions in the STM version would be mostly
wasted, destined to be retried. Unless I am mistaken, this assumption
appears to be consistent with the observation that the STM version
with -N1 is the fastest of all. Despite all this wasted work by the
thundering herd, the total CPU time (i.e. my power bill) for the STM
version is still less than for the MVar version, because the MVar
version so dramatically slow.

Paradoxically, MVars seem to be the wrong tool for this job. So when
are MVars faster than STM? (This is question 1.)

[1] https://stackoverflow.com/questions/15439966/when-why-use-an-mvar-over-a-tvar
[2] https://www.reddit.com/r/haskell/comments/39ef3y/ioref_vs_mvar_vs_tvar_vs_tmvar/
[3] https://mail.haskell.org/pipermail/haskell-cafe/2014-January/112158.html

Thanks,
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe



--
Chris Allen
Currently working on http://haskellbook.com

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

Thomas Koster
On Sun, Jan 24, 2016 at 12:46 AM, Thomas Koster <[hidden email]> wrote:

> Using Criterion, I have been running benchmarks to measure the
> relative performance of STM and MVars for some simple transactions
> that I expect will be typical in my application. I am using GHC 7.10.2
> and libraries as at Stackage LTS 3.2.
>
> I have found that STM is faster than MVars in all my benchmarks,
> without exception. This seems to go against accepted wisdom [1][2][3].
> I have not included my source code here to save space, but if you
> suspect that I am using MVars incorrectly, just say so and I will post
> my source code separately.
>
> I have two questions:
>
> 1. When are MVars faster than STM? If the answer is "never", then when
> are MVars "better" than STM? (Choose your own definition of "better".)
>
> 2. When given two capabilities (+RTS -N2), MVars are suddenly an order
> of magnitude slower than with just one capability. Why?


On 24 January 2016 at 17:55, Christopher Allen <[hidden email]> wrote:
> Could you post the code please?


module Main (main) where

import Control.Concurrent.Async
import Control.Concurrent.MVar
import Control.Concurrent.STM
import Control.Monad
import Criterion.Main

main =
  defaultMain
    [
      bgroup "thrash"
        [
          bench "MVar" $ whnfIO (thrashTest mvarNew mvarInc mvarGet),
          bench "TVar" $ whnfIO (thrashTest tvarNew tvarInc tvarGet)
        ]
    ]

thrashTest :: IO a
           -> (a -> IO ())
           -> (a -> IO b)
           -> IO b
thrashTest new inc get = do
  var <- new
  threads <- replicateM 4 (async (replicateM_ 100000 $ inc var))
  forM_ threads wait
  get var

mvarNew :: IO (MVar Int)
mvarNew = newMVar 0

mvarInc :: MVar Int -> IO ()
mvarInc var =
  modifyMVar_ var $ \ i ->
    return $! succ i

mvarGet :: MVar Int -> IO Int
mvarGet = readMVar

tvarNew :: IO (TVar Int)
tvarNew = newTVarIO 0

tvarInc :: TVar Int -> IO ()
tvarInc var =
  atomically $ do
    i <- readTVar var
    writeTVar var $! succ i

tvarGet :: TVar Int -> IO Int
tvarGet = readTVarIO

--
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

Ryan Yates
Hi Thomas,

I'm sorry I don't have time right now for a proper response (buried under paper deadlines).  There are certainly times when one will be faster then the other and the reasons are quite complicated.  To complicate matters further it is very difficult to get benchmarks that don't lie about performance in this space.  There are also alternative implementations that change the balance drastically.  The only broad advice I can give is to benchmark the target application with both implementations to see how all the implications fall out.  A broad description of the differences in implementation would be that MVars have a fairness guarantee (that does not come for free) for waking waiting threads.  STM does not have this fairness which can lead to problems for programs that have quick transactions that always win over occasional long transactions (there are ways to avoid with a different implementation or with the cost of shifted to the programmer).  My guess is in your particular benchmark the unfairness of STM works to your advantage and all the work is happening sequentially while the MVar version's fairness incurs frequent cache misses.


Ryan

On Sun, Jan 24, 2016 at 2:13 AM, Thomas Koster <[hidden email]> wrote:
On Sun, Jan 24, 2016 at 12:46 AM, Thomas Koster <[hidden email]> wrote:
> Using Criterion, I have been running benchmarks to measure the
> relative performance of STM and MVars for some simple transactions
> that I expect will be typical in my application. I am using GHC 7.10.2
> and libraries as at Stackage LTS 3.2.
>
> I have found that STM is faster than MVars in all my benchmarks,
> without exception. This seems to go against accepted wisdom [1][2][3].
> I have not included my source code here to save space, but if you
> suspect that I am using MVars incorrectly, just say so and I will post
> my source code separately.
>
> I have two questions:
>
> 1. When are MVars faster than STM? If the answer is "never", then when
> are MVars "better" than STM? (Choose your own definition of "better".)
>
> 2. When given two capabilities (+RTS -N2), MVars are suddenly an order
> of magnitude slower than with just one capability. Why?


On 24 January 2016 at 17:55, Christopher Allen <[hidden email]> wrote:
> Could you post the code please?


module Main (main) where

import Control.Concurrent.Async
import Control.Concurrent.MVar
import Control.Concurrent.STM
import Control.Monad
import Criterion.Main

main =
  defaultMain
    [
      bgroup "thrash"
        [
          bench "MVar" $ whnfIO (thrashTest mvarNew mvarInc mvarGet),
          bench "TVar" $ whnfIO (thrashTest tvarNew tvarInc tvarGet)
        ]
    ]

thrashTest :: IO a
           -> (a -> IO ())
           -> (a -> IO b)
           -> IO b
thrashTest new inc get = do
  var <- new
  threads <- replicateM 4 (async (replicateM_ 100000 $ inc var))
  forM_ threads wait
  get var

mvarNew :: IO (MVar Int)
mvarNew = newMVar 0

mvarInc :: MVar Int -> IO ()
mvarInc var =
  modifyMVar_ var $ \ i ->
    return $! succ i

mvarGet :: MVar Int -> IO Int
mvarGet = readMVar

tvarNew :: IO (TVar Int)
tvarNew = newTVarIO 0

tvarInc :: TVar Int -> IO ()
tvarInc var =
  atomically $ do
    i <- readTVar var
    writeTVar var $! succ i

tvarGet :: TVar Int -> IO Int
tvarGet = readTVarIO

--
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

Yuras Shumovich
In reply to this post by Thomas Koster
On Sun, 2016-01-24 at 17:46 +1100, Thomas Koster wrote:

> 2. When given two capabilities (+RTS -N2), MVars are suddenly an
> order
> of magnitude slower than with just one capability. Why?

One possible explanation is closure locking which is not performed when
there is only one capability. In my quick measurements it gives 40%
speedup: https://ghc.haskell.org/trac/ghc/ticket/693#comment:9


_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

Thomas Koster
In reply to this post by Ryan Yates
Ryan,

On Sun, Jan 24, 2016 at 12:46 AM, Thomas Koster <[hidden email]> wrote:

> Using Criterion, I have been running benchmarks to measure the
> relative performance of STM and MVars for some simple transactions
> that I expect will be typical in my application. I am using GHC 7.10.2
> and libraries as at Stackage LTS 3.2.
>
> I have found that STM is faster than MVars in all my benchmarks,
> without exception. This seems to go against accepted wisdom [1][2][3].
> I have not included my source code here to save space, but if you
> suspect that I am using MVars incorrectly, just say so and I will post
> my source code separately.
>
> I have two questions:
>
> 1. When are MVars faster than STM? If the answer is "never", then when
> are MVars "better" than STM? (Choose your own definition of "better".)
>
> 2. When given two capabilities (+RTS -N2), MVars are suddenly an order
> of magnitude slower than with just one capability. Why?

On 25 January 2016 at 01:04, Ryan Yates <[hidden email]> wrote:
> I'm sorry I don't have time right now for a proper response (buried under
> paper deadlines).  There are certainly times when one will be faster then
> the other and the reasons are quite complicated.  To complicate matters
> further it is very difficult to get benchmarks that don't lie about
> performance in this space.  There are also alternative implementations that
> change the balance drastically.  The only broad advice I can give is to
> benchmark the target application with both implementations to see how all
> the implications fall out.

That is fair. From what I can tell, the time spent in the runtime
dominates my user time, so I am basically benchmarking the GHC
runtime, which I am not qualified to do :)

I had only hoped to be able to decide on MVar vs STM before getting
into the nitty gritty.

> A broad description of the differences in
> implementation would be that MVars have a fairness guarantee (that does not
> come for free) for waking waiting threads.  STM does not have this fairness
> which can lead to problems for programs that have quick transactions that
> always win over occasional long transactions (there are ways to avoid with a
> different implementation or with the cost of shifted to the programmer).  My
> guess is in your particular benchmark the unfairness of STM works to your
> advantage and all the work is happening sequentially while the MVar
> version's fairness incurs frequent cache misses.

Fairness may actually be very important to my application. Unlike my
benchmark, the complexity of real transactions can vary enormously.
Let me think about this.

Thanks for your response.

--
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

Thomas Koster
In reply to this post by Yuras Shumovich
Yuras,

On Sun, 2016-01-24 at 17:46 +1100, Thomas Koster wrote:
> 2. When given two capabilities (+RTS -N2), MVars are suddenly an
> order
> of magnitude slower than with just one capability. Why?

On 25 January 2016 at 02:09, Yuras Shumovich <[hidden email]> wrote:
> One possible explanation is closure locking which is not performed when
> there is only one capability. In my quick measurements it gives 40%
> speedup: https://ghc.haskell.org/trac/ghc/ticket/693#comment:9

This makes sense. After all, why bother with locks and barriers when
the process is single-threaded anyway?

Thanks for your response.

--
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

Christopher Allen
Well, there are cases where even with single-threading you want a memory barrier to prevent the CPU reordering instructions, but the shift to a single-threaded runtime should elide _some_ locks expressly designed to cope with multithreading.

On Sun, Jan 24, 2016 at 5:04 PM, Thomas Koster <[hidden email]> wrote:
Yuras,

On Sun, 2016-01-24 at 17:46 +1100, Thomas Koster wrote:
> 2. When given two capabilities (+RTS -N2), MVars are suddenly an
> order
> of magnitude slower than with just one capability. Why?

On 25 January 2016 at 02:09, Yuras Shumovich <[hidden email]> wrote:
> One possible explanation is closure locking which is not performed when
> there is only one capability. In my quick measurements it gives 40%
> speedup: https://ghc.haskell.org/trac/ghc/ticket/693#comment:9

This makes sense. After all, why bother with locks and barriers when
the process is single-threaded anyway?

Thanks for your response.

--
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe



--
Chris Allen
Currently working on http://haskellbook.com

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

John Lato-2

This has nothing to do with your questions, but are you sure that mvarInc is sufficiently strict?


On 15:17, Sun, Jan 24, 2016 Christopher Allen <[hidden email]> wrote:
Well, there are cases where even with single-threading you want a memory barrier to prevent the CPU reordering instructions, but the shift to a single-threaded runtime should elide _some_ locks expressly designed to cope with multithreading.

On Sun, Jan 24, 2016 at 5:04 PM, Thomas Koster <[hidden email]> wrote:
Yuras,

On Sun, 2016-01-24 at 17:46 +1100, Thomas Koster wrote:
> 2. When given two capabilities (+RTS -N2), MVars are suddenly an
> order
> of magnitude slower than with just one capability. Why?

On 25 January 2016 at 02:09, Yuras Shumovich <[hidden email]> wrote:
> One possible explanation is closure locking which is not performed when
> there is only one capability. In my quick measurements it gives 40%
> speedup: https://ghc.haskell.org/trac/ghc/ticket/693#comment:9

This makes sense. After all, why bother with locks and barriers when
the process is single-threaded anyway?

Thanks for your response.

--
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe



--
Chris Allen
Currently working on http://haskellbook.com
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: When are MVars better than STM?

Thomas Koster
On 24 January 2016 at 17:46, Thomas Koster <[hidden email]> wrote:
> I have found that STM is faster than MVars in all my benchmarks,
> without exception. This seems to go against accepted wisdom.

On 24 January 2016 at 18:13, Thomas Koster <[hidden email]> wrote:

> module Main (main) where
>
> import Control.Concurrent.Async
> import Control.Concurrent.MVar
> import Control.Concurrent.STM
> import Control.Monad
> import Criterion.Main
>
> main =
>   defaultMain
>     [
>       bgroup "thrash"
>         [
>           bench "MVar" $ whnfIO (thrashTest mvarNew mvarInc mvarGet),
>           bench "TVar" $ whnfIO (thrashTest tvarNew tvarInc tvarGet)
>         ]
>     ]
>
> thrashTest :: IO a
>            -> (a -> IO ())
>            -> (a -> IO b)
>            -> IO b
> thrashTest new inc get = do
>   var <- new
>   threads <- replicateM 4 (async (replicateM_ 100000 $ inc var))
>   forM_ threads wait
>   get var
>
> mvarNew :: IO (MVar Int)
> mvarNew = newMVar 0
>
> mvarInc :: MVar Int -> IO ()
> mvarInc var =
>   modifyMVar_ var $ \ i ->
>     return $! succ i
>
> mvarGet :: MVar Int -> IO Int
> mvarGet = readMVar
>
> tvarNew :: IO (TVar Int)
> tvarNew = newTVarIO 0
>
> tvarInc :: TVar Int -> IO ()
> tvarInc var =
>   atomically $ do
>     i <- readTVar var
>     writeTVar var $! succ i
>
> tvarGet :: TVar Int -> IO Int
> tvarGet = readTVarIO

On 28 January 2016 at 16:26, John Lato <[hidden email]> wrote:
> This has nothing to do with your questions, but are you sure that mvarInc is
> sufficiently strict?

I think so. If you think it isn't, I would love to know why, since
strictness and correct use of seq are still a bit of a black art for
me.

The strictness characteristics of the MVar version and the STM version
as written ought to be identical. If not, I would love to know why as
well.

--
Thomas Koster
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/haskell-cafe