Help with "shootout"

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
12 messages Options
Reply | Threaded
Open this post in threaded view
|

Help with "shootout"

haskell-2
Hello all,

I decided to put together an entry for one of the shooutout categories:

http://haskell.org/hawiki/ChameneosEntry

It involves 4 threads. There is no current Haskell entry.

Anyone have comments? Alterations?  I can't have created the best code yet.

I don't have the compilers for the other entries, so I can't compare
speed on my powerbook (which is the wrong kind of CPU anyway -- they
test on an AMD system).

--
Chris
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

Isaac Gouy-2
--- Chris Kuklewicz <[hidden email]>
wrote:
-snip-
> which is the wrong kind of CPU anyway -- they
> test on an AMD system

"What machine are you running the programs on?"
http://shootout.alioth.debian.org/gp4/faq.php#machine


       
               
__________________________________
Yahoo! for Good - Make a difference this year.
http://brand.yahoo.com/cybergivingweek2005/
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

haskell-2
In reply to this post by haskell-2
Hello,

  I have posted two faster versions to
http://haskell.org/hawiki/ChameneosEntry

The faster of the two uses an extra manager thread, the slower uses STM.

--
Chris

Chris Kuklewicz wrote:
> Hello all,
>
> I decided to put together an entry for one of the shooutout categories:
>
> http://haskell.org/hawiki/ChameneosEntry
>
> It involves 4 threads. There is no current Haskell entry.
>
> Anyone have comments? Alterations?  I can't have created the best code
yet.
>
> I don't have the compilers for the other entries, so I can't compare
> speed on my powerbook (which is the wrong kind of CPU anyway -- they
> test on an AMD system).
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

haskell-2
Hello,

  Einar Kartunen sped up the code using a custom channel implementation.
   This increased speed by a factor of over 2.  The wiki at
http://haskell.org/hawiki/ChameneosEntry has the latest version.

  This makes me ponder one of the things that Joel was trying to do:
efficiently pass data to a logging thread.  It may be that a custom
channel would be helpful for that as well.

--
Chris Kuklewicz

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re[2]: Help with "shootout"

Bulat Ziganshin
Hello Chris,

Tuesday, January 03, 2006, 12:20:26 AM, you wrote:

CK>   Einar Kartunen sped up the code using a custom channel implementation.
CK>    This increased speed by a factor of over 2.  The wiki at
CK> http://haskell.org/hawiki/ChameneosEntry has the latest version.

can these channels be used in general-purpose code?

CK>   This makes me ponder one of the things that Joel was trying to do:
CK> efficiently pass data to a logging thread.  It may be that a custom
CK> channel would be helpful for that as well.

last variant of his code used just MVar-protected direct hPutStr
operations


--
Best regards,
 Bulat                            mailto:[hidden email]



_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

Joel Reymont
In reply to this post by haskell-2

On Jan 2, 2006, at 9:20 PM, Chris Kuklewicz wrote:

>   This makes me ponder one of the things that Joel was trying to do:
> efficiently pass data to a logging thread.  It may be that a custom
> channel would be helpful for that as well.

I have not taken the time to analyze the Chameneos code but need to  
point out that my problem was not with efficiently passing data to  
the logging thread. The issue was with data accumulating in the  
channel and the logger thread not reading it out fast enough.

The TChan implementation is a single-linked list implemented on top  
of TVar's. That would seem pretty efficient to me.

        Thanks, Joel

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

Matt Harden
On Tue, Jan 03, 2006 at 12:07:43AM +0000, Joel Reymont wrote:

>
> On Jan 2, 2006, at 9:20 PM, Chris Kuklewicz wrote:
>
> >  This makes me ponder one of the things that Joel was trying to do:
> >efficiently pass data to a logging thread.  It may be that a custom
> >channel would be helpful for that as well.
>
> I have not taken the time to analyze the Chameneos code but need to  
> point out that my problem was not with efficiently passing data to  
> the logging thread. The issue was with data accumulating in the  
> channel and the logger thread not reading it out fast enough.
>
> The TChan implementation is a single-linked list implemented on top  
> of TVar's. That would seem pretty efficient to me.

It's simple and efficient but does nothing to prevent the channel from
growing out of control.  A slightly modified (custom) channel based on
TChan, but enforcing a maximum size (blocking on insert if the channel
is too full), probably would have solved the problem.

I assume that Erlang either does that or increases the priority of
threads with large event queues, or both.

Thanks,
Matt Harden
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

haskell-2
In reply to this post by Bulat Ziganshin
Bulat Ziganshin wrote:
> Hello Chris,
>
> Tuesday, January 03, 2006, 12:20:26 AM, you wrote:
>
> CK>   Einar Kartunen sped up the code using a custom channel implementation.
> CK>    This increased speed by a factor of over 2.  The wiki at
> CK> http://haskell.org/hawiki/ChameneosEntry has the latest version.
>
> can these channels be used in general-purpose code?

The latest Ch code is very very short:

> {- Ch : fast unordered channel implementation -}
> newtype Ch a = Ch (MVar [a], MVar a)
>
> newCh = liftM2 (,) (newMVar []) newEmptyMVar >>= return.Ch
>
> readCh (Ch (w,r)) = takeMVar w >>= \lst ->
>   case lst of (x:xs) -> putMVar w xs >> return x
>               []     -> putMVar w [] >> takeMVar r
>
> writeCh (Ch (w,r)) x = do
>   ok <- tryPutMVar r x -- opportunistic, helps for this problem
>   unless ok $ takeMVar w >>= \lst -> do
>     ok <- tryPutMVar r x  -- safe inside take/put
>     putMVar w $ if ok then lst else (x:lst)
>

It could be used in general purpose code, note the parametric type "a"
in "Ch a".  It makes absolutely no guarantees about the order of values.
 That means that the order they are written is unlikely to be the order
in which they are read.  Writes to the channel are non-blocking and the
"MVar [a]" holds some items waiting to be read (in LIFO order).  The
"MVar a" allows a reader to block and wait for an empty channel to get
an item.  A small amount of extra speed comes from the writer's
"opportunistic" attempt to not take the w MVar unless it needs to.  But
note that readCh always takes the w MVar, and can ignore the r MVar.
This choice was determined by benchmarking.

Alternative, but slower for this case, functions for readCh and writeCh are

> readCh' (Ch (w,r)) = do
>   mx <- tryTakeMVar r
>   case mx of
>     Just x -> return x
>     Nothing -> takeMVar w >>= \lst -> case lst of (x:xs) -> putMVar w xs >> return x
>                                                   []     -> putMVar w [] >> takeMVar r
>
> writeCh' (Ch (w,r)) x = takeMVar w >>= \lst -> do
>     ok <- tryPutMVar r x  -- safe inside take/put
>     putMVar w $ if ok then lst else (x:lst)
>

But in this instance, using either of these would be slower.  The
balance between readers (one here) and writers (four here) and their
average speed is what determines the optimum readCh/writeCh code.
Different usage would benefit from different choices.

>
> CK>   This makes me ponder one of the things that Joel was trying to do:
> CK> efficiently pass data to a logging thread.  It may be that a custom
> CK> channel would be helpful for that as well.
>
> last variant of his code used just MVar-protected direct hPutStr
> operations

My point was more that Haskell allows you to make your own channels and
that it is possible to do better than the provided ones.

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

Joel Reymont
It seems like the real difference between TChan and the Ch code below  
is that TChan is, basically, [TVar a] whereas Ch is MVar [a], plus  
the order is guaranteed for a TChan.

Now why would it matter so much speed-wise?

This is the CVS code. newTChanIO is exported but undocumented in GHC  
6.4.1. I'm not sure what purpose it serves.

-- | 'TChan' is an abstract type representing an unbounded FIFO channel.
data TChan a = TChan (TVar (TVarList a)) (TVar (TVarList a))

type TVarList a = TVar (TList a)
data TList a = TNil | TCons a (TVarList a)

newTChan :: STM (TChan a)
newTChan = do
   hole <- newTVar TNil
   read <- newTVar hole
   write <- newTVar hole
   return (TChan read write)

newTChanIO :: IO (TChan a)
newTChanIO = do
   hole <- newTVarIO TNil
   read <- newTVarIO hole
   write <- newTVarIO hole
   return (TChan read write)

writeTChan :: TChan a -> a -> STM ()
writeTChan (TChan _read write) a = do
   listend <- readTVar write -- listend == TVar pointing to TNil
   new_listend <- newTVar TNil
   writeTVar listend (TCons a new_listend)
   writeTVar write new_listend

readTChan :: TChan a -> STM a
readTChan (TChan read _write) = do
   listhead <- readTVar read
   head <- readTVar listhead
   case head of
     TNil -> retry
     TCons a tail -> do
         writeTVar read tail
         return a

On Jan 3, 2006, at 11:25 AM, Chris Kuklewicz wrote:

> The latest Ch code is very very short:
>
>> {- Ch : fast unordered channel implementation -}
>> newtype Ch a = Ch (MVar [a], MVar a)
>>
>> newCh = liftM2 (,) (newMVar []) newEmptyMVar >>= return.Ch
>>
>> readCh (Ch (w,r)) = takeMVar w >>= \lst ->
>>   case lst of (x:xs) -> putMVar w xs >> return x
>>               []     -> putMVar w [] >> takeMVar r
>>
>> writeCh (Ch (w,r)) x = do
>>   ok <- tryPutMVar r x -- opportunistic, helps for this problem
>>   unless ok $ takeMVar w >>= \lst -> do
>>     ok <- tryPutMVar r x  -- safe inside take/put
>>     putMVar w $ if ok then lst else (x:lst)
>>
>
> It could be used in general purpose code, note the parametric type "a"
> in "Ch a".  It makes absolutely no guarantees about the order of  
> values.
>  That means that the order they are written is unlikely to be the  
> order
> in which they are read.  Writes to the channel are non-blocking and  
> the
> "MVar [a]" holds some items waiting to be read (in LIFO order).

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

haskell-2
In reply to this post by Matt Harden
[ Deeply nested replies are starting to look similar to runListT $
runStateT $ runWriter .... ]

[hidden email] wrote:

> On Tue, Jan 03, 2006 at 12:07:43AM +0000, Joel Reymont wrote:
>
>>On Jan 2, 2006, at 9:20 PM, Chris Kuklewicz wrote:
>>
>>
>>> This makes me ponder one of the things that Joel was trying to do:
>>>efficiently pass data to a logging thread.  It may be that a custom
>>>channel would be helpful for that as well.
>>
>>I have not taken the time to analyze the Chameneos code but need to  
>>point out that my problem was not with efficiently passing data to  
>>the logging thread. The issue was with data accumulating in the  
>>channel and the logger thread not reading it out fast enough.
>>
>>The TChan implementation is a single-linked list implemented on top  
>>of TVar's. That would seem pretty efficient to me.
>
>
> It's simple and efficient but does nothing to prevent the channel from
> growing out of control.  A slightly modified (custom) channel based on
> TChan, but enforcing a maximum size (blocking on insert if the channel
> is too full), probably would have solved the problem.
>
> I assume that Erlang either does that or increases the priority of
> threads with large event queues, or both.
>
> Thanks,
> Matt Harden
Given that actually controlling priorities is not an option, adding
blocking like that makes sense.  One can make a ring buffer instead of a
singly linked list very easily.  In fact, I have that code lying around
(now attached).  It has not been speed optimized, but I did like being
able to express:

> type Node a = [TMVar a]
>
> make :: (Integral k) => k -> STM (Node a)
> make k = liftM cycle $ sequence $ genericReplicate k newEmptyTMVar

It has the usual operations, but you need to pass a fixed size to
new/newEmpty and you also have an isFull test.  It has no operations to
resize the ring buffer created by "make".

module ProdCons (PC,new,newEmpty, put,ProdCons.take,ProdCons.read,
                   tryPut,tryTake,tryRead, isEmpty,isFull) where

{-  Fixed bounded-buffer size solution of producer/consumer problem.

    Acts like a FIFO TMVar, blocking when capacity is reached.  So a
    capacity of 1 behaves like a TMVar.

    For arbitrary capacity just use a TChan. -}

import Control.Concurrent.STM
import Control.Concurrent
import Control.Monad.Fix
import Control.Monad
import Data.List(cycle,genericReplicate)

type Node a = [TMVar a]
newtype PC a = PC (TVar (Node a),TVar (Node a))

newEmpty :: (Integral k) => k -> IO (PC a)
newEmpty k | k <=0 = error "Need capacity > 0"
           | otherwise = do
  node <- atomically $ make k
  atomically $ do
    tv1 <- newTVar node
    tv2 <- newTVar node
    return (PC (tv1,tv2))

new :: (Integral k) => k -> a -> IO (PC a)
new k v | k <=0 = error "Need capacity > 0"
      v | otherwise = do
  pc <- newEmpty k
  atomically $ put pc v
  return pc

put ::PC a -> a -> STM ()
put (PC (tvar,_)) value = do
  (tmvar:next) <- readTVar tvar
  putTMVar tmvar value
  writeTVar tvar next

take :: PC a -> STM a
take (PC (_,tvar)) = do
  (tmvar:next) <- readTVar tvar
  value <- takeTMVar tmvar
  writeTVar tvar next
  return value

read :: PC a -> STM a
read (PC (_,tvar)) = do
  (tmvar:_) <- readTVar tvar
  readTMVar tmvar

tryTake :: PC a -> STM (Maybe a)
tryTake pc = (ProdCons.take pc >>= return.Just) `orElse` (return Nothing)

tryRead :: PC a -> STM (Maybe a)
tryRead pc = (ProdCons.read pc >>= return.Just) `orElse` (return Nothing)

tryPut :: PC a -> a -> STM Bool
tryPut pc v = (put pc v >> return True) `orElse` (return False)

isEmpty :: PC a -> STM Bool
isEmpty (PC (_,tvar)) = do
   (tmvar:_) <- readTVar tvar
   isEmptyTMVar tmvar

isFull (PC (tvar,_)) = do
   (tmvar:_) <- readTVar tvar
   empty <- isEmptyTMVar tmvar
   return (not empty)

-- -- -- Internal -- -- --

make :: (Integral k) => k -> STM (Node a)
make k = liftM cycle $ sequence $ genericReplicate k newEmptyTMVar

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

Re: Help with "shootout"

haskell-2
In reply to this post by Joel Reymont
Joel Reymont wrote:
> It seems like the real difference between TChan and the Ch code below
> is that TChan is, basically, [TVar a] whereas Ch is MVar [a], plus  the
> order is guaranteed for a TChan.
>
> Now why would it matter so much speed-wise?

STM* is usually slower than IO/MVar.  STM has to do the transactional
record keeping and throws away work (i.e. CPU cycles and speed) when it
aborts.  The Chameneos benchmark has 4 writers working *very* quickly,
so the contention is high.  Taking the MVar acts like a mutex to
serialize access without throwing away work.

The [a] is a LIFO stack; pushing and popping the front element is fast.

Also, I suspect the following may be true:

If 4 threads block trying to take an MVar, and a 5th thread puts a value
in the MVar, then *exactly one* of the 4 blocked threads is woken up and
scheduled to run.

If 4 threads retry while in STM (e.g. takeTMVar), and a 5th thread
commits a change to that TMVar, then *all 4 threads* are woken up and
rescheduled.  This is what the Apache httpd server folks called the
"thundering herd" problem when many processes are blocked waiting on
socket 80.

If 4000 threads were getting woken up when only 1 was needed, then
performance would be poor.  Certainly I found 4 writer thread and STM to
be much slower for this shootout problem than the Einar's custom MVar
channel.

Could someone who knows the STM implementation comment on this?

>
> This is the CVS code. newTChanIO is exported but undocumented in GHC
> 6.4.1. I'm not sure what purpose it serves.

I get tired of writing "do tv <- atomically $ newTVar foo"

I bet this is just shorthand: "do tv <- newTVarIO foo"

Same for newTChanIO.

[ Of course, a type class could be used instead.  But that seems tohave
been judged overkill ]

>
> -- | 'TChan' is an abstract type representing an unbounded FIFO channel.
> data TChan a = TChan (TVar (TVarList a)) (TVar (TVarList a))
>
> type TVarList a = TVar (TList a)
> data TList a = TNil | TCons a (TVarList a)
>
> newTChan :: STM (TChan a)
> newTChan = do
>   hole <- newTVar TNil
>   read <- newTVar hole
>   write <- newTVar hole
>   return (TChan read write)
>
> newTChanIO :: IO (TChan a)
> newTChanIO = do
>   hole <- newTVarIO TNil
>   read <- newTVarIO hole
>   write <- newTVarIO hole
>   return (TChan read write)
>
> writeTChan :: TChan a -> a -> STM ()
> writeTChan (TChan _read write) a = do
>   listend <- readTVar write -- listend == TVar pointing to TNil
>   new_listend <- newTVar TNil
>   writeTVar listend (TCons a new_listend)
>   writeTVar write new_listend
>
> readTChan :: TChan a -> STM a
> readTChan (TChan read _write) = do
>   listhead <- readTVar read
>   head <- readTVar listhead
>   case head of
>     TNil -> retry
>     TCons a tail -> do
>         writeTVar read tail
>         return a
>
> On Jan 3, 2006, at 11:25 AM, Chris Kuklewicz wrote:
>
>> The latest Ch code is very very short:
>>
>>> {- Ch : fast unordered channel implementation -}
>>> newtype Ch a = Ch (MVar [a], MVar a)
>>>
>>> newCh = liftM2 (,) (newMVar []) newEmptyMVar >>= return.Ch
>>>
>>> readCh (Ch (w,r)) = takeMVar w >>= \lst ->
>>>   case lst of (x:xs) -> putMVar w xs >> return x
>>>               []     -> putMVar w [] >> takeMVar r
>>>
>>> writeCh (Ch (w,r)) x = do
>>>   ok <- tryPutMVar r x -- opportunistic, helps for this problem
>>>   unless ok $ takeMVar w >>= \lst -> do
>>>     ok <- tryPutMVar r x  -- safe inside take/put
>>>     putMVar w $ if ok then lst else (x:lst)
>>>
>>
>> It could be used in general purpose code, note the parametric type "a"
>> in "Ch a".  It makes absolutely no guarantees about the order of  values.
>>  That means that the order they are written is unlikely to be the  order
>> in which they are read.  Writes to the channel are non-blocking and  the
>> "MVar [a]" holds some items waiting to be read (in LIFO order).
>
>
> --
> http://wagerlabs.com/
>
>
>
>
>
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|

RE: Help with "shootout"

Simon Marlow
In reply to this post by haskell-2
On 03 January 2006 12:03, Chris Kuklewicz wrote:

> STM* is usually slower than IO/MVar.  STM has to do the transactional
> record keeping and throws away work (i.e. CPU cycles and speed) when
> it aborts.  The Chameneos benchmark has 4 writers working *very*
> quickly, so the contention is high.  Taking the MVar acts like a
> mutex to serialize access without throwing away work.

We found that once you add exception safety to your MVar code, i.e.
using withMVar and modifyMVar rather than raw takeMVar/putMVar, then
MVar code is comparable in speed to STM.  For example, Chan and TChan
perform about the same.

Of course, for a benchmark, you can forget about the exception safety,
as the Ch implementation does (plus it is tweaked in various other
ways).

> Also, I suspect the following may be true:
>
> If 4 threads block trying to take an MVar, and a 5th thread puts a
> value in the MVar, then *exactly one* of the 4 blocked threads is
> woken up and scheduled to run.

Yes, that's true.  We haven't documented this as a property of MVars,
but we don't intend to change this behaviour, so I think we probably
should document it.

Furthermore, the thread to be woken up is always the first thread to
block on the MVar.  That is, there's a kind of fairness built into MVars
which isn't present in STM and TMVars.

> If 4 threads retry while in STM (e.g. takeTMVar), and a 5th thread
> commits a change to that TMVar, then *all 4 threads* are woken up and
> rescheduled.  This is what the Apache httpd server folks called the
> "thundering herd" problem when many processes are blocked waiting on
> socket 80.
>
> If 4000 threads were getting woken up when only 1 was needed, then
> performance would be poor.  Certainly I found 4 writer thread and STM
> to be much slower for this shootout problem than the Einar's custom
> MVar channel.
>
> Could someone who knows the STM implementation comment on this?

Hope this helps...

Simon PJ and I talked about this briefly this morning, and agreed that
it might be possible to provide more primitive MVar-like objects in STM
to recover single-wakeup and some of the speed that MVars currently
have.  We haven't worked through the details though.

>> This is the CVS code. newTChanIO is exported but undocumented in GHC
>> 6.4.1. I'm not sure what purpose it serves.
>
> I get tired of writing "do tv <- atomically $ newTVar foo"
>
> I bet this is just shorthand: "do tv <- newTVarIO foo"
>
> Same for newTChanIO.

It's not just shorthand: you can use newTVarIO and friends inside
unsafePerformIO, which doesn't always work for the longhand atomically
version.

Cheers,
        Simon
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe