Quantcast

Runtime performance degradation for multi-threaded C FFI callback

classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Runtime performance degradation for multi-threaded C FFI callback

Sanket Agrawal
I posted this issue on StackOverflow today. A brief recap:

 In the case when C FFI calls back a Haskell function, I have observed sharp increase in total time when multi-threading is enabled in C code (even when total number of function calls to Haskell remain same). In my test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4, RHEL5, 12-core box):

  • Single-threaded C function: call back Haskell function 5M times - Total time 1.32s
  • 5 threads in C function: each thread calls back the Haskell function 1M times - so, total is still 5M - Total time 7.79s - Verified that pthread didn't contribute much to the overhead by having the same code call a C function instead, and compared with single-threaded version. So, almost all of the increase in overhead seems to come from GHC runtime.
What I want to ask is if this is a known issue for GHC runtime? If not,  I will file a bug report for GHC team with code to reproduce it. I don't want to file a duplicate bug report if this is already known issue. I searched through GHC trac using some keywords but didn't see any bugs related to it.


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

Edward Z. Yang
Hmm, this kind of sounds like GHC is assuming that it has control over
all of the threads, and when this assumption fails bad things happen.
(We use lightweight threads, and use the operating system threads that
map to pthreads sparingly.)  I'm sure Simon Marlow could give a more accurate
assessment, however.

Edward

Excerpts from Sanket Agrawal's message of Tue Jan 17 23:31:38 -0500 2012:

> I posted this issue on StackOverflow today. A brief recap:
>
>  In the case when C FFI calls back a Haskell function, I have observed
> sharp increase in total time when multi-threading is enabled in C code
> (even when total number of function calls to Haskell remain same). In my
> test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4,
> RHEL5, 12-core box):
>
>
>    - Single-threaded C function: call back Haskell function 5M times -
>    Total time 1.32s
>    - 5 threads in C function: each thread calls back the Haskell function 1M
>    times - so, total is still 5M - Total time 7.79s - Verified that pthread
>    didn't contribute much to the overhead by having the same code call a C
>    function instead, and compared with single-threaded version. So, almost all
>    of the increase in overhead seems to come from GHC runtime.
>
> What I want to ask is if this is a known issue for GHC runtime? If not,  I
> will file a bug report for GHC team with code to reproduce it. I don't want
> to file a duplicate bug report if this is already known issue. I searched
> through GHC trac using some keywords but didn't see any bugs related to it.
>
> StackOverflow post link (has code and details on how to reproduce the
> issue):
> http://stackoverflow.com/questions/8902568/runtime-performance-degradation-for-c-ffi-callback-when-pthreads-are-enabled

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

Edward Z. Yang
In reply to this post by Sanket Agrawal
Hello Sanket,

What happens if you run this experiment with 5 threads in the C function,
and have GHC run RTS with -N7? (e.g. five C threads + seven GHC threads = 12
threads on your 12-core box.)

Edward

Excerpts from Sanket Agrawal's message of Tue Jan 17 23:31:38 -0500 2012:

> I posted this issue on StackOverflow today. A brief recap:
>
>  In the case when C FFI calls back a Haskell function, I have observed
> sharp increase in total time when multi-threading is enabled in C code
> (even when total number of function calls to Haskell remain same). In my
> test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4,
> RHEL5, 12-core box):
>
>
>    - Single-threaded C function: call back Haskell function 5M times -
>    Total time 1.32s
>    - 5 threads in C function: each thread calls back the Haskell function 1M
>    times - so, total is still 5M - Total time 7.79s - Verified that pthread
>    didn't contribute much to the overhead by having the same code call a C
>    function instead, and compared with single-threaded version. So, almost all
>    of the increase in overhead seems to come from GHC runtime.
>
> What I want to ask is if this is a known issue for GHC runtime? If not,  I
> will file a bug report for GHC team with code to reproduce it. I don't want
> to file a duplicate bug report if this is already known issue. I searched
> through GHC trac using some keywords but didn't see any bugs related to it.
>
> StackOverflow post link (has code and details on how to reproduce the
> issue):
> http://stackoverflow.com/questions/8902568/runtime-performance-degradation-for-c-ffi-callback-when-pthreads-are-enabled

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

Sanket Agrawal
Hi Edward,

I was just going to get back to you about it. I did find out that the issue was indeed one GHC thread dealing with 5 C threads for callback (1:5 mapping) - so, the C threads were blocking on callback waiting for the only GHC thread to be available. I updated the code to do 1:1 mapping - 5 GHC threads for 5 C threads. That proved to be almost linearly scalable.

John Latos suggested the above approach two days back, but I didn't get to test the idea until now.

It doesn't seem to matter whether number of GHC threads are increased, if the mapping between GHC threads and C threads is not 1:1. I got 1:1 mapping by doing forkIO for each C thread. Is it really possible to do 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads during callback)? I can't think of a way to do it. Not that I need it. I am just curious if that is possible.

Thanks,
Sanket

On Fri, Jan 20, 2012 at 11:16 PM, Edward Z. Yang <[hidden email]> wrote:
Hello Sanket,

What happens if you run this experiment with 5 threads in the C function,
and have GHC run RTS with -N7? (e.g. five C threads + seven GHC threads = 12
threads on your 12-core box.)

Edward

Excerpts from Sanket Agrawal's message of Tue Jan 17 23:31:<a href="tel:38%20-0500%202012" value="+13805002012">38 -0500 2012:
> I posted this issue on StackOverflow today. A brief recap:
>
>  In the case when C FFI calls back a Haskell function, I have observed
> sharp increase in total time when multi-threading is enabled in C code
> (even when total number of function calls to Haskell remain same). In my
> test, I called a Haskell function 5M times using two scenarios (GHC 7.0.4,
> RHEL5, 12-core box):
>
>
>    - Single-threaded C function: call back Haskell function 5M times -
>    Total time 1.32s
>    - 5 threads in C function: each thread calls back the Haskell function 1M
>    times - so, total is still 5M - Total time 7.79s - Verified that pthread
>    didn't contribute much to the overhead by having the same code call a C
>    function instead, and compared with single-threaded version. So, almost all
>    of the increase in overhead seems to come from GHC runtime.
>
> What I want to ask is if this is a known issue for GHC runtime? If not,  I
> will file a bug report for GHC team with code to reproduce it. I don't want
> to file a duplicate bug report if this is already known issue. I searched
> through GHC trac using some keywords but didn't see any bugs related to it.
>
> StackOverflow post link (has code and details on how to reproduce the
> issue):
> http://stackoverflow.com/questions/8902568/runtime-performance-degradation-for-c-ffi-callback-when-pthreads-are-enabled


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

Simon Marlow-7
On 21/01/2012 15:35, Sanket Agrawal wrote:
> Hi Edward,
>
> I was just going to get back to you about it. I did find out that the
> issue was indeed one GHC thread dealing with 5 C threads for callback
> (1:5 mapping) - so, the C threads were blocking on callback waiting for
> the only GHC thread to be available. I updated the code to do 1:1
> mapping - 5 GHC threads for 5 C threads. That proved to be almost
> linearly scalable.

This is almost right, except that your callbacks are not waiting for a
GHC *thread*, but what we call a "capability", which is roughly speaking
"permission to execute Haskell code".  The +RTS -N option chooses the
number of capabilities.

I expect that with -N1, your program is spending a lot of time just
switching between the different OS threads.

It's possible that we could make the runtime more flexible here.  I
recently made it possible to modify the number of capabilities at
runtime, so it's conceivable that the runtime could automatically add
capabilities if it is being called from multiple OS threads.

> John Latos suggested the above approach two days back, but I didn't get
> to test the idea until now.
>
> It doesn't seem to matter whether number of GHC threads are increased,
> if the mapping between GHC threads and C threads is not 1:1. I got 1:1
> mapping by doing forkIO for each C thread. Is it really possible to do
> 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads
> during callback)? I can't think of a way to do it. Not that I need it. I
> am just curious if that is possible.

Just think of +RTS -N7 as being 7 *locks*, not 7 threads.  Then it makes
perfect sense to have 7 locks available for 5 threads.

Cheers,
        Simon



> Thanks,
> Sanket
>
> On Fri, Jan 20, 2012 at 11:16 PM, Edward Z. Yang <[hidden email]
> <mailto:[hidden email]>> wrote:
>
>     Hello Sanket,
>
>     What happens if you run this experiment with 5 threads in the C
>     function,
>     and have GHC run RTS with -N7? (e.g. five C threads + seven GHC
>     threads = 12
>     threads on your 12-core box.)
>
>     Edward
>
>     Excerpts from Sanket Agrawal's message of Tue Jan 17 23:31:38 -0500
>     2012 <tel:38%20-0500%202012>:
>      > I posted this issue on StackOverflow today. A brief recap:
>      >
>      >  In the case when C FFI calls back a Haskell function, I have
>     observed
>      > sharp increase in total time when multi-threading is enabled in C
>     code
>      > (even when total number of function calls to Haskell remain
>     same). In my
>      > test, I called a Haskell function 5M times using two scenarios
>     (GHC 7.0.4,
>      > RHEL5, 12-core box):
>      >
>      >
>      >    - Single-threaded C function: call back Haskell function 5M
>     times -
>      >    Total time 1.32s
>      >    - 5 threads in C function: each thread calls back the Haskell
>     function 1M
>      >    times - so, total is still 5M - Total time 7.79s - Verified
>     that pthread
>      >    didn't contribute much to the overhead by having the same code
>     call a C
>      >    function instead, and compared with single-threaded version.
>     So, almost all
>      >    of the increase in overhead seems to come from GHC runtime.
>      >
>      > What I want to ask is if this is a known issue for GHC runtime?
>     If not,  I
>      > will file a bug report for GHC team with code to reproduce it. I
>     don't want
>      > to file a duplicate bug report if this is already known issue. I
>     searched
>      > through GHC trac using some keywords but didn't see any bugs
>     related to it.
>      >
>      > StackOverflow post link (has code and details on how to reproduce the
>      > issue):
>      >
>     http://stackoverflow.com/questions/8902568/runtime-performance-degradation-for-c-ffi-callback-when-pthreads-are-enabled
>
>
>
>
> _______________________________________________
> Glasgow-haskell-users mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/glasgow-haskell-users


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

John Lato-2
In reply to this post by Sanket Agrawal
Hi Simon,

I'm not certain that your explanation matches what I observed.

All of my tests were done on a 4-core machine, executing with "+RTS
-N", which should be the same as "+RTS -N4" I believe.

With 1 Haskell thread (the main thread) and 4 process threads (via
pthreads), I saw a significant performance degradation compared to 5
Haskell threads (main + 4 via forkIO) and 4 process threads.  As I
understand your explanation, if C callbacks are scheduled according to
available capabilities, there should be no difference between these
situations.

I observed this with GHC-7.2.1, however Daniel Fischer reported that,
with ghc-7.2.2, he observed different behavior (which matches your
explanation AFAICT).  Is it possible that the scheduling of callbacks
into Haskell changed between those versions?

Thanks,
John L.

> From: Simon Marlow <[hidden email]>
> Subject: Re: Runtime performance degradation for multi-threaded C FFI
>        callback
> To: Sanket Agrawal <[hidden email]>
> Cc: glasgow-haskell-users <[hidden email]>
> Message-ID: <[hidden email]>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> On 21/01/2012 15:35, Sanket Agrawal wrote:
>> Hi Edward,
>>
>> I was just going to get back to you about it. I did find out that the
>> issue was indeed one GHC thread dealing with 5 C threads for callback
>> (1:5 mapping) - so, the C threads were blocking on callback waiting for
>> the only GHC thread to be available. I updated the code to do 1:1
>> mapping - 5 GHC threads for 5 C threads. That proved to be almost
>> linearly scalable.
>
> This is almost right, except that your callbacks are not waiting for a
> GHC *thread*, but what we call a "capability", which is roughly speaking
> "permission to execute Haskell code".  The +RTS -N option chooses the
> number of capabilities.
>
> I expect that with -N1, your program is spending a lot of time just
> switching between the different OS threads.
>
> It's possible that we could make the runtime more flexible here.  I
> recently made it possible to modify the number of capabilities at
> runtime, so it's conceivable that the runtime could automatically add
> capabilities if it is being called from multiple OS threads.
>
>> John Latos suggested the above approach two days back, but I didn't get
>> to test the idea until now.
>>
>> It doesn't seem to matter whether number of GHC threads are increased,
>> if the mapping between GHC threads and C threads is not 1:1. I got 1:1
>> mapping by doing forkIO for each C thread. Is it really possible to do
>> 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads
>> during callback)? I can't think of a way to do it. Not that I need it. I
>> am just curious if that is possible.
>
> Just think of +RTS -N7 as being 7 *locks*, not 7 threads.  Then it makes
> perfect sense to have 7 locks available for 5 threads.
>
> Cheers,
>        Simon

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

Simon Marlow-7
I'll need to analyse the program to see what's going on.  There was a
small change to the scheduler between 7.2.1 and 7.2.2 that could
conceivably have made a difference in this scenario, but it was aimed at
fixing a bug rather than improvement performance.

Another possibility is a difference in OS scheduling behaviour between
yours and Daniel Fischer's setup.  In microbenchmarks like this, it's
easy for a difference in OS scheduling behaviour to make a large
difference in performance if it happens consistently.

Cheers,
        Simon

On 23/01/2012 12:49, John Lato wrote:

> Hi Simon,
>
> I'm not certain that your explanation matches what I observed.
>
> All of my tests were done on a 4-core machine, executing with "+RTS
> -N", which should be the same as "+RTS -N4" I believe.
>
> With 1 Haskell thread (the main thread) and 4 process threads (via
> pthreads), I saw a significant performance degradation compared to 5
> Haskell threads (main + 4 via forkIO) and 4 process threads.  As I
> understand your explanation, if C callbacks are scheduled according to
> available capabilities, there should be no difference between these
> situations.
>
> I observed this with GHC-7.2.1, however Daniel Fischer reported that,
> with ghc-7.2.2, he observed different behavior (which matches your
> explanation AFAICT).  Is it possible that the scheduling of callbacks
> into Haskell changed between those versions?
>
> Thanks,
> John L.
>
>> From: Simon Marlow<[hidden email]>
>> Subject: Re: Runtime performance degradation for multi-threaded C FFI
>>         callback
>> To: Sanket Agrawal<[hidden email]>
>> Cc: glasgow-haskell-users<[hidden email]>
>> Message-ID:<[hidden email]>
>> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>>
>> On 21/01/2012 15:35, Sanket Agrawal wrote:
>>> Hi Edward,
>>>
>>> I was just going to get back to you about it. I did find out that the
>>> issue was indeed one GHC thread dealing with 5 C threads for callback
>>> (1:5 mapping) - so, the C threads were blocking on callback waiting for
>>> the only GHC thread to be available. I updated the code to do 1:1
>>> mapping - 5 GHC threads for 5 C threads. That proved to be almost
>>> linearly scalable.
>>
>> This is almost right, except that your callbacks are not waiting for a
>> GHC *thread*, but what we call a "capability", which is roughly speaking
>> "permission to execute Haskell code".  The +RTS -N option chooses the
>> number of capabilities.
>>
>> I expect that with -N1, your program is spending a lot of time just
>> switching between the different OS threads.
>>
>> It's possible that we could make the runtime more flexible here.  I
>> recently made it possible to modify the number of capabilities at
>> runtime, so it's conceivable that the runtime could automatically add
>> capabilities if it is being called from multiple OS threads.
>>
>>> John Latos suggested the above approach two days back, but I didn't get
>>> to test the idea until now.
>>>
>>> It doesn't seem to matter whether number of GHC threads are increased,
>>> if the mapping between GHC threads and C threads is not 1:1. I got 1:1
>>> mapping by doing forkIO for each C thread. Is it really possible to do
>>> 7:5 mapping (that is 7 GHC threads to choose from, for 5 C threads
>>> during callback)? I can't think of a way to do it. Not that I need it. I
>>> am just curious if that is possible.
>>
>> Just think of +RTS -N7 as being 7 *locks*, not 7 threads.  Then it makes
>> perfect sense to have 7 locks available for 5 threads.
>>
>> Cheers,
>>         Simon


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

Daniel Fischer
On Monday 23 January 2012, 14:26:13, Simon Marlow wrote:
> Another possibility is a difference in OS scheduling behaviour between
> yours and Daniel Fischer's setup.  In microbenchmarks like this, it's
> easy for a difference in OS scheduling behaviour to make a large
> difference in performance if it happens consistently.

That seems likely, since I get pretty much the same times and relations
with 7.0.4 and 7.2.1 as with 7.2.2

Cheers,
Daniel

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

John Lato-2
In reply to this post by Simon Marlow-7
I agree the OS scheduler is likely to contribute to our different
observations.  I'll try to test with ghc-7.4-rc1 tonight to see if I
get similar results to 7.2.1.

If you want to see some code I'll post it, although I doubt it's
necessary.  I would appreciate it if you (or someone else in the know)
could answer a question for me: does the GHC runtime handle scheduling
of code from Haskell threads (forkIO) and foreign callbacks (via
FunPtr's) in the same way, or are there restrictions on which
capability may handle one or the other (ignoring bound threads and the
like)?

Thank you,
John L.

On Mon, Jan 23, 2012 at 1:26 PM, Simon Marlow <[hidden email]> wrote:

> I'll need to analyse the program to see what's going on.  There was a small
> change to the scheduler between 7.2.1 and 7.2.2 that could conceivably have
> made a difference in this scenario, but it was aimed at fixing a bug rather
> than improvement performance.
>
> Another possibility is a difference in OS scheduling behaviour between yours
> and Daniel Fischer's setup.  In microbenchmarks like this, it's easy for a
> difference in OS scheduling behaviour to make a large difference in
> performance if it happens consistently.
>
> Cheers,
>        Simon

_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Runtime performance degradation for multi-threaded C FFI callback

Simon Marlow-7
On 23/01/2012 14:54, John Lato wrote:

> I agree the OS scheduler is likely to contribute to our different
> observations.  I'll try to test with ghc-7.4-rc1 tonight to see if I
> get similar results to 7.2.1.
>
> If you want to see some code I'll post it, although I doubt it's
> necessary.  I would appreciate it if you (or someone else in the know)
> could answer a question for me: does the GHC runtime handle scheduling
> of code from Haskell threads (forkIO) and foreign callbacks (via
> FunPtr's) in the same way, or are there restrictions on which
> capability may handle one or the other (ignoring bound threads and the
> like)?

Callbacks always create bound threads.  There are no restrictions on
which capabilities can handle either forkIO or bound threads.

Cheers,
        Simon


> Thank you,
> John L.
>
> On Mon, Jan 23, 2012 at 1:26 PM, Simon Marlow<[hidden email]>  wrote:
>> I'll need to analyse the program to see what's going on.  There was a small
>> change to the scheduler between 7.2.1 and 7.2.2 that could conceivably have
>> made a difference in this scenario, but it was aimed at fixing a bug rather
>> than improvement performance.
>>
>> Another possibility is a difference in OS scheduling behaviour between yours
>> and Daniel Fischer's setup.  In microbenchmarks like this, it's easy for a
>> difference in OS scheduling behaviour to make a large difference in
>> performance if it happens consistently.
>>
>> Cheers,
>>         Simon


_______________________________________________
Glasgow-haskell-users mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/glasgow-haskell-users
Loading...