Quantcast

Project postmortem

classic Classic list List threaded Threaded
20 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Project postmortem

Joel Reymont
Folks,

I have done a lot of experiments over the past few weeks and came to  
a few interesting conclusions. First some background, then issues,  
solutions and conclusions.

I wrote a test harness for a poker server that understands the  
different binary packets and can send and receive them. The harness  
launches each "script" in a separate unbound thread that connects to  
the server via TCP and does its work.

The main goals of the project were: easy scripting, very high number  
of connections from the harness (a few thousand) and running on  
Windows. I develop on Mac OSX but have a Windows machine for testing  
and to run the poker server.

Another key goal was to support the server encryption. SSL encryption  
is done in a wierd way that requires attaching read/write OpenSSL  
BIOs to the SSL descriptor so that SSL encrypts to/from memory.  
Encrypted chunks are then taken from the BIOs and sent as payload in  
servver packets.

Overall, I probably spent about 4 weeks writing the server and about  
2 more weeks grappling with the various issues. The issues centered  
around 1) the program trashing memory like no tomorrow, 2)  
intermittent crashes on Windows and 3) not being able to launch a  
high number of connections on Windows before crashing.

I significantly improved trashing of memory by switching to plain  
Haskell structures from nested lists of wxHaskell-style properties  
(attr := value). Intermittent crashes were harder to troubleshoot,  
specially given that things were running smoothly on Mac OSX.

Stack traces pointed into libcrypto (part of OpenSSL) and thus to the  
BIOs that I was allocating. I guesses that OpenSSL was maxing out  
some resources and closed the leak by explicitly freeing the SSL  
descriptor which freed the associated BIO structures. Then things got  
wierder as my program started crashing in a different place entirely  
with stack traces like this:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x3139322e
0x0027c174 in s8j1_info ()
(gdb) where
#0  0x0027c174 in s8j1_info ()
#1  0x0021c9f4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2  0x0021cdc4 in schedule (mainThread=0x1100360,  
initialCapability=0x308548) at Schedule.c:932
#3  0x0021dd6c in waitThread_ (m=0x1100360, initialCapability=0x0) at  
Schedule.c:2156
#4  0x0021dc50 in scheduleWaitThread (tso=0x13c0000, ret=0x0,  
initialCapability=0x0) at Schedule.c:2050
#5  0x00219548 in rts_evalLazyIO (p=0x29b47c, ret=0x0) at RtsAPI.c:459
#6  0x001e4768 in main (argc=2262116, argv=0x308548) at Main.c:104

I took waitThread_ as a clue and started digging deeper.

Whenever I connect to the server or send a command I wait for X  
seconds and if not connected or desired command is not received I  
throw an exception which fails the script. I implemented the timeout  
combinator a couple of different ways, including that in the  
Asynchronous Exceptions paper but it did not help. I think the issue  
has to do with killing threads that are using FFI. Although I'm  
killing threads that call the Haskell connectTo, hGetBuf, etc. I  
think it's still FFI.

I disposed of timeouts entirely, leaving connectTo as it is and using  
hWaitForInput on my socket handle to simulate timeouts. This improved  
things tremendously and I'm now able to run a few thousands of  
unbound script threads on Windows with OpenSSL FFI and everything.

Memory usage is still higher than I would have liked and crashes in  
OpenSSL still happen when the number of threads/memory usage is  
really high so there's still room for improvement. I should probably  
go back to using a foreign finalizer (SSL_free) on the SSL  
descriptors rather than freeing them explicitly as the freeing does  
not happen if a script fails mid-way.

I'm quite satisfied with my first Haskell project. I love Haskell and  
will continue hacking away with it. This list is invaluable in the  
depth of offered help whereas #haskell (IRC) is invaluable when speed  
matters. I'm quite amazed at the things I have been able to do, the  
expressiveness of Haskell and the clean looks.

Clean looks can be deceptive, though, as they can hide code of  
amazing complexity. Fundeps, existential types, HList take a while to  
grasp. Also, I feel somewhat like a pioneer and I definitely got more  
than a fair share of arrows in my back.

I had GHC run out of memory during compilation (fixed by SPJ), had it  
quit midway during compilation with an error about generated extents  
being too large in assembler code. I had GHC crash at runtime with an  
error like "fromJust not returning Just, this could not be  
happening!". Yesterday's error topped them all:

internal error: update_fwd: unknown/strange object  0
    Please report this as a bug to [hidden email],
    or http://www.sourceforge.net/projects/ghc/

I think I got this when using +RTS -C0 -c.

Overall, the experience with Haskell has been exhilarating and I'm  
already preparing to use it on my next projects like detecting  
collusion in poker as well as rake optimization (Dazzle paper very  
helpful here!). Still, I think that GHC can be a bit rough around the  
edges and I would think twice about writing high-performance network  
apps with it.

        Thanks, Joel

P.S. The Glasgow Distributed Haskell (GdH) people are supposed to  
have a mailing list and I would love to share my findings twith them  
but I could not find the mailing list itself.

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Scott Weeks
Hi Joel,

What would your impression be of building an application in Haskell  
versus Erlang from a practical point of view given your experiences  
with this project and the Erlang poker server?

My feelings having developed a little with Erlang and embarking on a  
Haskell project are that the learning curve is far steeper with  
Haskell but it is far more elegant and readable. I'm still climbing  
that curve though (IO makes me want to pull my hair out).

Thanks for writing up that post mortem. There's lots of good info in  
there, especially for a newbie like myself.

Cheers,
Scott


On 18/11/2005, at 12:43 AM, Joel Reymont wrote:

> Folks,
>
> I have done a lot of experiments over the past few weeks and came  
> to a few interesting conclusions. First some background, then  
> issues, solutions and conclusions.
>
> I wrote a test harness for a poker server that understands the  
> different binary packets and can send and receive them. The harness  
> launches each "script" in a separate unbound thread that connects  
> to the server via TCP and does its work.
>
> The main goals of the project were: easy scripting, very high  
> number of connections from the harness (a few thousand) and running  
> on Windows. I develop on Mac OSX but have a Windows machine for  
> testing and to run the poker server.
>
> Another key goal was to support the server encryption. SSL  
> encryption is done in a wierd way that requires attaching read/
> write OpenSSL BIOs to the SSL descriptor so that SSL encrypts to/
> from memory. Encrypted chunks are then taken from the BIOs and sent  
> as payload in servver packets.
>
> Overall, I probably spent about 4 weeks writing the server and  
> about 2 more weeks grappling with the various issues. The issues  
> centered around 1) the program trashing memory like no tomorrow, 2)  
> intermittent crashes on Windows and 3) not being able to launch a  
> high number of connections on Windows before crashing.
>
> I significantly improved trashing of memory by switching to plain  
> Haskell structures from nested lists of wxHaskell-style properties  
> (attr := value). Intermittent crashes were harder to troubleshoot,  
> specially given that things were running smoothly on Mac OSX.
>
> Stack traces pointed into libcrypto (part of OpenSSL) and thus to  
> the BIOs that I was allocating. I guesses that OpenSSL was maxing  
> out some resources and closed the leak by explicitly freeing the  
> SSL descriptor which freed the associated BIO structures. Then  
> things got wierder as my program started crashing in a different  
> place entirely with stack traces like this:
>
> Program received signal EXC_BAD_ACCESS, Could not access memory.
> Reason: KERN_INVALID_ADDRESS at address: 0x3139322e
> 0x0027c174 in s8j1_info ()
> (gdb) where
> #0  0x0027c174 in s8j1_info ()
> #1  0x0021c9f4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
> #2  0x0021cdc4 in schedule (mainThread=0x1100360,  
> initialCapability=0x308548) at Schedule.c:932
> #3  0x0021dd6c in waitThread_ (m=0x1100360, initialCapability=0x0)  
> at Schedule.c:2156
> #4  0x0021dc50 in scheduleWaitThread (tso=0x13c0000, ret=0x0,  
> initialCapability=0x0) at Schedule.c:2050
> #5  0x00219548 in rts_evalLazyIO (p=0x29b47c, ret=0x0) at RtsAPI.c:459
> #6  0x001e4768 in main (argc=2262116, argv=0x308548) at Main.c:104
>
> I took waitThread_ as a clue and started digging deeper.
>
> Whenever I connect to the server or send a command I wait for X  
> seconds and if not connected or desired command is not received I  
> throw an exception which fails the script. I implemented the  
> timeout combinator a couple of different ways, including that in  
> the Asynchronous Exceptions paper but it did not help. I think the  
> issue has to do with killing threads that are using FFI. Although  
> I'm killing threads that call the Haskell connectTo, hGetBuf, etc.  
> I think it's still FFI.
>
> I disposed of timeouts entirely, leaving connectTo as it is and  
> using hWaitForInput on my socket handle to simulate timeouts. This  
> improved things tremendously and I'm now able to run a few  
> thousands of unbound script threads on Windows with OpenSSL FFI and  
> everything.
>
> Memory usage is still higher than I would have liked and crashes in  
> OpenSSL still happen when the number of threads/memory usage is  
> really high so there's still room for improvement. I should  
> probably go back to using a foreign finalizer (SSL_free) on the SSL  
> descriptors rather than freeing them explicitly as the freeing does  
> not happen if a script fails mid-way.
>
> I'm quite satisfied with my first Haskell project. I love Haskell  
> and will continue hacking away with it. This list is invaluable in  
> the depth of offered help whereas #haskell (IRC) is invaluable when  
> speed matters. I'm quite amazed at the things I have been able to  
> do, the expressiveness of Haskell and the clean looks.
>
> Clean looks can be deceptive, though, as they can hide code of  
> amazing complexity. Fundeps, existential types, HList take a while  
> to grasp. Also, I feel somewhat like a pioneer and I definitely got  
> more than a fair share of arrows in my back.
>
> I had GHC run out of memory during compilation (fixed by SPJ), had  
> it quit midway during compilation with an error about generated  
> extents being too large in assembler code. I had GHC crash at  
> runtime with an error like "fromJust not returning Just, this could  
> not be happening!". Yesterday's error topped them all:
>
> internal error: update_fwd: unknown/strange object  0
>    Please report this as a bug to [hidden email],
>    or http://www.sourceforge.net/projects/ghc/
>
> I think I got this when using +RTS -C0 -c.
>
> Overall, the experience with Haskell has been exhilarating and I'm  
> already preparing to use it on my next projects like detecting  
> collusion in poker as well as rake optimization (Dazzle paper very  
> helpful here!). Still, I think that GHC can be a bit rough around  
> the edges and I would think twice about writing high-performance  
> network apps with it.
>
> Thanks, Joel
>
> P.S. The Glasgow Distributed Haskell (GdH) people are supposed to  
> have a mailing list and I would love to share my findings twith  
> them but I could not find the mailing list itself.
>
> --
> http://wagerlabs.com/
>
>
>
>
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
>

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont
On Nov 17, 2005, at 10:59 PM, Scotty Weeks wrote:

> What would your impression be of building an application in Haskell  
> versus Erlang from a practical point of view given your experiences  
> with this project and the Erlang poker server?

I would have been done much faster and with far less trouble. The  
scripting would have been a royal pain in the rear for the customer,  
though. But, again, I would have been done much faster as network  
clients/servers is what Erlang excels at. That and concurrency.

Haskell... I'm still trying to figure out why reading from a Chan  
with getChanContents and then printing out the contents works and  
doing the same with readChan and looping blocks. Or why the app now  
crashes violently on Mac OSX but works without a hitch on Windows.  
And I still don't have a good timeout combinator.

I felt very excited this morning given the newly found love between  
my app and Windows but the excitement lasted only until I realized  
that hWaitForIO blocks all other threads :-(.

> My feelings having developed a little with Erlang and embarking on  
> a Haskell project are that the learning curve is far steeper with  
> Haskell but it is far more elegant and readable. I'm still climbing  
> that curve though (IO makes me want to pull my hair out).

Unless lightning strikes and tomorrow morning I figure out what's the  
deal with the spurious Mac OSX crashes, I think this might be my last  
network app in Haskell. I should really be spending time on the  
business end of the app intead of figuring out platform differences  
and the like.

        Joel

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Project postmortem

Simon Peyton Jones
In reply to this post by Joel Reymont

| Unless lightning strikes and tomorrow morning I figure out what's the
| deal with the spurious Mac OSX crashes, I think this might be my last
| network app in Haskell. I should really be spending time on the
| business end of the app intead of figuring out platform differences
| and the like.

Joel, I think it's fantastic that you've been pushing on Haskell in the
way you have.  What I learn from your experience is that the *language*
is pretty good for what you wanted to do (esp lightweight concurrency)
but the *libraries* in the area of networking are lacking both
functionality and (more particularly) robustness.

I hope you don't abandon Haskell altogether.  Without steady, friendly
pressure from applications-end folk like you, things won't improve.
It's incredibly valuable feedback.  But I can see that when you have to
deliver something next week you can't wait around for some someone to
get around to fixing your problem.  (They aren't paid either!)  Maybe
you can use Haskell for something less mission-critical, so that you can
keep up the pressure?

Meanwhile, let me utter my customary encouragement to the Haskell
community out there: please pitch in and help!  Haskell will only break
into real applications, of the kind Joel has been writing, if we can
offer robust libraries, and that depends utterly on you.  Don't wait for
someone else to do it.

Simon
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Project postmortem

Jan Stoklasa (gmail)
Hi,
so sad, so true...
At least haskell ideas sneak into mainstream languages under disguise (LINQ
anyone?). C-Java-C# syntax that business "developers" and their bosses love
so much is mandatory so the result lack the beauty we all know and
appreciate, but it is kinda nice to see functional programming going
mainstream at last. Maybe, "Lambda" is the IT buzzword of next decade :-).

Jan

 

-----Original Message-----
From: [hidden email]
[mailto:[hidden email]] On Behalf Of Simon Peyton-Jones
Sent: 18 November 2005 10:17
To: Joel Reymont; Scotty Weeks
Cc: Haskell Cafe
Subject: RE: [Haskell-cafe] Project postmortem


| Unless lightning strikes and tomorrow morning I figure out what's the
| deal with the spurious Mac OSX crashes, I think this might be my last
| network app in Haskell. I should really be spending time on the
| business end of the app intead of figuring out platform differences
| and the like.

Joel, I think it's fantastic that you've been pushing on Haskell in the way
you have.  What I learn from your experience is that the *language* is
pretty good for what you wanted to do (esp lightweight concurrency) but the
*libraries* in the area of networking are lacking both functionality and
(more particularly) robustness.

I hope you don't abandon Haskell altogether.  Without steady, friendly
pressure from applications-end folk like you, things won't improve.
It's incredibly valuable feedback.  But I can see that when you have to
deliver something next week you can't wait around for some someone to get
around to fixing your problem.  (They aren't paid either!)  Maybe you can
use Haskell for something less mission-critical, so that you can keep up the
pressure?

Meanwhile, let me utter my customary encouragement to the Haskell community
out there: please pitch in and help!  Haskell will only break into real
applications, of the kind Joel has been writing, if we can offer robust
libraries, and that depends utterly on you.  Don't wait for someone else to
do it.

Simon
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont
In reply to this post by Simon Peyton Jones
On Nov 18, 2005, at 10:17 AM, Simon Peyton-Jones wrote:

> I hope you don't abandon Haskell altogether.  Without steady, friendly
> pressure from applications-end folk like you, things won't improve.

Nah, I'm just having a very frustrating Friday. I think I need some  
direction in which to dig and a bit of patience over the weekend. For  
example,

What does this mean precisely? My take is that the GHC runtime is  
trying to call a C function. this much I gathered from the source  
code. It also seems that since I do not see another library at #0  
then the issue is within GHC. Is that the right take on it?

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_INVALID_ADDRESS at address: 0x3139322e
0x0027c174 in s8j1_info ()
(gdb) where
#0  0x0027c174 in s8j1_info ()
#1  0x0021c9f4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2  0x0021cdc4 in schedule (mainThread=0x1100360,  
initialCapability=0x308548) at Schedule.c:932
#3  0x0021dd6c in waitThread_ (m=0x1100360, initialCapability=0x0) at  
Schedule.c:2156
#4  0x0021dc50 in scheduleWaitThread (tso=0x13c0000, ret=0x0,  
initialCapability=0x0) at Schedule.c:2050
#5  0x00219548 in rts_evalLazyIO (p=0x29b47c, ret=0x0) at RtsAPI.c:459
#6  0x001e4768 in main (argc=2262116, argv=0x308548) at Main.c:104

> It's incredibly valuable feedback.  But I can see that when you  
> have to
> deliver something next week you can't wait around for some someone to
> get around to fixing your problem.  (They aren't paid either!)  Maybe
> you can use Haskell for something less mission-critical, so that  
> you can
> keep up the pressure?

I can't change who I am, I just gotta push the envelope. I would not  
have stood the pain of doing this project in Erlang, for example,  
what with all the nested data structures, etc.

I'm not waiting for someone to fix my problem, I would gladly fix it  
myself if I understood where the problem is. It used to be fairly  
clear before when the stack trace pointed to one of the OpenSSL  
libraries. In this particular case I don't even know how to start  
debugging this. Do I set a break point in s8j1_info? But it's  
something else periodically, like s34n_info.

Do I inspect the C code somehow? But how do I do that? How do I debug  
the GHC runtime?

        Thanks, Joel

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

The IT buzzword of the next decade (was Re: Project postmortem)

Joel Reymont
In reply to this post by Jan Stoklasa (gmail)
This would be a good new thread to discuss it ;-)

On Nov 18, 2005, at 10:42 AM, Jan Stoklasa (gmail) wrote:

> Hi,
> so sad, so true...
> At least haskell ideas sneak into mainstream languages under  
> disguise (LINQ
> anyone?). C-Java-C# syntax that business "developers" and their  
> bosses love
> so much is mandatory so the result lack the beauty we all know and
> appreciate, but it is kinda nice to see functional programming going
> mainstream at last. Maybe, "Lambda" is the IT buzzword of next  
> decade :-).

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Project postmortem

Simon Marlow
In reply to this post by Joel Reymont
On 18 November 2005 10:48, Joel Reymont wrote:

> On Nov 18, 2005, at 10:17 AM, Simon Peyton-Jones wrote:
>
>> I hope you don't abandon Haskell altogether.  Without steady,
>> friendly pressure from applications-end folk like you, things won't
>> improve.
>
> Nah, I'm just having a very frustrating Friday. I think I need some
> direction in which to dig and a bit of patience over the weekend. For
> example,
>
> What does this mean precisely? My take is that the GHC runtime is
> trying to call a C function. this much I gathered from the source
> code. It also seems that since I do not see another library at #0
> then the issue is within GHC. Is that the right take on it?

The stack trace doesn't mean much at all I'm afraid - GHC doesn't use
the C stack, so any stack trace generated for a crash inside the Haskell
code is mostly useless.  It does tell you the block in which the crash
happened (s8j1_info), and it tells you that the crash was in Haskell and
not C.  The rest of the frames on the stack are from the GHC runtime,
and you'll pretty much always see these same frames on the stack for any
crash inside Haskell code.

How we normally proceed for a crash like this is as follows: examine
where the crash happened and determine whether it is a result of heap or
stack corruption, and then attempt to trace backwards to find out where
the corruption originated from.  Tracing backwards means running the
program from the beginning again, so it's essential to have a
reproducible example.  Without reproducibility, we have to use a
combination of debugging printfs and staring really hard at the code,
which is much more time consuming (and still requires being able to run
the program to make it crash with debugging output turned on).

You can get debugging output by compiling your program with -debug, and
then running it with some of the -D<something> options (use +RTS -? for
a list, +RTS -Ds is a good one to start with).

Cheers,
        Simon
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont
On Nov 18, 2005, at 1:55 PM, Simon Marlow wrote:

> You can get debugging output by compiling your program with -debug,  
> and
> then running it with some of the -D<something> options (use +RTS -?  
> for
> a list, +RTS -Ds is a good one to start with).

I'm still working on a repro case but here's what I get...

+RTS -Ds
...
scheduler: checking for threads blocked on I/O
sched: -->> running thread 1103 ThreadRunGHC ...
sched: --<< thread 1103 (ThreadRunGHC) stopped: is blocked on an MVar
all threads:
         thread 1225 @ 0x1539000 is not blocked
         thread 1224 @ 0x1506aa4 is not blocked
         thread 1223 @ 0x15066a4 is not blocked
...
scheduler: checking for threads blocked on I/O
sched: -->> running thread 1107 ThreadRunGHC ...
Segmentation fault

1107 is not blocked in the list of all threads. What options should I  
try next?

        Thanks, Joel

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Project postmortem

Simon Marlow
In reply to this post by Joel Reymont
On 18 November 2005 14:42, Joel Reymont wrote:

> On Nov 18, 2005, at 1:55 PM, Simon Marlow wrote:
>
>> You can get debugging output by compiling your program with -debug,
>> and then running it with some of the -D<something> options (use +RTS
>> -? for a list, +RTS -Ds is a good one to start with).
>
> I'm still working on a repro case but here's what I get...
>
> +RTS -Ds
> ...
> scheduler: checking for threads blocked on I/O
> sched: -->> running thread 1103 ThreadRunGHC ...
> sched: --<< thread 1103 (ThreadRunGHC) stopped: is blocked on an MVar
> all threads:
>          thread 1225 @ 0x1539000 is not blocked
>          thread 1224 @ 0x1506aa4 is not blocked
>          thread 1223 @ 0x15066a4 is not blocked
> ...
> scheduler: checking for threads blocked on I/O
> sched: -->> running thread 1107 ThreadRunGHC ...
> Segmentation fault
>
> 1107 is not blocked in the list of all threads. What options should I
> try next?

That doesn't tell us much unfortunately.  Can you send a disassembly of
the block in which the crash happened?

Is it always the same block, BTW?  Does changing the heap size (+RTS
-H<size>) have any effect?

Cheers,
        Simon
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont

On Nov 18, 2005, at 2:47 PM, Simon Marlow wrote:

> That doesn't tell us much unfortunately.  Can you send a  
> disassembly of
> the block in which the crash happened?
>
> Is it always the same block, BTW?  Does changing the heap size (+RTS
> -H<size>) have any effect?

I don't think changing the heap size has any effect. I tried a run  
with -H512m and the only difference was that it crashed at 0x00000005  
with the same kernel protection failure. The address for s34n_info is  
the same, everything else the same, including stack trace and  
addresses and offsets in it.

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
0x0024ef88 in s34n_info ()
(gdb) where
#0  0x0024ef88 in s34n_info ()
#1  0x00211eb4 in StgRunIsImplementedInAssembler () at StgCRun.c:576
#2  0x0020f048 in schedule (mainThread=0x1100360,  
initialCapability=0x2fd508) at Schedule.c:932
#3  0x0020fff0 in waitThread_ (m=0x1100360, initialCapability=0x0) at  
Schedule.c:2156
#4  0x0020fed4 in scheduleWaitThread (tso=0x13c0000, ret=0x0,  
initialCapability=0x0) at Schedule.c:2050
#5  0x0020cd70 in rts_evalLazyIO (p=0x29216c, ret=0x0) at RtsAPI.c:459
#6  0x001d80fc in main (argc=2212180, argv=0x2fd508) at Main.c:104

(gdb) disas 0x0024ef88
Dump of assembler code for function s34n_info:
0x0024ef70 <s34n_info+0>:       mr      r10,r25
0x0024ef74 <s34n_info+4>:       addi    r9,r25,8
0x0024ef78 <s34n_info+8>:       mr      r25,r9
0x0024ef7c <s34n_info+12>:      cmplw   cr7,r9,r26
0x0024ef80 <s34n_info+16>:      bgt-    cr7,0x24efb4 <s34n_info+68>
0x0024ef84 <s34n_info+20>:      lwz     r2,4(r14)
0x0024ef88 <s34n_info+24>:      lbzx    r0,r2,r15
0x0024ef8c <s34n_info+28>:      cmpwi   cr7,r0,0
0x0024ef90 <s34n_info+32>:      bne-    cr7,0x24efc4 <s34n_info+84>
0x0024ef94 <s34n_info+36>:      lis     r2,42
0x0024ef98 <s34n_info+40>:      lwz     r2,20668(r2)
0x0024ef9c <s34n_info+44>:      stw     r2,4(r10)
0x0024efa0 <s34n_info+48>:      stw     r15,0(r9)
0x0024efa4 <s34n_info+52>:      addi    r14,r9,-4
0x0024efa8 <s34n_info+56>:      lwz     r29,0(r22)
0x0024efac <s34n_info+60>:      mtctr   r29
0x0024efb0 <s34n_info+64>:      bctr
0x0024efb4 <s34n_info+68>:      li      r0,8
0x0024efb8 <s34n_info+72>:      stw     r0,108(r27)
0x0024efbc <s34n_info+76>:      lwz     r29,-4(r27)
0x0024efc0 <s34n_info+80>:      b       0x24efac <s34n_info+60>
0x0024efc4 <s34n_info+84>:      addi    r15,r15,1
0x0024efc8 <s34n_info+88>:      addi    r25,r9,-8
0x0024efcc <s34n_info+92>:      lis     r29,37
0x0024efd0 <s34n_info+96>:      addi    r29,r29,-4240
0x0024efd4 <s34n_info+100>:     b       0x24efac <s34n_info+60>
0x0024efd8 <s34n_info+104>:     .long 0x21
0x0024efdc <s34n_info+108>:     .long 0x240000
End of assembler dump.

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont
In reply to this post by Simon Marlow
Folks,

This is not quite the error that I was expecting but they could be  
related, I'm not sure. In any case, you can retrieve the repro  
project thusly:

darcs get http://test.wagerlabs.com/postmortem

You need OpenSSL to build these so don't forget to add -lssl -lcrypto  
to either ghc or ghci.

I would appreciate if we could all collectively look at this as  
things are either wierd or I'm missing something obvious. I will  
apply any patches sent to me.

I run like this:

ghci -fglasgow-exts -lssl -lcrypto
:l Server
main

ghci -fglasgow-exts -lssl -lcrypto
:l Client
main

I get in the server window:

interactive: unknown exception

14:51:39: ThreadId 1: Accepted new connection: {handle: <socket: 5>}
14:51:39: ThreadId 1: Verify locations: 1
14:51:39: ThreadId 1: sslGetError: 2
14:51:39: ThreadId 4: Starting SSL handshake...
14:51:39: ThreadId 4: Reading from BIO...
14:51:39: ThreadId 4: Waiting for BIO 0x01108670
14:51:39: ThreadId 4: waitForBio: gotta wait a bit...

If you look at SSL.hs you will see that I'm calling threadDelay right  
after this message. No other messages are produced. This tells me  
that threadDelay is throwing an exception.

Why would it, though? And how can I tell what the exception is? If I  
comment out the threadDelay then I get the exception somewhere in the  
expect code after bytes are sent to the other side.

Overall, my intent is to get this to work for 1 thread and then try,  
say, 5 or 10 thousand.

        Thanks, Joel

_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Jason Dagit
In reply to this post by Simon Peyton Jones

On Nov 18, 2005, at 2:17 AM, Simon Peyton-Jones wrote:

> I hope you don't abandon Haskell altogether.  Without steady, friendly
> pressure from applications-end folk like you, things won't improve.
> It's incredibly valuable feedback.  But I can see that when you  
> have to
> deliver something next week you can't wait around for some someone to
> get around to fixing your problem.  (They aren't paid either!)  Maybe
> you can use Haskell for something less mission-critical, so that  
> you can
> keep up the pressure?

Here is some feedback on a negative experience I had with Haskell  
recently (really about the only negative experience :)

I was playing with one of the Haskell OpenGL libraries (actually it's  
a refined FFI) over the summer and some things about it rubbed me the  
wrong way.  I wanted to try fixing them but I really couldn't figure  
out how to get ahold of the code and start hacking.  I found some  
candidates, but it seemed like old cvs repositories or something.  I  
was confused, ran out of time and moved on.  Why do I bring it up?  
If it had been obvious where to get an official copy of the library I  
could have tried sending in some patches to make things work the way  
I wanted.  I'm a huge fan of darcs repositories, BTW.

Thanks,
Jason
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Cale Gibbard
In reply to this post by Joel Reymont
test.wagerlabs.com seems really slow for me right now. I've mirrored
the repo on my own machine (might not be 100% reliable, but should
stay up nearly all of the time). The mirror address is
http://vx.hn.org/postmortem/

 - Cale

On 18/11/05, Joel Reymont <[hidden email]> wrote:

> Folks,
>
> This is not quite the error that I was expecting but they could be
> related, I'm not sure. In any case, you can retrieve the repro
> project thusly:
>
> darcs get http://test.wagerlabs.com/postmortem
>
> You need OpenSSL to build these so don't forget to add -lssl -lcrypto
> to either ghc or ghci.
>
> I would appreciate if we could all collectively look at this as
> things are either wierd or I'm missing something obvious. I will
> apply any patches sent to me.
>
> I run like this:
>
> ghci -fglasgow-exts -lssl -lcrypto
> :l Server
> main
>
> ghci -fglasgow-exts -lssl -lcrypto
> :l Client
> main
>
> I get in the server window:
>
> interactive: unknown exception
>
> 14:51:39: ThreadId 1: Accepted new connection: {handle: <socket: 5>}
> 14:51:39: ThreadId 1: Verify locations: 1
> 14:51:39: ThreadId 1: sslGetError: 2
> 14:51:39: ThreadId 4: Starting SSL handshake...
> 14:51:39: ThreadId 4: Reading from BIO...
> 14:51:39: ThreadId 4: Waiting for BIO 0x01108670
> 14:51:39: ThreadId 4: waitForBio: gotta wait a bit...
>
> If you look at SSL.hs you will see that I'm calling threadDelay right
> after this message. No other messages are produced. This tells me
> that threadDelay is throwing an exception.
>
> Why would it, though? And how can I tell what the exception is? If I
> comment out the threadDelay then I get the exception somewhere in the
> expect code after bytes are sent to the other side.
>
> Overall, my intent is to get this to work for 1 thread and then try,
> say, 5 or 10 thousand.
>
>         Thanks, Joel
>
> _______________________________________________
> Haskell-Cafe mailing list
> [hidden email]
> http://www.haskell.org/mailman/listinfo/haskell-cafe
>
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Sven Panne
In reply to this post by Jason Dagit
Am Freitag, 18. November 2005 17:16 schrieb Jason Dagit:

> [...]
> I was playing with one of the Haskell OpenGL libraries (actually it's
> a refined FFI) over the summer and some things about it rubbed me the
> wrong way.  I wanted to try fixing them but I really couldn't figure
> out how to get ahold of the code and start hacking.  I found some
> candidates, but it seemed like old cvs repositories or something.  I
> was confused, ran out of time and moved on.  Why do I bring it up?
> If it had been obvious where to get an official copy of the library I
> could have tried sending in some patches to make things work the way
> I wanted.  I'm a huge fan of darcs repositories, BTW.

Hmmm, as the OpenGL/GLUT/OpenAL/ALUT guy I have to admit that I should really,
really update the web pages about those packages. But anyway: Asking on any
Haskell mailing list (there is even one especially for the OpenGL/GLUT
packages) normally gives you fast response times. Without even knowing that
there is a problem, there is nothing I can fix. :-) And don't hesitate to ask
questions about the usage of those packages, because this is valuable
feedback, too. Regarding the repository: The normal fptools repository is the
"official" one for those packages. But IIRC, most GHC binary packages include
OpenGL/GLUT support, so there is normally no urgent need for a home-made
version. All packages are already cabalized, but I have to admit that I have
never tried to build them on their own.

Cheers,
   S.
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont
In reply to this post by Joel Reymont
The exception is actually from withTimeOut. Removing calls to that  
lets the handshake proceed. The server is using a client handshake,  
though, so the handshake of client vs. client goes on indefinitely.

I'm fixing the server side and once that is done will clean up SSL at  
the end of the handshake and launch a few thousand clients. It's not  
a good repro case yet although I would love to know why withTimeOut  
is throwing that exception.

        Joel

On Nov 18, 2005, at 5:02 PM, Christian Maeder wrote:

> Sorry, I can only show you my output on
> Linux turing 2.6.11.4-21.9, but I don't know what's going on and  
> will not have more time this week.
>
> Cheers Christian
>
> maeder@turing:/local/maeder/haskell/postmortem> ./server
> 17:55:14: ThreadId 1: Accepted new connection: {handle: <socket: 4>}
> 17:55:14: ThreadId 1: Verify locations: 1
> 17:55:14: ThreadId 1: sslGetError: 2
> 17:55:14: ThreadId 4: Starting SSL handshake...
> 17:55:14: ThreadId 4: Reading from BIO...
> 17:55:14: ThreadId 4: Waiting for BIO 0x080d10d8
> 17:55:14: ThreadId 4: waitForBio: gotta wait a bit...
> server: unknown exception

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont
In reply to this post by Simon Marlow
I'm happy to report that the problem can be reproduced by running the  
code from my darcs repo at http://test.wagerlabs.com/postmortem. See  
the README file. I'm on Mac OSX 10.4.3.

The server just sits there, goes through the SSL handshake and...  
does nothing else. The clients go through the handshake with the  
server and do nothing else. The handshake goes through X number of  
times and then the client crashes.

On Nov 18, 2005, at 1:55 PM, Simon Marlow wrote:

> How we normally proceed for a crash like this is as follows: examine
> where the crash happened and determine whether it is a result of  
> heap or
> stack corruption, and then attempt to trace backwards to find out  
> where
> the corruption originated from.  Tracing backwards means running the
> program from the beginning again, so it's essential to have a
> reproducible example.  Without reproducibility, we have to use a
> combination of debugging printfs and staring really hard at the code,
> which is much more time consuming (and still requires being able to  
> run
> the program to make it crash with debugging output turned on).

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

RE: Project postmortem

Simon Peyton Jones
In reply to this post by Joel Reymont
If it's MacOS specific, we're not going to be much help at GHC HQ,
because we don't have any (Macs that is).  Wolfgang Thaller is the MacOS
expert, but maybe there are others now?

Simon

| -----Original Message-----
| From: Joel Reymont [mailto:[hidden email]]
| Sent: 19 November 2005 00:57
| To: Simon Marlow
| Cc: Simon Peyton-Jones; Haskell Cafe
| Subject: Re: [Haskell-cafe] Project postmortem
|
| I'm happy to report that the problem can be reproduced by running the
| code from my darcs repo at http://test.wagerlabs.com/postmortem. See
| the README file. I'm on Mac OSX 10.4.3.
|
| The server just sits there, goes through the SSL handshake and...
| does nothing else. The clients go through the handshake with the
| server and do nothing else. The handshake goes through X number of
| times and then the client crashes.
|
| On Nov 18, 2005, at 1:55 PM, Simon Marlow wrote:
|
| > How we normally proceed for a crash like this is as follows: examine
| > where the crash happened and determine whether it is a result of
| > heap or
| > stack corruption, and then attempt to trace backwards to find out
| > where
| > the corruption originated from.  Tracing backwards means running the
| > program from the beginning again, so it's essential to have a
| > reproducible example.  Without reproducibility, we have to use a
| > combination of debugging printfs and staring really hard at the
code,
| > which is much more time consuming (and still requires being able to
| > run
| > the program to make it crash with debugging output turned on).
_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont
Is Wolfgang still around?

Would you guys be willing to guide me through this? I could then  
possibly become the next Mac OSX expert :-).

I have the disassembler dumps, etc. I do not know how to approach  
this problem. I read up a bit on the GHC internals, STG, code  
generation, etc.

        Thanks, Joel

P.S. Please feel free to take the email exchange offline, could be  
too boring for everyone else

On Nov 21, 2005, at 9:35 AM, Simon Peyton-Jones wrote:

> If it's MacOS specific, we're not going to be much help at GHC HQ,
> because we don't have any (Macs that is).  Wolfgang Thaller is the  
> MacOS
> expert, but maybe there are others now?

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Project postmortem

Joel Reymont
In reply to this post by Simon Peyton Jones
Simon,

What about the non-OSX issue of using a Chan to collect traces from  
thousands of threads?

It's not working very well for me when I use readChan in a loop (see  
the code). getChanContents works much better but then the logger  
thread is stuck forever and everything else that waits on it is stuck  
as well.

The output from logger (Util.hs) stops after a few lines and thus  
memory taken starts to grow because all the output sent to the chan  
is not being processed.

        Thanks, Joel

On Nov 21, 2005, at 9:35 AM, Simon Peyton-Jones wrote:

> If it's MacOS specific, we're not going to be much help at GHC HQ,
> because we don't have any (Macs that is).  Wolfgang Thaller is the  
> MacOS
> expert, but maybe there are others now?

--
http://wagerlabs.com/





_______________________________________________
Haskell-Cafe mailing list
[hidden email]
http://www.haskell.org/mailman/listinfo/haskell-cafe
Loading...