A bug of multicore IO manager

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
10 messages Options
Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Kazu Yamamoto (山本和彦)
Hi,

As I said before, I started running HTTP server using Mio in the real
world. Unfortunately, the daemon is not stable.

After one day or so, the server cannot accept any HTTP requests.  No
error messages from the server.

The server is alive. To terminate the server (running in a "screen"
terminal), single Ctrl-c is not enough. Typing Ctrl-c again terminates
the server.

After several tests, I'm getting convinced that this occurs only when
+RTS -N<x> is specified (where <x> >= 2). The server runs well if +RTS
-N<x> is not specified.

My question: if the program complied with GHC needs double Ctrl-c to
terminate, what is the situation of the program?

P.S.

It seems to me that the server also is leaking space. The server is
getting fatter gradually.

--Kazu



Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Johan Tibell-2
Hi Kazu,

On Tue, Sep 3, 2013 at 2:52 PM, Kazu Yamamoto <kazu at iij.ad.jp> wrote:

> Hi,
>
> As I said before, I started running HTTP server using Mio in the real
> world. Unfortunately, the daemon is not stable.
>
> After one day or so, the server cannot accept any HTTP requests.  No
> error messages from the server.
>
> The server is alive. To terminate the server (running in a "screen"
> terminal), single Ctrl-c is not enough. Typing Ctrl-c again terminates
> the server.
>

Could you run an strace on the process in this state so we can get an idea
what it's doing?


> After several tests, I'm getting convinced that this occurs only when
> +RTS -N<x> is specified (where <x> >= 2). The server runs well if +RTS
> -N<x> is not specified.
>

That indicates that the problem is with the threaded RTS and perhaps with
the IO manager.


> My question: if the program complied with GHC needs double Ctrl-c to
> terminate, what is the situation of the program?
>

If Ctrl+C generates an exception (does it?) there could be an overzealous
exception catcher somewhere that catches all exceptions, including your
Ctrl+C.


>
> P.S.
>
> It seems to me that the server also is leaking space. The server is
> getting fatter gradually.


Could you use the profiler to see what type of objects are leaking?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130903/6e469be1/attachment.htm>

Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Andreas Voellmy
Kazu, thanks for noticing this! I will try to recreate it on my server as
well.

-Andi


On Tue, Sep 3, 2013 at 5:57 PM, Johan Tibell <johan.tibell at gmail.com> wrote:

> Hi Kazu,
>
> On Tue, Sep 3, 2013 at 2:52 PM, Kazu Yamamoto <kazu at iij.ad.jp> wrote:
>
>> Hi,
>>
>> As I said before, I started running HTTP server using Mio in the real
>> world. Unfortunately, the daemon is not stable.
>>
>> After one day or so, the server cannot accept any HTTP requests.  No
>> error messages from the server.
>>
>> The server is alive. To terminate the server (running in a "screen"
>> terminal), single Ctrl-c is not enough. Typing Ctrl-c again terminates
>> the server.
>>
>
> Could you run an strace on the process in this state so we can get an idea
> what it's doing?
>
>
>> After several tests, I'm getting convinced that this occurs only when
>> +RTS -N<x> is specified (where <x> >= 2). The server runs well if +RTS
>> -N<x> is not specified.
>>
>
> That indicates that the problem is with the threaded RTS and perhaps with
> the IO manager.
>
>
>> My question: if the program complied with GHC needs double Ctrl-c to
>> terminate, what is the situation of the program?
>>
>
> If Ctrl+C generates an exception (does it?) there could be an overzealous
> exception catcher somewhere that catches all exceptions, including your
> Ctrl+C.
>
>
>>
>> P.S.
>>
>> It seems to me that the server also is leaking space. The server is
>> getting fatter gradually.
>
>
> Could you use the profiler to see what type of objects are leaking?
>
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130903/8757fd56/attachment.htm>

Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Andreas Voellmy
Hi Kazu,

What sort of workload was the mighty server under during those 1 or 2 days
while you waited for it to become unresponsive. I.e. was this a production
web server? Or were you generating requests at some frequency or leaving it
mostly idle?

-Andi


On Tue, Sep 3, 2013 at 6:29 PM, Andreas Voellmy
<andreas.voellmy at gmail.com>wrote:

> Kazu, thanks for noticing this! I will try to recreate it on my server as
> well.
>
> -Andi
>
>
> On Tue, Sep 3, 2013 at 5:57 PM, Johan Tibell <johan.tibell at gmail.com>wrote:
>
>> Hi Kazu,
>>
>> On Tue, Sep 3, 2013 at 2:52 PM, Kazu Yamamoto <kazu at iij.ad.jp> wrote:
>>
>>> Hi,
>>>
>>> As I said before, I started running HTTP server using Mio in the real
>>> world. Unfortunately, the daemon is not stable.
>>>
>>> After one day or so, the server cannot accept any HTTP requests.  No
>>> error messages from the server.
>>>
>>> The server is alive. To terminate the server (running in a "screen"
>>> terminal), single Ctrl-c is not enough. Typing Ctrl-c again terminates
>>> the server.
>>>
>>
>> Could you run an strace on the process in this state so we can get an
>> idea what it's doing?
>>
>>
>>> After several tests, I'm getting convinced that this occurs only when
>>> +RTS -N<x> is specified (where <x> >= 2). The server runs well if +RTS
>>> -N<x> is not specified.
>>>
>>
>> That indicates that the problem is with the threaded RTS and perhaps with
>> the IO manager.
>>
>>
>>> My question: if the program complied with GHC needs double Ctrl-c to
>>> terminate, what is the situation of the program?
>>>
>>
>> If Ctrl+C generates an exception (does it?) there could be an overzealous
>> exception catcher somewhere that catches all exceptions, including your
>> Ctrl+C.
>>
>>
>>>
>>> P.S.
>>>
>>> It seems to me that the server also is leaking space. The server is
>>> getting fatter gradually.
>>
>>
>> Could you use the profiler to see what type of objects are leaking?
>>
>>
>> _______________________________________________
>> ghc-devs mailing list
>> ghc-devs at haskell.org
>> http://www.haskell.org/mailman/listinfo/ghc-devs
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130903/14d48d1a/attachment.htm>

Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Kazu Yamamoto (山本和彦)
Hi Andi,

> What sort of workload was the mighty server under during those 1 or 2 days
> while you waited for it to become unresponsive. I.e. was this a production
> web server? Or were you generating requests at some frequency or leaving it
> mostly idle?

I ran Mighty on http://mew.org. This is my private domain which
provides my free programs and articles. It's not so busy but not so
dull.

I did not generate requests from measurement tools.

--Kazu



Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Simon Marlow-7
In reply to this post by Johan Tibell-2
On 03/09/13 22:57, Johan Tibell wrote:

> Hi Kazu,
>
> On Tue, Sep 3, 2013 at 2:52 PM, Kazu Yamamoto <kazu at iij.ad.jp
> <mailto:kazu at iij.ad.jp>> wrote:
>
>     Hi,
>
>     As I said before, I started running HTTP server using Mio in the real
>     world. Unfortunately, the daemon is not stable.
>
>     After one day or so, the server cannot accept any HTTP requests.  No
>     error messages from the server.
>
>     The server is alive. To terminate the server (running in a "screen"
>     terminal), single Ctrl-c is not enough. Typing Ctrl-c again terminates
>     the server.
>
>
> Could you run an strace on the process in this state so we can get an
> idea what it's doing?
>
>     After several tests, I'm getting convinced that this occurs only when
>     +RTS -N<x> is specified (where <x> >= 2). The server runs well if +RTS
>     -N<x> is not specified.
>
>
> That indicates that the problem is with the threaded RTS and perhaps
> with the IO manager.
>
>     My question: if the program complied with GHC needs double Ctrl-c to
>     terminate, what is the situation of the program?
>
>
> If Ctrl+C generates an exception (does it?) there could be an
> overzealous exception catcher somewhere that catches all exceptions,
> including your Ctrl+C.

The first Ctrl-C is sent as an Interrupted exception to the main thread.
  The second Ctrl-C sends a SIGINT as usual, which tends to kill the
process.

If you need two Ctrl-Cs to kill the program, it probably means that it
deadlocked somewhere, maybe in the RTS.  Kazu: if you can attach to the
deadlocked process with gdb and get stack traces of all the threads,
that might help.

Cheers,
        Simon





Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Kazu Yamamoto (山本和彦)
Hi,

> If you need two Ctrl-Cs to kill the program, it probably means that it
> deadlocked somewhere, maybe in the RTS.  Kazu: if you can attach to
> the deadlocked process with gdb and get stack traces of all the
> threads, that might help.

To debug with GDB, I complied Mighty with "-debug". This changes the
behavior and I got the following error:

mighty-20130905: internal error: ASSERTION FAILED: file rts/sm/MarkWeak.c, line 371

    (GHC version 7.7.20130901 for i386_unknown_linux)
    Please report this as a GHC bug:  http://www.haskell.org/ghc/reportabug

Simon, can you tell what's going on?

--Kazu



Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Akio Takano
Hi,

On Thu, Sep 5, 2013 at 9:10 PM, Kazu Yamamoto <kazu at iij.ad.jp> wrote:

> Hi,
>
> > If you need two Ctrl-Cs to kill the program, it probably means that it
> > deadlocked somewhere, maybe in the RTS.  Kazu: if you can attach to
> > the deadlocked process with gdb and get stack traces of all the
> > threads, that might help.
>
> To debug with GDB, I complied Mighty with "-debug". This changes the
> behavior and I got the following error:
>
> mighty-20130905: internal error: ASSERTION FAILED: file rts/sm/MarkWeak.c,
> line 371
>
>     (GHC version 7.7.20130901 for i386_unknown_linux)
>     Please report this as a GHC bug:
> http://www.haskell.org/ghc/reportabug
>

I wonder if this issue could have been introduced by the commit:

https://github.com/ghc/ghc/commit/6770663f764db76dbb7138ccb3aea0527d194151

It looks like after the commit, addCFinalizerToWeak# can call into the GC
with the closure lock held. This means the info pointer points to
stg_WHITEHOLE_info, breaking the asserted invariant. I haven't done any
testing to confirm this, however.

-- Takano Akio
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130905/46a08afc/attachment.htm>

Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Kazu Yamamoto (山本和彦)
Hi Takano-san,

> It looks like after the commit, addCFinalizerToWeak# can call into the GC
> with the closure lock held. This means the info pointer points to
> stg_WHITEHOLE_info, breaking the asserted invariant. I haven't done any
> testing to confirm this, however.

I can try. Should I revert this patch?

--Kazu



Reply | Threaded
Open this post in threaded view
|

A bug of multicore IO manager

Akio Takano
I'm going to try to make a small test case today (probably after 08:00
UTC), but feel free to try it. If my guess is correct, reverting the patch
should fix the problem.

On Fri, Sep 6, 2013 at 7:38 AM, Kazu Yamamoto <kazu at iij.ad.jp> wrote:

> Hi Takano-san,
>
> > It looks like after the commit, addCFinalizerToWeak# can call into the GC
> > with the closure lock held. This means the info pointer points to
> > stg_WHITEHOLE_info, breaking the asserted invariant. I haven't done any
> > testing to confirm this, however.
>
> I can try. Should I revert this patch?
>
> --Kazu
>
> _______________________________________________
> ghc-devs mailing list
> ghc-devs at haskell.org
> http://www.haskell.org/mailman/listinfo/ghc-devs
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.haskell.org/pipermail/ghc-devs/attachments/20130906/fc930c46/attachment.htm>