Creative ideas on how to debug heap corruption

Previous Topic Next Topic
 
classic Classic list List threaded Threaded
8 messages Options
Reply | Threaded
Open this post in threaded view
|

Creative ideas on how to debug heap corruption

Moritz Angermann-2
Hi there!

as some of you may know, I've been working on an aarch64 native code
generator.  Now I've hit a situation where my stage2 compiler somehow
corrupts my heap.  Initially I thought this would likely be missing memory
barriers, however they are emitted.  This doesn't mean it can't be, but at
least it's not as simple as "they are just missing".

The crashes I see are non deterministic, in fact I sometimes even manage
to compile a Hello World module, without crashes.  Other times it crashes
with unknown closure errors or it just crashes.  But it always crashes
during GC.  Changing the nursery size make it crasha bit more frequent,
but nothing obvious sticks out yet.

If anyone has some create ideas, I'd love to hear them.  I've been wondering
if just logging allocations (offset, range, type) would help figuring out what we
expected to be there; and then maybe try to break on the allocation, (and
subsequent writes).

I'm sure some have been down this road before. 

Cheers,
 Moritz

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Creative ideas on how to debug heap corruption

Ben Lippmeier-2


> On 31 Aug 2020, at 5:54 pm, Moritz Angermann <[hidden email]> wrote:
>
> If anyone has some create ideas, I'd love to hear them.  I've been wondering
> if just logging allocations (offset, range, type) would help figuring out what we
> expected to be there; and then maybe try to break on the allocation, (and
> subsequent writes).
>
> I'm sure some have been down this road before.

Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.

The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.

A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.

http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html
https://hackage.haskell.org/package/packman

Ben.



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Creative ideas on how to debug heap corruption

Csaba Hruska
Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.

On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier <[hidden email]> wrote:


> On 31 Aug 2020, at 5:54 pm, Moritz Angermann <[hidden email]> wrote:
>
> If anyone has some create ideas, I'd love to hear them.  I've been wondering
> if just logging allocations (offset, range, type) would help figuring out what we
> expected to be there; and then maybe try to break on the allocation, (and
> subsequent writes).
>
> I'm sure some have been down this road before.

Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.

The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.

A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.

http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html
https://hackage.haskell.org/package/packman

Ben.



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Creative ideas on how to debug heap corruption

George Colpitts
I assume you're familiar with the following from https://www.aosabook.org/en/ghc.html and that this facility is still there. Just in case you are not: 

So, the debug RTS has an optional mode that we call sanity checking. Sanity checking enables all kinds of expensive assertions, and can make the program run many times more slowly. In particular, sanity checking runs a full scan of the heap to check for dangling pointers (amongst other things), before and after every GC. The first job when investigating a runtime crash is to run the program with sanity checking turned on; sometimes this will catch the invariant violation well before the program actually crashes.

On Mon, Aug 31, 2020 at 11:08 AM Csaba Hruska <[hidden email]> wrote:
Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.

On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier <[hidden email]> wrote:


> On 31 Aug 2020, at 5:54 pm, Moritz Angermann <[hidden email]> wrote:
>
> If anyone has some create ideas, I'd love to hear them.  I've been wondering
> if just logging allocations (offset, range, type) would help figuring out what we
> expected to be there; and then maybe try to break on the allocation, (and
> subsequent writes).
>
> I'm sure some have been down this road before.

Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.

The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.

A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.

http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html
https://hackage.haskell.org/package/packman

Ben.



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Creative ideas on how to debug heap corruption

George Colpitts
+Moritz

On Mon, Aug 31, 2020 at 11:17 AM George Colpitts <[hidden email]> wrote:
I assume you're familiar with the following from https://www.aosabook.org/en/ghc.html and that this facility is still there. Just in case you are not: 

So, the debug RTS has an optional mode that we call sanity checking. Sanity checking enables all kinds of expensive assertions, and can make the program run many times more slowly. In particular, sanity checking runs a full scan of the heap to check for dangling pointers (amongst other things), before and after every GC. The first job when investigating a runtime crash is to run the program with sanity checking turned on; sometimes this will catch the invariant violation well before the program actually crashes.

On Mon, Aug 31, 2020 at 11:08 AM Csaba Hruska <[hidden email]> wrote:
Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.

On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier <[hidden email]> wrote:


> On 31 Aug 2020, at 5:54 pm, Moritz Angermann <[hidden email]> wrote:
>
> If anyone has some create ideas, I'd love to hear them.  I've been wondering
> if just logging allocations (offset, range, type) would help figuring out what we
> expected to be there; and then maybe try to break on the allocation, (and
> subsequent writes).
>
> I'm sure some have been down this road before.

Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.

The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.

A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.

http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html
https://hackage.haskell.org/package/packman

Ben.



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Creative ideas on how to debug heap corruption

Csaba Hruska
Fuzzing:
  1. generate simple random stg programs
  2. compile and run with RTS sanity checking enabled
  3. compare the program result between different backends
The fuzzer should cover all codegen cases and all code in RTS. Maybe this could be checked by the existing tools.

On Mon, Aug 31, 2020 at 4:19 PM George Colpitts <[hidden email]> wrote:
+Moritz

On Mon, Aug 31, 2020 at 11:17 AM George Colpitts <[hidden email]> wrote:
I assume you're familiar with the following from https://www.aosabook.org/en/ghc.html and that this facility is still there. Just in case you are not: 

So, the debug RTS has an optional mode that we call sanity checking. Sanity checking enables all kinds of expensive assertions, and can make the program run many times more slowly. In particular, sanity checking runs a full scan of the heap to check for dangling pointers (amongst other things), before and after every GC. The first job when investigating a runtime crash is to run the program with sanity checking turned on; sometimes this will catch the invariant violation well before the program actually crashes.

On Mon, Aug 31, 2020 at 11:08 AM Csaba Hruska <[hidden email]> wrote:
Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.

On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier <[hidden email]> wrote:


> On 31 Aug 2020, at 5:54 pm, Moritz Angermann <[hidden email]> wrote:
>
> If anyone has some create ideas, I'd love to hear them.  I've been wondering
> if just logging allocations (offset, range, type) would help figuring out what we
> expected to be there; and then maybe try to break on the allocation, (and
> subsequent writes).
>
> I'm sure some have been down this road before.

Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.

The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.

A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.

http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html
https://hackage.haskell.org/package/packman

Ben.



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Creative ideas on how to debug heap corruption

Moritz Angermann-2
Thanks everyone. I have indeed been trying to get somewhere with sanity checking. That used to help quite a bit for the deadstripping stuff that happened on iOS a long time ago, but that was also much more deterministic.
Maybe I'll try to see if running it through qemu will give me some more determinism. That at least gives somewhat predictable allocations. It could still end up being some annoying memory ordering issues, the llvm backend
just managed to happen to not run into by luck, or optimisation passes. 

On Mon, Aug 31, 2020 at 10:29 PM Csaba Hruska <[hidden email]> wrote:
Fuzzing:
  1. generate simple random stg programs
  2. compile and run with RTS sanity checking enabled
  3. compare the program result between different backends
The fuzzer should cover all codegen cases and all code in RTS. Maybe this could be checked by the existing tools.

On Mon, Aug 31, 2020 at 4:19 PM George Colpitts <[hidden email]> wrote:
+Moritz

On Mon, Aug 31, 2020 at 11:17 AM George Colpitts <[hidden email]> wrote:
I assume you're familiar with the following from https://www.aosabook.org/en/ghc.html and that this facility is still there. Just in case you are not: 

So, the debug RTS has an optional mode that we call sanity checking. Sanity checking enables all kinds of expensive assertions, and can make the program run many times more slowly. In particular, sanity checking runs a full scan of the heap to check for dangling pointers (amongst other things), before and after every GC. The first job when investigating a runtime crash is to run the program with sanity checking turned on; sometimes this will catch the invariant violation well before the program actually crashes.

On Mon, Aug 31, 2020 at 11:08 AM Csaba Hruska <[hidden email]> wrote:
Dump the whole heap into file during GC traversal or taking the whole allocated area. hmm, maybe this is the same as core dump.

On Mon, Aug 31, 2020 at 11:00 AM Ben Lippmeier <[hidden email]> wrote:


> On 31 Aug 2020, at 5:54 pm, Moritz Angermann <[hidden email]> wrote:
>
> If anyone has some create ideas, I'd love to hear them.  I've been wondering
> if just logging allocations (offset, range, type) would help figuring out what we
> expected to be there; and then maybe try to break on the allocation, (and
> subsequent writes).
>
> I'm sure some have been down this road before.

Force a GC before every allocation, and make the GC check the validity of the objects before it moves anything. I think this used to be possible by compiling the runtime system in debug mode.

The usual pain of heap corruption is that once the heap is corrupted it may be several GC cycles before you get the actual crash, and in the meantime the objects have all been moved around. The GC walks over all the objects by nature, so get it to validate the heap every time it does, then force it to run as often as you possibly can.

A user space approach is to use a library like vacuum or packman that also walks over the heap objects directly.

http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html
https://hackage.haskell.org/package/packman

Ben.



_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Creative ideas on how to debug heap corruption

Ben Gamari-2
In reply to this post by Ben Lippmeier-2
Ben Lippmeier <[hidden email]> writes:

>> On 31 Aug 2020, at 5:54 pm, Moritz Angermann <[hidden email]> wrote:
>>
>> If anyone has some create ideas, I'd love to hear them.  I've been wondering
>> if just logging allocations (offset, range, type) would help figuring out what we
>> expected to be there; and then maybe try to break on the allocation, (and
>> subsequent writes).
>>
>> I'm sure some have been down this road before.
>
> Force a GC before every allocation, and make the GC check the validity
> of the objects before it moves anything. I think this used to be
> possible by compiling the runtime system in debug mode.
>
> The usual pain of heap corruption is that once the heap is corrupted
> it may be several GC cycles before you get the actual crash, and in
> the meantime the objects have all been moved around. The GC walks over
> all the objects by nature, so get it to validate the heap every time
> it does, then force it to run as often as you possibly can.
>
Indeed.  Small nurseries (using +RTS -A), deterministic GC behavior
(with +RTS -V0 -I0), and sanity checking (with +RTS -DS) are all a very
useful for this.

> A user space approach is to use a library like vacuum or packman that
> also walks over the heap objects directly.
>
> http://hackage.haskell.org/package/vacuum-2.2.0.0/docs/GHC-Vacuum.html
> https://hackage.haskell.org/package/packman
>
For what it's worth, the ghc-debug [1] project which Sven Tennie, Matt
Pickering, and I have been working on over the last year or so was in
part motivated by precisely this use-case. It would allow the heap of
one Haskell process's heap to be traversed by another process. This is
useful for both debugging and profiling use-cases.

Cheers,

- Ben


[1] https://github.com/bgamari/ghc-debug

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment