Async exceptions and delimited continuations

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Async exceptions and delimited continuations

Alexis King
Hi all,

As some of you are likely aware, I have an open GHC proposal[1] to add
native support for delimited continuations to the RTS. I also have an
incomplete implementation,[2] and the only major remaining obstacle
concerns async exceptions. The issue is subtle, so I’ve tried to
provide all the necessary context in this email. If you’re already
familiar with the ideas at play, you can skip the context about how
delimited continuations work.

For those unfamiliar, delimited continuations allow capturing slices
of the call stack and restoring them later. For example, the program

    do y <- prompt $ do x <- control0 $ \k -> k (pure 10)
                        pure (x + 5)
       print y

will print 15. To understand what’s happening operationally, we can
imagine an abstract call stack made up of continuation frames:

    ┌──────────┐
    │  ● + 5   │    redex: control0 $ \k -> k (pure 10)
    ├──────────┤
    │ prompt ● │
    ├──────────┤
    │ print ●  │
    ├──────────┤
    │   ...    │
    ├──────────┤

Here, each ● represents the “hole” where the evaluated result of the
redex will be returned. `control0` moves all the frames between the
top of the stack and the first `prompt` into the heap and returns a
reference to them, so after a single reduction step, we have

    ┌──────────┐
    │ print ●  │    redex: k1 (pure 10)
    ├──────────┤    heap:  k1 = ┌──────────┐
    │   ...    │                │  ● + 5   │
    ├──────────┤                └──────────┘

When a continuation is applied, its stored stack frames are copied
back onto the top of the current stack, and the argument becomes the
new redex:

    ┌──────────┐
    │  ● + 5   │    redex: pure 10
    ├──────────┤
    │ print ●  │
    ├──────────┤
    │   ...    │
    ├──────────┤

Now it should hopefully be clear how we end up printing 15.

With that context, consider the following expression:

    prompt $ mask_ $ do x <- control0 $ \k -> k (pure 10)
                        f x

The call stack at the point of control0 looks very similar in this
program, but now we have a use of `mask_` in there as well:

    ┌──────────┐
    │   f ●    │    redex: control0 $ \k -> k (pure 10)
    ├──────────┤    exns:  masked
    │ mask_ ●  │
    ├──────────┤
    │ prompt ● │
    ├──────────┤
    │   ...    │
    ├──────────┤

When capturing the continuation, we’ll unwind the stack the same way
we did before. Because we’re popping mask_ off the stack, we’ll unmask
async exceptions:

    ┌──────────┐    redex: k1 (pure 10)
    │   ...    │    exns:  not masked
    ├──────────┤    heap:  k1 = ┌──────────┐
                                │   f ●    │
                                ├──────────┤
                                │ mask_ ●  │
                                └──────────┘

Now when we apply `k1`, we’ll copy the `mask_` frame back onto the
stack, and we must re-mask async exceptions. Otherwise, exceptions
will not be masked during the call to `f`, which would be wrong.

Why is this a problem? The RTS applies an optimization: if you call
mask_ (actually maskAsyncExceptions#) while exceptions are already
masked, it doesn’t push a new stack frame at all. So, for example, if
you write

    mask_ $ mask_ $ foo bar

you’ll end up with only one mask_ frame on the call stack, not two.
This tiny optimization actually allows not one but two savings:

    1. The immediate benefit is that we save a stack frame.

    2. The hidden benefit is that we never need to push the old
       exception masking state onto the stack.

       If we had multiple mask_ frames on the stack simultaneously, we
       wouldn’t know what to do when returning: should we unmask them,
       or should they stay masked? We’d need to push that information
       onto the stack in the same way we must push a return address
       when calling a function.

       By skipping these redundant stack frames, we can always be
       certain the right thing to do on return is to unmask async
       exceptions. No need to store anything else.

(This explanation is a slight simplification, especially once
maskUninterruptible comes into play, but it’s close enough.)

Now you may see the looming problem: this strategy completely breaks
down in the presence of delimited continuations. With delimited
continuations, we might have a program like this:

    mask_ $ prompt $ mask_ $ ...

If we capture a continuation up to this prompt, we expect the inner
mask_ frame to be captured along with it. But that won’t happen if we
never pushed a mask_ frame at all due to the aforementioned
optimization.

So now I can finally state my question: what is the right solution for
this? I see three obvious possible ways forward:

    1. Keep track of whether or not we’re inside a prompt and skip the
       optimization in that case.

    2. Keep some bookkeeping that tracks the modifications to the
       async exception masking state since the most recently pushed
       prompt.

    3. Just don’t bother with the optimization at all.

Option 3 certainly seems the most appealing from a simplicity point of
view, and I suspect the optimization doesn’t matter much in practice.
Why? Because the real `mask` implementation from Control.Exception
already avoids re-masking exceptions if they’re masked! (And that’s
okay, because prompt# and control0# are not intended to be used
directly in IO, so code that uses them can provide its own version of
`mask`.) However, it is admittedly possible for the restore action
passed to the argument of `mask` to create redundant calls, as the
check is only performed in `mask` itself.

Is eliminating this optimization an acceptable compromise? Or is there
reason to believe this is important for performance of real programs?

Thanks,
Alexis

[1]: https://github.com/ghc-proposals/ghc-proposals/pull/313
[2]: https://gitlab.haskell.org/lexi.lambda/ghc/-/commits/first-class-continuations
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions and delimited continuations

Iavor Diatchki
Hi,

I am by no means an expert on the GHC RTS but all 3 suggestions seem quite reasonable to me.   A good way to make a decision might be to collect some data, at least for the things that might be easy to measure.   In particular, it would be interesting to temporarily disable the optimization and run some benchmarks on some IO/exceptions heavy code (some sort of server?  or maybe a synthetic benchmark to really stress the masking/unmaksing) and see what's the change in performance.

-Iavor









On Wed, Jul 1, 2020 at 12:13 PM Alexis King <[hidden email]> wrote:
Hi all,

As some of you are likely aware, I have an open GHC proposal[1] to add
native support for delimited continuations to the RTS. I also have an
incomplete implementation,[2] and the only major remaining obstacle
concerns async exceptions. The issue is subtle, so I’ve tried to
provide all the necessary context in this email. If you’re already
familiar with the ideas at play, you can skip the context about how
delimited continuations work.

For those unfamiliar, delimited continuations allow capturing slices
of the call stack and restoring them later. For example, the program

    do y <- prompt $ do x <- control0 $ \k -> k (pure 10)
                        pure (x + 5)
       print y

will print 15. To understand what’s happening operationally, we can
imagine an abstract call stack made up of continuation frames:

    ┌──────────┐
    │  ● + 5   │    redex: control0 $ \k -> k (pure 10)
    ├──────────┤
    │ prompt ● │
    ├──────────┤
    │ print ●  │
    ├──────────┤
    │   ...    │
    ├──────────┤

Here, each ● represents the “hole” where the evaluated result of the
redex will be returned. `control0` moves all the frames between the
top of the stack and the first `prompt` into the heap and returns a
reference to them, so after a single reduction step, we have

    ┌──────────┐
    │ print ●  │    redex: k1 (pure 10)
    ├──────────┤    heap:  k1 = ┌──────────┐
    │   ...    │                │  ● + 5   │
    ├──────────┤                └──────────┘

When a continuation is applied, its stored stack frames are copied
back onto the top of the current stack, and the argument becomes the
new redex:

    ┌──────────┐
    │  ● + 5   │    redex: pure 10
    ├──────────┤
    │ print ●  │
    ├──────────┤
    │   ...    │
    ├──────────┤

Now it should hopefully be clear how we end up printing 15.

With that context, consider the following expression:

    prompt $ mask_ $ do x <- control0 $ \k -> k (pure 10)
                        f x

The call stack at the point of control0 looks very similar in this
program, but now we have a use of `mask_` in there as well:

    ┌──────────┐
    │   f ●    │    redex: control0 $ \k -> k (pure 10)
    ├──────────┤    exns:  masked
    │ mask_ ●  │
    ├──────────┤
    │ prompt ● │
    ├──────────┤
    │   ...    │
    ├──────────┤

When capturing the continuation, we’ll unwind the stack the same way
we did before. Because we’re popping mask_ off the stack, we’ll unmask
async exceptions:

    ┌──────────┐    redex: k1 (pure 10)
    │   ...    │    exns:  not masked
    ├──────────┤    heap:  k1 = ┌──────────┐
                                │   f ●    │
                                ├──────────┤
                                │ mask_ ●  │
                                └──────────┘

Now when we apply `k1`, we’ll copy the `mask_` frame back onto the
stack, and we must re-mask async exceptions. Otherwise, exceptions
will not be masked during the call to `f`, which would be wrong.

Why is this a problem? The RTS applies an optimization: if you call
mask_ (actually maskAsyncExceptions#) while exceptions are already
masked, it doesn’t push a new stack frame at all. So, for example, if
you write

    mask_ $ mask_ $ foo bar

you’ll end up with only one mask_ frame on the call stack, not two.
This tiny optimization actually allows not one but two savings:

    1. The immediate benefit is that we save a stack frame.

    2. The hidden benefit is that we never need to push the old
       exception masking state onto the stack.

       If we had multiple mask_ frames on the stack simultaneously, we
       wouldn’t know what to do when returning: should we unmask them,
       or should they stay masked? We’d need to push that information
       onto the stack in the same way we must push a return address
       when calling a function.

       By skipping these redundant stack frames, we can always be
       certain the right thing to do on return is to unmask async
       exceptions. No need to store anything else.

(This explanation is a slight simplification, especially once
maskUninterruptible comes into play, but it’s close enough.)

Now you may see the looming problem: this strategy completely breaks
down in the presence of delimited continuations. With delimited
continuations, we might have a program like this:

    mask_ $ prompt $ mask_ $ ...

If we capture a continuation up to this prompt, we expect the inner
mask_ frame to be captured along with it. But that won’t happen if we
never pushed a mask_ frame at all due to the aforementioned
optimization.

So now I can finally state my question: what is the right solution for
this? I see three obvious possible ways forward:

    1. Keep track of whether or not we’re inside a prompt and skip the
       optimization in that case.

    2. Keep some bookkeeping that tracks the modifications to the
       async exception masking state since the most recently pushed
       prompt.

    3. Just don’t bother with the optimization at all.

Option 3 certainly seems the most appealing from a simplicity point of
view, and I suspect the optimization doesn’t matter much in practice.
Why? Because the real `mask` implementation from Control.Exception
already avoids re-masking exceptions if they’re masked! (And that’s
okay, because prompt# and control0# are not intended to be used
directly in IO, so code that uses them can provide its own version of
`mask`.) However, it is admittedly possible for the restore action
passed to the argument of `mask` to create redundant calls, as the
check is only performed in `mask` itself.

Is eliminating this optimization an acceptable compromise? Or is there
reason to believe this is important for performance of real programs?

Thanks,
Alexis

[1]: https://github.com/ghc-proposals/ghc-proposals/pull/313
[2]: https://gitlab.haskell.org/lexi.lambda/ghc/-/commits/first-class-continuations
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions and delimited continuations

Ben Gamari-2
In reply to this post by Alexis King
Alexis King <[hidden email]> writes:

> Hi all,
>
> As some of you are likely aware, I have an open GHC proposal[1] to add
> native support for delimited continuations to the RTS. I also have an
> incomplete implementation,[2] and the only major remaining obstacle
> concerns async exceptions. The issue is subtle, so I’ve tried to
> provide all the necessary context in this email. If you’re already
> familiar with the ideas at play, you can skip the context about how
> delimited continuations work.
>
...

> This tiny optimization actually allows not one but two savings:
>
>     1. The immediate benefit is that we save a stack frame.
>
>     2. The hidden benefit is that we never need to push the old
>        exception masking state onto the stack.
>
>        If we had multiple mask_ frames on the stack simultaneously, we
>        wouldn’t know what to do when returning: should we unmask them,
>        or should they stay masked? We’d need to push that information
>        onto the stack in the same way we must push a return address
>        when calling a function.
>
>        By skipping these redundant stack frames, we can always be
>        certain the right thing to do on return is to unmask async
>        exceptions. No need to store anything else.
>
Indeed. However, I don't think it would be that expensive to store the
masking state with a suitable implementation strategy. Specifically,
don't add a "masking state" field to the return frame. Rather, I think
you want to have `mask` push one of two possible return frame types:

 * in the case that we are already masked push an "already masked"
   frame. The entry code for this would be a no-op.

 * in the case that we aren't currently masked push the usual
   stg_maskAsyncExceptionszh_ret_info return frame.

This incurs minimal overhead while preserving the information that you
need. We avoid adding a word to the stack frame (likely saving an
instruction) and in cases where today the optimsation fires the user
would only pay for a return to a trivial entry frame. It seems to me
that this is as close to free as we will get.

Under this scheme `control` can simply look for "already masked" frames
when copying the stack.

> (This explanation is a slight simplification, especially once
> maskUninterruptible comes into play, but it’s close enough.)
>
> Now you may see the looming problem: this strategy completely breaks
> down in the presence of delimited continuations. With delimited
> continuations, we might have a program like this:
>
>     mask_ $ prompt $ mask_ $ ...
>
> If we capture a continuation up to this prompt, we expect the inner
> mask_ frame to be captured along with it. But that won’t happen if we
> never pushed a mask_ frame at all due to the aforementioned
> optimization.
>
> So now I can finally state my question: what is the right solution for
> this? I see three obvious possible ways forward:
>
>     1. Keep track of whether or not we’re inside a prompt and skip the
>        optimization in that case.
>
>     2. Keep some bookkeeping that tracks the modifications to the
>        async exception masking state since the most recently pushed
>        prompt.
>
>     3. Just don’t bother with the optimization at all.
>
> Option 3 certainly seems the most appealing from a simplicity point of
> view, and I suspect the optimization doesn’t matter much in practice.
> Why? Because the real `mask` implementation from Control.Exception
> already avoids re-masking exceptions if they’re masked! (And that’s
> okay, because prompt# and control0# are not intended to be used
> directly in IO, so code that uses them can provide its own version of
> `mask`.) However, it is admittedly possible for the restore action
> passed to the argument of `mask` to create redundant calls, as the
> check is only performed in `mask` itself.
>
> Is eliminating this optimization an acceptable compromise? Or is there
> reason to believe this is important for performance of real programs?
>
It never hurts to measure. Perhaps establish the worst-case performance
impact by looking at a simple program which just masks repeatedly.

For what it's worth, I rather doubt that this optimisation is going to
make or break any program. Exception operations in general are somewhat
rare in my experience and the cost of pushing a stack frame (or, perhaps
more importantly on modern hardware, returning through it) should be
reasonably cheap.

By the way, when you go to implement this do add a few comments to the
masking primitives. It took a bit of staring to work out what was going
on in there.

Cheers,

- Ben

_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs

signature.asc (497 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Async exceptions and delimited continuations

Takenobu Tani
In reply to this post by Alexis King
Hi Alexis,

I prepared a framework page on ghc-wiki about this proposal:
  * https://gitlab.haskell.org/ghc/ghc/-/wikis/delimited-continuations

If it helps, please use the page to share ideas with developers and users.

It is a draft page. Please feel free to rewrite all contents as you like.
You could also create sub pages.

Some similar pages for proposals are here:
  * https://gitlab.haskell.org/ghc/ghc/-/wikis/linear-types
  * https://gitlab.haskell.org/ghc/ghc/-/wikis/dependent-haskell

Regards,
Takenobu

On Thu, Jul 2, 2020 at 4:13 AM Alexis King <[hidden email]> wrote:

>
> Hi all,
>
> As some of you are likely aware, I have an open GHC proposal[1] to add
> native support for delimited continuations to the RTS. I also have an
> incomplete implementation,[2] and the only major remaining obstacle
> concerns async exceptions. The issue is subtle, so I’ve tried to
> provide all the necessary context in this email. If you’re already
> familiar with the ideas at play, you can skip the context about how
> delimited continuations work.
>
> For those unfamiliar, delimited continuations allow capturing slices
> of the call stack and restoring them later. For example, the program
>
>     do y <- prompt $ do x <- control0 $ \k -> k (pure 10)
>                         pure (x + 5)
>        print y
>
> will print 15. To understand what’s happening operationally, we can
> imagine an abstract call stack made up of continuation frames:
>
>     ┌──────────┐
>     │  ● + 5   │    redex: control0 $ \k -> k (pure 10)
>     ├──────────┤
>     │ prompt ● │
>     ├──────────┤
>     │ print ●  │
>     ├──────────┤
>     │   ...    │
>     ├──────────┤
>
> Here, each ● represents the “hole” where the evaluated result of the
> redex will be returned. `control0` moves all the frames between the
> top of the stack and the first `prompt` into the heap and returns a
> reference to them, so after a single reduction step, we have
>
>     ┌──────────┐
>     │ print ●  │    redex: k1 (pure 10)
>     ├──────────┤    heap:  k1 = ┌──────────┐
>     │   ...    │                │  ● + 5   │
>     ├──────────┤                └──────────┘
>
> When a continuation is applied, its stored stack frames are copied
> back onto the top of the current stack, and the argument becomes the
> new redex:
>
>     ┌──────────┐
>     │  ● + 5   │    redex: pure 10
>     ├──────────┤
>     │ print ●  │
>     ├──────────┤
>     │   ...    │
>     ├──────────┤
>
> Now it should hopefully be clear how we end up printing 15.
>
> With that context, consider the following expression:
>
>     prompt $ mask_ $ do x <- control0 $ \k -> k (pure 10)
>                         f x
>
> The call stack at the point of control0 looks very similar in this
> program, but now we have a use of `mask_` in there as well:
>
>     ┌──────────┐
>     │   f ●    │    redex: control0 $ \k -> k (pure 10)
>     ├──────────┤    exns:  masked
>     │ mask_ ●  │
>     ├──────────┤
>     │ prompt ● │
>     ├──────────┤
>     │   ...    │
>     ├──────────┤
>
> When capturing the continuation, we’ll unwind the stack the same way
> we did before. Because we’re popping mask_ off the stack, we’ll unmask
> async exceptions:
>
>     ┌──────────┐    redex: k1 (pure 10)
>     │   ...    │    exns:  not masked
>     ├──────────┤    heap:  k1 = ┌──────────┐
>                                 │   f ●    │
>                                 ├──────────┤
>                                 │ mask_ ●  │
>                                 └──────────┘
>
> Now when we apply `k1`, we’ll copy the `mask_` frame back onto the
> stack, and we must re-mask async exceptions. Otherwise, exceptions
> will not be masked during the call to `f`, which would be wrong.
>
> Why is this a problem? The RTS applies an optimization: if you call
> mask_ (actually maskAsyncExceptions#) while exceptions are already
> masked, it doesn’t push a new stack frame at all. So, for example, if
> you write
>
>     mask_ $ mask_ $ foo bar
>
> you’ll end up with only one mask_ frame on the call stack, not two.
> This tiny optimization actually allows not one but two savings:
>
>     1. The immediate benefit is that we save a stack frame.
>
>     2. The hidden benefit is that we never need to push the old
>        exception masking state onto the stack.
>
>        If we had multiple mask_ frames on the stack simultaneously, we
>        wouldn’t know what to do when returning: should we unmask them,
>        or should they stay masked? We’d need to push that information
>        onto the stack in the same way we must push a return address
>        when calling a function.
>
>        By skipping these redundant stack frames, we can always be
>        certain the right thing to do on return is to unmask async
>        exceptions. No need to store anything else.
>
> (This explanation is a slight simplification, especially once
> maskUninterruptible comes into play, but it’s close enough.)
>
> Now you may see the looming problem: this strategy completely breaks
> down in the presence of delimited continuations. With delimited
> continuations, we might have a program like this:
>
>     mask_ $ prompt $ mask_ $ ...
>
> If we capture a continuation up to this prompt, we expect the inner
> mask_ frame to be captured along with it. But that won’t happen if we
> never pushed a mask_ frame at all due to the aforementioned
> optimization.
>
> So now I can finally state my question: what is the right solution for
> this? I see three obvious possible ways forward:
>
>     1. Keep track of whether or not we’re inside a prompt and skip the
>        optimization in that case.
>
>     2. Keep some bookkeeping that tracks the modifications to the
>        async exception masking state since the most recently pushed
>        prompt.
>
>     3. Just don’t bother with the optimization at all.
>
> Option 3 certainly seems the most appealing from a simplicity point of
> view, and I suspect the optimization doesn’t matter much in practice.
> Why? Because the real `mask` implementation from Control.Exception
> already avoids re-masking exceptions if they’re masked! (And that’s
> okay, because prompt# and control0# are not intended to be used
> directly in IO, so code that uses them can provide its own version of
> `mask`.) However, it is admittedly possible for the restore action
> passed to the argument of `mask` to create redundant calls, as the
> check is only performed in `mask` itself.
>
> Is eliminating this optimization an acceptable compromise? Or is there
> reason to believe this is important for performance of real programs?
>
> Thanks,
> Alexis
>
> [1]: https://github.com/ghc-proposals/ghc-proposals/pull/313
> [2]: https://gitlab.haskell.org/lexi.lambda/ghc/-/commits/first-class-continuations
> _______________________________________________
> ghc-devs mailing list
> [hidden email]
> http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs
_______________________________________________
ghc-devs mailing list
[hidden email]
http://mail.haskell.org/cgi-bin/mailman/listinfo/ghc-devs