Skip to content

System.IO.Pipelines.Pipe.ReadAsync fails to cancel on CancellationToken #58909

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
AArnott opened this issue Sep 9, 2021 · 25 comments · Fixed by #59090
Closed

System.IO.Pipelines.Pipe.ReadAsync fails to cancel on CancellationToken #58909

AArnott opened this issue Sep 9, 2021 · 25 comments · Fixed by #59090
Assignees
Milestone

Comments

@AArnott
Copy link
Contributor

AArnott commented Sep 9, 2021

Description

I have a hang dump of a process that is stuck in a PipeReader.ReadAsync call that was canceled via CancellationToken but which never canceled the returned ValueTask<ReadResult>. This resulted in the hang.

I do not have a reliable repro of it but I have a rash of pipelines and repos that all suffer from frequent hangs in test runs and ultimately they all lead into this same function so I believe they are all caused by this one problem.

Dump file. Failed pipeline.

Looking at the dump file in WinDbg I took these notes, which form a kind of continuation chain mixed with some state in each of the objects. Below you can see that the CancellationToken that was passed to PipeReader.ReadAsync is in fact cancelled, but that the Pipe._readerAwaitable field has no CancellationTokenRegistration for it. You can also see that ReadAsync has not completed, so we're blocked waiting for ReadAsync to honor the CancellationToken yet it never will because it wasn't even registered to notice.

0000020b0e0e2048 Pipe
  0000020b0e0e20e8 _readerAwaitable
    10 _awaitableState
    0000020b0e0e2110 _cancellationTokenRegistration
      0 _id
      0 _node
0000020b0e0e2278 DefaultPipeReader.ReadAsync
  0000020b0e0e2048 _pipe
0000020b0dff7d38 System.IO.Pipelines.PipeReaderStream+<ReadAsyncInternal
  0000020b0dfdeeb8 cancellationToken.m_source (_state: 3, cancelled)
  0000020b0e0e2278 _pipeReader
0000020b0dff7df8 MessagePackStreamReader.TryReadMoreDataAsync
  0000020b0dfdeeb8 cancellationToken.m_source (_state: 3, cancelled)
  0000020b0dfde1a8 underlying System.IO.Pipelines.PipeReaderStream 
0000020b0dfde5b8 V2Formatter

Configuration

Windows 10
The hangs show up on .NET Framework and .NET Core 3.1 at least.
x64 process
System.IO.Pipelines v5.00.120.57516 (nuget package version v5.0.1)

Regression?

Unknown.

Other information

This bug was originally filed against Nerdbank.Streams as dotnet/Nerdbank.Streams#387 before I tracked it into System.IO.Pipelines.

@ghost ghost added area-System.IO.Pipelines untriaged New issue has not been triaged by the area owner labels Sep 9, 2021
@AArnott
Copy link
Contributor Author

AArnott commented Sep 10, 2021

This is seriously impacting our pipelines. We've yet to catch this on a local dev machine, but several of our pipelines fail mulltiple times daily because of this bug. One in particular hangs in 75% of runs! Please look into this soon.

I wonder if there's some kind of race condition in or around this code:

public void BeginOperation(CancellationToken cancellationToken, Action<object?> callback, object? state)
{
cancellationToken.ThrowIfCancellationRequested();
_awaitableState |= AwaitableState.Running;
// Don't register if already completed, we would immediately unregistered in ObserveCancellation
if (cancellationToken.CanBeCanceled && !IsCompleted)
{
#if (NETSTANDARD2_0 || NETFRAMEWORK)
_cancellationToken = cancellationToken;
#endif
_cancellationTokenRegistration = cancellationToken.UnsafeRegister(callback, state);
}
}

@davidfowl
Copy link
Member

Anything that updates the PipeAwaitable should be in a lock. The only thing that could change state here is the cancellationToken itself (since the caller owns that). The registration is released here

cancellationTokenRegistration = _readerAwaitable.ReleaseCancellationTokenRegistration(out cancellationToken);
.

It seems you're also using the sync context. I dunno if that's related but does it repro if you disable it?

@zlatanov
Copy link
Contributor

This looked interesting, so I did a little digging and here is what I found:

public void BeginOperation(CancellationToken cancellationToken, Action<object?> callback, object? state)
{
cancellationToken.ThrowIfCancellationRequested();
_awaitableState |= AwaitableState.Running;
// Don't register if already completed, we would immediately unregistered in ObserveCancellation
if (cancellationToken.CanBeCanceled && !IsCompleted)
{
#if (NETSTANDARD2_0 || NETFRAMEWORK)
_cancellationToken = cancellationToken;
#endif
_cancellationTokenRegistration = cancellationToken.UnsafeRegister(callback, state);
}
}

If the cancellation token gets cancelled just before executing line 58 here:

_cancellationTokenRegistration = cancellationToken.UnsafeRegister(callback, state);

then the callback will be invoked synchronously. The callback is actually:

private void ReaderCancellationRequested()
{
CompletionData completionData;
lock (_sync)
{
_readerAwaitable.CancellationTokenFired(out completionData);
}
TrySchedule(_readerScheduler, completionData);
}

which calls PipeAwaitable

public void CancellationTokenFired(out CompletionData completionData)
{
// We might be getting stale callbacks that we already unsubscribed from
if (CancellationToken.IsCancellationRequested)
{
Cancel(out completionData);
}
else
{
completionData = default;
}
}

However if we read carefully in CancellationTokenFired we find that Cancel(out completionData) is only called if CancellationToken.IsCancellationRequested. If we go to the definition of CancellationToken in PipeAwaitable it is defined as:

#if (!NETSTANDARD2_0 && !NETFRAMEWORK)
private CancellationToken CancellationToken => _cancellationTokenRegistration.Token;
#else

It uses the _cancellationTokenRegistration field to extract the cancellation token. However if we go back to the beginning, we will see that the _cancellationTokenRegistration is not yet set:

_cancellationTokenRegistration = cancellationToken.UnsafeRegister(callback, state);

The summary is that UnsafeRegister method invokes the callback synchronously, but the callback uses the result from UnsafeRegister. Writing a test for this particular case would be extremely difficult in my opinion.

//cc @davidfowl

@davidfowl
Copy link
Member

davidfowl commented Sep 13, 2021

@zlatanov amazing analysis! Inline execution of the callbacks is always a pain to deal with. Also, we use the registration to reduce the size of the PipeAwaitable.

@zlatanov
Copy link
Contributor

@davidfowl In order to keep the use of the registration an easy solution would be to modify the callback:

private void ReaderCancellationRequested()
{
CompletionData completionData;
lock (SyncObj)
{
_readerAwaitable.CancellationTokenFired(out completionData);
}
TrySchedule(ReaderScheduler, completionData);
}

to have something like this at the beginning:

if ( Monitor.IsEntered(SyncObj))
{
    // We have synchronous callback from UnsafeRegister, but because we use the result from it
    // we can't process it synchronously.
   ThreadPool.UnsafeQueueUserWorkItem(s_signalReaderAwaitable, this);
   return;
}
//... the rest of the method as is

@davidfowl
Copy link
Member

I'm wondering if there's any downside is removing the IsCancellationRequested check in CancellationTokenFired

@zlatanov
Copy link
Contributor

Somebody left a pretty telling comment there: We might be getting stale callbacks that we already unsubscribed from. I don't know how true it is though.

@davidfowl
Copy link
Member

Yea, I was looking at that PR as well. @pakrym added that comment.

Actually, maybe we can add a check here to see if the token did fire and mark the awaitable as cancelled synchronously.

@zlatanov
Copy link
Contributor

I think it's dangerous to add it there, because it only hides the issue. In the future if CancellationTokenFired method were to be refactored or something else added that counted on IsCancellationRequested being up to date, it would blow up silently later on.

The issue here is that the callback is fired synchronously and in that code's execution path, some method is using the CancellationToken which is not yet set. At any point in the future if any code is added, this might cause more hard to spot problems.

The best solution would be to be able to set a dummy CancellationTokenRegistration that has the up to date cancellation token, even if not really subscribed, before calling UnsafeRegister, but I guess that isn't possible at the moment.

@davidfowl
Copy link
Member

davidfowl commented Sep 14, 2021

I'm not concerned about code refactoring. We'll add some heavy comments to explain why it's important. I'd like to keep the fix as localized to PipeAwaitable as possible.

The best solution would be to be able to set a dummy CancellationTokenRegistration that has the up to date cancellation token, even if not really subscribed, before calling UnsafeRegister, but I guess that isn't possible at the moment.

The simplest solution is to always store both the token and the registration and make the PipeAwaitable bigger for both .NET framework and .NET Core. But I like checking if the token fired after this call since it would be doing the moral equivalent if it ran inline.

@zlatanov
Copy link
Contributor

But I like checking if the token fired after this call since it would be doing the moral equivalent if it ran inline.

True, but what I don't like about this, is the fact that the callback has already run and could potentially have already changed the state of the awaitable.

@davidfowl
Copy link
Member

I'm now not sure how this happens on .NET Framework since the cancellation token is assigned before registering for the callback...

@zlatanov
Copy link
Contributor

I'm now not sure how this happens on .NET Framework since the cancellation token is assigned before registering for the callback...

How sure are we that the pipeline uses the correct dll? I can't open the build pipeline at the moment to try and download the artifacts for .NET Framework and use reflection to make sure the we're not chasing ghosts.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Sep 14, 2021
@davidfowl
Copy link
Member

How sure are we that the pipeline uses the correct dll? I can't open the build pipeline at the moment to try and download the artifacts for .NET Framework and use reflection to make sure the we're not chasing ghosts.

@AArnott ?

@AArnott
Copy link
Contributor Author

AArnott commented Sep 14, 2021

I can confirm from the dump that the System.IO.Pipelines.dll loaded in the process is 5.0.1+b02e13abab3d6f499751af2ad1ae6333fb4e633d.

I'm now not sure how this happens on .NET Framework since the cancellation token is assigned before registering for the callback

The dump file isn't of the .NET Framework. It's a .NET Core process (coreclr.dll 4,700,21,35901 @Commit: 2aee3eb10b6303dff09cbe5ac99b6fae4644f346).

If the bug you found that can be blamed for the cause of the hang in this dump can't explain a failure on .NET Framework then I will find a .NET Framework dump as well that we can investigate.

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Sep 15, 2021
@davidfowl davidfowl added this to the 7.0.0 milestone Sep 16, 2021
@davidfowl davidfowl self-assigned this Sep 16, 2021
@davidfowl davidfowl removed the untriaged New issue has not been triaged by the area owner label Sep 16, 2021
@AArnott
Copy link
Contributor Author

AArnott commented Sep 16, 2021

This still repros after the fix is applied (albeit less frequently).
This time the hang is about the same, but a little different. There is still a cancelled token stuck in a ReadAsyncInternal that hasn't completed. But this time the CTS._state is 1 instead of 3. Both are 'canceled' but 1 signifies the callbacks are still being invoked I think. But I don't see any thread where that is happening.

As evidenced by the testResults-Windows artifact downloaded from this pipeline. If that pipeline is purged, the artifact is also on my OneDrive.

Here are my notes from windbg and VS investigation of the hang dump:

0000021b3bcc1a20 ContextRuleMonitorServiceTests+<RegisterContextRuleAsyncSerializesRuleMetadata>d__3
    0 <>1__state
    0000021b3bc0f358 <>4__this (ContextRuleMonitorServiceTests)
        0000021b3bc0f3d0 <TimeoutTokenSource>k__BackingField
              3 _state (CANCELED)
    0000021b3bc0f4b0 <Service>k__BackingField
        0000021b3bae3040 <LastRuleMetadata>k__BackingField
    0000021b3bcb6538 <ClientProxy>k__BackingField
        0000021b3bc80d48 rpc
            0000021b3bc80810 <MessageHandler>k__BackingField
                  0000021b3bc80120 <Formatter>k__BackingField
                       0000021b3bc6cf10 multiplexingStream
                           0000021b3bc0fe50 formatter (V1Formatter)
                               0000021b3bc0f918 stream
                                    0000021b3bc0f8b8 readableStream


0000021b3bae2ac8 00007ffec92a3148      168          0 Nerdbank.Streams.MultiplexingStream+<ReadToFillAsync>d__55
0000021b3bae2b70 00007ffec92a38f8      240          0 Nerdbank.Streams.MultiplexingStream+V1Formatter+<ReadFrameAsync>d__12
    0000021b3b3ceb80 <>4__this (V1Formatter)
    0000021b3bae2c10 cancellationToken
        0000021b3bc687d0 _source
            1 _state (CANCELED)


0000021b3bcad1e8 00007ffec92a3148      168          0 Nerdbank.Streams.MultiplexingStream+<ReadToFillAsync>d__55
0000021b3bcad290 00007ffec92a38f8      240          0 Nerdbank.Streams.MultiplexingStream+V1Formatter+<ReadFrameAsync>d__12
    0000021b3bc6cf10 <>4__this (V1Formatter)
    0000021b3bcad330 cancellationToken
        0000021b3bc6d188 _source
            1 _state (CANCELED)

The source code for that V1Formatter.ReadFrameAsync at the 0th await is ultimately a call to PipeReaderStream.ReadAsyncInternal as can be seen in the VS visualization of the async state machines:
image

And as you can see from the windbg notes, both of those ReadFrameAsync calls have already been canceled.

@AArnott AArnott reopened this Sep 16, 2021
@zlatanov
Copy link
Contributor

@AArnott I can't download the artifact from either source. It seems both are internal to Microsoft organization only. Can you share it some other way so I can access it?

@AArnott
Copy link
Contributor Author

AArnott commented Sep 17, 2021

I'm afraid not in this case. The original dump with which I created this issue came from an open source repo, but this new one comes from a private one. If I can repro on the OSS one, I'll share that publicly.

@zlatanov
Copy link
Contributor

I double checked the fix by @davidfowl and it seems solid. However then I decided to look into CancellationTokenSource and its Register method. The fix by David assumes that if the token is cancelled while we are calling the Register method, that a default CancellationTokenRegistration will be returned:

// If we get back the default CancellationTokenRegistration then it means the
// callback synchronously and we can throw inline. This is safe because we haven't changed
// the state of the awaitable as yet.
if (_cancellationTokenRegistration == default(CancellationTokenRegistration))
{
#if DEBUG
Debug.Assert(previousAwaitableState == _awaitableState, "The awaitable state changed!");
#endif
cancellationToken.ThrowIfCancellationRequested();
}

However after reading the source code in CancellationTokenSource, this seems to not always be the case:

// If cancellation hasn't been requested, return the registration.
// if cancellation has been requested, try to undo the registration and run the callback
// ourselves, but if we can't unregister it (e.g. the thread running Cancel snagged
// our callback for execution), return the registration so that the caller can wait
// for callback completion in ctr.Dispose().
Debug.Assert(id != 0, "IDs should never be the reserved value 0.");
if (!IsCancellationRequested || !registrations.Unregister(id, node))
{
return new CancellationTokenRegistration(id, node);
}

@stephentoub
Copy link
Member

stephentoub commented Sep 17, 2021

Doesn't it just assume that if default is returned the callback has been invoked? That holds.

@zlatanov
Copy link
Contributor

Doesn't it just assume that if default is returned the callback has been invoked? That holds.

You're right. Can we consider another case then. When signaling the CancellationTokenSource this method is called

private void NotifyCancellation(bool throwOnFirstException)
{
// If we're the first to signal cancellation, do the main extra work.
if (!IsCancellationRequested && Interlocked.CompareExchange(ref _state, NotifyingState, NotCanceledState) == NotCanceledState)

which sets the state to notifying and then

/// <summary>Invoke all registered callbacks.</summary>
/// <remarks>The handlers are invoked synchronously in LIFO order.</remarks>
private void ExecuteCallbackHandlers(bool throwOnFirstException)
{
Debug.Assert(IsCancellationRequested, "ExecuteCallbackHandlers should only be called after setting IsCancellationRequested->true");
// If there are no callbacks to run, we can safely exit. Any race conditions to lazy initialize it
// will see IsCancellationRequested and will then run the callback themselves.
Registrations? registrations = Interlocked.Exchange(ref _registrations, null);

which resets the _reservations field to null. Lets assume that one thread is doing this, while another thread is halted at Register method and more specifically, just before line:

Registrations? registrations = Volatile.Read(ref _registrations);

Then the registering thread will try to create new registrations object. Everything is fine until these lines in the Register method:

if (!IsCancellationRequested || !registrations.Unregister(id, node))
{
return new CancellationTokenRegistration(id, node);
}

and more specifically IsCancellationRequested property. It is defined as:

public bool IsCancellationRequested => _state != NotCanceledState;

Is it possible that the _state field, even if it's declared as volatile, to have a value (as observed from the other thread, and possible other CPU node) that is not completely in sync with the Interlocked.CompareExchange(ref _state, NotifyingState, NotCanceledState) operation?

I specifically ask because of this note in the documentation here:

On a multiprocessor system, a volatile read operation does not guarantee 
to obtain the latest value written to that memory location by any processor. 

@stephentoub
Copy link
Member

stephentoub commented Sep 18, 2021

Is it possible that the _state field, even if it's declared as volatile, to have a value (as observed from the other thread, and possible other CPU node) that is not completely in sync with the Interlocked.CompareExchange(ref _state, NotifyingState, NotCanceledState) operation?

In your sequence of events, we're only at this point because the thread read a null _registrations. That read is a full fence and can't be reordered with the subsequent read you cite of _state. _state was written by the other thread with a full fence prior to nulling out _registrations, also with a full fence. The reordering described isn't possible then.

@BrennanConroy
Copy link
Member

There is still a cancelled token stuck in a ReadAsyncInternal that hasn't completed. But this time the CTS._state is 1 instead of 3. Both are 'canceled' but 1 signifies the callbacks are still being invoked I think.

I took a look at this and it looks like you're using pre CancellationTokenSource rewrite when the _state field was 1 by default and meant NotCanceled. In the 6.0 rewrite, the state values changed and now the _state starts at 0 and 1 means Notfiying (or cancel in progress).
I'm assuming when you were debugging this you looked at the newer code and came to the conclusion that the cancellation token was canceled and pipes had a bug.

My conclusion: since the read isn't canceled and the pipe isn't completed then the hang is expected and points to a bug in the test or application code.

@AArnott
Copy link
Contributor Author

AArnott commented Oct 27, 2021

Thank you, @BrennanConroy. I'll take another look at the dump with that insight.

@AArnott
Copy link
Contributor Author

AArnott commented Nov 2, 2021

Ok, I found a cancellation bug in Nerdbank.Streams this time around (dotnet/Nerdbank.Streams#414). So till evidence suggests otherwise, I will consider this bug fixed with the original PR.
@davidfowl, I will do some more testing of the Pipelines fix combined with the Nerdbank.Streams fix. If I don't see any hangs in test runs, I'll request that we resume the backport of the fix made for this bug.

@AArnott AArnott closed this as completed Nov 2, 2021
@ghost ghost locked as resolved and limited conversation to collaborators Dec 2, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants