Add setting to allow replaying terminal orchestrators in DTFx.AS #1159
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
In DTFx.AS, we track an orchestration state in the History and Instance tables, which we update independently in a non-transactional way (there are not transactions in Azure Storage tables). Therefore, there exists a race condition where if DTFx.AS crashes after updating the orchestrator state in the History table but before updating the Instance table, the two tables will be out of sync.
In many cases, this is fine: the next orchestrator replay will update the Instance table and bring the two data stores back in sync. However, if the orchestrator is in a terminal state as per the History table (e.g. Completed, Failed, Terminated), then we will not get another opportunity to update the Instances table because we will discard any new events for that instanceID (including "Terminate" requests) based solely on the fact that the History table claims the orchestrator is completed. For proof, see the
IsExecutableInstance
method, where we ensure an orchestrator that is terminal (according to the history table) cannot be replayed again:durabletask/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs
Lines 1049 to 1085 in 4f82f62
In the end, this situation necessary manuals editing of storage resources to recover from, which is dangerous.
Impact
This is rare, but I've seen it a handful of times in servicing cases, including yesterday.
This PR
This PR introduces a new setting for the Azure Storage backend, called
AllowReplayingTerminalInstances
. When set, we will not discard events for orchestrator in terminal instances, like we do today inIsExecutableInstance
(embedded snippet above). This means we can force the orchestrator to replay, causing the instance table to get updated and thus brought back into sync.I imagine the user story to be:
(1) the user identifies the state of instanceID X is inconsistent between the History and Instance table: the former claims the orchestrator is completed/terminal, but the latter says it's running.
(2) They set to
AllowReplayingTerminalInstances
totrue
.(3) They send an arbitrary external event to the orchestrator, which will force a replay and update the instace table.
(4) The orchestrator state in storage is now correct :-)
In step (3) they may also decide to just terminate the orchestrator, either works. What matters is that we can once again update the Instance table in response to an event.
Ideal solution (not this PR), in my opinion. Just some musings.
It would be great if Azure Storage provided transactions across Azure Storage tables. In the absence of that, this race condition will always exist. I think it's worth considering if a new major version of DTFx.AS could work with only a singular Azure Storage table, perhaps by making use of another "sentinel"-type row in the History table to replace the Instance table altogether. This remove the possibility of an incomplete update like this. Just something I've been considering in the back of my mind.