Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add setting to allow replaying terminal orchestrators in DTFx.AS #1159

Merged
merged 1 commit into from
Sep 11, 2024

Conversation

davidmrdavid
Copy link
Collaborator

@davidmrdavid davidmrdavid commented Sep 10, 2024

Problem

In DTFx.AS, we track an orchestration state in the History and Instance tables, which we update independently in a non-transactional way (there are not transactions in Azure Storage tables). Therefore, there exists a race condition where if DTFx.AS crashes after updating the orchestrator state in the History table but before updating the Instance table, the two tables will be out of sync.

In many cases, this is fine: the next orchestrator replay will update the Instance table and bring the two data stores back in sync. However, if the orchestrator is in a terminal state as per the History table (e.g. Completed, Failed, Terminated), then we will not get another opportunity to update the Instances table because we will discard any new events for that instanceID (including "Terminate" requests) based solely on the fact that the History table claims the orchestrator is completed. For proof, see the IsExecutableInstance method, where we ensure an orchestrator that is terminal (according to the history table) cannot be replayed again:

bool IsExecutableInstance(OrchestrationRuntimeState runtimeState, IList<TaskMessage> newMessages, out string message)
{
if (runtimeState.ExecutionStartedEvent == null && !newMessages.Any(msg => msg.Event is ExecutionStartedEvent))
{
var instanceId = newMessages[0].OrchestrationInstance.InstanceId;
if (DurableTask.Core.Common.Entities.AutoStart(instanceId, newMessages))
{
message = null;
return true;
}
else
{
// A non-zero event count usually happens when an instance's history is overwritten by a
// new instance or by a ContinueAsNew. When history is overwritten by new instances, we
// overwrite the old history with new history (with a new execution ID), but this is done
// gradually as we build up the new history over time. If we haven't yet overwritten *all*
// the old history and we receive a message from the old instance (this happens frequently
// with canceled durable timer messages) we'll end up loading just the history that hasn't
// been fully overwritten. We know it's invalid because it's missing the ExecutionStartedEvent.
message = runtimeState.Events.Count == 0 ? "No such instance" : "Invalid history (may have been overwritten by a newer instance)";
return false;
}
}
if (runtimeState.ExecutionStartedEvent != null &&
runtimeState.OrchestrationStatus != OrchestrationStatus.Running &&
runtimeState.OrchestrationStatus != OrchestrationStatus.Pending &&
runtimeState.OrchestrationStatus != OrchestrationStatus.Suspended)
{
message = $"Instance is {runtimeState.OrchestrationStatus}";
return false;
}
message = null;
return true;
}

In the end, this situation necessary manuals editing of storage resources to recover from, which is dangerous.

Impact

This is rare, but I've seen it a handful of times in servicing cases, including yesterday.

This PR

This PR introduces a new setting for the Azure Storage backend, called AllowReplayingTerminalInstances. When set, we will not discard events for orchestrator in terminal instances, like we do today in IsExecutableInstance (embedded snippet above). This means we can force the orchestrator to replay, causing the instance table to get updated and thus brought back into sync.

I imagine the user story to be:
(1) the user identifies the state of instanceID X is inconsistent between the History and Instance table: the former claims the orchestrator is completed/terminal, but the latter says it's running.
(2) They set to AllowReplayingTerminalInstances to true.
(3) They send an arbitrary external event to the orchestrator, which will force a replay and update the instace table.
(4) The orchestrator state in storage is now correct :-)

In step (3) they may also decide to just terminate the orchestrator, either works. What matters is that we can once again update the Instance table in response to an event.

Ideal solution (not this PR), in my opinion. Just some musings.

It would be great if Azure Storage provided transactions across Azure Storage tables. In the absence of that, this race condition will always exist. I think it's worth considering if a new major version of DTFx.AS could work with only a singular Azure Storage table, perhaps by making use of another "sentinel"-type row in the History table to replace the Instance table altogether. This remove the possibility of an incomplete update like this. Just something I've been considering in the back of my mind.

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:shipit:

@davidmrdavid davidmrdavid marked this pull request as draft September 10, 2024 19:10
@davidmrdavid davidmrdavid marked this pull request as ready for review September 10, 2024 19:10
@davidmrdavid davidmrdavid reopened this Sep 10, 2024
@davidmrdavid davidmrdavid reopened this Sep 10, 2024
@davidmrdavid davidmrdavid merged commit ed3ece3 into main Sep 11, 2024
46 checks passed
@davidmrdavid davidmrdavid deleted the dajusto/AllowReplayingTerminalInstances branch September 11, 2024 16:34
davidmrdavid added a commit that referenced this pull request Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants