Add setting to allow replaying terminal orchestrators in DTFx.AS #1159

davidmrdavid · 2024-09-10T17:59:02Z

Problem

In DTFx.AS, we track an orchestration state in the History and Instance tables, which we update independently in a non-transactional way (there are not transactions in Azure Storage tables). Therefore, there exists a race condition where if DTFx.AS crashes after updating the orchestrator state in the History table but before updating the Instance table, the two tables will be out of sync.

In many cases, this is fine: the next orchestrator replay will update the Instance table and bring the two data stores back in sync. However, if the orchestrator is in a terminal state as per the History table (e.g. Completed, Failed, Terminated), then we will not get another opportunity to update the Instances table because we will discard any new events for that instanceID (including "Terminate" requests) based solely on the fact that the History table claims the orchestrator is completed. For proof, see the IsExecutableInstance method, where we ensure an orchestrator that is terminal (according to the history table) cannot be replayed again:

durabletask/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs

Lines 1049 to 1085 in 4f82f62

    
           bool IsExecutableInstance(OrchestrationRuntimeState runtimeState, IList<TaskMessage> newMessages, out string message) 
        
           { 
        
               if (runtimeState.ExecutionStartedEvent == null && !newMessages.Any(msg => msg.Event is ExecutionStartedEvent)) 
        
               { 
        
                   var instanceId = newMessages[0].OrchestrationInstance.InstanceId; 
        
                   if (DurableTask.Core.Common.Entities.AutoStart(instanceId, newMessages)) 
        
                   { 
        
                       message = null; 
        
                       return true; 
        
                   } 
        
                   else 
        
                   { 
        
                       // A non-zero event count usually happens when an instance's history is overwritten by a 
        
                       // new instance or by a ContinueAsNew. When history is overwritten by new instances, we 
        
                       // overwrite the old history with new history (with a new execution ID), but this is done 
        
                       // gradually as we build up the new history over time. If we haven't yet overwritten *all* 
        
                       // the old history and we receive a message from the old instance (this happens frequently 
        
                       // with canceled durable timer messages) we'll end up loading just the history that hasn't 
        
                       // been fully overwritten. We know it's invalid because it's missing the ExecutionStartedEvent. 
        
                       message = runtimeState.Events.Count == 0 ? "No such instance" : "Invalid history (may have been overwritten by a newer instance)"; 
        
                       return false; 
        
                   } 
        
               } 
        
               if (runtimeState.ExecutionStartedEvent != null && 
        
                   runtimeState.OrchestrationStatus != OrchestrationStatus.Running && 
        
                   runtimeState.OrchestrationStatus != OrchestrationStatus.Pending && 
        
                   runtimeState.OrchestrationStatus != OrchestrationStatus.Suspended) 
        
               { 
        
                   message = $"Instance is {runtimeState.OrchestrationStatus}"; 
        
                   return false; 
        
               } 
        
               message = null; 
        
               return true; 
        
           }

In the end, this situation necessary manuals editing of storage resources to recover from, which is dangerous.

Impact

This is rare, but I've seen it a handful of times in servicing cases, including yesterday.

This PR

This PR introduces a new setting for the Azure Storage backend, called AllowReplayingTerminalInstances. When set, we will not discard events for orchestrator in terminal instances, like we do today in IsExecutableInstance (embedded snippet above). This means we can force the orchestrator to replay, causing the instance table to get updated and thus brought back into sync.

I imagine the user story to be:
(1) the user identifies the state of instanceID X is inconsistent between the History and Instance table: the former claims the orchestrator is completed/terminal, but the latter says it's running.
(2) They set to AllowReplayingTerminalInstances to true.
(3) They send an arbitrary external event to the orchestrator, which will force a replay and update the instace table.
(4) The orchestrator state in storage is now correct :-)

In step (3) they may also decide to just terminate the orchestrator, either works. What matters is that we can once again update the Instance table in response to an event.

Ideal solution (not this PR), in my opinion. Just some musings.

It would be great if Azure Storage provided transactions across Azure Storage tables. In the absence of that, this race condition will always exist. I think it's worth considering if a new major version of DTFx.AS could work with only a singular Azure Storage table, perhaps by making use of another "sentinel"-type row in the History table to replace the Instance table altogether. This remove the possibility of an incomplete update like this. Just something I've been considering in the back of my mind.

cgillum

…rminal orchestrators (#1159) (#1163)

add setting to allow replaying terminal orchestrators

b4433e9

davidmrdavid requested a review from cgillum September 10, 2024 18:01

cgillum approved these changes Sep 10, 2024

View reviewed changes

davidmrdavid marked this pull request as draft September 10, 2024 19:10

davidmrdavid marked this pull request as ready for review September 10, 2024 19:10

davidmrdavid closed this Sep 10, 2024

davidmrdavid reopened this Sep 10, 2024

davidmrdavid closed this Sep 10, 2024

davidmrdavid reopened this Sep 10, 2024

davidmrdavid merged commit ed3ece3 into main Sep 11, 2024
46 checks passed

davidmrdavid deleted the dajusto/AllowReplayingTerminalInstances branch September 11, 2024 16:34

davidmrdavid added a commit that referenced this pull request Sep 11, 2024

add setting to allow replaying terminal orchestrators (#1159)

71d5408

davidmrdavid mentioned this pull request Sep 11, 2024

[Cherry-pick] [durabletask-core-v2] add setting to allow replaying terminal orchestrators (#1159) #1163

Merged

davidmrdavid added a commit that referenced this pull request Sep 12, 2024

[Cherry-pick] [durabletask-core-v2] add setting to allow replaying te…

cc4af08

…rminal orchestrators (#1159) (#1163)

nytian mentioned this pull request Oct 14, 2024

Update DurableTask.ApplicationInsights and DurableTask.AzureStorage Version #1172

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add setting to allow replaying terminal orchestrators in DTFx.AS #1159

Add setting to allow replaying terminal orchestrators in DTFx.AS #1159

davidmrdavid commented Sep 10, 2024 •

edited

Loading

cgillum left a comment

	bool IsExecutableInstance(OrchestrationRuntimeState runtimeState, IList<TaskMessage> newMessages, out string message)
	{
	if (runtimeState.ExecutionStartedEvent == null && !newMessages.Any(msg => msg.Event is ExecutionStartedEvent))
	{
	var instanceId = newMessages[0].OrchestrationInstance.InstanceId;

	if (DurableTask.Core.Common.Entities.AutoStart(instanceId, newMessages))
	{
	message = null;
	return true;
	}
	else
	{
	// A non-zero event count usually happens when an instance's history is overwritten by a
	// new instance or by a ContinueAsNew. When history is overwritten by new instances, we
	// overwrite the old history with new history (with a new execution ID), but this is done
	// gradually as we build up the new history over time. If we haven't yet overwritten all
	// the old history and we receive a message from the old instance (this happens frequently
	// with canceled durable timer messages) we'll end up loading just the history that hasn't
	// been fully overwritten. We know it's invalid because it's missing the ExecutionStartedEvent.
	message = runtimeState.Events.Count == 0 ? "No such instance" : "Invalid history (may have been overwritten by a newer instance)";
	return false;
	}
	}

	if (runtimeState.ExecutionStartedEvent != null &&
	runtimeState.OrchestrationStatus != OrchestrationStatus.Running &&
	runtimeState.OrchestrationStatus != OrchestrationStatus.Pending &&
	runtimeState.OrchestrationStatus != OrchestrationStatus.Suspended)
	{
	message = $"Instance is {runtimeState.OrchestrationStatus}";
	return false;
	}

	message = null;
	return true;
	}

Add setting to allow replaying terminal orchestrators in DTFx.AS #1159

Add setting to allow replaying terminal orchestrators in DTFx.AS #1159

Conversation

davidmrdavid commented Sep 10, 2024 • edited Loading

Problem

Impact

This PR

Ideal solution (not this PR), in my opinion. Just some musings.

cgillum left a comment

Choose a reason for hiding this comment

davidmrdavid commented Sep 10, 2024 •

edited

Loading