Orchestrator stuck in running when saving large custom status values #2918

cliedeman · 2024-09-25T06:50:21Z

Description

A clear and concise description of what the bug is. Please make an effort to fill in all the sections below; the information will help us investigate your issue.

I have several instances of the same orchestrator that moreo often than not get stuck in Running.

The ochestrator calls 5 sub orchestrators and takes about 2 hours total. The inputs are not large so nothing suspcious there.
If I check the history table I can see that an OrchestratorComplete event is fired with a null instanceId - indicading it should be in Completed state but is not.

NOTE: JavaScript issues should be reported here: https://github.com/Azure/azure-functions-durable-js

Expected behavior

A clear and concise description of what you expected to happen.

Orchestrator leaves the Running state and becomes Completed

Actual behavior

A clear and concise description of what actually happened.

Orchestrator remains in running state

Relevant source code snippets

// insert code snippet here

Known workarounds

Provide a description of any known workarounds you used.

App Details

Dotnet 8
Isolated Worker

  <ItemGroup Label="Azure Functions Worker">
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.ApplicationInsights" Version="1.4.0" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker" Version="1.23.0" />
    <!-- Don't upgrade this library because of this issue -> https://github.com/microsoft/durabletask-dotnet/issues/247 -->
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Sdk" Version="1.17.4" />
  </ItemGroup>
  <ItemGroup Label="Azure Functions Worker Extensions">
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Http" Version="3.2.0" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Http.AspNetCore" Version="1.3.2" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Timer" Version="4.3.1" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Storage" Version="6.6.0" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.Storage.Queues" Version="5.5.0" />
    <!-- https://github.com/Azure/azure-sdk-for-net/pull/34783 -->
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.DurableTask" Version="1.1.5" />
    <PackageVersion Include="Microsoft.DurableTask.Generators" Version="1.0.0-preview.1" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.CosmosDB" Version="4.11.0" />
    <PackageVersion Include="Microsoft.Azure.Functions.Worker.Extensions.ServiceBus" Version="5.22.0" />
  </ItemGroup>

Screenshots

If applicable, add screenshots to help explain your problem.

If deployed to Azure

We have access to a lot of telemetry that can help with investigations. Please provide as much of the following information as you can to help us investigate!

Timeframe issue observed: Past Week
Function App name: functions-prod-fulcrum-mobilemart
Function name(s): DataLakeDataExportOrchestrate
Azure region: North Europe
Orchestration instance ID(s): a053d5a6566544539670bb04989d7c6b, 999a607c87074b7b9a08d3b825f29622, ec599cef5cd545bcb126ed3e4f94bfc8, 86608ce0275c4b35ae1807f5021bdaa9, f168030031c24d5a9ab31bc8e13d18e1
Azure storage account name: fulcrumprodfunctionapp

If you don't want to share your Function App or storage account name GitHub, please at least share the orchestration instance ID. Otherwise it's extremely difficult to look up information.

The text was updated successfully, but these errors were encountered:

cgillum · 2024-09-25T15:50:39Z

Hi @cliedeman. Are you using Application Insights? If so, can you try enabling the Durable Task Framework logging (warnings and errors as shown in the sample should be fine) and then querying the traces collection in App Insights to see if there are any clues about what might be going on?

cliedeman · 2024-09-26T08:23:16Z

@cgillum I do. When I run another batch in the coming days I will try to get some extra logging output

cliedeman · 2024-09-28T19:10:26Z

@cgillum I found this error in the logs

An unexpected failure occurred while processing instance '36790fe4a8e3465ca3ca68f210483553': DurableTask.AzureStorage.Storage.DurableTaskStorageException: Bad Request
 ---> Microsoft.WindowsAzure.Storage.StorageException: Bad Request
   at Microsoft.WindowsAzure.Storage.Core.Executor.Executor.ExecuteAsyncInternal[T](RESTCommand`1 cmd, IRetryPolicy policy, OperationContext operationContext, CancellationToken token)
   at DurableTask.AzureStorage.TimeoutHandler.ExecuteWithTimeout[T](String operationName, String account, AzureStorageOrchestrationServiceSettings settings, Func`3 operation, AzureStorageOrchestrationServiceStats stats, String clientRequestId)
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force)
Request Information
RequestID:7f4f8aad-3002-001f-59d8-11bb64000000
RequestDate:Sat, 28 Sep 2024 18:59:54 GMT
StatusMessage:Bad Request
ErrorCode:
ErrorMessage:The property value exceeds the maximum allowed size (64KB). If the property value is a string, it is UTF-16 encoded and the maximum number of characters should be 32K or less.
RequestId:7f4f8aad-3002-001f-59d8-11bb64000000
Time:2024-09-28T18:59:54.6725330Z

   --- End of inner exception stack trace ---
   at DurableTask.AzureStorage.Storage.AzureStorageClient.MakeStorageRequest[T](Func`3 storageRequest, String accountName, String operationName, String clientRequestId, Boolean force) in /_/src/DurableTask.AzureStorage/Storage/AzureStorageClient.cs:line 141
   at DurableTask.AzureStorage.Storage.Table.ExecuteAsync(TableOperation operation, String operationType) in /_/src/DurableTask.AzureStorage/Storage/Table.cs:line 113
   at DurableTask.AzureStorage.Storage.Table.InsertOrMergeAsync(DynamicTableEntity tableEntity) in /_/src/DurableTask.AzureStorage/Storage/Table.cs:line 101
   at DurableTask.AzureStorage.Tracking.AzureTableTrackingStore.UpdateStateAsync(OrchestrationRuntimeState newRuntimeState, OrchestrationRuntimeState oldRuntimeState, String instanceId, String executionId, String eTagValue, Object trackingStoreContext) in /_/src/DurableTask.AzureStorage/Tracking/AzureTableTrackingStore.cs:line 1159
   at DurableTask.AzureStorage.AzureStorageOrchestrationService.CompleteTaskOrchestrationWorkItemAsync(TaskOrchestrationWorkItem workItem, OrchestrationRuntimeState newOrchestrationRuntimeState, IList`1 outboundMessages, IList`1 orchestratorMessages, IList`1 timerMessages, TaskMessage continuedAsNewMessage, OrchestrationState orchestrationState) in /_/src/DurableTask.AzureStorage/AzureStorageOrchestrationService.cs:line 1179

I confirmed that 36790fe4a8e3465ca3ca68f210483553 is my newest instance so this error does not fail the orchestration.

I suspect that it is my customStatus (which reports on the job progress) which is exceeding the limit

Ciaran

cgillum · 2024-09-30T15:19:32Z

@cliedeman thanks for this info! I checked the code, and I think you're right that this could be caused by a large custom status value. We have checks to ensure that it doesn't exceed 16 KB, but it looks like there aren't any checks to ensure that the custom status value combined with other semi-large values (like inputs or outputs) doesn't exceed the 64 KB limit imposed by Azure Storage.

I'm labeling this as a bug that needs to be fixed. In the meantime, I recommend reducing the size of your custom status values to avoid this issue in the future. For the current stuck instance, you can terminate it to get it out of the "Running" status.

cgillum · 2024-09-30T16:38:02Z

We have checks to ensure that it doesn't exceed 16 KB, but it looks like there aren't any checks to ensure that the custom status value combined with other semi-large values (like inputs or outputs) doesn't exceed the 64 KB limit imposed by Azure Storage.

I was wrong - while we do have checks for the custom status size for the .NET in-proc SDK, we don't have any such checks in the .NET Isolated SDK, which otherwise would have caught this kind of issue. We may need to introduce a breaking change to ensure that the serialized custom status payload size matches the in-proc limit: 16 KB.

microsoft-github-policy-service bot added the Needs: Triage 🔍 label Sep 25, 2024

cgillum added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Triage 🔍 labels Sep 25, 2024

microsoft-github-policy-service bot added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Sep 26, 2024

cgillum added Needs: Author Feedback Waiting for the author of the issue to respond to a question and removed Needs: Attention 👋 labels Sep 26, 2024

microsoft-github-policy-service bot added Needs: Attention 👋 and removed Needs: Author Feedback Waiting for the author of the issue to respond to a question labels Sep 28, 2024

cgillum added bug P1 Priority 1 and removed Needs: Attention 👋 labels Sep 30, 2024

cgillum changed the title ~~Orchestrator Stuck in Running~~ Orchestrator stuck in running when saving large custom status values Sep 30, 2024

cgillum added the dotnet-isolated label Sep 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Orchestrator stuck in running when saving large custom status values #2918

Orchestrator stuck in running when saving large custom status values #2918

cliedeman commented Sep 25, 2024

cgillum commented Sep 25, 2024

cliedeman commented Sep 26, 2024

cliedeman commented Sep 28, 2024

cgillum commented Sep 30, 2024

cgillum commented Sep 30, 2024

Orchestrator stuck in running when saving large custom status values #2918

Orchestrator stuck in running when saving large custom status values #2918

Comments

cliedeman commented Sep 25, 2024

Description

Expected behavior

Actual behavior

Relevant source code snippets

Known workarounds

App Details

Screenshots

If deployed to Azure

cgillum commented Sep 25, 2024

cliedeman commented Sep 26, 2024

cliedeman commented Sep 28, 2024

cgillum commented Sep 30, 2024

cgillum commented Sep 30, 2024