Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail fast if ExtendedSessionsEnabled is requested for a non-.NET worker. #2732

Merged
merged 14 commits into from
Oct 28, 2024

Conversation

gukoff
Copy link
Contributor

@gukoff gukoff commented Jan 26, 2024

Problem

If you set extendedSessionsEnabled to true for Python durable function, it fails to complete orchestrations. Although the control queues are properly processed and the history tables are written accordingly, Python worker only triggers the orchestration once and never replays it.

Excerpt from host.json:

{
  "extensions": {
     "durableTask": {
      "hubName": "TaskHub3",
      "extendedSessionsEnabled": true
      "extendedSessionIdleTimeoutInSeconds": 600
     }
  }

Excerpt from the logs:

[2024-01-26T13:29:09.316Z] Durable Functions does not work with extendedSessions = true for non-.NET languages. This value is being set to false instead. See https://docs.microsoft.com/en-us/azure/azure-functions/durable/durable-functions-perf-and-scale#extended-sessions for more details.. InstanceId: . Function: . HubName: TaskHub2. AppName: . SlotName: . ExtensionVersion: 2.12.0. SequenceNumber: 0.

Related issue about this parameter I've found via github search: https://github.com/briandunnington/briandunnington.github.com/blob/97f5a9cca08b41654c07d3552e8ede184ae1b8e4/_posts/durable_functions_an_eventstarted_event_is_required.md?plain=1#L61

Solution

The current code emits the warning and overwrites the option value to false. However, it seems like the framework reads the "old" option value before that, e.g. in the InitializeForFunctionsV1I() During execution, at least part of the framework logic still assumes that the extended sessions are on.

A more simple and reliable approach is to just disallow this option value for non-dotnet workers and fail when it's provided.

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation PR is ready to merge and referenced in pending_docs.md
  • My changes should not be added to the release notes for the next release
    • Otherwise: I've added my notes to release_notes.md
  • My changes do not need to be backported to a previous version
    • Otherwise: Backport tracked by issue/PR #issue_or_pr
  • I have added all required tests (Unit tests, E2E tests)
  • My changes do not require any extra work to be leveraged by OutOfProc SDKs
    • Otherwise: That work is being tracked here: #issue_or_pr_in_each_sdk
  • My changes do not change the version of the WebJobs.Extensions.DurableTask package
    • Otherwise: major or minor version updates are reflected in /src/Worker.Extensions.DurableTask/AssemblyInfo.cs
  • My changes do not add EventIds to our EventSource logs
    • Otherwise: Ensure the EventIds are within the supported range in our existing Windows infrastructure. You may validate this with a deployed app's telemetry. You may also extend the range by completing a PR such as this one.
  • My changes should be added to v3.x branch.
    • Otherwise: This change only applies to Durable Functions v2.x and will not be merged to branch v3.x.

While the code does emit the warning and overwrites the option value, this is too late. The frameworks reads the "old" option value before the options are validated and actually runs as if the extended sessions are on. This leads to the hard-to-troubleshoot problems, e.g. the python worker never re-triggers the orchestration and never completes them.
@gukoff
Copy link
Contributor Author

gukoff commented Jan 26, 2024

I am surprised all checks have passed, because I didn't edit the tests for the old logic: https://github.com/Azure/azure-functions-durable-extension/pull/1502/files#diff-cc6c49a98ce039f487dcb8326901a4efd087351398aa832615d2041389d99bf2R4502

Is there a way to run all tests on this change? I can't easily run tests for this repo locally on OSX, getting various build errors. Would be easier to just do it on CI.

@cgillum
Copy link
Member

cgillum commented Jan 26, 2024

@gukoff the tests that ran and passed are just the smoke tests - it looks like the integration tests didn't run.

@davidmrdavid what's the proper way to run integration tests for community-contributed PRs for this repo now? Is it /azp run like we do for the durabletask repo?

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor feedback on the error message.

@gukoff gukoff requested a review from cgillum February 26, 2024 09:04
@cgillum
Copy link
Member

cgillum commented Feb 26, 2024

/azp run

Copy link

Azure Pipelines successfully started running 1 pipeline(s), but failed to run 1 pipeline(s).

@cgillum
Copy link
Member

cgillum commented Feb 26, 2024

Hi @gukoff, it looks like there is at least one test that needs to be updated to account for this new behavior:

Microsoft.Azure.WebJobs.Extensions.DurableTask.Tests.DurableTaskEndToEndTests.ExtendedSessions_OutOfProc_SetToFalse

@gukoff
Copy link
Contributor Author

gukoff commented Aug 9, 2024

@cgillum

The CI checks fail for the seemingly unrelated reasons.

  1. Smoke Test - .NET Isolated on Functions V4 / build fails transiently. Here's a fail, and here's a successful run after I replaced docker build in the e2e test with docker build --pull --no-cache, here's another fail after I removed the change, and another success when I brought back --pull. Maybe it's a coincidence, and maybe the CI agents do have stale images in cache. Locally this test runs fine.

  2. Smoke Test - .NET in-proc w/ MSSQL on Functions V4 / build fails to start the mssql server because it tries to set an empty password. I added some code in the e2e test to explain the failure if it happens. Maybe you know why SA_PASSWORD is empty for the CI check of this PR, but not for the CI checks on the other PRs?

gukoff added 2 commits August 9, 2024 14:31
…from the latest base images, no matter where it's built.
@gukoff
Copy link
Contributor Author

gukoff commented Oct 28, 2024

Ping?

Copy link
Member

@cgillum cgillum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @gukoff - I really appreciate this contribution - especially the improvements to the CI! I think this change you've introduced is really important, and we can go ahead and accept it in spite of the MSSQL failure (which has been broken for a little while already and may require some process changes as it involves secret management).

@cgillum cgillum merged commit 8470f3d into Azure:dev Oct 28, 2024
13 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants