Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds Next Retry Delay for TS SDK #3069

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

mjameswh
Copy link
Contributor

@mjameswh mjameswh commented Sep 6, 2024

What does this PR do?

  • Add instructions on how to use Next Retry Delay in TypeScript.

  • Update and restructure the TS Failure Detection dev guide page to be more in line with .Net, which better reflects what the Education team is aiming for.

Notes to reviewers

  • I diverted a bit from the Next Retry Delay doc for the Java SDK introduced in this PR, to try to better align with the style of the .Net's Failure Detection page.

  • You may want to also get a look at diff between my proposed TS Failure Detection page (/docs/develop/typescript/failure-detection.mdx) and the existing .Net's Failure Detection page. I diverged in a few places to fix minor things that were IMHO inconsistencies/errors. You will either want to accept my corrections and transpose them to the .Net page, or revert them to the original .Net text.

@mjameswh mjameswh requested a review from a team as a code owner September 6, 2024 19:24

## Cancel an Activity from a Workflow {#cancel-an-activity}

Canceling an Activity from within a Workflow requires that the Activity Execution sends Heartbeats and sets a Heartbeat Timeout.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to Heartbeat recommended, plus a sentence or so along the lines that a Heartbeat is a proof-of-life timeout used for long-running Activities.
This sentence also parallels 391 quite a lot. This is a good place once your thoughts are down to step back and consider wording for clarity. (Oh, I see Angela got there first. Thanks, Angela!)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fairlydurable I only moved that subsection out from the Failure Detection page. I agree it needs rework, but if you don't mind, I would prefer to consider that one out of scope for this PR.

Copy link
Contributor Author

@mjameswh mjameswh Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at what's already written on the .Net Cancellation page; when we get there (either I or somebody else), the .Net content should probably be the reference (i.e. rather than this legacy TS doc versino).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're doing something awesome. We appreciate it.

## Cancel an Activity from a Workflow {#cancel-an-activity}

Canceling an Activity from within a Workflow requires that the Activity Execution sends Heartbeats and sets a Heartbeat Timeout.
If the Heartbeat is not invoked, the Activity cannot receive a cancellation request.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Simpler: Activities without Heartbeats can't be canceled.
Good opportunity to explain why this is necessary. I'm asking in tech-qs to see if I understand it correctly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Activity cancellations are sent by the server to the worker through the RecordActivityTaskHeartbeatResponse. So if there is no heartbeat request, then there is no heartbeat response, and consequently, the worker never learns about the activity cancellation request. (edited)


Canceling an Activity from within a Workflow requires that the Activity Execution sends Heartbeats and sets a Heartbeat Timeout.
If the Heartbeat is not invoked, the Activity cannot receive a cancellation request.
When any non-immediate Activity is executed, the Activity Execution should send Heartbeats and set a [Heartbeat Timeout](/encyclopedia/detecting-activity-failures#heartbeat-timeout) to ensure that the server knows it is still working.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a "non-immediate" Activity? One that is asynchronous? One that doesn't return control and blocks? Long lived? If I were reading this, I'd be confused.

When any non-immediate Activity is executed, the Activity Execution should send Heartbeats and set a [Heartbeat Timeout](/encyclopedia/detecting-activity-failures#heartbeat-timeout) to ensure that the server knows it is still working.

When an Activity is canceled, an error is raised in the Activity at the next available opportunity.
If cleanup logic needs to be performed, it can be done in a `finally` clause or inside a caught cancel error.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd specify the error if there's a parent cancelation error (not sure there is in typescript) and link to the API docs for it. If this doesn't apply, no worries.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, who is catching the error? The Workflow or the Activity if it isn't in the finally clause?


When an Activity is canceled, an error is raised in the Activity at the next available opportunity.
If cleanup logic needs to be performed, it can be done in a `finally` clause or inside a caught cancel error.
However, for the Activity to appear canceled the exception needs to be re-raised.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More normally we "re-throw" exceptions.

I checked and there are 3 uses, but I have never heard re-raised outside of baking.


:::note

Unlike regular Activities, [Local Activities](/activities#local-activity) can be canceled if they don't send Heartbeats.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

without sending Heartbeats.

Current phrasing makes it sound like it's a precondition rather than a choice.
They're used for short-lived operations and run in the same process as the Workflow that spawns them -- at least I believe so, even though you said Worker process on 400. They have some limitations like no routing but I don't think it helps mentioning them here

:::note

Unlike regular Activities, [Local Activities](/activities#local-activity) can be canceled if they don't send Heartbeats.
Local Activities are handled locally, and all the information needed to handle the cancellation logic is available in the same Worker process.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think s/locally/in the same process as the Worker that spawned them/ would be clearer (assuming I'm accurate, which is always something you should be skeptical about)


Each Workflow timeout controls the maximum duration of a different aspect of a Workflow Execution.

Workflow timeouts are set when [starting the Workflow Execution](#workflow-timeouts).
Workflow Timeouts are set when [starting the Workflow Execution](#workflow-timeouts).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider "A Client sets Workflow Timeouts when [...." because it gives this an actor vs passive voice, and also explains how those timeouts are set

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, the phrase was incomplete; clarified that this is possible on both the Client and Workflow APIs, which unfortunately makes the active voice less applicable.


- **[Workflow Execution Timeout](/encyclopedia/detecting-workflow-failures#workflow-execution-timeout)** - restricts the maximum amount of time that a single Workflow Execution can be executed.
- **[Workflow Run Timeout](/encyclopedia/detecting-workflow-failures#workflow-run-timeout):** restricts the maximum amount of time that a single Workflow Run can last.
- **[Workflow Task Timeout](/encyclopedia/detecting-workflow-failures#workflow-task-timeout):** restricts the maximum amount of time that a Worker can execute a Workflow Task.

Create an instance of `WorkflowOptions` from the Client and set your Workflow Timeout.
The following properties can be set on the [`WorkflowOptions`](https://typescript.temporal.io/api/interfaces/client.WorkflowOptions/) when starting a Workflow using either the Client or Workflow API:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically, isn't the Workflow API still acting as a client? Or am I missing something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No.

One can't use the client API from a Workflow, as the client API is not replace-aware and doing network requests could introduce non-deterministic behaviors.

Therefore, one has to use the Workflow API, which creates Workflow commands. This goes through a completely different path. And even though some operations are available both through the Client and the Workflow APIs, available options and the call semantic may differ.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I get it now. I wasn't thinking of this from that angle. I thought you meant Workflow API meaning only "creating a new Workflow Execution", not "Using the Workflow type's API surface to create another Workflow" (Unless I've misunderstood again.)


- **[Schedule-To-Close Timeout](/encyclopedia/detecting-activity-failures#schedule-to-close-timeout):** is the maximum amount of time allowed for the overall [Activity Execution](/activities#activity-execution).
- **[Start-To-Close Timeout](/encyclopedia/detecting-activity-failures#start-to-close-timeout):** is the maximum time allowed for a single [Activity Task Execution](/workers#activity-task-execution).
- **[Schedule-To-Close Timeout](/encyclopedia/detecting-activity-failures#schedule-to-close-timeout):** is the maximum amount of time allowed for the overall [Activity Execution](/activities#activity-execution);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think "entire" is better than "overall" because it relates to the duration and is more specific. I'd also use the whole from-to (from scheduling to reaching the Closed state or whatever it actually is). It doesn't take many words and it helps differentiate two things that sound really similar when doing a quick visual scan through docs.

We normally don't use semicolons in lists. We skip punctuation. (I know what you're replacing doesn't skip.)


A Retry Policy works in cooperation with the timeouts to provide fine controls to optimize the execution experience.

Activity Executions are automatically associated with a default [Retry Policy](/encyclopedia/retry-policies) if a custom one is not provided.

To set Activity Retry Policies in TypeScript, pass [`ActivityOptions.retry`](https://typescript.temporal.io/api/interfaces/common.ActivityOptions#retry) to [`proxyActivities`](https://typescript.temporal.io/api/namespaces/workflow/#proxyactivities).
To set an Activity's Retry Policy in TypeScript, use the [`ActivityOptions.retry`](https://typescript.temporal.io/api/interfaces/common.ActivityOptions#retry) property when creating the corresponding Activity proxy function using the [`proxyActivities()`](https://typescript.temporal.io/api/namespaces/workflow#proxyactivities) API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "assign the" vs "use the"

}
}
```
In addition to obtaining cancellation information, Heartbeats also support detail data that persists on the server for retrieval during Activity retry.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does persisting on the server mean that it becomes part of the Event History? If so, I think it would be good to clarify that.

And is it cancellation information or cancellation requests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does persisting on the server mean that it becomes part of the Event History? If so, I think it would be good to clarify that.

To be exact, it gets persisted to the Workflow's Mutable State, which is a very advanced concept. There's no corresponding encyclopedia entry to which I could link to… I think it would be preferable not to mention that.

}
```
In addition to obtaining cancellation information, Heartbeats also support detail data that persists on the server for retrieval during Activity retry.
If an Activity calls `heartbeat({ progress: 456 })` and then fails and gets retried, then on the next attempt, [`activityInfo().heartbeatDetails`](https://typescript.temporal.io/api/interfaces/activity.Info#heartbeatdetails) will contain the object `{ progress: 456 }`, which can be used to allow the activity to efficiently resume its work.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd make this more of an example, like an activity that is processing 1000 parts, so when it heartbeats it sends a progress payload, ensuring the activity picks up at the right place when re-tried, even if it is not on the same Worker, or similar.


A [Heartbeat Timeout](/encyclopedia/detecting-activity-failures#heartbeat-timeout) works in conjunction with [Activity Heartbeats](/encyclopedia/detecting-activity-failures#activity-heartbeat).

To set a Heartbeat Timeout, use [`ActivityOptions.heartbeatTimeout`](https://typescript.temporal.io/api/interfaces/common.ActivityOptions#heartbeattimeout). If the Activity takes longer than that between heartbeats, the Activity is failed.
To set an Activity's Heartbeat Timeout in TypeScript, set the [`ActivityOptions.heartbeatTimeout`](https://typescript.temporal.io/api/interfaces/common.ActivityOptions#heartbeattimeout) property when creating the corresponding Activity proxy functions using the [`proxyActivities()`](https://typescript.temporal.io/api/namespaces/workflow#proxyactivities) API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this time is exceeded...

### Overriding the retry interval with Next Retry Delay {#next-retry-delay}
## Activity next Retry delay {#activity-next-retry-delay}

**How to override the next Retry delay following an Activity failure using the Temporal Java SDK**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: "using the Temporal Java SDK" feels tedious and repetitive. Some of the above ones simply say "In Java" which seems better to me.

But docs team's house practices gets final word here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally agree, though I think there's some SEO argument behind that.

I've been told to adhere to the structure and style of our .net's dev guide, as it is the latest and reflects the current standard that the doc team is trying to follow, hence that phrasing.

**How to override the next Retry delay following an Activity failure using the Temporal TypeScript SDK**

The time to wait after a retryable Activity failure until the next retry is attempted is normally determined by that Activity's Retry Policy.
However, an Activity may override that duration when explicitly failing with an [`ApplicationFailure`](https://typescript.temporal.io/api/classes/common.ApplicationFailure) by setting a next Retry delay.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might suggest linking to "Application Failure" and the overview docs instead.

https://docs.temporal.io/references/failures#application-failure

That contains more broadly instructive information than the typescript one, while linking to typescript, and unfortunately AFAICT the typescript docs don't contain an uplink to the reference.

The time to wait after a retryable Activity failure until the next retry is attempted is normally determined by that Activity's Retry Policy.
However, an Activity may override that duration when explicitly failing with an [`ApplicationFailure`](https://typescript.temporal.io/api/classes/common.ApplicationFailure) by setting a next Retry delay.

To override the next Retry delay for an `ApplicationFailure` thrown by an Activity in TypeScript, provide the [`nextRetryDelay`](https://typescript.temporal.io/api/interfaces/common.ApplicationFailureOptions#nextretrydelay) property on the object argument of the [`ApplicationFailure.create()`](https://typescript.temporal.io/api/classes/common.ApplicationFailure#create) factory method.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"in Typescript" is already established by the heading.
Overall, this paragraph is simply restating what the code already says. I can't find any value in it other than the links, which would be better and more succinctly provided IMO in a "see also" links list.

I also prefer how the java example incorporates attempt count via a more real world example, explaining how to do backoff.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants