Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRON-2208: Add toggle in tron config to disable retries on LOST k8s jobs #988

Merged
merged 15 commits into from
Aug 28, 2024

Conversation

cuza
Copy link
Member

@cuza cuza commented Jul 5, 2024

Summary

This PR addresses TRON-2208 by adding a toggle in the Tron configuration to disable automatic retries for Kubernetes jobs that are marked as "LOST". This is particularly useful for non-idempotent jobs where retries can be dangerous.

Changes

Configuration

  • tron/config/config_parse.py: Added non_retryable_exit_codes to the Kubernetes configuration schema.
  • tron/config/schema.py: Updated the schema to include non_retryable_exit_codes.

Core Logic

  • tron/core/actionrun.py:
    • Modified the _exit_unsuccessful method to check for non_retryable_exit_codes and skip retries if the exit code is in this list.
    • Refactored non_retryable_exit_codes to use a list instead of a tuple.
  • tron/kubernetes.py:
    • Updated the handle_event method to set the exit code to EXIT_KUBERNETES_TASK_LOST for lost tasks.
    • Added non_retryable_exit_codes to the Kubernetes configuration.
  • tron/utils/exitcode.py: Added a new exit code EXIT_KUBERNETES_TASK_LOST with a value of -12.

Tests

  • tests/config/config_parse_test.py: Updated tests to include non_retryable_exit_codes.
  • tests/kubernetes_test.py: Updated tests to check for the new exit code and ensure the correct behavior when a task is marked as lost.

Miscellaneous

  • .gitignore: Added .fleet to the ignore list.
  • requirements-dev-minimal.txt: Added debugpy.
  • requirements-dev.txt: Added debugpy.

Motivation

Given that "LOST" means Tron has lost track of a pod it already thought it had started for a job, attempting to retry/start a replacement can be dangerous for non-idempotent jobs. In the current state, these will consume retries, but with some of our EKS migration methods, LOST tasks are more likely. Therefore, we should have a way to temporarily pause retries on these.

Testing

  • Unit tests have been updated and added to cover the new configuration option and its effects on job retry behavior.

Related Issues

  • TRON-2208: Add toggle in Tron config to disable retries on LOST k8s jobs

Notes

  • Ensure to update your Tron configuration to include the new non_retryable_exit_codes option if you wish to use this feature.
  • This change is backward compatible; if the new configuration option is not set, the default behavior remains unchanged.

@cuza cuza changed the title chore: Add disable_retries_on_lost option to Kubernetes configuration Add toggle in tron config to disable retries on LOST k8s jobs Jul 5, 2024
@cuza cuza changed the title Add toggle in tron config to disable retries on LOST k8s jobs TRON-2208: Add toggle in tron config to disable retries on LOST k8s jobs Jul 5, 2024
The _exit_unsuccessful method in the KubernetesActionRun class has been refactored to include the is_lost_task parameter. This parameter is used to determine whether auto-retries should be skipped when the disable_retries_on_lost option is enabled in the Kubernetes configuration.
tron/kubernetes.py Outdated Show resolved Hide resolved
@cuza cuza requested a review from jfongatyelp July 8, 2024 12:43
cuza added 3 commits July 8, 2024 06:10
…ies_on_lost

* origin/master:
  Released 2.2.6 via make release
  Update scribereader tests (#983)
  Released 2.2.5 via make release
  Update yelp_clog and use datetime range for S3 logs (#979)
  Add clarification around reverting changes that touch DynamoDB data (#982)
…ies_on_lost

* origin/master:
  Adjust path after conf.py was moved
tron/core/actionrun.py Outdated Show resolved Hide resolved
@nemacysts
Copy link
Member

sidenote: can we add some additional context to the PR description? it can still reference the ticket, but having more context in git blame (without having to switch to a browser and go to jira) is always nice :)

@cuza
Copy link
Member Author

cuza commented Jul 10, 2024

sidenote: can we add some additional context to the PR description? it can still reference the ticket, but having more context in git blame (without having to switch to a browser and go to jira) is always nice :)

yeah of course, this is just a draft PR to make sure that we agreed on the implementation 😄

cuza added 3 commits July 31, 2024 15:12
The Kubernetes configuration has been updated to use the `non_retryable_exit_codes` option instead of `disable_retries_on_lost`. This change allows for more flexibility in specifying the exit codes that should not trigger auto-retries for Kubernetes tasks.
…ernetesClusterRepository

The `non_retryable_exit_codes` attribute in the `KubernetesCluster` and `KubernetesClusterRepository` classes has been refactored to use a tuple of integers instead of a single integer. This change allows for more flexibility in specifying multiple exit codes that should not trigger auto-retries for Kubernetes tasks.
@cuza cuza marked this pull request as ready for review August 2, 2024 16:53
…ies_on_lost

* origin/master:
  Minor fix to help text for kubecontext param
  Support running on devbox too
  Support multiple clusters in sync_tron_from_k8s
  Released 2.3.0 via make release
  More type fixes for defaults
  Initialize KubernetesClusterRepository watcher_kubeconfig_paths the same as KubernetsCluster and correct typing
  Released 2.2.7 via make release
  Only show disable warning on tronctl disable (#990)
  Add validation to watcher_kubeconfig_paths too, and a minimal  test
  Update tools/sync_tron_state_from_k8s.py
  Bump task_processing to get watcher_kubeconfig_paths change
  Support old_kubeconfig_paths
  Update sync_tron_from_k8s to match sanitized instance labels, cleanup comments
  Update sync to handle in-progress retries and don't tronctl yet; add unit tests
  TRON-2210: Add a tool that can sync from pod state to tron w/ tronctl

# Conflicts:
#	tron/kubernetes.py
The `debugpy` dependency in the `requirements-dev-minimal.txt` and `requirements-dev.txt` files has been updated to version 1.8.1. This update ensures compatibility and brings in the latest features and bug fixes for debugging purposes.
The .gitignore file has been updated to include the .fleet directory. This ensures that the .fleet directory is ignored by Git and not tracked in the repository.
Copy link
Contributor

@jfongatyelp jfongatyelp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like a nice option, would be nice to see a test where k8s_options.non_retryable_exit_codes was non-empty and checking our code does the right thing tho

@@ -447,7 +447,7 @@ def test_handle_event_lost(mock_kubernetes_task):
)
)

assert mock_kubernetes_task.is_unknown
assert mock_kubernetes_task.exit_status == exitcode.EXIT_KUBERNETES_TASK_LOST
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to keep the old assert as well, since we do still want to verify 'lost' k8s events lead to marking the task as unknown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sadly the way we're checking unknow is by checking exit code

  def is_unknown(self):
      return self.exit_status is None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohh interesting; i didn't realize assigning the LOST's an exit code changed this behavior, thanks.

Did you happen to glance through all the other places where we use is_unknown/unknown to check this change won't change any expected behavior there?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should not be a problem since we're only marking lost jobs with the special lost exit code

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_broken() uses the is_unknown property and it looks like there's some jobrun code that uses is_broken to gate starting new action runs - but i'm not sure if this change impacts that in a meaningful way yet

tron/core/actionrun.py Outdated Show resolved Hide resolved
tron/core/actionrun.py Outdated Show resolved Hide resolved
tron/core/actionrun.py Outdated Show resolved Hide resolved
tron/kubernetes.py Outdated Show resolved Hide resolved
tron/kubernetes.py Outdated Show resolved Hide resolved
@cuza cuza requested a review from jfongatyelp August 19, 2024 16:39
Copy link
Member

@nemacysts nemacysts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think there's any major blockers - it's probably fine to run this in non-prod for a bit

i'm not 100% sure about my 2 non-requirements.txt comments, but we should be able to sus any issues out by actually running this since the case we're handling here is pretty rare anyway :)

@@ -1,4 +1,5 @@
asynctest
debugpy
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious: what is this for?

Copy link
Member Author

@cuza cuza Aug 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VSCode Python Debugger :)

https://github.com/microsoft/debugpy

tron/core/actionrun.py Show resolved Hide resolved
@@ -447,7 +447,7 @@ def test_handle_event_lost(mock_kubernetes_task):
)
)

assert mock_kubernetes_task.is_unknown
assert mock_kubernetes_task.exit_status == exitcode.EXIT_KUBERNETES_TASK_LOST
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_broken() uses the is_unknown property and it looks like there's some jobrun code that uses is_broken to gate starting new action runs - but i'm not sure if this change impacts that in a meaningful way yet

@cuza cuza requested a review from nemacysts August 27, 2024 01:02
@nemacysts nemacysts merged commit 2d17df3 into master Aug 28, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants