Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TRON-2210: Add a tool that can sync from pod state to tron w/ tronctl #985

Merged
merged 4 commits into from
Jul 11, 2024

Conversation

jfongatyelp
Copy link
Contributor

@jfongatyelp jfongatyelp commented Jul 2, 2024

This adds a pretty simple script that will:

  1. search for any completed pods in the tron namespace given a KUBECONFIG
  2. attempt to match those pods with any existing action runs that are in a state that can be updated
  3. update the state w/ tronctl if the pod & tron's states do not match.

Tron will only allow us to transition from a few known states, so this script only allows using tronctl fail or tronctl succeed if the tron action run was either lost/unknown or 'stuck' at running or starting.

Right now, I allow passing both a tron-url + tronctl-wrapper so that this could be run from outside the tron server itself, but defaults assume we'll be running on a local tron server.

Still adding some tests but thought I should get feedback now in case I need to shift gears.

Validation

Currently, one reliable way to get an actual tron action run into an unknown state is to restart tron w/in a short period of the pod having been started.

Forcing a few of these cases in tron-spark-pnw-devc, the script correctly updates the 'unknown' action runs' states:

Action state scheduled for paasta-contract-monitor.k8s_bad.41.sleep_with_retries not modifiable, no action taken
Action state failed for paasta-contract-monitor.k8s_bad.40.sleep not modifiable, no action taken
paasta-contract-monitor.k8s_bad.40.sleep_with_retries state unknown needs updating to fail
Dry-Run: Would run ['tronctl', 'fail', 'paasta-contract-monitor.k8s_bad.40.sleep_with_retries']
Action state failed for paasta-contract-monitor.k8s_bad.39.sleep not modifiable, no action taken
paasta-contract-monitor.k8s_bad.39.sleep_with_retries state unknown needs updating to fail
Dry-Run: Would run ['tronctl', 'fail', 'paasta-contract-monitor.k8s_bad.39.sleep_with_retries']
Action state failed for paasta-contract-monitor.k8s_bad.38.sleep not modifiable, no action taken
paasta-contract-monitor.k8s_bad.38.sleep_with_retries state running needs updating to fail
Dry-Run: Would run ['tronctl', 'fail', 'paasta-contract-monitor.k8s_bad.38.sleep_with_retries']
Action state failed for paasta-contract-monitor.k8s_bad.37.sleep not modifiable, no action taken
paasta-contract-monitor.k8s_bad.37.sleep_with_retries state running needs updating to fail
Dry-Run: Would run ['tronctl', 'fail', 'paasta-contract-monitor.k8s_bad.37.sleep_with_retries']
Action state failed for paasta-contract-monitor.k8s_bad.36.sleep not modifiable, no action taken
paasta-contract-monitor.k8s_bad.36.sleep_with_retries state running needs updating to fail
Dry-Run: Would run ['tronctl', 'fail', 'paasta-contract-monitor.k8s_bad.36.sleep_with_retries']
Action state cancelled for paasta-contract-monitor.k8s_bad.35.sleep not modifiable, no action taken

tools/sync_tron_state_from_k8s.py Outdated Show resolved Hide resolved
job_runs = client.job(
url,
include_action_runs=True,
count=num_runs, # TODO: fetch job run_limit and use that for count ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a bad idea :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out we don't actually include run_limit in job/job run API responses :/ so this would need to pull in from other sources (either config api or reading the configs some other way), so I think I'm going to leave it as is for now (we can always run w/ a higher num_run if we need to).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, we can probably read from /nail/etc/services - but it's probably fine to leave as-is, i can't think of anything getting anywhere near their run_limit in terms of concurrently running jobs

)

for job in jobs:
# What am I doing wrong here, why do I have to append /api
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not doing anything wrong - i'm not quite sure why this method doesn't automatically add in /api for us

@jfongatyelp jfongatyelp force-pushed the u/jfong/TRON-2210-sync-from-k8s branch from 56291a3 to 64db0ab Compare July 5, 2024 22:21
tools/sync_tron_state_from_k8s.py Outdated Show resolved Hide resolved
@jfongatyelp jfongatyelp merged commit ff099ba into master Jul 11, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants