Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set Meta of Restarted Job as Unused #687

Merged
merged 2 commits into from
Sep 12, 2024
Merged

Conversation

mpass99
Copy link
Collaborator

@mpass99 mpass99 commented Sep 11, 2024

because whenever a new allocation is started (for the same runner), it is already handled as fresh in Poseidon. With this PR we persist this information to Nomad, to allow a minimally improved runner recovery behavior when such runners are still counted as idle.

Closes #621

We do not have sufficient knowledge about indicators in the allocation data to identify restarted/rescheduled/migrated allocations.
With this approach, we completely rely on the PreviousAllocation field. If it is set, the allocation must be restarted and is, therefore, unused. However, we disregard that idle runners can also be restarted, leading to an unnecessary request in this approach. Also, the PreviousAllocation might not be a reliable indicator for all restarted/rescheduled/migrated allocations. It might be possible that we do not reset the used field for some restarted/.. runners.
As an alternative, we could invest more time in determining the restart/.. indicators. This might allow resetting the Used-Meta field of more runners, which would minimally improve the timing of recovery scenarios (startup, network outages, nomad outages).

@mpass99 mpass99 self-assigned this Sep 11, 2024
Base automatically changed from linter to main September 11, 2024 17:43
because whenever a new allocation is started (for the same runner), it is already handled as fresh in Poseidon.
With this PR we persist this information to Nomad, to allow a minimal improved runner recovery behavior when such runners are still counted as idle.
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 79.16667% with 5 lines in your changes missing coverage. Please review.

Project coverage is 76.60%. Comparing base (fe6ed1b) to head (b42819b).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
internal/nomad/nomad.go 71.42% 2 Missing and 2 partials ⚠️
internal/runner/nomad_manager.go 90.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #687      +/-   ##
==========================================
+ Coverage   76.17%   76.60%   +0.42%     
==========================================
  Files          43       43              
  Lines        3631     3642      +11     
==========================================
+ Hits         2766     2790      +24     
+ Misses        632      623       -9     
+ Partials      233      229       -4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

MrSerth
MrSerth previously approved these changes Sep 11, 2024
Copy link
Member

@MrSerth MrSerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for looking into this 🦋

I acknowledge your concerns and agree that there could be further potential to improve. However, it might not be worth the effort right now given the other issues we are looking at.

With this approach, we completely rely on the PreviousAllocation field.

That's true, and is fine for me as an improvement.

However, we disregard that idle runners can also be restarted, leading to an unnecessary request in this approach.

You mean because we are calling setRunnerMetaUsed(r, false, 0) (and setting the value) without checking the previous ConfigMetaUsedKey? Sure, that could be optimized, but shouldn't be too bad, I'd say.

Also, the PreviousAllocation might not be a reliable indicator for all restarted/rescheduled/migrated allocations. It might be possible that we do not reset the used field for some restarted/.. runners.

That's true. For the current knowledge, however, it is the best approach we have to update the meta data. I still feel that the PR improves the situation and if we learn about occasions where it is not sufficient, we can still fix the behavior.

As an alternative, we could invest more time in determining the restart/.. indicators. This might allow resetting the Used-Meta field of more runners, which would minimally improve the timing of recovery scenarios (startup, network outages, nomad outages).

Would you recommend that? I'd say that "minimally improving the timing of recovery scenarios" is nice, but currently not the highest priority.

to send a second request to Nomad only when the meta values change.
@mpass99
Copy link
Collaborator Author

mpass99 commented Sep 12, 2024

You mean because we are calling setRunnerMetaUsed(r, false, 0) (and setting the value) without checking the previous ConfigMetaUsedKey? Sure, that could be optimized, but shouldn't be too bad, I'd say.

Yeah. How you summarized it, it seemed so easy that I've added it with b42819b.

Would you recommend that? I'd say that "minimally improving the timing of recovery scenarios" is nice, but currently not the highest priority.

No. I would agree 🦋

Copy link
Member

@MrSerth MrSerth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome, thanks for the additional improvement -- this really looks great! 🌸

@MrSerth MrSerth merged commit c5718b5 into main Sep 12, 2024
12 checks passed
@MrSerth MrSerth deleted the fix/#621-unset-used-runner branch September 12, 2024 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix Runner-Meta-Tag "Used"
2 participants