Resubmit all jobs in daily twice before giving up in desi_proc_night #2416
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Overview
This PR fixes a limitation in
desi_proc_night
for resubmissions. In current main it trys to resubmit jobs, but refuses if any job has been resubmitted twice (for a total of three processing attempts). Now it will try to process all jobs 3 times (2 resubmissions) before quitting on a given job. The code only refuses to proceed if a calibration continues to fail, otherwise it will proceed with submitting future tiles since issues on a given tile are sometimes independent and worth attempting.The logic for
desi_resubmit_queue_failures
remains effectively unchanged. It resubmits any job that is incomplete, up to a default 100 attempts. We've never come close to running a job that number of times.Minor note -- I can't reproduce the doc test error at NERSC, and can't find the issue in the github logs.
Tests
Real night of 20241110
This had a job that failed 3 times and others that only failed once or twice. In main,
desi_proc_night
wouldn't submit any of them. Now it submits all except the jobs that had 3 attempts (2 resubmissions).desi_proc_night
outputs:Resulting entries in the processing table:
But using
desi_resubmit_queue_failures -n 20241110 --dry-run-level=3
still submits the jobs that the daily pipeline skips when running normally:Fabricated issues on night 20241109
Edited the last several exposures of the 20241109 processing table to be the following:
This includes a failed job that was resubmitted once (and should be resubmitted again), a job that was resubmitted twice and FAILED (shouldn't be resubmitted in
desi_proc_night
but should be withdesi_resubmit_queue_failures
); and jobs with MAX_RESUB, UNSUBMITTED, and DEP_NOT_SUBD statuses that have never been resubmitted. The desired behavior is for all of them to be resubmitted except the tilenight job for exposure 262126 indesi_proc_night
and all of them to be resubmitted usingdesi_resubmit_queue_failures
. We see that the desired outcome happens in this PR:> desi_proc_night --daily -n 20241109 --dry-run-level=4
:>desi_resubmit_queue_failures --dry-run-level=4 -n 20241109
:Real night 20241108 with no issues
Both
desi_proc_night --daily -n 20241108 --dry-run-level=4
anddesi_resubmit_queue_failures --dry-run-level=4 -n 20241108
run without complaint and without trying to submit anything.