Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resubmit all jobs in daily twice before giving up in desi_proc_night #2416

Merged
merged 4 commits into from
Nov 26, 2024

Conversation

akremin
Copy link
Member

@akremin akremin commented Nov 22, 2024

Overview

This PR fixes a limitation in desi_proc_night for resubmissions. In current main it trys to resubmit jobs, but refuses if any job has been resubmitted twice (for a total of three processing attempts). Now it will try to process all jobs 3 times (2 resubmissions) before quitting on a given job. The code only refuses to proceed if a calibration continues to fail, otherwise it will proceed with submitting future tiles since issues on a given tile are sometimes independent and worth attempting.

The logic for desi_resubmit_queue_failures remains effectively unchanged. It resubmits any job that is incomplete, up to a default 100 attempts. We've never come close to running a job that number of times.

Minor note -- I can't reproduce the doc test error at NERSC, and can't find the issue in the github logs.

Tests

Real night of 20241110

This had a job that failed 3 times and others that only failed once or twice. In main, desi_proc_night wouldn't submit any of them. Now it submits all except the jobs that had 3 attempts (2 resubmissions).

desi_proc_night outputs:

INFO:processing.py:1279:recursive_submit_failed: Identified row 241110024 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241110024: Tileid=41207, Expid(s)=[262230], Jobdesc=tilenight
INFO:processing.py:753:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:32768225', '/global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241110/tilenight-20241110-41207.slurm']
INFO:processing.py:754:submit_batch_script: Submitted /global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241110/tilenight-20241110-41207.slurm with dependencies --dependency=afterok:32768225 and reservation=None. Returned qid: 31454013
[...]
INFO:processing.py:1279:recursive_submit_failed: Identified row 241110027 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241110027: Tileid=41206, Expid(s)=[262231], Jobdesc=tilenight
WARNING:processing.py:1282:recursive_submit_failed: Tileid=41206, Expid(s)=[262231], Jobdesc=tilenight has already been submitted 3 times. Not resubmitting.
[...]
INFO:processing.py:1279:recursive_submit_failed: Identified row 241110030 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241110030: Tileid=41211, Expid(s)=[262232], Jobdesc=tilenight
INFO:processing.py:753:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:32768225', '/global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241110/tilenight-20241110-41211.slurm']
INFO:processing.py:754:submit_batch_script: Submitted /global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241110/tilenight-20241110-41211.slurm with dependencies --dependency=afterok:32768225 and reservation=None. Returned qid: 31454015

Resulting entries in the processing table:

262230|,science,41207,20241110,,all,|,a0123456789,0,241110024,,tilenight,31455399,1731455407,SUBMITTED,,241110019|,32768225|,32771633|32772819|32773898|32794792|31455399|
262230|,science,41207,20241110,,all,|,a0123456789,0,241110025,,cumulative,31455400,1731455407,SUBMITTED,,241110024|230608068|231001036|231023022|240603088|240723062|,10076731|16580942|17342827|26463159|2852558\
5|31455399|,32771636|31455400|
262231|,science,41206,20241110,,all,|,a0123456789,0,241110027,,tilenight,32795050,1731348670,MAX_RESUB,,241110019|,32768225|,32772950|32773999|32795050|
262231|,science,41206,20241110,,all,|,a0123456789,0,241110028,,cumulative,32795051,1731348673,MAX_RESUB,,241110027|,32795050|,32772953|32774001|32795051|
262232|,science,41211,20241110,,all,|,a0123456789,0,241110030,,tilenight,31455401,1731455407,SUBMITTED,,241110019|,32768225|,32774002|32795057|31455401|
262232|,science,41211,20241110,,all,|,a0123456789,0,241110031,,cumulative,31455402,1731455407,SUBMITTED,,241110030|,31455401|,32774003|32795065|31455402|
262233|,science,41204,20241110,,all,|,a0123456789,0,241110033,,tilenight,31455403,1731455408,SUBMITTED,,241110019|,32768225|,32774006|32795066|31455403|
262233|,science,41204,20241110,,all,|,a0123456789,0,241110034,,cumulative,31455404,1731455408,SUBMITTED,,241110033|,31455403|,32774008|32795067|31455404|
262234|,science,41213,20241110,,all,|,a0123456789,0,241110036,,tilenight,31455405,1731455408,SUBMITTED,,241110019|,32768225|,32774878|32795068|31455405|
262234|,science,41213,20241110,,all,|,a0123456789,0,241110037,,cumulative,31455406,1731455408,SUBMITTED,,241110036|,31455405|,32774883|32795069|31455406|
262235|,science,41214,20241110,,skysub,low_sn|,a0123456789,0,241110039,,tilenight,32774888,1731294091,COMPLETED,,241110019|,|,32774888|
262236|,science,43016,20241110,,skysub,low_sn|,a0123456789,0,241110041,,tilenight,32775693,1731295505,COMPLETED,,241110019|,|,32775693|
262237|,science,41203,20241110,,skysub,low_sn|,a0123456789,0,241110043,,tilenight,32776037,1731296459,COMPLETED,,241110019|,|,32776037|
262239|,science,43000,20241110,,skysub,low_sn|,a0123456789,0,241110045,,tilenight,32776385,1731297640,COMPLETED,,241110019|,|,32776385|
262240|,science,43002,20241110,,skysub,low_sn|,a0123456789,0,241110047,,tilenight,32776387,1731297642,COMPLETED,,241110019|,|,32776387|

But using desi_resubmit_queue_failures -n 20241110 --dry-run-level=3 still submits the jobs that the daily pipeline skips when running normally:

INFO:processing.py:1279:recursive_submit_failed: Identified row 241110027 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241110027: Tileid=41206, Expid(s)=[262231], Jobdesc=tilenight
INFO:processing.py:753:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:32768225', '/global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241110/tilenight-20241110-41206.slurm']
INFO:processing.py:754:submit_batch_script: Submitted /global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241110/tilenight-20241110-41206.slurm with dependencies --dependency=afterok:32768225 and reservation=None. Returned qid: 31460817
INFO:processing.py:1279:recursive_submit_failed: Identified row 241110028 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241110028: Tileid=41206, Expid(s)=[262231], Jobdesc=cumulative
INFO:processing.py:753:submit_batch_script: ['sbatch', '--parsable', '--dependency=afterok:31460817', '/global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/tiles/cumulative/41206/20241110/ztile-41206-thru20241110.slurm']
INFO:processing.py:754:submit_batch_script: Submitted ztile-41206-thru20241110.slurm with dependencies --dependency=afterok:31460817 and reservation=None. Returned qid: 31460818

Fabricated issues on night 20241109

Edited the last several exposures of the 20241109 processing table to be the following:

262125|,science,7628,20241109,,all,|,a0123456789,0,241109076,,tilenight,32756033,1731243642,COMPLETED,,241109019|,|,32756033|
262125|,science,7628,20241109,,all,|,a0123456789,0,241109077,,cumulative,32756034,1731243643,FAILED,,241109076|241108097|,32756033|,32756034|99999999|
262126|,science,2917,20241109,,all,|,a0123456789,0,241109079,,tilenight,32756135,1731244840,FAILED,,241109019|,|,32756135|99999999|99999999|
262126|,science,2917,20241109,,all,|,a0123456789,0,241109080,,cumulative,32756136,1731244842,MAX_RESUB,,241109079|,32756135|,32756136|
262127|,science,20358,20241109,,all,|,a0123456789,0,241109082,,tilenight,32756137,1731244846,UNSUBMITTED,,241109019|,|,32756137|
262127|,science,20358,20241109,,all,|,a0123456789,0,241109083,,cumulative,32756138,1731244848,DEP_NOT_SUBD,,241109082|241108103|,32756137|,32756138|

This includes a failed job that was resubmitted once (and should be resubmitted again), a job that was resubmitted twice and FAILED (shouldn't be resubmitted in desi_proc_night but should be with desi_resubmit_queue_failures); and jobs with MAX_RESUB, UNSUBMITTED, and DEP_NOT_SUBD statuses that have never been resubmitted. The desired behavior is for all of them to be resubmitted except the tilenight job for exposure 262126 in desi_proc_night and all of them to be resubmitted using desi_resubmit_queue_failures. We see that the desired outcome happens in this PR:

> desi_proc_night --daily -n 20241109 --dry-run-level=4:

INFO:processing.py:1279:recursive_submit_failed: Identified row 241109077 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241109077: Tileid=7628, Expid(s)=[262125], Jobdesc=cumulative
[...]
INFO:processing.py:754:submit_batch_script: Submitted ztile-7628-thru20241109.slurm with dependencies --dependency=afterok:32735474:32756033 and reservation=None. Returned qid: 32237729
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109079 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241109079: Tileid=2917, Expid(s)=[262126], Jobdesc=tilenight
WARNING:processing.py:1282:recursive_submit_failed: Tileid=2917, Expid(s)=[262126], Jobdesc=tilenight has already been submitted 3 times. Not resubmitting.
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109080 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241109080: Tileid=2917, Expid(s)=[262126], Jobdesc=cumulative
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109079 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241109079: Tileid=2917, Expid(s)=[262126], Jobdesc=tilenight
[...]
INFO:processing.py:754:submit_batch_script: Submitted /global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241109/tilenight-20241109-2917.slurm with dependencies --dependency=afterok:32748192 and reservation=None. Returned qid: 32237730
[...]
INFO:processing.py:754:submit_batch_script: Submitted ztile-2917-thru20241109.slurm with dependencies --dependency=afterok:32237730 and reservation=None. Returned qid: 32237731
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109082 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241109082: Tileid=20358, Expid(s)=[262127], Jobdesc=tilenight
[...]
INFO:processing.py:754:submit_batch_script: Submitted /global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241109/tilenight-20241109-20358.slurm with dependencies --dependency=afterok:32748192 and reservation=None. Returned qid: 32237732
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109083 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 241109083: Tileid=20358, Expid(s)=[262127], Jobdesc=cumulative
[...]
INFO:processing.py:754:submit_batch_script: Submitted ztile-20358-thru20241109.slurm with dependencies --dependency=afterok:32735627:32237732 and reservation=None. Returned qid: 32237733

>desi_resubmit_queue_failures --dry-run-level=4 -n 20241109:

INFO:processing.py:1279:recursive_submit_failed: Identified row 241109077 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 	241109077: Tileid=7628, Expid(s)=[262125], Jobdesc=cumulative
[...]
INFO:processing.py:754:submit_batch_script: Submitted ztile-7628-thru20241109.slurm with dependencies --dependency=afterok:32735474:32756033 and reservation=None. Returned qid: 32238512
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109079 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 	241109079: Tileid=2917, Expid(s)=[262126], Jobdesc=tilenight
[...]
INFO:processing.py:754:submit_batch_script: Submitted /global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241109/tilenight-20241109-2917.slurm with dependencies --dependency=afterok:32748192 and reservation=None. Returned qid: 32238513
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109080 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 	241109080: Tileid=2917, Expid(s)=[262126], Jobdesc=cumulative
[...]
INFO:processing.py:754:submit_batch_script: Submitted ztile-2917-thru20241109.slurm with dependencies --dependency=afterok:32238513 and reservation=None. Returned qid: 32238514
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109082 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 	241109082: Tileid=20358, Expid(s)=[262127], Jobdesc=tilenight
[...]
INFO:processing.py:754:submit_batch_script: Submitted /global/cfs/cdirs/desi/users/kremin/PRs/daily_resub_all_twice/run/scripts/night/20241109/tilenight-20241109-20358.slurm with dependencies --dependency=afterok:32748192 and reservation=None. Returned qid: 32238515
INFO:processing.py:1279:recursive_submit_failed: Identified row 241109083 as needing resubmission.
INFO:processing.py:1280:recursive_submit_failed: 	241109083: Tileid=20358, Expid(s)=[262127], Jobdesc=cumulative
INFO:processing.py:1300:recursive_submit_failed: Internal ID: 241108103 not in id_to_row_map. This is expected since it's from another day. 
[...]
INFO:processing.py:754:submit_batch_script: Submitted ztile-20358-thru20241109.slurm with dependencies --dependency=afterok:32735627:32238515 and reservation=None. Returned qid: 32238516

Real night 20241108 with no issues

Both desi_proc_night --daily -n 20241108 --dry-run-level=4 and desi_resubmit_queue_failures --dry-run-level=4 -n 20241108 run without complaint and without trying to submit anything.

@akremin akremin changed the title Try to resub all jobs in daily twice before giving up in desi_proc_night Resubmit all jobs in daily twice before giving up in desi_proc_night Nov 22, 2024
@coveralls
Copy link

coveralls commented Nov 22, 2024

Coverage Status

coverage: 30.176% (+0.02%) from 30.16%
when pulling 8f4dca9 on daily_resub_all_twice
into 1f75589 on main.

@akremin akremin requested a review from sbailey November 22, 2024 22:00
Copy link
Contributor

@sbailey sbailey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put a minor style comment inline, but otherwise looks good and useful. I'm trusting your testing which looks complete; I haven't tried running it myself.

py/desispec/workflow/processing.py Outdated Show resolved Hide resolved
@akremin akremin merged commit 255647f into main Nov 26, 2024
25 of 26 checks passed
@akremin akremin deleted the daily_resub_all_twice branch November 26, 2024 19:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants