-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cori data causing cross-night dependency error in daily processing? #2331
Comments
4 backup tiles from 20240818 are affected by this issue: 42083, 40062, 42946, 42918 |
Adding tile 40689 from 20240819 to this list |
41853 failed on 20240819 for potentially related reasons? I purged and tried to rerun the tile data from 20211212 (which failed on standard stars) to see if the issue on 0819 could be fixed. That attempt also failed. |
If the Slurm jobid isn't found then the code just doesn't update the STATUS and it goes based on what the STATUS was in the processing table. So if these jobs had failed in the past it would cause an issue, but if they were successful it shouldn't (unless there is a bug). Another possibility is that the Cori jobid's are now overlapping Perlmutter jobid's and Slurm is returning the status of the Perlmutter job of the same ID, which may have failed. I will get to the bottom of this and report back. @abrodze We should not be purging old processed data unless a new data quality issue has been identified on that old night. It may be true that it would have solved the issue on that tile created by new code, but it wouldn't address the underlying issue for any of the other backup tiles and makes the daily dataset even more inhomogenous. |
An odd twist in this story. I've been digging into tile 40062. It turns out the first exposure was only processed through sky subtraction and is therefore not an issue for redshift dependencies. I have verified that that is true in the processing table for the latest night -- it is only trying to make dependencies with the two nights in 2023 and the tilenight job on 20240818 itself. That is all good. The bad news is that from what I can tell the 2023 nights were processed using Perlmutter but Slurm still doesn't remember them in sacct. Dumping some outputs below. I'll dig into this more on Monday and see if there is anything we can do here.
|
Tile 40062 failed during processing. The tile has been observed on 4 different nights (20220418, 20230620, 20230806, 20240818). @sbailey's hypothesis for the failure is that the night from 2022 was processed on Cori and it fails the cross-night dependencies (#2321) because it does not have a QID.
The text was updated successfully, but these errors were encountered: