Skip to content

Commit

Permalink
Update job_monitoring.md
Browse files Browse the repository at this point in the history
  • Loading branch information
ktiits authored Aug 30, 2024
1 parent d5f26c7 commit 448e287
Showing 1 changed file with 19 additions and 0 deletions.
19 changes: 19 additions & 0 deletions materials/job_monitoring.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,3 +41,22 @@ More detailed queries can be tailored with `sacct`
- `sacct -S 2022-08-01` will show all jobs started after that date
**Note!** Querying data from the Slurm accounting database with `sacct` can be a very heavy operation. **Don't** query long time intervals or run `sacct` in a loop/using `watch` as this will degrade the performance of the system for all users.
:::

:::{admonition} What to do if a job fails?
:class: warning

Does `sacct` show you that your job failed? Or did your job not do what you expected (e.g. write some files, etc)?
Some things to check:

1. Did the job run out of time?
2. Did the job run out of memory?
3. Did the job actually use the resources you specified?
- Problems in the batch job script can cause parameters to be ignored and default values are used instead
4. Did it fail immediately or did it run for some time?
- Jobs failing immediately are often due to something like typos, missing inputs, bad parameters _etc_.
5. Check the error file captured by the batch job script (default name `slurm-jobid.out`)
6. Check any other error files and logs your program may have produced
7. Error messages can sometimes be long, cryptic and a bit intimidating, but ...
- Try skimming through them and see if you can spot something "human-readable"
- Often you can spot the actual problem, if you go through the whole message. Something like "required input file so-and-so missing" or "parameter X out of range" _etc_.
8. Consult the [FAQ on common Slurm issues](https://docs.csc.fi/support/faq/why-does-my-batch-job-fail/) in the CSC Docs

0 comments on commit 448e287

Please sign in to comment.