Skip to content

Commit

Permalink
move too many jobs box to htc, small fixes KE
Browse files Browse the repository at this point in the history
  • Loading branch information
samumantha committed Oct 11, 2023
1 parent a6db427 commit 4413fb5
Show file tree
Hide file tree
Showing 3 changed files with 51 additions and 50 deletions.
54 changes: 5 additions & 49 deletions materials/fair_share.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,66 +40,22 @@ SLURM job allocations
- At CSC, the priority is configured to use "fair share"
- The _initial_ priority of a job _decreases_ if the user has recently run lots of jobs
- Over time (while queueing) its priority _increases_ and eventually it will run
- Some queues have a lower priority (e.g. _longrun_ -- use shorter if you can!)
- In general, always use the shortest queue/smallest partition possible!
- See our documentation for more information on [Getting started with running batch jobs on Puhti/Mahti](https://docs.csc.fi/computing/running/getting-started/) and [LUMI](https://docs.lumi-supercomputer.eu/runjobs/).

:::{admonition} How many resources to request?
:class: seealso

* You can use your workstation / laptop as a base measuring stick: If the code runs on your machine, as a first guess you can reserve the same amount of CPUs & memory as your machine has. Before reserving multiple CPUs, check if your code can make use them.
* You can also check more closely what resources are used with `top` when running on your machine
* You can also check more closely what resources are used with `top` on Mac and Linux or `task manager` on Windows when running on your machine
* Similarly for running time: if you have run it on your machine, you should reserve similar time in the cluster.
* If your program does the same thing more than once, you can estimate that the total run time is T≈n_steps⋅t_step, where tstep is the time taken by each step.
* Likewise, if your program runs multiple parameters, the total time needed is T_total≈n_parameters⋅T_single, where T_single is time needed to run the program with some parameters.
* If your program does the same thing more than once, you can estimate that the `total run time is number of steps times time taken by each step`.
* Likewise, if your program runs multiple parameters, the `total time needed is number of parameters times the time needed to run the program with one/some parameters`.
* You can also run a smaller version of the problem and try to estimate how the program will scale when you make the problem bigger.
* You should always monitor jobs to find out what were the actual resources you requested (seff jobid).
* You should always monitor jobs to find out what were the actual resources you requested.

Adapted from [Aalto Scientific Computing](https://scicomp.aalto.fi/triton/usage/program-size/)
:::

:::{admonition} How many jobs is too many?
:class: seealso, dropdown

We mention in documentation and guidelines that users shouldn’t send too many jobs, but how many is too many?
Unfortunately it’s impossible to give any exact numbers because both Slurm and Lustre are shared resources.
* It’s possible to give better limits for global usage of the system.
* When system total load is low, it may be ok to run something that is problematic when system is full.

**How many jobs/steps is too many?**

* SHOULD BE OK to run tens of jobs/steps
* PAY ATTENTION if you run hundreds of jobs/steps
* DON’T RUN several thousands of jobs

**How many file operations is too many?**

* SHOULD BE OK to access hundreds of files
* PAY ATTENTION if you need several thousands of files
* DON’T USE hundreds of thoudsands of files

Note that these guideline numbers are for all operations on all jobs.

**I have lots of small files**

* Check the tool that you are using
* There may be different options for data storage
* Tar/untar and compress your datasets.
* Use local disk (NVMe on Puhti, ramdisk on Mahti).
* Remove intermediate files if possible.
* Use squashfs for read-only datasets and containers.

**I have lots of small tasks for SLURM**

* Check the tool that you are using
* There may already be support for multiple jobs in a single job
* Regroup your tasks and execute larger group of tasks in single job/step.
* Manual or automatic (if feature is present in your tool)
* Horizontal and vertical packing
* Tradeoff (redundancy, parallelism, utilization)
* Do a larger job and use another scheduler (hyperqueue, flux).
* Integration for nextflow and snakemake already exists
* CSC has some tools for farming type jobs
* Not all or nothing

:::

45 changes: 45 additions & 0 deletions materials/htc.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,3 +118,48 @@ Many tools available:
Do you need to run a lot of steps one after another? Or few steps that need a lot of memory? Do steps depend on each other? Which steps could be run in parallel? Which steps cannot be run in parallel?

:::

:::{admonition} How many jobs is too many?
:class: seealso, dropdown

We mention in documentation and guidelines that users shouldn’t send too many jobs, but how many is too many?

Unfortunately it’s impossible to give any exact numbers because both Slurm and Lustre are shared resources.
* It’s possible to give better limits for global usage of the system.
* When system total load is low, it may be ok to run something that is problematic when system is full.

**How many jobs is too many?**

* SHOULD BE OK to run tens of jobs
* PAY ATTENTION if you run hundreds of jobs
* DON’T RUN several thousands of jobs

**How many file operations is too many?**

* SHOULD BE OK to access hundreds of files
* PAY ATTENTION if you need several thousands of files
* DON’T USE hundreds of thoudsands of files

Note that these guideline numbers are for all operations on all jobs.

**I have lots of small files**

* Check the tool that you are using
* There may be different options for data storage
* Tar/untar and compress your datasets.
* Use local disk (NVMe on Puhti, ramdisk on Mahti).
* Remove intermediate files if possible.
* Use squashfs for read-only datasets and containers.

**I have lots of small tasks for Slurm**

* Regroup your tasks and execute larger group of tasks in single job.
* Manual or automatic (if feature is present in your tool)
* Horizontal and vertical packing
* Tradeoff (redundancy, parallelism, utilization)
* Do a larger job and use another scheduler (hyperqueue, flux).
* Integration for nextflow and snakemake already exists
* CSC has some tools for farming type jobs
* Not all or nothing

:::
2 changes: 1 addition & 1 deletion materials/job_types.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,7 @@ Submitting an array job of 100 members counts the same as 100 individual jobs fr

## GPU jobs

A graphics processing unit (GPU, a video card), is capable of doing certain type of simultaneous calculations very efficiently. In order to take advantage of this power, a computer program must be reprogrammed to adapt on how GPU handles data. For spatial computations on the GPU, check out for example [RAPIDS cuSpatial](https://docs.rapids.ai/api/cuspatial/stable/user_guide/cuspatial_api_examples/). [CSC's GPU resources](https://docs.csc.fi/computing/overview/#gpu-nodes) are relatively scarce and hence should be used with particular care. A GPU uses 60 times more billing units than a single CPU core. In practice, 1-10 CPU cores (but not more) should be allocated per GPU on Puhti.
A GPU is capable of doing certain type of simultaneous calculations very efficiently. In order to take advantage of this power, a computer program must be programmed to adapt on how GPU handles data. For spatial computations on the GPU, check out for example [RAPIDS cuSpatial](https://docs.rapids.ai/api/cuspatial/stable/user_guide/cuspatial_api_examples/). [CSC's GPU resources](https://docs.csc.fi/computing/overview/#gpu-nodes) are relatively scarce and hence should be used with particular care. A GPU uses 60 times more billing units than a single CPU core. In practice, 1-10 CPU cores (but not more) should be allocated per GPU on Puhti.

:::{admonition} More advanced topics - GPU
:class: dropdown, seealso
Expand Down

0 comments on commit 4413fb5

Please sign in to comment.