move too many jobs box to htc, small fixes KE

csc-training · Oct 11, 2023 · 4413fb5 · 4413fb5
1 parent a6db427
commit 4413fb5
Show file tree

Hide file tree

Showing 3 changed files with 51 additions and 50 deletions.
diff --git a/materials/fair_share.md b/materials/fair_share.md
@@ -40,66 +40,22 @@ SLURM job allocations
 - At CSC, the priority is configured to use "fair share"
    - The _initial_ priority of a job _decreases_ if the user has recently run lots of jobs
    - Over time (while queueing) its priority _increases_ and eventually it will run
-   - Some queues have a lower priority (e.g. _longrun_ -- use shorter if you can!)
+- In general, always use the shortest queue/smallest partition possible!
 - See our documentation for more information on [Getting started with running batch jobs on Puhti/Mahti](https://docs.csc.fi/computing/running/getting-started/) and [LUMI](https://docs.lumi-supercomputer.eu/runjobs/).
 
 :::{admonition} How many resources to request?
 :class: seealso
 
 * You can use your workstation / laptop as a base measuring stick: If the code runs on your machine, as a first guess you can reserve the same amount of CPUs & memory as your machine has. Before reserving multiple CPUs, check if your code can make use them.
-* You can also check more closely what resources are used with `top` when running on your machine
+* You can also check more closely what resources are used with `top` on Mac and Linux or `task manager` on Windows when running on your machine
 * Similarly for running time: if you have run it on your machine, you should reserve similar time in the cluster.
-* If your program does the same thing more than once, you can estimate that the total run time is T≈n_steps⋅t_step, where tstep is the time taken by each step.
-* Likewise, if your program runs multiple parameters, the total time needed is T_total≈n_parameters⋅T_single, where T_single is time needed to run the program with some parameters.
+* If your program does the same thing more than once, you can estimate that the `total run time is number of steps times time taken by each step`.
+* Likewise, if your program runs multiple parameters, the `total time needed is number of parameters times the time needed to run the program with one/some parameters`.
 * You can also run a smaller version of the problem and try to estimate how the program will scale when you make the problem bigger.
-* You should always monitor jobs to find out what were the actual resources you requested (seff jobid).
+* You should always monitor jobs to find out what were the actual resources you requested.
 
 Adapted from [Aalto Scientific Computing](https://scicomp.aalto.fi/triton/usage/program-size/)
 :::
 
-:::{admonition} How many jobs is too many?
-:class: seealso, dropdown
 
-We mention in documentation and guidelines that users shouldn’t send too many jobs, but how many is too many?
-Unfortunately it’s impossible to give any exact numbers because both Slurm and Lustre are shared resources.
-* It’s possible to give better limits for global usage of the system.
-* When system total load is low, it may be ok to run something that is problematic when system is full.
-
-**How many jobs/steps is too many?**
-
-* SHOULD BE OK to run tens of jobs/steps
-* PAY ATTENTION if you run hundreds of jobs/steps
-* DON’T RUN several thousands of jobs
-
-**How many file operations is too many?**
-
-* SHOULD BE OK to access hundreds of files
-* PAY ATTENTION if you need several thousands of files
-* DON’T USE hundreds of thoudsands of files
-
-Note that these guideline numbers are for all operations on all jobs.
-
-**I have lots of small files**
-
-* Check the tool that you are using
-    * There may be different options for data storage
-* Tar/untar and compress your datasets.
-* Use local disk (NVMe on Puhti, ramdisk on Mahti).
-* Remove intermediate files if possible.
-* Use squashfs for read-only datasets and containers.
-
-**I have lots of small tasks for SLURM**
-
-* Check the tool that you are using
-    * There may already be support for multiple jobs in a single job 
-* Regroup your tasks and execute larger group of tasks in single job/step.
-    * Manual or automatic (if feature is present in your tool)
-    * Horizontal and vertical packing
-    * Tradeoff (redundancy, parallelism, utilization)
-* Do a larger job and use another scheduler (hyperqueue, flux).
-    * Integration for nextflow and snakemake already exists
-    * CSC has some tools for farming type jobs
-    * Not all or nothing
-
-:::
 
diff --git a/materials/htc.md b/materials/htc.md
@@ -118,3 +118,48 @@ Many tools available:
 Do you need to run a lot of steps one after another? Or few steps that need a lot of memory? Do steps depend on each other? Which steps could be run in parallel? Which steps cannot be run in parallel?
 
 :::
+
+:::{admonition} How many jobs is too many?
+:class: seealso, dropdown
+
+We mention in documentation and guidelines that users shouldn’t send too many jobs, but how many is too many?
+
+Unfortunately it’s impossible to give any exact numbers because both Slurm and Lustre are shared resources.
+* It’s possible to give better limits for global usage of the system.
+* When system total load is low, it may be ok to run something that is problematic when system is full.
+
+**How many jobs is too many?**
+
+* SHOULD BE OK to run tens of jobs
+* PAY ATTENTION if you run hundreds of jobs
+* DON’T RUN several thousands of jobs
+
+**How many file operations is too many?**
+
+* SHOULD BE OK to access hundreds of files
+* PAY ATTENTION if you need several thousands of files
+* DON’T USE hundreds of thoudsands of files
+
+Note that these guideline numbers are for all operations on all jobs.
+
+**I have lots of small files**
+
+* Check the tool that you are using
+    * There may be different options for data storage
+* Tar/untar and compress your datasets.
+* Use local disk (NVMe on Puhti, ramdisk on Mahti).
+* Remove intermediate files if possible.
+* Use squashfs for read-only datasets and containers.
+
+**I have lots of small tasks for Slurm**
+
+* Regroup your tasks and execute larger group of tasks in single job.
+    * Manual or automatic (if feature is present in your tool)
+    * Horizontal and vertical packing
+    * Tradeoff (redundancy, parallelism, utilization)
+* Do a larger job and use another scheduler (hyperqueue, flux).
+    * Integration for nextflow and snakemake already exists
+    * CSC has some tools for farming type jobs
+    * Not all or nothing
+
+:::
diff --git a/materials/job_types.md b/materials/job_types.md
@@ -97,7 +97,7 @@ Submitting an array job of 100 members counts the same as 100 individual jobs fr
 
 ## GPU jobs 
 
-A graphics processing unit (GPU, a video card), is capable of doing certain type of simultaneous calculations very efficiently. In order to take advantage of this power, a computer program must be reprogrammed to adapt on how GPU handles data. For spatial computations on the GPU, check out for example [RAPIDS cuSpatial](https://docs.rapids.ai/api/cuspatial/stable/user_guide/cuspatial_api_examples/). [CSC's GPU resources](https://docs.csc.fi/computing/overview/#gpu-nodes) are relatively scarce and hence should be used with particular care. A GPU uses 60 times more billing units than a single CPU core. In practice, 1-10 CPU cores (but not more) should be allocated per GPU on Puhti.
+A GPU is capable of doing certain type of simultaneous calculations very efficiently. In order to take advantage of this power, a computer program must be programmed to adapt on how GPU handles data. For spatial computations on the GPU, check out for example [RAPIDS cuSpatial](https://docs.rapids.ai/api/cuspatial/stable/user_guide/cuspatial_api_examples/). [CSC's GPU resources](https://docs.csc.fi/computing/overview/#gpu-nodes) are relatively scarce and hence should be used with particular care. A GPU uses 60 times more billing units than a single CPU core. In practice, 1-10 CPU cores (but not more) should be allocated per GPU on Puhti.
 
 :::{admonition} More advanced topics - GPU
 :class: dropdown, seealso