Upload hands on activity for distubuted comput and multiprocessing (#16)

* Initial commit * scripts * HPC script * HPC script * HPC troubleshooting * HPC script working * combining files * combining files * alpine tested * almost final * parallel compute graph * upload cheatsheet * update the README * Update lectures/10.HPC_and_parallel_compute/README.md Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/SLURM_cheatsheet.md Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/SLURM_cheatsheet.md Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/scripts/analyze_sequences.py Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/SLURM_cheatsheet.md Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/SLURM_cheatsheet.md Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/SLURM_cheatsheet.md Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/SLURM_cheatsheet.md Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/SLURM_cheatsheet.md Co-authored-by: Gregory Way <[email protected]> * Update lectures/10.HPC_and_parallel_compute/scripts/analyze_sequences.py Co-authored-by: Gregory Way <[email protected]> * updates lecture 10 files * course files udpate * add python script to utils * HPC run * add the data * run all analyses * update activity --------- Co-authored-by: Gregory Way <[email protected]>
WayScience · Nov 7, 2024 · 84e6f3b · 84e6f3b
1 parent 4eacc26
commit 84e6f3b
Show file tree

Hide file tree

Showing 22 changed files with 865 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -5,4 +5,5 @@
 *.Rproj
 .Rhistory
 .RData
+*__pycache__/
 *.snakemake
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -0,0 +1,49 @@
+repos:
+-   repo: https://gitlab.com/vojko.pribudic.foss/pre-commit-update
+    rev: v0.6.0post1  # Insert the latest tag here
+    hooks:
+    -   id: pre-commit-update
+        args: [--exclude, black, --keep, isort]
+  # Formats import order
+-   repo: https://github.com/pycqa/isort
+    rev: 5.12.0
+    hooks:
+    -   id: isort
+        name: isort (python)
+        args: ["--profile", "black", "--filter-files"]
+
+  #Code formatter for both python files and jupyter notebooks
+-   repo: https://github.com/psf/black
+    rev: 22.10.0
+    hooks:
+    -   id: black-jupyter
+    -   id: black
+        language_version: python3.10
+
+-   repo: https://github.com/nbQA-dev/nbQA
+    rev: 1.9.0
+    hooks:
+    -   id: nbqa-isort
+        additional_dependencies: [isort==5.6.4]
+        args: [--profile=black]
+
+
+  # remove unused imports
+-   repo: https://github.com/hadialqattan/pycln.git
+    rev: v2.4.0
+    hooks:
+    -   id: pycln
+
+  # additional hooks found with in the pre-commit lib
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: v5.0.0
+    hooks:
+    -   id: trailing-whitespace # removes trailing white spaces
+    -   id: mixed-line-ending # removes mixed end of line
+        args:
+        -   --fix=lf
+    -   id: pretty-format-json # JSON Formatter
+        args:
+        -   --autofix
+        -   --indent=4
+        -   --no-sort-keys
diff --git a/lectures/10.hpc_and_parallel_compute/README.md b/lectures/10.hpc_and_parallel_compute/README.md
@@ -0,0 +1,26 @@
+# Lecture 10: High performance computing and parallel computing
+
+This lecture will cover parallel computing and high performance computing.
+We have the following learning objectives:
+1. Familiarize with the concept of parallel computing
+2. Understand how to leverage parallel computing
+3. Learn about high performance computing
+4. Understand how to leverage high performance computing
+5. Learn how to use HPC resources and best practices
+
+We will be using some pre-written scripts to explore parallel computing and high performance computing.
+The following scripts are available in the [scripts](./scripts) directory:
+* [analyze_sequences](scripts/analyze_sequences.py)
+    * This script  contains the core sequence analysis function that we use to analyze sequences.
+    Note this script is can be run for a single sequence and in a serial fashion but we will also call to be parallelized.
+* [multiprocessing_run](scripts/multiprocessing_run.sh)
+    * This script runs itself in parallel using the `multiprocessing` module in Python.
+    This shell script calls the `multiprocessing_sequence_analysis.py` script below
+* [multiprocessing_sequence_analysis](scripts/multiprocessing_sequence_analysis.py)
+    * The script is called by the `multiprocessing_run.sh` script.
+* [plot_parallel_compute_analysis](scripts/plot_parallel_compute_analysis.py)
+    * This script plots the results of the parallel computing analysis.
+* [serial_run](scripts/serial_run.sh)
+    * This script runs the `analyze_sequences.py` script in serial.
+* [submit_jobs_HPC](scripts/submit_jobs_HPC.sh)
+    * This script submits jobs to the HPC cluster in an array job.
diff --git a/lectures/10.hpc_and_parallel_compute/SLURM_cheatsheet.md b/lectures/10.hpc_and_parallel_compute/SLURM_cheatsheet.md
@@ -0,0 +1,79 @@
+# Slurm Guide
+
+For bash scripts, this line should be the first line of code in every script
+```
+#!/bin/bash  # Shebang slash bin slash bash
+```
+
+Next are the SBATCH directives that tell slurm scheduler how to handle your job.
+These directives should be at the top of your script, but under the shebang line.
+
+### Frequent SLURM directives
+```
+#SBATCH --job-name=parallel_job # job name
+#SBATCH -t 1-23:59:59   # D-HH-MM-SS
+#SBATCH -t 59           # MM
+#SBATCH -t 59:59        # MM:SS
+#SBATCH -t 59:59:59     # HH:MM:SS
+#SBATCH -t 1-23         # D-HH
+#SBATCH -t 1-23:59      # D-HH-MM
+#SBATCH --mem=16G       # 16 Gigabytes
+#SBATCH --output=out_%j.log
+#SBATCH --ntasks     # number of tasks
+#SBATCH --mail-type=NONE, BEGIN, END, FAIL, ALL   # email events
+#SBATCH [email protected]
+```
+### Slurm Commands
+#### Environment modules
+```
+module purge # removes all modules
+module avail # lists all modules availble for loading
+module list # list all currently loaded modules
+module load # loads module (hint: us the tab key to autocomplete)
+```
+#### Submitting a job
+```
+sbatch script.sh  # submit script.sh
+```
+#### Checking job status
+```
+squeue -u {User}  # check submitted jobs in queue
+```
+#### Canceling a job or all jobs
+```
+scancel {jobid}  # Cancel job
+scancel -u {User}  # Cancel all jobs for user
+```
+#### Check job details
+```
+jobstats $USER {days} # Check job stats for user for the last {days}
+```
+#### Check job efficiency
+```
+seff {jobid}
+```
+#### Check fairshare
+```
+module load slurmtools
+levelfs $USER
+```
+#### Check user and institution account billings
+```
+suuser $USER
+suacct amc-general
+```
+
+#### Example SBATCH
+```
+#!/bin/bash
+#SBATCH --job-name=Slurm_job    # job name "slurm_job"
+#SBATCH -t 1-23         # Time 1 day, 23 hours
+#SBATCH --mem=16G       # 16 Gigabytes of RAM
+#SBATCH --output=out_%j.log          # std output/error file
+
+#SBATCH --mail-type=END,FAIL       # send email on job end/fail
+#SBATCH [email protected]  # send email to this address
+
+module load python/3.9.6
+module list
+```
diff --git a/lectures/10.hpc_and_parallel_compute/data/parallel_compute_analysis.csv b/lectures/10.hpc_and_parallel_compute/data/parallel_compute_analysis.csv
@@ -0,0 +1,111 @@
+sequences,time_per_sequence(s),core_count
+10,0.5,1
+100,0.5,1
+1000,0.5,1
+10000,0.5,1
+100000,0.5,1
+1000000,0.5,1
+10000000,0.5,1
+100000000,0.5,1
+1000000000,0.5,1
+10000000000,0.5,1
+10,0.5,2
+100,0.5,2
+1000,0.5,2
+10000,0.5,2
+100000,0.5,2
+1000000,0.5,2
+10000000,0.5,2
+100000000,0.5,2
+1000000000,0.5,2
+10000000000,0.5,2
+10,0.5,4
+100,0.5,4
+1000,0.5,4
+10000,0.5,4
+100000,0.5,4
+1000000,0.5,4
+10000000,0.5,4
+100000000,0.5,4
+1000000000,0.5,4
+10000000000,0.5,4
+10,0.5,8
+100,0.5,8
+1000,0.5,8
+10000,0.5,8
+100000,0.5,8
+1000000,0.5,8
+10000000,0.5,8
+100000000,0.5,8
+1000000000,0.5,8
+10000000000,0.5,8
+10,0.5,16
+100,0.5,16
+1000,0.5,16
+10000,0.5,16
+100000,0.5,16
+1000000,0.5,16
+10000000,0.5,16
+100000000,0.5,16
+1000000000,0.5,16
+10000000000,0.5,16
+10,0.5,32
+100,0.5,32
+1000,0.5,32
+10000,0.5,32
+100000,0.5,32
+1000000,0.5,32
+10000000,0.5,32
+100000000,0.5,32
+1000000000,0.5,32
+10000000000,0.5,32
+10,0.5,64
+100,0.5,64
+1000,0.5,64
+10000,0.5,64
+100000,0.5,64
+1000000,0.5,64
+10000000,0.5,64
+100000000,0.5,64
+1000000000,0.5,64
+10000000000,0.5,64
+10,0.5,128
+100,0.5,128
+1000,0.5,128
+10000,0.5,128
+100000,0.5,128
+1000000,0.5,128
+10000000,0.5,128
+100000000,0.5,128
+1000000000,0.5,128
+10000000000,0.5,128
+10,0.5,256
+100,0.5,256
+1000,0.5,256
+10000,0.5,256
+100000,0.5,256
+1000000,0.5,256
+10000000,0.5,256
+100000000,0.5,256
+1000000000,0.5,256
+10000000000,0.5,256
+10,0.5,512
+100,0.5,512
+1000,0.5,512
+10000,0.5,512
+100000,0.5,512
+1000000,0.5,512
+10000000,0.5,512
+100000000,0.5,512
+1000000000,0.5,512
+10000000000,0.5,512
+10,0.5,1024
+100,0.5,1024
+1000,0.5,1024
+10000,0.5,1024
+100000,0.5,1024
+1000000,0.5,1024
+10000000,0.5,1024
+100000000,0.5,1024
+1000000000,0.5,1024
+10000000000,0.5,1024
diff --git a/lectures/10.hpc_and_parallel_compute/data/sequences_to_analyze.txt b/lectures/10.hpc_and_parallel_compute/data/sequences_to_analyze.txt
@@ -0,0 +1,10 @@
+GCXCCXAGGGTTGCAGTCAAATGTCCA
+CGGCCAATGAGGGXCGCXTAGGTCAT
+TAGGTGGATACCXCTXATATATGATT
+CCXATATTAAGACATATAATTGGAGG
+TATTACACGCCCAAATAATTTGGCXA
+TCAGCXGCXGGGAAGCGGGCGCXATACT
+CGGATGATCATCXGGGATGATGTCTA
+GCGCCXGGAAGACGAATCTTAATTA
+TTAGGAACXTXXCAATATGTTTCGGT
+ACTTCTATGTCTXTGGATTACAAACA
diff --git a/lectures/10.hpc_and_parallel_compute/environments/README.md b/lectures/10.hpc_and_parallel_compute/environments/README.md
@@ -0,0 +1,10 @@
+# Environment creation
+We need to create the environment needed for this lecture and hands on activity.
+To do so run the following command from this directory:
+```bash
+conda env create -f parallel_and_hpc_compute_env.yaml
+```
+OR
+```bash
+mamba env create -f parallel_and_hpc_compute_env.yaml
+```
diff --git a/lectures/10.hpc_and_parallel_compute/environments/parallel_and_hpc_compute_env.yaml b/lectures/10.hpc_and_parallel_compute/environments/parallel_and_hpc_compute_env.yaml
@@ -0,0 +1,16 @@
+name: parallel_and_hpc_compute_env
+channels:
+  - conda-forge
+  - defaults
+dependencies:
+  - python=3.11
+  - conda-forge::pandas
+  - conda-forge::jupyter
+  - conda-forge::ipykernel
+  - conda-forge::nbconvert
+  - conda-forge::pip
+  - conda-forge::matplotlib
+  - conda-forge::seaborn
+  - pip:
+    - argparse
+
diff --git a/...10.hpc_and_parallel_compute/hands_on_activity/5mc_sequence_analysis_activity.md b/...10.hpc_and_parallel_compute/hands_on_activity/5mc_sequence_analysis_activity.md
@@ -0,0 +1,24 @@
+# Hand on: 5mC sequence analysis activity
+
+You want to identify the 5mC content in each of 10 sequences.
+Where X is 5mC and C is cytosine.
+The goal is to identify the number of 5mC in each sequence byt using mutliple compute approaches.
+For the sequences below, identify the number of 5mC in each sequence by using the following approaches:
+* Serial approach
+* Parallel approach
+    * Python multiprocessing approach
+    * GNU parallel approach
+* HPC approach
+
+Sequences:
+0. GCXCCXAGGGTTGCAGTCAAATGTCC
+1. ACTTCTATGTCTXTGGATTACAAACA
+2. CGGCCAATGAGGGXCGCXTAGGTCAT
+3. TAGGTGGATACCXCTXATATATGATT
+4. CCXATATTAAGACATATAATTGGAGG
+5. TATTACACGCCCAAATAATTTGGCXA
+6. TCAGCXGCXGGGAAGCGGGCGCXATA
+7. CGGATGATCATCXGGGATGATGTCTA
+8. GCGCCXGGAAGACGAATCTTAATTAX
+9. TTAGGAACXTXXCAATATGTTTCGGT
+