From d7e97714f1064087edf5dc1751737a3f39ecee4e Mon Sep 17 00:00:00 2001
From: linsalrob <raedwards@gmail.com>
Date: Sat, 9 Nov 2024 11:54:57 +1030
Subject: [PATCH] slurmifying

---
 Workshops/COMBINE_WA_2024.md | 303 ++++++++++++++++++++++++++++-------
 1 file changed, 244 insertions(+), 59 deletions(-)

diff --git a/Workshops/COMBINE_WA_2024.md b/Workshops/COMBINE_WA_2024.md
index 8f606f4..c934f4a 100644
--- a/Workshops/COMBINE_WA_2024.md
+++ b/Workshops/COMBINE_WA_2024.md
@@ -22,8 +22,8 @@ Time | Topic
 1100-1200 | Downloading data and filtering human genomes using minimap2
 1200-1300 | Lunch 
 1300-1330 | Introduction to methods for identifying species GTDB, SingleM 
-1330-1400 | Hands-on with SingleM 
-1400-1500 | Introduction to binning 
+1330-1430 | Hands-on with SingleM 
+1430-1500 | Introduction to binning 
 1500-1530 | Afternoon tea 
 1530-1600 | Microbial binning 
 1600-1700 | Hands on with binning
@@ -56,20 +56,8 @@ If you are using a MS Windows machine, please download and install [MobaXterm](h
 We are going to jump right in with metagenomics, but [here is a brief introduction](https://linsalrob.github.io/ComputationalGenomicsManual/Metagenomics/) if you want to read something while Rob is talking.
 
 
-We have created servers for you with all the software and data that you will need for these excercises. 
+We have created accounts for you on pawsey, and we will share the usernames and passwords with you at the workshop. *These are temporary accounts and will be deleted at the end of the workshop*
 
-Here are some machines that you can use, if you don't have access to a server:
-
-```
-IP Addresses:
-1: 
-2: 
-3: 
-4: 
-5: 
-6: 
-7: 
-```
 
 
 # Learning BASH
@@ -90,15 +78,41 @@ Now, type `./Ponylinux.sh` and press `enter` (or `return`).
 
 Our first excercise is installing software using mamba. 
 
+Before we begin, we are going to make lives slightly easier for ourselves by making an `alias` or `symbolic link`:
+
+```
+ln -s /software/projects/courses01/$USER software
+```
+
+This will create a directory called software.
+
+*Important*: When you install mamba, it will ask you for a location. Use `/home/$USER/software/miniforge3` as the location
 
 Install `conda`, `fastp`, `minimap2`, `samtools` using [conda](../Conda/)
 
 We will use all of these programs today.
 
+You can check that they installed by using the command:
+
+```
+which fastp
+```
+
+If that works, it will tell you!
+
 
 # Downloading Data
 
-All the data we are going to use in the workshop is present on the servers in `/storage/data/cf_data`
+We are going to use the CF data that Rob talked about. To start we are just going to download two files, an R1 and an R2 file to work with:
+
+```
+mkdir fastq
+cd fastq
+curl -LO https://github.com/linsalrob/ComputationalGenomicsManual/raw/refs/heads/master/Datasets/CF/788707_20180129_S_R1.fastq.gz
+curl -LO https://github.com/linsalrob/ComputationalGenomicsManual/raw/refs/heads/master/Datasets/CF/788707_20180129_S_R2.fastq.gz
+cd
+ls
+```
 
 # Use `fastp` to trim bad sequences and remove the adapters.
 
@@ -110,11 +124,23 @@ We are going to use the [Illumina Adapters](https://github.com/linsalrob/Computa
 
 Do you remember how to download the Illumina Adapters? The URL is `https://github.com/linsalrob/ComputationalGenomicsManual/raw/master/SequenceQC/IlluminaAdapters.fa`
 
-Once you have downloaded the adapters, we can use this command:
+Once you have downloaded the adapters, we are going to make a slurm script to run the command on the cluster
+
+Use `nano` (or `vi` or `emacs`) to edit a file, and copy this text:
 
 ```bash
+#!/bin/bash
+#SBATCH --job-name=fastp
+#SBATCH -o fastp-%j.out
+#SBATCH -e fastp-%j.err
+#SBATCH --account=courses01
+#SBATCH --time=1-0
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=1
+#SBATCH --mem=8GB
+
 mkdir fastp
-fastp -n 1 -l 100 -i /storage/data/cf_data/reads/788707_20180129_S_R1.fastq.gz -I /storage/data/cf_data/reads/788707_20180129_S_R2.fastq.gz -o fastp/788707_20180129_S_R1.fastq.gz -O fastp/788707_20180129_S_R2.fastq.gz --adapter_fasta IlluminaAdapters.fa
+fastp -n 1 -l 100 -i fastq/788707_20180129_S_R1.fastq.gz -I fastq/788707_20180129_S_R2.fastq.gz -o fastp/788707_20180129_S_R1.fastq.gz -O fastp/788707_20180129_S_R2.fastq.gz --adapter_fasta IlluminaAdapters.fa
 ```
 
 When `fastp` runs, you will get an HTML output file called [fastp.html](fastp_788707_20180129.html). This shows some statistics about the run.
@@ -169,21 +195,32 @@ For this work, we are going to use [GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_a
 
 We have made this data available for you in `/storage/data/human`
 
-## Install minimap and samtools
 
-This also installs the conda channels for you in the right order!
+## Use minimap2 and samtools to filter the human sequences
+
+We are going to make another slurm script, called `mapping.slurm`:
+
 ```
-conda config --add channels bioconda
-conda config --add channels conda-forge
-mamba create -n minimap2 minimap2 samtools
+nano mapping.slurm
 ```
 
-## Use minimap2 and samtools to filter the human sequences
-
+And copy and paste these contents:
 
 ```
+#!/bin/bash
+#SBATCH --job-name=mapping
+#SBATCH -o mapping-%j.out
+#SBATCH -e mapping-%j.err
+#SBATCH --account=courses01
+#SBATCH --time=1-0
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=64GB
+
+
 mkdir -p bam/
-minimap2 --split-prefix=tmp$$ -a -xsr /storage/data/human/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz fastp/788707_20180129_S_R1.fastq.gz fastp/788707_20180129_S_R2.fastq.gz | samtools view -bh | samtools sort -o bam/788707_20180129.bam
+minimap2 --split-prefix=tmp$$ -t 16 -a -xsr /scratch/courses01/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz fastp/788707_20180129_S_R1.fastq.gz fastp/788707_20180129_S_R2.fastq.gz \
+	| samtools view -bh | samtools sort -o bam/788707_20180129.bam
 samtools index bam/788707_20180129.bam
 ```
 
@@ -191,21 +228,34 @@ Here is the [samtools specification](https://samtools.github.io/hts-specs/SAMv1.
 
 Now, we use `samtools` flags to filter out the human and not human sequences. You can find out what the flags mean using the [samtools flag explainer](https://broadinstitute.github.io/picard/explain-flags.html)
 
-### human only sequences
+
+Again, edit a file, this time called `filtering.slurm`
 
 ```
+nano filtering.slurm
+```
+
+And paste these contents:
+
+```
+#!/bin/bash
+#SBATCH --job-name=filtering
+#SBATCH -o filtering-%j.out
+#SBATCH -e filtering-%j.err
+#SBATCH --account=courses01
+#SBATCH --time=1-0
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=64GB
+
 mkdir human not_human
 samtools fastq -F 3588 -f 65 bam/788707_20180129.bam | gzip -c > human/788707_20180129_S_R1.fastq.gz
-echo "R2 matching human genome:"
 samtools fastq -F 3588 -f 129 bam/788707_20180129.bam | gzip -c > human/788707_20180129_S_R2.fastq.gz
-```
 
-### sequences that are not human
+# sequences that are not human
 
-```
 samtools fastq -F 3584 -f 77 bam/788707_20180129.bam  | gzip -c > not_human/788707_20180129_S_R1.fastq.gz
 samtools fastq -F 3584 -f 141 bam/788707_20180129.bam | gzip -c > not_human/788707_20180129_S_R2.fastq.gz
-samtools fastq -f 4 -F 1 bam/788707_20180129.bam | gzip -c > not_human/788707_20180129_S_Singletons.fastq.gz
 ```
 
 
@@ -219,14 +269,174 @@ We can combine all of that into a single `snakemake` file, and it will do all of
 See the [Snakemake](../Snakemake) section for details on how to run these two commands in a single pipeline.
 
 
+# Read based annotations
+
+In metagenomics, there are two fundemental approaches: read-based annotations and assembly based approaches. We are going to start with read based annotations.
+
+
+
+# Predicting the species that are in the sample
+
+## SingleM
+
+take a [look at the manual](https://wwood.github.io/singlem/) for detailed `singlem` instructions.
+
+
+Install singleM with mamba. _Note:_ Here, we introduce named `mamba` environments. What is the advantage of creating a named environment?
+
+```
+mamba create -n singlem -c bioconda singlem
+mamba activate singlem
+```
+
+After installing singlem, you will get a warning from krona. DO NOT run the `ktUpdate.sh` script. Instead, create a new symlink like so:
+
+```
+rm -rf /software/projects/courses01/$USER/miniforge3/envs/singlem/opt/krona/taxonomy
+ln -s /scratch/courses01/krona/taxonomy /software/projects/courses01/$USER/miniforge3/envs/singlem/opt/krona/taxonomy
+```
+
+Next, before you use singlem, make sure you add this command: 
+
+```
+export SINGLEM_METAPACKAGE_PATH='/scratch/courses01/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb'
+```
+
+
+Now run singleM on the CF data:
+
+```
+#!/bin/bash
+#SBATCH --job-name=singlem
+#SBATCH -o singlem-%j.out
+#SBATCH -e singlem-%j.err
+#SBATCH --account=courses01
+#SBATCH --time=1-0
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=64GB
+
+eval "$(conda shell.bash hook)"
+conda activate singlem
+export SINGLEM_METAPACKAGE_PATH='/scratch/courses01/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb'
+singlem pipe -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -p output_profile.tsv --taxonomic-profile-krona krona.html --threads 16
+```
+
+## Unzip the data
+
+Note: Before we carry on, both `focus` and `super-focus` require that we unzip the data.
+
+```
+cd not_human
+find . -name \*gz -exec gunzip {} \;
+cd ..
+```
+
+
+## Focus
+
+Another way to identify the species present is to use [FOCUS](https://github.com/metageni/FOCUS)
+
+We create a mamba environment just for focus:
+
+```
+mamba create -n focus -c bioconda focus
+```
+
+Now, we need to unpack the database. Here is a trick, since we don't know exactly where the database is:
+
+```
+FOCUS=$(find software/miniforge3/envs/ -name db.zip -printf "%h\n")
+unzip $FOCUS/db.zip -d $FOCUS
+```
+
+This should create the directory `software/miniforge3/envs/focus/lib/python3.13/site-packages/focus_app/db/` with two files inside of it.
+
+Now we can run focus on our data:
+
+```
+#!/bin/bash
+#SBATCH --job-name=focus
+#SBATCH -o focus-%j.out
+#SBATCH -e focus-%j.err
+#SBATCH --account=courses01
+#SBATCH --time=1-0
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=64GB
+
+eval "$(conda shell.bash hook)"
+conda activate focus
+
+focus -q not_human/ -o focus -t 16
+```
+
+
+# SUPER-FOCUS
+
+We are going to assess the functions using [SUPER-FOCUS](https://github.com/metageni/SUPER-FOCUS)
+
+We are going to make _another_ mamba environment for super-focus:
+
+```
+mamba create -n superfocus -c bioconda super-focus mmseqs2
+```
+
+Now we can run super-focus on our data. _Note_: Superfocus creates a _lot_ of data, and you will likely get an error if you just output the results to your home directory. In this command, we put the results somewhere else!
+
+```
+#!/bin/bash
+#SBATCH --job-name=superfocus
+#SBATCH -o superfocus-%j.out
+#SBATCH -e superfocus-%j.err
+#SBATCH --account=courses01
+#SBATCH --time=1-0
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=64GB
+
+eval "$(conda shell.bash hook)"
+conda activate superfocus
+
+export SUPERFOCUS_DB=/scratch/courses01/superfocus/
+superfocus -q not_human/ -dir /scratch/courses01/$USER/superfocus -a mmseqs -t 16 -db DB_95
+```
+
+## Recompressing the files.
+
+Now that we are done with `focus` and `superfocus`, we can recompress the files. _Question:_ Why should we compress the files (or not compress them)?
+
+```
+cd not_human
+find . -type f -exec gzip {} \;
+cd ..
+```
+
 ## Assembling the sequences
 
 *Note:* Assembling _may_ take a while, and for the workshops, Rob has already assembled the sequences. We may, however, assemble some of them depending on computational resources!
 
-We will assemble with megahit:
+We will assemble with megahit.
+
+We need to create a mamba environment for megahit ... how are you going to do that?
+
 
 ```
-megahit -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -o megahit_assembled/788707_20180129_S -t 8
+#!/bin/bash
+#SBATCH --job-name=megahit
+#SBATCH -o megahit-%j.out
+#SBATCH -e megahit-%j.err
+#SBATCH --account=courses01
+#SBATCH --time=1-0
+#SBATCH --ntasks=1
+#SBATCH --cpus-per-task=16
+#SBATCH --mem=64GB
+
+eval "$(conda shell.bash hook)"
+conda activate megahit
+
+mkdir -p megahit_assembled/
+megahit -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -o megahit_assembled/788707_20180129_S -t 16
 ```
 
 This generates a contig file called `final.contigs.fa`.
@@ -298,28 +508,3 @@ We have created an [example Jupyter notebook](Workshop_MAG_demo.ipynb) so you ca
 
 We are going to move the data to [Google Colab](https://colab.research.google.com/) to analyse the data and identify contigs that co-occur across multiple samples.
 
-# Predicting the species that are in the sample
-
-## SingleM
-
-take a [look at the manual](https://wwood.github.io/singlem/) for detailed `singlem` instructions.
-
-
-Install singleM with mamba:
-
-```
-mamba create -n singlem -c bioconda singlem
-mamba activate singlem
-```
-
-Before you use it, make sure you add this command: 
-
-`export SINGLEM_METAPACKAGE_PATH=/storage/data/metapackage`
-
-
-Now run singleM on the CF data:
-
-```
-singlem pipe -1 /storage/data/cf_data/CF_Data_R1.fastq.gz -2 /storage/data/cf_data/CF_Data_R2.fastq.gz -p output_profile.tsv
-```
-