Skip to content


Browse files Browse the repository at this point in the history
  • Loading branch information
linsalrob committed Nov 9, 2024
1 parent e70df33 commit d7e9771
Showing 1 changed file with 244 additions and 59 deletions.
303 changes: 244 additions & 59 deletions Workshops/
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ Time | Topic
1100-1200 | Downloading data and filtering human genomes using minimap2
1200-1300 | Lunch
1300-1330 | Introduction to methods for identifying species GTDB, SingleM
1330-1400 | Hands-on with SingleM
1400-1500 | Introduction to binning
1330-1430 | Hands-on with SingleM
1430-1500 | Introduction to binning
1500-1530 | Afternoon tea
1530-1600 | Microbial binning
1600-1700 | Hands on with binning
Expand Down Expand Up @@ -56,20 +56,8 @@ If you are using a MS Windows machine, please download and install [MobaXterm](h
We are going to jump right in with metagenomics, but [here is a brief introduction]( if you want to read something while Rob is talking.

We have created servers for you with all the software and data that you will need for these excercises.
We have created accounts for you on pawsey, and we will share the usernames and passwords with you at the workshop. *These are temporary accounts and will be deleted at the end of the workshop*

Here are some machines that you can use, if you don't have access to a server:

IP Addresses:

# Learning BASH
Expand All @@ -90,15 +78,41 @@ Now, type `./` and press `enter` (or `return`).

Our first excercise is installing software using mamba.

Before we begin, we are going to make lives slightly easier for ourselves by making an `alias` or `symbolic link`:

ln -s /software/projects/courses01/$USER software

This will create a directory called software.

*Important*: When you install mamba, it will ask you for a location. Use `/home/$USER/software/miniforge3` as the location

Install `conda`, `fastp`, `minimap2`, `samtools` using [conda](../Conda/)

We will use all of these programs today.

You can check that they installed by using the command:

which fastp

If that works, it will tell you!

# Downloading Data

All the data we are going to use in the workshop is present on the servers in `/storage/data/cf_data`
We are going to use the CF data that Rob talked about. To start we are just going to download two files, an R1 and an R2 file to work with:

mkdir fastq
cd fastq
curl -LO
curl -LO

# Use `fastp` to trim bad sequences and remove the adapters.

Expand All @@ -110,11 +124,23 @@ We are going to use the [Illumina Adapters](

Do you remember how to download the Illumina Adapters? The URL is ``

Once you have downloaded the adapters, we can use this command:
Once you have downloaded the adapters, we are going to make a slurm script to run the command on the cluster

Use `nano` (or `vi` or `emacs`) to edit a file, and copy this text:

#SBATCH --job-name=fastp
#SBATCH -o fastp-%j.out
#SBATCH -e fastp-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=8GB

mkdir fastp
fastp -n 1 -l 100 -i /storage/data/cf_data/reads/788707_20180129_S_R1.fastq.gz -I /storage/data/cf_data/reads/788707_20180129_S_R2.fastq.gz -o fastp/788707_20180129_S_R1.fastq.gz -O fastp/788707_20180129_S_R2.fastq.gz --adapter_fasta IlluminaAdapters.fa
fastp -n 1 -l 100 -i fastq/788707_20180129_S_R1.fastq.gz -I fastq/788707_20180129_S_R2.fastq.gz -o fastp/788707_20180129_S_R1.fastq.gz -O fastp/788707_20180129_S_R2.fastq.gz --adapter_fasta IlluminaAdapters.fa

When `fastp` runs, you will get an HTML output file called [fastp.html](fastp_788707_20180129.html). This shows some statistics about the run.
Expand Down Expand Up @@ -169,43 +195,67 @@ For this work, we are going to use [GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_a

We have made this data available for you in `/storage/data/human`

## Install minimap and samtools

This also installs the conda channels for you in the right order!
## Use minimap2 and samtools to filter the human sequences

We are going to make another slurm script, called `mapping.slurm`:

conda config --add channels bioconda
conda config --add channels conda-forge
mamba create -n minimap2 minimap2 samtools
nano mapping.slurm

## Use minimap2 and samtools to filter the human sequences

And copy and paste these contents:

#SBATCH --job-name=mapping
#SBATCH -o mapping-%j.out
#SBATCH -e mapping-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
mkdir -p bam/
minimap2 --split-prefix=tmp$$ -a -xsr /storage/data/human/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz fastp/788707_20180129_S_R1.fastq.gz fastp/788707_20180129_S_R2.fastq.gz | samtools view -bh | samtools sort -o bam/788707_20180129.bam
minimap2 --split-prefix=tmp$$ -t 16 -a -xsr /scratch/courses01/GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna.gz fastp/788707_20180129_S_R1.fastq.gz fastp/788707_20180129_S_R2.fastq.gz \
| samtools view -bh | samtools sort -o bam/788707_20180129.bam
samtools index bam/788707_20180129.bam

Here is the [samtools specification](, and the description of the columns is on page 6.

Now, we use `samtools` flags to filter out the human and not human sequences. You can find out what the flags mean using the [samtools flag explainer](

### human only sequences

Again, edit a file, this time called `filtering.slurm`

nano filtering.slurm

And paste these contents:

#SBATCH --job-name=filtering
#SBATCH -o filtering-%j.out
#SBATCH -e filtering-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
mkdir human not_human
samtools fastq -F 3588 -f 65 bam/788707_20180129.bam | gzip -c > human/788707_20180129_S_R1.fastq.gz
echo "R2 matching human genome:"
samtools fastq -F 3588 -f 129 bam/788707_20180129.bam | gzip -c > human/788707_20180129_S_R2.fastq.gz
### sequences that are not human
# sequences that are not human
samtools fastq -F 3584 -f 77 bam/788707_20180129.bam | gzip -c > not_human/788707_20180129_S_R1.fastq.gz
samtools fastq -F 3584 -f 141 bam/788707_20180129.bam | gzip -c > not_human/788707_20180129_S_R2.fastq.gz
samtools fastq -f 4 -F 1 bam/788707_20180129.bam | gzip -c > not_human/788707_20180129_S_Singletons.fastq.gz

Expand All @@ -219,14 +269,174 @@ We can combine all of that into a single `snakemake` file, and it will do all of
See the [Snakemake](../Snakemake) section for details on how to run these two commands in a single pipeline.

# Read based annotations

In metagenomics, there are two fundemental approaches: read-based annotations and assembly based approaches. We are going to start with read based annotations.

# Predicting the species that are in the sample

## SingleM

take a [look at the manual]( for detailed `singlem` instructions.

Install singleM with mamba. _Note:_ Here, we introduce named `mamba` environments. What is the advantage of creating a named environment?

mamba create -n singlem -c bioconda singlem
mamba activate singlem

After installing singlem, you will get a warning from krona. DO NOT run the `` script. Instead, create a new symlink like so:

rm -rf /software/projects/courses01/$USER/miniforge3/envs/singlem/opt/krona/taxonomy
ln -s /scratch/courses01/krona/taxonomy /software/projects/courses01/$USER/miniforge3/envs/singlem/opt/krona/taxonomy

Next, before you use singlem, make sure you add this command:

export SINGLEM_METAPACKAGE_PATH='/scratch/courses01/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb'

Now run singleM on the CF data:

#SBATCH --job-name=singlem
#SBATCH -o singlem-%j.out
#SBATCH -e singlem-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
eval "$(conda shell.bash hook)"
conda activate singlem
export SINGLEM_METAPACKAGE_PATH='/scratch/courses01/singlem/S4.3.0.GTDB_r220.metapackage_20240523.smpkg.zb'
singlem pipe -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -p output_profile.tsv --taxonomic-profile-krona krona.html --threads 16

## Unzip the data

Note: Before we carry on, both `focus` and `super-focus` require that we unzip the data.

cd not_human
find . -name \*gz -exec gunzip {} \;
cd ..

## Focus

Another way to identify the species present is to use [FOCUS](

We create a mamba environment just for focus:

mamba create -n focus -c bioconda focus

Now, we need to unpack the database. Here is a trick, since we don't know exactly where the database is:

FOCUS=$(find software/miniforge3/envs/ -name -printf "%h\n")
unzip $FOCUS/ -d $FOCUS

This should create the directory `software/miniforge3/envs/focus/lib/python3.13/site-packages/focus_app/db/` with two files inside of it.

Now we can run focus on our data:

#SBATCH --job-name=focus
#SBATCH -o focus-%j.out
#SBATCH -e focus-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
eval "$(conda shell.bash hook)"
conda activate focus
focus -q not_human/ -o focus -t 16


We are going to assess the functions using [SUPER-FOCUS](

We are going to make _another_ mamba environment for super-focus:

mamba create -n superfocus -c bioconda super-focus mmseqs2

Now we can run super-focus on our data. _Note_: Superfocus creates a _lot_ of data, and you will likely get an error if you just output the results to your home directory. In this command, we put the results somewhere else!

#SBATCH --job-name=superfocus
#SBATCH -o superfocus-%j.out
#SBATCH -e superfocus-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
eval "$(conda shell.bash hook)"
conda activate superfocus
export SUPERFOCUS_DB=/scratch/courses01/superfocus/
superfocus -q not_human/ -dir /scratch/courses01/$USER/superfocus -a mmseqs -t 16 -db DB_95

## Recompressing the files.

Now that we are done with `focus` and `superfocus`, we can recompress the files. _Question:_ Why should we compress the files (or not compress them)?

cd not_human
find . -type f -exec gzip {} \;
cd ..

## Assembling the sequences

*Note:* Assembling _may_ take a while, and for the workshops, Rob has already assembled the sequences. We may, however, assemble some of them depending on computational resources!

We will assemble with megahit:
We will assemble with megahit.

We need to create a mamba environment for megahit ... how are you going to do that?

megahit -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -o megahit_assembled/788707_20180129_S -t 8
#SBATCH --job-name=megahit
#SBATCH -o megahit-%j.out
#SBATCH -e megahit-%j.err
#SBATCH --account=courses01
#SBATCH --time=1-0
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64GB
eval "$(conda shell.bash hook)"
conda activate megahit
mkdir -p megahit_assembled/
megahit -1 not_human/788707_20180129_S_R1.fastq.gz -2 not_human/788707_20180129_S_R2.fastq.gz -o megahit_assembled/788707_20180129_S -t 16

This generates a contig file called `final.contigs.fa`.
Expand Down Expand Up @@ -298,28 +508,3 @@ We have created an [example Jupyter notebook](Workshop_MAG_demo.ipynb) so you ca

We are going to move the data to [Google Colab]( to analyse the data and identify contigs that co-occur across multiple samples.

# Predicting the species that are in the sample

## SingleM

take a [look at the manual]( for detailed `singlem` instructions.

Install singleM with mamba:

mamba create -n singlem -c bioconda singlem
mamba activate singlem

Before you use it, make sure you add this command:

`export SINGLEM_METAPACKAGE_PATH=/storage/data/metapackage`

Now run singleM on the CF data:

singlem pipe -1 /storage/data/cf_data/CF_Data_R1.fastq.gz -2 /storage/data/cf_data/CF_Data_R2.fastq.gz -p output_profile.tsv

0 comments on commit d7e9771

Please sign in to comment.