diff --git a/docs/NextFlow/scRNAseq.md b/docs/NextFlow/scRNAseq.md index 96f0f3d..018a906 100644 --- a/docs/NextFlow/scRNAseq.md +++ b/docs/NextFlow/scRNAseq.md @@ -88,53 +88,55 @@ Create a new folder somewhere to store your genome files. Enter the new folder, STAR should be loaded already via the conda environment for the genome indexing step. We will set `--sjdbOverhang` to 79 to be suitable for use with the longer `R2` FASTQ data resulting from BD Rhapsody single cell sequencing. This may require alteration for other platforms. **Essentially, you just need to set `--sjdbOverhang` to the length of your R2 sequences minus 1.** -#### Human genome files 👨👩 - -```bash title="01_retrieve_human_genome.sh" -#!/bin/bash -VERSION=111 -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz -gunzip * -``` - -Then use STAR to prepare the genome index. - -```bash title="02_index_human_genome.sh" -#!/bin/bash -VERSION=111 -STAR \ - --runThreadN 16 \ - --genomeDir "STARgenomeIndex79/" \ - --runMode genomeGenerate \ - --genomeFastaFiles "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa" \ - --sjdbGTFfile "Homo_sapiens.GRCh38.$VERSION.gtf" \ - --sjdbOverhang 79 -``` - -#### Mouse genome files 🐁 - -```bash title="01_retrieve_mouse_genome.sh" -#!/bin/bash -VERSION=111 -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz -gunzip * -``` +=== "Human genome files 👨👩" + + ```bash title="01_retrieve_human_genome.sh" + #!/bin/bash + VERSION=111 + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz + gunzip * + ``` + +=== "Mouse genome files 🐁" + + ```bash title="01_retrieve_mouse_genome.sh" + #!/bin/bash + VERSION=111 + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz + gunzip * + ``` Then use STAR to prepare the genome index. -```bash title="02_index_mouse_genome.sh" -#!/bin/bash -VERSION=111 -STAR \ - --runThreadN 16 \ - --genomeDir "STARgenomeIndex79/" \ - --runMode genomeGenerate \ - --genomeFastaFiles "Mus_musculus.GRCm39.dna_sm.primary_assembly.fa" \ - --sjdbGTFfile "Mus_musculus.GRCm39.$VERSION.gtf" \ - --sjdbOverhang 79 -``` +=== "Human genome files 👨👩" + + ```bash title="02_index_human_genome.sh" + #!/bin/bash + VERSION=111 + STAR \ + --runThreadN 16 \ + --genomeDir "STARgenomeIndex79/" \ + --runMode genomeGenerate \ + --genomeFastaFiles "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa" \ + --sjdbGTFfile "Homo_sapiens.GRCh38.$VERSION.gtf" \ + --sjdbOverhang 79 + ``` + +=== "Mouse genome files 🐁" + + ```bash title="02_index_mouse_genome.sh" + #!/bin/bash + VERSION=111 + STAR \ + --runThreadN 16 \ + --genomeDir "STARgenomeIndex79/" \ + --runMode genomeGenerate \ + --genomeFastaFiles "Mus_musculus.GRCm39.dna_sm.primary_assembly.fa" \ + --sjdbGTFfile "Mus_musculus.GRCm39.$VERSION.gtf" \ + --sjdbOverhang 79 + ``` ### Prepare your sample sheet ✏️ diff --git a/docs/RNAseq/rnaseq-nfcore.md b/docs/RNAseq/rnaseq-nfcore.md index f1e26d7..e45d11e 100644 --- a/docs/RNAseq/rnaseq-nfcore.md +++ b/docs/RNAseq/rnaseq-nfcore.md @@ -99,92 +99,95 @@ nextflow run nf-core/rnaseq -r 3.14.0 \ --skip_markduplicates ``` -* We skip the `dupradar` step, because to install `bioconductor-dupradar`, mamba wants to downgrade `salmon` to a very early version, which is not ideal :facepalm: -* We also skip the `markduplicates` step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample) :mag::sparkles: +!!! info "Um... why are we skipping things?" -### Download genome files + * We skip the `dupradar` step, because to install `bioconductor-dupradar`, mamba wants to downgrade `salmon` to a very early version, which is not ideal :facepalm: + * We also skip the `markduplicates` step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample) :mag::sparkles: + +### Download genome files 🧬 To avoid issues with genome incompatibility with the version of STAR you are running, it is recommended to simply download the relevant genome fasta and GTF files using the following scripts, and then supply them directly to the function call. -#### Human genome files 👨👩 +=== "Human genome files 👨👩" -```bash title="01_retrieve_human_genome.sh" -#!/bin/bash -VERSION=111 -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz -``` + ```bash title="01_retrieve_human_genome.sh" + #!/bin/bash + VERSION=111 + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz + ``` -#### Mouse genome files 🐁 +=== "Mouse genome files 🐁" -```bash title="01_retrieve_mouse_genome.sh" -#!/bin/bash -VERSION=111 -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz -``` + ```bash title="01_retrieve_mouse_genome.sh" + #!/bin/bash + VERSION=111 + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz + ``` -### Run your RNA sequencing reads 🐁 +### Run your RNA sequencing reads 🏃 To avoid typing the whole command out (and in case the pipeline crashes), create a script that will handle the process. Two examples are given here, with one for human samples, and one for mouse samples. * You will need to replace the RSEM folder location with your own path from above. * Using the `save_reference` option stores the formatted genome files to save time if you need to resume or restart the pipeline. -#### Human run script 👨👩 - -```bash title="02_run_rnaseq_human.sh" -#!/bin/bash -module load java/openjdk-17.0.2 -export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/ - -nextflow run nf-core/rnaseq -r 3.14.0 \ - --input samplesheet.csv \ - --outdir rnaseq_output \ - --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \ - --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.111.gtf.gz \ - --skip_dupradar \ - --skip_markduplicates \ - --save_reference \ - -resume - -``` - -#### Mouse run script 🐁 - -```bash title="02_run_rnaseq_mouse.sh" -#!/bin/bash -module load java/openjdk-17.0.2 -export PATH=$PATH:”/home/mmacowan/mf33/tools/RSEM/” - -nextflow run nf-core/rnaseq -r 3.14.0 \ - --input samplesheet.csv \ - --outdir rnaseq_output \ - --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz \ - --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.111.gtf.gz \ - --skip_dupradar \ - --skip_markduplicates \ - --save_reference \ - -resume -``` - -## Import data into R +=== "Human run script 👨👩" + + ```bash title="02_run_rnaseq_human.sh" + #!/bin/bash + module load java/openjdk-17.0.2 + export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/ + + nextflow run nf-core/rnaseq -r 3.14.0 \ + --input samplesheet.csv \ + --outdir rnaseq_output \ + --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \ + --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.111.gtf.gz \ + --skip_dupradar \ + --skip_markduplicates \ + --save_reference \ + -resume + ``` + +=== "Mouse genome files 🐁" + + ```bash title="02_run_rnaseq_mouse.sh" + #!/bin/bash + module load java/openjdk-17.0.2 + export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/ + + nextflow run nf-core/rnaseq -r 3.14.0 \ + --input samplesheet.csv \ + --outdir rnaseq_output \ + --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz \ + --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.111.gtf.gz \ + --skip_dupradar \ + --skip_markduplicates \ + --save_reference \ + -resume + ``` + +## Import data into R 📥 We have a standardised method for importing data into R. Luckily for us, the NF-CORE/rnaseq pipeline outputs are provided in `.rds` format as `SummarizedExperiment` objects, with bias-corrected gene counts without an offset. * `salmon.merged.gene_counts_length_scaled.rds` -There are two matrices provided to us: `counts` and `abundance`. +??? info "Tell me more!" + + * There are two matrices provided to us: `counts` and `abundance`. + * The `counts` matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2. + * The `abundance` matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments. + * The `tximport` package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, `"abundance"`, `"counts"`, and `"length"`, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the `"length"` matrix contains the effective gene lengths. The `"length"` matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices. -- The `abundance` matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments. -- The `counts` matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2. -- The `tximport` package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, `"abundance"`, `"counts"`, and `"length"`, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the `"length"` matrix contains the effective gene lengths. The `"length"` matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices. ### R code for import and voom-normalisation Here we show our standard process for preparing RNAseq data for downstream analysis. -```r +```r title="Prepare Voom-normalised DGE List" # Load R packages pkgs <- c('knitr', 'here', 'SummarizedExperiment', 'biomaRt', 'edgeR', 'limma') pacman::p_load(char = pkgs) diff --git a/docs/Utilities/convert-raw-novaseq-outputs.md b/docs/Utilities/convert-raw-novaseq-outputs.md index 1c66d41..64af662 100644 --- a/docs/Utilities/convert-raw-novaseq-outputs.md +++ b/docs/Utilities/convert-raw-novaseq-outputs.md @@ -30,7 +30,9 @@ The document should be in the following format, where `index` is the `i7 adapter For the indexes, **both** sequences used on the sample sheet should be the reverse complement of the actual sequences. -If you make this on a Windows system, ensure you save your output encoded by `UTF-8` and not `UTF-8 with BOM`. +!!! warning "Ensure correct file encoding 🪟👀" + + If you make this on a Windows system, ensure you save your output encoded by `UTF-8` and not `UTF-8 with BOM`. ```bash [Header] @@ -72,9 +74,11 @@ The most up-to-date bcl-convert will be inside the output `usr/bin/` folder, and With the `raw_data` folder and `samplesheet.txt` both in the same directory, we can now run BCL Convert to generate our demultiplexed FASTQ files. Ensure you have at least 64GB of RAM in your interactive smux session. -You will need a very high limit for open files – BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves. +!!! warning "Open file limit error" + + You will need a very high limit for open files – BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves. -You can request additional open file limit from the M3 MASSIVE help desk. + You can request additional open file limit from the M3 MASSIVE help desk. !!! question "Can I run this on my local machine?" @@ -84,11 +88,11 @@ You can request additional open file limit from the M3 MASSIVE help desk. The minimum requirements (as of BCL Convert v4.0) are: - **Hardware requirements** - - Single multiprocessor or multicore computer - - Minimum 64 GB RAM + - Single multiprocessor or multicore computer + - Minimum 64 GB RAM - **Software requirements** - - Root access to your computer - - File system access to adjust ulimit + - Root access to your computer + - File system access to adjust ulimit You can start an interactive bash session and increase the open file limit as follows: diff --git a/mkdocs.yml b/mkdocs.yml index 134910f..d8cc513 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -45,6 +45,8 @@ markdown_extensions: - pymdownx.tasklist - pymdownx.tilde - pymdownx.emoji + - pymdownx.tabbed: + alternate_style: true extra: analytics: diff --git a/site/RNAseq/rnaseq-nfcore/index.html b/site/RNAseq/rnaseq-nfcore/index.html index 68c2244..cefa6f2 100644 --- a/site/RNAseq/rnaseq-nfcore/index.html +++ b/site/RNAseq/rnaseq-nfcore/index.html @@ -372,67 +372,19 @@
Um... why are we skipping things?
dupradar
step, because to install bioconductor-dupradar
, mamba wants to downgrade salmon
to a very early version, which is not ideal markduplicates
step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample) To avoid issues with genome incompatibility with the version of STAR you are running, it is recommended to simply download the relevant genome fasta and GTF files using the following scripts, and then supply them directly to the function call.
-#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz
To avoid typing the whole command out (and in case the pipeline crashes), create a script that will handle the process. Two examples are given here, with one for human samples, and one for mouse samples.
save_reference
option stores the formatted genome files to save time if you need to resume or restart the pipeline.#!/bin/bash
module load java/openjdk-17.0.2
export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/
@@ -1040,10 +955,11 @@ Human run script 👨👩
--save_reference \
-resume
#!/bin/bash
module load java/openjdk-17.0.2
-export PATH=$PATH:”/home/mmacowan/mf33/tools/RSEM/”
+export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/
nextflow run nf-core/rnaseq -r 3.14.0 \
--input samplesheet.csv \
@@ -1055,20 +971,28 @@ Mouse run script 🐁
--save_reference \
-resume
We have a standardised method for importing data into R. Luckily for us, the NF-CORE/rnaseq pipeline outputs are provided in .rds
format as SummarizedExperiment
objects, with bias-corrected gene counts without an offset.
salmon.merged.gene_counts_length_scaled.rds
There are two matrices provided to us: counts
and abundance
.
abundance
matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments.counts
and abundance
.counts
matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2.abundance
matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments.tximport
package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, "abundance"
, "counts"
, and "length"
, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the "length"
matrix contains the effective gene lengths. The "length"
matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices.Here we show our standard process for preparing RNAseq data for downstream analysis.
-# Load R packages
+Prepare Voom-normalised DGE List# Load R packages
pkgs <- c('knitr', 'here', 'SummarizedExperiment', 'biomaRt', 'edgeR', 'limma')
pacman::p_load(char = pkgs)
diff --git a/site/search/search_index.json b/site/search/search_index.json
index 21f33b0..d7323ec 100644
--- a/site/search/search_index.json
+++ b/site/search/search_index.json
@@ -1 +1 @@
-{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to the Mucosal Immunology Lab Bioinformatics Hub","text":""},{"location":"#overview","title":"Overview","text":"Over the years, we have refined several workflows for processing of the various omic modalities we utilise within our group. As with any workflows in the bioinformatics field, these are constantly evolving as new tools and best practices emerge. As such, this hub is very much a work-in-progress, and will remain so as we continue to add to and update it.
"},{"location":"#contributors","title":"Contributors","text":"These sort of tasks are never accomplished alone! Massive thanks to the people who have contributed to this.
-
Matthew Macowan
Bioinformatician \u2013 Mucosal Immunology Group
-
C\u00e9line Pattaroni
Group Leader \u2013 Computational Immunology Group
-
Giulia Iacono
Post Doc \u2013 Mucosal Immunology Group
- Alana Butler: Bioinformatician
- Bailey Cardwell: PhD student \u2013 Mucosal Immunology Group
"},{"location":"#group-research-overview","title":"Group Research Overview","text":"The Mucosal Immunology Research Group, led by Professors Marsland, Harris and Westall, is focused on understanding the fundamental principles in health and disease of the gut, lung and nervous system. Projects span the space from discovery using preclinical models of host-microbe interactions and inflammation through to high performance computational approaches to identify clinical biomarkers and development of novel drug candidates.
"},{"location":"#group-heads","title":"Group Heads","text":" -
Ben Marsland
-
Nicola Harris
-
Glen Westall
"},{"location":"NextFlow/nf-mucimmuno/","title":"Nextflow Workflows","text":"As they are built and published, this repository will contain Nextflow workflows for processing of different data omic modalities.
Additionally, we may provide additional tools and code for further downstream processing, with the goal of standardising data analytic approaches within the Mucosal Immunology Lab.
"},{"location":"NextFlow/nf-mucimmuno/#single-cell-rnaseq-fastq-pre-processing","title":"Single-cell RNAseq FASTQ pre-processing","text":"nf-mucimmuno/scRNAseq is a bioinformatics pipeline for single-cell RNA sequencing data that can be used to run quality control steps and alignment to a host genome using STARsolo. Currently only configured for use with data resulting from BD Rhapsody library preparation.
"},{"location":"NextFlow/scRNAseq/","title":"Single-cell RNAseq FASTQ pre-processing","text":""},{"location":"NextFlow/scRNAseq/#introduction","title":"Introduction","text":"nf-mucimmuno/scRNAseq is a bioinformatics pipeline that can be used to run quality control steps and alignment to a host genome using STARsolo. It takes a samplesheet and FASTQ files as input, performs FastQC, trimming and alignment, and produces an output .tar.gz
archive containing the collected outputs from STARsolo, ready for further processing downstream in R. MultiQC is run on the FastQC outputs both before and after TrimGalore! for visual inspection of sample quality \u2013 output .html
files are collected in the results.
"},{"location":"NextFlow/scRNAseq/#usage","title":"Usage","text":""},{"location":"NextFlow/scRNAseq/#download-the-repository","title":"Download the repository \ud83d\udcc1","text":"This repository contains the relevant Nextflow workflow components, including a conda environment and submodules, to run the pipeline. To retrieve this repository alone, run the retrieve_me.sh
script above.
Git sparse-checkout
is required to retrieve just the nf-mucimmuno/scRNAseq pipeline. It was only introduced to Git in version 2.27.0, so ensure that the loaded version is high enough (or that there is a version loaded on the cluster at all). As of July 2024, the M3 MASSIVE cluster has version 2.38.1 available.
# Check git version\ngit --version\n\n# Load git module if not loaded or insufficient version\nmodule load git/2.38.1\n
First, create a new bash script file.
# Create and edit a new file with nano\nnano retrieve_me.sh\n
Add the contents to the file, save, and close.
retrieve_me.sh#!/bin/bash\n\n# Define variables\nREPO_URL=\"https://github.com/mucosal-immunology-lab/nf-mucimmuno\"\nREPO_DIR=\"nf-mucimmuno\"\nSUBFOLDER=\"scRNAseq\"\n\n# Clone the repository with sparse checkout\ngit clone --no-checkout $REPO_URL\ncd $REPO_DIR\n\n# Initialize sparse-checkout and set the desired subfolder\ngit sparse-checkout init --cone\ngit sparse-checkout set $SUBFOLDER\n\n# Checkout the files in the subfolder\ngit checkout main\n\n# Move the folder into the main folder and delete the parent\nmv $SUBFOLDER ../\ncd ..\nrm -rf $REPO_DIR\n\n# Extract the larger gzipped CLS files\ngunzip -r \"$SUBFOLDER/modules/starsolo/CLS\"\n\necho \"Subfolder '$SUBFOLDER' has been downloaded successfully.\"\n
Then run the script to retrieve the repository into a new folder called scRNAseq
, which will house your workflow files and results.
# Run the script\nbash retrieve_me.sh\n
"},{"location":"NextFlow/scRNAseq/#create-the-conda-environment","title":"Create the conda environment \ud83d\udc0d","text":"To create the conda environment, use the provided environment .yaml
file. Then activate it to access required functions.
# Create the environment\nmamba env create -f environment.yaml\n\n# Activate the environment\nmamba activate nextflow-scrnaseq\n
"},{"location":"NextFlow/scRNAseq/#prepare-the-genome","title":"Prepare the genome \ud83e\uddec","text":"Create a new folder somewhere to store your genome files. Enter the new folder, and run the relevant code depending on your host organism. Run these steps in an interactive session with ~48GB RAM and 16 cores, or submit them as an sbatch job.
Please check if these are already available somewhere before regenerating them yourself!
STAR should be loaded already via the conda environment for the genome indexing step. We will set --sjdbOverhang
to 79 to be suitable for use with the longer R2
FASTQ data resulting from BD Rhapsody single cell sequencing. This may require alteration for other platforms. Essentially, you just need to set --sjdbOverhang
to the length of your R2 sequences minus 1.
"},{"location":"NextFlow/scRNAseq/#human-genome-files","title":"Human genome files \ud83d\udc68\ud83d\udc69","text":"01_retrieve_human_genome.sh#!/bin/bash\nVERSION=111\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz\ngunzip *\n
Then use STAR to prepare the genome index.
02_index_human_genome.sh#!/bin/bash\nVERSION=111\nSTAR \\\n --runThreadN 16 \\\n --genomeDir \"STARgenomeIndex79/\" \\\n --runMode genomeGenerate \\\n --genomeFastaFiles \"Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa\" \\\n --sjdbGTFfile \"Homo_sapiens.GRCh38.$VERSION.gtf\" \\\n --sjdbOverhang 79\n
"},{"location":"NextFlow/scRNAseq/#mouse-genome-files","title":"Mouse genome files \ud83d\udc01","text":"01_retrieve_mouse_genome.sh#!/bin/bash\nVERSION=111\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz\ngunzip *\n
Then use STAR to prepare the genome index.
02_index_mouse_genome.sh#!/bin/bash\nVERSION=111\nSTAR \\\n --runThreadN 16 \\\n --genomeDir \"STARgenomeIndex79/\" \\\n --runMode genomeGenerate \\\n --genomeFastaFiles \"Mus_musculus.GRCm39.dna_sm.primary_assembly.fa\" \\\n --sjdbGTFfile \"Mus_musculus.GRCm39.$VERSION.gtf\" \\\n --sjdbOverhang 79\n
"},{"location":"NextFlow/scRNAseq/#prepare-your-sample-sheet","title":"Prepare your sample sheet \u270f\ufe0f","text":"This pipeline requires a sample sheet to identify where your FASTQ files are located, and which cell label sequences (CLS) are being utilised.
More information about the CLS tags used with BD Rhapsody single-cell RNAseq library preparation can be found here:
- BD Rhapsody Sequence Analysis Pipeline \u2013 User's Guide
- BD Rhapsody Cell Label Structure \u2013 Python Script
More information about the CLS tags used with 10X Chromium single-cell RNAseq library preparation can be found here:
- 10X Chromium Single Cell 3' Solution V2 and V3 guide (Teich Lab)
- 10X Chromium V2 CLS sequences are 26bp long.
- 10X Chromium V3 CLS sequences are 28bp long.
The benefit of providing the name of the CLS bead versions in the sample sheet is that you can combine runs that utilise different beads together in the same workflow. Keep in mind that if you do this though, there may be some bead-related batch effects to address and correct downstream \u2013 it is always important to check for these effects when combining sequencing runs in any case. The current options are:
CLS option Description BD_Original The original BD rhapsody beads and linker sequences BD_Enhanced_V1 First version of enhanced beads with polyT and 5prime capture oligo types, shorter linker sequences, longer polyT, and 0-3 diversity insert bases at the beginning of the sequence BD_Enhanced_V2 Same structure as the enhanced (V1) beads, but with increased CLS diversity (384 vs. 96) 10X_Chromium_V2 Feature a 16 bp cell barcode and a 10 bp unique molecular identifier (UMI) 10X_Chromium_V3 Enhanced sequencing accuracy and resolution with a 16 bp cell barcode and a 12 bp UMI Further, we also need to provide the path to the STAR genome index folder for each sample \u2013 while in many cases this value will remain constant, the benefit of providing this information is that you can process runs with different R2 sequence lengths at the same time. Recall from above that the genome index you use should use an --sjdbOverhang
length that of your R2 sequences minus 1.
Your sample sheet should look as follows, ensuring you use the exact column names as below. Remember that on the M3 MASSIVE cluster, you need to use the full file path \u2013 relative file paths don't usually work.
sample,fastq_1,fastq_2,CLS,GenomeIndex\nCONTROL_S1,CONTROL_S1_R1.fastq.gz,CONTROL_S1_R2.fastq.gz,BD_Enhanced_V2,mf33/Databases/ensembl/human/STARgenomeIndex79\nCONTROL_S2,CONTROL_S2_R1.fastq.gz,CONTROL_S1_R2.fastq.gz,BD_Enhanced_V2,mf33/Databases/ensembl/human/STARgenomeIndex79\nTREATMENT_S1,TREATMENT_S1_R1.fastq.gz,TREATMENT_S1_R2.fastq.gz,BD_Enhanced_V2,mf33/Databases/ensembl/human/STARgenomeIndex79\n
An example is provided in data/samplesheet_test
.
"},{"location":"NextFlow/scRNAseq/#running-the-pipeline","title":"Running the pipeline \ud83c\udfc3","text":"Now you can run the pipeline. You will need to set up a parent job to run each of the individual jobs \u2013 this can be either an interactive session, or an sbatch job. For example:
# Start an interactive session with minimal resources\nsmux n --time=3-00:00:00 --mem=16GB --ntasks=1 --cpuspertask=2 -J nf-STARsolo\n
Make sure you alter the nextflow.config
file to provide the path to your sample sheet, unless it is ./data/samplesheet.csv
which is the default for the cluster profile. Stay within the top cluster
profile section to alter settings for Slurm-submitted jobs.
Inside your interactive session, be sure to activate your nextflow-scrnaseq
environment from above. Then, inside the scRNAseq folder, begin the pipeline using the following command (ensuring you use the cluster
profile to make use of the Slurm workflow manager).
# Activate conda environment\nmamba activate nextflow-scrnaseq\n\n# Begin running the pipeline\nnextflow run process_raw_reads.nf -resume -profile cluster\n
"},{"location":"NextFlow/scRNAseq/#customisation","title":"Customisation \u2699\ufe0f","text":"There are several customisation options that are available within the nextflow.config
file. While the defaults should be suitable for those with access to the M3 MASSIVE cluster genomics partition, for those without access, of for those who require different amounts of resources, there are ways to change these.
In order to work with different technologies, and accommodate for differences in cell label structure (CLS), the STAR parameters --soloType
and --soloCBmatchWLtype
are set in a CLS-dependent manner. This is required, because the BD Rhapsody system has a complex barcode structure. The 10X Chromium system on the other hand has a simple barcode structure with a single barcode and single UMI. Additionally, the --soloCBmatchWLtype = EditDist2
only works with --soloType = CB_UMI_Complex
, and therefore --soloCBmatchWLtype = 1MM multi Nbase pseudocounts
is used for 10X Chromium runs.
- For BD Rhapsody sequencing:
--soloType = CB_UMI_Complex
and --soloCBmatchWLtype = EditDist2
. - For 10X Chromium sequencing:
--soloType = CB_UMI_Simple
and --soloCBmatchWLtype = 1MM multi Nbase pseudocounts
. - Additionally, 10X Chromium runs use
--clipAdapterType = CellRanger4
.
To adjust the cluster
profile settings, stay within the appropriate section at the top of the file.
Parameters
Visit STAR documentation for explanations of all available options for STARsolo.
Option Description samples_csv The file path to your sample sheet outdir A new folder name to be created for your results trimgalore.quality The minimum quality before a sequence is truncated (default: 20
) trimgalore.adapter A custom adapter sequence for the R1 sequences (default: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
) trimgalore.adapter2 A custom adapter sequence for the R2 sequences (default: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT'
) starsolo.soloUMIdedup The type of UMI deduplication (default: '1MM_CR'
) starsolo.soloUMIfiltering The type of UMI filtering for reads uniquely mapping to genes (default: 'MultiGeneUMI_CR'
) starsolo.soloCellFilter The method type and parameters for cell filtering (default: 'EmptyDrops_CR'
) starsolo.soloMultiMappers The counting method for reads mapping for multiple genes (default: 'EM'
) Process
These settings relate to resource allocation and cluster settings. FASTQC and TRIMGALORE steps can take longer than 4 hours for typical single-cell RNAseq file, and therefore the default option is to run these steps on the comp
partition.
Option Description executor The workload manager (default: 'slurm'
) conda The conda environment to use (default: './environment.yaml'
) queueSize The maximum number of jobs to be submitted at any time (default: 12
) submitRateLimit The rate allowed for job submission \u2013 either a number of jobs per second (e.g. 20sec) or a number of jobs per time period (e.g. 20/5min) (default: '1/2sec'
) memory The maximum global memory allowed for Nextflow to use (default: '320 GB'
) FASTQC.memory Memory for FASTQC step to use (default: '80 GB'
) FASTQC.cpus Number of CPUs for FASTQC step to use (default: 8
) FASTQC.clusterOptions Specific cluster options for FASTQC step (default: '--time=8:00:00'
) TRIMGALORE.memory Memory for TRIMGALORE step to use (default: '80 GB'
) TRIMGALORE.cpus Number of CPUs for TRIMGALORE step to use (default: 8
) TRIMGALORE.clusterOptions Specific cluster options for TRIMGALORE step (default : '--time=8:00:00'
) STARSOLO.memory Memory for STARSOLO step to use (default: '80 GB'
) STARSOLO.cpus Number of CPUs for STARSOLO step to use (default: 12
) STARSOLO.clusterOptions Specific cluster options for STARSOLO step (default : '--time=4:00:00 --partition=genomics --qos=genomics'
) COLLECT_EXPORT_FILES.memory Memory for COLLECT_EXPORT_FILES step to use (default: '32 GB'
) COLLECT_EXPORT_FILES.cpus Number of CPUs for COLLECT_EXPORT_FILES step to use (default: 8
) COLLECT_EXPORT_FILES.clusterOptions Specific cluster options for COLLECT_EXPORT_FILES step (default : '--time=4:00:00 --partition=genomics --qos=genomics'
)"},{"location":"NextFlow/scRNAseq/#outputs","title":"Outputs","text":"Several outputs will be copied from their respective Nextflow work
directories to the output folder of your choice (default: results
).
Alignment summary utility script
There is also a utility script in the main scRNAseq
directory called collect_alignment_summaries.sh
. This will navigate into each of the sample folders inside results/STARsolo
, and retrieve some key information for you to validate that the alignment worked successfully (from the GeneFull_Ex50pAS
subfolder). This can otherwise take quite some time to go through each folder if you have a lot of samples.
- After running this, a new file called
AlignmentSummary.txt
will be generated in the scRNAseq
directory. Each sample will be listed by name, with the number of reads, percentage of reads with valid barcodes, and estimated number of cells. - It will be immediately obvious that something has gone wrong if you see that the percentage of reads with valid barcodes is very low (e.g.
0.02
= 2% valid barcodes) \u2013 this is usually paired with a very low estimated cell number. - This could indicate that you have used the wrong barcode version for your runs, and therefore the associated barcode whitelist used by the pipeline was incorrect.
A successful example is shown below
Sample: Healthy1\nNumber of Reads,353152389\nReads With Valid Barcodes,0.950799\nEstimated Number of Cells,6623\n\nSample: Healthy2\nNumber of Reads,344989615\nReads With Valid Barcodes,0.948577\nEstimated Number of Cells,6631\n# etc...\n
"},{"location":"NextFlow/scRNAseq/#collected-export-files","title":"Collected export files \ud83d\udce6","text":"The main output will be a single archive file called export_files.tar.gz
that you will take for further downstream pre-processing. It contains STARsolo outputs for each sample, with the respective subfolders described below.
"},{"location":"NextFlow/scRNAseq/#reports","title":"Reports \ud83d\udcc4","text":"Within the reports
folder, you will find the MultiQC outputs from pre- and post-trimming.
"},{"location":"NextFlow/scRNAseq/#starsolo","title":"STARsolo \u2b50","text":"Contains the outputs for each sample from STARsolo, including various log files and package version information.
The main output of interest here is a folder called {sample}.Solo.out
, which houses subfolders called Gene
, GeneFull_Ex50pAS
, and Velocyto
. It is this main folder for each sample that is added to export_files.tar.gz
. * As you will use the gene count data from GeneFull_Ex50pAS
downstream, it is a good idea to check the Summary.csv
within this folder for each sample to ensure mapping was successful (or use the utility script above). * One of the key values to inspect is Reads With Valid Barcodes
, which should be >0.8 (indicating at least 80% of reads had valid barcodes). * If you note that this value is closer to 0.02 (i.e. ~2% had valid barcodes), you should double-check to make sure you specified the correct BD Rhapsody beads version. For instance, if you specified BD_Enhanced_V1
but actually required BD_Enhanced_V2
, the majority of your reads will not match the whitelist, and therefore the reads will be considered invalid.
Folder structure
Below is an example of the output structure for running one sample. The STARsolo folder would contain additional samples as required.
scRNAseq\n\u2514\u2500\u2500 results/\n \u251c\u2500\u2500 export_files.tar.gz\n \u251c\u2500\u2500 reports/\n \u2502 \u251c\u2500\u2500 pretrim_multiqc_report.html\n \u2502 \u2514\u2500\u2500 posttrim_multiqc_report.html\n \u2514\u2500\u2500 STARsolo/\n \u2514\u2500\u2500 sample1/\n \u251c\u2500\u2500 sample1.Solo.out/\n \u2502 \u251c\u2500\u2500 Gene/\n \u2502 \u2502 \u251c\u2500\u2500 filtered/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 matrix.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 raw/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 matrix.mtx.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 UniqueAndMult-EM.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 Features.stats\n \u2502 \u2502 \u251c\u2500\u2500 Summary.csv\n \u2502 \u2502 \u2514\u2500\u2500 UMIperCellSorted.txt\n \u2502 \u251c\u2500\u2500 GeneFull_Ex50pAS/\n \u2502 \u2502 \u251c\u2500\u2500 filtered/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 matrix.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 raw/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 matrix.mtx.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 UniqueAndMult-EM.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 Features.stats\n \u2502 \u2502 \u251c\u2500\u2500 Summary.csv\n \u2502 \u2502 \u2514\u2500\u2500 UMIperCellSorted.txt\n \u2502 \u251c\u2500\u2500 Velocyto/\n \u2502 \u2502 \u251c\u2500\u2500 filtered/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 ambiguous.mtx.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 spliced.mtx.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 unspliced.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 raw/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 ambiguous.mtx.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 spliced.mtx.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 unspliced.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 Features.stats\n \u2502 \u2502 \u2514\u2500\u2500 Summary.csv\n \u2502 \u2514\u2500\u2500 Barcodes.stats\n \u251c\u2500\u2500 sample1.Log.final.out\n \u251c\u2500\u2500 sample1.Log.out\n \u251c\u2500\u2500 sample1.Log.progress.out\n \u2514\u2500\u2500 versions.yml\n
"},{"location":"PublicDatasets/public-datasets/","title":"Public datasets","text":"Here we provide a list of publicly-available datasets that we have generated and uploaded to repositories. Some of the data is yet to be released, and will be available following publication.
"},{"location":"PublicDatasets/public-datasets/#ncbi-sequencing-read-archive","title":"NCBI Sequencing Read Archive","text":"The following datasets have been uploaded to the NCBI Sequencing Read Archive (SRA) database in their original FASTQ data format.
"},{"location":"PublicDatasets/public-datasets/#summary","title":"Summary","text":"Sequencing type Sequencing runs (uploaded) Bulk transcriptomics 425 Single-cell transcriptomics 2 Shotgun metagenomics 310 16S amplicon 1,146 ITS amplicon 373"},{"location":"PublicDatasets/public-datasets/#datasets","title":"Datasets","text":"Host organism Context BioProject Availability Bulk transcriptomics Single-cell transcriptomics Shotgun metagenomics 16S amplicon ITS amplicon Mouse SHIP-deficient model of Crohn's-like ileitis and chronic lung inflammation PRJNA1086166 \u2013 2024 Released 24 stool samples Human Paediatric severe wheeze + asthma PRJNA1080233 \u2013 2024 Released 55 bronchial brushes 28 bronchial brushes Human Paediatric healthy + infant wheeze PRJNA1076275 \u2013 2024 Released 188 nasal swabs + 73 blood samples 320 nasal swabs 135 nasal swabs Human Infant cystic fibrosis PRJNA978345 \u2013 2024 Released 96 stool samples 75 BAL samples Rat Early life stress + mild traumatic brain injury PRJNA940177 \u2013 2024 Released 76 stool samples Mouse OTII cells Germinal centre expansion + IL-21 role PRJNA776662 \u2013 2021 Released 8 culture samples Human Early life + airways PRJNA694493 \u2013 2021 Released 85 nasal swabs 118 nasal swabs + 119 oropharyngeal swabs 119 nasal swabs + 119 oropharyngeal swabs Mouse Allergic airway inflammation PRJNA641984 \u2013 2020 Released 20 stool samples 127 stool samples Human Male-associated infertility PRJNA509076 \u2013 2018 Released 94 seminal fluid samples Human Early life + immune development PRJNA475630 \u2013 2018 Released 16 tracheal aspirates 45 tracheal aspirates Mouse High fat diet PRJNA1131116 To be released 24 ileum luminal samples + 24 ileum mucosal samples + 22 colon luminal samples 77 stool samples Mouse Early life antibiotic treatment PRJNA1112091 To be released 2 lung structural cell digests 96 stool samples 41 lung tissue samples + 30 BAL samples"},{"location":"PublicDatasets/public-datasets/#european-nucleotide-archive","title":"European Nucleotide Archive","text":"The following datasets have been uploaded to the European Nucleotide Archive (ENA) database in their original FASTQ data format.
"},{"location":"PublicDatasets/public-datasets/#summary_1","title":"Summary","text":"Sequencing type Sequencing runs (uploaded) 16S amplicon 1,179"},{"location":"PublicDatasets/public-datasets/#datasets_1","title":"Datasets","text":"Host organism Context Project ID 16S amplicon Availability Human Early life + atopic dermatitis PRJEB42268 \u2013 2022 Released 1,179 lateral upper arm swabs"},{"location":"RNAseq/rnaseq-nfcore/","title":"Processing RNA sequencing data with nf-core","text":""},{"location":"RNAseq/rnaseq-nfcore/#overview","title":"Overview","text":"Here we will describe the process for processing RNA sequencing data using the nf-core/rnaseq pipeline. This document was written as of version 3.14.0
nf-core/rnaseq is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.
Full details of the pipeline and the many customisable options can be view on the pipeline website.
"},{"location":"RNAseq/rnaseq-nfcore/#installation","title":"Installation","text":"In this section, we discuss the installation process on the M3 MASSIVE cluster.
"},{"location":"RNAseq/rnaseq-nfcore/#create-nextflow-environment","title":"Create nextflow environment \ud83d\udc0d","text":"To begin with, we need to create a new environment using mamba. Mamba is recommended here over conda due to its massively improved dependency solving speeds and parallel package downloading (among other reasons).
# Create environment\nmamba create -n nextflow nextflow \\\n salmon=1.10.0 fq fastqc umi_tools \\\n trim-galore bbmap sortmerna samtools \\\n picard stringtie bedtools rseqc \\\n qualimap preseq multiqc subread \\\n ucsc-bedgraphtobigwig ucsc-bedclip \\\n bioconductor-deseq2\n\n# Activate environment\nmamba activate nextflow\n
"},{"location":"RNAseq/rnaseq-nfcore/#download-and-compile-rsem","title":"Download and compile RSEM","text":"RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data.
# Download RSEM\ngit clone https://github.com/deweylab/RSEM\n\n# Enter the directory (RSEM) and compile\ncd RSEM; make\n
Make note of this directory for your run script so you can add this to your PATH variable.
"},{"location":"RNAseq/rnaseq-nfcore/#prepare-your-sample-sheet","title":"Prepare your sample sheet \u270f\ufe0f","text":"You will need to have a sample sheet prepared that contains a sample name, the fastq.gz
file paths, and the strandedness of the read files.
If you are working with a single-ended sequencing run, leave the fastq_2
column empty, but the header still needs to be included.
For example, samplesheet.csv
:
sample,fastq_1,fastq_2,strandedness\nCONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto\nCONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto\nCONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto\n
Each row represents a fastq file (single-end) or a pair of fastq files (paired end). Rows with the same sample identifier are considered technical replicates and merged automatically. The strandedness refers to the library preparation and will be automatically inferred if set to auto.
"},{"location":"RNAseq/rnaseq-nfcore/#run-the-pipeline","title":"Run the pipeline \ud83c\udf4f","text":""},{"location":"RNAseq/rnaseq-nfcore/#start-a-new-interactive-session","title":"Start a new interactive session","text":"Firstly, we will start a new interactive session on the M3 MASSIVE cluster.
smux n --time=2-00:00:00 --mem=64GB --ntasks=1 --cpuspertask=12 -J nf-core/rnaseq\n
Once we are inside the interactive session, we need to select an appropriate version of the Java JDK to use. For the Nextflow pipeline we will be running, we need at least version 17+.
# View available java JDK modules\nmodule avail java\n\n# Load an appropriate one (over version 17)\nmodule load java/openjdk-17.0.2\n\n# Can double-check the correct version is loaded\njava --version\n
"},{"location":"RNAseq/rnaseq-nfcore/#test-your-set-up-optional","title":"Test your set-up (optional) \ud83e\uddba","text":"This step is optional, but highly advisable for a first-time setup or when re-installing.
nextflow run nf-core/rnaseq -r 3.14.0 \\\n -profile test \\\n --outdir test \\\n -resume \\\n --skip-dupradar \\\n --skip_markduplicates\n
- We skip the
dupradar
step, because to install bioconductor-dupradar
, mamba wants to downgrade salmon
to a very early version, which is not ideal - We also skip the
markduplicates
step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample)
"},{"location":"RNAseq/rnaseq-nfcore/#download-genome-files","title":"Download genome files","text":"To avoid issues with genome incompatibility with the version of STAR you are running, it is recommended to simply download the relevant genome fasta and GTF files using the following scripts, and then supply them directly to the function call.
"},{"location":"RNAseq/rnaseq-nfcore/#human-genome-files","title":"Human genome files \ud83d\udc68\ud83d\udc69","text":"01_retrieve_human_genome.sh#!/bin/bash\nVERSION=111\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz\n
"},{"location":"RNAseq/rnaseq-nfcore/#mouse-genome-files","title":"Mouse genome files \ud83d\udc01","text":"01_retrieve_mouse_genome.sh#!/bin/bash\nVERSION=111\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz\n
"},{"location":"RNAseq/rnaseq-nfcore/#run-your-rna-sequencing-reads","title":"Run your RNA sequencing reads \ud83d\udc01","text":"To avoid typing the whole command out (and in case the pipeline crashes), create a script that will handle the process. Two examples are given here, with one for human samples, and one for mouse samples.
- You will need to replace the RSEM folder location with your own path from above.
- Using the
save_reference
option stores the formatted genome files to save time if you need to resume or restart the pipeline.
"},{"location":"RNAseq/rnaseq-nfcore/#human-run-script","title":"Human run script \ud83d\udc68\ud83d\udc69","text":"02_run_rnaseq_human.sh#!/bin/bash\nmodule load java/openjdk-17.0.2\nexport PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/\n\nnextflow run nf-core/rnaseq -r 3.14.0 \\\n --input samplesheet.csv \\\n --outdir rnaseq_output \\\n --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \\\n --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.111.gtf.gz \\\n --skip_dupradar \\\n --skip_markduplicates \\\n --save_reference \\\n -resume\n
"},{"location":"RNAseq/rnaseq-nfcore/#mouse-run-script","title":"Mouse run script \ud83d\udc01","text":"02_run_rnaseq_mouse.sh#!/bin/bash\nmodule load java/openjdk-17.0.2\nexport PATH=$PATH:\u201d/home/mmacowan/mf33/tools/RSEM/\u201d\n\nnextflow run nf-core/rnaseq -r 3.14.0 \\\n --input samplesheet.csv \\\n --outdir rnaseq_output \\\n --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz \\\n --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.111.gtf.gz \\\n --skip_dupradar \\\n --skip_markduplicates \\\n --save_reference \\\n -resume\n
"},{"location":"RNAseq/rnaseq-nfcore/#import-data-into-r","title":"Import data into R","text":"We have a standardised method for importing data into R. Luckily for us, the NF-CORE/rnaseq pipeline outputs are provided in .rds
format as SummarizedExperiment
objects, with bias-corrected gene counts without an offset.
salmon.merged.gene_counts_length_scaled.rds
There are two matrices provided to us: counts
and abundance
.
- The
abundance
matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity \u2014 we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments. - The
counts
matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2. - The
tximport
package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, \"abundance\"
, \"counts\"
, and \"length\"
, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the \"length\"
matrix contains the effective gene lengths. The \"length\"
matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices.
"},{"location":"RNAseq/rnaseq-nfcore/#r-code-for-import-and-voom-normalisation","title":"R code for import and voom-normalisation","text":"Here we show our standard process for preparing RNAseq data for downstream analysis.
# Load R packages\npkgs <- c('knitr', 'here', 'SummarizedExperiment', 'biomaRt', 'edgeR', 'limma')\npacman::p_load(char = pkgs)\n\n# Import the bias-corrected counts from STAR Salmon\nrna_data <- readRDS(here('input', 'salmon.merged.gene_counts_length_scaled.rds'))\n\n# Get Ensembl annotations\nensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')\n\nensemblIDsBronch <- rownames(rna_bronch)\n\ngene_list <- getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol', 'gene_biotype'),\n filters = 'ensembl_gene_id', values = ensemblIDsBronch, mart = ensembl)\ncolnames(gene_list) <- c(\"gene_id\", \"hgnc_symbol\", \"gene_biotype\")\ngene_list <- filter(gene_list, !duplicated(gene_id))\n\n# Ensure that only genes in the STAR Salmon outputs are kept for the gene list\nrna_data <- rna_data[rownames(rna_data) %in% gene_list$gene_id, ]\n\n# Add the ENSEMBL data to the rowData element\nrowData(rna_data) <- merge(gene_list, rowData(rna_data), by = \"gene_id\", all = FALSE)\n\n# Load the RNA metadata\nmetadata_rna <- read_csv(here('input', 'metadata_rna.csv'))\n\n# Sort the metadata rows to match the order of the abundance data\nrownames(metadata_rna) <- metadata_rna$RNA_barcode\nmetadata_rna <- metadata_rna[colnames(rna_data),]\n\n# Create a DGEList from the SummarizedExperiment object\nrna_data_dge <- DGEList(assay(rna_data, 'counts'), \n samples = metadata_rna, \n group = metadata_rna$group,\n genes = rowData(rna_data),\n remove.zeros = TRUE)\n\n# Filter the DGEList based on the group information\ndesign <- model.matrix(~ group, data = rna_data_dge$samples)\nkeep_min10 <- filterByExpr(rna_data_dge, design, min.count = 10)\nrna_data_dge_min10 <- rna_data_dge[keep_min10, ]\n\n# Calculate norm factors and perform voom normalisation\nrna_data_dge_min10 <- calcNormFactors(rna_data_dge_min10)\nrna_data_dge_min10 <- voom(rna_data_dge_min10, design, plot = TRUE)\n\n# Add the normalised abundance data from STAR Salmon and filter to match the counts data\nrna_data_dge_min10$abundance <- as.matrix(assay(rna_bronch, 'abundance'))[keep_min10, ]\n\n# Select protein coding defined genes only\nrna_data_dge_min10 <- rna_data_dge_min10[rna_data_dge_min10$genes$gene_biotype == \"protein_coding\" & rna_data_dge_min10$genes$hgnc_symbol != \"\", ]\n\n# Add symbol as rowname\nrownames(rna_data_dge_min10) <- rna_data_dge_min10$genes$gene_name\n\n# Save the DGEList\nsaveRDS(rna_data_dge_min10, here('input', 'rna_data_dge_min10.rds'))\n
"},{"location":"RNAseq/rnaseq-nfcore/#rights","title":"Rights","text":"NF-CORE/rnaseq
There are many people to thank here for writing and maintaining the NF-CORE/rnaseq pipeline (see here). If you use this pipeline for your analysis, please cite it using the following doi: 10.5281/zenodo.1400710
This document
- Copyright \u00a9 2024 \u2013 Mucosal Immunology Lab, Melbourne VIC, Australia
- Licence: These tools are provided under the MIT licence (see LICENSE file for details)
- Authors: M. Macowan
"},{"location":"Utilities/convert-raw-novaseq-outputs/","title":"Handling NovaSeq sequencing outputs","text":"Here we discuss how to process the raw sequencing reads directly from the Illumina NovaSeq sequencer.
"},{"location":"Utilities/convert-raw-novaseq-outputs/#what-you-should-have-out-of-the-box","title":"What you should have \"out of the box\" \ud83d\uddc3\ufe0f","text":"Our runs are stored in Vault storage, and need to be transferred to the M3 MASSIVE cluster for processing. To inspect your files, the simplest way is to use FileZilla by setting up an SFTP connection as below. You need to ensure you have file access to the Vault prior to this.
The basic file structure on the Vault should look something like below, with a main folder (long name) that contains the relevant files you need, and generally some sort of metadata file. You need to ensure that you have given all permissions to every file so that you can transfer them to the cluster \u2013 you can do this by right clicking the NovaSeq parent folder, selecting File Attributes...
, and then adding all of the Read
, Write
, and Execute
permissions, ensuring you select Recurse into subdirectories
.
"},{"location":"Utilities/convert-raw-novaseq-outputs/#transfer-files-to-the-cluster","title":"Transfer files to the cluster","text":""},{"location":"Utilities/convert-raw-novaseq-outputs/#sequencing-data-transfer","title":"Sequencing data transfer \ud83d\ude9b","text":"Navigate to an appropriate project folder on the cluster. An example command is shown below for transferring the data folder into a new folder called raw_data
using rsync
. If it doesn't exist, the folder you name will be created for you (just make sure you put a /
after the new folder name).
rsync -aHWv --stats --progress MONASH\\\\mmac0026@vault-v2.erc.monash.edu:Marsland-CCS-RAW-Sequencing-Archive/vault/03_NovaSeq/NovaSeq25_Olaf_Shotgun/231025_A00611_0223_AHGMNNDRX2/ raw_data/\n
"},{"location":"Utilities/convert-raw-novaseq-outputs/#bcl-convert-sample-sheet-preparation","title":"BCL Convert sample sheet preparation \ud83d\uddd2\ufe0f","text":"Create a sample sheet document for BCL Convert (the tool that will demultiplex and prepare out FASTQ files from the raw data). The full documentation can be viewed here.
The document should be in the following format, where index
is the i7 adapter sequence
and index2
is the i5 adapter sequence
. An additional first column called Lane
can be provided to specify a particular lane number only for FASTQ file generation. We will call this file samplesheet.txt
.
For the indexes, both sequences used on the sample sheet should be the reverse complement of the actual sequences.
If you make this on a Windows system, ensure you save your output encoded by UTF-8
and not UTF-8 with BOM
.
[Header]\nFileFormatVersion,2\n\n[BCLConvert_Settings]\nCreateFastqForIndexReads,0\n\n[BCLConvert_Data]\nSample_ID,i7_adapter,index,i5_adapter,index2\nAbx1_d21,N701,TAAGGCGA,S502,ATAGAGAG\nAbx2_d21,N702,CGTACTAG,S502,ATAGAGAG\nAbx3_d21,N703,AGGCAGAA,S502,ATAGAGAG\nAbx4_d21,N704,TCCTGAGC,S502,ATAGAGAG\nAbx5_d21,N705,GGACTCCT,S502,ATAGAGAG\n#etc.\n
"},{"location":"Utilities/convert-raw-novaseq-outputs/#bcl-convert","title":"BCL Convert \ud83d\udd04","text":""},{"location":"Utilities/convert-raw-novaseq-outputs/#install","title":"Install \u2b07\ufe0f","text":"If you feel the need to have the latest version, visit the Illumina support website and copy the link for the latest CentOS version of the BCL Convert tool.
Otherwise use the version that is available on the M3 MASSIVE cluster, and skip to the run section.
# Download from the support website in the main folder\nwget https://webdata.illumina.com/downloads/software/bcl-convert/bcl-convert-4.2.4-2.el7.x86_64.rpm\n\n# Install using rpm2cpio (change file name as required)\nmodule load rpm2cpio\nrpm2cpio bcl-convert-4.2.4-2.el7.x86_64.rpm | cpio -idv\n
The most up-to-date bcl-convert will be inside the output usr/bin/
folder, and can be called from that location.
"},{"location":"Utilities/convert-raw-novaseq-outputs/#run","title":"Run \ud83c\udfc3","text":"With the raw_data
folder and samplesheet.txt
both in the same directory, we can now run BCL Convert to generate our demultiplexed FASTQ files. Ensure you have at least 64GB of RAM in your interactive smux session.
You will need a very high limit for open files \u2013 BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves.
You can request additional open file limit from the M3 MASSIVE help desk.
Can I run this on my local machine?
Please note that the node m3k010
has been decommissioned due to system upgrades.
However, it is more than possible to run this process quickly on a local machine if you have the raw BCL files available. The minimum requirements (as of BCL Convert v4.0) are:
- Hardware requirements
- Single multiprocessor or multicore computer
- Minimum 64 GB RAM
- Software requirements
- Root access to your computer
- File system access to adjust ulimit
You can start an interactive bash session and increase the open file limit as follows:
# Begin a new interactive bash session on the designated node\nsrun --pty --partition=genomics --qos=genomics --nodelist=m3k010 --mem=320GB --ntasks=1 --cpus-per-task=48 bash -i\n\n# Increase the open file limit to 65,535\nulimit -n 65535\n
# Run bcl-convert\nbcl-convert \\\n --bcl-input-directory raw_data \\\n --output-directory fastq_files \\\n --sample-sheet samplesheet.txt\n
This will create a new output folder called fastq_files
that contains your demultiplexed samples.
"},{"location":"Utilities/convert-raw-novaseq-outputs/#merge-lanes","title":"Merge lanes \u26d9","text":"If you ran your samples without lane splitting, then you can merge the two lanes together using the following code, saved in the main project folder as merge_lanes.sh
, and run using the command: bash merge_lanes.sh
.
merge_lanes.sh#!/bin/bash\n\n# Merge lanes 1 and 2\ncd fastq_files\nfor f in *.fastq.gz\n do\n Basename=${f%_L00*}\n ## merge R1\n ls ${Basename}_L00*_R1_001.fastq.gz | xargs cat > ${Basename}_R1.fastq.gz\n ## merge R2\n ls ${Basename}_L00*_R2_001.fastq.gz | xargs cat > ${Basename}_R2.fastq.gz\n done\n\n# Remove individual files to make space\nrm -rf *L00*\n
"},{"location":"Utilities/sra-data-submission/","title":"SRA sequencing data submission","text":"A guide to submitting sequencing data to the National Center for Biotechnology Information (NCBI) sequencing read archive (SRA) database. Includes information on uploading data to the SRA using the high-speed Aspera Connect tool.
Patient-derived sequencing files
If your samples are derived from humans, ensure that your file names include no reference to patient identifiers. Once uploaded to the SRA database, it is very difficult to change the names of files, and requires directly contacting the database to arrange for removal of files and for you to reupload the data. It also involves a difficult process of them re-mapping the new uploads to your existing SRA metadata files.
Also ensure that you only include the absolute minimum amount of metadata, in a manner that protects patient confidentiality. Absolutely no information should be unique to one single patient in your cohort, even an age (if you have a patient with a unique age, this should be replaced with NA
for the purposes of SRA submission). For manuscripts, you can include a phrase indicating that further metadata is available upon reasonable request. The important thing here is to not infringe on patient privacy and confidentiality.
Things you could potentially include: - Modified and anonymised patient ID - Sampling group - Timepoint (not exact days or months) - Sex - Collection year (no exact dates) - Tissue
"},{"location":"Utilities/sra-data-submission/#process-overview","title":"Process overview","text":" - Register a BioProject
- Register BioSamples for the related BioProject
- Submit data to SRA
"},{"location":"Utilities/sra-data-submission/#register-a-bioproject","title":"Register a BioProject \ud83d\udcd4","text":"The BioProject is an important element that can link together different types of sequencing data, and represents all the sequencing data for a given experiment.
Go to the SRA submission website to register a new BioProject.
- Sample scope: Multispecies (if you have microbiome data)
- Target description: Bacterial 16S metagenomics (change if you have shotgun metagenomics and/or host transcriptomics)
- Organism name: Human (change if using mouse or rat data)
- Project type: Metagenome (add transcriptome if you also have host transcriptomics)
"},{"location":"Utilities/sra-data-submission/#register-biosamples-test_tube","title":"Register BioSamples :test_tube:","text":""},{"location":"Utilities/sra-data-submission/#microbiome-data","title":"Microbiome data \ud83e\udda0","text":"Microbiome samples will be registered as MIMARKS Specimen samples. On the BioSample Attributes tab, download the BioSample metadata Excel template, and complete it accordingly before uploading. Be very careful with the required field formats. You can double check ontology using the EMBL-EBI Ontology Lookup Service.
- Use the BioProject accession number previously generated
- Organism:
human metagenome
(or as appropriate) - Env broad scale:
host-associated
- Env local scale:
mammalia-associated habitat
- Env medium: (as appropriate)
- Strain, isolate, cultivar, ecotype:
NA
- Add any other relevant host information in the table, as well as the host tissue samples
- Any other column which is not relevant can be set to
NA
The SRA Metadata tab is what will join everything together. Once again, download the provided Excel template, and fill everything in carefully.
- Sample name: the base name of your samples
- Library ID: you may have named your files differently than your sample names \u2013 provide this if so, otherwise you can repeat the sample name
- Title: a short description of the sample in the form \"
{methodology}
of {organism}
: {sample_info}
\" \u2013 e.g. \"Shotgun metagenomics of Homo sapiens: childhood bronchial brushing\". - Library strategy:
WGS
- Library source:
METAGENOMIC
- Library selection:
RANDOM
- Library layout:
paired
- Platform:
ILLUMINA
- Instrument model:
Illumina NovaSeq 6000
- Design description:
NA
- Filetype:
fastq
- Filename: the file name of the forward reads
- Filename2: the file name of the reverse reads
"},{"location":"Utilities/sra-data-submission/#transcriptomics-data","title":"Transcriptomics data \ud83d\udc68\ud83d\udc2d","text":"Host transcriptomics samples will be registered as either HUMAN or Model organism or animal samples. On the BioSample Attributes tab, download the BioSample metadata Excel template, and complete it accordingly before uploading. Be very careful with the required field formats. You can double check ontology using the EMBL-EBI Ontology Lookup Service.
- Use the BioProject accession number previously generated
- Organism:
Homo sapiens
(or Mus musculus
/Rattus norvegicus
as appropriate) - Isolate: NA
- Age: fill this in, but leave
NA
for human samples if it would result in a unique combination of metadata variables with potential to allow identification of any individual. - Biomaterial provider: enter the lab, organisation etc. that provided the samples
- Collection date: do not enter any exact dates for human samples
- Geo loc name: country in which samples were collected
- Sex: provide sex of host
- Tissue: specify tissue origin of samples
- Add any other relevant data, such as sampling group
As above, the SRA Metadata tab is where the magic will happen :magic_wand:. Once again, download the provided Excel template, and fill everything in carefully.
- Sample name: the base name of your samples
- Library ID: you may have named your files differently than your sample names \u2013 provide this if so, otherwise you can repeat the sample name
- Title: a short description of the sample in the form \"
{methodology}
of {organism}
: {sample_info}
\" \u2013 e.g. \"RNA-Seq of Homo sapiens: childhood bronchial brushing\". - Library strategy:
RNA-Seq
- Library source:
TRANSCRIPTOMIC
- Library selection:
RANDOM
- Library layout:
paired
- Platform:
ILLUMINA
- Instrument model:
Illumina NovaSeq 6000
- Design description:
NA
- Filetype:
fastq
- Filename: the file name of the forward reads
- Filename2: the file name of the reverse reads
"},{"location":"Utilities/sra-data-submission/#submit-data-to-sra","title":"Submit data to SRA \ud83d\udce4","text":"Which upload option should I choose?
You can choose either of the following upload options, and each has pros and cons.
- Filezilla allows parallel uploads according to your settings, but upload speed is typically slower.
- Aspera Connect (at least with NCBI) only allows sequential uploads, but the upload speed is significantly faster.
"},{"location":"Utilities/sra-data-submission/#filezilla","title":"FileZilla \ud83e\udd96","text":"Using FileZilla is more effective when you have large files and/or a large number of files.
In FileZilla, open the sites manager and connect to NCBI as follows: - Protocol: FTP
- Host: ftp-private.ncbi.nlm.nih.gov
- Username: subftp
- Password: this is your user-specific NCBI password given when you submit your data
In the Advanced
tab next to the General
tab, set the Default remote directory
field to the directory specified by NCBI. This will looks something like: /uploads/{username}_{uniqueID}
.
Select connect, and gain access to your account folder on the NCBI FTP server.
Create a new project folder within the main upload folder, and enter the folder. Add your files to the upload queue, and begin the upload process.
"},{"location":"Utilities/sra-data-submission/#aspera-connect","title":"Aspera Connect","text":"The IBM Aspera Connect tool allows for much faster uploads than FileZilla, and is a good alternative for large files.
"},{"location":"Utilities/sra-data-submission/#linux-process","title":"Linux process \ud83d\udc27","text":"The process described here is for Linux, but is similar for Windows and MacOS operating systems. More information is provided on the IBM website.
- Download the Aspera Connect software.
- Open a new terminal window (
Ctrl+Alt+T
) - Navigate to downloads, extract the
tar.gz
file. - Run the install script.
# Extract the file\ntar -zxvf ibm-aspera-connect-version+platform.tar.gz\n# Run the install script\n./ibm-aspera-connect-version+platform.sh\n
- Add the Aspera Connect bin folder to your PATH variable (reopen terminal to apply changes).
# Add folder to PATH\necho 'export PATH=$PATH:/home/{user}/.aspera/connect/bin/ >> ~/.bashrc'\n
- Download the NCBI Aspera Connect key file.
- Navigate to the parent folder of the folder containing the files you want to upload to the SRA database, and create a new bash script.
# Create a new bash script file\ntouch upload_seq_data.sh\n
- Add the following code to the bash script file.
- The
-i
argument is the path to the key file, and must be given as a full path (not a relative one). - The
-d
argument specifies that the directory will be created if it doesn't exist. - You can adjust the maximum upload speed using the
-l500m
argument, where 500
is the speed in Mbps. You could increase or decrease as desired. - Add the folder containing the data to upload, which can be relative to the folder containing the bash script.
- Next provide the upload folder provided by NCBI, which will be user-specific, and ensure you provide a project folder at the end of this. Data will not be available if it is uploaded into the main uploads folder.
upload_seq_data.sh#!/bin/bash\nascp -i {/full/path/to/key-file/aspera.openssh} -QT -l500m -k1 -d {./name-of-seq-data-folder} subasp@upload.ncbi.nlm.nih.gov:uploads/{user-specific-ID}/{name-of-project}\n
- Run the bash script, and upload all files. The default settings will allow you to resume uploads if they are interrupted, and it will not overwrite files that are identical in the destination folder.
# Run script\nbash upload_seq_data.sh\n
"}]}
\ No newline at end of file
+{"config":{"lang":["en"],"separator":"[\\s\\-]+","pipeline":["stopWordFilter"]},"docs":[{"location":"","title":"Welcome to the Mucosal Immunology Lab Bioinformatics Hub","text":""},{"location":"#overview","title":"Overview","text":"Over the years, we have refined several workflows for processing of the various omic modalities we utilise within our group. As with any workflows in the bioinformatics field, these are constantly evolving as new tools and best practices emerge. As such, this hub is very much a work-in-progress, and will remain so as we continue to add to and update it.
"},{"location":"#contributors","title":"Contributors","text":"These sort of tasks are never accomplished alone! Massive thanks to the people who have contributed to this.
-
Matthew Macowan
Bioinformatician \u2013 Mucosal Immunology Group
-
C\u00e9line Pattaroni
Group Leader \u2013 Computational Immunology Group
-
Giulia Iacono
Post Doc \u2013 Mucosal Immunology Group
- Alana Butler: Bioinformatician
- Bailey Cardwell: PhD student \u2013 Mucosal Immunology Group
"},{"location":"#group-research-overview","title":"Group Research Overview","text":"The Mucosal Immunology Research Group, led by Professors Marsland, Harris and Westall, is focused on understanding the fundamental principles in health and disease of the gut, lung and nervous system. Projects span the space from discovery using preclinical models of host-microbe interactions and inflammation through to high performance computational approaches to identify clinical biomarkers and development of novel drug candidates.
"},{"location":"#group-heads","title":"Group Heads","text":" -
Ben Marsland
-
Nicola Harris
-
Glen Westall
"},{"location":"NextFlow/nf-mucimmuno/","title":"Nextflow Workflows","text":"As they are built and published, this repository will contain Nextflow workflows for processing of different data omic modalities.
Additionally, we may provide additional tools and code for further downstream processing, with the goal of standardising data analytic approaches within the Mucosal Immunology Lab.
"},{"location":"NextFlow/nf-mucimmuno/#single-cell-rnaseq-fastq-pre-processing","title":"Single-cell RNAseq FASTQ pre-processing","text":"nf-mucimmuno/scRNAseq is a bioinformatics pipeline for single-cell RNA sequencing data that can be used to run quality control steps and alignment to a host genome using STARsolo. Currently only configured for use with data resulting from BD Rhapsody library preparation.
"},{"location":"NextFlow/scRNAseq/","title":"Single-cell RNAseq FASTQ pre-processing","text":""},{"location":"NextFlow/scRNAseq/#introduction","title":"Introduction","text":"nf-mucimmuno/scRNAseq is a bioinformatics pipeline that can be used to run quality control steps and alignment to a host genome using STARsolo. It takes a samplesheet and FASTQ files as input, performs FastQC, trimming and alignment, and produces an output .tar.gz
archive containing the collected outputs from STARsolo, ready for further processing downstream in R. MultiQC is run on the FastQC outputs both before and after TrimGalore! for visual inspection of sample quality \u2013 output .html
files are collected in the results.
"},{"location":"NextFlow/scRNAseq/#usage","title":"Usage","text":""},{"location":"NextFlow/scRNAseq/#download-the-repository","title":"Download the repository \ud83d\udcc1","text":"This repository contains the relevant Nextflow workflow components, including a conda environment and submodules, to run the pipeline. To retrieve this repository alone, run the retrieve_me.sh
script above.
Git sparse-checkout
is required to retrieve just the nf-mucimmuno/scRNAseq pipeline. It was only introduced to Git in version 2.27.0, so ensure that the loaded version is high enough (or that there is a version loaded on the cluster at all). As of July 2024, the M3 MASSIVE cluster has version 2.38.1 available.
# Check git version\ngit --version\n\n# Load git module if not loaded or insufficient version\nmodule load git/2.38.1\n
First, create a new bash script file.
# Create and edit a new file with nano\nnano retrieve_me.sh\n
Add the contents to the file, save, and close.
retrieve_me.sh#!/bin/bash\n\n# Define variables\nREPO_URL=\"https://github.com/mucosal-immunology-lab/nf-mucimmuno\"\nREPO_DIR=\"nf-mucimmuno\"\nSUBFOLDER=\"scRNAseq\"\n\n# Clone the repository with sparse checkout\ngit clone --no-checkout $REPO_URL\ncd $REPO_DIR\n\n# Initialize sparse-checkout and set the desired subfolder\ngit sparse-checkout init --cone\ngit sparse-checkout set $SUBFOLDER\n\n# Checkout the files in the subfolder\ngit checkout main\n\n# Move the folder into the main folder and delete the parent\nmv $SUBFOLDER ../\ncd ..\nrm -rf $REPO_DIR\n\n# Extract the larger gzipped CLS files\ngunzip -r \"$SUBFOLDER/modules/starsolo/CLS\"\n\necho \"Subfolder '$SUBFOLDER' has been downloaded successfully.\"\n
Then run the script to retrieve the repository into a new folder called scRNAseq
, which will house your workflow files and results.
# Run the script\nbash retrieve_me.sh\n
"},{"location":"NextFlow/scRNAseq/#create-the-conda-environment","title":"Create the conda environment \ud83d\udc0d","text":"To create the conda environment, use the provided environment .yaml
file. Then activate it to access required functions.
# Create the environment\nmamba env create -f environment.yaml\n\n# Activate the environment\nmamba activate nextflow-scrnaseq\n
"},{"location":"NextFlow/scRNAseq/#prepare-the-genome","title":"Prepare the genome \ud83e\uddec","text":"Create a new folder somewhere to store your genome files. Enter the new folder, and run the relevant code depending on your host organism. Run these steps in an interactive session with ~48GB RAM and 16 cores, or submit them as an sbatch job.
Please check if these are already available somewhere before regenerating them yourself!
STAR should be loaded already via the conda environment for the genome indexing step. We will set --sjdbOverhang
to 79 to be suitable for use with the longer R2
FASTQ data resulting from BD Rhapsody single cell sequencing. This may require alteration for other platforms. Essentially, you just need to set --sjdbOverhang
to the length of your R2 sequences minus 1.
Human genome files \ud83d\udc68\ud83d\udc69Mouse genome files \ud83d\udc01 01_retrieve_human_genome.sh#!/bin/bash\nVERSION=111\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz\ngunzip *\n
01_retrieve_mouse_genome.sh#!/bin/bash\nVERSION=111\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz\ngunzip *\n
Then use STAR to prepare the genome index.
Human genome files \ud83d\udc68\ud83d\udc69Mouse genome files \ud83d\udc01 02_index_human_genome.sh#!/bin/bash\nVERSION=111\nSTAR \\\n --runThreadN 16 \\\n --genomeDir \"STARgenomeIndex79/\" \\\n --runMode genomeGenerate \\\n --genomeFastaFiles \"Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa\" \\\n --sjdbGTFfile \"Homo_sapiens.GRCh38.$VERSION.gtf\" \\\n --sjdbOverhang 79\n
02_index_mouse_genome.sh#!/bin/bash\nVERSION=111\nSTAR \\\n --runThreadN 16 \\\n --genomeDir \"STARgenomeIndex79/\" \\\n --runMode genomeGenerate \\\n --genomeFastaFiles \"Mus_musculus.GRCm39.dna_sm.primary_assembly.fa\" \\\n --sjdbGTFfile \"Mus_musculus.GRCm39.$VERSION.gtf\" \\\n --sjdbOverhang 79\n
"},{"location":"NextFlow/scRNAseq/#prepare-your-sample-sheet","title":"Prepare your sample sheet \u270f\ufe0f","text":"This pipeline requires a sample sheet to identify where your FASTQ files are located, and which cell label sequences (CLS) are being utilised.
More information about the CLS tags used with BD Rhapsody single-cell RNAseq library preparation can be found here:
- BD Rhapsody Sequence Analysis Pipeline \u2013 User's Guide
- BD Rhapsody Cell Label Structure \u2013 Python Script
More information about the CLS tags used with 10X Chromium single-cell RNAseq library preparation can be found here:
- 10X Chromium Single Cell 3' Solution V2 and V3 guide (Teich Lab)
- 10X Chromium V2 CLS sequences are 26bp long.
- 10X Chromium V3 CLS sequences are 28bp long.
The benefit of providing the name of the CLS bead versions in the sample sheet is that you can combine runs that utilise different beads together in the same workflow. Keep in mind that if you do this though, there may be some bead-related batch effects to address and correct downstream \u2013 it is always important to check for these effects when combining sequencing runs in any case. The current options are:
CLS option Description BD_Original The original BD rhapsody beads and linker sequences BD_Enhanced_V1 First version of enhanced beads with polyT and 5prime capture oligo types, shorter linker sequences, longer polyT, and 0-3 diversity insert bases at the beginning of the sequence BD_Enhanced_V2 Same structure as the enhanced (V1) beads, but with increased CLS diversity (384 vs. 96) 10X_Chromium_V2 Feature a 16 bp cell barcode and a 10 bp unique molecular identifier (UMI) 10X_Chromium_V3 Enhanced sequencing accuracy and resolution with a 16 bp cell barcode and a 12 bp UMI Further, we also need to provide the path to the STAR genome index folder for each sample \u2013 while in many cases this value will remain constant, the benefit of providing this information is that you can process runs with different R2 sequence lengths at the same time. Recall from above that the genome index you use should use an --sjdbOverhang
length that of your R2 sequences minus 1.
Your sample sheet should look as follows, ensuring you use the exact column names as below. Remember that on the M3 MASSIVE cluster, you need to use the full file path \u2013 relative file paths don't usually work.
sample,fastq_1,fastq_2,CLS,GenomeIndex\nCONTROL_S1,CONTROL_S1_R1.fastq.gz,CONTROL_S1_R2.fastq.gz,BD_Enhanced_V2,mf33/Databases/ensembl/human/STARgenomeIndex79\nCONTROL_S2,CONTROL_S2_R1.fastq.gz,CONTROL_S1_R2.fastq.gz,BD_Enhanced_V2,mf33/Databases/ensembl/human/STARgenomeIndex79\nTREATMENT_S1,TREATMENT_S1_R1.fastq.gz,TREATMENT_S1_R2.fastq.gz,BD_Enhanced_V2,mf33/Databases/ensembl/human/STARgenomeIndex79\n
An example is provided in data/samplesheet_test
.
"},{"location":"NextFlow/scRNAseq/#running-the-pipeline","title":"Running the pipeline \ud83c\udfc3","text":"Now you can run the pipeline. You will need to set up a parent job to run each of the individual jobs \u2013 this can be either an interactive session, or an sbatch job. For example:
# Start an interactive session with minimal resources\nsmux n --time=3-00:00:00 --mem=16GB --ntasks=1 --cpuspertask=2 -J nf-STARsolo\n
Make sure you alter the nextflow.config
file to provide the path to your sample sheet, unless it is ./data/samplesheet.csv
which is the default for the cluster profile. Stay within the top cluster
profile section to alter settings for Slurm-submitted jobs.
Inside your interactive session, be sure to activate your nextflow-scrnaseq
environment from above. Then, inside the scRNAseq folder, begin the pipeline using the following command (ensuring you use the cluster
profile to make use of the Slurm workflow manager).
# Activate conda environment\nmamba activate nextflow-scrnaseq\n\n# Begin running the pipeline\nnextflow run process_raw_reads.nf -resume -profile cluster\n
"},{"location":"NextFlow/scRNAseq/#customisation","title":"Customisation \u2699\ufe0f","text":"There are several customisation options that are available within the nextflow.config
file. While the defaults should be suitable for those with access to the M3 MASSIVE cluster genomics partition, for those without access, of for those who require different amounts of resources, there are ways to change these.
In order to work with different technologies, and accommodate for differences in cell label structure (CLS), the STAR parameters --soloType
and --soloCBmatchWLtype
are set in a CLS-dependent manner. This is required, because the BD Rhapsody system has a complex barcode structure. The 10X Chromium system on the other hand has a simple barcode structure with a single barcode and single UMI. Additionally, the --soloCBmatchWLtype = EditDist2
only works with --soloType = CB_UMI_Complex
, and therefore --soloCBmatchWLtype = 1MM multi Nbase pseudocounts
is used for 10X Chromium runs.
- For BD Rhapsody sequencing:
--soloType = CB_UMI_Complex
and --soloCBmatchWLtype = EditDist2
. - For 10X Chromium sequencing:
--soloType = CB_UMI_Simple
and --soloCBmatchWLtype = 1MM multi Nbase pseudocounts
. - Additionally, 10X Chromium runs use
--clipAdapterType = CellRanger4
.
To adjust the cluster
profile settings, stay within the appropriate section at the top of the file.
Parameters
Visit STAR documentation for explanations of all available options for STARsolo.
Option Description samples_csv The file path to your sample sheet outdir A new folder name to be created for your results trimgalore.quality The minimum quality before a sequence is truncated (default: 20
) trimgalore.adapter A custom adapter sequence for the R1 sequences (default: 'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'
) trimgalore.adapter2 A custom adapter sequence for the R2 sequences (default: 'AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGT'
) starsolo.soloUMIdedup The type of UMI deduplication (default: '1MM_CR'
) starsolo.soloUMIfiltering The type of UMI filtering for reads uniquely mapping to genes (default: 'MultiGeneUMI_CR'
) starsolo.soloCellFilter The method type and parameters for cell filtering (default: 'EmptyDrops_CR'
) starsolo.soloMultiMappers The counting method for reads mapping for multiple genes (default: 'EM'
) Process
These settings relate to resource allocation and cluster settings. FASTQC and TRIMGALORE steps can take longer than 4 hours for typical single-cell RNAseq file, and therefore the default option is to run these steps on the comp
partition.
Option Description executor The workload manager (default: 'slurm'
) conda The conda environment to use (default: './environment.yaml'
) queueSize The maximum number of jobs to be submitted at any time (default: 12
) submitRateLimit The rate allowed for job submission \u2013 either a number of jobs per second (e.g. 20sec) or a number of jobs per time period (e.g. 20/5min) (default: '1/2sec'
) memory The maximum global memory allowed for Nextflow to use (default: '320 GB'
) FASTQC.memory Memory for FASTQC step to use (default: '80 GB'
) FASTQC.cpus Number of CPUs for FASTQC step to use (default: 8
) FASTQC.clusterOptions Specific cluster options for FASTQC step (default: '--time=8:00:00'
) TRIMGALORE.memory Memory for TRIMGALORE step to use (default: '80 GB'
) TRIMGALORE.cpus Number of CPUs for TRIMGALORE step to use (default: 8
) TRIMGALORE.clusterOptions Specific cluster options for TRIMGALORE step (default : '--time=8:00:00'
) STARSOLO.memory Memory for STARSOLO step to use (default: '80 GB'
) STARSOLO.cpus Number of CPUs for STARSOLO step to use (default: 12
) STARSOLO.clusterOptions Specific cluster options for STARSOLO step (default : '--time=4:00:00 --partition=genomics --qos=genomics'
) COLLECT_EXPORT_FILES.memory Memory for COLLECT_EXPORT_FILES step to use (default: '32 GB'
) COLLECT_EXPORT_FILES.cpus Number of CPUs for COLLECT_EXPORT_FILES step to use (default: 8
) COLLECT_EXPORT_FILES.clusterOptions Specific cluster options for COLLECT_EXPORT_FILES step (default : '--time=4:00:00 --partition=genomics --qos=genomics'
)"},{"location":"NextFlow/scRNAseq/#outputs","title":"Outputs","text":"Several outputs will be copied from their respective Nextflow work
directories to the output folder of your choice (default: results
).
Alignment summary utility script
There is also a utility script in the main scRNAseq
directory called collect_alignment_summaries.sh
. This will navigate into each of the sample folders inside results/STARsolo
, and retrieve some key information for you to validate that the alignment worked successfully (from the GeneFull_Ex50pAS
subfolder). This can otherwise take quite some time to go through each folder if you have a lot of samples.
- After running this, a new file called
AlignmentSummary.txt
will be generated in the scRNAseq
directory. Each sample will be listed by name, with the number of reads, percentage of reads with valid barcodes, and estimated number of cells. - It will be immediately obvious that something has gone wrong if you see that the percentage of reads with valid barcodes is very low (e.g.
0.02
= 2% valid barcodes) \u2013 this is usually paired with a very low estimated cell number. - This could indicate that you have used the wrong barcode version for your runs, and therefore the associated barcode whitelist used by the pipeline was incorrect.
A successful example is shown below
Sample: Healthy1\nNumber of Reads,353152389\nReads With Valid Barcodes,0.950799\nEstimated Number of Cells,6623\n\nSample: Healthy2\nNumber of Reads,344989615\nReads With Valid Barcodes,0.948577\nEstimated Number of Cells,6631\n# etc...\n
"},{"location":"NextFlow/scRNAseq/#collected-export-files","title":"Collected export files \ud83d\udce6","text":"The main output will be a single archive file called export_files.tar.gz
that you will take for further downstream pre-processing. It contains STARsolo outputs for each sample, with the respective subfolders described below.
"},{"location":"NextFlow/scRNAseq/#reports","title":"Reports \ud83d\udcc4","text":"Within the reports
folder, you will find the MultiQC outputs from pre- and post-trimming.
"},{"location":"NextFlow/scRNAseq/#starsolo","title":"STARsolo \u2b50","text":"Contains the outputs for each sample from STARsolo, including various log files and package version information.
The main output of interest here is a folder called {sample}.Solo.out
, which houses subfolders called Gene
, GeneFull_Ex50pAS
, and Velocyto
. It is this main folder for each sample that is added to export_files.tar.gz
. * As you will use the gene count data from GeneFull_Ex50pAS
downstream, it is a good idea to check the Summary.csv
within this folder for each sample to ensure mapping was successful (or use the utility script above). * One of the key values to inspect is Reads With Valid Barcodes
, which should be >0.8 (indicating at least 80% of reads had valid barcodes). * If you note that this value is closer to 0.02 (i.e. ~2% had valid barcodes), you should double-check to make sure you specified the correct BD Rhapsody beads version. For instance, if you specified BD_Enhanced_V1
but actually required BD_Enhanced_V2
, the majority of your reads will not match the whitelist, and therefore the reads will be considered invalid.
Folder structure
Below is an example of the output structure for running one sample. The STARsolo folder would contain additional samples as required.
scRNAseq\n\u2514\u2500\u2500 results/\n \u251c\u2500\u2500 export_files.tar.gz\n \u251c\u2500\u2500 reports/\n \u2502 \u251c\u2500\u2500 pretrim_multiqc_report.html\n \u2502 \u2514\u2500\u2500 posttrim_multiqc_report.html\n \u2514\u2500\u2500 STARsolo/\n \u2514\u2500\u2500 sample1/\n \u251c\u2500\u2500 sample1.Solo.out/\n \u2502 \u251c\u2500\u2500 Gene/\n \u2502 \u2502 \u251c\u2500\u2500 filtered/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 matrix.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 raw/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 matrix.mtx.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 UniqueAndMult-EM.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 Features.stats\n \u2502 \u2502 \u251c\u2500\u2500 Summary.csv\n \u2502 \u2502 \u2514\u2500\u2500 UMIperCellSorted.txt\n \u2502 \u251c\u2500\u2500 GeneFull_Ex50pAS/\n \u2502 \u2502 \u251c\u2500\u2500 filtered/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 matrix.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 raw/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 matrix.mtx.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 UniqueAndMult-EM.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 Features.stats\n \u2502 \u2502 \u251c\u2500\u2500 Summary.csv\n \u2502 \u2502 \u2514\u2500\u2500 UMIperCellSorted.txt\n \u2502 \u251c\u2500\u2500 Velocyto/\n \u2502 \u2502 \u251c\u2500\u2500 filtered/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 ambiguous.mtx.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 spliced.mtx.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 unspliced.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 raw/\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 ambiguous.mtx.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 barcodes.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 features.tsv.gz\n \u2502 \u2502 \u2502 \u251c\u2500\u2500 spliced.mtx.gz\n \u2502 \u2502 \u2502 \u2514\u2500\u2500 unspliced.mtx.gz\n \u2502 \u2502 \u251c\u2500\u2500 Features.stats\n \u2502 \u2502 \u2514\u2500\u2500 Summary.csv\n \u2502 \u2514\u2500\u2500 Barcodes.stats\n \u251c\u2500\u2500 sample1.Log.final.out\n \u251c\u2500\u2500 sample1.Log.out\n \u251c\u2500\u2500 sample1.Log.progress.out\n \u2514\u2500\u2500 versions.yml\n
"},{"location":"PublicDatasets/public-datasets/","title":"Public datasets","text":"Here we provide a list of publicly-available datasets that we have generated and uploaded to repositories. Some of the data is yet to be released, and will be available following publication.
"},{"location":"PublicDatasets/public-datasets/#ncbi-sequencing-read-archive","title":"NCBI Sequencing Read Archive","text":"The following datasets have been uploaded to the NCBI Sequencing Read Archive (SRA) database in their original FASTQ data format.
"},{"location":"PublicDatasets/public-datasets/#summary","title":"Summary","text":"Sequencing type Sequencing runs (uploaded) Bulk transcriptomics 425 Single-cell transcriptomics 2 Shotgun metagenomics 310 16S amplicon 1,146 ITS amplicon 373"},{"location":"PublicDatasets/public-datasets/#datasets","title":"Datasets","text":"Host organism Context BioProject Availability Bulk transcriptomics Single-cell transcriptomics Shotgun metagenomics 16S amplicon ITS amplicon Mouse SHIP-deficient model of Crohn's-like ileitis and chronic lung inflammation PRJNA1086166 \u2013 2024 Released 24 stool samples Human Paediatric severe wheeze + asthma PRJNA1080233 \u2013 2024 Released 55 bronchial brushes 28 bronchial brushes Human Paediatric healthy + infant wheeze PRJNA1076275 \u2013 2024 Released 188 nasal swabs + 73 blood samples 320 nasal swabs 135 nasal swabs Human Infant cystic fibrosis PRJNA978345 \u2013 2024 Released 96 stool samples 75 BAL samples Rat Early life stress + mild traumatic brain injury PRJNA940177 \u2013 2024 Released 76 stool samples Mouse OTII cells Germinal centre expansion + IL-21 role PRJNA776662 \u2013 2021 Released 8 culture samples Human Early life + airways PRJNA694493 \u2013 2021 Released 85 nasal swabs 118 nasal swabs + 119 oropharyngeal swabs 119 nasal swabs + 119 oropharyngeal swabs Mouse Allergic airway inflammation PRJNA641984 \u2013 2020 Released 20 stool samples 127 stool samples Human Male-associated infertility PRJNA509076 \u2013 2018 Released 94 seminal fluid samples Human Early life + immune development PRJNA475630 \u2013 2018 Released 16 tracheal aspirates 45 tracheal aspirates Mouse High fat diet PRJNA1131116 To be released 24 ileum luminal samples + 24 ileum mucosal samples + 22 colon luminal samples 77 stool samples Mouse Early life antibiotic treatment PRJNA1112091 To be released 2 lung structural cell digests 96 stool samples 41 lung tissue samples + 30 BAL samples"},{"location":"PublicDatasets/public-datasets/#european-nucleotide-archive","title":"European Nucleotide Archive","text":"The following datasets have been uploaded to the European Nucleotide Archive (ENA) database in their original FASTQ data format.
"},{"location":"PublicDatasets/public-datasets/#summary_1","title":"Summary","text":"Sequencing type Sequencing runs (uploaded) 16S amplicon 1,179"},{"location":"PublicDatasets/public-datasets/#datasets_1","title":"Datasets","text":"Host organism Context Project ID 16S amplicon Availability Human Early life + atopic dermatitis PRJEB42268 \u2013 2022 Released 1,179 lateral upper arm swabs"},{"location":"RNAseq/rnaseq-nfcore/","title":"Processing RNA sequencing data with nf-core","text":""},{"location":"RNAseq/rnaseq-nfcore/#overview","title":"Overview","text":"Here we will describe the process for processing RNA sequencing data using the nf-core/rnaseq pipeline. This document was written as of version 3.14.0
nf-core/rnaseq is a bioinformatics pipeline that can be used to analyse RNA sequencing data obtained from organisms with a reference genome and annotation. It takes a samplesheet and FASTQ files as input, performs quality control (QC), trimming and (pseudo-)alignment, and produces a gene expression matrix and extensive QC report.
Full details of the pipeline and the many customisable options can be view on the pipeline website.
"},{"location":"RNAseq/rnaseq-nfcore/#installation","title":"Installation","text":"In this section, we discuss the installation process on the M3 MASSIVE cluster.
"},{"location":"RNAseq/rnaseq-nfcore/#create-nextflow-environment","title":"Create nextflow environment \ud83d\udc0d","text":"To begin with, we need to create a new environment using mamba. Mamba is recommended here over conda due to its massively improved dependency solving speeds and parallel package downloading (among other reasons).
# Create environment\nmamba create -n nextflow nextflow \\\n salmon=1.10.0 fq fastqc umi_tools \\\n trim-galore bbmap sortmerna samtools \\\n picard stringtie bedtools rseqc \\\n qualimap preseq multiqc subread \\\n ucsc-bedgraphtobigwig ucsc-bedclip \\\n bioconductor-deseq2\n\n# Activate environment\nmamba activate nextflow\n
"},{"location":"RNAseq/rnaseq-nfcore/#download-and-compile-rsem","title":"Download and compile RSEM","text":"RSEM is a software package for estimating gene and isoform expression levels from RNA-Seq data.
# Download RSEM\ngit clone https://github.com/deweylab/RSEM\n\n# Enter the directory (RSEM) and compile\ncd RSEM; make\n
Make note of this directory for your run script so you can add this to your PATH variable.
"},{"location":"RNAseq/rnaseq-nfcore/#prepare-your-sample-sheet","title":"Prepare your sample sheet \u270f\ufe0f","text":"You will need to have a sample sheet prepared that contains a sample name, the fastq.gz
file paths, and the strandedness of the read files.
If you are working with a single-ended sequencing run, leave the fastq_2
column empty, but the header still needs to be included.
For example, samplesheet.csv
:
sample,fastq_1,fastq_2,strandedness\nCONTROL_REP1,AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz,auto\nCONTROL_REP1,AEG588A1_S1_L003_R1_001.fastq.gz,AEG588A1_S1_L003_R2_001.fastq.gz,auto\nCONTROL_REP1,AEG588A1_S1_L004_R1_001.fastq.gz,AEG588A1_S1_L004_R2_001.fastq.gz,auto\n
Each row represents a fastq file (single-end) or a pair of fastq files (paired end). Rows with the same sample identifier are considered technical replicates and merged automatically. The strandedness refers to the library preparation and will be automatically inferred if set to auto.
"},{"location":"RNAseq/rnaseq-nfcore/#run-the-pipeline","title":"Run the pipeline \ud83c\udf4f","text":""},{"location":"RNAseq/rnaseq-nfcore/#start-a-new-interactive-session","title":"Start a new interactive session","text":"Firstly, we will start a new interactive session on the M3 MASSIVE cluster.
smux n --time=2-00:00:00 --mem=64GB --ntasks=1 --cpuspertask=12 -J nf-core/rnaseq\n
Once we are inside the interactive session, we need to select an appropriate version of the Java JDK to use. For the Nextflow pipeline we will be running, we need at least version 17+.
# View available java JDK modules\nmodule avail java\n\n# Load an appropriate one (over version 17)\nmodule load java/openjdk-17.0.2\n\n# Can double-check the correct version is loaded\njava --version\n
"},{"location":"RNAseq/rnaseq-nfcore/#test-your-set-up-optional","title":"Test your set-up (optional) \ud83e\uddba","text":"This step is optional, but highly advisable for a first-time setup or when re-installing.
nextflow run nf-core/rnaseq -r 3.14.0 \\\n -profile test \\\n --outdir test \\\n -resume \\\n --skip-dupradar \\\n --skip_markduplicates\n
Um... why are we skipping things?
- We skip the
dupradar
step, because to install bioconductor-dupradar
, mamba wants to downgrade salmon
to a very early version, which is not ideal - We also skip the
markduplicates
step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample)
"},{"location":"RNAseq/rnaseq-nfcore/#download-genome-files","title":"Download genome files \ud83e\uddec","text":"To avoid issues with genome incompatibility with the version of STAR you are running, it is recommended to simply download the relevant genome fasta and GTF files using the following scripts, and then supply them directly to the function call.
Human genome files \ud83d\udc68\ud83d\udc69Mouse genome files \ud83d\udc01 01_retrieve_human_genome.sh#!/bin/bash\nVERSION=111\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz\n
01_retrieve_mouse_genome.sh#!/bin/bash\nVERSION=111\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz\nwget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz\n
"},{"location":"RNAseq/rnaseq-nfcore/#run-your-rna-sequencing-reads","title":"Run your RNA sequencing reads \ud83c\udfc3","text":"To avoid typing the whole command out (and in case the pipeline crashes), create a script that will handle the process. Two examples are given here, with one for human samples, and one for mouse samples.
- You will need to replace the RSEM folder location with your own path from above.
- Using the
save_reference
option stores the formatted genome files to save time if you need to resume or restart the pipeline.
Human run script \ud83d\udc68\ud83d\udc69Mouse genome files \ud83d\udc01 02_run_rnaseq_human.sh#!/bin/bash\nmodule load java/openjdk-17.0.2\nexport PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/\n\nnextflow run nf-core/rnaseq -r 3.14.0 \\\n --input samplesheet.csv \\\n --outdir rnaseq_output \\\n --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \\\n --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.111.gtf.gz \\\n --skip_dupradar \\\n --skip_markduplicates \\\n --save_reference \\\n -resume\n
02_run_rnaseq_mouse.sh#!/bin/bash\nmodule load java/openjdk-17.0.2\nexport PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/\n\nnextflow run nf-core/rnaseq -r 3.14.0 \\\n --input samplesheet.csv \\\n --outdir rnaseq_output \\\n --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz \\\n --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.111.gtf.gz \\\n --skip_dupradar \\\n --skip_markduplicates \\\n --save_reference \\\n -resume\n
"},{"location":"RNAseq/rnaseq-nfcore/#import-data-into-r","title":"Import data into R \ud83d\udce5","text":"We have a standardised method for importing data into R. Luckily for us, the NF-CORE/rnaseq pipeline outputs are provided in .rds
format as SummarizedExperiment
objects, with bias-corrected gene counts without an offset.
salmon.merged.gene_counts_length_scaled.rds
Tell me more! - There are two matrices provided to us:
counts
and abundance
. - The
counts
matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2. - The
abundance
matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity \u2014 we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments.
- The
tximport
package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, \"abundance\"
, \"counts\"
, and \"length\"
, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the \"length\"
matrix contains the effective gene lengths. The \"length\"
matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices.
"},{"location":"RNAseq/rnaseq-nfcore/#r-code-for-import-and-voom-normalisation","title":"R code for import and voom-normalisation","text":"Here we show our standard process for preparing RNAseq data for downstream analysis.
Prepare Voom-normalised DGE List# Load R packages\npkgs <- c('knitr', 'here', 'SummarizedExperiment', 'biomaRt', 'edgeR', 'limma')\npacman::p_load(char = pkgs)\n\n# Import the bias-corrected counts from STAR Salmon\nrna_data <- readRDS(here('input', 'salmon.merged.gene_counts_length_scaled.rds'))\n\n# Get Ensembl annotations\nensembl <- useMart('ensembl', dataset = 'hsapiens_gene_ensembl')\n\nensemblIDsBronch <- rownames(rna_bronch)\n\ngene_list <- getBM(attributes = c('ensembl_gene_id', 'hgnc_symbol', 'gene_biotype'),\n filters = 'ensembl_gene_id', values = ensemblIDsBronch, mart = ensembl)\ncolnames(gene_list) <- c(\"gene_id\", \"hgnc_symbol\", \"gene_biotype\")\ngene_list <- filter(gene_list, !duplicated(gene_id))\n\n# Ensure that only genes in the STAR Salmon outputs are kept for the gene list\nrna_data <- rna_data[rownames(rna_data) %in% gene_list$gene_id, ]\n\n# Add the ENSEMBL data to the rowData element\nrowData(rna_data) <- merge(gene_list, rowData(rna_data), by = \"gene_id\", all = FALSE)\n\n# Load the RNA metadata\nmetadata_rna <- read_csv(here('input', 'metadata_rna.csv'))\n\n# Sort the metadata rows to match the order of the abundance data\nrownames(metadata_rna) <- metadata_rna$RNA_barcode\nmetadata_rna <- metadata_rna[colnames(rna_data),]\n\n# Create a DGEList from the SummarizedExperiment object\nrna_data_dge <- DGEList(assay(rna_data, 'counts'), \n samples = metadata_rna, \n group = metadata_rna$group,\n genes = rowData(rna_data),\n remove.zeros = TRUE)\n\n# Filter the DGEList based on the group information\ndesign <- model.matrix(~ group, data = rna_data_dge$samples)\nkeep_min10 <- filterByExpr(rna_data_dge, design, min.count = 10)\nrna_data_dge_min10 <- rna_data_dge[keep_min10, ]\n\n# Calculate norm factors and perform voom normalisation\nrna_data_dge_min10 <- calcNormFactors(rna_data_dge_min10)\nrna_data_dge_min10 <- voom(rna_data_dge_min10, design, plot = TRUE)\n\n# Add the normalised abundance data from STAR Salmon and filter to match the counts data\nrna_data_dge_min10$abundance <- as.matrix(assay(rna_bronch, 'abundance'))[keep_min10, ]\n\n# Select protein coding defined genes only\nrna_data_dge_min10 <- rna_data_dge_min10[rna_data_dge_min10$genes$gene_biotype == \"protein_coding\" & rna_data_dge_min10$genes$hgnc_symbol != \"\", ]\n\n# Add symbol as rowname\nrownames(rna_data_dge_min10) <- rna_data_dge_min10$genes$gene_name\n\n# Save the DGEList\nsaveRDS(rna_data_dge_min10, here('input', 'rna_data_dge_min10.rds'))\n
"},{"location":"RNAseq/rnaseq-nfcore/#rights","title":"Rights","text":"NF-CORE/rnaseq
There are many people to thank here for writing and maintaining the NF-CORE/rnaseq pipeline (see here). If you use this pipeline for your analysis, please cite it using the following doi: 10.5281/zenodo.1400710
This document
- Copyright \u00a9 2024 \u2013 Mucosal Immunology Lab, Melbourne VIC, Australia
- Licence: These tools are provided under the MIT licence (see LICENSE file for details)
- Authors: M. Macowan
"},{"location":"Utilities/convert-raw-novaseq-outputs/","title":"Handling NovaSeq sequencing outputs","text":"Here we discuss how to process the raw sequencing reads directly from the Illumina NovaSeq sequencer.
"},{"location":"Utilities/convert-raw-novaseq-outputs/#what-you-should-have-out-of-the-box","title":"What you should have \"out of the box\" \ud83d\uddc3\ufe0f","text":"Our runs are stored in Vault storage, and need to be transferred to the M3 MASSIVE cluster for processing. To inspect your files, the simplest way is to use FileZilla by setting up an SFTP connection as below. You need to ensure you have file access to the Vault prior to this.
The basic file structure on the Vault should look something like below, with a main folder (long name) that contains the relevant files you need, and generally some sort of metadata file. You need to ensure that you have given all permissions to every file so that you can transfer them to the cluster \u2013 you can do this by right clicking the NovaSeq parent folder, selecting File Attributes...
, and then adding all of the Read
, Write
, and Execute
permissions, ensuring you select Recurse into subdirectories
.
"},{"location":"Utilities/convert-raw-novaseq-outputs/#transfer-files-to-the-cluster","title":"Transfer files to the cluster","text":""},{"location":"Utilities/convert-raw-novaseq-outputs/#sequencing-data-transfer","title":"Sequencing data transfer \ud83d\ude9b","text":"Navigate to an appropriate project folder on the cluster. An example command is shown below for transferring the data folder into a new folder called raw_data
using rsync
. If it doesn't exist, the folder you name will be created for you (just make sure you put a /
after the new folder name).
rsync -aHWv --stats --progress MONASH\\\\mmac0026@vault-v2.erc.monash.edu:Marsland-CCS-RAW-Sequencing-Archive/vault/03_NovaSeq/NovaSeq25_Olaf_Shotgun/231025_A00611_0223_AHGMNNDRX2/ raw_data/\n
"},{"location":"Utilities/convert-raw-novaseq-outputs/#bcl-convert-sample-sheet-preparation","title":"BCL Convert sample sheet preparation \ud83d\uddd2\ufe0f","text":"Create a sample sheet document for BCL Convert (the tool that will demultiplex and prepare out FASTQ files from the raw data). The full documentation can be viewed here.
The document should be in the following format, where index
is the i7 adapter sequence
and index2
is the i5 adapter sequence
. An additional first column called Lane
can be provided to specify a particular lane number only for FASTQ file generation. We will call this file samplesheet.txt
.
For the indexes, both sequences used on the sample sheet should be the reverse complement of the actual sequences.
Ensure correct file encoding \ud83e\ude9f\ud83d\udc40
If you make this on a Windows system, ensure you save your output encoded by UTF-8
and not UTF-8 with BOM
.
[Header]\nFileFormatVersion,2\n\n[BCLConvert_Settings]\nCreateFastqForIndexReads,0\n\n[BCLConvert_Data]\nSample_ID,i7_adapter,index,i5_adapter,index2\nAbx1_d21,N701,TAAGGCGA,S502,ATAGAGAG\nAbx2_d21,N702,CGTACTAG,S502,ATAGAGAG\nAbx3_d21,N703,AGGCAGAA,S502,ATAGAGAG\nAbx4_d21,N704,TCCTGAGC,S502,ATAGAGAG\nAbx5_d21,N705,GGACTCCT,S502,ATAGAGAG\n#etc.\n
"},{"location":"Utilities/convert-raw-novaseq-outputs/#bcl-convert","title":"BCL Convert \ud83d\udd04","text":""},{"location":"Utilities/convert-raw-novaseq-outputs/#install","title":"Install \u2b07\ufe0f","text":"If you feel the need to have the latest version, visit the Illumina support website and copy the link for the latest CentOS version of the BCL Convert tool.
Otherwise use the version that is available on the M3 MASSIVE cluster, and skip to the run section.
# Download from the support website in the main folder\nwget https://webdata.illumina.com/downloads/software/bcl-convert/bcl-convert-4.2.4-2.el7.x86_64.rpm\n\n# Install using rpm2cpio (change file name as required)\nmodule load rpm2cpio\nrpm2cpio bcl-convert-4.2.4-2.el7.x86_64.rpm | cpio -idv\n
The most up-to-date bcl-convert will be inside the output usr/bin/
folder, and can be called from that location.
"},{"location":"Utilities/convert-raw-novaseq-outputs/#run","title":"Run \ud83c\udfc3","text":"With the raw_data
folder and samplesheet.txt
both in the same directory, we can now run BCL Convert to generate our demultiplexed FASTQ files. Ensure you have at least 64GB of RAM in your interactive smux session.
Open file limit error
You will need a very high limit for open files \u2013 BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves.
You can request additional open file limit from the M3 MASSIVE help desk.
Can I run this on my local machine?
Please note that the node m3k010
has been decommissioned due to system upgrades.
However, it is more than possible to run this process quickly on a local machine if you have the raw BCL files available. The minimum requirements (as of BCL Convert v4.0) are:
- Hardware requirements
- Single multiprocessor or multicore computer
- Minimum 64 GB RAM
- Software requirements
- Root access to your computer
- File system access to adjust ulimit
You can start an interactive bash session and increase the open file limit as follows:
# Begin a new interactive bash session on the designated node\nsrun --pty --partition=genomics --qos=genomics --nodelist=m3k010 --mem=320GB --ntasks=1 --cpus-per-task=48 bash -i\n\n# Increase the open file limit to 65,535\nulimit -n 65535\n
# Run bcl-convert\nbcl-convert \\\n --bcl-input-directory raw_data \\\n --output-directory fastq_files \\\n --sample-sheet samplesheet.txt\n
This will create a new output folder called fastq_files
that contains your demultiplexed samples.
"},{"location":"Utilities/convert-raw-novaseq-outputs/#merge-lanes","title":"Merge lanes \u26d9","text":"If you ran your samples without lane splitting, then you can merge the two lanes together using the following code, saved in the main project folder as merge_lanes.sh
, and run using the command: bash merge_lanes.sh
.
merge_lanes.sh#!/bin/bash\n\n# Merge lanes 1 and 2\ncd fastq_files\nfor f in *.fastq.gz\n do\n Basename=${f%_L00*}\n ## merge R1\n ls ${Basename}_L00*_R1_001.fastq.gz | xargs cat > ${Basename}_R1.fastq.gz\n ## merge R2\n ls ${Basename}_L00*_R2_001.fastq.gz | xargs cat > ${Basename}_R2.fastq.gz\n done\n\n# Remove individual files to make space\nrm -rf *L00*\n
"},{"location":"Utilities/sra-data-submission/","title":"SRA sequencing data submission","text":"A guide to submitting sequencing data to the National Center for Biotechnology Information (NCBI) sequencing read archive (SRA) database. Includes information on uploading data to the SRA using the high-speed Aspera Connect tool.
Patient-derived sequencing files
If your samples are derived from humans, ensure that your file names include no reference to patient identifiers. Once uploaded to the SRA database, it is very difficult to change the names of files, and requires directly contacting the database to arrange for removal of files and for you to reupload the data. It also involves a difficult process of them re-mapping the new uploads to your existing SRA metadata files.
Also ensure that you only include the absolute minimum amount of metadata, in a manner that protects patient confidentiality. Absolutely no information should be unique to one single patient in your cohort, even an age (if you have a patient with a unique age, this should be replaced with NA
for the purposes of SRA submission). For manuscripts, you can include a phrase indicating that further metadata is available upon reasonable request. The important thing here is to not infringe on patient privacy and confidentiality.
Things you could potentially include: - Modified and anonymised patient ID - Sampling group - Timepoint (not exact days or months) - Sex - Collection year (no exact dates) - Tissue
"},{"location":"Utilities/sra-data-submission/#process-overview","title":"Process overview","text":" - Register a BioProject
- Register BioSamples for the related BioProject
- Submit data to SRA
"},{"location":"Utilities/sra-data-submission/#register-a-bioproject","title":"Register a BioProject \ud83d\udcd4","text":"The BioProject is an important element that can link together different types of sequencing data, and represents all the sequencing data for a given experiment.
Go to the SRA submission website to register a new BioProject.
- Sample scope: Multispecies (if you have microbiome data)
- Target description: Bacterial 16S metagenomics (change if you have shotgun metagenomics and/or host transcriptomics)
- Organism name: Human (change if using mouse or rat data)
- Project type: Metagenome (add transcriptome if you also have host transcriptomics)
"},{"location":"Utilities/sra-data-submission/#register-biosamples-test_tube","title":"Register BioSamples :test_tube:","text":""},{"location":"Utilities/sra-data-submission/#microbiome-data","title":"Microbiome data \ud83e\udda0","text":"Microbiome samples will be registered as MIMARKS Specimen samples. On the BioSample Attributes tab, download the BioSample metadata Excel template, and complete it accordingly before uploading. Be very careful with the required field formats. You can double check ontology using the EMBL-EBI Ontology Lookup Service.
- Use the BioProject accession number previously generated
- Organism:
human metagenome
(or as appropriate) - Env broad scale:
host-associated
- Env local scale:
mammalia-associated habitat
- Env medium: (as appropriate)
- Strain, isolate, cultivar, ecotype:
NA
- Add any other relevant host information in the table, as well as the host tissue samples
- Any other column which is not relevant can be set to
NA
The SRA Metadata tab is what will join everything together. Once again, download the provided Excel template, and fill everything in carefully.
- Sample name: the base name of your samples
- Library ID: you may have named your files differently than your sample names \u2013 provide this if so, otherwise you can repeat the sample name
- Title: a short description of the sample in the form \"
{methodology}
of {organism}
: {sample_info}
\" \u2013 e.g. \"Shotgun metagenomics of Homo sapiens: childhood bronchial brushing\". - Library strategy:
WGS
- Library source:
METAGENOMIC
- Library selection:
RANDOM
- Library layout:
paired
- Platform:
ILLUMINA
- Instrument model:
Illumina NovaSeq 6000
- Design description:
NA
- Filetype:
fastq
- Filename: the file name of the forward reads
- Filename2: the file name of the reverse reads
"},{"location":"Utilities/sra-data-submission/#transcriptomics-data","title":"Transcriptomics data \ud83d\udc68\ud83d\udc2d","text":"Host transcriptomics samples will be registered as either HUMAN or Model organism or animal samples. On the BioSample Attributes tab, download the BioSample metadata Excel template, and complete it accordingly before uploading. Be very careful with the required field formats. You can double check ontology using the EMBL-EBI Ontology Lookup Service.
- Use the BioProject accession number previously generated
- Organism:
Homo sapiens
(or Mus musculus
/Rattus norvegicus
as appropriate) - Isolate: NA
- Age: fill this in, but leave
NA
for human samples if it would result in a unique combination of metadata variables with potential to allow identification of any individual. - Biomaterial provider: enter the lab, organisation etc. that provided the samples
- Collection date: do not enter any exact dates for human samples
- Geo loc name: country in which samples were collected
- Sex: provide sex of host
- Tissue: specify tissue origin of samples
- Add any other relevant data, such as sampling group
As above, the SRA Metadata tab is where the magic will happen :magic_wand:. Once again, download the provided Excel template, and fill everything in carefully.
- Sample name: the base name of your samples
- Library ID: you may have named your files differently than your sample names \u2013 provide this if so, otherwise you can repeat the sample name
- Title: a short description of the sample in the form \"
{methodology}
of {organism}
: {sample_info}
\" \u2013 e.g. \"RNA-Seq of Homo sapiens: childhood bronchial brushing\". - Library strategy:
RNA-Seq
- Library source:
TRANSCRIPTOMIC
- Library selection:
RANDOM
- Library layout:
paired
- Platform:
ILLUMINA
- Instrument model:
Illumina NovaSeq 6000
- Design description:
NA
- Filetype:
fastq
- Filename: the file name of the forward reads
- Filename2: the file name of the reverse reads
"},{"location":"Utilities/sra-data-submission/#submit-data-to-sra","title":"Submit data to SRA \ud83d\udce4","text":"Which upload option should I choose?
You can choose either of the following upload options, and each has pros and cons.
- Filezilla allows parallel uploads according to your settings, but upload speed is typically slower.
- Aspera Connect (at least with NCBI) only allows sequential uploads, but the upload speed is significantly faster.
"},{"location":"Utilities/sra-data-submission/#filezilla","title":"FileZilla \ud83e\udd96","text":"Using FileZilla is more effective when you have large files and/or a large number of files.
In FileZilla, open the sites manager and connect to NCBI as follows: - Protocol: FTP
- Host: ftp-private.ncbi.nlm.nih.gov
- Username: subftp
- Password: this is your user-specific NCBI password given when you submit your data
In the Advanced
tab next to the General
tab, set the Default remote directory
field to the directory specified by NCBI. This will looks something like: /uploads/{username}_{uniqueID}
.
Select connect, and gain access to your account folder on the NCBI FTP server.
Create a new project folder within the main upload folder, and enter the folder. Add your files to the upload queue, and begin the upload process.
"},{"location":"Utilities/sra-data-submission/#aspera-connect","title":"Aspera Connect","text":"The IBM Aspera Connect tool allows for much faster uploads than FileZilla, and is a good alternative for large files.
"},{"location":"Utilities/sra-data-submission/#linux-process","title":"Linux process \ud83d\udc27","text":"The process described here is for Linux, but is similar for Windows and MacOS operating systems. More information is provided on the IBM website.
- Download the Aspera Connect software.
- Open a new terminal window (
Ctrl+Alt+T
) - Navigate to downloads, extract the
tar.gz
file. - Run the install script.
# Extract the file\ntar -zxvf ibm-aspera-connect-version+platform.tar.gz\n# Run the install script\n./ibm-aspera-connect-version+platform.sh\n
- Add the Aspera Connect bin folder to your PATH variable (reopen terminal to apply changes).
# Add folder to PATH\necho 'export PATH=$PATH:/home/{user}/.aspera/connect/bin/ >> ~/.bashrc'\n
- Download the NCBI Aspera Connect key file.
- Navigate to the parent folder of the folder containing the files you want to upload to the SRA database, and create a new bash script.
# Create a new bash script file\ntouch upload_seq_data.sh\n
- Add the following code to the bash script file.
- The
-i
argument is the path to the key file, and must be given as a full path (not a relative one). - The
-d
argument specifies that the directory will be created if it doesn't exist. - You can adjust the maximum upload speed using the
-l500m
argument, where 500
is the speed in Mbps. You could increase or decrease as desired. - Add the folder containing the data to upload, which can be relative to the folder containing the bash script.
- Next provide the upload folder provided by NCBI, which will be user-specific, and ensure you provide a project folder at the end of this. Data will not be available if it is uploaded into the main uploads folder.
upload_seq_data.sh#!/bin/bash\nascp -i {/full/path/to/key-file/aspera.openssh} -QT -l500m -k1 -d {./name-of-seq-data-folder} subasp@upload.ncbi.nlm.nih.gov:uploads/{user-specific-ID}/{name-of-project}\n
- Run the bash script, and upload all files. The default settings will allow you to resume uploads if they are interrupted, and it will not overwrite files that are identical in the destination folder.
# Run script\nbash upload_seq_data.sh\n
"}]}
\ No newline at end of file
diff --git a/site/sitemap.xml.gz b/site/sitemap.xml.gz
index 0db6900..cdf8b45 100644
Binary files a/site/sitemap.xml.gz and b/site/sitemap.xml.gz differ