From 9e637e7f593f4fa5643e9180c98c6cf4847de623 Mon Sep 17 00:00:00 2001 From: Matthew Macowan Date: Sun, 8 Dec 2024 13:08:18 +1100 Subject: [PATCH] Editing RNAseq and scRNAseq pages --- docs/NextFlow/scRNAseq.md | 90 +++++----- docs/RNAseq/rnaseq-nfcore.md | 125 +++++++------- docs/Utilities/convert-raw-novaseq-outputs.md | 18 +- mkdocs.yml | 2 + site/RNAseq/rnaseq-nfcore/index.html | 154 +++++------------- site/search/search_index.json | 2 +- site/sitemap.xml.gz | Bin 127 -> 127 bytes 7 files changed, 163 insertions(+), 228 deletions(-) diff --git a/docs/NextFlow/scRNAseq.md b/docs/NextFlow/scRNAseq.md index 96f0f3d..018a906 100644 --- a/docs/NextFlow/scRNAseq.md +++ b/docs/NextFlow/scRNAseq.md @@ -88,53 +88,55 @@ Create a new folder somewhere to store your genome files. Enter the new folder, STAR should be loaded already via the conda environment for the genome indexing step. We will set `--sjdbOverhang` to 79 to be suitable for use with the longer `R2` FASTQ data resulting from BD Rhapsody single cell sequencing. This may require alteration for other platforms. **Essentially, you just need to set `--sjdbOverhang` to the length of your R2 sequences minus 1.** -#### Human genome files 👨👩 - -```bash title="01_retrieve_human_genome.sh" -#!/bin/bash -VERSION=111 -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz -gunzip * -``` - -Then use STAR to prepare the genome index. - -```bash title="02_index_human_genome.sh" -#!/bin/bash -VERSION=111 -STAR \ - --runThreadN 16 \ - --genomeDir "STARgenomeIndex79/" \ - --runMode genomeGenerate \ - --genomeFastaFiles "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa" \ - --sjdbGTFfile "Homo_sapiens.GRCh38.$VERSION.gtf" \ - --sjdbOverhang 79 -``` - -#### Mouse genome files 🐁 - -```bash title="01_retrieve_mouse_genome.sh" -#!/bin/bash -VERSION=111 -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz -gunzip * -``` +=== "Human genome files 👨👩" + + ```bash title="01_retrieve_human_genome.sh" + #!/bin/bash + VERSION=111 + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz + gunzip * + ``` + +=== "Mouse genome files 🐁" + + ```bash title="01_retrieve_mouse_genome.sh" + #!/bin/bash + VERSION=111 + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz + gunzip * + ``` Then use STAR to prepare the genome index. -```bash title="02_index_mouse_genome.sh" -#!/bin/bash -VERSION=111 -STAR \ - --runThreadN 16 \ - --genomeDir "STARgenomeIndex79/" \ - --runMode genomeGenerate \ - --genomeFastaFiles "Mus_musculus.GRCm39.dna_sm.primary_assembly.fa" \ - --sjdbGTFfile "Mus_musculus.GRCm39.$VERSION.gtf" \ - --sjdbOverhang 79 -``` +=== "Human genome files 👨👩" + + ```bash title="02_index_human_genome.sh" + #!/bin/bash + VERSION=111 + STAR \ + --runThreadN 16 \ + --genomeDir "STARgenomeIndex79/" \ + --runMode genomeGenerate \ + --genomeFastaFiles "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa" \ + --sjdbGTFfile "Homo_sapiens.GRCh38.$VERSION.gtf" \ + --sjdbOverhang 79 + ``` + +=== "Mouse genome files 🐁" + + ```bash title="02_index_mouse_genome.sh" + #!/bin/bash + VERSION=111 + STAR \ + --runThreadN 16 \ + --genomeDir "STARgenomeIndex79/" \ + --runMode genomeGenerate \ + --genomeFastaFiles "Mus_musculus.GRCm39.dna_sm.primary_assembly.fa" \ + --sjdbGTFfile "Mus_musculus.GRCm39.$VERSION.gtf" \ + --sjdbOverhang 79 + ``` ### Prepare your sample sheet ✏️ diff --git a/docs/RNAseq/rnaseq-nfcore.md b/docs/RNAseq/rnaseq-nfcore.md index f1e26d7..e45d11e 100644 --- a/docs/RNAseq/rnaseq-nfcore.md +++ b/docs/RNAseq/rnaseq-nfcore.md @@ -99,92 +99,95 @@ nextflow run nf-core/rnaseq -r 3.14.0 \ --skip_markduplicates ``` -* We skip the `dupradar` step, because to install `bioconductor-dupradar`, mamba wants to downgrade `salmon` to a very early version, which is not ideal :facepalm: -* We also skip the `markduplicates` step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample) :mag::sparkles: +!!! info "Um... why are we skipping things?" -### Download genome files + * We skip the `dupradar` step, because to install `bioconductor-dupradar`, mamba wants to downgrade `salmon` to a very early version, which is not ideal :facepalm: + * We also skip the `markduplicates` step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample) :mag::sparkles: + +### Download genome files 🧬 To avoid issues with genome incompatibility with the version of STAR you are running, it is recommended to simply download the relevant genome fasta and GTF files using the following scripts, and then supply them directly to the function call. -#### Human genome files 👨👩 +=== "Human genome files 👨👩" -```bash title="01_retrieve_human_genome.sh" -#!/bin/bash -VERSION=111 -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz -``` + ```bash title="01_retrieve_human_genome.sh" + #!/bin/bash + VERSION=111 + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz + ``` -#### Mouse genome files 🐁 +=== "Mouse genome files 🐁" -```bash title="01_retrieve_mouse_genome.sh" -#!/bin/bash -VERSION=111 -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz -wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz -``` + ```bash title="01_retrieve_mouse_genome.sh" + #!/bin/bash + VERSION=111 + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz + wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz + ``` -### Run your RNA sequencing reads 🐁 +### Run your RNA sequencing reads 🏃 To avoid typing the whole command out (and in case the pipeline crashes), create a script that will handle the process. Two examples are given here, with one for human samples, and one for mouse samples. * You will need to replace the RSEM folder location with your own path from above. * Using the `save_reference` option stores the formatted genome files to save time if you need to resume or restart the pipeline. -#### Human run script 👨👩 - -```bash title="02_run_rnaseq_human.sh" -#!/bin/bash -module load java/openjdk-17.0.2 -export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/ - -nextflow run nf-core/rnaseq -r 3.14.0 \ - --input samplesheet.csv \ - --outdir rnaseq_output \ - --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \ - --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.111.gtf.gz \ - --skip_dupradar \ - --skip_markduplicates \ - --save_reference \ - -resume - -``` - -#### Mouse run script 🐁 - -```bash title="02_run_rnaseq_mouse.sh" -#!/bin/bash -module load java/openjdk-17.0.2 -export PATH=$PATH:”/home/mmacowan/mf33/tools/RSEM/” - -nextflow run nf-core/rnaseq -r 3.14.0 \ - --input samplesheet.csv \ - --outdir rnaseq_output \ - --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz \ - --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.111.gtf.gz \ - --skip_dupradar \ - --skip_markduplicates \ - --save_reference \ - -resume -``` - -## Import data into R +=== "Human run script 👨👩" + + ```bash title="02_run_rnaseq_human.sh" + #!/bin/bash + module load java/openjdk-17.0.2 + export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/ + + nextflow run nf-core/rnaseq -r 3.14.0 \ + --input samplesheet.csv \ + --outdir rnaseq_output \ + --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \ + --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.111.gtf.gz \ + --skip_dupradar \ + --skip_markduplicates \ + --save_reference \ + -resume + ``` + +=== "Mouse genome files 🐁" + + ```bash title="02_run_rnaseq_mouse.sh" + #!/bin/bash + module load java/openjdk-17.0.2 + export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/ + + nextflow run nf-core/rnaseq -r 3.14.0 \ + --input samplesheet.csv \ + --outdir rnaseq_output \ + --fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz \ + --gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.111.gtf.gz \ + --skip_dupradar \ + --skip_markduplicates \ + --save_reference \ + -resume + ``` + +## Import data into R 📥 We have a standardised method for importing data into R. Luckily for us, the NF-CORE/rnaseq pipeline outputs are provided in `.rds` format as `SummarizedExperiment` objects, with bias-corrected gene counts without an offset. * `salmon.merged.gene_counts_length_scaled.rds` -There are two matrices provided to us: `counts` and `abundance`. +??? info "Tell me more!" + + * There are two matrices provided to us: `counts` and `abundance`. + * The `counts` matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2. + * The `abundance` matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments. + * The `tximport` package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, `"abundance"`, `"counts"`, and `"length"`, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the `"length"` matrix contains the effective gene lengths. The `"length"` matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices. -- The `abundance` matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments. -- The `counts` matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2. -- The `tximport` package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, `"abundance"`, `"counts"`, and `"length"`, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the `"length"` matrix contains the effective gene lengths. The `"length"` matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices. ### R code for import and voom-normalisation Here we show our standard process for preparing RNAseq data for downstream analysis. -```r +```r title="Prepare Voom-normalised DGE List" # Load R packages pkgs <- c('knitr', 'here', 'SummarizedExperiment', 'biomaRt', 'edgeR', 'limma') pacman::p_load(char = pkgs) diff --git a/docs/Utilities/convert-raw-novaseq-outputs.md b/docs/Utilities/convert-raw-novaseq-outputs.md index 1c66d41..64af662 100644 --- a/docs/Utilities/convert-raw-novaseq-outputs.md +++ b/docs/Utilities/convert-raw-novaseq-outputs.md @@ -30,7 +30,9 @@ The document should be in the following format, where `index` is the `i7 adapter For the indexes, **both** sequences used on the sample sheet should be the reverse complement of the actual sequences. -If you make this on a Windows system, ensure you save your output encoded by `UTF-8` and not `UTF-8 with BOM`. +!!! warning "Ensure correct file encoding 🪟👀" + + If you make this on a Windows system, ensure you save your output encoded by `UTF-8` and not `UTF-8 with BOM`. ```bash [Header] @@ -72,9 +74,11 @@ The most up-to-date bcl-convert will be inside the output `usr/bin/` folder, and With the `raw_data` folder and `samplesheet.txt` both in the same directory, we can now run BCL Convert to generate our demultiplexed FASTQ files. Ensure you have at least 64GB of RAM in your interactive smux session. -You will need a very high limit for open files – BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves. +!!! warning "Open file limit error" + + You will need a very high limit for open files – BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves. -You can request additional open file limit from the M3 MASSIVE help desk. + You can request additional open file limit from the M3 MASSIVE help desk. !!! question "Can I run this on my local machine?" @@ -84,11 +88,11 @@ You can request additional open file limit from the M3 MASSIVE help desk. The minimum requirements (as of BCL Convert v4.0) are: - **Hardware requirements** - - Single multiprocessor or multicore computer - - Minimum 64 GB RAM + - Single multiprocessor or multicore computer + - Minimum 64 GB RAM - **Software requirements** - - Root access to your computer - - File system access to adjust ulimit + - Root access to your computer + - File system access to adjust ulimit You can start an interactive bash session and increase the open file limit as follows: diff --git a/mkdocs.yml b/mkdocs.yml index 134910f..d8cc513 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -45,6 +45,8 @@ markdown_extensions: - pymdownx.tasklist - pymdownx.tilde - pymdownx.emoji + - pymdownx.tabbed: + alternate_style: true extra: analytics: diff --git a/site/RNAseq/rnaseq-nfcore/index.html b/site/RNAseq/rnaseq-nfcore/index.html index 68c2244..cefa6f2 100644 --- a/site/RNAseq/rnaseq-nfcore/index.html +++ b/site/RNAseq/rnaseq-nfcore/index.html @@ -372,67 +372,19 @@
  • - Download genome files - - - - -
  • - Run your RNA sequencing reads 🐁 + Run your RNA sequencing reads 🏃 - -
  • @@ -443,11 +395,11 @@
  • - Import data into R + Import data into R 📥 -
  • - Import data into R + Import data into R 📥 -