Skip to content

Commit

Permalink
Editing RNAseq and scRNAseq pages
Browse files Browse the repository at this point in the history
  • Loading branch information
mmac0026 committed Dec 8, 2024
1 parent 45729eb commit 9e637e7
Show file tree
Hide file tree
Showing 7 changed files with 163 additions and 228 deletions.
90 changes: 46 additions & 44 deletions docs/NextFlow/scRNAseq.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,53 +88,55 @@ Create a new folder somewhere to store your genome files. Enter the new folder,

STAR should be loaded already via the conda environment for the genome indexing step. We will set `--sjdbOverhang` to 79 to be suitable for use with the longer `R2` FASTQ data resulting from BD Rhapsody single cell sequencing. This may require alteration for other platforms. **Essentially, you just need to set `--sjdbOverhang` to the length of your R2 sequences minus 1.**

#### Human genome files 👨👩

```bash title="01_retrieve_human_genome.sh"
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz
gunzip *
```

Then use STAR to prepare the genome index.

```bash title="02_index_human_genome.sh"
#!/bin/bash
VERSION=111
STAR \
--runThreadN 16 \
--genomeDir "STARgenomeIndex79/" \
--runMode genomeGenerate \
--genomeFastaFiles "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa" \
--sjdbGTFfile "Homo_sapiens.GRCh38.$VERSION.gtf" \
--sjdbOverhang 79
```

#### Mouse genome files 🐁

```bash title="01_retrieve_mouse_genome.sh"
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz
gunzip *
```
=== "Human genome files 👨👩"

```bash title="01_retrieve_human_genome.sh"
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz
gunzip *
```

=== "Mouse genome files 🐁"

```bash title="01_retrieve_mouse_genome.sh"
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz
gunzip *
```

Then use STAR to prepare the genome index.

```bash title="02_index_mouse_genome.sh"
#!/bin/bash
VERSION=111
STAR \
--runThreadN 16 \
--genomeDir "STARgenomeIndex79/" \
--runMode genomeGenerate \
--genomeFastaFiles "Mus_musculus.GRCm39.dna_sm.primary_assembly.fa" \
--sjdbGTFfile "Mus_musculus.GRCm39.$VERSION.gtf" \
--sjdbOverhang 79
```
=== "Human genome files 👨👩"

```bash title="02_index_human_genome.sh"
#!/bin/bash
VERSION=111
STAR \
--runThreadN 16 \
--genomeDir "STARgenomeIndex79/" \
--runMode genomeGenerate \
--genomeFastaFiles "Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa" \
--sjdbGTFfile "Homo_sapiens.GRCh38.$VERSION.gtf" \
--sjdbOverhang 79
```

=== "Mouse genome files 🐁"

```bash title="02_index_mouse_genome.sh"
#!/bin/bash
VERSION=111
STAR \
--runThreadN 16 \
--genomeDir "STARgenomeIndex79/" \
--runMode genomeGenerate \
--genomeFastaFiles "Mus_musculus.GRCm39.dna_sm.primary_assembly.fa" \
--sjdbGTFfile "Mus_musculus.GRCm39.$VERSION.gtf" \
--sjdbOverhang 79
```

### Prepare your sample sheet ✏️

Expand Down
125 changes: 64 additions & 61 deletions docs/RNAseq/rnaseq-nfcore.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,92 +99,95 @@ nextflow run nf-core/rnaseq -r 3.14.0 \
--skip_markduplicates
```

* We skip the `dupradar` step, because to install `bioconductor-dupradar`, mamba wants to downgrade `salmon` to a very early version, which is not ideal :facepalm:
* We also skip the `markduplicates` step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample) :mag::sparkles:
!!! info "Um... why are we skipping things?"

### Download genome files
* We skip the `dupradar` step, because to install `bioconductor-dupradar`, mamba wants to downgrade `salmon` to a very early version, which is not ideal :facepalm:
* We also skip the `markduplicates` step because it is not recommended to remove duplicates anyway due to normal biological duplicates (i.e. there won't just be 1 copy of a given gene in a complete sample) :mag::sparkles:

### Download genome files 🧬

To avoid issues with genome incompatibility with the version of STAR you are running, it is recommended to simply download the relevant genome fasta and GTF files using the following scripts, and then supply them directly to the function call.

#### Human genome files 👨👩
=== "Human genome files 👨👩"

```bash title="01_retrieve_human_genome.sh"
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz
```
```bash title="01_retrieve_human_genome.sh"
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/homo_sapiens/Homo_sapiens.GRCh38.$VERSION.gtf.gz
```

#### Mouse genome files 🐁
=== "Mouse genome files 🐁"

```bash title="01_retrieve_mouse_genome.sh"
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz
```
```bash title="01_retrieve_mouse_genome.sh"
#!/bin/bash
VERSION=111
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/fasta/mus_musculus/dna/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz
wget -L ftp://ftp.ensembl.org/pub/release-$VERSION/gtf/mus_musculus/Mus_musculus.GRCm39.$VERSION.gtf.gz
```

### Run your RNA sequencing reads 🐁
### Run your RNA sequencing reads 🏃

To avoid typing the whole command out (and in case the pipeline crashes), create a script that will handle the process. Two examples are given here, with one for human samples, and one for mouse samples.

* You will need to replace the RSEM folder location with your own path from above.
* Using the `save_reference` option stores the formatted genome files to save time if you need to resume or restart the pipeline.

#### Human run script 👨👩

```bash title="02_run_rnaseq_human.sh"
#!/bin/bash
module load java/openjdk-17.0.2
export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/

nextflow run nf-core/rnaseq -r 3.14.0 \
--input samplesheet.csv \
--outdir rnaseq_output \
--fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \
--gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.111.gtf.gz \
--skip_dupradar \
--skip_markduplicates \
--save_reference \
-resume

```

#### Mouse run script 🐁

```bash title="02_run_rnaseq_mouse.sh"
#!/bin/bash
module load java/openjdk-17.0.2
export PATH=$PATH:”/home/mmacowan/mf33/tools/RSEM/”

nextflow run nf-core/rnaseq -r 3.14.0 \
--input samplesheet.csv \
--outdir rnaseq_output \
--fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz \
--gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.111.gtf.gz \
--skip_dupradar \
--skip_markduplicates \
--save_reference \
-resume
```

## Import data into R
=== "Human run script 👨👩"

```bash title="02_run_rnaseq_human.sh"
#!/bin/bash
module load java/openjdk-17.0.2
export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/

nextflow run nf-core/rnaseq -r 3.14.0 \
--input samplesheet.csv \
--outdir rnaseq_output \
--fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa.gz \
--gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Homo_sapiens.GRCh38.111.gtf.gz \
--skip_dupradar \
--skip_markduplicates \
--save_reference \
-resume
```

=== "Mouse genome files 🐁"

```bash title="02_run_rnaseq_mouse.sh"
#!/bin/bash
module load java/openjdk-17.0.2
export PATH=$PATH:/home/mmacowan/mf33/tools/RSEM/

nextflow run nf-core/rnaseq -r 3.14.0 \
--input samplesheet.csv \
--outdir rnaseq_output \
--fasta /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.dna_sm.primary_assembly.fa.gz \
--gtf /home/mmacowan/mf33/scratch_nobackup/RNA/Mus_musculus.GRCm39.111.gtf.gz \
--skip_dupradar \
--skip_markduplicates \
--save_reference \
-resume
```

## Import data into R 📥

We have a standardised method for importing data into R. Luckily for us, the NF-CORE/rnaseq pipeline outputs are provided in `.rds` format as `SummarizedExperiment` objects, with bias-corrected gene counts without an offset.

* `salmon.merged.gene_counts_length_scaled.rds`

There are two matrices provided to us: `counts` and `abundance`.
??? info "Tell me more!"

* There are two matrices provided to us: `counts` and `abundance`.
* The `counts` matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2.
* The `abundance` matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments.
* The `tximport` package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, `"abundance"`, `"counts"`, and `"length"`, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the `"length"` matrix contains the effective gene lengths. The `"length"` matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices.

- The `abundance` matrix is the scaled and normalised transcripts per million (TPM) abundance. TPM explicitly erases information about library size. That is, it estimates the relative abundance of each transcript proportional to the total population of transcripts sampled in the experiment. Thus, you can imagine TPM, in a way, as a partition of unity — we want to assign a fraction of the total expression (whatever that may be) to transcript, regardless of whether our library is 10M fragments or 100M fragments.
- The `counts` matrix is a re-estimated counts table that aims to provide count-level data to be compatible with downstream tools such as DESeq2.
- The `tximport` package has a single function for importing transcript-level estimates. The type argument is used to specify what software was used for estimation. A simple list with matrices, `"abundance"`, `"counts"`, and `"length"`, is returned, where the transcript level information is summarized to the gene-level. Typically, abundance is provided by the quantification tools as TPM (transcripts-per-million), while the counts are estimated counts (possibly fractional), and the `"length"` matrix contains the effective gene lengths. The `"length"` matrix can be used to generate an offset matrix for downstream gene-level differential analysis of count matrices.

### R code for import and voom-normalisation

Here we show our standard process for preparing RNAseq data for downstream analysis.

```r
```r title="Prepare Voom-normalised DGE List"
# Load R packages
pkgs <- c('knitr', 'here', 'SummarizedExperiment', 'biomaRt', 'edgeR', 'limma')
pacman::p_load(char = pkgs)
Expand Down
18 changes: 11 additions & 7 deletions docs/Utilities/convert-raw-novaseq-outputs.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@ The document should be in the following format, where `index` is the `i7 adapter

For the indexes, **both** sequences used on the sample sheet should be the reverse complement of the actual sequences.

If you make this on a Windows system, ensure you save your output encoded by `UTF-8` and not `UTF-8 with BOM`.
!!! warning "Ensure correct file encoding 🪟👀"

If you make this on a Windows system, ensure you save your output encoded by `UTF-8` and not `UTF-8 with BOM`.

```bash
[Header]
Expand Down Expand Up @@ -72,9 +74,11 @@ The most up-to-date bcl-convert will be inside the output `usr/bin/` folder, and

With the `raw_data` folder and `samplesheet.txt` both in the same directory, we can now run BCL Convert to generate our demultiplexed FASTQ files. Ensure you have at least 64GB of RAM in your interactive smux session.

You will need a very high limit for open files &ndash; BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves.
!!! warning "Open file limit error"

You will need a very high limit for open files &ndash; BCL Convert will attempt to set this limit to 65,535. However, by default, the limit on the M3 MASSIVE cluster is only 1,024 and cannot be increased by users themselves.

You can request additional open file limit from the M3 MASSIVE help desk.
You can request additional open file limit from the M3 MASSIVE help desk.

!!! question "Can I run this on my local machine?"

Expand All @@ -84,11 +88,11 @@ You can request additional open file limit from the M3 MASSIVE help desk.
The minimum requirements (as of BCL Convert v4.0) are:

- **Hardware requirements**
- Single multiprocessor or multicore computer
- Minimum 64 GB RAM
- Single multiprocessor or multicore computer
- Minimum 64 GB RAM
- **Software requirements**
- Root access to your computer
- File system access to adjust ulimit
- Root access to your computer
- File system access to adjust ulimit

You can start an interactive bash session and increase the open file limit as follows:

Expand Down
2 changes: 2 additions & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -45,6 +45,8 @@ markdown_extensions:
- pymdownx.tasklist
- pymdownx.tilde
- pymdownx.emoji
- pymdownx.tabbed:
alternate_style: true

extra:
analytics:
Expand Down
Loading

0 comments on commit 9e637e7

Please sign in to comment.