Skip to content

Materials, guides, manuals and files for RNAseq analysis - tutorial under LFSC540 GST

Notifications You must be signed in to change notification settings

t-keertana/RNASeq_GST_tutorial

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

RNASeq_GST_tutorial

Materials, guides, manuals and files for RNAseq analysis - tutorial under LFSC540 GST. Check the Wiki page for easy access to the class content: https://github.com/t-keertana/RNASeq_GST_tutorial/wiki

Here are the slides we used in the class for lecture: https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/GST_RNASeq_tutorial_2024.pdf

The general pipeline we'll be using for the tutorial is illustrated here. In short, we'll attempt to cover the following topics which are broadly applicable to most RNAseq analyses. We'll be using a simplified and chopped up version of the data from this paper: https://www.cell.com/current-biology/pdf/S0960-9822(22)01691-8.pdf

Introduction to RNA-seq
1.1 - RNA-seq Overview
1.2 - Experimental Design
Pre-processing
2.1 - Quality Control (QC)
2.2 - Trimming and Filtering
Alignment and mapping
3.1 - Alignment Algorithms
3.2 - Visualizing alignment
Quantification of Gene Expression
4.1 - Quantification Methods
4.2 - Normalization methods
Differential expression analysis
5.1 - Introduction to Differential Expression
5.2 - Visualization of Differential Expression Results
------------------------------------------------------------
BONUS: Functional Analysis:
6.1 - Gene Ontology (GO) Enrichment Analysis
6.2 - Pathway Analysis
BONUS: Advanced Topics:
7.1 - Single-Cell RNA-seq


  1. Introduction

Here we are using a subset of the data from https://www.cell.com/current-biology/pdf/S0960-9822(22)01691-8.pdf. For this tutorial, you can find all the necessary files on this page. To ensure faster data analysis, we are using two trimmed fastq files. However, to gain a better understanding of the study's results, we recommend that you download the original raw data after this class and perform the same steps using the original fastq files instead of the trimmed ones.


  1. Pre‐processing

Raw read quality assessment

A. Quality Control (QC) of FASTQ files

The first step in the RNA-Seq workflow is to take the FASTQ files and assess the quality of the sequence reads.

Load the following fastq files to run FASTQC:

https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/RNA_youngAdult_rep2.1.p.fq.gz

https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/RNA_youngAdult_rep2.2.p1.fq.gz

Assessing the quality of raw sequencing reads using FastQC:

#fastqc PATH_TO_YOUR_FASTQ_FOLDER/forward.fq.gz PATH_TO_YOUR_FASTQ_FOLDER/reverse.fq.gz -o PATH_TO_OUTPUT_FOLDER

For more information read this file: https://mugenomicscore.missouri.edu/PDF/FastQC_Manual.pdf

fastqc /home/sobhan/Desktop/test/Analysis/analysis/Fastq/RNA_youngAdult_rep2.1.p.fq.gz /home/sobhan/Desktop/test/Analysis/analysis/Fastq/RNA_youngAdult_rep2.2.p1.fq.gz -o /home/sobhan/Desktop/test/Analysis/analysis/Fastq/

Output files will be stored in the output folder in .zip format. The output files will contain icons, images, summary of fastqc in .text and .html formats.

Read trimming for removal of low-quality sequences, adapters etc.

To trim adapters and low-quality bases from sequencing data, we utilize a software called TRIMMOMATIC.You can download TRIMMOMATIC (0.39 version) using the following link:

http://www.usadellab.org/cms/uploads/supplementary/Trimmomatic/Trimmomatic-Src-0.39.zip

For more information read this page:http://www.usadellab.org/cms/?page=trimmomatic

#java -jar <path to trimmomatic.jar> PE [-threads <threads] [-phred33 | -phred64] [-trimlog <logFile>] <input 1> <input 2> <paired output 1> <unpaired output 1> <paired output 2> <unpaired output 2> <step 1> ...


java -jar /home/sobhan/tools/Trimmomatic-0.39/trimmomatic-0.39.jar PE -phred33 /home/sobhan/Desktop/test/Analysis/analysis/Fastq/RNA_youngAdult_rep2.1.p.fq.gz /home/sobhan/Desktop/test/Analysis/analysis/Fastq/RNA_youngAdult_rep2.2.p1.fq.gz /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.1_forward_paired.fq.gz /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.1_forward_unpaired.fq.gz /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.2_reverse_paired.fq.gz /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.2_reverse_unpaired.fq.gz ILLUMINACLIP:/home/sobhan/Desktop/test/Analysis/analysis/adapters.fa:2:30:10:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

This will perform the following:

Remove adapters (ILLUMINACLIP:adapters.fa:2:30:10:2:30:10)

Remove leading low quality or N bases (below quality 3) (LEADING:3)

Remove trailing low quality or N bases (below quality 3) (TRAILING:3)

Scan the read with a 4-base wide sliding window, cutting when the average quality per base drops below 15 (SLIDINGWINDOW:4:15)

Drop reads below the 36 bases long (MINLEN:36).

To ensure the quality of reads after trimming, use Fastqc again.

fastqc /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.1_forward_paired.fq.gz /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.2_reverse_paired.fq.gz -o /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/

Trimmed fastqc output files will be used for the next step to map and align reads to the reference genome.


  1. Mapping

After pre-processing the fastq sequences, for datasets with a reference genome available, the next step is to align the RNASeq reads to this reference. We'll use the STAR alignment program to map our reads to the O. tipulae reference genome.

First, we'll create an index for our genome file which will tell STAR how to navigate the genome. Then, we'll ask it to align our reads to the genome and give an output of reads that match our genome. Because STAR also has an index from the previous step, the output will have genes to which the reads in our files were mapped to.

Required inputs:

Trimmed or pre-processed RNASeq reads as fasta or fastq files:

https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/RNA_youngAdult_rep2.1_forward_paired.fq.gz

https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/RNA_youngAdult_rep2.2_reverse_paired.fq.gz

Reference genome file (.fna, fasta or fastq): https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/CEW1.fa.gz

Genome feature file (.gff, .gtf): https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/CEW1.gtf.gz

Outputs:

SAM or BAM files (including sorted/indexed files): https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/RNA_youngAdult_rep2.Aligned.sortedByCoord.out.bam

To install STAR and obtain more information, please refer to the following file: https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf.

The first step is to unzip the trimmed fastq files using the following command:

#gzip -dk file.gz
gzip -dk /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.1_forward_paired.fq.gz /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.2_reverse_paired.fq.gz

Indexing the reference genome

Then we need to build the genome index file using the following command:

#STAR --runThreadN 6 \
#--runMode genomeGenerate \
#--genomeDir PATH_To_OUTPUT_FOLDER \
#--genomeFastaFiles PATH_TO_THE_REFERENCE_GENOME_FILE_FASTA_FORMAT \
#--genomeSAindexNbases 11 \
#--sjdbGTFfile PATH_TO_THE_GTF_FILE \
#--sjdbOverhang 149

For more information read this file: https://physiology.med.cornell.edu/faculty/skrabanek/lab/angsd/lecture_notes/STARmanual.pdf

For more information read this page: https://hbctraining.github.io/Intro-to-rnaseq-hpc-O2/lessons/03_alignment.html

STAR --runThreadN 6 \
--runMode genomeGenerate \
--genomeDir /home/sobhan/Desktop/test/Analysis/analysis/Ref/ \
--genomeFastaFiles /home/sobhan/Desktop/test/Analysis/analysis/Ref/CEW1.fa \
--genomeSAindexNbases 11 \
--sjdbGTFfile /home/sobhan/Desktop/test/Analysis/analysis/Ref/CEW1.gtf \
--sjdbOverhang 149

Aligning reads

Then we map the reads against the genome, using the following command:

#STAR --genomeDir PATH_TO_THE_OUTPUT_FOLDER_OF_THE_PREVIOUS_STEP \
#--runThreadN 6 \
#--readFilesIn PATH_TO_THE_TRIMMED_FASTQ_FILES/forward.fq,PATH_TO_THE_TRIMMED_FASTQ_FILES/reverse.fq \
#--outFileNamePrefix PATH_To_OUTPUT_FOLDER/RNA_youngAdult_rep2. \
#--outSAMstrandField intronMotif \
#--outMultimapperOrder Random \
#--outSAMattributes Standard \
#--limitBAMsortRAM 62572128765 \
#--outSAMtype BAM SortedByCoordinate \
#--limitOutSJcollapsed 2000000
STAR --genomeDir /home/sobhan/Desktop/test/Analysis/analysis/Ref/ \
--runThreadN 6 \
--readFilesIn /home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.1_forward_paired.fq,/home/sobhan/Desktop/test/Analysis/analysis/Trimmed_fastq/RNA_youngAdult_rep2.2_reverse_paired.fq \
--outFileNamePrefix /home/sobhan/Desktop/test/Analysis/analysis/Mapping/RNA_youngAdult_rep2. \
--outSAMstrandField intronMotif \
--outMultimapperOrder Random \
--outSAMattributes Standard \
--limitBAMsortRAM 62572128765 \
--outSAMtype BAM SortedByCoordinate \
--limitOutSJcollapsed 2000000

Further reading: https://pubmed.ncbi.nlm.nih.gov/23104886/, https://github.com/alexdobin/STAR


  1. Counting Reads

In this step we will count the number of reads aligning to each gene or transcript using tools like HTSeq. The number of reads (counts) associated with features (gene, exon, transcript, etc.), is obtained by using BAM files as the input. For RNA-seq, the features are typically genes, where each gene is the union of all its exons. Various tools can perform this task, including featureCounts, HTSeq, or Salmon. Here, we'll be using HTSeq for counting the reads.

Input for counting:
Mapped reads (.bam file produced by STAR): https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/RNA_youngAdult_rep2.Aligned.sortedByCoord.out.bam
General Feature Format (GFF3) or Gene transfer format (GTF) file: https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/CEW1.gtf.gz \

Output of counting = A count matrix, with genes as rows and samples are columns: https://github.com/t-keertana/RNASeq_GST_tutorial/blob/main/RNA_youngAdult_rep2_htseq.txt

#htseq-count [options] alignment_file gtf_file > output_file

For more information read this page:https://htseq.readthedocs.io/en/master/htseqcount.html

htseq-count \
--stranded=no \
-f bam \
/home/sobhan/Desktop/test/Analysis/analysis/Mapping/RNA_youngAdult_rep2.Aligned.sortedByCoord.out.bam \
/home/sobhan/Desktop/test/Analysis/analysis/Ref/CEW1.gtf > \
/home/sobhan/Desktop/test/Analysis/analysis/Counts/RNA_youngAdult_rep2_htseq.txt

  1. Differential expression analysis
  • DESeq2 with R to perform differential expression on read counts.

After counting the number of reads assigned to each contig/gene we will store the read counts as a matrix and use this matrix to identify differentially expressed genes using DESeq2.

Input data

The first step is to merge count reads for all the samples and create a count matrix. The count matrix we are going to use in this tutorial can be downloaded here.

The file consists of a list of sample names, gene IDs, and the number of reads assigned to each gene for every sample. The value at the i-th row and j-th column of the matrix indicates the number of reads that can be assigned to gene i in sample j. This matrix contains raw counts, and it is not recommended to use transformed or normalized values such as counts scaled by library size as input because the DESeq2 model internally corrects for library size.

image

In addition to this matrix we also need a table of sample information. This information typically includes metadata about each sample, such as sample identifiers, experimental conditions, treatment groups, phenotypic characteristics, or any other relevant information. Annotation files are typically formatted as tab-delimited text files or comma-separated values (CSV) files, where each row corresponds to a sample and each column corresponds to a metadata attribute. The annotation file we are doing to use in this tutorial can be downloaded here.

Install packages and load libraries

#install.packages("tidyverse")
#if (!require("BiocManager", quietly = TRUE))
#    install.packages("BiocManager")

#BiocManager::install("DESeq2")

library("DESeq2")
library(ggplot2)

Import data

To import data we will use the following code. The count matrix will be called cts and the annotation file as coldata.

pasCts <- ".../Table_rawdata.txt"
pasAnno <- ".../annotation.csv"
cts <- as.matrix(read.csv(pasCts,sep="\t",row.names="gene_id"))
coldata <- read.csv(pasAnno, row.names=1)
coldata <- coldata[,c("condition","type")]
coldata$condition <- factor(coldata$condition)
coldata$type <- factor(coldata$type)
head(cts,2)
coldata

image

The design formula describes how the gene expression levels are expected to vary with respect to different experimental factors or covariates. It typically includes terms for the experimental conditions or treatment groups, as well as any other relevant covariates that may affect gene expression (e.g., batch effects, sample-specific variables).

It's crucial for the columns of the count matrix and the rows of the column data (which contains information about samples) to be in the same order. DESeq2 cannot determine which column of the count matrix corresponds to which row of the column data, so we need to make sure they're arranged in a consistent order before providing them to DESeq2. If the order is inconsistent, later functions will produce an error, so we must rearrange one or the other to ensure that they are consistent in terms of sample order.

all(rownames(coldata) %in% colnames(cts))
all(rownames(coldata) == colnames(cts))
cts <- cts[, rownames(coldata)]
all(rownames(coldata) == colnames(cts))

image

Construct DESEQDataSet Object

Now we can use our data to construct a DESeqDataSet using the following code.

dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds

image

Pre-filtering

Although it is not strictly required, pre-filtering low count genes before using the DESeq2 functions can be beneficial. Firstly, it reduces the memory size of the dds data object by removing rows with very few reads. Secondly, it speeds up the count modeling process within DESeq2. Additionally, pre-filtering can improve visualizations since features with no information for differential expression are not plotted in dispersion plots or MA-plots.

By using the following code, we exclude rows that have a count of less than 10.

keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]

By default, R selects a reference level for factors based on alphabetical order. Therefore, we need to select one group as the reference group for analysis.

dds$condition <- relevel(dds$condition, ref = "Adult")

Now we’re ready to run DESEQ function

dds <- DESeq(dds)
res <- results(dds)
res

image

And now we can save our results as a file.

res <- results(dds, name="condition_Male_vs_Adult")
resOrdered <- res[order(res$pvalue),]
r<-".../Adultvs.Male.txt"
write.table(resOrdered, file = r, row.names=TRUE)

Data visualization

Assessing the quality of data and removing low-quality data is a crucial part of any data analysis. It is recommended to perform these steps early in the analysis of a new dataset, before or simultaneously with the differential expression testing.

  • Principal component plot of the samples

The PCA plot displays samples in a 2D plane based on their first two principal components. This plot is helpful to visualize the impact of experimental covariates and batch effects.

When performing downstream analyses such as visualization or clustering, it can be helpful to utilize transformed versions of the count data. This step is aimed at eliminating the dependence of the variance on the mean, especially the high variance of the logarithm of count data when the mean is low. To achieve this, we will use the regularized logarithm or rlog approach which produces transformed data on the log2 scale. This transformed data has been normalized with respect to library size or other normalization factors, making it suitable for downstream analyses.

rld <- rlog(dds, blind=FALSE)
head(assay(rld), 3)

plotPCA(rld, intgroup="condition",ntop = 500)

image

  • Volcano plot

library(ggplot2)

# Create volcano plot
p<-ggplot(res, aes(x = log2FoldChange, y = -log10(pvalue))) +
  geom_point(aes(color = ifelse(pvalue < 0.05, "Significant", "Not Significant")), alpha = 0.6) +
  geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "red") +
  labs(x = "log2 Fold Change", y = "-log10(p-value)", title = "Volcano Plot",color='') +
  theme_minimal()
p

image

For more information: https://www.bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.html

About

Materials, guides, manuals and files for RNAseq analysis - tutorial under LFSC540 GST

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •