diff --git a/vignettes/FRASER.Rnw b/vignettes/FRASER.Rnw index 70e80f21..5e5fe1f6 100644 --- a/vignettes/FRASER.Rnw +++ b/vignettes/FRASER.Rnw @@ -40,7 +40,7 @@ opts_chunk$set( \author{ Christian Mertes$^{1}$, Ines Scheller$^{1}$, Julien Gagneur$^{1}$ \\ - \small{$^{1}$ Technische Universit\"at M\"unchen, Department of + \small{$^{1}$ Technical University of Munich, Department of Informatics, Garching, Germany} } @@ -77,7 +77,14 @@ diseases. \begin{center} \begin{tabular}{ | l | } \hline -If you use \fraser{} in published research, please cite: \\ +If you use \fraser{} version >= 1.99.0 in published research, please cite: \\ +\\ +Scheller I, Lutz K, Mertes C, \emph{et al.} +\textbf{Improved detection of aberrant splicing with FRASER 2.0} \\ +\textbf{using the Intron Jaccard Index}, medrXiv, 2023, \\ +\emph{\url{https://doi.org/10.1101/2023.03.31.23287997}} \\ +\hline +For previous versions of \fraser{}, please cite: \\ \\ Mertes C, Scheller I, Yepez V, \emph{et al.} \textbf{Detection of aberrant splicing events} \\ @@ -190,7 +197,7 @@ previously captured with either of the metrics $\psi_5$, $\psi_3$, $\theta$. } The Intron Jaccard Index considers both split and nonsplit reads and is -defined as the jaccard index of the set of donor reads (reads sharing a donor +defined as the Jaccard index of the set of donor reads (reads sharing a donor site with the intron of interest and nonsplit reads at that donor site) and acceptor reads (reads sharing an acceptor site with the intron of interest and nonsplit reads at that acceptor site): @@ -203,11 +210,11 @@ nonsplit reads at that acceptor site): \section{Quick guide to \fraser{}} -Here we quickly show how to do an analysis with \fraser{}, starting from a -sample annotation table and the corresponding bam files. First, we create an -\fds{} from the sample annotation and count the relevant reads in the bam files. +Here we show how to do an analysis with \fraser{}, starting from a +sample annotation table and raw data (RNA-seq BAM files). First, we create a +\fds{} object from the sample annotation and count the relevant reads in the BAM files. Then, we compute the $\psi/\theta$ values and -filter out introns that are just noise. Secondly, we run the full +filter out introns that are lowly expressed. Secondly, we run the full pipeline using the command \Rfunction{FRASER}. In the last step, we extract the results table from the \fds{} using the \Rfunction{results} function. Additionally, the user can create several analysis plots directly from the @@ -218,7 +225,7 @@ fitted \fds{} object. These plotting functions are described in section # load FRASER library library(FRASER) -# count data +# count raw data fds <- createTestFraserSettings() fds <- countRNAData(fds) fds @@ -226,11 +233,11 @@ fds # compute stats fds <- calculatePSIValues(fds) -# filtering junction with low expression +# filter junctions with low expression fds <- filterExpressionAndVariability(fds, minExpressionInOneSample=20, minDeltaPsi=0.0, filter=TRUE) -# we provide two ways to anntoate introns with the corresponding gene symbols: +# we provide two ways to annotate introns with the corresponding gene symbols: # the first way uses TxDb-objects provided by the user as shown here library(TxDb.Hsapiens.UCSC.hg19.knownGene) library(org.Hs.eg.db) @@ -242,16 +249,16 @@ fds <- annotateRangesWithTxDb(fds, txdb=txdb, orgDb=orgDb) # with a specific latentspace dimension fds <- FRASER(fds, q=c(jaccard=2)) -# alternatively, we also provide a way to use biomart for the annotation: +# Alternatively, we also provide a way to use BioMart for the annotation: # fds <- annotateRanges(fds) -# get results: we recommend to use an FDR cutoff 0.05, but due to the small -# dataset size we extract all events and their associated values +# get results: we recommend to use an FDR cutoff of 0.05, but due to the small +# dataset size, we extract all events and their associated values # eg: res <- results(fds, padjCutoff=0.05, deltaPsiCutoff=0.1) res <- results(fds, all=TRUE) res -# result visualization +# result visualization, aggregate=TRUE means that results are aggregated at the gene level plotVolcano(fds, sampleID="sample1", type="jaccard", aggregate=TRUE) @ @@ -261,22 +268,21 @@ plotVolcano(fds, sampleID="sample1", type="jaccard", aggregate=TRUE) The analysis workflow of \fraser{} for detecting rare aberrant splicing events in RNA-seq data can be divided into the following steps: \begin{enumerate} - \item Data import or Counting reads \ref{sec:dataPreparation} + \item Data import or counting reads \ref{sec:dataPreparation} \item Data preprocessing and QC \ref{sec:DataPreprocessing} \item Correcting for confounders \ref{sec:correction} - \item Calculate P-values \ref{sec:P-value-calculation} - \item Calculate Z-scores \ref{sec:Z-score-calculation} - \item Visualize the results \ref{sec:result-vis} + \item Calculating P-values \ref{sec:P-value-calculation} + \item Visualizing the results \ref{sec:result-vis} \end{enumerate} -Step 3-5 are wrapped up in one function \Rfunction{FRASER}, but each step can -be called individually and parametrizied. Either way, data preprocessing should +Steps 3 and 4 are wrapped up in one function \Rfunction{FRASER}, but each step can +be called individually and parametrized. Either way, data preprocessing should be done before starting the analysis, so that samples failing quality measurements or introns stemming from background noise are discarded. Detailed explanations of each step are given in the following subsections. -For this tutorial we will use the a small example dataset that is contained +For this tutorial, we will use the a small example dataset that is contained in the package. \subsection{Data preparation} @@ -285,7 +291,7 @@ in the package. \subsubsection{Creating a \fds{} and Counting reads} \label{sec:CountingReads} -To start a RNA-seq data analysis with \fraser{} some preparation steps are +To start an RNA-seq data analysis with \fraser{} some preparation steps are needed. The first step is the creation of a \fds{} which derives from a RangedSummarizedExperiment object. To create the \fds, sample annotation and two count matrices are needed: one containing counts for the splice junctions, @@ -296,10 +302,10 @@ splice junctions. You can first create the \fds{} with only the sample annotation and subsequently count the reads as described in \ref{sec:CountingReads}. For this, we need a table with basic informations which then can be transformed into a -\Rclass{FraserSettings} object. The minimum of information per sample is an -unique sample name, the path to the aligned bam file. -Additionally groups can be specified for the P-value calculations later. -If a \textbf{NA} is assigned no P-values will be calculated. An example sample +\Rclass{FraserSettings} object. The minimum of information per sample is a +unique sample name and the path to the BAM file. +Additionally groups can be specified for the P-value calculations. +If a \textbf{NA} is assigned, no P-values will be calculated. An example sample table is given within the package: <>= @@ -308,7 +314,7 @@ sampleTable <- fread(system.file( head(sampleTable) @ -To create a settings object for \fraser{} the constructor +To create a settings object for \fraser{}, the constructor \Rfunction{FraserSettings} should be called with at least a sampleData table. For an example have a look into the \Rfunction{createTestFraserSettings}. In addition to the sampleData you can specify further parameters. @@ -326,7 +332,7 @@ options from the sample annotation above: <>= # convert it to a bamFile list bamFiles <- system.file(sampleTable[,bamFile], package="FRASER", mustWork=TRUE) -sampleTable[,bamFile:=bamFiles] +sampleTable[, bamFile := bamFiles] # create FRASER object settings <- FraserDataSet(colData=sampleTable, workingDir="FRASER_output") @@ -343,15 +349,15 @@ settings <- createTestFraserSettings() settings @ -Counting of the reads are straight forward and is done through the +Counting the reads is straightforward and is done through the \Rfunction{countRNAData} function. The only required parameter is the -FraserSettings object. First all split reads are extracted from each individual -sample and cached if enabled. Then a dataset wide junction map is created +FraserSettings object. First, all split reads are extracted from each individual +sample and cached if enabled. Then a dataset-wide junction map is created (all visible junctions over all samples). After that for each sample the -non-spliced reads at each given donor and acceptor site is counted. The +non-spliced reads at each given donor and acceptor site are counted. The resulting \Rclass{FraserDataSet} object contains two -\Rclass{SummarizedExperiment} objects for each the junctions and the splice -sites. +\Rclass{SummarizedExperiment} objects, one for the junctions and one for the +splice sites. <>= # example of how to use parallelization: use 10 cores or the maximal number of @@ -375,7 +381,7 @@ If the count matrices already exist, you can use these matrices directly together with the sample annotation from above to create the \fds: <>= -# example sample annoation for precalculated count matrices +# example sample annotation for precalculated count matrices sampleTable <- fread(system.file("extdata", "sampleTable_countTable.tsv", package="FRASER", mustWork=TRUE)) head(sampleTable) @@ -405,13 +411,13 @@ slides\footnote{\url{http://tinyurl.com/RNA-ASHG-presentation}}. At the time of writing this vignette, we recommend that the RNA-seq data should be aligned with a splice-aware aligner like STAR\cite{Dobin2013} or GEM\cite{MarcoSola2012}. -To gain better results, at least 20 samples should be sequenced and they should -be processed with the same protocol and origin from the same tissue. +To obtain better results, at least 50 samples should be sequenced and they should +be processed with the same protocol and originated from the same tissue. \subsubsection{Filtering} \label{sec:filtering} -Before we can filter the data, we have to compute the main splicing metric: +Before filtering the data, we have to compute the main splicing metrics: the $\psi$-value (Percent Spliced In) and the Intron Jaccard Index. <>= @@ -419,14 +425,14 @@ fds <- calculatePSIValues(fds) fds @ -Now we can have some cut-offs to filter down the number of junctions we want to +Now we can filter down the number of junctions we want to test later on. -Currently, we keep only junctions which support the following: +Currently, we suggest keeping only junctions which support the following: \begin{itemize} - \item At least one sample has 20 reads - \item 5\% of the samples have at least 1 read + \item At least one sample has 20 (or more) reads + \item 25\% (or more) of the samples have at least 10 reads \end{itemize} Furthemore one could filter for: @@ -437,7 +443,7 @@ Furthemore one could filter for: <>= -fds <- filterExpressionAndVariability(fds, minDeltaPsi=0.0, filter=FALSE) +fds <- filterExpressionAndVariability(fds, minDeltaPsi=0, filter=FALSE) plotFilterExpression(fds, bins=100) @ @@ -458,8 +464,8 @@ Since $\psi$ values are ratios within a sample, one might think that there should not be as much correlation structure as observed in gene expression data within the splicing data. -This is not true as we do see strong sample co-variation across different -tissues and cohorts. Let's have a look into our data to see if we do have +However, we do see strong sample co-variation across different +tissues and cohorts. Let's have a look into our demo data to see if we it has correlation structure or not. To have a better estimate, we use the logit transformed $\psi$ values to compute the correlation. @@ -479,7 +485,7 @@ plotCountCorHeatmap(fds, type="jaccard", logit=TRUE, normalized=FALSE, \subsection{Detection of aberrant splicing events} -After preprocessing the raw data and visualizing it, we can start our analysis. +After preprocessing the raw data and visualizing it, we can start with our analysis. Let's start with the first step in the aberrant splicing detection: the model fitting. @@ -491,11 +497,11 @@ latent space with a dimension $q=10$ . Using the correct dimension is crucial to have the best performance (see \ref{sec:encDim}). Alternatively, one can also use a PCA to correct the data. The wrapper function \Rfunction{FRASER} both fits the model and calculates the -p-values and z-scores for all $\psi$ types. For more details see section +p-values for all $\psi$ types. For more details see section \ref{sec:details}. <>= -# This is computational heavy on real size datasets and can take awhile +# This is computational heavy on real datasets and can take some hours fds <- FRASER(fds, q=c(jaccard=3)) @ @@ -508,15 +514,15 @@ plotCountCorHeatmap(fds, type="jaccard", normalized=TRUE, logit=TRUE) \subsubsection{Calling splicing outliers} -Before we extract the results, we should add the human readable HGNC symbols. +Before we extract the results, we should add HGNC symbols to the junctions. \fraser{} comes already with an annotation function. The function uses \Biocpkg{biomaRt} in the background to overlap the genomic ranges with the known HGNC symbols. To have more flexibilty on the annotation, one can also provide a custom `txdb` object to annotate the HGNC symbols. Here we assume a beta binomial distribution and call outliers based on the -significance level. The user can choose between a p value cutoff, a Z score -cutoff or a cutoff on the $\Delta\psi$ values between the observed and expected +significance level. The user can choose between a p value cutoff, +a cutoff on the $\Delta\psi$ values between the observed and expected $\psi$ values or both. <>= @@ -534,7 +540,7 @@ fds <- annotateRangesWithTxDb(fds, txdb=txdb, orgDb=orgDb) res <- results(fds) @ -\subsubsection{Interpreting the result table} +\subsubsection{Interpreting the results table} The function \Rfunction{results} retrieves significant events based on the specified cutoffs as a \Rclass{GRanges} object which contains the genomic @@ -543,7 +549,7 @@ the following additional information: \begin{itemize} \item sampleID: the sampleID in which this aberrant event occurred \item hgncSymbol: the gene symbol of the gene that contains the splice - junction or site if available + junction or site, if available \item type: the metric for which the aberrant event was detected (either jaccard for Intron Jaccard Index or psi5 for $\psi_5$, psi3 for $\psi_3$ or theta for $\theta$) @@ -551,9 +557,9 @@ the following additional information: this event (at intron or splice site level depending on metric) \item pValueGene, padjustGene: only present in the gene-level results table, gives the p-value and FDR adjusted p-value at gene-level - \item psiValue: the value of $\psi_5$, $\psi_3$ or $\theta$ metric - (depending on the type column) of this junction or splice site for the - sample in which it is detected as aberrant + \item psiValue: the value of the splice metric (see 'type' column for the + name of the metric) of this junction or splice site for the sample in which + it is detected as aberrant \item deltaPsi: the $\Delta\psi$-value of the event in this sample, which is the difference between the actual observed $\psi$ and the expected $\psi$ \item counts, totalCounts: the count (k) and total count (n) of the splice @@ -578,19 +584,18 @@ junction where the event is detected; an aberrant $\psi_3$ value might indicate aberrant donor site usage of the junction where the event is detected; and an aberrant $\theta$ value might indicate partial or full intron retention, or exon truncation or elongation. As the Intron Jaccard Index combines the -previously described metrics, an aberrant Intron Jaccard value can indicate any -of the above described cases. We recommend using a genome browser to -investigate interesting detected events in more detail. \fraser{}2 also -provides the function \Rfunction{plotBamCoverageFromResultTable} to create a -sashimi plot for an outlier in the results table directly in R (if paths to +three metrics, an aberrant Intron Jaccard value can indicate any +of the above described cases. We recommend inspecting the outliers using IGV. +\fraser{}2 also provides the function \Rfunction{plotBamCoverageFromResultTable} +to create a sashimi plot for an outlier in the results table directly in R (if paths to bam files are available in the \fds{} object). <>= -# to show result visualization functions for this tutorial, no cutoff used +# for visualization purposes for this tutorial, no cutoffs were used res <- results(fds, all=TRUE) res -# for the gene level pvalues, gene symbols need to be annotated the fds object +# for the gene level pvalues, gene symbols need to be added to the fds object # before calling the calculatePadjValues function (part of FRASER() function) # as we previously called FRASER() before annotating genes, we run it again here fds <- calculatePadjValues(fds, type="jaccard", geneLevel=TRUE) @@ -601,7 +606,7 @@ res_gene \subsection{Finding splicing candidates in patients} -Let's hava a look at sample 10 and check if we got some splicing +Let's have a look at sample 10 and check if we got some splicing candidates for this sample. <>= @@ -618,14 +623,14 @@ sampleRes To have a closer look at the junction level, use the following functions: <>= -plotExpression(fds, type="jaccard", result=sampleRes[9]) +plotExpression(fds, type="jaccard", result=sampleRes[9]) # plots the 9th row plotSpliceMetricRank(fds, type="jaccard", result=sampleRes[9]) plotExpectedVsObservedPsi(fds, result=sampleRes[9]) @ \subsection{Saving and loading a \fds{}} -A \fds{} object can be easily saved and reloaded at any time as follows: +A \fds{} object can be easily saved and reloaded as follows: <>= # saving a fds @@ -646,7 +651,7 @@ fds <- loadFraserDataSet(file=file.path(workingDir(fds), The function \Rfunction{FRASER} is a convenient wrapper function that takes care of correcting for confounders, fitting the beta binomial distribution and -calculating p-values and z-scores for all $\psi$ types. To have more control +calculating p-values for all $\psi$ types. To have more control over the individual steps, the different functions can also be called separately. The following sections give a short explanation of these steps. @@ -668,8 +673,8 @@ confounders in the data. Currently the following methods are implemented: <>= # Using an alternative way to correct splicing ratios -# here: only 2 iteration to speed the calculation up -# for the vignette, the default is 15 iterations +# here: only 2 iterations to speed the calculation up for the vignette +# the default is 15 iterations fds <- fit(fds, q=3, type="jaccard", implementation="PCA-BB-Decoder", iterations=2) @ @@ -677,8 +682,8 @@ fds <- fit(fds, q=3, type="jaccard", implementation="PCA-BB-Decoder", \subsubsection{Finding the dimension of the latent space} \label{sec:encDim} -For the previous call, the dimension $q$ of the latent space has been fixed to -$q=10$. Since working with the correct $q$ is very important, the \fraser{} +For the previous call, the dimension $q$ of the latent space has been fixed. +Since working with the correct $q$ is very important, the \fraser{} package also provides the function \Rfunction{optimHyperParams} that can be used to estimate the dimension $q$ of the latent space of the data. It works by artificially injecting outliers into the data and then comparing the AUC of @@ -751,15 +756,15 @@ head(padjVals(fds, type="jaccard", subsetName="exampleSubset")) \subsection{Result visualization} \label{sec:result-vis} -In addition to the plotting methods \Rfunction{plotVolcano}, +Besides the plotting methods \Rfunction{plotVolcano}, \Rfunction{plotExpression}, \Rfunction{plotExpectedVsObservedPsi}, \Rfunction{plotSpliceMetricRank}, \Rfunction{plotFilterExpression} and \Rfunction{plotEncDimSearch} used above, -the \fraser{} package provides two additional functions to visualize the +the \fraser{} package provides additional functions to visualize the results: \Rfunction{plotAberrantPerSample} displays the number of aberrant events per -sample based on the given cutoff values and \Rfunction{plotQQ} gives a +sample of the whole cohort based on the given cutoff values and \Rfunction{plotQQ} gives a quantile-quantile plot either for a single junction/splice site or globally. <>=