-
Notifications
You must be signed in to change notification settings - Fork 6
/
Copy pathindex.Rmd
530 lines (386 loc) · 19.4 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
---
title: "An Introduction to Pathway Analysis with R and Bioconductor"
author: "Alex Sanchez"
date: "`r Sys.Date()`"
output:
html_document:
toc: true
toc_float: true
toc_depth: 4
code_folding: hide
theme: cerulean
highlight: textmate
embed-resources: true
editor_options:
chunk_output_type: console
---
```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, cache=FALSE,
echo=TRUE, warning=FALSE, message=FALSE, results ='markup')
options(warn=-1)
```
```{r installPackages, eval=FALSE}
installifnot <- function (packageName){
if (!(require(packageName, character.only=TRUE))) {
install.packages(packageName)
}else{
detach(paste ("package", packageName, sep=":"), character.only=TRUE)
}
}
bioCifnot <- function (packageName){
if (!(require(packageName, character.only=TRUE))) {
BiocManager::install(packageName)
}else{
detach(paste ("package", packageName, sep=":"), character.only=TRUE)
}
}
installifnot("knitr")
installifnot("xml2") # May yield problems if some libraries (xml2-config) not available in linux
installifnot("ggnewscale")
bioCifnot ("org.Hs.eg.db")
bioCifnot ("hgu133a.db")
bioCifnot ("GO.db")
bioCifnot ("annotate")
bioCifnot ("Rgraphviz")
bioCifnot ("GOstats")
bioCifnot ("clusterProfiler")
```
# Introduction
This document provides *some examples* on the analyses that can be perfomed on one or more gene lists to help gain biological insight on the results of a differential expression analysis. Overall these analyses are known as _Pathway Analysis_ or, also, _Functional Analysis_.
Functional analysis can be performed in many different ways that lead to similar (or not-so-similar) results. Because there is not a universal acceptance of what is a *complete, well done functional analysis* some different approaches will be shown.
## Input Data for Functional Analysis
Functional analysis can be made, on a first approach on:
- One or more lists of genes __selected__ for being differentially expressed in a given experimental setting. In this case we usually work with gene identifiers.
- One or more list of values measuring the difference between groups (i.e. Fold Changes, p-values, or t-statistics) for all genes being compared.
Most tools require that gene list consist of gene identifiers in some standard notation such as `Entrez`, `ENSEMBL` or other related to these.
These gene lists can be usually extracted from output tables provided by microarrays or RNA-seq data analysis tools.
The examples shown in the document use several _gene lists_ obtained from a maicroarray analysi performed on data from a breast cancer study, but it can be easily extended to more lists or other studies.
## Read data
We start by reading two files that contain the expression values (`expres_AvsB.csv2`) and the results (`Top_AvsB.csv2`) of a differential expression analysis performed using microarrays.
The code and text for the analysis that, using these data, generated these tables, can be found at: [https://github.com/ASPteaching/Ejemplo_de_Analisis_de_Microarrays_con_Bioconductor](https://github.com/ASPteaching/Ejemplo_de_Analisis_de_Microarrays_con_Bioconductor)
The code below assumes the files have been stored in a subdirectory of the current folder named `datasets`.
```{r readData1}
inputDir="datasets"
topTabAvsB <- read.table (file.path(inputDir, "Top_AvsB.csv2"), head=T, sep=";", dec=",", row.names=1)
expresAvsB <- read.table (file.path(inputDir, "expres_AvsB.csv2"), head=T, sep=";", dec=",", row.names=1)
comparisonName <- "AvsB"
dim(topTabAvsB); head(topTabAvsB)
dim(expresAvsB); head(expresAvsB)
```
# Exploring gene lists
A given gene list contains useful information that can be extracted by querying databases.
Let us see how we can obtain information fom the _probesets_ in table (comparison) `AvsB`.
```{r probes}
myProbes <- rownames(expresAvsB)
head(myProbes)
```
We need to load the library ("package") that contains specific annotations for the microarray type that was used in this study.
It has to be noticed also that each row _does not represent a gene, but a probeset_, a sequence that has been designed to detect if a given gene is expressed. Microarrays contain multiple probesets for many genes and this is something that ha to be dealt with.
## ID conversion
In order to do many analyses it is convenient to use a universally accepted identifier such as `Entrez` or `ENSEMBL`.
For instance Bioconductor organism annotation packages rely on `Entrez` identifiers as main key for most mappings.
It is possible to easily find out which mappings are available for each ID.
```{r mappings0}
library(hgu133a.db)
keytypes(hgu133a.db)
```
Annotation packages make it possible to annotate genes and in a similar manner other omics features. For example, we can obtain gene symbol, entrez ID and gene name with a single SQL instruction.
```{r}
geneAnots <- AnnotationDbi::select(hgu133a.db, myProbes, c("SYMBOL", "ENTREZID", "GENENAME"))
head(geneAnots)
```
Now we can provide a more informative list of differentially expressed genes in topTable
```{r}
selected<- topTabAvsB[,"adj.P.Val"]<0.05 & topTabAvsB[,"logFC"] > 1
sum(selected)
selectedTopTab <- topTabAvsB[selected,]
selectedProbes <- rownames(selectedTopTab)
selectedAnots <- AnnotationDbi::select(hgu133a.db, selectedProbes, c("SYMBOL", "ENTREZID", "GENENAME"))
selectedTopTab2 <- cbind(PROBEID=rownames(selectedTopTab), selectedTopTab)
# selectedInfo <- cbind(selectedAnots, selectedTopTab)
selectedInfo = merge(selectedAnots, selectedTopTab2, by="PROBEID")
write.csv2(selectedInfo, file="selectedTopTab_AvsB.csv2")
```
## From gene lists to pathway analysis
Pathway Analysis is an extensive field of research and application and the aim of this document is not to summarize it but simply to illustrate some applications with R.
See (https://github.com/ASPteaching/An-Introduction-to-Pathway-Analysis-with-R-and-Bioconductor/raw/main/Slides/Intro2PathwayEnrichmentAnalysis-SHORT.pdf) for an introduction to the topic.
The most used database in Pathway Analysis is the Gene Ontology, which, as suggested by its name, is not a dabase but an ontology. The `GO.db` can be accessed using the same syntaxis
as with gene identifiers.
# Basic GO Annotation
<!-- ALTRES OPCIONS
Es poden trobar informacions de com consultar la GO a altres llocs:
Per exemple
- El meu vell document OntologyAnalysis.Rnw
- Al workflow d'anotació de Bioconductor: http://bioconductor.org/help/workflows/annotation/annotation/#OrgDb
- https://www.biostars.org/p/81174/ i els enllaços derivats
-->
Bioconductor libraries allow for both:
- Exploration of functional information on genes based on the Gene Ontology
- Different types of Gene Enrichment and Pathway Analysis based on the GO or other pathway databases such as the KEGG, Reactome etc.
The code below shows some easy ways to retrieve GO information associated with genes
Start by loading the appropriate packages
```{r}
require(GOstats)
require(GO.db)
require(hgu133a.db);
require(annotate) # Loads the required libraries.
```
Select the "top 5" genes from the list
```{r top25}
probes <- rownames(expresAvsB)[1:5]
```
In the first versions of Bioconductor, identifiers were managed using environments and functions such as `get` or `mget`. While it still works, nowadays it has been substituted by the use of the `select` function which allows for a muach easier retrieving of identifiers.
`Select` returns a data.frame. If we need a character vector we can obtain it using an ad-hoc function sucha as the simple`charFromSelect`below.
```{r}
charFromSelect <- function(df,posNames=1, posValues=2){
res <- df[,posValues]
names(res) <- df[,posNames]
return(res)
}
require(annotate)
geneIDs <- AnnotationDbi::select(hgu133a.db, probes, c("ENTREZID", "SYMBOL"))
entrezs <- charFromSelect(geneIDs, 1, 2)
simbols <- charFromSelect(geneIDs, 1, 3)
```
Given a set of gene identifiers its _associated_ GO identifiers can be extracted from the microarrays (for probesets ids) or from the organism (for Entrez IDS) annotation package.
```{r GOtable}
# % WANING
# The previous chunk can be substituted by the followink code chunk, shorter and more efficient.
library(hgu133a.db)
keytypes(hgu133a.db)
res <-AnnotationDbi::select(hgu133a.db, keys=probes, keytype = "PROBEID", columns = c("ENTREZID", "SYMBOL","ONTOLOGY"))
res1 <- AnnotationDbi::select(hgu133a.db, keys=probes, keytype = "PROBEID", columns = c("ENTREZID", "SYMBOL","GO"))
```
The resulting tables can be printed or saved.
```{r echo=TRUE}
print(head(res, n=10))
print(head(res1, n=10))
```
This process yield GO identifiers but its meaning is not clear. For this we need to query the `GO.db package`
## What do these GO:XXXXXXX mean?
The `We can query the `GO.db` package can be queried as any anotation package.
```{r mappings}
library(GO.db)
keytypes(GO.db)
columns(GO.db)
```
Assume we are only interested in the first three GO identifiers. We can obtain its name directly using the appropriate `select` call.
First we remove duplicates then we proceed to extract information.
```{r}
uniqueGOTerms <- unique(res1$GO)
selectedGOTerms <- uniqueGOTerms[1:5]
select(GO.db, selectedGOTerms, c("GOID", "ONTOLOGY","TERM" ))
```
## Navigating the GO Graph
The Gene Ontology has a hierarchichal structure. At some moment we can be interested in getting information about the ancestors or the descendants from one term.
The GO.db database contains a series of mappings that provide the _ancestors_ or _childrens_ of any given GOTerm
Take the top GO Term of the previous analysis:
```{r}
oneTerm <- selectedGOTerms[1]
oneParent <- get(oneTerm, GOBPPARENTS) # the vector of its parent terms in the BP ontology.
oneParent
oneChildren<-get(oneTerm, GOBPCHILDREN) # the vector of its children terms in the BP ontolog
oneChildren
oneOffspring<-get(oneTerm, GOBPOFFSPRING) # the vector of its offspring terms in the BP ontology.
oneOffspring
oneChildren %in% oneOffspring
```
While this is interesting and can be used to produce nice plots it is not directly related with Gene Enrichment Analysis and will be ignored from here on.
<!-- Similar analyses can be done on gene lists -->
<!-- ```{r} -->
<!-- require(org.Hs.eg.db) # loads the library -->
<!-- myEIDs3 <-entrezs[1:3] # Create vecotor of input Entrez IDs -->
<!-- myGO <- unlist(org.Hs.egGO[[as.character(myEIDs3[1])]]) -->
<!-- myGO_All <- mget(myEIDs3, org.Hs.egGO) -->
<!-- GOgenes <- org.Hs.egGO2ALLEGS[[myGO[1]]] -->
<!-- GOgenes_All <- mget(myGO[1], org.Hs.egGO2ALLEGS) -->
<!-- ``` -->
# Gene Enrichment Analysis
There are two main types of enrichment analysis:
- _Over-Representation Analysis_ takes a list of differentially expressed genes and it searches for biological categories in which a number of genes appear with "unusually" high frequencies. That is it looks for genes appearing more often than they would be expected by chance in any category.
- _Gene Set Expression Analyses_ works with __all__ genes and looks for differentially expressed gene sets (categories). That is it searches for categories that, without containing an unusually high number of differentially expressed genes, are associated with genes that are in the upper or lower parts of the list of genes ordered by some measure of intensity of difference, such as the "log-Fold Change".
## Over-Representation Analysis
Over-Representation Analysis is applied on a "truncated" list of genes that one considers to be differentially expressed.
These are checked for enrichment versus a "Universe" gene list, usually, all the genes that have entered in the analysis
```{r}
require(hgu133a.db)
topTab <- topTabAvsB
probesUniverse <- rownames(topTab)
# entrezUniverse = unlist(mget(rownames(topTab), hgu133aENTREZID, ifnotfound=NA))
entrezUniverse<- select(hgu133a.db, probesUniverse, "ENTREZID")
entrezUniverse <- entrezUniverse$ENTREZID
whichGenes<- topTab["adj.P.Val"]<0.05 & topTab["logFC"] > 1
sum(whichGenes)
topGenes <- entrezUniverse[whichGenes]
# Remove possible duplicates
topGenes <- topGenes[!duplicated(topGenes)]
entrezUniverse <- entrezUniverse[!duplicated(entrezUniverse)]
```
Many packages can be used to do a Gene Enrichment Analysis. Each of them perfoms slightly different analysis but the underlying ideas are the same. Some of these packages are:
- `GOstats`
- `topGO`
- `gprofiler`
- `clusterProfiler`
We show examples on how to do it using `GOstats` and `clusterProfiler`
### Enrichment Analysis using `GOstats`
In this section we use `GOstats` which was one of the first packages available in Bioconduictor to do Enrichment Analysis.
The idea for using it is pretty simple: We need to create special type of object called "Hyperparameter" which may be of class either:
- `GOHyperGParams` if we want to do the enrichment analysis using the Gene Ontology as reference databases or
- `KEGGHyperGParams` if we wish to do the analysis using the KEGG database, or
- `PFAMHyperGParams` if we wish to do the analysis using the PFAM database.
A good strategy is to use all three.
First, we create the hyper-parameters.
```{r}
library(GOstats)
# This parameter has an "ontology" argument. It may be "BP", "MF" or "CC"
# Other arguments are taken by default. Check the help for more information.
GOparams = new("GOHyperGParams",
geneIds=topGenes, universeGeneIds=entrezUniverse,
annotation="hgu133a.db", ontology="BP",
pvalueCutoff=0.001)
# These hyperparameters do not have an "ontology" argument.
KEGGparams = new("KEGGHyperGParams",
geneIds=topGenes, universeGeneIds=entrezUniverse,
annotation="hgu133a.db",
pvalueCutoff=0.001)
PFAMparams = new("PFAMHyperGParams",
geneIds=topGenes, universeGeneIds=entrezUniverse,
annotation="hgu133a.db",
pvalueCutoff=0.001)
```
Next, we run the analyses:
```{r}
GOhyper = hyperGTest(GOparams)
KEGGhyper = hyperGTest(KEGGparams)
# PFAMhyper = hyperGTest(PFAMparams)
```
We can extract information directly from the analysis results using `summary`:
```{r}
dim(summary(GOhyper))
head(summary(GOhyper))
```
Apart of this we can create an automated report per each result, or, with a litle more effort we can combine all outputs into a single one.
```{r}
# Creamos un informe html con los resultados
comparison = "AvsB"
GOfilename =file.path(paste("GOResults.",comparison,".html", sep=""))
htmlReport(GOhyper, file = GOfilename, summary.args=list("htmlLinks"=TRUE))
```
The resulting file has the contents provided by the call to `summary`but with additional links.
### Enrichment Analysis using `clusterProfiler`
GO enrichment analysis can be performed using `enrichGO` function from `clusterProfiler` package. We will perform the analysis over the Biological Process (BP) GO category.
The Gene Universe, the set of all genes, can be specified as in `GOstats`or simply assumed to be all the human genes.
An aside comment about the format of identifiers:
- `GOstats` expects entrez identifiers to be integer values stored as _characters_.
- `clusterProfiler` requires the same identifiers, but they have to be stored as _integers_.
This means that, in order to reuse identifierswe need to recast them into one or another direction. In this case, because ` topGenes`is a character vector, to use them in Bioconductor we need to convert them into integers using `as.integer(topGenes)`.
```{r}
# In GOstats use the following code
#
# GOparams = new("GOHyperGParams",
# geneIds=topGenes, universeGeneIds=entrezUniverse,
# annotation="hgu133a.db", ontology="BP",
# pvalueCutoff=0.001)
library(clusterProfiler)
## Run GO enrichment analysis
ego <- enrichGO(gene = as.integer(topGenes), # selgenes,
# universe = entrezUniverse, # universe = all_genes,
keyType = "ENTREZID",
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "BH",
qvalueCutoff = 0.25,
readable = TRUE)
head(ego, n=10)
# Output results from GO analysis to a table
ego_results <- data.frame(ego)
write.csv(ego_results, "clusterProfiler_ORAresults_UpGO.csv")
```
### Visualization of enrichment results
**Dotplot of top 10 enriched terms**
```{r}
dotplot(ego, showCategory=10)
```
**Visualization of GO terms in hierarchy**
Enriched GO terms can be visualized as a directed acyclic graph (only for GO):
```{r}
goplot(ego, showCategory=10)
```
**Gene network for the top terms**
```{r}
cnetplot(ego)
```
**Enrichment Map**
Enriched terms can be grouped by some similarity measure (eg. overlap of genes between terms) to summarize the results.
```{r}
## Enrichmap clusters the 50 most significant (by adj.P.Va) GO terms to visualize relationships between terms
library(enrichplot)
ego_sim <- pairwise_termsim(ego)
emapplot(ego_sim, cex_label_category=0.5)
```
**Note:**
For overrepresentation analysis based on **Reactome Pathways** database one can use function `enrichPathway` from `ReactomePA` package.
**Challenge**: perform an overrepresentation analysis of the top down-regulated genes with a logFC < -2 and adjusted p-value < 0.05 over the GO-BP ontology.
## Gene Set Enrichment Analysis
If, instead of relying on the gene lists we decided to use all the genes on the array and confront them to _selected sets of genes_ we may use the *Gene Set Enrichment Analysis* approach.
### Classical GSEA
The `clusterProfiler` package implements the classical GSEA method as introduced by Subramanian et alt (2005).
```{r}
entrezIDs <- AnnotationDbi::select(hgu133a.db, rownames(topTabAvsB), c("ENTREZID"))
# entrezIDs <- charFromSelect(entrezIDs)
topTabAvsB2<- cbind( PROBEID= rownames(topTabAvsB), topTabAvsB)
geneList <- merge(topTabAvsB2, entrezIDs, by="PROBEID")
# sort by absolute logFC to remove duplicates with smallest absolute logFC
geneList <- geneList[order(abs(geneList$logFC), decreasing=T),]
geneList <- geneList[ !duplicated(geneList$ENTREZ), ] ### Keep highest
# re-order based on logFC to be GSEA ready
geneList <- geneList[order(geneList$logFC, decreasing=T),]
genesVector <- geneList$logFC
names(genesVector) <- geneList$ENTREZ
set.seed(123)
library(clusterProfiler)
gseResulti <- gseKEGG(geneList = genesVector,
organism = "hsa",
keyType = "kegg",
exponent = 1,
minGSSize = 10,maxGSSize = 500,
pvalueCutoff = 0.05,pAdjustMethod = "BH",
# nPerm = 10000, #augmentem permutacions a 10000
verbose = TRUE,
use_internal_data = FALSE,
seed = TRUE,
eps=0,
by = "fgsea"
)
# keggResultsList[[i]] <- gseResulti
```
```{r results='asis'}
library(kableExtra)
gsea.result <- setReadable(gseResulti, OrgDb = org.Hs.eg.db, keyType ="ENTREZID" )
gsea.result.df <- as.data.frame(gsea.result)
print(kable(gsea.result.df[,c("Description","setSize","NES","p.adjust")])%>% scroll_box(height = "500px"))
```
```{r eval=FALSE}
library(ggplot2)
# for (i in 1:length(files)){
# cat("\nComparison: ", namesC[i],"\n")
cat("DOTPLOT\n")
# if(nrow(keggResultsList[[i]]) > 0){
if(nrow(gseResulti) > 0){
p<- dotplot(gseResulti, showCategory = 20, font.size = 15,
title =paste("Enriched Pathways\n", comparisonName ,
split=".sign") + facet_grid(.~.sign))
plot(p)
cat("\nENRICHMENT MAP\n")
em<- emapplot(gseResulti)
plot(em)
#guardem en pdf
pdf(file = paste0("KEGGplots.",comparisonName,".pdf"),
width = 14, height = 14)
print(p)
print(em)
dev.off()
}else{
cat("\nNo enriched terms found\n")
}
```