-
Notifications
You must be signed in to change notification settings - Fork 0
/
signature-analysis.Rmd
208 lines (152 loc) · 9.15 KB
/
signature-analysis.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
# **Signature Score Calculation**
The **`calculate_sig_score`** function integrates three gene set scoring methods: **ssGSEA**, **PCA**, and **z-score**.By inputting preprocessed transcriptomic data, the function can batch-calculate signature scores for each sample. This streamlined approach enables efficient and flexible scoring of gene signatures, supporting diverse research needs in transcriptomic data analysis.
## Loading packages
Load the IOBR package in your R session after the installation is complete:
```{r, eval=TRUE, warning=FALSE, message=FALSE}
library(IOBR)
library(survminer)
library(tidyverse)
```
## Downloading data for example
Obtaining data set from GEO [Gastric cancer: GSE62254](https://pubmed.ncbi.nlm.nih.gov/25894828/) using `GEOquery` R package.
```{r,message=FALSE,warning=FALSE}
if (!requireNamespace("GEOquery", quietly = TRUE)) BiocManager::install("GEOquery")
library("GEOquery")
# NOTE: This process may take a few minutes which depends on the internet connection speed. Please wait for its completion.
eset_geo <- getGEO(GEO = "GSE62254", getGPL = F, destdir = "./")
eset <-eset_geo[[1]]
eset <-exprs(eset)
eset[1:5,1:5]
```
Annotation of genes in the expression matrix and removal of duplicate genes.
```{r,message=FALSE,warning=FALSE}
# Load the annotation file `anno_hug133plus2` in IOBR.
head(anno_hug133plus2)
# Conduct gene annotation using `anno_hug133plus2` file; If identical gene symbols exists, these genes would be ordered by the mean expression levels. The gene symbol with highest mean expression level is selected and remove others.
eset<-anno_eset(eset = eset,
annotation = anno_hug133plus2,
symbol = "symbol",
probe = "probe_id",
method = "mean")
eset[1:5, 1:3]
```
## Signature score estimation
### Signature collection of IOBR
**322 reported gene signatures** are collected in IOBR, including those related to the TME, metabolism, gene signatures derived from single-cell RNA-seq data, and others. The extensive signature collection is categorized into three distinct groups: **TME-associated**, **tumour-metabolism**, and **tumour-intrinsic signatures**.
```{r}
# Return available parameter options of signature estimation.
signature_score_calculation_methods
#TME associated signatures
names(signature_tme)[1:20]
```
```{r}
#Metabolism related signatures
names(signature_metabolism)[1:20]
```
Signatures associated with basic biomedical research, such as m6A, TLS, ferroptosis and exosomes.
```{r}
names(signature_tumor)
```
`signature_collection` including all aforementioned signatures
```{r}
names(signature_collection)[1:20]
```
```{r}
#citation of signatures
signature_collection_citation[1:20, ]
```
The evaluation of signature scores involved three methodologies: Single-sample Gene Set Enrichment Analysis (ssGSEA), Principal Component Analysis (PCA), and Z-score.
## Estimation of signature using PCA method
The PCA method is ideal for gene sets with co-expression. Heatmaps and correlation matrices can be used to determine if co-expression is present in the applicable gene set.
```{r, message = F, warning = F}
sig_tme<-calculate_sig_score(pdata = NULL,
eset = eset,
signature = signature_collection,
method = "pca",
mini_gene_count = 2)
sig_tme <- t(column_to_rownames(sig_tme, var = "ID"))
sig_tme[1:5, 1:3]
```
## Estimated using the ssGSEA methodology
This method is appropriate for gene sets that contain a large number of genes (> 30 genes), such as those of [GO, KEGG, REACTOME gene sets](https://www.gsea-msigdb.org/gsea/msigdb).
```{r echo=FALSE, fig.cap='Gene sets of MSigDb', out.width='95%', fig.asp=.85, fig.align='center'}
knitr::include_graphics(rep("./fig/gsea.png", 1))
```
```{r, message = F, warning = F, eval=FALSE}
sig_tme<-calculate_sig_score(pdata = NULL,
eset = eset,
signature = go_bp,
method = "ssgsea",
mini_gene_count = 2)
```
## Calculated using the z-score function.
```{r, message = F, warning = F, eval=FALSE}
sig_tme<-calculate_sig_score(pdata = NULL,
eset = eset,
signature = signature_collection,
method = "zscore",
mini_gene_count = 2)
```
## Calculated using all three methods at the same time
```{r, message = F, warning = F, eval=FALSE}
sig_tme<-calculate_sig_score(pdata = NULL,
eset = eset,
signature = signature_collection,
method = "integration",
mini_gene_count = 2)
```
The same signature in this case will be scored using all three methods simultaneously.
```{r, message = F, warning = F, eval=FALSE}
colnames(sig_tme)[grep(colnames(sig_tme), pattern = "CD_8_T_effector")]
```
The `select_method()` function allows the user to extract data using various methods.
```{r, message = F, warning = F, eval=FALSE}
sig_tme_pca <- select_method(data = sig_tme, method = "pca")
colnames(sig_tme_pca)[grep(colnames(sig_tme_pca), pattern = "CD_8_T_effector")]
```
## Classification of signatures
As more signatures related to the tumour microenvironment were collected in IOBR, and we may continue to add gene signatures related to the tumour microenvironment in the future, we have made a basic classification of these signatures by combining them with our analysis experience. Users can compare the signatures in the same group during the analysis process to improve the reliability and consistency of the conclusions.
```{r, message = F, warning = F, eval=FALSE}
sig_group[1:8]
```
## How to customise the signature gene list for `calculate_signature_score`
To overcome the limitations of fixed signatures in IOBR, users are now allowed to create customized gene signature lists for the **`calculate_sig_score`** function, enabling more flexible transcriptomic data analysis tailored to specific research goals.
### Method-1: Use excel for storage and construction
Users can collect gene signatures using either an `Excel` or `CSV` file. The format should have the name of the signature in the first row, followed by the genes contained in each signature from the second row onwards. Once imported, the function `format_signature` can be used to transform the data into a gene list of signatures required for `calculate_signature_score`. To import the file into R, users can use the functions read.csv or read_excel. It is important to note here that the user needs to use the longest signature as a criterion and then replace all the vacant grids using NA, otherwise an error may be reported when reading into R.
Here we provide a sample data `sig_excel`, please refer to this format to construct the required csv or excel files.
```{r, message = F, warning = F}
data("sig_excel", package = "IOBR")
sig <- format_signatures(sig_excel)
print(sig[1:5])
```
For simple structures or when the number of signatures to be added is relatively small, the following two methods can also be used.
### Method-2: Build the list structure directly
```{r, message = F, warning = F}
sig <- list("CD8" = c("CD8A", "CXCL10", "CXCL9", "GZMA", "GZMB", "IFNG", "PRF1", "TBX21"),
"ICB" = c("CD274", "PDCD1LG2", "CTLA4", "PDCD1", "LAG3", "HAVCR2", "TIGIT" ))
sig
```
### Method-3: Add the new signature to the existing gene list
```{r, warning = F}
sig<- signature_tumor
sig$CD8 <- c("CD8A", "CXCL10", "CXCL9", "GZMA", "GZMB", "IFNG", "PRF1", "TBX21")
sig
```
### Method-4: Construct cell-specific gene signatures from single-cell differential analysis results
```{r, message = F, warning = F}
data('deg', package = "IOBR")
sig <- get_sig_sc(deg, cluster = "cluster", gene = "gene", avg_log2FC = "avg_log2FC", n = 100)
sig
```
## How to export gene signature
Using the `output_sig` function, user can export the signatures of the list structure to a csv file for other purposes. This step is exactly the reverse of `format_signatures`.
```{r, warning = F}
sig <- output_sig(signatures = signature_sc, format = "csv", file.name = "sc_signature")
sig[1:8, 1:5]
```
## References
**ssgsea**: Barbie, D.A. et al (2009). Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1. Nature, 462(5):108-112.
**gsva**: Hänzelmann, S., Castelo, R. and Guinney, J. (2013). GSVA: Gene set variation analysis for microarray and RNA-Seq data. BMC Bioinformatics, 14(1):7.
**zscore**: Lee, E. et al (2008). Inferring pathway activity toward precise disease classification. PLoS Comp Biol, 4(11):e1000217.
**PCA method**: Mariathasan S, Turley SJ, Nickles D, et al. TGFβ attenuates tumour response to PD-L1 blockade by contributing to exclusion of T cells. Nature. 2018 Feb 22;554(7693):544-548.
**MSigDB**:Dolgalev I (2022). msigdbr: MSigDB Gene Sets for Multiple Organisms in a Tidy Data Format. R package version 7.5.1. (https://www.gsea-msigdb.org/gsea/msigdb/)