-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathGTSPreport.Rmd
266 lines (209 loc) · 16.2 KB
/
GTSPreport.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
# Analysis of integration site distributions and relative clonal abundance for subject `r sanitize(patient)`
`r format(Sys.Date(), "%b %d %Y")`
```{r setup,echo=FALSE}
opts_chunk$set(fig.path='figureByPatient/', fig.align='left', comment="",
echo=FALSE, warning=FALSE, error=TRUE, message=FALSE, cache=F, results="asis")
options(knitr.table.format = 'html')
```
## Introduction
The attached report describes results of analysis of integration site distributions and relative abundance for samples from gene therapy trials. For cases of gene correction in hematopoietic stem cells, it is possible to harvest blood cells and analyze the distributions of integration sites. Frequency of isolation information can provide information on the clonal structure of the population. This report summarizes results for subject `r sanitize(patient)` over time points `r sanitize(paste(timepoint,collapse=", "))` in UCSC genome draft `r sanitize(freeze)`.
The samples studied in this report, the numbers of sequence reads, and unique integration sites available for this subject are shown below.
```{r summaryTable,results="asis"}
kable(summaryTable, caption="Sample Summary Table", row.names=FALSE, format="html", digits = 2)
```
## Population Size
Under most circumstances only a subset of sites will be sampled. We thus include an estimate of sample size based on frequency of isolation information from the SonicLength method [(Berry, 2012)](http://www.ncbi.nlm.nih.gov/pubmed/22238265).
The 'S.chao1' column denotes the estimated population size derived using Chao estimate [(Chao, 1987)](http://www.ncbi.nlm.nih.gov/pubmed/3427163). If sample replicates were present then estimates were subjected to jackknife bias correction.
We also quantify population clone structure using Gini Coefficients. The Gini coefficient provides a measure of inequality in clonal abundance in each sample. The coefficient equals zero when all sites are equally abundant (polyclonal) and increases as fewer sites account for more of the total (oligoclonal).
The table below summarizes sample population for each timepoint & celltype combination.
```{r ChaoGiniTable,results="asis"}
kable(popSummaryTable, caption="Sample Population Summary", format="html",
digits = 2, row.names=FALSE)
```
The graph below visualizes population based summaries as a function of time.
```{r pop_graphs, results='asis', fig.width=8, fig.height=6}
if(length(unique(timepointPopulationInfo$group)) >1 ){
ggplot(data=timepointPopulationInfo, aes(group, value)) + geom_bar(stat="identity") + facet_wrap(~variable, scales="free")
}else{cat(paste0("**Only one timepoint, ", unique(levels(timepointPopulationInfo$group)), ", present. Insufficient data available to plot population estimators across timepoints.**"))}
```
## Relative abundance of cell clones
The relative abundance of cell clones is summarized in the attached stacked bar graphs. The cell fraction studied is named at the top, the time points are marked at the bottom. The different bars in each panel show the major cell clones, as marked by integration sites. A key to the sites is shown at the right. Each integration site is named by the nearest gene. The '*' indicates if the integration site is with the transcription unit for that gene, the '~' indicates a cancer related gene, and the '!' indicates a gene of interest from previous gene therapy trials (these include genes involved in adverse events, and genes at clustered integration sites from the first SCID trial).
Integration sites were recovered using ligation mediated PCR after random fragmentation of genomic DNA, which reduces recovery biases compared with restriction enzyme cleavage. Relative abundance was not measured from read counts, which are known to be inaccurate, but from marks introduced into DNA specimens prior to PCR amplification using the SonicLength method [PMID:22238265](http://www.ncbi.nlm.nih.gov/pubmed/22238265).
In the barplots below, any sites with Estimated Relative Abundance below `r percent(abundCutoff.barplots)` are binned as LowAbund.
```{r barPlots, fig.height=12, fig.width=12}
siteColors = structure(gg_color_hue(length(unique(barplotAbunds$maskedRefGeneName))), names=unique(barplotAbunds$maskedRefGeneName))
siteColors["LowAbund"] <- "#E0E0E0"
ggplot(data=barplotAbunds, aes(Timepoint, estAbundProp, fill=maskedRefGeneName)) +
geom_bar(stat="identity") + facet_wrap(~CellType, scales="free") + scale_fill_manual(values=siteColors) +
labs(y="Relative Sonic Abundance", x="Timepoint") +
scale_y_continuous(labels=percent)
```
Here is another way to perceive top ranking integration sites by genes within each celltype. Any sites with Estimated Relative Abundance below `r percent(abundCutoff.detailed)` are binned as LowAbund.
```{r sitetype_heatmap, fig.width=12, fig.height=11}
ggplot(data=detailedAbunds, aes(Timepoint, maskedRefGeneName, fill=estAbundProp)) + geom_tile() +
scale_fill_continuous(name='Relative\nAbundance', labels=percent, low="#E5F5E0", high="#2B8CBE") +
facet_grid(.~CellType, scales="free", space="free") + labs(y="SiteType", x="Timepoint") +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))
```
## Longitudinal behavior of major clones
When multiple time points are available, it is of interest to track the behavior of the most abundant clones. A plot of the relative abundances of major clones, based on output from SonicLength, is shown below. For cases where only a single time point is available, the data is just plotted as unlinked points.
```{r ParallelLines, fig.width=10, fig.height=10}
if (has_longitudinal_data) {
ggplot(longitudinal, aes(x=Timepoint, y=estAbundProp)) +
geom_point(size=.5) +
geom_line(aes(colour=posid, group=posid), alpha=.5, show_guide=FALSE) +
facet_wrap(~CellType, scales="free") +
ggtitle(paste("Patient:", patient, "Trial:", trial)) + xlab("Timepoint") +
scale_y_continuous(name="Relative Sonic Abundance",
labels=percent,expand=c(0,0)) +
theme(axis.text.x=element_text(angle=45,hjust=1,vjust=1))
}else{cat(paste0("**Only one timepoint, ", unique(levels(timepointPopulationInfo$group)), ", present. Insufficient data available to plot changes of clone densities across timepoints.**"))}
```
## Integration sites near particular genes of interest
Integration sites near genes that have been associated with adverse events are of particular interest. Thus, we have cataloged all integration sites for which a gene of interest is the nearest cancer-related gene.
Results are summarized below as a scatter plot where the y-axis shows relative abundance of sites and x-axis is distance to the nearest onconogene 5' end.
Negative distances indicate that the integration site is downstream from (i.e. after) the TSS. Positive distances indicate that the integration site is upstream from (i.e. before) the TSS. Note that all RefSeq splicing isoforms are used for this analysis, so the reference TSS may not be the same for each listed integration site.
```{r badActors, include=FALSE}
#this chunk has to have include=FALSE otherwise it inexplicably displays a verbatim
#copy of the longitudinal data graph... I have absolutely no idea why...
#a bit hackey but it works
badActorOut <- NULL #clear it out
badActorOut <- lapply(badActors, function(badActor){
sites <- as.data.frame(badActorData[[badActor]])
if(nrow(sites)>1){
knit_child("badActorPartial.Rmd", quiet=T, envir=environment())
}else{
knit_expand(text=paste0("### ", badActor, "\n **No sites within 100kb of any ", badActor," TSS for this patient.**\n***"))
}
})
```
`r knit(text = unlist(badActorOut))`
## Do any clones account for greater than 20% of the total?
For some trials, a reporting criteria is whether any cell clones expand to account for greater than 20% of all clones. This is summarized below for subject `r patient`. Abundance is estimated using the SonicLength method. Data such as this must, of course, be interpreted in the context of results from other assays.
```{r TwentyPercSites, results="asis"}
### Sites >20% of data by Alias ###
knit_exit()
rows <- sites.qc$estAbundance1Prop >= .2
if(any(rows)) {
test2 <- arrange(unique(sites.qc[rows,c("patient", "timepoint", "celltype", "posID",
"estAbundance1", "estAbundance1Prop",
"estAbundance1Rank", "geneType")]),
patient,timepoint,celltype,plyr::desc(estAbundance1Prop),posID)
tps <- sortTimePoints(test2$timepoint)
test2$timepoint <- factor(test2$timepoint, levels=names(tps))
test2 <- arrange(test2, patient,timepoint,celltype,estAbundance1Rank)
names(test2) <- col.keys[names(test2)]
test2$RelativeAbundance <- percent(test2$RelativeAbundance)
kable(test2, caption="Sites >20% of the Total", format="html", row.names=FALSE)
} else {
cat("<strong>No sites found in this patient which are greater than 20% of the total data.</strong>")
}
```
### Do any multihit account for greater than 20% of the total?
Up until now, all the analysis has been looking at unique integration sites. But it is also helpful to look at reads finding multiple equally good scoring hits/places in the genome which can be reffered to as 'Multihits'. If an integration site occurred within a repeat element (i.e. Alus, LINE, SINE, etc), then it might be helpful to access those sites for potential detrimental effects. These collection of sequences are binned and analyzed separately due to their ambiguity. To make some sense of these multihits, we bin any sequence(s) which share 1 or more genomic locations hence forming psuedo-collections which can be reffered to as OTUs (operation taxonomic units). Once the OTUs are formed, we compare breakpoints of unique sites and multihits. The idea is to see if there are any multihits which higher in abundance than a unique site in a given sample. Below is a table similar to the one shown previously except we show any site which might be greater than 20\% of all clones in the data.
```{r Top10All, results="asis"}
rows <- sites.all$estAbundance1Prop >= .2
if(any(rows)) {
toprint <- unique(subset(sites.all, estAbundance1Prop >= .2)
[,setdiff(names(sites.all), c('posID','Chr','strand','Position',
'Sequence','otuID','Alias',
'AliasOTUid','estAbundance1PropRank',
'timepointDay'))
])
toprint$Aliasposid <- NULL
tps <- sortTimePoints(toprint$timepoint)
toprint$timepoint <- factor(toprint$timepoint, levels=names(tps))
toprint <- arrange(toprint, patient,timepoint,celltype,estAbundance1Rank)
toprint$isMultiHit <- NULL
names(toprint) <- col.keys[names(toprint)]
toprint$RelativeAbundance <- percent(toprint$RelativeAbundance)
kable(toprint, caption="All Sites >20% of the Total",
format="html", row.names=FALSE)
} else {
cat("<strong>No sites found in this patient which are >20% of the total data after combining multihits.</strong>")
}
```
### SiteTypes
The plot in previous section summarizes overlapping sites at the genomic coordinate level. However, integration sites are often represented by the gene they are in or nearby (SiteType). The plot below summarizes which 'SiteTypes' are often found to be abundant across samples relative to the entire landscape. The sites with abundance greater than 5% and rank within top two are colored.
```{r global_siteType, fig.width=10, fig.height=9}
sums <- aggregateSiteTypes(sites.qc, siteTypeVar="geneType")
sums$timepoint <- sub("(.+):(.+)","\\1",sums$Alias)
sums$celltype <- sub("(.+):(.+)","\\2",sums$Alias)
sums <- merge(sums, with(sums, getRanks(Props, siteType, Alias, "Ranks2")),
by.x=c("siteType", "Alias"),by.y=c("posID","grouping"))
sums$SiteRank <- factor(with(sums, ifelse(Ranks2<4, Ranks2,">3")), levels=c(1:3,">3"))
sums$siteType2 <- with(sums, ifelse(Props>=0.05 & Ranks2<3, siteType, ""))
tps <- names(sortTimePoints(as.character(sums$timepoint)))
sums$timepoint <- factor(as.character(sums$timepoint), levels=tps)
counts <- count(sums, c("timepoint","celltype"))
sums$celltype <- factor(sums$celltype, levels=unique(counts$celltype))
# set custom color scale #
siteTypeCols <- structure(gg_color_hue(length(unique(sums$siteType2))),
names=unique(sums$siteType2))
siteTypeCols[names(siteTypeCols)==""] <- "grey70"
p <- qplot(data=sums, x=timepoint, y=Props, colour=siteType2, xlab="Timepoint",
geom="jitter", position = position_jitter(h = 0.001)) +
geom_hline(y=0.05,linetype='dotted') +
scale_colour_manual(values=siteTypeCols) +
scale_y_continuous(name="Relative Abundance", labels=percent, expand=c(0,0.01)) +
facet_grid(.~celltype, scales="free_x", space="free_x") +
theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))
p <- direct.label(p,"smart.grid")
print(p)
```
What is the most frequently occuring SiteType in subject `r patient`?
```{r wordle, fig.width=8, fig.height=8}
counts <- count(sums,"siteType"); names(counts)[1] <- "word"
suppressWarnings(plotWordCloud(counts, scale=c(3,0.5), min.freq=1, max.words=500,
rot.per = 0,
colors=c(colSets("Set1")[-6],colSets("Paired"))))
```
```{r overlaps_REvsFrag, results='hide', eval=FALSE}
### Overlap Analysis Restriction Enzyme Vs Fragmentase
Often it is of interest to investigate whether integration sites recovered using Restriction Enzyme(s) are seen again using Fragmentase method or not. In the analysis to follow, we divide each sample data by the isolation method and test how many integration sites overlap using window size of 5bp. For comparability, samples labelled with general celltypes such as WB/Blood were replaced with PBMC.
test <- sites.qc
test$isFrag <- grepl('FRAG',as.character(test$enzyme))
## change celltype WB/Blood to PBMC for comparability ##
test$Alias <- gsub("WB|Blood","PBMC",test$Alias,ignore.case=T)
toCheck <- pmin(xtabs(~Alias+isFrag,test),1)
toCheck <- toCheck[rowSums(toCheck)>1,]
if(is.null(dim(toCheck))) {
cat("<strong>No samples found which used both isolation methods.</strong>")
} else {
toCheck.aliases <- rownames(toCheck)
sites.gr <- with(unique(test[test$Alias %in% toCheck.aliases,
c('Chr','strand','Alias','Position','isFrag')]),
GRanges(seqnames=Chr, strand=strand, Alias=Alias, isFrag=isFrag,
IRanges(start=Position, width=1)))
test.gr <- split(sites.gr, paste(mcols(sites.gr)$Alias))
overlap.res <- sapply(test.gr, findOverlaps, maxgap=5,
ignoreSelf=T, ignoreRedundant=T)
## find overlap & get union of sites per isolation method for percent total ##
overlap.res <- lapply(test.gr,
function(x) {
res <- as.data.frame(findOverlaps(x, maxgap=5,
ignoreSelf=TRUE,
ignoreRedundant=TRUE))
res$isFrag1 <- mcols(x)$isFrag[res$queryHits]
res$isFrag2 <- mcols(x)$isFrag[res$subjectHits]
union.res <- length(union(subset(x,mcols(x)$isFrag),
subset(x,!mcols(x)$isFrag)))
cbind(Sample=as.character(x$Alias[1]),
count(res,c("isFrag1","isFrag2")),
UnionSites=union.res)
})
rm("sites.gr","test.gr")
cleanit <- gc()
overlap.res <- do.call(rbind, overlap.res)
names(overlap.res)[grepl("freq",names(overlap.res))] <- "TotalOverlap"
overlap.res$PercentOverlap <- percent(with(overlap.res,TotalOverlap/UnionSites))
overlap.res$Tp <- sub("(.+):.+","\\1",overlap.res$Sample)
overlap.res$Cell <- sub(".+:(.+)","\\1",overlap.res$Sample)
tps <- sortTimePoints(overlap.res$Tp)
overlap.res$Tp <- factor(overlap.res$Tp, levels=names(tps))
overlap.res <- arrange(overlap.res, Tp, Cell)
wanted.cols <- c("Tp", "Cell", "TotalOverlap", "UnionSites", "PercentOverlap")
kable(overlap.res[,wanted.cols], row.names=FALSE, format="html", digits = 0,
caption="Sites Overlaping between Isolation Methods")
}
```