GTSPreport.Rmd

# Analysis of integration site distributions and relative clonal abundance for subject `r sanitize(patient)`
`r format(Sys.Date(), "%b %d %Y")`

```{r setup,echo=FALSE}
opts_chunk$set(fig.path='figureByPatient/', fig.align='left', comment="",
               echo=FALSE, warning=FALSE, error=TRUE, message=FALSE, cache=F, results="asis")
options(knitr.table.format = 'html')
```

## Introduction

The attached report describes results of analysis of integration site distributions and relative abundance for samples from gene therapy trials. For cases of gene correction in hematopoietic stem cells, it is possible to harvest blood cells and analyze the distributions of integration sites. Frequency of isolation information can provide information on the clonal structure of the population. This report summarizes results for subject `r sanitize(patient)` over time points `r sanitize(paste(timepoint,collapse=", "))` in UCSC genome draft `r sanitize(freeze)`. 

The samples studied in this report, the numbers of sequence reads, and unique integration sites available for this subject are shown below. 

```{r summaryTable,results="asis"}
kable(summaryTable, caption="Sample Summary Table", row.names=FALSE, format="html", digits = 2)
```

## Population Size

Under most circumstances only a subset of sites will be sampled. We thus include an estimate of sample size based on frequency of isolation information from the SonicLength method [(Berry, 2012)](http://www.ncbi.nlm.nih.gov/pubmed/22238265). 
The 'S.chao1' column denotes the estimated population size derived using Chao estimate [(Chao, 1987)](http://www.ncbi.nlm.nih.gov/pubmed/3427163). If sample replicates were present then estimates were subjected to jackknife bias correction.

We also quantify population clone structure using Gini Coefficients. The Gini coefficient provides a measure of inequality in clonal abundance in each sample. The coefficient equals zero when all sites are equally abundant (polyclonal) and increases as fewer sites account for more of the total (oligoclonal). 

The table below summarizes sample population for each timepoint & celltype combination. 

```{r ChaoGiniTable,results="asis"}
kable(popSummaryTable, caption="Sample Population Summary", format="html",
      digits = 2, row.names=FALSE)
```

The graph below visualizes population based summaries as a function of time.

```{r pop_graphs, results='asis', fig.width=8, fig.height=6}
if(length(unique(timepointPopulationInfo$group)) >1 ){
  ggplot(data=timepointPopulationInfo, aes(group, value)) + geom_bar(stat="identity") + facet_wrap(~variable, scales="free")
}else{cat(paste0("**Only one timepoint, ", unique(levels(timepointPopulationInfo$group)), ", present.  Insufficient data available to plot population estimators across timepoints.**"))}
```

## Relative abundance of cell clones

The relative abundance of cell clones is summarized in the attached stacked bar graphs.  The cell fraction studied is named at the top, the time points are marked at the bottom. The different bars in each panel show the major cell clones, as marked by integration sites.  A key to the sites is shown at the right.  Each integration site is named by the nearest gene. The '*' indicates if the integration site is with the transcription unit for that gene, the '~' indicates a cancer related gene, and the '!' indicates a gene of interest from previous gene therapy trials (these include genes involved in adverse events, and genes at clustered integration sites from the first SCID trial).

Integration sites were recovered using ligation mediated PCR after random fragmentation of genomic DNA, which reduces recovery biases compared with restriction enzyme cleavage. Relative abundance was not measured from read counts, which are known to be inaccurate, but from marks introduced into DNA specimens prior to PCR amplification using the SonicLength method [PMID:22238265](http://www.ncbi.nlm.nih.gov/pubmed/22238265).

In the barplots below, any sites with Estimated Relative Abundance below `r percent(abundCutoff.barplots)` are binned as LowAbund.

```{r barPlots, fig.height=12, fig.width=12}
siteColors = structure(gg_color_hue(length(unique(barplotAbunds$maskedRefGeneName))), names=unique(barplotAbunds$maskedRefGeneName))
siteColors["LowAbund"] <- "#E0E0E0"

ggplot(data=barplotAbunds, aes(Timepoint, estAbundProp, fill=maskedRefGeneName)) +
  geom_bar(stat="identity") + facet_wrap(~CellType, scales="free") + scale_fill_manual(values=siteColors) +
  labs(y="Relative Sonic Abundance", x="Timepoint") +
  scale_y_continuous(labels=percent)
```

Here is another way to perceive top ranking integration sites by genes within each celltype.  Any sites with Estimated Relative Abundance below `r percent(abundCutoff.detailed)` are binned as LowAbund.

```{r sitetype_heatmap, fig.width=12, fig.height=11}
ggplot(data=detailedAbunds, aes(Timepoint, maskedRefGeneName, fill=estAbundProp)) + geom_tile() +
  scale_fill_continuous(name='Relative\nAbundance', labels=percent, low="#E5F5E0", high="#2B8CBE") + 
  facet_grid(.~CellType, scales="free", space="free") + labs(y="SiteType", x="Timepoint") +
  theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))
```

## Longitudinal behavior of major clones

When multiple time points are available, it is of interest to track the behavior of the most abundant clones.  A plot of the relative abundances of major clones, based on output from SonicLength, is shown below. For cases where only a single time point is available, the data is just plotted as unlinked points. 

```{r ParallelLines, fig.width=10, fig.height=10}
if (has_longitudinal_data) {
  ggplot(longitudinal, aes(x=Timepoint, y=estAbundProp)) +
    geom_point(size=.5) +
    geom_line(aes(colour=posid, group=posid), alpha=.5, show_guide=FALSE) +
    facet_wrap(~CellType, scales="free") +
    ggtitle(paste("Patient:", patient, "Trial:", trial)) + xlab("Timepoint") + 
    scale_y_continuous(name="Relative Sonic Abundance",
                       labels=percent,expand=c(0,0)) +
    theme(axis.text.x=element_text(angle=45,hjust=1,vjust=1))
  }else{cat(paste0("**Only one timepoint, ", unique(levels(timepointPopulationInfo$group)), ", present.  Insufficient data available to plot changes of clone densities across timepoints.**"))}
```

## Integration sites near particular genes of interest

Integration sites near genes that have been associated with adverse events are of particular interest. Thus, we have cataloged all integration sites for which a gene of interest is the nearest cancer-related gene.
Results are summarized below as a scatter plot where the y-axis shows relative abundance of sites and x-axis is distance to the nearest onconogene 5' end.

Negative distances indicate that the integration site is downstream from (i.e. after) the TSS.  Positive distances indicate that the integration site is upstream from (i.e. before) the TSS.  Note that all RefSeq splicing isoforms are used for this analysis, so the reference TSS may not be the same for each listed integration site.


```{r badActors, include=FALSE}
#this chunk has to have include=FALSE otherwise it inexplicably displays a verbatim
#copy of the longitudinal data graph... I have absolutely no idea why...
#a bit hackey but it works
badActorOut <- NULL #clear it out
badActorOut <- lapply(badActors, function(badActor){
  sites <- as.data.frame(badActorData[[badActor]])
  if(nrow(sites)>1){
    knit_child("badActorPartial.Rmd", quiet=T, envir=environment())
    }else{
      knit_expand(text=paste0("### ", badActor, "\n **No sites within 100kb of any ", badActor," TSS for this patient.**\n***"))
    }
  })
```
`r knit(text = unlist(badActorOut))`

## Do any clones account for greater than 20% of the total?
  
For some trials, a reporting criteria is whether any cell clones expand to account for greater than 20% of all clones. This is summarized below for subject `r patient`. Abundance is estimated using the SonicLength method. Data such as this must, of course, be interpreted in the context of results from other assays.

```{r TwentyPercSites, results="asis"}
### Sites >20% of data by Alias ###
knit_exit()
rows <- sites.qc$estAbundance1Prop >= .2

if(any(rows)) {
  test2 <- arrange(unique(sites.qc[rows,c("patient", "timepoint", "celltype", "posID",
                                          "estAbundance1", "estAbundance1Prop",
                                          "estAbundance1Rank", "geneType")]),
                   patient,timepoint,celltype,plyr::desc(estAbundance1Prop),posID)
  tps <- sortTimePoints(test2$timepoint)  
  test2$timepoint <- factor(test2$timepoint, levels=names(tps))
  test2 <- arrange(test2, patient,timepoint,celltype,estAbundance1Rank)  
  names(test2) <- col.keys[names(test2)]
  test2$RelativeAbundance <- percent(test2$RelativeAbundance)
  kable(test2, caption="Sites >20% of the Total", format="html", row.names=FALSE)  
} else { 
  cat("<strong>No sites found in this patient which are greater than 20% of the total data.</strong>")        
}
```

### Do any multihit account for greater than 20% of the total?

Up until now, all the analysis has been looking at unique integration sites. But it is also helpful to look at reads finding multiple equally good scoring hits/places in the genome which can be reffered to as 'Multihits'. If an integration site occurred within a repeat element (i.e. Alus, LINE, SINE, etc), then it might be helpful to access those sites for potential detrimental effects. These collection of sequences are binned and analyzed separately due to their ambiguity. To make some sense of these multihits, we bin any sequence(s) which share 1 or more genomic locations hence forming psuedo-collections which can be reffered to as OTUs (operation taxonomic units). Once the OTUs are formed, we compare breakpoints of unique sites and multihits. The idea is to see if there are any multihits which higher in abundance than a unique site in a given sample. Below is a table similar to the one shown previously except we show any site which might be greater than 20\% of all clones in the data.

```{r Top10All, results="asis"}
rows <- sites.all$estAbundance1Prop >= .2

if(any(rows)) {
  toprint <- unique(subset(sites.all, estAbundance1Prop >= .2)
                    [,setdiff(names(sites.all), c('posID','Chr','strand','Position',
                                                 'Sequence','otuID','Alias',
                                                 'AliasOTUid','estAbundance1PropRank',
                                                 'timepointDay'))
                     ])
  toprint$Aliasposid <- NULL
  tps <- sortTimePoints(toprint$timepoint)  
  toprint$timepoint <- factor(toprint$timepoint, levels=names(tps))
  toprint <- arrange(toprint, patient,timepoint,celltype,estAbundance1Rank)	
  toprint$isMultiHit <- NULL
  names(toprint) <- col.keys[names(toprint)]
  toprint$RelativeAbundance <- percent(toprint$RelativeAbundance)
  kable(toprint, caption="All Sites >20% of the Total", 
        format="html", row.names=FALSE)  
} else { 
  cat("<strong>No sites found in this patient which are >20% of the total data after combining multihits.</strong>")  
}	
```

### SiteTypes

The plot in previous section summarizes overlapping sites at the genomic coordinate level. However, integration sites are often represented by the gene they are in or nearby (SiteType). The plot below summarizes which 'SiteTypes' are often found to be abundant across samples relative to the entire landscape. The sites with abundance greater than 5% and rank within top two are colored.

```{r global_siteType, fig.width=10, fig.height=9}
sums <- aggregateSiteTypes(sites.qc, siteTypeVar="geneType")
sums$timepoint <- sub("(.+):(.+)","\\1",sums$Alias)
sums$celltype <- sub("(.+):(.+)","\\2",sums$Alias)
sums <- merge(sums, with(sums, getRanks(Props, siteType, Alias, "Ranks2")),
              by.x=c("siteType", "Alias"),by.y=c("posID","grouping"))
sums$SiteRank <- factor(with(sums, ifelse(Ranks2<4, Ranks2,">3")), levels=c(1:3,">3"))
sums$siteType2 <- with(sums, ifelse(Props>=0.05 & Ranks2<3, siteType, ""))

tps <- names(sortTimePoints(as.character(sums$timepoint)))
sums$timepoint <- factor(as.character(sums$timepoint), levels=tps)
counts <- count(sums, c("timepoint","celltype"))
sums$celltype <- factor(sums$celltype, levels=unique(counts$celltype))

# set custom color scale #
siteTypeCols <- structure(gg_color_hue(length(unique(sums$siteType2))),
                          names=unique(sums$siteType2))
siteTypeCols[names(siteTypeCols)==""] <- "grey70"

p <- qplot(data=sums, x=timepoint, y=Props, colour=siteType2, xlab="Timepoint",
           geom="jitter", position = position_jitter(h = 0.001)) +
  geom_hline(y=0.05,linetype='dotted') +
  scale_colour_manual(values=siteTypeCols) +
  scale_y_continuous(name="Relative Abundance", labels=percent, expand=c(0,0.01)) + 
  facet_grid(.~celltype, scales="free_x", space="free_x") + 
  theme(axis.text.x=element_text(angle=45, hjust=1, vjust=1))
p <- direct.label(p,"smart.grid")
print(p)
```

What is the most frequently occuring SiteType in subject `r patient`?

```{r wordle, fig.width=8, fig.height=8}
counts <- count(sums,"siteType"); names(counts)[1] <- "word"
suppressWarnings(plotWordCloud(counts, scale=c(3,0.5), min.freq=1, max.words=500, 
                               rot.per = 0, 
                               colors=c(colSets("Set1")[-6],colSets("Paired"))))
```

```{r overlaps_REvsFrag, results='hide', eval=FALSE}
### Overlap Analysis Restriction Enzyme Vs Fragmentase

Often it is of interest to investigate whether integration sites recovered using Restriction Enzyme(s) are seen again using Fragmentase method or not. In the analysis to follow, we divide each sample data by the isolation method and test how many integration sites overlap using window size of 5bp. For comparability, samples labelled with general celltypes such as WB/Blood were replaced with PBMC.

test <- sites.qc
test$isFrag <- grepl('FRAG',as.character(test$enzyme))

## change celltype WB/Blood to PBMC for comparability ##
test$Alias <- gsub("WB|Blood","PBMC",test$Alias,ignore.case=T)

toCheck <- pmin(xtabs(~Alias+isFrag,test),1)
toCheck <- toCheck[rowSums(toCheck)>1,]
if(is.null(dim(toCheck))) {
  cat("<strong>No samples found which used both isolation methods.</strong>")
} else {
  toCheck.aliases <- rownames(toCheck)
  sites.gr <- with(unique(test[test$Alias %in% toCheck.aliases,
                               c('Chr','strand','Alias','Position','isFrag')]), 
                   GRanges(seqnames=Chr, strand=strand, Alias=Alias, isFrag=isFrag, 
                           IRanges(start=Position, width=1)))
  
  
  test.gr <- split(sites.gr, paste(mcols(sites.gr)$Alias))
  
  overlap.res <- sapply(test.gr, findOverlaps, maxgap=5, 
                        ignoreSelf=T, ignoreRedundant=T)
  
  ## find overlap & get union of sites per isolation method for percent total ##
  overlap.res <- lapply(test.gr, 
                        function(x) {
                          res <- as.data.frame(findOverlaps(x, maxgap=5, 
                                                            ignoreSelf=TRUE, 
                                                            ignoreRedundant=TRUE))
                          res$isFrag1 <- mcols(x)$isFrag[res$queryHits]
                          res$isFrag2 <- mcols(x)$isFrag[res$subjectHits]
                          union.res <- length(union(subset(x,mcols(x)$isFrag),
                                                    subset(x,!mcols(x)$isFrag)))
                          cbind(Sample=as.character(x$Alias[1]),
                                count(res,c("isFrag1","isFrag2")),                                
                                UnionSites=union.res)
                        })
  
  rm("sites.gr","test.gr")
  cleanit <- gc()  
  
  overlap.res <- do.call(rbind, overlap.res)
  names(overlap.res)[grepl("freq",names(overlap.res))] <- "TotalOverlap"
  overlap.res$PercentOverlap <- percent(with(overlap.res,TotalOverlap/UnionSites))
  
  overlap.res$Tp <- sub("(.+):.+","\\1",overlap.res$Sample)
  overlap.res$Cell <- sub(".+:(.+)","\\1",overlap.res$Sample)
  tps <- sortTimePoints(overlap.res$Tp)  
  overlap.res$Tp <- factor(overlap.res$Tp, levels=names(tps))
  overlap.res <- arrange(overlap.res, Tp, Cell)
  wanted.cols <- c("Tp", "Cell", "TotalOverlap", "UnionSites", "PercentOverlap")

  kable(overlap.res[,wanted.cols], row.names=FALSE, format="html", digits = 0,
        caption="Sites Overlaping between Isolation Methods")
}
```