ayogasek_project4FINAL.Rmd

---
title: 'BINF 6110 Project 4: Direct Principal Component Analysis of Sequence Matrix
  Tutorial'
author: "Abinaya Yogasekaram"
date: "09/04/2021"
output:
  html_document: default
  word_document: default
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, tidy.opts = list(width.cutoff=60), tidy=TRUE, warning = FALSE, message = FALSE)
knitr::opts_knit$set(root.dir = '~/Desktop')
knitr::opts_chunk$set(fig.width=10, fig.height=8) 
```

## Introduction
Principal Component Analysis is the workhorse of statistics (Lever *et al*., 2017). The results gleaned from PCA can drive downstream analysis and also unearth potential structures or motifs within the data. The extension of this mathematical approach to genomics can prove fruitful in understanding patterns in variation(Lever *et al*., 2017). Leveraging these differences to understand population variants may serve as an important stepping stone in analysis.

However, current methods available to conduct PCA on genomes work with many assumptions of the genome and well-tailored to suit human or model organism data (Konishi *et al*., 2019). Some of these assumptions include chromosome delineation and rates of evolution (Gauch *et al*., 2019) and are therefore not transferable to the study of viral genomes for example, where the rate of evolution and variation is unlike the mutation rate seen in other organisms. Moreover, there are no chromosomes - viral genomes can vary by strand and genetic material (Baron., 1996). Available PCA methods also fail to capture unique motifs through the distance matrix.  Summarizing differences between genomes via distance measures often masks the genetic motifs and features that help understand *how* samples are different from each other (Konishi *et al*., 2019). 

To circumvent this, Konishi *et al* propose a direct PCA approach that uses a matrix of the sequences based on boolean vectors - akin to one-hot-encoding in technology terms. Each position of a sequence alignment serves as a variable x 5 (possibility of A, C, G, T, -/N for any given position). With each row representing a sample, a column is populated with a 1 if a given nucleotide was contained in that position, 0 if not. The resulting matrix is subject to singular value decomposition (rotation) to identify the principal components. By taking this approach, invalid assumptions or rates of mutation accumulation are shed. Samples such as viral genomes can be assessed at face value with a direct comparison of their sequences. Distances or differences between genomes can be directly traced to the sites and bases that gave rise to that distance score.

The objective of this analysis is to utilize the direct PCA approach on 120 complete genomes of coronavirus to see if this PCA approach mirrors the clade separation and differentiation seen in the paper written by Li *et al*., exploring the supertree method of phylogenetic analysis for SARS-Cov2 evolution. Concordance between these two methods would illustrate the value of Konishi's approach and may highlight clustering patterns not captured in the supertree phylogeny (2020). The direct PCA approach will be used to determine the distances between samples, the clustering of samples and sites based on the leading principal components. The goals of the analysis is to visualize the sample and nucleotide position-based principal components for motifs and to discern if the method is robust enough to use in place of existing PCA methods. 


# Tutorial
This guide is intended to introduce the concept of direct PCA and its workflow. I encourage you to read the original paper describing this method by Konishi *et al*. I would also suggest running the R code in a high-performance computing cluster for faster computing (steps involving the population of the boolean matrix can be computationally expensive).

For this method, you will need 1) a text file containing the accessionIDs of the sequences you want to compare (is using NCBI:Genbank sequences) and 2) R and some Bioconductor packages for alignment. 
A few points of note before getting started:

 a) this method uses the sequence matrix and alignment object directly. Clean sequences (little/no sequencing errors, completed assembly of the genome, no misplaced contigs) are required to make sure errors do not affect the boolean coding and distance measures. 

 b) If you have a large genome (human, animals, etc.), it may be better to try this out with a gene or protein sequence first. If the results appear promising, you can scale up the code to include the full genome but this must be done in a high-performance computing platform.

### Procuring Data
(Where to find the datasets used)
Recall, the direct PCA method requires a sequence alignment. The data collected for this tutorial was initially run with data from the GISAID database with high-quality Coronavirus genomes. Unfortunately, posting this dataset would fail to adhere to the databases' policy on data sharing. However, the database is freely accessible through a registered account and I encourage you to run this analysis with their data as well. As an alternative, accession IDs of 120 complete genomes used by Li et al in [this paper](https://www.nature.com/articles/s41598-020-79484-8#Sec150) are extracted from NCBI.

In this paper, Li et al explore the evolution of the SARS-CoV2 virus (COVID-19 Pandemic; Severe Acute Respiratory Syndrome) along with their relationship to precursor coronavirus genomes such as Bat-host SARS, MERS (Middle East Respiratory Syndrome), and older precursor human-host SARS genomes found in the coronaviridae family. The details from the NCBI records such as date of collection, location, etc. Can be found in the [supplemental material](https://static-content.springer.com/esm/art%3A10.1038%2Fs41598-020-79484-8/MediaObjects/41598_2020_79484_MOESM1_ESM.xlsx) of this paper. I extracted the accession IDs from this table to collect the raw sequence information from NCBI using the rentrez package)

A txt file with a list of these accessions as well as the resulting fasta file can be found in [this github repository](https://github.com/ayogasekaram/BINF6110-Project4). 

Tip 1: If a list of accession IDs are available, a batch download of the sequences can be done here: https://www.ncbi.nlm.nih.gov/sites/batchentrez. Simply upload the file of accession IDs and save the fasta file (SendTo > Format=fasta>Download).

Tip 2: If using sparse fasta files (such as a download from GISAID), place the files in a separate directory and run the command below to combine.

```{bash eval=FALSE, include=TRUE}
# go to your directory
cd <<Path/To/Your/Directory>>

# combine the files into a single fasta.
cat *.fasta > <<name_of_combined_file>>.fasta
```
If you already have your sequences in the fasta format, you can skip ahead to the Alignment steps.

Required Packages: (please have these install before running the analysis)
```{r, echo=TRUE, include=TRUE, eval=FALSE}
# Data Collection
library(rentrez)

# Alignment
# BiocManager::install("DECIPHER")
library(DECIPHER)

# Visualizations:
library(ggplot2)
library(gridExtra)
library(scatterplot3d)

# Data formatting
library(stringr)
library(reshape2)
library(dplyr)

```


For this tutorial, we can use the rentrez package in R to get the sequences directly from NCBI. Using the entrez_fetch function, the supplied accession IDs will be matched to their respective records and returned in the format specified by "rettype".

```{r eval=FALSE, include=TRUE}
# read in your list of accession numbers. Note the version number (X___.1) must be intact.
accessions <-read.table("supertree_full_accessions.txt")
# extracting the sequences from the nucleotide database and returning the format type "fasta"
seqs <- entrez_fetch(db="nuccore", rettype="fasta", id=accessions$V1)

# write this fasta file to your directory
write(seqs, "full_supertree_accessions.fasta")

```

We can now align these sequences for PCA.

### Alignment
Sequence data can be read in as a fasta file using the BioConductor DECIPHER package. The sequences can then be aligned using the AlignSeqs function in the package (similar to the methods described by Konishi et al., 2020). (Note, this can also be done with amino acid sequences for the alignment of proteins).

Tip: alignment and the PCA calculation steps are very computationally taxing (especially if using whole genomes). If available, run these lines of R code on an HPC.

```{bash eval=FALSE, include=TRUE}
# copy the fasta file from your computer to the 
scp <<name_of_combined_file>>.fasta <<location_on_HPC>>

# load the following modules: (bioconductor packages used for alignment)
module load StdEnv/2020  gcc/9.3.0 r-bundle-bioconductor/3.12 r

# enter R session
R
```

Here, we use the AlignSeqs from the DECIPHER package from Bioconductor to construct a mutltiple sequence alignment. Note if this analysis is being run on Amino Acid Sequences, the seqs object should be read in as readAAstringset. Different alignment algorithms can be used as well (ClustalOmega, MUSCLE, etc.) and available in the "msa" package from Bioconductor. Alignment parameters can also be optimized to suit your genome of interest. The documentation for the DECIPHER package can be found [here](https://bioconductor.org/packages/release/bioc/vignettes/DECIPHER/inst/doc/DECIPHERing.pdf).

```{r eval=FALSE, include=TRUE}


# Loading in sequence file (if this is in a different directory than your working directory, include your path before the file name).
fas <- "full_supertree_accessions.fasta"

# read in fasta file as DNAstringset object (it can distinguish between header lines and sequences)
seqs <- readDNAStringSet(fas)

# Alignseqs will produce an alignment object
aligned <- AlignSeqs(seqs)
```

Once the alignment is complete, the aligned sequences can be written to a fasta file in the format required by the direct PCA method computation. For this method, the header line must be separated by a tab then followed by the sequence. In order to the store the file in that format, a custom function is made. The arguments for this function would be the alignment object the name of the file when written to your present working directory.

```{r eval=FALSE, include=TRUE}

# input arguments are the alignment object and the desired output file name.
alignment2Fasta <- function(alignment, filename) {
  sink(filename) # create a file in the pwd with desired output name
  n <- length(names(alignment))
  for(i in seq(1, n)) { # for each sequence in the alignment
    cat(paste0('>', names(alignment)[i], "\t")) # paste the header line with the tab
    the.sequence <- as.character(alignment[[i]]) # paste the sequence 
    cat(the.sequence)
    cat('\n')  # write a new line and repeat
  }
  
  sink(NULL)
}

# output the alignment as the PCA_formatted.fasta
alignment2Fasta(aligned, 'PCA_formatted.fasta')

# this will save the file in your pwd
```

### Direct PCA set-up

The alignment fasta that was saved in the previous step can be read in as a table since the header and sequence information is tab delimited (column 1 contains header info, column 2 contains the sequence information). The dimensions of the matrix should be the number of genomes (120) by 2 columns

```{r eval=FALSE, include=TRUE}
# read in table
sites <- read.table(file="PCA_formatted.fasta", header=F, sep="\t")

# the dimension of this table should be number of samples as rows and 2 columns.
dim(sites)

```

The number of samples and the length of the sequences are required to setting up the sequence matrix to be populated with the one-hot-encoding (or boolean vectors). This empty matrix will be populated with information on the presence or absence of a given nucleotide in the alignment. Remember there are 5 possible character states for each position (A,C,T,G, -/N). So the matrix columns are setup such that there is an "A" columns for every nucleotide position, "T", "C", etc. The dimsensions of this matrix should be 5*the length of the sequence in the alignment and the number of genomes in the alignment.
Unique identifiers are required for downstream visualization, so we will population the rownames with the fasta header information.
```{r eval=FALSE, include=TRUE}
# these dimensions are required to setup the matrix.
n.sample <- dim(sites)[1] # number of samples --> number of rows
seq.len <- nchar(sites[2,2]) # length of the sequences after alignment (should be uniform among all samples) --> number of possible positions.

# using dimensions to set up a boolean array.
boolean.matrix <- array(0, dim=c(n.sample, 5*seq.len)) # 5*sequence length as there are 5 possible characters at each position: A,C,G,T, N/-. 

# setting the column names to represent A,C,T,G or N and the sequence position.
colnames(boolean.matrix) <- c(paste("A_", 1:seq.len, sep=""),paste("T_", 1:seq.len, sep=""),paste("G_", 1:seq.len, sep=""),paste("C_", 1:seq.len, sep=""),paste("N_", 1:seq.len, sep=""))

# setting the row names of the matrix to hold the respective header information
rownames(boolean.matrix) <- sites[ ,1]
```

### Populating Boolean Values

This is the most computationally expensive step. 
1. For each sample in matrix (in this case 120 genomes), we will loop through the sequence letter by letter.
2. If the letter matches the base and position denoted by the column name (ex. the base "A" at position 1), the A_1 column will be populated with a 1, else 0. The same will repeat for each base of the sequence. 
3. Once the matrix has been populated with 1s and 0s for the sample, the loop moves onto the next sample. 
4. !! Note the order of the column names set in the previous step and the order of bases in this for loop. It is essential that the same order is maintained (A,T,G,C) in both steps as the math used to fill in the matrix is based on their relative positions (T_ columns will always follow A_ columns, etc.). If this confusing - take a look at the column names of the matrix.

By the end of the numerical conversion of the sequences, the sum of each row in this matrix should equal the length of the sequence. (I recommend doing smaller sanity checks like this to make sure the code is working as expected. All downstream processes such as the calculation of PCs rely on these steps.)
```{r eval=FALSE, include=TRUE}
# uncomment and run the line below to see the column names
colnames(boolean.matrix)

for (samp. in 1:n.sample){ # for each sample in the matrix
  se <- sites[samp., 2] # the second column contains the sequence.
  se <- tolower(se) # converting to lower case letters
  
  for (letter in  1:seq.len){ # for each position in this sequence ...
    base <- substr(se, letter,letter) # the base entity will be the character to match
    
    if(base =="a") { # if the base is a
      boolean.matrix[samp., letter] <-1 # populate the cell with 1. If not, is the base a T?
    } else {
      
      if(base =="t") {
        boolean.matrix[samp., letter+seq.len] <-1 # if so, populate with a 1. If not, move to the next if else block and continue
      } else {
        
        if(base =="g") {
          boolean.matrix[samp., letter+seq.len*2] <-1
        } else {
          
          if(base =="c") {
            boolean.matrix[samp., letter+seq.len*3] <-1
          } else {
            
            boolean.matrix[samp., letter+seq.len*4] <-1 # if it not a nucleotide (A,T,C,G), populate the N/- column with a 1.
          }}}}
  }}
# checking that all nucleotides are accounted for. the sum of each row in this matrix should be the length of the sequence.

apply(boolean.matrix, 1, sum)
```

At the end of this step, the sum of the values in each row should 1) be equal for all the samples (since they were aligned) and 2) equal the length of the sequence after alignment. We've effectively just converted a character string into a numerical matrix for computation. Now onto the PCA processing.

### Principal Component Analysis: Data pre-processing

Before actually plotting the PCAs, the data will need to be centered and scaled in order to find the axis of rotation that will be applied to the PCs. Distances of the sequences are calculated around this centered object, thus samples that are far from the defined center will have a larger influence on finding the direction of rotation. 

There is also a step to correct for double counting the distances between samples. Why is this necessary? Determining the distance between two samples uses the Euclidean distance (A-B)^2. However, when populating the matrix, the distances have been double-counted: (A-B)^2 and (B-A)^2. To compensate, each distance measure (difference between sequences) is divided by the square root of 2.
The differences of each sample from the center sequence at each position are stored in the differences matrix.

What's interesting about this method is that the center can be manipulated. Typically, you would find the mean of all the sequence distances and make that the center to calculate the distance against. However, this also means that if you have sequences that are very different, the mean as the center will lose the resolution of the smaller differences in very similar sequences. Another approach is to make your reference sequence the center and calculating the distances of the sequences from that reference instead of the mean. We'll calculate both ways here to show the difference. We'll calculate one version of the PCA with the mean of all the coronavirus genomes and another with the SARS-CoV2 Wuhan reference genome as the center.

The sweep function below will essentially apply the operation by the values defined in the third argument. The differences will be calculated across the columns relative to the center object we've defined. The result should be a matrix with the differences between the position_nucleotide value for each sample and the center value.

```{r eval=FALSE, include=TRUE}
## finding the center (or mean) for each column of the boolean matrix. This will define the center of the rotation
center_mean <- apply(boolean.matrix, 2, mean) # extracting the mean of all columns

# the Wuhan reference genome is the first sample in our boolean matrix
center_wuhan <- boolean.matrix[1,]

# applying this center to all columns of the matrix to find the differences between sequences to this center value. 
differences_mean <-sweep(boolean.matrix, 2, center_mean)
differences_wuhan <- sweep(boolean.matrix, 2, center_wuhan)

# compensating for the doubled counts in Euclidean distance metrics.
differences_mean <- differences_mean/(2^0.5)
differences_wuhan <- differences_wuhan/(2^0.5) 

```

Checking the distribution of the distances between sequences can provide some indication of the variation we'd see in the PCs. If there are many small distances, the points will likely be closer in space and not as separated. If there is a large distribution of differences, the groups may be much more differentiated. Check if the distances are normally distributed with a qq norm plot. 
```{r eval=FALSE, include=TRUE}
# checking distribution of the distances
distances_mean<- (apply(differences_mean^2, 1, sum))^0.5
distances_wuhan<- (apply(differences_wuhan^2, 1, sum))^0.5

# checking for a normal distribution
par(mfrow=c(1,2))
qqnorm(distances_mean, main="Normal Q-Q Plot of Distance Against Mean") 
qqnorm(distances_wuhan,  main="Normal Q-Q Plot of Distance Against Reference") 
```

```{r echo=FALSE, out.width = '75%'}
knitr::include_graphics("qq_norm_plot.png")
```
<br/>**Figure 1** Illustrated above are the Normal Quantile-Quantile Plots for the distance measures against the mean value (left) and against the Wuhan reference genome (Right).


The plot suggests that the distance measures are not spread out (do not follow a normal distribution). We would expect that the PCA will have many points that are clustered together and have very small distances with few genomes further away. We can check how the distances are distributed with a histogram as well. 

```{r eval=FALSE, include=TRUE}
# histogram of distances
p1 <- ggplot(data=as.data.frame(distances_mean), aes(x=distances_mean))+
  geom_histogram(color="darkblue", fill="lightblue", bins = 5) + ggtitle("Histogram of Sequence distances against the mean distance value") + xlab("Distance") + theme(plot.title = element_text(size=8))

p2 <- ggplot(data=as.data.frame(distances_wuhan), aes(x=distances_wuhan))+
  geom_histogram(color="darkblue", fill="lightblue", bins = 5) + ggtitle("Histogram of Sequence distances against the Wuhan Reference Genome") + xlab("Distance") + theme(plot.title = element_text(size=8))

grid.arrange(p1,p2, ncol=2)
```

```{r echo=FALSE, out.width = '75%'}
knitr::include_graphics("distance_hist_plot.png")
```
<br/>**Figure 2** Illustrated above are the histogram of calculated sequence distance measures against the mean value (left) and against the Wuhan reference genome (Right).

The majority of the samples have very small distances while there are a few with very large distances. This suggests that most of the samples have very similar sequences and will likely cluster together in the principal component space. The samples with very large distances are good indicators of the method differentiating between the SARS-CoV2 sequences and their precursors (MERS, SARS, SARS-like Bat Viruses, etc.). Note, however, with the same boolean matrix, the distribution is slightly different when the center of the distance calculation is changed. When compared against the mean, there are many more small distance values that show that the inclusion of more differentiated genomes can skew the mean calculation and therefore, have a differing resolution between closely related sequences. In this case, we suspect the inclusion of the BAT and MERs genomes in the center calculation will cluster the SARS-CoV2 genomes closer together. In the case where the Wuhan reference genome is the center, the SARS-CoV2 clades will be much further away from the precursor genomes such as the BAT host SARS and MERS.

### Principal Component Analysis: Calculating PCs

Now that the data have been centered and differences are calculated, they are subject to the PC calculations. The matrix is set up in such a way that the singular value decomposition function applied to the matrix of differences can identify the linear combinations of the PCs. If familiar with Eigen decomposition, note that the mathematics is limited to a diagonal and square matrix. SVD is a generalized version that can be applied to any n x p matrix. The mathematics behind the calculation of the PCs is very well explained in this [YouTube tutorial](https://www.youtube.com/watch?v=HAJey9-Q8js) and I suggest watching it briefly before running the steps below. 

Essentially, we'll be separating the matrix into smaller rank 1 matrices, the rank defining the number of linearly independent columns in our matrix. By separating them this way, you're able to identify the rank (or the number of linearly independent columns) that can describe the variation in many smaller matrices. When multiplying these matrices by the pXp diagonal matrix (sigma below). With matrix multiplication (%*%) of the Left and Right matrix with the sigma identity matrix, we return the rotated matrix (either based on the nucleotide or sample). The columns in the matrix are essentially like the Eigenvectors of an Eigen decomposition PCA process.

The benefit of this method is not only identifying variation in the samples, but also the nucleotide positions. In other words, seeing the same bases and positions cluster together away from the rest of the data suggests that the bases in that position may be a discernible motif we can save this information in a data frame to plot later.


```{r eval=FALSE, include=TRUE}
# core PCA

# singular value decomposition of a matrix (similar to identifying the linear combinations through eigen decomposition)
res_svd <- svd(differences_mean)

# d: vector of singular values, u: matrix with columns of left singular values, v: matrix with columns of right singular values.
str(res_svd)
LeftMatrix <- res_svd$u		# the left singular vector
RightMatrix <- res_svd$v		# the right singular vector
sigma <- diag(res_svd$d)		# diagonal matrix of the singular values (identifying the variance from the principal components)

### calculation of principal components
# for the nucleotide positions
sPC_nucleotide_mean <-	 RightMatrix %*% simga / (n.sample^0.5)
# for the samples
sPC_sample_mean <- LeftMatrix %*% sigma/ (seq.len^0.5)

# the total of the above components are the number of values, or, in this context, principal components that describe the same amount of data in a lower dimension. (note the sPC since these are Principal Components Calculated from singlular value decomposition).

# set the row names of the respective tables.
rownames(sPC_nucleotide_mean)<- colnames(boolean.matrix) 
rownames(sPC_sample_mean)<- rownames(boolean.matrix)

```

This block is repeating th above steps for the PCs for distances centered around the SARS-CoV2 Reference Genome.
```{r eval=FALSE, include=TRUE}
# repeat for the wuhan differences
res_svd_w <- svd(differences_wuhan)
str(res_svd_w)
LeftMatrix_w <- res_svd_w$u	
RightMatrix_w <- res_svd_w$v		
sigma_w <- diag(res_svd_w$d)	

sPC_nucleotide_w <-	 RightMatrix_w %*% sigma_w / (n.sample^0.5)
sPC_sample_w <-	 LeftMatrix_w %*% sigma_w/ (seq.len^0.5)

# and for the Wuhan reference
rownames(sPC_nucleotide_w)<- colnames(boolean.matrix) 
rownames(sPC_sample_w)<- rownames(boolean.matrix)

```

### Visualizing Results

The bulk of the Principal Component Analysis is best visualized in a biplot or 3D plot. To see which PCs contribute most towards describing variation. The values contained in each column of the sPC matrices can be viewed as the loadings (vector coefficients) for each of the bases/positions. 

#### Contribution Plot
By assessing the contribution of each principal component, the amount of variation described by the  PCs can be viewed. We divide the contribution of each principal component by the total variation seen. This will give a running total of the amount of variance explained by each PC. If an "elbow" is present, it typically suggests that the leading principal components describe the majority of the variation while the others have smaller contributions towards describing the data. The contributions are calculated by finding the percentage of variance described by the respective PC over the sum of the variation described by all PCs.

```{r eval=FALSE, include=TRUE}
# scree plot (contribution of each principal component towards explaining the overall variation in the data)
index <- 1:20
contribution_scores_mean <- round((res_svd$d/sum(res_svd$d)*100)[1:20],2)
contributions_mean <- data.frame(index,contribution_scores_mean)

# for reference center
contribution_scores_w <- round((res_svd_w$d/sum(res_svd_w$d)*100)[1:20],2)
contributions_w <- data.frame(index,contribution_scores_w)

p3 <- ggplot(data=contributions_mean, aes(x=`index`, y=`contribution_scores_mean`, group=1)) + 
  geom_line(linetype = "twodash") +
  geom_text(aes(label=`contribution_scores_mean`),hjust=-0.54, vjust=-0.6, size=3) +
  geom_point() + labs(y="PC Contribution (%)", x = "PC") + ggtitle("Cumulative Proportion of Variance Described by Leading 20 PCs (Mean Center)") + theme(plot.title = element_text(size=8))
        
p4 <- ggplot(data=contributions_w, aes(x=`index`, y=`contribution_scores_w`, group=1)) +
  geom_line(linetype = "twodash")+
  geom_text(aes(label=`contribution_scores_w`),hjust=-0.54, vjust=-0.6, size=3)+
  geom_point() + labs(y="PC Contribution (%)", x = "PC") + ggtitle("Cumulative Proportion of Variance Described by Leading 20 PCs (Reference Center)")+ 
  theme(plot.title = element_text(size=8))

grid.arrange(p3,p4, ncol=2)

```
```{r echo=FALSE, out.width = '75%'}
knitr::include_graphics("contribution_plots.png")
```

<br/>**Figure 3** Illustrated above are the contribution scores of the leading 20 principal components calculated through the direct PCA approach on the Li *et al* coronavirus data set using the mean distance as the center (left) and reference center (right). 
There appears to be a distinct elbow in the contribution scores. This suggests that the first 3 PCs best describe the variations in the samples. Notice how the line reaches a plateau where the contributions are 3% or less. Thus, the most variation between the samples can be best visualized with a 3D plot of the top 3 PCs. Note, however, that the leading 3 PCs only describe ~39% of the variation in the data. This suggests that most of the samples are very similar with minute differences between them (recall, this was seen in the histogram of the distances as well). It can be hypothesized that the leading PCs best illustrate the samples that had very large distances. 

#### Principal Component Plot of Samples

The data pulled from the Li *et al* paper included the clades for each SARS-CoV2 sequence and the grouping of the precursors. These can be included in the data frame as a factor variable to differentiate between the points. Here we will: 
1) extract the accession IDs from the row names. These will be used to identify the clade they belong to.
2) clade information can be pulled from a csv file and merged with the Principal Components data frame based on matching accessions
3) plot the first three dimensions (principal components as we saw in figure 3, the elbow occurred around the 3rd PC)
4) color-code the points in the scatterplot by their respective clades to see if the clustering matches.


```{r eval=FALSE, include=TRUE}
# converting the sPC matrix into a dataframe for plotting
df_mean_center <- as.data.frame(sPC_sample_mean)

# extract the accession IDs from the matrix rownames
labels <- rownames(sPC_sample_mean)

# using string extractions to only get the accession IDs (isolate from the rest of the header info)
df_mean_center$accessions <- str_extract(labels,"[A-Z]+[0-9]+")
print(df_mean_center$accessions)

# write these accessions to a file so we retain the order and open as an excel file. You can add the clade memberships in the next column and save as a csv.
write(df_mean_center$accessions, "accession_numbers_mean.txt")

# read in the accession numbers with the clade membership. In your csv file, ensure the headers are "accessions" and "type" for the two columns. 
# the accession columns must match in order to merge the two data frames.
accessions_with_clade <- read.csv("accession_with_clades.csv", header = T)

# merge these two dataframes using the accession ID as the key
df_mean_center <- merge(df_mean_center, accessions_with_clade, by="accessions")
# setting the clade as a factor variable
df_mean_center$type <- as.factor(df_mean_center$type)

# find the number of unique clades
length(levels(df_mean_center$type)) # we need 14 different colour indicators

# vector of colours to use
colours <- c("#0c98cc", "#3cacd6", "#3d59ab", "#8e7fc7", "#f0a334", "#0a3b60","#4e426d", "#91dfb6", "#aae5a4", "#daf9bc","#e19696", "#a54b4b", "#7c2222", "#fd7300")

# scatterplot3d requires the factor variable to be coded by the colours.
colours.num <- colours[as.numeric(df_mean_center$type)]

# plot
scatterplot3d(df_mean_center[,1:3], pch=16, main="3D Principal Component Plot of Coronavirus genomes (Mean Reference Center)", color=colours.num, xlab="sPC1", ylab="sPC2", zlab="sPC3", angle = 70, cex.symbols = par(20)) 
legend("topright", legend = levels(df_mean_center$type),col = colours, pch = 16)

```
```{r echo=FALSE, out.width = '75%'}
knitr::include_graphics("pca_plot_mean.png")
```
<br/>**Figure 4** 3D Principal Component Plot of Coronavirus genomes used in the Phylogenetic Supertree paper by Li et al (2020) with the mean sequence difference used as the center. NC Refers to SARS-CoV2 Viruses without a definitive clade membership (NoClade). BAT host coronaviruses and human host precursors to SARS-CoV2 are identified. Single letter membership references the Clades identified by Li et al in their paper.

The clustering of the genomes appears to closely resemble the clades and groupings seen in the Li *et al* Supertree Figure! With the center of the distances referencing the mean, the relationship between the MERs genomes and the Bat host coronavirus genomes appears to be much further apart. In addition, with the inclusion of more different genomes, the benefit of using the mean value is that the clustering between the genomes in each clade is tighter. The differentiation seen in the Li *et al* Figure seems to be well explained by the first dimension/principal component (A clade closest to the B clade then C, etc.).

We can see how the clustering and spatial patterns differ when using the SARS-CoV2 Reference Genome as the center. It might also be helpful to de-noise the figure by removing the SARS-CoV2 samples that do not belong to a clade. We're repeating the same steps as done above.


```{r eval=FALSE, include=TRUE}
# Repeating the same process as above, but for the reference genome as the center

# add svd vectors into a dataframe.
df_ref_center <- as.data.frame(sPC_sample_w)

# extract the accession IDs from the matrix rownames
labels <- rownames(sPC_sample_w)

# using string extractions to only get the accession IDs (isolate from the rest of the header info)
df_ref_center$accessions <- str_extract(labels,"[A-Z]+[0-9]+")
print(df_ref_center$accessions)


# recall we have the csv file that was read in with the accession IDs and their respective clades. We merge these dataframes using the accession ID as the key.
# merge these two dataframes using the accession ID as the key
df_ref_center <- merge(df_ref_center, accessions_with_clade, by="accessions")

# setting the clade as a factor variable
df_ref_center$type <- as.factor(df_ref_center$type)

length(levels(df_ref_center$type)) # we need 14 different colour indicators

# subset the observations that belong to a clade (not NC)
df_noclade <- df_ref_center %>%
  filter(df_ref_center$type != "NC")

# set a vector of colours
colours <- c("#0c98cc", "#3cacd6", "#3d59ab", "#8e7fc7", "#f0a334", "#0a3b60","#4e426d", "#91dfb6", "#aae5a4", "#daf9bc","#e19696", "#a54b4b", "#7c2222", "#fd7300")
colours.num <- colours[as.numeric(df_noclade$type)]

# plot
scatterplot3d(df_noclade[,1:3], pch=16, main="3D Principal Component Plot of Coronavirus genomes (Wuhan Reference Center)", color=colours.num, xlab="sPC1", ylab="sPC2", zlab="sPC3", angle = 70, cex.symbols = par(20)) 
legend("topright", legend = levels(df_noclade$type),col = colours, pch = 16)
```
```{r echo=FALSE, out.width = '75%'}
knitr::include_graphics("pca_plot_ref.png")
```
<br/>**Figure 5** 3D Principal Component Plot of Coronavirus genomes used in the Phylogenetic Supertree paper by Li et al (2020) with the Wuhan reference sequence difference used as the center. SARS-CoV2 trains without definitive clade membership is removed. BAT host coronaviruses and human host precursors to SARS-CoV2 are identified. Single letter membership reference the Clades identified by Li *et al* in their paper.

```{r echo=FALSE, out.width = '75%'}
knitr::include_graphics("Li_paper.png")
```
<br/>**Figure 6** Figure 2 from the paper (2020) [Phylogenetic supertree reveals detailed evolution of SARS-CoV-2](https://www.nature.com/articles/s41598-020-79484-8/figures/2) for comparison.

Notice that the clustering between the SARS-CoV2 clades is not as tight compared to when the mean was the center. Distances between the center and the individual SARS-CoV2 genomes are now larger and therefore more spaced out. However, the membership between the Bat-hot SARS, MERs, and precursor Human-host SAR genomes is clustered closer together and much further away from the SARS-CoV2 genomes. Even though the grouping is still intact, this diagram appears to mirror the distances seen in Figure 1 of the Li *et al* paper. Also, note the Bat-host points that approach the SARS-CoV2 clades. These are likely the points we see in the Li *et al* figure as the Bat SARs precursors that branch off right before the SARs-CoV2 division. 

The direct PCA method appears to generate similar results to what was seen in peer-reviewed literature, lending more credibility to the method and its application with other data sets.

#### Principal Component Plot of Nucleotide Sites

Recall that the benefit of approaching PCA and distance measures is that the differences can be traced back to the actual position that gave rise to those differences.
Konishi et al highlight that this may a beneficial plot to assess since it may unearth distinct motifs. High difference scores are not always randomly dispersed: they gather in hot spots, which can suggest that these mutations occur at a specific sequence position. This information can be useful in supplying clues for understanding the relationship in the differences between samples.

Using the sPC_nuc matrix created earlier, we can calculate the counts of each base and plot them against the first principal component dimension.

```{r eval=FALSE, include=TRUE}
# PCA plot of the positions
# set an index for the position
position.index <- 1:seq.len

# subset the matrix into the columns that refer to each base (for example A_1: A_39210 will be all the columns positions that have A as the base)
A. <-sPC_nucleotide_mean[1:seq.len,1]
T. <-sPC_nucleotide_mean[1:seq.len+seq.len,1]
G. <- sPC_nucleotide_mean[1:seq.len+seq.len+seq.len,1]
C. <- sPC_nucleotide_mean[1:seq.len+seq.len+seq.len+seq.len,1]

# merge these matrices into a dataframe
sites.plotting <- data.frame(position.index, A., G., C., T.)

# melt the data so we can plot all this information in one plot 
molten.data<-melt(sites.plotting,
                  measure.vars=c("A.", "G.", "C.", "T."),
                  variable_name = "variable")
molten.data$label <- paste(molten.data$variable, "_", molten.data$position.index)

ggplot(molten.data, aes(x=`position.index`, y=`value`)) + geom_point(aes(color = `variable`, shape = `variable`), size = 2) + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9", "#009E73")) + theme(legend.position="bottom")+
  labs(y="sPC1", x = "Sites") + ggtitle("Site Specific Variation") 

# or use label names to see the exact position numbers.
ggplot(molten.data, aes(x=`position.index`, y=`value`)) + geom_text(aes(color = `variable`, label=`label`), size = 1) + scale_fill_manual(values=c("#999999", "#E69F00", "#56B4E9", "#009E73")) + theme(legend.position="bottom")+
  labs(y="sPC1", x = "Sites") + ggtitle("Site Specific Variation") 
```
```{r echo=FALSE, out.width = '75%'}
knitr::include_graphics("site_variation_labels.png")
```
<br/>**Figure 7** Illustrated above are the nucleotide positions plotted against the first principal component calculated with the direct PCA method with the respective base character represented by color.

The figure above shows that there are regions of the genomes that have stable and consistent nucleotide composition and do not vary as significantly. There is one section (~23,000) of nucleotides in the genomes that is T-rich and appears consistent across all 120 genomes. However, the number of genomes used in this analysis appears to inject noise and not generate a clear picture of potential motifs. This plot may be much more informative when looking at a smaller sequence region such as a gene or protein where differences between bases are much more evident and may appear as a hot spot when plotted on the graph as above. If looking to assess the nucleotide differences in a whole genome, subsetting the data into smaller groups can also render a more informative plot.

## Conclusion

The direct PCA method proves to be a great avenue for assessing sequence variation through a distance matrix using the direct PCA method developed by Konishi et al. not only showed clustering that mirrored the groupings seen in a paper using these genomes for phylogenetic analyses but also offers customization that is lacking in other PCA based modules for genomes. The centers can be modified to understand the distances and variation in the genomes relative to a reference genome, the mean difference for all the files, or another sample of interest.
Much of the heavy lifting for the computation is done using base R functions, making this widely available and offers transparency  for the user. The actual dimension reduction vector information is also readily available through the sPC_nuc and sPC_sample matrices. It would be interesting to see if the same cluster membership can be seen when subjecting variant spike protein data to the amino acid version of this script. 


## Reflection

  Being able to reproduce analyses seen in peer-reviewed research is a cornerstone of bioinformatic studies. Conducting analyses with similar data sets lend credence to different methods proposed by researchers. However, the implementation of these methods is not easily transferable and requires manipulation. The impetus for researching this method of PC analysis was the lack of assumption-free analyses available especially for viral genomes. As stated, applications such as Plink are well-suited for chromosomes containing entities and perform the analysis based on mutation rates. Konishi et al were successful in using this method to not only separate coronavirus genomes through the principal components but also compared the outputs of using Direct PCA and known PCA methods of human and lion genomes.

  The results showed that this method was able to better discriminate between sequences and their distances and had a clear separation of clusters when subject to the direct PCA method. The implementation via an R script also made the analysis method more accessible (the original R script used by Konishi can be found in [this github repository](https://github.com/TomokazuKonishi/direct-PCA-for-sequences/blob/master/aminoacid.txt). The accompanying paper also set the stage well for understanding the methodology and rationale for applying this workflow (**Principal Component Analysis Applied Directly to Sequence Matrix**, 2019). This also added to the accessibility of the workflow since it is easily transferable and can be easily translated to a High-Performance Computing Platform context and did not have many dependencies (original plots were made in base R). DECIPHER was used in this tutorial but any alignment package can be used to deliver the same output. Additionally, this method is not restricted to nucleotide sequences and can also work with amino acid (AA) sequences. However, the number of column vectors for each position will change and requires different values in the numerical construction. A similar script for AA sequences can be found at this github repository.
  
  It was interesting to learn and visualize how complex analyses seen in literature can be described using sequences and mathematics alone. As seen in Figure 5, when the center of differences is set to the reference genome, the variation and spread of the clusters mirrored what was seen in the Li et al paper, which highlights the benefits of this method with respect to viral genomes. As described by Morel *et al*., understanding phylogenetic relationships of viruses is hard (2020). They often mutate at a rate that is unseen in other organisms. Many of the mutations are spurious and do not affect the function of the viral genomes. Thus, establishing the patterns in its mutation rate or defining important differences for clustering is difficult - especially when detrimental mutations and spurious mutations cannot be differentiated. Despite all of these impediments, the fact that this direct PCA method was able to extract clustering patterns based on the sequences alone is remarkable. This in addition to how closely the patterns match what is seen in the supertree phylogeny suggests that this method can be used to further evidence these phylogenetic studies. Free from assumptions in its calculation, it confers the advantage of minimal expectations - there is no "cherry-picking" or fine-tuning of parameters to render phylogenetic hypotheses (Konishi *et al* ., 2019).
  
  Another very interesting benefit of using this method is that the differences can be traced back to nucleotide positions. In other distance calculation methods, usually, the software calculates a distance matrix, but if there are differences in the resulting PCA in the clustering, it is difficult to find out what points in the sequence may have resulted in that distance measure. In this method, the point can directly be related to the position and base character (A_3232). This can open doors for analysis related to motifs in the sequences that are caused by variation.

  As seamless and accessible as this method of PC analysis was, there were some associated challenges. The greatest challenge was data procurement. Identifying the size and scope of the genomes to be pulled to see a reasonable enough difference between the samples required thorough research. Coronavirus variants that affect humans are extremely similar (especially in regions that code for necessary proteins). Typically, studies broaden the scope by assessing precursor versions of the coronavirus that infected other hosts (Bat, Pangolin, etc.) which as mentioned, can affect the level of clustering seen. Sequences that are very dissimilar or further related can skew the mean or center value used to compute the distances between samples. Another associated challenge is finding clean data. Because this method calculates principal components through the sequences directly, anything that can modify the alignment such as sequence artifacts or errors can persist downstream of the alignment step. The GISAID database was helpful in obtaining very high-quality data used in publications and many epidemiological efforts and is a great resource.
  As someone with a biology background, the mathematical understanding related to the actual construction using singular value decomposition was difficult to comprehend. However, this is a worthy challenge as this method has room for customization. Applications and modules prioritize ease of use by providing a single PCA function that forces the black box approach on the user (putting in any data will churn out a result, whether or not it is biologically meaningful is an analysis of its own). Constructing this tutorial required familiarity with one-hot encoding (boolean matrix), singular value decomposition, euclidean distances, etc. which allowed for comparison between using a different reference center and identifying the concordance between the results from this tutorial and the phylogenetic supertree constructed by Li *et al*.

  Disadvantages of the direct PCA method were also made aware through this analysis. One shortfall is that the size and dimensionality of the analysis will increase significantly as the size of the sequences increases. Direct PCA on the entire viral genome was feasible because they are relatively small (~39k bases for SARS-CoV2). However, this is not feasible for the complete genome of larger species (fish, humans, etc.) and computationally expensive (5 X length of the human genome ~ 15 billion vectors in the matrix to decompose). Thus, the application of this analysis to these species will likely need to be scaled down to the gene or protein level or will require access to a high-performance computing cluster.

  This method is also likely not employable for understudied organisms where good-quality genomes are rare. Since the sequence matrix is dependent on multiple sequence alignments, unplaced contigs and repeat regions in scaffolds are difficult to discern and therefore difficult to align. This again would persist in downstream analyses.
  Another caveat that the authors also highlight is that this method is not robust enough to be a stand-alone analysis in research. Rather, it would be better to take this approach as a means of assessing concordance between methods (are you getting the same clustering, could your data be influenced by noise?). This feeds the purpose of scientific discovery, questioning results and seeing how results differ when subject to alternative methods. 

  Despite the challenges, Dr. Konishi presents an assumption-free approach to PC analysis that can be used to govern downstream analysis steps for phylogenetic studies, population structure, and other population and distance-related studies. The method opens the door for "species" that do not abide by typical evolutionary assumptions and generate biologically relevant insights. 


## References 
Baron, S. (1996). Structure and Classification of Viruses. Medical Microbiology, 4th editio, Chapter 41. //www.ncbi.nlm.nih.gov/books/NBK8174/

Gauch, H. G., Qian, S., Piepho, H. P., Zhou, L., & Chen, R. (2018). Consequences of PCA graphs, SNP codings, and PCA variants for elucidating population structure. BioRxiv. https://doi.org/10.1101/393611

Konishi, T. (2020). Principal component analysis of coronaviruses reveals their diversity and seasonal and pandemic potential. PLoS ONE, 15(12 December). https://doi.org/10.1371/journal.pone.0242954

Konishi, T., Matsukuma, S., Fuji, H., Nakamura, D., Satou, N., & Okano, K. (2019). Principal Component Analysis applied directly to Sequence Matrix. Scientific Reports, 9(1). https://doi.org/10.1038/s41598-019-55253-0

Lever, J., Krzywinski, M., & Altman, N. (2017). Principal component analysis. Nature Methods, 14(7), 641–642. https://doi.org/10.1038/nmeth.4346

Li, T., Liu, D., Yang, Y., Guo, J., Feng, Y., Zhang, X., Cheng, S., & Feng, J. (2020). Phylogenetic supertree reveals detailed evolution of SARS-CoV-2. https://doi.org/10.21203/rs.3.rs-33194/v1

Morel, B., Barbera, P., Czech, L., Bettisworth, B., Hübner, L., Lutteropp, S., Serdari, D., Kostaki, E.-G., Mamais, I., Kozlov, A. M., Pavlidis, P., Paraskevis, D., & Stamatakis, A. (2020). Phylogenetic Analysis of SARS-CoV-2 Data Is Difficult. Molecular Biology and Evolution. https://doi.org/10.1093/molbev/msaa314