20241111_jurkat_mpra_big_table.Rmd

---
title: "Jurkat MPRA table"
output: html_document
author: Max Dippel
date: "2024-10-21"
---


In this code, I will use to create a really large table of jurkat mpra data with lots of extra data on functional annotations. After you have created this big table, you can use it to create plots which are relevant to MPRA data. Mostly, these data are relevant supplementary tables 1, 5 and 6 and can make the plots (in another piece of code) for figure 1, figure 2, supp. figure 4, supp. figure 5, and supp. figure 6.

# MPRA merge

Load the packages
```{r packages, message=FALSE}
library(stringr)
library(IRanges)
library(data.table)
library(readxl)
require(biomaRt)
library(GenomicRanges)
library(dplyr)
library(readxl)
library(stringr)
```

Establish directories
```{r}
# This is the main directory for the analysis
main.dir <- "~/Desktop/Ho et al. writings/Code_from_scratch2"
# This is where the data will go
data.dir <- "~/Desktop/Ho et al. writings/Code_from_scratch2/data/"

# This is the directory where you put the original MPRA analysis code created by Mike Guo. It is located here.  https://zenodo.org/records/6302248
mhguo.dir <- "~/Desktop/Ho et al. writings/Code_from_scratch2/data/mhguo1-T_cell_MPRA-5c36361/"

# This is the directory where you have stored the DHS data. Use hg19 for this analysis. You can change the genome buildat the end. The DHS data can be found at https://zenodo.org/records/3838751 
# Download these three things:
# DHS_Index_and_Vocabulary_metadata.tsv
# DHS_Index_and_Vocabulary_hg19_WM20190703.txt.gz
# dat_bin_FDR01_hg19.txt.gz
dhs.data.dir <- "~/Desktop/Ho et al. writings/Code_from_scratch2/data/DHS_hg19/"
```

Import MPRA, inspect data, and set directories for specific cell types

############################

## STOP!

STOP! You must look below to see the MPRA for which you are about to make a big table. This Rmd makes the jurkat big table. 

If you are replicating the plots from Ho et al., you need all the data available. You must run both   to generate big tables for the T-cell and Jurkat MPRAs hashing in and out the proper lines to get the correct code for each run. I thought this would be better and at least more compact than making this code twice. 

############################

```{r}
# Load Ho et al. Supp. table 1 or two depending if you want to make this table with primary t-cell (sheet 1) and jurkat data (sheet 2)
mpra_out<-read.table(paste0(data.dir,"OLJR.A_Jurkat_emVAR_glm_20240310.out"), header=T, stringsAsFactors = F, sep="")
head(mpra_out)
# How many rows are in that file?
nrow(mpra_out)
# How many columns are in that file?
ncol(mpra_out)
# What are the column names in the mpra?
names(mpra_out)
# Name of MPRA file 
mpra.name <- "20241111_jurkat"

# This decides what option to choose for the variant category filter. There are currently two options "original" and "expression". The "original" filter is for the jurkat MPRA as it is similar (not the exact same, but similar to Mouri & Guo et al.). The "expression" filter is for the Primary T-cell MPRA. 
filter <- "original"

# What is the stimulation status of the cells? We going to eliminate stimulated (S) or unstimulated (U) cells. For Ho et al. Primary T-cells are stimulated and Jurkats are unstimulated. 
eliminate_stim_status <- "S" <- # Unstimulated so eliminate stimulated cells with S

# List out the cells which are relevant to this T-cell cell type for DHS columns. Change this to the B-cell list if you are using B-cells or make your own! The large list of cell types can be found in the DHS section sample.dat in the Biosample.name column. 
cells<-c( "hTH1", "hTH17", "hTH2","CD4", "CD4pos_N", "hTR", "Jurkat", "CD8")


# Please be aware that delta SVM, eqtl data and ATAC-QTL data are all T-cell specific 
```

This code standardizes the column names in the analysis. 
```{r}
# Change the name of the columns to match the functions
mpra_out$LogSkew <- mpra_out$Log2Skew
# Change the name of the column to match the functions
mpra_out$A.Ctrl.Mean <- mpra_out$A_Ctrl_Mean
mpra_out$A.Exp.Mean <- mpra_out$A_Exp_Mean
mpra_out$A.log2FC <- mpra_out$A_log2FC
mpra_out$A.log2FC_SE <- mpra_out$A_log2FC_SE
mpra_out$A.logP <- mpra_out$A_logP
mpra_out$A.logPadj_BH <- mpra_out$A_logPadj_BH
mpra_out$A.logPadj_BF <- mpra_out$A_logPadj_BF

mpra_out$B.Ctrl.Mean <- mpra_out$B_Ctrl_Mean
mpra_out$B.Exp.Mean <- mpra_out$B_Exp_Mean
mpra_out$B.log2FC <- mpra_out$B_log2FC
mpra_out$B.log2FC_SE <- mpra_out$B_log2FC_SE
mpra_out$B.logP <- mpra_out$B_logP
mpra_out$B.logPadj_BH <- mpra_out$B_logPadj_BH
mpra_out$B.logPadj_BF <- mpra_out$B_logPadj_BF
mpra_out$Skew.logP <- mpra_out$Skew_logP
mpra_out$Skew.logFDR <- mpra_out$Skew_logFDR
# This column creates the project column
mpra_out$project <- "TGWAS"
# eliminate all of the columns that have the names we don't want
mpra_out <- subset(mpra_out, select = -c(Log2Skew,A_Ctrl_Mean,A_Exp_Mean,A_log2FC,A_log2FC_SE,A_logP,A_logPadj_BH, A_logPadj_BF,B_Ctrl_Mean,B_Exp_Mean,B_log2FC,B_log2FC_SE,B_logP,B_logPadj_BH,B_logPadj_BF,Skew_logP,Skew_logFDR))

mpra <- mpra_out
```

```{r}
nrow(mpra_out)
```


LD columns from the jurkat MPRA. You can also calculate the ld data by hand by running the plink script (ld.sh) yourself, but I have included it here for ease of use. 
```{r}
#Merge LD files across chromosomes
# Create an empty data frame
dat.ld.all<-data.frame(lead_snp=character(), ld_snp=character(), r2=numeric(), stringsAsFactors = F)
# A loop that says "for every chromosome..."
for(c in c(1:22, "X")){
  # read in Mike's linkage files for each chromosome
  dat.ld.chr<-read.delim(paste(mhguo.dir,"annotate_mpra/ld/ld/mpra.chr", c, ".ld", sep=""), header=T, stringsAsFactors = F, sep="")
  # Select the columns you need from this data set
  dat.ld.chr<-subset(dat.ld.chr, select=c(SNP_A, SNP_B, R2))
  # Rename those columns 
  names(dat.ld.chr)<-c("lead_snp","ld_snp", "r2")
  # combine these values with the empty data frame you created earlier
  dat.ld.all<-rbind(dat.ld.all, dat.ld.chr)
  # Remove dat.ld.chr
  rm(dat.ld.chr)
}

#remove duplicates. If proxy SNP has multiple entries, select the one with the highest r2
dat.ld.all<-dat.ld.all[order(dat.ld.all$ld_snp, -dat.ld.all$r2),]
dat.ld.all<-dat.ld.all[!duplicated(dat.ld.all$ld_snp),]

#Merge in LD information
mpra$ld_snp <- mpra$SNP
mpra<-merge(mpra, dat.ld.all,  by="ld_snp", all.x=T, all.y=F)
```

Creating bed-file like columns
```{r}
# Make the MPRA file into a bed-like file for downstream annotations
# Creating the start snp column by modifying the SNP column. Start snp is calculated by getting the postion then subtracting 1. Then subtracting the absolute value of the difference between the number of reference and alternate alleles.
mpra$snp_start <- as.numeric(str_split_fixed(mpra$SNP, "\\:", 4)[,2])-1-abs(nchar(str_split_fixed(mpra$SNP, "\\:", 4)[,3])-nchar(str_split_fixed(mpra$SNP, "\\:", 4)[,4]))
# Create a end_snp column which modifies the SNP column and the values after the first colon and before the second colon
mpra$snp_end<-as.numeric(str_split_fixed(mpra$SNP, "\\:", 4)[,2])
# Create a chromosome column by modifying the SNP column. Taking everything before the first colon and adding it to the letter "chr"
mpra$chr<-paste0("chr", str_split_fixed(mpra$SNP, "\\:", 4)[,1])
# Create a strand column
mpra$strand<-"+"
# Create position column
mpra$pos <- sub("^[^:]+:([0-9]+):.*", "\\1", mpra$SNP)
# Reference allele
mpra$ref_allele <- sub("^[^:]+:[^:]+:([^:]+):.*$", "\\1", mpra$SNP)
# Alternate allele
mpra$alt_allele <- sub("^[^:]+:[^:]+:[^:]+:([^:]+)$", "\\1", mpra$SNP)
# Create ID column
mpra$ID <- paste0(mpra$SNP,":R:wC")
```

```{r}
nrow(mpra)
```

This big loop to integrate the PICS (probabilistic identification of causal SNPs) data into the mpra.merge data set 
```{r PICS data}
# Order of GWAS for plotting
gwas.order<- c("Crohns","MS","Psoriasis", "RA","T1D","UC", "IBD") 
# Merge in GWAS and PICS fine-mapping data
for(d in gwas.order){
  # Read in PICS results
  gwas<-read.delim(paste0(mhguo.dir,"pics/run/", d, ".ld_0.8.pics"), header=F, stringsAsFactors = F, sep="\t")
  # Select certain columns in the gwas data
  gwas<-gwas[,c(1,2,4,6,7)]
  # Name the columns of the gwas data
  names(gwas)<-c("lead_snp", "ld_snp", "r2", "pics","pval")
  # eliminate the letters "chr" from lead and lp snp
  gwas$lead_snp<-gsub("chr", "", gwas$lead_snp)
  gwas$ld_snp<-gsub("chr", "", gwas$ld_snp)
  
  #Merge PICS results with MPRA to subset for SNPs tested in the MPRA
  gwas<-merge(gwas, mpra[,c("ld_snp", "lead_snp")], by=c("ld_snp", "lead_snp"), all.x=F, all.y=F)

  #Remove SNPs seen multiple times, picking the SNP with the most significant p-value in catalog, then the highest PICS posterior probability
  gwas<-gwas[order(-gwas$pval, -gwas$r2),]
  gwas<-gwas[!duplicated(gwas[,c("ld_snp", "lead_snp")]),]
  
# Generate credible sets
  # Order the data by SNPs
  gwas <- gwas[order(gwas$lead_snp, -gwas$pics),]
  gwas$PP_running<-0 #Create variable with a running sum of PICS probabilities
  gwas$PP_sum <- 0 #Create variable with sum of PICS probablities
  #Note: need this second variable to help break ties (if two SNPs have the same PP, the automatically will have the same credible set annotation)

  prev_var <- "dummy_variable" #Create variable to test whether we've moved onto a new locus
  for(i in c(1:nrow(gwas))){
    # Creating a new variable from lead snp
    new_var<-gwas[i,]$lead_snp
    # if the pics value is greater than zero
    if(gwas[i,]$pics>0){
      if(new_var==prev_var){ #Test if next variant is still in same locus
        gwas[i,]$PP_sum<-gwas[i-1,]$PP_sum+gwas[i,]$pics #Add pics to running PP sum
        
        if(gwas[i,]$pics==gwas[i-1,]$pics){ #If two variants have the same PP, force them to have the same PP_running
          gwas[i,]$PP_running<-gwas[i-1,]$PP_running
        }else{
          gwas[i,]$PP_running<-gwas[i,]$pics+gwas[i-1,]$PP_sum #If not Add pics and PP_sum to running PP sum
        } 
      }else{ #If this variant is in a new locus, assign the PICS PP as their PP_sum and PP_running
        gwas[i,]$PP_sum<-gwas[i,]$pics 
        gwas[i,]$PP_running<-gwas[i,]$pics
      }
    }
    prev_var<-new_var #If lead SNP is different, we've now moved on to new locus
  }
  
# Merge with MPRA data
  # subseet the gwax data to a few variants
  gwas<-subset(gwas, select=c(ld_snp, pval,pics,PP_running))
  # Round the gwas pvalue to two digits
  gwas$pval<-round(gwas$pval,2)
  # Creating a disease specific name with underscores for the pics statistics
  names(gwas)[2:4]<-paste0(d, "_", names(gwas)[2:4])
  # merge he gwas data and mpra by ld_snp
  mpra<-merge(mpra, gwas, by="ld_snp", all.x=T, all.y=F)
  # remove the gwas object
  rm(gwas)
}
```

```{r}
nrow(mpra)
```

DHS (DNase 1 hypersensitive sites) data 
As stated above, the DHS data can be found at https://zenodo.org/records/3838751
Download these three things:
1. DHS_Index_and_Vocabulary_metadata.tsv
2. DHS_Index_and_Vocabulary_hg19_WM20190703.txt.gz
3. dat_bin_FDR01_hg19.txt.gz
```{r}
### DHS data
# Eliminating all the columns that have "dhs_" in them which happen to be none of the columns
mpra<-mpra[,!(grepl("dhs_", names(mpra)))]
# Make a granges object from the mpra spreadsheet
mpra.se<-makeGRangesFromDataFrame(mpra,  seqnames = "chr", start.field = "snp_start", end.field = "snp_end",keep.extra.columns=TRUE)

# Read in DHS position file
dhs.pos<- read.delim(gzfile(paste0(dhs.data.dir,"DHS_Index_and_Vocabulary_hg19_WM20190703.txt.gz")), header=T, stringsAsFactors = F)
# Only keep the first three columns of the DHS pos data
dhs.pos<-dhs.pos[,c(1:3)]
# Make peak column which puts together the chromosome, the start and the end
dhs.pos$peak<-paste0(dhs.pos$seqname, ":", dhs.pos$start, "_", dhs.pos$end)
# make a granges from the dhs pos data
dhs.pos.se<-makeGRangesFromDataFrame(dhs.pos, seqnames = "seqname", start.field = "start", end.field = "end",keep.extra.columns=TRUE)

#Find DHS peaks that overlap with MPRA
overlap_peaks<-data.frame(subsetByOverlaps(dhs.pos.se, mpra.se, type="any"), stringsAsFactors = F)$peak
overlap_index<-which(dhs.pos$peak%in%overlap_peaks)

# bring in dhs data
dhs.dat<-fread(paste0(dhs.data.dir,"dat_bin_FDR01_hg19.txt.gz")) 
#Subset for only DHS peaks that overlap with MPRA in the dhs data
dhs.dat<-dhs.dat[overlap_index,]
#Subset for only DHS peaks that overlap with MPRA in the dhs pos data
dhs.pos<-dhs.pos[overlap_index,]
# Now combin ehte dhs and dhs pos data
dhs.dat<-cbind(dhs.pos[,c(1:3)],dhs.dat)

#Read in sample information
sample.dat<-read.delim(paste0(dhs.data.dir,"DHS_Index_and_Vocabulary_metadata.tsv"), sep="\t", header=T, stringsAsFactors = F)
# Select certain columns in the sample data
sample.dat<-subset(sample.dat, !is.na(library.order),select=c(library.order,Biosample.name, System,Subsystem,  Organ, Biosample.type, Biological.state))

#Add peak index into MPRA for faster calculation
# Create three empty columns in mpra
mpra$peak_index1<-NA
mpra$peak_index2<-NA
mpra$peak_index3<-NA
# Create a row number column for the dhs data sets
dhs.pos$index<-seq(1:nrow(dhs.pos))
dhs.dat$index<-seq(1:nrow(dhs.dat))

# These hashed out portions are hashed out in the original code. I am going to leave them there.
# For every row in mpra, create a subseted dhs pos dataset with the same chromosome and which emcompases the position of the SNP in that row and extract the index number from that. Then put the indices into the empty mpra columns. You need the three columns for varaints with multiple indicies
#num_index<-vector()
for(i in 1:nrow(mpra)){
  #mpra[i,]$peak_index<-subset(dhs.pos, seqname==mpra[i,]$chr & start<=mpra[i,]$snp_end & end >=mpra[i,]$snp_end)$index[1]  
  indices<-subset(dhs.pos, seqname==mpra[i,]$chr & start<=mpra[i,]$snp_end & end >=mpra[i,]$snp_end)$index
  mpra[i,c("peak_index1","peak_index2","peak_index3")]<-c(indices, rep(NA, 3-length(indices)))
  #num_index<-c(num_index,length(subset(dhs.pos, seqname==mpra[i,]$chr & start<=mpra[i,]$snp_end & end >=mpra[i,]$snp_end)$index))
}

# list out the cells which are relevant to this T-cell cell type. Change this to the B-cell list if you are using B-cells or make your own! The large list of cell types can be found in sample.dat in the Biosample.name column. 
#cells<-c( "hTH1", "hTH17", "hTH2","CD4", "CD4pos_N", "hTR", "Jurkat", "CD8")
#cells<-c( "GM12878", "GM12865", "GM12864","CD20", "CD19")

# Second loop
# For each cell type,
for(c in cells){
  # get the numbers of the cell types in the sample data
  cell_col<-as.vector((subset(sample.dat, Biosample.name==c)$library.order)+3)
  # If the length of these columns are greater than 1, then use the max dhs values from the dhs data 
  if(length(cell_col)>1){
    dhs.dat$max_dhs<-apply(dhs.dat[,cell_col], 1, max)
  }else{
  # Or just make the only cell type columns's dhs data the max dhs
    dhs.dat$max_dhs<-dhs.dat[,cell_col]
  }
  # Create a peak index of all the values that are greater than 0 
  peak_index<-subset(dhs.dat, max_dhs>0)$index

 # mpra$dhs_peak<-ifelse(mpra$peak_index%in%peak_index, 1, 0)
  # Make a column in the mpra that that contains a 1 if that that variant has a peak and a 0 if it does not
  mpra$dhs_peak<-ifelse(mpra$peak_index1%in%peak_index | mpra$peak_index2%in%peak_index| mpra$peak_index3%in%peak_index, 1, 0)
  # Make the name of that column specific to the specific cell type tested
  names(mpra)<-gsub("dhs_peak", paste0("dhs_", c), names(mpra))

}
# Create a column of dhs peaks that merges all of the DHS columns 
mpra$dhs_Tcell_merged<-apply(mpra[,which(grepl("dhs_",names(mpra)))],1, max)
# Create a column of dhs peaks from the original dhs pos column. How is this different from the column we just made?
mpra$dhs_all<-ifelse(mpra$peak_index1%in%dhs.pos$index | mpra$peak_index2%in%dhs.pos$index | mpra$peak_index3%in%dhs.pos$index, 1, 0)
# eliminate the peak columns
mpra<-subset(mpra, select=-c( peak_index1, peak_index2, peak_index3))
# Remove the objects you do not need anymore 
rm(mpra.se)
rm(dhs.pos.se)
```

```{r}
nrow(mpra)
```

Just like the LD columns, I am just going to use the delatSVM file already calculated from Mike's code. You can calculate the delta SVM scores yourself using this file delta_svm_workflow.txt, but I am trying to make this code as user friendly as possible by already including it here. 
```{r}
#Merge in deltaSVM
# Load the Jurkat deltaSVM data
dat.deltasvm<-read.delim(paste0(mhguo.dir,"delta_svm/mpra_snps_E2_Jurkat_deltaSVM.txt"), header=F, stringsAsFactors = F, sep="\t")
# Label the names of the Jurkat deltaSVM data
names(dat.deltasvm)<-c("SNP", "delta_svm_Jurkat")
# merge the jurkat deltaSVM data with mpra
mpra<-merge(mpra, dat.deltasvm, by="SNP", all.x=T, all.y=F)
# Load the CD4 deltaSVM data
dat.deltasvm<-read.delim(paste0(mhguo.dir,"delta_svm/mpra_snps_E2_naive_CD4_deltaSVM.txt"), header=F, stringsAsFactors = F, sep="\t")
# Rename the columns
names(dat.deltasvm)<-c("SNP", "delta_svm_nCD4")
# merge with MPRA
mpra<-merge(mpra, dat.deltasvm, by="SNP", all.x=T, all.y=F)
```

```{r}
nrow(mpra)
```

Add allelic skew annotations ASC 
```{r}
#Merge in ASC from Calderon
#Read in allelic skew data from Calderon et al. (PMID: 31570894), Supplementary Table 1, sheet 10, "significant_ASCs" tab
# https://www.nature.com/articles/s41588-019-0505-9#Sec30
# Import the spreadsheet into R
asc<-data.frame(read_excel(paste0(data.dir,"41588_2019_505_MOESM3_ESM.xlsx"), sheet=10), stringsAsFactors = F)
# remove cell tpyes with have -S in them (aka stimulated cell types)
asc<-asc[!grepl(paste0("\\-",eliminate_stim_status), asc$cell_type),]
# Extracting the chromosome from the ID and making it a column
asc$chr<-str_split_fixed(asc$het_id, "\\_", 5)[,2]
# # Extracting the position from the ID and making it a column
asc$pos<-as.numeric(str_split_fixed(asc$het_id, "\\_", 5)[,3])
# Empty column for this data in the MPRA (full of zeros)
mpra$asc<-0
# For each row in the mpra data, if the asc has a matching chromosome and position then the mpra row gets a 1
for(i in 1:nrow(mpra)){
  if(nrow(subset(asc, chr==mpra[i,]$chr & pos==mpra[i,]$snp_end))>0){
    mpra[i,]$asc<-1
  }
}
```

```{r}
nrow(mpra)
```

ATAC-QTL and eQTL
```{r}
# Read in Gate ATAC-QTL data
# Read in Gate et al ATAC-QTL data (PMID: 29988122), Supplemental Table 6
# https://www.nature.com/articles/s41588-018-0156-2#Sec38
# Read in supplementary table
qtl<-data.frame(read_excel(paste0(data.dir,"41588_2018_156_MOESM8_ESM.xlsx"), sheet=2), stringsAsFactors = F)
# Create a chromosome column from the SNP column
qtl$chr<-str_split_fixed(qtl$SNP, "\\:", 2)[,1]
# Create a position column from the SNP column
qtl$snp_end<-str_split_fixed(str_split_fixed(qtl$SNP, "\\_", 2)[,1], "\\:", 2)[,2]
# Subsetting to only the information you want to put in  the mpra
qtl<-subset(qtl, select=c(chr, snp_end, beta, p.value))
# Changing the column names to the column names you want to put in the mpra
names(qtl)[3:4]<-c("atac_qtl_beta", "atac_qtl_pval")
# Merge the atac qtl data with the mpra
mpra<-merge(mpra, qtl, by=c("chr", "snp_end"), all.x=T, all.y=F)

# Read in Gate eQTL data
# Read in the second supplementary table (Supplemental Table 8) This one is eqtls not atac-qtls.  Supplemental Table 8
qtl<-data.frame(read_excel(paste0(data.dir,"41588_2018_156_MOESM10_ESM.xlsx"), sheet=2), stringsAsFactors = F)
# Create a chromosome column from the SNP column
qtl$chr<-str_split_fixed(qtl$SNP, "\\:", 2)[,1]
# Create a position column from the SNP column
qtl$snp_end<-str_split_fixed(str_split_fixed(qtl$SNP, "\\_", 2)[,1], "\\:", 2)[,2]
#qtl$chr_pos<-paste(qtl$chr, qtl$pos, sep="_")
# Subsetting to only the information you want to put in  the mpra
qtl<-subset(qtl, select=c(chr, snp_end, beta, p.value, gene))
# Changing the column names to the column names you want to put in the mpra
names(qtl)[3:5]<-c("eqtl_beta", "eqtl_pval", "eqtl_gene")
# Merge the eqtl data with the mpra
mpra<-merge(mpra, qtl, by=c("chr", "snp_end"), all.x=T, all.y=F)
# Eliminate the duplicated SNPs created by this merge
mpra <- subset(mpra, duplicated(SNP)!=TRUE)
```

```{r}
nrow(mpra)
```

MotifbreakR
```{r motifbreakr}
#Read in motifbreakr data
dat.motifbreakr<-read.delim(paste0(mhguo.dir,"/tf/motifbreakr/mpra.motifbreakr.method_log.p1e5.results.txt"),header=T, stringsAsFactors = F,sep="\t")
# Create a SNP column from the SNP_id column in the motifbreakr data
dat.motifbreakr$SNP<-gsub("chr", "", dat.motifbreakr$SNP_id)
# Subset the motifbreakr to only the SNP and the gene
dat.motifbreakr<-subset(dat.motifbreakr,  select=c(SNP,  geneSymbol))
# Create an empty column in the mpra for the motifbreakr data
mpra$tf_motifbreakr<-NA
# For each row in mpra, if the row's SNP is in the MPRA data set, subset the motifbreakr data to only that SNP and paste the genes as characters in the column with a comma in between and make that the tf_motifbreakr column in the mpra
for(i in 1:nrow(mpra)){
  if(mpra[i,]$SNP%in%dat.motifbreakr$SNP){
    mpra[i,]$tf_motifbreakr<-paste(subset(dat.motifbreakr, SNP==mpra[i,]$SNP)$geneSymbol, collapse=",")
  }
}
```

```{r}
nrow(mpra)
```


Create an Ananastra column.
```{r}
ananastra.data.for.mpra <- read.table(paste0(data.dir, "tf/ananastra_data_for_mpra_hg19.txt"), sep="\t", header=T)
ananastra.data.for.mpra$SNP <- ananastra.data.for.mpra$SNP19
ananastra.data.for.mpra <- subset(ananastra.data.for.mpra, select=-c(SNP19))
mpra <- merge(mpra,ananastra.data.for.mpra, by="SNP",all.x=TRUE)
```

```{r}
nrow(mpra)
```

Motifbreakr column for mpra based on my run of motifbreak r with hocomoco v. 11 and a lower p-value threshold
```{r}
motifbreakr.data.for.mpra <- read.table(paste0(data.dir,"tf/motifbreakr_data_for_mpra_hg19.txt"), sep="\t", header=T)

mpra <- merge(mpra,motifbreakr.data.for.mpra, by="SNP",all.x=TRUE)
```

```{r}
nrow(mpra)
```


Nearest TSS
```{r genes from online}
#Add in genes
# Load the dataset. This takes a while. This dataset is in hg19
options(timeout=150)
dat.genes<-fread("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz", header=F,data.table=F, skip=5)
# Subset the data to only genes and pick the correct columns you need
dat.genes<-subset(dat.genes, V3=="gene", select=c(V1, V4, V5, V7,V9))
# Give names to the columns
names(dat.genes)<-c("chr", "start", "end","strand", "annotation")
# Select only protein coding genes from the annotation
dat.genes<-dat.genes[grepl("protein_coding", dat.genes$annotation),]
# extract the gene name from the annotation
dat.genes$gene<-str_split_fixed(dat.genes$annotation, "\\;",10)[,5]
# eliminate the words gene_name from the gene column
dat.genes$gene<-gsub("gene_name ", "",dat.genes$gene)
# Eliminate the quotes from the gene name
dat.genes$gene<-gsub('"', "", dat.genes$gene, fixed=TRUE)
# Eliminate the spaces from the gene name
dat.genes$gene<-gsub(' ', "", dat.genes$gene, fixed=TRUE)
# Create a new column of tss which is just the start of the gene
dat.genes$tss<-dat.genes$start
# For every row in this ebi dataset, if the gene is on the negative strand, then the value for the tss column is actually the value for the end column because that would be the tss
for(i in 1:nrow(dat.genes)){
  if(dat.genes[i,]$strand=="-"){
    dat.genes[i,]$tss<-dat.genes[i,]$end
  }
}
# Eliminate the annotation and strand columns
dat.genes<-subset(dat.genes, select=-c(annotation, strand))

#Add nearest tss
# Create an empty column in the mora for this data 
mpra$tss<-NA
# For each row in mpra, subset the tss data so that it is the same chromosome as the row, then create a new column which is the absolute value of the difference of all the tss of that chromosome and and the variant in that row. Then order the data by the distance and pick the row with the shortest ditance to put in that row's tss column because that is the gene with the closest tss 
for(i in 1:nrow(mpra)){
   temp<-subset(dat.genes, chr==mpra[i,]$chr)
   temp$distance<-abs(temp$tss-mpra[i,]$snp_end)
   temp<-temp[order(temp$distance),]
  mpra[i,]$tss<-temp[1,]$gene
 }

```

```{r}
nrow(mpra)
```


This adds the rsID to the data set. 
```{r}
# Subset to only the rsids found in Mouri and Guo et al. library which are the ones which have rsids. The list of risds can be found here: https://www.nature.com/articles/s41588-022-01056-5 (supplementary table 1, sheet 1, GWAS loci).

# This adds the rsID to the data set. 
# Read Mouri and Guo et al. supplementary table 1 GWAS loci. It can be found here: https://www.nature.com/articles/s41588-022-01056-5
rsid.dat<-read_excel(paste0(data.dir, "41588_2022_1056_MOESM4_ESM.xlsx"))
# renaming a column to merge it with the mpra data
rsid.dat$SNP <- rsid.dat$ld_snp 
# pick just the SNP and rsid column
rsid.dat<-unique(subset(rsid.dat, select=c(SNP, rsid)))
# merge the rsid data with the mpra
mpra<-merge(mpra, rsid.dat, by="SNP", all.x=T, all.y=F)

missing_rsid_snps <- subset(mpra, is.na(rsid)==TRUE)
mpra <- subset(mpra, is.na(rsid)==FALSE)
```

```{r}
nrow(mpra)
```

Creating classifying mpra varaints as emVars, pCREs and no activity variants. This is the unstimulated jurkat filter because it does not have a log2FC cut-off. It is slightly different to the filter in Mouri et al. 2022 to be consistant with the primary T-cell filter. You will have the opportunity to customize this and change it based on the results of the grid search analysis in the mpra_DHS_grid_search code. 
```{r}
if(filter=="original"){
mpra$mpra_sig<- NA
# For every row in the mpra, if the BH p-vlaue is above 2 for either allele and the LogSkew FDR is above 1, then the variant is labeled as an emVar (Enhancer_Skew). Else, if the BH p-vlaue is above 2 for either allele, then it is labeled as a pCRE (Enhancer_nSkew). If it is neither of those it is a no activity variant (nEnhancer_nSkew). 
for(i in 1:nrow(mpra)){
  mpra[i,]$mpra_sig<-ifelse(((mpra[i,]$A.logPadj_BF>2) | (mpra[i,]$B.logPadj_BF>2)) & mpra[i,]$Skew.logFDR>=1, "Enhancer_Skew", ifelse((mpra[i,]$A.logPadj_BF>2) | (mpra[i,]$B.logPadj_BF>2), "Enhancer_nSkew", "nEnhancer_nSkew"))
}
# 
}
```


Final checks before writing the mpra merge file 
```{r}
# names(mpra)
subset_mpra <- subset(mpra, mpra_sig == "Enhancer_Skew")
nrow(subset_mpra)
# emVars (tcell glm): 545
# emVars  (unstim_jurkat glm) 187
subset_mpra2 <- subset(mpra, mpra_sig == "Enhancer_nSkew")
nrow(subset_mpra2)
# pCREs (tcell glm): 1125
# pCREs  (unstim_jurkat glm) 2728
subset_mpra3 <- subset(mpra, mpra_sig == "nEnhancer_nSkew")
nrow(subset_mpra3)
# no activity (tcell glm): 16471
# no activity  (unstim_jurkat glm) 15114

emvars_in_dhs <- subset(mpra, mpra_sig == "Enhancer_Skew" & dhs_Tcell_merged==1)
nrow(emvars_in_dhs)
# Primary tcell 95
# unstim jurkat 39
```

Add top_pics and top_disease
```{r, eval=FALSE}
#For each MPRA variant, find the pics for all the diseases and make max_pics the maximum of all of those values
mpra.pics <- subset(mpra, select=c(MS_pics,RA_pics,UC_pics,T1D_pics,IBD_pics,Crohns_pics,Psoriasis_pics,RA_pval, MS_pval, UC_pval, T1D_pval, IBD_pval,Crohns_pval,Psoriasis_pval))

# For each mpra variant, find the disease with the strongest association and its associated PICS data
  mpra.pics$top_pval<-NA #Top GWAS p-value for the MPRA variant
  mpra.pics$top_disease<-NA #Disease corresponding to top GWAS p-value
  mpra.pics$top_PP_running<-NA #Cumulative sum of posterior probabilities for that variant
  mpra.pics$top_pics<-NA #PICS probability for that variant in the top GWAS

  for(i in 1:nrow(mpra.pics)){ #Run through each MPRA variant
  
  top_pval<-max(mpra.pics[i,grepl("_pval",names(mpra.pics))], na.rm=T) #Find the top GWAS p-value
  top_disease<-str_split_fixed(names(mpra.pics)[which(mpra.pics[i,]==top_pval)][1], "\\_", 2)[1] #Find the disease corresponding to the top GWAS p-value
  
  #Write out GWAS and PICS data for top GWAS p-value
  mpra.pics[i,]$top_pval<-top_pval
  mpra.pics[i,]$top_disease<-top_disease
  mpra.pics[i,]$top_PP_running<-mpra.pics[i,paste0(top_disease, "_PP_running")]
  mpra.pics[i,]$top_pics<-mpra.pics[i,paste0(top_disease, "_pics")]
}
  mpra.pics$top_pics<-as.numeric(mpra.pics$top_pics)
  mpra.pics$top_PP_running<-as.numeric(mpra.pics$top_PP_running)

```

Putting the max_pics and win_pics back into mpra
```{r, eval=FALSE}
mpra$top_disease <- mpra.pics$top_disease
mpra$top_pics <- mpra.pics$top_pics
```

```{r}
nrow(mpra)
```


Write table 
```{r,eval=FALSE}
write.table(mpra, (paste0(data.dir,mpra.name,"_mpra_merge_all.txt")), row.names=F, col.names=T, sep="\t", quote=F)
```

Bring in the table again
```{r, eval=FALSE}
mpra <- read.table(paste0(data.dir,mpra.name,"_mpra_merge_all.txt"), sep="\t", header=T)
```

Filter for greater than 20 plasmids for both the A and B allele
```{r}
nrow(mpra)
mpra_filtered <- subset(mpra, A.Ctrl.Mean>=20 & B.Ctrl.Mean>=20)
nrow(mpra_filtered) # You will lose some variants here. Out of the 18285 variants in the study, we lose 144 variants in the primary t-cells (for a total of 18141). This will be different for Jurkat (18029). 
```

Write table 
```{r,eval=FALSE}
write.table(mpra_filtered, (paste0(data.dir,mpra.name,"_mpra_merge_filtered.txt")), row.names=F, col.names=T, sep="\t", quote=F)
```

Read table 
```{r,eval=FALSE}
mpra <- read.table((paste0(data.dir,mpra.name,"_mpra_merge_all.txt")), sep="\t", header=T)
```

MPRA with hg38 columns
```{r}
# Import hg conversion data
hg38_liftover <- read.table(paste0(data.dir,"mpra_snps.hg38_liftover.txt"), sep="\t", header=T)
# Get rid of the word "chr" in the chromosome column, so only the number remains
hg38_liftover_chr_mod <- gsub("chr", "", hg38_liftover$chr)
# Get rid of the numbers on the snpid_hg19 to be left with the alleles
hg38_liftover_id_mod <- gsub("[0-9]+", "", hg38_liftover$snpid_hg19)
# Get rid of double colon
hg38_liftover_id_mod <- gsub("::", ":", hg38_liftover_id_mod)
# create the SNP ID out of all the pieces you just created
hg38_id <- paste0(hg38_liftover_chr_mod,":",hg38_liftover$pos_hg38,hg38_liftover_id_mod)
# Make the SNPID into a column
hg38_liftover$hg38_id <- hg38_id
# Give the data frame better column names
colnames(hg38_liftover) <- c("chr38","pos38","SNP19","SNP38")

# Make other hg38 specific columns
hg38_liftover$pos38 <- sub("^[^:]+:([0-9]+):.*", "\\1", hg38_liftover$SNP38)
hg38_liftover$snp_end38 <- sub("^[^:]+:([0-9]+):.*", "\\1", hg38_liftover$SNP38)
hg38_liftover$snp_start38<-as.numeric(str_split_fixed(hg38_liftover$SNP38, "\\:", 4)[,2])-1-abs(nchar(str_split_fixed(hg38_liftover$SNP38, "\\:", 4)[,3])-nchar(str_split_fixed(hg38_liftover$SNP38, "\\:", 4)[,4]))
hg38_liftover$ID38 <- paste0(hg38_liftover$SNP38,":R:wC")
hg38_liftover$comb38 <- paste0(hg38_liftover$SNP38, "_center_fwd_ref")
hg38_liftover$ld_snp38 <- hg38_liftover$SNP38

# Make a SNP19 column in the MPRA
mpra$SNP19 <- mpra$SNP

mpra <- merge(hg38_liftover,mpra, by="SNP19")

# Import hg38 conversion data
hg38_conversion <- read.table(paste0(data.dir,"231011_Mouri_hg19_to_hg38_conversion.txt"), sep="\t", header=T)
# Get rid of columns that are bad
hg38_conversion <- hg38_conversion[,1:11]
hg38_conversion <- subset(hg38_conversion, select=-(hg38_ld_snp))
# Change the names of the columns 
names(hg38_conversion) <- c("chr19", "pos19", "rsid", "ref", "alt", "ld_snp19", "lead_snp19","r2", "pos38","lead_snp38")

# Make a SNP column so that I can merge the 
hg38_conversion$SNP19 <- paste0(hg38_conversion$chr19,":",hg38_conversion$pos19,":",hg38_conversion$ref,":",hg38_conversion$alt)

# Eliminate duplicated SNPs
hg38_conversion <- hg38_conversion[!duplicated(hg38_conversion[,c('SNP19')]),]

# Rename the columns
names(hg38_conversion) <- c("chr19", "pos19", "rsid_38", "ref_38", "alt_38", "ld_snp19", "lead_snp19","r2_38", "pos38", "lead_snp38","SNP19")

# Eliminate the columns which are hg19 specific except for SNP19
mpra38 <- subset(mpra, select = -c(SNP,snp_end,ld_snp,lead_snp,comb))

# Get the hg38 specific columns
hg_38_conversion <- subset(hg38_conversion, select = c(lead_snp38,SNP19))

# Merge the MPRA with the hg38 info
mpra_hg38 <- merge(hg_38_conversion,mpra38, by="SNP19")

# Change all of the hg38 columns to just regular names
mpra_hg38$SNP <- mpra_hg38$SNP38
mpra_hg38$ld_snp <- mpra_hg38$ld_snp38
mpra_hg38$lead_snp <- mpra_hg38$lead_snp38
mpra_hg38$pos <- mpra_hg38$pos38
mpra_hg38$ID <- mpra_hg38$ID38
mpra_hg38$comb <- mpra_hg38$comb38
mpra_hg38$snp_start <- mpra_hg38$snp_start38  
mpra_hg38$snp_end <- mpra_hg38$snp_end38
mpra_hg38$chr <- mpra_hg38$chr38
# Get rid of the columns which say hg38
mpra38 <- subset(mpra_hg38, select=-c(ld_snp38,lead_snp38,pos38,snp_end38,snp_start38,ID38,comb38,chr38))
```

Write the table
```{r}
write.table(mpra38, (paste0(data.dir,mpra.name,"_mpra_merge_hg38_all.txt")), row.names=F, col.names=T, sep="\t", quote=F)
```

Filter for greater than 20 plasmids for both the A and B allele
```{r}
mpra38 <- read.table((paste0(data.dir,mpra.name,"_mpra_merge_hg38_all.txt")), sep="\t", header=T)

nrow(mpra38)
mpra38_filtered <- subset(mpra38, A.Ctrl.Mean>=20 & B.Ctrl.Mean>=20)
nrow(mpra38_filtered) # You will lose some variants here. Out of the 18285 variants in the study, we usually lose ~300. 

write.table(mpra38_filtered, (paste0(data.dir,mpra.name,"_mpra_merge_hg38_filtered.txt")), row.names=F, col.names=T, sep="\t", quote=F)
```