-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path20241111_jurkat_mpra_big_table.Rmd
734 lines (622 loc) · 32.4 KB
/
20241111_jurkat_mpra_big_table.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
---
title: "Jurkat MPRA table"
output: html_document
author: Max Dippel
date: "2024-10-21"
---
In this code, I will use to create a really large table of jurkat mpra data with lots of extra data on functional annotations. After you have created this big table, you can use it to create plots which are relevant to MPRA data. Mostly, these data are relevant supplementary tables 1, 5 and 6 and can make the plots (in another piece of code) for figure 1, figure 2, supp. figure 4, supp. figure 5, and supp. figure 6.
# MPRA merge
Load the packages
```{r packages, message=FALSE}
library(stringr)
library(IRanges)
library(data.table)
library(readxl)
require(biomaRt)
library(GenomicRanges)
library(dplyr)
library(readxl)
library(stringr)
```
Establish directories
```{r}
# This is the main directory for the analysis
main.dir <- "~/Desktop/Ho et al. writings/Code_from_scratch2"
# This is where the data will go
data.dir <- "~/Desktop/Ho et al. writings/Code_from_scratch2/data/"
# This is the directory where you put the original MPRA analysis code created by Mike Guo. It is located here. https://zenodo.org/records/6302248
mhguo.dir <- "~/Desktop/Ho et al. writings/Code_from_scratch2/data/mhguo1-T_cell_MPRA-5c36361/"
# This is the directory where you have stored the DHS data. Use hg19 for this analysis. You can change the genome buildat the end. The DHS data can be found at https://zenodo.org/records/3838751
# Download these three things:
# DHS_Index_and_Vocabulary_metadata.tsv
# DHS_Index_and_Vocabulary_hg19_WM20190703.txt.gz
# dat_bin_FDR01_hg19.txt.gz
dhs.data.dir <- "~/Desktop/Ho et al. writings/Code_from_scratch2/data/DHS_hg19/"
```
Import MPRA, inspect data, and set directories for specific cell types
############################
## STOP!
STOP! You must look below to see the MPRA for which you are about to make a big table. This Rmd makes the jurkat big table.
If you are replicating the plots from Ho et al., you need all the data available. You must run both to generate big tables for the T-cell and Jurkat MPRAs hashing in and out the proper lines to get the correct code for each run. I thought this would be better and at least more compact than making this code twice.
############################
```{r}
# Load Ho et al. Supp. table 1 or two depending if you want to make this table with primary t-cell (sheet 1) and jurkat data (sheet 2)
mpra_out<-read.table(paste0(data.dir,"OLJR.A_Jurkat_emVAR_glm_20240310.out"), header=T, stringsAsFactors = F, sep="")
head(mpra_out)
# How many rows are in that file?
nrow(mpra_out)
# How many columns are in that file?
ncol(mpra_out)
# What are the column names in the mpra?
names(mpra_out)
# Name of MPRA file
mpra.name <- "20241111_jurkat"
# This decides what option to choose for the variant category filter. There are currently two options "original" and "expression". The "original" filter is for the jurkat MPRA as it is similar (not the exact same, but similar to Mouri & Guo et al.). The "expression" filter is for the Primary T-cell MPRA.
filter <- "original"
# What is the stimulation status of the cells? We going to eliminate stimulated (S) or unstimulated (U) cells. For Ho et al. Primary T-cells are stimulated and Jurkats are unstimulated.
eliminate_stim_status <- "S" <- # Unstimulated so eliminate stimulated cells with S
# List out the cells which are relevant to this T-cell cell type for DHS columns. Change this to the B-cell list if you are using B-cells or make your own! The large list of cell types can be found in the DHS section sample.dat in the Biosample.name column.
cells<-c( "hTH1", "hTH17", "hTH2","CD4", "CD4pos_N", "hTR", "Jurkat", "CD8")
# Please be aware that delta SVM, eqtl data and ATAC-QTL data are all T-cell specific
```
This code standardizes the column names in the analysis.
```{r}
# Change the name of the columns to match the functions
mpra_out$LogSkew <- mpra_out$Log2Skew
# Change the name of the column to match the functions
mpra_out$A.Ctrl.Mean <- mpra_out$A_Ctrl_Mean
mpra_out$A.Exp.Mean <- mpra_out$A_Exp_Mean
mpra_out$A.log2FC <- mpra_out$A_log2FC
mpra_out$A.log2FC_SE <- mpra_out$A_log2FC_SE
mpra_out$A.logP <- mpra_out$A_logP
mpra_out$A.logPadj_BH <- mpra_out$A_logPadj_BH
mpra_out$A.logPadj_BF <- mpra_out$A_logPadj_BF
mpra_out$B.Ctrl.Mean <- mpra_out$B_Ctrl_Mean
mpra_out$B.Exp.Mean <- mpra_out$B_Exp_Mean
mpra_out$B.log2FC <- mpra_out$B_log2FC
mpra_out$B.log2FC_SE <- mpra_out$B_log2FC_SE
mpra_out$B.logP <- mpra_out$B_logP
mpra_out$B.logPadj_BH <- mpra_out$B_logPadj_BH
mpra_out$B.logPadj_BF <- mpra_out$B_logPadj_BF
mpra_out$Skew.logP <- mpra_out$Skew_logP
mpra_out$Skew.logFDR <- mpra_out$Skew_logFDR
# This column creates the project column
mpra_out$project <- "TGWAS"
# eliminate all of the columns that have the names we don't want
mpra_out <- subset(mpra_out, select = -c(Log2Skew,A_Ctrl_Mean,A_Exp_Mean,A_log2FC,A_log2FC_SE,A_logP,A_logPadj_BH, A_logPadj_BF,B_Ctrl_Mean,B_Exp_Mean,B_log2FC,B_log2FC_SE,B_logP,B_logPadj_BH,B_logPadj_BF,Skew_logP,Skew_logFDR))
mpra <- mpra_out
```
```{r}
nrow(mpra_out)
```
LD columns from the jurkat MPRA. You can also calculate the ld data by hand by running the plink script (ld.sh) yourself, but I have included it here for ease of use.
```{r}
#Merge LD files across chromosomes
# Create an empty data frame
dat.ld.all<-data.frame(lead_snp=character(), ld_snp=character(), r2=numeric(), stringsAsFactors = F)
# A loop that says "for every chromosome..."
for(c in c(1:22, "X")){
# read in Mike's linkage files for each chromosome
dat.ld.chr<-read.delim(paste(mhguo.dir,"annotate_mpra/ld/ld/mpra.chr", c, ".ld", sep=""), header=T, stringsAsFactors = F, sep="")
# Select the columns you need from this data set
dat.ld.chr<-subset(dat.ld.chr, select=c(SNP_A, SNP_B, R2))
# Rename those columns
names(dat.ld.chr)<-c("lead_snp","ld_snp", "r2")
# combine these values with the empty data frame you created earlier
dat.ld.all<-rbind(dat.ld.all, dat.ld.chr)
# Remove dat.ld.chr
rm(dat.ld.chr)
}
#remove duplicates. If proxy SNP has multiple entries, select the one with the highest r2
dat.ld.all<-dat.ld.all[order(dat.ld.all$ld_snp, -dat.ld.all$r2),]
dat.ld.all<-dat.ld.all[!duplicated(dat.ld.all$ld_snp),]
#Merge in LD information
mpra$ld_snp <- mpra$SNP
mpra<-merge(mpra, dat.ld.all, by="ld_snp", all.x=T, all.y=F)
```
Creating bed-file like columns
```{r}
# Make the MPRA file into a bed-like file for downstream annotations
# Creating the start snp column by modifying the SNP column. Start snp is calculated by getting the postion then subtracting 1. Then subtracting the absolute value of the difference between the number of reference and alternate alleles.
mpra$snp_start <- as.numeric(str_split_fixed(mpra$SNP, "\\:", 4)[,2])-1-abs(nchar(str_split_fixed(mpra$SNP, "\\:", 4)[,3])-nchar(str_split_fixed(mpra$SNP, "\\:", 4)[,4]))
# Create a end_snp column which modifies the SNP column and the values after the first colon and before the second colon
mpra$snp_end<-as.numeric(str_split_fixed(mpra$SNP, "\\:", 4)[,2])
# Create a chromosome column by modifying the SNP column. Taking everything before the first colon and adding it to the letter "chr"
mpra$chr<-paste0("chr", str_split_fixed(mpra$SNP, "\\:", 4)[,1])
# Create a strand column
mpra$strand<-"+"
# Create position column
mpra$pos <- sub("^[^:]+:([0-9]+):.*", "\\1", mpra$SNP)
# Reference allele
mpra$ref_allele <- sub("^[^:]+:[^:]+:([^:]+):.*$", "\\1", mpra$SNP)
# Alternate allele
mpra$alt_allele <- sub("^[^:]+:[^:]+:[^:]+:([^:]+)$", "\\1", mpra$SNP)
# Create ID column
mpra$ID <- paste0(mpra$SNP,":R:wC")
```
```{r}
nrow(mpra)
```
This big loop to integrate the PICS (probabilistic identification of causal SNPs) data into the mpra.merge data set
```{r PICS data}
# Order of GWAS for plotting
gwas.order<- c("Crohns","MS","Psoriasis", "RA","T1D","UC", "IBD")
# Merge in GWAS and PICS fine-mapping data
for(d in gwas.order){
# Read in PICS results
gwas<-read.delim(paste0(mhguo.dir,"pics/run/", d, ".ld_0.8.pics"), header=F, stringsAsFactors = F, sep="\t")
# Select certain columns in the gwas data
gwas<-gwas[,c(1,2,4,6,7)]
# Name the columns of the gwas data
names(gwas)<-c("lead_snp", "ld_snp", "r2", "pics","pval")
# eliminate the letters "chr" from lead and lp snp
gwas$lead_snp<-gsub("chr", "", gwas$lead_snp)
gwas$ld_snp<-gsub("chr", "", gwas$ld_snp)
#Merge PICS results with MPRA to subset for SNPs tested in the MPRA
gwas<-merge(gwas, mpra[,c("ld_snp", "lead_snp")], by=c("ld_snp", "lead_snp"), all.x=F, all.y=F)
#Remove SNPs seen multiple times, picking the SNP with the most significant p-value in catalog, then the highest PICS posterior probability
gwas<-gwas[order(-gwas$pval, -gwas$r2),]
gwas<-gwas[!duplicated(gwas[,c("ld_snp", "lead_snp")]),]
# Generate credible sets
# Order the data by SNPs
gwas <- gwas[order(gwas$lead_snp, -gwas$pics),]
gwas$PP_running<-0 #Create variable with a running sum of PICS probabilities
gwas$PP_sum <- 0 #Create variable with sum of PICS probablities
#Note: need this second variable to help break ties (if two SNPs have the same PP, the automatically will have the same credible set annotation)
prev_var <- "dummy_variable" #Create variable to test whether we've moved onto a new locus
for(i in c(1:nrow(gwas))){
# Creating a new variable from lead snp
new_var<-gwas[i,]$lead_snp
# if the pics value is greater than zero
if(gwas[i,]$pics>0){
if(new_var==prev_var){ #Test if next variant is still in same locus
gwas[i,]$PP_sum<-gwas[i-1,]$PP_sum+gwas[i,]$pics #Add pics to running PP sum
if(gwas[i,]$pics==gwas[i-1,]$pics){ #If two variants have the same PP, force them to have the same PP_running
gwas[i,]$PP_running<-gwas[i-1,]$PP_running
}else{
gwas[i,]$PP_running<-gwas[i,]$pics+gwas[i-1,]$PP_sum #If not Add pics and PP_sum to running PP sum
}
}else{ #If this variant is in a new locus, assign the PICS PP as their PP_sum and PP_running
gwas[i,]$PP_sum<-gwas[i,]$pics
gwas[i,]$PP_running<-gwas[i,]$pics
}
}
prev_var<-new_var #If lead SNP is different, we've now moved on to new locus
}
# Merge with MPRA data
# subseet the gwax data to a few variants
gwas<-subset(gwas, select=c(ld_snp, pval,pics,PP_running))
# Round the gwas pvalue to two digits
gwas$pval<-round(gwas$pval,2)
# Creating a disease specific name with underscores for the pics statistics
names(gwas)[2:4]<-paste0(d, "_", names(gwas)[2:4])
# merge he gwas data and mpra by ld_snp
mpra<-merge(mpra, gwas, by="ld_snp", all.x=T, all.y=F)
# remove the gwas object
rm(gwas)
}
```
```{r}
nrow(mpra)
```
DHS (DNase 1 hypersensitive sites) data
As stated above, the DHS data can be found at https://zenodo.org/records/3838751
Download these three things:
1. DHS_Index_and_Vocabulary_metadata.tsv
2. DHS_Index_and_Vocabulary_hg19_WM20190703.txt.gz
3. dat_bin_FDR01_hg19.txt.gz
```{r}
### DHS data
# Eliminating all the columns that have "dhs_" in them which happen to be none of the columns
mpra<-mpra[,!(grepl("dhs_", names(mpra)))]
# Make a granges object from the mpra spreadsheet
mpra.se<-makeGRangesFromDataFrame(mpra, seqnames = "chr", start.field = "snp_start", end.field = "snp_end",keep.extra.columns=TRUE)
# Read in DHS position file
dhs.pos<- read.delim(gzfile(paste0(dhs.data.dir,"DHS_Index_and_Vocabulary_hg19_WM20190703.txt.gz")), header=T, stringsAsFactors = F)
# Only keep the first three columns of the DHS pos data
dhs.pos<-dhs.pos[,c(1:3)]
# Make peak column which puts together the chromosome, the start and the end
dhs.pos$peak<-paste0(dhs.pos$seqname, ":", dhs.pos$start, "_", dhs.pos$end)
# make a granges from the dhs pos data
dhs.pos.se<-makeGRangesFromDataFrame(dhs.pos, seqnames = "seqname", start.field = "start", end.field = "end",keep.extra.columns=TRUE)
#Find DHS peaks that overlap with MPRA
overlap_peaks<-data.frame(subsetByOverlaps(dhs.pos.se, mpra.se, type="any"), stringsAsFactors = F)$peak
overlap_index<-which(dhs.pos$peak%in%overlap_peaks)
# bring in dhs data
dhs.dat<-fread(paste0(dhs.data.dir,"dat_bin_FDR01_hg19.txt.gz"))
#Subset for only DHS peaks that overlap with MPRA in the dhs data
dhs.dat<-dhs.dat[overlap_index,]
#Subset for only DHS peaks that overlap with MPRA in the dhs pos data
dhs.pos<-dhs.pos[overlap_index,]
# Now combin ehte dhs and dhs pos data
dhs.dat<-cbind(dhs.pos[,c(1:3)],dhs.dat)
#Read in sample information
sample.dat<-read.delim(paste0(dhs.data.dir,"DHS_Index_and_Vocabulary_metadata.tsv"), sep="\t", header=T, stringsAsFactors = F)
# Select certain columns in the sample data
sample.dat<-subset(sample.dat, !is.na(library.order),select=c(library.order,Biosample.name, System,Subsystem, Organ, Biosample.type, Biological.state))
#Add peak index into MPRA for faster calculation
# Create three empty columns in mpra
mpra$peak_index1<-NA
mpra$peak_index2<-NA
mpra$peak_index3<-NA
# Create a row number column for the dhs data sets
dhs.pos$index<-seq(1:nrow(dhs.pos))
dhs.dat$index<-seq(1:nrow(dhs.dat))
# These hashed out portions are hashed out in the original code. I am going to leave them there.
# For every row in mpra, create a subseted dhs pos dataset with the same chromosome and which emcompases the position of the SNP in that row and extract the index number from that. Then put the indices into the empty mpra columns. You need the three columns for varaints with multiple indicies
#num_index<-vector()
for(i in 1:nrow(mpra)){
#mpra[i,]$peak_index<-subset(dhs.pos, seqname==mpra[i,]$chr & start<=mpra[i,]$snp_end & end >=mpra[i,]$snp_end)$index[1]
indices<-subset(dhs.pos, seqname==mpra[i,]$chr & start<=mpra[i,]$snp_end & end >=mpra[i,]$snp_end)$index
mpra[i,c("peak_index1","peak_index2","peak_index3")]<-c(indices, rep(NA, 3-length(indices)))
#num_index<-c(num_index,length(subset(dhs.pos, seqname==mpra[i,]$chr & start<=mpra[i,]$snp_end & end >=mpra[i,]$snp_end)$index))
}
# list out the cells which are relevant to this T-cell cell type. Change this to the B-cell list if you are using B-cells or make your own! The large list of cell types can be found in sample.dat in the Biosample.name column.
#cells<-c( "hTH1", "hTH17", "hTH2","CD4", "CD4pos_N", "hTR", "Jurkat", "CD8")
#cells<-c( "GM12878", "GM12865", "GM12864","CD20", "CD19")
# Second loop
# For each cell type,
for(c in cells){
# get the numbers of the cell types in the sample data
cell_col<-as.vector((subset(sample.dat, Biosample.name==c)$library.order)+3)
# If the length of these columns are greater than 1, then use the max dhs values from the dhs data
if(length(cell_col)>1){
dhs.dat$max_dhs<-apply(dhs.dat[,cell_col], 1, max)
}else{
# Or just make the only cell type columns's dhs data the max dhs
dhs.dat$max_dhs<-dhs.dat[,cell_col]
}
# Create a peak index of all the values that are greater than 0
peak_index<-subset(dhs.dat, max_dhs>0)$index
# mpra$dhs_peak<-ifelse(mpra$peak_index%in%peak_index, 1, 0)
# Make a column in the mpra that that contains a 1 if that that variant has a peak and a 0 if it does not
mpra$dhs_peak<-ifelse(mpra$peak_index1%in%peak_index | mpra$peak_index2%in%peak_index| mpra$peak_index3%in%peak_index, 1, 0)
# Make the name of that column specific to the specific cell type tested
names(mpra)<-gsub("dhs_peak", paste0("dhs_", c), names(mpra))
}
# Create a column of dhs peaks that merges all of the DHS columns
mpra$dhs_Tcell_merged<-apply(mpra[,which(grepl("dhs_",names(mpra)))],1, max)
# Create a column of dhs peaks from the original dhs pos column. How is this different from the column we just made?
mpra$dhs_all<-ifelse(mpra$peak_index1%in%dhs.pos$index | mpra$peak_index2%in%dhs.pos$index | mpra$peak_index3%in%dhs.pos$index, 1, 0)
# eliminate the peak columns
mpra<-subset(mpra, select=-c( peak_index1, peak_index2, peak_index3))
# Remove the objects you do not need anymore
rm(mpra.se)
rm(dhs.pos.se)
```
```{r}
nrow(mpra)
```
Just like the LD columns, I am just going to use the delatSVM file already calculated from Mike's code. You can calculate the delta SVM scores yourself using this file delta_svm_workflow.txt, but I am trying to make this code as user friendly as possible by already including it here.
```{r}
#Merge in deltaSVM
# Load the Jurkat deltaSVM data
dat.deltasvm<-read.delim(paste0(mhguo.dir,"delta_svm/mpra_snps_E2_Jurkat_deltaSVM.txt"), header=F, stringsAsFactors = F, sep="\t")
# Label the names of the Jurkat deltaSVM data
names(dat.deltasvm)<-c("SNP", "delta_svm_Jurkat")
# merge the jurkat deltaSVM data with mpra
mpra<-merge(mpra, dat.deltasvm, by="SNP", all.x=T, all.y=F)
# Load the CD4 deltaSVM data
dat.deltasvm<-read.delim(paste0(mhguo.dir,"delta_svm/mpra_snps_E2_naive_CD4_deltaSVM.txt"), header=F, stringsAsFactors = F, sep="\t")
# Rename the columns
names(dat.deltasvm)<-c("SNP", "delta_svm_nCD4")
# merge with MPRA
mpra<-merge(mpra, dat.deltasvm, by="SNP", all.x=T, all.y=F)
```
```{r}
nrow(mpra)
```
Add allelic skew annotations ASC
```{r}
#Merge in ASC from Calderon
#Read in allelic skew data from Calderon et al. (PMID: 31570894), Supplementary Table 1, sheet 10, "significant_ASCs" tab
# https://www.nature.com/articles/s41588-019-0505-9#Sec30
# Import the spreadsheet into R
asc<-data.frame(read_excel(paste0(data.dir,"41588_2019_505_MOESM3_ESM.xlsx"), sheet=10), stringsAsFactors = F)
# remove cell tpyes with have -S in them (aka stimulated cell types)
asc<-asc[!grepl(paste0("\\-",eliminate_stim_status), asc$cell_type),]
# Extracting the chromosome from the ID and making it a column
asc$chr<-str_split_fixed(asc$het_id, "\\_", 5)[,2]
# # Extracting the position from the ID and making it a column
asc$pos<-as.numeric(str_split_fixed(asc$het_id, "\\_", 5)[,3])
# Empty column for this data in the MPRA (full of zeros)
mpra$asc<-0
# For each row in the mpra data, if the asc has a matching chromosome and position then the mpra row gets a 1
for(i in 1:nrow(mpra)){
if(nrow(subset(asc, chr==mpra[i,]$chr & pos==mpra[i,]$snp_end))>0){
mpra[i,]$asc<-1
}
}
```
```{r}
nrow(mpra)
```
ATAC-QTL and eQTL
```{r}
# Read in Gate ATAC-QTL data
# Read in Gate et al ATAC-QTL data (PMID: 29988122), Supplemental Table 6
# https://www.nature.com/articles/s41588-018-0156-2#Sec38
# Read in supplementary table
qtl<-data.frame(read_excel(paste0(data.dir,"41588_2018_156_MOESM8_ESM.xlsx"), sheet=2), stringsAsFactors = F)
# Create a chromosome column from the SNP column
qtl$chr<-str_split_fixed(qtl$SNP, "\\:", 2)[,1]
# Create a position column from the SNP column
qtl$snp_end<-str_split_fixed(str_split_fixed(qtl$SNP, "\\_", 2)[,1], "\\:", 2)[,2]
# Subsetting to only the information you want to put in the mpra
qtl<-subset(qtl, select=c(chr, snp_end, beta, p.value))
# Changing the column names to the column names you want to put in the mpra
names(qtl)[3:4]<-c("atac_qtl_beta", "atac_qtl_pval")
# Merge the atac qtl data with the mpra
mpra<-merge(mpra, qtl, by=c("chr", "snp_end"), all.x=T, all.y=F)
# Read in Gate eQTL data
# Read in the second supplementary table (Supplemental Table 8) This one is eqtls not atac-qtls. Supplemental Table 8
qtl<-data.frame(read_excel(paste0(data.dir,"41588_2018_156_MOESM10_ESM.xlsx"), sheet=2), stringsAsFactors = F)
# Create a chromosome column from the SNP column
qtl$chr<-str_split_fixed(qtl$SNP, "\\:", 2)[,1]
# Create a position column from the SNP column
qtl$snp_end<-str_split_fixed(str_split_fixed(qtl$SNP, "\\_", 2)[,1], "\\:", 2)[,2]
#qtl$chr_pos<-paste(qtl$chr, qtl$pos, sep="_")
# Subsetting to only the information you want to put in the mpra
qtl<-subset(qtl, select=c(chr, snp_end, beta, p.value, gene))
# Changing the column names to the column names you want to put in the mpra
names(qtl)[3:5]<-c("eqtl_beta", "eqtl_pval", "eqtl_gene")
# Merge the eqtl data with the mpra
mpra<-merge(mpra, qtl, by=c("chr", "snp_end"), all.x=T, all.y=F)
# Eliminate the duplicated SNPs created by this merge
mpra <- subset(mpra, duplicated(SNP)!=TRUE)
```
```{r}
nrow(mpra)
```
MotifbreakR
```{r motifbreakr}
#Read in motifbreakr data
dat.motifbreakr<-read.delim(paste0(mhguo.dir,"/tf/motifbreakr/mpra.motifbreakr.method_log.p1e5.results.txt"),header=T, stringsAsFactors = F,sep="\t")
# Create a SNP column from the SNP_id column in the motifbreakr data
dat.motifbreakr$SNP<-gsub("chr", "", dat.motifbreakr$SNP_id)
# Subset the motifbreakr to only the SNP and the gene
dat.motifbreakr<-subset(dat.motifbreakr, select=c(SNP, geneSymbol))
# Create an empty column in the mpra for the motifbreakr data
mpra$tf_motifbreakr<-NA
# For each row in mpra, if the row's SNP is in the MPRA data set, subset the motifbreakr data to only that SNP and paste the genes as characters in the column with a comma in between and make that the tf_motifbreakr column in the mpra
for(i in 1:nrow(mpra)){
if(mpra[i,]$SNP%in%dat.motifbreakr$SNP){
mpra[i,]$tf_motifbreakr<-paste(subset(dat.motifbreakr, SNP==mpra[i,]$SNP)$geneSymbol, collapse=",")
}
}
```
```{r}
nrow(mpra)
```
Create an Ananastra column.
```{r}
ananastra.data.for.mpra <- read.table(paste0(data.dir, "tf/ananastra_data_for_mpra_hg19.txt"), sep="\t", header=T)
ananastra.data.for.mpra$SNP <- ananastra.data.for.mpra$SNP19
ananastra.data.for.mpra <- subset(ananastra.data.for.mpra, select=-c(SNP19))
mpra <- merge(mpra,ananastra.data.for.mpra, by="SNP",all.x=TRUE)
```
```{r}
nrow(mpra)
```
Motifbreakr column for mpra based on my run of motifbreak r with hocomoco v. 11 and a lower p-value threshold
```{r}
motifbreakr.data.for.mpra <- read.table(paste0(data.dir,"tf/motifbreakr_data_for_mpra_hg19.txt"), sep="\t", header=T)
mpra <- merge(mpra,motifbreakr.data.for.mpra, by="SNP",all.x=TRUE)
```
```{r}
nrow(mpra)
```
Nearest TSS
```{r genes from online}
#Add in genes
# Load the dataset. This takes a while. This dataset is in hg19
options(timeout=150)
dat.genes<-fread("ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_19/gencode.v19.annotation.gtf.gz", header=F,data.table=F, skip=5)
# Subset the data to only genes and pick the correct columns you need
dat.genes<-subset(dat.genes, V3=="gene", select=c(V1, V4, V5, V7,V9))
# Give names to the columns
names(dat.genes)<-c("chr", "start", "end","strand", "annotation")
# Select only protein coding genes from the annotation
dat.genes<-dat.genes[grepl("protein_coding", dat.genes$annotation),]
# extract the gene name from the annotation
dat.genes$gene<-str_split_fixed(dat.genes$annotation, "\\;",10)[,5]
# eliminate the words gene_name from the gene column
dat.genes$gene<-gsub("gene_name ", "",dat.genes$gene)
# Eliminate the quotes from the gene name
dat.genes$gene<-gsub('"', "", dat.genes$gene, fixed=TRUE)
# Eliminate the spaces from the gene name
dat.genes$gene<-gsub(' ', "", dat.genes$gene, fixed=TRUE)
# Create a new column of tss which is just the start of the gene
dat.genes$tss<-dat.genes$start
# For every row in this ebi dataset, if the gene is on the negative strand, then the value for the tss column is actually the value for the end column because that would be the tss
for(i in 1:nrow(dat.genes)){
if(dat.genes[i,]$strand=="-"){
dat.genes[i,]$tss<-dat.genes[i,]$end
}
}
# Eliminate the annotation and strand columns
dat.genes<-subset(dat.genes, select=-c(annotation, strand))
#Add nearest tss
# Create an empty column in the mora for this data
mpra$tss<-NA
# For each row in mpra, subset the tss data so that it is the same chromosome as the row, then create a new column which is the absolute value of the difference of all the tss of that chromosome and and the variant in that row. Then order the data by the distance and pick the row with the shortest ditance to put in that row's tss column because that is the gene with the closest tss
for(i in 1:nrow(mpra)){
temp<-subset(dat.genes, chr==mpra[i,]$chr)
temp$distance<-abs(temp$tss-mpra[i,]$snp_end)
temp<-temp[order(temp$distance),]
mpra[i,]$tss<-temp[1,]$gene
}
```
```{r}
nrow(mpra)
```
This adds the rsID to the data set.
```{r}
# Subset to only the rsids found in Mouri and Guo et al. library which are the ones which have rsids. The list of risds can be found here: https://www.nature.com/articles/s41588-022-01056-5 (supplementary table 1, sheet 1, GWAS loci).
# This adds the rsID to the data set.
# Read Mouri and Guo et al. supplementary table 1 GWAS loci. It can be found here: https://www.nature.com/articles/s41588-022-01056-5
rsid.dat<-read_excel(paste0(data.dir, "41588_2022_1056_MOESM4_ESM.xlsx"))
# renaming a column to merge it with the mpra data
rsid.dat$SNP <- rsid.dat$ld_snp
# pick just the SNP and rsid column
rsid.dat<-unique(subset(rsid.dat, select=c(SNP, rsid)))
# merge the rsid data with the mpra
mpra<-merge(mpra, rsid.dat, by="SNP", all.x=T, all.y=F)
missing_rsid_snps <- subset(mpra, is.na(rsid)==TRUE)
mpra <- subset(mpra, is.na(rsid)==FALSE)
```
```{r}
nrow(mpra)
```
Creating classifying mpra varaints as emVars, pCREs and no activity variants. This is the unstimulated jurkat filter because it does not have a log2FC cut-off. It is slightly different to the filter in Mouri et al. 2022 to be consistant with the primary T-cell filter. You will have the opportunity to customize this and change it based on the results of the grid search analysis in the mpra_DHS_grid_search code.
```{r}
if(filter=="original"){
mpra$mpra_sig<- NA
# For every row in the mpra, if the BH p-vlaue is above 2 for either allele and the LogSkew FDR is above 1, then the variant is labeled as an emVar (Enhancer_Skew). Else, if the BH p-vlaue is above 2 for either allele, then it is labeled as a pCRE (Enhancer_nSkew). If it is neither of those it is a no activity variant (nEnhancer_nSkew).
for(i in 1:nrow(mpra)){
mpra[i,]$mpra_sig<-ifelse(((mpra[i,]$A.logPadj_BF>2) | (mpra[i,]$B.logPadj_BF>2)) & mpra[i,]$Skew.logFDR>=1, "Enhancer_Skew", ifelse((mpra[i,]$A.logPadj_BF>2) | (mpra[i,]$B.logPadj_BF>2), "Enhancer_nSkew", "nEnhancer_nSkew"))
}
#
}
```
Final checks before writing the mpra merge file
```{r}
# names(mpra)
subset_mpra <- subset(mpra, mpra_sig == "Enhancer_Skew")
nrow(subset_mpra)
# emVars (tcell glm): 545
# emVars (unstim_jurkat glm) 187
subset_mpra2 <- subset(mpra, mpra_sig == "Enhancer_nSkew")
nrow(subset_mpra2)
# pCREs (tcell glm): 1125
# pCREs (unstim_jurkat glm) 2728
subset_mpra3 <- subset(mpra, mpra_sig == "nEnhancer_nSkew")
nrow(subset_mpra3)
# no activity (tcell glm): 16471
# no activity (unstim_jurkat glm) 15114
emvars_in_dhs <- subset(mpra, mpra_sig == "Enhancer_Skew" & dhs_Tcell_merged==1)
nrow(emvars_in_dhs)
# Primary tcell 95
# unstim jurkat 39
```
Add top_pics and top_disease
```{r, eval=FALSE}
#For each MPRA variant, find the pics for all the diseases and make max_pics the maximum of all of those values
mpra.pics <- subset(mpra, select=c(MS_pics,RA_pics,UC_pics,T1D_pics,IBD_pics,Crohns_pics,Psoriasis_pics,RA_pval, MS_pval, UC_pval, T1D_pval, IBD_pval,Crohns_pval,Psoriasis_pval))
# For each mpra variant, find the disease with the strongest association and its associated PICS data
mpra.pics$top_pval<-NA #Top GWAS p-value for the MPRA variant
mpra.pics$top_disease<-NA #Disease corresponding to top GWAS p-value
mpra.pics$top_PP_running<-NA #Cumulative sum of posterior probabilities for that variant
mpra.pics$top_pics<-NA #PICS probability for that variant in the top GWAS
for(i in 1:nrow(mpra.pics)){ #Run through each MPRA variant
top_pval<-max(mpra.pics[i,grepl("_pval",names(mpra.pics))], na.rm=T) #Find the top GWAS p-value
top_disease<-str_split_fixed(names(mpra.pics)[which(mpra.pics[i,]==top_pval)][1], "\\_", 2)[1] #Find the disease corresponding to the top GWAS p-value
#Write out GWAS and PICS data for top GWAS p-value
mpra.pics[i,]$top_pval<-top_pval
mpra.pics[i,]$top_disease<-top_disease
mpra.pics[i,]$top_PP_running<-mpra.pics[i,paste0(top_disease, "_PP_running")]
mpra.pics[i,]$top_pics<-mpra.pics[i,paste0(top_disease, "_pics")]
}
mpra.pics$top_pics<-as.numeric(mpra.pics$top_pics)
mpra.pics$top_PP_running<-as.numeric(mpra.pics$top_PP_running)
```
Putting the max_pics and win_pics back into mpra
```{r, eval=FALSE}
mpra$top_disease <- mpra.pics$top_disease
mpra$top_pics <- mpra.pics$top_pics
```
```{r}
nrow(mpra)
```
Write table
```{r,eval=FALSE}
write.table(mpra, (paste0(data.dir,mpra.name,"_mpra_merge_all.txt")), row.names=F, col.names=T, sep="\t", quote=F)
```
Bring in the table again
```{r, eval=FALSE}
mpra <- read.table(paste0(data.dir,mpra.name,"_mpra_merge_all.txt"), sep="\t", header=T)
```
Filter for greater than 20 plasmids for both the A and B allele
```{r}
nrow(mpra)
mpra_filtered <- subset(mpra, A.Ctrl.Mean>=20 & B.Ctrl.Mean>=20)
nrow(mpra_filtered) # You will lose some variants here. Out of the 18285 variants in the study, we lose 144 variants in the primary t-cells (for a total of 18141). This will be different for Jurkat (18029).
```
Write table
```{r,eval=FALSE}
write.table(mpra_filtered, (paste0(data.dir,mpra.name,"_mpra_merge_filtered.txt")), row.names=F, col.names=T, sep="\t", quote=F)
```
Read table
```{r,eval=FALSE}
mpra <- read.table((paste0(data.dir,mpra.name,"_mpra_merge_all.txt")), sep="\t", header=T)
```
MPRA with hg38 columns
```{r}
# Import hg conversion data
hg38_liftover <- read.table(paste0(data.dir,"mpra_snps.hg38_liftover.txt"), sep="\t", header=T)
# Get rid of the word "chr" in the chromosome column, so only the number remains
hg38_liftover_chr_mod <- gsub("chr", "", hg38_liftover$chr)
# Get rid of the numbers on the snpid_hg19 to be left with the alleles
hg38_liftover_id_mod <- gsub("[0-9]+", "", hg38_liftover$snpid_hg19)
# Get rid of double colon
hg38_liftover_id_mod <- gsub("::", ":", hg38_liftover_id_mod)
# create the SNP ID out of all the pieces you just created
hg38_id <- paste0(hg38_liftover_chr_mod,":",hg38_liftover$pos_hg38,hg38_liftover_id_mod)
# Make the SNPID into a column
hg38_liftover$hg38_id <- hg38_id
# Give the data frame better column names
colnames(hg38_liftover) <- c("chr38","pos38","SNP19","SNP38")
# Make other hg38 specific columns
hg38_liftover$pos38 <- sub("^[^:]+:([0-9]+):.*", "\\1", hg38_liftover$SNP38)
hg38_liftover$snp_end38 <- sub("^[^:]+:([0-9]+):.*", "\\1", hg38_liftover$SNP38)
hg38_liftover$snp_start38<-as.numeric(str_split_fixed(hg38_liftover$SNP38, "\\:", 4)[,2])-1-abs(nchar(str_split_fixed(hg38_liftover$SNP38, "\\:", 4)[,3])-nchar(str_split_fixed(hg38_liftover$SNP38, "\\:", 4)[,4]))
hg38_liftover$ID38 <- paste0(hg38_liftover$SNP38,":R:wC")
hg38_liftover$comb38 <- paste0(hg38_liftover$SNP38, "_center_fwd_ref")
hg38_liftover$ld_snp38 <- hg38_liftover$SNP38
# Make a SNP19 column in the MPRA
mpra$SNP19 <- mpra$SNP
mpra <- merge(hg38_liftover,mpra, by="SNP19")
# Import hg38 conversion data
hg38_conversion <- read.table(paste0(data.dir,"231011_Mouri_hg19_to_hg38_conversion.txt"), sep="\t", header=T)
# Get rid of columns that are bad
hg38_conversion <- hg38_conversion[,1:11]
hg38_conversion <- subset(hg38_conversion, select=-(hg38_ld_snp))
# Change the names of the columns
names(hg38_conversion) <- c("chr19", "pos19", "rsid", "ref", "alt", "ld_snp19", "lead_snp19","r2", "pos38","lead_snp38")
# Make a SNP column so that I can merge the
hg38_conversion$SNP19 <- paste0(hg38_conversion$chr19,":",hg38_conversion$pos19,":",hg38_conversion$ref,":",hg38_conversion$alt)
# Eliminate duplicated SNPs
hg38_conversion <- hg38_conversion[!duplicated(hg38_conversion[,c('SNP19')]),]
# Rename the columns
names(hg38_conversion) <- c("chr19", "pos19", "rsid_38", "ref_38", "alt_38", "ld_snp19", "lead_snp19","r2_38", "pos38", "lead_snp38","SNP19")
# Eliminate the columns which are hg19 specific except for SNP19
mpra38 <- subset(mpra, select = -c(SNP,snp_end,ld_snp,lead_snp,comb))
# Get the hg38 specific columns
hg_38_conversion <- subset(hg38_conversion, select = c(lead_snp38,SNP19))
# Merge the MPRA with the hg38 info
mpra_hg38 <- merge(hg_38_conversion,mpra38, by="SNP19")
# Change all of the hg38 columns to just regular names
mpra_hg38$SNP <- mpra_hg38$SNP38
mpra_hg38$ld_snp <- mpra_hg38$ld_snp38
mpra_hg38$lead_snp <- mpra_hg38$lead_snp38
mpra_hg38$pos <- mpra_hg38$pos38
mpra_hg38$ID <- mpra_hg38$ID38
mpra_hg38$comb <- mpra_hg38$comb38
mpra_hg38$snp_start <- mpra_hg38$snp_start38
mpra_hg38$snp_end <- mpra_hg38$snp_end38
mpra_hg38$chr <- mpra_hg38$chr38
# Get rid of the columns which say hg38
mpra38 <- subset(mpra_hg38, select=-c(ld_snp38,lead_snp38,pos38,snp_end38,snp_start38,ID38,comb38,chr38))
```
Write the table
```{r}
write.table(mpra38, (paste0(data.dir,mpra.name,"_mpra_merge_hg38_all.txt")), row.names=F, col.names=T, sep="\t", quote=F)
```
Filter for greater than 20 plasmids for both the A and B allele
```{r}
mpra38 <- read.table((paste0(data.dir,mpra.name,"_mpra_merge_hg38_all.txt")), sep="\t", header=T)
nrow(mpra38)
mpra38_filtered <- subset(mpra38, A.Ctrl.Mean>=20 & B.Ctrl.Mean>=20)
nrow(mpra38_filtered) # You will lose some variants here. Out of the 18285 variants in the study, we usually lose ~300.
write.table(mpra38_filtered, (paste0(data.dir,mpra.name,"_mpra_merge_hg38_filtered.txt")), row.names=F, col.names=T, sep="\t", quote=F)
```