Calculating ArchR Gene Score Matrix for bulk epigenetic data #2175

Al-Murphy · 2024-06-16T13:04:32Z

Al-Murphy
Jun 16, 2024

I'm trying to calculate a gene score matrix for bulk ChIP-Seq histone mark data (e.g. H3K27ac from Roadmap/ENCODE). I believe these gene score could be good predictors of expression more generally than only being used to call marker genes. However, I'm having issues formatting the bulk ChIP-Seq data to get ArchR to accept it:

I start with bedGraph data containing the mapped count of reads:

              V1       V2       V3     V4
          <char>    <int>    <int>  <int>
       1:   chr1        0    10147     0
       2:   chr1    10147    10183     1
       3:   chr1    10183    13055     0
       4:   chr1    13055    13091     1
       5:   chr1    13091    16219     0

This looks pretty similar to the single-cell, fragments file ArchR happily accepts:

              V1       V2       V3                 V4    V5
          <char>    <int>    <int>             <char> <int>
       1:   chr1    10079    10209 TAGGAGGGTTAACCGT-1     1
       2:   chr1    10090    10278 CCTTGGTAGTTCTCCC-1     5
       3:   chr1    10157    10204 GCCAGCAGTTCTGAGT-1     1
       4:   chr1    10222    10333 TAGCATGGTGACCAGA-1     1
       5:   chr1    10297    10614 GCCCAGAAGTGTCACT-1     1

The only difference is the cell barcode is missing. To get around this I created random bar codes to make 500 cells for my 50 million reads (col V5):

              V1       V2       V3        V5    V4
          <char>    <int>    <int>    <char> <int>
       1:   chr1        0    10147     82008     0
       2:   chr1    10147    10183     82008     1
       3:   chr1    10183    13055     82008     0
       4:   chr1    13055    13091     82008     1
       5:   chr1    13091    16219     82008     0

Any less than 500, and archR fails, I believe throwing errors about the number of cells. With this approach, I can create an arrow file (createArrowFiles) and get the gene score matrix. I then pseudobulk (tested mean and sum) the scores across the 500 'cells' to get one score per gene.

The issue is the correlation between these and the true expression for the same cell type (bulk RNA-Seq) is pretty much random: -0.050094. I know there are issues with my approach but I would have thought this would have been quite a bit higher even so.

So my question is how to improve this? I also tried duplicating the data with the 500 'cells' above and one more cell with all the 50 million reads. I had to increase the maxFrags parameter but did get createArrowFiles to run. I then filter the gene score matrix to just the cell with all of the reads. However the correlation between these and the true expression was equally poor -0.054114.

Any advice or help on this would be great!

Al-Murphy · 2024-06-18T16:13:34Z

Al-Murphy
Jun 18, 2024
Author

Just as an update on this, I ran my same approach (second one from above) on a lot more cell types and histone marks and found just as poor correlations:

cell	histone_mark	correlation
E003	H3K4me1	0.008739
E003	H3K4me3	0.007959
E003	H3K27ac	-0.054114
E003	H3K27me3	0.008236
E003	H3K9me3	0.007846
E006	H3K4me3	-0.028055
E006	H3K9ac	-0.028527
E007	H3K9ac	-0.036493
E016	H3K4me3	-0.061422
E087	H3K4me1	-0.008787
E087	H3K4me3	-0.008532
E087	H3K27ac	-0.008095
E087	H3K9ac	-0.007843
E087	H3K27me3	-0.007070
E087	H3K36me3	-0.010298
E087	H3K9me3	-0.009428
E114	H3K27ac	-0.032556
E114	H3K9ac	-0.034724
E116	H3K27ac	-0.022326
E116	H3K9ac	-0.044485
E116	H3K36me3	-0.041636
E118	H3K4me3	-0.015269
E118	H3K27ac	-0.001395
E118	H3K9ac	-0.027112
E118	H3K36me3	-0.022781

Note that some of these marks are repressive so I would take a high negative correlation as a good result too.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Calculating ArchR Gene Score Matrix for bulk epigenetic data #2175

{{title}}

Replies: 1 comment

{{title}}

Select a reply

Calculating ArchR Gene Score Matrix for bulk epigenetic data #2175

Al-Murphy Jun 16, 2024

Replies: 1 comment

Al-Murphy Jun 18, 2024 Author

Al-Murphy
Jun 16, 2024

Al-Murphy
Jun 18, 2024
Author