Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improbably high peak call counts with low IgG read counts -- any way around this? #92

Open
SolKatzman opened this issue Jan 11, 2023 · 0 comments

Comments

@SolKatzman
Copy link

SEACR v1.3
SEACR v1.4-beta.2

In some CutRun experiments that I am analyzing there were, unfortunately, relatively low counts of IgG control mappings (compared to target mappings) . So when I try to use bedgraph files derived from merged target replicates, in order to increase the accuracy of called peaks, the net result seems to be a substantial overabundance of called peaks.

Here are the details. I ran the 3 replicates (sampA, sampB, sampC) separately and also merged (3samp) into one. The 3 IgG replicates were merged into one control that was used for all SEACR runs. I ran with "norm" in both "relaxed" and "stringent" modes, for each of the two versions of SEACR noted above. I tried v1.4 based on the suggestion in Issue #76

The experimental samples were H3K4me1 from mouse (mapped to mm10). The controls were IgG. The indicated read counts were taken from the bam files before generating the bedgraph files. The numPeaks, median, and average lengths were derived from the SEACR output files. The empirical FDR values were taken from the SEACR run logs. (Note my Issue #91 -- more information in the log might be helpful to understand these results.)

seacr_run: #Sample ctrlReads exptReads numPeaks medLen avgLen empFDR
v1.3_relaxed: sampA 2.56M 3.45M 81422 172 187 0.054
v1.3_relaxed: sampB 2.56M 3.91M 65394 195 214 0.058
v1.3_relaxed: sampC 2.56M 3.70M 83225 169 181 0.055
v1.3_relaxed: 3samp 2.56M 11.06M 502204 185 234 0.011

v1.4_relaxed: sampA 2.56M 3.45M 70063 179 193 0.058
v1.4_relaxed: sampB 2.56M 3.91M 100790 175 193 0.065
v1.4_relaxed: sampC 2.56M 3.70M 82307 169 181 0.096
v1.4_relaxed: 3samp 2.56M 11.06M 421670 197 249 0.014

v1.3_stringent: sampA 2.56M 3.45M 32275 213 228 0.024
v1.3_stringent: sampB 2.56M 3.91M 26703 241 260 0.029
v1.3_stringent: sampC 2.56M 3.70M 28739 211 223 0.028
v1.3_stringent: 3samp 2.56M 11.06M 213091 254 319 0.003

v1.4_stringent: sampA 2.56M 3.45M 27407 220 236 0.028
v1.4_stringent: sampB 2.56M 3.91M 33200 229 248 0.022
v1.4_stringent: sampC 2.56M 3.70M 20847 225 237 0.038
v1.4_stringent: 3samp 2.56M 11.06M 182323 269 337 0.003

It is mildly interesting that v1.4 has fewer peaks called than v1.3 for sampA and sampC, but more for sampB, apparently due to less merging of the peaks (based on the median length, which is more consistent across the 3 replicates in v1.4)

Although v1.4 has reduced the count of peaks called for 3samp, the net output is still improbably high. Note that the median length of the 3samp peaks is (only) 10% to 20% higher than the sampA,B,C peaks. Average length is more like 50% higher. I am not sure how to interpret the lower empirical FDR numbers for the 3samp runs compared to sampA,B,C.

Note that I am presenting this (perhaps extreme) case to see if you have any other suggestions for dealing with relatively low control mapping counts. I have other experiments in which even single replicates have perhaps a factor of 2X read counts compared to a merged set of control replicates.

I truly would appreciate any insight that you might have.

Sol Katzman
UC Santa Cruz Genomics Institute

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant