-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
evaluation and clarification for supplementary data 1 #26
Comments
Let me know if you have any issue replicating the task. |
Sorry I think I misunderstood what you mean by question 1. The notebook do not process the intermediate results, it compares the CSVs to get the value of your 2nd question. To run whatshap compare you need to run the |
Thanks for the reply! For question 1, I did run the
where haplotagged.bam and phased.vcf.* is the outputs from whatshap or hapcut2. Yet the output vcf has the same number of unique phase block IDs as the input vcf (from whatshap). This was HG002 chr6, and methphaser definitely joined many blocks in both the csv file and the bam output. Is there anything I should sanity check on? Am I understanding the phase block IDs wrong? I assume |
Yes this is weird because the script does merge the blocks when it reads the csv files from the intermediate output. |
edit: after a while I realized the csv's "same" and "not same" is relative to the whatshap vcf, not absolute. Some of the correct/wrong phasing said about the intermediate csv below might be incorrect, please disregard them. The vcf the bam are being modified, but seems not in all places that need modification, in terms of consistency with the intermediate csv...? I wonder what am doing wrong. The following are from these files:
HG002 with hg38 as reference and 1-index. This joining is all correct: chr6:23,954,211-24,015,472
But for the following gap: chr6:11,092,382-11,147,866
Here the intermediate csv has the correct decision, but vcf and bam did not flip the haplotags of the reads...? The left and the right variants in whatshap vcf:
For chr6:42318581-42403518, the decision seems not right and phase group ID is not updated:
In IGV: (^ top track is 1Mb subsequences of T2T HG002 aligned to ref with Another maybe unrelated issue is seen at chr6:43,291,981-43,347,903 , which is correctly assigned & modified as a not-same joint. The intermediate csv's line is I did not retain the stdout or stderr for this run, but I think the only warnings were:
|
did you use the postprocessing script in your pull request version? I think that actually ignored the block merge I did. |
I did a clean conda installation, but still got the same symptoms somehow. On a r10.4.1 e8.2 HG002 chr6, whatshap's phasing had 62 unique phase block IDs, post methphaser phasing it becomes 58. Here are the steps of this run: Variant calling and SNP phasing files are reused from older runs of the pipeline via softlinks. It was epi2me-labs/wf-human-variation (calling clair3) and whatshap. Remove conda env of previous methphaser, then Run the following bash script:
Upon finish, the log file is: log.txt. Checksums from the log file are:
which should be identical with commit 0e44165, tag V0.0.3. Outputs: Parsing of phase block IDs:
Do you by any chance still have the hapcut2 phased vcf files of the runs described in the paper? I'm using the intermediate csv files at the moment and would like to make sure my parsing is consistent with what would be obtained in the correct vcf. |
https://zenodo.org/records/11195009 |
I was aware of the zenodo link, but it did not contain the vcf produced by hapcut2 prior to methphase. I would like to see how many gaps were there for chr6 and run Putting that aside, I try to understand the block relationships using R9_60X chr6 methphaser-tagged vcf and the intermediate csv files:
My understanding is here SNPs with ID 382537 and ID 4615421 should be reassigned to have ID 157644. But this seems to be not the case. What did I get wrong? Thanks! |
There are some other filters to decide whether two blocks need to be connected. For example, these two blocks actually don't have a clear majority of read that supports the relationship (247 VS 143 and 21 VS 11). In this case, to make sure of MethPhaser's accuracy, we actually did not take the block assignment into account in our final vcf output. See these 2 parameters:
|
I assume you mean the ratio between column "same_hap_num" and "diff_hap_num". Still using R9_60X chr6, pairs of phase blocks appear to be clear cuts also do not have all of their SNPs' group ID merged. Example of clear cut cases:
Parsing code:
|
I vaguely recall that there might be some filter based on CpG numbers in each block. But I need to check further |
Could you check if vcf is updated correctly, rather than what filtering criteria were used? I just would like to make sure this comment above is an expected result, and I'm free to take the vcf output as the final result & run whatshap compare on it for evaluation. |
yes please |
Just to clarify - do you mean given linked comment above, what I thought was vcf writing error is actually the expected behavior, i.e. when intermediate csv and vcf/bam disagree, I should disregard the csv? |
yes because the the postprocessing script is basically applying filters to the csv files and alter the VCFs. if you don't see the change in the vcf it means the line in that csv file didn't pass some filter. |
Got it, thank you so much! Closing this. I will test that PR later this week and update over there. |
Sorry, reopening for an additional question: In supplementary table 1, sample R9 60x, "SNV Phasing Unconnected Phaseblock Number" is 3179, "MethPhaser Connected Phaseblock Number" is 1457. I think this means in the vcf produced by methphaser, there should be 3178-1457=1721 phase blocks. In the zenodo data, My local runs are like this, including the clean installation one mentioned above. Are you sure this is correct...? edit: to be clear, I am not questioning methphaser's performance. By parsing the intermediate csv files & evaluated on them with custom scripts, methphaser was very nice on a local R9 HG002 chr6 run. I just wonder if phase group IDs are not updated either accidentally or intentionally in the final vcf, or I misinterpreted anything. Sancheck, the vcf matches its md5 on zenodo (this file):
Here is the script for reproduce the stats:
|
@Fu-Yilei Would you offer any comments? (Pinging in case github did not send notification upon reopening.) |
Sorry I didn't got the notification, the group id did looks weird I need to further check the vcf generation scripts. |
Thanks! Appreciate it :) |
Hi,
I'm trying to replicate supplementary data 1 on a single chromosome. I have a few questions regarding the post processing of methphaser results.
I believe the phase blocks in the output vcf are kept untouched, and in order to run
whatshap compare
, one needs to refer to the csv files in the work directory and create an intermediate vcf file. Is this correct? I did a quick search inMethPhaser_paper_scripts/post_processing_script.ipynb
but parsing seems to be not included there.Could you clarify the difference between "Methylation connected Phase Block Accuracy" and "Methylation connected Phase Block Success Rate" in supplementary data 1?
Is the HG002 R10 data used in the manuscript this one?
Thank you!
The text was updated successfully, but these errors were encountered: