Document of QV file #148

ZexuanZhao · 2024-12-02T18:31:20Z

Hi!

I'm using merqury v1.3 to evaluate QV scores of each HiFi reads to identify contamination reads based on an Illumina dataset that is not contaminated. Here's a glance of the data I got from the .qv file.

The first 5 columns are named according to this doc, where the shared column, according to the doc, is k-mers found in both assembly and the read set. The last column is the size of the reads calculated using samtools faidx.

What drew my attention is that sometimes uniq + shared > size + 20, which should not happen given the limited amount of kmers a sequence can have. But I also noticed that shared = size - 20, which makes me wonder if shared should actually be the total amount of kmers instead of shared kmers.

To confirm that I recalculated the QV score from your paper but assuming the shared column is the total number of kmers while the shared number of khmers is the third column subtract the second column.

And this confirmed my suspicions. My QV calculation is the same as the one generated by mercury.

I'm wondering if you can check the documentation and see if the description of the third column of the QV file is correct?

Best,
Zexuan

The text was updated successfully, but these errors were encountered:

arangrhie · 2024-12-11T18:15:10Z

Hi @ZexuanZhao!

Thanks for pointing this out - I corrected the wiki description.
The shared is indeed total num. of k-mers in the assembly, in your case, the read sequence.

Sorry for the confusion!

Best,
Arang

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document of QV file #148

Document of QV file #148

ZexuanZhao commented Dec 2, 2024

arangrhie commented Dec 11, 2024

Document of QV file #148

Document of QV file #148

Comments

ZexuanZhao commented Dec 2, 2024

arangrhie commented Dec 11, 2024