-
Notifications
You must be signed in to change notification settings - Fork 10
Toy example
For this page, I'll use the following 100-bp reference genome (no errors) with a 20-bp two-copy repeat:
Here is the draft genome which contains errors (that we aim to fix):
And here are the short reads (non-paired, forward-strand-only and error free for this simple example) which we will use to polish the draft genome:
Aligning reads in the normal way involves putting each read in its single best position. If there are multiple equally good positions, then one is chosen at random. This strategy would result in alignments that look like this:
This illustrates the problem with that alignment strategy: the error in the first copy of the repeat has caused in reads from that part of the genome to align to the second copy of the repeat. This has left no reads aligning over the error.
If we instead align each read to all possible locations, we get alignments that look like this:
I've coloured the reads which align to multiple locations in red. Now we have reads aligning over the error in the repeat, and this is the kind of alignment that Polypolish was built to take.
Polypolish now tallies up the read bases at each position of the draft to make a pileup like this:
Some things to note:
- Each base of the draft usually corresponds to a single base from the reads. However, a deletion in the reads relative to the draft results in no base (shown as
-
) and an insertion in the reads relative to the draft results in multiple bases at one position (a squishedCT
in this example). - The shaded green area represents the depth at each position of the draft. It doesn't always correspond to the size of the pileup because of repeats. E.g. when a read aligns to two possible places, it adds 0.5 depth to each of those place.
- The dotted red line represents the threshold depth. Polypolish sets this at each position as either 5 (adjustable with
--min_depth
) or half the depth (adjustable with--min_fraction
), whichever is larger. Since this toy example has low read depth, the threshold is often set at 5. In a more realistic case of deeper reads, most locations in the draft will have a threshold set at half the depth. - You might notice that the pileup seems to be missing some bases which were in the alignment. E.g. a few reads aligned all the way to the end of the draft, but in the pileup there are two bases at the end with no reads. This is due to Polypolish's alignment trimming logic.
At each position of the draft, a valid sequence is one which occurs more times than the threshold depth. Polypolish will change positions in the draft genome where both of the following are true:
- there is one and only one valid sequence
- that read sequence differs from the assembly sequence
For positions with no valid read sequences (e.g. due to low read depth), Polypolish has no information with which to change things and will therefore make no changes. In our example, this has happened at a few positions, mostly at the start/end of the draft. For positions with multiple valid read sequences, then Polypolish will make no changes because it doesn't know which of the alternative read sequences to use. This hasn't occurred in our example here – it would be most common when the sequence has an inexact repeat (which the example sequence doesn't have).
In our example, changes occur at three positions:
These changes correspond to the three errors in the draft, so the resulting genome is error free!
Since Polypolish only makes a change when that change is strongly indicated, it is unlikely to introduce an error into the draft genome. So you can be reasonably sure that the output of Polypolish is no worse than the input. It is a 'do no harm' polishing strategy (a term I learned from this paper).