-
Notifications
You must be signed in to change notification settings - Fork 10
Toy example
This page will show how Polypolish works using a small and simple example. The genome in this example is only 100-bp long and has a 20-bp two-copy repeat:
Here is the assembly which contains three errors (that we aim to fix):
And here are the error-free short reads which we will use to polish the assembly:
While this example is tiny, it can still be run in Polypolish, so download the files here if you want to try for yourself: assembly.fasta
, reads.fastq
, alignments.sam
Aligning reads in the normal way (i.e. using default aligner settings) involves putting each read in its single best position. If there are multiple equally good positions, then one is chosen at random. This would result in alignments that look like this:
This illustrates the problem with that alignment strategy: the error in the first copy of the repeat has caused reads from that part of the genome to align to the second copy of the repeat. This has left no reads aligning over the error.
If we instead align each read to all possible locations, we get alignments that look like this:
I've coloured the reads which align to multiple locations in red. Now we have reads aligning over the error in the repeat, and these are the kind of alignments that Polypolish was built to take.
Using an all-alignments input, Polypolish tallies up the read sequences at each position of the assembly to make a pileup. You can visualise the process as taking the alignment (drawn above) and squashing down each column to remove the empty space:
Some things to note:
- Each base of the assembly usually corresponds to a single base from the reads. However, a deletion in the reads relative to the assembly results in no base (shown as
-
) and an insertion in the reads relative to the assembly results in multiple bases at one position (a squishedCT
in this example). - The shaded green area represents the read depth at each position of the assembly. In non-repetitive regions (where reads align to only one position), the read depth is equivalent to the number of alignments. But for repetitive regions, the read depth is less than the number of alignments. This is because reads with multiple alignments only contribute fractional depth to each of their locations. E.g. when a read aligns to two possible places, it adds 0.5 depth to each of those places.
- The dotted red lines represent the threshold depths (used in the next step and explained more there). For the invalid threshold depth (bottom dotted red line), Polypolish sets this to 20% of the read depth (adjustable with
--fraction_invalid
). For the valid threshold depth (top dotted red line), Polypolish sets this to either 5 (adjustable with--min_depth
) or half the read depth (adjustable with--fraction_valid
), whichever is larger. This toy example has low read depth, so the valid threshold depth is the minimum (5) for many positions of the assembly. In a real situation (with higher read depth), the valid threshold depth would be half the read depth for most positions of the assembly. - You might notice that the pileup seems to be missing some sequences which were in the alignment. E.g. a few reads aligned all the way to the end of the assembly, but in the pileup there are two bases at the end with no sequence. This is due to Polypolish's alignment trimming logic.
Polypolish will assess these conditions for each position in the pileup:
- There is one and only valid sequence. A valid sequence is defined as as sequence in the pileup which occurs more times than the valid threshold depth (controlled by
--min_depth
and--fraction_valid
). - Any other sequences are invalid. An invalid sequence occurs fewer times than the invalid threshold depth (controlled by
--fraction_invalid
). - The read depth is sufficiently high (controlled by
--min_depth
). - The one valid sequence differs from the assembly base.
The possible outcomes are:
- If all four conditions are true, that indicates an error in need of repair. Polypolish will change that position of the assembly to the pileup's valid sequence.
- If conditions 1–3 are true but condition 4 is false, that simply means that the pileup agrees with the assembly and no change is needed. This should be the outcome for most positions in the assembly.
- If any of conditions 1–3 are false, then there something unclear about this position: no valid sequence, multiple valid sequences, one or more sequences falling between validity and invalidity, or low read depth. In this case, Polypolish plays it safe and does not make a change.
In our example, repairs occur at three positions:
These changes correspond to the three errors in the assembly, so the resulting genome is error free!
Since Polypolish only makes a change in unambiguous cases, it is unlikely to introduce an error into the assembly genome. So you can be reasonably sure that the output of Polypolish is at least no worse than the input. It is a 'do no harm' polishing strategy (a term I'm borrowing from this paper).
If you are interested in per-base polishing information, you can use Polypolish's --debug
option to save a tab-delimited table to file. For each position of the assembly, you can see the assembly base, depth, threshold depth, read sequence pileup, status and new base.
The status can be:
-
low_depth
: depth is below--min_depth
-
none
: no valid options -
multiple
: more than one valid option -
too_close
: one or more options are neither valid nor invalid -
kept
: one valid option which matches the assembly base -
changed
: one valid option which differs from the assembly base
Here is what this file looks like for our toy example:
name pos base depth invalid valid pileup status new_base
draft 0 G 1.0 0 5 Gx1 low_depth G
draft 1 A 2.0 0 5 Ax2 low_depth A
draft 2 C 2.0 0 5 Cx2 low_depth C
draft 3 T 2.0 0 5 Tx2 low_depth T
draft 4 G 3.0 1 5 Gx3 low_depth G
draft 5 T 4.0 1 5 Tx4 low_depth T
draft 6 T 4.0 1 5 Tx4 low_depth T
draft 7 C 4.0 1 5 Cx4 low_depth C
draft 8 A 5.0 1 5 Ax5 kept A
draft 9 A 5.0 1 5 Ax5 kept A
draft 10 C 6.0 1 5 Cx6 kept C
draft 11 G 9.0 2 5 Gx9 kept G
draft 12 C 11.0 2 6 Cx11 kept C
draft 13 G 11.0 2 6 Gx11 kept G
draft 14 A 12.0 2 6 Ax13 kept A
draft 15 T 11.5 2 6 Tx13 kept T
draft 16 C 10.5 2 5 -x2,Cx10 too_close C
draft 17 G 11.5 2 6 -x2,Gx11,Tx2 too_close G
draft 18 C 12.5 2 6 Cx18 kept C
draft 19 T 10.0 2 5 Tx16 kept T
draft 20 G 7.5 2 5 Gx14 kept G
draft 21 T 7.5 2 5 Tx15 kept T
draft 22 A 7.0 1 5 Ax14 kept A
draft 23 T 8.0 2 5 Tx16 kept T
draft 24 T 8.5 2 5 Tx17 kept T
draft 25 C 7.5 2 5 Cx15 kept C
draft 26 A 10.5 2 5 Ax21 kept A
draft 27 G 9.5 2 5 Cx19 changed C
draft 28 C 9.0 2 5 Cx18 kept C
draft 29 A 9.5 2 5 Ax19 kept A
draft 30 A 10.0 2 5 Ax20 kept A
draft 31 G 11.0 2 6 Gx22 kept G
draft 32 G 9.5 2 5 Gx18 kept G
draft 33 T 12.5 2 6 Tx21 kept T
draft 34 T 9.0 2 5 Tx14 kept T
draft 35 C 7.0 1 5 Cx10 kept C
draft 36 T 9.0 2 5 Tx12 kept T
draft 37 T 9.5 2 5 Tx11 kept T
draft 38 C 10.0 2 5 Cx10 kept C
draft 39 G 11.0 2 6 Gx11 kept G
draft 40 G 10.0 2 5 Gx10 kept G
draft 41 C 11.0 2 6 Cx11 kept C
draft 42 C 14.0 3 7 Cx14 kept C
draft 43 T 14.0 3 7 Tx14 kept T
draft 44 T 12.0 2 6 Tx12 kept T
draft 45 C 12.0 2 6 Cx12 kept C
draft 46 A 10.0 2 5 Ax10 kept A
draft 47 C 9.0 2 5 Cx9 kept C
draft 48 G 6.0 1 5 Gx6 kept G
draft 49 T 7.0 1 5 Tx7 kept T
draft 50 A 4.0 1 5 Ax4 low_depth A
draft 51 C 6.0 1 5 CTx6 changed CT
draft 52 C 5.0 1 5 Cx5 kept C
draft 53 T 5.0 1 5 Tx5 kept T
draft 54 G 6.0 1 5 Gx6 kept G
draft 55 C 8.0 2 5 Cx8 kept C
draft 56 G 7.0 1 5 Gx7 kept G
draft 57 C 7.0 1 5 Cx7 kept C
draft 58 T 5.5 1 5 Tx6 kept T
draft 59 A 6.5 1 5 Ax7,Cx1 too_close A
draft 60 T 7.5 2 5 Gx3,Tx8 too_close T
draft 61 C 9.5 2 5 Cx15 kept C
draft 62 T 9.0 2 5 Tx15 kept T
draft 63 G 7.5 2 5 Gx14 kept G
draft 64 T 7.5 2 5 Tx15 kept T
draft 65 A 8.0 2 5 Ax15 kept A
draft 66 T 9.0 2 5 Tx17 kept T
draft 67 T 9.5 2 5 Tx18 kept T
draft 68 C 8.5 2 5 Cx16 kept C
draft 69 A 9.5 2 5 Ax18 kept A
draft 70 C 10.5 2 5 Cx20 kept C
draft 71 C 10.0 2 5 Cx19 kept C
draft 72 A 10.5 2 5 Ax20 kept A
draft 73 A 10.0 2 5 Ax20 kept A
draft 74 G 11.0 2 6 Gx22 kept G
draft 75 G 9.5 2 5 Gx18 kept G
draft 76 T 11.5 2 6 Tx20 kept T
draft 77 T 11.0 2 6 Tx16 kept T
draft 78 C 10.0 2 5 Cx13 kept C
draft 79 T 11.0 2 6 Tx14 kept T
draft 80 T 11.0 2 6 Tx13 kept T
draft 81 A 9.5 2 5 Ax10 kept A
draft 82 A 9.0 2 5 Ax9 kept A
draft 83 C 9.0 2 5 Cx9 kept C
draft 84 T 10.0 2 5 Tx10 kept T
draft 85 A 8.0 2 5 -x8 changed -
draft 86 C 5.0 1 5 Cx5 kept C
draft 87 G 5.0 1 5 Gx5 kept G
draft 88 T 4.0 1 5 Tx4 low_depth T
draft 89 G 3.0 1 5 Gx3 low_depth G
draft 90 T 7.0 1 5 Tx7 kept T
draft 91 G 7.0 1 5 Gx7 kept G
draft 92 G 6.0 1 5 Gx6 kept G
draft 93 A 5.0 1 5 Ax5 kept A
draft 94 G 5.0 1 5 Gx5 kept G
draft 95 C 4.0 1 5 Cx4 low_depth C
draft 96 T 4.0 1 5 Tx4 low_depth T
draft 97 G 4.0 1 5 Gx4 low_depth G
draft 98 C 0.0 0 5 low_depth C
draft 99 G 0.0 0 5 low_depth G