Skip to content

Toy example

Ryan Wick edited this page Jul 26, 2021 · 20 revisions

This page will show how Polypolish works using a small and simple example. The genome in this example is only 100-bp long and has a 20-bp two-copy repeat:

reference

Here is the assembly genome which contains three errors (that we aim to fix):

assembly

And here are the error-free short reads which we will use to polish the assembly:

reads

While this example is tiny, it can still be run in Polypolish, so download the files here if you want to try for yourself: assembly.fasta, reads.fastq, alignments.sam

Read alignment

One alignment per read (the wrong way)

Aligning reads in the normal way (i.e. using default aligner settings) involves putting each read in its single best position. If there are multiple equally good positions, then one is chosen at random. This would result in alignments that look like this:

read alignments (one per read)

This illustrates the problem with that alignment strategy: the error in the first copy of the repeat has caused reads from that part of the genome to align to the second copy of the repeat. This has left no reads aligning over the error.

All alignments (the right way)

If we instead align each read to all possible locations, we get alignments that look like this:

read alignments (all per read)

I've coloured the reads which align to multiple locations in red. Now we have reads aligning over the error in the repeat, and these are the kind of alignments that Polypolish was built to take.

Building a pileup

Using an all-alignments input, Polypolish tallies up the read sequences at each position of the assembly to make a pileup. You can visualise the process as taking the alignment (drawn above) and squashing down each column to remove the empty space:

pileup

Some things to note:

  • Each base of the assembly usually corresponds to a single base from the reads. However, a deletion in the reads relative to the assembly results in no base (shown as -) and an insertion in the reads relative to the assembly results in multiple bases at one position (a squished CT in this example).
  • The shaded green area represents the read depth at each position of the assembly. In non-repetitive regions (where reads align to only one position), the read depth is equivalent to the number of alignments. But for repetitive regions, the read depth is less than the number of alignments. This is because reads with multiple alignments only contribute fractional depth to each of their locations. E.g. when a read aligns to two possible places, it adds 0.5 depth to each of those places.
  • The dotted red line represents the threshold depth (used in the next step). Polypolish sets this at each position to either 5 (adjustable with --min_depth) or half the read depth (adjustable with --min_fraction), whichever is larger. This toy example has low read depth, so the threshold depth is the minimum (5) for many positions of the assembly. In a real situation (with higher read depth), the threshold depth would be half the read depth for most positions of the assembly.
  • You might notice that the pileup seems to be missing some sequences which were in the alignment. E.g. a few reads aligned all the way to the end of the assembly, but in the pileup there are two bases at the end with no sequence. This is due to Polypolish's alignment trimming logic.

Fixing errors

At each position of the assembly, a valid sequence is one which occurs more times than the threshold depth. For any position in the assembly with one and only one valid sequence, Polypolish will set that assembly position to the valid sequence. If the valid sequence and the assembly base differ, this will result in a change to the assembly (the repair of an error).

In our example, changes occur at three positions:

fixes

These changes correspond to the three errors in the assembly, so the resulting genome is error free!

For positions with no valid sequences (e.g. due to low read depth), Polypolish will make no changes. In our example, this has happened at a few positions, mostly at the start/end of the assembly. For positions with multiple valid read sequences, then Polypolish will make no changes because it doesn't know which of the alternatives to use. This hasn't occurred in our toy example here – it would be most common when the sequence has an inexact repeat. Since Polypolish only makes a change in unambiguous cases, it is unlikely to introduce an error into the assembly genome. So you can be reasonably sure that the output of Polypolish is at least no worse than the input. It is a 'do no harm' polishing strategy (a term I'm borrowing from this paper).

Debug output

If you are interested in per-base polishing information, you can use Polypolish's --debug option to save a tab-delimited table to file. For each position of the assembly, you can see the assembly base, depth, threshold depth, read sequence pileup, status and new base. The status can be none (no valid options), multiple (more than one valid option), kept (one valid option which matches the assembly base) and changed (one valid option which differs from the assembly base).

Here is what this file looks like for our toy example:

name   pos  base  depth  threshold  pileup        status   new_base
draft  0    G     1.0    5          Gx1           none     G
draft  1    A     2.0    5          Ax2           none     A
draft  2    C     2.0    5          Cx2           none     C
draft  3    T     2.0    5          Tx2           none     T
draft  4    G     3.0    5          Gx3           none     G
draft  5    T     4.0    5          Tx4           none     T
draft  6    T     4.0    5          Tx4           none     T
draft  7    C     4.0    5          Cx4           none     C
draft  8    A     5.0    5          Ax5           kept     A
draft  9    A     5.0    5          Ax5           kept     A
draft  10   C     6.0    5          Cx6           kept     C
draft  11   G     9.0    5          Gx9           kept     G
draft  12   C     11.0   6          Cx11          kept     C
draft  13   G     11.0   6          Gx11          kept     G
draft  14   A     12.0   6          Ax13          kept     A
draft  15   T     11.5   6          Tx13          kept     T
draft  16   C     10.5   5          Cx10,-x2      kept     C
draft  17   G     11.5   6          Tx2,Gx11,-x2  kept     G
draft  18   C     12.5   6          Cx18          kept     C
draft  19   T     10.0   5          Tx16          kept     T
draft  20   G     7.5    5          Gx14          kept     G
draft  21   T     7.5    5          Tx15          kept     T
draft  22   A     7.0    5          Ax14          kept     A
draft  23   T     8.0    5          Tx16          kept     T
draft  24   T     8.5    5          Tx17          kept     T
draft  25   C     7.5    5          Cx15          kept     C
draft  26   A     10.5   5          Ax21          kept     A
draft  27   G     9.5    5          Cx19          changed  C
draft  28   C     9.0    5          Cx18          kept     C
draft  29   A     9.5    5          Ax19          kept     A
draft  30   A     10.0   5          Ax20          kept     A
draft  31   G     11.0   6          Gx22          kept     G
draft  32   G     9.5    5          Gx18          kept     G
draft  33   T     12.5   6          Tx21          kept     T
draft  34   T     9.0    5          Tx14          kept     T
draft  35   C     7.0    5          Cx10          kept     C
draft  36   T     9.0    5          Tx12          kept     T
draft  37   T     9.5    5          Tx11          kept     T
draft  38   C     10.0   5          Cx10          kept     C
draft  39   G     11.0   6          Gx11          kept     G
draft  40   G     10.0   5          Gx10          kept     G
draft  41   C     11.0   6          Cx11          kept     C
draft  42   C     14.0   7          Cx14          kept     C
draft  43   T     14.0   7          Tx14          kept     T
draft  44   T     12.0   6          Tx12          kept     T
draft  45   C     12.0   6          Cx12          kept     C
draft  46   A     10.0   5          Ax10          kept     A
draft  47   C     9.0    5          Cx9           kept     C
draft  48   G     6.0    5          Gx6           kept     G
draft  49   T     7.0    5          Tx7           kept     T
draft  50   A     4.0    5          Ax4           none     A
draft  51   C     6.0    5          CTx6          changed  CT
draft  52   C     5.0    5          Cx5           kept     C
draft  53   T     5.0    5          Tx5           kept     T
draft  54   G     6.0    5          Gx6           kept     G
draft  55   C     8.0    5          Cx8           kept     C
draft  56   G     7.0    5          Gx7           kept     G
draft  57   C     7.0    5          Cx7           kept     C
draft  58   T     5.5    5          Tx6           kept     T
draft  59   A     6.5    5          Ax7,Cx1       kept     A
draft  60   T     7.5    5          Tx8,Gx3       kept     T
draft  61   C     9.5    5          Cx15          kept     C
draft  62   T     9.0    5          Tx15          kept     T
draft  63   G     7.5    5          Gx14          kept     G
draft  64   T     7.5    5          Tx15          kept     T
draft  65   A     8.0    5          Ax15          kept     A
draft  66   T     9.0    5          Tx17          kept     T
draft  67   T     9.5    5          Tx18          kept     T
draft  68   C     8.5    5          Cx16          kept     C
draft  69   A     9.5    5          Ax18          kept     A
draft  70   C     10.5   5          Cx20          kept     C
draft  71   C     10.0   5          Cx19          kept     C
draft  72   A     10.5   5          Ax20          kept     A
draft  73   A     10.0   5          Ax20          kept     A
draft  74   G     11.0   6          Gx22          kept     G
draft  75   G     9.5    5          Gx18          kept     G
draft  76   T     11.5   6          Tx20          kept     T
draft  77   T     11.0   6          Tx16          kept     T
draft  78   C     10.0   5          Cx13          kept     C
draft  79   T     11.0   6          Tx14          kept     T
draft  80   T     11.0   6          Tx13          kept     T
draft  81   A     9.5    5          Ax10          kept     A
draft  82   A     9.0    5          Ax9           kept     A
draft  83   C     9.0    5          Cx9           kept     C
draft  84   T     10.0   5          Tx10          kept     T
draft  85   A     8.0    5          -x8           changed  -
draft  86   C     5.0    5          Cx5           kept     C
draft  87   G     5.0    5          Gx5           kept     G
draft  88   T     4.0    5          Tx4           none     T
draft  89   G     3.0    5          Gx3           none     G
draft  90   T     7.0    5          Tx7           kept     T
draft  91   G     7.0    5          Gx7           kept     G
draft  92   G     6.0    5          Gx6           kept     G
draft  93   A     5.0    5          Ax5           kept     A
draft  94   G     5.0    5          Gx5           kept     G
draft  95   C     4.0    5          Cx4           none     C
draft  96   T     4.0    5          Tx4           none     T
draft  97   G     4.0    5          Gx4           none     G
draft  98   C     0.0    5                        none     C
draft  99   G     0.0    5                        none     G