Skip to content

FAQ and miscellaneous tips

Ryan Wick edited this page Sep 11, 2024 · 40 revisions

Table of contents


Polypolish performance

Polypolish is quick and efficient, at least on the bacterial genomes I've tested it on:

  • A small and simple bacterial genome should take less than a minute to polish and use less than a gigabyte of RAM.
  • A big or complex (i.e. repeat rich) bacterial genome can take a few minutes to polish and use a few gigabytes of RAM.

Polypolish itself is single-threaded, but BWA-MEM parallelises well, so use as many threads as you have available when preparing the alignments for Polypolish.

I haven't tested Polypolish on eukaryote genomes, so I'm unsure of its performance in that context (see this question).


Adjusting Polypolish options

Polypolish is designed to be conservative in its error correction, i.e. it only corrects errors when there is strong evidence to do so. This means it is more likely to suffer from false negatives (failing to correct an error) than false positions (introducing an error).

Polypolish's default options are:

--min_depth 5 --fraction_invalid 0.2 --fraction_valid 0.5

If you want Polypolish to be less conservative (i.e. more willing to fix errors but with an increased risk of introducing errors), you could use these options:

--min_depth 3 --fraction_invalid 0.3 --fraction_valid 0.4

If you want Polypolish to be more conservative (i.e. only fixing errors when the evidence is very strong), you could use these options:

--min_depth 15 --fraction_invalid 0.1 --fraction_valid 0.6 --careful

See the Toy example page for a deeper explanation of how these options work.


Does Polypolish work on eukaryote genomes?

Polypolish was designed with bacterial genomes in mind, but it should also work on small haploid eukaryote genomes. It's probably not appropriate for large and/or diploid eukaryote genomes. Nothing about Polypolish's algorithm is in principle tuned to bacterial genomes. However, Polypolish requires that you align each short read to all possible locations, and for a repeat-rich eukaryote genome, this could result in a lot of alignments. So there may be practical limitations.

To illustrate the problem, consider two bacterial genomes I have used with Polypolish: Bacillus subtilis NC_000964.3 and Bordetella pertussis NC_002929.2. About 2% of the Bacillus genome is repetitive, and my 1.4 million reads for that genome result in 1.6 million alignments. Most reads only align to a single place, so there aren't that many more alignments than reads. The Bordetella genome is a similar size, but it is 9% repetitive due to hundreds of copies of IS481. For that genome, 1.4 million reads result in a whopping 13.6 million alignments! Eukaryote genomes can be 50% or more repetitive, so I shudder to think how many alignments they might generate.

I've successfully tried Polypolish on a Drosophila genome, but it was pretty slow (took ~6 hours). If you try it on a bigger genome, let me know how well (or not well) it worked.

Also, Richard Wheeler wrote a tool named polyalign which may help with some of Polypolish's memory usage problems with eukaryote genomes (see issue #25).


Does Polypolish work on metagenomes?

Short answer: probably! While designed with isolates in mind, Polypolish is conservative (unlikely to introduce errors) and should for the most part work well on long-read metagenome assemblies too.

I can, however, think of one particular case where Polypolish could introduce an error into a metagenome: when you've got a very similar sequence shared between a high-depth genome and a low-depth genome. E.g. genome A has 2000× read depth and genome B has 20× read depth, and both share some sequence at high identity. In that case, there's a risk that Polypolish could change the shared sequence in genome B to look like the shared sequence in genome A. For this reason, I recommend using the --careful option when polishing a metagenome.

It's also worth pointing out that Polypolish was made with completed genome assemblies in mind: its input should ideally be one-contig-per-replicon. Some metagenome assemblies get pretty messy, especially when you have a mixture of closely-related genomes. I don't know how well Polypolish would perform on a highly-fragmented metagenome assembly, so interpret any results with caution.


Can I align my paired-end reads in a single BWA-MEM command?

The How to run Polypolish page instructs you to align paired-end reads into separate SAM files like this:

bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam

You might be tempted to combine those into a single command:

bwa mem -t 16 -a draft.fasta reads_1.fastq.gz reads_2.fastq.gz > alignments.sam

DON'T DO THIS! While that BWA-MEM command will run successfully, it will not make a SAM file appropriate for use with Polypolish. BWA-MEM's -a option (which Polypolish relies on to polish repeat regions) has no effect when used on paired read files, so the combined command will only have a single alignment for each read.


Does the order of alignments in the SAM file matter?

The order of the SAM alignments will not change Polypolish's output (the polished genome sequence) but it can affect whether or not Polypolish will successfully run.

Polypolish assumes that all of the alignments for each read are grouped together on adjacent lines in the SAM file. This is how BWA-MEM outputs its SAM files, so it shouldn't be a problem. But if you've sorted your alignments using samtools sort, they may not work with Polypolish.

When BWA-MEM is run in all-alignments mode (using the -a option, as you should do for Polypolish alignments), it does not include the read sequence on every line. The primary alignment for each read will contain the sequence, but secondary alignments will only contain a * to save space. If the alignments for each read are not grouped together, Polypolish will be unable to get the read sequence for secondary alignments and will quit with an error like this:

Error: no alignments for read NS500764:85:H3J5TBGXF:2:12111:13314:18114 contain sequence

If your SAM files have gotten out of order, you can use the samtools 'sort by read name' option (-n) to make them compatible with Polypolish:

samtools sort -n -O sam alignments.sam > alignments_sorted.sam

Assuming your SAM files meet the above requirement (all alignments for each read are grouped on adjacent lines), then the order of the SAM file does not matter. E.g. if you reverse the order of lines in your SAM file with tac, Polypolish will still run and it will produce identical output.


Can I use Bwa-mem2 instead of BWA-MEM?

Yes! Bwa-mem2 is a faster implementation of BWA-MEM. It produces nearly identical alignments to BWA-MEM, so its alignments are definitely appropriate for use with Polypolish.


Can I use Bowtie2 instead of BWA-MEM?

Bowtie2 is another popular short-read aligner, and like BWA-MEM, it has an option (-a) to align each read to all possible locations. So yes, you can use it to generate alignments for Polypolish. However, I used BWA-MEM when developing and testing Polypolish, and I've only briefly tried using Bowtie2. So BWA-MEM is probably the safer choice.

Example alignment commands with Bowtie2 might look something like this:

bowtie2-build draft.fasta draft.fasta
bowtie2 -a -p 16 -x draft.fasta -U reads_1.fastq.gz > alignments_1.sam
bowtie2 -a -p 16 -x draft.fasta -U reads_2.fastq.gz > alignments_2.sam

Paired-end reads usually have suffixes on read names (/1 and /2). BWA-MEM removes these when making the SAM file (so first-in-pair and second-in-pair reads have the same name) but Bowtie2 does not. So in order to run Polypolish's insert filter, you may have to remove these suffixes like this:

sed -i 's|/1\t|\t|' alignments_1.sam
sed -i 's|/2\t|\t|' alignments_2.sam

Should I run multiple rounds of Polypolish polishing?

You probably don't need to bother – Polypolish doesn't usually make changes after the first polishing round. But since Polypolish is unlikely to introduce an error into your assembly, you're welcome to try! Just don't be surprised if subsequent rounds of Polypolish don't do anything.


Why does Polypolish only use end-to-end alignments?

You might have noticed that when loading alignments, Polypolish discards any which are not end-to-end. I.e. any alignments which are clipped on either end aren't included in the pileup. Here's an example from Polypolish output:

alignments_1.sam: 1,561,434 alignments from 1,405,200 reads
alignments_2.sam: 1,561,193 alignments from 1,405,200 reads

Filtering for high-quality end-to-end alignments:
  3,082,475 alignments kept
  40,152 alignments discarded

Assuming that your long-read assembly and your short reads came from the same genome (as they should), then I can think of three main reasons for clipped alignments. The first would be a significant structural error in the assembly, which Polypolish is not designed to fix (other polishing tools like Pilon can do a better job with this kind of error). The second would be alignments at the start-end of a circular contig.

The third cause of clipped alignments would be from reads which are partially in a repeat. For example, if a read was half in a two-copy repeat (the boundary of the repeat was in the middle of the read), then we might expect two alignments: one where the read fully aligned end-to-end in its true location and one where half the read was clipped in the alignment, like this:

alignments:           TCTTTATTATTA                      ------TTATTA
assembly:   AGAGATTCGATCTTTATTATTATGCGGAATTCTGGTTGCCTCAAGGAAGCTTATTATGCGGAATAGAACCGTCCG
                            |   repeat   |                    |   repeat   |

In cases like this, clipped alignments represent incorrectly placed reads, so Polypolish discards them. This helps to reduce extraneous bases in the pileup and should improve Polypolish's ability to fix errors near the ends of repeats.


Can Polypolish add/remove bases at the start/end of a sequence?

No, it cannot. Polypolish will only fix errors that are in the middle of your sequence, not errors right at the ends. For example, if your sequence was missing a few bases at its end, Polypolish will not add them back in. For a specific example, take a look at this issue where Devon O'Rourke set one up.

Adding/removing bases at the start/end of a contig gets tricky for circular sequences, and most bacterial sequences are circular. So I would recommend that you fix up the ends of your contigs before polishing with Polypolish, e.g. use Trycycler which gives clean circularisation for bacterial genomes.


Does Polypolish change contig names?

As of v0.6.0, Polypolish will not change the names of contigs. It will, however, add polypolish to the contig's description.

For example, given this sequence as input:

>chromosome circular=true
ATGAATATAAAAGATTTTTTACTTGAGTTTAAAACTGAAA...

It will produce this output sequence:

>chromosome circular=true polypolish
ATGAATATAAAAGATTTTTTACTTGAGTTTAAAACTGAAA...

If you don't want polypolish added to the sequence descriptions, you can pipe Polypolish through sed:

polypolish polish draft.fasta filtered_1.sam filtered_2.sam | sed 's/ polypolish//' > polished.fasta

For how Polypolish used to behave prior to v0.6.0, see issue #7.


Why are polypolish filter and polypolish polish separate?

In How to run Polypolish, you can see that two Polypolish commands are needed: polypolish filter and polypolish polish. You might wonder why I didn't combine these together so Polypolish is simpler to run?

This is because the polypolish filter command only applies to paired-end read sets, so it isn't needed for unpaired reads. Also, if you have multiple different paired-end read sets (e.g. you sequenced your isolate multiple times), then you must run polypolish filter separately for each of the paired-end sets before giving all filtered alignments to polypolish polish.


Saving Polypolish's log to file

Polypolish outputs human-readable information to stderr but doesn't create a log file. If you'd like to save the log output to a file while still seeing it in the terminal, you can use the tee command as follows:

polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam 2> >(tee polypolish.log)
polypolish polish draft.fasta filtered_1.sam filtered_2.sam 1> polished.fasta 2> >(tee -a polypolish.log)

The first command creates a new log file named polypolish.log, and the second command appends to this file.

By default, this log file will include ANSI escape codes for terminal formatting (e.g., colors and bold text). If you'd prefer a 'clean' log without formatting, use the following commands to strip out the escape codes:

polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam 2> >(tee >(sed 's/\x1b\[[0-9;]*m//g' > polypolish.log))
polypolish polish draft.fasta filtered_1.sam filtered_2.sam 1> polished.fasta 2> >(tee >(sed 's/\x1b\[[0-9;]*m//g' >> polypolish.log))