-
Notifications
You must be signed in to change notification settings - Fork 10
FAQ and miscellaneous tips
- Polypolish performance
- Adjusting Polypolish options
- Does Polypolish work on eukaryote genomes?
- Does Polypolish work on metagenomes?
- Can I align my paired-end reads in a single BWA-MEM command?
- Does the order of alignments in the SAM file matter?
- Can I use Bwa-mem2 instead of BWA-MEM?
- Can I use Bowtie2 instead of BWA-MEM?
- Should I run multiple rounds of Polypolish polishing?
- Why does Polypolish only use end-to-end alignments?
- Can Polypolish add/remove bases at the start/end of a sequence?
- Does Polypolish change contig names?
- Why are
polypolish filter
andpolypolish polish
separate? - Saving Polypolish's log to file
Polypolish is quick and efficient, at least on the bacterial genomes I've tested it on:
- A small and simple bacterial genome should take less than a minute to polish and use less than a gigabyte of RAM.
- A big or complex (i.e. repeat rich) bacterial genome can take a few minutes to polish and use a few gigabytes of RAM.
Polypolish itself is single-threaded, but BWA-MEM parallelises well, so use as many threads as you have available when preparing the alignments for Polypolish.
I haven't tested Polypolish on eukaryote genomes, so I'm unsure of its performance in that context (see this question).
Polypolish is designed to be conservative in its error correction, i.e. it only corrects errors when there is strong evidence to do so. This means it is more likely to suffer from false negatives (failing to correct an error) than false positions (introducing an error).
Polypolish's default options are:
--min_depth 5 --fraction_invalid 0.2 --fraction_valid 0.5
If you want Polypolish to be less conservative (i.e. more willing to fix errors but with an increased risk of introducing errors), you could use these options:
--min_depth 3 --fraction_invalid 0.3 --fraction_valid 0.4
If you want Polypolish to be more conservative (i.e. only fixing errors when the evidence is very strong), you could use these options:
--min_depth 15 --fraction_invalid 0.1 --fraction_valid 0.6 --careful
See the Toy example page for a deeper explanation of how these options work.
Polypolish was designed with bacterial genomes in mind, but it should also work on small haploid eukaryote genomes. It's probably not appropriate for large and/or diploid eukaryote genomes. Nothing about Polypolish's algorithm is in principle tuned to bacterial genomes. However, Polypolish requires that you align each short read to all possible locations, and for a repeat-rich eukaryote genome, this could result in a lot of alignments. So there may be practical limitations.
To illustrate the problem, consider two bacterial genomes I have used with Polypolish: Bacillus subtilis NC_000964.3 and Bordetella pertussis NC_002929.2. About 2% of the Bacillus genome is repetitive, and my 1.4 million reads for that genome result in 1.6 million alignments. Most reads only align to a single place, so there aren't that many more alignments than reads. The Bordetella genome is a similar size, but it is 9% repetitive due to hundreds of copies of IS481. For that genome, 1.4 million reads result in a whopping 13.6 million alignments! Eukaryote genomes can be 50% or more repetitive, so I shudder to think how many alignments they might generate.
I've successfully tried Polypolish on a Drosophila genome, but it was pretty slow (took ~6 hours). If you try it on a bigger genome, let me know how well (or not well) it worked.
Also, Richard Wheeler wrote a tool named polyalign which may help with some of Polypolish's memory usage problems with eukaryote genomes (see issue #25).
Short answer: probably! While designed with isolates in mind, Polypolish is conservative (unlikely to introduce errors) and should for the most part work well on long-read metagenome assemblies too.
I can, however, think of one particular case where Polypolish could introduce an error into a metagenome: when you've got a very similar sequence shared between a high-depth genome and a low-depth genome. E.g. genome A has 2000× read depth and genome B has 20× read depth, and both share some sequence at high identity. In that case, there's a risk that Polypolish could change the shared sequence in genome B to look like the shared sequence in genome A. For this reason, I recommend using the --careful
option when polishing a metagenome.
It's also worth pointing out that Polypolish was made with completed genome assemblies in mind: its input should ideally be one-contig-per-replicon. Some metagenome assemblies get pretty messy, especially when you have a mixture of closely-related genomes. I don't know how well Polypolish would perform on a highly-fragmented metagenome assembly, so interpret any results with caution.
The How to run Polypolish page instructs you to align paired-end reads into separate SAM files like this:
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz > alignments_1.sam
bwa mem -t 16 -a draft.fasta reads_2.fastq.gz > alignments_2.sam
You might be tempted to combine those into a single command:
bwa mem -t 16 -a draft.fasta reads_1.fastq.gz reads_2.fastq.gz > alignments.sam
DON'T DO THIS! While that BWA-MEM command will run successfully, it will not make a SAM file appropriate for use with Polypolish. BWA-MEM's -a
option (which Polypolish relies on to polish repeat regions) has no effect when used on paired read files, so the combined command will only have a single alignment for each read.
The order of the SAM alignments will not change Polypolish's output (the polished genome sequence) but it can affect whether or not Polypolish will successfully run.
Polypolish assumes that all of the alignments for each read are grouped together on adjacent lines in the SAM file. This is how BWA-MEM outputs its SAM files, so it shouldn't be a problem. But if you've sorted your alignments using samtools sort
, they may not work with Polypolish.
When BWA-MEM is run in all-alignments mode (using the -a
option, as you should do for Polypolish alignments), it does not include the read sequence on every line. The primary alignment for each read will contain the sequence, but secondary alignments will only contain a *
to save space. If the alignments for each read are not grouped together, Polypolish will be unable to get the read sequence for secondary alignments and will quit with an error like this:
Error: no alignments for read NS500764:85:H3J5TBGXF:2:12111:13314:18114 contain sequence
If your SAM files have gotten out of order, you can use the samtools 'sort by read name' option (-n
) to make them compatible with Polypolish:
samtools sort -n -O sam alignments.sam > alignments_sorted.sam
Assuming your SAM files meet the above requirement (all alignments for each read are grouped on adjacent lines), then the order of the SAM file does not matter. E.g. if you reverse the order of lines in your SAM file with tac
, Polypolish will still run and it will produce identical output.
Yes! Bwa-mem2 is a faster implementation of BWA-MEM. It produces nearly identical alignments to BWA-MEM, so its alignments are definitely appropriate for use with Polypolish.
Bowtie2 is another popular short-read aligner, and like BWA-MEM, it has an option (-a
) to align each read to all possible locations. So yes, you can use it to generate alignments for Polypolish. However, I used BWA-MEM when developing and testing Polypolish, and I've only briefly tried using Bowtie2. So BWA-MEM is probably the safer choice.
Example alignment commands with Bowtie2 might look something like this:
bowtie2-build draft.fasta draft.fasta
bowtie2 -a -p 16 -x draft.fasta -U reads_1.fastq.gz > alignments_1.sam
bowtie2 -a -p 16 -x draft.fasta -U reads_2.fastq.gz > alignments_2.sam
Paired-end reads usually have suffixes on read names (/1
and /2
). BWA-MEM removes these when making the SAM file (so first-in-pair and second-in-pair reads have the same name) but Bowtie2 does not. So in order to run Polypolish's insert filter, you may have to remove these suffixes like this:
sed -i 's|/1\t|\t|' alignments_1.sam
sed -i 's|/2\t|\t|' alignments_2.sam
You probably don't need to bother – Polypolish doesn't usually make changes after the first polishing round. But since Polypolish is unlikely to introduce an error into your assembly, you're welcome to try! Just don't be surprised if subsequent rounds of Polypolish don't do anything.
You might have noticed that when loading alignments, Polypolish discards any which are not end-to-end. I.e. any alignments which are clipped on either end aren't included in the pileup. Here's an example from Polypolish output:
alignments_1.sam: 1,561,434 alignments from 1,405,200 reads
alignments_2.sam: 1,561,193 alignments from 1,405,200 reads
Filtering for high-quality end-to-end alignments:
3,082,475 alignments kept
40,152 alignments discarded
Assuming that your long-read assembly and your short reads came from the same genome (as they should), then I can think of three main reasons for clipped alignments. The first would be a significant structural error in the assembly, which Polypolish is not designed to fix (other polishing tools like Pilon can do a better job with this kind of error). The second would be alignments at the start-end of a circular contig.
The third cause of clipped alignments would be from reads which are partially in a repeat. For example, if a read was half in a two-copy repeat (the boundary of the repeat was in the middle of the read), then we might expect two alignments: one where the read fully aligned end-to-end in its true location and one where half the read was clipped in the alignment, like this:
alignments: TCTTTATTATTA ------TTATTA
assembly: AGAGATTCGATCTTTATTATTATGCGGAATTCTGGTTGCCTCAAGGAAGCTTATTATGCGGAATAGAACCGTCCG
| repeat | | repeat |
In cases like this, clipped alignments represent incorrectly placed reads, so Polypolish discards them. This helps to reduce extraneous bases in the pileup and should improve Polypolish's ability to fix errors near the ends of repeats.
No, it cannot. Polypolish will only fix errors that are in the middle of your sequence, not errors right at the ends. For example, if your sequence was missing a few bases at its end, Polypolish will not add them back in. For a specific example, take a look at this issue where Devon O'Rourke set one up.
Adding/removing bases at the start/end of a contig gets tricky for circular sequences, and most bacterial sequences are circular. So I would recommend that you fix up the ends of your contigs before polishing with Polypolish, e.g. use Trycycler which gives clean circularisation for bacterial genomes.
As of v0.6.0, Polypolish will not change the names of contigs. It will, however, add polypolish
to the contig's description.
For example, given this sequence as input:
>chromosome circular=true
ATGAATATAAAAGATTTTTTACTTGAGTTTAAAACTGAAA...
It will produce this output sequence:
>chromosome circular=true polypolish
ATGAATATAAAAGATTTTTTACTTGAGTTTAAAACTGAAA...
If you don't want polypolish
added to the sequence descriptions, you can pipe Polypolish through sed:
polypolish polish draft.fasta filtered_1.sam filtered_2.sam | sed 's/ polypolish//' > polished.fasta
For how Polypolish used to behave prior to v0.6.0, see issue #7.
In How to run Polypolish, you can see that two Polypolish commands are needed: polypolish filter
and polypolish polish
. You might wonder why I didn't combine these together so Polypolish is simpler to run?
This is because the polypolish filter
command only applies to paired-end read sets, so it isn't needed for unpaired reads. Also, if you have multiple different paired-end read sets (e.g. you sequenced your isolate multiple times), then you must run polypolish filter
separately for each of the paired-end sets before giving all filtered alignments to polypolish polish
.
Polypolish outputs human-readable information to stderr
but doesn't create a log file. If you'd like to save the log output to a file while still seeing it in the terminal, you can use the tee
command as follows:
polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam 2> >(tee polypolish.log)
polypolish polish draft.fasta filtered_1.sam filtered_2.sam 1> polished.fasta 2> >(tee -a polypolish.log)
The first command creates a new log file named polypolish.log
, and the second command appends to this file.
By default, this log file will include ANSI escape codes for terminal formatting (e.g., colors and bold text). If you'd prefer a 'clean' log without formatting, use the following commands to strip out the escape codes:
polypolish filter --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam 2> >(tee >(sed 's/\x1b\[[0-9;]*m//g' > polypolish.log))
polypolish polish draft.fasta filtered_1.sam filtered_2.sam 1> polished.fasta 2> >(tee >(sed 's/\x1b\[[0-9;]*m//g' >> polypolish.log))