Incompatabilities between augur tree and augur ancestral #1360
Labels
bug
Something isn't working
please take this issue
Extra attention is needed
priority: high
To be resolved before other issues
Current Behavior
Augur tree will perform
s/#/_/g
replacement on strain names in a completely silent manner. This will leave the fasta alignment and the tree out-of-sync as the strain names no longer match, which will have downstream consequences. For instance, the following bugs arise when using the data inaugur ancestral
:#
in them) thenaugur ancestral
prints a warning and exits code 0. It's not at all obvious that the reason for this mismatch is ultimatelyaugur tree
. Additionally, because it exits code 0 (without producing the output JSON) other commands may run further obscuring the actual error.#
in their name then the commands will proceed; warnings will be printed but these are often lost in the sea of output that a pipeline produces.augur ancestral
will (by default) infer the sequences on the missing tips. The end result is that we will end up missing mutations in a dataset or asserting that a genotype is present in the data when in fact the sequence data says otherwise. This is a much more serious bug than (1).Here I've focused on
augur ancestral
but the underlying strain name manipulation will have consequences for any command which reads the tree + alignment.Expected behavior
A series of augur commands should work together, we shouldn't cause data to become out-of-sync for the users.
Additionally, when
augur ancestral
encounters a problem it should error with a non-zero code. Update: a TreeTime error now causes a non-zero exit code via #1367, but this doesn't address the TreeTime warnings described below which don't cause an exit.How to reproduce
Steps to reproduce the behaviour in case (1) above:
a. Create
sequences.fasta
b.
augur tree --alignment sequences.fasta --output tree.nwk
c.
augur refine --keep-root --tree tree.nwk --output-tree tree.refine.nwk
d.
augur ancestral --tree tree.refine.nwk --alignment sequences --output-node-data nt_muts.json
The output will look like
However the exit code is 0, and
nt_muts.json
is not produced.To recreate case (2):
Repeat the above steps, but using
sequences.fasta
This time the
augur ancestral
will show the following warnings:Inspecting the produced
nt_muts.json
we see that a sequence forseq_1
has been inferred and it does not match the sequence of the input dataseq#1
:Possible solution
Augur tree exit upon discovery of
#
in strain names, and provide a simple command/script to modify the input data appropriately. (There may be other charactersaugur tree
modifies too, but#
is the one I've encountered.)augur tree
could modify the output tree file to reinstate the replaced charactersAugur ancestral should exit (with a non zero exit code) if any sequences are missing - otherwise how was the tree created with that strain in it?
Your environment: if running Nextstrain locally
augur 23.1.1
The text was updated successfully, but these errors were encountered: