Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tree] don't modify strain names #1750

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

jameshadfield
Copy link
Member

This PR increases the characters which augur tree will safely allow and adds a fatal error if single quotes or backslashes are in the strain name. See commit messages for details. As augur tree no longer modifies strain names this PR closes #1084, however ideally we would alert the user to the presence of those disallowed characters somewhere upstream of augur tree.

@jameshadfield
Copy link
Member Author

jameshadfield commented Feb 10, 2025

The CI failures occur because this PR makes augur tree not modify strain names (which I think is good behaviour) which then runs into an augur refine problem because that expects strain names to have been modified!

As an example, the fasta header >KM597072.1 Mumps virus genotype G strain MuV/Gabon/13/2[G], complete genome was previously modified as it passed through augur tree (the newick strain name became KM597072.1) and augur refine would perform the same modification as it read the fasta header.

Copy link

codecov bot commented Feb 10, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 5 lines in your changes missing coverage. Please review.

Project coverage is 73.04%. Comparing base (8920730) to head (7b0a8e5).
Report is 1 commits behind head on master.

Files with missing lines Patch % Lines
augur/tree.py 80.00% 3 Missing and 2 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1750      +/-   ##
==========================================
- Coverage   73.06%   73.04%   -0.02%     
==========================================
  Files          79       79              
  Lines        8335     8348      +13     
  Branches     1700     1703       +3     
==========================================
+ Hits         6090     6098       +8     
- Misses       1958     1961       +3     
- Partials      287      289       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

We also replace invalid characters with their unicode hex representation
(rather than a random string) which both guarantees uniqueness and makes
debugging a whole lot easier.
These are ok within Augur / Bio.Phylo code (assuming we replace them
for the tree builder) but are escaped by `Bio.Phylo.Write`, so
"'" becomes "\'" in the newick string. While there may be some way
around this it's going to cause constant headaches for bioinformatics
and so these characters should be replaced prior to running `augur tree`
Depending on the strain name backslashes are fine some of the time and
problematic other times. (Assuming we replace the backslash for the
"-delim" alignment passed to IQ-TREE.) All examples here work fine
within our code (and calls to `Bio.Phylo`) but some of them are escaped
during the `Bio.Phylo.write` call¹.

Examples which are written as-is:

 - "sing\le"
 - "dou\\ble"
 - "three\\\ee"

Some examples which are escaped ("\" → "\\")

 - "v~`,\"
 - "wb)\o"
 - " Yb\y"
 - ";:R@\"
 - ";U,\~"
 - "\[4{h"
 - "\|$[8"
 - "\9]oC"
 - "\ZY)x"
 - "5\j ("
 - "9_n[\"
 - "a_\B;"
 - "CQ9\]"
 - "F-V:\"
 - "Dj \."
 - "F;2C\"
 - "h\)n|"
 - "j(A\v"
 - "l\+)*"
 - "v~`,\"

¹ Code is here, although nothing obvious sticks out to me
<https://github.com/biopython/biopython/blob/master/Bio/Phylo/NewickIO.py>
This test generates random ASCII names. It's disabled for CI as it's
both stochastic and slow but you can easily toggle it back on (by
uncommenting the function call) if you want to better test strain name
handling in `augur tree`
In Augur v28 (and earlier), strain names which included spaces such as the FASTA input
">KM597072.1 Mumps virus genotype G strain MuV/Gabon/13/2[G], complete genome" would be
written to a delimited FASTA file by `augur tree` which didn't replace those spaces.
IQ-TREE would then change those names by dropping all characters after (and
including) the first space. After replacement of any delimeters the resulting `--output`
newick would have "KM597072.1".

The previous commits in this series changed the behaviour to preserve strain names entirely,
which resulted in the full name in the output newick. This caused issues downstream because
`augur refine` does the same name replacement when it parses the alignment file¹ and thus we have
a mismatch of tree names and (parsed) alignment names.

This commit restores the original behaviour with regards to names with spaces.

¹ <https://github.com/neherlab/treetime/blob/531f776bcad3b8b11fd470a403a9c06c78b07e36/treetime/sequence_data.py#L164-L172>
@jameshadfield jameshadfield force-pushed the james/tree-name-changes-1084 branch from c3390c4 to 7b0a8e5 Compare February 10, 2025 22:40
@jameshadfield
Copy link
Member Author

I've updated this PR to once again modify strain names with spaces in them, which was the previous behaviour and is necessary for augur refine to work in its current state. (See commit message for more.)

@jameshadfield jameshadfield requested a review from a team February 10, 2025 23:45
@jameshadfield
Copy link
Member Author

Closes #1360

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

augur tree modifying strain names without warning
2 participants