-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[tree] don't modify strain names #1750
base: master
Are you sure you want to change the base?
Conversation
The CI failures occur because this PR makes As an example, the fasta header |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1750 +/- ##
==========================================
- Coverage 73.06% 73.04% -0.02%
==========================================
Files 79 79
Lines 8335 8348 +13
Branches 1700 1703 +3
==========================================
+ Hits 6090 6098 +8
- Misses 1958 1961 +3
- Partials 287 289 +2 ☔ View full report in Codecov by Sentry. |
Includes failing test
We also replace invalid characters with their unicode hex representation (rather than a random string) which both guarantees uniqueness and makes debugging a whole lot easier.
These are ok within Augur / Bio.Phylo code (assuming we replace them for the tree builder) but are escaped by `Bio.Phylo.Write`, so "'" becomes "\'" in the newick string. While there may be some way around this it's going to cause constant headaches for bioinformatics and so these characters should be replaced prior to running `augur tree`
Depending on the strain name backslashes are fine some of the time and problematic other times. (Assuming we replace the backslash for the "-delim" alignment passed to IQ-TREE.) All examples here work fine within our code (and calls to `Bio.Phylo`) but some of them are escaped during the `Bio.Phylo.write` call¹. Examples which are written as-is: - "sing\le" - "dou\\ble" - "three\\\ee" Some examples which are escaped ("\" → "\\") - "v~`,\" - "wb)\o" - " Yb\y" - ";:R@\" - ";U,\~" - "\[4{h" - "\|$[8" - "\9]oC" - "\ZY)x" - "5\j (" - "9_n[\" - "a_\B;" - "CQ9\]" - "F-V:\" - "Dj \." - "F;2C\" - "h\)n|" - "j(A\v" - "l\+)*" - "v~`,\" ¹ Code is here, although nothing obvious sticks out to me <https://github.com/biopython/biopython/blob/master/Bio/Phylo/NewickIO.py>
This test generates random ASCII names. It's disabled for CI as it's both stochastic and slow but you can easily toggle it back on (by uncommenting the function call) if you want to better test strain name handling in `augur tree`
In Augur v28 (and earlier), strain names which included spaces such as the FASTA input ">KM597072.1 Mumps virus genotype G strain MuV/Gabon/13/2[G], complete genome" would be written to a delimited FASTA file by `augur tree` which didn't replace those spaces. IQ-TREE would then change those names by dropping all characters after (and including) the first space. After replacement of any delimeters the resulting `--output` newick would have "KM597072.1". The previous commits in this series changed the behaviour to preserve strain names entirely, which resulted in the full name in the output newick. This caused issues downstream because `augur refine` does the same name replacement when it parses the alignment file¹ and thus we have a mismatch of tree names and (parsed) alignment names. This commit restores the original behaviour with regards to names with spaces. ¹ <https://github.com/neherlab/treetime/blob/531f776bcad3b8b11fd470a403a9c06c78b07e36/treetime/sequence_data.py#L164-L172>
c3390c4
to
7b0a8e5
Compare
I've updated this PR to once again modify strain names with spaces in them, which was the previous behaviour and is necessary for |
Closes #1360 |
This PR increases the characters which
augur tree
will safely allow and adds a fatal error if single quotes or backslashes are in the strain name. See commit messages for details. Asaugur tree
no longer modifies strain names this PR closes #1084, however ideally we would alert the user to the presence of those disallowed characters somewhere upstream ofaugur tree
.