-
Notifications
You must be signed in to change notification settings - Fork 128
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print warnings if alignment names will be changed by IQtree #1085
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1085 +/- ##
==========================================
+ Coverage 73.06% 73.10% +0.03%
==========================================
Files 79 79
Lines 8335 8347 +12
Branches 1700 1700
==========================================
+ Hits 6090 6102 +12
Misses 1958 1958
Partials 287 287 ☔ View full report in Codecov by Sentry. |
Current IQtree escape characters can be found here: https://github.com/iqtree/iqtree2/blob/3bbc304263cb2f85574a9163e8f2e5c5b597a147/utils/tools.cpp#L585 |
Can you say more about this? |
Would also be great to have tests for this behaviour, e.g. in |
Oh wow its been so long since I worked on this - I will try to add a test and get this ready for review again |
Co-authored-by: Cornelius Roemer <[email protected]>
Co-authored-by: Thomas Sibley <[email protected]>
044781e
to
0f1d633
Compare
Hi @tsibley I now added some tests - I realized that the replacement of the |
@anna-parker it looks like you and @jameshadfield are barking up the same tree -- see #1750 |
Hi @anna-parker - apologies for missing this PR when I started working on #1084 recently in #1750. Both our PRs split the fasta header on whitespace and only use what's before the first whitespace (which is what iq-tree will do anyways) however our escaping approach is quite different.
I think #1750 is more complete, but I'm happy to consider |
@jameshadfield I also think your approach is more complete - so I am happy to go with that! |
We also replace invalid characters with their unicode hex representation (rather than a random string) which both guarantees uniqueness and makes debugging a whole lot easier. See <#1085> for an alternate approach to escaping using a local-specific non-word regex.
Description of proposed changes
IQtree changes "offending characters" in alignment names to a
'_'
character. Offending characters are defined in:https://github.com/Cibiv/IQ-TREE/blob/6776a95f15a2eccda2aa330497291dc246575995/utils/tools.cpp#L503https://github.com/iqtree/iqtree2/blob/3bbc304263cb2f85574a9163e8f2e5c5b597a147/utils/tools.cpp#L585 (i.e. if the character is not alphanumeric and not in '.', '-' or '_') . Currently, augur will make sure that the offending characters'/|()*'
are not modified, however all other offending characters will be modified by IQtree. IQtree writes all modifications to a.log
file. This is not conveyed to the user and alignment information can no longer be used when iterating over the tree.I extend augur's current technique for not modifying
'/|()*
' to not modify any offending character. However, treetime might not be able to handle all offending characters. (For exampletreetime
's alignment module reports that"'"
and"\'"
are not equivalent - I am not quite sure why -and even if an alignment name with a'
is not modified the alignment information still cannot be used downstream, for example byrefine
). For offending characters outside of'/|()*'
print a warning and a recommendation to change the character to'_'
to avoid later issues.Related issue(s)
Resolves #1084
Testing
Modify a sequence name in an alignment file to include an offending character, such as
,@$
, runCheck error warning and node names in
tree_modified.nwk
.Checklist