Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print warnings if alignment names will be changed by IQtree #1085

Closed
wants to merge 13 commits into from

Conversation

anna-parker
Copy link
Collaborator

@anna-parker anna-parker commented Nov 1, 2022

Description of proposed changes

IQtree changes "offending characters" in alignment names to a '_' character. Offending characters are defined in: https://github.com/Cibiv/IQ-TREE/blob/6776a95f15a2eccda2aa330497291dc246575995/utils/tools.cpp#L503 https://github.com/iqtree/iqtree2/blob/3bbc304263cb2f85574a9163e8f2e5c5b597a147/utils/tools.cpp#L585 (i.e. if the character is not alphanumeric and not in '.', '-' or '_') . Currently, augur will make sure that the offending characters '/|()*' are not modified, however all other offending characters will be modified by IQtree. IQtree writes all modifications to a .log file. This is not conveyed to the user and alignment information can no longer be used when iterating over the tree.

I extend augur's current technique for not modifying '/|()*' to not modify any offending character. However, treetime might not be able to handle all offending characters. (For example treetime's alignment module reports that "'" and "\'" are not equivalent - I am not quite sure why -and even if an alignment name with a ' is not modified the alignment information still cannot be used downstream, for example by refine). For offending characters outside of '/|()*' print a warning and a recommendation to change the character to '_' to avoid later issues.

image

Related issue(s)

Resolves #1084

Testing

Modify a sequence name in an alignment file to include an offending character, such as ,@$, run

augur tree --alignment modified.fasta --tree-builder-args '-ninit 10 -n 4' --output tree_modified.nwk --nthreads 1

Check error warning and node names in tree_modified.nwk.

Checklist

  • Add a message in CHANGES.md summarizing the changes in this PR. Keep headers and formatting consistent with the rest of the file.

@codecov
Copy link

codecov bot commented Nov 1, 2022

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 73.10%. Comparing base (eb8c8ac) to head (6545f16).
Report is 52 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1085      +/-   ##
==========================================
+ Coverage   73.06%   73.10%   +0.03%     
==========================================
  Files          79       79              
  Lines        8335     8347      +12     
  Branches     1700     1700              
==========================================
+ Hits         6090     6102      +12     
  Misses       1958     1958              
  Partials      287      287              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

augur/tree.py Outdated Show resolved Hide resolved
@corneliusroemer
Copy link
Member

Current IQtree escape characters can be found here: https://github.com/iqtree/iqtree2/blob/3bbc304263cb2f85574a9163e8f2e5c5b597a147/utils/tools.cpp#L585

@tsibley
Copy link
Member

tsibley commented Jan 4, 2023

For example treetime's alignment module reports that "'" and "\'" are not equivalent

Can you say more about this?

augur/tree.py Outdated Show resolved Hide resolved
augur/tree.py Outdated Show resolved Hide resolved
augur/tree.py Outdated Show resolved Hide resolved
@tsibley
Copy link
Member

tsibley commented Jan 5, 2023

Would also be great to have tests for this behaviour, e.g. in tests/functional/tree.t or similar.

@anna-parker
Copy link
Collaborator Author

Oh wow its been so long since I worked on this - I will try to add a test and get this ready for review again

@anna-parker anna-parker force-pushed the fix/augur_tree_offending_character branch from 044781e to 0f1d633 Compare February 11, 2025 08:21
@anna-parker
Copy link
Collaborator Author

Hi @tsibley I now added some tests - I realized that the replacement of the ' character still has some unstable behavior - sometimes it gets replaced correctly, other times it gets replaced with ''. So potentially we should replace these characters with _ instead (as currently) and just add a warning.

@genehack
Copy link
Contributor

@anna-parker it looks like you and @jameshadfield are barking up the same tree -- see #1750

@jameshadfield
Copy link
Member

Hi @anna-parker - apologies for missing this PR when I started working on #1084 recently in #1750.

Both our PRs split the fasta header on whitespace and only use what's before the first whitespace (which is what iq-tree will do anyways) however our escaping approach is quite different.

I think #1750 is more complete, but I'm happy to consider unsafe_chars = re.compile(rb'[^\w.-]', re.LOCALE) if you think that's the best approach?

@anna-parker
Copy link
Collaborator Author

@jameshadfield I also think your approach is more complete - so I am happy to go with that!

jameshadfield added a commit that referenced this pull request Feb 13, 2025
We also replace invalid characters with their unicode hex representation
(rather than a random string) which both guarantees uniqueness and makes
debugging a whole lot easier.

See <#1085> for an alternate
approach to escaping using a local-specific non-word regex.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

augur tree modifying strain names without warning
5 participants