Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removal of hypervariable sequences #187

Open
BenKuhnhaeuser opened this issue Oct 24, 2024 · 2 comments
Open

Removal of hypervariable sequences #187

BenKuhnhaeuser opened this issue Oct 24, 2024 · 2 comments

Comments

@BenKuhnhaeuser
Copy link

Hi there,
I love how the phyx family is expanding to include so many useful capabilities. I would like to suggest a program that removes highly divergent / hypervariable sequences from multiple sequence alignments. Such highly divergent sequences can point to various issues, such as chimeric or paralogous sequences being mixed up in an alignment. The tool would calculate the number of sites divergent from the consensus sequence (similar to calculating missing sites), and remove sequences above the specified threshold.

A "sliding window" option would be even cooler to clean short stretches of highly divergent sequences in otherwise perfectly aligned and well-behaved sequences without discarding the entire sequence.

Many thanks for the consideration!
Ben

@josephwb
Copy link
Member

This indeed sounds useful. I guess what would be necessary would be the definition(s) of "highly divergent / hypervariable". I can think of various ways to pick the "most different" sequence(s), but what would be the threshold? If you have concrete ideas on what you'd like to see, hen please share!

@BenKuhnhaeuser
Copy link
Author

BenKuhnhaeuser commented Oct 25, 2024

My idea would be to calculate for each sequence the percentage of non-ambiguous sites different from non-ambiguous sites in the consensus sequence. A divergence threshold p could then be set by the user, and sequences with a higher percentage of sites above that threshold would be discarded. CIAlign has such a functionality but is a bit clunky to use. A 10% divergence threshold worked quite well for me so far, but having the ability to adjust this threshold is quite important as the expected "normal" divergence between sequences will depend on how closely related the taxa in the analysed lineage are.

For the sliding window idea, the program would identify parts of sequences above the set divergence threshold p within a specified window size w (maybe using 50 or 100 bp as a default), and mask / delete only the part of the sequence that is above the divergence threshold, not the entire sequence.

I'm attaching a screenshot of a real-world case for illustration, where disagreements from the consensus are highlighted. If editing manually, I would probably get rid of samples 146 and 154 altogether (or remove large parts of them), but in sample 147 I would only want to exclude sites > 400 bp, and in sample 152 only sites between 80 and 120 bp. Happy to share the .fasta file if you need something for testing.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants