Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protein sequence validation #498

Open
simarilion opened this issue Dec 11, 2024 · 6 comments
Open

Protein sequence validation #498

simarilion opened this issue Dec 11, 2024 · 6 comments

Comments

@simarilion
Copy link

Hi seqkit team,

I'd like to simply validate some protein sequences but the behaviour of seqkit is not as I'd intuitively expect.....

seqkit version -u
seqkit v2.9.0
Checking new version...
You are using the latest version of seqkit

echo -e ">seq\nMFKXXXXXQLRTNKZZZZZDRTFPAD" | seqkit seq --seq-type protein -v

seq
MFKXXXXXQLRTNKZZZZZDRTFPAD

I would expect some warnings about a protein sequence with "X" or "Z" rather than just returning the input sequence. Am I missing something?

Any advice would be gratefully received!

@shenwei356
Copy link
Owner

How about showing IDs with Z or X in the sequences

$ echo -e ">seq\nMFKXXXXXQLRTNKZZZZZDRTFPAD"  \
   | seqkit grep -s -P -i -p z -p x | seqkit seq -ni
seq

@simarilion
Copy link
Author

Thanks for you fast reply Shen Wei!

Yes, that works - thanks! But I guess as a naive user I would expect a warning about such protein sequences as "Z" and "X" are not standard amino acids.

@shenwei356
Copy link
Owner

They are allowed in protein sequences 🥲 Someone asked to add Z a long time ago.

https://github.com/shenwei356/bio/blob/master/seq/alphabet.go#L34-L73

	Z	Glx	Glutamine or Glutamic acid [2]
	X	unknown amino acid

 2. http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html

@simarilion
Copy link
Author

Hmmmm, I see the conundrum. Problem is for many sequence alignment and phylogenetic analysis programs "X" and "Z" residues are hard rejected. Maybe some kind of 'soft warning' would be good??

Nonetheless I greatly appreciate your helpful and fast responses :-)!

@shenwei356
Copy link
Owner

How about replacing Z with Q or E, and replacing X with any residue?

$ echo -e ">seq\nMFKXXXXXQLRTNKZZZZZDRTFPAD" \
    | seqkit replace -s -i -p Z -r E \
    | seqkit replace -s -i -p X -r M
>seq
MFKMMMMMQLRTNKEEEEEDRTFPAD

@simarilion
Copy link
Author

simarilion commented Dec 11, 2024

In my case I can't replace residues with something else as these are supposed to be explicit datapoints - i.e. every position in the protein alignment should be explicitly defined (and not in the case of "X" whatever you want!).

My plan is to use seqkit for this very purpose; check that all sequences in an alignment do not have undefined residues. If they do then probably remove them from the analysis!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants