Protein sequence validation #498

simarilion · 2024-12-11T10:45:54Z

Hi seqkit team,

I'd like to simply validate some protein sequences but the behaviour of seqkit is not as I'd intuitively expect.....

seqkit version -u
seqkit v2.9.0
Checking new version...
You are using the latest version of seqkit

echo -e ">seq\nMFKXXXXXQLRTNKZZZZZDRTFPAD" | seqkit seq --seq-type protein -v

seq
MFKXXXXXQLRTNKZZZZZDRTFPAD

I would expect some warnings about a protein sequence with "X" or "Z" rather than just returning the input sequence. Am I missing something?

Any advice would be gratefully received!

shenwei356 · 2024-12-11T11:18:50Z

How about showing IDs with Z or X in the sequences

$ echo -e ">seq\nMFKXXXXXQLRTNKZZZZZDRTFPAD"  \
   | seqkit grep -s -P -i -p z -p x | seqkit seq -ni
seq

simarilion · 2024-12-11T11:23:26Z

Thanks for you fast reply Shen Wei!

Yes, that works - thanks! But I guess as a naive user I would expect a warning about such protein sequences as "Z" and "X" are not standard amino acids.

shenwei356 · 2024-12-11T11:37:32Z

They are allowed in protein sequences 🥲 Someone asked to add Z a long time ago.

https://github.com/shenwei356/bio/blob/master/seq/alphabet.go#L34-L73

	Z	Glx	Glutamine or Glutamic acid [2]
	X	unknown amino acid

 2. http://www.dnabaser.com/articles/IUPAC%20ambiguity%20codes.html

simarilion · 2024-12-11T11:41:38Z

Hmmmm, I see the conundrum. Problem is for many sequence alignment and phylogenetic analysis programs "X" and "Z" residues are hard rejected. Maybe some kind of 'soft warning' would be good??

Nonetheless I greatly appreciate your helpful and fast responses :-)!

shenwei356 · 2024-12-11T12:06:24Z

How about replacing Z with Q or E, and replacing X with any residue?

$ echo -e ">seq\nMFKXXXXXQLRTNKZZZZZDRTFPAD" \
    | seqkit replace -s -i -p Z -r E \
    | seqkit replace -s -i -p X -r M
>seq
MFKMMMMMQLRTNKEEEEEDRTFPAD

simarilion · 2024-12-11T12:20:35Z

In my case I can't replace residues with something else as these are supposed to be explicit datapoints - i.e. every position in the protein alignment should be explicitly defined (and not in the case of "X" whatever you want!).

My plan is to use seqkit for this very purpose; check that all sequences in an alignment do not have undefined residues. If they do then probably remove them from the analysis!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protein sequence validation #498

Protein sequence validation #498

simarilion commented Dec 11, 2024

shenwei356 commented Dec 11, 2024

simarilion commented Dec 11, 2024

shenwei356 commented Dec 11, 2024

simarilion commented Dec 11, 2024

shenwei356 commented Dec 11, 2024

simarilion commented Dec 11, 2024 •

edited

Loading

Protein sequence validation #498

Protein sequence validation #498

Comments

simarilion commented Dec 11, 2024

shenwei356 commented Dec 11, 2024

simarilion commented Dec 11, 2024

shenwei356 commented Dec 11, 2024

simarilion commented Dec 11, 2024

shenwei356 commented Dec 11, 2024

simarilion commented Dec 11, 2024 • edited Loading

simarilion commented Dec 11, 2024 •

edited

Loading