Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering

Computational methods in protein engineering often require encoding amino acid sequences, i.e., converting them into numeric arrays. Physicochemical properties are a typical choice for encoding. However, what property (or group thereof) is best for a given predictive task remains an open problem. In this work, we generalize property-based encoding strategies to maximize the performance of predictive models in protein engineering. First, combining text mining and unsupervised learning, we partitioned the AAIndex database into eight semantically-consistent groups of properties. We then applied a non-linear PCA within each group to define a single encoder to represent it. Then, in several case studies, we assess the performance of predictive models trained using classical encoders (One Hot Encoder and TAPE embeddings) and the proposed encoders for predicting protein and peptide function, folding, and biological activity. We confirm that in most cases, models trained using our encoders outperform classical approaches both in precision and generality. Furthermore, when applying the Fast Fourier Transform (FFT) to the sequences encoded with the proposed encoders, the increase in performance and reduction in overfitting is much more drastic. Finally, we propose a preliminary and straightforward methodology to create \textit{de novo} sequences with desirable properties. All these results offer simple ways to increase the performance of general and complex predictive tasks in protein engineering.

Summary of directories

aaindexdb: Has different files associated to aaindex database considering the original source and the processed datasets.
dataset testing: Has the different builded dataset to evaluate the proposed methodology.
results: Contains the proposed encoders using the methodology developed in this work.
sourcecode: Contains the different Python scripts implemented on this work.

Contact us

Sebastián Contreras: [email protected]
Álvaro Olivera-Nappa: [email protected]
David Medina-Ortiz: [email protected]

License

All source code, environment configurations, datasets, and models are available for non-commercial use under the Creative Commons Attribution-Non-Commercial ShareAlike International License, Version 4.0 (CC-BY-NC-SA 4.0). The complete source code, datasets, and models are available under the Creative Commons Attribution-Non-Commercial ShareAlike International License, Version 4.0 (CC-BY-NC-SA 4.0) for open, non-commercial use.

Attribution — You must give appropriate credit , provide a link to the license, and indicate if changes were made . You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

NonCommercial — You may not use the material for commercial purposes .

ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
.idea		.idea
aaindexdb		aaindexdb
dataset_testing		dataset_testing
results		results
sourcecode		sourcecode
LICENCE		LICENCE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering

Summary of directories

Contact us

License

About

Releases

Packages

Languages

License

ProteinEngineering-PESB2/numerical_representations_protein_seqs

Folders and files

Latest commit

History

Repository files navigation

Generalized property-based encoders and digital signal processing facilitate predictive tasks in protein engineering

Summary of directories

Contact us

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages