This project aims to evaluate protein sequences if they belong to humans or pathogens. It is a collaborative framework provided by DeepChain apps. The main deepchain-apps package can be found on pypi. To leverage the apps capability, take a look at the bio-transformers and bio-datasets package.
Linear classifiers with SGD (stochastic gradient descent) training, sklearn.linear_model.SGDClassifier, is applied on two types of features:
- Probert embeddings: given by deepchain-apps using bio-transformers
- One-hot encoding: categorical variables (amino acid) are represented as binary vectors using OneHotEncoder.
More than 96k human and pathogen protein sequences are given by bio-datasets package. Before jumping in, the global analysis of the data is always crucial! You can check protein lenght information via src/exploratory_data_analysis.py
with or without histograms.
python src/exploratory_data_analysis.py
You can train/validate/test data and save classifiers as below:
python src/classifier.py -f probert_embedding # using probert embedding features
python src/classifier.py -f one_hot_encoding # using one-hot encoding features
Training with one-hot encoding takes a few minutes the first time but as the feature information will be saved, it will be faster from the 2nd time.
You can check the information at any time with the help command:
python src/classifier.py -h # help
The classifiers will be saved in checkpoint/
The main class is named App
in src/app.py
.
You can add or modify the protein sequences that you want to evaluate (at the bottom of the code), then just run it:
python src/app.py
The output show the score for each protein and each feature in dictionnary format:
[
{
'SGD_probert_embedding':score_of_prot1,
'SGD_one_hot_encoding':score_of_prot1
},
{
'SGD_probert_embedding':score_of_prot2,
'SGD_one_hot_encoding':score_of_prot2
}
]
The score [0,1] correpond to the probability that the proteins belong to the human class.
python >= 3.7
numpy
scipy
sklearn
biodatasets
biotransformers
deepchain.components
torch
joblib
loguru
tqdm
statistics
matplotlib