Building and Tuning a Machine Learning Model to Predict Protein Subcellular Localization from Amino Acid Sequence

by MAS16, 2019

Introduction

We now have access to an unprecedented amount of genomic information. However, harnessing the full potential of that information still requires a lot of slow and expensive experiments. In an ideal scenario, we would employ supervised machine learning to make predictions about biology using the data that's already been collected and readily available genomic information. Here, I build and tune a machine learning model to predict the subcellular localization of proteins based on amino acid sequences. The end result is a support vector machine (SVM) model that predicts soluble and membrane proteins with an out-of-sample prediction accuracy of 0.84, precision of 0.85, and recall of 0.84.

Preprocessing Scripts

The data come from the proteome database of Uniprot for the bacterium E.Coli. The scripts for scraping, preprocessing, and feature engineering are:

01_scrape_uniprot.py

02_extract_features.py

03_evaluate_features.py

Model Building Script

The script used for model building is" 04_model_build_tune.py

Model Building and Prediction Jupyter Notebook

To see how I built the model, tuned hyperparameters, and used it for prediction, see the 05_model_build_tune_describe.ipynb notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.idea		.idea
.ipynb_checkpoints		.ipynb_checkpoints
plots		plots
.gitignore		.gitignore
01_scrape_uniprot.py		01_scrape_uniprot.py
02_extract_features.py		02_extract_features.py
03_evaluate_features.py		03_evaluate_features.py
04_model_build_tune.py		04_model_build_tune.py
05_model_build_tune_describe.ipynb		05_model_build_tune_describe.ipynb
README.md		README.md
ecoli_proteome.csv		ecoli_proteome.csv
ecoli_proteome_features.csv		ecoli_proteome_features.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building and Tuning a Machine Learning Model to Predict Protein Subcellular Localization from Amino Acid Sequence

Introduction

Preprocessing Scripts

Model Building Script

Model Building and Prediction Jupyter Notebook

About

Releases

Packages

Languages

mas16/SubcellularLocalizationPrediction

Folders and files

Latest commit

History

Repository files navigation

Building and Tuning a Machine Learning Model to Predict Protein Subcellular Localization from Amino Acid Sequence

Introduction

Preprocessing Scripts

Model Building Script

Model Building and Prediction Jupyter Notebook

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages