predict-gene-expression-using-graph-word2vec

This repository presents an example of prediction of protein expression. using a graph and known levels of expression on training set (80% of nodes).

From Wikipedia (https://en.wikipedia.org/wiki/Gene_expression) Gene expression is the most fundamental level at which the genotype gives rise to the phenotype, i.e. observable trait.

About the dataset. The dataset is a graph of interactions between proteins. The nodes represent pairs gene-protein (it is considered that both concepts play the same role in our setting) and the edges represent interaction between proteins.

Specifically, the graph is given by the list of its edges (edges). The data about target variable (gene expression) is subdivided into train и test sets.

The dataset has been simplified to meet the following conditions

the graph is connected (tested in the notebook);
most important hubs are removed;
the graph density is reduced;
the problem can be solved by classical ML approaches.

About the problem. The model accuracy is evaluated using its MSE on the test set.

Conclusions. The best error on test set is of order of 0.2, which is less than one third of the mean of the target variable.

The results show some random behavior due to the embedding 'word2vec', see:

https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vecdoc2vecetc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism " So, it is to be expected that models vary from run to run, even trained on the same data."

The optimal embedding size seems to be 30. Indeed, the error on test set remains stable or increases afterwards. The error on test set is at least 4 times the error on training set. This property of the model seems inherent and independent of the the embedding size.

Feedback and additional questions. All questions about the dataset or the problem should be adressed to Alexander Milenkin:

Telegram: Alerin75infskin
Email: [email protected]

All questions about the source code should be adressed to its author Alexandre Aksenov:

GitHub: Alexandre-aksenov
Email: [email protected]

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
dat		dat
results		results
sample		sample
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

predict-gene-expression-using-graph-word2vec

About

Releases

Packages

License

Alexandre-aksenov/predict-gene-expression-using-graph-word2vec

Folders and files

Latest commit

History

Repository files navigation

predict-gene-expression-using-graph-word2vec

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages