Skip to content

Prediction of gene expression using a Graph data and word2vec for generating features. Different dimensions of the embedding space are compared.

License

Notifications You must be signed in to change notification settings

Alexandre-aksenov/predict-gene-expression-using-graph-word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

predict-gene-expression-using-graph-word2vec

This repository presents an example of prediction of protein expression. using a graph and known levels of expression on training set (80% of nodes).

From Wikipedia (https://en.wikipedia.org/wiki/Gene_expression) Gene expression is the most fundamental level at which the genotype gives rise to the phenotype, i.e. observable trait.

About the dataset. The dataset is a graph of interactions between proteins. The nodes represent pairs gene-protein (it is considered that both concepts play the same role in our setting) and the edges represent interaction between proteins.

Specifically, the graph is given by the list of its edges (edges). The data about target variable (gene expression) is subdivided into train и test sets.

The dataset has been simplified to meet the following conditions

  • the graph is connected (tested in the notebook);
  • most important hubs are removed;
  • the graph density is reduced;
  • the problem can be solved by classical ML approaches.

About the problem. The model accuracy is evaluated using its MSE on the test set.

Conclusions. The best error on test set is of order of 0.2, which is less than one third of the mean of the target variable.

The results show some random behavior due to the embedding 'word2vec', see:

https://github.com/piskvorky/gensim/wiki/Recipes-&-FAQ#q11-ive-trained-my-word2vecdoc2vecetc-model-repeatedly-using-the-exact-same-text-corpus-but-the-vectors-are-different-each-time-is-there-a-bug-or-have-i-made-a-mistake-2vec-training-non-determinism " So, it is to be expected that models vary from run to run, even trained on the same data."

The optimal embedding size seems to be 30. Indeed, the error on test set remains stable or increases afterwards. The error on test set is at least 4 times the error on training set. This property of the model seems inherent and independent of the the embedding size.

Feedback and additional questions. All questions about the dataset or the problem should be adressed to Alexander Milenkin:

All questions about the source code should be adressed to its author Alexandre Aksenov:

About

Prediction of gene expression using a Graph data and word2vec for generating features. Different dimensions of the embedding space are compared.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published