clei2block: deep-neural protein translation

What is clei2block?

clei2block is a method to predict protein abundance across samples from RNA-seq data. Due to the complexity of gene regulation, the discrepancy between mRNA and protein levels has been a standing question since both genome-wide measurements became available. There are two types of mRNA-vs-protein relationships: (i) across genes within a sample (ii) across samples for each gene. Our method models the latter relationship using deep-neural networks.

Installation

Requirements

A Linux system equipped with Nvidia GPU
Docker
R version 3.6.2 or later

Download GitHub repository

git clone https://github.com/stasaki/clei2block.git

Working on a demo data

R notebook for this tutorial is available at code/demo-code.html and the expected outcome is code/demo-code-expected-output.nb.html

Training model

Change current working directory to clei2block

cd clei2block

Command-line arguments for clei2block are as follows

python clei2block.py <path to input data> <path to output> <dimensionality of the latent space> <dimensionality of hidden layer> <if you want to apply the model to other data specify the path to the data (Optional)> <input features for latent embedding> <input features for linear regression>

Inputs

latent embedding
- train_vae_x_mRNA.txt.gz: input feature matrix for latent embedding (gene x sample)
- train_vae_x_mRNA_col.txt.gz: sample id
- train_vae_x_mRNA_row.txt.gz: gene id
linear regression module
- train_lr_x_mRNA.txt.gz: input feature matrix for linear regression (gene x sample)
- train_lr_x_mRNA_col.txt.gz: sample id
- train_lr_x_mRNA_row.txt.gz: gene id
protein outcome
- train_y.txt.gz: actual protein matrix (gene x sample)
- train_y_col.txt.gz: sample id
- train_y_row.txt.gz: gene id

Sample orders must be identical for all inputs. Gene orders must be identical for linear regression inputs and protein outcomes. Test data should follow the same format, but the file name begins with test instead of train.

Run clei2blck model with the demo data through docker

time docker run --rm --gpus all -v $(pwd)/../:/var  pytorch/pytorch:1.2-cuda10.0-cudnn7-runtime  /bin/bash -c "pip install pandas; python /var/functions/clei2block.py /var/demo-data/ /var/demo-out/ 100 800 none mRNA mRNA"

The training of the model takes about 3 minutes. This script runs clei2block 5 times and the results will be in subdirectories from 0 to 4. In each subdirectory, you will see the following outputs.

trained_model.pt: trained pytorch model
train_loss.txt: loss for training data
valid_loss.txt: loss for validation data
val_index.txt: sample index used for validation data
test_prediction.npy: predicted protein values for testing data stored as NumPy array

Check if the loss is decreased as expected

Launch R

Install required R libraries

install.packages("ggplot2")
install.packages("dplyr")
install.packages("tidyr")
install.packages("reticulate")
np=import("numpy")

Run the following code in R

library(ggplot2)
library(dplyr)
library(tidyr)

# collect training loss
list.files("../demo-out/",pattern = "train_loss.txt",full.names = TRUE,recursive = TRUE)%>%
  lapply(., function(x){
    read.delim(x,sep = ",",header = F)%>%
      mutate(trial=dirname(x)%>%basename(),
             step=1:n())%>%
      dplyr::select(trial,step,train_loss = V1)%>%
      return()
  })%>%
  bind_rows() -> loss_ds

# collect validation loss
list.files("../demo-out/",pattern = "valid_loss.txt",full.names = TRUE,recursive = TRUE)%>%
  lapply(., function(x){
    read.delim(x,sep = ",",header = F)%>%
      mutate(trial=dirname(x)%>%basename(),
             step=1:n())%>%
      dplyr::select(trial,step,valid_loss = V1)%>%
      return()
  })%>%
  bind_rows()%>%
  inner_join(loss_ds,.,by=c("trial","step"))%>%
  group_by(trial)%>%
  filter(1:n() <= which(valid_loss==min(valid_loss)))-> loss_ds

loss_ds%>%gather(loss_type,loss,-trial,-step)%>%
    ggplot(.,aes(as.factor(step),loss,color=loss_type))+
  geom_boxplot()+
  xlab("Training epochs")

Check the model performance for testing data

Run the following code in R

library(reticulate)
np=import("numpy")

# collect test prediction
list.files("../demo-out/",pattern = "test_prediction.npy",full.names = TRUE,recursive = TRUE)%>%
  lapply(., function(x){
    np$load(x)%>%
      as.matrix()%>%
      t()%>%
      return()
  }) -> pred_protein

# consensus prediction
for(i in 2:length(pred_protein)){
  pred_protein[1][[1]]=pred_protein[1][[1]]+pred_protein[i][[1]]
}
pred_protein = pred_protein[1][[1]] # this is a protein by sample matrix predicted by the clei2block model


# read actual measurements
acutal_protein = read.delim("../demo-data/test_y.txt.gz",header = F)%>%as.matrix()
acutal_mRNA = read.delim("../demo-data/test_lr_x_mRNA.txt.gz",header = F)%>%as.matrix()

# calculate correlation between predicted protein and acutal protein measurement for each gene across samples
lapply(1:nrow(acutal_protein), function(i){
  cor(pred_protein[i,],acutal_protein[i,])
})%>%unlist() -> pre_cor

# calculate correlation between acutal protein measurement and thier corresponding mRNA for each gene across samples
lapply(1:nrow(acutal_protein), function(i){
  cor(acutal_mRNA[i,],acutal_protein[i,])
})%>%unlist() -> mRNA_cor

# visualize correlation
ds = data.frame(cor=pre_cor,
           data = "predicted protein")
ds = bind_rows(data.frame(cor=mRNA_cor,
           data = "actual mRNA"),ds)
ds%>%
  ggplot(.,aes(data,cor,fill=data))+
  geom_boxplot()+ylab("Correlation with actual protein")

How to use a pre-trained model

You can use a pre-trained model with the following command.

python clei2block-predict.py <path to input data> <path to output> <path to pretrained model> <input features for latent embedding> <input features for linear regression>

The order of input variables must be identical to the input used to train the model. Also if you use multiple data files for the input of latent embedding or linear regression, the order of files in the command must be identical with the training.

To apply a pre-trained model to the demo data through docker, run the following command.

time docker run --rm --gpus all -v $(pwd)/../:/var  pytorch/pytorch:1.2-cuda10.0-cudnn7-runtime  /bin/bash -c "pip install pandas; python /var/functions/clei2block-predict.py /var/demo-data/test_ /var/demo-out-predict/ /var/demo-out/0/trained_model.pt mRNA mRNA"

The training of the model takes about 3 minutes. You will see the following outputs.

prediction.npy: predicted protein values stored as NumPy array

How to use a pre-trained model with ROSMAP brain data

Download models trained with ROSMAP brain data

The clei2block models for ROSMAP brain data and input variable names are available at http://dx.doi.org/10.7303/syn23624037.

You can programmatically download data and models via Synapse API client

pip install synapseclient

mkdir ROSMAP_model
cd ROSMAP_model
synapse get -r syn23624087 
cd ..

mkdir Variable_IDs
cd Variable_IDs
synapse get -r syn26454410
cd ..

Access to ROSMAP brain data

The RNA-seq and protein data used for training models, the predicted proteome data, and the estimated pseudotimes are distributed under the controlled data restrictions with a requirement of the Data Use Agreement. The data can be requested at the RADC Resource Sharing Hub at www.radc.rush.edu.

Compute principal components (PCs) of transcriptome features

Files nessesary to compute PCs are stored at ./pca/ and ./functions/compute_pc.R.

Run ROSMAP clei2block model

You can build or use ROSMAP clei2block model using code/train.R and code/predict.R.

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
code		code
demo-data		demo-data
demo-out-predict		demo-out-predict
demo-out		demo-out
functions		functions
pca		pca
LICENSE		LICENSE
README.html		README.html
README.md		README.md
logo1.png		logo1.png
overview.png		overview.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

clei2block: deep-neural protein translation

What is clei2block?

Installation

Requirements

Download GitHub repository

Working on a demo data

Training model

Check if the loss is decreased as expected

Check the model performance for testing data

How to use a pre-trained model

How to use a pre-trained model with ROSMAP brain data

Download models trained with ROSMAP brain data

Access to ROSMAP brain data

Compute principal components (PCs) of transcriptome features

Run ROSMAP clei2block model

About

Releases 2

Packages

Contributors 2

Languages

License

stasaki/clei2block

Folders and files

Latest commit

History

Repository files navigation

clei2block: deep-neural protein translation

What is clei2block?

Installation

Requirements

Download GitHub repository

Working on a demo data

Training model

Check if the loss is decreased as expected

Check the model performance for testing data

How to use a pre-trained model

How to use a pre-trained model with ROSMAP brain data

Download models trained with ROSMAP brain data

Access to ROSMAP brain data

Compute principal components (PCs) of transcriptome features

Run ROSMAP clei2block model

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 2

Packages 0

Contributors 2

Languages

Packages