RVAgene models gene expression dynamics in single-cell or bulk data. Read the paper here.
- Python 3
- numpy, matplotlib, pytorch, scikit-learn, tqdm
- GPU (optional)
- Jupyter notebook demo of RVAgene here
- Jupyter notebook demo of processing bulk Single cell data and running RVAgene here
- Or on the command line
- python gen_synthetic_data.py <dataset_name> e.g.
python gen_synthetic_data.py demosim
- python train_and_gen.py <dataset_name> e.g.
python train_and_gen.py demosim
- python gen_synthetic_data.py <dataset_name> e.g.
data
: contains example synthetic gene expression time series data with 6 inherent clusters
rvagene
: contains code for recurrent variational autoencoder
train_and_gen.py
: code demonstrating training RVAgene, unsupervised clustering on latent space using K-means and Generating new gene expression data by sampling and decoding points from latent space.
gen_synthetic_data.py
: code to generate synthetic data with cluster structure as described in the paper.
figs
: contains figures generated by the demo code.
demo.ipynb
: Demonstration on the whole synthetic data generation, RVAgene training, clustering on latent space and new cluster specific data generation process
single_cell_demo.ipynb
:demo of processing bulk Single cell data and running RVAgene
sequence_length
: length of the input sequence
number_of_features
: number of features per timepoint per gene i.e. 1
hidden_size
: hidden size of the RNN
hidden_layer_depth
: number of layers in RNN (1 is enough)
latent_length
: latent vector length
batch_size
: number of genes in a single batch. IMPORTANT: last batch will be dropped is it is not a divisor of number of training genes.
learning_rate
: the learning rate of the module
n_epochs
: Number of iterations/epochs
dropout_rate
: The probability of a node being dropped-out
optimizer
: ADAM/ SGD optimizer to reduce the loss function
loss
: SmoothL1Loss / MSELoss / ReconLoss / any custom loss which inherits from _Loss
class
boolean cuda
: to be run on GPU or not
print_every
: The number of iterations after which loss should be printed for each epoch
boolean clip
: Gradient clipping to overcome explosion
max_grad_norm
: The grad-norm to be clipped if using clipping
dload
: Download directory where models are to be dumped
log_file
: File to log training loss , default None i.e. STDOUT
scRNA seq Data used in the paper from : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE65525 (Klein et. al. 2015) Bulk RNA seq Data used in the paper from : https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE98622 (Liu. et. al. , 2017)
- Thanks to open source implementation of recurrent VAE at https://github.com/tejaslodaya/timeseries-clustering-vae
- Relevant research works as cited in the work.