Skip to content

Variational autoencoders learn universal latent representations of metabolomics data (supplementary code)

License

Notifications You must be signed in to change notification settings

krumsieklab/mtVAE

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mtVAE

The repository contains scripts to replicate findings in the paper Gomari, Schweickart et al., "Variational autoencoders learn universal latent representations of metabolomics data"


Requirements:


1. git repository

Clone local copy of git repository

git clone https://github.com/krumsieklab/mtVAE

(or use a git GUI client of your choice)


2. Environment setup

Setup python environment

In a terminal, switch to the directory of the local git repository.

conda env create --force --file environment.yml

R setup
  1. Open mtvae.Rproj
  2. Run R_setup.R

Note: this has been tested under R version 4.0.0 and RStudio version 1.3.1073


3. Activating the conda environment to access jupyter notebooks

conda activate mtvae_env

jupyter notebook



4. Instructions for running the scripts

  1. Place datasets into data/.
  2. With access to TwinsUK, Type 2 diabetes (T2D), schizophrenia, acute myeloid leukemia (AML) data, run scripts in increasing order of the file prefixes, starting from 01_train_VAE.ipynb.
  • Notes:
    • Pre-trained models from 01_train_VAE.ipynb can be found under models/
    • All R scripts should be run from within RStudio
    • 00_optimize_VAE_hyperparameters.ipynb can be skipped and should only be used as a guide to select hyperparameters.

Name Description
00_optimize_KPCA_hyperparameters.ipynb Optimize for KPCA hyperparameters using TwinsUK train data. (Runtime: ~2h on a MacBook pro)
00_optimize_VAE_hyperparameters.ipynb Optimize for VAE hyperparameters using TwinsUK train data. (Runtime: 1h15m on a MacBook pro)
01_train_VAE.ipynb Train VAE model on TwinsUK data and calculate evaluation metrics. Note that this requires access to TwinsUK, which should be requested separately from https://twinsuk.ac.uk/.
02_reconstruct_data.ipynb Generate TwinsUK data reconstructions using trained VAE, PCA, and KPCA models. Used for model performance assessments.
03_assess_model_performance.R Compute mean squared error (MSE) and correlation matrix MSE (CM-MSE) for VAE, PCA, and KPCA. This includes the calculation of MSE and CM-MSE for varying latent space dimensionality d.
04_calculate_SAGE_values_VAE.ipynb Calculate VAE SAGE values using TwinsUK test data. This script should be parallelized, due to its long runtime. Pre-computed VAE SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~7.5h)
04_calculate_SAGE_values_PCA.ipynb Calculate PCA SAGE values using TwinsUK test data. Pre-computed PCA SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~1.5h)
04_calculate_SAGE_values_KPCA.ipynb Calculate KPCA SAGE values using TwinsUK test data. Pre-computed KPCA SAGE values can be found under results/sage_values. (Runtime: if all instances are parallelized ~6h)
05_interpret_latent_space.R Create SAGE value heatmaps and alluvial plots for VAE, PCA, and KPCA.
06_encode_data.ipynb Generate type 2 diabetes (T2D), schizophrenia, and acute myeloid leukemia (AML) data encodings using VAE, PCA, and KPCA models.
07_associate_dimensions_with_diseases.R Associate VAE, PCA, and KPCA encodings with patient groups from T2D, schizophrenia, and AML data. This includes T2D clinical variables (e.g. HbA1c %) and AML mutations.

Other files

Name Description
models.py Contains VAE, PCA, and KPCA model classes.
metric_functions.py Functions used for model assessment in python can be found here.
helper_functions.R R functions that are required for the calculation of evaluation results and the construction of plots can be found here.

About

Variational autoencoders learn universal latent representations of metabolomics data (supplementary code)

Resources

License

Stars

Watchers

Forks

Packages

No packages published