This repository contains the source code and results of the iterative clustering of multiomics data and interpretation with the Biomedical Data Translator project. This project began at the 2022 Bio-IT FAIR Data Hackathon. We use the Gabriella Miller Kids First Data Resource Center data supported by the NIH Common Fund--this resource contains data from over 11,000 samples, including DNA and RNA as well as clinical information.
In this project, we initially focused on clustering of gene expression profiles from RNA-Seq data collected from pediatric tumor samples. We then create a simple and interpretable predictive model to determine the gene expression signatures that differentiate the clusters from one another. To gain additional translational insight into the clusters we sought to annotate the important genes from each cluster with data from the NCATS Biomedical Data Translator (github org). Our analyses were executed on the Cavatica cloud-based data analysis and sharing platform.
Our workflow consisted of the following core steps:
- Wrangle the NICHD Kids First data
- See the following notebooks
- Perform unsupervised clustering of gene expression gene expression using pvclust, which is hierarchichal clustering approach that implements a bootstrapping method for assessing statistical significance of clusters
- Develop a classification model using xgboost to predict the cluster assignments from the gene expression data. The xgboost model provides feature importance metrics that we use the
- Annotate results by querying the NCATS Biomedical Data Translator
In the future we hope to integrate additional omics data modalities available through the Kids First Data Resource-such as somatic mutation calls from tumor sequencing, HPO phenotypes and patient clinical characteristics-and additional disease states like the INCLUDE project focused on Down Syndrome.
In the future we would like to explore the use of feature selection methods, such as recursive feature elimination, to reduce the number of genes required to make cluster predictions. We would then iterate on the clustering process to see if pvclust performs better on the reduced feature set.
- Cloud platform: Cavatica
- Biomedical Data Translator: github org
- Python
- R