Iterative cluster analysis using multi-omics modalities and interpretation with the data translator

This repository contains the source code and results of the iterative clustering of multiomics data and interpretation with the Biomedical Data Translator project. This project began at the 2022 Bio-IT FAIR Data Hackathon. We use the Gabriella Miller Kids First Data Resource Center data supported by the NIH Common Fund--this resource contains data from over 11,000 samples, including DNA and RNA as well as clinical information.

In this project, we initially focused on clustering of gene expression profiles from RNA-Seq data collected from pediatric tumor samples. We then create a simple and interpretable predictive model to determine the gene expression signatures that differentiate the clusters from one another. To gain additional translational insight into the clusters we sought to annotate the important genes from each cluster with data from the NCATS Biomedical Data Translator (github org). Our analyses were executed on the Cavatica cloud-based data analysis and sharing platform.

Summary

Our workflow consisted of the following core steps:

Wrangle the NICHD Kids First data
- See the following notebooks
Perform unsupervised clustering of gene expression gene expression using pvclust, which is hierarchichal clustering approach that implements a bootstrapping method for assessing statistical significance of clusters
Develop a classification model using xgboost to predict the cluster assignments from the gene expression data. The xgboost model provides feature importance metrics that we use the
Annotate results by querying the NCATS Biomedical Data Translator

Future directions

Data types

In the future we hope to integrate additional omics data modalities available through the Kids First Data Resource-such as somatic mutation calls from tumor sequencing, HPO phenotypes and patient clinical characteristics-and additional disease states like the INCLUDE project focused on Down Syndrome.

Methodology

In the future we would like to explore the use of feature selection methods, such as recursive feature elimination, to reduce the number of genes required to make cluster predictions. We would then iterate on the clustering process to see if pvclust performs better on the reduced feature set.

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
notebooks		notebooks
LICENSE		LICENSE
README.md		README.md
clustering-feature-importance.yml		clustering-feature-importance.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Iterative cluster analysis using multi-omics modalities and interpretation with the data translator

Summary

Future directions

Data types

Methodology

Technical details

Platforms

Dependencies

About

Releases

Packages

Contributors 3

Languages

License

BioITHackathons/Iterative-Cluster-Analysis-Using-Multi-Omics-Modalities

Folders and files

Latest commit

History

Repository files navigation

Iterative cluster analysis using multi-omics modalities and interpretation with the data translator

Summary

Future directions

Data types

Methodology

Technical details

Platforms

Dependencies

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages