This project was part of the March 2025 CMU Hackathon in partnership with DNAnexus.
Hackathon Team: Samuel Blechman, Nicholas P. Cooley, Aung Myat Phyo, Ciara O'Donoghue, Glenn Ross-Dolan, Rebecca Satterwhite, Rishika Gupta
With the increasing availability of multimodal patient data, non-specialists, including health care professionals, are obtaining an abundance of transdisciplinary information without a corresponding ability to analyze and interpret it. Traditional statistical methods primarily focus on correlation-based associations, making it difficult to infer causal mechanisms in complex patient trajectories. Working with raw EHR data presents several challenges that must be addressed for effective causal discovery. To address these challenges, we present a causal discovery pipeline designed to automate the preprocessing, causal inference, and visualization steps required for analyzing longitudinal data, using MIMIC-III Dataset[Johnson et al., 2022] as an example. Our approach leverages DNAnexus for scalable computation and employs Tetrad [Ramsey et al. 2018], a well-established causal discovery tool, to identify relationships between clinical features, laboratory results, microbiological events, and patient outcomes. By converting raw CSVs into an SQL database, we ensure efficient querying and data retrieval.
Tetrad Tetrad is a software suite for simulating, estimating, and searching for graphical causal models of statistical data. The aim of the program is to provide sophisticated methods in a friendly interface requiring very little statistical sophistication of the user and no programming knowledge. Tetrad is open-source, free software that performs many of the functions in commercial programs.
See here for Tetrad User Manual
What is DNAnexus
sudo apt install openjdk-17-jdk
More information regarding Java installation can be found HERE
sudo apt install r-base
More information regarding R installatin can be found HERE
wget https://s01.oss.sonatype.org/content/repositories/releases/io/github/cmu-phil/causal-cmd/1.12.0/causal-cmd-1.12.0-jar-with-dependencies.jar
Please see Example User Input Folder for example R script: GenerateExampleUserInput.R to output this file and user_input_yaml.txt for example of file format. NOTE-examples of specific arguments and options can be found in the R script file.
To add knowledge parameters the user must create a knowledge.txt
file. This option enables the user to input background knowledge regarding the data. For example. information about the time order of the measured variables. Please see 'knowledge.txt' for reference.
We tested use of our pipeline through data from the MIMIC-III Dataset[Johnson et al., 2022].
Results in example_output_name_out.txt. 87 causal relationships identified.
================================================================================ Graph Edges:
- "LAB_AlkalinePhosphatase" o-> "LAB_Bilirubin.Total"
- "LAB_AlkalinePhosphatase" --> "LAB_Calcium.Total"
- "LAB_AlkalinePhosphatase" --> "LAB_Sodium"
- "LAB_AnionGap" <-> "LAB_Bicarbonate"
- "LAB_AnionGap" <-> "LAB_Creatinine"
- "LAB_AnionGap" <-> "LAB_Lactate"
- "LAB_AnionGap" <-> "LAB_Phosphate"
- "LAB_AsparateAminotransferase.AST." --> "LAB_AlanineAminotransferase.ALT."
- "LAB_Basophils" o-> "LAB_Eosinophils"
- "LAB_Bicarbonate" <-> "LAB_Chloride"
- "LAB_Bicarbonate" <-> "mortality_in_hospital"
- "LAB_Calcium.Total" <-> "LAB_CreatineKinase.CK."
- "LAB_Calcium.Total" --> "LAB_Lymphocytes"
- "LAB_Calcium.Total" <-> "LAB_Magnesium"
- "LAB_Calcium.Total" --> "LAB_Monocytes"
- "LAB_CalculatedTotalCO2" o-> "LAB_BaseExcess"
- "LAB_CalculatedTotalCO2" o-> "LAB_Bicarbonate"
- "LAB_CalculatedTotalCO2" o-> "LAB_pCO2"
- "LAB_Creatinine" <-> "LAB_UreaNitrogen"
- "LAB_Eosinophils" <-> "LAB_Neutrophils"
- "LAB_Eosinophils" <-> "LAB_PlateletCount"
- "LAB_Eosinophils" <-> "mortality_in_hospital"
- "LAB_Hematocrit" o-> "LAB_Hemoglobin"
- "LAB_Hematocrit" o-> "LAB_Phosphate"
- "LAB_Hemoglobin" --> "LAB_Lymphocytes"
- "LAB_Hemoglobin" --> "LAB_UreaNitrogen"
- "LAB_INR.PT." --> "LAB_PT"
- "LAB_INR.PT." --> "LAB_PTT"
- "LAB_Lactate" --> "LAB_AsparateAminotransferase.AST."
- "LAB_Lactate" --> "LAB_CreatineKinase.CK."
- "LAB_Lactate" --> "LAB_Glucose"
- "LAB_Lactate" --> "LAB_INR.PT."
- "LAB_Lactate" --> "LAB_Oxygen"
- "LAB_Lactate" --> "mortality_in_hospital"
- "LAB_Lymphocytes" <-> "LAB_Monocytes"
- "LAB_Lymphocytes" <-> "LAB_Neutrophils"
- "LAB_Lymphocytes" <-> "mortality_in_hospital"
- "LAB_MCH" --> "LAB_Hemoglobin"
- "LAB_MCHC" --> "LAB_CreatineKinase.CK."
- "LAB_MCHC" --> "LAB_MCH"
- "LAB_MCV" o-> "LAB_MCH"
- "LAB_MCV" o-> "LAB_Potassium"
- "LAB_MCV" o-o "LAB_pO2"
- "LAB_Magnesium" --> "LAB_PT"
- "LAB_Neutrophils" --> "LAB_Bicarbonate"
- "LAB_Neutrophils" --> "LAB_Monocytes"
- "LAB_Neutrophils" <-> "LAB_PlateletCount"
- "LAB_Neutrophils" <-> "LAB_WhiteBloodCells"
- "LAB_Oxygen" --> "mortality_in_hospital"
- "LAB_PT" --> "LAB_PTT"
- "LAB_Phosphate" --> "LAB_AsparateAminotransferase.AST."
- "LAB_Phosphate" --> "LAB_Creatinine"
- "LAB_Phosphate" --> "LAB_UreaNitrogen"
- "LAB_PlateletCount" <-> "LAB_Lactate"
- "LAB_PlateletCount" --> "LAB_MCHC"
- "LAB_PlateletCount" --> "LAB_RDW"
- "LAB_PlateletCount" <-> "mortality_in_hospital"
- "LAB_Potassium" --> "LAB_AnionGap"
- "LAB_Potassium" <-> "LAB_MCHC"
- "LAB_Potassium" <-> "LAB_Magnesium"
- "LAB_Potassium" --> "LAB_Phosphate"
- "LAB_RDW" --> "LAB_AlkalinePhosphatase"
- "LAB_RDW" --> "LAB_Bilirubin.Total"
- "LAB_RDW" --> "LAB_MCHC"
- "LAB_RDW" --> "LAB_PT"
- "LAB_RDW" --> "LAB_Temperature"
- "LAB_RedBloodCells" o-o "LAB_Hematocrit"
- "LAB_Sodium" --> "LAB_Chloride"
- "LAB_Sodium" --> "LAB_Magnesium"
- "LAB_Sodium" --> "LAB_Oxygen"
- "LAB_Sodium" --> "LAB_PTT"
- "LAB_UreaNitrogen" --> "LAB_Glucose"
- "LAB_UreaNitrogen" <-> "LAB_Magnesium"
- "LAB_UreaNitrogen" <-> "LAB_PlateletCount"
- "LAB_UreaNitrogen" --> "LAB_Temperature"
- "LAB_UreaNitrogen" <-> "mortality_in_hospital"
- "LAB_WhiteBloodCells" --> "LAB_AnionGap"
- "LAB_WhiteBloodCells" o-> "LAB_PlateletCount"
- "LAB_WhiteBloodCells" <-> "mortality_in_hospital"
- "LAB_pH" o-> "LAB_BaseExcess"
- "LAB_pH" o-o "LAB_CalculatedTotalCO2"
- "LAB_pH" o-> "LAB_MCHC"
- "LAB_pH" o-> "LAB_pCO2"
- "LAB_pO2" o-> "LAB_BaseExcess"
- "LAB_pO2" o-> "LAB_RDW"
- "mortality_in_hospital" --> "LAB_Glucose"
- "mortality_in_hospital" --> "LAB_INR.PT."
A --> B
present
A is a cause of B. It may be a direct or indirect cause that may include other measured variables. Also, there may be an unmeasured confounder of A and B.
absent
B is not a cause of A.
A <-> B
present
There is an unmeasured variable (call it L) that is a cause of A and B. There may be measured variables along the causal pathway from L to A or from L to B.
absent
A is not a cause of B. B is not a cause of A.
A o-> B
present
Either A is a cause of B, or there is an unmeasured variable that is a cause of A and B, or both.
absent
This pipeline has the potential to be developed for the use of biological data (e.g. exploring causal relationships in a dataset with SNPs and gene expression). Furthermore, the use of principal component analysis may provide more efficiency to a large data set without as much user input.