Rapid Longitudinal Analysis of Public Health Data

This project was part of the March 2025 CMU Hackathon in partnership with DNAnexus.

Hackathon Team: Samuel Blechman, Nicholas P. Cooley, Aung Myat Phyo, Ciara O'Donoghue, Glenn Ross-Dolan, Rebecca Satterwhite, Rishika Gupta

Link to Slides

Problem: Increased availability of large EHRs with limited accessible causal discovery methods

With the increasing availability of multimodal patient data, non-specialists, including health care professionals, are obtaining an abundance of transdisciplinary information without a corresponding ability to analyze and interpret it. Traditional statistical methods primarily focus on correlation-based associations, making it difficult to infer causal mechanisms in complex patient trajectories. Working with raw EHR data presents several challenges that must be addressed for effective causal discovery. To address these challenges, we present a causal discovery pipeline designed to automate the preprocessing, causal inference, and visualization steps required for analyzing longitudinal data, using MIMIC-III Dataset[Johnson et al., 2022] as an example. Our approach leverages DNAnexus for scalable computation and employs Tetrad [Ramsey et al. 2018], a well-established causal discovery tool, to identify relationships between clinical features, laboratory results, microbiological events, and patient outcomes. By converting raw CSVs into an SQL database, we ensure efficient querying and data retrieval.

What is Tetrad

Tetrad Tetrad is a software suite for simulating, estimating, and searching for graphical causal models of statistical data. The aim of the program is to provide sophisticated methods in a friendly interface requiring very little statistical sophistication of the user and no programming knowledge. Tetrad is open-source, free software that performs many of the functions in commercial programs.

See here for Tetrad User Manual

What is DNAnexus

Pipeline Workflow:

Installation Prior to Pipeline

Latest Java Version

sudo apt install openjdk-17-jdk

More information regarding Java installation can be found HERE

Latest R Version

sudo apt install r-base

More information regarding R installatin can be found HERE

.jar file for running Causaml-cmd on terminal (Tetrad command line option)

wget https://s01.oss.sonatype.org/content/repositories/releases/io/github/cmu-phil/causal-cmd/1.12.0/causal-cmd-1.12.0-jar-with-dependencies.jar

Files Needed Prior to Pipeline

YAML file to specify variables (eg specific columns) and specific arguments to input into Tetrad

Please see Example User Input Folder for example R script: GenerateExampleUserInput.R to output this file and user_input_yaml.txt for example of file format. NOTE-examples of specific arguments and options can be found in the R script file.

Knowledge File

To add knowledge parameters the user must create a knowledge.txt file. This option enables the user to input background knowledge regarding the data. For example. information about the time order of the measured variables. Please see 'knowledge.txt' for reference.

Testing

We tested use of our pipeline through data from the MIMIC-III Dataset[Johnson et al., 2022].

Results

Results in example_output_name_out.txt. 87 causal relationships identified.

================================================================================ Graph Edges:

"LAB_AlkalinePhosphatase" o-> "LAB_Bilirubin.Total"
"LAB_AlkalinePhosphatase" --> "LAB_Calcium.Total"
"LAB_AlkalinePhosphatase" --> "LAB_Sodium"
"LAB_AnionGap" <-> "LAB_Bicarbonate"
"LAB_AnionGap" <-> "LAB_Creatinine"
"LAB_AnionGap" <-> "LAB_Lactate"
"LAB_AnionGap" <-> "LAB_Phosphate"
"LAB_AsparateAminotransferase.AST." --> "LAB_AlanineAminotransferase.ALT."
"LAB_Basophils" o-> "LAB_Eosinophils"
"LAB_Bicarbonate" <-> "LAB_Chloride"
"LAB_Bicarbonate" <-> "mortality_in_hospital"
"LAB_Calcium.Total" <-> "LAB_CreatineKinase.CK."
"LAB_Calcium.Total" --> "LAB_Lymphocytes"
"LAB_Calcium.Total" <-> "LAB_Magnesium"
"LAB_Calcium.Total" --> "LAB_Monocytes"
"LAB_CalculatedTotalCO2" o-> "LAB_BaseExcess"
"LAB_CalculatedTotalCO2" o-> "LAB_Bicarbonate"
"LAB_CalculatedTotalCO2" o-> "LAB_pCO2"
"LAB_Creatinine" <-> "LAB_UreaNitrogen"
"LAB_Eosinophils" <-> "LAB_Neutrophils"
"LAB_Eosinophils" <-> "LAB_PlateletCount"
"LAB_Eosinophils" <-> "mortality_in_hospital"
"LAB_Hematocrit" o-> "LAB_Hemoglobin"
"LAB_Hematocrit" o-> "LAB_Phosphate"
"LAB_Hemoglobin" --> "LAB_Lymphocytes"
"LAB_Hemoglobin" --> "LAB_UreaNitrogen"
"LAB_INR.PT." --> "LAB_PT"
"LAB_INR.PT." --> "LAB_PTT"
"LAB_Lactate" --> "LAB_AsparateAminotransferase.AST."
"LAB_Lactate" --> "LAB_CreatineKinase.CK."
"LAB_Lactate" --> "LAB_Glucose"
"LAB_Lactate" --> "LAB_INR.PT."
"LAB_Lactate" --> "LAB_Oxygen"
"LAB_Lactate" --> "mortality_in_hospital"
"LAB_Lymphocytes" <-> "LAB_Monocytes"
"LAB_Lymphocytes" <-> "LAB_Neutrophils"
"LAB_Lymphocytes" <-> "mortality_in_hospital"
"LAB_MCH" --> "LAB_Hemoglobin"
"LAB_MCHC" --> "LAB_CreatineKinase.CK."
"LAB_MCHC" --> "LAB_MCH"
"LAB_MCV" o-> "LAB_MCH"
"LAB_MCV" o-> "LAB_Potassium"
"LAB_MCV" o-o "LAB_pO2"
"LAB_Magnesium" --> "LAB_PT"
"LAB_Neutrophils" --> "LAB_Bicarbonate"
"LAB_Neutrophils" --> "LAB_Monocytes"
"LAB_Neutrophils" <-> "LAB_PlateletCount"
"LAB_Neutrophils" <-> "LAB_WhiteBloodCells"
"LAB_Oxygen" --> "mortality_in_hospital"
"LAB_PT" --> "LAB_PTT"
"LAB_Phosphate" --> "LAB_AsparateAminotransferase.AST."
"LAB_Phosphate" --> "LAB_Creatinine"
"LAB_Phosphate" --> "LAB_UreaNitrogen"
"LAB_PlateletCount" <-> "LAB_Lactate"
"LAB_PlateletCount" --> "LAB_MCHC"
"LAB_PlateletCount" --> "LAB_RDW"
"LAB_PlateletCount" <-> "mortality_in_hospital"
"LAB_Potassium" --> "LAB_AnionGap"
"LAB_Potassium" <-> "LAB_MCHC"
"LAB_Potassium" <-> "LAB_Magnesium"
"LAB_Potassium" --> "LAB_Phosphate"
"LAB_RDW" --> "LAB_AlkalinePhosphatase"
"LAB_RDW" --> "LAB_Bilirubin.Total"
"LAB_RDW" --> "LAB_MCHC"
"LAB_RDW" --> "LAB_PT"
"LAB_RDW" --> "LAB_Temperature"
"LAB_RedBloodCells" o-o "LAB_Hematocrit"
"LAB_Sodium" --> "LAB_Chloride"
"LAB_Sodium" --> "LAB_Magnesium"
"LAB_Sodium" --> "LAB_Oxygen"
"LAB_Sodium" --> "LAB_PTT"
"LAB_UreaNitrogen" --> "LAB_Glucose"
"LAB_UreaNitrogen" <-> "LAB_Magnesium"
"LAB_UreaNitrogen" <-> "LAB_PlateletCount"
"LAB_UreaNitrogen" --> "LAB_Temperature"
"LAB_UreaNitrogen" <-> "mortality_in_hospital"
"LAB_WhiteBloodCells" --> "LAB_AnionGap"
"LAB_WhiteBloodCells" o-> "LAB_PlateletCount"
"LAB_WhiteBloodCells" <-> "mortality_in_hospital"
"LAB_pH" o-> "LAB_BaseExcess"
"LAB_pH" o-o "LAB_CalculatedTotalCO2"
"LAB_pH" o-> "LAB_MCHC"
"LAB_pH" o-> "LAB_pCO2"
"LAB_pO2" o-> "LAB_BaseExcess"
"LAB_pO2" o-> "LAB_RDW"
"mortality_in_hospital" --> "LAB_Glucose"
"mortality_in_hospital" --> "LAB_INR.PT."

Interpretation of Results

A --> B

present

A is a cause of B. It may be a direct or indirect cause that may include other measured variables. Also, there may be an unmeasured confounder of A and B.

absent

B is not a cause of A.

A <-> B

present

There is an unmeasured variable (call it L) that is a cause of A and B. There may be measured variables along the causal pathway from L to A or from L to B.

absent

A is not a cause of B. B is not a cause of A.

A o-> B

present

Either A is a cause of B, or there is an unmeasured variable that is a cause of A and B, or both.

absent

Reproducibility and Future Directions

This pipeline has the potential to be developed for the use of biological data (e.g. exploring causal relationships in a dataset with SNPs and gene expression). Furthermore, the use of principal component analysis may provide more efficiency to a large data set without as much user input.

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
Scripts		Scripts
example_user_input		example_user_input
.DS_Store		.DS_Store
LICENSE		LICENSE
Methods.docx		Methods.docx
Methods.pdf		Methods.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Rapid Longitudinal Analysis of Public Health Data

Problem: Increased availability of large EHRs with limited accessible causal discovery methods

What is Tetrad

What is DNAnexus

Pipeline Workflow:

Installation Prior to Pipeline

Latest Java Version

Latest R Version

.jar file for running Causaml-cmd on terminal (Tetrad command line option)

Files Needed Prior to Pipeline

YAML file to specify variables (eg specific columns) and specific arguments to input into Tetrad

Please see Example User Input Folder for example R script: GenerateExampleUserInput.R to output this file and user_input_yaml.txt for example of file format. NOTE-examples of specific arguments and options can be found in the R script file.

Knowledge File

Testing

Results

Interpretation of Results

Reproducibility and Future Directions

About

Uh oh!

Releases

Packages

Contributors 7

Uh oh!

Languages

License

collaborativebioinformatics/Longitudinal_emr_accleRation

Folders and files

Latest commit

History

Repository files navigation

Rapid Longitudinal Analysis of Public Health Data

Problem: Increased availability of large EHRs with limited accessible causal discovery methods

What is Tetrad

What is DNAnexus

Pipeline Workflow:

Installation Prior to Pipeline

Latest Java Version

Latest R Version

.jar file for running Causaml-cmd on terminal (Tetrad command line option)

Files Needed Prior to Pipeline

YAML file to specify variables (eg specific columns) and specific arguments to input into Tetrad

Please see Example User Input Folder for example R script: GenerateExampleUserInput.R to output this file and user_input_yaml.txt for example of file format. NOTE-examples of specific arguments and options can be found in the R script file.

Knowledge File

Testing

Results

Interpretation of Results

Reproducibility and Future Directions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 7

Uh oh!

Languages

Packages