NHANES_preprocessing

This repository is part of an entire project to study age prediction and survival prediction from NHANES dataset. The code of this project is split into 3 repositories:

📦NHANES_preprocessing to scrape the NHANES website and preprocess the data.
📦TrainingCenter to train the algorithms from the dataset created in the previous repository.
📦CorrelationCenter to study the outputs of the models trained in the previous repository.

The main categories that we are going to leverage are:

demographics
laboratory
examination
questionnaire

Feel free to start a discussion to ask anything here.

After extracting the data, if you have access to a cluster of computers, 35 minutes is needed to run the other steps (Fusion, Cleaning, Casting, Merge) that will lead you to ready to use datasets containing only floats without missing values.

Shell scripts are available to launch jobs for Slurm clusters. They are always stored the same way: ./step/shell_script/run_step.sh

I Extraction

[Code in R]
Start by restoring the R environment thanks to the renv.lock file :

renv::restore()

If you don't already have the package renv installed. Please install it following the docs.

The file extraction.Rmd scrapes NHANES website and stores the files in ./extraction/data/.

Some files are only available with the format .sas7dbat so you need to convert them to the .csv format with the file sas7bdat_to_csv.Rmd.

For the mortality dataset, files have been download from the ftp server of the Centers for Disease Control. Then, they have been processed using the file mortality.Rmd.

II Fusion

[Code in Python]
Start by installing the Python package thanks the the setup.py file:

pip install -e .[dev]

This stage has the goal to fusion the different files, given by the previous step, into one single file, representing a category. The way the categories are formed is shown in the google sheet variable_categorizer. This google sheet reports the name of the variables with their description and the category they have been assigned to, along with other information. Among them, you have the correlations between the variables and the age of the participants.

You need to set the environment variable GOOGLE_SPLIT_SHEET_ID in order to interact with the google sheet variable_categorizer. A good way of doing it, is to set this variable in the activate file of your python virtual environment adding the following:

export GOOGLE_SPLIT_SHEET_ID="1wyfNAD_SgmIlKXK-2QFcBu7eH4xPJKbWe4PLOIIlriI"

III Cleaning

[Code in Python]
This stage cleans the files obtained in the previous step by removing the nans. After removing the nans, the columns with a null standard deviation are removed. It also creates the age of the participants for demographics. Finally, it creates a variable called survival_type for mortality that tells the cause of death (cvd, cancer, other).

IV Casting

[Code in Python]
This stage casts the files obtained in the previous step by casting the types of the variables to float32. When a categorical variable is encountered, the variable is converted in dummy vectors.

V Merge

[Code in Python]
This stage merges the files obtained in the previous step with the demographics files and the mortality data. It also adds the description of the variables to their name.

VI Correlation with age

[Code in Python]
This state is independant from the rest of the stages. It computes the correlation between the age and the variables. Thus, it updates the columns age correlation, p-value and sample size in the google sheet variable_categorizer.

VII Scatter plot

[Code in Python]
This state is also independant from the rest of the stages. It displays the scatter plot of a desired variable of NHANES with age in month as the X-axis. To use it:

scatter_plot --main_category examination --variable BMXWT

Here is what you get for the weight of the NHANES participants:

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
.devcontainer		.devcontainer
.github		.github
casting		casting
cleaning		cleaning
correlation_with_age		correlation_with_age
extraction		extraction
fusion		fusion
merge		merge
utils		utils
.gitignore		.gitignore
README.md		README.md
command_lines.sh		command_lines.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NHANES_preprocessing

I Extraction

II Fusion

III Cleaning

IV Casting

V Merge

VI Correlation with age

VII Scatter plot

About

Releases 1

Packages

Languages

HMS-AgeVSSurvival/NHANES_preprocessing

Folders and files

Latest commit

History

Repository files navigation

NHANES_preprocessing

About

Resources

Stars

Watchers

Forks

Languages