This repository is part of an entire project to study age prediction and survival prediction from NHANES dataset. The code of this project is split into 3 repositories:
- 📦NHANES_preprocessing to scrape the NHANES website and preprocess the data.
- 📦TrainingCenter to train the algorithms from the dataset created in the previous repository.
- 📦CorrelationCenter to study the outputs of the models trained in the previous repository.
The main categories that we are going to leverage are:
- demographics
- laboratory
- examination
- questionnaire
Feel free to start a discussion to ask anything here.
After extracting the data, if you have access to a cluster of computers, 35 minutes is needed to run the other steps (Fusion, Cleaning, Casting, Merge) that will lead you to ready to use datasets containing only floats without missing values.
Shell scripts are available to launch jobs for Slurm clusters. They are always stored the same way: ./step/shell_script/run_step.sh
[Code in R]
Start by restoring the R environment thanks to the renv.lock file :
renv::restore()
If you don't already have the package renv installed. Please install it following the docs.
The file extraction.Rmd scrapes NHANES website and stores the files in ./extraction/data/.
Some files are only available with the format .sas7dbat so you need to convert them to the .csv format with the file sas7bdat_to_csv.Rmd.
For the mortality dataset, files have been download from the ftp server of the Centers for Disease Control. Then, they have been processed using the file mortality.Rmd.
[Code in Python]
Start by installing the Python package thanks the the setup.py file:
pip install -e .[dev]
This stage has the goal to fusion the different files, given by the previous step, into one single file, representing a category. The way the categories are formed is shown in the google sheet variable_categorizer. This google sheet reports the name of the variables with their description and the category they have been assigned to, along with other information. Among them, you have the correlations between the variables and the age of the participants.
You need to set the environment variable GOOGLE_SPLIT_SHEET_ID in order to interact with the google sheet variable_categorizer. A good way of doing it, is to set this variable in the activate file of your python virtual environment adding the following:
export GOOGLE_SPLIT_SHEET_ID="1wyfNAD_SgmIlKXK-2QFcBu7eH4xPJKbWe4PLOIIlriI"
[Code in Python]
This stage cleans the files obtained in the previous step by removing the nans. After removing the nans, the columns with a null standard deviation are removed. It also creates the age of the participants for demographics. Finally, it creates a variable called survival_type for mortality that tells the cause of death (cvd, cancer, other).
[Code in Python]
This stage casts the files obtained in the previous step by casting the types of the variables to float32. When a categorical variable is encountered, the variable is converted in dummy vectors.
[Code in Python]
This stage merges the files obtained in the previous step with the demographics files and the mortality data. It also adds the description of the variables to their name.
[Code in Python]
This state is independant from the rest of the stages. It computes the correlation between the age and the variables. Thus, it updates the columns age correlation, p-value and sample size in the google sheet variable_categorizer.
[Code in Python]
This state is also independant from the rest of the stages. It displays the scatter plot of a desired variable of NHANES with age in month as the X-axis. To use it:
scatter_plot --main_category examination --variable BMXWT
Here is what you get for the weight of the NHANES participants: