User_Guide.Rmd

---
title: "NHANES Scientific Data User Guide"
author: "Chirag J Patel"
date: "May 19, 2016"
output: html_document
---

This presents how to use to how to use the NHANES datasets and to make inferences that account for the non-random and stratified nature of the survey.

Comments to: Chirag Patel (chirag@hms.harvard.edu)

Load the `.Rdata` object:
```{r}
load('nh_99-06.Rdata')
```

How do we figure out what is a exposure and what is a phenotype in NHANES? Hint: Use the `VarDescription` `data.frame`:
```{r}
head(VarDescription) ## this gives the variable name and description and broad category for each variable (called 'var_desc_ewas')

as.data.frame(table(VarDescription$category)) ##  the types of variables
```

Next, how does survey-weighted regression work?
Suppose we want to look at the association between fasting glucose and BMI (adjusted by age and sex) in the 2003-2004 survey.

Under a normal study sample, we would simply use `lm`:
```{r}
dat <- subset(MainTable, SDDSRVYR == 3) # subset for 2003-2004
mod <- lm(LBXGLU ~ BMXBMI + RIDAGEYR + female, dat)
summary(mod)
```

But with NHANES, this is technically not correct. We need to use survey-weighting to accomodate the survey sampling of the data:
```{r}
library(survey)
dsn <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T,data=subset(dat, WTMEC2YR > 0)) # first cret a survey design object, specififying the sampling units (SDMVPSU), the strata (SDMVSTRA), and probability weight of being selected WTMEC2YR
mod.svy <- svyglm(LBXGLU  ~ BMXBMI + RIDAGEYR + female, design=dsn) ## now use SVYGLM; 
summary(mod.svy)  #slightly different estimates
```

Let's try logistic regression, looking at the clinical diagnosis of diabetes (`LBXGLU >= 126`) using logistic regression:
```{r}
mod.svy.t2d <- svyglm(I(LBXGLU >=125)  ~ BMXBMI + RIDAGEYR + female, design=dsn, family=quasibinomial()) #depending on the family= parameter, you can use this for logistic regression, as well.
summary(mod.svy.t2d) #t2d increases by 10% per 1 unit increase in BMI.
```

What about survival analysis? Different beast! In survival analyses, we require whether the person died at the time of querying survival (0 or 1) and time to querying (e.g., 1 month, 5 months). These are coded as `MORTSTAT` and `PERMTH_EXM` (time to death from the the examination survey) in the `MainTable`, respectively.
For a 1 unit glucose increase, what is the hazard of death adjusting for age and sex for participants surveyed in 1999-2000?
```{r}
suvdat <- subset(MainTable, !is.na(MORTSTAT) & !is.na(PERMTH_EXM) & SDDSRVYR == 1) ## get all data from 1999-2000
dsn <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, probs=~WTMEC2YR, nest=T,data=suvdat)
mod.cox.svy <- svycoxph(Surv(PERMTH_EXM, MORTSTAT) ~ RIDAGEYR + female + LBXGLU, dsn)
summary(mod.cox.svy)
```

With those basics in hand, now we can tackle executing exposome-like associations in NHANES.

For example, we recently found an association between serum cadmium and mortality ([Patel CJ, *et al.* 2013](https://www.ncbi.nlm.nih.gov/pubmed/24345851)). Serum cadmium was found to be signficantly associated with all-cause mortality.

What is the variable name for "cadmium"?
```{r}
VarDescription[grep('cadmium', VarDescription$var_desc, ignore.case = T), ] # looks like it is LBXBCD

suvdat <- subset(MainTable, !is.na(MORTSTAT) & !is.na(PERMTH_EXM) & SDDSRVYR == 1) ## get all data from 1999-2000
dsn <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T,data=suvdat)
mod.cox.cadmium.1 <- svycoxph(Surv(PERMTH_EXM, MORTSTAT) ~ RIDAGEYR + female + LBXBCD, dsn)
summary(mod.cox.cadmium.1)
```
In the above, we see that females have a decreased risk for death (hazard ratio of 0.6 for females vs males) and individuals with a 1 unit higher of cadmium (1 ng/mL) have a 2 fold increased risk for death (HR = 2) versus the no increase in cadmium levels. Similar results are seen in 2001-2002...

##Does it replicate in the 2001-2002 survey?
```{r}
suvdat <- subset(MainTable, !is.na(MORTSTAT) & !is.na(PERMTH_EXM) & SDDSRVYR == 2) ## 2003-2004 survey
dsn <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T,data=suvdat)
mod.cox.cadmium.2 <- svycoxph(Surv(PERMTH_EXM, MORTSTAT) ~ RIDAGEYR + female + LBXBCD, dsn) # yes, strong association in 2001-2002 as well
summary(mod.cox.cadmium.2)
```

In the above, we see that females have a decreased risk for death (hazard ratio of 0.6 for females vs males) and individuals with a 1 unit higher of cadmium (1 ng/mL) have a 30% increased risk for death (HR = 1.3). Similar results are seen in 1999-2000.