forked from chiragjp/nhanes_scidata
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathUser_Guide.Rmd
80 lines (64 loc) · 4.47 KB
/
User_Guide.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
---
title: "NHANES Scientific Data User Guide"
author: "Chirag J Patel"
date: "May 19, 2016"
output: html_document
---
This presents how to use to how to use the NHANES datasets and to make inferences that account for the non-random and stratified nature of the survey.
Comments to: Chirag Patel ([email protected])
Load the `.Rdata` object:
```{r}
load('nh_99-06.Rdata')
```
How do we figure out what is a exposure and what is a phenotype in NHANES? Hint: Use the `VarDescription` `data.frame`:
```{r}
head(VarDescription) ## this gives the variable name and description and broad category for each variable (called 'var_desc_ewas')
as.data.frame(table(VarDescription$category)) ## the types of variables
```
Next, how does survey-weighted regression work?
Suppose we want to look at the association between fasting glucose and BMI (adjusted by age and sex) in the 2003-2004 survey.
Under a normal study sample, we would simply use `lm`:
```{r}
dat <- subset(MainTable, SDDSRVYR == 3) # subset for 2003-2004
mod <- lm(LBXGLU ~ BMXBMI + RIDAGEYR + female, dat)
summary(mod)
```
But with NHANES, this is technically not correct. We need to use survey-weighting to accomodate the survey sampling of the data:
```{r}
library(survey)
dsn <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T,data=subset(dat, WTMEC2YR > 0)) # first cret a survey design object, specififying the sampling units (SDMVPSU), the strata (SDMVSTRA), and probability weight of being selected WTMEC2YR
mod.svy <- svyglm(LBXGLU ~ BMXBMI + RIDAGEYR + female, design=dsn) ## now use SVYGLM;
summary(mod.svy) #slightly different estimates
```
Let's try logistic regression, looking at the clinical diagnosis of diabetes (`LBXGLU >= 126`) using logistic regression:
```{r}
mod.svy.t2d <- svyglm(I(LBXGLU >=125) ~ BMXBMI + RIDAGEYR + female, design=dsn, family=quasibinomial()) #depending on the family= parameter, you can use this for logistic regression, as well.
summary(mod.svy.t2d) #t2d increases by 10% per 1 unit increase in BMI.
```
What about survival analysis? Different beast! In survival analyses, we require whether the person died at the time of querying survival (0 or 1) and time to querying (e.g., 1 month, 5 months). These are coded as `MORTSTAT` and `PERMTH_EXM` (time to death from the the examination survey) in the `MainTable`, respectively.
For a 1 unit glucose increase, what is the hazard of death adjusting for age and sex for participants surveyed in 1999-2000?
```{r}
suvdat <- subset(MainTable, !is.na(MORTSTAT) & !is.na(PERMTH_EXM) & SDDSRVYR == 1) ## get all data from 1999-2000
dsn <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, probs=~WTMEC2YR, nest=T,data=suvdat)
mod.cox.svy <- svycoxph(Surv(PERMTH_EXM, MORTSTAT) ~ RIDAGEYR + female + LBXGLU, dsn)
summary(mod.cox.svy)
```
With those basics in hand, now we can tackle executing exposome-like associations in NHANES.
For example, we recently found an association between serum cadmium and mortality ([Patel CJ, *et al.* 2013](https://www.ncbi.nlm.nih.gov/pubmed/24345851)). Serum cadmium was found to be signficantly associated with all-cause mortality.
What is the variable name for "cadmium"?
```{r}
VarDescription[grep('cadmium', VarDescription$var_desc, ignore.case = T), ] # looks like it is LBXBCD
suvdat <- subset(MainTable, !is.na(MORTSTAT) & !is.na(PERMTH_EXM) & SDDSRVYR == 1) ## get all data from 1999-2000
dsn <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T,data=suvdat)
mod.cox.cadmium.1 <- svycoxph(Surv(PERMTH_EXM, MORTSTAT) ~ RIDAGEYR + female + LBXBCD, dsn)
summary(mod.cox.cadmium.1)
```
In the above, we see that females have a decreased risk for death (hazard ratio of 0.6 for females vs males) and individuals with a 1 unit higher of cadmium (1 ng/mL) have a 2 fold increased risk for death (HR = 2) versus the no increase in cadmium levels. Similar results are seen in 2001-2002...
##Does it replicate in the 2001-2002 survey?
```{r}
suvdat <- subset(MainTable, !is.na(MORTSTAT) & !is.na(PERMTH_EXM) & SDDSRVYR == 2) ## 2003-2004 survey
dsn <- svydesign(id=~SDMVPSU, strata=~SDMVSTRA, weights=~WTMEC2YR, nest=T,data=suvdat)
mod.cox.cadmium.2 <- svycoxph(Surv(PERMTH_EXM, MORTSTAT) ~ RIDAGEYR + female + LBXBCD, dsn) # yes, strong association in 2001-2002 as well
summary(mod.cox.cadmium.2)
```
In the above, we see that females have a decreased risk for death (hazard ratio of 0.6 for females vs males) and individuals with a 1 unit higher of cadmium (1 ng/mL) have a 30% increased risk for death (HR = 1.3). Similar results are seen in 1999-2000.