-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path03-labelled.Rmd
323 lines (217 loc) · 11.2 KB
/
03-labelled.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
# Labelled data
One of the best (and somewhat accidental) innovations of the haven package is the introduction of value labels and other metadata tags that we commonly see when working with other statistical software into R, primarily via the labelled vector.
Labelled vectors were created as an R equivalent to categorical-esque types. Originally this was only intended as a pass-through class to get to factors. As the [`labelled()`](https://haven.tidyverse.org/reference/labelled.html) documentation from haven says:
> This class provides few methods, as I expect you'll coerce to a standard R class (e.g. a factor()) soon after importing.
It turns out that the labelled class is immensely useful in its own right. Fortunately R lives in the open source world, and the [labelled](https://larmarange.github.io/labelled/) package was created. This provides a set of helper functions for more easily working with labelled datasets, particularly for label editing and manipulation.
We'll be going through some brief examples of working with labels, but for a more detailed general introduction see the [Introduction to labelled](https://larmarange.github.io/labelled/articles/intro_labelled.html) vignette. The [labelled cheat sheet](https://raw.githubusercontent.com/larmarange/labelled/master/cheatsheet/labelled_cheatsheet.pdf) is a fantastic quick function reference.
## What is labelled data?
### The basics
When reading a dataset using haven, variables have labels and other metadata attached as attributes.
Standard attributes included regardless of variable type are:
* A `label` attribute with the variable label
* A `format.stata`, `format.spss`, or `format.sas` attribute, depending on the input type, storing the variable format for the specified file type (e.g. `"F1.0"`)
```{r}
library(haven)
library(labelled)
library(dplyr, warn.conflicts = FALSE)
gss <- read_sav("data/gss/GSS2018.sav", user_na = TRUE)
gss_dta <- read_dta("data/gss/GSS2018.dta")
# A standard numeric variable, with additional attributes
class(gss$YEAR)
str(gss$YEAR)
attributes(gss$YEAR)
class(gss_dta$year)
str(gss_dta$year)
attributes(gss_dta$year)
```
If a variable contains labelled values it will be imported as a [`haven_labelled`](https://haven.tidyverse.org/reference/labelled.html) vector, which stores the variable labels in the `labels` attribute.
If we're reading an SPSS file and the variable contains user-defined missing values it will be imported as a [`haven_labelled_spss`](https://haven.tidyverse.org/reference/labelled_spss.html) vector. This is an extension of the `haven_labelled` class that also records user-defined missing values in the `na_values` or `na_range` attribute as appropriate.
```{r}
# A "labelled" categorical variable
class(gss$HEALTH)
str(gss$HEALTH)
attributes(gss$HEALTH)
class(gss_dta$health)
str(gss_dta$health)
attributes(gss_dta$health)
```
One immediate advantage of labelled vectors is that value labels are used in data frame printing when using [tibble](https://tibble.tidyverse.org/) (and by extension the wider tidyverse) and other packages using the [pillar](https://cran.r-project.org/web/packages/pillar/index.html) printing methods.
```{r}
# Print helpers
gss %>% count(HEALTH)
gss %>% count(HELPSICK)
```
Using `head()` on a variable will print a nicely formatted summary of the attached metadata, excluding formats.
```{r}
head(gss$HEALTH)
head(gss_dta$health)
```
### Missing values
#### User-defined missing values (SPSS)
SPSS allows for user-defined missing values, where the user can tag a discrete set or a range of values to be treated as missing.
These are relatively simple to deal with in haven, and allow for easy differential treatment of missing values in formatting and recoding methods as we'll see later. They get a handy `(NA)` prefix when printed in a tibble and return `TRUE` from `is.na()`.
```{r}
# Missing values 0, 8 and 9
head(gss$HEALTH)
gss %>% count(HEALTH, is.na(HEALTH))
```
One gotcha in our experience is that although they return `TRUE` from `is.na()` they are not considered equivalent to `NA` in other contexts.
```{r}
# These are not equivalent!
gss %>% count(HEALTH, is.na(HEALTH), HEALTH %in% NA)
```
Ranges work similarly to discrete values but will exclude all missing values in the range, as you would expect.
```{r}
# Missing value range 13 - 99, plus discrete value 0
head(gss$RINCOME)
gss %>% count(RINCOME, is.na(RINCOME))
```
#### Tagged missing values (SAS, Stata)
SAS and Stata take the opposite approach to SPSS - rather than tagging a value as missing, they tag missing data with a "type". This is also supported by haven, albeit in a slightly different way.
Tagged missing values appear in the label set as an `NA` with an attached letter flagging the type.
```{r}
head(gss_dta$health)
```
Treatment of tagged missing values can be a bit funny compared to user-defined missing values. Note that, in the example below, doing a straight count for "IAP" does not match the SPSS example and is actually combining the "IAP" and "DK" values.
```{r}
gss_dta %>% count(health, is.na(health))
```
In many circumstances tagged `NA` values will be grouped together like this, which can be misleading, and need to be treated a bit differently.
You can use `na_tag()` to extract the tagged type of the `NA` values, or `is_tagged_na()` to check for values with a particular tag.
```{r}
gss_dta %>%
count(
health,
is.na(health),
na_tag(health)
)
gss_dta %>%
count(
health,
na_tag(health),
is_tagged_na(health),
is_tagged_na(health, "d")
)
```
#### Zapping
To convert tagged or user-defined missing values to a standard R `NA`, you can use the `zap_missing()` function on either a vector or a data frame.
```{r}
gss %>% count(HEALTH, zap_missing(HEALTH))
gss_dta %>% count(health, na_tag(health), zap_missing(health))
```
You may recall earlier that we mentioned the `user_na = TRUE` argument for `read_sav()`. If you use `user_na = FALSE` (the default), it will convert user defined missing values to `NA` on the way in.
```{r}
read_sav("data/gss/GSS2018.sav", user_na = TRUE) %>%
zap_missing() %>%
count(HEALTH)
read_sav("data/gss/GSS2018.sav", user_na = FALSE) %>%
count(HEALTH)
```
## Converting labelled vectors
Labelled datasets are great for accessing metadata in the R console, but many functions need base R data types.
### Factors
The labelled package has a couple of helper functions for converting labelled vectors to factors and character vectors. The `to_factor()` function is versatile, and can manipulate labels in various ways on the way to factor levels.
The `levels` argument controls how levels are derived from the value labels.
```{r}
# Convert to factors, using the labels as levels
gss %>% count(HEALTH = to_factor(HEALTH))
# Include the category code in the label
gss %>% count(HEALTH = to_factor(HEALTH, levels = "prefixed"))
# Use the category code instead of the label
gss %>% count(HEALTH = to_factor(HEALTH, levels = "values"))
```
User defined missing values can be removed from the levels and converted to `NA` using `user_na_to_na = TRUE`.
```{r}
# Remove user-defined NA values
gss %>% count(HEALTH = to_factor(HEALTH, user_na_to_na = TRUE))
```
Labels that don't exist in the data can be dropped from the levels using `drop_unused_labels = TRUE`.
```{r}
# Drop unused labels
table(to_factor(gss$HEALTH))
table(to_factor(gss$HEALTH, drop_unused_labels = TRUE))
```
Factor Levels can easily be sorted by either value or label using the `sort_levels` argument. By default, they are sorted by value.
```{r}
# Sort by value
levels(to_factor(gss$HEALTH, levels = "prefixed", sort_levels = "values"))
# Sort by label
levels(to_factor(gss$HEALTH, levels = "prefixed", sort_levels = "labels"))
# Sort descending
levels(to_factor(gss$HEALTH, levels = "prefixed", sort_levels = "values", decreasing = TRUE))
```
By default unlabelled values will be included with the value used as the factor level. They can be discarded with `no_label_to_na = TRUE`.
```{r}
gss %>% count(HELPSICK = to_factor(HELPSICK))
# Convert unlabelled levels to NA
gss %>% count(HELPSICK = to_factor(HELPSICK, nolabel_to_na = TRUE))
```
And all labelled vectors in the data frame can be converted to factors in one go.
```{r}
# Convert all labelled vectors to factors
to_factor(gss)
```
### Character vectors
The `to_character()` function allows you to convert to a character vector instead of a factor, using the same general conversion arguments as `to_factor()`.
```{r}
# Convert to a character variable
gss %>% count(HEALTH = to_character(HEALTH, levels = "prefixed"))
# Remove tagged NA values
gss %>% count(HEALTH = to_character(HEALTH, user_na_to_na = TRUE))
```
## Exploring datasets
The labelled package provides a simple helper function `look_for()` for finding variables with either variable or value labels matching a search term in your dataset.
Some simple examples are included below. For a more detailed rundown of the `look_for()` function see the [vignette](https://larmarange.github.io/labelled/articles/look_for.html).
```{r}
# Find variables with "medical" in the label
look_for(gss, "medical")
# Only provide basic details
look_for(gss, "income", details = FALSE)
# Search using a regular expression
look_for(gss, "medic(al|ation)", details = FALSE)
# Provide a variable summary as a tibble
gss %>%
look_for("medic(al|ation)") %>%
as_tibble()
# Provide a variable summary as a tibble with one row per value
gss %>%
look_for("medic(al|ation)") %>%
lookfor_to_long_format()
```
## Labelled data in other packages
Although labelled datasets are relatively new and somewhat of a niche there are a few packages that are starting to leverage the additional metadata provided.
### Frequency tables with [questionr](https://juba.github.io/questionr)
The questionr package provides a set of convenient helper functions for survey processing tasks. Some of these use label and missing value metadata for display purposes.
Among others, the `freq()` function provides an equivalent to frequency tables produced in SPSS, and the `ltabs()` function provides a wrapper for `stats::xtabs()` that uses labels by default
```{r}
library(questionr)
freq(gss$HEALTH)
ltabs(~ HELPSICK + HEALTH, gss)
```
### Tabling with [gtsummary](https://www.danieldsjoberg.com/gtsummary/)
gtsummary was originally developed as a complement to the [gt]{https://gt.rstudio.com/} table presentation package, for easily producing summary tables of common indicators for datasets, regression models and so on.
Variable labels will be used for labelling tables by default, where they exist. Value labels are not used by default, but can easily be included by converting the variables to factors as demonstrated in the previous section.
```{r}
library(gtsummary)
```
```{r}
gss %>%
select(HEALTH, HELPSICK, HELPPOOR) %>%
to_factor(drop_unused_labels = TRUE, user_na_to_na = TRUE) %>%
tbl_summary(by = HEALTH)
```
```{r}
gss %>%
transmute(RINCOME, REALINC = unclass(REALINC), FINRELA) %>%
to_factor(drop_unused_labels = TRUE, user_na_to_na = TRUE) %>%
tbl_summary(by = FINRELA, percent = "row")
```
```{r}
gss %>%
to_factor(drop_unused_labels = TRUE, user_na_to_na = TRUE) %>%
tbl_cross(HELPSICK, HEALTH, percent = "row")
```
<!-- ## Editing variable metadata -->
<!-- ### Variable labels -->
<!-- ### Value labels -->
<!-- ### Missing values -->