-
Notifications
You must be signed in to change notification settings - Fork 289
/
08-feature-engineering.Rmd
594 lines (407 loc) · 39.6 KB
/
08-feature-engineering.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
```{r engineering-setup, include = FALSE}
knitr::opts_chunk$set(fig.path = "figures/")
library(tidymodels)
library(kableExtra)
tidymodels_prefer()
val_list <- function(x) {
x <- format(table(x), big.mark = ",")
x <- paste0("`", names(x), "` ($n = ", unname(x), "$)")
knitr::combine_words(x)
}
source("ames_snippets.R")
lm_wflow <-
lm_wflow %>%
remove_recipe() %>%
add_variables(outcome = Sale_Price, predictors = c(Longitude, Latitude))
```
# Feature Engineering with recipes {#recipes}
Feature engineering entails reformatting predictor values to make them easier for a model to use effectively. This includes transformations and encodings of the data to best represent their important characteristics. Imagine that you have two predictors in a data set that can be more effectively represented in your model as a ratio; creating a new predictor from the ratio of the original two is a simple example of feature engineering.
Take the location of a house in Ames as a more involved example. There are a variety of ways that this spatial information can be exposed to a model, including neighborhood (a qualitative measure), longitude/latitude, distance to the nearest school or Iowa State University, and so on. When choosing how to encode these data in modeling, we might choose an option we believe is most associated with the outcome. The original format of the data, for example numeric (e.g., distance) versus categorical (e.g., neighborhood), is also a driving factor in feature engineering choices.
Other examples of preprocessing to build better features for modeling include:
* Correlation between predictors can be reduced via feature extraction or the removal of some predictors.
* When some predictors have missing values, they can be imputed using a sub-model.
* Models that use variance-type measures may benefit from coercing the distribution of some skewed predictors to be symmetric by estimating a transformation.
Feature engineering and data preprocessing can also involve reformatting that may be required by the model. Some models use geometric distance metrics and, consequently, numeric predictors should be centered and scaled so that they are all in the same units. Otherwise, the distance values would be biased by the scale of each column.
:::rmdnote
Different models have different preprocessing requirements and some, such as tree-based models, require very little preprocessing at all. Appendix \@ref(pre-proc-table) contains a small table of recommended preprocessing techniques for different models.
:::
In this chapter, we introduce the [`r pkg(recipes)`](https://recipes.tidymodels.org/) package that you can use to combine different feature engineering and preprocessing tasks into a single object and then apply these transformations to different data sets. The `r pkg(recipes)` package is, like `r pkg(parsnip)` for models, one of the core tidymodels packages.
This chapter uses the Ames housing data and the R objects created in the book so far, as summarized in Section \@ref(workflows-summary).
## A Simple `recipe()` for the Ames Housing Data
In this section, we will focus on a small subset of the predictors available in the Ames housing data:
* The neighborhood (qualitative, with `r length(levels(ames_train$Neighborhood))` neighborhoods in the training set)
* The gross above-grade living area (continuous, named `Gr_Liv_Area`)
* The year built (`Year_Built`)
* The type of building (`Bldg_Type` with values `r val_list(ames_train$Bldg_Type)`)
Suppose that an initial ordinary linear regression model were fit to these data. Recalling that, in Chapter \@ref(ames), the sale prices were pre-logged, a standard call to `lm()` might look like:
```{r engineering-ames-simple-formula, eval = FALSE}
lm(Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Year_Built + Bldg_Type, data = ames)
```
When this function is executed, the data are converted from a data frame to a numeric _design matrix_ (also called a _model matrix_) and then the least squares method is used to estimate parameters. In Section \@ref(formula) we listed the multiple purposes of the R model formula; let's focus only on the data manipulation aspects for now. What this formula does can be decomposed into a series of steps:
1. Sale price is defined as the outcome while neighborhood, gross living area, the year built, and building type variables are all defined as predictors.
1. A log transformation is applied to the gross living area predictor.
1. The neighborhood and building type columns are converted from a non-numeric format to a numeric format (since least squares requires numeric predictors).
As mentioned in Chapter \@ref(base-r), the formula method will apply these data manipulations to any data, including new data, that are passed to the `predict()` function.
A recipe is also an object that defines a series of steps for data processing. Unlike the formula method inside a modeling function, the recipe defines the steps via `step_*()` functions without immediately executing them; it is only a specification of what should be done. Here is a recipe equivalent to the previous formula that builds on the code summary in Section \@ref(splitting-summary):
```{r engineering-ames-simple-recipe}
library(tidymodels) # Includes the recipes package
tidymodels_prefer()
simple_ames <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_dummy(all_nominal_predictors())
simple_ames
```
Let's break this down:
1. The call to `recipe()` with a formula tells the recipe the _roles_ of the "ingredients" or variables (e.g., predictor, outcome). It only uses the data `ames_train` to determine the data types for the columns.
1. `step_log()` declares that `Gr_Liv_Area` should be log transformed.
1. `step_dummy()` specifies which variables should be converted from a qualitative format to a quantitative format, in this case, using dummy or indicator variables. An indicator or dummy variable is a binary numeric variable (a column of ones and zeroes) that encodes qualitative information; we will dig deeper into these kinds of variables in Section \@ref(dummies).
The function `all_nominal_predictors()` captures the names of any predictor columns that are currently factor or character (i.e., nominal) in nature. This is a `r pkg(dplyr)`-like selector function similar to `starts_with()` or `matches()` but that can only be used inside of a recipe.
:::rmdnote
Other selectors specific to the `r pkg(recipes)` package are: `all_numeric_predictors()`, `all_numeric()`, `all_predictors()`, and `all_outcomes()`. As with `r pkg(dplyr)`, one or more unquoted expressions, separated by commas, can be used to select which columns are affected by each step.
:::
What is the advantage to using a recipe, over a formula or raw predictors? There are a few, including:
* These computations can be recycled across models since they are not tightly coupled to the modeling function.
* A recipe enables a broader set of data processing choices than formulas can offer.
* The syntax can be very compact. For example, `all_nominal_predictors()` can be used to capture many variables for specific types of processing while a formula would require each to be explicitly listed.
* All data processing can be captured in a single R object instead of in scripts that are repeated, or even spread across different files.
## Using Recipes
As we discussed in Chapter \@ref(workflows), preprocessing choices and feature engineering should typically be considered part of a modeling workflow, not a separate task. The `r pkg(workflows)` package contains high level functions to handle different types of preprocessors. Our previous workflow (`lm_wflow`) used a simple set of `r pkg(dplyr)` selectors. To improve on that approach with more complex feature engineering, let's use the `simple_ames` recipe to preprocess data for modeling.
This object can be attached to the workflow:
```{r workflows-fail, error = TRUE}
lm_wflow %>%
add_recipe(simple_ames)
```
That did not work! We can have only one preprocessing method at a time, so we need to remove the existing preprocessor before adding the recipe.
```{r workflows-add-recipe}
lm_wflow <-
lm_wflow %>%
remove_variables() %>%
add_recipe(simple_ames)
lm_wflow
```
Let's estimate both the recipe and model using a simple call to `fit()`:
```{r workflows-recipe-fit}
lm_fit <- fit(lm_wflow, ames_train)
```
The `predict()` method applies the same preprocessing that was used on the training set to the new data before passing them along to the model's `predict()` method:
```{r workflows-recipe-pred, message = FALSE, warning = FALSE}
predict(lm_fit, ames_test %>% slice(1:3))
```
If we need the bare model object or recipe, there are `extract_*` functions that can retrieve them:
```{r workflows-pull}
# Get the recipe after it has been estimated:
lm_fit %>%
extract_recipe(estimated = TRUE)
# To tidy the model fit:
lm_fit %>%
# This returns the parsnip object:
extract_fit_parsnip() %>%
# Now tidy the linear model object:
tidy() %>%
slice(1:5)
```
:::rmdnote
Tools for using (and debugging) recipes outside of workflow objects are described in Section \@ref(recipe-functions).
:::
## How Data Are Used by the `recipe()`
Data are passed to recipes at different stages.
First, when calling `recipe(..., data)`, the data set is used to determine the data types of each column so that selectors such as `all_numeric()` or `all_numeric_predictors()` can be used.
Second, when preparing the data using `fit(workflow, data)`, the training data are used for all estimation operations including a recipe that may be part of the `workflow`, from determining factor levels to computing PCA components and everything in between.
:::rmdwarning
All preprocessing and feature engineering steps use *only* the training data. Otherwise, information leakage can negatively impact the model's performance when used with new data.
:::
Finally, when using `predict(workflow, new_data)`, no model or preprocessor parameters like those from recipes are re-estimated using the values in `new_data`. Take centering and scaling using `step_normalize()` as an example. Using this step, the means and standard deviations from the appropriate columns are determined from the training set; new samples at prediction time are standardized using these values from training when `predict()` is invoked.
## Examples of Recipe Steps {#example-steps}
Before proceeding, let's take an extended tour of the capabilities of `r pkg(recipes)` and explore some of the most important `step_*()` functions. These recipe step functions each specify a specific possible step in a feature engineering process, and different recipe steps can have different effects on columns of data.
### Encoding qualitative data in a numeric format {#dummies}
One of the most common feature engineering tasks is transforming nominal or qualitative data (factors or characters) so that they can be encoded or represented numerically. Sometimes we can alter the factor levels of a qualitative column in helpful ways prior to such a transformation. For example, `step_unknown()` can be used to change missing values to a dedicated factor level. Similarly, if we anticipate that a new factor level may be encountered in future data, `step_novel()` can allot a new level for this purpose.
Additionally, `step_other()` can be used to analyze the frequencies of the factor levels in the training set and convert infrequently occurring values to a catch-all level of "other," with a threshold that can be specified. A good example is the `Neighborhood` predictor in our data, shown in Figure \@ref(fig:ames-neighborhoods).
```{r ames-neighborhoods, echo = FALSE}
#| fig.cap = "Frequencies of neighborhoods in the Ames training set",
#| fig.alt = "A bar chart of the frequencies of neighborhoods in the Ames training set. The most homes are in North Ames while the Greens, Green Hills, and Landmark neighborhood have very few instances."
ggplot(ames_train, aes(y = Neighborhood)) +
geom_bar() +
labs(y = NULL)
```
Here we see that two neighborhoods have less than five properties in the training data (Landmark and Green Hills); in this case, no houses at all in the Landmark neighborhood were included in the testing set. For some models, it may be problematic to have dummy variables with a single nonzero entry in the column. At a minimum, it is highly improbable that these features would be important to a model. If we add `step_other(Neighborhood, threshold = 0.01)` to our recipe, the bottom 1% of the neighborhoods will be lumped into a new level called "other." In this training set, this will catch `r xfun::numbers_to_words(sum(table(ames_train$Neighborhood)/nrow(ames_train) <= .01))` neighborhoods.
For the Ames data, we can amend the recipe to use:
```{r engineering-ames-recipe-other}
simple_ames <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal_predictors())
```
:::rmdnote
Many, but not all, underlying model calculations require predictor values to be encoded as numbers. Notable exceptions include tree-based models, rule-based models, and naive Bayes models.
:::
The most common method for converting a factor predictor to a numeric format is to create dummy or indicator variables. Let's take the predictor in the Ames data for the building type, which is a factor variable with five levels (see Table \@ref(tab:dummy-vars)). For dummy variables, the single `Bldg_Type` column would be replaced with four numeric columns whose values are either zero or one. These binary variables represent specific factor level values. In R, the convention is to exclude a column for the first factor level (`OneFam`, in this case). The `Bldg_Type` column would be replaced with a column called `TwoFmCon` that is one when the row has that value and zero otherwise. Three other columns are similarly created:
```{r engineering-all-dummies, echo = FALSE, results = 'asis'}
show_rows <-
ames_train %>%
mutate(.row = row_number()) %>%
group_by(Bldg_Type) %>% dplyr::select(Bldg_Type, .row) %>%
slice(1) %>%
pull(.row)
recipe(~Bldg_Type, data = ames_train) %>%
step_mutate(`Raw Data` = Bldg_Type) %>%
step_dummy(Bldg_Type, naming = function(var, lvl, ordinal = FALSE, sep = "_") lvl) %>%
prep() %>%
bake(ames_train) %>%
slice(show_rows) %>%
arrange(`Raw Data`) %>%
kable(
caption = 'Illustration of binary encodings (i.e., dummy variables) for a qualitative predictor.',
label = "dummy-vars"
) %>%
kable_styling(full_width = FALSE)
```
Why not all five? The most basic reason is simplicity; if you know the value for these four columns, you can determine the last value because these are mutually exclusive categories. More technically, the classical justification is that a number of models, including ordinary linear regression, have numerical issues when there are linear dependencies between columns. If all five building type indicator columns are included, they would add up to the intercept column (if there is one). This would cause an issue, or perhaps an outright error, in the underlying matrix algebra.
The full set of encodings can be used for some models. This is traditionally called the one-hot encoding and can be achieved using the `one_hot` argument of `step_dummy()`.
One helpful feature of `step_dummy()` is that there is more control over how the resulting dummy variables are named. In base R, dummy variable names mash the variable name with the level, resulting in names like `NeighborhoodVeenker`. Recipes, by default, use an underscore as the separator between the name and level (e.g., `Neighborhood_Veenker`) and there is an option to use custom formatting for the names. The default naming convention in `r pkg(recipes)` makes it easier to capture those new columns in future steps using a selector, such as `starts_with("Neighborhood_")`.
Traditional dummy variables require that all of the possible categories be known to create a full set of numeric features. There are other methods for doing this transformation to a numeric format. _Feature hashing_ methods only consider the value of the category to assign it to a predefined pool of dummy variables. _Effect_ or _likelihood encodings_ replace the original data with a single numeric column that measures the _effect_ of those data. Both feature hashing and effect encoding can seamlessly handle situations where a novel factor level is encountered in the data. Chapter \@ref(categorical) explores these and other methods for encoding categorical data, beyond straightforward dummy or indicator variables.
:::rmdnote
Different recipe steps behave differently when applied to variables in the data. For example, `step_log()` modifies a column in place without changing the name. Other steps, such as `step_dummy()`, eliminate the original data column and replace it with one or more columns with different names. The effect of a recipe step depends on the type of feature engineering transformation being done.
:::
### Interaction terms
Interaction effects involve two or more predictors. Such an effect occurs when one predictor has an effect on the outcome that is contingent on one or more other predictors. For example, if you were trying to predict how much traffic there will be during your commute, two potential predictors could be the specific time of day you commute and the weather. However, the relationship between the amount of traffic and bad weather is different for different times of day. In this case, you could add an interaction term between the two predictors to the model along with the original two predictors (which are called the main effects). Numerically, an interaction term between predictors is encoded as their product. Interactions are defined in terms of their effect on the outcome and can be combinations of different types of data (e.g., numeric, categorical, etc). [Chapter 7](https://bookdown.org/max/FES/detecting-interaction-effects.html) of @fes discusses interactions and how to detect them in greater detail.
After exploring the Ames training set, we might find that the regression slopes for the gross living area differ for different building types, as shown in Figure \@ref(fig:building-type-interactions).
```{r engineering-ames-feature-plots, eval=FALSE}
ggplot(ames_train, aes(x = Gr_Liv_Area, y = 10^Sale_Price)) +
geom_point(alpha = .2) +
facet_wrap(~ Bldg_Type) +
geom_smooth(method = lm, formula = y ~ x, se = FALSE, color = "lightblue") +
scale_x_log10() +
scale_y_log10() +
labs(x = "Gross Living Area", y = "Sale Price (USD)")
```
```{r building-type-interactions, ref.label = "engineering-ames-feature-plots"}
#| echo = FALSE,
#| fig.cap = "Gross living area (in log-10 units) versus sale price (also in log-10 units) for five different building types",
#| fig.alt = "Scatter plots of gross living area (in log-10 units) versus sale price (also in log-10 units) for five different building types. All trends are linear but appear to have different slopes and intercepts for the different building types."
```
How are interactions specified in a recipe? A base R formula would take an interaction using a `:`, so we would use:
```r
Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) + Bldg_Type +
log10(Gr_Liv_Area):Bldg_Type
# or
Sale_Price ~ Neighborhood + log10(Gr_Liv_Area) * Bldg_Type
```
where `*` expands those columns to the main effects and interaction term. Again, the formula method does many things simultaneously and understands that a factor variable (such as `Bldg_Type`) should be expanded into dummy variables first and that the interaction should involve all of the resulting binary columns.
Recipes are more explicit and sequential, and they give you more control. With the current recipe, `step_dummy()` has already created dummy variables. How would we combine these for an interaction? The additional step would look like `step_interact(~ interaction terms)` where the terms on the right-hand side of the tilde are the interactions. These can include selectors, so it would be appropriate to use:
```{r engineering-ames-interact-recipe}
simple_ames <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type,
data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal_predictors()) %>%
# Gr_Liv_Area is on the log scale from a previous step
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") )
```
Additional interactions can be specified in this formula by separating them by `+`. Also note that the recipe will only use interactions between different variables; if the formula uses `var_1:var_1`, this term will be ignored.
Suppose that, in a recipe, we had not yet made dummy variables for building types. It would be inappropriate to include a factor column in this step, such as:
```r
step_interact( ~ Gr_Liv_Area:Bldg_Type )
```
This is telling the underlying (base R) code used by `step_interact()` to make dummy variables and then form the interactions. In fact, if this occurs, a warning states that this might generate unexpected results.
```{block, type = "rmdwarning"}
This behavior gives you more control, but it is different from R's standard model formula.
```
As with naming dummy variables, `r pkg(recipes)` provides more coherent names for interaction terms. In this case, the interaction is named `Gr_Liv_Area_x_Bldg_Type_Duplex` instead of `Gr_Liv_Area:Bldg_TypeDuplex` (which is not a valid column name for a data frame).
:::rmdnote
_Remember that order matters_. The gross living area is log transformed prior to the interaction term. Subsequent interactions with this variable will also use the log scale.
:::
### Spline functions
When a predictor has a nonlinear relationship with the outcome, some types of predictive models can adaptively approximate this relationship during training. However, simpler is usually better and it is not uncommon to try to use a simple model, such as a linear fit, and add in specific nonlinear features for predictors that may need them, such as longitude and latitude for the Ames housing data. One common method for doing this is to use _spline_ functions to represent the data. Splines replace the existing numeric predictor with a set of columns that allow a model to emulate a flexible, nonlinear relationship. As more spline terms are added to the data, the capacity to nonlinearly represent the relationship increases. Unfortunately, it may also increase the likelihood of picking up on data trends that occur by chance (i.e., overfitting).
If you have ever used `geom_smooth()` within a `ggplot`, you have probably used a spline representation of the data. For example, each panel in Figure \@ref(fig:ames-latitude-splines) uses a different number of smooth splines for the latitude predictor:
```{r engineering-ames-splines, eval=FALSE}
library(patchwork)
library(splines)
plot_smoother <- function(deg_free) {
ggplot(ames_train, aes(x = Latitude, y = 10^Sale_Price)) +
geom_point(alpha = .2) +
scale_y_log10() +
geom_smooth(
method = lm,
formula = y ~ ns(x, df = deg_free),
color = "lightblue",
se = FALSE
) +
labs(title = paste(deg_free, "Spline Terms"),
y = "Sale Price (USD)")
}
( plot_smoother(2) + plot_smoother(5) ) / ( plot_smoother(20) + plot_smoother(100) )
```
```{r ames-latitude-splines, ref.label = "engineering-ames-splines"}
#| echo = FALSE,
#| fig.cap = "Sale price versus latitude, with trend lines using natural splines with different degrees of freedom",
#| fig.alt = "Scatter plots of sale price versus latitude with trend lines using natural splines with different degrees of freedom. As the degrees of freedom increase, the lines are more responsive to trends in the data but begin to become excessively complex with 100 spline terms."
```
The `ns()` function in the `r pkg(splines)` package generates feature columns using functions called _natural splines_.
Some panels in Figure \@ref(fig:ames-latitude-splines) clearly fit poorly; two terms _underfit_ the data while 100 terms _overfit_. The panels with five and twenty terms seem like reasonably smooth fits that catch the main patterns of the data. This indicates that the proper amount of "nonlinear-ness" matters. The number of spline terms could then be considered a _tuning parameter_ for this model. These types of parameters are explored in Chapter \@ref(tuning).
In `r pkg(recipes)`, multiple steps can create these types of terms. To add a natural spline representation for this predictor:
```{r engineering-spline-rec, eval = FALSE}
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type + Latitude,
data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, deg_free = 20)
```
The user would need to determine if both neighborhood and latitude should be in the model since they both represent the same underlying data in different ways.
### Feature extraction
Another common method for representing multiple features at once is called _feature extraction_. Most of these techniques create new features from the predictors that capture the information in the broader set as a whole. For example, principal component analysis (PCA) tries to extract as much of the original information in the predictor set as possible using a smaller number of features. PCA is a linear extraction method, meaning that each new feature is a linear combination of the original predictors. One nice aspect of PCA is that each of the new features, called the principal components or PCA scores, are uncorrelated with one another. Because of this, PCA can be very effective at reducing the correlation between predictors. Note that PCA is only aware of the predictors; the new PCA features might not be associated with the outcome.
In the Ames data, several predictors measure size of the property, such as the total basement size (`Total_Bsmt_SF`), size of the first floor (`First_Flr_SF`), the gross living area (`Gr_Liv_Area`), and so on. PCA might be an option to represent these potentially redundant variables as a smaller feature set. Apart from the gross living area, these predictors have the suffix `SF` in their names (for square feet) so a recipe step for PCA might look like:
```r
# Use a regular expression to capture house size predictors:
step_pca(matches("(SF$)|(Gr_Liv)"))
```
Note that all of these columns are measured in square feet. PCA assumes that all of the predictors are on the same scale. That's true in this case, but often this step can be preceded by `step_normalize()`, which will center and scale each column.
There are existing recipe steps for other extraction methods, such as: independent component analysis (ICA), non-negative matrix factorization (NNMF), multidimensional scaling (MDS), uniform manifold approximation and projection (UMAP), and others.
### Row sampling steps
Recipe steps can affect the rows of a data set as well. For example, _subsampling_ techniques for class imbalances change the class proportions in the data being given to the model; these techniques often don't improve overall performance but can generate better behaved distributions of the predicted class probabilities. These are approaches to try when subsampling your data with class imbalance:
* _Downsampling_ the data keeps the minority class and takes a random sample of the majority class so that class frequencies are balanced.
* _Upsampling_ replicates samples from the minority class to balance the classes. Some techniques do this by synthesizing new samples that resemble the minority class data while other methods simply add the same minority samples repeatedly.
* _Hybrid methods_ do a combination of both.
The [`r pkg(themis)`](https://themis.tidymodels.org/) package has recipe steps that can be used to address class imbalance via subsampling. For simple downsampling, we would use:
```r
step_downsample(outcome_column_name)
```
:::rmdwarning
Only the training set should be affected by these techniques. The test set or other holdout samples should be left as-is when processed using the recipe. For this reason, all of the subsampling steps default the `skip` argument to have a value of `TRUE` (Section \@ref(skip-equals-true)).
:::
Other step functions are row-based as well: `step_filter()`, `step_sample()`, `step_slice()`, and `step_arrange()`. In almost all uses of these steps, the `skip` argument should be set to `TRUE`.
### General transformations
Mirroring the original `r pkg(dplyr)` operation, `step_mutate()` can be used to conduct a variety of basic operations to the data. It is best used for straightforward transformations like computing a ratio of two variables, such as `Bedroom_AbvGr / Full_Bath`, the ratio of bedrooms to bathrooms for the Ames housing data.
:::rmdwarning
When using this flexible step, use extra care to avoid data leakage in your preprocessing. Consider, for example, the transformation `x = w > mean(w)`. When applied to new data or testing data, this transformation would use the mean of `w` from the _new_ data, not the mean of `w` from the training data.
:::
### Natural language processing
Recipes can also handle data that are not in the traditional structure where the columns are features. For example, the [`r pkg(textrecipes)`](https://textrecipes.tidymodels.org/) package can apply natural language processing methods to the data. The input column is typically a string of text, and different steps can be used to tokenize the data (e.g., split the text into separate words), filter out tokens, and create new features appropriate for modeling.
## Skipping Steps for New Data {#skip-equals-true}
The sale price data are already log-transformed in the `ames` data frame. Why not use:
```r
step_log(Sale_Price, base = 10)
```
This will cause a failure when the recipe is applied to new properties with an unknown sale price. Since price is what we are trying to predict, there probably won't be a column in the data for this variable. In fact, to avoid information leakage, many tidymodels packages isolate the data being used when making any predictions. This means that the training set and any outcome columns are not available for use at prediction time.
:::rmdnote
For simple transformations of the outcome column(s), we strongly suggest that those operations be _conducted outside of the recipe_.
:::
However, there are other circumstances where this is not an adequate solution. For example, in classification models where there is a severe class imbalance, it is common to conduct _subsampling_ of the data that are given to the modeling function. For example, suppose that there were two classes and a 10% event rate. A simple, albeit controversial, approach would be to _downsample_ the data so that the model is provided with all of the events and a random 10% of the nonevent samples.
The problem is that the same subsampling process should not be applied to the data being predicted. As a result, when using a recipe, we need a mechanism to ensure that some operations are applied only to the data that are given to the model. Each step function has an option called `skip` that, when set to `TRUE`, will be ignored by the `predict()` function. In this way, you can isolate the steps that affect the modeling data without causing errors when applied to new samples. However, all steps are applied when using `fit()`.
```{r engineering-skips, include = FALSE}
library(recipes)
library(themis)
preps <- as.character(methods("prep"))
steps <- gsub("prep\\.", "", preps)
steps <- grep("^step", steps, value = TRUE)
skip <- rep(rlang::na_lgl, length(steps))
for (i in seq_along(skip)) {
x_code <- try(getFromNamespace(steps[i], "recipes"), silent = TRUE)
if (inherits(x_code, "try-error")) {
x_code <- try(getFromNamespace(steps[i], "themis"), silent = TRUE)
}
if (!inherits(x_code, "try-error")) {
skip[i] <- formals(x_code)$skip
}
}
skip_list <- paste0("`", steps[skip], "()`")
```
At the time of this writing, the step functions in the `r pkg(recipes)` and `r pkg(themis)` packages that are only applied to the training data are: `r knitr::combine_words(skip_list)`.
## Tidy a `recipe()`
In Section \@ref(tidiness-modeling), we introduced the `tidy()` verb for statistical objects. There is also a `tidy()` method for recipes, as well as individual recipe steps. Before proceeding, let's create an extended recipe for the Ames data using some of the new steps we've discussed in this chapter:
```{r engineering-lm-extended-recipe}
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
```
The `tidy()` method, when called with the recipe object, gives a summary of the recipe steps:
```{r engineering-ames-tidy-rec}
tidy(ames_rec)
```
This result can be helpful for identifying individual steps, perhaps to then be able to execute the `tidy()` method on one specific step.
We can specify the `id` argument in any step function call; otherwise it is generated using a random suffix. Setting this value can be helpful if the same type of step is added to the recipe more than once. Let's specify the `id` ahead of time for `step_other()`, since we'll want to `tidy()` it:
```{r engineering-lm-recipe-id}
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01, id = "my_id") %>%
step_dummy(all_nominal_predictors()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
```
We'll refit the workflow with this new recipe:
```{r engineering-lm-extended-recipe-fit}
lm_wflow <-
workflow() %>%
add_model(lm_model) %>%
add_recipe(ames_rec)
lm_fit <- fit(lm_wflow, ames_train)
```
The `tidy()` method can be called again along with the `id` identifier we specified to get our results for applying `step_other()`:
```{r engineering-lm-tidy-other}
estimated_recipe <-
lm_fit %>%
extract_recipe(estimated = TRUE)
tidy(estimated_recipe, id = "my_id")
```
The `tidy()` results we see here for using `step_other()` show which factor levels were retained, i.e., not added to the new "other" category.
The `tidy()` method can be called with the `number` identifier as well, if we know which step in the recipe we need:
```{r engineering-ames-tidy-other}
tidy(estimated_recipe, number = 2)
```
Each `tidy()` method returns the relevant information about that step. For example, the `tidy()` method for `step_dummy()` returns a column with the variables that were converted to dummy variables and another column with all of the known levels for each column.
## Column Roles
When a formula is used with the initial call to `recipe()` it assigns _roles_ to each of the columns, depending on which side of the tilde they are on. Those roles are either `"predictor"` or `"outcome"`. However, other roles can be assigned as needed.
For example, in our Ames data set, the original raw data contained a column for address.^[Our version of these data does not contain that column.] It may be useful to keep that column in the data so that, after predictions are made, problematic results can be investigated in detail. In other words, the column could be important even when it isn't a predictor or outcome.
To solve this, the `add_role()`, `remove_role()`, and `update_role()` functions can be helpful. For example, for the house price data, the role of the street address column could be modified using:
```r
ames_rec %>% update_role(address, new_role = "street address")
```
After this change, the `address` column in the dataframe will no longer be a predictor but instead will be a `"street address"` according to the recipe. Any character string can be used as a role. Also, columns can have multiple roles (additional roles are added via `add_role()`) so that they can be selected under more than one context.
This can be helpful when the data are _resampled_. It helps to keep the columns that are not involved with the model fit in the same data frame (rather than in an external vector). Resampling, described in Chapter \@ref(resampling), creates alternate versions of the data mostly by row subsampling. If the street address were in another column, additional subsampling would be required and might lead to more complex code and a higher likelihood of errors.
Finally, all step functions have a `role` field that can assign roles to the results of the step. In many cases, columns affected by a step retain their existing role. For example, the `step_log()` calls to our `ames_rec` object affected the `Gr_Liv_Area` column. For that step, the default behavior is to keep the existing role for this column since no new column is created. As a counter-example, the step to produce splines defaults new columns to have a role of `"predictor"` since that is usually how spline columns are used in a model. Most steps have sensible defaults but, since the defaults can be different, be sure to check the documentation page to understand which role(s) will be assigned.
## Chapter Summary {#recipes-summary}
In this chapter, you learned about using `r pkg(recipes)` for flexible feature engineering and data preprocessing, from creating dummy variables to handling class imbalance and more. Feature engineering is an important part of the modeling process where information leakage can easily occur and good practices must be adopted. Between the `r pkg(recipes)` package and other packages that extend recipes, there are over 100 available steps. All possible recipe steps are enumerated at [`tidymodels.org/find`](https://www.tidymodels.org/find/). The `r pkg(recipes)` framework provides a rich data manipulation environment for preprocessing and transforming data prior to modeling.
Additionally, [`tidymodels.org/learn/develop/recipes/`](https://www.tidymodels.org/learn/develop/recipes/) shows how custom steps can be created.
Our work here has used recipes solely inside of a workflow object. For modeling, that is the recommended use because feature engineering should be estimated together with a model. However, for visualization and other activities, a workflow may not be appropriate; more recipe-specific functions may be required. Chapter \@ref(dimensionality) discusses lower-level APIs for fitting, using, and troubleshooting recipes.
The code that we will use in later chapters is:
```{r engineering-summary, eval = FALSE}
library(tidymodels)
data(ames)
ames <- mutate(ames, Sale_Price = log10(Sale_Price))
set.seed(502)
ames_split <- initial_split(ames, prop = 0.80, strata = Sale_Price)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
ames_rec <-
recipe(Sale_Price ~ Neighborhood + Gr_Liv_Area + Year_Built + Bldg_Type +
Latitude + Longitude, data = ames_train) %>%
step_log(Gr_Liv_Area, base = 10) %>%
step_other(Neighborhood, threshold = 0.01) %>%
step_dummy(all_nominal_predictors()) %>%
step_interact( ~ Gr_Liv_Area:starts_with("Bldg_Type_") ) %>%
step_ns(Latitude, Longitude, deg_free = 20)
lm_model <- linear_reg() %>% set_engine("lm")
lm_wflow <-
workflow() %>%
add_model(lm_model) %>%
add_recipe(ames_rec)
lm_fit <- fit(lm_wflow, ames_train)
```
```{r engineering-save, include = FALSE}
if(is_new_version(lm_fit, "RData/lm_fit.RData")) {
save(lm_fit, file = "RData/lm_fit.RData", version = 2, compress = "xz")
}
```