Skip to content

Commit 0d2e649

Browse files
New interface for validation splits (#653)
* draft post on new validation split interface * Apply suggestions from code review Co-authored-by: Emil Hvitfeldt <[email protected]> * reviewer feedback * update date * re-render without dev versions --------- Co-authored-by: Emil Hvitfeldt <[email protected]>
1 parent f4a7d68 commit 0d2e649

File tree

4 files changed

+469
-0
lines changed

4 files changed

+469
-0
lines changed
Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,151 @@
1+
---
2+
output: hugodown::hugo_document
3+
4+
slug: validation-split-as-3-way-split
5+
title: New interface to validation splits
6+
date: 2023-08-25
7+
author: Hannah Frick
8+
description: >
9+
The latest releases of rsample and tune provide a new interface to
10+
validation sets as a three-way split.
11+
12+
photo:
13+
url: https://unsplash.com/photos/68GdK1Aoc8g
14+
author: Scott Webb
15+
16+
# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other"
17+
categories: [package]
18+
tags: [tidymodels, rsample, tune]
19+
---
20+
21+
<!--
22+
TODO:
23+
* [x] Look over / edit the post's title in the yaml
24+
* [x] Edit (or delete) the description; note this appears in the Twitter card
25+
* [x] Pick category and tags (see existing with `hugodown::tidy_show_meta()`)
26+
* [x] Find photo & update yaml metadata
27+
* [x] Create `thumbnail-sq.jpg`; height and width should be equal
28+
* [x] Create `thumbnail-wd.jpg`; width should be >5x height
29+
* [x] `hugodown::use_tidy_thumbnails()`
30+
* [x] Add intro sentence, e.g. the standard tagline for the package
31+
* [x] `usethis::use_tidy_thanks()`
32+
-->
33+
34+
We're chuffed to announce the release of a new interface to validation splits in [rsample](https://rsample.tidymodels.org/) 1.2.0 and [tune](https://tune.tidymodels.org/) 1.1.2. The rsample package makes it easy to create resamples for assessing model performance. The tune package facilitates hyperparameter tuning for the tidymodels packages.
35+
36+
You can install the new versions from CRAN with:
37+
38+
```{r, eval = FALSE}
39+
install.packages(c("rsample", "tune"))
40+
```
41+
42+
This blog post will walk you through how to make a validation split and use it for tuning.
43+
44+
You can see a full list of changes in the release notes for [rsample](https://github.com/tidymodels/rsample/releases/tag/v1.2.0) and [tune](https://github.com/tidymodels/tune/releases/tag/v1.1.2).
45+
46+
Let's start with loading the tidymodels package which will load, among others, both rsample and tune.
47+
48+
```{r setup}
49+
library(tidymodels)
50+
```
51+
52+
## The new functions
53+
54+
You can now make a three-way split of your data instead of doing a sequence of two binary splits.
55+
56+
- `initial_validation_split()` with variants `initial_validation_time_split()` and `group_initial_validation_split()` for the initial three-way split
57+
- `validation_set()` to create the `rset` for tuning containing the analysis (= training) and assessment (= validation) set
58+
- `training()`, `validation()`, and `testing()` for access to the separate subsets
59+
- `last_fit()` (and `fit_best()`) now also work on the initial three-way split
60+
61+
## The new functions in action
62+
63+
To illustrate how to use the new functions, we'll replicate an analysis of [childcare cost](https://github.com/rfordatascience/tidytuesday/blob/master/data/2023/2023-05-09/readme.md) from a [Tidy Tuesday](https://github.com/rfordatascience/tidytuesday) done by Julia Silge in one of her [screencasts](https://juliasilge.com/blog/childcare-costs/).
64+
65+
We are modeling the median weekly price for school-aged kids in childcare centers `mcsa` and are thus removing the other variables containing different variants of median prices (e.g., for different age groups). We are also removing the FIPS code identifying the county as we are including various characteristics of the counties instead of their ID.
66+
67+
```{r}
68+
library(readr)
69+
70+
childcare_costs <- read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2023/2023-05-09/childcare_costs.csv')
71+
72+
childcare_costs <- childcare_costs |>
73+
select(-matches("^mc_|^mfc")) |>
74+
select(-county_fips_code) |>
75+
drop_na()
76+
77+
glimpse(childcare_costs)
78+
```
79+
80+
Even after omitting rows with missing values are we left with `r nrow(childcare_costs)` observations. That is plenty to work with! We are likely to get a reliable estimate of the model performance from a validation set without having to fit and evaluate the model multiple times, as with, for example, v-fold cross-validation.
81+
82+
We are creating a three-way split of the data into a training, a validation, and a test set with the new `initial_validation_split()` function. We are stratifying based on our outcome `mcsa`. The default of `prop = c(0.6, 0.2)` means that 60% of the data gets allocated to the training set and 20% to the validation set - and the remaining 20% go into the test set.
83+
84+
```{r initial-validation-split}
85+
set.seed(123)
86+
childcare_split <- childcare_costs |>
87+
initial_validation_split(strata = mcsa)
88+
childcare_split
89+
```
90+
91+
You can access the subsets of the data with the familiar `training()` and `testing()` as well as the new `validation()`:
92+
93+
```{r validation}
94+
validation(childcare_split)
95+
```
96+
97+
You may want to extract the training data to do some exploratory data analysis but here we are going to rely on xgboost to figure out patterns in the data so we can breeze straight to tuning a model.
98+
99+
```{r xgb-wflow}
100+
xgb_spec <-
101+
boost_tree(
102+
trees = 500,
103+
min_n = tune(),
104+
mtry = tune(),
105+
stop_iter = tune(),
106+
learn_rate = 0.01
107+
) |>
108+
set_engine("xgboost", validation = 0.2) |>
109+
set_mode("regression")
110+
111+
xgb_wf <- workflow(mcsa ~ ., xgb_spec)
112+
xgb_wf
113+
```
114+
115+
We give this workflow object with the model specification to `tune_grid()` to try multiple combinations of the hyperparameters we tagged for tuning (`min_n`, `mtry`, and `stop_iter`).
116+
117+
During tuning, the model should not have access to the test data, only to the data used to fit the model (the analysis set) and the data used to assess the model (the assessment set). Each pair of analysis and assessment set forms a resample. For 10-fold cross-validation, we'd have 10 resamples. With a validation split, we have just one resample with the training set functioning as the analysis set and the validation set as the assessment set. The tidymodels tuning functions all expect a _set_ of resamples (which can be of size one) and the corresponding objects are of class `rset`.
118+
119+
To remove the test data from the initial three-way split and create such an `rset` object for tuning, use `validation_set()`.
120+
121+
```{r validation-set}
122+
set.seed(234)
123+
childcare_set <- validation_set(childcare_split)
124+
childcare_set
125+
```
126+
127+
We are going to try 15 different parameter combinations and pick the one with the smallest RMSE.
128+
129+
```{r tune-grid}
130+
set.seed(234)
131+
xgb_res <- tune_grid(xgb_wf, childcare_set, grid = 15)
132+
best_parameters <- select_best(xgb_res, "rmse")
133+
childcare_wflow <- finalize_workflow(xgb_wf, best_parameters)
134+
```
135+
136+
`last_fit()` then lets you fit your model on the training data and calculate performance on the test data. If you provide it with a three-way split, you can choose if you want your model to be fitted on the training data only or on the combination of training and validation set. You can specify this with the `add_validation_set` argument.
137+
138+
```{r last-fit}
139+
childcare_fit <- last_fit(childcare_wflow, childcare_split, add_validation_set = TRUE)
140+
collect_metrics(childcare_fit)
141+
```
142+
143+
This takes you through the important changes for validation sets in the tidymodels framework!
144+
145+
## Acknowledgements
146+
147+
Many thanks to the people who contributed since the last releases!
148+
149+
For rsample: [&#x0040;afrogri37](https://github.com/afrogri37), [&#x0040;AngelFelizR](https://github.com/AngelFelizR), [&#x0040;bschneidr](https://github.com/bschneidr), [&#x0040;erictleung](https://github.com/erictleung), [&#x0040;exsell-jc](https://github.com/exsell-jc), [&#x0040;hfrick](https://github.com/hfrick), [&#x0040;jrosell](https://github.com/jrosell), [&#x0040;MasterLuke84](https://github.com/MasterLuke84), [&#x0040;MichaelChirico](https://github.com/MichaelChirico), [&#x0040;mikemahoney218](https://github.com/mikemahoney218), [&#x0040;rdavis120](https://github.com/rdavis120), [&#x0040;sametsoekel](https://github.com/sametsoekel), [&#x0040;Shafi2016](https://github.com/Shafi2016), [&#x0040;simonpcouch](https://github.com/simonpcouch), [&#x0040;topepo](https://github.com/topepo), and [&#x0040;trevorcampbell](https://github.com/trevorcampbell).
150+
151+
For tune: [&#x0040;blechturm](https://github.com/blechturm), [&#x0040;cphaarmeyer](https://github.com/cphaarmeyer), [&#x0040;EmilHvitfeldt](https://github.com/EmilHvitfeldt), [&#x0040;forecastingEDs](https://github.com/forecastingEDs), [&#x0040;hfrick](https://github.com/hfrick), [&#x0040;kjbeath](https://github.com/kjbeath), [&#x0040;mikemahoney218](https://github.com/mikemahoney218), [&#x0040;rdavis120](https://github.com/rdavis120), [&#x0040;simonpcouch](https://github.com/simonpcouch), and [&#x0040;topepo](https://github.com/topepo).

0 commit comments

Comments
 (0)