-
Notifications
You must be signed in to change notification settings - Fork 13
/
terminology.qmd
77 lines (54 loc) · 2.62 KB
/
terminology.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# Miscellaneous Terminology
We saw several ways to refer to features/inputs and targets/outputs in @tbl-feature-target-names. Here are some more to consider.
### Observations
When we refer to the rows of data, we often call them observations. Here are some common terms for observations.
```{r}
#| echo: false
#| label: tbl-sample-names
#| tbl-cap: Common Terms for Observations
tbl_observations = tibble(
Name = c('data', 'observation', 'sample', 'instance', 'example'),
) |>
gt()
```
### Data Splits
When we want to use data to assess performance, we will use splits of the data. Here are some common terms for these splits. For each, we'll note their primary use initial model assessment (training) or not (testing). To be clear, sets used in the (cross-) validation process are still used in training whenever they aren't being used for validation.
```{r}
#| echo: false
#| label: tbl-split-names
#| tbl-cap: Common Terms for Data Splits
tbl_splits = tibble(
Name = c('train', 'validation', 'test', 'holdout', 'out-of-sample'),
Type = c('training', 'training', 'testing', 'testing', 'either')
) |>
gt()
```
Some may refer to out-of-sample data as holdout data. This is because it is held out of the original training sample of data and/or was never included. However, it can also be used for validation, for example, within a random forest model, typically called the out-of-bag data.
### Regularization
Regularization is a technique used to prevent overfitting in models. Here are some common terms for regularization.
```{r}
#| echo: false
#| label: tbl-regularization-names
#| tbl-cap: Common Terms for Regularization
tbl_regularization = tibble(
Name = c('penalty', 'shrinkage', 'weight decay', 'L1', 'L2', 'elastic net'),
) |>
gt()
```
Common models that use regularization include LASSO (L1), ridge regression (L2), and elastic net, which is a combination of the two. The penalty term is a general term for the regularization term in a model.
```{r}
#| echo: false
#| label: tbl-reg-reg-terms
tbl_reg_models = tibble(
Model = c('LASSO', 'ridge regression', 'elastic net'),
Regularization = c('L1', 'L2', 'L1 and L2'),
# `Bayesian Counterpart` = ('Laplace', 'Gaussian', '')
) |>
gt()
```
## Problematic Terminology
There are some terms that are used in data science that can be problematic. Here are a few to consider.
- Using 'truth' as a synonym for the target variable
- significant without a clear definition of what it means
- (feature) importance without a clear definition of what it means
- Saying something like a model is *better* without any understanding of the uncertainty in the comparison.