terminology.qmd

# Miscellaneous Terminology

We saw several ways to refer to features/inputs and targets/outputs in @tbl-feature-target-names. Here are some more to consider.


### Observations

When we refer to the rows of data, we often call them observations. Here are some common terms for observations.
```{r}
#| echo: false
#| label: tbl-sample-names
#| tbl-cap: Common Terms for Observations

tbl_observations = tibble(
    Name = c('data', 'observation', 'sample', 'instance', 'example'),
) |>
    gt()
```


### Data Splits

When we want to use data to assess performance, we will use splits of the data. Here are some common terms for these splits. For each, we'll note their primary use initial model assessment (training) or not (testing). To be clear, sets used in the (cross-) validation process are still used in training whenever they aren't being used for validation.

```{r}
#| echo: false
#| label: tbl-split-names
#| tbl-cap: Common Terms for Data Splits

tbl_splits = tibble(
    Name = c('train', 'validation', 'test', 'holdout', 'out-of-sample'),
    Type = c('training', 'training', 'testing', 'testing', 'either')
) |>
    gt()
```

Some may refer to out-of-sample data as holdout data. This is because it is held out of the original training sample of data and/or was never included. However, it can also be used for validation, for example, within a random forest model, typically called the out-of-bag data.


### Regularization

Regularization is a technique used to prevent overfitting in models. Here are some common terms for regularization.

```{r}
#| echo: false
#| label: tbl-regularization-names
#| tbl-cap: Common Terms for Regularization

tbl_regularization = tibble(
    Name = c('penalty', 'shrinkage', 'weight decay', 'L1', 'L2', 'elastic net'),
) |>
    gt()
```

Common models that use regularization include LASSO (L1), ridge regression (L2), and elastic net, which is a combination of the two. The penalty term is a general term for the regularization term in a model.


```{r}
#| echo: false
#| label: tbl-reg-reg-terms

tbl_reg_models = tibble(
    Model = c('LASSO', 'ridge regression', 'elastic net'),
    Regularization = c('L1', 'L2', 'L1 and L2'),
    # `Bayesian Counterpart` = ('Laplace', 'Gaussian', '')
) |>
    gt()
```

## Problematic Terminology

There are some terms that are used in data science that can be problematic. Here are a few to consider.

- Using 'truth' as a synonym for the target variable
- significant without a clear definition of what it means
- (feature) importance without a clear definition of what it means
- Saying something like a model is *better* without any understanding of the uncertainty in the comparison.