Skip to content

Commit

Permalink
addresses #636 tidyverse
Browse files Browse the repository at this point in the history
  • Loading branch information
rempsyc committed Oct 4, 2023
1 parent 1726625 commit ebeaafd
Show file tree
Hide file tree
Showing 7 changed files with 46 additions and 37 deletions.
3 changes: 3 additions & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -93,7 +93,9 @@ Suggests:
dbscan,
estimatr,
fixest,
flextable,
forecast,
ftExtra,
gamm4,
ggplot2,
glmmTMB,
Expand Down Expand Up @@ -128,6 +130,7 @@ Suggests:
psych,
qqplotr (>= 0.0.6),
randomForest,
rempsyc,
rmarkdown,
rstanarm,
rstantools,
Expand Down
5 changes: 2 additions & 3 deletions papers/JOSE/paper.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -187,7 +187,7 @@ Working with regression models creates the possibility of using model-based SOD

In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]

Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.
Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below.

```{r model}
model <- lm(disp ~ mpg * disp, data = data)
Expand Down Expand Up @@ -228,7 +228,6 @@ _Summary of Statistical Outlier Detection Methods Recommendations_
```{r table1_print, echo=FALSE, message=FALSE}
x <- flextable::flextable(df, cwidth = 1.25)
x <- flextable::theme_apa(x)
# x <- flextable::align(x, align = "left", part = "all")
x <- flextable::font(x, fontname = "Latin Modern Roman", part = "all")
x <- flextable::fontsize(x, size = 10, part = "all")
ftExtra::colformat_md(x)
Expand All @@ -238,7 +237,7 @@ ftExtra::colformat_md(x)
All `check_outliers()` output objects possess a `plot()` method, meaning it is also possible to visualize the outliers using the generic `plot()` function on the resulting outlier object after loading the {see} package (Figure 1).

```{r model_fig, fig.cap = "Visual depiction of outliers based on Cook's distance (leverage and standardized residuals), based on the fitted model."}
plot(outliers)
plot(outliers) # Figure 1 above
```

## Cook's Distance vs. MCD
Expand Down
20 changes: 8 additions & 12 deletions papers/JOSE/paper.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.10.4) 4 OCT 2023 16:57
This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.10.4) 4 OCT 2023 17:36
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
Expand Down Expand Up @@ -1177,10 +1177,6 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt):
(fancyhdr) \addtolength{\topmargin}{-1.71957pt}.

[4]
Underfull \hbox (badness 1331) in paragraph at lines 628--636
[][]$[][][][][] [] [] [] [][][][][][][][][] [] [][][][][][] [] [][] [] [][][][][][][] [] [][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]$[][]\TU/lmr/m/n/10 ). We show a
[]


Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.

Expand Down Expand Up @@ -1222,7 +1218,7 @@ Package fontspec Info: Font family 'LatinModernRoman(0)' created for font
(fontspec) - 'bold italic small caps' (b/scit) with NFSS spec.:

LaTeX Font Info: Font shape `TU/LatinModernRoman(0)/m/n' will be
(Font) scaled to size 10.0pt on input line 680.
(Font) scaled to size 10.0pt on input line 683.

Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.

Expand Down Expand Up @@ -1263,7 +1259,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.

LaTeX Font Info: Font shape `TU/LatinModernRoman(0)/b/n' will be
(Font) scaled to size 9.70718pt on input line 692.
(Font) scaled to size 9.70718pt on input line 695.

Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.

Expand Down Expand Up @@ -1307,7 +1303,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.

LaTeX Font Info: Font shape `TU/LatinModernRoman(0)/m/it' will be
(Font) scaled to size 10.0pt on input line 692.
(Font) scaled to size 10.0pt on input line 695.

Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.

Expand Down Expand Up @@ -1387,7 +1383,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.


Overfull \hbox (4.64252pt too wide) in paragraph at lines 692--692
Overfull \hbox (4.64252pt too wide) in paragraph at lines 695--695
[]|[]\TU/LatinModernRoman(0)/m/it/10 check_outliers(model,[][]
[]

Expand Down Expand Up @@ -1745,17 +1741,17 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt):
(fancyhdr) \addtolength{\topmargin}{-1.71957pt}.

[8]
Underfull \hbox (badness 1584) in paragraph at lines 995--1001
Underfull \hbox (badness 1584) in paragraph at lines 998--1004
[]\TU/lmr/m/n/10 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psy-
[]


Underfull \hbox (badness 3049) in paragraph at lines 995--1001
Underfull \hbox (badness 3049) in paragraph at lines 998--1004
\TU/lmr/m/n/10 chology: Undisclosed flexibility in data collection and analysis allows pre-
[]


Underfull \hbox (badness 3735) in paragraph at lines 995--1001
Underfull \hbox (badness 3735) in paragraph at lines 998--1004
\TU/lmr/m/n/10 senting anything as significant. \TU/lmr/m/it/10 Psychological Science\TU/lmr/m/n/10 , \TU/lmr/m/it/10 22\TU/lmr/m/n/10 (11), 1359–1366.
[]

Expand Down
4 changes: 2 additions & 2 deletions papers/JOSE/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -218,7 +218,7 @@ Working with regression models creates the possibility of using model-based SOD

In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]

Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.
Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below.


```r
Expand Down Expand Up @@ -310,7 +310,7 @@ All `check_outliers()` output objects possess a `plot()` method, meaning it is a


```r
plot(outliers)
plot(outliers) # Figure 1 above
```

\begin{figure}
Expand Down
Binary file modified papers/JOSE/paper.pdf
Binary file not shown.
Binary file modified papers/JOSE/paper_files/figure-latex/model_fig-1.pdf
Binary file not shown.
51 changes: 31 additions & 20 deletions vignettes/check_outliers.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,8 @@ knitr::opts_chunk$set(
)
options(digits = 2)
pkgs <- c("see", "performance", "datawizard", "rempsyc")
pkgs <- c("see", "performance", "datawizard", "rempsyc",
"ggplot2", "flextable", "ftExtra")
successfully_loaded <- vapply(pkgs, requireNamespace, FUN.VALUE = logical(1L), quietly = TRUE)
can_evaluate <- all(successfully_loaded)
Expand Down Expand Up @@ -154,7 +155,7 @@ Working with regression models creates the possibility of using model-based SOD

In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]

Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.
Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below.

```{r model, fig.cap = "Visual depiction of outliers based on Cook's distance (leverage and standardized residuals), based on the fitted model."}
model <- lm(disp ~ mpg * disp, data = data)
Expand All @@ -168,31 +169,41 @@ Using the model-based outlier detection method, we identified a single outlier.

Table 1 below summarizes which methods to use in which cases, and with what threshold. The recommended thresholds are the default thresholds.

```{r, echo=FALSE}
```{r table1_prep, echo=FALSE}
df <- data.frame(
`Statistical Test` = c(
"Supported regression model",
"Structural Equation Modeling (or other unsupported model)",
"Simple test with few variables (*t* test, correlation, etc.)"
),
"Supported regression model",
"Structural Equation Modeling (or other unsupported model)",
"Simple test with few variables (*t* test, correlation, etc.)"),
`Diagnosis Method` = c(
"**Model-based**: Cook (or Pareto for Bayesian models)",
"**Multivariate**: Minimum Covariance Determinant (MCD)",
"**Univariate**: robust *z* scores (MAD)"
),
"**Model-based**: Cook (or Pareto for Bayesian models)",
"**Multivariate**: Minimum Covariance Determinant (MCD)",
"**Univariate**: robust *z* scores (MAD)"),
`Recommended Threshold` = c(
"`qf(0.5, ncol(x), nrow(x) - ncol(x))` (or 0.7 for Pareto)",
"`qchisq(p = 1 - 0.001, df = ncol(x))`",
"`qnorm(p = 1 - 0.001 / 2)`, ~ 3.29"
)
)
knitr::kable(
df,
col.names = gsub("[.]", " ", names(df)),
caption = "Summary of Statistical Outlier Detection Methods Recommendations.", longtable = TRUE
"_qf(0.5, ncol(x), nrow(x) - ncol(x))_ (or 0.7 for Pareto)",
"_qchisq(p = 1 - 0.001, df = ncol(x))_",
"_qnorm(p = 1 - 0.001 / 2)_, ~ 3.29"),
`Function Usage` = c(
'_check_outliers(model, method = "cook")_',
'_check_outliers(data, method = "mcd")_',
'_check_outliers(data, method = "zscore_robust")_'),
check.names = FALSE
)
```

### Table 1

_Summary of Statistical Outlier Detection Methods Recommendations_

```{r table1_print, echo=FALSE, message=FALSE}
x <- flextable::flextable(df, cwidth = 2.25)
x <- flextable::theme_apa(x)
x <- flextable::font(x, fontname = "Latin Modern Roman", part = "all")
# x <- flextable::fontsize(x, size = 10, part = "all")
ftExtra::colformat_md(x)
```

## Cook's Distance vs. MCD

@leys2018outliers report a preference for the MCD method over Cook's distance. This is because Cook's distance removes one observation at a time and checks its corresponding influence on the model each time [@cook1977detection], and flags any observation that has a large influence. In the view of these authors, when there are several outliers, the process of removing a single outlier at a time is problematic as the model remains "contaminated" or influenced by other possible outliers in the model, rendering this method suboptimal in the presence of multiple outliers.
Expand Down

0 comments on commit ebeaafd

Please sign in to comment.