diff --git a/DESCRIPTION b/DESCRIPTION index 933df479a..99ec823a7 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -93,7 +93,9 @@ Suggests: dbscan, estimatr, fixest, + flextable, forecast, + ftExtra, gamm4, ggplot2, glmmTMB, @@ -128,6 +130,7 @@ Suggests: psych, qqplotr (>= 0.0.6), randomForest, + rempsyc, rmarkdown, rstanarm, rstantools, diff --git a/papers/JOSE/paper.Rmd b/papers/JOSE/paper.Rmd index 5e8409178..e8403d389 100644 --- a/papers/JOSE/paper.Rmd +++ b/papers/JOSE/paper.Rmd @@ -187,7 +187,7 @@ Working with regression models creates the possibility of using model-based SOD In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.] -Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below. +Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below. ```{r model} model <- lm(disp ~ mpg * disp, data = data) @@ -228,7 +228,6 @@ _Summary of Statistical Outlier Detection Methods Recommendations_ ```{r table1_print, echo=FALSE, message=FALSE} x <- flextable::flextable(df, cwidth = 1.25) x <- flextable::theme_apa(x) -# x <- flextable::align(x, align = "left", part = "all") x <- flextable::font(x, fontname = "Latin Modern Roman", part = "all") x <- flextable::fontsize(x, size = 10, part = "all") ftExtra::colformat_md(x) @@ -238,7 +237,7 @@ ftExtra::colformat_md(x) All `check_outliers()` output objects possess a `plot()` method, meaning it is also possible to visualize the outliers using the generic `plot()` function on the resulting outlier object after loading the {see} package (Figure 1). ```{r model_fig, fig.cap = "Visual depiction of outliers based on Cook's distance (leverage and standardized residuals), based on the fitted model."} -plot(outliers) +plot(outliers) # Figure 1 above ``` ## Cook's Distance vs. MCD diff --git a/papers/JOSE/paper.log b/papers/JOSE/paper.log index 231a93157..c09bc9e1c 100644 --- a/papers/JOSE/paper.log +++ b/papers/JOSE/paper.log @@ -1,4 +1,4 @@ -This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.10.4) 4 OCT 2023 16:57 +This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.10.4) 4 OCT 2023 17:36 entering extended mode restricted \write18 enabled. %&-line parsing enabled. @@ -1177,10 +1177,6 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) \addtolength{\topmargin}{-1.71957pt}. [4] -Underfull \hbox (badness 1331) in paragraph at lines 628--636 -[][]$[][][][][] [] [] [] [][][][][][][][][] [] [][][][][][] [] [][] [] [][][][][][][] [] [][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]$[][]\TU/lmr/m/n/10 ). We show a - [] - Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. @@ -1222,7 +1218,7 @@ Package fontspec Info: Font family 'LatinModernRoman(0)' created for font (fontspec) - 'bold italic small caps' (b/scit) with NFSS spec.: LaTeX Font Info: Font shape `TU/LatinModernRoman(0)/m/n' will be -(Font) scaled to size 10.0pt on input line 680. +(Font) scaled to size 10.0pt on input line 683. Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. @@ -1263,7 +1259,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. LaTeX Font Info: Font shape `TU/LatinModernRoman(0)/b/n' will be -(Font) scaled to size 9.70718pt on input line 692. +(Font) scaled to size 9.70718pt on input line 695. Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. @@ -1307,7 +1303,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. LaTeX Font Info: Font shape `TU/LatinModernRoman(0)/m/it' will be -(Font) scaled to size 10.0pt on input line 692. +(Font) scaled to size 10.0pt on input line 695. Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. @@ -1387,7 +1383,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403. -Overfull \hbox (4.64252pt too wide) in paragraph at lines 692--692 +Overfull \hbox (4.64252pt too wide) in paragraph at lines 695--695 []|[]\TU/LatinModernRoman(0)/m/it/10 check_outliers(model,[][] [] @@ -1745,17 +1741,17 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) \addtolength{\topmargin}{-1.71957pt}. [8] -Underfull \hbox (badness 1584) in paragraph at lines 995--1001 +Underfull \hbox (badness 1584) in paragraph at lines 998--1004 []\TU/lmr/m/n/10 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psy- [] -Underfull \hbox (badness 3049) in paragraph at lines 995--1001 +Underfull \hbox (badness 3049) in paragraph at lines 998--1004 \TU/lmr/m/n/10 chology: Undisclosed flexibility in data collection and analysis allows pre- [] -Underfull \hbox (badness 3735) in paragraph at lines 995--1001 +Underfull \hbox (badness 3735) in paragraph at lines 998--1004 \TU/lmr/m/n/10 senting anything as significant. \TU/lmr/m/it/10 Psychological Science\TU/lmr/m/n/10 , \TU/lmr/m/it/10 22\TU/lmr/m/n/10 (11), 1359–1366. [] diff --git a/papers/JOSE/paper.md b/papers/JOSE/paper.md index dbc12fd66..b4d317156 100644 --- a/papers/JOSE/paper.md +++ b/papers/JOSE/paper.md @@ -218,7 +218,7 @@ Working with regression models creates the possibility of using model-based SOD In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.] -Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below. +Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below. ```r @@ -310,7 +310,7 @@ All `check_outliers()` output objects possess a `plot()` method, meaning it is a ```r -plot(outliers) +plot(outliers) # Figure 1 above ``` \begin{figure} diff --git a/papers/JOSE/paper.pdf b/papers/JOSE/paper.pdf index dbd0311a2..00101a8f5 100644 Binary files a/papers/JOSE/paper.pdf and b/papers/JOSE/paper.pdf differ diff --git a/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf b/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf index f382fd528..50cb86db5 100644 Binary files a/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf and b/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf differ diff --git a/vignettes/check_outliers.Rmd b/vignettes/check_outliers.Rmd index cd3ff33e0..65f6e092e 100644 --- a/vignettes/check_outliers.Rmd +++ b/vignettes/check_outliers.Rmd @@ -27,7 +27,8 @@ knitr::opts_chunk$set( ) options(digits = 2) -pkgs <- c("see", "performance", "datawizard", "rempsyc") +pkgs <- c("see", "performance", "datawizard", "rempsyc", + "ggplot2", "flextable", "ftExtra") successfully_loaded <- vapply(pkgs, requireNamespace, FUN.VALUE = logical(1L), quietly = TRUE) can_evaluate <- all(successfully_loaded) @@ -154,7 +155,7 @@ Working with regression models creates the possibility of using model-based SOD In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.] -Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below. +Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below. ```{r model, fig.cap = "Visual depiction of outliers based on Cook's distance (leverage and standardized residuals), based on the fitted model."} model <- lm(disp ~ mpg * disp, data = data) @@ -168,31 +169,41 @@ Using the model-based outlier detection method, we identified a single outlier. Table 1 below summarizes which methods to use in which cases, and with what threshold. The recommended thresholds are the default thresholds. -```{r, echo=FALSE} +```{r table1_prep, echo=FALSE} df <- data.frame( `Statistical Test` = c( - "Supported regression model", - "Structural Equation Modeling (or other unsupported model)", - "Simple test with few variables (*t* test, correlation, etc.)" - ), + "Supported regression model", + "Structural Equation Modeling (or other unsupported model)", + "Simple test with few variables (*t* test, correlation, etc.)"), `Diagnosis Method` = c( - "**Model-based**: Cook (or Pareto for Bayesian models)", - "**Multivariate**: Minimum Covariance Determinant (MCD)", - "**Univariate**: robust *z* scores (MAD)" - ), + "**Model-based**: Cook (or Pareto for Bayesian models)", + "**Multivariate**: Minimum Covariance Determinant (MCD)", + "**Univariate**: robust *z* scores (MAD)"), `Recommended Threshold` = c( - "`qf(0.5, ncol(x), nrow(x) - ncol(x))` (or 0.7 for Pareto)", - "`qchisq(p = 1 - 0.001, df = ncol(x))`", - "`qnorm(p = 1 - 0.001 / 2)`, ~ 3.29" - ) -) -knitr::kable( - df, - col.names = gsub("[.]", " ", names(df)), - caption = "Summary of Statistical Outlier Detection Methods Recommendations.", longtable = TRUE + "_qf(0.5, ncol(x), nrow(x) - ncol(x))_ (or 0.7 for Pareto)", + "_qchisq(p = 1 - 0.001, df = ncol(x))_", + "_qnorm(p = 1 - 0.001 / 2)_, ~ 3.29"), + `Function Usage` = c( + '_check_outliers(model, method = "cook")_', + '_check_outliers(data, method = "mcd")_', + '_check_outliers(data, method = "zscore_robust")_'), + check.names = FALSE ) ``` +### Table 1 + +_Summary of Statistical Outlier Detection Methods Recommendations_ + +```{r table1_print, echo=FALSE, message=FALSE} +x <- flextable::flextable(df, cwidth = 2.25) +x <- flextable::theme_apa(x) +x <- flextable::font(x, fontname = "Latin Modern Roman", part = "all") +# x <- flextable::fontsize(x, size = 10, part = "all") +ftExtra::colformat_md(x) + +``` + ## Cook's Distance vs. MCD @leys2018outliers report a preference for the MCD method over Cook's distance. This is because Cook's distance removes one observation at a time and checks its corresponding influence on the model each time [@cook1977detection], and flags any observation that has a large influence. In the view of these authors, when there are several outliers, the process of removing a single outlier at a time is problematic as the model remains "contaminated" or influenced by other possible outliers in the model, rendering this method suboptimal in the presence of multiple outliers.