addresses #636 tidyverse

easystats · Oct 4, 2023 · ebeaafd · ebeaafd
1 parent 1726625
commit ebeaafd
Show file tree

Hide file tree

Showing 7 changed files with 46 additions and 37 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -93,7 +93,9 @@ Suggests:
     dbscan,
     estimatr,
     fixest,
+    flextable, 
     forecast,
+    ftExtra,
     gamm4,
     ggplot2,
     glmmTMB,
@@ -128,6 +130,7 @@ Suggests:
     psych,
     qqplotr (>= 0.0.6),
     randomForest,
+    rempsyc,
     rmarkdown,
     rstanarm,
     rstantools,

diff --git a/papers/JOSE/paper.Rmd b/papers/JOSE/paper.Rmd
@@ -187,7 +187,7 @@ Working with regression models creates the possibility of using model-based SOD
 
 In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]
 
-Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.
+Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below.
 
 ```{r model}
 model <- lm(disp ~ mpg * disp, data = data)
@@ -228,7 +228,6 @@ _Summary of Statistical Outlier Detection Methods Recommendations_
 ```{r table1_print, echo=FALSE, message=FALSE}
 x <- flextable::flextable(df, cwidth = 1.25)
 x <- flextable::theme_apa(x)
-# x <- flextable::align(x, align = "left", part = "all")
 x <- flextable::font(x, fontname = "Latin Modern Roman", part = "all")
 x <- flextable::fontsize(x, size = 10, part = "all")
 ftExtra::colformat_md(x)
@@ -238,7 +237,7 @@ ftExtra::colformat_md(x)
 All `check_outliers()` output objects possess a `plot()` method, meaning it is also possible to visualize the outliers using the generic `plot()` function on the resulting outlier object after loading the {see} package (Figure 1).
 
 ```{r model_fig, fig.cap = "Visual depiction of outliers based on Cook's distance (leverage and standardized residuals), based on the fitted model."}
-plot(outliers)
+plot(outliers) # Figure 1 above
 ```
 
 ## Cook's Distance vs. MCD

diff --git a/papers/JOSE/paper.log b/papers/JOSE/paper.log
@@ -1,4 +1,4 @@
-This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.10.4)  4 OCT 2023 16:57
+This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.10.4)  4 OCT 2023 17:36
 entering extended mode
  restricted \write18 enabled.
  %&-line parsing enabled.
@@ -1177,10 +1177,6 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt):
 (fancyhdr)                \addtolength{\topmargin}{-1.71957pt}.
 
 [4]
-Underfull \hbox (badness 1331) in paragraph at lines 628--636
-[][]$[][][][][] [] [] [] [][][][][][][][][] [] [][][][][][] [] [][] [] [][][][][][][] [] [][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]$[][]\TU/lmr/m/n/10 ). We show a
- []
-
 
 Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 
@@ -1222,7 +1218,7 @@ Package fontspec Info: Font family 'LatinModernRoman(0)' created for font
 (fontspec)             - 'bold italic small caps'  (b/scit) with NFSS spec.: 
 
 LaTeX Font Info:    Font shape `TU/LatinModernRoman(0)/m/n' will be
-(Font)              scaled to size 10.0pt on input line 680.
+(Font)              scaled to size 10.0pt on input line 683.
 
 Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 
@@ -1263,7 +1259,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 
 LaTeX Font Info:    Font shape `TU/LatinModernRoman(0)/b/n' will be
-(Font)              scaled to size 9.70718pt on input line 692.
+(Font)              scaled to size 9.70718pt on input line 695.
 
 Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 
@@ -1307,7 +1303,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 
 LaTeX Font Info:    Font shape `TU/LatinModernRoman(0)/m/it' will be
-(Font)              scaled to size 10.0pt on input line 692.
+(Font)              scaled to size 10.0pt on input line 695.
 
 Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 
@@ -1387,7 +1383,7 @@ Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 Package fontspec Info: Latin Modern Roman scale = 0.9999964596882403.
 
 
-Overfull \hbox (4.64252pt too wide) in paragraph at lines 692--692
+Overfull \hbox (4.64252pt too wide) in paragraph at lines 695--695
  []|[]\TU/LatinModernRoman(0)/m/it/10 check_outliers(model,[][] 
  []
 
@@ -1745,17 +1741,17 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt):
 (fancyhdr)                \addtolength{\topmargin}{-1.71957pt}.
 
 [8]
-Underfull \hbox (badness 1584) in paragraph at lines 995--1001
+Underfull \hbox (badness 1584) in paragraph at lines 998--1004
 []\TU/lmr/m/n/10 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psy-
  []
 
 
-Underfull \hbox (badness 3049) in paragraph at lines 995--1001
+Underfull \hbox (badness 3049) in paragraph at lines 998--1004
 \TU/lmr/m/n/10 chology: Undisclosed flexibility in data collection and analysis allows pre-
  []
 
 
-Underfull \hbox (badness 3735) in paragraph at lines 995--1001
+Underfull \hbox (badness 3735) in paragraph at lines 998--1004
 \TU/lmr/m/n/10 senting anything as significant. \TU/lmr/m/it/10 Psychological Science\TU/lmr/m/n/10 , \TU/lmr/m/it/10 22\TU/lmr/m/n/10 (11), 1359–1366.
  []
 

diff --git a/papers/JOSE/paper.md b/papers/JOSE/paper.md
@@ -218,7 +218,7 @@ Working with regression models creates the possibility of using model-based SOD
 
 In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]
 
-Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.
+Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below.
 
 
 ```r
@@ -310,7 +310,7 @@ All `check_outliers()` output objects possess a `plot()` method, meaning it is a
 
 
 ```r
-plot(outliers)
+plot(outliers) # Figure 1 above
 ```
 
 \begin{figure}

diff --git a/papers/JOSE/paper.pdf b/papers/JOSE/paper.pdf
diff --git a/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf b/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf
diff --git a/vignettes/check_outliers.Rmd b/vignettes/check_outliers.Rmd
@@ -27,7 +27,8 @@ knitr::opts_chunk$set(
 )
 options(digits = 2)
 
-pkgs <- c("see", "performance", "datawizard", "rempsyc")
+pkgs <- c("see", "performance", "datawizard", "rempsyc", 
+          "ggplot2", "flextable", "ftExtra")
 successfully_loaded <- vapply(pkgs, requireNamespace, FUN.VALUE = logical(1L), quietly = TRUE)
 can_evaluate <- all(successfully_loaded)
 
@@ -154,7 +155,7 @@ Working with regression models creates the possibility of using model-based SOD
 
 In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]
 
-Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.
+Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). Also note that although `check_outliers()` supports the pipe operators (`|>` or `%>%`), it does not support `tidymodels` at this time. We show a demo below.
 
 ```{r model, fig.cap = "Visual depiction of outliers based on Cook's distance (leverage and standardized residuals), based on the fitted model."}
 model <- lm(disp ~ mpg * disp, data = data)
@@ -168,31 +169,41 @@ Using the model-based outlier detection method, we identified a single outlier.
 
 Table 1 below summarizes which methods to use in which cases, and with what threshold. The recommended thresholds are the default thresholds.
 
-```{r, echo=FALSE}
+```{r table1_prep, echo=FALSE}
 df <- data.frame(
   `Statistical Test` = c(
-    "Supported regression model",
-    "Structural Equation Modeling (or other unsupported model)",
-    "Simple test with few variables (*t* test, correlation, etc.)"
-  ),
+    "Supported regression model", 
+    "Structural Equation Modeling (or other unsupported model)", 
+    "Simple test with few variables (*t* test, correlation, etc.)"),
   `Diagnosis Method` = c(
-    "**Model-based**: Cook (or Pareto for Bayesian models)",
-    "**Multivariate**: Minimum Covariance Determinant (MCD)",
-    "**Univariate**: robust *z* scores (MAD)"
-  ),
+    "**Model-based**: Cook (or Pareto for Bayesian models)", 
+    "**Multivariate**: Minimum Covariance Determinant (MCD)", 
+    "**Univariate**: robust *z* scores (MAD)"),
   `Recommended Threshold` = c(
-    "`qf(0.5, ncol(x), nrow(x) - ncol(x))` (or 0.7 for Pareto)",
-    "`qchisq(p = 1 - 0.001, df = ncol(x))`",
-    "`qnorm(p = 1 - 0.001 / 2)`, ~ 3.29"
-  )
-)
-knitr::kable(
-  df,
-  col.names = gsub("[.]", " ", names(df)),
-  caption = "Summary of Statistical Outlier Detection Methods Recommendations.", longtable = TRUE
+    "_qf(0.5, ncol(x), nrow(x) - ncol(x))_ (or 0.7 for Pareto)", 
+    "_qchisq(p = 1 - 0.001, df = ncol(x))_", 
+    "_qnorm(p = 1 - 0.001 / 2)_, ~ 3.29"),
+  `Function Usage` = c(
+    '_check_outliers(model, method = "cook")_', 
+    '_check_outliers(data, method = "mcd")_', 
+    '_check_outliers(data, method = "zscore_robust")_'),
+  check.names = FALSE
 )
 ```
 
+### Table 1
+
+_Summary of Statistical Outlier Detection Methods Recommendations_
+
+```{r table1_print, echo=FALSE, message=FALSE}
+x <- flextable::flextable(df, cwidth = 2.25)
+x <- flextable::theme_apa(x)
+x <- flextable::font(x, fontname = "Latin Modern Roman", part = "all")
+# x <- flextable::fontsize(x, size = 10, part = "all")
+ftExtra::colformat_md(x)
+
+```
+
 ## Cook's Distance vs. MCD
 
 @leys2018outliers report a preference for the MCD method over Cook's distance. This is because Cook's distance removes one observation at a time and checks its corresponding influence on the model each time [@cook1977detection], and flags any observation that has a large influence. In the view of these authors, when there are several outliers, the process of removing a single outlier at a time is problematic as the model remains "contaminated" or influenced by other possible outliers in the model, rendering this method suboptimal in the presence of multiple outliers.