diff --git a/papers/JOSE/paper.Rmd b/papers/JOSE/paper.Rmd index bbd9bcaad..c72fefc15 100644 --- a/papers/JOSE/paper.Rmd +++ b/papers/JOSE/paper.Rmd @@ -29,7 +29,7 @@ affiliations: - index: 1 name: Department of Psychology, Université du Québec à Montréal, Montréal, Québec, Canada - index: 2 - name: Independent Researcher + name: Independent Researcher, Ramat Gan, Israel - index: 3 name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany - index: 4 @@ -115,7 +115,7 @@ Nonetheless, the improper handling of these outliers can substantially affect st One possible reason is that researchers are not aware of the existing recommendations, or do not know how to implement them using their analysis software. In this paper, we show how to follow current best practices for automatic and reproducible statistical outlier detection (SOD) using R and the *{performance}* package [@ludecke2021performance], which is part of the *easystats* ecosystem of packages that build an R framework for easy statistical modeling, visualization, and reporting [@easystatspackage]. Installation instructions can be found on [GitHub](https://github.com/easystats/performance) or its [website](https://easystats.github.io/performance/), and its list of dependencies on [CRAN](https://cran.r-project.org/package=performance). -The instructional materials that follow is aimed at an audience of researchers who want to follow good practices, and is appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment. +The instructional materials that follow are aimed at an audience of researchers who want to follow good practices, and are appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment. # Identifying Outliers @@ -123,7 +123,7 @@ Although many researchers attempt to identify outliers with measures based on th Nonetheless, which exact outlier method to use depends on many factors. In some cases, eye-gauging odd observations can be an appropriate solution, though many researchers will favour algorithmic solutions to detect potential outliers, for example, based on a continuous value expressing the observation stands out from the others. -One of the factors to consider when selecting an algorithmic outlier detection method is the statistical test of interest. When using a regression model, relevant information can be found by identifying observations that do not fit well with the model. This approach, known as model-based outliers detection (as outliers are extracted after the statistical model has been fit), can be contrasted with distribution-based outliers detection, which is based on the distance between an observation and the "center" of its population. Various quantification strategies of this distance exist for the latter, both univariate (involving only one variable at a time) or multivariate (involving multiple variables). +One of the factors to consider when selecting an algorithmic outlier detection method is the statistical test of interest. Identifying observations the regression model does not fit well can help find information relevant to our specific research context. This approach, known as model-based outliers detection (as outliers are extracted after the statistical model has been fit), can be contrasted with distribution-based outliers detection, which is based on the distance between an observation and the "center" of its population. Various quantification strategies of this distance exist for the latter, both univariate (involving only one variable at a time) or multivariate (involving multiple variables). When no method is readily available to detect model-based outliers, such as for structural equation modelling (SEM), looking for multivariate outliers may be of relevance. For simple tests (_t_ tests or correlations) that compare values of the same variable, it can be appropriate to check for univariate outliers. However, univariate methods can give false positives since _t_ tests and correlations, ultimately, are also models/multivariable statistics. They are in this sense more limited, but we show them nonetheless for educational purposes. @@ -187,6 +187,8 @@ Working with regression models creates the possibility of using model-based SOD In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.] +Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below. + ```{r model} model <- lm(disp ~ mpg * disp, data = data) outliers <- check_outliers(model, method = "cook") diff --git a/papers/JOSE/paper.log b/papers/JOSE/paper.log index b1a43ab5f..33daaa7f8 100644 --- a/papers/JOSE/paper.log +++ b/papers/JOSE/paper.log @@ -1,4 +1,4 @@ -This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.9.14) 4 OCT 2023 11:22 +This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.9.14) 4 OCT 2023 15:06 entering extended mode restricted \write18 enabled. %&-line parsing enabled. @@ -1097,6 +1097,10 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) \addtolength{\topmargin}{-1.71957pt}. [4] +Underfull \hbox (badness 1331) in paragraph at lines 627--635 +[][]$[][][][][] [] [] [] [][][][][][][][][] [] [][][][][][] [] [][] [] [][][][][][][] [] [][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]$[][]\TU/lmr/m/n/10 ). We show a + [] + File: paper_files/figure-latex/model_fig-1.pdf Graphic file (type pdf) File: D:/Rpackages/rticles/rmarkdown/templates/joss/resources/JOSE-logo.png Graphic file (type bmp) @@ -1139,17 +1143,17 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt): (fancyhdr) \addtolength{\topmargin}{-1.71957pt}. [8] -Underfull \hbox (badness 1584) in paragraph at lines 959--965 +Underfull \hbox (badness 1584) in paragraph at lines 968--974 []\TU/lmr/m/n/10 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psy- [] -Underfull \hbox (badness 3049) in paragraph at lines 959--965 +Underfull \hbox (badness 3049) in paragraph at lines 968--974 \TU/lmr/m/n/10 chology: Undisclosed flexibility in data collection and analysis allows pre- [] -Underfull \hbox (badness 3735) in paragraph at lines 959--965 +Underfull \hbox (badness 3735) in paragraph at lines 968--974 \TU/lmr/m/n/10 senting anything as significant. \TU/lmr/m/it/10 Psychological Science\TU/lmr/m/n/10 , \TU/lmr/m/it/10 22\TU/lmr/m/n/10 (11), 1359–1366. [] @@ -1180,6 +1184,6 @@ Here is how much of TeX's memory you used: 57602 multiletter control sequences out of 15000+600000 564981 words of font info for 89 fonts, out of 8000000 for 9000 14 hyphenation exceptions out of 8191 - 84i,12n,87p,1194b,850s stack positions out of 10000i,1000n,20000p,200000b,200000s + 84i,13n,87p,1194b,850s stack positions out of 10000i,1000n,20000p,200000b,200000s Output written on paper.pdf (9 pages). diff --git a/papers/JOSE/paper.md b/papers/JOSE/paper.md index 223b0056b..ef0dd7007 100644 --- a/papers/JOSE/paper.md +++ b/papers/JOSE/paper.md @@ -29,7 +29,7 @@ affiliations: - index: 1 name: Department of Psychology, Université du Québec à Montréal, Montréal, Québec, Canada - index: 2 - name: Independent Researcher + name: Independent Researcher, Ramat Gan, Israel - index: 3 name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany - index: 4 @@ -103,7 +103,7 @@ Nonetheless, the improper handling of these outliers can substantially affect st One possible reason is that researchers are not aware of the existing recommendations, or do not know how to implement them using their analysis software. In this paper, we show how to follow current best practices for automatic and reproducible statistical outlier detection (SOD) using R and the *{performance}* package [@ludecke2021performance], which is part of the *easystats* ecosystem of packages that build an R framework for easy statistical modeling, visualization, and reporting [@easystatspackage]. Installation instructions can be found on [GitHub](https://github.com/easystats/performance) or its [website](https://easystats.github.io/performance/), and its list of dependencies on [CRAN](https://cran.r-project.org/package=performance). -The instructional materials that follow is aimed at an audience of researchers who want to follow good practices, and is appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment. +The instructional materials that follow are aimed at an audience of researchers who want to follow good practices, and are appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment. # Identifying Outliers @@ -111,7 +111,7 @@ Although many researchers attempt to identify outliers with measures based on th Nonetheless, which exact outlier method to use depends on many factors. In some cases, eye-gauging odd observations can be an appropriate solution, though many researchers will favour algorithmic solutions to detect potential outliers, for example, based on a continuous value expressing the observation stands out from the others. -One of the factors to consider when selecting an algorithmic outlier detection method is the statistical test of interest. When using a regression model, relevant information can be found by identifying observations that do not fit well with the model. This approach, known as model-based outliers detection (as outliers are extracted after the statistical model has been fit), can be contrasted with distribution-based outliers detection, which is based on the distance between an observation and the "center" of its population. Various quantification strategies of this distance exist for the latter, both univariate (involving only one variable at a time) or multivariate (involving multiple variables). +One of the factors to consider when selecting an algorithmic outlier detection method is the statistical test of interest. Identifying observations the regression model does not fit well can help find information relevant to our specific research context. This approach, known as model-based outliers detection (as outliers are extracted after the statistical model has been fit), can be contrasted with distribution-based outliers detection, which is based on the distance between an observation and the "center" of its population. Various quantification strategies of this distance exist for the latter, both univariate (involving only one variable at a time) or multivariate (involving multiple variables). When no method is readily available to detect model-based outliers, such as for structural equation modelling (SEM), looking for multivariate outliers may be of relevance. For simple tests (_t_ tests or correlations) that compare values of the same variable, it can be appropriate to check for univariate outliers. However, univariate methods can give false positives since _t_ tests and correlations, ultimately, are also models/multivariable statistics. They are in this sense more limited, but we show them nonetheless for educational purposes. @@ -218,6 +218,8 @@ Working with regression models creates the possibility of using model-based SOD In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.] +Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below. + ```r model <- lm(disp ~ mpg * disp, data = data) diff --git a/papers/JOSE/paper.pdf b/papers/JOSE/paper.pdf index 96f13be44..f6cbe4b67 100644 Binary files a/papers/JOSE/paper.pdf and b/papers/JOSE/paper.pdf differ diff --git a/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf b/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf index 7669f46ff..1c4feeb94 100644 Binary files a/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf and b/papers/JOSE/paper_files/figure-latex/model_fig-1.pdf differ diff --git a/vignettes/check_outliers.Rmd b/vignettes/check_outliers.Rmd index 3c2199350..e700cfb16 100644 --- a/vignettes/check_outliers.Rmd +++ b/vignettes/check_outliers.Rmd @@ -56,7 +56,7 @@ Nonetheless, the improper handling of these outliers can substantially affect st One possible reason is that researchers are not aware of the existing recommendations, or do not know how to implement them using their analysis software. In this paper, we show how to follow current best practices for automatic and reproducible statistical outlier detection (SOD) using R and the *{performance}* package [@ludecke2021performance], which is part of the *easystats* ecosystem of packages that build an R framework for easy statistical modeling, visualization, and reporting [@easystatspackage]. Installation instructions can be found on [GitHub](https://github.com/easystats/performance) or its [website](https://easystats.github.io/performance/), and its list of dependencies on [CRAN](https://cran.r-project.org/package=performance). -The instructional materials that follow is aimed at an audience of researchers who want to follow good practices, and is appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment. +The instructional materials that follow are aimed at an audience of researchers who want to follow good practices, and are appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment. # Identifying Outliers @@ -154,6 +154,8 @@ Working with regression models creates the possibility of using model-based SOD In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.] +Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below. + ```{r model, fig.cap = "Visual depiction of outliers based on Cook's distance (leverage and standardized residuals), based on the fitted model."} model <- lm(disp ~ mpg * disp, data = data) outliers <- check_outliers(model, method = "cook")