Skip to content

Commit

Permalink
addresses #636 major points
Browse files Browse the repository at this point in the history
  • Loading branch information
rempsyc committed Oct 4, 2023
1 parent 46e8575 commit 44d11e1
Show file tree
Hide file tree
Showing 6 changed files with 22 additions and 12 deletions.
8 changes: 5 additions & 3 deletions papers/JOSE/paper.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ affiliations:
- index: 1
name: Department of Psychology, Université du Québec à Montréal, Montréal, Québec, Canada
- index: 2
name: Independent Researcher
name: Independent Researcher, Ramat Gan, Israel
- index: 3
name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany
- index: 4
Expand Down Expand Up @@ -115,15 +115,15 @@ Nonetheless, the improper handling of these outliers can substantially affect st

One possible reason is that researchers are not aware of the existing recommendations, or do not know how to implement them using their analysis software. In this paper, we show how to follow current best practices for automatic and reproducible statistical outlier detection (SOD) using R and the *{performance}* package [@ludecke2021performance], which is part of the *easystats* ecosystem of packages that build an R framework for easy statistical modeling, visualization, and reporting [@easystatspackage]. Installation instructions can be found on [GitHub](https://github.com/easystats/performance) or its [website](https://easystats.github.io/performance/), and its list of dependencies on [CRAN](https://cran.r-project.org/package=performance).

The instructional materials that follow is aimed at an audience of researchers who want to follow good practices, and is appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment.
The instructional materials that follow are aimed at an audience of researchers who want to follow good practices, and are appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment.

# Identifying Outliers

Although many researchers attempt to identify outliers with measures based on the mean (e.g., _z_ scores), those methods are problematic because the mean and standard deviation themselves are not robust to the influence of outliers and those methods also assume normally distributed data (i.e., a Gaussian distribution). Therefore, current guidelines recommend using robust methods to identify outliers, such as those relying on the median as opposed to the mean [@leys2019outliers; @leys2013outliers; @leys2018outliers].

Nonetheless, which exact outlier method to use depends on many factors. In some cases, eye-gauging odd observations can be an appropriate solution, though many researchers will favour algorithmic solutions to detect potential outliers, for example, based on a continuous value expressing the observation stands out from the others.

One of the factors to consider when selecting an algorithmic outlier detection method is the statistical test of interest. When using a regression model, relevant information can be found by identifying observations that do not fit well with the model. This approach, known as model-based outliers detection (as outliers are extracted after the statistical model has been fit), can be contrasted with distribution-based outliers detection, which is based on the distance between an observation and the "center" of its population. Various quantification strategies of this distance exist for the latter, both univariate (involving only one variable at a time) or multivariate (involving multiple variables).
One of the factors to consider when selecting an algorithmic outlier detection method is the statistical test of interest. Identifying observations the regression model does not fit well can help find information relevant to our specific research context. This approach, known as model-based outliers detection (as outliers are extracted after the statistical model has been fit), can be contrasted with distribution-based outliers detection, which is based on the distance between an observation and the "center" of its population. Various quantification strategies of this distance exist for the latter, both univariate (involving only one variable at a time) or multivariate (involving multiple variables).

When no method is readily available to detect model-based outliers, such as for structural equation modelling (SEM), looking for multivariate outliers may be of relevance. For simple tests (_t_ tests or correlations) that compare values of the same variable, it can be appropriate to check for univariate outliers. However, univariate methods can give false positives since _t_ tests and correlations, ultimately, are also models/multivariable statistics. They are in this sense more limited, but we show them nonetheless for educational purposes.

Expand Down Expand Up @@ -187,6 +187,8 @@ Working with regression models creates the possibility of using model-based SOD

In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]

Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.

```{r model}
model <- lm(disp ~ mpg * disp, data = data)
outliers <- check_outliers(model, method = "cook")
Expand Down
14 changes: 9 additions & 5 deletions papers/JOSE/paper.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.9.14) 4 OCT 2023 11:22
This is XeTeX, Version 3.141592653-2.6-0.999995 (TeX Live 2023) (preloaded format=xelatex 2023.9.14) 4 OCT 2023 15:06
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
Expand Down Expand Up @@ -1097,6 +1097,10 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt):
(fancyhdr) \addtolength{\topmargin}{-1.71957pt}.

[4]
Underfull \hbox (badness 1331) in paragraph at lines 627--635
[][]$[][][][][] [] [] [] [][][][][][][][][] [] [][][][][][] [] [][] [] [][][][][][][] [] [][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][][]$[][]\TU/lmr/m/n/10 ). We show a
[]

File: paper_files/figure-latex/model_fig-1.pdf Graphic file (type pdf)
<use paper_files/figure-latex/model_fig-1.pdf>
File: D:/Rpackages/rticles/rmarkdown/templates/joss/resources/JOSE-logo.png Graphic file (type bmp)
Expand Down Expand Up @@ -1139,17 +1143,17 @@ Package fancyhdr Warning: \headheight is too small (62.59596pt):
(fancyhdr) \addtolength{\topmargin}{-1.71957pt}.

[8]
Underfull \hbox (badness 1584) in paragraph at lines 959--965
Underfull \hbox (badness 1584) in paragraph at lines 968--974
[]\TU/lmr/m/n/10 Simmons, J. P., Nelson, L. D., & Simonsohn, U. (2011). False-positive psy-
[]


Underfull \hbox (badness 3049) in paragraph at lines 959--965
Underfull \hbox (badness 3049) in paragraph at lines 968--974
\TU/lmr/m/n/10 chology: Undisclosed flexibility in data collection and analysis allows pre-
[]


Underfull \hbox (badness 3735) in paragraph at lines 959--965
Underfull \hbox (badness 3735) in paragraph at lines 968--974
\TU/lmr/m/n/10 senting anything as significant. \TU/lmr/m/it/10 Psychological Science\TU/lmr/m/n/10 , \TU/lmr/m/it/10 22\TU/lmr/m/n/10 (11), 1359–1366.
[]

Expand Down Expand Up @@ -1180,6 +1184,6 @@ Here is how much of TeX's memory you used:
57602 multiletter control sequences out of 15000+600000
564981 words of font info for 89 fonts, out of 8000000 for 9000
14 hyphenation exceptions out of 8191
84i,12n,87p,1194b,850s stack positions out of 10000i,1000n,20000p,200000b,200000s
84i,13n,87p,1194b,850s stack positions out of 10000i,1000n,20000p,200000b,200000s

Output written on paper.pdf (9 pages).
8 changes: 5 additions & 3 deletions papers/JOSE/paper.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ affiliations:
- index: 1
name: Department of Psychology, Université du Québec à Montréal, Montréal, Québec, Canada
- index: 2
name: Independent Researcher
name: Independent Researcher, Ramat Gan, Israel
- index: 3
name: Center for Humans and Machines, Max Planck Institute for Human Development, Berlin, Germany
- index: 4
Expand Down Expand Up @@ -103,15 +103,15 @@ Nonetheless, the improper handling of these outliers can substantially affect st

One possible reason is that researchers are not aware of the existing recommendations, or do not know how to implement them using their analysis software. In this paper, we show how to follow current best practices for automatic and reproducible statistical outlier detection (SOD) using R and the *{performance}* package [@ludecke2021performance], which is part of the *easystats* ecosystem of packages that build an R framework for easy statistical modeling, visualization, and reporting [@easystatspackage]. Installation instructions can be found on [GitHub](https://github.com/easystats/performance) or its [website](https://easystats.github.io/performance/), and its list of dependencies on [CRAN](https://cran.r-project.org/package=performance).

The instructional materials that follow is aimed at an audience of researchers who want to follow good practices, and is appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment.
The instructional materials that follow are aimed at an audience of researchers who want to follow good practices, and are appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment.

# Identifying Outliers

Although many researchers attempt to identify outliers with measures based on the mean (e.g., _z_ scores), those methods are problematic because the mean and standard deviation themselves are not robust to the influence of outliers and those methods also assume normally distributed data (i.e., a Gaussian distribution). Therefore, current guidelines recommend using robust methods to identify outliers, such as those relying on the median as opposed to the mean [@leys2019outliers; @leys2013outliers; @leys2018outliers].

Nonetheless, which exact outlier method to use depends on many factors. In some cases, eye-gauging odd observations can be an appropriate solution, though many researchers will favour algorithmic solutions to detect potential outliers, for example, based on a continuous value expressing the observation stands out from the others.

One of the factors to consider when selecting an algorithmic outlier detection method is the statistical test of interest. When using a regression model, relevant information can be found by identifying observations that do not fit well with the model. This approach, known as model-based outliers detection (as outliers are extracted after the statistical model has been fit), can be contrasted with distribution-based outliers detection, which is based on the distance between an observation and the "center" of its population. Various quantification strategies of this distance exist for the latter, both univariate (involving only one variable at a time) or multivariate (involving multiple variables).
One of the factors to consider when selecting an algorithmic outlier detection method is the statistical test of interest. Identifying observations the regression model does not fit well can help find information relevant to our specific research context. This approach, known as model-based outliers detection (as outliers are extracted after the statistical model has been fit), can be contrasted with distribution-based outliers detection, which is based on the distance between an observation and the "center" of its population. Various quantification strategies of this distance exist for the latter, both univariate (involving only one variable at a time) or multivariate (involving multiple variables).

When no method is readily available to detect model-based outliers, such as for structural equation modelling (SEM), looking for multivariate outliers may be of relevance. For simple tests (_t_ tests or correlations) that compare values of the same variable, it can be appropriate to check for univariate outliers. However, univariate methods can give false positives since _t_ tests and correlations, ultimately, are also models/multivariable statistics. They are in this sense more limited, but we show them nonetheless for educational purposes.

Expand Down Expand Up @@ -218,6 +218,8 @@ Working with regression models creates the possibility of using model-based SOD

In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]

Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.


```r
model <- lm(disp ~ mpg * disp, data = data)
Expand Down
Binary file modified papers/JOSE/paper.pdf
Binary file not shown.
Binary file modified papers/JOSE/paper_files/figure-latex/model_fig-1.pdf
Binary file not shown.
4 changes: 3 additions & 1 deletion vignettes/check_outliers.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ Nonetheless, the improper handling of these outliers can substantially affect st

One possible reason is that researchers are not aware of the existing recommendations, or do not know how to implement them using their analysis software. In this paper, we show how to follow current best practices for automatic and reproducible statistical outlier detection (SOD) using R and the *{performance}* package [@ludecke2021performance], which is part of the *easystats* ecosystem of packages that build an R framework for easy statistical modeling, visualization, and reporting [@easystatspackage]. Installation instructions can be found on [GitHub](https://github.com/easystats/performance) or its [website](https://easystats.github.io/performance/), and its list of dependencies on [CRAN](https://cran.r-project.org/package=performance).

The instructional materials that follow is aimed at an audience of researchers who want to follow good practices, and is appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment.
The instructional materials that follow are aimed at an audience of researchers who want to follow good practices, and are appropriate for advanced undergraduate students, graduate students, professors, or professionals having to deal with the nuances of outlier treatment.

# Identifying Outliers

Expand Down Expand Up @@ -154,6 +154,8 @@ Working with regression models creates the possibility of using model-based SOD

In {performance}, two such model-based SOD methods are currently available: Cook's distance, for regular regression models, and Pareto, for Bayesian models. As such, `check_outliers()` can be applied directly on regression model objects, by simply specifying `method = "cook"` (or `method = "pareto"` for Bayesian models).^[Our default threshold for the Cook method is defined by `stats::qf(0.5, ncol(x), nrow(x) - ncol(x))`, which again is an approximation of the critical value for _p_ < .001 consistent with the thresholds of our other methods.]

Currently, most lm models are supported (with the exception of `glmmTMB`, `lmrob`, and `glmrob` models), as long as they are supported by the underlying functions `stats::cooks.distance()` (or `loo::pareto_k_values()`) and `insight::get_data()` (for a full list of the 225 models currently supported by the `insight` package, see https://easystats.github.io/insight/#list-of-supported-models-by-class). We show a demo below.

```{r model, fig.cap = "Visual depiction of outliers based on Cook's distance (leverage and standardized residuals), based on the fitted model."}
model <- lm(disp ~ mpg * disp, data = data)
outliers <- check_outliers(model, method = "cook")
Expand Down

0 comments on commit 44d11e1

Please sign in to comment.