effectsize_example.Rmd

---
title: "Effect size"
output:
  html_notebook:
    toc: yes
    toc_float:
      collapsed: false
bibliography: references.bib
---

<div id="FAQ">
[[Jump to Exemplar]](#exemplar)

<div style="text-align:center">
![](effectsize_firstfigure.png)
</div>

# FAQ

## What is an effect size?
Broadly speaking, an effect size is *"anything that might be of interest"* [@Cumming2013a] ; it is some quantity that captures the magnitude of the effect studied.

In HCI, common examples of effect size include the mean difference (e.g., in seconds) in task completion times between two techniques, or the mean difference in error rates (e.g., in percent). These are called *simple effect sizes* (or *unstandardized effect sizes*).

More complex measures of effect size exist called *standardized effect sizes* (see [What is a standardized effect size?](#standarized)). Although the term "effect size" is often used to refer to standardized effect sizes only, using the term in a broader sense can avoid unnecessary confusion [@Cumming2013a, @Wilkinson1999a]. In this document, "effect size" refers to both simple and standardized effect sizes.

## <a name="whenwhy"></a>Why and when should effect sizes be reported?
In quantitative experiments, effect sizes are among the most elementary and the most essential summary statistics that can be reported. Identifying the effect size(s) of interest also allows the researcher to turn a vague research question into a precise, quantitative question [@Cumming2014a]. For example, if a researcher is interested in showing that her technique is faster than a baseline technique, an appropriate choice of effect size is the mean difference in completion times. The observed effect size will indicate not only the likely direction of the effect (e.g., whether the technique is faster), but also whether the effect is large enough to care about. 

For the sake of transparency, effect sizes should always be reported in quantitative research, unless there are good reasons not to do so. According to the American Psychological Association:

> For the reader to appreciate the magnitude or importance of a study's findings, it is almost always necessary to include some measure of effect size in the Results section. [@APA2001]

Sometimes, effect sizes can be hard to compute or to interpret. When this is the case, and if the main focus of the study is on the direction (rather than magnitude) of the effect, reporting the results of statistical significance tests without reporting effect sizes (see the [inferential statistics FAQ]()) may be an acceptable option.


## How should effect sizes be reported?

The first choice is on whether to report simple effect sizes or standardized effect sizes. For this question, see [Should simple effect sizes or standardized effect sizes be reported?](#simple_v_standardized) 

It is rarely sufficient to report an effect size as a single quantity. This is because a single quantity like a difference in means or a Cohen's *d* is typically only a *point estimate*, i.e., it is merely a “best guess” of the true effect size. It is crucial to also assess and report the statistical uncertainty about this point estimate.

For more on assessing and reporting statistical uncertainty, see the [inferential statistics FAQ]().

Ideally, an effect size report should include:

- The direction of the effect if applicable (e.g., given a difference between two treatments `A` and `B`, indicate if the measured effect is `A - B` or `B - A`).
- The type of point estimate reported (e.g., a sample mean difference)
- The type of uncertainty information reported (e.g., a 95% CI)
- The units of the effect size if applicable, or the type of standardized effect size if it is a unitless effect size. 

This information can be reported either numerically or graphically. Both formats are acceptable, although plots tend to be easier to comprehend than numbers when more than one effect size needs to be conveyed [@loftus1993picture; @kastellec2007using]. Unless precise numerical values are important, it is sufficient (and often preferable) to report all effect sizes graphically. Researchers should avoid plotting point estimates without also plotting uncertainty information (using, e.g., error bars).

## What is a standardized effect size?

A standardized effect size is a unitless measure of effect size. The most common measure of standardized effect size is Cohen’s *d*, where the mean difference is divided by the standard deviation of the pooled observations [@Cohen1988a]. [Other approaches](http://stats.idre.ucla.edu/other/mult-pkg/faq/general/effect-size-power/faqhow-is-effect-size-used-in-power-analysis/) to standardization exist [prefer citations]. To some extent, standardized effect sizes make it possible to compare different studies in terms of how “impressive” their results are (see [How do I know my effect is large enough?](#howlarge)).

## <a name="simple_v_standardized"></a>Should simple or standardized effect sizes be reported? 

While the term *effect size* may conjure up the image of arcane statistical formulas, the most useful effect sizes are often much simpler, and more intuitive, than perhaps should even warrant a specialized term. An effect size is essentially any way to compute the practical size of an effect.

Standardized effect sizes are useful in some situations, for example when effects obtained from different experiments and/or expressed in different units need to be combined or compared [@Cumming2014a]. However, even this practice is controversial, as it can rely on assumptions about the effects being measured that are difficult to verify [@Cummings2011]. 

In most cases, simple effect sizes should be preferred over standardized effect sizes:

> Only rarely will uncorrected standardized effect size be more useful than simple effect size. It is usually far better to report simple effect size. [@baguley2009standardized]

Simple effect sizes are often easier to interpret and justify [@Cumming2014a; @Cummings2011]. When the units of the data are meaningful (e.g., seconds), reporting effect sizes expressed in their original units is more informative and can make it easier to judge whether the effect has a practical significance [@Wilkinson1999a,@Cummings2011].

Barring a strong, domain- or problem-specific argument for reporting a standardized effect size instead of a simple one, simple effect sizes should be preferred as being more transparent and easier to interpret.

If a standardized effect size is reported, it should be accompanied by an argument for its applicability to the domain. If there is no inherent reasoning to argue for a particular interpretation of the practical significance of the standardized effect size, it should be accompanied by another assessment of the practical significance of the effect.

## <a name="whenlarge"></a>How do I know my effect is large enough?

Although there exist rules of thumb to help interpret standardized effect sizes, these are not universally accepted. See [What about Cohen's small, medium, and large effect sizes?](#small-medium-large)

It is generally advisable to avoid the use of arbitrary thresholds when deciding on whether an effect is large enough, and instead try to think of whether the effect is of practical importance. This requires domain knowledge, and often a fair degree of subjective judgment. Ideally, a researcher should think in advance what effect size they would consider to be large enough, and plan the experiment, the hypotheses and the analyses accordingly (see the [experiment and analysis planning FAQ]()).

Nevertheless, more often than not in HCI, it is difficult to determine whether a certain effect is of practical importance. For example, a difference in pointing time of 100 ms between two pointing techniques can be large or small depending on the application, how often it is used, its context of use, etc. In such cases, forcing artificial interpretations of practical importance can hurt transparency. In many cases, it is sufficient to present effect sizes in a clear manner and leave the judgment of practical importance to the reader.

Simple effect sizes are often a better choice, because they provide the information necessary for an expert in the area to use their judgment to assess the practical impact of an effect size. For example, a difference in reaction time of 100ms is above the threshold of human perception, and therefore likely of practical impact. A difference of 100ms in receiving a chat message in an asynchronous chat application is likely less impactful, as it is small compared to the amount of time a chat message is generally expected to take. A difference in pointing time of 100ms between two pointing techniques might be large or small depending on the application, how often it is used, the context of use, etc. Presenting simple effect sizes in a clear way---with units---allows the expert author to argue why the effect size may or may not have practical importance *and* allow the expert reader to make their own judgment.

## <a name="small-medium-large"></a>What about Cohen's small, medium, and large effect sizes?

Conventional thresholds are sometimes used for standardized effect sizes like Cohen’s *d*, labeling them "small", "medium", or "large". These thresholds are however largely arbitrary [@Cummings2011]. They were originally proposed by Cohen based on human heights and intelligence quotients [@Cohen1977], but Cohen, in the very text where he first introduced them, noted that these thresholds may not be directly applicable to other domains:

> The terms "small", "medium", and "large" are relative, not only to
each other, but to the area of behavioral science or even more particularly
to the specific content and research method being employed in any given
investigation... In the face of this relativity, there is
a certain risk inherent in offering conventional operational definitions for
these terms for use in power analysis in as diverse a field of inquiry as behavioral
science. This risk is nevertheless accepted in the belief that more
is to be gained than lost by supplying a common conventional frame of
reference which is recommended for use only when no better basis for estimating
the ES index is available. [@Cohen1977]

Cohen recommended the use of these thresholds only when no better frame of reference for assessing practical importance was available. However, hindsight has demonstrated that if such thresholds are offered, they will be adopted as a convenience, often without much thought to how they apply to the domain at hand [citation needed]. Once adopted, these thresholds make reports more opaque, by standardizing away units of measurement and categorizing results into arbitrary classes. Like Cummings [@Cummings2011], we recommend against assessing the importance of effects by labeling them using Cohen's thresholds.