Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document new experiments methodology #10217

Merged
merged 49 commits into from
Jan 7, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
7043ff1
Move Methodology below Features and rename
danielbachhuber Dec 24, 2024
ef6cf80
First pass at statistics primer
danielbachhuber Dec 24, 2024
b1f2789
Active voice
danielbachhuber Dec 24, 2024
2aa7a71
First pass at funnel statistics doc
danielbachhuber Dec 24, 2024
98f58c4
First pass at Trends count statistics
danielbachhuber Dec 24, 2024
04d05e0
Add "What the heck?" sections
danielbachhuber Dec 24, 2024
3f9f429
Edits
danielbachhuber Dec 24, 2024
37d0d71
Edits
danielbachhuber Dec 24, 2024
be3fac7
First pass at continuous trends
danielbachhuber Dec 24, 2024
6eb49a2
Link to all overviews
danielbachhuber Dec 24, 2024
1575c5b
Link in sidebar
danielbachhuber Dec 24, 2024
c5bc948
Edits
danielbachhuber Dec 24, 2024
3898e7d
Edits
danielbachhuber Dec 24, 2024
1d8fe54
Formatting
danielbachhuber Jan 6, 2025
edffcff
Edit
danielbachhuber Jan 6, 2025
bf802a7
Formatting
danielbachhuber Jan 6, 2025
cccb2cb
Formatting
danielbachhuber Jan 6, 2025
60b432f
Formatting
danielbachhuber Jan 6, 2025
179ff15
Formatting
danielbachhuber Jan 6, 2025
1b834e8
Edit
danielbachhuber Jan 6, 2025
ca235f7
Drop 'what the heck'
danielbachhuber Jan 6, 2025
68da0ad
Add an introduction
danielbachhuber Jan 6, 2025
420deee
s/the/our/
danielbachhuber Jan 6, 2025
5eb1f8f
Improve Bayesian example
danielbachhuber Jan 6, 2025
07dce3a
Explain why different models
danielbachhuber Jan 6, 2025
5c0e225
Update credible interval definition
danielbachhuber Jan 6, 2025
5f893f9
Merge branch 'master' into experiments/new-stats-methodology
danielbachhuber Jan 6, 2025
5132470
Explain why it might be useful
danielbachhuber Jan 6, 2025
bb2d23c
Make the example more specific to pizza
danielbachhuber Jan 6, 2025
b20beae
Tie the example to the model explanation
danielbachhuber Jan 6, 2025
27e5c5c
Explain "minimally informative priors"
danielbachhuber Jan 6, 2025
c2ba4c2
We're evaluating property values, not just revenue
danielbachhuber Jan 6, 2025
2293f14
Clarify "Beta model" vs. "Beta distribution"
danielbachhuber Jan 6, 2025
045c810
Formatting
danielbachhuber Jan 6, 2025
cd96971
Deprecate the legacy methodology
danielbachhuber Jan 6, 2025
c7aba65
Explain true probability
danielbachhuber Jan 6, 2025
dc666e4
Missing word
danielbachhuber Jan 6, 2025
2c0464b
Missing word
danielbachhuber Jan 6, 2025
6f15396
Missing word
danielbachhuber Jan 6, 2025
db5a215
Edits
danielbachhuber Jan 6, 2025
3acbd21
Two grafs
danielbachhuber Jan 6, 2025
45057a3
Two grafs
danielbachhuber Jan 6, 2025
b5baa94
Merge branch 'master' into experiments/new-stats-methodology
danielbachhuber Jan 7, 2025
a282d20
Replace "experiments" with "metrics"
danielbachhuber Jan 7, 2025
6634857
Edit
danielbachhuber Jan 7, 2025
1aafe97
"Funnel metrics", not "Funnel experiments"
danielbachhuber Jan 7, 2025
1ef9b83
Rewrite as "Supported metric types"
danielbachhuber Jan 7, 2025
d7da622
Fix casing
danielbachhuber Jan 7, 2025
3f8e672
Rename to "Statisics overview"
danielbachhuber Jan 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions contents/docs/experiments/funnels-statistics.mdx
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally use sentence case for feature names, but you flip back and forth between title case for "beta model." First sentence is in sentence case, but the first title has it capitalized. Should all be "beta model" if possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andehen Which makes more sense, "gamma-poisson model" or "Gamma-Poisson model"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think only capital letters if is a title. But in the middle of a sentence it should be "a gamma-poisson model". The excpetion is if one refers to a specific distribution like this "we use a Beta(1, 1) distribution as prior ..."

Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
---
title: Statistical methodology for funnel metrics
---

Funnel metrics use Bayesian statistics with a beta model to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already.

## What is a beta model?

Imagine you run a pizza shop and want to know if customers say "yes" to adding pineapple. Some customers will say yes, others will say no. Knowing what percentage of customers want pineapple on their pizza helps you decide how much to order and what options to offer.

The beta model is a statistical approach that's great for analyzing proportions or probabilities. It uses a **beta distribution** to model the uncertainty in conversion rates and helps us understand:

1. Our best estimate of the _true_ probability that a customer will say "yes" to adding pineapple (vs. the probability we observe).
2. How certain we are about that estimate.

For example, if:
danielbachhuber marked this conversation as resolved.
Show resolved Hide resolved

- Only 2 out of 4 customers (50%) say yes, the beta distribution will be wide, indicating high uncertainty.
- 150 out of 300 customers (50%) say yes, the beta distribution will be narrow, showing we're more confident about that 50% rate.

So when we say we're using a beta model for funnel metrics, we're:
1. Using the beta distribution to model conversion rates between 0% and 100%.
2. Getting more confident in our estimates as we collect more data.

One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `ALPHA_PRIOR = 1` and `BETA_PRIOR = 1`, which is like starting with a blank slate instead of making an upfront assumption about the results.

## Win probabilities

The **win probability** tells you how likely it is that a given variant has the highest conversion rate compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance.

Let's say you're testing a new way of presenting pineapple on the website and have these results:

- Control (current design): 100 pineapple orders from 1000 customers (10% acceptance)
- Test (suggesting pineapple with a photo): 150 pineapple orders from 1000 customers (15% acceptance)

To calculate the win probabilities, our methodology will:

1. Model each variant's conversion rate using a beta distribution:
- Control: Beta(100 + ALPHA_PRIOR, 900 + BETA_PRIOR)
- Test: Beta(150 + ALPHA_PRIOR, 850 + BETA_PRIOR)

2. Take 10,000 random samples from each distribution.

3. Check which variant had the higher conversion rate for each sample.

4. Calculate the final win probabilities:
- Control wins in 40 out of 10,000 samples = 0.4% probability
- Test wins in 9,960 out of 10,000 samples = 99.6% probability

These results tell us we can be 99.6% confident that showing photos of pineapple pizza performs better than the current design.

## Credible intervals

A **credible interval** tells you the range where the true conversion rate lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true rate if you repeated the experiment many times (not a direct probability statement about where the rate lies).

For example, if you have these results:

- Control (current design): 100 pineapple orders from 1000 customers (10% acceptance)
- Test (suggesting pineapple with a photo): 150 pineapple orders from 1000 customers (15% acceptance)

To calculate the credible intervals, our methodology will:

1. Create a beta distribution for each variant:
- Control: Beta(100 + ALPHA_PRIOR, 900 + BETA_PRIOR)
- Test: Beta(150 + ALPHA_PRIOR, 850 + BETA_PRIOR)

2. Find the 2.5th and 97.5% percentiles of each distribution:
- Control: [8.3%, 12%] = "You can be 95% confident the true conversion rate is between 8.3% and 12.0%"
- Test: [12.9%, 17.3%] = "You can be 95% confident the true conversion rate is between 12.9% and 17.3%"

Since these intervals don't overlap, you can be quite confident that the test variant performs better than the control. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true conversion rates.
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Experiment significance
title: Legacy statistics methodology
---

import { FormulaScreenshot } from 'components/FormulaScreenshot'
Expand All @@ -10,7 +10,7 @@ export const FunnelExperimentCalculationDark = "https://res.cloudinary.com/dmuku
export const FunnelSignificanceLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-significance-light.png"
export const FunnelSignificanceDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-significance-dark.png"

Below are all the formulas and calculations we use to determine the significance of an experiment.
> **Note:** This document describes methodology used to evaluate experiments created prior to January 2025. Check out our [statistics overview](/docs/experiments/statistics) for an introduction to our updated methodology.

## Bayesian experimentation

Expand Down
53 changes: 53 additions & 0 deletions contents/docs/experiments/statistics.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
---
title: Experiment statistics overview
---

A working understanding of statistical methodology is helpful to feel confident about interpreting experiment results. For those without prior experience, this overview explains everything you need to know in layperson's terms. For those with some statistics experiments, this overview documents our methodology and assumptions.

Experiments use [Bayesian statistics](https://en.wikipedia.org/wiki/Bayesian_statistics) to determine whether a given **variant** performs better than the **control**. It quantifies **win probabilities** and **credible intervals**, helps determine whether the experiment shows a **statistically significant** effect, and enables you to:

- Check results at any time without statistical penalties.
- Get direct probability statements about which variant is winning.
- Make confident decisions earlier with accumulating evidence.

This contrasts with [Frequentist statistics](https://en.wikipedia.org/wiki/Frequentist_inference), which requires you to predefine sample sizes and prevents you from updating probabilities as new data arrives.

## Example Bayesian analysis

Say you started an experiment a few hours ago and see these results:

- 1 in 10 people in the control group complete the funnel = 10% success rate.
- 1 in 9 people in the test variant group complete the funnel = 11% success rate.
- The control variant has a 46.7% probability of being better and the test variant has a 53.3% probability of being better.
- The control variant shows a credible interval of [2.3%, 41.3%] and the test variant shows a credible interval of [2.5%, 44.5%].

The first two values are pure math: dividing the number of successes by the total number of users gives us the raw success rates. It's not enough to just compare these conversion rates, however.

The last two values are derived using Bayesian statistics and describe our confidence in the results. The **win probability** tells you how likely it is that a given variant has the highest conversion rate compared to all other variants in the experiment. The **credible interval** tells you the range where the true conversion rate lies with 95% probability.

Importantly, even though the test variant is winning, it doesn't clear our threshold of 90% or greater win probability to to be a statistically significant conclusion. This uncertainty is also demonstrated with the amount of overlap in the credible intervals.

As such, you decide to let the experiment run a bit longer and see these results:

- 100 in 1000 people in the control group complete the funnel = 10% success rate.
- 100 in 900 people in the test variant group complete the funnel = 11% success rate.
- The control variant has a 21.5% probability of being better and the test variant has a 78.5% probability of being better.
- The control variant shows a credible interval of [8.3%, 12%] and the test variant shows a credible interval of [9.2%, 13.3%].

Et voilà! The additional data increased the win probability and narrowed the credible intervals. With 1,900 total users instead of just 19, random chance becomes a less likely explanation for the difference in conversion rates. Even though both variants maintained the same conversion rates (10% vs 11%), the larger sample size gives us more confidence in the experiment results.

At this point, you could either declare the test variant as the winner (78.5% probability), or continue collecting data to reach the 90% statistical significance threshold. Bayesian statistics let you check your results whenever you'd like without worrying about increasing the chance of false positives from checking too frequently.

## Supported metric types

Experiments support a few different types of metrics, and each metric type uses a model appropriate to the shape of its data.

For example, funnel conversions are always between 0% and 100%, pageview counts can be any positive number (0, 50, 280), and property values can vary widely and tend to be right-skewed.

The following explain how Bayesian statistics is applied to each type of metric:

- [Beta model for funnel metrics](/docs/experiments/funnels-statistics) to analyze conversion rates through multi-step funnels.
- [Gamma-poisson model for trends metrics with count-based data](/docs/experiments/trends-count-statistics) like pageviews or interaction events.
- [Lognormal model with a normal-inverse-gamma prior for trends metrics with property values](/docs/experiments/trends-property-value-statistics) like revenue.

If your experiment was created prior to January 2025, it is [evaluated using the legacy methodology](/docs/experiments/legacy-methodology).
82 changes: 82 additions & 0 deletions contents/docs/experiments/trends-continuous-statistics.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
---
title: Statistical methodology for property value trend metrics
---

Trends metrics for property values use Bayesian statistics with a lognormal model and normal-inverse-gamma prior to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already.

## What is a lognormal model with normal-inverse-gamma prior?

A lognormal model with normal-inverse-gamma prior sounds like something you'd learn about in a quantum physics class, but its a lot less intimidating than it seems. The model is great for analyzing metrics like revenue or other property values that are always positive and often have a "long tail" of high values.

Imagine you're looking at daily revenue from your customers:

- Most customers might spend $20-100.
- Some customers spend $200-500.
- A few customers spend $1000+.

This creates what we call a "right-skewed" distribution - lots of smaller values, with a long tail stretching to the right. This is where the log-normal model shines:

- When we take the logarithm of these values, they follow a nice bell curve (normal distribution).
- This makes it much easier to analyze the data mathematically.
- We can transform back to regular dollars for our final results.

The "normal-inverse-gamma prior" part helps us handle uncertainty:

- When we have very little data, it keeps our estimates reasonable.
- As we collect more data, it lets the actual data drive our conclusions.
- It accounts for uncertainty in both the average value AND how spread out the values are.
- We use a fixed log-space variance (`LOG_VARIANCE = 0.75`) based on typical patterns in property value data.

For example:

- Day 1: 5 customers spend an average of $50, but we're very uncertain about whether this represents the true average spending.
- Day 30: 500 customers spend an average of $50, and we're much more confident about this average value.

One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `MU_0 = 0.0`, `KAPPA_0 = 1.0`, `ALPHA_0 = 1.0`, and `BETA_0 = 1.0`, which is like starting with a blank slate instead of making an upfront assumption about the results.

## Win probabilities

The **win probability** tells you how likely it is that a given variant has the highest value compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance.

Let's say you're testing a new pricing page and have these results:

- Control: $50 average revenue per user (500 users)
- Test: $60 average revenue per user (500 users)

To calculate the win probabilities, our methodology:

1. Models each variant's value using a lognormal distribution (which works well for metrics like revenue that are always positive and often right-skewed):
- We transform the data to log-space where it follows a normal distribution.
- We use a normal-inverse-gamma prior to handle uncertainty about both the mean and variance.

2. Takes 10,000 random samples from each variant's posterior distribution.

3. Checks which variant had the higher value for each sample.

4. Calculates the final win probabilities:
- Control wins in 5 out of 10,000 samples = 0.5% probability.
- Test wins in 9,995 out of 10,000 samples = 99.5% probability.

These results tell us we can be 98.5% confident that the test variant performs better than the control.

## Credible intervals

A **credible interval** tells you the range where the true value lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true value if you repeated the experiment many times (not a direct probability statement about where the value lies).

For example, if you have these results:

- Control: $50 average revenue per user (500 users)
- Test: $60 average revenue per user (500 users)

To calculate the credible intervals, our methodology will:

1. Transform the data to log-space and model each variant using a t-distribution:
- We use log transformation because metrics like revenue are often right-skewed
- The t-distribution parameters come from our Normal-Inverse-Gamma model
- This handles uncertainty about both the mean and variance

2. Find the 2.5th and 97.5% percentiles of each distribution:
- Control: [45.98, 55.1] = "You can be 95% confident the true average revenue is between $45.98 and $53.53"
- Test: [55.15, 64.22] = "You can be 95% confident the true average revenue is between $55.15 and $64.22"

Since these intervals don't overlap, you can be quite confident that the test variant performs better than the control. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true values.
70 changes: 70 additions & 0 deletions contents/docs/experiments/trends-count-statistics.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: Statistical methodology for count trend metrics
---

Trends metrics for count-based data use Bayesian statistics with a gamma-poisson model to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already.

## What is a gamma-poisson model?

Imagine you run a pizza shop and want to know how many slices a customer typically orders. Some days customers might order 1 slice, others 3 slices, and occasionally someone might order 6 slices! This kind of count data (1, 2, 3, etc.) follows what's called a **poisson distribution**.

The poisson distribution has one key number: the average rate. In our pizza example, maybe it's 2.5 slices per customer. But here's the catch - we don't know the true rate for sure. We only have our observations to guess from.

This is where the **gamma distribution** comes in. It helps us model our uncertainty about the true rate:

- When we have very little data, the gamma distribution is wide, saying "hey, the true rate could be anywhere in this broad range".
- As we collect more data, the gamma distribution gets narrower, saying "we're getting more confident about what the true rate is".

So when we say we're using a gamma-poisson model for count metrics, we're:

1. Using the poisson distribution to model how count data naturally varies.
2. Using the gamma distribution to express our uncertainty about the true rate.
3. Getting more confident in our estimates over time.

One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `ALPHA_PRIOR = 1` and `BETA_PRIOR = 1`, which is like starting with a blank slate instead of making an upfront assumption about the results.

## Win probabilities

The **win probability** tells you how likely it is that a given variant has the highest rate compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance.

Let's say you're testing a new menu design and have these results:

- Control (old menu): 250 slices ordered by 100 customers (rate of 2.5 slices per customer)
- Test (new menu): 300 slices ordered by 100 customers (rate of 3.0 slices per customer)

To calculate the win probabilities, our methodology:

1. Models each variant's rate using a gamma distribution:
- Control: Gamma(250 + ALPHA_PRIOR, 100 + BETA_PRIOR)
- Test: Gamma(300 + ALPHA_PRIOR, 100 + BETA_PRIOR)

2. Takes 10,000 random samples from each distribution.

3. Checks which variant had the higher rate for each sample.

4. Calculates the final win probabilities:
- Control wins in 154 out of 10,000 samples = 1.54% probability
- Test wins in 9,846 out of 10,000 samples = 98.46% probability

These results tell us we can be 98.46%% confident that the new menu design leads to more slice orders per customer than the old menu.

## Credible intervals

A **credible interval** tells you the range where the true rate lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true rate if you repeated the experiment many times (not a direct probability statement about where the rate lies).

For example, if you have these results:

- Control (old menu): 250 slices ordered by 100 customers (rate of 2.5 slices per customer)
- Test (new menu): 300 slices ordered by 100 customers (rate of 3.0 slices per customer)

To calculate the credible intervals, our methodology will:

1. Create a gamma distribution for each variant:
- Control: Gamma(250 + ALPHA_PRIOR, 100 + BETA_PRIOR)
- Test: Gamma(300 + ALPHA_PRIOR, 100 + BETA_PRIOR)

2. Find the 2.5th and 97.5% percentiles of each distribution:
- Control: [2.2, 2.8] = "You can be 95% confident customers order between 2.2 and 2.8 slices on average with the old menu"
- Test: [2.7, 3.3] = "You can be 95% confident customers order between 2.7 and 3.3 slices on average with the new menu"

Since these intervals barely overlap, you can be quite confident that the new menu design results in more slice orders per customer. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true rates.
Loading
Loading