-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document new experiments methodology #10217
Merged
Merged
Changes from all commits
Commits
Show all changes
49 commits
Select commit
Hold shift + click to select a range
7043ff1
Move Methodology below Features and rename
danielbachhuber ef6cf80
First pass at statistics primer
danielbachhuber b1f2789
Active voice
danielbachhuber 2aa7a71
First pass at funnel statistics doc
danielbachhuber 98f58c4
First pass at Trends count statistics
danielbachhuber 04d05e0
Add "What the heck?" sections
danielbachhuber 3f9f429
Edits
danielbachhuber 37d0d71
Edits
danielbachhuber be3fac7
First pass at continuous trends
danielbachhuber 6eb49a2
Link to all overviews
danielbachhuber 1575c5b
Link in sidebar
danielbachhuber c5bc948
Edits
danielbachhuber 3898e7d
Edits
danielbachhuber 1d8fe54
Formatting
danielbachhuber edffcff
Edit
danielbachhuber bf802a7
Formatting
danielbachhuber cccb2cb
Formatting
danielbachhuber 60b432f
Formatting
danielbachhuber 179ff15
Formatting
danielbachhuber 1b834e8
Edit
danielbachhuber ca235f7
Drop 'what the heck'
danielbachhuber 68da0ad
Add an introduction
danielbachhuber 420deee
s/the/our/
danielbachhuber 5eb1f8f
Improve Bayesian example
danielbachhuber 07dce3a
Explain why different models
danielbachhuber 5c0e225
Update credible interval definition
danielbachhuber 5f893f9
Merge branch 'master' into experiments/new-stats-methodology
danielbachhuber 5132470
Explain why it might be useful
danielbachhuber bb2d23c
Make the example more specific to pizza
danielbachhuber b20beae
Tie the example to the model explanation
danielbachhuber 27e5c5c
Explain "minimally informative priors"
danielbachhuber c2ba4c2
We're evaluating property values, not just revenue
danielbachhuber 2293f14
Clarify "Beta model" vs. "Beta distribution"
danielbachhuber 045c810
Formatting
danielbachhuber cd96971
Deprecate the legacy methodology
danielbachhuber c7aba65
Explain true probability
danielbachhuber dc666e4
Missing word
danielbachhuber 2c0464b
Missing word
danielbachhuber 6f15396
Missing word
danielbachhuber db5a215
Edits
danielbachhuber 3acbd21
Two grafs
danielbachhuber 45057a3
Two grafs
danielbachhuber b5baa94
Merge branch 'master' into experiments/new-stats-methodology
danielbachhuber a282d20
Replace "experiments" with "metrics"
danielbachhuber 6634857
Edit
danielbachhuber 1aafe97
"Funnel metrics", not "Funnel experiments"
danielbachhuber 1ef9b83
Rewrite as "Supported metric types"
danielbachhuber d7da622
Fix casing
danielbachhuber 3f8e672
Rename to "Statisics overview"
danielbachhuber File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,71 @@ | ||
--- | ||
title: Statistical methodology for funnel metrics | ||
--- | ||
|
||
Funnel metrics use Bayesian statistics with a beta model to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already. | ||
|
||
## What is a beta model? | ||
|
||
Imagine you run a pizza shop and want to know if customers say "yes" to adding pineapple. Some customers will say yes, others will say no. Knowing what percentage of customers want pineapple on their pizza helps you decide how much to order and what options to offer. | ||
|
||
The beta model is a statistical approach that's great for analyzing proportions or probabilities. It uses a **beta distribution** to model the uncertainty in conversion rates and helps us understand: | ||
|
||
1. Our best estimate of the _true_ probability that a customer will say "yes" to adding pineapple (vs. the probability we observe). | ||
2. How certain we are about that estimate. | ||
|
||
For example, if: | ||
danielbachhuber marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
- Only 2 out of 4 customers (50%) say yes, the beta distribution will be wide, indicating high uncertainty. | ||
- 150 out of 300 customers (50%) say yes, the beta distribution will be narrow, showing we're more confident about that 50% rate. | ||
|
||
So when we say we're using a beta model for funnel metrics, we're: | ||
1. Using the beta distribution to model conversion rates between 0% and 100%. | ||
2. Getting more confident in our estimates as we collect more data. | ||
|
||
One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `ALPHA_PRIOR = 1` and `BETA_PRIOR = 1`, which is like starting with a blank slate instead of making an upfront assumption about the results. | ||
|
||
## Win probabilities | ||
|
||
The **win probability** tells you how likely it is that a given variant has the highest conversion rate compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance. | ||
|
||
Let's say you're testing a new way of presenting pineapple on the website and have these results: | ||
|
||
- Control (current design): 100 pineapple orders from 1000 customers (10% acceptance) | ||
- Test (suggesting pineapple with a photo): 150 pineapple orders from 1000 customers (15% acceptance) | ||
|
||
To calculate the win probabilities, our methodology will: | ||
|
||
1. Model each variant's conversion rate using a beta distribution: | ||
- Control: Beta(100 + ALPHA_PRIOR, 900 + BETA_PRIOR) | ||
- Test: Beta(150 + ALPHA_PRIOR, 850 + BETA_PRIOR) | ||
|
||
2. Take 10,000 random samples from each distribution. | ||
|
||
3. Check which variant had the higher conversion rate for each sample. | ||
|
||
4. Calculate the final win probabilities: | ||
- Control wins in 40 out of 10,000 samples = 0.4% probability | ||
- Test wins in 9,960 out of 10,000 samples = 99.6% probability | ||
|
||
These results tell us we can be 99.6% confident that showing photos of pineapple pizza performs better than the current design. | ||
|
||
## Credible intervals | ||
|
||
A **credible interval** tells you the range where the true conversion rate lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true rate if you repeated the experiment many times (not a direct probability statement about where the rate lies). | ||
|
||
For example, if you have these results: | ||
|
||
- Control (current design): 100 pineapple orders from 1000 customers (10% acceptance) | ||
- Test (suggesting pineapple with a photo): 150 pineapple orders from 1000 customers (15% acceptance) | ||
|
||
To calculate the credible intervals, our methodology will: | ||
|
||
1. Create a beta distribution for each variant: | ||
- Control: Beta(100 + ALPHA_PRIOR, 900 + BETA_PRIOR) | ||
- Test: Beta(150 + ALPHA_PRIOR, 850 + BETA_PRIOR) | ||
|
||
2. Find the 2.5th and 97.5% percentiles of each distribution: | ||
- Control: [8.3%, 12%] = "You can be 95% confident the true conversion rate is between 8.3% and 12.0%" | ||
- Test: [12.9%, 17.3%] = "You can be 95% confident the true conversion rate is between 12.9% and 17.3%" | ||
|
||
Since these intervals don't overlap, you can be quite confident that the test variant performs better than the control. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true conversion rates. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,53 @@ | ||
--- | ||
title: Experiment statistics overview | ||
--- | ||
|
||
A working understanding of statistical methodology is helpful to feel confident about interpreting experiment results. For those without prior experience, this overview explains everything you need to know in layperson's terms. For those with some statistics experiments, this overview documents our methodology and assumptions. | ||
|
||
Experiments use [Bayesian statistics](https://en.wikipedia.org/wiki/Bayesian_statistics) to determine whether a given **variant** performs better than the **control**. It quantifies **win probabilities** and **credible intervals**, helps determine whether the experiment shows a **statistically significant** effect, and enables you to: | ||
|
||
- Check results at any time without statistical penalties. | ||
- Get direct probability statements about which variant is winning. | ||
- Make confident decisions earlier with accumulating evidence. | ||
|
||
This contrasts with [Frequentist statistics](https://en.wikipedia.org/wiki/Frequentist_inference), which requires you to predefine sample sizes and prevents you from updating probabilities as new data arrives. | ||
|
||
## Example Bayesian analysis | ||
|
||
Say you started an experiment a few hours ago and see these results: | ||
|
||
- 1 in 10 people in the control group complete the funnel = 10% success rate. | ||
- 1 in 9 people in the test variant group complete the funnel = 11% success rate. | ||
- The control variant has a 46.7% probability of being better and the test variant has a 53.3% probability of being better. | ||
- The control variant shows a credible interval of [2.3%, 41.3%] and the test variant shows a credible interval of [2.5%, 44.5%]. | ||
|
||
The first two values are pure math: dividing the number of successes by the total number of users gives us the raw success rates. It's not enough to just compare these conversion rates, however. | ||
|
||
The last two values are derived using Bayesian statistics and describe our confidence in the results. The **win probability** tells you how likely it is that a given variant has the highest conversion rate compared to all other variants in the experiment. The **credible interval** tells you the range where the true conversion rate lies with 95% probability. | ||
|
||
Importantly, even though the test variant is winning, it doesn't clear our threshold of 90% or greater win probability to to be a statistically significant conclusion. This uncertainty is also demonstrated with the amount of overlap in the credible intervals. | ||
|
||
As such, you decide to let the experiment run a bit longer and see these results: | ||
|
||
- 100 in 1000 people in the control group complete the funnel = 10% success rate. | ||
- 100 in 900 people in the test variant group complete the funnel = 11% success rate. | ||
- The control variant has a 21.5% probability of being better and the test variant has a 78.5% probability of being better. | ||
- The control variant shows a credible interval of [8.3%, 12%] and the test variant shows a credible interval of [9.2%, 13.3%]. | ||
|
||
Et voilà! The additional data increased the win probability and narrowed the credible intervals. With 1,900 total users instead of just 19, random chance becomes a less likely explanation for the difference in conversion rates. Even though both variants maintained the same conversion rates (10% vs 11%), the larger sample size gives us more confidence in the experiment results. | ||
|
||
At this point, you could either declare the test variant as the winner (78.5% probability), or continue collecting data to reach the 90% statistical significance threshold. Bayesian statistics let you check your results whenever you'd like without worrying about increasing the chance of false positives from checking too frequently. | ||
|
||
## Supported metric types | ||
|
||
Experiments support a few different types of metrics, and each metric type uses a model appropriate to the shape of its data. | ||
|
||
For example, funnel conversions are always between 0% and 100%, pageview counts can be any positive number (0, 50, 280), and property values can vary widely and tend to be right-skewed. | ||
|
||
The following explain how Bayesian statistics is applied to each type of metric: | ||
|
||
- [Beta model for funnel metrics](/docs/experiments/funnels-statistics) to analyze conversion rates through multi-step funnels. | ||
- [Gamma-poisson model for trends metrics with count-based data](/docs/experiments/trends-count-statistics) like pageviews or interaction events. | ||
- [Lognormal model with a normal-inverse-gamma prior for trends metrics with property values](/docs/experiments/trends-property-value-statistics) like revenue. | ||
|
||
If your experiment was created prior to January 2025, it is [evaluated using the legacy methodology](/docs/experiments/legacy-methodology). |
82 changes: 82 additions & 0 deletions
82
contents/docs/experiments/trends-continuous-statistics.mdx
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,82 @@ | ||
--- | ||
title: Statistical methodology for property value trend metrics | ||
--- | ||
|
||
Trends metrics for property values use Bayesian statistics with a lognormal model and normal-inverse-gamma prior to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already. | ||
|
||
## What is a lognormal model with normal-inverse-gamma prior? | ||
|
||
A lognormal model with normal-inverse-gamma prior sounds like something you'd learn about in a quantum physics class, but its a lot less intimidating than it seems. The model is great for analyzing metrics like revenue or other property values that are always positive and often have a "long tail" of high values. | ||
|
||
Imagine you're looking at daily revenue from your customers: | ||
|
||
- Most customers might spend $20-100. | ||
- Some customers spend $200-500. | ||
- A few customers spend $1000+. | ||
|
||
This creates what we call a "right-skewed" distribution - lots of smaller values, with a long tail stretching to the right. This is where the log-normal model shines: | ||
|
||
- When we take the logarithm of these values, they follow a nice bell curve (normal distribution). | ||
- This makes it much easier to analyze the data mathematically. | ||
- We can transform back to regular dollars for our final results. | ||
|
||
The "normal-inverse-gamma prior" part helps us handle uncertainty: | ||
|
||
- When we have very little data, it keeps our estimates reasonable. | ||
- As we collect more data, it lets the actual data drive our conclusions. | ||
- It accounts for uncertainty in both the average value AND how spread out the values are. | ||
- We use a fixed log-space variance (`LOG_VARIANCE = 0.75`) based on typical patterns in property value data. | ||
|
||
For example: | ||
|
||
- Day 1: 5 customers spend an average of $50, but we're very uncertain about whether this represents the true average spending. | ||
- Day 30: 500 customers spend an average of $50, and we're much more confident about this average value. | ||
|
||
One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `MU_0 = 0.0`, `KAPPA_0 = 1.0`, `ALPHA_0 = 1.0`, and `BETA_0 = 1.0`, which is like starting with a blank slate instead of making an upfront assumption about the results. | ||
|
||
## Win probabilities | ||
|
||
The **win probability** tells you how likely it is that a given variant has the highest value compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance. | ||
|
||
Let's say you're testing a new pricing page and have these results: | ||
|
||
- Control: $50 average revenue per user (500 users) | ||
- Test: $60 average revenue per user (500 users) | ||
|
||
To calculate the win probabilities, our methodology: | ||
|
||
1. Models each variant's value using a lognormal distribution (which works well for metrics like revenue that are always positive and often right-skewed): | ||
- We transform the data to log-space where it follows a normal distribution. | ||
- We use a normal-inverse-gamma prior to handle uncertainty about both the mean and variance. | ||
|
||
2. Takes 10,000 random samples from each variant's posterior distribution. | ||
|
||
3. Checks which variant had the higher value for each sample. | ||
|
||
4. Calculates the final win probabilities: | ||
- Control wins in 5 out of 10,000 samples = 0.5% probability. | ||
- Test wins in 9,995 out of 10,000 samples = 99.5% probability. | ||
|
||
These results tell us we can be 98.5% confident that the test variant performs better than the control. | ||
|
||
## Credible intervals | ||
|
||
A **credible interval** tells you the range where the true value lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true value if you repeated the experiment many times (not a direct probability statement about where the value lies). | ||
|
||
For example, if you have these results: | ||
|
||
- Control: $50 average revenue per user (500 users) | ||
- Test: $60 average revenue per user (500 users) | ||
|
||
To calculate the credible intervals, our methodology will: | ||
|
||
1. Transform the data to log-space and model each variant using a t-distribution: | ||
- We use log transformation because metrics like revenue are often right-skewed | ||
- The t-distribution parameters come from our Normal-Inverse-Gamma model | ||
- This handles uncertainty about both the mean and variance | ||
|
||
2. Find the 2.5th and 97.5% percentiles of each distribution: | ||
- Control: [45.98, 55.1] = "You can be 95% confident the true average revenue is between $45.98 and $53.53" | ||
- Test: [55.15, 64.22] = "You can be 95% confident the true average revenue is between $55.15 and $64.22" | ||
|
||
Since these intervals don't overlap, you can be quite confident that the test variant performs better than the control. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true values. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,70 @@ | ||
--- | ||
title: Statistical methodology for count trend metrics | ||
--- | ||
|
||
Trends metrics for count-based data use Bayesian statistics with a gamma-poisson model to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already. | ||
|
||
## What is a gamma-poisson model? | ||
|
||
Imagine you run a pizza shop and want to know how many slices a customer typically orders. Some days customers might order 1 slice, others 3 slices, and occasionally someone might order 6 slices! This kind of count data (1, 2, 3, etc.) follows what's called a **poisson distribution**. | ||
|
||
The poisson distribution has one key number: the average rate. In our pizza example, maybe it's 2.5 slices per customer. But here's the catch - we don't know the true rate for sure. We only have our observations to guess from. | ||
|
||
This is where the **gamma distribution** comes in. It helps us model our uncertainty about the true rate: | ||
|
||
- When we have very little data, the gamma distribution is wide, saying "hey, the true rate could be anywhere in this broad range". | ||
- As we collect more data, the gamma distribution gets narrower, saying "we're getting more confident about what the true rate is". | ||
|
||
So when we say we're using a gamma-poisson model for count metrics, we're: | ||
|
||
1. Using the poisson distribution to model how count data naturally varies. | ||
2. Using the gamma distribution to express our uncertainty about the true rate. | ||
3. Getting more confident in our estimates over time. | ||
|
||
One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `ALPHA_PRIOR = 1` and `BETA_PRIOR = 1`, which is like starting with a blank slate instead of making an upfront assumption about the results. | ||
|
||
## Win probabilities | ||
|
||
The **win probability** tells you how likely it is that a given variant has the highest rate compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance. | ||
|
||
Let's say you're testing a new menu design and have these results: | ||
|
||
- Control (old menu): 250 slices ordered by 100 customers (rate of 2.5 slices per customer) | ||
- Test (new menu): 300 slices ordered by 100 customers (rate of 3.0 slices per customer) | ||
|
||
To calculate the win probabilities, our methodology: | ||
|
||
1. Models each variant's rate using a gamma distribution: | ||
- Control: Gamma(250 + ALPHA_PRIOR, 100 + BETA_PRIOR) | ||
- Test: Gamma(300 + ALPHA_PRIOR, 100 + BETA_PRIOR) | ||
|
||
2. Takes 10,000 random samples from each distribution. | ||
|
||
3. Checks which variant had the higher rate for each sample. | ||
|
||
4. Calculates the final win probabilities: | ||
- Control wins in 154 out of 10,000 samples = 1.54% probability | ||
- Test wins in 9,846 out of 10,000 samples = 98.46% probability | ||
|
||
These results tell us we can be 98.46%% confident that the new menu design leads to more slice orders per customer than the old menu. | ||
|
||
## Credible intervals | ||
|
||
A **credible interval** tells you the range where the true rate lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true rate if you repeated the experiment many times (not a direct probability statement about where the rate lies). | ||
|
||
For example, if you have these results: | ||
|
||
- Control (old menu): 250 slices ordered by 100 customers (rate of 2.5 slices per customer) | ||
- Test (new menu): 300 slices ordered by 100 customers (rate of 3.0 slices per customer) | ||
|
||
To calculate the credible intervals, our methodology will: | ||
|
||
1. Create a gamma distribution for each variant: | ||
- Control: Gamma(250 + ALPHA_PRIOR, 100 + BETA_PRIOR) | ||
- Test: Gamma(300 + ALPHA_PRIOR, 100 + BETA_PRIOR) | ||
|
||
2. Find the 2.5th and 97.5% percentiles of each distribution: | ||
- Control: [2.2, 2.8] = "You can be 95% confident customers order between 2.2 and 2.8 slices on average with the old menu" | ||
- Test: [2.7, 3.3] = "You can be 95% confident customers order between 2.7 and 3.3 slices on average with the new menu" | ||
|
||
Since these intervals barely overlap, you can be quite confident that the new menu design results in more slice orders per customer. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true rates. |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally use sentence case for feature names, but you flip back and forth between title case for "beta model." First sentence is in sentence case, but the first title has it capitalized. Should all be "beta model" if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andehen Which makes more sense, "gamma-poisson model" or "Gamma-Poisson model"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think only capital letters if is a title. But in the middle of a sentence it should be "a gamma-poisson model". The excpetion is if one refers to a specific distribution like this "we use a Beta(1, 1) distribution as prior ..."