PostHog · danielbachhuber · Jan 7, 2025 · Dec 24, 2024 · Dec 24, 2024 · Dec 24, 2024
diff --git a/contents/docs/experiments/funnels-statistics.mdx b/contents/docs/experiments/funnels-statistics.mdx
@@ -0,0 +1,71 @@
+---
+title: Statistical methodology for funnel metrics
+---
+
+Funnel metrics use Bayesian statistics with a beta model to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already.
+
+## What is a beta model?
+
+Imagine you run a pizza shop and want to know if customers say "yes" to adding pineapple. Some customers will say yes, others will say no. Knowing what percentage of customers want pineapple on their pizza helps you decide how much to order and what options to offer.
+
+The beta model is a statistical approach that's great for analyzing proportions or probabilities. It uses a **beta distribution** to model the uncertainty in conversion rates and helps us understand:
+
+1. Our best estimate of the _true_ probability that a customer will say "yes" to adding pineapple (vs. the probability we observe).
+2. How certain we are about that estimate.
+
+For example, if:
+
+- Only 2 out of 4 customers (50%) say yes, the beta distribution will be wide, indicating high uncertainty.
+- 150 out of 300 customers (50%) say yes, the beta distribution will be narrow, showing we're more confident about that 50% rate.
+
+So when we say we're using a beta model for funnel metrics, we're:
+1. Using the beta distribution to model conversion rates between 0% and 100%.
+2. Getting more confident in our estimates as we collect more data.
+
+One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `ALPHA_PRIOR = 1` and `BETA_PRIOR = 1`, which is like starting with a blank slate instead of making an upfront assumption about the results.
+
+## Win probabilities
+
+The **win probability** tells you how likely it is that a given variant has the highest conversion rate compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance.
+
+Let's say you're testing a new way of presenting pineapple on the website and have these results:
+
+- Control (current design): 100 pineapple orders from 1000 customers (10% acceptance)
+- Test (suggesting pineapple with a photo): 150 pineapple orders from 1000 customers (15% acceptance)
+
+To calculate the win probabilities, our methodology will:
+
+1. Model each variant's conversion rate using a beta distribution:
+	- Control: Beta(100 + ALPHA_PRIOR, 900 + BETA_PRIOR)
+	- Test: Beta(150 + ALPHA_PRIOR, 850 + BETA_PRIOR)
+
+2. Take 10,000 random samples from each distribution.
+
+3. Check which variant had the higher conversion rate for each sample.
+
+4. Calculate the final win probabilities:
+	- Control wins in 40 out of 10,000 samples = 0.4% probability
+	- Test wins in 9,960 out of 10,000 samples = 99.6% probability
+
+These results tell us we can be 99.6% confident that showing photos of pineapple pizza performs better than the current design.
+
+## Credible intervals
+
+A **credible interval** tells you the range where the true conversion rate lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true rate if you repeated the experiment many times (not a direct probability statement about where the rate lies).
+
+For example, if you have these results:
+
+- Control (current design): 100 pineapple orders from 1000 customers (10% acceptance)
+- Test (suggesting pineapple with a photo): 150 pineapple orders from 1000 customers (15% acceptance)
+
+To calculate the credible intervals, our methodology will:
+
+1. Create a beta distribution for each variant:
+	- Control: Beta(100 + ALPHA_PRIOR, 900 + BETA_PRIOR)
+	- Test: Beta(150 + ALPHA_PRIOR, 850 + BETA_PRIOR)
+
+2. Find the 2.5th and 97.5% percentiles of each distribution:
+	- Control: [8.3%, 12%] = "You can be 95% confident the true conversion rate is between 8.3% and 12.0%"
+	- Test: [12.9%, 17.3%] = "You can be 95% confident the true conversion rate is between 12.9% and 17.3%"
+
+Since these intervals don't overlap, you can be quite confident that the test variant performs better than the control. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true conversion rates.
diff --git a/...s/experiments/experiment-significance.mdx → ...s/docs/experiments/legacy-methodology.mdx b/...s/experiments/experiment-significance.mdx → ...s/docs/experiments/legacy-methodology.mdx
@@ -1,5 +1,5 @@
 ---
-title: Experiment significance
+title: Legacy statistics methodology
 ---
 
 import { FormulaScreenshot } from 'components/FormulaScreenshot'
@@ -10,7 +10,7 @@ export const FunnelExperimentCalculationDark = "https://res.cloudinary.com/dmuku
 export const FunnelSignificanceLight = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-significance-light.png"
 export const FunnelSignificanceDark = "https://res.cloudinary.com/dmukukwp6/image/upload/posthog.com/contents/images/docs/user-guides/experimentation/funnel-significance-dark.png"
 
-Below are all the formulas and calculations we use to determine the significance of an experiment.
+> **Note:** This document describes methodology used to evaluate experiments created prior to January 2025. Check out our [statistics overview](/docs/experiments/statistics) for an introduction to our updated methodology.
 
 ## Bayesian experimentation
 

diff --git a/contents/docs/experiments/statistics.mdx b/contents/docs/experiments/statistics.mdx
@@ -0,0 +1,53 @@
+---
+title: Experiment statistics overview
+---
+
+A working understanding of statistical methodology is helpful to feel confident about interpreting experiment results. For those without prior experience, this overview explains everything you need to know in layperson's terms. For those with some statistics experiments, this overview documents our methodology and assumptions.
+
+Experiments use [Bayesian statistics](https://en.wikipedia.org/wiki/Bayesian_statistics) to determine whether a given **variant** performs better than the **control**. It quantifies **win probabilities** and **credible intervals**, helps determine whether the experiment shows a **statistically significant** effect, and enables you to:
+
+- Check results at any time without statistical penalties.
+- Get direct probability statements about which variant is winning.
+- Make confident decisions earlier with accumulating evidence.
+
+This contrasts with [Frequentist statistics](https://en.wikipedia.org/wiki/Frequentist_inference), which requires you to predefine sample sizes and prevents you from updating probabilities as new data arrives.
+
+## Example Bayesian analysis
+
+Say you started an experiment a few hours ago and see these results:
+
+- 1 in 10 people in the control group complete the funnel = 10% success rate.
+- 1 in 9 people in the test variant group complete the funnel = 11% success rate.
+- The control variant has a 46.7% probability of being better and the test variant has a 53.3% probability of being better.
+- The control variant shows a credible interval of [2.3%, 41.3%] and the test variant shows a credible interval of [2.5%, 44.5%].
+
+The first two values are pure math: dividing the number of successes by the total number of users gives us the raw success rates. It's not enough to just compare these conversion rates, however.
+
+The last two values are derived using Bayesian statistics and describe our confidence in the results. The **win probability** tells you how likely it is that a given variant has the highest conversion rate compared to all other variants in the experiment. The **credible interval** tells you the range where the true conversion rate lies with 95% probability.
+
+Importantly, even though the test variant is winning, it doesn't clear our threshold of 90% or greater win probability to to be a statistically significant conclusion. This uncertainty is also demonstrated with the amount of overlap in the credible intervals.
+
+As such, you decide to let the experiment run a bit longer and see these results:
+
+- 100 in 1000 people in the control group complete the funnel = 10% success rate.
+- 100 in 900 people in the test variant group complete the funnel = 11% success rate.
+- The control variant has a 21.5% probability of being better and the test variant has a 78.5% probability of being better.
+- The control variant shows a credible interval of [8.3%, 12%] and the test variant shows a credible interval of [9.2%, 13.3%].
+
+Et voilà! The additional data increased the win probability and narrowed the credible intervals. With 1,900 total users instead of just 19, random chance becomes a less likely explanation for the difference in conversion rates. Even though both variants maintained the same conversion rates (10% vs 11%), the larger sample size gives us more confidence in the experiment results. 
+
+At this point, you could either declare the test variant as the winner (78.5% probability), or continue collecting data to reach the 90% statistical significance threshold. Bayesian statistics let you check your results whenever you'd like without worrying about increasing the chance of false positives from checking too frequently.
+
+## Supported metric types
+
+Experiments support a few different types of metrics, and each metric type uses a model appropriate to the shape of its data.
+
+For example, funnel conversions are always between 0% and 100%, pageview counts can be any positive number (0, 50, 280), and property values can vary widely and tend to be right-skewed.
+
+The following explain how Bayesian statistics is applied to each type of metric:
+
+- [Beta model for funnel metrics](/docs/experiments/funnels-statistics) to analyze conversion rates through multi-step funnels.
+- [Gamma-poisson model for trends metrics with count-based data](/docs/experiments/trends-count-statistics) like pageviews or interaction events.
+- [Lognormal model with a normal-inverse-gamma prior for trends metrics with property values](/docs/experiments/trends-property-value-statistics) like revenue.
+
+If your experiment was created prior to January 2025, it is [evaluated using the legacy methodology](/docs/experiments/legacy-methodology).
diff --git a/contents/docs/experiments/trends-continuous-statistics.mdx b/contents/docs/experiments/trends-continuous-statistics.mdx
@@ -0,0 +1,82 @@
+---
+title: Statistical methodology for property value trend metrics
+---
+
+Trends metrics for property values use Bayesian statistics with a lognormal model and normal-inverse-gamma prior to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already.
+
+## What is a lognormal model with normal-inverse-gamma prior?
+
+A lognormal model with normal-inverse-gamma prior sounds like something you'd learn about in a quantum physics class, but its a lot less intimidating than it seems. The model is great for analyzing metrics like revenue or other property values that are always positive and often have a "long tail" of high values.
+
+Imagine you're looking at daily revenue from your customers:
+
+- Most customers might spend $20-100.
+- Some customers spend $200-500.
+- A few customers spend $1000+.
+
+This creates what we call a "right-skewed" distribution - lots of smaller values, with a long tail stretching to the right. This is where the log-normal model shines:
+
+- When we take the logarithm of these values, they follow a nice bell curve (normal distribution).
+- This makes it much easier to analyze the data mathematically.
+- We can transform back to regular dollars for our final results.
+
+The "normal-inverse-gamma prior" part helps us handle uncertainty:
+
+- When we have very little data, it keeps our estimates reasonable.
+- As we collect more data, it lets the actual data drive our conclusions.
+- It accounts for uncertainty in both the average value AND how spread out the values are.
+- We use a fixed log-space variance (`LOG_VARIANCE = 0.75`) based on typical patterns in property value data.
+
+For example:
+
+- Day 1: 5 customers spend an average of $50, but we're very uncertain about whether this represents the true average spending.
+- Day 30: 500 customers spend an average of $50, and we're much more confident about this average value.
+
+One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `MU_0 = 0.0`, `KAPPA_0 = 1.0`, `ALPHA_0 = 1.0`, and `BETA_0 = 1.0`, which is like starting with a blank slate instead of making an upfront assumption about the results.
+
+## Win probabilities
+
+The **win probability** tells you how likely it is that a given variant has the highest value compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance.
+
+Let's say you're testing a new pricing page and have these results:
+
+- Control: $50 average revenue per user (500 users)
+- Test: $60 average revenue per user (500 users)
+
+To calculate the win probabilities, our methodology:
+
+1. Models each variant's value using a lognormal distribution (which works well for metrics like revenue that are always positive and often right-skewed):
+   - We transform the data to log-space where it follows a normal distribution.
+   - We use a normal-inverse-gamma prior to handle uncertainty about both the mean and variance.
+
+2. Takes 10,000 random samples from each variant's posterior distribution.
+
+3. Checks which variant had the higher value for each sample.
+
+4. Calculates the final win probabilities:
+   - Control wins in 5 out of 10,000 samples = 0.5% probability.
+   - Test wins in 9,995 out of 10,000 samples = 99.5% probability.
+
+These results tell us we can be 98.5% confident that the test variant performs better than the control.
+
+## Credible intervals
+
+A **credible interval** tells you the range where the true value lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true value if you repeated the experiment many times (not a direct probability statement about where the value lies).
+
+For example, if you have these results:
+
+- Control: $50 average revenue per user (500 users)
+- Test: $60 average revenue per user (500 users)
+
+To calculate the credible intervals, our methodology will:
+
+1. Transform the data to log-space and model each variant using a t-distribution:
+   - We use log transformation because metrics like revenue are often right-skewed
+   - The t-distribution parameters come from our Normal-Inverse-Gamma model
+   - This handles uncertainty about both the mean and variance
+
+2. Find the 2.5th and 97.5% percentiles of each distribution:
+   - Control: [45.98, 55.1] = "You can be 95% confident the true average revenue is between $45.98 and $53.53"
+   - Test: [55.15, 64.22] = "You can be 95% confident the true average revenue is between $55.15 and $64.22"
+
+Since these intervals don't overlap, you can be quite confident that the test variant performs better than the control. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true values.
diff --git a/contents/docs/experiments/trends-count-statistics.mdx b/contents/docs/experiments/trends-count-statistics.mdx
@@ -0,0 +1,70 @@
+---
+title: Statistical methodology for count trend metrics
+---
+
+Trends metrics for count-based data use Bayesian statistics with a gamma-poisson model to evaluate the **win probabilities** and **credible intervals**. [Read the statistics overview](/docs/experiments/statistics) if you haven't already.
+
+## What is a gamma-poisson model?
+
+Imagine you run a pizza shop and want to know how many slices a customer typically orders. Some days customers might order 1 slice, others 3 slices, and occasionally someone might order 6 slices! This kind of count data (1, 2, 3, etc.) follows what's called a **poisson distribution**.
+
+The poisson distribution has one key number: the average rate. In our pizza example, maybe it's 2.5 slices per customer. But here's the catch - we don't know the true rate for sure. We only have our observations to guess from.
+
+This is where the **gamma distribution** comes in. It helps us model our uncertainty about the true rate:
+
+- When we have very little data, the gamma distribution is wide, saying "hey, the true rate could be anywhere in this broad range".
+- As we collect more data, the gamma distribution gets narrower, saying "we're getting more confident about what the true rate is".
+
+So when we say we're using a gamma-poisson model for count metrics, we're:
+
+1. Using the poisson distribution to model how count data naturally varies.
+2. Using the gamma distribution to express our uncertainty about the true rate.
+3. Getting more confident in our estimates over time.
+
+One more thing worth noting: Bayesian inference starts with an initial guess that then gets updated as more data comes in. Our model uses a "minimally informative prior" of `ALPHA_PRIOR = 1` and `BETA_PRIOR = 1`, which is like starting with a blank slate instead of making an upfront assumption about the results.
+
+## Win probabilities
+
+The **win probability** tells you how likely it is that a given variant has the highest rate compared to all other variants. It helps you determine whether the metric shows a **statistically significant** real effect vs. simply random chance.
+
+Let's say you're testing a new menu design and have these results:
+
+- Control (old menu): 250 slices ordered by 100 customers (rate of 2.5 slices per customer)
+- Test (new menu): 300 slices ordered by 100 customers (rate of 3.0 slices per customer)
+
+To calculate the win probabilities, our methodology:
+
+1. Models each variant's rate using a gamma distribution:
+	- Control: Gamma(250 + ALPHA_PRIOR, 100 + BETA_PRIOR)
+	- Test: Gamma(300 + ALPHA_PRIOR, 100 + BETA_PRIOR)
+
+2. Takes 10,000 random samples from each distribution.
+
+3. Checks which variant had the higher rate for each sample.
+
+4. Calculates the final win probabilities:
+	- Control wins in 154 out of 10,000 samples = 1.54% probability
+	- Test wins in 9,846 out of 10,000 samples = 98.46% probability
+
+These results tell us we can be 98.46%% confident that the new menu design leads to more slice orders per customer than the old menu.
+
+## Credible intervals
+
+A **credible interval** tells you the range where the true rate lies with 95% probability. This is different than a confidence interval, which describes how often such intervals would contain the true rate if you repeated the experiment many times (not a direct probability statement about where the rate lies).
+
+For example, if you have these results:
+
+- Control (old menu): 250 slices ordered by 100 customers (rate of 2.5 slices per customer)
+- Test (new menu): 300 slices ordered by 100 customers (rate of 3.0 slices per customer)
+
+To calculate the credible intervals, our methodology will:
+
+1. Create a gamma distribution for each variant:
+	- Control: Gamma(250 + ALPHA_PRIOR, 100 + BETA_PRIOR)
+	- Test: Gamma(300 + ALPHA_PRIOR, 100 + BETA_PRIOR)
+
+2. Find the 2.5th and 97.5% percentiles of each distribution:
+	- Control: [2.2, 2.8] = "You can be 95% confident customers order between 2.2 and 2.8 slices on average with the old menu"
+	- Test: [2.7, 3.3] = "You can be 95% confident customers order between 2.7 and 3.3 slices on average with the new menu"
+
+Since these intervals barely overlap, you can be quite confident that the new menu design results in more slice orders per customer. The intervals will become narrower as you collect more data, reflecting your increasing certainty about the true rates.