Skip to content

feat: Add Variance Reduction to Delta Method and couple it with CUPAC #230

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

LGonzalezGomez
Copy link
Collaborator

@LGonzalezGomez LGonzalezGomez commented Mar 2, 2025

CUPED is used with the Delta Method.
This is intended to be used directly with the Power Analysis so the user is expected to use CUPAC. This simplifies a bit the usage although it is a slight approximation but should work for our implementation.
Otherwise we have to change how the interface works so we can pass ratio_metrics and normal metrics, which might imply a ton of refactoring and changing how users usually interact with the library.

Delta Method Isolated would be doing the Delta Method Properly with CUPAC, were we don't predict the ratio but both numerator and denominator and then apply the Delta Method on that to consider it when using CUPED. As can be seen predicting the ratio is a good approximate and tends to overestimate variance.
Recall that we apply Delta Method as otherwise we underestimate variance. I prefer having a very small type I error rather than changing how users interact.

image
image
Screenshot 2025-03-02 at 20 00 08

Please feel free to check it and we can discuss implementation and align on the best way going forward!

Remove possibility to pass ratio covariates. Ideally this should always be done through CUPAC adjustment as it was observed a very small deviation and it was a very good approximation
@LGonzalezGomez LGonzalezGomez linked an issue Mar 2, 2025 that may be closed by this pull request
@codecov-commenter
Copy link

codecov-commenter commented Mar 2, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

Attention: Patch coverage is 96.82540% with 2 lines in your changes missing coverage. Please review.

Project coverage is 96.38%. Comparing base (9570cf6) to head (e65774e).

Files with missing lines Patch % Lines
cluster_experiments/cupac.py 90.00% 1 Missing ⚠️
cluster_experiments/experiment_analysis.py 98.11% 1 Missing ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #230      +/-   ##
==========================================
+ Coverage   96.28%   96.38%   +0.10%     
==========================================
  Files          17       17              
  Lines        1668     1716      +48     
==========================================
+ Hits         1606     1654      +48     
  Misses         62       62              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@LGonzalezGomez
Copy link
Collaborator Author

Codecov is not updating. The only line that is not covered is present in CUPAC. Not sure if it is really needed, seems a bit too much IMO, but we could add a test in power config to test that CUPAC with Delta Method is indeed working.

@@ -136,7 +140,12 @@ def _prep_data_cupac(
df_predict = df.drop(columns=[self.target_col])
# Split data into X and y
pre_experiment_x = pre_experiment_df.drop(columns=[self.target_col])
pre_experiment_y = pre_experiment_df[self.target_col]
if self.scale_col:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps adding a property in the class called is_delta_method would make this more explicit

@@ -1260,7 +1259,8 @@ def __init__(
DeltaMethodAnalysis(
cluster_cols=['cluster'],
target_col='x',
scale_col='y'
scale_col='y',
covariates=['x']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same covariate as target?

target_col: name of the column containing the variable to measure (the numerator of the ratio).
scale_col: name of the column containing the scale variable (the denominator of the ratio).
treatment_col: name of the column containing the treatment variable.
treatment: name of the treatment to use as the treated group.
covariates: list of columns to use as covariates.
ratio_covariates: list of tuples of columns to use as covariates for ratio metrics. First element is the numerator column, second element is the denominator column.
covariates: list of columns to use as covariates. Have to be previously aggregated at the cluster level.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so this doesn't work with raw data? only aggregated?

Y_var = df[self.target_metric].var()
for k, covariate in enumerate(self.covariates):
Y_var += (
theta[k] ** 2 * df[covariate].var()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wdyt about splitting in a couple of steps?

target_scale_cov = df.loc[:, self.target_col].cov(df.loc[:, self.scale_col])
df = df.copy()
df = self._transform_metrics(df)
if self.covariates:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wondering if we could avoid this if else in any way

@@ -1333,6 +1390,14 @@ def _get_mean_standard_error(self, df: pd.DataFrame) -> tuple[float, float]:
Variance reduction is used if covariates are given.
"""

if (self._get_num_clusters(df) < 1000).any():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could the 1000 be a class variable?

if (self._get_num_clusters(df) < 1000).any():
self.__warn_small_group_size()

if self.covariates:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are covariates relevant to aggregation?

Check if there are enough clusters to run the analysis.
"""
return df.groupby(self.treatment_col).apply(
lambda x: self._get_cluster_column(x).nunique()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we groupby treatment col?

@@ -324,8 +325,124 @@ def test_stats_delta_vs_ols(analysis_ratio_df, experiment_dates):
SE_delta = analyser_delta.get_standard_error(df)

assert point_estimate_delta == pytest.approx(
point_estimate_ols, rel=1e-2
point_estimate_ols, rel=1e-1
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a test where the Deltamethod is passed to an analysisPlan?

@david26694 david26694 self-requested a review March 24, 2025 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add Variance Reduction to Delta Method
3 participants