-
Notifications
You must be signed in to change notification settings - Fork 4
feat: Add Variance Reduction to Delta Method and couple it with CUPAC #230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Remove possibility to pass ratio covariates. Ideally this should always be done through CUPAC adjustment as it was observed a very small deviation and it was a very good approximation
Codecov ReportAttention: Patch coverage is
❗ Your organization needs to install the Codecov GitHub app to enable full functionality. Additional details and impacted files@@ Coverage Diff @@
## main #230 +/- ##
==========================================
+ Coverage 96.28% 96.38% +0.10%
==========================================
Files 17 17
Lines 1668 1716 +48
==========================================
+ Hits 1606 1654 +48
Misses 62 62 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Codecov is not updating. The only line that is not covered is present in CUPAC. Not sure if it is really needed, seems a bit too much IMO, but we could add a test in power config to test that CUPAC with Delta Method is indeed working. |
@@ -136,7 +140,12 @@ def _prep_data_cupac( | |||
df_predict = df.drop(columns=[self.target_col]) | |||
# Split data into X and y | |||
pre_experiment_x = pre_experiment_df.drop(columns=[self.target_col]) | |||
pre_experiment_y = pre_experiment_df[self.target_col] | |||
if self.scale_col: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps adding a property in the class called is_delta_method would make this more explicit
@@ -1260,7 +1259,8 @@ def __init__( | |||
DeltaMethodAnalysis( | |||
cluster_cols=['cluster'], | |||
target_col='x', | |||
scale_col='y' | |||
scale_col='y', | |||
covariates=['x'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same covariate as target?
target_col: name of the column containing the variable to measure (the numerator of the ratio). | ||
scale_col: name of the column containing the scale variable (the denominator of the ratio). | ||
treatment_col: name of the column containing the treatment variable. | ||
treatment: name of the treatment to use as the treated group. | ||
covariates: list of columns to use as covariates. | ||
ratio_covariates: list of tuples of columns to use as covariates for ratio metrics. First element is the numerator column, second element is the denominator column. | ||
covariates: list of columns to use as covariates. Have to be previously aggregated at the cluster level. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this doesn't work with raw data? only aggregated?
Y_var = df[self.target_metric].var() | ||
for k, covariate in enumerate(self.covariates): | ||
Y_var += ( | ||
theta[k] ** 2 * df[covariate].var() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wdyt about splitting in a couple of steps?
target_scale_cov = df.loc[:, self.target_col].cov(df.loc[:, self.scale_col]) | ||
df = df.copy() | ||
df = self._transform_metrics(df) | ||
if self.covariates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wondering if we could avoid this if else in any way
@@ -1333,6 +1390,14 @@ def _get_mean_standard_error(self, df: pd.DataFrame) -> tuple[float, float]: | |||
Variance reduction is used if covariates are given. | |||
""" | |||
|
|||
if (self._get_num_clusters(df) < 1000).any(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could the 1000 be a class variable?
if (self._get_num_clusters(df) < 1000).any(): | ||
self.__warn_small_group_size() | ||
|
||
if self.covariates: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are covariates relevant to aggregation?
Check if there are enough clusters to run the analysis. | ||
""" | ||
return df.groupby(self.treatment_col).apply( | ||
lambda x: self._get_cluster_column(x).nunique() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we groupby treatment col?
@@ -324,8 +325,124 @@ def test_stats_delta_vs_ols(analysis_ratio_df, experiment_dates): | |||
SE_delta = analyser_delta.get_standard_error(df) | |||
|
|||
assert point_estimate_delta == pytest.approx( | |||
point_estimate_ols, rel=1e-2 | |||
point_estimate_ols, rel=1e-1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add a test where the Deltamethod is passed to an analysisPlan?
CUPED is used with the Delta Method.
This is intended to be used directly with the Power Analysis so the user is expected to use CUPAC. This simplifies a bit the usage although it is a slight approximation but should work for our implementation.
Otherwise we have to change how the interface works so we can pass ratio_metrics and normal metrics, which might imply a ton of refactoring and changing how users usually interact with the library.
Delta Method Isolated would be doing the Delta Method Properly with CUPAC, were we don't predict the ratio but both numerator and denominator and then apply the Delta Method on that to consider it when using CUPED. As can be seen predicting the ratio is a good approximate and tends to overestimate variance.
Recall that we apply Delta Method as otherwise we underestimate variance. I prefer having a very small type I error rather than changing how users interact.
Please feel free to check it and we can discuss implementation and align on the best way going forward!