-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bootstrap for A/B testing #31
Comments
I think this should be possible, with the multi='independent' option that's in master, and the appropriate statistic function? Edit: I now realize that, while this will work, it won't work with Numba. So in that case, yes, I can see how this might be useful, and I'll have to think about how to implement it. |
@cgevans seems If I'm not mistaken, the new statistical function is not enough. According to Efron's and Tibshirani's suggested algorithm datasets should be preprocessed. Let's say we have two data sets and . We should create two new data sets and , where is mean of the combined sample. Sampling with replacement should be draw from IMO it is possible to implement hypothesis testing using Numba. Unfortunately, similarly to |
Here is a quick prototype that implements distribution comparisons (see Algorithm 16.1 in Efron's book). I've tested it against Mann-Whitney rank test: @numba.njit(parallel=True, fastmath=True, nogil=True)
def compare_dist(z: np.ndarray, y: np.ndarray, n_samples: int = 10_000) -> Tuple[np.ndarray, float]:
n = z.shape[0]
m = y.shape[0]
x = np.concatenate((z, y))
t_obs = z.mean() - y.mean()
t_ = np.zeros(n_samples)
for i in numba.prange(n_samples):
x_ = np.random.choice(x, n+m)
t_[i] = x_[:n].mean() - x_[m:].mean()
return t_, float(np.sum(np.greater_equal(t_, t_obs)) / n_samples) Testing script: https://gist.github.com/dbalabka/439a76cd182338f64ad4269d77a4f693 |
Here is a prototype for mean difference hypothesis testing (see Algorithm 16.2 in Efron's book). Tested against T-test: @numba.njit(parallel=True, fastmath=True, nogil=True)
def compare_mean(z: np.ndarray, y: np.ndarray, n_samples: int = 10_000) -> Tuple[np.ndarray, float, float]:
t_obs = _calculate_mean_diff_statistics(z, y)
z_mean = z.mean()
y_mean = y.mean()
x_mean = np.concatenate((z, y)).mean()
z_ = np.add(np.subtract(z, z_mean), x_mean)
y_ = np.add(np.subtract(y, y_mean), x_mean)
t = np.zeros(n_samples)
for i in numba.prange(n_samples):
zz_ = np.random.choice(z_, z_.shape[0])
yy_ = np.random.choice(y_, y_.shape[0])
t[i] = _calculate_mean_diff_statistics(zz_, yy_)
return t, t_obs, float(np.sum(np.greater_equal(t, t_obs)) / n_samples)
@numba.njit(parallel=True, fastmath=True, nogil=True)
def _calculate_mean_diff_statistics(z_: np.ndarray, y_: np.ndarray) -> float:
z_mean_ = z_.mean()
y_mean_ = y_.mean()
return (z_mean_ - y_mean_) / sqrt(
(np.sum(np.power(np.subtract(z_, z_mean_), 2)) / (z_.shape[0] - 1)) / z_.shape[0] +
(np.sum(np.power(np.subtract(y_, y_mean_), 2)) / (y_.shape[0] - 1)) / y_.shape[0]
) Testing script: https://gist.github.com/dbalabka/706dd9dcb5fd7e97a0136321c7b87364 |
It would be very helpful to have bootstrap to calculate the difference between two distributions means:
https://github.com/facebookarchive/bootstrapped/blob/master/bootstrapped/bootstrap.py#L360
The text was updated successfully, but these errors were encountered: