From 990ebfd98b7fc9b2ce84fdcaf7b2413475724a73 Mon Sep 17 00:00:00 2001
From: Mwangi Wambugu <mwangiwambugu@gmail.com>
Date: Thu, 2 Nov 2023 16:12:37 +0300
Subject: [PATCH] AB testing

---
 index.ipynb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/index.ipynb b/index.ipynb
index 21737c9..b52883e 100644
--- a/index.ipynb
+++ b/index.ipynb
@@ -1 +1 @@
-{"cells": [{"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["# Website A/B Testing - Lab\n", "\n", "## Introduction\n", "\n", "In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.\n", "\n", "## Objectives\n", "\n", "You will be able to:\n", "* Analyze the data from a website A/B test to draw relevant conclusions\n", "* Explore and analyze web action data"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["## Exploratory Analysis\n", "\n", "Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data."]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["> Hints:\n", "    * Start investigating the id column:\n", "        * How many viewers also clicked?\n", "        * Are there any anomalies with the data; did anyone click who didn't view?\n", "        * Is there any overlap between the control and experiment groups? \n", "            * If so, how do you plan to account for this in your experimental design?"]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["#Your code here"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["## Conduct a Statistical Test\n", "\n", "Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["#Your code here"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["## Verifying Results\n", "\n", "One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. \n", "\n", "The variance for the number of successes in a sample of a binomial variable with n observations is given by:\n", "\n", "## $n\\bullet p (1-p)$\n", "\n", "Given this, perform 3 steps to verify the results of your statistical test:\n", "1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. \n", "2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. \n", "3. Finally, calculate a p-value using the normal distribution based on this z-score."]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["### Step 1:\n", "Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. "]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["#Your code here"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["### Step 2:\n", "Calculate the number of standard deviations that the actual number of clicks was from this estimate."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["#Your code here"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["### Step 3: \n", "Finally, calculate a p-value using the normal distribution based on this z-score."]}, {"cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": ["#Your code here"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["### Analysis:\n", "\n", "Does this result roughly match that of the previous statistical test?\n", "\n", "> Comment: **Your analysis here**"]}, {"attachments": {}, "cell_type": "markdown", "metadata": {}, "source": ["## Summary\n", "\n", "In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values."]}], "metadata": {"kernelspec": {"display_name": "Python 3", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.6"}}, "nbformat": 4, "nbformat_minor": 2}
\ No newline at end of file
+{"cells":[{"attachments":{},"cell_type":"markdown","metadata":{},"source":["# Website A/B Testing - Lab\n","\n","## Introduction\n","\n","In this lab, you'll get another chance to practice your skills at conducting a full A/B test analysis. It will also be a chance to practice your data exploration and processing skills! The scenario you'll be investigating is data collected from the homepage of a music app page for audacity.\n","\n","## Objectives\n","\n","You will be able to:\n","* Analyze the data from a website A/B test to draw relevant conclusions\n","* Explore and analyze web action data"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## Exploratory Analysis\n","\n","Start by loading in the dataset stored in the file 'homepage_actions.csv'. Then conduct an exploratory analysis to get familiar with the data."]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["> Hints:\n","    * Start investigating the id column:\n","        * How many viewers also clicked?\n","        * Are there any anomalies with the data; did anyone click who didn't view?\n","        * Is there any overlap between the control and experiment groups? \n","            * If so, how do you plan to account for this in your experimental design?"]},{"cell_type":"code","execution_count":9,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["8188\n"]},{"data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>id</th>\n","      <th>group</th>\n","      <th>action</th>\n","    </tr>\n","    <tr>\n","      <th>timestamp</th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>2016-09-24 17:42:27.839496</th>\n","      <td>804196</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>2016-09-24 19:19:03.542569</th>\n","      <td>434745</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>2016-09-24 19:36:00.944135</th>\n","      <td>507599</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>2016-09-24 19:59:02.646620</th>\n","      <td>671993</td>\n","      <td>control</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>2016-09-24 20:26:14.466886</th>\n","      <td>536734</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>...</th>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 09:11:41.984113</th>\n","      <td>192060</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 09:42:12.844575</th>\n","      <td>755912</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 10:01:09.026482</th>\n","      <td>458115</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 10:08:51.588469</th>\n","      <td>505451</td>\n","      <td>control</td>\n","      <td>view</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 10:24:08.629327</th>\n","      <td>461199</td>\n","      <td>control</td>\n","      <td>view</td>\n","    </tr>\n","  </tbody>\n","</table>\n","<p>8188 rows × 3 columns</p>\n","</div>"],"text/plain":["                                id       group action\n","timestamp                                            \n","2016-09-24 17:42:27.839496  804196  experiment   view\n","2016-09-24 19:19:03.542569  434745  experiment   view\n","2016-09-24 19:36:00.944135  507599  experiment   view\n","2016-09-24 19:59:02.646620  671993     control   view\n","2016-09-24 20:26:14.466886  536734  experiment   view\n","...                            ...         ...    ...\n","2017-01-18 09:11:41.984113  192060  experiment   view\n","2017-01-18 09:42:12.844575  755912  experiment   view\n","2017-01-18 10:01:09.026482  458115  experiment   view\n","2017-01-18 10:08:51.588469  505451     control   view\n","2017-01-18 10:24:08.629327  461199     control   view\n","\n","[8188 rows x 3 columns]"]},"execution_count":9,"metadata":{},"output_type":"execute_result"}],"source":["#Your code here\n","import numpy as np\n","import matplotlib.pyplot as plt\n","import seaborn as sns\n","sns.set_style('darkgrid')\n","%matplotlib inline\n","import pandas as pd\n","df = pd.read_csv('homepage_actions.csv', index_col = 0)\n","print(len(df))\n","df"]},{"cell_type":"code","execution_count":10,"metadata":{},"outputs":[{"data":{"text/plain":["view     6328\n","click    1860\n","Name: action, dtype: int64"]},"execution_count":10,"metadata":{},"output_type":"execute_result"}],"source":["df.action.value_counts()"]},{"cell_type":"code","execution_count":18,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Number of viewers: 6328 \tNumber of clickers: 1860\n","Number of Viewers who didn't click: 4468\n","Number of Clickers who didn't view: 0\n"]}],"source":["cids = set(df[df.action=='click']['id'].unique())\n","vids = set(df[df.action=='view']['id'].unique())\n","print(\"Number of viewers: {} \\tNumber of clickers: {}\".format(len(vids), len(cids)))\n","print(\"Number of Viewers who didn't click: {}\".format(len(vids-cids)))\n","print(\"Number of Clickers who didn't view: {}\".format(len(cids-vids)))\n"]},{"cell_type":"code","execution_count":19,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Overlap of experiment and control groups: 0\n"]}],"source":["eids = set(df[df.group=='experiment']['id'].unique())\n","cids = set(df[df.group=='control']['id'].unique())\n","print('Overlap of experiment and control groups: {}'.format(len(eids&cids)))"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## Conduct a Statistical Test\n","\n","Conduct a statistical test to determine whether the experimental homepage was more effective than that of the control group."]},{"cell_type":"code","execution_count":20,"metadata":{},"outputs":[{"data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th></th>\n","      <th>id</th>\n","      <th>group</th>\n","      <th>action</th>\n","      <th>count</th>\n","    </tr>\n","    <tr>\n","      <th>timestamp</th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","      <th></th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>2016-09-24 17:42:27.839496</th>\n","      <td>804196</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2016-09-24 19:19:03.542569</th>\n","      <td>434745</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2016-09-24 19:36:00.944135</th>\n","      <td>507599</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2016-09-24 19:59:02.646620</th>\n","      <td>671993</td>\n","      <td>control</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2016-09-24 20:26:14.466886</th>\n","      <td>536734</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>...</th>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","      <td>...</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 09:11:41.984113</th>\n","      <td>192060</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 09:42:12.844575</th>\n","      <td>755912</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 10:01:09.026482</th>\n","      <td>458115</td>\n","      <td>experiment</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 10:08:51.588469</th>\n","      <td>505451</td>\n","      <td>control</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","    <tr>\n","      <th>2017-01-18 10:24:08.629327</th>\n","      <td>461199</td>\n","      <td>control</td>\n","      <td>view</td>\n","      <td>1</td>\n","    </tr>\n","  </tbody>\n","</table>\n","<p>8188 rows × 4 columns</p>\n","</div>"],"text/plain":["                                id       group action  count\n","timestamp                                                   \n","2016-09-24 17:42:27.839496  804196  experiment   view      1\n","2016-09-24 19:19:03.542569  434745  experiment   view      1\n","2016-09-24 19:36:00.944135  507599  experiment   view      1\n","2016-09-24 19:59:02.646620  671993     control   view      1\n","2016-09-24 20:26:14.466886  536734  experiment   view      1\n","...                            ...         ...    ...    ...\n","2017-01-18 09:11:41.984113  192060  experiment   view      1\n","2017-01-18 09:42:12.844575  755912  experiment   view      1\n","2017-01-18 10:01:09.026482  458115  experiment   view      1\n","2017-01-18 10:08:51.588469  505451     control   view      1\n","2017-01-18 10:24:08.629327  461199     control   view      1\n","\n","[8188 rows x 4 columns]"]},"execution_count":20,"metadata":{},"output_type":"execute_result"}],"source":["#Your code here\n","df[\"count\"]= 1\n","df"]},{"cell_type":"code","execution_count":21,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["Sample sizes:\tControl: 3332\tExperiment: 2996\n","Total Clicks:\tControl: 932.0\tExperiment: 928.0\n","Average click rate:\tControl: 0.2797118847539016\tExperiment: 0.3097463284379172\n"]},{"data":{"text/html":["<div>\n","<style scoped>\n","    .dataframe tbody tr th:only-of-type {\n","        vertical-align: middle;\n","    }\n","\n","    .dataframe tbody tr th {\n","        vertical-align: top;\n","    }\n","\n","    .dataframe thead th {\n","        text-align: right;\n","    }\n","</style>\n","<table border=\"1\" class=\"dataframe\">\n","  <thead>\n","    <tr style=\"text-align: right;\">\n","      <th>action</th>\n","      <th>click</th>\n","      <th>view</th>\n","    </tr>\n","    <tr>\n","      <th>id</th>\n","      <th></th>\n","      <th></th>\n","    </tr>\n","  </thead>\n","  <tbody>\n","    <tr>\n","      <th>182994</th>\n","      <td>1.0</td>\n","      <td>1.0</td>\n","    </tr>\n","    <tr>\n","      <th>183089</th>\n","      <td>0.0</td>\n","      <td>1.0</td>\n","    </tr>\n","    <tr>\n","      <th>183248</th>\n","      <td>1.0</td>\n","      <td>1.0</td>\n","    </tr>\n","    <tr>\n","      <th>183515</th>\n","      <td>0.0</td>\n","      <td>1.0</td>\n","    </tr>\n","    <tr>\n","      <th>183524</th>\n","      <td>0.0</td>\n","      <td>1.0</td>\n","    </tr>\n","  </tbody>\n","</table>\n","</div>"],"text/plain":["action  click  view\n","id                 \n","182994    1.0   1.0\n","183089    0.0   1.0\n","183248    1.0   1.0\n","183515    0.0   1.0\n","183524    0.0   1.0"]},"execution_count":21,"metadata":{},"output_type":"execute_result"}],"source":["#Convert clicks into a binary variable on a user-by-user-basis\n","control = df[df.group=='control'].pivot(index='id', columns='action', values='count')\n","control = control.fillna(value=0)\n","\n","experiment = df[df.group=='experiment'].pivot(index='id', columns='action', values='count')\n","experiment = experiment.fillna(value=0)\n","\n","\n","\n","print(\"Sample sizes:\\tControl: {}\\tExperiment: {}\".format(len(control), len(experiment)))\n","print(\"Total Clicks:\\tControl: {}\\tExperiment: {}\".format(control.click.sum(), experiment.click.sum()))\n","print(\"Average click rate:\\tControl: {}\\tExperiment: {}\".format(control.click.mean(), experiment.click.mean()))\n","control.head()"]},{"cell_type":"code","execution_count":22,"metadata":{},"outputs":[],"source":["import scipy.stats as stats\n","\n","def welch_t(a, b):\n","    \n","    \"\"\" Calculate Welch's t statistic for two samples. \"\"\"\n","\n","    numerator = a.mean() - b.mean()\n","    \n","    # “ddof = Delta Degrees of Freedom”: the divisor used in the calculation is N - ddof, \n","    #  where N represents the number of elements. By default ddof is zero.\n","    \n","    denominator = np.sqrt(a.var(ddof=1)/a.size + b.var(ddof=1)/b.size)\n","    \n","    return np.abs(numerator/denominator)\n","\n","def welch_df(a, b):\n","    \n","    \"\"\" Calculate the effective degrees of freedom for two samples. This function returns the degrees of freedom \"\"\"\n","    \n","    s1 = a.var(ddof=1) \n","    s2 = b.var(ddof=1)\n","    n1 = a.size\n","    n2 = b.size\n","    \n","    numerator = (s1/n1 + s2/n2)**2\n","    denominator = (s1/ n1)**2/(n1 - 1) + (s2/ n2)**2/(n2 - 1)\n","    \n","    return numerator/denominator"]},{"cell_type":"code","execution_count":23,"metadata":{},"outputs":[],"source":["\n","def p_value_welch_ttest(a, b, two_sided=False):\n","    \"\"\"Calculates the p-value for Welch's t-test given two samples.\n","    By default, the returned p-value is for a one-sided t-test. \n","    Set the two-sided parameter to True if you wish to perform a two-sided t-test instead.\n","    \"\"\"\n","    t = welch_t(a, b)\n","    df = welch_df(a, b)\n","    \n","    p = 1-stats.t.cdf(np.abs(t), df)\n","    \n","    if two_sided:\n","        return 2*p\n","    else:\n","        return p"]},{"cell_type":"code","execution_count":25,"metadata":{},"outputs":[{"data":{"text/plain":["0.004466402814337078"]},"execution_count":25,"metadata":{},"output_type":"execute_result"}],"source":["#Your code here\n","p_value_welch_ttest(control.click, experiment.click)"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## Verifying Results\n","\n","One sensible formulation of the data to answer the hypothesis test above would be to create a binary variable representing each individual in the experiment and control group. This binary variable would represent whether or not that individual clicked on the homepage; 1 for they did and 0 if they did not. \n","\n","The variance for the number of successes in a sample of a binomial variable with n observations is given by:\n","\n","## $n\\bullet p (1-p)$\n","\n","Given this, perform 3 steps to verify the results of your statistical test:\n","1. Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. \n","2. Calculate the number of standard deviations that the actual number of clicks was from this estimate. \n","3. Finally, calculate a p-value using the normal distribution based on this z-score."]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["### Step 1:\n","Calculate the expected number of clicks for the experiment group, if it had the same click-through rate as that of the control group. "]},{"cell_type":"code","execution_count":26,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["838.0168067226891\n"]}],"source":["control_rate = control.click.mean()\n","expected_experiment_clicks_under_null = control_rate * len(experiment)\n","print(expected_experiment_clicks_under_null)\n"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["### Step 2:\n","Calculate the number of standard deviations that the actual number of clicks was from this estimate."]},{"cell_type":"code","execution_count":28,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["24.568547907005815\n"]}],"source":["#Your code heren = len(experiment)\n","n = len(experiment)\n","p = control_rate\n","var = n * p * (1-p)\n","std = np.sqrt(var)\n","print(std)\n"]},{"cell_type":"code","execution_count":29,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["3.6625360854823588\n"]}],"source":["actual_experiment_clicks = experiment.click.sum()\n","z_score = (actual_experiment_clicks - expected_experiment_clicks_under_null)/std\n","print(z_score)"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["### Step 3: \n","Finally, calculate a p-value using the normal distribution based on this z-score."]},{"cell_type":"code","execution_count":30,"metadata":{},"outputs":[{"name":"stdout","output_type":"stream","text":["0.00012486528006951198\n"]}],"source":["#Your code here\n","import scipy.stats as stats\n","p_val = stats.norm.sf(z_score) #or 1 - stats.norm.cdf(z_score)\n","print(p_val)"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["### Analysis:\n","\n","Does this result roughly match that of the previous statistical test?\n","\n","> Comment: **Your analysis here**"]},{"attachments":{},"cell_type":"markdown","metadata":{},"source":["## Summary\n","\n","In this lab, you continued to get more practice designing and conducting AB tests. This required additional work preprocessing and formulating the initial problem in a suitable manner. Additionally, you also saw how to verify results, strengthening your knowledge of binomial variables, and reviewing initial statistical concepts of the central limit theorem, standard deviation, z-scores, and their accompanying p-values."]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.5"}},"nbformat":4,"nbformat_minor":2}