kaggler-tv · May 8, 2021
diff --git a/‎_notebooks/2021-04-29-dae-with-2-lines-of-code-with-kaggler.ipynb
+199-29 b/‎_notebooks/2021-04-29-dae-with-2-lines-of-code-with-kaggler.ipynb
+199-29
@@ -1,8 +1,13 @@
 {
  "cells": [
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "1"
+    }
+   },
    "source": [
     "# DAE with 2 Lines of Code with Kaggler\n",
     "> A tutorial on Kaggler's new DAE feature transformation\n",
@@ -14,8 +19,60 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "2"
+    }
+   },
+   "source": [
+    "# **UPDATE on 5/1/2021**\n",
+    "\n",
+    "Today, [`Kaggler`](https://github.com/jeongyoonlee/Kaggler) v0.9.4 is released with additional features for DAE as follows:\n",
+    "* In addition to the swap noise (`swap_prob`), the Gaussian noise (`noise_std`) and zero masking (`mask_prob`) have been added to DAE to overcome overfitting.\n",
+    "* Stacked DAE is available through the `n_layer` input argument (see Figure 3. in [Vincent et al. (2010), \"Stacked Denoising Autoencoders\"](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf) for reference).\n",
+    "\n",
+    "For example, to build a stacked DAE with 3 pairs of encoder/decoder and all three types of noises, you can do:\n",
+    "```python\n",
+    "from kaggler.preprocessing import DAE\n",
+    "\n",
+    "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_layer=3, noise_std=.05, swap_prob=.2, masking_prob=.1)\n",
+    "X = dae.fit_transform(pd.concat([trn, tst], axis=0))\n",
+    "```\n",
+    "\n",
+    "If you're using previous versions, please upgrade `Kaggler` using `pip install -U kaggler`.\n",
+    "\n",
+    "---\n",
+    "\n",
+    "Today I released a new version (v0.9.0) of the `Kaggler` package with Denoising AutoEncoder (DAE) with the swap noise. \n",
+    "\n",
+    "Now you can train a DAE with only 2 lines of code as follows:\n",
+    "\n",
+    "```python\n",
+    "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n",
+    "X = dae.fit_transform(df[feature_cols])\n",
+    "```\n",
+    "\n",
+    "In addition to the new DAE feature encoder, `Kaggler` supports many of feature transformations used in Kaggle including:\n",
+    "* `TargetEncoder`: with smoothing and cross-validation to avoid overfitting\n",
+    "* `FrequencyEncoder`\n",
+    "* `LabelEncoder`: that imputes missing values and groups rare categories\n",
+    "* `OneHotEncoder`: that imputes missing values and groups rare categories\n",
+    "* `EmbeddingEncoder`: that transforms categorical features into embeddings\n",
+    "* `QuantileEncoder`: that transforms numerical features into quantiles\n",
+    "\n",
+    "In the notebook below, I will show how to use `Kaggler`'s `LabelEncoder`, `TargetEncoder`, and `DAE` for feature engineering, then use `Kaggler`'s `AutoLGB` to do feature selection and hyperparameter optimization."
+   ]
+  },
+  {
+   "attachments": {},
+   "cell_type": "markdown",
+   "metadata": {
+    "nterop": {
+     "id": "29"
+    }
+   },
    "source": [
     "This notebook was originally published [here](https://www.kaggle.com/jeongyoonlee/dae-with-2-lines-of-code-with-kaggler) at Kaggle.\n",
     "\n",
@@ -42,8 +99,13 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "3"
+    }
+   },
    "source": [
     "# Part 1: Data Loading & Feature Engineering"
    ]
@@ -52,7 +114,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_kg_hide-input": true
+    "_kg_hide-input": true,
+    "nterop": {
+     "id": "4"
+    }
    },
    "outputs": [],
    "source": [
@@ -72,7 +137,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_kg_hide-output": true
+    "_kg_hide-output": true,
+    "nterop": {
+     "id": "5"
+    }
    },
    "outputs": [],
    "source": [
@@ -82,7 +150,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "6"
+    }
+   },
    "outputs": [],
    "source": [
     "import kaggler\n",
@@ -96,7 +168,10 @@
    "cell_type": "code",
    "execution_count": null,
    "metadata": {
-    "_kg_hide-input": true
+    "_kg_hide-input": true,
+    "nterop": {
+     "id": "7"
+    }
    },
    "outputs": [],
    "source": [
@@ -107,7 +182,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "8"
+    }
+   },
    "outputs": [],
    "source": [
     "feature_name = 'dae'\n",
@@ -132,7 +211,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "9"
+    }
+   },
    "outputs": [],
    "source": [
     "n_fold = 5\n",
@@ -143,7 +226,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "10"
+    }
+   },
    "outputs": [],
    "source": [
     "trn = pd.read_csv(trn_file, index_col=id_col)\n",
@@ -156,7 +243,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "11"
+    }
+   },
    "outputs": [],
    "source": [
     "tst[target_col] = pseudo_label[target_col]\n",
@@ -168,7 +259,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "12"
+    }
+   },
    "outputs": [],
    "source": [
     "# Feature engineering code from https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model\n",
@@ -212,7 +307,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "13"
+    }
+   },
    "outputs": [],
    "source": [
     "for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:\n",
@@ -223,33 +322,51 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "14"
+    }
+   },
    "source": [
     "## Label encoding with rare category grouping and missing value imputation"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "15"
+    }
+   },
    "outputs": [],
    "source": [
     "lbe = LabelEncoder(min_obs=50)\n",
     "df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "16"
+    }
+   },
    "source": [
     "## Target encoding with smoothing and 5-fold cross-validation"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "17"
+    }
+   },
    "outputs": [],
    "source": [
     "cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)\n",
@@ -260,16 +377,25 @@
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "18"
+    }
+   },
    "source": [
     "## DAE"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "19"
+    }
+   },
    "outputs": [],
    "source": [
     "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n",
@@ -279,31 +405,49 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "20"
+    }
+   },
    "outputs": [],
    "source": [
     "df_dae = pd.DataFrame(X, columns=[f'dae_{i}' for i in range(encoding_dim)])\n",
     "print(df_dae.shape)"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "21"
+    }
+   },
    "source": [
     "# Part 2: Model Training"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "22"
+    }
+   },
    "source": [
     "## AutoLGB for Feature Selection and Hyperparameter Optimization"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "23"
+    }
+   },
    "outputs": [],
    "source": [
     "X = pd.concat([df[feature_cols], df_te, df_dae], axis=1)\n",
@@ -335,24 +479,37 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "24"
+    }
+   },
    "outputs": [],
    "source": [
     "print(f'  CV AUC: {roc_auc_score(y, p):.6f}')\n",
     "print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "25"
+    }
+   },
    "source": [
     "## Submission"
    ]
   },
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "26"
+    }
+   },
    "outputs": [],
    "source": [
     "n_pos = int(0.34911 * tst.shape[0])\n",
@@ -364,16 +521,25 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "27"
+    }
+   },
    "outputs": [],
    "source": [
     "sub[target_col] = (p_tst > th).astype(int)\n",
     "sub.to_csv(submission_file)"
    ]
   },
   {
+   "attachments": {},
    "cell_type": "markdown",
-   "metadata": {},
+   "metadata": {
+    "nterop": {
+     "id": "28"
+    }
+   },
    "source": [
     "If you find it useful, please upvote the notebook and leave your feedback. It will be greatly appreciated!\n",
     "\n",
@@ -385,6 +551,7 @@
   }
  ],
  "metadata": {
+  "hide_input": false,
   "kernelspec": {
    "display_name": "Python 3",
    "language": "python",
@@ -400,7 +567,10 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.8.5"
+   "version": "3.7.10"
+  },
+  "nterop": {
+   "seedId": "29"
   },
   "toc": {
    "base_numbering": 1,