|
1 | 1 | {
|
2 | 2 | "cells": [
|
3 | 3 | {
|
| 4 | + "attachments": {}, |
4 | 5 | "cell_type": "markdown",
|
5 |
| - "metadata": {}, |
| 6 | + "metadata": { |
| 7 | + "nterop": { |
| 8 | + "id": "1" |
| 9 | + } |
| 10 | + }, |
6 | 11 | "source": [
|
7 | 12 | "# DAE with 2 Lines of Code with Kaggler\n",
|
8 | 13 | "> A tutorial on Kaggler's new DAE feature transformation\n",
|
|
14 | 19 | ]
|
15 | 20 | },
|
16 | 21 | {
|
| 22 | + "attachments": {}, |
17 | 23 | "cell_type": "markdown",
|
18 |
| - "metadata": {}, |
| 24 | + "metadata": { |
| 25 | + "nterop": { |
| 26 | + "id": "2" |
| 27 | + } |
| 28 | + }, |
| 29 | + "source": [ |
| 30 | + "# **UPDATE on 5/1/2021**\n", |
| 31 | + "\n", |
| 32 | + "Today, [`Kaggler`](https://github.com/jeongyoonlee/Kaggler) v0.9.4 is released with additional features for DAE as follows:\n", |
| 33 | + "* In addition to the swap noise (`swap_prob`), the Gaussian noise (`noise_std`) and zero masking (`mask_prob`) have been added to DAE to overcome overfitting.\n", |
| 34 | + "* Stacked DAE is available through the `n_layer` input argument (see Figure 3. in [Vincent et al. (2010), \"Stacked Denoising Autoencoders\"](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf) for reference).\n", |
| 35 | + "\n", |
| 36 | + "For example, to build a stacked DAE with 3 pairs of encoder/decoder and all three types of noises, you can do:\n", |
| 37 | + "```python\n", |
| 38 | + "from kaggler.preprocessing import DAE\n", |
| 39 | + "\n", |
| 40 | + "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_layer=3, noise_std=.05, swap_prob=.2, masking_prob=.1)\n", |
| 41 | + "X = dae.fit_transform(pd.concat([trn, tst], axis=0))\n", |
| 42 | + "```\n", |
| 43 | + "\n", |
| 44 | + "If you're using previous versions, please upgrade `Kaggler` using `pip install -U kaggler`.\n", |
| 45 | + "\n", |
| 46 | + "---\n", |
| 47 | + "\n", |
| 48 | + "Today I released a new version (v0.9.0) of the `Kaggler` package with Denoising AutoEncoder (DAE) with the swap noise. \n", |
| 49 | + "\n", |
| 50 | + "Now you can train a DAE with only 2 lines of code as follows:\n", |
| 51 | + "\n", |
| 52 | + "```python\n", |
| 53 | + "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n", |
| 54 | + "X = dae.fit_transform(df[feature_cols])\n", |
| 55 | + "```\n", |
| 56 | + "\n", |
| 57 | + "In addition to the new DAE feature encoder, `Kaggler` supports many of feature transformations used in Kaggle including:\n", |
| 58 | + "* `TargetEncoder`: with smoothing and cross-validation to avoid overfitting\n", |
| 59 | + "* `FrequencyEncoder`\n", |
| 60 | + "* `LabelEncoder`: that imputes missing values and groups rare categories\n", |
| 61 | + "* `OneHotEncoder`: that imputes missing values and groups rare categories\n", |
| 62 | + "* `EmbeddingEncoder`: that transforms categorical features into embeddings\n", |
| 63 | + "* `QuantileEncoder`: that transforms numerical features into quantiles\n", |
| 64 | + "\n", |
| 65 | + "In the notebook below, I will show how to use `Kaggler`'s `LabelEncoder`, `TargetEncoder`, and `DAE` for feature engineering, then use `Kaggler`'s `AutoLGB` to do feature selection and hyperparameter optimization." |
| 66 | + ] |
| 67 | + }, |
| 68 | + { |
| 69 | + "attachments": {}, |
| 70 | + "cell_type": "markdown", |
| 71 | + "metadata": { |
| 72 | + "nterop": { |
| 73 | + "id": "29" |
| 74 | + } |
| 75 | + }, |
19 | 76 | "source": [
|
20 | 77 | "This notebook was originally published [here](https://www.kaggle.com/jeongyoonlee/dae-with-2-lines-of-code-with-kaggler) at Kaggle.\n",
|
21 | 78 | "\n",
|
|
42 | 99 | ]
|
43 | 100 | },
|
44 | 101 | {
|
| 102 | + "attachments": {}, |
45 | 103 | "cell_type": "markdown",
|
46 |
| - "metadata": {}, |
| 104 | + "metadata": { |
| 105 | + "nterop": { |
| 106 | + "id": "3" |
| 107 | + } |
| 108 | + }, |
47 | 109 | "source": [
|
48 | 110 | "# Part 1: Data Loading & Feature Engineering"
|
49 | 111 | ]
|
|
52 | 114 | "cell_type": "code",
|
53 | 115 | "execution_count": null,
|
54 | 116 | "metadata": {
|
55 |
| - "_kg_hide-input": true |
| 117 | + "_kg_hide-input": true, |
| 118 | + "nterop": { |
| 119 | + "id": "4" |
| 120 | + } |
56 | 121 | },
|
57 | 122 | "outputs": [],
|
58 | 123 | "source": [
|
|
72 | 137 | "cell_type": "code",
|
73 | 138 | "execution_count": null,
|
74 | 139 | "metadata": {
|
75 |
| - "_kg_hide-output": true |
| 140 | + "_kg_hide-output": true, |
| 141 | + "nterop": { |
| 142 | + "id": "5" |
| 143 | + } |
76 | 144 | },
|
77 | 145 | "outputs": [],
|
78 | 146 | "source": [
|
|
82 | 150 | {
|
83 | 151 | "cell_type": "code",
|
84 | 152 | "execution_count": null,
|
85 |
| - "metadata": {}, |
| 153 | + "metadata": { |
| 154 | + "nterop": { |
| 155 | + "id": "6" |
| 156 | + } |
| 157 | + }, |
86 | 158 | "outputs": [],
|
87 | 159 | "source": [
|
88 | 160 | "import kaggler\n",
|
|
96 | 168 | "cell_type": "code",
|
97 | 169 | "execution_count": null,
|
98 | 170 | "metadata": {
|
99 |
| - "_kg_hide-input": true |
| 171 | + "_kg_hide-input": true, |
| 172 | + "nterop": { |
| 173 | + "id": "7" |
| 174 | + } |
100 | 175 | },
|
101 | 176 | "outputs": [],
|
102 | 177 | "source": [
|
|
107 | 182 | {
|
108 | 183 | "cell_type": "code",
|
109 | 184 | "execution_count": null,
|
110 |
| - "metadata": {}, |
| 185 | + "metadata": { |
| 186 | + "nterop": { |
| 187 | + "id": "8" |
| 188 | + } |
| 189 | + }, |
111 | 190 | "outputs": [],
|
112 | 191 | "source": [
|
113 | 192 | "feature_name = 'dae'\n",
|
|
132 | 211 | {
|
133 | 212 | "cell_type": "code",
|
134 | 213 | "execution_count": null,
|
135 |
| - "metadata": {}, |
| 214 | + "metadata": { |
| 215 | + "nterop": { |
| 216 | + "id": "9" |
| 217 | + } |
| 218 | + }, |
136 | 219 | "outputs": [],
|
137 | 220 | "source": [
|
138 | 221 | "n_fold = 5\n",
|
|
143 | 226 | {
|
144 | 227 | "cell_type": "code",
|
145 | 228 | "execution_count": null,
|
146 |
| - "metadata": {}, |
| 229 | + "metadata": { |
| 230 | + "nterop": { |
| 231 | + "id": "10" |
| 232 | + } |
| 233 | + }, |
147 | 234 | "outputs": [],
|
148 | 235 | "source": [
|
149 | 236 | "trn = pd.read_csv(trn_file, index_col=id_col)\n",
|
|
156 | 243 | {
|
157 | 244 | "cell_type": "code",
|
158 | 245 | "execution_count": null,
|
159 |
| - "metadata": {}, |
| 246 | + "metadata": { |
| 247 | + "nterop": { |
| 248 | + "id": "11" |
| 249 | + } |
| 250 | + }, |
160 | 251 | "outputs": [],
|
161 | 252 | "source": [
|
162 | 253 | "tst[target_col] = pseudo_label[target_col]\n",
|
|
168 | 259 | {
|
169 | 260 | "cell_type": "code",
|
170 | 261 | "execution_count": null,
|
171 |
| - "metadata": {}, |
| 262 | + "metadata": { |
| 263 | + "nterop": { |
| 264 | + "id": "12" |
| 265 | + } |
| 266 | + }, |
172 | 267 | "outputs": [],
|
173 | 268 | "source": [
|
174 | 269 | "# Feature engineering code from https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model\n",
|
|
212 | 307 | {
|
213 | 308 | "cell_type": "code",
|
214 | 309 | "execution_count": null,
|
215 |
| - "metadata": {}, |
| 310 | + "metadata": { |
| 311 | + "nterop": { |
| 312 | + "id": "13" |
| 313 | + } |
| 314 | + }, |
216 | 315 | "outputs": [],
|
217 | 316 | "source": [
|
218 | 317 | "for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:\n",
|
|
223 | 322 | ]
|
224 | 323 | },
|
225 | 324 | {
|
| 325 | + "attachments": {}, |
226 | 326 | "cell_type": "markdown",
|
227 |
| - "metadata": {}, |
| 327 | + "metadata": { |
| 328 | + "nterop": { |
| 329 | + "id": "14" |
| 330 | + } |
| 331 | + }, |
228 | 332 | "source": [
|
229 | 333 | "## Label encoding with rare category grouping and missing value imputation"
|
230 | 334 | ]
|
231 | 335 | },
|
232 | 336 | {
|
233 | 337 | "cell_type": "code",
|
234 | 338 | "execution_count": null,
|
235 |
| - "metadata": {}, |
| 339 | + "metadata": { |
| 340 | + "nterop": { |
| 341 | + "id": "15" |
| 342 | + } |
| 343 | + }, |
236 | 344 | "outputs": [],
|
237 | 345 | "source": [
|
238 | 346 | "lbe = LabelEncoder(min_obs=50)\n",
|
239 | 347 | "df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)"
|
240 | 348 | ]
|
241 | 349 | },
|
242 | 350 | {
|
| 351 | + "attachments": {}, |
243 | 352 | "cell_type": "markdown",
|
244 |
| - "metadata": {}, |
| 353 | + "metadata": { |
| 354 | + "nterop": { |
| 355 | + "id": "16" |
| 356 | + } |
| 357 | + }, |
245 | 358 | "source": [
|
246 | 359 | "## Target encoding with smoothing and 5-fold cross-validation"
|
247 | 360 | ]
|
248 | 361 | },
|
249 | 362 | {
|
250 | 363 | "cell_type": "code",
|
251 | 364 | "execution_count": null,
|
252 |
| - "metadata": {}, |
| 365 | + "metadata": { |
| 366 | + "nterop": { |
| 367 | + "id": "17" |
| 368 | + } |
| 369 | + }, |
253 | 370 | "outputs": [],
|
254 | 371 | "source": [
|
255 | 372 | "cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)\n",
|
|
260 | 377 | ]
|
261 | 378 | },
|
262 | 379 | {
|
| 380 | + "attachments": {}, |
263 | 381 | "cell_type": "markdown",
|
264 |
| - "metadata": {}, |
| 382 | + "metadata": { |
| 383 | + "nterop": { |
| 384 | + "id": "18" |
| 385 | + } |
| 386 | + }, |
265 | 387 | "source": [
|
266 | 388 | "## DAE"
|
267 | 389 | ]
|
268 | 390 | },
|
269 | 391 | {
|
270 | 392 | "cell_type": "code",
|
271 | 393 | "execution_count": null,
|
272 |
| - "metadata": {}, |
| 394 | + "metadata": { |
| 395 | + "nterop": { |
| 396 | + "id": "19" |
| 397 | + } |
| 398 | + }, |
273 | 399 | "outputs": [],
|
274 | 400 | "source": [
|
275 | 401 | "dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n",
|
|
279 | 405 | {
|
280 | 406 | "cell_type": "code",
|
281 | 407 | "execution_count": null,
|
282 |
| - "metadata": {}, |
| 408 | + "metadata": { |
| 409 | + "nterop": { |
| 410 | + "id": "20" |
| 411 | + } |
| 412 | + }, |
283 | 413 | "outputs": [],
|
284 | 414 | "source": [
|
285 | 415 | "df_dae = pd.DataFrame(X, columns=[f'dae_{i}' for i in range(encoding_dim)])\n",
|
286 | 416 | "print(df_dae.shape)"
|
287 | 417 | ]
|
288 | 418 | },
|
289 | 419 | {
|
| 420 | + "attachments": {}, |
290 | 421 | "cell_type": "markdown",
|
291 |
| - "metadata": {}, |
| 422 | + "metadata": { |
| 423 | + "nterop": { |
| 424 | + "id": "21" |
| 425 | + } |
| 426 | + }, |
292 | 427 | "source": [
|
293 | 428 | "# Part 2: Model Training"
|
294 | 429 | ]
|
295 | 430 | },
|
296 | 431 | {
|
| 432 | + "attachments": {}, |
297 | 433 | "cell_type": "markdown",
|
298 |
| - "metadata": {}, |
| 434 | + "metadata": { |
| 435 | + "nterop": { |
| 436 | + "id": "22" |
| 437 | + } |
| 438 | + }, |
299 | 439 | "source": [
|
300 | 440 | "## AutoLGB for Feature Selection and Hyperparameter Optimization"
|
301 | 441 | ]
|
302 | 442 | },
|
303 | 443 | {
|
304 | 444 | "cell_type": "code",
|
305 | 445 | "execution_count": null,
|
306 |
| - "metadata": {}, |
| 446 | + "metadata": { |
| 447 | + "nterop": { |
| 448 | + "id": "23" |
| 449 | + } |
| 450 | + }, |
307 | 451 | "outputs": [],
|
308 | 452 | "source": [
|
309 | 453 | "X = pd.concat([df[feature_cols], df_te, df_dae], axis=1)\n",
|
|
335 | 479 | {
|
336 | 480 | "cell_type": "code",
|
337 | 481 | "execution_count": null,
|
338 |
| - "metadata": {}, |
| 482 | + "metadata": { |
| 483 | + "nterop": { |
| 484 | + "id": "24" |
| 485 | + } |
| 486 | + }, |
339 | 487 | "outputs": [],
|
340 | 488 | "source": [
|
341 | 489 | "print(f' CV AUC: {roc_auc_score(y, p):.6f}')\n",
|
342 | 490 | "print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')"
|
343 | 491 | ]
|
344 | 492 | },
|
345 | 493 | {
|
| 494 | + "attachments": {}, |
346 | 495 | "cell_type": "markdown",
|
347 |
| - "metadata": {}, |
| 496 | + "metadata": { |
| 497 | + "nterop": { |
| 498 | + "id": "25" |
| 499 | + } |
| 500 | + }, |
348 | 501 | "source": [
|
349 | 502 | "## Submission"
|
350 | 503 | ]
|
351 | 504 | },
|
352 | 505 | {
|
353 | 506 | "cell_type": "code",
|
354 | 507 | "execution_count": null,
|
355 |
| - "metadata": {}, |
| 508 | + "metadata": { |
| 509 | + "nterop": { |
| 510 | + "id": "26" |
| 511 | + } |
| 512 | + }, |
356 | 513 | "outputs": [],
|
357 | 514 | "source": [
|
358 | 515 | "n_pos = int(0.34911 * tst.shape[0])\n",
|
|
364 | 521 | {
|
365 | 522 | "cell_type": "code",
|
366 | 523 | "execution_count": null,
|
367 |
| - "metadata": {}, |
| 524 | + "metadata": { |
| 525 | + "nterop": { |
| 526 | + "id": "27" |
| 527 | + } |
| 528 | + }, |
368 | 529 | "outputs": [],
|
369 | 530 | "source": [
|
370 | 531 | "sub[target_col] = (p_tst > th).astype(int)\n",
|
371 | 532 | "sub.to_csv(submission_file)"
|
372 | 533 | ]
|
373 | 534 | },
|
374 | 535 | {
|
| 536 | + "attachments": {}, |
375 | 537 | "cell_type": "markdown",
|
376 |
| - "metadata": {}, |
| 538 | + "metadata": { |
| 539 | + "nterop": { |
| 540 | + "id": "28" |
| 541 | + } |
| 542 | + }, |
377 | 543 | "source": [
|
378 | 544 | "If you find it useful, please upvote the notebook and leave your feedback. It will be greatly appreciated!\n",
|
379 | 545 | "\n",
|
|
385 | 551 | }
|
386 | 552 | ],
|
387 | 553 | "metadata": {
|
| 554 | + "hide_input": false, |
388 | 555 | "kernelspec": {
|
389 | 556 | "display_name": "Python 3",
|
390 | 557 | "language": "python",
|
|
400 | 567 | "name": "python",
|
401 | 568 | "nbconvert_exporter": "python",
|
402 | 569 | "pygments_lexer": "ipython3",
|
403 |
| - "version": "3.8.5" |
| 570 | + "version": "3.7.10" |
| 571 | + }, |
| 572 | + "nterop": { |
| 573 | + "seedId": "29" |
404 | 574 | },
|
405 | 575 | "toc": {
|
406 | 576 | "base_numbering": 1,
|
|
0 commit comments