Skip to content

Commit 3c9f2e5

Browse files
committedMay 8, 2021
update dae notebook with info
1 parent 18e31a3 commit 3c9f2e5

File tree

1 file changed

+199
-29
lines changed

1 file changed

+199
-29
lines changed
 

‎_notebooks/2021-04-29-dae-with-2-lines-of-code-with-kaggler.ipynb

+199-29
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,13 @@
11
{
22
"cells": [
33
{
4+
"attachments": {},
45
"cell_type": "markdown",
5-
"metadata": {},
6+
"metadata": {
7+
"nterop": {
8+
"id": "1"
9+
}
10+
},
611
"source": [
712
"# DAE with 2 Lines of Code with Kaggler\n",
813
"> A tutorial on Kaggler's new DAE feature transformation\n",
@@ -14,8 +19,60 @@
1419
]
1520
},
1621
{
22+
"attachments": {},
1723
"cell_type": "markdown",
18-
"metadata": {},
24+
"metadata": {
25+
"nterop": {
26+
"id": "2"
27+
}
28+
},
29+
"source": [
30+
"# **UPDATE on 5/1/2021**\n",
31+
"\n",
32+
"Today, [`Kaggler`](https://github.com/jeongyoonlee/Kaggler) v0.9.4 is released with additional features for DAE as follows:\n",
33+
"* In addition to the swap noise (`swap_prob`), the Gaussian noise (`noise_std`) and zero masking (`mask_prob`) have been added to DAE to overcome overfitting.\n",
34+
"* Stacked DAE is available through the `n_layer` input argument (see Figure 3. in [Vincent et al. (2010), \"Stacked Denoising Autoencoders\"](https://www.jmlr.org/papers/volume11/vincent10a/vincent10a.pdf) for reference).\n",
35+
"\n",
36+
"For example, to build a stacked DAE with 3 pairs of encoder/decoder and all three types of noises, you can do:\n",
37+
"```python\n",
38+
"from kaggler.preprocessing import DAE\n",
39+
"\n",
40+
"dae = DAE(cat_cols=cat_cols, num_cols=num_cols, n_layer=3, noise_std=.05, swap_prob=.2, masking_prob=.1)\n",
41+
"X = dae.fit_transform(pd.concat([trn, tst], axis=0))\n",
42+
"```\n",
43+
"\n",
44+
"If you're using previous versions, please upgrade `Kaggler` using `pip install -U kaggler`.\n",
45+
"\n",
46+
"---\n",
47+
"\n",
48+
"Today I released a new version (v0.9.0) of the `Kaggler` package with Denoising AutoEncoder (DAE) with the swap noise. \n",
49+
"\n",
50+
"Now you can train a DAE with only 2 lines of code as follows:\n",
51+
"\n",
52+
"```python\n",
53+
"dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n",
54+
"X = dae.fit_transform(df[feature_cols])\n",
55+
"```\n",
56+
"\n",
57+
"In addition to the new DAE feature encoder, `Kaggler` supports many of feature transformations used in Kaggle including:\n",
58+
"* `TargetEncoder`: with smoothing and cross-validation to avoid overfitting\n",
59+
"* `FrequencyEncoder`\n",
60+
"* `LabelEncoder`: that imputes missing values and groups rare categories\n",
61+
"* `OneHotEncoder`: that imputes missing values and groups rare categories\n",
62+
"* `EmbeddingEncoder`: that transforms categorical features into embeddings\n",
63+
"* `QuantileEncoder`: that transforms numerical features into quantiles\n",
64+
"\n",
65+
"In the notebook below, I will show how to use `Kaggler`'s `LabelEncoder`, `TargetEncoder`, and `DAE` for feature engineering, then use `Kaggler`'s `AutoLGB` to do feature selection and hyperparameter optimization."
66+
]
67+
},
68+
{
69+
"attachments": {},
70+
"cell_type": "markdown",
71+
"metadata": {
72+
"nterop": {
73+
"id": "29"
74+
}
75+
},
1976
"source": [
2077
"This notebook was originally published [here](https://www.kaggle.com/jeongyoonlee/dae-with-2-lines-of-code-with-kaggler) at Kaggle.\n",
2178
"\n",
@@ -42,8 +99,13 @@
4299
]
43100
},
44101
{
102+
"attachments": {},
45103
"cell_type": "markdown",
46-
"metadata": {},
104+
"metadata": {
105+
"nterop": {
106+
"id": "3"
107+
}
108+
},
47109
"source": [
48110
"# Part 1: Data Loading & Feature Engineering"
49111
]
@@ -52,7 +114,10 @@
52114
"cell_type": "code",
53115
"execution_count": null,
54116
"metadata": {
55-
"_kg_hide-input": true
117+
"_kg_hide-input": true,
118+
"nterop": {
119+
"id": "4"
120+
}
56121
},
57122
"outputs": [],
58123
"source": [
@@ -72,7 +137,10 @@
72137
"cell_type": "code",
73138
"execution_count": null,
74139
"metadata": {
75-
"_kg_hide-output": true
140+
"_kg_hide-output": true,
141+
"nterop": {
142+
"id": "5"
143+
}
76144
},
77145
"outputs": [],
78146
"source": [
@@ -82,7 +150,11 @@
82150
{
83151
"cell_type": "code",
84152
"execution_count": null,
85-
"metadata": {},
153+
"metadata": {
154+
"nterop": {
155+
"id": "6"
156+
}
157+
},
86158
"outputs": [],
87159
"source": [
88160
"import kaggler\n",
@@ -96,7 +168,10 @@
96168
"cell_type": "code",
97169
"execution_count": null,
98170
"metadata": {
99-
"_kg_hide-input": true
171+
"_kg_hide-input": true,
172+
"nterop": {
173+
"id": "7"
174+
}
100175
},
101176
"outputs": [],
102177
"source": [
@@ -107,7 +182,11 @@
107182
{
108183
"cell_type": "code",
109184
"execution_count": null,
110-
"metadata": {},
185+
"metadata": {
186+
"nterop": {
187+
"id": "8"
188+
}
189+
},
111190
"outputs": [],
112191
"source": [
113192
"feature_name = 'dae'\n",
@@ -132,7 +211,11 @@
132211
{
133212
"cell_type": "code",
134213
"execution_count": null,
135-
"metadata": {},
214+
"metadata": {
215+
"nterop": {
216+
"id": "9"
217+
}
218+
},
136219
"outputs": [],
137220
"source": [
138221
"n_fold = 5\n",
@@ -143,7 +226,11 @@
143226
{
144227
"cell_type": "code",
145228
"execution_count": null,
146-
"metadata": {},
229+
"metadata": {
230+
"nterop": {
231+
"id": "10"
232+
}
233+
},
147234
"outputs": [],
148235
"source": [
149236
"trn = pd.read_csv(trn_file, index_col=id_col)\n",
@@ -156,7 +243,11 @@
156243
{
157244
"cell_type": "code",
158245
"execution_count": null,
159-
"metadata": {},
246+
"metadata": {
247+
"nterop": {
248+
"id": "11"
249+
}
250+
},
160251
"outputs": [],
161252
"source": [
162253
"tst[target_col] = pseudo_label[target_col]\n",
@@ -168,7 +259,11 @@
168259
{
169260
"cell_type": "code",
170261
"execution_count": null,
171-
"metadata": {},
262+
"metadata": {
263+
"nterop": {
264+
"id": "12"
265+
}
266+
},
172267
"outputs": [],
173268
"source": [
174269
"# Feature engineering code from https://www.kaggle.com/udbhavpangotra/tps-apr21-eda-model\n",
@@ -212,7 +307,11 @@
212307
{
213308
"cell_type": "code",
214309
"execution_count": null,
215-
"metadata": {},
310+
"metadata": {
311+
"nterop": {
312+
"id": "13"
313+
}
314+
},
216315
"outputs": [],
217316
"source": [
218317
"for col in ['SameFirstName', 'SameSecondName', 'Fare', 'FamilySize', 'Parch', 'SibSp']:\n",
@@ -223,33 +322,51 @@
223322
]
224323
},
225324
{
325+
"attachments": {},
226326
"cell_type": "markdown",
227-
"metadata": {},
327+
"metadata": {
328+
"nterop": {
329+
"id": "14"
330+
}
331+
},
228332
"source": [
229333
"## Label encoding with rare category grouping and missing value imputation"
230334
]
231335
},
232336
{
233337
"cell_type": "code",
234338
"execution_count": null,
235-
"metadata": {},
339+
"metadata": {
340+
"nterop": {
341+
"id": "15"
342+
}
343+
},
236344
"outputs": [],
237345
"source": [
238346
"lbe = LabelEncoder(min_obs=50)\n",
239347
"df[cat_cols] = lbe.fit_transform(df[cat_cols]).astype(int)"
240348
]
241349
},
242350
{
351+
"attachments": {},
243352
"cell_type": "markdown",
244-
"metadata": {},
353+
"metadata": {
354+
"nterop": {
355+
"id": "16"
356+
}
357+
},
245358
"source": [
246359
"## Target encoding with smoothing and 5-fold cross-validation"
247360
]
248361
},
249362
{
250363
"cell_type": "code",
251364
"execution_count": null,
252-
"metadata": {},
365+
"metadata": {
366+
"nterop": {
367+
"id": "17"
368+
}
369+
},
253370
"outputs": [],
254371
"source": [
255372
"cv = StratifiedKFold(n_splits=n_fold, shuffle=True, random_state=seed)\n",
@@ -260,16 +377,25 @@
260377
]
261378
},
262379
{
380+
"attachments": {},
263381
"cell_type": "markdown",
264-
"metadata": {},
382+
"metadata": {
383+
"nterop": {
384+
"id": "18"
385+
}
386+
},
265387
"source": [
266388
"## DAE"
267389
]
268390
},
269391
{
270392
"cell_type": "code",
271393
"execution_count": null,
272-
"metadata": {},
394+
"metadata": {
395+
"nterop": {
396+
"id": "19"
397+
}
398+
},
273399
"outputs": [],
274400
"source": [
275401
"dae = DAE(cat_cols=cat_cols, num_cols=num_cols, encoding_dim=encoding_dim)\n",
@@ -279,31 +405,49 @@
279405
{
280406
"cell_type": "code",
281407
"execution_count": null,
282-
"metadata": {},
408+
"metadata": {
409+
"nterop": {
410+
"id": "20"
411+
}
412+
},
283413
"outputs": [],
284414
"source": [
285415
"df_dae = pd.DataFrame(X, columns=[f'dae_{i}' for i in range(encoding_dim)])\n",
286416
"print(df_dae.shape)"
287417
]
288418
},
289419
{
420+
"attachments": {},
290421
"cell_type": "markdown",
291-
"metadata": {},
422+
"metadata": {
423+
"nterop": {
424+
"id": "21"
425+
}
426+
},
292427
"source": [
293428
"# Part 2: Model Training"
294429
]
295430
},
296431
{
432+
"attachments": {},
297433
"cell_type": "markdown",
298-
"metadata": {},
434+
"metadata": {
435+
"nterop": {
436+
"id": "22"
437+
}
438+
},
299439
"source": [
300440
"## AutoLGB for Feature Selection and Hyperparameter Optimization"
301441
]
302442
},
303443
{
304444
"cell_type": "code",
305445
"execution_count": null,
306-
"metadata": {},
446+
"metadata": {
447+
"nterop": {
448+
"id": "23"
449+
}
450+
},
307451
"outputs": [],
308452
"source": [
309453
"X = pd.concat([df[feature_cols], df_te, df_dae], axis=1)\n",
@@ -335,24 +479,37 @@
335479
{
336480
"cell_type": "code",
337481
"execution_count": null,
338-
"metadata": {},
482+
"metadata": {
483+
"nterop": {
484+
"id": "24"
485+
}
486+
},
339487
"outputs": [],
340488
"source": [
341489
"print(f' CV AUC: {roc_auc_score(y, p):.6f}')\n",
342490
"print(f'Test AUC: {roc_auc_score(pseudo_label[target_col], p_tst)}')"
343491
]
344492
},
345493
{
494+
"attachments": {},
346495
"cell_type": "markdown",
347-
"metadata": {},
496+
"metadata": {
497+
"nterop": {
498+
"id": "25"
499+
}
500+
},
348501
"source": [
349502
"## Submission"
350503
]
351504
},
352505
{
353506
"cell_type": "code",
354507
"execution_count": null,
355-
"metadata": {},
508+
"metadata": {
509+
"nterop": {
510+
"id": "26"
511+
}
512+
},
356513
"outputs": [],
357514
"source": [
358515
"n_pos = int(0.34911 * tst.shape[0])\n",
@@ -364,16 +521,25 @@
364521
{
365522
"cell_type": "code",
366523
"execution_count": null,
367-
"metadata": {},
524+
"metadata": {
525+
"nterop": {
526+
"id": "27"
527+
}
528+
},
368529
"outputs": [],
369530
"source": [
370531
"sub[target_col] = (p_tst > th).astype(int)\n",
371532
"sub.to_csv(submission_file)"
372533
]
373534
},
374535
{
536+
"attachments": {},
375537
"cell_type": "markdown",
376-
"metadata": {},
538+
"metadata": {
539+
"nterop": {
540+
"id": "28"
541+
}
542+
},
377543
"source": [
378544
"If you find it useful, please upvote the notebook and leave your feedback. It will be greatly appreciated!\n",
379545
"\n",
@@ -385,6 +551,7 @@
385551
}
386552
],
387553
"metadata": {
554+
"hide_input": false,
388555
"kernelspec": {
389556
"display_name": "Python 3",
390557
"language": "python",
@@ -400,7 +567,10 @@
400567
"name": "python",
401568
"nbconvert_exporter": "python",
402569
"pygments_lexer": "ipython3",
403-
"version": "3.8.5"
570+
"version": "3.7.10"
571+
},
572+
"nterop": {
573+
"seedId": "29"
404574
},
405575
"toc": {
406576
"base_numbering": 1,

0 commit comments

Comments
 (0)
Please sign in to comment.