Skip to content

Commit 0e7b19d

Browse files
committedApr 25, 2021
migrate from fast_template to fastpages
1 parent aac0b52 commit 0e7b19d

40 files changed

+2593
-7
lines changed
 

‎_notebooks/2021-04-18-autoencoder-pseudo-label-autolgb.ipynb

+548
Large diffs are not rendered by default.

‎_notebooks/2021-04-21-supervised-emphasized-denoising-autoencoder.ipynb

+757
Large diffs are not rendered by default.

‎_pages/about.md

+49-7
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,53 @@
1-
---
2-
layout: page
3-
title: About Me
4-
permalink: /about/
5-
---
1+
# About Kaggler TV
62

7-
This website is powered by **[fastpages](https://github.com/fastai/fastpages)** [^1].
3+
We are Data Scientists and Kagglers.
84

5+
We enjoy participating in data science competitions. In every competition, we learn something new – new algorithms (Factorization Machine, Follow-the-Regularized-Leader), new tools (Vowpal Wabbit, XGBoost, LightGBM, Keras), and new domain knowledge. It helps us keep our skills up-to-date in the fast evolving fields of Machine Learning and Data Science.
96

7+
With Kaggler TV, we’d like to share our learning and experiences with others. We’ll announce new releases of Python package Kaggler – that we wrote for Kaggle competitions – here as well.
108

11-
[^1]:a blogging platform that natively supports Jupyter notebooks in addition to other formats.
9+
If you’re interested in contributing to Kaggler.com, please email us at kagglertv@gmail.com.
10+
11+
Enjoy!
12+
13+
14+
# OTHER LINKS
15+
16+
## Kaggler TV YouTube Channel
17+
18+
* [Kaggler TV Youtube Channel](https://www.youtube.com/c/KagglerTV)
19+
* [GitHub Repo for Schedule & Content Request](https://github.com/kaggler-tv/kaggler-tv-schedule)
20+
21+
## Social Networks
22+
23+
* [Kaggler TV Twitter](https://twitter.com/kagglertv)
24+
* [Kaggler Facebook Page](https://www.facebook.com/Kaggler/)
25+
26+
## Kaggler/ Kaggler TV Code Repositories
27+
28+
* [Kaggler Python Package](https://github.com/jeongyoonlee/Kaggler)
29+
* [Kaggler Competition Pipeline Template](https://github.com/kaggler-tv/kaggler-template)
30+
* [Pipeline Example With `cat-in-the-dat-ii`](https://github.com/kaggler-tv/cat-in-the-dat-ii)
31+
32+
33+
# CONTRIBUTORS
34+
35+
<img src="images/jeong.png" style="float:left; background:none; border:none; box-shadow:none;">
36+
37+
**Jeong-Yoon Lee**: Jeong is a Competition Master at Kaggle. He participated in over 100 competitions, won 6 times including KDD Cup 2012 and 2015, and was ranked Top-10 at Kaggle in 2015. He served as KDD Cup co-chair at KDD Cup 2018. He earned his Ph.D. in Computer Science from University of Southern California. He’s originally from South Korea.
38+
39+
<img src="images/erkut.png" style="float:left; background:none; border:none; box-shadow:none;">
40+
41+
**Erkut Aykutlug**: Erkut is a Competitions Expert at Kaggle. He participated in more than a dozen competitions, and won the first place at Kaggle-Days SF in 2019. He earned his Ph.D. in Mechanical Engineering from UC Irvine. He is originally from Turkey.
42+
43+
<img src="images/youhan.png" style="float:left; background:none; border:none; box-shadow:none;">
44+
45+
**Youhan Lee**: Youhan is a Competition Master at Kaggle. He participated in over 30 competitions, and won 3 gold medals (two 3rd, one 11th) at Kaggle. He has a deep interest in using machine learning techniques to solve industrial problems. He earned his Ph.D. in Chemical Engineering from Korea Advanced Institute of Science and Technology. He’s originally from South Korea.
46+
47+
<img src="images/tam.png" style="float:left; background:none; border:none; box-shadow:none;">
48+
49+
**Tam T. Nguyen**: Tam is a Competition Grandmaster at Kaggle. He won the 1st prizes at KDD Cup 2015, IJCAI-15 repeat buyer competition, and Springleaf marketing response competition. He is a Postdoctoral Search Fellow at Ryerson University in Toronto, Canada. Prior to that, he was Data Analytics Project Lead at I2R A\*STAR. He earned his Ph.D. in Computer Science from NTU, Singapore. He’s originally from Vietnam.
50+
51+
<img src="images/hang.png" style="float:left; background:none; border:none; box-shadow:none;">
52+
53+
**Hang Li**: Hang is a Competition Master at Kaggle. He participated in over 30 Data Science Competitions. He has a strong passion for using machine learning techniques to solve real-world problems. He earned his Ph.D. in Information and Communication Engineering from Tsinghua University. He’s originally from China.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# Data Science Career for Neuroscientists + Tips for Kaggle Competitions
2+
3+
![](/images/20131007-brain-circuit.jpg)
4+
5+
Recently Prof. Konrad Koerding at Northwestern University asked for an advice on his Facebook for one of his Ph.D student, who studies Computational Neuroscience but wants to pursue his career in Data Science. It reminded me of the time I was looking for such opportunities, and shared my thoughts (now posted on the webpage of his lab [here](http://kordinglab.com/2016/01/05/leave-neuroscience.html)). I decide to post it here with a few fixes so that it can help others.
6+
7+
1. TOC
8+
{:toc}
9+
10+
--—
11+
12+
# Introduction
13+
14+
First, I’d like to say that Data Science is a relatively new field (like Computational Neuroscience), and you don’t need to feel bad to make the transition after your Ph.D. When I was out to the job market, I didn’t have any analytic background at all either.
15+
16+
I started my industrial career at one of analytic consulting companies, Opera Solutions in San Diego, where one of Nicolas‘ friends, Jacob, runs the R&D team of the company. Jacob did his Ph.D under the supervision of Prof. Michael Arbib at University of Southern California in Computational Neuroscience as well. During the interview, I was tested to prove my thought process, basic knowledges in statistics and Machine Learning, and programming, which I’d practiced through out my Ph.D everyday.
17+
18+
So, if he has a good Machine Learning background with programming skills (I’m sure that he does, based on the fact he’s your student), he can be competent to pursue his career in Data Science.
19+
20+
# Tools in Data Science
21+
22+
Back in the graduate school, I used mostly MATLAB with some SPSS and C. In the Data Science field, Python and R are most popular languages, and SQL is a kind of necessary evil.
23+
24+
R is similar to MATLAB except that it’s free. It is not a hardcore programming language and doesn’t take much time to learn. It comes with the latest statistical libraries and provides powerful plotting functions. There are many IDEs, which make easy to use R, but my favorite is R Studio. If you run R on the server with R Studio Server, you can access it from anywhere via your web browser, which is really cool. Although native R plotting functions are excellent by themselves, the ggplot2 library provides more eye-catching visualization.
25+
26+
For Python, Numpy + Scipy packages provides similar vector-matrix computation functionalities as MATLAB. For Machine Learning algorithms, you need Scikit-Learn, and for data handling, Pandas will make your life easy. For debugging and prototyping, iPython Notebook is really handy and useful.
27+
28+
SQL is an old technology but still widely used. Most of data are stored in the data warehouse, which can be accessed only via SQL or SQL equivalents (Oracle, Teradata, Netezza, etc.). Postgres and MySQL are powerful yet free, so it’s perfect to practice with.
29+
30+
# Hints for Kaggle Data Mining Competitions
31+
32+
Fortunately, I had a chance to work with many of top competitors such as the 1st and 2nd place teams at Netflix competitions, and learn how they do at competitions. Here are some tips I found helpful.
33+
34+
## Don’t jump into algorithms too fast.
35+
36+
Spend enough time to understand data. Algorithms are important, but no matter how good algorithm you use, garbage-in only leads to garbage-out. Many classification/regression algorithms assume the Gaussian distributed variables, and fail to make good predictions if you provide non-Gaussian distributed variables. So, standardization, normalization, non-linear transformation, discretization, binning are very important.
37+
38+
## Try different algorithms and blend.
39+
40+
There is no universal optimal algorithm. Most of times (if not all), the winning algorithms are ensembles of many individual models with tens of different algorithms. Combining different kinds of models can improve prediction performance a lot. For individual models, I found Random Forest, Gradient Boosting Machine, Factorization Machine, Neural Network, Support Vector Machine, logistic/linear regression, Naive Bayes, and collaborative filtering are mostly useful. Gradient Boosting Machine and Factorization Machine are often the best individual models.
41+
42+
## Optimize at last.
43+
44+
Each competition has a different evaluation metric, and optimizing algorithms to do the best for that metric can improve your chance to win. Two most popular metrics are RMSE and AUC (area under the ROC curve). Algorithms optimizing one metric is not the optimal for the other. Many open source algorithm implementations provide only RMSE optimization, so for AUC (or other metric) optimization, you need to implement it by yourself.
+42
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
# 60 Day Journey of Deloitte Churn Prediction Competition
2+
3+
![](/images/20140103-deloitte-competition.png)
4+
5+
1. TOC
6+
{:toc}
7+
8+
# Competition
9+
10+
Last December, I teamed up with Michael once again to participate in the Deloitte Churn Prediction competition at Kaggle, where to predict which customers will leave an insurance company in the next 12 months.
11+
12+
It was a master competition, which is open to only master level Kagglers (top 0.2% out of 138K competitors), with \$70,000 cash prizes for top 3 finishers.
13+
14+
# Result
15+
16+
We managed to do well and finished in 4th place out of 37 teams in spite of that we did not have much time due to projects at work and family events (especially for Michael, who became a dad during the competition).
17+
18+
Although we were little short to earn the prize, it was a fun experience working together with Michael, competing with other top competitors across the world, and climbing the leaderboard day by day.
19+
20+
# Visualization
21+
22+
I visualized our 60 day journey during the competition below, and here are some highlights (for us):
23+
24+
* Day 22-35: Dived into the competition, set up the github repo and S3 for collaboration, and climbed up the leaderboard quickly.
25+
* Day 41-45: Second spurt. Dug in GBM and NN models. Michael’s baby girl was born on Day 48.
26+
* Day 53-60: Last spurt. Ensembled all models. Improved our score every day, but didn’t have time to train the best models.
27+
28+
[Motion Chart - Deloitte Churn Prediction Leaderboard](/deloitte-leaderboard.html)
29+
30+
Once clicked the link above, it will show a motion chart where:
31+
32+
* X-axis: Competition day. From day 0 to day 60.
33+
* Y-axis: AUC score.
34+
* Colored circle: Each team. If clicked, it shows which team it represents.
35+
* Right most legend: Competition day. You can drag up and down the number to see the chart on a specific day.
36+
* Initial positions of circles show the scores of their first submissions.
37+
38+
For the chart, I reused the code using rCharts published by Tony Hirst at github: https://github.com/psychemedia (He also wrote a tutorial on his blog about creating a motion chart using rCharts).
39+
40+
# Closing
41+
42+
We took a rain check on this, but will win next time! 🙂

0 commit comments

Comments
 (0)
Please sign in to comment.