This repository contains my solution for the Kaggle competition TalkingData AdTracking Fraud Detection. It ranked 197 out of 3967 (Top 5%).
The goal of this competition was to predict if mobile users will install an app they have clicked (Click-Through prediction). The biggest challenge in this competition was to handle the huge amount of data (about 250 millions rows).
This solution is based on 2 different models:
- Field-Aware Factorization Machine
- Gradient Boosted Decision Tree
The Field-Aware Factorization Machine was combined with an unsupervised gradient boosted decision tree (with 30 trees) for feature engineering. The gradient boosted decision tree is regularly trained in a supervised way but instead of using its target predictions the leaf index predictions are used as features for the Field-Aware Factorization machine. This approach was proposed by Xinran He et al. Practical Lessons from Predicting Clicks on Ads at Facebook and used in the winning solution of the previous Click-Trough prediction competition Display Advertising Challenge. Field-Aware Factorization machines proved to be a very strong powerful model in past Click-Trough prediction competitions. They work well when used with categorical features. The used library is xLearn.
The Gradient Boosted Decision Tree is trained using various Groupby and Aggregating features including aggregate functions count
, var
, mean
, nunique
and cumcount
in addition to time-to-next-click features. Those features Those features were used in almost all top solutions and many kernels. The used library was LightGBM because its impressived speed given the huge amount of data.
These 2 models were ensembled using weighted blending.