Here I describe my approach for solving Kaggle competetion task. My solution landed at the #2 place.
I decided to build a bunch of simple (basic) classifiers and combine them to a final ensemble. As a basic classifiers I used mostly Logisitc regression over words/stemmed words/POS tags/etc. Basic classifiers results were stacked using Random Forest regressor which were producing the final score.
The most obvious NLP features are words and stems. Stanford POS tagger were used to retrieve word's POS tags which then were used as an ordinary words/stems features.
0 1 2 3 4 5 6 7 words: I really do n't understand your point . stems: I realli do n't understand your point . POS tags: PRP RB VBP RB VB PRP$ NN .
Using word/stem/POS tag sequence one can build more complicated features: bigrams, trigrams, unordered bigrams/trigrams, subsequences of sliding window of length N and so on.
For example, POS tag based features:
0 1 2 3 4 5 6 sequence: PRP RB VBP RB VB PRP$ NN bigrams: PRP-RB RB-VBP VBP-RB RB-VB VB-PRP$ PRP$-NN unordered bigrams: PRP-RB RB-VBP RB-VBP RB-Vb PRP$-VB NN-PRP$ 2-subseq of 3: [ PRP,RB,VBP ] PRP-RB,PRP-VBP,RB-VBP [ RB,VBP,RB ] RB-VBP,RB-RB,VBP-RB ...
After extracting all featurs they should be scored. There are a bunch of different scoring approaches: 0/1, TF, TF*IDF, Probability score, Mutual information ad so on. Brief explanation:
- 0/1 - word present/not present in document
- TF - term frequncy - how many times this word occured in the document
- TF*IDF - use IDF (inverted document frequncy) to lower weight of popular words
- Probability score - Probability of seeing this word in document of some class
- Mutual information
I used probability score for the most features.
Using positive/negative samples one can build a Language model to calculate probability of some document being "insult" or "not insult". I used Kneser Ney language model over words/stems/POS tags up to 7-grams (however 7-grams language models were not useful at all).
To add smoothing I tried to build a stem-POS mixed bigrams and trigrams.
0 1 2 3 4 5 6 words: I really do n't understand your point tags: PRP RB VBP RB VB PRP$ NN tag-word: PRP-really RB-do VBP-n't RB-understand VB-your PRP$-point word-tag: I-RB really-VBP do-RB n't-VB understand-PRP$ your-NN
Character ngrams based classifiers were used as well - only 2,3,4 grams.
Stanford parser was used to obtain dependency parsing results and use it as an additional features. Dependency parser gives a number of triples (dependency type, word1, word2). Using these triples I can build "syntax bigrams" and "syntax trigrams". There also have been extractor for (dependency type, word2) and (word1, dependency type) features.
bigrams (dep, w2) (w1, dep) nsubj(understand-5, I-1) "understand I" NSUBJ-I understand-NSUBJ advmod(understand-5, really-2) "understand really" ADVMOD-really understand-ADVMOD aux(understand-5, do-3) "understand do" AUX-do understand-AUX neg(understand-5, n't-4) "understand n't" NEG-n't understand-NEG root(ROOT-0, understand-5) "ROOT understand" ROOT-understand understand-ROOT poss(point-7, your-6) "point your" POSS-your point-POSS dobj(understand-5, point-7) "understand point" DOBJ-point understand-DOBJ
In one of my experiments I used only syntax features - and got pretty good AUC. In my final ensemble they had very low feature importance though.
Two metamodels were implemented (metamodel uses underlying basic model to build another model) - "sentence level metamodels" and "ranking models".
The intuition is simple - in insulting message one sentence is usually insulting while others could be just a "normal".
So only examples that contain 1 sentence were selected and underlying model was trained using only those examples. To predict the score message is splitted to sentences and maximum of "insult" score (given by underlying model) is taken.
For AUC we really care about correct relative scores of predictions, and not about absolute values. Classification problem could be turned into ranking problem.
I've used Stanford POS tagger and Stanford Parser for feature extraction and scikit-learn package for machine learning. Final model contained 110 basic classifiers and took about 10 hours to train (just because I didn't optimize for speed it at all).
Here is the top of the most important basic classifiers according to RandomForest feature importances in final submission:
stemsSubseq3Logistic_6 0.22 stemsSubseq2SortedLogistic_5 0.16 stemsSubseq3Logistic_5 0.12 stem12Rank 0.08 stemsSubseq2Logistic_6 0.05
Here is an example of using sentence level model with language model inside.
stemLm_2 0.8866 0.7902 stemLm_3 0.0410 0.0262 stemLm_4 0.0123 0.0043 stemLm_5 0.0602 0.0344 stemLmSen_2 0.1254 stemLmSen_3 0.0124 stemLmSen_4 0.0038 stemLmSen_5 0.0034 AUC 0.72 0.77
So we can see that sentence level features give additional information to the final model.
Here is the feature group importances in the final submission:
stem subsequence based 0.66 stem based (unigrams, bigrams) 0.18 char ngrams based (sentence) 0.07 char ngrams based 0.04 all syntax 0.006 all language models 0.004 all mixed 0.002