I've decided to participate in following Kaggle competition: https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge
It cleans learning data set and validating data set. I've created C# application that allows to create cleaner data representation of initial file. It does the following:
- Merges all lines related to single id into one line of text
- Removes non English characters
- Removes all numbers
- Removes following words: "the", "a", "an", "thanks", "jan", "feb", "mar", "apr", "jun", "jul", "aug", "sep", "oct", "nov", "dec", "january", "february", "march", "april", "june", "july", "august", "september", "october", "november", "december", "utc", "pm", "ok", "please", "be", "hi", "gmt" because IMHO it's impossible to abuse with those words
- Truncates words that are longer then 50 characters
- lowercases all characters
- removes all duplicates of words for example line like this:
212313, some text some text continuation abc some text loooooooooooooooooooooooongstringloooooooooooooooooooooooongstring, 0,0,0,0,0,0
will become like this 212313, some text continuation loooooooooooooooooooooooongstringlooooooooooo, 0,0,0,0,0,0