After creating some web scraping and feature extraction code and some command-line utilities and Cython extensions, I tried to go the route of a train
script followed by an evaluate
script. This turned out to not be that good of an idea. For one thing, the amount of computing resources required for this project was just too much for my own machines to handle. Also, taking out any of the datasets into RAM was too wasteful. So, it became more about reliably getting the data in the database and all the NLP features extracted. Then, I would be able to worry about how to do the learning experiments. Once I finished that, I ended up with a database taking up nearly 230 GiB for some 55k reviews and all the NLP/non-NLP features. I decided the best way to go would be to make use of the partial_fit
methods of certain learners in scikit-learn, i.e., do online learning. So, my focus shifted to developing a utility, learn
, with which one could do incremental learning experiments with a small set of learners and also using grid search. The project is now at a point where this utility is working, so I thought it might be good to actually release a version of the project.