The goal of this IPython Notebook is to introduce some tools in Python for data-processing and machine learning. It is my no means exhaustive in any of the aspects of either Python, Machine Learning, Data-processing in Python or any of their permutations. The Machine Learning classifier used in this Notebook is one of the simplest classifiers called Logistic Regression models. The data-set on which the examples are run are taken from the Kaggle Titanic Challenge. This example was specifically chosen since there are many tutorials, IPython Notebooks, articles blogs and resources online on the Titanic Challenge that would help one get started.
To view this notebook online click on this this link. This IPython notebook above assumes some facility in working with Python.
1. Python
2. IPython (Optional since you could run the Python commands from the IPython
notebook on your native Python interpreter)
3. Numpy
4. Scipy
5. Pandas
6. Scikit-Learn
Installation methods vary depending on the Operating System. Here is a great link on completing a setup in Python for scientific purposes.
Below are pointers to some resources that might help one get started off.
Read about it here
A Simple Explanation from Duke Medicine
Logistic Regression for Classification
Course from Coursera. This does not require one to download and install Python. They have a version for the course that runs off the browser interactively.
The best intro I think, from Python Docs
The Tentative Numpy Tutorial is a good place to start.
Here is a great introduction on Machine Learning with Scikit-Learn. Its a tutorial from PyCon 2014.
The Python Pandas Cookbook Lecture Series on Youtube by Alfred Essa is a good place to start. Specifically to load our Titanic data set Alfred Essa talks about it here in Lesson 1.2.
A tutorial from Kaggle on Python
A tutorial from Kaggle on Pandas
A tutorial from Kaggle on SKLearn
A Fancy Notebook showing off many aspects of the Titanic data problem