Welcome, RAILer! The purpose of this guide is to get you up to speed as fast as possible in the tools that RAIL tends to use. The folders in the repo contain tutorials, mostly in Jupyter Notebook format (formerly known as iPython Notebook). If you're familiar with Jupyter Notebooks, then we recommend going through the interactive Notebooks for the lessons you need. If you're not familiar with them, then first install and get to know them.
We tend to use the standard Python data science & machine learning stack:
- Development: Jupyter Notebooks – an interactive, easy-to-share interface for creating "Notebooks" of code and results that anyone else can download and interact with. We recommend this over working on flat files.
- Data: Pandas – a Python library that makes loading, cleaning, exploring, and analyzing data really easy.
- Machine Learning: Scikit-Learn and others – Scikit-learn contains a lot of pre-made, well-tested machine learning algorithms. Most of the key ones can be called with the same methods:
model.fit(X, y)
andmodel.predict(X_test)
.- Other libraries we've used: TensorFlow (mostly for deep learning), Keras and PyTorch (deep learning), Edward (probabilistic modeling – advanced level.)
We've also included some extra reading on machine learning, to build your intuition about:
- What types of machine learning are there?
- When is each type used?
Check out the Jupyter Notebooks Introduction
folder to learn how to install and use Jupyter Notebooks.
Check out the Pandas Cookbook
folder for Julia Evans' phenomenal sequence of 9 Pandas tutorial notebooks (taken from this repo).
If you can do these, you'll be moving pretty fast.
Highly recommended.
- Basic ML Concepts – a high-level overview of Week 1 of Pedro Domingos' great Coursera course.
- Types of Machine Learning – a high-level overview of what supervised, unsupervised, semi-supervised, and reinforcement learning are.
- Types of Algorithms – a clustering (ha) of types of common machine learning algorithms.
- Flowchart: Choosing algorithms – while not comprehensive, this offers useful intution about when to choose a type of algorithm.
- Visualized: Decision Trees – a beautiful! introduction to basic concepts, decision trees, and the learning process.
1. Concepts - Machine Learning
- a set of notebooks to introduce you to applying a subset of ML concepts. I recommend looking at the PDF guides for conceptual learning and the notebooks for implementation. (Credit to John Wittenauer)2. Example - Titanic Survival
- an exercise where you'll predict the likelihood of survival for people onboard the Titanic using real data. This is a famous introductory example! (Credit to Andrew Conti)3. Tools - Scikit-learn Tutorial
- a set of notebooks to introduce you to various tools within scikit-learn. (Credit to Jake Vanderplas)
- The Stanford/Andrew Ng Machine Learning Course – a number of RAILers (strategists and engineers) have done this course during their RAIL project and found it both fascinating and useful.
- Python/Numpy Tutorial – if you've never used NumPy before, or want to understand Python, I recommend this tutorial. (Built for the Stanford Convolutional Neural Network class.)
- Python is the common programming language that we use. It is both functional and object-oriented. You will probably find that object-oriented is cleaner and easier to debug, while functional is faster to write.
- NumPy, or Numeric Python, is the fundamental Python library for scientific computing. It allows you to do things like really easy manipulate data, analyze matrix-style data, and do linear algebra.
- SciPy, or Scientific Python, is a collection of libraries (including NumPy) that contains sophisticated scientific computing functions. For example,
scipy.stats
contains some advanced statistical functions that NumPy doesn't have.