This primer will take you through some of the tools Python has for data science: mathematical operations, statistics, visualization, machine learning, etc.
I will assume knowledge of python and some basic knowledge of the topics. I won't be delving into the mathematical details of how the tools work. Instead, I will focus on what they do, why you might use them and how to use them.
All the content will be in the form of Jupyter notebooks. You can view it all directly on github without installing anything. But I'd recommend playing along to get the most out of it. If you're completely new to all the of tools being introduced, I'd recommend going in the order outlined in this README because I will be building on the content as I go along.
You need to have python3 installed. If you're on Windows, I highly recommend using Anaconda.
After that open a command line and run:
pip install -r requirements.txt
Now from the root of this repository run:
jupyter notebook
to launch Jupyter which will open a browser window where you can navigate through the files of the repo.
First up is NumPy. NumPy, short for Numerical Python, is the foundation of pretty much every mathematical python library. It's primary function is doing matrix operations. It does a lot more, but I will be focusing on the essentials.
The contents of the NumPy notebook are:
- Arrays
- Matrices
- Array Creation Functions
- Generating Random Arrays
- Reshape
- Mathematical Operations
- Statistics
Matplotlib is the most commonly used python library for creating 2D-plots. It's API interface is inspired by MATLAB.
The contents of the Matplotlib notebook are:
- Line Graphs
- Scatter Plots
- Combining Plots and Creating Legends
- Histograms
- Styling
Pandas is a library which provides data structures for doing data analysis. It is similar to having access to an Excel spreadsheet in python.
The contents of the Pandas notebook are:
- DataFrames
- Operations and Filtering
- Merging DataFrames
- Grouping Rows by Value
StatsModels is a library for running statistical models.
The Regression notebook includes:
- OLS Linear Regression
- Using OLS Linear Regression to do Polynomial regression
- Categorical Variables in OLS Linear Regression
scikit-learn is a python library for machine learning.
The classification notebook includes:
- Naive Bayes
- K-Nearest Neighbors
- Support Vector Machines
- Decision Trees
- Random Forest
- Evaluating Model Results
The dimensionality reduction notebook includes:
- Principal Component Analysis (PCA)
- PCA + Classification
PyBrain is another machine learning library. It has some overlap with scikit-learn, but its major focus is on neural networks.
The PyBrain neural network notebook includes:
- Function Approximation
- Classification
Contributions are more than welcome - from additional functionality I skipped over to whole new packages I didn't include. Here's a list of things I've already identified that I'd like to add.
Code released under the MIT license.