Materials are written by Adrien Couturier. Taught by Ivan Sayapin, Jiong Wei Lua and Adrien Couturier
This will be a great opportunity for all students who are interested to get involved with ML@LSE to get to know the committee as well as the slew of events we have lined up for the upcoming Academic Year.
As we only have one hour, we will try to focus on:
-
A gentle introduction to machine learning with interactive visualisations and interesting use cases
-
Sharing of the events we have lined up for the term ahead such as our pioneering Industry Mentorship Programme with Datatonic, as well as what our bootcamp expects to cover
-
Getting to know what people are interested in so that we can improve upon our activities
-
If there is time and/or demand, we may also do a short interactive activity on how we can build a highly accurate image classifier for the MNIST digits dataset in < 500 lines of code.
The machine learning wave is coming, so join us to ride the wave!
This first workshop aims to cover the general principles of machine learning theories and techniques.
We also aim to help out with the installation of the Jupyter Notebook development environment, which we will be using as our implementation environment for all subsequent bootcamps.
If time permits, we may also introduce some of the packages that we will be using frequently for subsequent bootcamps, such as pandas, numpy and sklearn.
Objectives: Understand the basics concepts and notions underpinning machine learning theories and techniques.
Requirements: Basic definitions of random variable, expectation and variance. Basic knowledge of Linear Algebra may be useful.
Keywords: Dataset, Number of Observations, Dimensionality, Machine learning techniques, High dimensional statistics, Statistical pattern, Supervised Learning, Unsupervised learning, Learning Function, Inputs and Outputs, Training Data, Test Data, Irreducible Error, Regression, Classification, Loss, Risk, Empirical Risk, MSE, MER, Testing Errors, Overfitting, Generalization, Training vs. Testing Errors, Bias- Variance Trade-off.
What is Anaconda? Anaconda is a python and R distribution. It aims to provide everything you need (python wise) for data science "out of the box".
It includes:
- The core python language
- 100+ python "packages" (libraries) such as scikitlearn, numpy and pandas
- Spyder (IDE/editor - like pycharm) and Jupyter
Guide to Installing Anaconda Distribution
This bootcamp aims to introduce members to a category of commonly used machine learning models known as tree-based models.
We will assume that all attending members have installed the Anaconda Distribution on their computer. Otherwise, you may follow the instructions here to download it.
Objectives: Understand what tree-based methods are and how they are buildt. Understand how Cross-Validation can be applied to decision trees. Understand how ensemble methods can improve the power of our techniques.
Requirements: Introductory Bootcamp (you can read the slides if you didn’t attend). Familiarity with the notions of independence and correlation may be useful.
Keywords: Decision tree, Nodes, Recursive binary splitting, Pruning, Cost complexity, Bootstrapping, Bagging, Random forest, Boosting.
We will build a Random Forest model to predict survivorship in the Titanic dataset.
Objectives: Understand linear classification methods. Understand their generalization to non-linear classification. Get a sense of the Kernel idea and Support Vector Machines.
Requirements: Introductory Bootcamp (you can read the slides if you didn’t attend). Although not necessary, familiarity with notions of linear algebra may greatly help: hyperplanes, dot and inner products. Some familiarity with constrained optimization may help.
Keywords: Hyperplane, Margin, Maximal Margin Classifier, Soft Margin Classifier, Non-linear Boundaries, Inner Product, Kernels, Sup- port Vector Machines.
We will build a Support Vector Machine to classify bank customers' risk of defaulting on their credit card.
Objectives: Understand two important unsupervised learning methods: Principal Component Analysis and Clustering. Understand the difference between k-means clustering and Hierarchical Clustering.
Requirements: Introductory Bootcamp (you can read the slides if you didn’t attend). Although not necessary, familiarity with notions of linear algebra may help: inner product and orthogonal projections. Strong understanding of the summation operator (ni=1, j∈C ) may help.
Keywords: PCA, Loading vector, Principal Components, Propor- tion of variance explained, Clustering, k-meanS Clustering, Within- cluster Variation, Hierarchical Clustering, Minimal Intercluster Dissim- ilarity, Dendrogram.
We will use PCA and Clustering Techniques to segment an unlabelled customer dataset into meaningful sub-groups.
Objectives: Understand the structure of feedforward neural net- works. Understand how feedworward neural networks are trained using backpropagation.
Requirements: Introductory Bootcamp (you can read the slides if you didn’t attend). Although not necessary, familiarity with partial derivatives, the chain rule and the gradient of a multivariate function may help.
Keywords: Perceptron, Weights, Biases, Activation function, Sig- moid activation function, Neural Network, Gradient descent, Backprop- agation.
We will use the Keras package to implement a simple neural network on the canonical MNIST dataset.
Objectives: Understand how to extract features from text and machine learning can be applied to text data, and how to scrape text data from the web
Requirements: Introductory Bootcamp (you can read the slides if you didn’t attend), and basic Python programming skills
Objectives: Understand how the LIME technique works in terms of intuition, and be able to leverage the LIME package in Python for your own uses.
Requirements: Introductory Bootcamp (you can read the slides if you didn’t attend), and basic Python programming skills