Welcome to this document! Let me guess... You are here because you want to get started with Machine Learning, but don't really know where to begin. Like I always say, recapping a bit of the Math needed to go ahead with ML is a good idea before actually delving deep into the algorithms. I have created this extensive guide document for you to get started with ML, specifically the Python environment needed for ML, some basic numpy, matplotlib and pandas operations and also some of the Math involved in Linear Regression. I have tried to cover everything that I know and things that I feel you need to know to get the gears moving. I hope this helps you out.
First off, let's start with installing and setting-up Jupyter Notebook on your local system, for future use. Then, we'll move on to the Math part.
Before going to the installation part, let's talk a bit about what is Jupyter Notebook. Also, what is the problem with other IDEs that we felt the need to use Jupyter Notebook?
Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. It is one of the most popular IDEs for Data Science and Machine Learning. In Jupyter Notebook, you have blocks of code that we call cells, that can be run without running the whole program, i.e., the notebook. This is very useful when you are working with large datasets and you want to see the output of some particular blocks of code. Besides adding comments like in a typical Python program, you can also add Markdown texts, to better explain what's going on in your notebook using links and images and texts, of course.
But why Jupyter? Why not PyCharm or VSCode? Well, in Jupyter we use the notebook style of programming which is very helpful when you want to run particular cells. In other scripting platforms, you'll have to type in the program, save it, then run it as a whole. As you'll see later in this document, running the whole program each time to get outputs isn't very practical in the field of ML. So, we use Jupyter Notebook.
Open your cmd/terminal and type in the following command:
pip install notebook
: This will install Jupyter Notebook on your system, via pip, the official package manager for Python.jupyter notebook
: This will open up Jupyter Notebook, in a new tab in your browser. Here, you can create new notebooks and run them, like you'll see later.
See! It's that simple to install and setup Jupyter Notebook. Now, let's get started with the basics of the Math required, after which we'll see how we can implement the learned Math using various Python libraries like numpy, matplotlib and statistics.
Below are some formulas that you have already learned in high school, but have most likely forgotten. So, let's recap them, shall we?
-
Mean (Average):
-
Median:
-
Standard Deviation:
-
z-Score (Standard Score):
-
Variance:
Given below is a code snippet to try out the further below mentioned plots using the very famous matplotlib library of Python, that you can install using pip install matplotlib
from your cmd.
import numpy as np
import matplotlib.pyplot as plt
# Let's create a list of numbers for x and y axes.
x = [1, 2, 3, 4, 5]
y = [2, 5, 6, 0, 9, 4]
# Let's also get data for a random normal distribution here.
# The following data has mean = 0 and standard deviation = 1.
normal = np.random.normal(0, 1, 100)
# Insert the code for plotting a graph here
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.title('Plot Example')
plt.show()
Type of Plot | Example Code (goes in the comment line in the above snippet) | Why are they used? |
---|---|---|
Continuous Line Plot | plt.plot(x, y) |
Used to get a simple plot of any data, and perhaps compare the output of a function or dataset with that of some other. Also used to clearly visualize the relationships between different data. |
Discrete Plot | plt.stem(x, y) |
Used to plot data that is not continuous in nature, i.e., discrete. For example, the number of students in a class. |
Scatter Plot | plt.scatter(x, y) |
Used to visualize the relationship between two variables, containing data, by displaying individual data points as dots on a plane. |
Histogram | plt.hist(normal, bins=20) |
Used to better understand the underlying distribution of a data. This divides the data into bins and plots the frequency of each bin. |
Boxplot | plt.boxplot(normal, vert=False) |
Used to better visualize the medians, quartiles and outliers in given data. |
Violinplot | plt.violinplot(normal, showmedians=True, vert=False) |
More types of plots and their examples can be found here.
Now that you are aware of the various basic types of graphs, let's dive into the world of matrices and linear algebra. We start off with some basic numpy functions that you need to know. But before, you will need to install numpy on your system using pip install numpy
, if you haven't installed it already.
With numpy installed, let's import numpy into the notebook, using the following line:
import numpy as np
These were some basic numpy functions that you need to know to get started. For further reading, you can check out the numpy documentation.
This was all for now, but I'll keep updating this document as and when I get time. I plan to add the basics of pandas here as well in the future. But for now, I hope this helps you out.
#####~ Soumyadeep Bose, 2k23 😅