In this section, we'll prepare the environment
You need:
- Python 3.10 (note that videos use 3.8)
- NumPy, Pandas and Scikit-Learn (latest available versions)
- Matplotlib and Seaborn
- Jupyter notebooks
- This video shows a complete end-to-end environment configuration for an AWS EC2 instance
- This video was created for another course (MLOps Zoomcamp), so you'll need to adjust it slightly: clone this repo instead of the mlops one
- You can use these instructions for setting up your local Ubuntu
Note for WSL
- Most of the instructions from the previous video apply to WSL too
- For setting up Docker, install Docker Desktop on Windows and it'll be automatically used in WSL. You don't need to install docker.io
The easiest way to set up the environment is to use Anaconda or Miniconda.
Anaconda comes with everything we need (and much more). Miniconda is a smaller version of Anaconda that contains only Python.
Follow the instructions on page for installing the correct package for your system. The site will automatically detect your operating system and suggest the correct package.
If you are using Windows, you can use WSL, but the plain Windows version should work too.
Anaconda is recommended.
It is a good idea to set up a dedicated environment for the course
In your terminal, run this command to create the environment
conda create -n ml-zoomcamp python=3.9
Activate it:
conda activate ml-zoomcamp
Installing libraries
conda install numpy pandas scikit-learn seaborn jupyter
Later in the course you will also need to install XGBoost and Tensorflow, but we can skip this part for now.
Instead of running things locally, you can use online services or rent a server
You can rent an instance on AWS:
Google cloud platform offers $300 in free credits when you sign up. You can use this for taking the course.
There are services that allow you to host and run notebooks. Note that notebooks alone are not sufficient for the course and for the deployment modules you will need to have access to the command line interface with Docker, Python and other libraries installed.
To use Kaggle to open and run the Jupyter notebooks provided as part of this course do the following:
Pre-requisites - You need to have an account in Kaggle (it's free) and be logged into Kaggle
-
Find the URL of the notebook.
-
To open the notebook in Kaggle, in your web browser launch paste the URL as shown in below example. (note the additional https://kaggle.com/kernels/welcome?src= before the URL of the notebook)
-
Check if the notebook uses any datafile to read data from it. If yes, note the datafile name from the code.- look for pd.read_csv("somefilename.csv").
-
You need to download the file into Kaggle. For this:
a. Find the URL of the datafile in github.
b. Suppose the URL is https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/chapter-02-car-price/data.csv , you need use the URL to raw file, which will look something like https://raw.githubusercontent.com/alexeygrigorev/mlbookcamp-code/master/chapter-02-car-price/data.csv
-
In the notebook opened in Kaggle, add a Code block with the command to download the file - !wget your-datafile-url
This way you can start with the exercise using Kaggle
To use Google Colab to open and run the Jupyter notebooks provided as part of this course do the following:
Pre-requisites - You need to have a google account (any gmail account) and be logged into that account
Steps for Google Colab are same as that for Kaggle, except for some changes in Step 2, as explained below.
-
To open the notebook in Google Colab, in your web browser launch paste the URL as shown in below example. (note the https://github.com/ in the URL of the notebook is replaced by https://colab.research.google.com/github/)