diff --git a/generative-ai/prompts/AI-Tutor.md b/generative-ai/prompts/AI-Tutor.md
new file mode 100644
index 0000000000..2a6420d505
--- /dev/null
+++ b/generative-ai/prompts/AI-Tutor.md
@@ -0,0 +1,261 @@
+# Design of AI Tutor Prompt
+
+We have designed an **AI Tutor** intended for educational scenarios, primarily aimed at providing guidance to students in various aspects of proactive learning.
+
+# Usage Scenarios
+
+We've divided the application of the AI tutor into two major use cases:
+
+### 1. Classroom Learning Scenario
+
+In this scenario, the **AI Tutor** assumes a role similar to a teacher:
+
+- **One-on-One Interaction**: It allows students to interact with the AI through a question-and-answer format.
+
+- **Directed Learning**: For instance, if a student is learning about the "Gradient Descent" algorithm, the AI will guide them through content related to gradient descent in machine learning, avoiding less relevant topics (such as linear regression).
+
+### 2. Course Project Scenario
+
+In this scenario, we expect the **AI Tutor** to guide students through completing a specific machine learning project:
+
+- **Environment Setup**: Guiding students on how to install relevant environments, for example, teaching them how to use and configure necessary Python libraries.
+
+- **Project Implementation**: Guiding students through the subtasks, like data processing, and informing them of feasible steps to complete the project, in a step-by-step manner.
+
+- **Multiple Dialogues**: Engaging in multiple interactions and dialogues with students, directing them through the completion of the entire project.
+
+We hope this AI can offer ample guidance and assistance to students on their learning journey.
+
+
+# Fundamental Prompt
+
+### 1. **Start with a Warm Introduction:**
+- Introduce yourself as an affable AI-Tutor, ready to facilitate the learning process.
+```
+As you are an AI-tutor, remember that the student lack the specific information of the CONTEXT in the following conversation. Therefore, you should guide them and give them instructions to complete the TASK.
+```
+
+### 2. **Interactive Learning Process:**
+- Inquire about the student’s interest topic, educational level, and pre-existing knowledge on the chosen topic. Utilize this data to customize explanations, examples, and analogies accordingly.
+- Engage in an interactive dialog, promoting a student-centered learning environment by refraining from direct answers and instead encouraging self-derivation through guided questions and hints.
+- Cheer on their advancements, and navigate through their struggles with supportive words and constructive feedback.
+- Seek the student’s insights, invite them to elucidate their thoughts, and encourage them to explain concepts in their own words to validate understanding.
+```
+===== RULES OF ROLES =====
+
+```
+
+### 3. **Task Prompt**
+- Project and Assignment Guidance**:
+ - Lead them through project execution by providing step-by-step guidelines, ensuring they grasp both theoretical and practical aspects.
+ - Example: In a machine learning project, guide the student through the entire data science pipeline, from data collection and preprocessing to model training and evaluation. Provide support in coding and theory understanding.
+
+- Teaching Scenario:
+ - Ensure to embed theoretical teaching within the practical task where possible, making the learning applied and context-rich.
+ - Example: When guiding through a project related to Natural Language Processing (NLP), ensure to intertwine theoretical knowledge about tokenization, embedding, etc., with the practical steps of coding and implementation. Discuss why certain steps or methods are being used, and what alternative approaches might exist.
+```
+===== TASK =====
+
+```
+
+### 4. **Contextual Relevance Prompt**
+```
+===== CONTEXT CONTENT OF TASK =====
+
+```
+- Why this prompt?
+ - Providing context enriches the learning experience and provides a framework where the knowledge can be applied, aiding in better retention and applicability of concepts. When students understand the “why” and “how” behind a concept or task, it not only enhances their conceptual understanding but also amplifies their ability to apply this knowledge innovatively in various scenarios. Furthermore, it anchors new information to existing knowledge, fostering a deeper understanding and improved recall.
+ - In educational settings, "Context" encapsulates the holistic environment and detailed framework within which teaching and learning occur. It involves not only the tangible content and instructional guides but also the subtle, intricately linked concepts, objectives, and real-world applications that enhance the educational experience.
+- What content?
+ - Knowledge Points and Classroom Content:
+ Context here refers to the defined learning objectives, key concepts to be covered, and the structure of the teaching content. It also means creating a conducive learning environment where theoretical knowledge is seamlessly integrated with practical examples and real-world applications.
+ - Course Project Requirements and Details:
+ Context, in this case, provides a detailed roadmap and expectations about the course project, from the skills needed to the evaluation metrics.
+
+### 5. **Answer Template**
+```
+===== ANSWER TEMPLATE =====
+AI-tutor:
+
+```
+
+# Examples of Prompt and Chat History
+
+### 1. Original Prompt
+
+```
+As you are an AI-tutor, remember that the student lack the specific information of the CONTEXT in the following conversation. Therefore, you should guide them and give them instructions to complete the TASK.
+
+===== RULES OF ROLES =====
+You are an upbeat, encouraging tutor who helps students understand concepts by explaining ideas and asking students questions. Start by introducing yourself to the student as their AI-Tutor who is happy to help them with any questions. Only ask one question at a time. First, ask them what they would like to learn about. Wait for the response. Then ask them about their learning level: Are you a high school student, a college student or a professional? Wait for their response. Then ask them what they know already about the topic they have chosen. Wait for a response. Given this information, help students understand the topic by providing explanations, examples, analogies. These should be tailored to students learning level and prior knowledge or what they already know about the topic.
+
+Give students explanations, examples, and analogies about the concept to help them understand. You should guide students in an open-ended way. Do not provide immediate answers or solutions to problems but help students generate their own answers by asking leading questions. Ask students to explain their thinking. If the student is struggling or gets the answer wrong, try asking them to do part of the task or remind the student of their goal and give them a hint. If students improve, then praise them and show excitement. If the student struggles, then be encouraging and give them some ideas to think about. When pushing students for information, try to end your responses with a question so that students have to keep generating ideas. Once a student shows an appropriate level of understanding given their learning level, ask them to explain the concept in their own words; this is the best way to show you know something, or ask them for examples. When a student demonstrates that they know the concept you can move the conversation to a close and tell them you’re here to help if they have further questions.
+
+===== TASK =====
+{TASK}
+===== CONTEXT CONTENT OF TASK =====
+Here is the CONTEXT of the TASK. You need to guide me complete the TASK in the specific CONTEXT.
+{CONTEXT}
+
+===== ANSWER TEMPLATE =====
+AI-tutor:
+
+```
+
+### 2. Classroom Learning Scenario
+
+- Example of Prompt
+
+ ```
+ As you are an AI-tutor, remember that the student lack the specific information of the CONTEXT in the following conversation. Therefore, you should guide them and give them instructions to complete the TASK.
+
+ ===== RULES OF ROLES =====
+ You are an upbeat, encouraging tutor who helps students understand concepts by explaining ideas and asking students questions. Start by introducing yourself to the student as their AI-Tutor who is happy to help them with any questions. Only ask one question at a time. First, ask them what they would like to learn about. Wait for the response. Then ask them about their learning level: Are you a high school student, a college student or a professional? Wait for their response. Then ask them what they know already about the topic they have chosen. Wait for a response. Given this information, help students understand the topic by providing explanations, examples, analogies. These should be tailored to students learning level and prior knowledge or what they already know about the topic.
+
+ Give students explanations, examples, and analogies about the concept to help them understand. You should guide students in an open-ended way. Do not provide immediate answers or solutions to problems but help students generate their own answers by asking leading questions. Ask students to explain their thinking. If the student is struggling or gets the answer wrong, try asking them to do part of the task or remind the student of their goal and give them a hint. If students improve, then praise them and show excitement. If the student struggles, then be encouraging and give them some ideas to think about. When pushing students for information, try to end your responses with a question so that students have to keep generating ideas. Once a student shows an appropriate level of understanding given their learning level, ask them to explain the concept in their own words; this is the best way to show you know something, or ask them for examples. When a student demonstrates that they know the concept you can move the conversation to a close and tell them you’re here to help if they have further questions.
+
+ ===== TASK =====
+ Give instructions to teach students the knowledge contained in CONTEXT.
+
+ ===== CONTEXT CONTENT OF TASK =====
+ Here is the CONTEXT of the TASK. You need to guide me complete the TASK in the specific CONTEXT.
+ ``
+ **Understanding Gradient Descent in Machine Learning**
+ In the extensive and captivating realm of Machine Learning (ML), one of the pivotal concepts that garners substantial attention is "Gradient Descent." This algorithm plays a vital role in optimizing various models by minimizing a function, often representing a cost or loss, which quantifies how well the model predicts the target variable.
+
+ **Conceptual Overview:**
+ Gradient Descent is a first-order iterative optimization algorithm for finding the minimum of a function. In the context of ML, this function is the Loss Function, which measures the discrepancy between the actual output and the output predicted by the model. To minimize this discrepancy, the model's parameters are iteratively adjusted.
+
+ **Key Components:**
+ 1. **Loss Function:** A metric that quantifies the error between predicted and actual output. Common examples include Mean Squared Error for regression and Cross-Entropy Loss for classification.
+ 2. **Gradient:** The gradient of a function at a particular point refers to the rate at which the function changes if the input is modified slightly. Mathematically, it's a partial derivative with respect to its parameters.
+ 3. **Learning Rate:** A hyperparameter that determines the size of the step that we take while moving towards the minimum. A too-small learning rate may lead to slow convergence, while a too-large one may cause the algorithm to overshoot the minimum.
+
+ **The Algorithm:**
+ The Gradient Descent algorithm iteratively tweaks the model parameters to minimize the loss function. Here is a simplified step-by-step process:
+
+ - Initialize model parameters randomly.
+ - Calculate the gradient of the loss function with respect to each parameter.
+ - Update the parameters in the opposite direction of the gradient: `new_param = old_param - learning_rate * gradient`
+ - Repeat until the gradient is close to zero or a predetermined number of iterations is reached.
+
+ **Types of Gradient Descent:**
+ 1. **Batch Gradient Descent:** Computes the gradient using the entire dataset. While precise, it can be computationally intensive for large datasets.
+ 2. **Stochastic Gradient Descent (SGD):** Computes the gradient using a single data point chosen at random, which can be faster but tends to introduce noise into the convergence process.
+ 3. **Mini-Batch Gradient Descent:** A compromise between Batch and SGD, it uses a random subset of data to compute the gradient, balancing computational efficiency and convergence stability.
+
+ **Practical Implications:**
+ Understanding and efficiently implementing Gradient Descent is pivotal in optimizing ML models, especially in scenarios involving vast and complex datasets. While its concept may seem straightforward, its application involves dealing with challenges like choosing an apt learning rate, avoiding local minima, and ensuring computational efficiency.
+
+ It’s noteworthy that while the essence of Gradient Descent remains constant, its application might diverge into numerous variants, each suited to particular types of problems and data characteristics in the practical and ever-expanding world of Machine Learning.
+ ``
+
+ ===== ANSWER TEMPLATE =====
+ AI-tutor:
+
+ ```
+
+- Demo
+
+ [Chat History: AI-Tutor-Teaching](https://chat.openai.com/share/4246ef3c-bb5e-4e0b-b8a8-281f9377ff67)
+
+
+### 3. Course Project Scenario
+
+- Example of Prompt
+
+ ```
+ As you are an AI- tutor, remember that the student lack the specific information of the CONTEXT in the following conversation. Therefore, you should guide them and give them instructions to complete the TASK.
+
+ ===== Rules of Roles =====
+ You are an upbeat, encouraging tutor who helps students understand concepts by explaining ideas and asking students questions. Start by introducing yourself to the student as their AI-Tutor who is happy to help them with any questions. Only ask one question at a time. First, ask them what they would like to learn about. Wait for the response. Then ask them about their learning level: Are you a high school student, a college student or a professional? Wait for their response. Then ask them what they know already about the topic they have chosen. Wait for a response. Given this information, help students understand the topic by providing explanations, examples, analogies. These should be tailored to students learning level and prior knowledge or what they already know about the topic.
+
+ Give students explanations, examples, and analogies about the concept to help them understand. You should guide students in an open-ended way. Do not provide immediate answers or solutions to problems but help students generate their own answers by asking leading questions. Ask students to explain their thinking. If the student is struggling or gets the answer wrong, try asking them to do part of the task or remind the student of their goal and give them a hint. If students improve, then praise them and show excitement. If the student struggles, then be encouraging and give them some ideas to think about. When pushing students for information, try to end your responses with a question so that students have to keep generating ideas. Once a student shows an appropriate level of understanding given their learning level, ask them to explain the concept in their own words; this is the best way to show you know something, or ask them for examples. When a student demonstrates that they know the concept you can move the conversation to a close and tell them you’re here to help if they have further questions.
+
+ ===== TASK =====
+ Assist me to develop a machine learning model to classify images of fruits into predefined categories.
+
+ ===== CONTEXT CONTENT OF TASK =====
+ Here is the CONTEXT of the TASK. You need to guide me complete the TASK in the specific CONTEXT.
+ ``
+ **Project Name**: FruitImageClassifier
+
+ **Project Description**:
+ ### 1. Environment Setup
+ #### 1.1 Hardware Requirements
+ - **GPU**: NVIDIA GTX 1080 Ti or equivalent for model training.
+ - **CPU**: Intel i7 or equivalent.
+ - **RAM**: Minimum of 16GB.
+
+ #### 1.2 Software Requirements
+ - **Operating System**: Ubuntu 20.04 LTS.
+ - **Programming Language**: Python 3.8.
+
+ ### 2. Dependencies Installation
+ Ensure you have `pip` installed. Then, use it to install the following dependencies:
+ ``bash
+ pip install tensorflow==2.6 scikit-learn==0.24 numpy==1.19 pandas==1.2 matplotlib==3.4
+ ``
+
+ ### 3. Configuration Files
+ #### 3.1 Dataset Configuration (`data_config.json`)
+ ``json
+ {
+ "train_data_path": "./data/train",
+ "test_data_path": "./data/test",
+ "validation_split": 0.2,
+ "batch_size": 32,
+ "image_size": [224, 224],
+ "num_classes": 5
+ }
+ ``
+ #### 3.2 Model Configuration (`model_config.json`)
+ ``json
+ {
+ "base_model": "MobileNetV2",
+ "base_model_weights": "imagenet",
+ "learning_rate": 0.0001,
+ "epochs": 20,
+ "checkpoint_path": "./checkpoints"
+ }
+ ``
+
+ ### 4. Model Training Script (`train_model.py`)
+ Ensure the following structure is followed in your training script to utilize the configuration files effectively:
+ ``python
+ import json
+ import tensorflow as tf
+ from sklearn.model_selection import train_test_split
+ # Other necessary imports...
+
+ # Load configurations
+ with open('data_config.json', 'r') as file:
+ data_config = json.load(file)
+ with open('model_config.json', 'r') as file:
+ model_config = json.load(file)
+
+ # Implement your data loading, pre-processing, and model training...
+ ``
+
+ ### 5. Model Deployment
+ - **Local Deployment**: Utilize TensorFlow Serving or a Flask API for local testing.
+ - **Cloud Deployment**: Consider options such as AWS SageMaker, Google AI Platform, or Azure ML for scalable deployment.
+
+ ### 6. Monitoring and Logging
+ - Ensure continuous monitoring of the model's performance metrics.
+ - Set up logging to keep track of requests and potential issues during the inference phase.
+
+ ### 7. Model Maintenance
+ - Regularly evaluate model performance.
+ - Update the dataset and retrain the model as needed.
+ - Ensure that system dependencies are updated and tested for compatibility.
+ ``
+
+ ===== ANSWER TEMPLATE =====
+ AI-tutor:
+
+ ```
+
+- Demo
+
+ [Chat History: AI-Tutor-Project](https://chat.openai.com/share/6c31ac3c-8ff2-4216-a438-8dfed4d5f4ba)
diff --git a/open-machine-learning-jupyter-book/_config.yml b/open-machine-learning-jupyter-book/_config.yml
index 9a3f6ad43c..703375180b 100644
--- a/open-machine-learning-jupyter-book/_config.yml
+++ b/open-machine-learning-jupyter-book/_config.yml
@@ -25,6 +25,8 @@ execute:
- 'ml-advanced/clustering/k-means-clustering.ipynb'
- 'data-science/data-visualization/visualization-distributions.md'
- 'ml-advanced/unsupervised-learning.ipynb'
+ - 'data-science/data-science-in-the-cloud/the-azure-ml-sdk-way.ipynb'
+ - 'data-science/data-science-in-the-cloud/the-low-code-no-code-way.ipynb'
parse:
myst_enable_extensions:
diff --git a/open-machine-learning-jupyter-book/_toc.yml b/open-machine-learning-jupyter-book/_toc.yml
index f72fd9f83f..5923dcc0aa 100644
--- a/open-machine-learning-jupyter-book/_toc.yml
+++ b/open-machine-learning-jupyter-book/_toc.yml
@@ -52,7 +52,6 @@ parts:
- file: ml-fundamentals/regression/tools-of-the-trade
- file: ml-fundamentals/regression/managing-data
- file: ml-fundamentals/regression/linear-and-polynomial-regression
- - file: ml-fundamentals/regression/loss-function
- file: ml-fundamentals/regression/logistic-regression
- file: ml-fundamentals/build-a-web-app-to-use-a-machine-learning-model
- file: ml-fundamentals/classification/getting-started-with-classification
@@ -152,6 +151,7 @@ parts:
- file: assignments/ml-fundamentals/ml-linear-regression-1
- file: assignments/ml-fundamentals/ml-linear-regression-2
- file: assignments/ml-fundamentals/ml-logistic-regression-1
+ - file: assignments/ml-fundamentals/ml-logistic-regression-2
- file: assignments/ml-fundamentals/ml-neural-network-1
# - file: assignments/ml-fundamentals/build-an-ml-web-app-1
# - file: assignments/ml-fundamentals/build-an-ml-web-app-2
@@ -223,6 +223,7 @@ parts:
- file: slides/ml-fundamentals/ml-overview
- file: slides/ml-fundamentals/linear-regression
- file: slides/ml-fundamentals/logistic-regression
+ - file: slides/ml-fundamentals/logistic-regression-condensed
- file: slides/ml-fundamentals/neural-network
- file: slides/ml-fundamentals/build-an-ml-web-app
- file: slides/ml-advanced/unsupervised-learning
diff --git a/open-machine-learning-jupyter-book/assignments/ml-advanced/unsupervised-learning/k-means-clustering-with-python.ipynb b/open-machine-learning-jupyter-book/assignments/ml-advanced/unsupervised-learning/k-means-clustering-with-python.ipynb
new file mode 100644
index 0000000000..8214d66d3b
--- /dev/null
+++ b/open-machine-learning-jupyter-book/assignments/ml-advanced/unsupervised-learning/k-means-clustering-with-python.ipynb
@@ -0,0 +1 @@
+{"cells":[{"cell_type":"markdown","metadata":{},"source":["\n","# K-Means Clustering with Python\n"]},{"cell_type":"markdown","metadata":{},"source":["K-Means clustering is the most popular unsupervised machine learning algorithm. K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. In this kernel, I implement K-Means clustering to find intrinsic groups within the dataset that display the same `status_type` behaviour. The `status_type` behaviour variable consists of posts of a different nature (video, photos, statuses and links).\n","\n","\n","So, let's get started."]},{"cell_type":"markdown","metadata":{},"source":["## Introduction to K-Means Clustering\n","\n","\n","Machine learning algorithms can be broadly classified into two categories - supervised and unsupervised learning. There are other categories also like semi-supervised learning and reinforcement learning. But, most of the algorithms are classified as supervised or unsupervised learning. The difference between them happens because of presence of target variable. In unsupervised learning, there is no target variable. The dataset only has input variables which describe the data. This is called unsupervised learning.\n","\n","K-Means clustering is the most popular unsupervised learning algorithm. It is used when we have unlabelled data which is data without defined categories or groups. The algorithm follows an easy or simple way to classify a given data set through a certain number of clusters, fixed apriori. K-Means algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity.\n"]},{"cell_type":"markdown","metadata":{},"source":["## K-Means\n","\n","![K-Means](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-advanced/k-means/k-means-visualization.png)"]},{"cell_type":"markdown","metadata":{},"source":["## K-Means Clustering intuition\n","\n","\n","K-Means clustering is used to find intrinsic groups within the unlabelled dataset and draw inferences from them. It is based on centroid-based clustering.\n","\n","\n","Centroid - A centroid is a data point at the centre of a cluster. In centroid-based clustering, clusters are represented by a centroid. It is an iterative algorithm in which the notion of similarity is derived by how close a data point is to the centroid of the cluster.\n","K-Means clustering works as follows:-\n","The K-Means clustering algorithm uses an iterative procedure to deliver a final result. The algorithm requires number of clusters K and the data set as input. The data set is a collection of features for each data point. The algorithm starts with initial estimates for the K centroids. The algorithm then iterates between two steps:-\n","\n","\n","## Data assignment step\n","\n","\n","Each centroid defines one of the clusters. In this step, each data point is assigned to its nearest centroid, which is based on the squared Euclidean distance. So, if ci is the collection of centroids in set C, then each data point is assigned to a cluster based on minimum Euclidean distance. \n","\n","\n","\n","## Centroid update step\n","\n","\n","In this step, the centroids are recomputed and updated. This is done by taking the mean of all data points assigned to that centroid’s cluster. \n","\n","\n","The algorithm then iterates between step 1 and step 2 until a stopping criteria is met. Stopping criteria means no data points change the clusters, the sum of the distances is minimized or some maximum number of iterations is reached.\n","This algorithm is guaranteed to converge to a result. The result may be a local optimum meaning that assessing more than one run of the algorithm with randomized starting centroids may give a better outcome.\n","\n","The K-Means intuition can be represented with the help of following diagram:-\n"]},{"cell_type":"markdown","metadata":{},"source":["## K-Means intuition\n","![K-Means intuition](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-advanced/k-means/k-means-clustering-example.jpg)"]},{"cell_type":"markdown","metadata":{},"source":["## Choosing the value of K\n","\n","\n","The K-Means algorithm depends upon finding the number of clusters and data labels for a pre-defined value of K. To find the number of clusters in the data, we need to run the K-Means clustering algorithm for different values of K and compare the results. So, the performance of K-Means algorithm depends upon the value of K. We should choose the optimal value of K that gives us best performance. There are different techniques available to find the optimal value of K. The most common technique is the elbow method which is described below.\n"]},{"cell_type":"markdown","metadata":{},"source":["## The elbow method\n","\n","\n","The elbow method is used to determine the optimal number of clusters in K-means clustering. The elbow method plots the value of the cost function produced by different values of K. "]},{"cell_type":"markdown","metadata":{},"source":["## Import libraries\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# This Python 3 environment comes with many helpful analytics libraries installed\n","# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python\n","# For example, here's several helpful packages to load in \n","\n","import numpy as np # linear algebra\n","import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)\n","import matplotlib.pyplot as plt # for data visualization\n","import seaborn as sns # for statistical data visualization\n","%matplotlib inline\n","\n","# Input data files are available in the \"../input/\" directory.\n","# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory\n","\n","import os\n","for dirname, _, filenames in os.walk('/kaggle/input'):\n"," for filename in filenames:\n"," print(os.path.join(dirname, filename))\n","\n","# Any results you write to the current directory are saved as output.\n"]},{"cell_type":"markdown","metadata":{},"source":["### Ignore warnings\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["import warnings\n","\n","warnings.filterwarnings('ignore')"]},{"cell_type":"markdown","metadata":{},"source":["## Import dataset\n","\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["data = 'https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/Live.csv'\n","\n","df = pd.read_csv(data)\n"]},{"cell_type":"markdown","metadata":{},"source":["## Exploratory data analysis\n"]},{"cell_type":"markdown","metadata":{},"source":["### Check shape of the dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.shape"]},{"cell_type":"markdown","metadata":{},"source":["We can see that there are 7050 instances and 16 attributes in the dataset. In the dataset description, it is given that there are 7051 instances and 12 attributes in the dataset.\n","\n","So, we can infer that the first instance is the row header and there are 4 extra attributes in the dataset. Next, we should take a look at the dataset to gain more insight about it."]},{"cell_type":"markdown","metadata":{},"source":["### Preview the dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.head()"]},{"cell_type":"markdown","metadata":{},"source":["### View summary of dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.info()"]},{"cell_type":"markdown","metadata":{},"source":["### Check for missing values in dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.isnull().sum()"]},{"cell_type":"markdown","metadata":{},"source":["We can see that there are 4 redundant columns in the dataset. We should drop them before proceeding further."]},{"cell_type":"markdown","metadata":{},"source":["### Drop redundant columns"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.drop(['Column1', 'Column2', 'Column3', 'Column4'], axis=1, inplace=True)"]},{"cell_type":"markdown","metadata":{},"source":["### Again view summary of dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.info()"]},{"cell_type":"markdown","metadata":{},"source":["Now, we can see that redundant columns have been removed from the dataset. \n","\n","We can see that, there are 3 character variables (data type = object) and remaining 9 numerical variables (data type = int64).\n"]},{"cell_type":"markdown","metadata":{},"source":["### View the statistical summary of numerical variables"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.describe()"]},{"cell_type":"markdown","metadata":{},"source":["There are 3 categorical variables in the dataset. I will explore them one by one."]},{"cell_type":"markdown","metadata":{},"source":["### Explore `status_id` variable"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view the labels in the variable\n","\n","df['status_id'].unique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view how many different types of variables are there\n","\n","len(df['status_id'].unique())"]},{"cell_type":"markdown","metadata":{},"source":["We can see that there are 6997 unique labels in the `status_id` variable. The total number of instances in the dataset is 7050. So, it is approximately a unique identifier for each of the instances. Thus this is not a variable that we can use. Hence, I will drop it."]},{"cell_type":"markdown","metadata":{},"source":["### Explore `status_published` variable"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view the labels in the variable\n","\n","df['status_published'].unique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view how many different types of variables are there\n","\n","len(df['status_published'].unique())"]},{"cell_type":"markdown","metadata":{},"source":["Again, we can see that there are 6913 unique labels in the `status_published` variable. The total number of instances in the dataset is 7050. So, it is also a approximately a unique identifier for each of the instances. Thus this is not a variable that we can use. Hence, I will drop it also."]},{"cell_type":"markdown","metadata":{},"source":["### Explore `status_type` variable"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view the labels in the variable\n","\n","df['status_type'].unique()"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["# view how many different types of variables are there\n","\n","len(df['status_type'].unique())"]},{"cell_type":"markdown","metadata":{},"source":["We can see that there are 4 categories of labels in the `status_type` variable."]},{"cell_type":"markdown","metadata":{},"source":["### Drop `status_id` and `status_published` variable from the dataset"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.drop(['status_id', 'status_published'], axis=1, inplace=True)"]},{"cell_type":"markdown","metadata":{},"source":["### View the summary of dataset again"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.info()"]},{"cell_type":"markdown","metadata":{},"source":["### Preview the dataset again"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["df.head()"]},{"cell_type":"markdown","metadata":{},"source":["We can see that there is 1 non-numeric column `status_type` in the dataset. I will convert it into integer equivalents."]},{"cell_type":"markdown","metadata":{},"source":["## Declare feature vector and target variable\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X = df\n","\n","y = df['status_type']"]},{"cell_type":"markdown","metadata":{},"source":["## Convert categorical variable into integers\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["from sklearn.preprocessing import LabelEncoder\n","\n","le = LabelEncoder()\n","\n","X['status_type'] = le.fit_transform(X['status_type'])\n","\n","y = le.transform(y)"]},{"cell_type":"markdown","metadata":{},"source":["### View the summary of X"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X.info()"]},{"cell_type":"markdown","metadata":{},"source":["### Preview the dataset X"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X.head()"]},{"cell_type":"markdown","metadata":{},"source":["## Feature Scaling"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["cols = X.columns"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["from sklearn.preprocessing import MinMaxScaler\n","\n","ms = MinMaxScaler()\n","\n","X = ms.fit_transform(X)"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X = pd.DataFrame(X, columns=[cols])"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["X.head()"]},{"cell_type":"markdown","metadata":{},"source":["## K-Means model with two clusters"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["from sklearn.cluster import KMeans\n","\n","kmeans = KMeans(n_clusters=2, random_state=0) \n","\n","kmeans.fit(X)"]},{"cell_type":"markdown","metadata":{},"source":["## K-Means model parameters study"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["kmeans.cluster_centers_"]},{"cell_type":"markdown","metadata":{},"source":["- The KMeans algorithm clusters data by trying to separate samples in n groups of equal variances, minimizing a criterion known as `inertia`, or within-cluster sum-of-squares Inertia, or the within-cluster sum of squares criterion, can be recognized as a measure of how internally coherent clusters are.\n","\n","\n","- The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean j of the samples in the cluster. The means are commonly called the cluster centroids.\n","\n","\n","- The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum of squared criterion."]},{"cell_type":"markdown","metadata":{},"source":["### Inertia\n","\n","\n","- `Inertia` is not a normalized metric. \n","\n","- The lower values of inertia are better and zero is optimal. \n","\n","- But in very high-dimensional spaces, euclidean distances tend to become inflated (this is an instance of `curse of dimensionality`). \n","\n","- Running a dimensionality reduction algorithm such as PCA prior to k-means clustering can alleviate this problem and speed up the computations.\n","\n","- We can calculate model inertia as follows:-"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["kmeans.inertia_"]},{"cell_type":"markdown","metadata":{},"source":["- The lesser the model inertia, the better the model fit.\n","\n","- We can see that the model has very high inertia. So, this is not a good model fit to the data."]},{"cell_type":"markdown","metadata":{},"source":[" ## Check quality of weak classification by the model"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["labels = kmeans.labels_\n","\n","# check how many of the samples were correctly labeled\n","correct_labels = sum(y == labels)\n","\n","print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))"]},{"cell_type":"markdown","metadata":{},"source":["We have achieved a weak classification accuracy of 1% by our unsupervised model."]},{"cell_type":"markdown","metadata":{},"source":["## Use elbow method to find optimal number of clusters\n"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["from sklearn.cluster import KMeans\n","cs = []\n","for i in range(1, 11):\n"," kmeans = KMeans(n_clusters = i, init = 'k-means++', max_iter = 300, n_init = 10, random_state = 0)\n"," kmeans.fit(X)\n"," cs.append(kmeans.inertia_)\n","plt.plot(range(1, 11), cs)\n","plt.title('The Elbow Method')\n","plt.xlabel('Number of clusters')\n","plt.ylabel('CS')\n","plt.show()\n"]},{"cell_type":"markdown","metadata":{},"source":["- By the above plot, we can see that there is a kink at k=2. \n","\n","- Hence k=2 can be considered a good number of the cluster to cluster this data.\n","\n","- But, we have seen that I have achieved a weak classification accuracy of 1% with k=2.\n","\n","- I will write the required code with k=2 again for convinience."]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["from sklearn.cluster import KMeans\n","\n","kmeans = KMeans(n_clusters=2,random_state=0)\n","\n","kmeans.fit(X)\n","\n","labels = kmeans.labels_\n","\n","# check how many of the samples were correctly labeled\n","\n","correct_labels = sum(y == labels)\n","\n","print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n","\n","print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))"]},{"cell_type":"markdown","metadata":{},"source":["So, our weak unsupervised classification model achieved a very weak classification accuracy of 1%."]},{"cell_type":"markdown","metadata":{},"source":["I will check the model accuracy with different number of clusters."]},{"cell_type":"markdown","metadata":{},"source":["## K-Means model with different clusters"]},{"cell_type":"markdown","metadata":{},"source":["### K-Means model with 3 clusters"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["kmeans = KMeans(n_clusters=3, random_state=0)\n","\n","kmeans.fit(X)\n","\n","# check how many of the samples were correctly labeled\n","labels = kmeans.labels_\n","\n","correct_labels = sum(y == labels)\n","print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n","print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))"]},{"cell_type":"markdown","metadata":{},"source":["### K-Means model with 4 clusters"]},{"cell_type":"code","execution_count":null,"metadata":{"trusted":true},"outputs":[],"source":["kmeans = KMeans(n_clusters=4, random_state=0)\n","\n","kmeans.fit(X)\n","\n","# check how many of the samples were correctly labeled\n","labels = kmeans.labels_\n","\n","correct_labels = sum(y == labels)\n","print(\"Result: %d out of %d samples were correctly labeled.\" % (correct_labels, y.size))\n","print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))"]},{"cell_type":"markdown","metadata":{},"source":["We have achieved a relatively high accuracy of 62% with k=4."]},{"cell_type":"markdown","metadata":{},"source":["## Results and conclusion\n","\n","\n","In this project, I have implemented the most popular unsupervised clustering technique called K-Means Clustering.\n","\n","I have applied the elbow method and find that k=2 (k is number of clusters) can be considered a good number of cluster to cluster this data.\n","\n","I have find that the model has very high inertia of 237.7572. So, this is not a good model fit to the data.\n","\n","I have achieved a weak classification accuracy of 1% with k=2 by our unsupervised model.\n","\n","So, I have changed the value of k and find relatively higher classification accuracy of 62% with k=4.\n","\n","Hence, we can conclude that k=4 being the optimal number of clusters.\n"]},{"cell_type":"markdown","metadata":{},"source":["## Acknowledgments\n","Thanks to Gokulprasanth T for creating the open-source project [K-Means Clustering with Python](https://www.kaggle.com/code/prashant111/k-means-clustering-with-python), lisensed under the [Prashant Banerjee](https://www.kaggle.com/prashant111). It inspires the majority of the content in this chapter."]}],"metadata":{"kernelspec":{"display_name":"Python 3","language":"python","name":"python3"},"language_info":{"codemirror_mode":{"name":"ipython","version":3},"file_extension":".py","mimetype":"text/x-python","name":"python","nbconvert_exporter":"python","pygments_lexer":"ipython3","version":"3.8.17"}},"nbformat":4,"nbformat_minor":4}
diff --git a/open-machine-learning-jupyter-book/assignments/ml-fundamentals/ml-logistic-regression-2.ipynb b/open-machine-learning-jupyter-book/assignments/ml-fundamentals/ml-logistic-regression-2.ipynb
new file mode 100644
index 0000000000..a6344b19d2
--- /dev/null
+++ b/open-machine-learning-jupyter-book/assignments/ml-fundamentals/ml-logistic-regression-2.ipynb
@@ -0,0 +1,88 @@
+{
+ "cells": [
+ {
+ "attachments": {},
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# ML logistic regression - assignment 2\n",
+ "\n",
+ "## Logistic Regression from scratch"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "vscode": {
+ "languageId": "plaintext"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "class MyOwnLogisticRegression:\n",
+ " def __init__(self, learning_rate=0.001, n_iters=1000):\n",
+ " self.lr = learning_rate\n",
+ " self.n_iters = n_iters\n",
+ " self.weights = None\n",
+ " self.bias = None\n",
+ "\n",
+ " def fit(self, X, y):\n",
+ " n_samples, n_features = X.shape\n",
+ "\n",
+ " # init parameters\n",
+ " self.weights = np.zeros(n_features)\n",
+ " self.bias = 0\n",
+ "\n",
+ " # gradient descent\n",
+ " for _ in range(self.n_iters):\n",
+ " # approximate y with linear combination of weights and x, plus bias\n",
+ " linear_model = np.dot(X, self.weights) + self.bias\n",
+ " # apply sigmoid function\n",
+ " y_predicted = self._sigmoid(linear_model)\n",
+ "\n",
+ " # compute gradients\n",
+ " dw = (1 / n_samples) * np.dot(X.T, (y_predicted - y))\n",
+ " db = (1 / n_samples) * np.sum(y_predicted - y)\n",
+ " # update parameters\n",
+ " self.weights -= self.lr * dw\n",
+ " self.bias -= self.lr * db\n",
+ "\n",
+ " def predict(self, X):\n",
+ " linear_model = np.dot(X, self.weights) + self.bias\n",
+ " y_predicted = self._sigmoid(linear_model)\n",
+ " y_predicted_cls = [1 if i > 0.5 else 0 for i in y_predicted]\n",
+ " return np.array(y_predicted_cls)\n",
+ "\n",
+ " def _sigmoid(self, x):\n",
+ " return 1 / (1 + np.exp(-x))"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.16"
+ },
+ "vscode": {
+ "interpreter": {
+ "hash": "aee8b7b246df8f9039afb4144a1f6fd8d2ca17a180786b69acc140d282b71a49"
+ }
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/data-science-in-the-cloud.ipynb b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/data-science-in-the-cloud.ipynb
new file mode 100644
index 0000000000..dd6cac58e6
--- /dev/null
+++ b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/data-science-in-the-cloud.ipynb
@@ -0,0 +1,81 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "33887561-a556-495e-8744-a192d65a947d",
+ "metadata": {
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "53182f7d-455f-4fd6-b9e8-75bbe9d32ed0",
+ "metadata": {
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import sys\n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d51de614",
+ "metadata": {},
+ "source": [
+ "# Data Science in the cloud\n",
+ "\n",
+ "![cloud-picture](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/cloud-picture.jpeg)\n",
+ "\n",
+ "> Photo by [Jelleke Vanooteghem](https://unsplash.com/@ilumire) from [Unsplash](https://unsplash.com/s/photos/cloud?orientation=landscape)\n",
+ "\n",
+ "When it comes to doing data science with big data, the cloud can be a game changer. In the next three sections, we are going to see what the cloud is and why it can be very helpful. We are also going to explore a heart failure dataset and build a model to help assess the probability of someone having heart failure. We will use the power of the cloud to train, deploy and consume a model in two different ways. One way uses only the user interface in a Low code/No code fashion, and the other way uses the Azure Machine Learning Software Developer Kit (Azure ML SDK).\n",
+ "\n",
+ "![project-schema](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/project-schema.png)\n",
+ "\n",
+ "---\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/data-science-in-the-cloud.md b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/data-science-in-the-cloud.md
deleted file mode 100644
index 428ba3e91a..0000000000
--- a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/data-science-in-the-cloud.md
+++ /dev/null
@@ -1,29 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: python
- name: python3
----
-
-# Data Science in the cloud
-
-![cloud-picture](../../../images/cloud-picture.jpeg)
-
-> Photo by [Jelleke Vanooteghem](https://unsplash.com/@ilumire) from [Unsplash](https://unsplash.com/s/photos/cloud?orientation=landscape)
-
-When it comes to doing data science with big data, the cloud can be a game changer. In the next three sections, we are going to see what the cloud is and why it can be very helpful. We are also going to explore a heart failure dataset and build a model to help assess the probability of someone having heart failure. We will use the power of the cloud to train, deploy and consume a model in two different ways. One way uses only the user interface in a Low code/No code fashion, and the other way uses the Azure Machine Learning Software Developer Kit (Azure ML SDK).
-
-![project-schema](../../../images/project-schema.png)
-
----
-
-```{tableofcontents}
-```
diff --git a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/introduction.ipynb b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/introduction.ipynb
new file mode 100644
index 0000000000..35ae0af251
--- /dev/null
+++ b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/introduction.ipynb
@@ -0,0 +1,165 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "7e3566be-b99b-41c1-baa4-a5605e5a9896",
+ "metadata": {
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 1,
+ "id": "99bdfaf7-62b7-4a12-8312-6ab34a7bb7e1",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import sys\n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6e3ccba0-778b-4fc0-9093-437024a2bda4",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "# Introduction\n",
+ "\n",
+ "In this section, you will learn the fundamental principles of the cloud, then you will see why it can be interesting for you to use cloud services to run your data science projects and we'll look at some examples of data science projects run in the cloud.\n",
+ "\n",
+ "## What is the cloud?\n",
+ "\n",
+ "The Cloud, or Cloud Computing, is the delivery of a wide range of pay-as-you-go computing services hosted on infrastructure over the internet. Services include solutions such as storage, databases, networking, software, analytics, and intelligent services.\n",
+ "\n",
+ "We usually differentiate the Public, Private and Hybrid clouds as follows:\n",
+ "\n",
+ "* Public cloud: a public cloud is owned and operated by a third-party cloud service provider which delivers its computing resources over the Internet to the public.\n",
+ "* Private cloud: refers to cloud computing resources used exclusively by a single business or organization, with services and infrastructure maintained on a private network.\n",
+ "* Hybrid cloud: the hybrid cloud is a system that combines public and private clouds. Users opt for an on-premises data center while allowing data and applications to be run on one or more public clouds.\n",
+ "\n",
+ "Most cloud computing services fall into three categories: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS).\n",
+ "\n",
+ "* Infrastructure as a Service (IaaS): users rent an IT infrastructure such as servers and virtual machines (VMs), storage, networks, operating systems\n",
+ "* Platform as a Service (PaaS): users rent an environment for developing, testing, delivering, and managing software applications. Users don't need to worry about setting up or managing the underlying infrastructure of servers, storage, network, and databases needed for development.\n",
+ "* Software as a Service (SaaS): users get access to software applications over the Internet, on-demand and typically on a subscription basis. Users don't need to worry about hosting and managing the software application, the underlying infrastructure or the maintenance, like software upgrades and security patching.\n",
+ "\n",
+ "Some of the largest Cloud providers are Amazon Web Services, Google Cloud Platform and Microsoft Azure.\n",
+ "\n",
+ "## Why choose the cloud for Data Science?\n",
+ "\n",
+ "Developers and IT professionals chose to work with the Cloud for many reasons, including the following:\n",
+ "\n",
+ "* Innovation: you can power your applications by integrating innovative services created by Cloud providers directly into your apps.\n",
+ "* Flexibility: you only pay for the services that you need and can choose from a wide range of services. You typically pay as you go and adapt your services according to your evolving needs.\n",
+ "* Budget: you don't need to make initial investments to purchase hardware and software, set up and run on-site data centers and you can just pay for what you use.\n",
+ "* Scalability: your resources can scale according to the needs of your project, which means that your apps can use more or less computing power, storage and bandwidth, by adapting to external factors at any given time.\n",
+ "* Productivity: you can focus on your business rather than spending time on tasks that can be managed by someone else, such as managing data centers.\n",
+ "* Reliability: Cloud Computing offers several ways to continuously back up your data and you can set up disaster recovery plans to keep your business and services going, even in times of crisis.\n",
+ "* Security: you can benefit from policies, technologies and controls that strengthen the security of your project.\n",
+ "\n",
+ "These are some of the most common reasons why people choose to use Cloud services. Now that we have a better understanding of what the Cloud is and what its main benefits are, let's look more specifically into the jobs of Data scientists and developers working with data, and how the Cloud can help them with several challenges they might face:\n",
+ "\n",
+ "* Storing large amounts of data: instead of buying, managing and protecting big servers, you can store your data directly in the cloud, with solutions such as Azure Cosmos DB, Azure SQL Database and Azure Data Lake Storage.\n",
+ "* Performing Data Integration: data integration is an essential part of Data Science, that lets you make a transition from data collection to taking action. With data integration services offered in the cloud, you can collect, transform and integrate data from various sources into a single data warehouse, with Data Factory.\n",
+ "* Processing data: processing vast amounts of data requires a lot of computing power, and not everyone has access to machines powerful enough for that, which is why many people choose to directly harness the cloud's huge computing power to run and deploy their solutions.\n",
+ "* Using data analytics services: cloud services like Azure Synapse Analytics, Azure Stream Analytics and Azure Databricks to help you turn your data into actionable insights.\n",
+ "* Using Machine Learning and data intelligence services: Instead of starting from scratch, you can use machine learning algorithms offered by the cloud provider, with services such as AzureML. You can also use cognitive services such as speech-to-text, text-to-speech, computer vision and more.\n",
+ "\n",
+ "## Examples of Data Science in the cloud\n",
+ "\n",
+ "Let's make this more tangible by looking at a couple of scenarios.\n",
+ "\n",
+ "### Real-time social media sentiment analysis\n",
+ "\n",
+ "We'll start with a scenario commonly studied by people who start with machine learning: social media sentiment analysis in real time.\n",
+ "\n",
+ "Let's say you run a news media website and you want to leverage live data to understand what content your readers could be interested in. To know more about that, you can build a program that performs real-time sentiment analysis of data from Twitter publications, on topics that are relevant to your readers.\n",
+ "\n",
+ "The key indicators you will look at are the volume of tweets on specific topics (hashtags) and sentiment, which is established using analytics tools that perform sentiment analysis around the specified topics.\n",
+ "\n",
+ "The steps necessary to create this project are as follows:\n",
+ "\n",
+ "* Create an event hub for streaming input, which will collect data from Twitter.\n",
+ "* Configure and start a Twitter client application, which will call the Twitter Streaming APIs.\n",
+ "* Create a Stream Analytics job.\n",
+ "* Specify the job input and query.\n",
+ "* Create an output sink and specify the job output.\n",
+ "* Start the job.\n",
+ "\n",
+ "To view the full process, check out the [documentation](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?WT.mc_id=academic-77958-bethanycheum&ocid=AID30411099).\n",
+ "\n",
+ "### Scientific papers analysis\n",
+ "\n",
+ "Let's take another example of a project created by [Dmitry Soshnikov](http://soshnikov.com), one of the authors of this curriculum.\n",
+ "\n",
+ "Dmitry created a tool that analyses COVID papers. By reviewing this project, you will see how you can create a tool that extracts knowledge from scientific papers, gains insights and helps researchers navigate through large collections of papers in an efficient way.\n",
+ "\n",
+ "Let's see the different steps used for this:\n",
+ "\n",
+ "* Extracting and pre-processing information with [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).\n",
+ "* Using [Azure ML](https://azure.microsoft.com/services/machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) to parallelize the processing.\n",
+ "* Storing and querying information with [Cosmos DB](https://azure.microsoft.com/services/cosmos-db?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).\n",
+ "* Create an interactive dashboard for data exploration and visualization using Power BI.\n",
+ "\n",
+ "To see the full process, visit [Dmitry's blog](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/).\n",
+ "\n",
+ "As you can see, we can leverage Cloud services in many ways to perform Data Science.\n",
+ "\n",
+ "## Self study\n",
+ "\n",
+ "* [What Is Cloud Computing? A Beginner's Guide | Microsoft Azure](https://azure.microsoft.com/overview/what-is-cloud-computing?ocid=AID3041109)\n",
+ "* [Social media analysis with Azure Stream Analytics | Microsoft Learn](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?ocid=AID3041109)\n",
+ "* [Analyzing COVID Medical Papers with Azure and Text Analytics for Health](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/)\n",
+ "\n",
+ "## Your turn! 🚀\n",
+ "\n",
+ "[Market research](https://static-1300131294.cos.ap-shanghai.myqcloud.com/assignments/data-science/market-research.md)\n",
+ "\n",
+ "## Acknowledgments\n",
+ "\n",
+ "Thanks to Microsoft for creating the open-source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It inspires the majority of the content in this chapter."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/introduction.md b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/introduction.md
deleted file mode 100644
index 0390be13c3..0000000000
--- a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/introduction.md
+++ /dev/null
@@ -1,110 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: python
- name: python3
----
-
-# Introduction
-
-In this section, you will learn the fundamental principles of the cloud, then you will see why it can be interesting for you to use cloud services to run your data science projects and we'll look at some examples of data science projects run in the cloud.
-
-## What is the cloud?
-
-The Cloud, or Cloud Computing, is the delivery of a wide range of pay-as-you-go computing services hosted on infrastructure over the internet. Services include solutions such as storage, databases, networking, software, analytics, and intelligent services.
-
-We usually differentiate the Public, Private and Hybrid clouds as follows:
-
-* Public cloud: a public cloud is owned and operated by a third-party cloud service provider which delivers its computing resources over the Internet to the public.
-* Private cloud: refers to cloud computing resources used exclusively by a single business or organization, with services and infrastructure maintained on a private network.
-* Hybrid cloud: the hybrid cloud is a system that combines public and private clouds. Users opt for an on-premises data center while allowing data and applications to be run on one or more public clouds.
-
-Most cloud computing services fall into three categories: Infrastructure as a Service (IaaS), Platform as a Service (PaaS) and Software as a Service (SaaS).
-
-* Infrastructure as a Service (IaaS): users rent an IT infrastructure such as servers and virtual machines (VMs), storage, networks, operating systems
-* Platform as a Service (PaaS): users rent an environment for developing, testing, delivering, and managing software applications. Users don’t need to worry about setting up or managing the underlying infrastructure of servers, storage, network, and databases needed for development.
-* Software as a Service (SaaS): users get access to software applications over the Internet, on-demand and typically on a subscription basis. Users don’t need to worry about hosting and managing the software application, the underlying infrastructure or the maintenance, like software upgrades and security patching.
-
-Some of the largest Cloud providers are Amazon Web Services, Google Cloud Platform and Microsoft Azure.
-
-## Why choose the cloud for Data Science?
-
-Developers and IT professionals chose to work with the Cloud for many reasons, including the following:
-
-* Innovation: you can power your applications by integrating innovative services created by Cloud providers directly into your apps.
-* Flexibility: you only pay for the services that you need and can choose from a wide range of services. You typically pay as you go and adapt your services according to your evolving needs.
-* Budget: you don’t need to make initial investments to purchase hardware and software, set up and run on-site data centers and you can just pay for what you use.
-* Scalability: your resources can scale according to the needs of your project, which means that your apps can use more or less computing power, storage and bandwidth, by adapting to external factors at any given time.
-* Productivity: you can focus on your business rather than spending time on tasks that can be managed by someone else, such as managing data centers.
-* Reliability: Cloud Computing offers several ways to continuously back up your data and you can set up disaster recovery plans to keep your business and services going, even in times of crisis.
-* Security: you can benefit from policies, technologies and controls that strengthen the security of your project.
-
-These are some of the most common reasons why people choose to use Cloud services. Now that we have a better understanding of what the Cloud is and what its main benefits are, let's look more specifically into the jobs of Data scientists and developers working with data, and how the Cloud can help them with several challenges they might face:
-
-* Storing large amounts of data: instead of buying, managing and protecting big servers, you can store your data directly in the cloud, with solutions such as Azure Cosmos DB, Azure SQL Database and Azure Data Lake Storage.
-* Performing Data Integration: data integration is an essential part of Data Science, that lets you make a transition from data collection to taking action. With data integration services offered in the cloud, you can collect, transform and integrate data from various sources into a single data warehouse, with Data Factory.
-* Processing data: processing vast amounts of data requires a lot of computing power, and not everyone has access to machines powerful enough for that, which is why many people choose to directly harness the cloud’s huge computing power to run and deploy their solutions.
-* Using data analytics services: cloud services like Azure Synapse Analytics, Azure Stream Analytics and Azure Databricks to help you turn your data into actionable insights.
-* Using Machine Learning and data intelligence services: Instead of starting from scratch, you can use machine learning algorithms offered by the cloud provider, with services such as AzureML. You can also use cognitive services such as speech-to-text, text-to-speech, computer vision and more.
-
-## Examples of Data Science in the cloud
-
-Let’s make this more tangible by looking at a couple of scenarios.
-
-### Real-time social media sentiment analysis
-
-We’ll start with a scenario commonly studied by people who start with machine learning: social media sentiment analysis in real time.
-
-Let's say you run a news media website and you want to leverage live data to understand what content your readers could be interested in. To know more about that, you can build a program that performs real-time sentiment analysis of data from Twitter publications, on topics that are relevant to your readers.
-
-The key indicators you will look at are the volume of tweets on specific topics (hashtags) and sentiment, which is established using analytics tools that perform sentiment analysis around the specified topics.
-
-The steps necessary to create this project are as follows:
-
-* Create an event hub for streaming input, which will collect data from Twitter.
-* Configure and start a Twitter client application, which will call the Twitter Streaming APIs.
-* Create a Stream Analytics job.
-* Specify the job input and query.
-* Create an output sink and specify the job output.
-* Start the job.
-
-To view the full process, check out the [documentation](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?WT.mc_id=academic-77958-bethanycheum&ocid=AID30411099).
-
-### Scientific papers analysis
-
-Let’s take another example of a project created by [Dmitry Soshnikov](http://soshnikov.com), one of the authors of this curriculum.
-
-Dmitry created a tool that analyses COVID papers. By reviewing this project, you will see how you can create a tool that extracts knowledge from scientific papers, gains insights and helps researchers navigate through large collections of papers in an efficient way.
-
-Let's see the different steps used for this:
-
-* Extracting and pre-processing information with [Text Analytics for Health](https://docs.microsoft.com/azure/cognitive-services/text-analytics/how-tos/text-analytics-for-health?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
-* Using [Azure ML](https://azure.microsoft.com/services/machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) to parallelize the processing.
-* Storing and querying information with [Cosmos DB](https://azure.microsoft.com/services/cosmos-db?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
-* Create an interactive dashboard for data exploration and visualization using Power BI.
-
-To see the full process, visit [Dmitry’s blog](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/).
-
-As you can see, we can leverage Cloud services in many ways to perform Data Science.
-
-## Self study
-
-* [What Is Cloud Computing? A Beginner’s Guide | Microsoft Azure](https://azure.microsoft.com/overview/what-is-cloud-computing?ocid=AID3041109)
-* [Social media analysis with Azure Stream Analytics | Microsoft Learn](https://docs.microsoft.com/azure/stream-analytics/stream-analytics-twitter-sentiment-analysis-trends?ocid=AID3041109)
-* [Analyzing COVID Medical Papers with Azure and Text Analytics for Health](https://soshnikov.com/science/analyzing-medical-papers-with-azure-and-text-analytics-for-health/)
-
-## Your turn! 🚀
-
-[Market research](../../assignments/data-science/market-research.md)
-
-## Acknowledgments
-
-Thanks to Microsoft for creating the open-source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It inspires the majority of the content in this chapter.
diff --git a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-azure-ml-sdk-way.ipynb b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-azure-ml-sdk-way.ipynb
new file mode 100644
index 0000000000..71891b7800
--- /dev/null
+++ b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-azure-ml-sdk-way.ipynb
@@ -0,0 +1,613 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "10efe4e8-7b3c-4532-b851-66332bc328de",
+ "metadata": {
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6d328b78-aec2-4495-a449-b16032dd9615",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import sys\n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython azureml"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ce897deb",
+ "metadata": {
+ "tags": []
+ },
+ "source": [
+ "\n",
+ "\n",
+ "# Data Science in the cloud: The \"Azure ML SDK\" way\n",
+ "\n",
+ "## Introduction\n",
+ "\n",
+ "### What is Azure ML SDK?\n",
+ "\n",
+ "Data scientists and AI developers use the Azure Machine Learning SDK to build and run Machine Learning workflows with the Azure Machine Learning service. You can interact with the service in any Python environment, including Jupyter Notebooks, Visual Studio Code, or your favorite Python IDE.\n",
+ "\n",
+ "Key areas of the SDK include:\n",
+ "\n",
+ "- Explore, prepare and manage the lifecycle of your datasets used in Machine Learning experiments.\n",
+ "- Manage cloud resources for monitoring, logging, and organizing your Machine Learning experiments.\n",
+ "- Train models either locally or by using cloud resources, including GPU-accelerated model training.\n",
+ "- Use automated Machine Learning, which accepts configuration parameters and training data. It automatically iterates through algorithms and hyperparameter settings to find the best model for running predictions.\n",
+ "- Deploy web services to convert your trained models into RESTful services that can be consumed in any application.\n",
+ "\n",
+ "[Learn more about the Azure Machine Learning SDK](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)\n",
+ "\n",
+ "In the [previous section](./the-low-code-no-code-way.md), we saw how to train, deploy and consume a model in a Low code/No code fashion. We used the Heart Failure dataset to generate and Heart failure prediction model. In this section, we are going to do the exact same thing but using the Azure Machine Learning SDK.\n",
+ "\n",
+ "![project-schema](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/project-schema.png)\n",
+ "\n",
+ "### Heart failure prediction project and dataset introduction\n",
+ "\n",
+ "Check [here](./the-low-code-no-code-way.md) the Heart failure prediction project and dataset introduction.\n",
+ "\n",
+ "## Training a model with the Azure ML SDK\n",
+ "\n",
+ "### Create an Azure ML workspace\n",
+ "\n",
+ "For simplicity, we are going to work on a Jupyter Notebook. This implies that you already have a Workspace and a compute instance. If you already have a Workspace, you can directly jump to section 2.3 Notebook creation.\n",
+ "\n",
+ "If not, please follow the instructions in section **2.1 Create an Azure ML workspace** in the [previous section](./the-low-code-no-code-way.md) to create a workspace.\n",
+ "\n",
+ "### Create a compute instance\n",
+ "\n",
+ "In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, go to the compute menu and you will see the different compute resources available\n",
+ "\n",
+ "![compute-instance-1](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/compute-instance-1.PNG)\n",
+ "\n",
+ "Let's create a compute instance to provision a Jupyter Notebook.\n",
+ "\n",
+ "1. Click on the + New button. \n",
+ "2. Give a name to your compute instance.\n",
+ "3. Choose your options: CPU or GPU, VM size and core number.\n",
+ "4. Click in the Create button.\n",
+ "\n",
+ "Congratulations, you have just created a compute instance! We will use this compute instance to create a Notebook in the [Creating Notebooks section](#23-creating-notebooks).\n",
+ "\n",
+ "### Loading the dataset\n",
+ "\n",
+ "Refer to the [previous section](./the-low-code-no-code-way.md) in the section [Loading the dataset](#loading-the-dataset) if you have not uploaded the dataset yet.\n",
+ "\n",
+ "### Creating Notebooks\n",
+ "\n",
+ ":::{note}\n",
+ "For the next step you can either create a new notebook from scratch, or you can upload the [notebook we created](https://static-1300131294.cos.ap-shanghai.myqcloud.com/assignments/data-science/data-science-in-the-cloud-the-azure-ml-sdk-way.ipynb) in you Azure ML Studio. To upload it, simply click on the \"Notebook\" menu and upload the notebook.\n",
+ ":::\n",
+ "\n",
+ "Notebooks are a really important part of the data science process. They can be used to Conduct Exploratory Data Analysis (EDA), call out to a computer cluster to train a model, and call out to an inference cluster to deploy an endpoint.\n",
+ "\n",
+ "To create a Notebook, we need a compute node that is serving out the Jupyter Notebook instance. Go back to the [Azure ML workspace](https://ml.azure.com/) and click on Compute instances. In the list of compute instances, you should see the [compute instance we created earlier](#create-a-compute-instance).\n",
+ "\n",
+ "1. In the Applications section, click on the Jupyter option.\n",
+ "2. Tick the \"Yes, I understand\" box and click on the Continue button.\n",
+ "![notebook-1](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/notebook-1.PNG)\n",
+ "1. This should open a new browser tab with your Jupyter Notebook instance as follow. Click on the \"New\" button to create a notebook.\n",
+ "\n",
+ "![notebook-2](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/notebook-2.PNG)\n",
+ "\n",
+ "Now that we have a Notebook, we can start training the model with Azure ML SDK.\n",
+ "\n",
+ "### Training a model\n",
+ "\n",
+ "First of all, if you ever have a doubt, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). It contains all the necessary information to understand the modules we are going to see in this section.\n",
+ "\n",
+ "#### Setup Workspace, experiment, compute cluster and dataset\n",
+ "\n",
+ "You need to load the `workspace` from the configuration file using the following code:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6764ba31",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from azureml.core import Workspace\n",
+ "ws = Workspace.from_config()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1350d570",
+ "metadata": {},
+ "source": [
+ "This returns an object of type `Workspace` that represents the workspace. You need to create an `experiment` using the following code:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "6dca29c5",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from azureml.core import Experiment\n",
+ "experiment_name = 'aml-experiment'\n",
+ "experiment = Experiment(ws, experiment_name)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "8b80dad1",
+ "metadata": {},
+ "source": [
+ "To get or create an experiment from a workspace, you request the experiment using the experiment name. Experiment name must be 3-36 characters, start with a letter or a number, and can only contain letters, numbers, underscores, and dashes. If the experiment is not found in the workspace, a new experiment is created.\n",
+ "\n",
+ "Now you need to create a compute cluster for the training using the following code. Note that this step can take a few minutes."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "76a2a067",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from azureml.core.compute import AmlCompute\n",
+ "\n",
+ "aml_name = \"heart-f-cluster\"\n",
+ "try:\n",
+ " aml_compute = AmlCompute(ws, aml_name)\n",
+ " print('Found existing AML compute context.')\n",
+ "except:\n",
+ " print('Creating new AML compute context.')\n",
+ " aml_config = AmlCompute.provisioning_configuration(vm_size=\"Standard_D2_v2\", min_nodes=1, max_nodes=3)\n",
+ " aml_compute = AmlCompute.create(ws, name=aml_name, provisioning_configuration=aml_config)\n",
+ " aml_compute.wait_for_completion(show_output=True)\n",
+ "\n",
+ "cts = ws.compute_targets\n",
+ "compute_target = cts[aml_name]"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5f9a5364",
+ "metadata": {},
+ "source": [
+ "You can get the dataset from the workspace using the dataset name in the following way:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f673ea26",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "dataset = ws.datasets['heart-failure-records']\n",
+ "df = dataset.to_pandas_dataframe()\n",
+ "df.describe()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a5466359",
+ "metadata": {},
+ "source": [
+ "#### AutoML configuration and training\n",
+ "\n",
+ "To set the AutoML configuration, use the [AutoMLConfig class](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig(class)?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).\n",
+ "\n",
+ "As described in the doc, there are a lot of parameters with which you can play with. For this project, we will use the following parameters:\n",
+ "\n",
+ "- `experiment_timeout_minutes`: The maximum amount of time (in minutes) that the experiment is allowed to run before it is automatically stopped and results are automatically made available\n",
+ "- `max_concurrent_iterations`: The maximum number of concurrent training iterations allowed for the experiment.\n",
+ "- `primary_metric`: The primary metric used to determine the experiment's status.\n",
+ "- `compute_target`: The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.\n",
+ "- `task`: The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve.\n",
+ "- `training_data`: The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column).\n",
+ "- `label_column_name`: The name of the label column.\n",
+ "- `path`: The full path to the Azure Machine Learning project folder.\n",
+ "- `enable_early_stopping`: Whether to enable early termination if the score is not improving in the short term.\n",
+ "- `featurization`: Indicator for whether the featurization step should be done automatically or not, or whether customized featurization should be used.\n",
+ "- `debug_log`: The log file to write debug information to."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "7e2d6adb",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from azureml.train.automl import AutoMLConfig\n",
+ "\n",
+ "project_folder = './aml-project'\n",
+ "\n",
+ "automl_settings = {\n",
+ " \"experiment_timeout_minutes\": 20,\n",
+ " \"max_concurrent_iterations\": 3,\n",
+ " \"primary_metric\" : 'AUC_weighted'\n",
+ "}\n",
+ "\n",
+ "automl_config = AutoMLConfig(compute_target=compute_target,\n",
+ " task = \"classification\",\n",
+ " training_data=dataset,\n",
+ " label_column_name=\"DEATH_EVENT\",\n",
+ " path = project_folder, \n",
+ " enable_early_stopping= True,\n",
+ " featurization= 'auto',\n",
+ " debug_log = \"automl_errors.log\",\n",
+ " **automl_settings\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "416b882b",
+ "metadata": {},
+ "source": [
+ "Now that you have your configuration set, you can train the model using the following code. This step can take up to an hour depending on your cluster size."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "27235651",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "remote_run = experiment.submit(automl_config)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c1259456",
+ "metadata": {},
+ "source": [
+ "You can run the RunDetails widget to show the different experiments."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "de0410e3",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from azureml.widgets import RunDetails\n",
+ "RunDetails(remote_run).show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "5698ca64",
+ "metadata": {},
+ "source": [
+ "## Model deployment and endpoint consumption with the Azure ML SDK\n",
+ "\n",
+ "### Saving the best model\n",
+ "\n",
+ "The `remote_run` is an object of type [AutoMLRun](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). This object contains the method `get_output()` which returns the best run and the corresponding fitted model."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "f5902263",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "best_run, fitted_model = remote_run.get_output()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6aafe503",
+ "metadata": {},
+ "source": [
+ "You can see the parameters used for the best model by just printing the fitted_model and see the properties of the best model by using the [get_properties()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "374d3f21",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "best_run.get_properties()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "07b9aac7",
+ "metadata": {},
+ "source": [
+ "Now register the model with the [register_model](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py#register-model-model-name-none--description-none--tags-none--iteration-none--metric-none-?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "86b307f0",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "model_name = best_run.properties['model_name']\n",
+ "script_file_name = 'inference/score.py'\n",
+ "best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')\n",
+ "description = \"aml heart failure project sdk\"\n",
+ "model = best_run.register_model(\n",
+ " model_name = model_name,\n",
+ " model_path = './outputs/',\n",
+ " description = description,\n",
+ " tags = None\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "6b760b6a",
+ "metadata": {},
+ "source": [
+ "### Model deployment\n",
+ "\n",
+ "Once the best model is saved, we can deploy it with the [InferenceConfig](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py?ocid=AID3041109) class. InferenceConfig represents the configuration settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a Machine Learning model deployed as a web service endpoint on Azure Container Instances. A deployed service is created from a model, script, and associated files. The resulting web service is a load-balanced, HTTP endpoint with a REST API. You can send data to this API and receive the prediction returned by the model.\n",
+ "\n",
+ "The model is deployed using the [deploy](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py#deploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false--show-output-false-?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "63ca096a",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from azureml.core.model import InferenceConfig, Model\n",
+ "from azureml.core.webservice import AciWebservice\n",
+ "\n",
+ "inference_config = InferenceConfig(entry_script=script_file_name, environment=best_run.get_environment())\n",
+ "\n",
+ "aciconfig = AciWebservice.deploy_configuration(\n",
+ " cpu_cores = 1,\n",
+ " memory_gb = 1,\n",
+ " tags = {'type': \"automl-heart-failure-prediction\"},\n",
+ " description = 'Sample service for AutoML Heart Failure Prediction'\n",
+ ")\n",
+ "\n",
+ "aci_service_name = 'automl-hf-sdk'\n",
+ "aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)\n",
+ "aci_service.wait_for_deployment(True)\n",
+ "print(aci_service.state)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "042a5d97",
+ "metadata": {},
+ "source": [
+ "This step should take a few minutes.\n",
+ "\n",
+ "### Endpoint consumption\n",
+ "\n",
+ "You consume your endpoint by creating a sample input:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "8501e73d",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "data = {\n",
+ " \"data\":\n",
+ " [\n",
+ " {\n",
+ " 'age': \"60\",\n",
+ " 'anaemia': \"false\",\n",
+ " 'creatinine_phosphokinase': \"500\",\n",
+ " 'diabetes': \"false\",\n",
+ " 'ejection_fraction': \"38\",\n",
+ " 'high_blood_pressure': \"false\",\n",
+ " 'platelets': \"260000\",\n",
+ " 'serum_creatinine': \"1.40\",\n",
+ " 'serum_sodium': \"137\",\n",
+ " 'sex': \"false\",\n",
+ " 'smoking': \"false\",\n",
+ " 'time': \"130\",\n",
+ " },\n",
+ " ],\n",
+ "}\n",
+ "\n",
+ "test_sample = str.encode(json.dumps(data))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b138d435",
+ "metadata": {},
+ "source": [
+ "And then you can send this input to your model for prediction :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "406240b7",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "response = aci_service.run(input_data=test_sample)\n",
+ "response"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ee723fae",
+ "metadata": {},
+ "source": [
+ "This should output `'{\"result\": [false]}'`. This means that the patient input we sent to the endpoint generated the prediction `false` which means this person is not likely to have a heart attack.\n",
+ "\n",
+ "Congratulations! You just consumed the model deployed and trained on Azure ML with the Azure ML SDK!\n",
+ "\n",
+ ":::{note}\n",
+ "Once you are done with the project, don't forget to delete all the resources.\n",
+ ":::\n",
+ "\n",
+ "## Your turn! 🚀\n",
+ "\n",
+ " There are many other things you can do through the SDK, unfortunately, we can not view them all in this section. But good news, learning how to skim through the SDK documentation can take you a long way on your own. Have a look at the Azure ML SDK documentation and find the `Pipeline` class that allows you to create pipelines. A Pipeline is a collection of steps which can be executed as a workflow.\n",
+ "\n",
+ ":::{note}\n",
+ "**HINT:** Go to the [SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) and type keywords in the search bar like \"Pipeline\". You should have the `azureml.pipeline.core.Pipeline` class in the search results.\n",
+ ":::\n",
+ "\n",
+ "Assignment - [Data Science project using Azure ML SDK](https://static-1300131294.cos.ap-shanghai.myqcloud.com/assignments/data-science/data-science-project-using-azure-ml-sdk.md)\n",
+ "\n",
+ "## Self study\n",
+ "\n",
+ "In this section, you learned how to train, deploy and consume a model to predict heart failure risk with the Azure ML SDK in the cloud. Check this [documentation](https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) for further information about the Azure ML SDK. Try to create your own model with the Azure ML SDK.\n",
+ "\n",
+ "## Acknowledgments\n",
+ "\n",
+ "Thanks to Microsoft for creating the open-source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It inspires the majority of the content in this chapter."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.18"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-azure-ml-sdk-way.md b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-azure-ml-sdk-way.md
deleted file mode 100644
index c0c9d35c1f..0000000000
--- a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-azure-ml-sdk-way.md
+++ /dev/null
@@ -1,309 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: python
- name: python3
----
-
-# Data Science in the cloud: The "Azure ML SDK" way
-
-## Introduction
-
-### What is Azure ML SDK?
-
-Data scientists and AI developers use the Azure Machine Learning SDK to build and run Machine Learning workflows with the Azure Machine Learning service. You can interact with the service in any Python environment, including Jupyter Notebooks, Visual Studio Code, or your favorite Python IDE.
-
-Key areas of the SDK include:
-
-- Explore, prepare and manage the lifecycle of your datasets used in Machine Learning experiments.
-- Manage cloud resources for monitoring, logging, and organizing your Machine Learning experiments.
-- Train models either locally or by using cloud resources, including GPU-accelerated model training.
-- Use automated Machine Learning, which accepts configuration parameters and training data. It automatically iterates through algorithms and hyperparameter settings to find the best model for running predictions.
-- Deploy web services to convert your trained models into RESTful services that can be consumed in any application.
-
-[Learn more about the Azure Machine Learning SDK](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)
-
-In the [previous section](./the-low-code-no-code-way.md), we saw how to train, deploy and consume a model in a Low code/No code fashion. We used the Heart Failure dataset to generate and Heart failure prediction model. In this section, we are going to do the exact same thing but using the Azure Machine Learning SDK.
-
-![project-schema](../../../images/project-schema.png)
-
-### Heart failure prediction project and dataset introduction
-
-Check [here](./the-low-code-no-code-way.md) the Heart failure prediction project and dataset introduction.
-
-## Training a model with the Azure ML SDK
-
-### Create an Azure ML workspace
-
-For simplicity, we are going to work on a Jupyter Notebook. This implies that you already have a Workspace and a compute instance. If you already have a Workspace, you can directly jump to section 2.3 Notebook creation.
-
-If not, please follow the instructions in section **2.1 Create an Azure ML workspace** in the [previous section](./the-low-code-no-code-way.md) to create a workspace.
-
-### Create a compute instance
-
-In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, go to the compute menu and you will see the different compute resources available
-
-![compute-instance-1](../../../images/compute-instance-1.PNG)
-
-Let's create a compute instance to provision a Jupyter Notebook.
-
-1. Click on the + New button.
-2. Give a name to your compute instance.
-3. Choose your options: CPU or GPU, VM size and core number.
-4. Click in the Create button.
-
-Congratulations, you have just created a compute instance! We will use this compute instance to create a Notebook in the [Creating Notebooks section](#23-creating-notebooks).
-
-### Loading the dataset
-
-Refer to the [previous section](./the-low-code-no-code-way.md) in the section [Loading the dataset](#loading-the-dataset) if you have not uploaded the dataset yet.
-
-### Creating Notebooks
-
-```{note}
-For the next step you can either create a new notebook from scratch, or you can upload the [notebook we created](../../assignments/data-science/data-science-in-the-cloud-the-azure-ml-sdk-way.ipynb) in you Azure ML Studio. To upload it, simply click on the "Notebook" menu and upload the notebook.
-```
-
-Notebooks are a really important part of the data science process. They can be used to Conduct Exploratory Data Analysis (EDA), call out to a computer cluster to train a model, and call out to an inference cluster to deploy an endpoint.
-
-To create a Notebook, we need a compute node that is serving out the Jupyter Notebook instance. Go back to the [Azure ML workspace](https://ml.azure.com/) and click on Compute instances. In the list of compute instances, you should see the [compute instance we created earlier](#create-a-compute-instance).
-
-1. In the Applications section, click on the Jupyter option.
-2. Tick the "Yes, I understand" box and click on the Continue button.
-![notebook-1](../../../images/notebook-1.PNG)
-3. This should open a new browser tab with your Jupyter Notebook instance as follow. Click on the "New" button to create a notebook.
-
-![notebook-2](../../../images/notebook-2.PNG)
-
-Now that we have a Notebook, we can start training the model with Azure ML SDK.
-
-### Training a model
-
-First of all, if you ever have a doubt, refer to the [Azure ML SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). It contains all the necessary information to understand the modules we are going to see in this section.
-
-#### Setup Workspace, experiment, compute cluster and dataset
-
-You need to load the `workspace` from the configuration file using the following code:
-
-```python
-from azureml.core import Workspace
-ws = Workspace.from_config()
-```
-
-This returns an object of type `Workspace` that represents the workspace. You need to create an `experiment` using the following code:
-
-```python
-from azureml.core import Experiment
-experiment_name = 'aml-experiment'
-experiment = Experiment(ws, experiment_name)
-```
-
-To get or create an experiment from a workspace, you request the experiment using the experiment name. Experiment name must be 3-36 characters, start with a letter or a number, and can only contain letters, numbers, underscores, and dashes. If the experiment is not found in the workspace, a new experiment is created.
-
-Now you need to create a compute cluster for the training using the following code. Note that this step can take a few minutes.
-
-```python
-from azureml.core.compute import AmlCompute
-
-aml_name = "heart-f-cluster"
-try:
- aml_compute = AmlCompute(ws, aml_name)
- print('Found existing AML compute context.')
-except:
- print('Creating new AML compute context.')
- aml_config = AmlCompute.provisioning_configuration(vm_size="Standard_D2_v2", min_nodes=1, max_nodes=3)
- aml_compute = AmlCompute.create(ws, name=aml_name, provisioning_configuration=aml_config)
- aml_compute.wait_for_completion(show_output=True)
-
-cts = ws.compute_targets
-compute_target = cts[aml_name]
-```
-
-You can get the dataset from the workspace using the dataset name in the following way:
-
-```python
-dataset = ws.datasets['heart-failure-records']
-df = dataset.to_pandas_dataframe()
-df.describe()
-```
-
-#### AutoML configuration and training
-
-To set the AutoML configuration, use the [AutoMLConfig class](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.automlconfig(class)?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
-
-As described in the doc, there are a lot of parameters with which you can play with. For this project, we will use the following parameters:
-
-- `experiment_timeout_minutes`: The maximum amount of time (in minutes) that the experiment is allowed to run before it is automatically stopped and results are automatically made available
-- `max_concurrent_iterations`: The maximum number of concurrent training iterations allowed for the experiment.
-- `primary_metric`: The primary metric used to determine the experiment's status.
-- `compute_target`: The Azure Machine Learning compute target to run the Automated Machine Learning experiment on.
-- `task`: The type of task to run. Values can be 'classification', 'regression', or 'forecasting' depending on the type of automated ML problem to solve.
-- `training_data`: The training data to be used within the experiment. It should contain both training features and a label column (optionally a sample weights column).
-- `label_column_name`: The name of the label column.
-- `path`: The full path to the Azure Machine Learning project folder.
-- `enable_early_stopping`: Whether to enable early termination if the score is not improving in the short term.
-- `featurization`: Indicator for whether the featurization step should be done automatically or not, or whether customized featurization should be used.
-- `debug_log`: The log file to write debug information to.
-
-```python
-from azureml.train.automl import AutoMLConfig
-
-project_folder = './aml-project'
-
-automl_settings = {
- "experiment_timeout_minutes": 20,
- "max_concurrent_iterations": 3,
- "primary_metric" : 'AUC_weighted'
-}
-
-automl_config = AutoMLConfig(compute_target=compute_target,
- task = "classification",
- training_data=dataset,
- label_column_name="DEATH_EVENT",
- path = project_folder,
- enable_early_stopping= True,
- featurization= 'auto',
- debug_log = "automl_errors.log",
- **automl_settings
- )
-```
-
-Now that you have your configuration set, you can train the model using the following code. This step can take up to an hour depending on your cluster size.
-
-```python
-remote_run = experiment.submit(automl_config)
-```
-
-You can run the RunDetails widget to show the different experiments.
-
-```python
-from azureml.widgets import RunDetails
-RunDetails(remote_run).show()
-```
-
-## Model deployment and endpoint consumption with the Azure ML SDK
-
-### Saving the best model
-
-The `remote_run` is an object of type [AutoMLRun](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109). This object contains the method `get_output()` which returns the best run and the corresponding fitted model.
-
-```python
-best_run, fitted_model = remote_run.get_output()
-```
-
-You can see the parameters used for the best model by just printing the fitted_model and see the properties of the best model by using the [get_properties()](https://docs.microsoft.com/python/api/azureml-core/azureml.core.run(class)?view=azure-ml-py#azureml_core_Run_get_properties?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
-
-```python
-best_run.get_properties()
-```
-
-Now register the model with the [register_model](https://docs.microsoft.com/python/api/azureml-train-automl-client/azureml.train.automl.run.automlrun?view=azure-ml-py#register-model-model-name-none--description-none--tags-none--iteration-none--metric-none-?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
-
-```python
-model_name = best_run.properties['model_name']
-script_file_name = 'inference/score.py'
-best_run.download_file('outputs/scoring_file_v_1_0_0.py', 'inference/score.py')
-description = "aml heart failure project sdk"
-model = best_run.register_model(
- model_name = model_name,
- model_path = './outputs/',
- description = description,
- tags = None
-)
-```
-
-### Model deployment
-
-Once the best model is saved, we can deploy it with the [InferenceConfig](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model.inferenceconfig?view=azure-ml-py?ocid=AID3041109) class. InferenceConfig represents the configuration settings for a custom environment used for deployment. The [AciWebservice](https://docs.microsoft.com/python/api/azureml-core/azureml.core.webservice.aciwebservice?view=azure-ml-py) class represents a Machine Learning model deployed as a web service endpoint on Azure Container Instances. A deployed service is created from a model, script, and associated files. The resulting web service is a load-balanced, HTTP endpoint with a REST API. You can send data to this API and receive the prediction returned by the model.
-
-The model is deployed using the [deploy](https://docs.microsoft.com/python/api/azureml-core/azureml.core.model(class)?view=azure-ml-py#deploy-workspace--name--models--inference-config-none--deployment-config-none--deployment-target-none--overwrite-false--show-output-false-?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) method.
-
-```python
-from azureml.core.model import InferenceConfig, Model
-from azureml.core.webservice import AciWebservice
-
-inference_config = InferenceConfig(entry_script=script_file_name, environment=best_run.get_environment())
-
-aciconfig = AciWebservice.deploy_configuration(
- cpu_cores = 1,
- memory_gb = 1,
- tags = {'type': "automl-heart-failure-prediction"},
- description = 'Sample service for AutoML Heart Failure Prediction'
-)
-
-aci_service_name = 'automl-hf-sdk'
-aci_service = Model.deploy(ws, aci_service_name, [model], inference_config, aciconfig)
-aci_service.wait_for_deployment(True)
-print(aci_service.state)
-```
-
-This step should take a few minutes.
-
-### Endpoint consumption
-
-You consume your endpoint by creating a sample input:
-
-```python
-data = {
- "data":
- [
- {
- 'age': "60",
- 'anaemia': "false",
- 'creatinine_phosphokinase': "500",
- 'diabetes': "false",
- 'ejection_fraction': "38",
- 'high_blood_pressure': "false",
- 'platelets': "260000",
- 'serum_creatinine': "1.40",
- 'serum_sodium': "137",
- 'sex': "false",
- 'smoking': "false",
- 'time': "130",
- },
- ],
-}
-
-test_sample = str.encode(json.dumps(data))
-```
-
-And then you can send this input to your model for prediction :
-
-```python
-response = aci_service.run(input_data=test_sample)
-response
-```
-
-This should output `'{"result": [false]}'`. This means that the patient input we sent to the endpoint generated the prediction `false` which means this person is not likely to have a heart attack.
-
-Congratulations! You just consumed the model deployed and trained on Azure ML with the Azure ML SDK!
-
-```{note}
-Once you are done with the project, don't forget to delete all the resources.
-```
-
-## Your turn! 🚀
-
- There are many other things you can do through the SDK, unfortunately, we can not view them all in this section. But good news, learning how to skim through the SDK documentation can take you a long way on your own. Have a look at the Azure ML SDK documentation and find the `Pipeline` class that allows you to create pipelines. A Pipeline is a collection of steps which can be executed as a workflow.
-
-```{note}
-**HINT:** Go to the [SDK documentation](https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) and type keywords in the search bar like "Pipeline". You should have the `azureml.pipeline.core.Pipeline` class in the search results.
-```
-
-Assignment - [Data Science project using Azure ML SDK](../../assignments/data-science/data-science-project-using-azure-ml-sdk.md)
-
-## Self study
-
-In this section, you learned how to train, deploy and consume a model to predict heart failure risk with the Azure ML SDK in the cloud. Check this [documentation](https://docs.microsoft.com/python/api/overview/azure/ml/?view=azure-ml-py?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) for further information about the Azure ML SDK. Try to create your own model with the Azure ML SDK.
-
-## Acknowledgments
-
-Thanks to Microsoft for creating the open-source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It inspires the majority of the content in this chapter.
diff --git a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-low-code-no-code-way.ipynb b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-low-code-no-code-way.ipynb
new file mode 100644
index 0000000000..bb7978f63c
--- /dev/null
+++ b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-low-code-no-code-way.ipynb
@@ -0,0 +1,462 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "49bb1540-b0f6-4e28-9c34-b9d86bfe4f17",
+ "metadata": {
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "0a80de56-2278-4fd1-941e-c11d67db53e9",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import os\n",
+ "import sys\n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "079cbda6",
+ "metadata": {},
+ "source": [
+ "# The \"low code/no code\" way\n",
+ "\n",
+ "## What is Azure Machine Learning(ML)?\n",
+ "\n",
+ "The Azure cloud platform is more than 200 products and cloud services designed to help you bring new solutions to life. Data scientists expend a lot of effort exploring and pre-processing data and trying various types of model-training algorithms to produce accurate models. These tasks are time-consuming and often make inefficient use of expensive compute hardware.\n",
+ "\n",
+ "[Azure ML](https://docs.microsoft.com/azure/machine-learning/overview-what-is-azure-machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) is a cloud-based platform for building and operating Machine Learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. Most importantly, it helps them to increase their efficiency by automating many of the time-consuming tasks associated with training models; and it enables them to use cloud-based compute resources that scale effectively, to handle large volumes of data while incurring costs only when actually used.\n",
+ "\n",
+ "Azure ML provides all the tools developers and data scientists need for their Machine Learning workflows. These include:\n",
+ "\n",
+ "- **Azure Machine Learning Studio**: it is a web portal in Azure Machine Learning for low-code and no-code options for model training, deployment, automation, tracking and asset management. The studio integrates with the Azure Machine Learning SDK for a seamless experience.\n",
+ "- **Jupyter Notebooks**: quickly prototype and test ML models.\n",
+ "- **Azure Machine Learning Designer**: allows to drag-n-drop modules to build experiments and then deploy pipelines in a low-code environment.\n",
+ "- **Automated Machine Learning UI (AutoML)** : automates iterative tasks of Machine Learning model development, allowing to build Machine Learning models with high scale, efficiency, and productivity, all while sustaining model quality.\n",
+ "- **Data Labelling**: an assisted ML tool to automatically label data.\n",
+ "- **Machine Learning extension for Visual Studio Code**: provides a full-featured development environment for building and managing Machine Learning projects.\n",
+ "- **Machine Learning CLI**: provides commands for managing Azure ML resources from the command line.\n",
+ "- **Integration with open-source frameworks** such as PyTorch, TensorFlow, Scikit-learn and many more for training, deploying and managing the end-to-end Machine Learning process.\n",
+ "- **MLflow**: It is an open-source library for managing the life cycle of your Lachine Learning experiments. **MLflow Tracking** is a component of MLflow that logs and tracks your training run metrics and model artifacts, irrespective of your experiment's environment.\n",
+ "\n",
+ "## The heart failure prediction project\n",
+ "\n",
+ "There is no doubt that making and building projects are the best way to put your skills and knowledge to the test. In this section, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema:\n",
+ "\n",
+ "![project-schema](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/project-schema.png)\n",
+ "\n",
+ "Each way has its own pros and cons. The Low code/No code way is easier to start with as it involves interacting with a GUI (Graphical User Interface), with no prior knowledge of code required. This method enables quick testing of the project's viability and to create POC (Proof Of Concept). However, as the project grows and things need to be production ready, it is not feasible to create resources through GUI. We need to programmatically automate everything, from the creation of resources to the deployment of a model. This is where knowing how to use the Azure ML SDK becomes crucial.\n",
+ "\n",
+ "| | Low code/no code | Azure ML SDK |\n",
+ "|-------------------|------------------|---------------------------|\n",
+ "| Expertise in code | Not required | Required |\n",
+ "| Time to develop | Fast and easy | Depends on code expertise |\n",
+ "| Production ready | No | Yes |\n",
+ "\n",
+ "## The heart failure dataset\n",
+ "\n",
+ "Cardiovascular diseases (CVDs) are the number 1 cause of death globally, accounting for 31% of all deaths worldwide. Environmental and behavioral risk factors such as use of tobacco, unhealthy diet and obesity, physical inactivity and harmful use of alcohol could be used as features for estimation models. Being able to estimate the probability of the development of a CVD could be of great use to prevent attacks in high-risk people.\n",
+ "\n",
+ "Kaggle has made a [Heart Failure dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) publicly available, that we are going to use for this project. You can download the dataset now. This is a tabular dataset with 13 columns (12 features and 1 target variable) and 299 rows.\n",
+ "\n",
+ "| | Variable name | Type | Description | Example |\n",
+ "|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|\n",
+ "| 1 | age | numerical | age of the patient | 25 |\n",
+ "| 2 | anaemia | boolean | Decrease of red blood cells or haemoglobin | 0 or 1 |\n",
+ "| 3 | creatinine_phosphokinase | numerical | Level of CPK enzyme in the blood | 542 |\n",
+ "| 4 | diabetes | boolean | If the patient has diabetes | 0 or 1 |\n",
+ "| 5 | ejection_fraction | numerical | Percentage of blood leaving the heart on each contraction | 45 |\n",
+ "| 6 | high_blood_pressure | boolean | If the patient has hypertension | 0 or 1 |\n",
+ "| 7 | platelets | numerical | Platelets in the blood | 149000 |\n",
+ "| 8 | serum_creatinine | numerical | Level of serum creatinine in the blood | 0.5 |\n",
+ "| 9 | serum_sodium | numerical | Level of serum sodium in the blood | jun |\n",
+ "| 10 | sex | boolean | woman or man | 0 or 1 |\n",
+ "| 11 | smoking | boolean | If the patient smokes | 0 or 1 |\n",
+ "| 12 | time | numerical | follow-up period (days) | 4 |\n",
+ "|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|\n",
+ "| 21 | DEATH_EVENT [Target] | boolean | if the patient dies during the follow-up period | 0 or 1 |\n",
+ "\n",
+ "Once you have the dataset, we can start the project in Azure.\n",
+ "\n",
+ "## Low code/no code training of a model in Azure ML Studio\n",
+ "\n",
+ "### Create an Azure ML workspace\n",
+ "\n",
+ "To train a model in Azure ML you first need to create an Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-workspace?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).\n",
+ "\n",
+ "It is recommended to use the most up-to-date browser that's compatible with your operating system. The following browsers are supported:\n",
+ "\n",
+ "- Microsoft Edge (The new Microsoft Edge, the latest version. Not Microsoft Edge legacy)\n",
+ "- Safari (latest version, Mac only)\n",
+ "- Chrome (latest version)\n",
+ "- Firefox (latest version)\n",
+ "\n",
+ "To use Azure Machine Learning, create a workspace in your Azure subscription. You can then use this workspace to manage data, compute resources, code, models, and other artifacts related to your Machine Learning workloads.\n",
+ "\n",
+ ":::{note}\n",
+ "Your Azure subscription will be charged a small amount for data storage as long as the Azure Machine Learning workspace exists in your subscription, so we recommend you to delete the Azure Machine Learning workspace when you are no longer using it.\n",
+ ":::\n",
+ "\n",
+ "1\\. Sign in to the [Azure portal](https://ms.portal.azure.com/) using the Microsoft credentials associated with your Azure subscription.\n",
+ "\n",
+ "2\\. Select **+Create a resource**.\n",
+ "\n",
+ "![workspace-1](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/workspace-1.PNG)\n",
+ "\n",
+ "Search for Machine Learning and select the Machine Learning tile.\n",
+ "\n",
+ "![workspace-2](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/workspace-2.PNG)\n",
+ "\n",
+ "Click the create button.\n",
+ "\n",
+ "![workspace-3](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/workspace-3.PNG)\n",
+ "\n",
+ "Fill in the settings as follows:\n",
+ "\n",
+ "- Subscription: Your Azure subscription.\n",
+ "- Resource group: Create or select a resource group.\n",
+ "- Workspace name: Enter a unique name for your workspace.\n",
+ "- Region: Select the geographical region closest to you.\n",
+ "- Storage account: Note the default new storage account that will be created for your workspace.\n",
+ "- Key vault: Note the default new key vault that will be created for your workspace.\n",
+ "- Application insights: Note the default new application insights resource that will be created for your workspace.\n",
+ "- Container registry: None (one will be created automatically the first time you deploy a model to a container)\n",
+ " ![workspace-4](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/workspace-4.PNG).\n",
+ "- Click the create + review and then on the create button.\n",
+ " \n",
+ "3\\. Wait for your workspace to be created (this can take a few minutes). Then go to it in the portal. You can find it through the Machine Learning Azure service.\n",
+ "\n",
+ "4\\. On the Overview page for your workspace, launch Azure Machine Learning studio (or open a new browser tab and navigate to [Azure ML](https://ml.azure.com), and sign into Azure Machine Learning studio using your Microsoft account. If prompted, select your Azure directory and subscription, and your Azure Machine Learning workspace.\n",
+ "\n",
+ "![workspace-5](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/workspace-5.PNG)\n",
+ "\n",
+ "5\\. In Azure Machine Learning Studio, toggle the 鈽?icon at the top left to view the various pages in the interface. You can use these pages to manage the resources in your workspace.\n",
+ "\n",
+ "![workspace-6](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/workspace-6.PNG)\n",
+ "\n",
+ "You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning Studio provides a more focused user interface for managing workspace resources.\n",
+ "\n",
+ "### Compute resources\n",
+ "\n",
+ "Compute Resources are cloud-based resources on which you can run model training and data exploration processes. There are four kinds of compute resource you can create:\n",
+ "\n",
+ "- **Compute Instances**: Development workstations that data scientists can use to work with data and models. This involves the creation of a Virtual Machine (VM) and launching a notebook instance. You can then train a model by calling a computer cluster from the notebook.\n",
+ "- **Compute Clusters**: Scalable clusters of VMs for on-demand processing of experiment code. You will need it when training a model. Compute clusters can also employ specialized GPU or CPU resources.\n",
+ "- **Inference Clusters**: Deployment targets for predictive services that use your trained models.\n",
+ "- **Attached Compute**: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.\n",
+ "\n",
+ "### Choosing the right options for your compute resources\n",
+ "\n",
+ "Some key factors are to consider when creating a compute resource and those choices can be critical decisions to make. \n",
+ "\n",
+ "**Do you need CPU or GPU?**\n",
+ "\n",
+ "A CPU (Central Processing Unit) is the electronic circuitry that executes instructions comprising a computer program. A GPU (Graphics Processing Unit) is a specialized electronic circuit that can execute graphics-related code at a very high rate. \n",
+ "\n",
+ "The main difference between CPU and GPU architecture is that a CPU is designed to handle a wide-range of tasks quickly (as measured by CPU clock speed), but are limited in the concurrency of tasks that can be running. GPUs are designed for parallel computing and therefore are much better at deep learning tasks.\n",
+ "\n",
+ "| CPU | GPU |\n",
+ "|-----------------------------------------|-----------------------------|\n",
+ "| Less expensive | More expensive |\n",
+ "| Lower level of concurrency | Higher level of concurrency |\n",
+ "| Slower in training deep learning models | Optimal for deep learning |\n",
+ "\n",
+ "**Cluster size**\n",
+ "\n",
+ "Larger clusters are more expensive but will result in better responsiveness. Therefore, if you have time but not enough money, you should start with a small cluster. Conversely, if you have money but not much time, you should start with a larger cluster.\n",
+ "\n",
+ "**VM size**\n",
+ "\n",
+ "Depending on your time and budgetary constraints, you can vary the size of your RAM, disk, number of cores and clock speed. Increasing all those parameters will be costlier, but will result in better performance.\n",
+ "\n",
+ "**Dedicated or low-priority instances?**\n",
+ "\n",
+ "A low-priority instance means that it is interruptible: essentially, Microsoft Azure can take those resources and assign them to another task, thus interrupting a job. A dedicated instance, or non-interruptible, means that the job will never be terminated without your permission.\n",
+ "This is another consideration of time vs money, since interruptible instances are less expensive than dedicated ones.\n",
+ "\n",
+ "### Creating a compute cluster\n",
+ "\n",
+ "In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, go to compute and you will be able to see the different compute resources we just discussed (i.e compute instances, compute clusters, inference clusters and attached compute). For this project, we are going to need a compute cluster for model training. In the Studio, Click on the \"Compute\" menu, then the \"Compute cluster\" tab and click on the \"+ New\" button to create a compute cluster.\n",
+ "\n",
+ "![22](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/cluster-1.PNG)\n",
+ "\n",
+ "6\\. Choose your options: Dedicated vs Low priority, CPU or GPU, VM size and core number (you can keep the default settings for this project).\n",
+ "\n",
+ "7\\. Click on the Next button.\n",
+ "\n",
+ "![23](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/cluster-2.PNG)\n",
+ "\n",
+ "8\\. Give the cluster a compute name\n",
+ "\n",
+ "9\\. Choose your options: Minimum/Maximum number of nodes, Idle seconds before scale down, SSH access. Note that if the minimum number of nodes is 0, you will save money when the cluster is idle. Note that the higher the number of maximum nodes, the shorter the training will be. The maximum number of nodes recommended is 3. \n",
+ "\n",
+ "10\\. Click on the \"Create\" button. This step may take a few minutes.\n",
+ "![29](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/cluster-3.PNG)\n",
+ "\n",
+ "Awesome! Now that we have a Compute cluster, we need to load the data to Azure ML Studio.\n",
+ "\n",
+ "### Loading the dataset\n",
+ "\n",
+ "11\\. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, click on \"Datasets\" in the left menu and click on the \"+ Create dataset\" button to create a dataset. Choose the \"From local files\" option and select the Kaggle dataset we downloaded earlier.\n",
+ "![24](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/dataset-1.PNG)\n",
+ "\n",
+ "12\\. Give your dataset a name, a type and a description. Click Next. Upload the data from files. Click Next.\n",
+ "![25](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/dataset-2.PNG)\n",
+ "\n",
+ "13\\. In the Schema, change the data type to Boolean for the following features: anemia, diabetes, high blood pressure, sex, smoking, and DEATH_EVENT. Click Next and Click Create.\n",
+ "![26](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/dataset-3.PNG)\n",
+ "\n",
+ "Great! Now that the dataset is in place and the compute cluster is created, we can start the training of the model!\n",
+ "\n",
+ "## Low code/no code training with AutoML\n",
+ "\n",
+ "Traditional Machine Learning model development is resource-intensive, requires significant domain knowledge and time to produce and compare dozens of models. \n",
+ "Automated Machine Learning (AutoML), is the process of automating the time-consuming, iterative tasks of Machine Learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity, all while sustaining model quality. It reduces the time it takes to get production-ready ML models, with great ease and efficiency. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)\n",
+ "\n",
+ "14\\. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier click on \"Automated ML\" in the left menu and select the dataset you just uploaded. Click Next.\n",
+ "![27](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/aml-1.PNG)\n",
+ "\n",
+ "15\\. Enter a new experiment name, the target column (DEATH_EVENT) and the compute cluster we created. Click Next.\n",
+ "![28](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/aml-2.PNG)\n",
+ "\n",
+ "16\\. Choose \"Classification\" and Click Finish. This step might take between 30 minutes to 1 hour, depending upon your compute cluster size.\n",
+ "![30](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/aml-3.PNG)\n",
+ "\n",
+ "17\\. Once the run is complete, click on the \"Automated ML\" tab, click on your run, and click on the Algorithm in the \"Best model summary\" card.\n",
+ "![31](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/aml-4.PNG)\n",
+ "\n",
+ "Here you can see a detailed description of the best model that AutoML generated. You can also explore other modes generated in the Models tab. Take a few minutes to explore the models in the Explanations (preview button). Once you have chosen the model you want to use (here we will choose the best model selected by autoML), we will see how we can deploy it.\n",
+ "\n",
+ "## Low code/no code model deployment and endpoint consumption\n",
+ "\n",
+ "### Model deployment\n",
+ "\n",
+ "The automated Machine Learning interface allows you to deploy the best model as a web service in a few steps. Deployment is the integration of the model so that it can make predictions based on new data and identify potential areas of opportunity. For this project, deployment to a web service means that medical applications will be able to consume the model to be able to make live predictions of their patient's risk to get a heart attack.\n",
+ "\n",
+ "In the best model description, click on the \"Deploy\" button.\n",
+ "\n",
+ "![deploy-1](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deploy-1.PNG)\n",
+ "\n",
+ "18\\. Give it a name, a description, compute type (Azure Container Instance), enable authentication and click on Deploy. This step might take about 20 minutes to complete. The deployment process entails several steps including registering the model, generating resources, and configuring them for the web service. A status message appears under Deploy status. Select Refresh periodically to check the deployment status. It is deployed and running when the status is \"Healthy\".\n",
+ "\n",
+ "![deploy-2](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deploy-2.PNG)\n",
+ "\n",
+ "19\\. Once it has been deployed, click on the Endpoint tab and click on the endpoint you just deployed. You can find here all the details you need to know about the endpoint. \n",
+ "\n",
+ "![deploy-3](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/deploy-3.PNG)\n",
+ "\n",
+ "Amazing! Now that we have a model deployed, we can start the consumption of the endpoint.\n",
+ "\n",
+ "### Endpoint consumption\n",
+ "\n",
+ "Click on the \"Consume\" tab. Here you can find the REST endpoint and a python script in the consumption option. Take some time to read the python code. \n",
+ "\n",
+ "This script can be run directly from your local machine and will consume your endpoint.\n",
+ "\n",
+ "![35](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/consumption-1.PNG)\n",
+ "\n",
+ "Take a moment to check those 2 lines of code:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "88ad0a9c",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "url = 'http://98e3715f-xxxx-xxxx-xxxx-9ec22d57b796.centralus.azurecontainer.io/score'\n",
+ "api_key = '' # Replace this with the API key for the web service"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "72761ffd",
+ "metadata": {},
+ "source": [
+ "The `url` variable is the REST endpoint found in the consume tab and the `api_key` variable is the primary key also found in the consume tab (only in the case you have enabled authentication). This is how the script can consume the endpoint.\n",
+ "\n",
+ "20\\. Running the script, you should see the following output:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "806ae22a",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "b'\"{\\\\\"result\\\\\": [true]}\"'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d244da87",
+ "metadata": {},
+ "source": [
+ "This means that the prediction of heart failure for the data given is true. This makes sense because if you look more closely at the data automatically generated in the script, everything is at 0 and false by default. You can change the data with the following input sample:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "4e10016c",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "data = {\n",
+ " \"data\":\n",
+ " [\n",
+ " {\n",
+ " 'age': \"0\",\n",
+ " 'anaemia': \"false\",\n",
+ " 'creatinine_phosphokinase': \"0\",\n",
+ " 'diabetes': \"false\",\n",
+ " 'ejection_fraction': \"0\",\n",
+ " 'high_blood_pressure': \"false\",\n",
+ " 'platelets': \"0\",\n",
+ " 'serum_creatinine': \"0\",\n",
+ " 'serum_sodium': \"0\",\n",
+ " 'sex': \"false\",\n",
+ " 'smoking': \"false\",\n",
+ " 'time': \"0\",\n",
+ " },\n",
+ " {\n",
+ " 'age': \"60\",\n",
+ " 'anaemia': \"false\",\n",
+ " 'creatinine_phosphokinase': \"500\",\n",
+ " 'diabetes': \"false\",\n",
+ " 'ejection_fraction': \"38\",\n",
+ " 'high_blood_pressure': \"false\",\n",
+ " 'platelets': \"260000\",\n",
+ " 'serum_creatinine': \"1.40\",\n",
+ " 'serum_sodium': \"137\",\n",
+ " 'sex': \"false\",\n",
+ " 'smoking': \"false\",\n",
+ " 'time': \"130\",\n",
+ " },\n",
+ " ],\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "cfd2909c",
+ "metadata": {},
+ "source": [
+ "The script should return :"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "231ab12d",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "b'\"{\\\\\"result\\\\\": [true, false]}\"'"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "467172a8",
+ "metadata": {},
+ "source": [
+ "Congratulations! You just consumed the model deployed and trained it on Azure ML!\n",
+ "\n",
+ ":::{note}\n",
+ "Once you are done with the project, don't forget to delete all the resources.\n",
+ ":::\n",
+ "\n",
+ "## Your turn! 🚀\n",
+ "\n",
+ "Look closely at the model explanations and details that AutoML generated for the top models. Try to understand why the best model is better than the other ones. What algorithms were compared? What are the differences between them? Why is the best one performing better in this case?\n",
+ "\n",
+ "Assignment - [Low code/no code Data Science project on Azure ML](https://static-1300131294.cos.ap-shanghai.myqcloud.com/assignments/data-science/low-code-no-code-data-science-project-on-azure-ml.md)\n",
+ "\n",
+ "## Self Study\n",
+ "\n",
+ "In this section, you learned how to train, deploy and consume a model to predict heart failure risk in a low code/no code fashion in the cloud. If you have not done it yet, dive deeper into the model explanations that AutoML generated for the top models and try to understand why the best model is better than others.\n",
+ "\n",
+ "You can go further into Low code/No code AutoML by reading this [documentation](https://docs.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).\n",
+ "\n",
+ "## Acknowledgments\n",
+ "\n",
+ "Thanks to Microsoft for creating the open-source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It inspires the majority of the content in this chapter.\n",
+ "\n",
+ "Data for the Heart Failure Prediction project is sourced from [Larxel](https://www.kaggle.com/andrewmvd) on [Kaggle](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data). It is licensed under the [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.11.5"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-low-code-no-code-way.md b/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-low-code-no-code-way.md
deleted file mode 100644
index ae8b659dfd..0000000000
--- a/open-machine-learning-jupyter-book/data-science/data-science-in-the-cloud/the-low-code-no-code-way.md
+++ /dev/null
@@ -1,333 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: Python
- name: Python3
----
-
-# The "low code/no code" way
-
-## What is Azure Machine Learning(ML)?
-
-The Azure cloud platform is more than 200 products and cloud services designed to help you bring new solutions to life. Data scientists expend a lot of effort exploring and pre-processing data and trying various types of model-training algorithms to produce accurate models. These tasks are time-consuming and often make inefficient use of expensive compute hardware.
-
-[Azure ML](https://docs.microsoft.com/azure/machine-learning/overview-what-is-azure-machine-learning?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109) is a cloud-based platform for building and operating Machine Learning solutions in Azure. It includes a wide range of features and capabilities that help data scientists prepare data, train models, publish predictive services, and monitor their usage. Most importantly, it helps them to increase their efficiency by automating many of the time-consuming tasks associated with training models; and it enables them to use cloud-based compute resources that scale effectively, to handle large volumes of data while incurring costs only when actually used.
-
-Azure ML provides all the tools developers and data scientists need for their Machine Learning workflows. These include:
-
-- **Azure Machine Learning Studio**: it is a web portal in Azure Machine Learning for low-code and no-code options for model training, deployment, automation, tracking and asset management. The studio integrates with the Azure Machine Learning SDK for a seamless experience.
-- **Jupyter Notebooks**: quickly prototype and test ML models.
-- **Azure Machine Learning Designer**: allows to drag-n-drop modules to build experiments and then deploy pipelines in a low-code environment.
-- **Automated Machine Learning UI (AutoML)** : automates iterative tasks of Machine Learning model development, allowing to build Machine Learning models with high scale, efficiency, and productivity, all while sustaining model quality.
-- **Data Labelling**: an assisted ML tool to automatically label data.
-- **Machine Learning extension for Visual Studio Code**: provides a full-featured development environment for building and managing Machine Learning projects.
-- **Machine Learning CLI**: provides commands for managing Azure ML resources from the command line.
-- **Integration with open-source frameworks** such as PyTorch, TensorFlow, Scikit-learn and many more for training, deploying and managing the end-to-end Machine Learning process.
-- **MLflow**: It is an open-source library for managing the life cycle of your Lachine Learning experiments. **MLflow Tracking** is a component of MLflow that logs and tracks your training run metrics and model artifacts, irrespective of your experiment's environment.
-
-## The heart failure prediction project
-
-There is no doubt that making and building projects are the best way to put your skills and knowledge to the test. In this section, we are going to explore two different ways of building a data science project for the prediction of heart failure attacks in Azure ML Studio, through Low code/No code and through the Azure ML SDK as shown in the following schema:
-
-![project-schema](../../../images/project-schema.png)
-
-Each way has its own pros and cons. The Low code/No code way is easier to start with as it involves interacting with a GUI (Graphical User Interface), with no prior knowledge of code required. This method enables quick testing of the project's viability and to create POC (Proof Of Concept). However, as the project grows and things need to be production ready, it is not feasible to create resources through GUI. We need to programmatically automate everything, from the creation of resources to the deployment of a model. This is where knowing how to use the Azure ML SDK becomes crucial.
-
-| | Low code/no code | Azure ML SDK |
-|-------------------|------------------|---------------------------|
-| Expertise in code | Not required | Required |
-| Time to develop | Fast and easy | Depends on code expertise |
-| Production ready | No | Yes |
-
-## The heart failure dataset
-
-Cardiovascular diseases (CVDs) are the number 1 cause of death globally, accounting for 31% of all deaths worldwide. Environmental and behavioral risk factors such as use of tobacco, unhealthy diet and obesity, physical inactivity and harmful use of alcohol could be used as features for estimation models. Being able to estimate the probability of the development of a CVD could be of great use to prevent attacks in high-risk people.
-
-Kaggle has made a [Heart Failure dataset](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data) publicly available, that we are going to use for this project. You can download the dataset now. This is a tabular dataset with 13 columns (12 features and 1 target variable) and 299 rows.
-
-| | Variable name | Type | Description | Example |
-|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
-| 1 | age | numerical | age of the patient | 25 |
-| 2 | anaemia | boolean | Decrease of red blood cells or haemoglobin | 0 or 1 |
-| 3 | creatinine_phosphokinase | numerical | Level of CPK enzyme in the blood | 542 |
-| 4 | diabetes | boolean | If the patient has diabetes | 0 or 1 |
-| 5 | ejection_fraction | numerical | Percentage of blood leaving the heart on each contraction | 45 |
-| 6 | high_blood_pressure | boolean | If the patient has hypertension | 0 or 1 |
-| 7 | platelets | numerical | Platelets in the blood | 149000 |
-| 8 | serum_creatinine | numerical | Level of serum creatinine in the blood | 0.5 |
-| 9 | serum_sodium | numerical | Level of serum sodium in the blood | jun |
-| 10 | sex | boolean | woman or man | 0 or 1 |
-| 11 | smoking | boolean | If the patient smokes | 0 or 1 |
-| 12 | time | numerical | follow-up period (days) | 4 |
-|----|---------------------------|-----------------|-----------------------------------------------------------|-------------------|
-| 21 | DEATH_EVENT [Target] | boolean | if the patient dies during the follow-up period | 0 or 1 |
-
-Once you have the dataset, we can start the project in Azure.
-
-## Low code/no code training of a model in Azure ML Studio
-
-### Create an Azure ML workspace
-
-To train a model in Azure ML you first need to create an Azure ML workspace. The workspace is the top-level resource for Azure Machine Learning, providing a centralized place to work with all the artifacts you create when you use Azure Machine Learning. The workspace keeps a history of all training runs, including logs, metrics, output, and a snapshot of your scripts. You use this information to determine which training run produces the best model. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-workspace?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
-
-It is recommended to use the most up-to-date browser that's compatible with your operating system. The following browsers are supported:
-
-- Microsoft Edge (The new Microsoft Edge, the latest version. Not Microsoft Edge legacy)
-- Safari (latest version, Mac only)
-- Chrome (latest version)
-- Firefox (latest version)
-
-To use Azure Machine Learning, create a workspace in your Azure subscription. You can then use this workspace to manage data, compute resources, code, models, and other artifacts related to your Machine Learning workloads.
-
-```{note}
-Your Azure subscription will be charged a small amount for data storage as long as the Azure Machine Learning workspace exists in your subscription, so we recommend you to delete the Azure Machine Learning workspace when you are no longer using it.
-```
-
-1\. Sign in to the [Azure portal](https://ms.portal.azure.com/) using the Microsoft credentials associated with your Azure subscription.
-
-2\. Select **+Create a resource**.
-
-![workspace-1](../../../images/workspace-1.PNG)
-
-Search for Machine Learning and select the Machine Learning tile.
-
-![workspace-2](../../../images/workspace-2.PNG)
-
-Click the create button.
-
-![workspace-3](../../../images/workspace-3.PNG)
-
-Fill in the settings as follows:
-
-- Subscription: Your Azure subscription.
-- Resource group: Create or select a resource group.
-- Workspace name: Enter a unique name for your workspace.
-- Region: Select the geographical region closest to you.
-- Storage account: Note the default new storage account that will be created for your workspace.
-- Key vault: Note the default new key vault that will be created for your workspace.
-- Application insights: Note the default new application insights resource that will be created for your workspace.
-- Container registry: None (one will be created automatically the first time you deploy a model to a container)
- ![workspace-4](../../../images/workspace-4.PNG).
-- Click the create + review and then on the create button.
-
-3\. Wait for your workspace to be created (this can take a few minutes). Then go to it in the portal. You can find it through the Machine Learning Azure service.
-
-4\. On the Overview page for your workspace, launch Azure Machine Learning studio (or open a new browser tab and navigate to [Azure ML](https://ml.azure.com), and sign into Azure Machine Learning studio using your Microsoft account. If prompted, select your Azure directory and subscription, and your Azure Machine Learning workspace.
-
-![workspace-5](../../../images/workspace-5.PNG)
-
-5\. In Azure Machine Learning Studio, toggle the ☰ icon at the top left to view the various pages in the interface. You can use these pages to manage the resources in your workspace.
-
-![workspace-6](../../../images/workspace-6.PNG)
-
-You can manage your workspace using the Azure portal, but for data scientists and Machine Learning operations engineers, Azure Machine Learning Studio provides a more focused user interface for managing workspace resources.
-
-### Compute resources
-
-Compute Resources are cloud-based resources on which you can run model training and data exploration processes. There are four kinds of compute resource you can create:
-
-- **Compute Instances**: Development workstations that data scientists can use to work with data and models. This involves the creation of a Virtual Machine (VM) and launching a notebook instance. You can then train a model by calling a computer cluster from the notebook.
-- **Compute Clusters**: Scalable clusters of VMs for on-demand processing of experiment code. You will need it when training a model. Compute clusters can also employ specialized GPU or CPU resources.
-- **Inference Clusters**: Deployment targets for predictive services that use your trained models.
-- **Attached Compute**: Links to existing Azure compute resources, such as Virtual Machines or Azure Databricks clusters.
-
-### Choosing the right options for your compute resources
-
-Some key factors are to consider when creating a compute resource and those choices can be critical decisions to make.
-
-**Do you need CPU or GPU?**
-
-A CPU (Central Processing Unit) is the electronic circuitry that executes instructions comprising a computer program. A GPU (Graphics Processing Unit) is a specialized electronic circuit that can execute graphics-related code at a very high rate.
-
-The main difference between CPU and GPU architecture is that a CPU is designed to handle a wide-range of tasks quickly (as measured by CPU clock speed), but are limited in the concurrency of tasks that can be running. GPUs are designed for parallel computing and therefore are much better at deep learning tasks.
-
-| CPU | GPU |
-|-----------------------------------------|-----------------------------|
-| Less expensive | More expensive |
-| Lower level of concurrency | Higher level of concurrency |
-| Slower in training deep learning models | Optimal for deep learning |
-
-**Cluster size**
-
-Larger clusters are more expensive but will result in better responsiveness. Therefore, if you have time but not enough money, you should start with a small cluster. Conversely, if you have money but not much time, you should start with a larger cluster.
-
-**VM size**
-
-Depending on your time and budgetary constraints, you can vary the size of your RAM, disk, number of cores and clock speed. Increasing all those parameters will be costlier, but will result in better performance.
-
-**Dedicated or low-priority instances?**
-
-A low-priority instance means that it is interruptible: essentially, Microsoft Azure can take those resources and assign them to another task, thus interrupting a job. A dedicated instance, or non-interruptible, means that the job will never be terminated without your permission.
-This is another consideration of time vs money, since interruptible instances are less expensive than dedicated ones.
-
-### Creating a compute cluster
-
-In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, go to compute and you will be able to see the different compute resources we just discussed (i.e compute instances, compute clusters, inference clusters and attached compute). For this project, we are going to need a compute cluster for model training. In the Studio, Click on the "Compute" menu, then the "Compute cluster" tab and click on the "+ New" button to create a compute cluster.
-
-![22](../../../images/cluster-1.PNG)
-
-6\. Choose your options: Dedicated vs Low priority, CPU or GPU, VM size and core number (you can keep the default settings for this project).
-
-7\. Click on the Next button.
-
-![23](../../../images/cluster-2.PNG)
-
-8\. Give the cluster a compute name
-
-9\. Choose your options: Minimum/Maximum number of nodes, Idle seconds before scale down, SSH access. Note that if the minimum number of nodes is 0, you will save money when the cluster is idle. Note that the higher the number of maximum nodes, the shorter the training will be. The maximum number of nodes recommended is 3.
-
-10\. Click on the "Create" button. This step may take a few minutes.
-![29](../../../images/cluster-3.PNG)
-
-Awesome! Now that we have a Compute cluster, we need to load the data to Azure ML Studio.
-
-### Loading the dataset
-
-11\. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier, click on "Datasets" in the left menu and click on the "+ Create dataset" button to create a dataset. Choose the "From local files" option and select the Kaggle dataset we downloaded earlier.
-![24](../../../images/dataset-1.PNG)
-
-12\. Give your dataset a name, a type and a description. Click Next. Upload the data from files. Click Next.
-![25](../../../images/dataset-2.PNG)
-
-13\. In the Schema, change the data type to Boolean for the following features: anemia, diabetes, high blood pressure, sex, smoking, and DEATH_EVENT. Click Next and Click Create.
-![26](../../../images/dataset-3.PNG)
-
-Great! Now that the dataset is in place and the compute cluster is created, we can start the training of the model!
-
-## Low code/no code training with AutoML
-
-Traditional Machine Learning model development is resource-intensive, requires significant domain knowledge and time to produce and compare dozens of models.
-Automated Machine Learning (AutoML), is the process of automating the time-consuming, iterative tasks of Machine Learning model development. It allows data scientists, analysts, and developers to build ML models with high scale, efficiency, and productivity, all while sustaining model quality. It reduces the time it takes to get production-ready ML models, with great ease and efficiency. [Learn more](https://docs.microsoft.com/azure/machine-learning/concept-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109)
-
-14\. In the [Azure ML workspace](https://ml.azure.com/) that we created earlier click on "Automated ML" in the left menu and select the dataset you just uploaded. Click Next.
-![27](../../../images/aml-1.PNG)
-
-15\. Enter a new experiment name, the target column (DEATH_EVENT) and the compute cluster we created. Click Next.
-![28](../../../images/aml-2.PNG)
-
-16\. Choose "Classification" and Click Finish. This step might take between 30 minutes to 1 hour, depending upon your compute cluster size.
-![30](../../../images/aml-3.PNG)
-
-17\. Once the run is complete, click on the "Automated ML" tab, click on your run, and click on the Algorithm in the "Best model summary" card.
-![31](../../../images/aml-4.PNG)
-
-Here you can see a detailed description of the best model that AutoML generated. You can also explore other modes generated in the Models tab. Take a few minutes to explore the models in the Explanations (preview button). Once you have chosen the model you want to use (here we will choose the best model selected by autoML), we will see how we can deploy it.
-
-## Low code/no code model deployment and endpoint consumption
-
-### Model deployment
-
-The automated Machine Learning interface allows you to deploy the best model as a web service in a few steps. Deployment is the integration of the model so that it can make predictions based on new data and identify potential areas of opportunity. For this project, deployment to a web service means that medical applications will be able to consume the model to be able to make live predictions of their patient's risk to get a heart attack.
-
-In the best model description, click on the "Deploy" button.
-
-![deploy-1](../../../images/deploy-1.PNG)
-
-18\. Give it a name, a description, compute type (Azure Container Instance), enable authentication and click on Deploy. This step might take about 20 minutes to complete. The deployment process entails several steps including registering the model, generating resources, and configuring them for the web service. A status message appears under Deploy status. Select Refresh periodically to check the deployment status. It is deployed and running when the status is "Healthy".
-
-![deploy-2](../../../images/deploy-2.PNG)
-
-19\. Once it has been deployed, click on the Endpoint tab and click on the endpoint you just deployed. You can find here all the details you need to know about the endpoint.
-
-![deploy-3](../../../images/deploy-3.PNG)
-
-Amazing! Now that we have a model deployed, we can start the consumption of the endpoint.
-
-### Endpoint consumption
-
-Click on the "Consume" tab. Here you can find the REST endpoint and a python script in the consumption option. Take some time to read the python code.
-
-This script can be run directly from your local machine and will consume your endpoint.
-
-![35](../../../images/consumption-1.PNG)
-
-Take a moment to check those 2 lines of code:
-
-```python
-url = 'http://98e3715f-xxxx-xxxx-xxxx-9ec22d57b796.centralus.azurecontainer.io/score'
-api_key = '' # Replace this with the API key for the web service
-```
-
-The `url` variable is the REST endpoint found in the consume tab and the `api_key` variable is the primary key also found in the consume tab (only in the case you have enabled authentication). This is how the script can consume the endpoint.
-
-20\. Running the script, you should see the following output:
-
-```python
-b'"{\\"result\\": [true]}"'
-```
-
-This means that the prediction of heart failure for the data given is true. This makes sense because if you look more closely at the data automatically generated in the script, everything is at 0 and false by default. You can change the data with the following input sample:
-
-```python
-data = {
- "data":
- [
- {
- 'age': "0",
- 'anaemia': "false",
- 'creatinine_phosphokinase': "0",
- 'diabetes': "false",
- 'ejection_fraction': "0",
- 'high_blood_pressure': "false",
- 'platelets': "0",
- 'serum_creatinine': "0",
- 'serum_sodium': "0",
- 'sex': "false",
- 'smoking': "false",
- 'time': "0",
- },
- {
- 'age': "60",
- 'anaemia': "false",
- 'creatinine_phosphokinase': "500",
- 'diabetes': "false",
- 'ejection_fraction': "38",
- 'high_blood_pressure': "false",
- 'platelets': "260000",
- 'serum_creatinine': "1.40",
- 'serum_sodium': "137",
- 'sex': "false",
- 'smoking': "false",
- 'time': "130",
- },
- ],
-}
-```
-
-The script should return :
-
-```python
-b'"{\\"result\\": [true, false]}"'
-```
-
-Congratulations! You just consumed the model deployed and trained it on Azure ML!
-
-```{note}
-Once you are done with the project, don't forget to delete all the resources.
-```
-
-## Your turn! 🚀
-
-Look closely at the model explanations and details that AutoML generated for the top models. Try to understand why the best model is better than the other ones. What algorithms were compared? What are the differences between them? Why is the best one performing better in this case?
-
-Assignment - [Low code/no code Data Science project on Azure ML](../../assignments/data-science/low-code-no-code-data-science-project-on-azure-ml.md)
-
-## Self Study
-
-In this section, you learned how to train, deploy and consume a model to predict heart failure risk in a low code/no code fashion in the cloud. If you have not done it yet, dive deeper into the model explanations that AutoML generated for the top models and try to understand why the best model is better than others.
-
-You can go further into Low code/No code AutoML by reading this [documentation](https://docs.microsoft.com/azure/machine-learning/tutorial-first-experiment-automated-ml?WT.mc_id=academic-77958-bethanycheum&ocid=AID3041109).
-
-## Acknowledgments
-
-Thanks to Microsoft for creating the open-source course [Data Science for Beginners](https://github.com/microsoft/Data-Science-For-Beginners). It inspires the majority of the content in this chapter.
-
-Data for the Heart Failure Prediction project is sourced from [Larxel](https://www.kaggle.com/andrewmvd) on [Kaggle](https://www.kaggle.com/andrewmvd/heart-failure-clinical-data). It is licensed under the [Attribution 4.0 International (CC BY 4.0)](https://creativecommons.org/licenses/by/4.0/)
diff --git a/open-machine-learning-jupyter-book/ml-advanced/kernel-method.md b/open-machine-learning-jupyter-book/ml-advanced/kernel-method.md
index 040fbf6995..f487b1d70b 100644
--- a/open-machine-learning-jupyter-book/ml-advanced/kernel-method.md
+++ b/open-machine-learning-jupyter-book/ml-advanced/kernel-method.md
@@ -340,4 +340,9 @@ A demo of SVM. [source]<
A demo of SVM. [source]
-
\ No newline at end of file
+
+
+
+
+## Your turn! 🚀
+You can follow this [assignment](../assignments/ml-advanced/kernel-method/kernel-method-assignment-1.ipynb) to practise Support Vector Machines with examples.
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/applied-ml-build-a-web-app.ipynb b/open-machine-learning-jupyter-book/ml-fundamentals/classification/applied-ml-build-a-web-app.ipynb
new file mode 100644
index 0000000000..d06c4c06a0
--- /dev/null
+++ b/open-machine-learning-jupyter-book/ml-fundamentals/classification/applied-ml-build-a-web-app.ipynb
@@ -0,0 +1,88 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-input"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "# Install the necessary dependencies\n",
+ "\n",
+ "import os\n",
+ "import sys \n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "source": [
+ "\n",
+ "\n",
+ "# Applied Machine Learning : build a web app\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.18"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/applied-ml-build-a-web-app.md b/open-machine-learning-jupyter-book/ml-fundamentals/classification/applied-ml-build-a-web-app.md
deleted file mode 100644
index 97cf6c7306..0000000000
--- a/open-machine-learning-jupyter-book/ml-fundamentals/classification/applied-ml-build-a-web-app.md
+++ /dev/null
@@ -1,16 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: python
- name: python3
----
-
-# Applied Machine Learning : build a web app
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/getting-started-with-classification.ipynb b/open-machine-learning-jupyter-book/ml-fundamentals/classification/getting-started-with-classification.ipynb
new file mode 100644
index 0000000000..5c1153e478
--- /dev/null
+++ b/open-machine-learning-jupyter-book/ml-fundamentals/classification/getting-started-with-classification.ipynb
@@ -0,0 +1,94 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-input"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "# Install the necessary dependencies\n",
+ "\n",
+ "import os\n",
+ "import sys \n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Getting started with classification\n",
+ "\n",
+ "In Asia and India, food traditions are extremely diverse, and very delicious! Let's look at data about regional cuisines to try to understand their ingredients.\n",
+ "\n",
+ "![Thai food seller](https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/thai-food.jpg)\n",
+ "> Photo by Lisheng Chang on Unsplash\n",
+ "\n",
+ "In this section, you will build on your earlier study of Regression and learn about other classifiers that you can use to better understand the data.\n",
+ "\n",
+ ":::{seealso}\n",
+ "There are useful low-code tools that can help you learn about working with classification models. Try [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-classification-model-azure-machine-learning-designer/?WT.mc_id=academic-77952-leestott)\n",
+ ":::\n",
+ "\n",
+ "\n",
+ "---"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.18"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/getting-started-with-classification.md b/open-machine-learning-jupyter-book/ml-fundamentals/classification/getting-started-with-classification.md
deleted file mode 100644
index 91c2110cd1..0000000000
--- a/open-machine-learning-jupyter-book/ml-fundamentals/classification/getting-started-with-classification.md
+++ /dev/null
@@ -1,37 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: python
- name: python3
----
-
-# Getting started with classification
-
-In Asia and India, food traditions are extremely diverse, and very delicious! Let's look at data about regional cuisines to try to understand their ingredients.
-
-```{figure} ../../../images/ml-fundamentals/ml-classification/thai-food.jpg
----
-name: 'Thai food seller'
-width: 90%
----
-Photo by Lisheng Chang on Unsplash
-```
-
-In this section, you will build on your earlier study of Regression and learn about other classifiers that you can use to better understand the data.
-
-```{seealso}
-There are useful low-code tools that can help you learn about working with classification models. Try [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-classification-model-azure-machine-learning-designer/?WT.mc_id=academic-77952-leestott)
-```
-
----
-
-```{tableofcontents}
-```
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/introduction-to-classification.ipynb b/open-machine-learning-jupyter-book/ml-fundamentals/classification/introduction-to-classification.ipynb
new file mode 100644
index 0000000000..0ba8cbc3b9
--- /dev/null
+++ b/open-machine-learning-jupyter-book/ml-fundamentals/classification/introduction-to-classification.ipynb
@@ -0,0 +1,1447 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "id": "f1464d3f",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3\n",
+ "\n"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "848fcc94-3480-439c-b565-b8dc6072268a",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-input"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "# Install the necessary dependencies\n",
+ "\n",
+ "import os\n",
+ "import sys \n",
+ "!{sys.executable} -m pip install --quiet pandas numpy matplotlib jupyterlab_myst ipython imblearn\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b0926c24",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "source": [
+ "\n",
+ "# Introduction to classification\n",
+ "\n",
+ "In these four sections, you will explore a fundamental focus of classic machine learning _classification_. We will walk through using various classification algorithms with a dataset about all the brilliant cuisines of Asia and India. Hope you're hungry!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4cc6fb13",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "source": [
+ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/pinch.png\n",
+ "---\n",
+ "name: 'Celebrate pan-Asian cuisines in these lessons!'\n",
+ "width: 90%\n",
+ "---\n",
+ "Image by [Jen Looper](https://twitter.com/jenlooper)\n",
+ ":::"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "21bdf1d7",
+ "metadata": {},
+ "source": [
+ "Classification is a form of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that bears a lot in common with regression techniques. If machine learning is all about predicting values or names to things by using datasets, then classification generally falls into two groups: _binary classification_ and _multiclass classification_."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "id": "4b39b77c",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-input",
+ "output-scoll"
+ ]
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n",
+ "\"\"\"\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "ebd13d4a",
+ "metadata": {},
+ "source": [
+ "Click the video above for a quick introduction to classification."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "a446aae1",
+ "metadata": {},
+ "source": [
+ ":::{note}\n",
+ "- **Linear regression** helped you predict relationships between variables and make accurate predictions on where a new datapoint would fall in relationship to that line. So, you could predict _what price a pumpkin would be in September vs. December_, for example.\n",
+ "- **Logistic regression** helped you discover \"binary categories\": at this price point, _is this pumpkin orange or not-orange_?\n",
+ "\n",
+ "Classification uses various algorithms to determine other ways of determining a data point's label or class. Let's work with this cuisine data to see whether, by observing a group of ingredients, we can determine its cuisine of origin.\n",
+ ":::"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "id": "dad60c56",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-input",
+ "output-scoll"
+ ]
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n",
+ "\n",
+ "A demo of Neural Network Playground. [source]\n",
+ "
\n",
+ "\n",
+ "A demo of Neural Network Playground. [source]\n",
+ "
\n",
+ "\"\"\"\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "44c95f32",
+ "metadata": {},
+ "source": [
+ "## Introduction\n",
+ "\n",
+ "Classification is one of the fundamental activities of the machine learning researcher and data scientist. From basic classification of a binary value (\"is this email spam or not?\"), to complex image classification and segmentation using computer vision, it's always useful to be able to sort data into classes and ask questions of it.\n",
+ "\n",
+ "To state the process in a more scientific way, your classification method creates a predictive model that enables you to map the relationship between input variables to output variables.\n",
+ "\n",
+ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/binary-multiclass.png\n",
+ "---\n",
+ "name: 'binary vs. multiclass classification'\n",
+ "width: 90%\n",
+ "---\n",
+ "Infographic by [Jen Looper](https://twitter.com/jenlooper)\n",
+ ":::\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "752b6fbd",
+ "metadata": {},
+ "source": [
+ "Before starting the process of cleaning our data, visualizing it, and prepping it for our ML tasks, let's learn a bit about the various ways machine learning can be leveraged to classify data.\n",
+ "\n",
+ "Derived from [statistics](https://wikipedia.org/wiki/Statistical_classification), classification using classic machine learning uses features, such as `smoker`, `weight`, and `age` to determine _likelihood of developing X disease_. As a supervised learning technique similar to the regression exercises you performed earlier, your data is labeled and the ML algorithms use those labels to classify and predict classes (or 'features') of a dataset and assign them to a group or outcome.\n",
+ "\n",
+ ":::{note}\n",
+ "Take a moment to imagine a dataset about cuisines. What would a multiclass model be able to answer? What would a binary model be able to answer? What if you wanted to determine whether a given cuisine was likely to use fenugreek? What if you wanted to see if, given a present of a grocery bag full of star anise, artichokes, cauliflower, and horseradish, you could create a typical Indian dish?\n",
+ ":::"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "f7f31899",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-input",
+ "output-scoll"
+ ]
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "\n",
+ "
\n",
+ "\n",
+ "\"\"\"\n",
+ " )\n",
+ ")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "4084b626",
+ "metadata": {},
+ "source": [
+ "Click the video above. The whole premise of the show 'Chopped' is the 'mystery basket' where chefs have to make some dish out of a random choice of ingredients. Surely a ML model would have helped!"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d5416a55",
+ "metadata": {},
+ "source": [
+ "## Hello 'classifier'\n",
+ "\n",
+ "The question we want to ask of this cuisine dataset is actually a **multiclass question**, as we have several potential national cuisines to work with. Given a batch of ingredients, which of these many classes will the data fit?\n",
+ "\n",
+ "Scikit-learn offers several different algorithms to use to classify data, depending on the kind of problem you want to solve. In the next two sections, you'll learn about several of these algorithms.\n",
+ "\n",
+ "## Exercise - clean and balance your data\n",
+ "\n",
+ "The first task at hand, before starting this project, is to clean and **balance** your data to get better results. Start with the blank [delicious-asian-and-indian-cuisines.ipynb](../../assignments/ml-fundamentals/delicious-asian-and-indian-cuisines.ipynb) file.\n",
+ "\n",
+ "The first thing to install is [imblearn](https://imbalanced-learn.org/stable/). This is a Scikit-learn package that will allow you to better balance the data (you will learn more about this task in a minute).\n",
+ "\n",
+ "1\\. Import the packages you need to import your data and visualize it, also import `SMOTE` from `imblearn`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "id": "1c741afb",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ },
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "import pandas as pd\n",
+ "import matplotlib.pyplot as plt\n",
+ "import matplotlib as mpl\n",
+ "import numpy as np\n",
+ "from imblearn.over_sampling import SMOTE"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "2f85a608",
+ "metadata": {},
+ "source": [
+ "Now you are set up to read import the data next.\n",
+ "\n",
+ "2\\. The next task will be to import the data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "id": "0bafb815",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ },
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "df = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/classification/cuisines.csv')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "fc4a461f",
+ "metadata": {},
+ "source": [
+ "Using `read_csv()` will read the content of the csv file _cusines.csv_ and place it in the variable `df`.\n",
+ "\n",
+ "3\\. Check the data's shape:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "id": "daaea537",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ " almond angelica anise anise_seed apple apple_brandy apricot \\\n",
+ "0 0 0 0 0 0 0 0 \n",
+ "1 1 0 0 0 0 0 0 \n",
+ "2 0 0 0 0 0 0 0 \n",
+ "3 0 0 0 0 0 0 0 \n",
+ "4 0 0 0 0 0 0 0 \n",
+ "\n",
+ " armagnac artemisia artichoke ... whiskey white_bread white_wine \\\n",
+ "0 0 0 0 ... 0 0 0 \n",
+ "1 0 0 0 ... 0 0 0 \n",
+ "2 0 0 0 ... 0 0 0 \n",
+ "3 0 0 0 ... 0 0 0 \n",
+ "4 0 0 0 ... 0 0 0 \n",
+ "\n",
+ " whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n",
+ "0 0 0 0 0 0 0 0 \n",
+ "1 0 0 0 0 0 0 0 \n",
+ "2 0 0 0 0 0 0 0 \n",
+ "3 0 0 0 0 0 0 0 \n",
+ "4 0 0 0 0 0 1 0 \n",
+ "\n",
+ "[5 rows x 380 columns]"
+ ]
+ },
+ "execution_count": 17,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "feature_df= df.drop(['cuisine' ,'Unnamed: 0' ,'rice' ,'garlic' ,'ginger'] , axis=1)\n",
+ "labels_df = df.cuisine #.unique()\n",
+ "feature_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "873e853e",
+ "metadata": {},
+ "source": [
+ "## Balance the dataset\n",
+ "\n",
+ "Now that you have cleaned the data, use [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - \"Synthetic Minority Over-sampling Technique\" - to balance it.\n",
+ "\n",
+ "1\\. Call `fit_resample()`, this strategy generates new samples by interpolation."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "id": "c2b45ece",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "oversample = SMOTE()\n",
+ "transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d945acff",
+ "metadata": {},
+ "source": [
+ "By balancing your data, you'll have better results when classifying it. Think about a binary classification. If most of your data is one class, a ML model is going to predict that class more frequently, just because there is more data for it. Balancing the data takes any skewed data and helps remove this imbalance. \n",
+ "\n",
+ "2\\. Now you can check the numbers of labels per ingredient:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 19,
+ "id": "77e0437e",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "new label count: indian 799\n",
+ "thai 799\n",
+ "chinese 799\n",
+ "japanese 799\n",
+ "korean 799\n",
+ "Name: cuisine, dtype: int64\n",
+ "old label count: korean 799\n",
+ "indian 598\n",
+ "chinese 442\n",
+ "japanese 320\n",
+ "thai 289\n",
+ "Name: cuisine, dtype: int64\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f'new label count: {transformed_label_df.value_counts()}')\n",
+ "print(f'old label count: {df.cuisine.value_counts()}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "1cce2396",
+ "metadata": {},
+ "source": [
+ "The data is nice and clean, balanced, and very delicious!\n",
+ "\n",
+ "3\\. The last step is to save your balanced data, including labels and features, into a new dataframe that can be exported into a file:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 20,
+ "id": "cd8e6186",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ }
+ },
+ "outputs": [],
+ "source": [
+ "transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "b55e246f",
+ "metadata": {},
+ "source": [
+ "4\\. You can take one more look at the data using `transformed_df.head()` and `transformed_df.info()`. Save a copy of this data for use in future sections:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 21,
+ "id": "eafdd7e8",
+ "metadata": {
+ "attributes": {
+ "classes": [
+ "code-cell"
+ ],
+ "id": ""
+ },
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "\n",
+ "RangeIndex: 3995 entries, 0 to 3994\n",
+ "Columns: 381 entries, cuisine to zucchini\n",
+ "dtypes: int64(380), object(1)\n",
+ "memory usage: 11.6+ MB\n"
+ ]
+ }
+ ],
+ "source": [
+ "transformed_df.head()\n",
+ "transformed_df.info()\n",
+ "transformed_df.to_csv(\"https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/cleaned_cuisines.csv \")"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c83784b1",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "source": [
+ "---\n",
+ "\n",
+ "## Self study\n",
+ "\n",
+ "This curriculum contains several interesting datasets. Dig through the [/data/classification](https://github.com/YinYi000/machine-learning/tree/main/open-machine-learning-jupyter-book/assets/data) folders and see if any contain datasets that would be appropriate for binary or multi-class classification? What questions would you ask of this dataset?\n",
+ "\n",
+ "## Your turn! 🚀\n",
+ "\n",
+ "Explore SMOTE's API. What use cases is it best used for? What problems does it solve?\n",
+ "\n",
+ "Assignment - [Explore classification methods](../../assignments/ml-fundamentals/explore-classification-methods.md)\n",
+ "\n",
+ "## Acknowledgments\n",
+ "\n",
+ "Thanks to Microsoft for creating the open-source course [ML-For-Beginners](https://github.com/microsoft/ML-For-Beginners). It inspires the majority of the content in this chapter."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.18"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/introduction-to-classification.md b/open-machine-learning-jupyter-book/ml-fundamentals/classification/introduction-to-classification.md
deleted file mode 100644
index 58f3c972b1..0000000000
--- a/open-machine-learning-jupyter-book/ml-fundamentals/classification/introduction-to-classification.md
+++ /dev/null
@@ -1,272 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: python
- name: python3
----
-
-# Introduction to classification
-
-In these four sections, you will explore a fundamental focus of classic machine learning _classification_. We will walk through using various classification algorithms with a dataset about all the brilliant cuisines of Asia and India. Hope you're hungry!
-
-```{figure} ../../../images/ml-fundamentals/ml-classification/pinch.png
----
-name: 'Celebrate pan-Asian cuisines in these lessons!'
-width: 90%
----
-Image by [Jen Looper](https://twitter.com/jenlooper)
-```
-
-Classification is a form of [supervised learning](https://wikipedia.org/wiki/Supervised_learning) that bears a lot in common with regression techniques. If machine learning is all about predicting values or names to things by using datasets, then classification generally falls into two groups: _binary classification_ and _multiclass classification_.
-
-```{seealso}
-
-
-
-
-
-Click the video above for a quick introduction to classification.
-```
-
-```{note}
-- **Linear regression** helped you predict relationships between variables and make accurate predictions on where a new datapoint would fall in relationship to that line. So, you could predict _what price a pumpkin would be in September vs. December_, for example.
-- **Logistic regression** helped you discover "binary categories": at this price point, _is this pumpkin orange or not-orange_?
-
-Classification uses various algorithms to determine other ways of determining a data point's label or class. Let's work with this cuisine data to see whether, by observing a group of ingredients, we can determine its cuisine of origin.
-```
-
-
-
-A demo of Neural Network Playground. [source]
-
-
-## Introduction
-
-Classification is one of the fundamental activities of the machine learning researcher and data scientist. From basic classification of a binary value ("is this email spam or not?"), to complex image classification and segmentation using computer vision, it's always useful to be able to sort data into classes and ask questions of it.
-
-To state the process in a more scientific way, your classification method creates a predictive model that enables you to map the relationship between input variables to output variables.
-
-```{figure} ../../../images/ml-fundamentals/ml-classification/binary-multiclass.png
----
-name: 'binary vs. multiclass classification'
-width: 90%
----
-Infographic by [Jen Looper](https://twitter.com/jenlooper)
-```
-
-Before starting the process of cleaning our data, visualizing it, and prepping it for our ML tasks, let's learn a bit about the various ways machine learning can be leveraged to classify data.
-
-Derived from [statistics](https://wikipedia.org/wiki/Statistical_classification), classification using classic machine learning uses features, such as `smoker`, `weight`, and `age` to determine _likelihood of developing X disease_. As a supervised learning technique similar to the regression exercises you performed earlier, your data is labeled and the ML algorithms use those labels to classify and predict classes (or 'features') of a dataset and assign them to a group or outcome.
-
-```{note}
-Take a moment to imagine a dataset about cuisines. What would a multiclass model be able to answer? What would a binary model be able to answer? What if you wanted to determine whether a given cuisine was likely to use fenugreek? What if you wanted to see if, given a present of a grocery bag full of star anise, artichokes, cauliflower, and horseradish, you could create a typical Indian dish?
-```
-
-```{seealso}
-
-
-
-
-
-Click the video above. The whole premise of the show 'Chopped' is the 'mystery basket' where chefs have to make some dish out of a random choice of ingredients. Surely a ML model would have helped!
-```
-
-## Hello 'classifier'
-
-The question we want to ask of this cuisine dataset is actually a **multiclass question**, as we have several potential national cuisines to work with. Given a batch of ingredients, which of these many classes will the data fit?
-
-Scikit-learn offers several different algorithms to use to classify data, depending on the kind of problem you want to solve. In the next two sections, you'll learn about several of these algorithms.
-
-## Exercise - clean and balance your data
-
-The first task at hand, before starting this project, is to clean and **balance** your data to get better results. Start with the blank [delicious-asian-and-indian-cuisines.ipynb](../../assignments/ml-fundamentals/delicious-asian-and-indian-cuisines.ipynb) file.
-
-The first thing to install is [imblearn](https://imbalanced-learn.org/stable/). This is a Scikit-learn package that will allow you to better balance the data (you will learn more about this task in a minute).
-
-1\. Import the packages you need to import your data and visualize it, also import `SMOTE` from `imblearn`.
-
-```{code-cell}
-import pandas as pd
-import matplotlib.pyplot as plt
-import matplotlib as mpl
-import numpy as np
-from imblearn.over_sampling import SMOTE
-```
-
-Now you are set up to read import the data next.
-
-2\. The next task will be to import the data:
-
-```{code-cell}
-df = pd.read_csv('../../assets/data/classification/cuisines.csv')
-```
-
-Using `read_csv()` will read the content of the csv file _cusines.csv_ and place it in the variable `df`.
-
-3\. Check the data's shape:
-
-```{code-cell}
-:tags: [output_scroll]
-
-df.head()
-```
-
-The first five rows look like this.
-
-4\. Get info about this data by calling `info()`:
-
-```{code-cell}
-df.info()
-```
-
-## Exercise - learning about cuisines
-
-Now the work starts to become more interesting. Let's discover the distribution of data, per cuisine
-
-1\. Plot the data as bars by calling `barh()`:
-
-```{code-cell}
-df.cuisine.value_counts().plot.barh()
-```
-
-There are a finite number of cuisines, but the distribution of data is uneven. You can fix that! Before doing so, explore a little more.
-
-2\. Find out how much data is available per cuisine and print it out:
-
-```{code-cell}
-thai_df = df[(df.cuisine == "thai")]
-japanese_df = df[(df.cuisine == "japanese")]
-chinese_df = df[(df.cuisine == "chinese")]
-indian_df = df[(df.cuisine == "indian")]
-korean_df = df[(df.cuisine == "korean")]
-
-print(f'thai df: {thai_df.shape}')
-print(f'japanese df: {japanese_df.shape}')
-print(f'chinese df: {chinese_df.shape}')
-print(f'indian df: {indian_df.shape}')
-print(f'korean df: {korean_df.shape}')
-```
-
-## Discovering ingredients
-
-Now you can dig deeper into the data and learn what are the typical ingredients per cuisine. You should clean out recurrent data that creates confusion between cuisines, so let's learn about this problem.
-
-1\. Create a function `create_ingredient()` in Python to create an ingredient dataframe. This function will start by dropping an unhelpful column and sort through ingredients by their count:
-
-```{code-cell}
-def create_ingredient_df(df):
- ingredient_df = df.T.drop(['cuisine' ,'Unnamed: 0']).sum(axis=1).to_frame('value')
- ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
- ingredient_df = ingredient_df.sort_values(by='value' , ascending=False,
- inplace=False)
- return ingredient_df
-```
-
-Now you can use that function to get an idea of top ten most popular ingredients by cuisine.
-
-2\. Call `create_ingredient()` and plot it calling `barh()`:
-
-```{code-cell}
-thai_ingredient_df = create_ingredient_df(thai_df)
-thai_ingredient_df.head(10).plot.barh()
-```
-
-3\. Do the same for the japanese data:
-
-```{code-cell}
-japanese_ingredient_df = create_ingredient_df(japanese_df)
-japanese_ingredient_df.head(10).plot.barh()
-```
-
-4\. Now for the chinese ingredients:
-
-```{code-cell}
-chinese_ingredient_df = create_ingredient_df(chinese_df)
-chinese_ingredient_df.head(10).plot.barh()
-```
-
-5\. Plot the indian ingredients:
-
-```{code-cell}
-indian_ingredient_df = create_ingredient_df(indian_df)
-indian_ingredient_df.head(10).plot.barh()
-```
-
-6\. Finally, plot the korean ingredients:
-
-```{code-cell}
-korean_ingredient_df = create_ingredient_df(korean_df)
-korean_ingredient_df.head(10).plot.barh()
-```
-
-7\. Now, drop the most common ingredients that create confusion between distinct cuisines, by calling `drop()`:
-
-Everyone loves rice, garlic and ginger!
-
-```{code-cell}
-:tags: [output_scroll]
-
-feature_df= df.drop(['cuisine' ,'Unnamed: 0' ,'rice' ,'garlic' ,'ginger'] , axis=1)
-labels_df = df.cuisine #.unique()
-feature_df.head()
-```
-
-## Balance the dataset
-
-Now that you have cleaned the data, use [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - "Synthetic Minority Over-sampling Technique" - to balance it.
-
-1\. Call `fit_resample()`, this strategy generates new samples by interpolation.
-
-```{code-cell}
-oversample = SMOTE()
-transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)
-```
-
-By balancing your data, you'll have better results when classifying it. Think about a binary classification. If most of your data is one class, a ML model is going to predict that class more frequently, just because there is more data for it. Balancing the data takes any skewed data and helps remove this imbalance.
-
-2\. Now you can check the numbers of labels per ingredient:
-
-```{code-cell}
-print(f'new label count: {transformed_label_df.value_counts()}')
-print(f'old label count: {df.cuisine.value_counts()}')
-```
-
-The data is nice and clean, balanced, and very delicious!
-
-3\. The last step is to save your balanced data, including labels and features, into a new dataframe that can be exported into a file:
-
-```{code-cell}
-transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')
-```
-
-4\. You can take one more look at the data using `transformed_df.head()` and `transformed_df.info()`. Save a copy of this data for use in future sections:
-
-```{code-cell}
-transformed_df.head()
-transformed_df.info()
-transformed_df.to_csv("../../assets/data/cleaned_cuisines.csv ")
-```
-
----
-
-## Self study
-
-This curriculum contains several interesting datasets. Dig through the [/data/classification](https://github.com/YinYi000/machine-learning/tree/main/open-machine-learning-jupyter-book/assets/data) folders and see if any contain datasets that would be appropriate for binary or multi-class classification? What questions would you ask of this dataset?
-
-## Your turn! 🚀
-
-Explore SMOTE's API. What use cases is it best used for? What problems does it solve?
-
-Assignment - [Explore classification methods](../../assignments/ml-fundamentals/explore-classification-methods.md)
-
-## Acknowledgments
-
-Thanks to Microsoft for creating the open-source course [ML-For-Beginners](https://github.com/microsoft/ML-For-Beginners). It inspires the majority of the content in this chapter.
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/more-classifiers.ipynb b/open-machine-learning-jupyter-book/ml-fundamentals/classification/more-classifiers.ipynb
new file mode 100644
index 0000000000..65f8b5f346
--- /dev/null
+++ b/open-machine-learning-jupyter-book/ml-fundamentals/classification/more-classifiers.ipynb
@@ -0,0 +1,908 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-input"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "# Install the necessary dependencies\n",
+ "\n",
+ "import os\n",
+ "import sys\n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# More classifiers\n",
+ "\n",
+ "In this section, you will use the dataset you saved from the last section full of balanced, clean data all about cuisines.\n",
+ "\n",
+ "You will use this dataset with a variety of classifiers to _predict a given national cuisine based on a group of ingredients_. While doing so, you'll learn more about some of the ways that algorithms can be leveraged for classification tasks.\n",
+ "\n",
+ "## Exercise - predict a national cuisine\n",
+ "\n",
+ "1\\. Working in this section's [build-classification-models](https://static-1300131294.cos.ap-shanghai.myqcloud.com/assignments/ml-fundamentals/build-classification-models.ipynb) file, import that file along with the Pandas library:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 2,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
"
+ ],
+ "text/plain": [
+ " almond angelica anise anise_seed apple apple_brandy apricot \\\n",
+ "0 0 0 0 0 0 0 0 \n",
+ "1 1 0 0 0 0 0 0 \n",
+ "2 0 0 0 0 0 0 0 \n",
+ "3 0 0 0 0 0 0 0 \n",
+ "4 0 0 0 0 0 0 0 \n",
+ "\n",
+ " armagnac artemisia artichoke ... whiskey white_bread white_wine \\\n",
+ "0 0 0 0 ... 0 0 0 \n",
+ "1 0 0 0 ... 0 0 0 \n",
+ "2 0 0 0 ... 0 0 0 \n",
+ "3 0 0 0 ... 0 0 0 \n",
+ "4 0 0 0 ... 0 0 0 \n",
+ "\n",
+ " whole_grain_wheat_flour wine wood yam yeast yogurt zucchini \n",
+ "0 0 0 0 0 0 0 0 \n",
+ "1 0 0 0 0 0 0 0 \n",
+ "2 0 0 0 0 0 0 0 \n",
+ "3 0 0 0 0 0 0 0 \n",
+ "4 0 0 0 0 0 1 0 \n",
+ "\n",
+ "[5 rows x 380 columns]"
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)\n",
+ "cuisines_feature_df.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now you are ready to train your model!\n",
+ "\n",
+ "## Choosing your classifier\n",
+ "\n",
+ "Now that your data is clean and ready for training, you have to decide which algorithm to use for the job. \n",
+ "\n",
+ "Scikit-learn groups classification under Supervised Learning, and in that category you will find many ways to classify. [The variety](https://scikit-learn.org/stable/supervised_learning.html) is quite bewildering at first sight. The following methods all include classification techniques:\n",
+ "\n",
+ "- Linear Models\n",
+ "- Support Vector Machines\n",
+ "- Stochastic Gradient Descent\n",
+ "- Nearest Neighbors\n",
+ "- Gaussian Processes\n",
+ "- Decision Trees\n",
+ "- Ensemble methods (voting Classifier)\n",
+ "- Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)\n",
+ "\n",
+ ":::{seealso}\n",
+ "You can also use [neural networks to classify data](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification), but that is outside the scope of this section.\n",
+ ":::\n",
+ "\n",
+ "### What classifier to go with?\n",
+ "\n",
+ "So, which classifier should you choose? Often, running through several and looking for a good result is a way to test. Scikit-learn offers a [side-by-side comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) on a created dataset, comparing KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, showing the results visualized:\n",
+ "\n",
+ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/comparison.png\n",
+ "---\n",
+ "name: 'comparison of classifiers'\n",
+ "width: 90%\n",
+ "---\n",
+ "Comparison of classifiers [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/comparison.png)\n",
+ ":::\n",
+ "\n",
+ ":::{seealso}\n",
+ "Plots generated on Scikit-learn's documentation.\n",
+ "\n",
+ "AutoML solves this problem neatly by running these comparisons in the cloud, allowing you to choose the best algorithm for your data. Try it [here](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-77952-leestott).\n",
+ ":::\n",
+ "\n",
+ "### A better approach\n",
+ "\n",
+ "A better way than wildly guessing, however, is to follow the ideas on this downloadable [ML Cheat sheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-77952-leestott). Here, we discover that, for our multiclass problem, we have some choices:\n",
+ "\n",
+ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/cheatsheet.png\n",
+ "---\n",
+ "name: 'cheatsheet for multiclass problems'\n",
+ "width: 90%\n",
+ "---\n",
+ "Cheatsheet for multiclass problems [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/cheatsheet.png)\n",
+ ":::\n",
+ "\n",
+ ":::{note}\n",
+ "A section of Microsoft's Algorithm Cheat Sheet, detailing multiclass classification options.\n",
+ ":::\n",
+ "\n",
+ ":::{seealso}\n",
+ "Download this cheat sheet, print it out, and hang it on your wall!\n",
+ ":::\n",
+ "\n",
+ "### Reasoning\n",
+ "\n",
+ "Let's see if we can reason our way through different approaches given the constraints we have:\n",
+ "\n",
+ "- **Neural networks are too heavy**. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task.\n",
+ "- **No two-class classifier**. We do not use a two-class classifier, so that rules out one-vs-all. \n",
+ "- **Decision tree or logistic regression could work**. A decision tree might work, or logistic regression for multiclass data. \n",
+ "- **Multiclass Boosted Decision Trees solve a different problem**. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.\n",
+ "\n",
+ "### Using Scikit-learn \n",
+ "\n",
+ "We will be using Scikit-learn to analyze our data. However, there are many ways to use logistic regression in Scikit-learn. Take a look at the [parameters to pass](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression). \n",
+ "\n",
+ "Essentially there are two important parameters - `multi_class` and `solver` - that we need to specify, when we ask Scikit-learn to perform a logistic regression. The `multi_class` value applies a certain behavior. The value of the solver is what algorithm to use. Not all solvers can be paired with all `multi_class` values.\n",
+ "\n",
+ "According to the docs, in the multiclass case, the training algorithm:\n",
+ "\n",
+ "- **Uses the one-vs-rest (OvR) scheme**, if the `multi_class` option is set to `ovr`.\n",
+ "- **Uses the cross-entropy loss**, if the `multi_class` option is set to `multinomial`. (Currently the `multinomial` option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)\n",
+ "\n",
+ ":::{seealso}\n",
+ "The 'scheme' here can either be 'ovr' (one-vs-rest) or 'multinomial'. Since logistic regression is really designed to support binary classification, these schemes allow it to better handle multiclass classification tasks. [🔗source](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)\n",
+ "\n",
+ "The 'solver' is defined as \"the algorithm to use in the optimization problem\". [source](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).\n",
+ ":::\n",
+ "\n",
+ "Scikit-learn offers this table to explain how solvers handle different challenges presented by different kinds of data structures:\n",
+ "\n",
+ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/solvers.png\n",
+ "---\n",
+ "name: 'solvers'\n",
+ "width: 90%\n",
+ "---\n",
+ "Solvers [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/solvers.png)\n",
+ ":::\n",
+ "\n",
+ "## Exercise - split the data\n",
+ "\n",
+ "We can focus on logistic regression for our first training trial since you recently learned about the latter in a previous section.\n",
+ "Split your data into training and testing groups by calling `train_test_split()`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Exercise - apply logistic regression\n",
+ "\n",
+ "Since you are using the multiclass case, you need to choose what _scheme_ to use and what _solver_ to set. Use LogisticRegression with a multiclass setting and the **liblinear** solver to train.\n",
+ "\n",
+ "1\\. Create a logistic regression with multi_class set to `ovr` and the solver set to `liblinear`:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 37,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Accuracy is 0.8256880733944955\n"
+ ]
+ }
+ ],
+ "source": [
+ "lr = LogisticRegression(multi_class='ovr',solver='liblinear')\n",
+ "model = lr.fit(X_train, np.ravel(y_train))\n",
+ "\n",
+ "accuracy = model.score(X_test, y_test)\n",
+ "print (\"Accuracy is {}\".format(accuracy))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::{seealso}\n",
+ "Try a different solver like `lbfgs`, which is often set as default.\n",
+ ":::\n",
+ "\n",
+ ":::{note}\n",
+ "Use Pandas [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) function to flatten your data when needed.\n",
+ ":::\n",
+ "\n",
+ "The accuracy is good at over **80%**!\n",
+ "\n",
+ "2\\. You can see this model in action by testing one row of data (#50):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 38,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "ingredients: Index(['cinnamon', 'cream', 'egg', 'milk', 'milk_fat'], dtype='object')\n",
+ "cuisine: indian\n"
+ ]
+ }
+ ],
+ "source": [
+ "print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')\n",
+ "print(f'cuisine: {y_test.iloc[50]}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::{seealso}\n",
+ "Try a different row number and check the results.\n",
+ ":::\n",
+ "\n",
+ "3\\. Digging deeper, you can check for the accuracy of this prediction:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 39,
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "
\n",
+ "\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
\n",
+ "
0
\n",
+ "
\n",
+ " \n",
+ " \n",
+ "
\n",
+ "
indian
\n",
+ "
0.583259
\n",
+ "
\n",
+ "
\n",
+ "
japanese
\n",
+ "
0.177337
\n",
+ "
\n",
+ "
\n",
+ "
chinese
\n",
+ "
0.130770
\n",
+ "
\n",
+ "
\n",
+ "
korean
\n",
+ "
0.090274
\n",
+ "
\n",
+ "
\n",
+ "
thai
\n",
+ "
0.018360
\n",
+ "
\n",
+ " \n",
+ "
\n",
+ "
"
+ ],
+ "text/plain": [
+ " 0\n",
+ "indian 0.583259\n",
+ "japanese 0.177337\n",
+ "chinese 0.130770\n",
+ "korean 0.090274\n",
+ "thai 0.018360"
+ ]
+ },
+ "execution_count": 39,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "test= X_test.iloc[50].values.reshape(-1, 1).T\n",
+ "proba = model.predict_proba(test)\n",
+ "classes = model.classes_\n",
+ "resultdf = pd.DataFrame(data=proba, columns=classes)\n",
+ "\n",
+ "topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])\n",
+ "topPrediction.head()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ ":::{seealso}\n",
+ "Can you explain why the model is pretty sure this is an Indian cuisine?\n",
+ ":::\n",
+ "\n",
+ "4\\. Get more detail by printing a classification report, as you did in the regression sections:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.74 0.76 0.75 243\n",
+ " indian 0.92 0.92 0.92 213\n",
+ " japanese 0.81 0.77 0.79 251\n",
+ " korean 0.84 0.82 0.83 253\n",
+ " thai 0.83 0.87 0.85 239\n",
+ "\n",
+ " accuracy 0.83 1199\n",
+ " macro avg 0.83 0.83 0.83 1199\n",
+ "weighted avg 0.83 0.83 0.83 1199\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "y_pred = model.predict(X_test)\n",
+ "print(classification_report(y_test,y_pred))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "source": [
+ "## Your turn! 🚀\n",
+ "\n",
+ "In this section, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the many options Scikit-learn provides to classify data. Dig deeper into the concept of 'solver' to understand what goes on behind the scenes.\n",
+ "\n",
+ "Assignment - [Study the solvers](../../assignments/ml-fundamentals/study-the-solvers.md).\n",
+ "\n",
+ "## Acknowledgments\n",
+ "\n",
+ "Thanks to Microsoft for creating the open-source course [ML-For-Beginners](https://github.com/microsoft/ML-For-Beginners). It inspires the majority of the content in this chapter."
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.18"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/more-classifiers.md b/open-machine-learning-jupyter-book/ml-fundamentals/classification/more-classifiers.md
deleted file mode 100644
index 8466bca18f..0000000000
--- a/open-machine-learning-jupyter-book/ml-fundamentals/classification/more-classifiers.md
+++ /dev/null
@@ -1,234 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: python
- name: python3
----
-
-# More classifiers
-
-In this section, you will use the dataset you saved from the last section full of balanced, clean data all about cuisines.
-
-You will use this dataset with a variety of classifiers to _predict a given national cuisine based on a group of ingredients_. While doing so, you'll learn more about some of the ways that algorithms can be leveraged for classification tasks.
-
-## Exercise - predict a national cuisine
-
-1\. Working in this section's [build-classification-models](../../assignments/ml-fundamentals/build-classification-models.ipynb) file, import that file along with the Pandas library:
-
-```{code-cell}
-:tags: [output_scroll]
-
-import pandas as pd
-cuisines_df = pd.read_csv("../../assets/data/classification/cleaned_cuisines.csv")
-cuisines_df.head()
-```
-
-2\. Now, import several more libraries:
-
-```{code-cell}
-from sklearn.linear_model import LogisticRegression
-from sklearn.model_selection import train_test_split, cross_val_score
-from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
-from sklearn.svm import SVC
-import numpy as np
-```
-
-3\. Divide the x and y coordinates into two dataframes for training. `cuisine` can be the labels dataframe:
-
-```{code-cell}
-cuisines_label_df = cuisines_df['cuisine']
-cuisines_label_df.head()
-```
-
-4\. Drop that `Unnamed: 0` column and the `cuisine` column, calling `drop()`. Save the rest of the data as trainable features:
-
-```{code-cell}
-:tags: [output_scroll]
-
-cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
-cuisines_feature_df.head()
-```
-
-Now you are ready to train your model!
-
-## Choosing your classifier
-
-Now that your data is clean and ready for training, you have to decide which algorithm to use for the job.
-
-Scikit-learn groups classification under Supervised Learning, and in that category you will find many ways to classify. [The variety](https://scikit-learn.org/stable/supervised_learning.html) is quite bewildering at first sight. The following methods all include classification techniques:
-
-- Linear Models
-- Support Vector Machines
-- Stochastic Gradient Descent
-- Nearest Neighbors
-- Gaussian Processes
-- Decision Trees
-- Ensemble methods (voting Classifier)
-- Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)
-
-```{seealso}
-You can also use [neural networks to classify data](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification), but that is outside the scope of this section.
-```
-
-### What classifier to go with?
-
-So, which classifier should you choose? Often, running through several and looking for a good result is a way to test. Scikit-learn offers a [side-by-side comparison](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) on a created dataset, comparing KNeighbors, SVC two ways, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB and QuadraticDiscrinationAnalysis, showing the results visualized:
-
-```{figure} ../../../images/ml-fundamentals/ml-classification/comparison.png
----
-name: 'comparison of classifiers'
-width: 90%
----
-Comparison of classifiers [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/comparison.png)
-```
-
-```{seealso}
-Plots generated on Scikit-learn's documentation.
-
-AutoML solves this problem neatly by running these comparisons in the cloud, allowing you to choose the best algorithm for your data. Try it [here](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-77952-leestott).
-```
-
-### A better approach
-
-A better way than wildly guessing, however, is to follow the ideas on this downloadable [ML Cheat sheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-77952-leestott). Here, we discover that, for our multiclass problem, we have some choices:
-
-```{figure} ../../../images/ml-fundamentals/ml-classification/cheatsheet.png
----
-name: 'cheatsheet for multiclass problems'
-width: 90%
----
-Cheatsheet for multiclass problems [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/cheatsheet.png)
-```
-
-```{note}
-A section of Microsoft's Algorithm Cheat Sheet, detailing multiclass classification options.
-```
-
-```{seealso}
-Download this cheat sheet, print it out, and hang it on your wall!
-```
-
-### Reasoning
-
-Let's see if we can reason our way through different approaches given the constraints we have:
-
-- **Neural networks are too heavy**. Given our clean, but minimal dataset, and the fact that we are running training locally via notebooks, neural networks are too heavyweight for this task.
-- **No two-class classifier**. We do not use a two-class classifier, so that rules out one-vs-all.
-- **Decision tree or logistic regression could work**. A decision tree might work, or logistic regression for multiclass data.
-- **Multiclass Boosted Decision Trees solve a different problem**. The multiclass boosted decision tree is most suitable for nonparametric tasks, e.g. tasks designed to build rankings, so it is not useful for us.
-
-### Using Scikit-learn
-
-We will be using Scikit-learn to analyze our data. However, there are many ways to use logistic regression in Scikit-learn. Take a look at the [parameters to pass](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).
-
-Essentially there are two important parameters - `multi_class` and `solver` - that we need to specify, when we ask Scikit-learn to perform a logistic regression. The `multi_class` value applies a certain behavior. The value of the solver is what algorithm to use. Not all solvers can be paired with all `multi_class` values.
-
-According to the docs, in the multiclass case, the training algorithm:
-
-- **Uses the one-vs-rest (OvR) scheme**, if the `multi_class` option is set to `ovr`.
-- **Uses the cross-entropy loss**, if the `multi_class` option is set to `multinomial`. (Currently the `multinomial` option is supported only by the ‘lbfgs’, ‘sag’, ‘saga’ and ‘newton-cg’ solvers.)
-
-```{seealso}
-The 'scheme' here can either be 'ovr' (one-vs-rest) or 'multinomial'. Since logistic regression is really designed to support binary classification, these schemes allow it to better handle multiclass classification tasks. [🔗source](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)
-
-The 'solver' is defined as "the algorithm to use in the optimization problem". [source](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).
-```
-
-Scikit-learn offers this table to explain how solvers handle different challenges presented by different kinds of data structures:
-
-```{figure} ../../../images/ml-fundamentals/ml-classification/solvers.png
----
-name: 'solvers'
-width: 90%
----
-Solvers [🔗source](https://github.com/microsoft/ML-For-Beginners/blob/main/4-Classification/2-Classifiers-1/images/solvers.png)
-```
-
-## Exercise - split the data
-
-We can focus on logistic regression for our first training trial since you recently learned about the latter in a previous section.
-Split your data into training and testing groups by calling `train_test_split()`:
-
-```{code-cell}
-X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
-```
-
-## Exercise - apply logistic regression
-
-Since you are using the multiclass case, you need to choose what _scheme_ to use and what _solver_ to set. Use LogisticRegression with a multiclass setting and the **liblinear** solver to train.
-
-1\. Create a logistic regression with multi_class set to `ovr` and the solver set to `liblinear`:
-
-```{code-cell}
-lr = LogisticRegression(multi_class='ovr',solver='liblinear')
-model = lr.fit(X_train, np.ravel(y_train))
-
-accuracy = model.score(X_test, y_test)
-print ("Accuracy is {}".format(accuracy))
-```
-
-```{seealso}
-Try a different solver like `lbfgs`, which is often set as default.
-```
-
-```{note}
-Use Pandas [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) function to flatten your data when needed.
-```
-
-The accuracy is good at over **80%**!
-
-2\. You can see this model in action by testing one row of data (#50):
-
-```{code-cell}
-print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
-print(f'cuisine: {y_test.iloc[50]}')
-```
-
-```{seealso}
-Try a different row number and check the results.
-```
-
-3\. Digging deeper, you can check for the accuracy of this prediction:
-
-```{code-cell}
-test= X_test.iloc[50].values.reshape(-1, 1).T
-proba = model.predict_proba(test)
-classes = model.classes_
-resultdf = pd.DataFrame(data=proba, columns=classes)
-
-topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
-topPrediction.head()
-```
-
-```{seealso}
-Can you explain why the model is pretty sure this is an Indian cuisine?
-```
-
-4\. Get more detail by printing a classification report, as you did in the regression sections:
-
-```{code-cell}
-y_pred = model.predict(X_test)
-print(classification_report(y_test,y_pred))
-```
-
-## Self Study
-
-Dig a little more into the math behind logistic regression in [this section](https://people.eecs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2006.pdf).
-
-## Your turn! 🚀
-
-In this section, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the many options Scikit-learn provides to classify data. Dig deeper into the concept of 'solver' to understand what goes on behind the scenes.
-
-Assignment - [Study the solvers](../../assignments/ml-fundamentals/study-the-solvers.md).
-
-## Acknowledgments
-
-Thanks to Microsoft for creating the open-source course [ML-For-Beginners](https://github.com/microsoft/ML-For-Beginners). It inspires the majority of the content in this chapter.
\ No newline at end of file
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/yet-other-classifiers.ipynb b/open-machine-learning-jupyter-book/ml-fundamentals/classification/yet-other-classifiers.ipynb
new file mode 100644
index 0000000000..cb0ccb62d8
--- /dev/null
+++ b/open-machine-learning-jupyter-book/ml-fundamentals/classification/yet-other-classifiers.ipynb
@@ -0,0 +1,513 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-cell"
+ ]
+ },
+ "source": [
+ "---\n",
+ "jupytext:\n",
+ " cell_metadata_filter: -all\n",
+ " formats: md:myst\n",
+ " text_representation:\n",
+ " extension: .md\n",
+ " format_name: myst\n",
+ " format_version: 0.13\n",
+ " jupytext_version: 1.11.5\n",
+ "kernelspec:\n",
+ " display_name: Python 3\n",
+ " language: python\n",
+ " name: python3"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": [
+ "hide-input"
+ ]
+ },
+ "outputs": [],
+ "source": [
+ "# Install the necessary dependencies\n",
+ "\n",
+ "import os\n",
+ "import sys \n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "source": [
+ "# Yet other classifiers\n",
+ "\n",
+ "In this second classification section, you will explore more ways to classify numeric data. You will also learn about the ramifications for choosing one classifier over the other.\n",
+ "\n",
+ "## Preparation\n",
+ "\n",
+ "We have loaded your [build-classification-model.ipynb](https://static-1300131294.cos.ap-shanghai.myqcloud.com/assignments/ml-fundamentals/build-classification-model.ipynb) file with the cleaned dataset and have divided it into x and y dataframes, ready for the model building process.\n",
+ "\n",
+ "## A classification map\n",
+ "\n",
+ "Previously, you learned about the various options you have when classifying data using Microsoft's cheat sheet. Scikit-learn offers a similar, but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):\n",
+ "\n",
+ ":::{figure} https://static-1300131294.cos.ap-shanghai.myqcloud.com/images/ml-fundamentals/ml-classification/map.png\n",
+ "---\n",
+ "name: 'ML Map from Scikit-learn'\n",
+ "width: 90%\n",
+ "---\n",
+ "ML Map from Scikit-learn. [Ref](https://scikit-learn.org/stable/tutorial/machine_learning_map/)\n",
+ ":::\n",
+ "\n",
+ "### The plan\n",
+ "\n",
+ "This map is very helpful once you have a clear grasp of your data, as you can 'walk' along its paths to a decision:\n",
+ "\n",
+ "- We have >50 samples\n",
+ "- We want to predict a category\n",
+ "- We have labeled data\n",
+ "- We have fewer than 100K samples\n",
+ "- ✨ We can choose a Linear SVC\n",
+ "- If that doesn't work, since we have numeric data\n",
+ " - We can try a ✨ KNeighbors Classifier \n",
+ " - If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers\n",
+ "\n",
+ "This is a very helpful trail to follow.\n",
+ "\n",
+ "## Exercise - split the data\n",
+ "\n",
+ "Following this path, we should start by importing some libraries to use.\n",
+ "\n",
+ "1\\. Import the needed libraries:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {
+ "editable": true,
+ "slideshow": {
+ "slide_type": ""
+ },
+ "tags": []
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.neighbors import KNeighborsClassifier\n",
+ "from sklearn.linear_model import LogisticRegression\n",
+ "from sklearn.svm import SVC\n",
+ "from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier\n",
+ "from sklearn.model_selection import train_test_split, cross_val_score\n",
+ "from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "cuisines_df = pd.read_csv(\"https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/classification/cleaned_cuisines.csv\")\n",
+ "cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)\n",
+ "cuisines_label_df = cuisines_df['cuisine']"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "2\\. Split your training and test data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Linear SVC classifier\n",
+ "\n",
+ "Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of [several](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here we set it to 'linear' to ensure that we leverage linear SVC. Probability defaults to 'false'; here we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities.\n",
+ "\n",
+ "### Exercise - apply a linear SVC\n",
+ "\n",
+ "Start by creating an array of classifiers. You will add progressively to this array as we test. \n",
+ "\n",
+ "1\\. Start with a Linear SVC:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "C = 10\n",
+ "# Create different classifiers.\n",
+ "classifiers = {\n",
+ " 'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)\n",
+ "}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "2\\. Train your model using the Linear SVC and print out a report:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Accuracy (train) for Linear SVC: 79.4% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.66 0.72 0.69 223\n",
+ " indian 0.91 0.89 0.90 255\n",
+ " japanese 0.76 0.75 0.76 244\n",
+ " korean 0.90 0.73 0.81 225\n",
+ " thai 0.77 0.85 0.81 252\n",
+ "\n",
+ " accuracy 0.79 1199\n",
+ " macro avg 0.80 0.79 0.79 1199\n",
+ "weighted avg 0.80 0.79 0.80 1199\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "n_classifiers = len(classifiers)\n",
+ "\n",
+ "def classify():\n",
+ " for index, (name, classifier) in enumerate(classifiers.items()):\n",
+ " classifier.fit(X_train, np.ravel(y_train))\n",
+ "\n",
+ " y_pred = classifier.predict(X_test)\n",
+ " accuracy = accuracy_score(y_test, y_pred)\n",
+ " print(\"Accuracy (train) for %s: %0.1f%% \" % (name, accuracy * 100))\n",
+ " print(classification_report(y_test,y_pred))\n",
+ "\n",
+ "classify()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The result is pretty good.\n",
+ "\n",
+ "## K-Neighbors classifier\n",
+ "\n",
+ "K-Neighbors is part of the \"neighbors\" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.\n",
+ "\n",
+ "### Exercise - apply the K-Neighbors classifier\n",
+ "\n",
+ "The previous classifier was good, and worked well with the data, but maybe we can get better accuracy. Try a K-Neighbors classifier.\n",
+ "\n",
+ "1\\. Add a line to your classifier array (add a comma after the Linear SVC item):"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Accuracy (train) for Linear SVC: 79.4% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.66 0.72 0.69 223\n",
+ " indian 0.91 0.89 0.90 255\n",
+ " japanese 0.76 0.75 0.76 244\n",
+ " korean 0.90 0.73 0.81 225\n",
+ " thai 0.77 0.85 0.81 252\n",
+ "\n",
+ " accuracy 0.79 1199\n",
+ " macro avg 0.80 0.79 0.79 1199\n",
+ "weighted avg 0.80 0.79 0.80 1199\n",
+ "\n",
+ "Accuracy (train) for KNN classifier: 73.1% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.62 0.66 0.64 223\n",
+ " indian 0.89 0.78 0.83 255\n",
+ " japanese 0.65 0.85 0.73 244\n",
+ " korean 0.91 0.55 0.68 225\n",
+ " thai 0.71 0.79 0.75 252\n",
+ "\n",
+ " accuracy 0.73 1199\n",
+ " macro avg 0.76 0.73 0.73 1199\n",
+ "weighted avg 0.76 0.73 0.73 1199\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "classifiers['KNN classifier'] = KNeighborsClassifier(C)\n",
+ "\n",
+ "classify()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The result is a little worse.\n",
+ "\n",
+ ":::{seealso}\n",
+ "Learn about [K-Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#neighbors)\n",
+ ":::\n",
+ "\n",
+ "## Support Vector Classifier\n",
+ "\n",
+ "Support-Vector classifiers are part of the [Support-Vector Machine](https://wikipedia.org/wiki/Support-vector_machine) family of ML methods that are used for classification and regression tasks. SVMs \"map training examples to points in space\" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.\n",
+ "\n",
+ "### Exercise - apply a Support Vector Classifier\n",
+ "\n",
+ "Let's try for a little better accuracy with a Support Vector Classifier.\n",
+ "\n",
+ "1\\. Add a comma after the K-Neighbors item, and then add this line:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Accuracy (train) for Linear SVC: 79.4% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.66 0.72 0.69 223\n",
+ " indian 0.91 0.89 0.90 255\n",
+ " japanese 0.76 0.75 0.76 244\n",
+ " korean 0.90 0.73 0.81 225\n",
+ " thai 0.77 0.85 0.81 252\n",
+ "\n",
+ " accuracy 0.79 1199\n",
+ " macro avg 0.80 0.79 0.79 1199\n",
+ "weighted avg 0.80 0.79 0.80 1199\n",
+ "\n",
+ "Accuracy (train) for KNN classifier: 73.1% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.62 0.66 0.64 223\n",
+ " indian 0.89 0.78 0.83 255\n",
+ " japanese 0.65 0.85 0.73 244\n",
+ " korean 0.91 0.55 0.68 225\n",
+ " thai 0.71 0.79 0.75 252\n",
+ "\n",
+ " accuracy 0.73 1199\n",
+ " macro avg 0.76 0.73 0.73 1199\n",
+ "weighted avg 0.76 0.73 0.73 1199\n",
+ "\n",
+ "Accuracy (train) for SVC: 82.0% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.73 0.73 0.73 223\n",
+ " indian 0.90 0.89 0.90 255\n",
+ " japanese 0.80 0.80 0.80 244\n",
+ " korean 0.92 0.80 0.86 225\n",
+ " thai 0.77 0.87 0.82 252\n",
+ "\n",
+ " accuracy 0.82 1199\n",
+ " macro avg 0.82 0.82 0.82 1199\n",
+ "weighted avg 0.82 0.82 0.82 1199\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "classifiers['SVC'] = SVC()\n",
+ "\n",
+ "classify()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The result is quite good!\n",
+ "\n",
+ ":::{seealso}\n",
+ "Learn about [Support-Vectors](https://scikit-learn.org/stable/modules/svm.html#svm)\n",
+ ":::\n",
+ "\n",
+ "## Ensemble Classifiers\n",
+ "\n",
+ "Let's follow the path to the very end, even though the previous test was quite good. Let's try some 'Ensemble Classifiers, specifically Random Forest and AdaBoost:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Accuracy (train) for Linear SVC: 79.4% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.66 0.72 0.69 223\n",
+ " indian 0.91 0.89 0.90 255\n",
+ " japanese 0.76 0.75 0.76 244\n",
+ " korean 0.90 0.73 0.81 225\n",
+ " thai 0.77 0.85 0.81 252\n",
+ "\n",
+ " accuracy 0.79 1199\n",
+ " macro avg 0.80 0.79 0.79 1199\n",
+ "weighted avg 0.80 0.79 0.80 1199\n",
+ "\n",
+ "Accuracy (train) for KNN classifier: 73.1% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.62 0.66 0.64 223\n",
+ " indian 0.89 0.78 0.83 255\n",
+ " japanese 0.65 0.85 0.73 244\n",
+ " korean 0.91 0.55 0.68 225\n",
+ " thai 0.71 0.79 0.75 252\n",
+ "\n",
+ " accuracy 0.73 1199\n",
+ " macro avg 0.76 0.73 0.73 1199\n",
+ "weighted avg 0.76 0.73 0.73 1199\n",
+ "\n",
+ "Accuracy (train) for SVC: 82.0% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.73 0.73 0.73 223\n",
+ " indian 0.90 0.89 0.90 255\n",
+ " japanese 0.80 0.80 0.80 244\n",
+ " korean 0.92 0.80 0.86 225\n",
+ " thai 0.77 0.87 0.82 252\n",
+ "\n",
+ " accuracy 0.82 1199\n",
+ " macro avg 0.82 0.82 0.82 1199\n",
+ "weighted avg 0.82 0.82 0.82 1199\n",
+ "\n",
+ "Accuracy (train) for RFST: 84.7% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.78 0.80 0.79 223\n",
+ " indian 0.94 0.91 0.93 255\n",
+ " japanese 0.85 0.79 0.82 244\n",
+ " korean 0.89 0.81 0.85 225\n",
+ " thai 0.79 0.92 0.85 252\n",
+ "\n",
+ " accuracy 0.85 1199\n",
+ " macro avg 0.85 0.84 0.85 1199\n",
+ "weighted avg 0.85 0.85 0.85 1199\n",
+ "\n",
+ "Accuracy (train) for ADA: 68.6% \n",
+ " precision recall f1-score support\n",
+ "\n",
+ " chinese 0.55 0.49 0.52 223\n",
+ " indian 0.87 0.84 0.86 255\n",
+ " japanese 0.63 0.60 0.62 244\n",
+ " korean 0.66 0.75 0.70 225\n",
+ " thai 0.69 0.73 0.71 252\n",
+ "\n",
+ " accuracy 0.69 1199\n",
+ " macro avg 0.68 0.68 0.68 1199\n",
+ "weighted avg 0.68 0.69 0.68 1199\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "classifiers['RFST'] = RandomForestClassifier(n_estimators=100)\n",
+ "classifiers['ADA'] = AdaBoostClassifier(n_estimators=100)\n",
+ "\n",
+ "classify()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The result is very good, especially for Random Forest.\n",
+ "\n",
+ ":::{seealso}\n",
+ "Learn about [Ensemble Classifiers](https://scikit-learn.org/stable/modules/ensemble.html)\n",
+ ":::\n",
+ "\n",
+ "This method of Machine Learning \"combines the predictions of several base estimators\" to improve the model's quality. In our example, we used Random Trees and AdaBoost.\n",
+ "\n",
+ "- [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest), an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter is set to the number of trees.\n",
+ "\n",
+ "- [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.\n",
+ "\n",
+ "---\n",
+ "\n",
+ "## Self Study\n",
+ "\n",
+ "There's a lot of jargon in these sections, so take a minute to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott) of useful terminology!\n",
+ "\n",
+ "## Your turn! 🚀\n",
+ "\n",
+ "Each of these techniques has a large number of parameters that you can tweak. Research each one's default parameters and think about what tweaking these parameters would mean for the model's quality.\n",
+ "\n",
+ "Assignment - [Parameter play](../../assignments/ml-fundamentals/parameter-play.md)\n",
+ "\n",
+ "## Acknowledgments\n",
+ "\n",
+ "Thanks to Microsoft for creating the open-source course [ML-For-Beginners](https://github.com/microsoft/ML-For-Beginners). It inspires the majority of the content in this chapter.\n"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.9.18"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/classification/yet-other-classifiers.md b/open-machine-learning-jupyter-book/ml-fundamentals/classification/yet-other-classifiers.md
deleted file mode 100644
index 979cb1591a..0000000000
--- a/open-machine-learning-jupyter-book/ml-fundamentals/classification/yet-other-classifiers.md
+++ /dev/null
@@ -1,196 +0,0 @@
----
-jupytext:
- cell_metadata_filter: -all
- formats: md:myst
- text_representation:
- extension: .md
- format_name: myst
- format_version: 0.13
- jupytext_version: 1.11.5
-kernelspec:
- display_name: Python 3
- language: python
- name: python3
----
-
-# Yet other classifiers
-
-In this second classification section, you will explore more ways to classify numeric data. You will also learn about the ramifications for choosing one classifier over the other.
-
-## Preparation
-
-We have loaded your [build-classification-model.ipynb](../../assignments/ml-fundamentals/build-classification-model.ipynb) file with the cleaned dataset and have divided it into x and y dataframes, ready for the model building process.
-
-## A classification map
-
-Previously, you learned about the various options you have when classifying data using Microsoft's cheat sheet. Scikit-learn offers a similar, but more granular cheat sheet that can further help narrow down your estimators (another term for classifiers):
-
-```{figure} ../../../images/ml-fundamentals/ml-classification/map.png
----
-name: 'ML Map from Scikit-learn'
-width: 90%
----
-ML Map from Scikit-learn. [Ref](https://scikit-learn.org/stable/tutorial/machine_learning_map/)
-```
-
-### The plan
-
-This map is very helpful once you have a clear grasp of your data, as you can 'walk' along its paths to a decision:
-
-- We have >50 samples
-- We want to predict a category
-- We have labeled data
-- We have fewer than 100K samples
-- ✨ We can choose a Linear SVC
-- If that doesn't work, since we have numeric data
- - We can try a ✨ KNeighbors Classifier
- - If that doesn't work, try ✨ SVC and ✨ Ensemble Classifiers
-
-This is a very helpful trail to follow.
-
-## Exercise - split the data
-
-Following this path, we should start by importing some libraries to use.
-
-1\. Import the needed libraries:
-
-```{code-cell}
-from sklearn.neighbors import KNeighborsClassifier
-from sklearn.linear_model import LogisticRegression
-from sklearn.svm import SVC
-from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
-from sklearn.model_selection import train_test_split, cross_val_score
-from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
-import numpy as np
-import pandas as pd
-
-cuisines_df = pd.read_csv("../../assets/data/classification/cleaned_cuisines.csv")
-cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
-cuisines_label_df = cuisines_df['cuisine']
-```
-
-2\. Split your training and test data:
-
-```{code-cell}
-X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
-```
-
-## Linear SVC classifier
-
-Support-Vector clustering (SVC) is a child of the Support-Vector machines family of ML techniques (learn more about these below). In this method, you can choose a 'kernel' to decide how to cluster the labels. The 'C' parameter refers to 'regularization' which regulates the influence of parameters. The kernel can be one of [several](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC); here we set it to 'linear' to ensure that we leverage linear SVC. Probability defaults to 'false'; here we set it to 'true' to gather probability estimates. We set the random state to '0' to shuffle the data to get probabilities.
-
-### Exercise - apply a linear SVC
-
-Start by creating an array of classifiers. You will add progressively to this array as we test.
-
-1\. Start with a Linear SVC:
-
-```{code-cell}
-C = 10
-# Create different classifiers.
-classifiers = {
- 'Linear SVC': SVC(kernel='linear', C=C, probability=True,random_state=0)
-}
-```
-
-2\. Train your model using the Linear SVC and print out a report:
-
-```{code-cell}
-n_classifiers = len(classifiers)
-
-def classify():
- for index, (name, classifier) in enumerate(classifiers.items()):
- classifier.fit(X_train, np.ravel(y_train))
-
- y_pred = classifier.predict(X_test)
- accuracy = accuracy_score(y_test, y_pred)
- print("Accuracy (train) for %s: %0.1f%% " % (name, accuracy * 100))
- print(classification_report(y_test,y_pred))
-
-classify()
-```
-
-The result is pretty good.
-
-## K-Neighbors classifier
-
-K-Neighbors is part of the "neighbors" family of ML methods, which can be used for both supervised and unsupervised learning. In this method, a predefined number of points is created and data are gathered around these points such that generalized labels can be predicted for the data.
-
-### Exercise - apply the K-Neighbors classifier
-
-The previous classifier was good, and worked well with the data, but maybe we can get better accuracy. Try a K-Neighbors classifier.
-
-1\. Add a line to your classifier array (add a comma after the Linear SVC item):
-
-```{code-cell}
-classifiers['KNN classifier'] = KNeighborsClassifier(C)
-
-classify()
-```
-
-The result is a little worse.
-
-```{seealso}
-Learn about [K-Neighbors](https://scikit-learn.org/stable/modules/neighbors.html#neighbors)
-```
-
-## Support Vector Classifier
-
-Support-Vector classifiers are part of the [Support-Vector Machine](https://wikipedia.org/wiki/Support-vector_machine) family of ML methods that are used for classification and regression tasks. SVMs "map training examples to points in space" to maximize the distance between two categories. Subsequent data is mapped into this space so their category can be predicted.
-
-### Exercise - apply a Support Vector Classifier
-
-Let's try for a little better accuracy with a Support Vector Classifier.
-
-1\. Add a comma after the K-Neighbors item, and then add this line:
-
-```{code-cell}
-classifiers['SVC'] = SVC()
-
-classify()
-```
-
-The result is quite good!
-
-```{seealso}
-Learn about [Support-Vectors](https://scikit-learn.org/stable/modules/svm.html#svm)
-```
-
-## Ensemble Classifiers
-
-Let's follow the path to the very end, even though the previous test was quite good. Let's try some 'Ensemble Classifiers, specifically Random Forest and AdaBoost:
-
-```{code-cell}
-classifiers['RFST'] = RandomForestClassifier(n_estimators=100)
-classifiers['ADA'] = AdaBoostClassifier(n_estimators=100)
-
-classify()
-```
-
-The result is very good, especially for Random Forest.
-
-```{seealso}
-Learn about [Ensemble Classifiers](https://scikit-learn.org/stable/modules/ensemble.html)
-```
-
-This method of Machine Learning "combines the predictions of several base estimators" to improve the model's quality. In our example, we used Random Trees and AdaBoost.
-
-- [Random Forest](https://scikit-learn.org/stable/modules/ensemble.html#forest), an averaging method, builds a 'forest' of 'decision trees' infused with randomness to avoid overfitting. The n_estimators parameter is set to the number of trees.
-
-- [AdaBoost](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html) fits a classifier to a dataset and then fits copies of that classifier to the same dataset. It focuses on the weights of incorrectly classified items and adjusts the fit for the next classifier to correct.
-
----
-
-## Self Study
-
-There's a lot of jargon in these sections, so take a minute to review [this list](https://docs.microsoft.com/dotnet/machine-learning/resources/glossary?WT.mc_id=academic-77952-leestott) of useful terminology!
-
-## Your turn! 🚀
-
-Each of these techniques has a large number of parameters that you can tweak. Research each one's default parameters and think about what tweaking these parameters would mean for the model's quality.
-
-Assignment - [Parameter play](../../assignments/ml-fundamentals/parameter-play.md)
-
-## Acknowledgments
-
-Thanks to Microsoft for creating the open-source course [ML-For-Beginners](https://github.com/microsoft/ML-For-Beginners). It inspires the majority of the content in this chapter.
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/regression/linear-regression-metrics.ipynb b/open-machine-learning-jupyter-book/ml-fundamentals/regression/linear-regression-metrics.ipynb
new file mode 100644
index 0000000000..ee56256090
--- /dev/null
+++ b/open-machine-learning-jupyter-book/ml-fundamentals/regression/linear-regression-metrics.ipynb
@@ -0,0 +1,194 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "id": "a23a2854-7e54-4a24-9ae4-0f8904f899ee",
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "# Install the necessary dependencies\n",
+ "\n",
+ "import os\n",
+ "import sys\n",
+ "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "3780e038-4395-44e7-9294-a54ae4bc731d",
+ "metadata": {},
+ "source": [
+ "---\n",
+ "license:\n",
+ " code: MIT\n",
+ " content: CC-BY-4.0\n",
+ "github: https://github.com/ocademy-ai/machine-learning\n",
+ "venue: By Ocademy\n",
+ "open_access: true\n",
+ "bibliography:\n",
+ " - https://raw.githubusercontent.com/ocademy-ai/machine-learning/main/open-machine-learning-jupyter-book/references.bib\n",
+ "---"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "63961ec0-328a-4289-8667-cc86f09db8f1",
+ "metadata": {},
+ "source": [
+ "# Linear Regression Metrics"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "c556cae5-c568-444c-be0c-4a9f54a0af5b",
+ "metadata": {},
+ "source": [
+ "Linear regression is a fundamental and widely used technique in machine learning and statistics for predicting continuous values based on input variables. It finds its application in various domains, from finance and economics to healthcare and engineering. When using linear regression, it's essential to assess the model's performance accurately. This is where linear regression metrics come into play.\n",
+ "\n",
+ "In this tutorial, we will delve into the world of linear regression metrics, exploring the key evaluation measures that allow us to gauge how well a linear regression model fits the data and makes predictions. These metrics provide valuable insights into the model's accuracy, precision, and ability to capture the underlying relationships between variables.\n",
+ "\n",
+ "We will cover essential concepts such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R-squared (R2) score, and Mean Absolute Error (MAE). Understanding these metrics is crucial for data scientists, machine learning practitioners, and anyone looking to harness the power of linear regression for predictive modeling.\n",
+ "\n",
+ "Whether you are building models for price predictions, sales forecasts, or any other regression task, mastering these metrics will empower you to make informed decisions and fine-tune your models for optimal performance. Let's embark on this journey to explore the intricacies of linear regression metrics and enhance our ability to assess and improve regression models."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f39e137f-d413-4d64-97b7-d6500542e8ed",
+ "metadata": {},
+ "source": [
+ "## Mean Squared Error (MSE)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "f893cbed-c0e9-46c5-aea3-8871c7bb9a5d",
+ "metadata": {},
+ "source": [
+ "In the realm of linear regression metrics, one fundamental measure of model performance is the **Mean Squared Error (MSE)**. MSE serves as a valuable indicator of how well your linear regression model aligns its predictions with the actual data points. This metric quantifies the average of the squared differences between predicted values and observed values."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "de348eec-516d-4d86-a02e-ccbe7dba7bf5",
+ "metadata": {},
+ "source": [
+ "### The Formula"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "545bc42e-7ca4-4c9a-91ca-fceeebaa1b83",
+ "metadata": {},
+ "source": [
+ "Mathematically, the MSE is computed using the following formula:"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "768aa918-1f0b-4ae5-b4c7-9e77097050e1",
+ "metadata": {},
+ "source": [
+ "$$ MSE = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 $$"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "dbd49d92-4228-458e-a865-5d4636bd4ff2",
+ "metadata": {},
+ "source": [
+ "Where:\n",
+ "\n",
+ "- $n$ is the number of data points.\n",
+ "- $y_i$ represents the actual observed value for the $i^{th}$ data point.\n",
+ "- $\\hat{y}_i$ represents the predicted value for the $i^{th}$ data point."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "54f58e2a-8d6e-4cfb-8a9a-ab50cdaaf956",
+ "metadata": {},
+ "source": [
+ "### Interpretation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d09eadf6-21c1-488e-98f1-4e0917be18b6",
+ "metadata": {},
+ "source": [
+ "A lower MSE value indicates that the model's predictions are closer to the actual values, signifying better model performance. Conversely, a higher MSE suggests that the model's predictions deviate more from the true values, indicating poorer performance."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "d2b803a5-b390-407a-886b-ccfcee059233",
+ "metadata": {},
+ "source": [
+ "### Python Implementation"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "id": "e546d69d-7542-4c65-9635-4d5dced7248e",
+ "metadata": {},
+ "source": [
+ "Let's take a look at how to calculate MSE in Python. We'll use a simple example with sample data:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "id": "7b027b00-2205-4475-a600-62059c7fc5c2",
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "Mean Squared Error (MSE): 0.5079999999999996\n"
+ ]
+ }
+ ],
+ "source": [
+ "# Import necessary libraries\n",
+ "import numpy as np\n",
+ "\n",
+ "# Sample data for demonstration (replace with your actual data)\n",
+ "actual_values = np.array([22.1, 19.9, 24.5, 20.1, 18.7])\n",
+ "predicted_values = np.array([23.5, 20.2, 23.9, 19.8, 18.5])\n",
+ "\n",
+ "# Calculate the squared differences between actual and predicted values\n",
+ "squared_errors = (actual_values - predicted_values) ** 2\n",
+ "\n",
+ "# Calculate the mean of squared errors to get MSE\n",
+ "mse = np.mean(squared_errors)\n",
+ "\n",
+ "# Print the MSE\n",
+ "print(\"Mean Squared Error (MSE):\", mse)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3 (ipykernel)",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.10.6"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/open-machine-learning-jupyter-book/ml-fundamentals/regression/loss-function.ipynb b/open-machine-learning-jupyter-book/ml-fundamentals/regression/loss-function.ipynb
deleted file mode 100644
index defa7558a4..0000000000
--- a/open-machine-learning-jupyter-book/ml-fundamentals/regression/loss-function.ipynb
+++ /dev/null
@@ -1,830 +0,0 @@
-{
- "cells": [
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "776fc8b6-42b2-4b27-8596-8fa5a29ab556",
- "metadata": {},
- "outputs": [],
- "source": [
- "# Install the necessary dependencies\n",
- "\n",
- "import os\n",
- "import sys\n",
- "!{sys.executable} -m pip install --quiet pandas scikit-learn numpy matplotlib jupyterlab_myst ipython"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a3e2352b-17ee-4471-b4a8-7c192326abde",
- "metadata": {
- "editable": true,
- "slideshow": {
- "slide_type": ""
- },
- "tags": []
- },
- "source": [
- "# Stock Market Prediction Hands-On: Training a Linear Regression Model (1/6)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "e8dc8e12-f3a4-4c24-bdd7-63fe322c9b52",
- "metadata": {},
- "source": [
- "Can linear regression in machine learning predict the stock market? This real dataset includes stock market data from several major U.S. companies between 2005 and 2020, including daily opening and closing prices, highest and lowest prices, trading volume, turnover rate, and other information. Today, we are going to use it to practice and see if we will make a profit or incur losses."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "41a37356-04f0-45f8-afef-185bd1a25015",
- "metadata": {},
- "source": [
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "20185c4e",
- "metadata": {},
- "source": [
- "*You can download the corresponding kaggle dataset [here](https://www.kaggle.com/datasets/nikhilkohli/us-stock-market-data-60-extracted-features)*"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f5ee754a-e5d9-4532-86d5-1433722be122",
- "metadata": {},
- "source": [
- "Let's begin by taking a look at Apple Inc., a company that has shown consistently robust performance over the years."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "595f6e94",
- "metadata": {},
- "outputs": [],
- "source": [
- "%matplotlib inline\n",
- "\n",
- "import pandas as pd\n",
- "import matplotlib.pyplot as plt\n",
- "import numpy as np\n",
- "from sklearn.model_selection import train_test_split\n",
- "from sklearn import metrics"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "f9b66d7a",
- "metadata": {},
- "outputs": [],
- "source": [
- "df_stock = pd.read_csv('https://static-1300131294.cos.ap-shanghai.myqcloud.com/data/ml-fundamental/AAPL.csv', index_col=0)\n",
- "df_stock = df_stock.rename(columns={'Close(t)':'Close'})\n",
- "df_stock.head()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "59c28a72-b9d2-48e7-898e-fe67fb61a80e",
- "metadata": {},
- "source": [
- "Here, we have a total of 3,732 days' worth of stock market data, with each row containing 63 columns. There's one particular column that stands out, known as 'Close_forecast,' which represents the stock's closing price for the next day. It's important to note that this column doesn't exist in the original scraped data; it was added by Kaggle to make the dataset more suitable for machine learning exercises.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "eccfc31d",
- "metadata": {},
- "outputs": [],
- "source": [
- "df_stock.shape"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "86a6d04f",
- "metadata": {},
- "outputs": [],
- "source": [
- "df_stock.columns"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "59e74939-3af1-4967-9cf6-049d946c2bda",
- "metadata": {},
- "source": [
- "We will select the 'Close_forecast' column as the target for our machine learning model, which serves as the label. The remaining 62 columns will be used as features. We will split the data, using 75% for training and 25% for testing."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e80b445f",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.preprocessing import StandardScaler\n",
- "\n",
- "X = df_stock.drop(columns=['Close_forcast'], axis=1)\n",
- "y = df_stock['Close_forcast']\n",
- "\n",
- "scaler_X = StandardScaler()\n",
- "X = scaler_X.fit_transform(X)\n",
- "\n",
- "X_train, X_test, y_train, y_test = train_test_split(\n",
- " X, y, test_size=0.25, random_state=42)\n",
- "\n",
- "print(X_train.shape, X_test.shape)\n",
- "print(y_train.shape, y_test.shape)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "083f151d-c523-4d8b-bd53-b339136f80d6",
- "metadata": {},
- "source": [
- "Finally, with just two simple lines of code, we will call the `LinearRegression.fit` method from sklearn to train our linear regression model."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "054a0309",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.linear_model import LinearRegression\n",
- "\n",
- "lr = LinearRegression()\n",
- "lr.fit(X_train, y_train)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "22239732-992a-43eb-82af-1977723c545f",
- "metadata": {},
- "source": [
- "Now that we have our model, it's time to put it to the test on our testing dataset. We'll use the model to make predictions on the test set and evaluate its performance."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6d1ebf8f",
- "metadata": {},
- "outputs": [],
- "source": [
- "y_test_pred = lr.predict(X_test)\n",
- "y_test_pred"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f61c02c1-85b2-4f66-baa6-acf6f136c1ce",
- "metadata": {},
- "source": [
- "At first glance, the results might seem a bit surprising, given the significant fluctuations in the predicted stock prices. However, I can offer some reassurance that our linear regression model is functioning correctly, and in fact, it performs quite well. You can confidently use the code provided above. As for the reason behind the seemingly chaotic predictions, we will delve into a more detailed analysis in the upcoming sections."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c24f7565-7e8a-4849-9c97-aa21a04f5a3b",
- "metadata": {},
- "source": [
- "# Stock Market Prediction Hands-On: Model Performance Evaluation (2/6)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b4f4703e-4c03-40cb-bab1-bafd1a783d14",
- "metadata": {},
- "source": [
- "In the previous segment, we attempted stock price prediction using linear regression on a real stock market dataset. The results seemed chaotic, with significant fluctuations and sharp ups and downs in stock prices. Can linear regression truly predict stock prices? "
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "ccaf0610",
- "metadata": {},
- "outputs": [],
- "source": [
- "y_test_pred = lr.predict(X_test)\n",
- "y_test_pred"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "3cb3c136-8525-4c04-a7f1-ec333f280c88",
- "metadata": {},
- "source": [
- "Strange occurrences often have underlying reasons. Let's take a closer look at what y_test in the test set actually looks like. As it turns out, when y_test was created, the order was shuffled. In fact, there's a parameter in sklearn's train_test_split function called 'shuffle,' which is set to 'True' by default. This means that by default, the order is shuffled when splitting the training and test sets.\n",
- "\n",
- "Shuffling the order itself isn't necessarily a problem, but in our daily lives, stock prices generally follow a relatively smooth curve over time. Therefore, the test results may initially appear odd because they don't align with common sense. If we set 'shuffle' to 'False,' we can avoid this situation. You might find it interesting to try this out for yourselves."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "18e13828",
- "metadata": {},
- "outputs": [],
- "source": [
- "y_test"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c0730d30-e1df-47a1-95be-99db83e25c6f",
- "metadata": {},
- "source": [
- "Here, we're taking the real y-label values from the training set and the predicted y-label values, placing them together, and then sorting them. By doing this, we can compare the two and observe that the differences between them are quite small on a daily basis."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e1e7d2c2",
- "metadata": {},
- "outputs": [],
- "source": [
- "df_test_pred = pd.DataFrame(y_test.values, \n",
- " columns=['Actual'], index=y_test.index)\n",
- "df_test_pred['Predicted'] = y_test_pred\n",
- "df_test_pred = df_test_pred.reset_index()\n",
- "sorted_df_test_pred = df_test_pred.sort_values(by='Date')\n",
- "sorted_df_test_pred = sorted_df_test_pred.reset_index()\n",
- "sorted_df_test_pred = sorted_df_test_pred.drop(columns=['index'])\n",
- "sorted_df_test_pred"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "32828743-f31c-4a78-952e-ad5fb630d084",
- "metadata": {},
- "source": [
- "Let's visualize the data using matplotlib to gain a clearer understanding. The results are highly promising, as the blue real values and the green predicted values almost perfectly overlap."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "707598b7",
- "metadata": {},
- "outputs": [],
- "source": [
- "plt.plot(sorted_df_test_pred.index, sorted_df_test_pred['Actual'], color='b')\n",
- "plt.plot(sorted_df_test_pred.index, sorted_df_test_pred['Predicted'], color='g')\n",
- "plt.grid(which=\"major\", color='k', linestyle='-.', linewidth=0.5)\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a237101b-df0e-4b3f-89d7-9c0d2c69be50",
- "metadata": {},
- "source": [
- "\n",
- "We calculate the R-squared, MAPE, and other evaluation metrics, and the results are excellent, consistent with the previous analysis. All of this indicates that linear regression performs well when applied to this real stock market dataset."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "061bf353",
- "metadata": {},
- "outputs": [],
- "source": [
- "print(\"Test R-squared: \",metrics.r2_score(\n",
- " y_test,y_test_pred))\n",
- "print(\"Test MAPE: \", metrics.mean_absolute_percentage_error(\n",
- " y_test,y_test_pred),\"%\")\n",
- "print(\n",
- " \"Test Mean Squared Error:\",\n",
- " metrics.mean_squared_error(y_test, y_test_pred)\n",
- ")\n",
- "print(\"Test RMSE: \",np.sgrt(metrics.mean_squared_error(\n",
- " y_test,y_test_pred)))\n",
- "print(\"Test MAE: \", metrics.mean_absolute_error(\n",
- "y_test, y_test_pred))"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a9ba5ae4-7fde-4be6-95d2-3f265e4e4357",
- "metadata": {},
- "source": [
- "It's important to note that evaluation metrics are often calculated on the test dataset, but they can also be computed on both the training and test datasets for comparison. Why do I emphasize this? Because in the next segment, we'll delve into loss functions, and their computation is exclusively for the training dataset."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "51b55a90-63f7-4312-90d5-0afe5ba13f20",
- "metadata": {},
- "source": [
- "# Stock Market Prediction Hands-On: Introduction to Loss Functions (3/6)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "29e890c3-bf6d-402a-87e3-b3d3cfc8967f",
- "metadata": {},
- "source": [
- "In past segments, we used linear regression to predict stock prices, tested it on the test set, and calculated evaluation metrics, with the model performing exceptionally well. We plot the daily closing prices of Apple Inc. from 2005 to 2022. If you bought Apple stock on the first day shown in the graph and held it until the last day, you would have roughly multiplied your investment many times over. However, achieving this in reality is exceedingly challenging."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "fd27ed12",
- "metadata": {},
- "outputs": [],
- "source": [
- "df_stock['Close'].plot(figsize=(10, 6))\n",
- "plt.title(\"Stock Price\", fontsize=13)\n",
- "plt.ylabel('Price', fontsize=12)\n",
- "plt.xlabel('Time', fontsize=12)\n",
- "plt.grid(which=\"major\", color='k', linestyle='-.', linewidth=0.5)\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "e3524a2d",
- "metadata": {},
- "source": [
- "As ordinary investors, we don't possess a time machine, and even if we have a strong belief in Apple's stock, we cannot predict what will happen 15 years into the future. Typically, we do not hold stocks for extended periods. Instead, we engage in short-term or medium-term investments. If the stock price shows substantial growth within a certain timeframe, we may choose to sell at a certain point, seizing the opportunity. Conversely, if the stock price remains stagnant or declines, we may also decide to sell at a specific point, implementing a timely stop-loss strategy."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "b74592d5-2619-48ef-a235-c7cb9a60cb91",
- "metadata": {},
- "source": [
- "Of course, we can't provide stock investment strategies here, but if machine learning can effectively predict stock prices, it can certainly assist in shaping our investment strategies. With model predictions, we can observe that Apple's stock steadily increased over 15 years, indicating that buying in 2005 and selling in 2020 would have been profitable.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "f6411620-3125-4cf9-874a-10c4b19e217e",
- "metadata": {},
- "source": [
- "Furthermore, if the model we've developed provides accurate predictions at finer granularities, we could potentially engage in multiple trades within those 15 years. Selling all stocks at local highs whenever the price is about to drop and buying in at local lows when the price is about to rise can optimize returns even further. However, this scenario assumes that our predictions align perfectly with reality, which, in practice, is unlikely to be the case."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "64c50fc2-dfb3-427c-958e-6044711155ba",
- "metadata": {},
- "source": [
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "e2736fce-21ae-4ff2-ad64-f654c9362ee5",
- "metadata": {},
- "source": [
- "While the evaluation metrics indicate that our model's performance is good, is it good enough to support the second investment strategy mentioned earlier? Or can it be further optimized to help us earn more from that strategy?\n",
- "\n",
- "The answer is affirmative, and here we introduce a new concept: the Loss Function, also known as the Cost Function. It is used to measure the difference or error between model predictions and real values on the training dataset. In the next segment, we will delve into how to calculate the loss function."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ab4a5347-a35f-4de9-bc57-3cf640424ff7",
- "metadata": {},
- "source": [
- "# Stock Market Prediction Hands-On: Calculating Loss Functions (4/6)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "168f0d68-d4d5-48d3-8278-683fda9fd58a",
- "metadata": {},
- "source": [
- "In previous segments, we successfully used linear regression for stock market prediction, guiding us to buy low and sell high, resulting in substantial profits. However, we are not content because there are still deviations between the model's predictions and the actual situation. This has caused us to buy at high points and sell at low points on several occasions. Following the principle that there's no harm in having more money, we aim to further optimize the model using a loss function. Today, let's first learn how to calculate the loss function.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c46884ee-0a15-4666-a992-8f6fb7102744",
- "metadata": {},
- "source": [
- "For regression tasks, there are three common types of loss functions. The first one is the Mean Squared Error (MSE), which measures the average of the squared differences between **predicted values** and **actual values** on the **training dataset** . The second one is the Mean Absolute Error (MAE), which measures the average of the absolute differences between **predicted values** and **actual values** on the **training dataset** ."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "d0bac2a8-4505-4b86-a7fa-042f8018a622",
- "metadata": {},
- "source": [
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a8d48588-a8ea-4f00-a942-f04dd0598357",
- "metadata": {},
- "source": [
- "So, let's go ahead and calculate the squared error and absolute error for each data point in the training set. The code is quite simple: we extract the labels and predicted results columns from the training set and use NumPy for some basic mathematical operations. The results are labeled as 'AE' and 'SE,'. As you can see, regardless of their magnitude, their values are never zero, meaning that there is always some difference between our predicted values and the actual values."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e9491a22",
- "metadata": {},
- "outputs": [],
- "source": [
- "df_train_pred = pd.DataFrame(y_train.values, \n",
- " columns=['Actual'], index=y_train.index)\n",
- "df_train_pred['Predicted'] = y_train_pred\n",
- "df_train_pred = df_train_pred.reset_index()\n",
- "sorted_df_train_pred = df_train_pred.sort_values(by='Date')\n",
- "sorted_df_train_pred = sorted_df_train_pred.reset_index()\n",
- "sorted_df_train_pred = sorted_df_train_pred.drop(columns=['index'])\n",
- "sorted_df_train_pred['AE'] = \\\n",
- " (sorted_df_train_pred['Predicted'] - \\\n",
- " sorted_df_train_pred['Actual']).abs()\n",
- "sorted_df_train_pred['SE'] = \\\n",
- " np.square((sorted_df_train_pred['Predicted'] - \\\n",
- " sorted_df_train_pred['Actual']))\n",
- "sorted_df_train_pred"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "9be6ebba-3fae-4f21-bba0-aa0e697ed4c9",
- "metadata": {},
- "source": [
- "Furthermore, we can visualize how AE and SE change over time. It's evident that as time progresses, their values tend to increase, indicating that the results tested on the training set become more accurate as they approach 2005."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "ced05c5d",
- "metadata": {},
- "outputs": [],
- "source": [
- "fig, axs = plt.subplots(1, 2, figsize=(7, 3))\n",
- "\n",
- "axs[0].plot(sorted_df_train_pred['AE'], color='blue')\n",
- "axs[0].set_title('AE')\n",
- "\n",
- "axs[1].plot(sorted_df_train_pred['SE'], color='orange')\n",
- "axs[1].set_title('SE')\n",
- "\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "a043402b-5f25-478c-bad3-a713693a3e59",
- "metadata": {},
- "source": [
- "Finally, by taking the mean of the AE and SE columns, we obtain the results for the loss functions, MAE and MSE. With this, we have computed the values of the two most common loss functions for linear regression."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8fc62797",
- "metadata": {},
- "outputs": [],
- "source": [
- "mae = sorted_df_train_pred['AE'].mean()\n",
- "mse = sorted_df_train_pred['SE'].mean()\n",
- "print('mae = ', mae)\n",
- "print('mse = ', mse)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "acea8cc2-7479-4e65-8812-6d143e2f075e",
- "metadata": {},
- "source": [
- "You might have already noticed that these two loss functions seem quite similar to the MAE and MSE metrics we learned earlier. You're absolutely right, there is indeed significant overlap between the concepts of loss functions and evaluation metrics, but there are also key differences. In the next segment, we will thoroughly analyze these distinctions."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "88b5f10b-9566-47b7-a95d-945c240b26fa",
- "metadata": {},
- "source": [
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "2604c01f-ebf7-4ccf-ac24-c6aa193d41d2",
- "metadata": {},
- "source": [
- "# Stock Market Prediction Hands-On: Understanding Loss Functions (5/6)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "68374bcb-917b-4094-96e9-4812c83871d0",
- "metadata": {},
- "source": [
- "\n",
- "Loss functions and evaluation metrics share common ground in that they are both used to assess a model's predictive capabilities. In fact, terms like MAE or MSE are statistical concepts that can serve both as evaluation metrics and as loss functions, with identical mathematical calculations.\n",
- "\n",
- "The code blocks above compute MAE and MSE as loss functions, while the code blocks below calculate MAE and MSE as evaluation metrics. If the input data is the same, the results will be entirely identical."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "6abe54c1",
- "metadata": {},
- "outputs": [],
- "source": [
- "mae = sorted_df_train_pred['AE'].mean()\n",
- "mse = sorted_df_train_pred['SE'].mean()\n",
- "print('mae = ', mae)\n",
- "print('mse = ', mse)"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "3cf189b7",
- "metadata": {},
- "outputs": [],
- "source": [
- "mae2 = metrics.mean_absolute_error(y_train, y_train_pred)\n",
- "mse2 = metrics.mean_squared_error(y_train, y_train_pred)\n",
- "print('mae2 = ', mae2)\n",
- "print('mse2 = ', mse2)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "ba0a0d59-4209-4cca-b372-6fbb76c3585f",
- "metadata": {},
- "source": [
- "So, what are the key differences between loss functions and evaluation metrics? \n",
- "\n",
- "Firstly, evaluation metrics include concepts like R-squared and explained variance, which are not present in loss functions. \n",
- "\n",
- "Secondly, their purposes differ; loss functions are primarily used during model training to help the model gradually adjust its parameters to minimize prediction errors. In contrast, evaluation metrics are used to summarize and compare the performance of a trained model, to understand the overall effectiveness of the model, or to compare the performance differences between different models, guiding model selection.\n",
- "\n",
- "Thirdly, their optimization directions are different. With loss functions, the goal is typically to minimize them because smaller loss values imply that the predicted values are closer to the actual values. In contrast, for evaluation metrics, the goal is often to maximize their values; for example, in the case of R-squared, higher values indicate better model performance. This difference reflects the distinct roles of loss functions and evaluation metrics in machine learning tasks. \n",
- "\n",
- "Finally, as mentioned in the previous segment, loss functions are often calculated on the training set, while evaluation metrics are typically computed on the test set, with fewer instances of calculating them on the training set.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "71aeba1f-dbaf-4fcf-a1b9-161438d7c2f6",
- "metadata": {},
- "source": [
- "\n",
- "\n",
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "16b188b4-5e1e-49e4-87f8-494298335f75",
- "metadata": {},
- "source": [
- "You're absolutely right, these differences might seem a bit overwhelming at first, but don't worry! In regression tasks, the distinctions between loss functions and evaluation metrics might not be as pronounced as in classification tasks. This was just a setup to introduce the concepts of evaluation metrics and loss functions.\n",
- "\n",
- "In classification tasks, we'll revisit the concepts of evaluation metrics and loss functions, and their differences will become clearer. As you gain a more comprehensive understanding of machine learning, these pieces of knowledge will gradually come together and become more straightforward.\n"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "90c4494f-31fa-42bf-a11e-e2ca5ac81178",
- "metadata": {},
- "source": [
- "# Stock Market Prediction Hands-On: Optimizing Models with Gradient Descent (6/6)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "7840fdc1-8da5-4bfc-9aa8-7cc5b83a8e32",
- "metadata": {},
- "source": [
- "In previous segments, we used sklearn's LinearRegression to train on real U.S. stock market data, employed a linear regression model for stock price prediction, and calculated the model's loss functions. Another option for solving linear regression models is to use SGDRegressor. Here, SGD stands for Stochastic Gradient Descent, and you don't need to worry about its details for now; we'll be learning about it soon."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "19462b49-a21c-4f9a-969a-7087d04f4a24",
- "metadata": {},
- "source": [
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "0ebe5104-a6b8-49e6-a43f-58789aac57bc",
- "metadata": {},
- "source": [
- "The training process of SGDRegressor is iterative, and we can keep track of the changes in the loss function during training. This allows us to utilize the loss function to optimize the model.\n",
- "\n",
- "We start from the model's initial state and train for 100 epochs, which means 100 rounds of training, recording the loss function after each round in an array. Please note that our loss function is calculated on the training dataset."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "4e92df97",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.linear_model import SGDRegressor\n",
- "\n",
- "regressor = SGDRegressor(eta0=0.0005)\n",
- "losses = []\n",
- "epochs = 100\n",
- "\n",
- "for epoch in range(epochs):\n",
- " regressor.partial_fit(X_train, y_train)\n",
- " loss = (regressor.predict(X_train) - y_train).abs().mean()\n",
- " losses.append(loss)"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "6dbb08ce-a0e1-4a72-a836-bf9fe7ce3732",
- "metadata": {},
- "source": [
- "We use Matplotlib to plot the results of the first 30 loss functions. As we can see, with an increase in epochs, the loss function gradually decreases. Moreover, the early epochs show a relatively rapid decline, while the later epochs exhibit a slower decrease."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "e07e167d",
- "metadata": {},
- "outputs": [],
- "source": [
- "fig = plt.figure(figsize=(7, 4))\n",
- "plt.plot(losses[:30], marker='o', markersize=10, color='green')\n",
- "plt.xlabel('epoch')\n",
- "plt.ylabel('loss')\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "5b5dc384-e15d-44f8-a0c0-ca8148241a3d",
- "metadata": {},
- "source": [
- "\n",
- "Furthermore, we plot the results of the loss function for all 100 training epochs. It's evident that the loss value keeps decreasing in the early epochs and only starts stabilizing after around 60 epochs.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "1bc2ccd1",
- "metadata": {},
- "outputs": [],
- "source": [
- "fig = plt.figure(figsize=(7, 4))\n",
- "plt.plot(losses, color='green')\n",
- "plt.xlabel('epoch')\n",
- "plt.ylabel('loss')\n",
- "plt.show()"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "4f48e207-c4df-433f-bd02-29629a129f50",
- "metadata": {},
- "source": [
- "You might be curious about what's happening behind the scenes when the loss function of the SGDRegressor model decreases during training. Let's print the model's `coef_` attribute, which represents the coefficients of the linear model. Starting with the model obtained after one training epoch, we get the following results."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "0b671069",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.linear_model import SGDRegressor\n",
- "\n",
- "regressor1 = SGDRegressor(eta0=0.0005)\n",
- "epochs = 1\n",
- "\n",
- "for epoch in range(epochs):\n",
- " regressor1.partial_fit(X_train, y_train)\n",
- " \n",
- "regressor1.coef_"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "1c48f74d-372f-491a-8f63-94628e0ad50b",
- "metadata": {},
- "source": [
- "Next, here are the model parameters after 10 training epochs, and we obtain the following results.\n"
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "0e15d9ef",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.linear_model import SGDRegressor\n",
- "\n",
- "regressor10 = SGDRegressor(eta0=0.0005)\n",
- "epochs = 10\n",
- "\n",
- "for epoch in range(epochs):\n",
- " regressor10.partial_fit(X_train, y_train)\n",
- " \n",
- "regressor10.coef_"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "04403b19-938a-4650-9d75-ef0c03ab3f58",
- "metadata": {},
- "source": [
- "Finally, when we examine the model parameters after 100 training epochs, we observe that the linear model's parameters continue to change with an increase in training epochs."
- ]
- },
- {
- "cell_type": "code",
- "execution_count": null,
- "id": "8ff44ace",
- "metadata": {},
- "outputs": [],
- "source": [
- "from sklearn.linear_model import SGDRegressor\n",
- "\n",
- "regressor100 = SGDRegressor(eta0=0.0005)\n",
- "epochs = 100\n",
- "\n",
- "for epoch in range(epochs):\n",
- " regressor100.partial_fit(X_train, y_train)\n",
- " \n",
- "regressor100.coef_"
- ]
- },
- {
- "cell_type": "markdown",
- "id": "cb75974d-4f48-41d5-a7fd-e11734cffaca",
- "metadata": {},
- "source": [
- "In essence, you can think of it this way: during the training process of the SGDRegressor model, the algorithm is continually trying to reduce the loss function. In other words, this is the direction of model optimization. Each training round of SGDRegressor results in a new model, which can yield a new loss function value on the training dataset. If the algorithm consistently finds a smaller loss function value in each iteration compared to the previous one, the model becomes incrementally more optimized with each round. Consequently, as the number of training epochs increases, the loss function tends to decrease until it stabilizes, and the model's parameters change along with it, ultimately achieving the optimal result."
- ]
- },
- {
- "cell_type": "markdown",
- "id": "c2a05c02-239e-4fdc-9f98-cbd963c58dd4",
- "metadata": {},
- "source": [
- ""
- ]
- },
- {
- "cell_type": "markdown",
- "id": "e57858b7-2458-4218-be99-2ea6666b25f2",
- "metadata": {},
- "source": [
- "Of course, the explanation here might be a bit simplified, and we will provide more detailed answers in the upcoming gradient descent series. "
- ]
- }
- ],
- "metadata": {
- "kernelspec": {
- "display_name": "Python 3 (ipykernel)",
- "language": "python",
- "name": "python3"
- },
- "language_info": {
- "codemirror_mode": {
- "name": "ipython",
- "version": 3
- },
- "file_extension": ".py",
- "mimetype": "text/x-python",
- "name": "python",
- "nbconvert_exporter": "python",
- "pygments_lexer": "ipython3",
- "version": "3.10.11"
- }
- },
- "nbformat": 4,
- "nbformat_minor": 5
-}
diff --git a/open-machine-learning-jupyter-book/slides/ml-fundamentals/logistic-regression-condensed.ipynb b/open-machine-learning-jupyter-book/slides/ml-fundamentals/logistic-regression-condensed.ipynb
new file mode 100644
index 0000000000..b547bdbff4
--- /dev/null
+++ b/open-machine-learning-jupyter-book/slides/ml-fundamentals/logistic-regression-condensed.ipynb
@@ -0,0 +1,664 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "skip"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%%html\n",
+ "\n",
+ "\n",
+ "\n",
+ ""
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "0MRC0e0KhQ0S",
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "# Logistic Regression"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Introduction\n",
+ "\n",
+ "\n",
+ "* In fact, logistic regression is a classification algorithm, unlike other regression models.\n",
+ "* Logistic Regression is very important for entering deep learning. \n",
+ "* After understanding this topic, you will be able to easily learning to Artificial Neural Network."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "LWd1UlMnhT2s",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Importing the libraries"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "YvGPUQaHhXfL",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "%matplotlib inline\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "import pandas as pd"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Sigmoid function"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "def sigmoid(x):\n",
+ " return 1.0 / (1.0 + np.exp(-x))\n",
+ "\n",
+ "values = np.arange(-10, 10, 0.1)\n",
+ "\n",
+ "plt.plot(values, sigmoid(values))\n",
+ "plt.xlabel('x')\n",
+ "plt.ylabel('sigmoid(x)')\n",
+ "plt.title('Sigmoid Function in Matplotlib')\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "K1VMqkGvhc3-",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Importing the dataset"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "M52QDmyzhh9s",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "dataset = pd.read_csv('../../assets/data/Social_Network_Ads.csv')\n",
+ "X = dataset.iloc[:, :-1].values\n",
+ "y = dataset.iloc[:, -1].values\n",
+ "\n",
+ "dataset"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "YvxIPVyMhmKp",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Splitting the dataset into the Training set and Test set"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "AVzJWAXIhxoC",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.model_selection import train_test_split\n",
+ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "kW3c7UYih0hT",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Feature Scaling"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {},
+ "colab_type": "code",
+ "id": "9fQlDPKCh8sc",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.preprocessing import StandardScaler\n",
+ "sc = StandardScaler()\n",
+ "X_train = sc.fit_transform(X_train)\n",
+ "X_test = sc.transform(X_test)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "bb6jCOCQiAmP",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Training the Logistic Regression model on the Training set"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 103
+ },
+ "colab_type": "code",
+ "executionInfo": {
+ "elapsed": 2125,
+ "status": "ok",
+ "timestamp": 1588265315505,
+ "user": {
+ "displayName": "Hadelin de Ponteves",
+ "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GhEuXdT7eQweUmRPW8_laJuPggSK6hfvpl5a6WBaA=s64",
+ "userId": "15047218817161520419"
+ },
+ "user_tz": -240
+ },
+ "id": "e0pFVAmciHQs",
+ "outputId": "67f64468-abdb-4fe7-cce9-de0037119610",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import LogisticRegression\n",
+ "classifier = LogisticRegression(random_state = 0)\n",
+ "classifier.fit(X_train, y_train)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "yyxW5b395mR2",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Predicting a new result"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 34
+ },
+ "colab_type": "code",
+ "executionInfo": {
+ "elapsed": 2118,
+ "status": "ok",
+ "timestamp": 1588265315505,
+ "user": {
+ "displayName": "Hadelin de Ponteves",
+ "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GhEuXdT7eQweUmRPW8_laJuPggSK6hfvpl5a6WBaA=s64",
+ "userId": "15047218817161520419"
+ },
+ "user_tz": -240
+ },
+ "id": "f8YOXsQy58rP",
+ "outputId": "2e1b0063-548e-4924-cf3a-93a79d97e35e",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "print(classifier.predict(sc.transform([[30, 87000], [65, 990000]])))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "vKYVQH-l5NpE",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Predicting the Test set results"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 1000
+ },
+ "colab_type": "code",
+ "executionInfo": {
+ "elapsed": 2112,
+ "status": "ok",
+ "timestamp": 1588265315506,
+ "user": {
+ "displayName": "Hadelin de Ponteves",
+ "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GhEuXdT7eQweUmRPW8_laJuPggSK6hfvpl5a6WBaA=s64",
+ "userId": "15047218817161520419"
+ },
+ "user_tz": -240
+ },
+ "id": "p6VMTb2O4hwM",
+ "outputId": "a4f03a97-2942-45cd-f735-f4063277a96c",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "y_pred = classifier.predict(X_test)\n",
+ "print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)), 1))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "h4Hwj34ziWQW",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Making the Confusion Matrix"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 68
+ },
+ "colab_type": "code",
+ "executionInfo": {
+ "elapsed": 2107,
+ "status": "ok",
+ "timestamp": 1588265315506,
+ "user": {
+ "displayName": "Hadelin de Ponteves",
+ "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GhEuXdT7eQweUmRPW8_laJuPggSK6hfvpl5a6WBaA=s64",
+ "userId": "15047218817161520419"
+ },
+ "user_tz": -240
+ },
+ "id": "D6bpZwUiiXic",
+ "outputId": "f202fcb3-5882-4d93-e5df-50791185067e",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix, accuracy_score\n",
+ "cm = confusion_matrix(y_test, y_pred)\n",
+ "print(cm)\n",
+ "accuracy_score(y_test, y_pred)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "6OMC_P0diaoD",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Visualising the Training set results"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 349
+ },
+ "colab_type": "code",
+ "executionInfo": {
+ "elapsed": 23189,
+ "status": "ok",
+ "timestamp": 1588265336596,
+ "user": {
+ "displayName": "Hadelin de Ponteves",
+ "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GhEuXdT7eQweUmRPW8_laJuPggSK6hfvpl5a6WBaA=s64",
+ "userId": "15047218817161520419"
+ },
+ "user_tz": -240
+ },
+ "id": "_NOjKvZRid5l",
+ "outputId": "6fa60701-9aa4-46f2-a6aa-0f9b0aad62b3",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from matplotlib.colors import ListedColormap\n",
+ "X_set, y_set = sc.inverse_transform(X_train), y_train\n",
+ "X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.5),\n",
+ " np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.5))\n",
+ "plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),\n",
+ " alpha = 0.75, cmap = ListedColormap(('red', 'green')))\n",
+ "plt.xlim(X1.min(), X1.max())\n",
+ "plt.ylim(X2.min(), X2.max())\n",
+ "for i, j in enumerate(np.unique(y_set)):\n",
+ " plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)\n",
+ "plt.title('Logistic Regression (Training set)')\n",
+ "plt.xlabel('Age')\n",
+ "plt.ylabel('Estimated Salary')\n",
+ "plt.legend()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "colab_type": "text",
+ "id": "SZ-j28aPihZx",
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "source": [
+ "## Visualising the Test set results"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "colab": {
+ "base_uri": "https://localhost:8080/",
+ "height": 349
+ },
+ "colab_type": "code",
+ "executionInfo": {
+ "elapsed": 43807,
+ "status": "ok",
+ "timestamp": 1588265357223,
+ "user": {
+ "displayName": "Hadelin de Ponteves",
+ "photoUrl": "https://lh3.googleusercontent.com/a-/AOh14GhEuXdT7eQweUmRPW8_laJuPggSK6hfvpl5a6WBaA=s64",
+ "userId": "15047218817161520419"
+ },
+ "user_tz": -240
+ },
+ "id": "qeTjz2vDilAC",
+ "outputId": "00fb10bc-c726-46b8-8eaa-c5c6b584aa54",
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from matplotlib.colors import ListedColormap\n",
+ "X_set, y_set = sc.inverse_transform(X_test), y_test\n",
+ "X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 10, stop = X_set[:, 0].max() + 10, step = 0.5),\n",
+ " np.arange(start = X_set[:, 1].min() - 1000, stop = X_set[:, 1].max() + 1000, step = 0.5))\n",
+ "plt.contourf(X1, X2, classifier.predict(sc.transform(np.array([X1.ravel(), X2.ravel()]).T)).reshape(X1.shape),\n",
+ " alpha = 0.75, cmap = ListedColormap(('red', 'green')))\n",
+ "plt.xlim(X1.min(), X1.max())\n",
+ "plt.ylim(X2.min(), X2.max())\n",
+ "for i, j in enumerate(np.unique(y_set)):\n",
+ " plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1], c = ListedColormap(('red', 'green'))(i), label = j)\n",
+ "plt.title('Logistic Regression (Test set)')\n",
+ "plt.xlabel('Age')\n",
+ "plt.ylabel('Estimated Salary')\n",
+ "plt.legend()\n",
+ "plt.show()"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "source": [
+ "## Linear Regression v.s. Logistic Regression"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.datasets import make_classification\n",
+ "\n",
+ "X, y = make_classification(\n",
+ " n_features=2, n_redundant=0, n_informative=2, n_clusters_per_class=1, random_state=12\n",
+ ")\n",
+ "\n",
+ "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)\n",
+ "\n",
+ "plt.scatter(X[:, 0], X[:, 1], c=y)\n",
+ "\n",
+ "plt.plot([-2.0, 0], [1.2, -1.3])"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "subslide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.linear_model import LogisticRegression\n",
+ "\n",
+ "classifier = LogisticRegression(random_state = 0)\n",
+ "classifier.fit(X_train, y_train)\n",
+ "\n",
+ "classifier.__dict__\n",
+ "\n",
+ "print(1.4/2.4)\n",
+ "\n",
+ "print(1.3/2.4)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "from sklearn.metrics import confusion_matrix, accuracy_score\n",
+ "\n",
+ "y_pred = classifier.predict(X_test)\n",
+ "\n",
+ "cm = confusion_matrix(y_test, y_pred)\n",
+ "print(cm)\n",
+ "accuracy_score(y_test, y_pred)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {
+ "slideshow": {
+ "slide_type": "slide"
+ }
+ },
+ "outputs": [],
+ "source": [
+ "classifier.coef_"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Assignment - 1\n",
+ "\n",
+ "- Build classification models:Predict the price range"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Assignment - 2\n",
+ "\n",
+ "- Logistic Regression from scratch\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": []
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": []
+ }
+ ],
+ "metadata": {
+ "celltoolbar": "Slideshow",
+ "colab": {
+ "authorship_tag": "ABX9TyOsvB/iqEjYj3VN6C/JbvkE",
+ "collapsed_sections": [],
+ "machine_shape": "hm",
+ "name": "logistic_regression.ipynb",
+ "provenance": [],
+ "toc_visible": true
+ },
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.7.4"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/tutorials/code-for-videos/introduction-to-machine-learning.ipynb b/tutorials/introduction-to-machine-learning.ipynb
similarity index 100%
rename from tutorials/code-for-videos/introduction-to-machine-learning.ipynb
rename to tutorials/introduction-to-machine-learning.ipynb
diff --git a/tutorials/code-for-videos/linear-regression-loss-function.ipynb b/tutorials/linear-regression-loss-function.ipynb
similarity index 100%
rename from tutorials/code-for-videos/linear-regression-loss-function.ipynb
rename to tutorials/linear-regression-loss-function.ipynb
diff --git a/tutorials/code-for-videos/metrics-linear-regression-diabetes.ipynb b/tutorials/metrics-linear-regression-diabetes.ipynb
similarity index 99%
rename from tutorials/code-for-videos/metrics-linear-regression-diabetes.ipynb
rename to tutorials/metrics-linear-regression-diabetes.ipynb
index ab59ade1ef..79582e73ef 100644
--- a/tutorials/code-for-videos/metrics-linear-regression-diabetes.ipynb
+++ b/tutorials/metrics-linear-regression-diabetes.ipynb
@@ -9,13 +9,6 @@
"# **Linear Regression - SKLearn Diabetes Dataset**"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Can AI predict diabetes? Test set training set (1/5)"
- ]
- },
{
"cell_type": "markdown",
"metadata": {},
@@ -1547,13 +1540,6 @@
"However, this method is obviously too primitive. Is there a better way to evaluate the quality of the model?"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Can AI predict diabetes? Illustrated evaluation indicators (2/5)"
- ]
- },
{
"cell_type": "markdown",
"metadata": {},
@@ -1588,13 +1574,6 @@
"***"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Can AI predict diabetes? Achieve evaluation metrics (3/5)"
- ]
- },
{
"cell_type": "markdown",
"metadata": {
@@ -1772,13 +1751,6 @@
"print('r_squared_sklearn = ', r_squared_sklearn)"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Can AI predict diabetes? Detailed explanation of MAPE (4/5)\n"
- ]
- },
{
"cell_type": "markdown",
"metadata": {},
@@ -2007,13 +1979,6 @@
"print(summary_model)"
]
},
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Can AI predict diabetes? Detailed explanation of R-squared (5/5)"
- ]
- },
{
"cell_type": "markdown",
"metadata": {},
diff --git a/tutorials/code-for-videos/multiple-linear-regression.ipynb b/tutorials/multiple-linear-regression.ipynb
similarity index 100%
rename from tutorials/code-for-videos/multiple-linear-regression.ipynb
rename to tutorials/multiple-linear-regression.ipynb
diff --git a/tutorials/code-for-videos/polynomial_regression.ipynb b/tutorials/polynomial_regression.ipynb
similarity index 100%
rename from tutorials/code-for-videos/polynomial_regression.ipynb
rename to tutorials/polynomial_regression.ipynb
diff --git a/tutorials/code-for-videos/simple-linear-regression.ipynb b/tutorials/simple-linear-regression.ipynb
similarity index 100%
rename from tutorials/code-for-videos/simple-linear-regression.ipynb
rename to tutorials/simple-linear-regression.ipynb