diff --git a/open-machine-learning-jupyter-book/_toc.yml b/open-machine-learning-jupyter-book/_toc.yml index 6ac77e0e31..5a08945679 100644 --- a/open-machine-learning-jupyter-book/_toc.yml +++ b/open-machine-learning-jupyter-book/_toc.yml @@ -151,6 +151,7 @@ parts: - file: assignments/ml-fundamentals/ml-linear-regression-1 - file: assignments/ml-fundamentals/ml-linear-regression-2 - file: assignments/ml-fundamentals/linear-regression/linear-regression-metrics.ipynb + - file: assignments/ml-fundamentals/linear-regression/loss-function.ipynb - file: assignments/ml-fundamentals/linear-regression/gradient-descent.ipynb - file: assignments/ml-fundamentals/ml-logistic-regression-1 - file: assignments/ml-fundamentals/ml-logistic-regression-2 diff --git a/open-machine-learning-jupyter-book/assignments/ml-fundamentals/linear-regression/loss-function.ipynb b/open-machine-learning-jupyter-book/assignments/ml-fundamentals/linear-regression/loss-function.ipynb new file mode 100644 index 0000000000..a517f8e3da --- /dev/null +++ b/open-machine-learning-jupyter-book/assignments/ml-fundamentals/linear-regression/loss-function.ipynb @@ -0,0 +1,553 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Loss Function" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Objective of this section\n", + "\n", + "We have already learned math and code for \"Gradient Descent\", as well as other optimization techniques.\n", + "\n", + "In this section, we will learn more about loss functions.\n", + "\n", + "As a learner, you can focus on learning the L1, L2 loss, and Classification Losses in the regression loss function in this section. And learn about other loss functions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Concept of loss function\n", + "\n", + "- A loss function gauges the disparity between the model's predictions and the actual values. Simply put, it indicates how \"off\" our model is. By optimizing this function, our objective is to identify parameters that bring the model's predictions as close as possible to the true values.\n", + "\n", + "- The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function.\n", + "— Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Difference between a Loss Function and a Cost Function\n", + "\n", + "A loss function evaluates the error for a single training example, and it is occasionally referred to as an error function. In contrast, a cost function represents the **average loss** across the entire training dataset. Optimization strategies are designed to minimize this cost function.\n", + "\n", + "For a simple sample:\n", + "\n", + "The corresponding cost function of L1 Loss is the Mean of these Squared Errors (MSE).\n", + "You can see the difference of [Mathematical Expression](#regression-loss-functions)\n", + "\n", + "However, these terms are frequently used interchangeably in practical settings, they aren't precisely equivalent. From a definitional standpoint, the cost function represents an aggregation or average of the loss functions." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Classification of loss functions\n", + "\n", + "### Regression Losses\n", + "\n", + "These are employed when the objective is to predict a continuous outcome.\n", + "\n", + "- Mean Squared Error (MSE): Measures the average squared discrepancies between predictions and actual values, emphasizing larger errors.\n", + "- Mean Absolute Error (MAE): Calculates the average of absolute differences between predicted outcomes and actual observations, offering a linear penalty for each deviation.\n", + "- Huber Loss: A hybrid loss that's quadratic for small differences and linear for large ones, providing resilience against outliers.\n", + "- L1 Loss: Directly reflects the absolute discrepancies between predictions and real values, synonymous with MAE.\n", + "- L2 Loss: Highlights squared differences between predictions and actuals, equivalent to MSE.\n", + "- Smooth L1 Loss: An amalgamation of L1 and L2 losses, it provides a balance in handling both minor and major deviations.\n", + "\n", + "### Classification Losses\n", + "\n", + "Utilized for tasks requiring the prediction of discrete categories.\n", + "\n", + "- Cross Entropy Loss: Quantifies the dissimilarity between the predicted probability distribution and the actual class distribution.\n", + "- Hinge Loss: A staple for Support Vector Machines (SVMs), it strives to categorize data by maximizing the decision boundary between classes.\n", + "- **Binary Cross Entropy Loss(Log Loss):** It is intended for use with binary classification where the target values are in the set {0, 1}. It is a special case of Cross Entropy Loss, specially used for binary classification problems.\n", + "- **Multi-Class Cross-Entropy Loss:** In this case, it is intended for use with multi-class classification where the target values are in the set {0, 1, 3, …, n}, where each class is assigned a unique integer value.It is an extension of Cross Entropy Loss and is used for multi-classification problems.\n", + "\n", + "### Structured Losses\n", + "\n", + "Tailored for intricate tasks involving structured data patterns.\n", + "\n", + "- Sequence Generation Loss: Emblematic examples include the CTC (Connectionist Temporal Classification) designed for undertakings such as speech and text identification.\n", + "- Image Segmentation Loss: Noteworthy instances encompass the Dice loss and the IoU (Intersection over Union) loss.\n", + "\n", + "### Regularization Losses\n", + "\n", + "Rather than directly influencing the model's predictions, these losses are integrated into the objective function to counteract excessive model complexity.\n", + "\n", + "- L1 Regularization (Lasso): Enforces sparsity by compelling certain model coefficients to be exactly zero.\n", + "- L2 Regularization (Ridge): Curbs the unchecked growth of model parameters without nullifying them, ensuring the model remains generalized without undue complexity." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Empirical Risk and Structural Risk\n", + "\n", + "### Definition\n", + "\n", + "Perhaps you've heard of these two concepts before. In the realms of machine learning and statistics, the concepts of empirical risk and structural risk are intricately tied to loss functions. **However, these terms aren't directly categories of loss functions perse.** Let's first clarify these concepts:\n", + "\n", + "1. **Empirical Risk:** Refers to the average loss of a model over a given dataset. Minimizing empirical risk focuses on reducing errors explicitly on the training data.\n", + "2. **Structural Risk:** Introduces a regularization term in addition to empirical risk, aiming to prevent overfitting. Minimizing structural risk strikes a balance between the empirical risk and the complexity of the model.\n", + "\n", + "Given these definitions:\n", + "\n", + "- **Empirical Risk:** Loss functions directly related to dataset performance fall under this category. From the ones been listed, regression losses (e.g., MSE, MAE, Huber Loss, L1 Loss, L2 Loss, Smooth L1 Loss), classification losses (e.g., Cross Entropy Loss, Hinge Loss, Log Loss), and structured losses (e.g., CTC or Image Segmentation Loss) can be seen as manifestations of empirical risk.\n", + "\n", + "- **Structural Risk:** Regularization losses, like L1 and L2 regularization, form part of structural risk. They don't measure the model's performance on the data directly but rather serve to rein in model complexity.\n", + "\n", + "### A Detaphor\n", + "\n", + "Maybe it's still abtract. So, now imagine you're a tailor trying to make a dress for a client.\n", + "\n", + "- **Empirical Risk:** This is like ensuring the dress fits the client perfectly based on a single fitting session. You measure every contour and make the dress to match those exact measurements. The dress is a perfect fit for the client on that particular day.\n", + "\n", + "However, what if the client gains or loses a little weight or wants to move more comfortably? A dress tailored too tightly to the exact measurements might not be very adaptable or comfortable in various situations.\n", + "\n", + "- **Structural Risk:** Now, consider that you decide to allow a bit more flexibility in the dress. You make it slightly adjustable, perhaps with some elastic portions. This way, even if the client's measurements change a bit, the dress will still fit comfortably. You're sacrificing a tiny bit of the \"perfect\" fit for the adaptability and general comfort.\n", + "\n", + "In the context of machine learning:\n", + "\n", + "Relying solely on **Empirical Risk** would be like fitting the dress exactly to the client's measurements, risking overfitting. If the data changes slightly, the model might perform poorly.\n", + "\n", + "Factoring in **Structural Risk** ensures the model isn't overly tailored to the training data and can generalize well to new, unseen data. It's about ensuring a balance between a perfect fit and adaptability.\n", + "\n", + "### Mathematical Explanation\n", + "\n", + "Now you have a general understanding of the meaning of empirical risk and structural risk. Let's delve into a more mathematical perspective:\n", + "\n", + "Given a dataset $\\mathcal{D}$ comprising input-output pairs $(x_1, y_1)$, $(x_2, y_2)$, ... $(x_n, y_n)$ and a model $f$ parameterized by $\\theta$, the empirical risk and structural risk can be formally defined as follows:\n", + "\n", + "**Empirical Risk(Cost Function):**\n", + "$$\n", + "R_{emp}(f) = \\frac{1}{n} \\sum_{i=1}^{n} L(y_i, f(x_i; \\theta))\n", + "$$\n", + "Where:\n", + "\n", + "- **$L$ is the loss function**, measuring the discrepancy between the predicted value $f(x_i; \\theta)$ and the actual output $y_i$.\n", + "\n", + "Empirical risk quantifies how well the model fits the given dataset, representing the average loss of the model on the training data.\n", + "\n", + "**Structural Risk(Objective Function):**\n", + "$$\n", + "R_{struc}(f) = R_{emp}(f) + \\lambda R_{reg}(\\theta)\n", + "$$\n", + "Where:\n", + "- $R_{reg}(\\theta)$ is the regularization term, penalizing the complexity of the model.\n", + "- $\\lambda$ is a regularization coefficient determining the weight of the regularization term relative to the empirical risk.\n", + "\n", + "Structural risk is a combination of the empirical risk and a penalty for model complexity. It strikes a balance between fitting the training data (empirical risk) and ensuring the model isn't overly complex (which can lead to overfitting).\n", + "\n", + "**Differences and Relations:**\n", + "\n", + "1. **Empirical Risk** focuses solely on minimizing the error on the training data without considering model complexity or how it generalizes to unseen data.\n", + "2. **Structural Risk** takes into account both the empirical risk and the complexity of the model. By introducing a regularization term, it ensures that the model doesn't become overly complex and overfit the training data. Thus, it balances performance on training data with generalization to new data.\n", + "\n", + "In essence, while empirical risk aims for performance on the current dataset, structural risk aims for good performance on new data by penalizing overly complex models.\n", + "\n", + "### Cost Function and Objective Function\n", + "\n", + "The empirical risk and cost functions are in many cases the same and represent the average loss on the training data.\n", + "\n", + "Structural risk is often viewed as an objective function, especially when regularization is considered. But the term \"objective function\" is broader and is not limited to structural risk but can also include other optimization objectives." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Common Loss Functions\n", + "\n", + "### Regression Loss Functions\n", + "\n", + "1. **Mean Squared Error, MSE**\n", + "\n", + "$$\n", + "L(y, \\hat{y}) = \\frac{1}{n} \\sum_{i=1}^{n} (y_i - \\hat{y}_i)^2 \n", + "$$\n", + "\n", + "Where $y_i$ is the actual value and $\\hat{y}_i$ is the predicted value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import tensorflow as tf\n", + "\n", + "y_true = tf.constant([1.0, 2.0, 3.0])\n", + "y_pred = tf.constant([1.5, 1.5, 3.5])\n", + "loss = tf.keras.losses.MSE(y_true, y_pred)\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. **Mean Absolute Error, MAE**\n", + "\n", + "$$\n", + "L(y, \\hat{y}) = \\frac{1}{n} \\sum_{i=1}^{n} |y_i - \\hat{y}_i|\n", + "$$" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loss = tf.keras.losses.MAE(y_true, y_pred)\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "3. **Huber Loss**\n", + "\n", + "$$\n", + "L_{\\delta}(y, \\hat{y}) = \\begin{cases} \n", + " \\frac{1}{2}(y - \\hat{y})^2 & \\text{if } |y - \\hat{y}| \\leq \\delta \\\\\n", + " \\delta |y - \\hat{y}| - \\frac{1}{2}\\delta^2 & \\text{otherwise}\n", + " \\end{cases}\n", + "$$\n", + "\n", + "Positioned between MSE and MAE, this loss function offers robustness against outliers." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "loss = tf.keras.losses.Huber()(y_true, y_pred)\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "4. **L1 Loss**\n", + "\n", + "$$\n", + "L = ( y - f(x) )^2\n", + "$$\n", + "\n", + "Corresponds to MAE.\n", + "\n", + "5. **L2 Loss**\n", + "\n", + "$$\n", + "L = | y - f(x) |\n", + "$$\n", + "\n", + "Corresponds to MSE.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Classification Loss Functions\n", + "\n", + "1. **Cross Entropy Loss**\n", + "\n", + "$$\n", + "L(y, p) = - \\sum_{i=1}^{C} y_i \\log(p_i)\n", + "$$\n", + "\n", + "Where $y_i$ is the actual label (0 or 1) and $p_i$ is the predicted probability for the respective class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_true = tf.constant([[0, 1], [1, 0], [1, 0]])\n", + "y_pred = tf.constant([[0.05, 0.95], [0.1, 0.9], [0.8, 0.2]])\n", + "loss = tf.keras.losses.CategoricalCrossentropy()(y_true, y_pred)\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. **Hinge Loss**\n", + "\n", + "$$\n", + "L(y, \\hat{y}) = \\max(0, 1 - y \\cdot \\hat{y})\n", + "$$\n", + "\n", + "Primarily used for Support Vector Machines, but it can also be employed for other classification tasks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_true = tf.constant([-1, 1, 1]) # binary class labels in {-1, 1}\n", + "y_pred = tf.constant([0.5, 0.3, -0.7]) # raw model outputs\n", + "loss = tf.keras.losses.Hinge()(y_true, y_pred)\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "3. **Binary Cross Entropy(Log Loss)**\n", + "\n", + "Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.\n", + "\n", + "Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for predicting class 1. The score is minimized and a perfect cross-entropy value is 0.\n", + "\n", + "This YouTube video by Andrew Ng explains very well Binary Cross Entropy Loss (make sure that you have access to YouTube for this web page to render correctly):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from IPython.display import HTML\n", + "\n", + "display(HTML(\n", + " \"\"\"\n", + " \n", + " \"\"\"\n", + "))" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_true = tf.constant([0, 1, 0])\n", + "y_pred = tf.constant([0.05, 0.95, 0.1])\n", + "loss = tf.keras.losses.BinaryCrossentropy()(y_true, y_pred)\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "4. **Multi-Class Cross-Entropy Loss**\n", + "\n", + "Mathematically, it is the preferred loss function under the inference framework of maximum likelihood. It is the loss function to be evaluated first and only changed if you have a good reason.\n", + "\n", + "Cross-entropy will calculate a score that summarizes the average difference between the actual and predicted probability distributions for all classes in the problem. The score is minimized and a perfect cross-entropy value is 0." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "y_true = [[1, 0, 0],\n", + " [0, 1, 0],\n", + " [0, 0, 1],\n", + " [1, 0, 0],\n", + " [0, 1, 0]]\n", + "\n", + "# Mock predicted probabilities from a model\n", + "y_pred = [[0.7, 0.2, 0.1],\n", + " [0.2, 0.5, 0.3],\n", + " [0.1, 0.2, 0.7],\n", + " [0.6, 0.3, 0.1],\n", + " [0.1, 0.6, 0.3]]\n", + "\n", + "y_true = tf.constant(y_true, dtype=tf.float32)\n", + "y_pred = tf.constant(y_pred, dtype=tf.float32)\n", + "\n", + "loss = tf.reduce_mean(-tf.reduce_sum(y_true * tf.math.log(y_pred), axis=1))\n", + "\n", + "print(\"Multi-Class Cross-Entropy Loss:\", loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Structured Loss Functions\n", + "\n", + "1. **CTC Loss (Connectionist Temporal Classification)**\n", + "\n", + "Used for sequence-to-sequence problems, like speech recognition." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import numpy as np\n", + "\n", + "y_true = np.array([[1, 2]]) # (batch, timesteps)\n", + "y_pred = np.array([[[0.1, 0.6, 0.3], [0.3, 0.1, 0.6]]]) # (batch, timesteps, num_classes)\n", + "logit_length = [2]\n", + "label_length = [2]\n", + "loss = tf.keras.backend.ctc_batch_cost(y_true, y_pred, logit_length, label_length)\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. **Dice Loss, IoU Loss**\n", + "\n", + "Used for image segmentation tasks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def dice_loss(y_true, y_pred):\n", + " numerator = 2 * tf.reduce_sum(y_true * y_pred, axis=-1)\n", + " denominator = tf.reduce_sum(y_true + y_pred, axis=-1)\n", + " return 1 - (numerator + 1) / (denominator + 1)\n", + "\n", + "y_true = tf.constant([[1, 0, 1], [0, 1, 0]])\n", + "y_pred = tf.constant([[0.8, 0.2, 0.6], [0.3, 0.7, 0.1]])\n", + "loss = dice_loss(y_true, y_pred)\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "def iou_loss(y_true, y_pred):\n", + " intersection = tf.reduce_sum(y_true * y_pred, axis=[1, 2, 3])\n", + " union = tf.reduce_sum(y_true, axis=[1, 2, 3]) + tf.reduce_sum(y_pred, axis=[1, 2, 3]) - intersection\n", + " return 1. - (intersection + 1) / (union + 1)\n", + "\n", + "# For simplicity, using 2D tensors. Typically, these are images (3D tensors).\n", + "y_true = tf.constant([[1, 0, 1], [0, 1, 0]])\n", + "y_pred = tf.constant([[0.8, 0.2, 0.6], [0.3, 0.7, 0.1]])\n", + "loss = iou_loss(y_true[tf.newaxis, ...], y_pred[tf.newaxis, ...]) # Add batch dimension\n", + "\n", + "print(loss.numpy())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Regularization\n", + "\n", + "1. **L1 Regularization (Lasso)**\n", + "\n", + "Produces sparse model parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from keras.regularizers import l1\n", + "\n", + "model = tf.keras.models.Sequential([\n", + " tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=l1(0.01), input_shape=(10,))\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "2. **L2 Regularization (Ridge)**\n", + "\n", + "Prevents model parameters from becoming too large but doesn't force them to become exactly zero." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from keras.regularizers import l2\n", + "\n", + "model = tf.keras.models.Sequential([\n", + " tf.keras.layers.Dense(64, activation='relu', kernel_regularizer=l2(0.01), input_shape=(10,))\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "Loss functions hold a pivotal role in machine learning. By minimizing the loss, we enhance the accuracy of our model's predictions. A deep understanding of various loss functions aids in selecting the most appropriate optimization technique for specific challenges." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "open-machine-learning-jupyter-book", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.16" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}