Merge pull request #402 from neuromatch/W2D1-discord

W2D1 Post-Course Update
neuromatch · Aug 12, 2024 · 96c15ab · 96c15ab
2 parents 87c554d + e424164
commit 96c15ab
Show file tree

Hide file tree

Showing 3 changed files with 78 additions and 6 deletions.
diff --git a/tutorials/W2D1_Macrocircuits/W2D1_Tutorial2.ipynb b/tutorials/W2D1_Macrocircuits/W2D1_Tutorial2.ipynb
@@ -661,6 +661,7 @@
     "$$\n",
     "\n",
     "Here, $H^+$ is the [Moore-Penrose pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse).\n",
+    "\n",
     "</details>\n",
     "\n",
     "\n",
@@ -1564,7 +1565,12 @@
     "execution": {}
    },
    "source": [
-    "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data."
+    "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data.\n",
+    "\n",
+    "The word *regularization* is commonly used in statistics/ML parlance in two different contexts to ensure the good generalization of overparameterized models:\n",
+    "\n",
+    "- The first context, which is emphasized throughout the tutorial, is explicit regularization which means that the model is not trained to completion (zero training error) in order to avoid overfitting of noise. Without explicit regularization, we observe the double descent behavior – i.e. catastrophic overfitting when the number of model parameters is too close to the number of training examples – but also a vast reduction in this overfitting effect as we heavily overparameterize the model. With explicit regularization (when tuned correctly), the double descent behavior disappears because we no longer run the risk of overfitting to noise at all.\n",
+    "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting."
    ]
   },
   {
@@ -1881,7 +1887,25 @@
     "\n",
     "Intuitively, in a large network, there are many ways to achieve zero training error. Gradient descent tends to find solutions that are near the initialization. So simple initializations (here, small variance) yield less complex models that can generalize well.\n",
     "\n",
-    "Therefore, proper initialization is critical for good generalization in large networks. "
+    "Therefore, proper initialization is critical for good generalization in large networks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6da546cf-bd9e-4818-8321-156d118795fa",
+   "metadata": {
+    "execution": {}
+   },
+   "source": [
+    "There is one more mathematical trick to highlight, used in `fit_relu_init_scale` to find the appropriate weights. This passage is optional for those who would like to dive into implementation specifics.\n",
+    "<details open>\n",
+    "<summary> Solution in case of non-zero weights initialization</summary>\n",
+    "    \n",
+    "The tutorial is trying to study what happens when you train the network with gradient descent for a very long time. However, if we actually trained these networks that way, it'd indeed take a very long time. So, we have picked a case where we know where training will end up and can calculate this endpoint in a different way. This case is basically a linear regression from the hidden layer activity. We have covered mathematical derivation at the very beginning of the tutorial.\n",
+    "\n",
+    "Still, there's one case that is a bit trickier: what if the network is overparametrized? That is, what if we have fewer data samples than weights? Then, there are many possible solutions that can result in zero training errors. Starting from some particular initialization $w^0$, which one of these solutions will gradient descent end at? It turns out that the gradient descent dynamics only change the weights in the subspace spanned by the input training data. The component of the weights that start in the orthogonal complement of that subspace, the data null space, i.e., the subspace in which we have no data at all, doesn't change at all during gradient descent learning. So, whatever initial values the weights take in that subspace will persist throughout training. To see this, note that the gradient update is $\\Delta w \\propto \\text{error} H^T$, so every update is in the direction of a training example. Therefore, to calculate $w^{\\text{gd}}$, the weights that gradient descent reaches convergence, we need to add the component of the weight initialization that lies in this data nullspace to the minimum norm linear regression solution given by the earlier equation for $w$^. That is, $w^{gd} = w^ + (I-\\text{pinv}(H)H)w^0$, where $w^0$ is the initial weight vector and $(I-\\text{pinv}(H)H)$ is one way of calculating the projection of that vector into the data nullspace. \n",
+    "\n",
+    "To recap the intuition: shallow gradient descent learning in overparameterized models has a *frozen subspace* in which weight components do not change during learning, and so initial values in this subspace persist forever. We have to add the projection of our initialization on this frozen subspace to the minimum norm linear regression solution to obtain the solution that gradient descent would find. New test examples can then overlap with this frozen subspace, making the network's test performance depend on the initial weights.\n"
    ]
   },
   {

diff --git a/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial2.ipynb b/tutorials/W2D1_Macrocircuits/instructor/W2D1_Tutorial2.ipynb
@@ -661,6 +661,7 @@
     "$$\n",
     "\n",
     "Here, $H^+$ is the [Moore-Penrose pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse).\n",
+    "\n",
     "</details>\n",
     "\n",
     "\n",
@@ -1572,7 +1573,12 @@
     "execution": {}
    },
    "source": [
-    "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data."
+    "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data.\n",
+    "\n",
+    "The word *regularization* is commonly used in statistics/ML parlance in two different contexts to ensure the good generalization of overparameterized models:\n",
+    "\n",
+    "- The first context, which is emphasized throughout the tutorial, is explicit regularization which means that the model is not trained to completion (zero training error) in order to avoid overfitting of noise. Without explicit regularization, we observe the double descent behavior – i.e. catastrophic overfitting when the number of model parameters is too close to the number of training examples – but also a vast reduction in this overfitting effect as we heavily overparameterize the model. With explicit regularization (when tuned correctly), the double descent behavior disappears because we no longer run the risk of overfitting to noise at all.\n",
+    "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting."
    ]
   },
   {
@@ -1893,7 +1899,25 @@
     "\n",
     "Intuitively, in a large network, there are many ways to achieve zero training error. Gradient descent tends to find solutions that are near the initialization. So simple initializations (here, small variance) yield less complex models that can generalize well.\n",
     "\n",
-    "Therefore, proper initialization is critical for good generalization in large networks. "
+    "Therefore, proper initialization is critical for good generalization in large networks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6da546cf-bd9e-4818-8321-156d118795fa",
+   "metadata": {
+    "execution": {}
+   },
+   "source": [
+    "There is one more mathematical trick to highlight, used in `fit_relu_init_scale` to find the appropriate weights. This passage is optional for those who would like to dive into implementation specifics.\n",
+    "<details open>\n",
+    "<summary> Solution in case of non-zero weights initialization</summary>\n",
+    "    \n",
+    "The tutorial is trying to study what happens when you train the network with gradient descent for a very long time. However, if we actually trained these networks that way, it'd indeed take a very long time. So, we have picked a case where we know where training will end up and can calculate this endpoint in a different way. This case is basically a linear regression from the hidden layer activity. We have covered mathematical derivation at the very beginning of the tutorial.\n",
+    "\n",
+    "Still, there's one case that is a bit trickier: what if the network is overparametrized? That is, what if we have fewer data samples than weights? Then, there are many possible solutions that can result in zero training errors. Starting from some particular initialization $w^0$, which one of these solutions will gradient descent end at? It turns out that the gradient descent dynamics only change the weights in the subspace spanned by the input training data. The component of the weights that start in the orthogonal complement of that subspace, the data null space, i.e., the subspace in which we have no data at all, doesn't change at all during gradient descent learning. So, whatever initial values the weights take in that subspace will persist throughout training. To see this, note that the gradient update is $\\Delta w \\propto \\text{error} H^T$, so every update is in the direction of a training example. Therefore, to calculate $w^{\\text{gd}}$, the weights that gradient descent reaches convergence, we need to add the component of the weight initialization that lies in this data nullspace to the minimum norm linear regression solution given by the earlier equation for $w$^. That is, $w^{gd} = w^ + (I-\\text{pinv}(H)H)w^0$, where $w^0$ is the initial weight vector and $(I-\\text{pinv}(H)H)$ is one way of calculating the projection of that vector into the data nullspace. \n",
+    "\n",
+    "To recap the intuition: shallow gradient descent learning in overparameterized models has a *frozen subspace* in which weight components do not change during learning, and so initial values in this subspace persist forever. We have to add the projection of our initialization on this frozen subspace to the minimum norm linear regression solution to obtain the solution that gradient descent would find. New test examples can then overlap with this frozen subspace, making the network's test performance depend on the initial weights.\n"
    ]
   },
   {

diff --git a/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial2.ipynb b/tutorials/W2D1_Macrocircuits/student/W2D1_Tutorial2.ipynb
@@ -661,6 +661,7 @@
     "$$\n",
     "\n",
     "Here, $H^+$ is the [Moore-Penrose pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse).\n",
+    "\n",
     "</details>\n",
     "\n",
     "\n",
@@ -1426,7 +1427,12 @@
     "execution": {}
    },
    "source": [
-    "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data."
+    "We observe that the \"peak\" disappears, and the test error roughly monotonically decreases, although it is generally higher for higher noise levels in the training data.\n",
+    "\n",
+    "The word *regularization* is commonly used in statistics/ML parlance in two different contexts to ensure the good generalization of overparameterized models:\n",
+    "\n",
+    "- The first context, which is emphasized throughout the tutorial, is explicit regularization which means that the model is not trained to completion (zero training error) in order to avoid overfitting of noise. Without explicit regularization, we observe the double descent behavior – i.e. catastrophic overfitting when the number of model parameters is too close to the number of training examples – but also a vast reduction in this overfitting effect as we heavily overparameterize the model. With explicit regularization (when tuned correctly), the double descent behavior disappears because we no longer run the risk of overfitting to noise at all.\n",
+    "- The second context is the one of inductive bias – overparameterized models, when trained with popular optimization algorithms like gradient descent, tend to converge to a particularly “simple” solution that perfectly fits the data. By “simple”, we usually mean that the size of the parameters (in terms of magnitude) is very small. This inductive bias is a big reason why double descent occurs as well, in particular, the benefit of overparameterization in reducing overfitting."
    ]
   },
   {
@@ -1718,7 +1724,25 @@
     "\n",
     "Intuitively, in a large network, there are many ways to achieve zero training error. Gradient descent tends to find solutions that are near the initialization. So simple initializations (here, small variance) yield less complex models that can generalize well.\n",
     "\n",
-    "Therefore, proper initialization is critical for good generalization in large networks. "
+    "Therefore, proper initialization is critical for good generalization in large networks."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "6da546cf-bd9e-4818-8321-156d118795fa",
+   "metadata": {
+    "execution": {}
+   },
+   "source": [
+    "There is one more mathematical trick to highlight, used in `fit_relu_init_scale` to find the appropriate weights. This passage is optional for those who would like to dive into implementation specifics.\n",
+    "<details open>\n",
+    "<summary> Solution in case of non-zero weights initialization</summary>\n",
+    "    \n",
+    "The tutorial is trying to study what happens when you train the network with gradient descent for a very long time. However, if we actually trained these networks that way, it'd indeed take a very long time. So, we have picked a case where we know where training will end up and can calculate this endpoint in a different way. This case is basically a linear regression from the hidden layer activity. We have covered mathematical derivation at the very beginning of the tutorial.\n",
+    "\n",
+    "Still, there's one case that is a bit trickier: what if the network is overparametrized? That is, what if we have fewer data samples than weights? Then, there are many possible solutions that can result in zero training errors. Starting from some particular initialization $w^0$, which one of these solutions will gradient descent end at? It turns out that the gradient descent dynamics only change the weights in the subspace spanned by the input training data. The component of the weights that start in the orthogonal complement of that subspace, the data null space, i.e., the subspace in which we have no data at all, doesn't change at all during gradient descent learning. So, whatever initial values the weights take in that subspace will persist throughout training. To see this, note that the gradient update is $\\Delta w \\propto \\text{error} H^T$, so every update is in the direction of a training example. Therefore, to calculate $w^{\\text{gd}}$, the weights that gradient descent reaches convergence, we need to add the component of the weight initialization that lies in this data nullspace to the minimum norm linear regression solution given by the earlier equation for $w$^. That is, $w^{gd} = w^ + (I-\\text{pinv}(H)H)w^0$, where $w^0$ is the initial weight vector and $(I-\\text{pinv}(H)H)$ is one way of calculating the projection of that vector into the data nullspace. \n",
+    "\n",
+    "To recap the intuition: shallow gradient descent learning in overparameterized models has a *frozen subspace* in which weight components do not change during learning, and so initial values in this subspace persist forever. We have to add the projection of our initialization on this frozen subspace to the minimum norm linear regression solution to obtain the solution that gradient descent would find. New test examples can then overlap with this frozen subspace, making the network's test performance depend on the initial weights.\n"
    ]
   },
   {