graphcore-research · thecharlieblake · Nov 28, 2024 · Nov 28, 2024
diff --git a/analysis/emb_lr_analysis.ipynb b/analysis/emb_lr_analysis.ipynb
@@ -0,0 +1,364 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Analysis of the effect of the embedding LR update on the subsequent matmul\n",
+    "\n",
+    "I wanted to write this out in a notebook to make sure I understood the way in which the embedding update effects the subsequent matmul.\n",
+    "\n",
+    "No revelations unfortunately - it still seems as though our rule can't be justified this way (it is \"unnatural\"!). Under the \"no-alignment\" assumption the standard embedding LR breaks, but unfortunately our fix does nothing to help. Oh well."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import torch\n",
+    "from torch import randn\n",
+    "from typing import Iterable"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def rms(*xs: Iterable[torch.Tensor]) -> Iterable[torch.Tensor]:\n",
+    "    if len(xs) == 1:\n",
+    "        return xs[0].pow(2).mean().sqrt()\n",
+    "    return tuple(rms(x) for x in xs)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setup\n",
+    "\n",
+    "Toggle `full_alignment` and `umup_lr_rule` to see the effect. mup scaling is used by default."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "d = 2**11\n",
+    "full_alignment = True\n",
+    "umup_lr_rule = False\n",
+    "\n",
+    "w_lr = d ** -(1 if full_alignment else 0.5)\n",
+    "e_lr = d ** -(0.5 if umup_lr_rule else 0)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Model & update\n",
+    "\n",
+    "Everything can be described in terms of these three tensors (a single embedding vector, weight matrix and a gradient vector). Note that I assume the gradient is unit-scale, and then just use the adam LR rules but under and SGD-like update (I appreciate this is a bit odd, but it's simple and the maths should work out)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(tensor(0.9984), tensor(0.0221), tensor(0.9882))"
+      ]
+     },
+     "execution_count": 4,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "e1 = randn(d, 1)\n",
+    "W1 = randn(d + 1, d) * d**-0.5\n",
+    "g = randn(d + 1, 1)\n",
+    "rms(\n",
+    "    e1, W1, g\n",
+    ")  # all \"well-scaled\", except the weight which is 1/sqrt(d) (this isn't unit scaling!)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Then we just run:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor(0.9953)"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "x1 = W1 @ e1\n",
+    "rms(x1)  # well-scaled"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "((tensor(0.9977), tensor(0.0005)), 0.00048828125)"
+      ]
+     },
+     "execution_count": 6,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "u_e = W1.T @ g * e_lr\n",
+    "u_W = g @ e1.T * w_lr\n",
+    "(\n",
+    "    rms(u_e, u_W),\n",
+    "    1 / d,\n",
+    ")  # the weight update is under-scaled (to be expected I think), though as a rank-1 matrix it has a much higher (O(1)) spectral norm! This means its effect doesn't \"go to zero\" in inf. width, though the rms does."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "(tensor(0.9998), tensor(0.0221))"
+      ]
+     },
+     "execution_count": 7,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "e2 = e1 + u_e\n",
+    "e2_std = e2.std()\n",
+    "e2 /= e2_std  # Why is `/ e2.std()` allowed/justified? Normally we'd have a much smaller weight update (scaled down by small LR constant), and then the original weight would be decayed a bit, keeping this at about rms=1. This re-scaling does something similar, though allows us to see the effect of the weight update scaling more clearly.\n",
+    "W2 = W1 + u_W\n",
+    "rms(\n",
+    "    e2, W2\n",
+    ")  # Update is well-scaled. Weight has barely changed from its 1/sqrt(d) starting point"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "tensor(1.7412)"
+      ]
+     },
+     "execution_count": 8,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "x2 = W2 @ e2\n",
+    "rms(x2)  # ~well-scaled. Certainly doesn't scale with a significant power of d"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Analysis\n",
+    "\n",
+    "Now we break this down into its constituent terms.\n",
+    "\n",
+    "First checking that they combine to the original"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 9,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "torch.allclose(x2, (W1 + u_W) @ (e1 + u_e * e_lr) / e2_std, atol=1e-6)\n",
+    "torch.allclose(x2, (W1 + g @ e1.T * w_lr) @ (e1 + W1.T @ g * e_lr) / e2_std, atol=1e-6)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "True"
+      ]
+     },
+     "execution_count": 10,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "# t1 = W1 @ e1 (== x1)\n",
+    "t2 = W1 @ W1.T @ g * e_lr\n",
+    "t3 = g @ e1.T * w_lr @ e1\n",
+    "t4 = g @ e1.T * w_lr @ W1.T @ g * e_lr\n",
+    "torch.allclose(x2, (x1 + t2 + t3 + t4) / e2_std, atol=1e-5)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Weight @ emb_update (t2)\n",
+    "\n",
+    "This is well-scaled under the original emb lr rule, but not under our lr rule - which isn't a great sign for our approach"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "rms(W1, g), e_lr=((tensor(0.0221), tensor(0.9882)), 1)\n",
+      "rms(W1 @ W1.T)=tensor(0.0312)\n",
+      "rms(W1.T @ g)=tensor(0.9977)\n",
+      "rms(W1 @ W1.T @ g * e_lr / e2_std)=tensor(0.9857)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"{rms(W1, g), e_lr=}\")\n",
+    "print(f\"{rms(W1 @ W1.T)=}\")\n",
+    "print(f\"{rms(W1.T @ g)=}\")\n",
+    "print(f\"{rms(W1 @ W1.T @ g * e_lr / e2_std)=}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Weight_update @ emb (t3)\n",
+    "\n",
+    "This is well-scaled under the original emb lr rule and our rule"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "rms(g, e1)=(tensor(0.9882), tensor(0.9984))\n",
+      "rms(g @ e1.T)=tensor(0.9866)\n",
+      "rms(e1.T @ e1 * w_lr)=tensor(0.9968)\n",
+      "rms(g @ e1.T * w_lr @ e1)=tensor(0.9850)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"{rms(g, e1)=}\")\n",
+    "print(f\"{rms(g @ e1.T)=}\")\n",
+    "print(f\"{rms(e1.T @ e1 * w_lr)=}\")\n",
+    "print(f\"{rms(g @ e1.T * w_lr @ e1)=}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Weight_update @ emb_update (t4)\n",
+    "\n",
+    "This vanishes with width under the original emb lr and our rule. Probably a good thing?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "rms(g @ e1.T @ W1.T @ g)=tensor(46.5558)\n",
+      "rms(g @ e1.T * w_lr @ W1.T @ g * e_lr)=tensor(0.0227)\n"
+     ]
+    }
+   ],
+   "source": [
+    "print(f\"{rms(g @ e1.T @ W1.T @ g)=}\")\n",
+    "print(f\"{rms(g @ e1.T * w_lr @ W1.T @ g * e_lr)=}\")"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": ".venv",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}