google
diff --git a/‎docs/guides/quantization/fp8_basics.ipynb
Lines changed: 147 additions & 90 deletions b/‎docs/guides/quantization/fp8_basics.ipynb
Lines changed: 147 additions & 90 deletions
@@ -13,12 +13,10 @@
     "as quantization (Q). Conversely, de-quantization (DQ) rescales the FP8 data back\n",
     "to its original type.\n",
     "\n",
-    "Although jnp.dot supports FP8 inputs, certain limitations make it impractical\n",
-    "for real-world applications. Alternatively, XLA, our compiler, can recognize\n",
-    "patterns like <FP8>->DQ->Dot and subsequently invoke FP8 backends (e.g.,\n",
-    "cublasLt for GPUs). FLAX encapsulates such patterns into the\n",
-    "nn.fp8_ops.Fp8DotGeneralOp module, allowing users to easily configure it for\n",
-    "existing layers (e.g., nn.Dense).\n",
+    "While jnp.dot supports FP8 inputs directly, proper quantization and\n",
+    "dequantization is needed for optimal performance. Flax provides\n",
+    "nn.fp8_ops.Fp8DotGeneral and nn.fp8_ops.Fp8Einsum modules that handle\n",
+    "this automatically and can be used with existing layers like nn.Dense.\n",
     "\n",
     "This tutorial will walk you through the basics of how to use it.\n",
     "\n",
@@ -50,7 +48,6 @@
     "from flax.linen import fp8_ops\n",
     "\n",
     "e4m3 = jnp.float8_e4m3fn\n",
-    "e5m2 = jnp.float8_e5m2\n",
     "f32 = jnp.float32\n",
     "E4M3_MAX = jnp.finfo(e4m3).max.astype(f32)\n",
     "\n",
@@ -82,34 +79,29 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "key = random.key(0)\n",
-    "A = random.uniform(key, (16, 32))\n",
-    "B = random.uniform(key, (32, 64))\n",
+    "k0, k1 = random.split(random.key(0), 2)\n",
+    "a = random.uniform(k0, (16, 32))\n",
+    "b = random.uniform(k1, (32, 64))\n",
     "@jax.jit\n",
-    "def dot_fp8(A, B):\n",
-    "  return jnp.dot(A.astype(e4m3), B.astype(e4m3), preferred_element_type=f32)\n",
-    "check_fp8_call(dot_fp8.lower(A, B))"
+    "def dot_fp8(a, b):\n",
+    "  return jnp.dot(a.astype(e4m3), b.astype(e4m3), preferred_element_type=f32)\n",
+    "check_fp8_call(dot_fp8.lower(a, b))"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "adb22878",
    "metadata": {},
    "source": [
-    "However, there are two main issues with this approach. Firstly, `jnp.dot` does\n",
-    "not accept scaling factors for the operands, defaulting to a scaling factor of\n",
-    "1.0. Secondly, it does not support operands of mixed FP8 data types. For\n",
-    "example, when the operands are E5M2 and E4M3, the dot product is performed using\n",
-    "the promoted FP16 data type.\n",
+    "However, this approach has two key limitations:\n",
     "\n",
-    "In real-world scenarios, it is essential to specify scaling factors, either from\n",
-    "calibration for inference or a user-defined algorithm during training.\n",
-    "Additionally, it is common practice to use E5M2 for gradients and E4M3 for\n",
-    "activations and kernels. These limitations make this method less practical for\n",
-    "real-world applications.\n",
+    "1. `jnp.dot` does not support custom scaling factors for operands, defaulting to\n",
+    "   a scale of 1.0\n",
+    "2. The autodiff does not automatically use E5M2 for gradients and E4M3 for\n",
+    "   activations/weights during training, which is the recommended practice\n",
     "\n",
-    "To address these limitations and create a more versatile FP8 dot product, we\n",
-    "recommend leveraging XLA-FP8. Let's begin with a simple scaling strategy.\n",
+    "To overcome these limitations and implement proper FP8 matrix multiplication, we\n",
+    "recommend using the Flax FP8 APIs. Let's start with a basic scaling approach.\n",
     "\n",
     "\n",
     "### Current Scaling\n",
@@ -129,36 +121,38 @@
    "outputs": [],
    "source": [
     "@jax.jit\n",
-    "def dot_fp8(A, B):\n",
-    "  A_scale = jnp.max(jnp.abs(A)) / E4M3_MAX\n",
-    "  B_scale = jnp.max(jnp.abs(B)) / E4M3_MAX\n",
-    "  A = fp8_ops.quantize_dequantize(A, e4m3, A_scale, f32)\n",
-    "  B = fp8_ops.quantize_dequantize(B, e4m3, B_scale, f32)\n",
-    "\n",
-    "  C = jnp.dot(A, B)\n",
-    "  return C\n",
+    "def dot_fp8(a, b):\n",
+    "  a_scale = jnp.max(jnp.abs(A)) / E4M3_MAX\n",
+    "  b_scale = jnp.max(jnp.abs(B)) / E4M3_MAX\n",
+    "  a = fp8_ops.quantize(a, e4m3, a_scale, f32)\n",
+    "  b = fp8_ops.quantize(b, e4m3, b_scale, f32)\n",
+    "\n",
+    "  c = jnp.dot(a, b, preferred_element_type=f32)\n",
+    "  c = fp8_ops.dequantize(c, f32, a_scale * b_scale)\n",
+    "  return c\n",
     "\n",
-    "C = dot_fp8(A, B)\n",
-    "check_fp8_call(dot_fp8.lower(A, B))"
+    "c = dot_fp8(a, b)\n",
+    "check_fp8_call(dot_fp8.lower(a, b))"
    ]
   },
   {
    "cell_type": "markdown",
    "id": "59aca6fe",
    "metadata": {},
    "source": [
-    "As shown in the code, we perform fake quantization\n",
-    "(`fp8_ops.quantize_dequantize`) on the operands of the dot product. Although the\n",
-    "`jnp.dot` still processes higher-precision inputs, XLA detects this pattern and\n",
-    "rewrites the dot operation as an FP8 dot call (e.g., cublasLt call for GPUs).\n",
-    "This approach effectively mimics the first example but offers greater\n",
-    "flexibility. We can control the input dtypes (both are set to E4M3 here, but we\n",
-    "could use mixed E4M3 and E5M2) and define scaling factors, which XLA can detect\n",
-    "and use in the dot backend.\n",
-    "\n",
-    "One major issue with the current scaling method is the overhead introduced by\n",
-    "computing `A_scale` and `B_scale`, which requires additional loading of the\n",
-    "operand tensors. To overcome this issue, we recommend the delayed scaling.\n",
+    "As shown in the code, we perform quantization (`fp8_ops.quantize`) on the\n",
+    "tensors to get the lower precision operands. The `jnp.dot` processes them and\n",
+    "accumulates the output in high precision (i.e., the `preferred_element_type`).\n",
+    "After that, we multiply the result by the scaling factors to dequantize back to\n",
+    "the original range (`fp8_ops.dequantize`). Note that while this example uses\n",
+    "E4M3 for both inputs, it is possible to use different FP8 dtypes like E4M3 and\n",
+    "E5M2 for the inputs. The quantization method and the scaling factors can also be\n",
+    "customized based on application needs.\n",
+    "\n",
+    "One major issue with the current scaling method is the performance overhead\n",
+    "introduced by computing `a_scale` and `b_scale`, which requires additional\n",
+    "loading of the operand tensors. To overcome this issue, we recommend the delayed\n",
+    "scaling.\n",
     "\n",
     "### Delayed Scaling\n",
     "\n",
@@ -167,8 +161,10 @@
     "values from recent steps (e.g., 1024 steps). Both tensors are computed from\n",
     "previous steps and maintained in the model parameters.\n",
     "\n",
-    "Fake quantization for delayed scaling is provided by `fp8_ops.in_qdq` for the\n",
-    "activations and weights, and `fp8_ops.out_qdq` for the gradients."
+    "The quantization and dequantization operations for delayed scaling are provided\n",
+    "by `fp8_ops.in_q` and `fp8_ops.out_dq` respectively. `fp8_ops.in_q` handles\n",
+    "input quantization and update the amax history and scaling factor, while\n",
+    "`fp8_ops.out_dq` performs output dequantization."
    ]
   },
   {
@@ -180,25 +176,20 @@
    "source": [
     "a_scale = jnp.array(1.0)\n",
     "b_scale = jnp.array(1.0)\n",
-    "g_scale = jnp.array(1.0)\n",
     "a_amax_hist = jnp.zeros((1024,))\n",
     "b_amax_hist = jnp.zeros((1024,))\n",
-    "g_amax_hist = jnp.zeros((1024,))\n",
     "\n",
     "@jax.jit\n",
-    "def dot_fp8(a, a_scale, a_amax_hist, b, b_scale, b_amax_hist,\n",
-    "            g_scale, g_amax_hist):\n",
-    "  a = fp8_ops.in_qdq(f32, e4m3, a, a_scale, a_amax_hist)\n",
-    "  b = fp8_ops.in_qdq(f32, e4m3, b, b_scale, b_amax_hist)\n",
+    "def dot_fp8(a, a_scale, a_amax_hist, b, b_scale, b_amax_hist):\n",
+    "  a, a_scale = fp8_ops.in_q(f32, e4m3, a, a_scale, a_amax_hist)\n",
+    "  b, b_scale = fp8_ops.in_q(f32, e4m3, b, b_scale, b_amax_hist)\n",
     "  \n",
-    "  c = jnp.dot(a, b)\n",
-    "  c = fp8_ops.out_qdq(f32, e5m2, c, g_scale, g_amax_hist)\n",
+    "  c = jnp.dot(a, b, preferred_element_type=f32)\n",
+    "  c = fp8_ops.out_dq(f32, a_scale, b_scale, c)\n",
     "  return c\n",
     "\n",
-    "C = dot_fp8(A, a_scale, a_amax_hist, B, b_scale, b_amax_hist,\n",
-    "            g_scale, g_amax_hist)\n",
-    "check_fp8_call(dot_fp8.lower(A, a_scale, a_amax_hist, B, b_scale, b_amax_hist,\n",
-    "                             g_scale, g_amax_hist))"
+    "c = dot_fp8(a, a_scale, a_amax_hist, b, b_scale, b_amax_hist)\n",
+    "check_fp8_call(dot_fp8.lower(a, a_scale, a_amax_hist, b, b_scale, b_amax_hist))"
    ]
   },
   {
@@ -208,22 +199,22 @@
    "source": [
     "In this example, we first prepare three pairs of scaling factors and amax\n",
     "histories, treating them as results computed from previous steps. Then, we apply\n",
-    "`fp8_ops.in_qdq` to the input operands of `jnp.dot`, followed by\n",
-    "`fp8_ops.out_qdq` to the output of `jnp.dot`. Note the `fp8_ops.out_qdq` will\n",
-    "apply fake quantization to the gradient of the output via custom_vjp functions.\n",
-    "The new scaling factors and amax histories will be returned through their\n",
-    "gradients, which will be covered in the next section.\n",
+    "`fp8_ops.in_q` to the input operands of `jnp.dot`, followed by `fp8_ops.out_dq`\n",
+    "to the output of `jnp.dot`.\n",
     "\n",
     "\n",
     "## FLAX High Level API\n",
-    "With the FLAX library, incorporating FP8 operations into existing FLAX layers\n",
-    "is a seamless process. Users don't need to manipulate the low-level APIs for\n",
-    "quantization. Instead, they can integrate the provided custom FP8 dot\n",
-    "(`fp8_ops.Fp8DotGeneralOp`) into FLAX layers using a straightforward\n",
-    "\"code-injection\" approach. This custom operation encapsulates all FP8-related\n",
-    "tasks, including the placement of quantization-dequantization ops, algorithms\n",
-    "for updating scaling factors, and the selection of FP8 dtype combinations for\n",
-    "forward and backward propagation.\n",
+    "Flax provides high-level operations to seamlessly integrate FP8 quantization\n",
+    "into existing layers. Instead of manually handling quantization of the delayed\n",
+    "scaling (e.g., the maintanence of the amax history and scaling factors), users\n",
+    "can simply use these drop-in replacements:\n",
+    "\n",
+    "* `fp8_ops.Fp8DotGeneral` for `lax.dot_general` operations\n",
+    "* `fp8_ops.Fp8Einsum` for `jnp.einsum` operations \n",
+    "\n",
+    "These operations automatically handle all FP8-related functionality, including\n",
+    "quantization/dequantization, scale factor updates, and FP8 dtype selection for\n",
+    "both forward and backward passes.\n",
     "\n",
     "Consider the following example:"
    ]
@@ -235,8 +226,8 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "model = nn.Dense(features=64, dot_general_cls=fp8_ops.Fp8DotGeneralOp)\n",
-    "params = model.init(key, A)\n",
+    "model = nn.Dense(features=64, dot_general_cls=fp8_ops.Fp8DotGeneral)\n",
+    "params = model.init(k0, A)\n",
     "\n",
     "@jax.jit\n",
     "def train_step(var, a): \n",
@@ -248,16 +239,64 @@
   },
   {
    "cell_type": "markdown",
-   "id": "a83b0851",
+   "id": "ba280e79",
    "metadata": {},
    "source": [
-    "In this example, we simply set `dot_general_cls=fp8_ops.Fp8DotGeneralOp` to\n",
-    "enable the Dense layer to utilize the FP8 dot operation. The usage of the model\n",
-    "remains almost the same as before. The main difference is the addition of a new\n",
-    "category of parameters: the sets of scaling factors and amax history. In the\n",
-    "next section, we will explore how to update these parameters.\n",
+    "By setting `dot_general_cls=fp8_ops.Fp8DotGeneral`, we replace the\n",
+    "default `lax.dot_general` operation in `nn.Dense` with an FP8-enabled version.\n",
+    "The model usage remains similar, but now includes additional parameters for FP8\n",
+    "quantization: scaling factors and amax history values. The next section explains\n",
+    "how to update these FP8-specific parameters.\n",
+    "\n",
+    "For models that use `jnp.einsum` operations, such as Mixture of Experts (MoE)\n",
+    "layers, users can replace them with `fp8_ops.Fp8Einsum` to enable FP8\n",
+    "quantization. Here's an example:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "961b4549",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from typing import Any\n",
+    "class FooModule(nn.Module):\n",
+    "  einsum: Any = None\n",
+    "  @nn.compact\n",
+    "  def __call__(self, a, b):\n",
+    "    if self.einsum is not None:\n",
+    "      einsum_fn = self.einsum()\n",
+    "    elif self.einsum is None:\n",
+    "      einsum_fn = jnp.einsum\n",
+    "    c = einsum_fn(\"mk,kn->mn\", a, b)\n",
+    "    return c\n",
+    "\n",
+    "model = FooModule(einsum=fp8_ops.Fp8Einsum)\n",
+    "params = model.init(k0, a, b)\n",
     "\n",
+    "@jax.jit\n",
+    "def train_step(var, a, b):\n",
+    "  c = model.apply(var, a, b)\n",
+    "  return jnp.sum(c)\n",
+    "\n",
+    "check_fp8_call(train_step.lower(params, a, b))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "a83b0851",
+   "metadata": {},
+   "source": [
     "## Manipulate FP8 params\n",
+    "\n",
+    "The following sections explain the internal FP8 parameters managed by\n",
+    "`fp8_ops.Fp8DotGeneral` and `fp8_ops.Fp8Einsum`. These parameters\n",
+    "include scaling factors and amax history values that control the FP8\n",
+    "quantization process. While most users don't need to interact with these\n",
+    "directly, understanding them can be valuable for advanced optimization and\n",
+    "debugging.\n",
+    "\n",
     "Let's first examine the data structure of `params`. In the code below, we redact\n",
     "the parameter values and then display the PyTree structure."
    ]
@@ -285,13 +324,12 @@
     "The output is as follows:\n",
     "\n",
     "```plaintext\n",
-    "{'_overwrite_with_gradient': {'Fp8DotGeneralOp_0': {'input_amax_history': '*',\n",
-    "                                                    'input_scale': '*',\n",
-    "                                                    'kernel_amax_history': '*',\n",
-    "                                                    'kernel_scale': '*',\n",
-    "                                                    'output_grad_amax_history': '*',\n",
-    "                                                    'output_grad_scale': '*'}},\n",
-    " 'params': {'bias': '*', 'kernel': '*'}}\n",
+    "{'_overwrite_with_gradient': {'Fp8Einsum_0': {'input_amax_history': '*',\n",
+    "                                              'input_scale': '*',\n",
+    "                                              'kernel_amax_history': '*',\n",
+    "                                              'kernel_scale': '*',\n",
+    "                                              'output_grad_amax_history': '*',\n",
+    "                                              'output_grad_scale': '*'}}}\n",
     "```\n",
     "\n",
     "In addition to the expected `params`, there is an additional category called\n",
@@ -400,7 +438,26 @@
     "2.0 [5. 0. 0. ... 0. 0. 0.]\n",
     "```\n",
     "\n",
-    "This casting is already included if users choose to use the high-level APIs."
+    "This casting is already included if users choose to use the high-level APIs.\n",
+    "\n",
+    "## Deprecated APIs\n",
+    "Previously, we provided APIs like `fp8_ops.quantize_dequantize` for current\n",
+    "scaling and `fp8_ops.[in|out]_qdq` for delayed scaling. These were used with\n",
+    "high precision dot operations, leveraging an XLA-FP8 feature that\n",
+    "pattern-matched QDQ->dot sequences to Q->fp8_cublas_gemm. The corresponding\n",
+    "high-level API was called `fp8_ops.Fp8DotGeneralOp`. However, this pattern\n",
+    "matching-based solution proved brittle, as the patterns could be easily broken\n",
+    "by other XLA optimizations. We recommend users migrate from these deprecated\n",
+    "APIs to the newer ones described above.\n",
+    "\n",
+    "For migration, users should replace:\n",
+    "* `fp8_ops.quantize_dequantize -> jnp.dot` with `fp8_ops.quantize -> jnp.dot ->\n",
+    "  fp8_ops.dequantize`\n",
+    "* `fp8_ops.in_qdq -> jnp.dot -> fp8_ops.out_qdq` with `fp8_ops.in_q -> jnp.dot\n",
+    "  -> fp8_ops.out_dq`\n",
+    "* `fp8_ops.Fp8DotGeneralOp` with `fp8_ops.Fp8DotGeneral`\n",
+    "\n",
+    "Additionally, we provide an einsum variant through `fp8_ops.Fp8Einsum`."
    ]
   }
  ],