From e56422b816bd65b27c55818d6415f90019e39b74 Mon Sep 17 00:00:00 2001
From: Peter Jun Park <peter.park@amd.com>
Date: Fri, 26 Jul 2024 06:59:13 -0400
Subject: [PATCH] add fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add metadata and fixes

Signed-off-by: Peter Jun Park <peter.park@amd.com>

add fixes

bump to 1.6.1

more fixes
---
 .wordlist.txt                                 |   1 +
 docs/conceptual/command-processor.rst         |   6 +-
 docs/conceptual/compute-unit.rst              |   9 +-
 docs/conceptual/definitions.rst               |   2 +-
 docs/conceptual/l2-cache.rst                  |  56 +++++----
 docs/conceptual/local-data-share.rst          |  30 +++--
 docs/conceptual/performance-model.rst         |   4 +-
 docs/conceptual/pipeline-descriptions.rst     |  14 ++-
 docs/conceptual/pipeline-metrics.rst          |  79 ++++++------
 docs/conceptual/references.rst                |   4 +
 docs/conceptual/shader-engine.rst             |  52 ++++----
 docs/conceptual/system-speed-of-light.rst     |   4 +
 docs/conceptual/vector-l1-cache.rst           | 118 +++++++++---------
 docs/how-to/analyze/cli.rst                   |  20 +--
 docs/how-to/analyze/grafana-gui.rst           |   6 +
 docs/how-to/analyze/standalone-gui.rst        |   4 +
 docs/how-to/use.rst                           |   2 +-
 docs/install/core-install.rst                 |   2 +-
 docs/install/grafana-setup.rst                |   4 +-
 docs/reference/compatible-accelerators.rst    |   4 +-
 docs/sphinx/requirements.in                   |   2 +-
 docs/sphinx/requirements.txt                  |   8 +-
 .../includes/infinity-fabric-transactions.rst |   7 +-
 .../valu-arithmetic-instruction-mix.rst       |   4 +-
 docs/tutorial/profiling-by-example.rst        |   2 +-
 docs/what-is-omniperf.rst                     |  36 +++---
 26 files changed, 273 insertions(+), 207 deletions(-)
diff --git a/.wordlist.txt b/.wordlist.txt
index 9f50063c3..b2f8b3c37 100644
--- a/.wordlist.txt
+++ b/.wordlist.txt
@@ -42,6 +42,7 @@ conf
 gcn
 isa
 latencies
+lds
 lookaside
 mantor
 modulefile
diff --git a/docs/conceptual/command-processor.rst b/docs/conceptual/command-processor.rst
index 6664ab587..f0affd835 100644
--- a/docs/conceptual/command-processor.rst
+++ b/docs/conceptual/command-processor.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: Omniperf performance model: Command processor (CP)
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, command, processor, fetcher, packet processor, CPF, CPC
+
 **********************
 Command processor (CP)
 **********************
@@ -23,7 +27,7 @@ The command processor consists of two sub-components:
 Before scheduling work to the accelerator, the command processor can
 first acquire a memory fence to ensure system consistency 
 :hsa-runtime-pdf:`Section 2.6.4 <91>`. After the work is complete, the
-command processor can apply a memory-release fence. Depending on the AMD CDNA
+command processor can apply a memory-release fence. Depending on the AMD CDNA™
 accelerator under question, either of these operations *might* initiate a cache
 write-back or invalidation.
 
diff --git a/docs/conceptual/compute-unit.rst b/docs/conceptual/compute-unit.rst
index 7e45df8a0..09ef483ab 100644
--- a/docs/conceptual/compute-unit.rst
+++ b/docs/conceptual/compute-unit.rst
@@ -1,9 +1,14 @@
+.. meta::
+   :description: Omniperf performance model: Compute unit (CU)
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, GCN, compute, unit, pipeline, workgroup, wavefront,
+              CDNA
+
 *****************
 Compute unit (CU)
 *****************
 
 The compute unit (CU) is responsible for executing a user's kernels on
-CDNA-based accelerators. All :ref:`wavefronts <desc-wavefront>` of a
+CDNA™-based accelerators. All :ref:`wavefronts <desc-wavefront>` of a
 :ref:`workgroup <desc-workgroup>` are scheduled on the same CU.
 
 .. image:: ../data/performance-model/gcn_compute_unit.png
@@ -44,7 +49,7 @@ presented by Omniperf for these pipelines are described in
   write-through. The vL1D caches from multiple compute units are kept coherent
   with one another through software instructions.
 
-* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
+* CDNA accelerators -- that is, AMD Instinct™ MI100 and newer -- contain
   specialized matrix-multiplication accelerator pipelines known as the
   :ref:`desc-mfma`.
 
diff --git a/docs/conceptual/definitions.rst b/docs/conceptual/definitions.rst
index 127ef19f1..7d397730f 100644
--- a/docs/conceptual/definitions.rst
+++ b/docs/conceptual/definitions.rst
@@ -19,7 +19,7 @@ and in this documentation.
 Memory spaces
 =============
 
-AMD Instinct MI accelerators can access memory through multiple address spaces
+AMD Instinct™ MI-series accelerators can access memory through multiple address spaces
 which may map to different physical memory locations on the system. The
 following table provides a view into how various types of memory used
 in HIP map onto these constructs:
diff --git a/docs/conceptual/l2-cache.rst b/docs/conceptual/l2-cache.rst
index cf30faeda..03c375665 100644
--- a/docs/conceptual/l2-cache.rst
+++ b/docs/conceptual/l2-cache.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: Omniperf performance model: L2 cache (TCC)
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, L2, cache, infinity fabric, metrics
+
 **************
 L2 cache (TCC)
 **************
@@ -9,7 +13,7 @@ on the device. Besides serving requests from the
 for servicing requests from the :ref:`L1 instruction caches <desc-l1i>`, the
 :ref:`scalar L1 data caches <desc-sL1D>` and the
 :doc:`command processor <command-processor>`. The L2 cache is composed of a
-number of distinct channels (32 on MI100/:ref:`MI2XX <mixxx-note>` series CDNA
+number of distinct channels (32 on MI100 and :ref:`MI2XX <mixxx-note>` series CDNA
 accelerators at 256B address interleaving) which can largely operate
 independently. Mapping of incoming requests to a specific L2 channel is
 determined by a hashing mechanism that attempts to evenly distribute requests
@@ -132,14 +136,14 @@ This section details the incoming requests to the L2 cache from the
        if only a single value is requested in a cache line, the data movement
        will still be counted as a full cache line.
 
-     - Bytes per normalization unit
+     - Bytes per :ref:`normalization unit <normalization-units>`.
 
    * - Requests
 
      - The total number of incoming requests to the L2 from all clients for all
        request types, per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Read Requests
 
@@ -221,7 +225,7 @@ This section details the incoming requests to the L2 cache from the
      - The total number of L2 cache lines written back to memory for internal
        hardware reasons, per :ref:`normalization unit <normalization-units>`.
 
-     - Cache lines per normalization unit
+     - Cache lines per :ref:`normalization unit <normalization-units>`.
 
    * - Writebacks (vL1D Req)
 
@@ -229,14 +233,14 @@ This section details the incoming requests to the L2 cache from the
        initiated by the :doc:`vL1D cache <vector-l1-cache>`, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Cache lines per normalization unit
+     - Cache lines per :ref:`normalization unit <normalization-units>`.
 
    * - Evictions (Normal)
 
      - The total number of L2 cache lines evicted from the cache due to capacity
        limits, per :ref:`normalization unit <normalization-units>`.
 
-     - Cache lines per normalization unit
+     - Cache lines per :ref:`normalization unit <normalization-units>`.
 
    * - Evictions (vL1D Req)
 
@@ -245,7 +249,7 @@ This section details the incoming requests to the L2 cache from the
        :doc:`vL1D cache <vector-l1-cache>`, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Cache lines per normalization unit
+     - Cache lines per :ref:`normalization unit <normalization-units>`.
 
    * - Non-hardware-Coherent Requests
 
@@ -253,25 +257,25 @@ This section details the incoming requests to the L2 cache from the
        memory allocations, per :ref:`normalization unit <normalization-units>`.
        See the :ref:`memory-type` for more information.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Uncached Requests
 
-     - The total number of requests to the L2 that to uncached (UC) memory
+     - The total number of requests to the L2 that go to Uncached (UC) memory
        allocations. See the :ref:`memory-type` for more information.
 
      - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Coherently Cached Requests
 
-     - The total number of requests to the L2 that to coherently cacheable (CC)
+     - The total number of requests to the L2 that go to Coherently Cacheable (CC)
        memory allocations. See the :ref:`memory-type` for more information.
 
      - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Read/Write Coherent Requests
 
-     - The total number of requests to the L2 that to Read-Write coherent memory
+     - The total number of requests to the L2 that go to Read-Write coherent memory
        (RW) allocations. See the :ref:`memory-type` for more information.
 
      - Requests per :ref:`normalization unit <normalization-units>`.
@@ -396,7 +400,7 @@ Metrics
      - The total number of bytes read by the L2 cache from Infinity Fabric per
        :ref:`normalization unit <normalization-units>`.
 
-     - Bytes per normalization unit
+     - Bytes per :ref:`normalization unit <normalization-units>`.
 
    * - HBM Read Traffic
 
@@ -446,7 +450,7 @@ Metrics
        :ref:`uncached memory <memory-type>` allocations on the
        MI2XX.
 
-     - Bytes per normalization unit
+     - Bytes per :ref:`normalization unit <normalization-units>`.
 
    * - HBM Write and Atomic Traffic
 
@@ -529,7 +533,7 @@ Metrics
    * - Read Stall
 
      - The ratio of the total number of cycles the L2-Fabric interface was
-       stalled on a read request to any destination (local HBM, remote PCIe
+       stalled on a read request to any destination (local HBM, remote PCIe®
        connected accelerator or CPU, or remote Infinity Fabric connected
        accelerator [#inf]_ or CPU) over the
        :ref:`total active L2 cycles <total-active-l2-cycles>`.
@@ -571,7 +575,7 @@ transaction breakdown table:
        :ref:`l2-request-flow` for more detail. Typically unused on CDNA
        accelerators.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Uncached Read Requests
 
@@ -581,7 +585,7 @@ transaction breakdown table:
        uncached data are counted as two 32B uncached data requests. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - 64B Read Requests
 
@@ -590,7 +594,7 @@ transaction breakdown table:
        :ref:`normalization unit <normalization-units>`. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - HBM Read Requests
 
@@ -599,7 +603,7 @@ transaction breakdown table:
        :ref:`normalization unit <normalization-units>`. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Remote Read Requests
 
@@ -608,7 +612,7 @@ transaction breakdown table:
        :ref:`normalization unit <normalization-units>`. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - 32B Write and Atomic Requests
 
@@ -617,7 +621,7 @@ transaction breakdown table:
        :ref:`normalization unit <normalization-units>`. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Uncached Write and Atomic Requests
 
@@ -626,7 +630,7 @@ transaction breakdown table:
        :ref:`normalization unit <normalization-units>`. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - 64B Write and Atomic Requests
 
@@ -635,7 +639,7 @@ transaction breakdown table:
        :ref:`normalization unit <normalization-units>`. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - HBM Write and Atomic Requests
 
@@ -644,7 +648,7 @@ transaction breakdown table:
        :ref:`normalization unit <normalization-units>`. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Remote Write and Atomic Requests
 
@@ -654,7 +658,7 @@ transaction breakdown table:
        :ref:`normalization unit <normalization-units>`. See
        :ref:`l2-request-flow` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Atomic Requests
 
@@ -668,7 +672,7 @@ transaction breakdown table:
        :ref:`fine-grained memory <memory-type>` allocations or
        :ref:`uncached memory <memory-type>` allocations on the MI2XX.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
 .. _l2-fabric-stalls:
 
@@ -759,7 +763,7 @@ remote accelerators or CPUs.
    `Infinity Fabric <https://www.amd.com/en/technologies/infinity-architecture>`_
    technology can be used to connect multiple accelerators to achieve advanced
    peer-to-peer connectivity and enhanced bandwidths over traditional PCIe
-   connections. Some AMD Instinct MI accelerators like the MI250X,
+   connections. Some AMD Instinct MI-series accelerators like the MI250X
    `feature coherent CPU↔accelerator connections built using AMD Infinity Fabric <https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>`_.
 
 .. rubric:: Disclaimer
diff --git a/docs/conceptual/local-data-share.rst b/docs/conceptual/local-data-share.rst
index c6b9fb9e0..33544edd8 100644
--- a/docs/conceptual/local-data-share.rst
+++ b/docs/conceptual/local-data-share.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: Omniperf performance model: Local data share (LDS)
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, local, data, share, LDS
+
 **********************
 Local data share (LDS)
 **********************
@@ -38,7 +42,7 @@ the LDS as a comparison with the peak achievable values of those metrics.
 
    * - Access Rate
 
-     - Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#1]_
+     - Indicates the percentage of SIMDs in the :ref:`VALU <desc-valu>` [#lds-workload]_
        actively issuing LDS instructions, averaged over the lifetime of the
        kernel. Calculated as the ratio of the total number of cycles spent by
        the :ref:`scheduler <desc-scheduler>` issuing :ref:`LDS <desc-lds>`
@@ -61,13 +65,13 @@ the LDS as a comparison with the peak achievable values of those metrics.
      - Indicates the percentage of active LDS cycles that were spent servicing
        bank conflicts. Calculated as the ratio of LDS cycles spent servicing
        bank conflicts over the number of LDS cycles that would have been
-       required to move the same amount of data in an uncontended access. [#2]_
+       required to move the same amount of data in an uncontended access. [#lds-bank-conflict]_
 
      - Percent
 
 .. rubric:: Footnotes
 
-.. [#1] Here we assume the typical case where the workload evenly distributes
+.. [#lds-workload] Here we assume the typical case where the workload evenly distributes
    LDS operations over all SIMDs in a CU (that is, waves on different SIMDs are
    executing similar code). For highly unbalanced workloads, where e.g., one
    SIMD pair in the CU does not issue LDS instructions at all, this metric is
@@ -75,7 +79,7 @@ the LDS as a comparison with the peak achievable values of those metrics.
    :ref:`SIMD pairs <desc-lds>` that are actively using the LDS, averaged over
    the lifetime of the kernel.
 
-.. [#2] The maximum value of the bank conflict rate is less than 100%
+.. [#lds-bank-conflict] The maximum value of the bank conflict rate is less than 100%
    (specifically: 96.875%), as the first cycle in the
    :ref:`LDS scheduler <desc-lds>` is never considered contended.
 
@@ -101,7 +105,7 @@ The LDS statistics panel gives a more detailed view of the hardware:
        read/write/atomics and HIP's ``__shfl`` instructions) executed per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Theoretical Bandwidth
 
@@ -112,7 +116,7 @@ The LDS statistics panel gives a more detailed view of the hardware:
        executed. See the
        :ref:`LDS bandwidth example <lds-bandwidth>` for more detail.
 
-     - Bytes per normalization unit
+     - Bytes per :ref:`normalization unit <normalization-units>`.
 
    * - LDS Latency
 
@@ -136,14 +140,14 @@ The LDS statistics panel gives a more detailed view of the hardware:
      - The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
        over all operations per :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
    * - Atomic Return Cycles
 
      - The total number of cycles spent on LDS atomics with return per
        :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
    * - Bank Conflicts
 
@@ -151,7 +155,7 @@ The LDS statistics panel gives a more detailed view of the hardware:
        due to bank conflicts (as determined by the conflict resolution hardware)
        per :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
    * - Address Conflicts
 
@@ -159,7 +163,7 @@ The LDS statistics panel gives a more detailed view of the hardware:
        due to address conflicts (as determined by the conflict resolution
        hardware) per :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
    * - Unaligned Stall
 
@@ -167,13 +171,13 @@ The LDS statistics panel gives a more detailed view of the hardware:
        due to stalls from non-dword aligned addresses per
        :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
    * - Memory Violations
 
      - The total number of out-of-bounds accesses made to the LDS, per
        :ref:`normalization unit <normalization-units>`. This is unused and
-       expected to be zero in most configurations for modern CDNA accelerators.
+       expected to be zero in most configurations for modern CDNA™ accelerators.
 
-     - Accesses per normalization unit
+     - Accesses per :ref:`normalization unit <normalization-units>`.
 
diff --git a/docs/conceptual/performance-model.rst b/docs/conceptual/performance-model.rst
index 4ed821b77..1a94b3ed6 100644
--- a/docs/conceptual/performance-model.rst
+++ b/docs/conceptual/performance-model.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: Omniperf documentation and reference
+   :description: Omniperf performance model
    :keywords: Omniperf, ROCm, performance, model, profiler, tool, Instinct,
               accelerator, AMD
 
@@ -9,7 +9,7 @@ Performance model
 
 Omniperf makes available an extensive list of metrics to better understand
 achieved application performance on AMD Instinct™ MI-series accelerators
-including Graphics Core Next™ (GCN) GPUs like the AMD Instinct MI50, CDNA
+including Graphics Core Next™ (GCN) GPUs like the AMD Instinct MI50, CDNA™
 accelerators like the MI100, and CDNA2 accelerators such as the MI250X, MI250,
 and MI210.
 
diff --git a/docs/conceptual/pipeline-descriptions.rst b/docs/conceptual/pipeline-descriptions.rst
index ee12e5c80..b781218fe 100644
--- a/docs/conceptual/pipeline-descriptions.rst
+++ b/docs/conceptual/pipeline-descriptions.rst
@@ -1,3 +1,8 @@
+.. meta::
+   :description: Omniperf performance model: Shader engine (SE)
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, pipeline, VALU, SALU, VMEM, SMEM, LDS, branch,
+              scheduler, MFMA, AGPRs
+
 *********************
 Pipeline descriptions
 *********************
@@ -14,8 +19,8 @@ Vector arithmetic logic unit (VALU)
 
 The vector arithmetic logic unit (VALU) executes vector instructions
 over an entire wavefront, each :ref:`work-item <desc-work-item>` (or,
-vector-lane) potentially operating on distinct data. The VALU of a CDNA
-accelerator or GCN GPU typically consists of:
+vector-lane) potentially operating on distinct data. The VALU of a CDNA™
+accelerator or GCN™ GPU typically consists of:
 
 *  Four 16-wide SIMD processors (see :hip-training-pdf:`24` for more details).
 
@@ -282,13 +287,12 @@ instructions (``v_accvgpr_*``). These data movement instructions may be
 used by the compiler to implement lower-cost register-spill/fills on
 architectures with AGPRs.
 
-AGPRs are not available on all AMD Instinct accelerators. GCN GPUs,
+AGPRs are not available on all AMD Instinct™ accelerators. GCN GPUs,
 such as the AMD Instinct MI50 had a 256 KiB VGPR file. The AMD
 Instinct MI100 (CDNA) has a 2x256 KiB register file, where one half
 is available as general-purpose VGPRs, and the other half is for matrix
 math accumulation VGPRs (AGPRs). The AMD Instinct :ref:`MI2XX <mixxx-note>`
 (CDNA2) has a 512 KiB VGPR file per CU, where each wave can dynamically request
 up to 256 KiB of VGPRs and an additional 256 KiB of AGPRs. For more information,
-refer to 
-`<https://github.com/RadeonOpenCompute/ROCm/issues/1689#issuecomment-1553751913>`__.
+refer to `this comment <https://github.com/ROCm/ROCm/issues/1689#issuecomment-1553751913>`_.
 
diff --git a/docs/conceptual/pipeline-metrics.rst b/docs/conceptual/pipeline-metrics.rst
index 17f27e317..e86132a96 100644
--- a/docs/conceptual/pipeline-metrics.rst
+++ b/docs/conceptual/pipeline-metrics.rst
@@ -1,3 +1,8 @@
+.. meta::
+   :description: Omniperf performance model: Pipeline metrics
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, pipeline, wavefront, metrics, launch, runtime
+              VALU, MFMA, instruction mix, FLOPs, arithmetic, operations
+
 ****************
 Pipeline metrics
 ****************
@@ -47,7 +52,7 @@ kernel launch:
    * - Total Wavefronts
 
      - The total number of wavefronts launched as part of the kernel dispatch.
-       On AMD Instinct CDNA accelerators and GCN GPUs, the wavefront size is
+       On AMD Instinct™ CDNA™ accelerators and GCN™ GPUs, the wavefront size is
        always 64 work-items.  Thus, the total number of wavefronts should be
        equivalent to the ceiling of grid size divided by 64.
 
@@ -212,11 +217,11 @@ execution of wavefronts in a kernel:
 .. note::
 
    As mentioned earlier, the measurement of kernel cycles and time typically
-   cannot directly be compared to e.g., Wave Cycles. This is due to two factors:
+   cannot be directly compared to, for example, wave cycles. This is due to two factors:
    first, the kernel cycles/timings are measured using a counter that is
    impacted by scheduling overhead, this is particularly noticeable for
    "short-running" kernels (less than 1ms) where scheduling overhead forms a
-   significant portion of the overall kernel runtime. Secondly, the Wave Cycles
+   significant portion of the overall kernel runtime. Secondly, the wave cycles
    metric is incremented per-wavefront scheduled to a SIMD every cycle whereas
    the kernel cycles counter is incremented only once per-cycle when *any*
    wavefront is scheduled.
@@ -240,9 +245,9 @@ instructions.
    change regardless of the execution mask of the wavefront. Note that even if
    the execution mask is identically zero (meaning that *no lanes are active*)
    the instruction will still be counted, as CDNA accelerators still consider
-   these instructions *issued*. See for example
-   :mi200-isa-pdf:`EXECute Mask, section 3.3 of the CDNA2 ISA guide<19>` and
-   further details.
+   these instructions *issued*. See
+   :mi200-isa-pdf:`EXECute Mask, section 3.3 of the CDNA2 ISA guide<19>` for 
+   examples and further details.
 
 Overall instruction mix
 -----------------------
@@ -355,14 +360,14 @@ additions executed as part of an MFMA instruction using the same precision.
      - The total number of instructions operating on 32-bit integer operands
        issued to the VALU per :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - INT64
 
      - The total number of instructions operating on 64-bit integer operands
        issued to the VALU per :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F16-ADD
 
@@ -370,7 +375,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F16-MUL
 
@@ -378,7 +383,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F16-FMA
 
@@ -386,7 +391,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F16-TRANS
 
@@ -394,7 +399,7 @@ additions executed as part of an MFMA instruction using the same precision.
        on 16-bit floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F32-ADD
 
@@ -402,7 +407,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F32-MUL
 
@@ -410,7 +415,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F32-FMA
 
@@ -418,7 +423,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F32-TRANS
 
@@ -426,7 +431,7 @@ additions executed as part of an MFMA instruction using the same precision.
        operating on 32-bit floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F64-ADD
 
@@ -434,7 +439,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F64-MUL
 
@@ -442,7 +447,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F64-FMA
 
@@ -450,7 +455,7 @@ additions executed as part of an MFMA instruction using the same precision.
        floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - F64-TRANS
 
@@ -458,7 +463,7 @@ additions executed as part of an MFMA instruction using the same precision.
        operating on 64-bit floating-point operands issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Conversion
 
@@ -466,7 +471,7 @@ additions executed as part of an MFMA instruction using the same precision.
        to or from F32↔F64) issued to the VALU per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
 For an example of these counters in action, refer to
 :ref:`valu-arith-instruction-mix-ex`.
@@ -485,7 +490,7 @@ instructions.
 .. _mfma-instruction-mix:
 
 MFMA instruction mix
-^^^^^^^^^^^^^^^^^^^^
+--------------------
 
 .. warning::
 
@@ -512,35 +517,35 @@ MFMA instructions are classified by the type of input data they operate on, and
      - The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions
        issued per :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - MFMA-F16 Instructions
 
      - The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
        instructions issued per :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - MFMA-BF16 Instructions
 
      - The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
        instructions issued per :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - MFMA-F32 Instructions
 
      - The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>`
        instructions issued per :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - MFMA-F64 Instructions
 
      - The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>`
        instructions issued per :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
 Compute pipeline
 ================
@@ -686,9 +691,9 @@ Pipeline statistics
 
 This section reports a number of key performance characteristics of
 various execution units on the :doc:`CU <compute-unit>`. Refer to
-:ref:`ipc-example` for a detailed dive into these metrics, and
-:ref:`scheduler <desc-scheduler>` for a high-level overview of execution units
-and instruction issue.
+:ref:`ipc-example` for a detailed dive into these metrics, and the
+:ref:`scheduler <desc-scheduler>` the for a high-level overview of execution
+units and instruction issue.
 
 .. list-table::
    :header-rows: 1
@@ -850,7 +855,7 @@ not. For more detail on how operations are counted see the
        :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
        :ref:`normalization unit <normalization-units>`.
 
-     - FLOP per normalization unit
+     - FLOP per :ref:`normalization unit <normalization-units>`.
 
    * - IOPs (Total)
 
@@ -858,7 +863,7 @@ not. For more detail on how operations are counted see the
        :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
        :ref:`normalization unit <normalization-units>`.
 
-     - IOP per normalization unit
+     - IOP per :ref:`normalization unit <normalization-units>`.
 
    * - F16 OPs
 
@@ -866,7 +871,7 @@ not. For more detail on how operations are counted see the
        :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
        :ref:`normalization unit <normalization-units>`.
 
-     - FLOP per normalization unit
+     - FLOP per :ref:`normalization unit <normalization-units>`.
 
    * - BF16 OPs
 
@@ -875,7 +880,7 @@ not. For more detail on how operations are counted see the
        :ref:`normalization unit <normalization-units>`. Note: on current CDNA
        accelerators, the VALU has no native BF16 instructions.
 
-     - FLOP per normalization unit
+     - FLOP per :ref:`normalization unit <normalization-units>`.
 
    * - F32 OPs
 
@@ -883,7 +888,7 @@ not. For more detail on how operations are counted see the
        the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
        :ref:`normalization unit <normalization-units>`.
 
-     - FLOP per normalization unit
+     - FLOP per :ref:`normalization unit <normalization-units>`.
 
    * - F64 OPs
 
@@ -891,7 +896,7 @@ not. For more detail on how operations are counted see the
        the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
        :ref:`normalization unit <normalization-units>`.
 
-     - FLOP per normalization unit
+     - FLOP per :ref:`normalization unit <normalization-units>`.
 
    * - INT8 OPs
 
@@ -900,5 +905,5 @@ not. For more detail on how operations are counted see the
        :ref:`normalization unit <normalization-units>`. Note: on current CDNA
        accelerators, the VALU has no native INT8 instructions.
 
-     - IOPs per normalization unit
+     - IOPs per :ref:`normalization unit <normalization-units>`.
 
diff --git a/docs/conceptual/references.rst b/docs/conceptual/references.rst
index cc0f36fe3..9f3d32cd8 100644
--- a/docs/conceptual/references.rst
+++ b/docs/conceptual/references.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: Omniperf performance model: References
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, HIP, GCN, LLVM, docs, documentation, training
+
 **********
 References
 **********
diff --git a/docs/conceptual/shader-engine.rst b/docs/conceptual/shader-engine.rst
index eeb9b6f3b..2ecfb4575 100644
--- a/docs/conceptual/shader-engine.rst
+++ b/docs/conceptual/shader-engine.rst
@@ -1,8 +1,12 @@
+.. meta::
+   :description: Omniperf performance model: Shader engine (SE)
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, shader, engine, sL1D, L1I, workgroup manager, SPI
+
 ******************
 Shader engine (SE)
 ******************
 
-The :doc:`compute units <compute-unit>` on a CDNA accelerator are grouped
+The :doc:`compute units <compute-unit>` on a CDNA™ accelerator are grouped
 together into a higher-level organizational unit called a shader engine (SE):
 
 .. figure:: ../data/performance-model/selayout.png
@@ -14,7 +18,7 @@ together into a higher-level organizational unit called a shader engine (SE):
 
 The number of CUs on a SE varies from chip to chip -- see for example 
 :hip-training-pdf:`20`. In addition, newer accelerators such as the AMD
-Instinct MI 250X have 8 SEs per accelerator.
+Instinct™ MI 250X have 8 SEs per accelerator.
 
 For the purposes of Omniperf, we consider resources that are shared between
 multiple CUs on a single SE as part of the SE's metrics.
@@ -36,7 +40,7 @@ The Scalar L1 Data cache (sL1D) can cache data accessed from scalar load
 instructions (and scalar store instructions on architectures where they exist)
 from wavefronts in the :doc:`CUs <compute-unit>`. The sL1D is shared between
 multiple CUs (:gcn-crash-course:`36`) -- the exact number of CUs depends on the
-architecture in question (3 CUs in GCN GPUs and MI100, 2 CUs in
+architecture in question (3 CUs in GCN™ GPUs and MI100, 2 CUs in
 :ref:`MI2XX <mixxx-note>`) -- and is backed by the :doc:`L2 cache <l2-cache>`.
 
 In typical usage, the data in the sL1D is comprised of:
@@ -123,14 +127,14 @@ and the hit/miss statistics.
      - The total number of requests, of any size or type, made to the sL1D per
        :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Hits
 
      - The total number of sL1D requests that hit on a previously loaded cache
        line, per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Misses - Non Duplicated
 
@@ -139,7 +143,7 @@ and the hit/miss statistics.
        :ref:`normalization unit <normalization-units>`. See :ref:`desc-sl1d-sol`
        for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Misses - Duplicated
 
@@ -148,7 +152,7 @@ and the hit/miss statistics.
        :ref:`normalization unit <normalization-units>`. See
        :ref:`desc-sl1d-sol` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Cache Hit Rate
 
@@ -163,7 +167,7 @@ and the hit/miss statistics.
      - The total number of sL1D read requests of any size, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Atomic Requests
 
@@ -171,42 +175,42 @@ and the hit/miss statistics.
        :ref:`normalization unit <normalization-units>`. Typically unused on CDNA
        accelerators.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Read Requests (1 DWord)
 
      - The total number of sL1D read requests made for a single dword of data
        (4B), per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Read Requests (2 DWord)
 
      - The total number of sL1D read requests made for a two dwords of data
        (8B), per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Read Requests (4 DWord)
 
      - The total number of sL1D read requests made for a four dwords of data
        (16B), per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Read Requests (8 DWord)
 
      - The total number of sL1D read requests made for a eight dwords of data
        (32B), per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Read Requests (16 DWord)
 
      - The total number of sL1D read requests made for a sixteen dwords of data
        (64B), per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
 .. _desc-sl1d-l2-interface:
 
@@ -233,14 +237,14 @@ sL1D↔:doc:`L2 <l2-cache>` interface.
        and atomics are typically unused on current CDNA accelerators, so in the
        majority of cases this can be interpreted as an sL1D→L2 read bandwidth.
 
-     - Bytes per normalization unit
+     - Bytes per :ref:`normalization unit <normalization-units>`.
 
    * - Read Requests
 
      - The total number of read requests from sL1D to the :doc:`L2 <l2-cache>`,
        per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Write Requests
 
@@ -248,7 +252,7 @@ sL1D↔:doc:`L2 <l2-cache>` interface.
        per :ref:`normalization unit <normalization-units>`. Typically unused on
        current CDNA accelerators.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Atomic Requests
 
@@ -257,14 +261,14 @@ sL1D↔:doc:`L2 <l2-cache>` interface.
        :ref:`normalization unit <normalization-units>`. Typically unused on
        current CDNA accelerators.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Stall Cycles
 
      - The total number of cycles the sL1D↔:doc:`L2 <l2-cache>` interface was
        stalled, per :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
 .. rubric:: Footnotes
 
@@ -373,14 +377,14 @@ This panel gives more detail on the hit/miss statistics of the L1I:
      - The total number of requests made to the L1I per
        :ref:`normalization-unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Hits
 
      - The total number of L1I requests that hit on a previously loaded cache
        line, per :ref:`normalization-unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Misses - Non Duplicated
 
@@ -389,7 +393,7 @@ This panel gives more detail on the hit/miss statistics of the L1I:
        :ref:`normalization-unit <normalization-units>`. See note in
        :ref:`desc-l1i-sol` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Misses - Duplicated
 
@@ -398,7 +402,7 @@ This panel gives more detail on the hit/miss statistics of the L1I:
        :ref:`normalization-unit <normalization-units>`. See note in
        :ref:`desc-l1i-sol` for more detail.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Cache Hit Rate
 
@@ -428,7 +432,7 @@ L1I-:doc:`L2 <l2-cache>` interface.
      - The total number of bytes read across the L1I-:doc:`L2 <l2-cache>`
        interface, per :ref:`normalization unit <normalization-units>`.
 
-     - Bytes per normalization unit
+     - Bytes per :ref:`normalization unit <normalization-units>`.
 
 .. rubric:: Footnotes
 
diff --git a/docs/conceptual/system-speed-of-light.rst b/docs/conceptual/system-speed-of-light.rst
index 4c2c462ef..fc758a698 100644
--- a/docs/conceptual/system-speed-of-light.rst
+++ b/docs/conceptual/system-speed-of-light.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: Omniperf performance model: System Speed-of-Light
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, system, speed of light
+
 *********************
 System Speed-of-Light
 *********************
diff --git a/docs/conceptual/vector-l1-cache.rst b/docs/conceptual/vector-l1-cache.rst
index 78325a7a4..42b740cf2 100644
--- a/docs/conceptual/vector-l1-cache.rst
+++ b/docs/conceptual/vector-l1-cache.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: Omniperf performance model: Vector L1 cache (vL1D)
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, vector, l1, cache, vl1d
+
 **********************
 Vector L1 cache (vL1D)
 **********************
@@ -37,7 +41,7 @@ operations issued by a wavefront. The vL1D cache consists of several components:
 Together, this complex is known as the vL1D, or Texture Cache per Pipe
 (TCP). A simplified diagram of the vL1D is presented below:
 
-.. figure:: ../data/performance-model/l1perf_model.*
+.. figure:: ../data/performance-model/l1perf_model.png
    :align: center
    :alt: Performance model of the vL1D Cache on AMD Instinct
 
@@ -89,7 +93,7 @@ as a comparison with the peak achievable values of those metrics.
 
    * - Utilization
 
-     - Indicates how busy the :ref:`vL1D Cache RAM <desc-TC>` was during the
+     - Indicates how busy the :ref:`vL1D Cache RAM <desc-tc>` was during the
        kernel execution. The number of cycles where the vL1D Cache RAM is
        actively processing any request divided by the number of cycles where the
        vL1D is active [#vl1d-activity]_.
@@ -100,7 +104,7 @@ as a comparison with the peak achievable values of those metrics.
 
      - Indicates how well memory instructions were coalesced by the
        :ref:`address processing unit <desc-ta>`, ranging from uncoalesced (25%)
-       to fully coalesced (100%). The average number of
+       to fully coalesced (100%). Calculated as the average number of
        :ref:`thread-requests <thread-requests>` generated per instruction
        divided by the ideal number of thread-requests per instruction.
 
@@ -221,7 +225,7 @@ kernel. These are broken down into a few major categories:
 
      - Private memory, or "scratch" memory, is only visible to a particular
        :ref:`work-item <desc-work-item>` in a particular
-       :ref:`workgroup <desc-workgroup>`. On AMD Instinct MI-series
+       :ref:`workgroup <desc-workgroup>`. On AMD Instinct™ MI-series
        accelerators, private memory is used to implement both register spills
        and stack memory accesses.
 
@@ -242,7 +246,7 @@ The address processor counts these instruction types as follows:
        :doc:`compute units <compute-unit>` on the accelerator, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Global/Generic Read
 
@@ -250,7 +254,7 @@ The address processor counts these instruction types as follows:
        all :doc:`compute units <compute-unit>` on the accelerator, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Global/Generic Write
 
@@ -258,7 +262,7 @@ The address processor counts these instruction types as follows:
        on all :doc:`compute units <compute-unit>` on the accelerator, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Global/Generic Atomic
 
@@ -266,7 +270,7 @@ The address processor counts these instruction types as follows:
        return) instructions executed on all :doc:`compute units <compute-unit>`
        on the accelerator, per :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Spill/Stack
 
@@ -274,7 +278,7 @@ The address processor counts these instruction types as follows:
        :doc:`compute units <compute-unit>` on the accelerator, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Spill/Stack Read
 
@@ -282,7 +286,7 @@ The address processor counts these instruction types as follows:
        :doc:`compute units <compute-unit>` on the accelerator, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Spill/Stack Write
 
@@ -290,7 +294,7 @@ The address processor counts these instruction types as follows:
        :doc:`compute units <compute-unit>` on the accelerator, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instruction per normalization unit
+     - Instruction per :ref:`normalization unit <normalization-units>`.
 
    * - Spill/Stack Atomic
 
@@ -300,7 +304,7 @@ The address processor counts these instruction types as follows:
        Typically unused as these memory operations are typically used to
        implement thread-local storage.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
 .. note::
 
@@ -343,7 +347,7 @@ stage for spill/stack memory, and thus reports:
        spill/stack instructions, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
    * - Spill/Stack Coalesced Read Cycles
 
@@ -351,7 +355,7 @@ stage for spill/stack memory, and thus reports:
        coalesced spill/stack read instructions, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
    * - Spill/Stack Coalesced Write Cycles
 
@@ -359,7 +363,7 @@ stage for spill/stack memory, and thus reports:
        coalesced spill/stack write instructions, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Cycles per normalization unit
+     - Cycles per :ref:`normalization unit <normalization-units>`.
 
 .. _desc-utcl1:
 
@@ -389,14 +393,14 @@ Omniperf reports the following L1 TLB metrics:
      - The number of translation requests made to the UTCL1 per
        :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Hits
 
      - The number of translation requests that hit in the UTCL1, and could be
        reused, per :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Hit Ratio
 
@@ -411,16 +415,16 @@ Omniperf reports the following L1 TLB metrics:
        translation not being present in the cache, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Permission Misses
 
      - The total number of translation requests that missed in the UTCL1 due to
        a permission error, per :ref:`normalization unit <normalization-units>`.
        This is unused and expected to be zero in most configurations for modern
-       CDNA accelerators.
+       CDNA™ accelerators.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
 .. note::
 
@@ -527,7 +531,7 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
        :ref:`address processing unit <desc-ta>` after coalescing per
        :ref:`normalization unit <normalization-units>`
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - Cache Bandwidth
 
@@ -539,7 +543,7 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
        instance, if only a single value is requested in a cache line, the data
        movement will still be counted as a full cache line.
 
-     - Bytes per normalization unit
+     - Bytes per :ref:`normalization unit <normalization-units>`.
 
    * - Cache Hit Rate [#vl1d-hit]_
 
@@ -562,7 +566,7 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
        serviced by the :ref:`vL1D Cache RAM <desc-tc>` per
        :ref:`normalization unit <normalization-units>`.
 
-     - Cache lines per normalization unit
+     - Cache lines per :ref:`normalization unit <normalization-units>`.
 
    * - Invalidations
 
@@ -571,7 +575,7 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
        :ref:`normalization unit <normalization-units>`.  This may be triggered
        by, for instance, the ``buffer_wbinvl1`` instruction.
 
-     - Invalidations per normalization unit
+     - Invalidations per :ref:`normalization unit <normalization-units>`.
 
    * - L1-L2 Bandwidth
 
@@ -583,7 +587,7 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
        instance, if only a single value is requested in a cache line, the data
        movement will still be counted as a full cache line.
 
-     - Bytes per normalization unit
+     - Bytes per :ref:`normalization unit <normalization-units>`.
 
    * - L1-L2 Reads
 
@@ -592,7 +596,7 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
        :doc:`L2 Cache <l2-cache>` per
        :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - L1-L2 Writes
 
@@ -600,7 +604,7 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
        the vL1D to the :doc:`L2 cache <l2-cache>`, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - L1-L2 Atomics
 
@@ -609,27 +613,27 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
        :ref:`normalization unit <normalization-units>`. This includes requests
        for atomics with, and without return.
 
-     - Requests per normalization unit
+     - Requests per :ref:`normalization unit <normalization-units>`.
 
    * - L1 Access Latency
 
-     - The average number of cycles that a vL1D cache line request spent in the
-       vL1D cache pipeline.
+     - Calculated as the average number of cycles that a vL1D cache line request
+       spent in the vL1D cache pipeline.
 
      - Cycles
 
    * - L1-L2 Read Access Latency
 
-     - The average number of cycles that the vL1D cache took to issue and
-       receive read requests from the :doc:`L2 Cache <l2-cache>`. This number
-       also includes requests for atomics with return values.
+     - Calculated as the average number of cycles that the vL1D cache took to
+       issue and receive read requests from the :doc:`L2 Cache <l2-cache>`. This
+       number also includes requests for atomics with return values.
 
      - Cycles
 
    * - L1-L2 Write Access Latency
 
-     - The average number of cycles that the vL1D cache took to issue and
-       receive acknowledgement of a write request to the
+     - Calculated as the average number of cycles that the vL1D cache took to
+       issue and receive acknowledgement of a write request to the
        :doc:`L2 Cache <l2-cache>`. This number also includes requests for
        atomics without return values.
 
@@ -639,7 +643,22 @@ latencies of read/write memory operations to the :doc:`L2 cache <l2-cache>`.
 
    All cache accesses in vL1D are for a single cache line's worth of data.
    The size of a cache line may vary, however on current AMD Instinct MI CDNA
-   accelerators and GCN GPUs the L1 cache line size is 64B.
+   accelerators and GCN™ GPUs the L1 cache line size is 64B.
+
+.. rubric :: Footnotes
+
+.. [#vl1d-hit] The vL1D cache on AMD Instinct MI-series CDNA accelerators
+   uses a "hit-on-miss" approach to reporting cache hits. That is, if while
+   satisfying a miss, another request comes in that would hit on the same
+   pending cache line, the subsequent request will be counted as a "hit".
+   Therefore, it is also important to consider the access latency metric in the
+   :ref:`Cache access metrics <vl1d-cache-stall-metrics>` section when
+   evaluating the vL1D hit rate.
+
+.. [#vl1d-activity] Omniperf considers the vL1D to be active when any part of
+   the vL1D (excluding the :ref:`address processor <desc-ta>` and
+   :ref:`data return <desc-td>` units) are active, for example, when performing
+   a translation, waiting for data, accessing the Tag or Cache RAMs, etc.
 
 .. _vl1d-l2-transaction-detail:
 
@@ -707,7 +726,7 @@ Omniperf reports the following vL1D data-return path metrics:
        :ref:`address processor <desc-ta>` that were found to be coalescable, per
        :ref:`normalization unit <normalization-units>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Read Instructions
 
@@ -717,9 +736,9 @@ Omniperf reports the following vL1D data-return path metrics:
        :doc:`compute units <compute-unit>` on the accelerator, per
        :ref:`normalization unit <normalization-units>`. This is expected to be
        the sum of global/generic and spill/stack reads in the
-       :ref:`address processor <ta-instruction-counts>`.
+       :ref:`address processor <desc-ta>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Write Instructions
 
@@ -731,7 +750,7 @@ Omniperf reports the following vL1D data-return path metrics:
        the sum of global/generic and spill/stack stores counted by the
        :ref:`vL1D cache-front-end <ta-instruction-counts>`.
 
-     - Instructions per normalization unit
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
    * - Atomic Instructions
 
@@ -741,22 +760,7 @@ Omniperf reports the following vL1D data-return path metrics:
        :doc:`compute units <compute-unit>` on the accelerator, per
        :ref:`normalization unit <normalization-units>`. This is expected to be
        the sum of global/generic and spill/stack atomics in the
-       :ref:`address processor <ta-instruction-counts>`.
+       :ref:`address processor <desc-ta>`.
 
-     - Instructions per normalization unit
-
-.. rubric :: Footnotes
-
-.. [#vl1d-hit] The vL1D cache on AMD Instinct MI-series CDNA accelerators
-   uses a "hit-on-miss" approach to reporting cache hits. That is, if while
-   satisfying a miss, another request comes in that would hit on the same
-   pending cache line, the subsequent request will be counted as a "hit".
-   Therefore, it is also important to consider the access latency metric in the
-   :ref:`Cache access metrics <vl1d-cache-stall-metrics>` section when
-   evaluating the vL1D hit rate.
-
-.. [#vl1d-activity] Omniperf considers the vL1D to be active when any part of
-   the vL1D (excluding the :ref:`address processor <desc-ta>` and
-   :ref:`data return <desc-td>` units) are active, for example, when performing
-   a translation, waiting for data, accessing the Tag or Cache RAMs, etc.
+     - Instructions per :ref:`normalization unit <normalization-units>`.
 
diff --git a/docs/how-to/analyze/cli.rst b/docs/how-to/analyze/cli.rst
index 61b213fab..f76e3970f 100644
--- a/docs/how-to/analyze/cli.rst
+++ b/docs/how-to/analyze/cli.rst
@@ -1,10 +1,14 @@
+.. meta::
+   :description: Omniperf analysis: CLI analysis
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, command line, analyze, filtering, metrics, baseline, comparison
+
 ************
 CLI analysis
 ************
 
-The following is a look into Omniperf's CLI analysis features.
+This section provides an overview of Omniperf's CLI analysis features.
 
-* **Derived metrics**: All of Omniperf's built-in metrics.
+* :ref:`Derived metrics <cli-list-metrics>`: All of Omniperf's built-in metrics.
 
 * :ref:`Baseline comparison <analysis-baseline-comparison>`: Compare multiple
   runs in a side-by-side manner.
@@ -22,7 +26,7 @@ Run ``omniperf analyze -h`` for more details.
 Walkthrough
 ===========
 
-#. To begin, generate a high-level analysis report using Omniperf's ``-b`` (or ``--block``) flag. 
+1. To begin, generate a high-level analysis report using Omniperf's ``-b`` (or ``--block``) flag. 
 
    .. code-block:: shell
 
@@ -126,7 +130,9 @@ Walkthrough
 
       ...
 
-#. Use ``--list-metrics`` to generate a list of available metrics for inspection.
+.. _cli-list-metrics:
+
+2. Use ``--list-metrics`` to generate a list of available metrics for inspection.
 
    .. code-block:: shell
 
@@ -178,7 +184,7 @@ Walkthrough
                       2.1.30 -> L1I Fetch Latency
       ...
 
-#. Choose your own customized subset of metrics with the ``-b`` (or ``--block``)
+3. Choose your own customized subset of metrics with the ``-b`` (or ``--block``)
    option. Or, build your own configuration following
    `config_template <https://github.com/ROCm/omniperf/blob/main/src/omniperf_analyze/configs/panel_config_template.yaml>`_.
    The following snippet shows how to generate a report containing only metric 2
@@ -271,10 +277,10 @@ Walkthrough
       Some cells may be blank indicating a missing or unavailable hardware
       counter or NULL value.
 
-#. Optimize the application, iterate, and re-profile to inspect performance
+4. Optimize the application, iterate, and re-profile to inspect performance
    changes.
 
-#. Redo a comprehensive analysis with Omniperf CLI at any optimization
+5. Redo a comprehensive analysis with Omniperf CLI at any optimization
    milestone.
 
 .. _cli-analysis-options:
diff --git a/docs/how-to/analyze/grafana-gui.rst b/docs/how-to/analyze/grafana-gui.rst
index 80c2a8a1e..403b9f7b1 100644
--- a/docs/how-to/analyze/grafana-gui.rst
+++ b/docs/how-to/analyze/grafana-gui.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: Omniperf analysis: Grafana GUI
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, Grafana, panels, GUI, import
+
 ********************
 Grafana GUI analysis
 ********************
@@ -845,6 +849,8 @@ Texture Addresser
    instructions) and write/atomic data from the Compute Unit (CU), and coalesces
    them into fewer requests for the cache to process.
 
+.. _grafana-panel-td:
+
 Texture Data
 ++++++++++++
 
diff --git a/docs/how-to/analyze/standalone-gui.rst b/docs/how-to/analyze/standalone-gui.rst
index 16c0392a0..66f855c8c 100644
--- a/docs/how-to/analyze/standalone-gui.rst
+++ b/docs/how-to/analyze/standalone-gui.rst
@@ -1,3 +1,7 @@
+.. meta::
+   :description: Omniperf analysis: Standalone GUI
+   :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, GUI, standalone, filter
+
 ***********************
 Standalone GUI analysis
 ***********************
diff --git a/docs/how-to/use.rst b/docs/how-to/use.rst
index 9f838a8f4..7377dd9f9 100644
--- a/docs/how-to/use.rst
+++ b/docs/how-to/use.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: Omniperf basic usage documentation.
+   :description: Omniperf basic usage
    :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD,
               basics, usage, operations
 
diff --git a/docs/install/core-install.rst b/docs/install/core-install.rst
index 5fa203421..07455b04f 100644
--- a/docs/install/core-install.rst
+++ b/docs/install/core-install.rst
@@ -213,7 +213,7 @@ software stack.
 
       .. code-block:: shell
 
-         $ sudo yum install omniperf
+         $ sudo dnf install omniperf
          $ pip install -r /opt/rocm/libexec/omniperf/requirements.txt
 
    .. tab-item:: SUSE Linux Enterprise Server
diff --git a/docs/install/grafana-setup.rst b/docs/install/grafana-setup.rst
index 44e947a5a..ac1436511 100644
--- a/docs/install/grafana-setup.rst
+++ b/docs/install/grafana-setup.rst
@@ -1,7 +1,7 @@
 .. meta::
-   :description: Omniperf client-side installation and deployment
+   :description: Omniperf Grafana server installation and deployment
    :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD,
-              install, deploy, Grafana, server, configuration,
+              install, deploy, Grafana, server, configuration, GUI
 
 ****************************************
 Setting up a Grafana server for Omniperf
diff --git a/docs/reference/compatible-accelerators.rst b/docs/reference/compatible-accelerators.rst
index 30eaf6f6e..b93c72032 100644
--- a/docs/reference/compatible-accelerators.rst
+++ b/docs/reference/compatible-accelerators.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: Omniperf - compatible accelerators and GPUs
+   :description: Omniperf support: compatible accelerators and GPUs
    :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, GPU
 
 ***********************
@@ -23,7 +23,7 @@ GPU specifications.
     * - Platform
       - Status
 
-    * - AMD Instinct MI300
+    * - AMD Instinct™ MI300
       - Supported ✅
 
     * - AMD Instinct MI200
diff --git a/docs/sphinx/requirements.in b/docs/sphinx/requirements.in
index 8b87d40cc..e503806ca 100644
--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
@@ -1,2 +1,2 @@
-rocm-docs-core==1.6.0
+rocm-docs-core==1.6.1
 sphinxcontrib.datatemplates==0.11.0
diff --git a/docs/sphinx/requirements.txt b/docs/sphinx/requirements.txt
index 794845c76..82d64eb29 100644
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
@@ -2,7 +2,7 @@
 # This file is autogenerated by pip-compile with Python 3.10
 # by the following command:
 #
-#    pip-compile docs/sphinx/requirements.in
+#    pip-compile requirements.in
 #
 accessible-pygments==0.0.5
     # via pydata-sphinx-theme
@@ -95,8 +95,8 @@ requests==2.32.3
     # via
     #   pygithub
     #   sphinx
-rocm-docs-core==1.6.0
-    # via -r docs/sphinx/requirements.in
+rocm-docs-core==1.6.1
+    # via -r requirements.in
 smmap==5.0.1
     # via gitdb
 snowballstemmer==2.2.0
@@ -129,7 +129,7 @@ sphinx-notfound-page==1.0.2
 sphinxcontrib-applehelp==1.0.8
     # via sphinx
 sphinxcontrib-datatemplates==0.11.0
-    # via -r docs/sphinx/requirements.in
+    # via -r requirements.in
 sphinxcontrib-devhelp==1.0.6
     # via sphinx
 sphinxcontrib-htmlhelp==2.0.6
diff --git a/docs/tutorial/includes/infinity-fabric-transactions.rst b/docs/tutorial/includes/infinity-fabric-transactions.rst
index 320f0d523..bb198d909 100644
--- a/docs/tutorial/includes/infinity-fabric-transactions.rst
+++ b/docs/tutorial/includes/infinity-fabric-transactions.rst
@@ -157,7 +157,7 @@ accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the
 
 .. note::
 
-   On some systems (e.g., those with only PCIe connected accelerators), you need
+   On some systems (e.g., those with only PCIe® connected accelerators), you need
    to set the environment variable ``HSA_FORCE_FINE_GRAIN_PCIE=1`` to enable
    this memory type.
 
@@ -642,6 +642,11 @@ MI250, to e.g., the CPU’s DRAM. In this light, we see that these
 requests correspond to *system scope* atomics, and specifically in the
 case of the MI250, to fine-grained memory!
 
+
+.. rubric:: Disclaimer
+
+PCIe® is a registered trademark of PCI-SIG Corporation.
+
 ..
    `Leave as possible future experiment to add
 
diff --git a/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst b/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst
index b3bc63b42..63496b94d 100644
--- a/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst
+++ b/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst
@@ -9,7 +9,7 @@ VALU arithmetic instruction mix
 
 .. note::
 
-   The examples in the section are expected to work on all CDNA accelerators.
+   The examples in the section are expected to work on all CDNA™ accelerators.
    However, the actual experiment results in this section were collected on an
    :ref:`MI2XX <mixxx-note>` accelerator.
 
@@ -22,7 +22,7 @@ This code uses a number of inline assembly instructions to cleanly
 identify the types of instructions being issued, as well as to avoid
 optimization / dead-code elimination by the compiler. While inline
 assembly is inherently not portable, this example is expected to work on
-all GCN GPUs and CDNA accelerators.
+all GCN™ GPUs and CDNA accelerators.
 
 We reproduce a sample of the kernel as follows:
 
diff --git a/docs/tutorial/profiling-by-example.rst b/docs/tutorial/profiling-by-example.rst
index ed4df1124..8a9c85c03 100644
--- a/docs/tutorial/profiling-by-example.rst
+++ b/docs/tutorial/profiling-by-example.rst
@@ -1,5 +1,5 @@
 .. meta::
-   :description: What is Omniperf?
+   :description: Omniperf: Profiling by example
    :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD
 
 ********************
diff --git a/docs/what-is-omniperf.rst b/docs/what-is-omniperf.rst
index 17d6a6b07..405a050f8 100644
--- a/docs/what-is-omniperf.rst
+++ b/docs/what-is-omniperf.rst
@@ -72,39 +72,41 @@ high level.
 
 * :doc:`GUI analyzer via Grafana and MongoDB </how-to/analyze/grafana-gui>`
 
-  * *System Info* panel
+  * :ref:`System Info panel <grafana-panel-sys-info>`
 
-  * *System Speed-of-Light* panel
+  * :ref:`Kernel Statistic panel <grafana-panel-kernel-stats>`
 
-  * *Kernel Statistic* panel
+  * :ref:`System Speed-of-Light panel <grafana-panel-system-sol>`
 
-  * *Memory Chart Analysis* panel
+  * :ref:`Memory Chart Analysis panel <grafana-panel-memory-chart-analysis>`
 
-  * *Roofline Analysis* panel (*Supported on MI200 only, Ubuntu 20.04, SLES 15 SP3 or RHEL8*)
+  * :ref:`Roofline Analysis panel <grafana-panel-roofline-analysis>`
+    (*Supported on MI200 only, Ubuntu 20.04, SLES 15 SP3 or RHEL8*)
 
-  * *Command Processor (CP)* panel
+  * :ref:`Command Processor (CP) panel <grafana-panel-cp>`
 
-  * *Workgroup Manager (SPI)* panel
+  * :ref:`Workgroup Manager (SPI) panel <grafana-panel-spi>`
 
-  * *Wavefront Launch* panel
+  * :ref:`Wavefront Launch panel <grafana-panel-wavefront>`
 
-  * *Compute Unit - Instruction Mix* panel
+  * :ref:`Compute Unit - Instruction Mix panel <grafana-panel-cu-instruction-mix>`
 
-  * *Compute Unit - Pipeline* panel
+  * :ref:`Compute Unit - Pipeline panel <grafana-panel-cu-compute-pipeline>`
 
-  * *Local Data Share (LDS)* panel
+  * :ref:`Local Data Share (LDS) panel <grafana-panel-lds>`
 
-  * *Instruction Cache* panel
+  * :ref:`Instruction Cache panel <grafana-panel-instruction-cache>`
 
-  * *Scalar L1D Cache* panel
+  * :ref:`Scalar L1D Cache panel <grafana-panel-sl1d-cache>`
 
-  * *L1 Address Processing Unit*, or, *Texture Addresser (TA)* and *L1 Backend Data Processing Unit*, or, *Texture Data (TD)* panels
+  * :ref:`L1 Address Processing Unit, or, Texture Addresser (TA) <grafana-panel-ta>`
+    and :ref:`L1 Backend Data Processing Unit, or, Texture Data (TD) <grafana-panel-td>` panels
 
-  * *Vector L1D Cache* panel
+  * :ref:`Vector L1D Cache panel <grafana-panel-vl1d>`
 
-  * *L2 Cache* panel
+  * :ref:`L2 Cache panel <grafana-panel-l2-cache>`
 
-  * *L2 Cache (per-channel)* panel
+  * :ref:`L2 Cache (per-channel) panel <grafana-panel-l2-cache-per-channel>`
 
 * :ref:`Filtering <filtering>` to reduce profiling time