From e56422b816bd65b27c55818d6415f90019e39b74 Mon Sep 17 00:00:00 2001 From: Peter Jun Park Date: Fri, 26 Jul 2024 06:59:13 -0400 Subject: [PATCH] add fixes Signed-off-by: Peter Jun Park add metadata and fixes Signed-off-by: Peter Jun Park add fixes bump to 1.6.1 more fixes --- .wordlist.txt | 1 + docs/conceptual/command-processor.rst | 6 +- docs/conceptual/compute-unit.rst | 9 +- docs/conceptual/definitions.rst | 2 +- docs/conceptual/l2-cache.rst | 56 +++++---- docs/conceptual/local-data-share.rst | 30 +++-- docs/conceptual/performance-model.rst | 4 +- docs/conceptual/pipeline-descriptions.rst | 14 ++- docs/conceptual/pipeline-metrics.rst | 79 ++++++------ docs/conceptual/references.rst | 4 + docs/conceptual/shader-engine.rst | 52 ++++---- docs/conceptual/system-speed-of-light.rst | 4 + docs/conceptual/vector-l1-cache.rst | 118 +++++++++--------- docs/how-to/analyze/cli.rst | 20 +-- docs/how-to/analyze/grafana-gui.rst | 6 + docs/how-to/analyze/standalone-gui.rst | 4 + docs/how-to/use.rst | 2 +- docs/install/core-install.rst | 2 +- docs/install/grafana-setup.rst | 4 +- docs/reference/compatible-accelerators.rst | 4 +- docs/sphinx/requirements.in | 2 +- docs/sphinx/requirements.txt | 8 +- .../includes/infinity-fabric-transactions.rst | 7 +- .../valu-arithmetic-instruction-mix.rst | 4 +- docs/tutorial/profiling-by-example.rst | 2 +- docs/what-is-omniperf.rst | 36 +++--- 26 files changed, 273 insertions(+), 207 deletions(-) diff --git a/.wordlist.txt b/.wordlist.txt index 9f50063c3..b2f8b3c37 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -42,6 +42,7 @@ conf gcn isa latencies +lds lookaside mantor modulefile diff --git a/docs/conceptual/command-processor.rst b/docs/conceptual/command-processor.rst index 6664ab587..f0affd835 100644 --- a/docs/conceptual/command-processor.rst +++ b/docs/conceptual/command-processor.rst @@ -1,3 +1,7 @@ +.. meta:: + :description: Omniperf performance model: Command processor (CP) + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, command, processor, fetcher, packet processor, CPF, CPC + ********************** Command processor (CP) ********************** @@ -23,7 +27,7 @@ The command processor consists of two sub-components: Before scheduling work to the accelerator, the command processor can first acquire a memory fence to ensure system consistency :hsa-runtime-pdf:`Section 2.6.4 <91>`. After the work is complete, the -command processor can apply a memory-release fence. Depending on the AMD CDNA +command processor can apply a memory-release fence. Depending on the AMD CDNA™ accelerator under question, either of these operations *might* initiate a cache write-back or invalidation. diff --git a/docs/conceptual/compute-unit.rst b/docs/conceptual/compute-unit.rst index 7e45df8a0..09ef483ab 100644 --- a/docs/conceptual/compute-unit.rst +++ b/docs/conceptual/compute-unit.rst @@ -1,9 +1,14 @@ +.. meta:: + :description: Omniperf performance model: Compute unit (CU) + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, GCN, compute, unit, pipeline, workgroup, wavefront, + CDNA + ***************** Compute unit (CU) ***************** The compute unit (CU) is responsible for executing a user's kernels on -CDNA-based accelerators. All :ref:`wavefronts ` of a +CDNA™-based accelerators. All :ref:`wavefronts ` of a :ref:`workgroup ` are scheduled on the same CU. .. image:: ../data/performance-model/gcn_compute_unit.png @@ -44,7 +49,7 @@ presented by Omniperf for these pipelines are described in write-through. The vL1D caches from multiple compute units are kept coherent with one another through software instructions. -* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain +* CDNA accelerators -- that is, AMD Instinct™ MI100 and newer -- contain specialized matrix-multiplication accelerator pipelines known as the :ref:`desc-mfma`. diff --git a/docs/conceptual/definitions.rst b/docs/conceptual/definitions.rst index 127ef19f1..7d397730f 100644 --- a/docs/conceptual/definitions.rst +++ b/docs/conceptual/definitions.rst @@ -19,7 +19,7 @@ and in this documentation. Memory spaces ============= -AMD Instinct MI accelerators can access memory through multiple address spaces +AMD Instinct™ MI-series accelerators can access memory through multiple address spaces which may map to different physical memory locations on the system. The following table provides a view into how various types of memory used in HIP map onto these constructs: diff --git a/docs/conceptual/l2-cache.rst b/docs/conceptual/l2-cache.rst index cf30faeda..03c375665 100644 --- a/docs/conceptual/l2-cache.rst +++ b/docs/conceptual/l2-cache.rst @@ -1,3 +1,7 @@ +.. meta:: + :description: Omniperf performance model: L2 cache (TCC) + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, L2, cache, infinity fabric, metrics + ************** L2 cache (TCC) ************** @@ -9,7 +13,7 @@ on the device. Besides serving requests from the for servicing requests from the :ref:`L1 instruction caches `, the :ref:`scalar L1 data caches ` and the :doc:`command processor `. The L2 cache is composed of a -number of distinct channels (32 on MI100/:ref:`MI2XX ` series CDNA +number of distinct channels (32 on MI100 and :ref:`MI2XX ` series CDNA accelerators at 256B address interleaving) which can largely operate independently. Mapping of incoming requests to a specific L2 channel is determined by a hashing mechanism that attempts to evenly distribute requests @@ -132,14 +136,14 @@ This section details the incoming requests to the L2 cache from the if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - - Bytes per normalization unit + - Bytes per :ref:`normalization unit `. * - Requests - The total number of incoming requests to the L2 from all clients for all request types, per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Read Requests @@ -221,7 +225,7 @@ This section details the incoming requests to the L2 cache from the - The total number of L2 cache lines written back to memory for internal hardware reasons, per :ref:`normalization unit `. - - Cache lines per normalization unit + - Cache lines per :ref:`normalization unit `. * - Writebacks (vL1D Req) @@ -229,14 +233,14 @@ This section details the incoming requests to the L2 cache from the initiated by the :doc:`vL1D cache `, per :ref:`normalization unit `. - - Cache lines per normalization unit + - Cache lines per :ref:`normalization unit `. * - Evictions (Normal) - The total number of L2 cache lines evicted from the cache due to capacity limits, per :ref:`normalization unit `. - - Cache lines per normalization unit + - Cache lines per :ref:`normalization unit `. * - Evictions (vL1D Req) @@ -245,7 +249,7 @@ This section details the incoming requests to the L2 cache from the :doc:`vL1D cache `, per :ref:`normalization unit `. - - Cache lines per normalization unit + - Cache lines per :ref:`normalization unit `. * - Non-hardware-Coherent Requests @@ -253,25 +257,25 @@ This section details the incoming requests to the L2 cache from the memory allocations, per :ref:`normalization unit `. See the :ref:`memory-type` for more information. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Uncached Requests - - The total number of requests to the L2 that to uncached (UC) memory + - The total number of requests to the L2 that go to Uncached (UC) memory allocations. See the :ref:`memory-type` for more information. - Requests per :ref:`normalization unit `. * - Coherently Cached Requests - - The total number of requests to the L2 that to coherently cacheable (CC) + - The total number of requests to the L2 that go to Coherently Cacheable (CC) memory allocations. See the :ref:`memory-type` for more information. - Requests per :ref:`normalization unit `. * - Read/Write Coherent Requests - - The total number of requests to the L2 that to Read-Write coherent memory + - The total number of requests to the L2 that go to Read-Write coherent memory (RW) allocations. See the :ref:`memory-type` for more information. - Requests per :ref:`normalization unit `. @@ -396,7 +400,7 @@ Metrics - The total number of bytes read by the L2 cache from Infinity Fabric per :ref:`normalization unit `. - - Bytes per normalization unit + - Bytes per :ref:`normalization unit `. * - HBM Read Traffic @@ -446,7 +450,7 @@ Metrics :ref:`uncached memory ` allocations on the MI2XX. - - Bytes per normalization unit + - Bytes per :ref:`normalization unit `. * - HBM Write and Atomic Traffic @@ -529,7 +533,7 @@ Metrics * - Read Stall - The ratio of the total number of cycles the L2-Fabric interface was - stalled on a read request to any destination (local HBM, remote PCIe + stalled on a read request to any destination (local HBM, remote PCIe® connected accelerator or CPU, or remote Infinity Fabric connected accelerator [#inf]_ or CPU) over the :ref:`total active L2 cycles `. @@ -571,7 +575,7 @@ transaction breakdown table: :ref:`l2-request-flow` for more detail. Typically unused on CDNA accelerators. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Uncached Read Requests @@ -581,7 +585,7 @@ transaction breakdown table: uncached data are counted as two 32B uncached data requests. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - 64B Read Requests @@ -590,7 +594,7 @@ transaction breakdown table: :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - HBM Read Requests @@ -599,7 +603,7 @@ transaction breakdown table: :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Remote Read Requests @@ -608,7 +612,7 @@ transaction breakdown table: :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - 32B Write and Atomic Requests @@ -617,7 +621,7 @@ transaction breakdown table: :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Uncached Write and Atomic Requests @@ -626,7 +630,7 @@ transaction breakdown table: :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - 64B Write and Atomic Requests @@ -635,7 +639,7 @@ transaction breakdown table: :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - HBM Write and Atomic Requests @@ -644,7 +648,7 @@ transaction breakdown table: :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Remote Write and Atomic Requests @@ -654,7 +658,7 @@ transaction breakdown table: :ref:`normalization unit `. See :ref:`l2-request-flow` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Atomic Requests @@ -668,7 +672,7 @@ transaction breakdown table: :ref:`fine-grained memory ` allocations or :ref:`uncached memory ` allocations on the MI2XX. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. .. _l2-fabric-stalls: @@ -759,7 +763,7 @@ remote accelerators or CPUs. `Infinity Fabric `_ technology can be used to connect multiple accelerators to achieve advanced peer-to-peer connectivity and enhanced bandwidths over traditional PCIe - connections. Some AMD Instinct MI accelerators like the MI250X, + connections. Some AMD Instinct MI-series accelerators like the MI250X `feature coherent CPU↔accelerator connections built using AMD Infinity Fabric `_. .. rubric:: Disclaimer diff --git a/docs/conceptual/local-data-share.rst b/docs/conceptual/local-data-share.rst index c6b9fb9e0..33544edd8 100644 --- a/docs/conceptual/local-data-share.rst +++ b/docs/conceptual/local-data-share.rst @@ -1,3 +1,7 @@ +.. meta:: + :description: Omniperf performance model: Local data share (LDS) + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, local, data, share, LDS + ********************** Local data share (LDS) ********************** @@ -38,7 +42,7 @@ the LDS as a comparison with the peak achievable values of those metrics. * - Access Rate - - Indicates the percentage of SIMDs in the :ref:`VALU ` [#1]_ + - Indicates the percentage of SIMDs in the :ref:`VALU ` [#lds-workload]_ actively issuing LDS instructions, averaged over the lifetime of the kernel. Calculated as the ratio of the total number of cycles spent by the :ref:`scheduler ` issuing :ref:`LDS ` @@ -61,13 +65,13 @@ the LDS as a comparison with the peak achievable values of those metrics. - Indicates the percentage of active LDS cycles that were spent servicing bank conflicts. Calculated as the ratio of LDS cycles spent servicing bank conflicts over the number of LDS cycles that would have been - required to move the same amount of data in an uncontended access. [#2]_ + required to move the same amount of data in an uncontended access. [#lds-bank-conflict]_ - Percent .. rubric:: Footnotes -.. [#1] Here we assume the typical case where the workload evenly distributes +.. [#lds-workload] Here we assume the typical case where the workload evenly distributes LDS operations over all SIMDs in a CU (that is, waves on different SIMDs are executing similar code). For highly unbalanced workloads, where e.g., one SIMD pair in the CU does not issue LDS instructions at all, this metric is @@ -75,7 +79,7 @@ the LDS as a comparison with the peak achievable values of those metrics. :ref:`SIMD pairs ` that are actively using the LDS, averaged over the lifetime of the kernel. -.. [#2] The maximum value of the bank conflict rate is less than 100% +.. [#lds-bank-conflict] The maximum value of the bank conflict rate is less than 100% (specifically: 96.875%), as the first cycle in the :ref:`LDS scheduler ` is never considered contended. @@ -101,7 +105,7 @@ The LDS statistics panel gives a more detailed view of the hardware: read/write/atomics and HIP's ``__shfl`` instructions) executed per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Theoretical Bandwidth @@ -112,7 +116,7 @@ The LDS statistics panel gives a more detailed view of the hardware: executed. See the :ref:`LDS bandwidth example ` for more detail. - - Bytes per normalization unit + - Bytes per :ref:`normalization unit `. * - LDS Latency @@ -136,14 +140,14 @@ The LDS statistics panel gives a more detailed view of the hardware: - The total number of cycles spent in the :ref:`LDS scheduler ` over all operations per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. * - Atomic Return Cycles - The total number of cycles spent on LDS atomics with return per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. * - Bank Conflicts @@ -151,7 +155,7 @@ The LDS statistics panel gives a more detailed view of the hardware: due to bank conflicts (as determined by the conflict resolution hardware) per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. * - Address Conflicts @@ -159,7 +163,7 @@ The LDS statistics panel gives a more detailed view of the hardware: due to address conflicts (as determined by the conflict resolution hardware) per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. * - Unaligned Stall @@ -167,13 +171,13 @@ The LDS statistics panel gives a more detailed view of the hardware: due to stalls from non-dword aligned addresses per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. * - Memory Violations - The total number of out-of-bounds accesses made to the LDS, per :ref:`normalization unit `. This is unused and - expected to be zero in most configurations for modern CDNA accelerators. + expected to be zero in most configurations for modern CDNA™ accelerators. - - Accesses per normalization unit + - Accesses per :ref:`normalization unit `. diff --git a/docs/conceptual/performance-model.rst b/docs/conceptual/performance-model.rst index 4ed821b77..1a94b3ed6 100644 --- a/docs/conceptual/performance-model.rst +++ b/docs/conceptual/performance-model.rst @@ -1,5 +1,5 @@ .. meta:: - :description: Omniperf documentation and reference + :description: Omniperf performance model :keywords: Omniperf, ROCm, performance, model, profiler, tool, Instinct, accelerator, AMD @@ -9,7 +9,7 @@ Performance model Omniperf makes available an extensive list of metrics to better understand achieved application performance on AMD Instinct™ MI-series accelerators -including Graphics Core Next™ (GCN) GPUs like the AMD Instinct MI50, CDNA +including Graphics Core Next™ (GCN) GPUs like the AMD Instinct MI50, CDNA™ accelerators like the MI100, and CDNA2 accelerators such as the MI250X, MI250, and MI210. diff --git a/docs/conceptual/pipeline-descriptions.rst b/docs/conceptual/pipeline-descriptions.rst index ee12e5c80..b781218fe 100644 --- a/docs/conceptual/pipeline-descriptions.rst +++ b/docs/conceptual/pipeline-descriptions.rst @@ -1,3 +1,8 @@ +.. meta:: + :description: Omniperf performance model: Shader engine (SE) + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, pipeline, VALU, SALU, VMEM, SMEM, LDS, branch, + scheduler, MFMA, AGPRs + ********************* Pipeline descriptions ********************* @@ -14,8 +19,8 @@ Vector arithmetic logic unit (VALU) The vector arithmetic logic unit (VALU) executes vector instructions over an entire wavefront, each :ref:`work-item ` (or, -vector-lane) potentially operating on distinct data. The VALU of a CDNA -accelerator or GCN GPU typically consists of: +vector-lane) potentially operating on distinct data. The VALU of a CDNA™ +accelerator or GCN™ GPU typically consists of: * Four 16-wide SIMD processors (see :hip-training-pdf:`24` for more details). @@ -282,13 +287,12 @@ instructions (``v_accvgpr_*``). These data movement instructions may be used by the compiler to implement lower-cost register-spill/fills on architectures with AGPRs. -AGPRs are not available on all AMD Instinct accelerators. GCN GPUs, +AGPRs are not available on all AMD Instinct™ accelerators. GCN GPUs, such as the AMD Instinct MI50 had a 256 KiB VGPR file. The AMD Instinct MI100 (CDNA) has a 2x256 KiB register file, where one half is available as general-purpose VGPRs, and the other half is for matrix math accumulation VGPRs (AGPRs). The AMD Instinct :ref:`MI2XX ` (CDNA2) has a 512 KiB VGPR file per CU, where each wave can dynamically request up to 256 KiB of VGPRs and an additional 256 KiB of AGPRs. For more information, -refer to -``__. +refer to `this comment `_. diff --git a/docs/conceptual/pipeline-metrics.rst b/docs/conceptual/pipeline-metrics.rst index 17f27e317..e86132a96 100644 --- a/docs/conceptual/pipeline-metrics.rst +++ b/docs/conceptual/pipeline-metrics.rst @@ -1,3 +1,8 @@ +.. meta:: + :description: Omniperf performance model: Pipeline metrics + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, pipeline, wavefront, metrics, launch, runtime + VALU, MFMA, instruction mix, FLOPs, arithmetic, operations + **************** Pipeline metrics **************** @@ -47,7 +52,7 @@ kernel launch: * - Total Wavefronts - The total number of wavefronts launched as part of the kernel dispatch. - On AMD Instinct CDNA accelerators and GCN GPUs, the wavefront size is + On AMD Instinct™ CDNA™ accelerators and GCN™ GPUs, the wavefront size is always 64 work-items. Thus, the total number of wavefronts should be equivalent to the ceiling of grid size divided by 64. @@ -212,11 +217,11 @@ execution of wavefronts in a kernel: .. note:: As mentioned earlier, the measurement of kernel cycles and time typically - cannot directly be compared to e.g., Wave Cycles. This is due to two factors: + cannot be directly compared to, for example, wave cycles. This is due to two factors: first, the kernel cycles/timings are measured using a counter that is impacted by scheduling overhead, this is particularly noticeable for "short-running" kernels (less than 1ms) where scheduling overhead forms a - significant portion of the overall kernel runtime. Secondly, the Wave Cycles + significant portion of the overall kernel runtime. Secondly, the wave cycles metric is incremented per-wavefront scheduled to a SIMD every cycle whereas the kernel cycles counter is incremented only once per-cycle when *any* wavefront is scheduled. @@ -240,9 +245,9 @@ instructions. change regardless of the execution mask of the wavefront. Note that even if the execution mask is identically zero (meaning that *no lanes are active*) the instruction will still be counted, as CDNA accelerators still consider - these instructions *issued*. See for example - :mi200-isa-pdf:`EXECute Mask, section 3.3 of the CDNA2 ISA guide<19>` and - further details. + these instructions *issued*. See + :mi200-isa-pdf:`EXECute Mask, section 3.3 of the CDNA2 ISA guide<19>` for + examples and further details. Overall instruction mix ----------------------- @@ -355,14 +360,14 @@ additions executed as part of an MFMA instruction using the same precision. - The total number of instructions operating on 32-bit integer operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - INT64 - The total number of instructions operating on 64-bit integer operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F16-ADD @@ -370,7 +375,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F16-MUL @@ -378,7 +383,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F16-FMA @@ -386,7 +391,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F16-TRANS @@ -394,7 +399,7 @@ additions executed as part of an MFMA instruction using the same precision. on 16-bit floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F32-ADD @@ -402,7 +407,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F32-MUL @@ -410,7 +415,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F32-FMA @@ -418,7 +423,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F32-TRANS @@ -426,7 +431,7 @@ additions executed as part of an MFMA instruction using the same precision. operating on 32-bit floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F64-ADD @@ -434,7 +439,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F64-MUL @@ -442,7 +447,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F64-FMA @@ -450,7 +455,7 @@ additions executed as part of an MFMA instruction using the same precision. floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - F64-TRANS @@ -458,7 +463,7 @@ additions executed as part of an MFMA instruction using the same precision. operating on 64-bit floating-point operands issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Conversion @@ -466,7 +471,7 @@ additions executed as part of an MFMA instruction using the same precision. to or from F32↔F64) issued to the VALU per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. For an example of these counters in action, refer to :ref:`valu-arith-instruction-mix-ex`. @@ -485,7 +490,7 @@ instructions. .. _mfma-instruction-mix: MFMA instruction mix -^^^^^^^^^^^^^^^^^^^^ +-------------------- .. warning:: @@ -512,35 +517,35 @@ MFMA instructions are classified by the type of input data they operate on, and - The total number of 8-bit integer :ref:`MFMA ` instructions issued per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - MFMA-F16 Instructions - The total number of 16-bit floating point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - MFMA-BF16 Instructions - The total number of 16-bit brain floating point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - MFMA-F32 Instructions - The total number of 32-bit floating-point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - MFMA-F64 Instructions - The total number of 64-bit floating-point :ref:`MFMA ` instructions issued per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. Compute pipeline ================ @@ -686,9 +691,9 @@ Pipeline statistics This section reports a number of key performance characteristics of various execution units on the :doc:`CU `. Refer to -:ref:`ipc-example` for a detailed dive into these metrics, and -:ref:`scheduler ` for a high-level overview of execution units -and instruction issue. +:ref:`ipc-example` for a detailed dive into these metrics, and the +:ref:`scheduler ` the for a high-level overview of execution +units and instruction issue. .. list-table:: :header-rows: 1 @@ -850,7 +855,7 @@ not. For more detail on how operations are counted see the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. - - FLOP per normalization unit + - FLOP per :ref:`normalization unit `. * - IOPs (Total) @@ -858,7 +863,7 @@ not. For more detail on how operations are counted see the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. - - IOP per normalization unit + - IOP per :ref:`normalization unit `. * - F16 OPs @@ -866,7 +871,7 @@ not. For more detail on how operations are counted see the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. - - FLOP per normalization unit + - FLOP per :ref:`normalization unit `. * - BF16 OPs @@ -875,7 +880,7 @@ not. For more detail on how operations are counted see the :ref:`normalization unit `. Note: on current CDNA accelerators, the VALU has no native BF16 instructions. - - FLOP per normalization unit + - FLOP per :ref:`normalization unit `. * - F32 OPs @@ -883,7 +888,7 @@ not. For more detail on how operations are counted see the the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. - - FLOP per normalization unit + - FLOP per :ref:`normalization unit `. * - F64 OPs @@ -891,7 +896,7 @@ not. For more detail on how operations are counted see the the :ref:`VALU ` or :ref:`MFMA ` units, per :ref:`normalization unit `. - - FLOP per normalization unit + - FLOP per :ref:`normalization unit `. * - INT8 OPs @@ -900,5 +905,5 @@ not. For more detail on how operations are counted see the :ref:`normalization unit `. Note: on current CDNA accelerators, the VALU has no native INT8 instructions. - - IOPs per normalization unit + - IOPs per :ref:`normalization unit `. diff --git a/docs/conceptual/references.rst b/docs/conceptual/references.rst index cc0f36fe3..9f3d32cd8 100644 --- a/docs/conceptual/references.rst +++ b/docs/conceptual/references.rst @@ -1,3 +1,7 @@ +.. meta:: + :description: Omniperf performance model: References + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, HIP, GCN, LLVM, docs, documentation, training + ********** References ********** diff --git a/docs/conceptual/shader-engine.rst b/docs/conceptual/shader-engine.rst index eeb9b6f3b..2ecfb4575 100644 --- a/docs/conceptual/shader-engine.rst +++ b/docs/conceptual/shader-engine.rst @@ -1,8 +1,12 @@ +.. meta:: + :description: Omniperf performance model: Shader engine (SE) + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, shader, engine, sL1D, L1I, workgroup manager, SPI + ****************** Shader engine (SE) ****************** -The :doc:`compute units ` on a CDNA accelerator are grouped +The :doc:`compute units ` on a CDNA™ accelerator are grouped together into a higher-level organizational unit called a shader engine (SE): .. figure:: ../data/performance-model/selayout.png @@ -14,7 +18,7 @@ together into a higher-level organizational unit called a shader engine (SE): The number of CUs on a SE varies from chip to chip -- see for example :hip-training-pdf:`20`. In addition, newer accelerators such as the AMD -Instinct MI 250X have 8 SEs per accelerator. +Instinct™ MI 250X have 8 SEs per accelerator. For the purposes of Omniperf, we consider resources that are shared between multiple CUs on a single SE as part of the SE's metrics. @@ -36,7 +40,7 @@ The Scalar L1 Data cache (sL1D) can cache data accessed from scalar load instructions (and scalar store instructions on architectures where they exist) from wavefronts in the :doc:`CUs `. The sL1D is shared between multiple CUs (:gcn-crash-course:`36`) -- the exact number of CUs depends on the -architecture in question (3 CUs in GCN GPUs and MI100, 2 CUs in +architecture in question (3 CUs in GCN™ GPUs and MI100, 2 CUs in :ref:`MI2XX `) -- and is backed by the :doc:`L2 cache `. In typical usage, the data in the sL1D is comprised of: @@ -123,14 +127,14 @@ and the hit/miss statistics. - The total number of requests, of any size or type, made to the sL1D per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Hits - The total number of sL1D requests that hit on a previously loaded cache line, per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Misses - Non Duplicated @@ -139,7 +143,7 @@ and the hit/miss statistics. :ref:`normalization unit `. See :ref:`desc-sl1d-sol` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Misses - Duplicated @@ -148,7 +152,7 @@ and the hit/miss statistics. :ref:`normalization unit `. See :ref:`desc-sl1d-sol` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Cache Hit Rate @@ -163,7 +167,7 @@ and the hit/miss statistics. - The total number of sL1D read requests of any size, per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Atomic Requests @@ -171,42 +175,42 @@ and the hit/miss statistics. :ref:`normalization unit `. Typically unused on CDNA accelerators. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Read Requests (1 DWord) - The total number of sL1D read requests made for a single dword of data (4B), per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Read Requests (2 DWord) - The total number of sL1D read requests made for a two dwords of data (8B), per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Read Requests (4 DWord) - The total number of sL1D read requests made for a four dwords of data (16B), per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Read Requests (8 DWord) - The total number of sL1D read requests made for a eight dwords of data (32B), per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Read Requests (16 DWord) - The total number of sL1D read requests made for a sixteen dwords of data (64B), per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. .. _desc-sl1d-l2-interface: @@ -233,14 +237,14 @@ sL1D↔:doc:`L2 ` interface. and atomics are typically unused on current CDNA accelerators, so in the majority of cases this can be interpreted as an sL1D→L2 read bandwidth. - - Bytes per normalization unit + - Bytes per :ref:`normalization unit `. * - Read Requests - The total number of read requests from sL1D to the :doc:`L2 `, per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Write Requests @@ -248,7 +252,7 @@ sL1D↔:doc:`L2 ` interface. per :ref:`normalization unit `. Typically unused on current CDNA accelerators. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Atomic Requests @@ -257,14 +261,14 @@ sL1D↔:doc:`L2 ` interface. :ref:`normalization unit `. Typically unused on current CDNA accelerators. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Stall Cycles - The total number of cycles the sL1D↔:doc:`L2 ` interface was stalled, per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. .. rubric:: Footnotes @@ -373,14 +377,14 @@ This panel gives more detail on the hit/miss statistics of the L1I: - The total number of requests made to the L1I per :ref:`normalization-unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Hits - The total number of L1I requests that hit on a previously loaded cache line, per :ref:`normalization-unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Misses - Non Duplicated @@ -389,7 +393,7 @@ This panel gives more detail on the hit/miss statistics of the L1I: :ref:`normalization-unit `. See note in :ref:`desc-l1i-sol` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Misses - Duplicated @@ -398,7 +402,7 @@ This panel gives more detail on the hit/miss statistics of the L1I: :ref:`normalization-unit `. See note in :ref:`desc-l1i-sol` for more detail. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Cache Hit Rate @@ -428,7 +432,7 @@ L1I-:doc:`L2 ` interface. - The total number of bytes read across the L1I-:doc:`L2 ` interface, per :ref:`normalization unit `. - - Bytes per normalization unit + - Bytes per :ref:`normalization unit `. .. rubric:: Footnotes diff --git a/docs/conceptual/system-speed-of-light.rst b/docs/conceptual/system-speed-of-light.rst index 4c2c462ef..fc758a698 100644 --- a/docs/conceptual/system-speed-of-light.rst +++ b/docs/conceptual/system-speed-of-light.rst @@ -1,3 +1,7 @@ +.. meta:: + :description: Omniperf performance model: System Speed-of-Light + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, system, speed of light + ********************* System Speed-of-Light ********************* diff --git a/docs/conceptual/vector-l1-cache.rst b/docs/conceptual/vector-l1-cache.rst index 78325a7a4..42b740cf2 100644 --- a/docs/conceptual/vector-l1-cache.rst +++ b/docs/conceptual/vector-l1-cache.rst @@ -1,3 +1,7 @@ +.. meta:: + :description: Omniperf performance model: Vector L1 cache (vL1D) + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, vector, l1, cache, vl1d + ********************** Vector L1 cache (vL1D) ********************** @@ -37,7 +41,7 @@ operations issued by a wavefront. The vL1D cache consists of several components: Together, this complex is known as the vL1D, or Texture Cache per Pipe (TCP). A simplified diagram of the vL1D is presented below: -.. figure:: ../data/performance-model/l1perf_model.* +.. figure:: ../data/performance-model/l1perf_model.png :align: center :alt: Performance model of the vL1D Cache on AMD Instinct @@ -89,7 +93,7 @@ as a comparison with the peak achievable values of those metrics. * - Utilization - - Indicates how busy the :ref:`vL1D Cache RAM ` was during the + - Indicates how busy the :ref:`vL1D Cache RAM ` was during the kernel execution. The number of cycles where the vL1D Cache RAM is actively processing any request divided by the number of cycles where the vL1D is active [#vl1d-activity]_. @@ -100,7 +104,7 @@ as a comparison with the peak achievable values of those metrics. - Indicates how well memory instructions were coalesced by the :ref:`address processing unit `, ranging from uncoalesced (25%) - to fully coalesced (100%). The average number of + to fully coalesced (100%). Calculated as the average number of :ref:`thread-requests ` generated per instruction divided by the ideal number of thread-requests per instruction. @@ -221,7 +225,7 @@ kernel. These are broken down into a few major categories: - Private memory, or "scratch" memory, is only visible to a particular :ref:`work-item ` in a particular - :ref:`workgroup `. On AMD Instinct MI-series + :ref:`workgroup `. On AMD Instinct™ MI-series accelerators, private memory is used to implement both register spills and stack memory accesses. @@ -242,7 +246,7 @@ The address processor counts these instruction types as follows: :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Global/Generic Read @@ -250,7 +254,7 @@ The address processor counts these instruction types as follows: all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Global/Generic Write @@ -258,7 +262,7 @@ The address processor counts these instruction types as follows: on all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Global/Generic Atomic @@ -266,7 +270,7 @@ The address processor counts these instruction types as follows: return) instructions executed on all :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Spill/Stack @@ -274,7 +278,7 @@ The address processor counts these instruction types as follows: :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Spill/Stack Read @@ -282,7 +286,7 @@ The address processor counts these instruction types as follows: :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Spill/Stack Write @@ -290,7 +294,7 @@ The address processor counts these instruction types as follows: :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. - - Instruction per normalization unit + - Instruction per :ref:`normalization unit `. * - Spill/Stack Atomic @@ -300,7 +304,7 @@ The address processor counts these instruction types as follows: Typically unused as these memory operations are typically used to implement thread-local storage. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. .. note:: @@ -343,7 +347,7 @@ stage for spill/stack memory, and thus reports: spill/stack instructions, per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. * - Spill/Stack Coalesced Read Cycles @@ -351,7 +355,7 @@ stage for spill/stack memory, and thus reports: coalesced spill/stack read instructions, per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. * - Spill/Stack Coalesced Write Cycles @@ -359,7 +363,7 @@ stage for spill/stack memory, and thus reports: coalesced spill/stack write instructions, per :ref:`normalization unit `. - - Cycles per normalization unit + - Cycles per :ref:`normalization unit `. .. _desc-utcl1: @@ -389,14 +393,14 @@ Omniperf reports the following L1 TLB metrics: - The number of translation requests made to the UTCL1 per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Hits - The number of translation requests that hit in the UTCL1, and could be reused, per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Hit Ratio @@ -411,16 +415,16 @@ Omniperf reports the following L1 TLB metrics: translation not being present in the cache, per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Permission Misses - The total number of translation requests that missed in the UTCL1 due to a permission error, per :ref:`normalization unit `. This is unused and expected to be zero in most configurations for modern - CDNA accelerators. + CDNA™ accelerators. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. .. note:: @@ -527,7 +531,7 @@ latencies of read/write memory operations to the :doc:`L2 cache `. :ref:`address processing unit ` after coalescing per :ref:`normalization unit ` - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - Cache Bandwidth @@ -539,7 +543,7 @@ latencies of read/write memory operations to the :doc:`L2 cache `. instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - - Bytes per normalization unit + - Bytes per :ref:`normalization unit `. * - Cache Hit Rate [#vl1d-hit]_ @@ -562,7 +566,7 @@ latencies of read/write memory operations to the :doc:`L2 cache `. serviced by the :ref:`vL1D Cache RAM ` per :ref:`normalization unit `. - - Cache lines per normalization unit + - Cache lines per :ref:`normalization unit `. * - Invalidations @@ -571,7 +575,7 @@ latencies of read/write memory operations to the :doc:`L2 cache `. :ref:`normalization unit `. This may be triggered by, for instance, the ``buffer_wbinvl1`` instruction. - - Invalidations per normalization unit + - Invalidations per :ref:`normalization unit `. * - L1-L2 Bandwidth @@ -583,7 +587,7 @@ latencies of read/write memory operations to the :doc:`L2 cache `. instance, if only a single value is requested in a cache line, the data movement will still be counted as a full cache line. - - Bytes per normalization unit + - Bytes per :ref:`normalization unit `. * - L1-L2 Reads @@ -592,7 +596,7 @@ latencies of read/write memory operations to the :doc:`L2 cache `. :doc:`L2 Cache ` per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - L1-L2 Writes @@ -600,7 +604,7 @@ latencies of read/write memory operations to the :doc:`L2 cache `. the vL1D to the :doc:`L2 cache `, per :ref:`normalization unit `. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - L1-L2 Atomics @@ -609,27 +613,27 @@ latencies of read/write memory operations to the :doc:`L2 cache `. :ref:`normalization unit `. This includes requests for atomics with, and without return. - - Requests per normalization unit + - Requests per :ref:`normalization unit `. * - L1 Access Latency - - The average number of cycles that a vL1D cache line request spent in the - vL1D cache pipeline. + - Calculated as the average number of cycles that a vL1D cache line request + spent in the vL1D cache pipeline. - Cycles * - L1-L2 Read Access Latency - - The average number of cycles that the vL1D cache took to issue and - receive read requests from the :doc:`L2 Cache `. This number - also includes requests for atomics with return values. + - Calculated as the average number of cycles that the vL1D cache took to + issue and receive read requests from the :doc:`L2 Cache `. This + number also includes requests for atomics with return values. - Cycles * - L1-L2 Write Access Latency - - The average number of cycles that the vL1D cache took to issue and - receive acknowledgement of a write request to the + - Calculated as the average number of cycles that the vL1D cache took to + issue and receive acknowledgement of a write request to the :doc:`L2 Cache `. This number also includes requests for atomics without return values. @@ -639,7 +643,22 @@ latencies of read/write memory operations to the :doc:`L2 cache `. All cache accesses in vL1D are for a single cache line's worth of data. The size of a cache line may vary, however on current AMD Instinct MI CDNA - accelerators and GCN GPUs the L1 cache line size is 64B. + accelerators and GCN™ GPUs the L1 cache line size is 64B. + +.. rubric :: Footnotes + +.. [#vl1d-hit] The vL1D cache on AMD Instinct MI-series CDNA accelerators + uses a "hit-on-miss" approach to reporting cache hits. That is, if while + satisfying a miss, another request comes in that would hit on the same + pending cache line, the subsequent request will be counted as a "hit". + Therefore, it is also important to consider the access latency metric in the + :ref:`Cache access metrics ` section when + evaluating the vL1D hit rate. + +.. [#vl1d-activity] Omniperf considers the vL1D to be active when any part of + the vL1D (excluding the :ref:`address processor ` and + :ref:`data return ` units) are active, for example, when performing + a translation, waiting for data, accessing the Tag or Cache RAMs, etc. .. _vl1d-l2-transaction-detail: @@ -707,7 +726,7 @@ Omniperf reports the following vL1D data-return path metrics: :ref:`address processor ` that were found to be coalescable, per :ref:`normalization unit `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Read Instructions @@ -717,9 +736,9 @@ Omniperf reports the following vL1D data-return path metrics: :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. This is expected to be the sum of global/generic and spill/stack reads in the - :ref:`address processor `. + :ref:`address processor `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Write Instructions @@ -731,7 +750,7 @@ Omniperf reports the following vL1D data-return path metrics: the sum of global/generic and spill/stack stores counted by the :ref:`vL1D cache-front-end `. - - Instructions per normalization unit + - Instructions per :ref:`normalization unit `. * - Atomic Instructions @@ -741,22 +760,7 @@ Omniperf reports the following vL1D data-return path metrics: :doc:`compute units ` on the accelerator, per :ref:`normalization unit `. This is expected to be the sum of global/generic and spill/stack atomics in the - :ref:`address processor `. + :ref:`address processor `. - - Instructions per normalization unit - -.. rubric :: Footnotes - -.. [#vl1d-hit] The vL1D cache on AMD Instinct MI-series CDNA accelerators - uses a "hit-on-miss" approach to reporting cache hits. That is, if while - satisfying a miss, another request comes in that would hit on the same - pending cache line, the subsequent request will be counted as a "hit". - Therefore, it is also important to consider the access latency metric in the - :ref:`Cache access metrics ` section when - evaluating the vL1D hit rate. - -.. [#vl1d-activity] Omniperf considers the vL1D to be active when any part of - the vL1D (excluding the :ref:`address processor ` and - :ref:`data return ` units) are active, for example, when performing - a translation, waiting for data, accessing the Tag or Cache RAMs, etc. + - Instructions per :ref:`normalization unit `. diff --git a/docs/how-to/analyze/cli.rst b/docs/how-to/analyze/cli.rst index 61b213fab..f76e3970f 100644 --- a/docs/how-to/analyze/cli.rst +++ b/docs/how-to/analyze/cli.rst @@ -1,10 +1,14 @@ +.. meta:: + :description: Omniperf analysis: CLI analysis + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, command line, analyze, filtering, metrics, baseline, comparison + ************ CLI analysis ************ -The following is a look into Omniperf's CLI analysis features. +This section provides an overview of Omniperf's CLI analysis features. -* **Derived metrics**: All of Omniperf's built-in metrics. +* :ref:`Derived metrics `: All of Omniperf's built-in metrics. * :ref:`Baseline comparison `: Compare multiple runs in a side-by-side manner. @@ -22,7 +26,7 @@ Run ``omniperf analyze -h`` for more details. Walkthrough =========== -#. To begin, generate a high-level analysis report using Omniperf's ``-b`` (or ``--block``) flag. +1. To begin, generate a high-level analysis report using Omniperf's ``-b`` (or ``--block``) flag. .. code-block:: shell @@ -126,7 +130,9 @@ Walkthrough ... -#. Use ``--list-metrics`` to generate a list of available metrics for inspection. +.. _cli-list-metrics: + +2. Use ``--list-metrics`` to generate a list of available metrics for inspection. .. code-block:: shell @@ -178,7 +184,7 @@ Walkthrough 2.1.30 -> L1I Fetch Latency ... -#. Choose your own customized subset of metrics with the ``-b`` (or ``--block``) +3. Choose your own customized subset of metrics with the ``-b`` (or ``--block``) option. Or, build your own configuration following `config_template `_. The following snippet shows how to generate a report containing only metric 2 @@ -271,10 +277,10 @@ Walkthrough Some cells may be blank indicating a missing or unavailable hardware counter or NULL value. -#. Optimize the application, iterate, and re-profile to inspect performance +4. Optimize the application, iterate, and re-profile to inspect performance changes. -#. Redo a comprehensive analysis with Omniperf CLI at any optimization +5. Redo a comprehensive analysis with Omniperf CLI at any optimization milestone. .. _cli-analysis-options: diff --git a/docs/how-to/analyze/grafana-gui.rst b/docs/how-to/analyze/grafana-gui.rst index 80c2a8a1e..403b9f7b1 100644 --- a/docs/how-to/analyze/grafana-gui.rst +++ b/docs/how-to/analyze/grafana-gui.rst @@ -1,3 +1,7 @@ +.. meta:: + :description: Omniperf analysis: Grafana GUI + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, Grafana, panels, GUI, import + ******************** Grafana GUI analysis ******************** @@ -845,6 +849,8 @@ Texture Addresser instructions) and write/atomic data from the Compute Unit (CU), and coalesces them into fewer requests for the cache to process. +.. _grafana-panel-td: + Texture Data ++++++++++++ diff --git a/docs/how-to/analyze/standalone-gui.rst b/docs/how-to/analyze/standalone-gui.rst index 16c0392a0..66f855c8c 100644 --- a/docs/how-to/analyze/standalone-gui.rst +++ b/docs/how-to/analyze/standalone-gui.rst @@ -1,3 +1,7 @@ +.. meta:: + :description: Omniperf analysis: Standalone GUI + :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, GUI, standalone, filter + *********************** Standalone GUI analysis *********************** diff --git a/docs/how-to/use.rst b/docs/how-to/use.rst index 9f838a8f4..7377dd9f9 100644 --- a/docs/how-to/use.rst +++ b/docs/how-to/use.rst @@ -1,5 +1,5 @@ .. meta:: - :description: Omniperf basic usage documentation. + :description: Omniperf basic usage :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, basics, usage, operations diff --git a/docs/install/core-install.rst b/docs/install/core-install.rst index 5fa203421..07455b04f 100644 --- a/docs/install/core-install.rst +++ b/docs/install/core-install.rst @@ -213,7 +213,7 @@ software stack. .. code-block:: shell - $ sudo yum install omniperf + $ sudo dnf install omniperf $ pip install -r /opt/rocm/libexec/omniperf/requirements.txt .. tab-item:: SUSE Linux Enterprise Server diff --git a/docs/install/grafana-setup.rst b/docs/install/grafana-setup.rst index 44e947a5a..ac1436511 100644 --- a/docs/install/grafana-setup.rst +++ b/docs/install/grafana-setup.rst @@ -1,7 +1,7 @@ .. meta:: - :description: Omniperf client-side installation and deployment + :description: Omniperf Grafana server installation and deployment :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, - install, deploy, Grafana, server, configuration, + install, deploy, Grafana, server, configuration, GUI **************************************** Setting up a Grafana server for Omniperf diff --git a/docs/reference/compatible-accelerators.rst b/docs/reference/compatible-accelerators.rst index 30eaf6f6e..b93c72032 100644 --- a/docs/reference/compatible-accelerators.rst +++ b/docs/reference/compatible-accelerators.rst @@ -1,5 +1,5 @@ .. meta:: - :description: Omniperf - compatible accelerators and GPUs + :description: Omniperf support: compatible accelerators and GPUs :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD, GPU *********************** @@ -23,7 +23,7 @@ GPU specifications. * - Platform - Status - * - AMD Instinct MI300 + * - AMD Instinct™ MI300 - Supported ✅ * - AMD Instinct MI200 diff --git a/docs/sphinx/requirements.in b/docs/sphinx/requirements.in index 8b87d40cc..e503806ca 100644 --- a/docs/sphinx/requirements.in +++ b/docs/sphinx/requirements.in @@ -1,2 +1,2 @@ -rocm-docs-core==1.6.0 +rocm-docs-core==1.6.1 sphinxcontrib.datatemplates==0.11.0 diff --git a/docs/sphinx/requirements.txt b/docs/sphinx/requirements.txt index 794845c76..82d64eb29 100644 --- a/docs/sphinx/requirements.txt +++ b/docs/sphinx/requirements.txt @@ -2,7 +2,7 @@ # This file is autogenerated by pip-compile with Python 3.10 # by the following command: # -# pip-compile docs/sphinx/requirements.in +# pip-compile requirements.in # accessible-pygments==0.0.5 # via pydata-sphinx-theme @@ -95,8 +95,8 @@ requests==2.32.3 # via # pygithub # sphinx -rocm-docs-core==1.6.0 - # via -r docs/sphinx/requirements.in +rocm-docs-core==1.6.1 + # via -r requirements.in smmap==5.0.1 # via gitdb snowballstemmer==2.2.0 @@ -129,7 +129,7 @@ sphinx-notfound-page==1.0.2 sphinxcontrib-applehelp==1.0.8 # via sphinx sphinxcontrib-datatemplates==0.11.0 - # via -r docs/sphinx/requirements.in + # via -r requirements.in sphinxcontrib-devhelp==1.0.6 # via sphinx sphinxcontrib-htmlhelp==2.0.6 diff --git a/docs/tutorial/includes/infinity-fabric-transactions.rst b/docs/tutorial/includes/infinity-fabric-transactions.rst index 320f0d523..bb198d909 100644 --- a/docs/tutorial/includes/infinity-fabric-transactions.rst +++ b/docs/tutorial/includes/infinity-fabric-transactions.rst @@ -157,7 +157,7 @@ accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the .. note:: - On some systems (e.g., those with only PCIe connected accelerators), you need + On some systems (e.g., those with only PCIe® connected accelerators), you need to set the environment variable ``HSA_FORCE_FINE_GRAIN_PCIE=1`` to enable this memory type. @@ -642,6 +642,11 @@ MI250, to e.g., the CPU’s DRAM. In this light, we see that these requests correspond to *system scope* atomics, and specifically in the case of the MI250, to fine-grained memory! + +.. rubric:: Disclaimer + +PCIe® is a registered trademark of PCI-SIG Corporation. + .. `Leave as possible future experiment to add diff --git a/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst b/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst index b3bc63b42..63496b94d 100644 --- a/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst +++ b/docs/tutorial/includes/valu-arithmetic-instruction-mix.rst @@ -9,7 +9,7 @@ VALU arithmetic instruction mix .. note:: - The examples in the section are expected to work on all CDNA accelerators. + The examples in the section are expected to work on all CDNA™ accelerators. However, the actual experiment results in this section were collected on an :ref:`MI2XX ` accelerator. @@ -22,7 +22,7 @@ This code uses a number of inline assembly instructions to cleanly identify the types of instructions being issued, as well as to avoid optimization / dead-code elimination by the compiler. While inline assembly is inherently not portable, this example is expected to work on -all GCN GPUs and CDNA accelerators. +all GCN™ GPUs and CDNA accelerators. We reproduce a sample of the kernel as follows: diff --git a/docs/tutorial/profiling-by-example.rst b/docs/tutorial/profiling-by-example.rst index ed4df1124..8a9c85c03 100644 --- a/docs/tutorial/profiling-by-example.rst +++ b/docs/tutorial/profiling-by-example.rst @@ -1,5 +1,5 @@ .. meta:: - :description: What is Omniperf? + :description: Omniperf: Profiling by example :keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, AMD ******************** diff --git a/docs/what-is-omniperf.rst b/docs/what-is-omniperf.rst index 17d6a6b07..405a050f8 100644 --- a/docs/what-is-omniperf.rst +++ b/docs/what-is-omniperf.rst @@ -72,39 +72,41 @@ high level. * :doc:`GUI analyzer via Grafana and MongoDB ` - * *System Info* panel + * :ref:`System Info panel ` - * *System Speed-of-Light* panel + * :ref:`Kernel Statistic panel ` - * *Kernel Statistic* panel + * :ref:`System Speed-of-Light panel ` - * *Memory Chart Analysis* panel + * :ref:`Memory Chart Analysis panel ` - * *Roofline Analysis* panel (*Supported on MI200 only, Ubuntu 20.04, SLES 15 SP3 or RHEL8*) + * :ref:`Roofline Analysis panel ` + (*Supported on MI200 only, Ubuntu 20.04, SLES 15 SP3 or RHEL8*) - * *Command Processor (CP)* panel + * :ref:`Command Processor (CP) panel ` - * *Workgroup Manager (SPI)* panel + * :ref:`Workgroup Manager (SPI) panel ` - * *Wavefront Launch* panel + * :ref:`Wavefront Launch panel ` - * *Compute Unit - Instruction Mix* panel + * :ref:`Compute Unit - Instruction Mix panel ` - * *Compute Unit - Pipeline* panel + * :ref:`Compute Unit - Pipeline panel ` - * *Local Data Share (LDS)* panel + * :ref:`Local Data Share (LDS) panel ` - * *Instruction Cache* panel + * :ref:`Instruction Cache panel ` - * *Scalar L1D Cache* panel + * :ref:`Scalar L1D Cache panel ` - * *L1 Address Processing Unit*, or, *Texture Addresser (TA)* and *L1 Backend Data Processing Unit*, or, *Texture Data (TD)* panels + * :ref:`L1 Address Processing Unit, or, Texture Addresser (TA) ` + and :ref:`L1 Backend Data Processing Unit, or, Texture Data (TD) ` panels - * *Vector L1D Cache* panel + * :ref:`Vector L1D Cache panel ` - * *L2 Cache* panel + * :ref:`L2 Cache panel ` - * *L2 Cache (per-channel)* panel + * :ref:`L2 Cache (per-channel) panel ` * :ref:`Filtering ` to reduce profiling time