Skip to content

Commit

Permalink
add fixes
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Jun Park <[email protected]>

add metadata and fixes

Signed-off-by: Peter Jun Park <[email protected]>

add fixes

bump to 1.6.1

more fixes
  • Loading branch information
peterjunpark committed Jul 26, 2024
1 parent bcb858e commit e56422b
Show file tree
Hide file tree
Showing 26 changed files with 273 additions and 207 deletions.
1 change: 1 addition & 0 deletions .wordlist.txt
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ conf
gcn
isa
latencies
lds
lookaside
mantor
modulefile
Expand Down
6 changes: 5 additions & 1 deletion docs/conceptual/command-processor.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
.. meta::
:description: Omniperf performance model: Command processor (CP)
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, command, processor, fetcher, packet processor, CPF, CPC

**********************
Command processor (CP)
**********************
Expand All @@ -23,7 +27,7 @@ The command processor consists of two sub-components:
Before scheduling work to the accelerator, the command processor can
first acquire a memory fence to ensure system consistency
:hsa-runtime-pdf:`Section 2.6.4 <91>`. After the work is complete, the
command processor can apply a memory-release fence. Depending on the AMD CDNA
command processor can apply a memory-release fence. Depending on the AMD CDNA
accelerator under question, either of these operations *might* initiate a cache
write-back or invalidation.

Expand Down
9 changes: 7 additions & 2 deletions docs/conceptual/compute-unit.rst
Original file line number Diff line number Diff line change
@@ -1,9 +1,14 @@
.. meta::
:description: Omniperf performance model: Compute unit (CU)
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, GCN, compute, unit, pipeline, workgroup, wavefront,
CDNA

*****************
Compute unit (CU)
*****************

The compute unit (CU) is responsible for executing a user's kernels on
CDNA-based accelerators. All :ref:`wavefronts <desc-wavefront>` of a
CDNA-based accelerators. All :ref:`wavefronts <desc-wavefront>` of a
:ref:`workgroup <desc-workgroup>` are scheduled on the same CU.

.. image:: ../data/performance-model/gcn_compute_unit.png
Expand Down Expand Up @@ -44,7 +49,7 @@ presented by Omniperf for these pipelines are described in
write-through. The vL1D caches from multiple compute units are kept coherent
with one another through software instructions.

* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
* CDNA accelerators -- that is, AMD Instinct MI100 and newer -- contain
specialized matrix-multiplication accelerator pipelines known as the
:ref:`desc-mfma`.

Expand Down
2 changes: 1 addition & 1 deletion docs/conceptual/definitions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ and in this documentation.
Memory spaces
=============

AMD Instinct MI accelerators can access memory through multiple address spaces
AMD Instinct MI-series accelerators can access memory through multiple address spaces
which may map to different physical memory locations on the system. The
following table provides a view into how various types of memory used
in HIP map onto these constructs:
Expand Down
56 changes: 30 additions & 26 deletions docs/conceptual/l2-cache.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,7 @@
.. meta::
:description: Omniperf performance model: L2 cache (TCC)
:keywords: Omniperf, ROCm, profiler, tool, Instinct, accelerator, L2, cache, infinity fabric, metrics

**************
L2 cache (TCC)
**************
Expand All @@ -9,7 +13,7 @@ on the device. Besides serving requests from the
for servicing requests from the :ref:`L1 instruction caches <desc-l1i>`, the
:ref:`scalar L1 data caches <desc-sL1D>` and the
:doc:`command processor <command-processor>`. The L2 cache is composed of a
number of distinct channels (32 on MI100/:ref:`MI2XX <mixxx-note>` series CDNA
number of distinct channels (32 on MI100 and :ref:`MI2XX <mixxx-note>` series CDNA
accelerators at 256B address interleaving) which can largely operate
independently. Mapping of incoming requests to a specific L2 channel is
determined by a hashing mechanism that attempts to evenly distribute requests
Expand Down Expand Up @@ -132,14 +136,14 @@ This section details the incoming requests to the L2 cache from the
if only a single value is requested in a cache line, the data movement
will still be counted as a full cache line.

- Bytes per normalization unit
- Bytes per :ref:`normalization unit <normalization-units>`.

* - Requests

- The total number of incoming requests to the L2 from all clients for all
request types, per :ref:`normalization unit <normalization-units>`.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - Read Requests

Expand Down Expand Up @@ -221,22 +225,22 @@ This section details the incoming requests to the L2 cache from the
- The total number of L2 cache lines written back to memory for internal
hardware reasons, per :ref:`normalization unit <normalization-units>`.

- Cache lines per normalization unit
- Cache lines per :ref:`normalization unit <normalization-units>`.

* - Writebacks (vL1D Req)

- The total number of L2 cache lines written back to memory due to requests
initiated by the :doc:`vL1D cache <vector-l1-cache>`, per
:ref:`normalization unit <normalization-units>`.

- Cache lines per normalization unit
- Cache lines per :ref:`normalization unit <normalization-units>`.

* - Evictions (Normal)

- The total number of L2 cache lines evicted from the cache due to capacity
limits, per :ref:`normalization unit <normalization-units>`.

- Cache lines per normalization unit
- Cache lines per :ref:`normalization unit <normalization-units>`.

* - Evictions (vL1D Req)

Expand All @@ -245,33 +249,33 @@ This section details the incoming requests to the L2 cache from the
:doc:`vL1D cache <vector-l1-cache>`, per
:ref:`normalization unit <normalization-units>`.

- Cache lines per normalization unit
- Cache lines per :ref:`normalization unit <normalization-units>`.

* - Non-hardware-Coherent Requests

- The total number of requests to the L2 to Not-hardware-Coherent (NC)
memory allocations, per :ref:`normalization unit <normalization-units>`.
See the :ref:`memory-type` for more information.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - Uncached Requests

- The total number of requests to the L2 that to uncached (UC) memory
- The total number of requests to the L2 that go to Uncached (UC) memory
allocations. See the :ref:`memory-type` for more information.

- Requests per :ref:`normalization unit <normalization-units>`.

* - Coherently Cached Requests

- The total number of requests to the L2 that to coherently cacheable (CC)
- The total number of requests to the L2 that go to Coherently Cacheable (CC)
memory allocations. See the :ref:`memory-type` for more information.

- Requests per :ref:`normalization unit <normalization-units>`.

* - Read/Write Coherent Requests

- The total number of requests to the L2 that to Read-Write coherent memory
- The total number of requests to the L2 that go to Read-Write coherent memory
(RW) allocations. See the :ref:`memory-type` for more information.

- Requests per :ref:`normalization unit <normalization-units>`.
Expand Down Expand Up @@ -396,7 +400,7 @@ Metrics
- The total number of bytes read by the L2 cache from Infinity Fabric per
:ref:`normalization unit <normalization-units>`.

- Bytes per normalization unit
- Bytes per :ref:`normalization unit <normalization-units>`.

* - HBM Read Traffic

Expand Down Expand Up @@ -446,7 +450,7 @@ Metrics
:ref:`uncached memory <memory-type>` allocations on the
MI2XX.

- Bytes per normalization unit
- Bytes per :ref:`normalization unit <normalization-units>`.

* - HBM Write and Atomic Traffic

Expand Down Expand Up @@ -529,7 +533,7 @@ Metrics
* - Read Stall

- The ratio of the total number of cycles the L2-Fabric interface was
stalled on a read request to any destination (local HBM, remote PCIe
stalled on a read request to any destination (local HBM, remote PCIe®
connected accelerator or CPU, or remote Infinity Fabric connected
accelerator [#inf]_ or CPU) over the
:ref:`total active L2 cycles <total-active-l2-cycles>`.
Expand Down Expand Up @@ -571,7 +575,7 @@ transaction breakdown table:
:ref:`l2-request-flow` for more detail. Typically unused on CDNA
accelerators.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - Uncached Read Requests

Expand All @@ -581,7 +585,7 @@ transaction breakdown table:
uncached data are counted as two 32B uncached data requests. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - 64B Read Requests

Expand All @@ -590,7 +594,7 @@ transaction breakdown table:
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - HBM Read Requests

Expand All @@ -599,7 +603,7 @@ transaction breakdown table:
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - Remote Read Requests

Expand All @@ -608,7 +612,7 @@ transaction breakdown table:
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - 32B Write and Atomic Requests

Expand All @@ -617,7 +621,7 @@ transaction breakdown table:
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - Uncached Write and Atomic Requests

Expand All @@ -626,7 +630,7 @@ transaction breakdown table:
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - 64B Write and Atomic Requests

Expand All @@ -635,7 +639,7 @@ transaction breakdown table:
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - HBM Write and Atomic Requests

Expand All @@ -644,7 +648,7 @@ transaction breakdown table:
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - Remote Write and Atomic Requests

Expand All @@ -654,7 +658,7 @@ transaction breakdown table:
:ref:`normalization unit <normalization-units>`. See
:ref:`l2-request-flow` for more detail.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

* - Atomic Requests

Expand All @@ -668,7 +672,7 @@ transaction breakdown table:
:ref:`fine-grained memory <memory-type>` allocations or
:ref:`uncached memory <memory-type>` allocations on the MI2XX.

- Requests per normalization unit
- Requests per :ref:`normalization unit <normalization-units>`.

.. _l2-fabric-stalls:

Expand Down Expand Up @@ -759,7 +763,7 @@ remote accelerators or CPUs.
`Infinity Fabric <https://www.amd.com/en/technologies/infinity-architecture>`_
technology can be used to connect multiple accelerators to achieve advanced
peer-to-peer connectivity and enhanced bandwidths over traditional PCIe
connections. Some AMD Instinct MI accelerators like the MI250X,
connections. Some AMD Instinct MI-series accelerators like the MI250X
`feature coherent CPU↔accelerator connections built using AMD Infinity Fabric <https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>`_.
.. rubric:: Disclaimer
Expand Down
Loading

0 comments on commit e56422b

Please sign in to comment.