Skip to content

Commit

Permalink
fix fmt in profiling examples
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Jun Park <[email protected]>

add missing mem type table

Signed-off-by: Peter Jun Park <[email protected]>

fix formatting
  • Loading branch information
peterjunpark committed Jul 29, 2024
1 parent c0ecfae commit c29ca45
Show file tree
Hide file tree
Showing 11 changed files with 211 additions and 152 deletions.
43 changes: 43 additions & 0 deletions docs/conceptual/definitions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -107,3 +107,46 @@ memory allocated local to that device may see the allocation as coherently
cacheable, while a remote accelerator might see the same allocation as
*uncached*.

These memory types include:

.. list-table::
:header-rows: 1

* - Memory type
- Description

* - Uncached Memory (UC)
- Memory that will not be cached in this accelerator. On
:ref:`MI2XX <mixxx-note>` accelerators, this corresponds “fine-grained”
(or, “coherent”) memory allocated on a remote accelerator or the host,
for example, using ``hipHostMalloc`` or ``hipMallocManaged`` with default
allocation flags.

* - Non-hardware-Coherent Memory (NC)
- Memory that will be cached by the accelerator, and is only guaranteed to
be consistent at kernel boundaries / after software-driven
synchronization events. On :ref:`MI2XX <mixxx-note>` accelerators, this
type of memory maps to, for example, “coarse-grained” ``hipHostMalloc``’d
memory -- that is, allocated with the ``hipHostMallocNonCoherent``
flag -- or ``hipMalloc``’d memory allocated on a remote accelerator.

* - Coherently Cachable (CC)
- Memory for which only reads from the accelerator where the memory was
allocated will be cached. Writes to CC memory are uncached, and trigger
invalidations of any line within this accelerator. On
:ref:`MI2XX <mixxx-note>` accelerators, this type of memory maps to
“fine-grained” memory allocated on the local accelerator using, for
example, the ``hipExtMallocWithFlags`` API using the
``hipDeviceMallocFinegrained`` flag.

* - Read/Write Coherent Memory (RW)
- Memory that will be cached by the accelerator, but may be invalidated by
writes from remote devices at kernel boundaries / after software-driven
synchronization events. On :ref:`MI2XX <mixxx-note>` accelerators, this
corresponds to “coarse-grained” memory allocated locally to the
accelerator, using for example, the default ``hipMalloc`` allocator.

Find a good discussion of coarse and fine-grained memory allocations and what
type of memory is returned by various combinations of memory allocators, flags
and arguments in the
`Crusher quick-start guide <https://docs.olcf.ornl.gov/systems/crusher_quick_start_guide.html#floating-point-fp-atomic-operations-and-coarse-fine-grained-memory-allocations>`_.
16 changes: 8 additions & 8 deletions docs/conceptual/local-data-share.rst
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ The LDS statistics panel gives a more detailed view of the hardware:
read/write/atomics and HIP's ``__shfl`` instructions) executed per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - Theoretical Bandwidth

Expand All @@ -116,7 +116,7 @@ The LDS statistics panel gives a more detailed view of the hardware:
executed. See the
:ref:`LDS bandwidth example <lds-bandwidth>` for more detail.

- Bytes per :ref:`normalization unit <normalization-units>`.
- Bytes per :ref:`normalization unit <normalization-units>`

* - LDS Latency

Expand All @@ -140,44 +140,44 @@ The LDS statistics panel gives a more detailed view of the hardware:
- The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
over all operations per :ref:`normalization unit <normalization-units>`.

- Cycles per :ref:`normalization unit <normalization-units>`.
- Cycles per :ref:`normalization unit <normalization-units>`

* - Atomic Return Cycles

- The total number of cycles spent on LDS atomics with return per
:ref:`normalization unit <normalization-units>`.

- Cycles per :ref:`normalization unit <normalization-units>`.
- Cycles per :ref:`normalization unit <normalization-units>`

* - Bank Conflicts

- The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
due to bank conflicts (as determined by the conflict resolution hardware)
per :ref:`normalization unit <normalization-units>`.

- Cycles per :ref:`normalization unit <normalization-units>`.
- Cycles per :ref:`normalization unit <normalization-units>`

* - Address Conflicts

- The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
due to address conflicts (as determined by the conflict resolution
hardware) per :ref:`normalization unit <normalization-units>`.

- Cycles per :ref:`normalization unit <normalization-units>`.
- Cycles per :ref:`normalization unit <normalization-units>`

* - Unaligned Stall

- The total number of cycles spent in the :ref:`LDS scheduler <desc-lds>`
due to stalls from non-dword aligned addresses per
:ref:`normalization unit <normalization-units>`.

- Cycles per :ref:`normalization unit <normalization-units>`.
- Cycles per :ref:`normalization unit <normalization-units>`

* - Memory Violations

- The total number of out-of-bounds accesses made to the LDS, per
:ref:`normalization unit <normalization-units>`. This is unused and
expected to be zero in most configurations for modern CDNA™ accelerators.

- Accesses per :ref:`normalization unit <normalization-units>`.
- Accesses per :ref:`normalization unit <normalization-units>`

58 changes: 29 additions & 29 deletions docs/conceptual/pipeline-metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -246,7 +246,7 @@ instructions.
the execution mask is identically zero (meaning that *no lanes are active*)
the instruction will still be counted, as CDNA accelerators still consider
these instructions *issued*. See
:mi200-isa-pdf:`EXECute Mask, section 3.3 of the CDNA2 ISA guide<19>` for
:mi200-isa-pdf:`EXECute Mask, section 3.3 of the CDNA2 ISA guide<19>` for
examples and further details.

Overall instruction mix
Expand Down Expand Up @@ -360,118 +360,118 @@ additions executed as part of an MFMA instruction using the same precision.
- The total number of instructions operating on 32-bit integer operands
issued to the VALU per :ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - INT64

- The total number of instructions operating on 64-bit integer operands
issued to the VALU per :ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F16-ADD

- The total number of addition instructions operating on 16-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F16-MUL

- The total number of multiplication instructions operating on 16-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F16-FMA

- The total number of fused multiply-add instructions operating on 16-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F16-TRANS

- The total number of transcendental instructions (e.g., `sqrt`) operating
on 16-bit floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F32-ADD

- The total number of addition instructions operating on 32-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F32-MUL

- The total number of multiplication instructions operating on 32-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F32-FMA

- The total number of fused multiply-add instructions operating on 32-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F32-TRANS

- The total number of transcendental instructions (such as ``sqrt``)
operating on 32-bit floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F64-ADD

- The total number of addition instructions operating on 64-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F64-MUL

- The total number of multiplication instructions operating on 64-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F64-FMA

- The total number of fused multiply-add instructions operating on 64-bit
floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - F64-TRANS

- The total number of transcendental instructions (such as `sqrt`)
operating on 64-bit floating-point operands issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - Conversion

- The total number of type conversion instructions (such as converting data
to or from F32↔F64) issued to the VALU per
:ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

For an example of these counters in action, refer to
:ref:`valu-arith-instruction-mix-ex`.
Expand Down Expand Up @@ -517,35 +517,35 @@ MFMA instructions are classified by the type of input data they operate on, and
- The total number of 8-bit integer :ref:`MFMA <desc-mfma>` instructions
issued per :ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - MFMA-F16 Instructions

- The total number of 16-bit floating point :ref:`MFMA <desc-mfma>`
instructions issued per :ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - MFMA-BF16 Instructions

- The total number of 16-bit brain floating point :ref:`MFMA <desc-mfma>`
instructions issued per :ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - MFMA-F32 Instructions

- The total number of 32-bit floating-point :ref:`MFMA <desc-mfma>`
instructions issued per :ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

* - MFMA-F64 Instructions

- The total number of 64-bit floating-point :ref:`MFMA <desc-mfma>`
instructions issued per :ref:`normalization unit <normalization-units>`.

- Instructions per :ref:`normalization unit <normalization-units>`.
- Instructions per :ref:`normalization unit <normalization-units>`

Compute pipeline
================
Expand Down Expand Up @@ -761,7 +761,7 @@ units and instruction issue.
- Indicates what percent of the kernel's duration the
:ref:`branch <desc-branch>` unit was busy executing instructions.
Computed as the ratio of the total number of cycles spent by the
:ref:`scheduler <desc-scheduler>` issuing branch instructions over the
:ref:`scheduler <desc-scheduler>` issuing branch instructions over the
:ref:`total CU cycles <total-cu-cycles>`.

- Percent
Expand Down Expand Up @@ -855,23 +855,23 @@ not. For more detail on how operations are counted see the
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
:ref:`normalization unit <normalization-units>`.

- FLOP per :ref:`normalization unit <normalization-units>`.
- FLOP per :ref:`normalization unit <normalization-units>`

* - IOPs (Total)

- The total number of integer operations executed on either the
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
:ref:`normalization unit <normalization-units>`.

- IOP per :ref:`normalization unit <normalization-units>`.
- IOP per :ref:`normalization unit <normalization-units>`

* - F16 OPs

- The total number of 16-bit floating-point operations executed on either the
:ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
:ref:`normalization unit <normalization-units>`.

- FLOP per :ref:`normalization unit <normalization-units>`.
- FLOP per :ref:`normalization unit <normalization-units>`

* - BF16 OPs

Expand All @@ -880,23 +880,23 @@ not. For more detail on how operations are counted see the
:ref:`normalization unit <normalization-units>`. Note: on current CDNA
accelerators, the VALU has no native BF16 instructions.

- FLOP per :ref:`normalization unit <normalization-units>`.
- FLOP per :ref:`normalization unit <normalization-units>`

* - F32 OPs

- The total number of 32-bit floating-point operations executed on either
the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
:ref:`normalization unit <normalization-units>`.

- FLOP per :ref:`normalization unit <normalization-units>`.
- FLOP per :ref:`normalization unit <normalization-units>`

* - F64 OPs

- The total number of 64-bit floating-point operations executed on either
the :ref:`VALU <desc-valu>` or :ref:`MFMA <desc-mfma>` units, per
:ref:`normalization unit <normalization-units>`.

- FLOP per :ref:`normalization unit <normalization-units>`.
- FLOP per :ref:`normalization unit <normalization-units>`

* - INT8 OPs

Expand All @@ -905,5 +905,5 @@ not. For more detail on how operations are counted see the
:ref:`normalization unit <normalization-units>`. Note: on current CDNA
accelerators, the VALU has no native INT8 instructions.

- IOPs per :ref:`normalization unit <normalization-units>`.
- IOPs per :ref:`normalization unit <normalization-units>`

Loading

0 comments on commit c29ca45

Please sign in to comment.