Skip to content

Commit

Permalink
add spellcheck/linting from rocm-docs-core
Browse files Browse the repository at this point in the history
Signed-off-by: Peter Jun Park <[email protected]>

fix rst directives

satisfy spellcheck

fix more spelling
  • Loading branch information
peterjunpark committed Jul 18, 2024
1 parent 7e8b254 commit 2da48d7
Show file tree
Hide file tree
Showing 11 changed files with 137 additions and 80 deletions.
16 changes: 16 additions & 0 deletions .github/workflows/docs-linting.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
name: Documentation

on:
push:
branches:
- dev
- 'docs/*'
pull_request:
branches:
- dev
- 'docs/*'

jobs:
call-workflow-passing-data:
name: Linting
uses: ROCm/rocm-docs-core/.github/workflows/linting.yml@develop
46 changes: 41 additions & 5 deletions .wordlist.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,41 @@
Grafana
MobaXterm
Normalizations
Omniperf's
roofline
AGPRs
FLOPs
GPUOpen
GiB
HIP's
HIP’s
IOPs
IPC
KiB
LD
Lmod
Relatedly
SEs
SIG
Transcendentals
UID
Utilizations
VPGRs
amd
architected
ast
atomicAdd
backpressure
backpressuring
broadcasted
cdna
conf
gcn
isa
lookaside
mantor
modulefile
modulefiles
pdf
sl
substring
typename
untar
utilizations
vega
vl
2 changes: 1 addition & 1 deletion docs/conceptual/pipeline-metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -478,7 +478,7 @@ VMEM instruction mix

This section breaks down the types of vector memory (VMEM) instructions
that were issued. Refer to the
`Instruction Counts metrics section <desc-ta>` of address-processor frontend of
`Instruction Counts metrics section <desc-ta>` of address-processor front end of
the vL1D cache for a description of these VMEM instructions.

.. _mfma-instruction-mix:
Expand Down
6 changes: 3 additions & 3 deletions docs/conceptual/vector-l1-cache.rst
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ Busy / stall metrics

When executing vector memory instructions, the compute unit must send an
address (and in the case of writes/atomics, data) to the address
processing unit. When the frontend cannot accept any more addresses, it
processing unit. When the front-end cannot accept any more addresses, it
must backpressure the wave-issue logic for the VMEM pipe and prevent the
issue of a vector memory instruction until a previously issued memory
operation has been processed.
Expand Down Expand Up @@ -500,7 +500,7 @@ vL1D cache access metrics
-------------------------

The vL1D cache access metrics broadly indicate the type of requests
incoming from the :ref:`cache frontend <desc-ta>`, the number of requests that
incoming from the :ref:`cache front-end <desc-ta>`, the number of requests that
were serviced by the vL1D, and the number & type of outgoing requests to
the :doc:`L2 cache <l2-cache>`. In addition, this section includes the
approximate latencies of accesses to the cache itself, along with
Expand Down Expand Up @@ -728,7 +728,7 @@ Omniperf reports the following vL1D data-return path metrics:
:doc:`compute units <compute-unit>` on the accelerator, per
:ref:`normalization unit <normalization-units>`. This is expected to be
the sum of global/generic and spill/stack stores counted by the
:ref:`vL1D cache-frontend <ta-instruction-counts>`.
:ref:`vL1D cache-front-end <ta-instruction-counts>`.

- Instructions per normalization unit

Expand Down
2 changes: 1 addition & 1 deletion docs/reference/faq.rst
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ backend. You can do this using :ref:`database <modes-database>` mode.

Pass in the directory of your desired workload as follows.

.. code:: shell
.. code-block:: shell
$ omniperf database --import -w <path-to-results> -H <hostname> -u <username> -t <team-name>
Expand Down
104 changes: 55 additions & 49 deletions docs/tutorial/includes/infinity-fabric-transactions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Infinity Fabric transactions

This following code snippet launches a simple read-only kernel.

.. code:: cpp
.. code-block:: cpp
// the main streaming kernel
__global__ void kernel(int* x, size_t N, int zero) {
Expand All @@ -25,12 +25,12 @@ This following code snippet launches a simple read-only kernel.
}
}
This happens twice -- once as a warmup and once for analysis. Note that the
This happens twice -- once as a warm-up and once for analysis. Note that the
buffer ``x`` is initialized to all zeros via a call to ``hipMemcpy`` on the
host before the kernel is ever launched. Therefore, the following conditional
is identically false -- and thus we expect no writes.

.. code:: cpp
.. code-block:: cpp
if (sum != 0) { ...
Expand Down Expand Up @@ -62,7 +62,7 @@ Experiment #1: Coarse-grained, accelerator-local HBM reads
In our first experiment, we consider the simplest possible case, a
``hipMalloc``\ ’d buffer that is local to our current accelerator:

.. code:: shell-session
.. code-block:: shell-session
$ omniperf profile -n coarse_grained_local --no-roof -- ./fabric -t 1 -o 0
Using:
Expand Down Expand Up @@ -129,7 +129,7 @@ In addition, we see a small amount of :ref:`uncached <memory-type>` reads

* Kernel arguments

* Coordinate parameters (e.g., blockDim.z) that were not initialized by the
* Coordinate parameters (e.g., ``blockDim.z``) that were not initialized by the
hardware, etc. and may account for some of our ‘remote’ read requests
(**17.5.4**), e.g., reading from CPU DRAM.

Expand Down Expand Up @@ -161,7 +161,7 @@ accelerator. Our code uses the ``hipExtMallocWithFlag`` API with the
to set the environment variable ``HSA_FORCE_FINE_GRAIN_PCIE=1`` to enable
this memory type.

.. code:: shell-session
.. code-block:: shell-session
$ omniperf profile -n fine_grained_local --no-roof -- ./fabric -t 0 -o 0
Using:
Expand Down Expand Up @@ -241,7 +241,7 @@ finally resetting the device back to the default, e.g.,
Although we have not changed our code significantly, we do see a
substantial change in the L2-Fabric metrics:

.. code:: shell-session
.. code-block:: shell-session
$ omniperf profile -n fine_grained_remote --no-roof -- ./fabric -t 0 -o 2
Using:
Expand Down Expand Up @@ -331,7 +331,7 @@ In this experiment, we move our :ref:`fine-grained <memory-type>` allocation to
be owned by the CPU’s DRAM. We accomplish this by allocating host-pinned
fine-grained memory using the ``hipHostMalloc`` API:

.. code:: shell-session
.. code-block:: shell-session
$ omniperf profile -n fine_grained_host --no-roof -- ./fabric -t 0 -o 1
Using:
Expand Down Expand Up @@ -384,8 +384,8 @@ fine-grained memory using the ``hipHostMalloc`` API:
Here we see *almost* the same results as in the
:ref:`previous experiment <infinity-fabric-ex3>`, however now as we are crossing
a PCIe bus to the CPU, we see that the Infinity Fabric Read stalls (17.4.1)
have shifted to be a PCIe stall (17.4.2). In addition, as (on this
a PCIe bus to the CPU, we see that the Infinity Fabric Read stalls (**17.4.1**)
have shifted to be a PCIe stall (**17.4.2**). In addition, as (on this
system) the PCIe bus has a lower peak bandwidth than the AMD Infinity
Fabric connection between two accelerators, we once again observe an
increase in the percentage of stalls on this interface.
Expand All @@ -395,7 +395,7 @@ increase in the percentage of stalls on this interface.
Had we performed this same experiment on an
`MI250X system <https://www.amd.com/system/files/documents/amd-cdna2-white-paper.pdf>`_,
these transactions would again have been marked as Infinity Fabric Read
stalls (17.4.1), as the CPU is connected to the accelerator via AMD Infinity
stalls (**17.4.1**), as the CPU is connected to the accelerator via AMD Infinity
Fabric.

.. _infinity-fabric-ex5:
Expand All @@ -408,7 +408,7 @@ In our next fabric experiment, we change our CPU memory allocation to be
``hipHostMalloc`` API the ``hipHostMallocNonCoherent`` flag, to mark the
allocation as coarse-grained:

.. code:: shell-session
.. code-block:: shell-session
$ omniperf profile -n coarse_grained_host --no-roof -- ./fabric -t 1 -o 1
Using:
Expand Down Expand Up @@ -459,24 +459,24 @@ allocation as coarse-grained:
│ 17.5.4 │ Remote Read │ 671088645.00 │ 671088645.00 │ 671088645.00 │ Req per kernel │
╘═════════╧═════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
Here we see a similar result to our `previous
experiment <Fabric_exp_4>`__, with one key difference: our accesses are
no longer marked as Uncached Read requests (17.2.3, 17.5.1), but instead
are 64B read requests (17.5.2), as observed in our `Coarse-grained,
accelerator-local HBM <Fabric_exp_1>`__ experiment.
Here we see a similar result to our
:ref:`previous experiment <infinity-fabric-ex4>`, with one key difference: our
accesses are no longer marked as Uncached Read requests (**17.2.3, 17.5.1**), but instead
are 64B read requests (**17.5.2**), as observed in our
:ref:`Coarse-grained, accelerator-local HBM <infinity-fabric-ex1>` experiment.

.. _infinity-fabric-ex6:

Experiment #6: Fine-grained, CPU-DRAM writes
--------------------------------------------

Thus far in our exploration of the L2-Fabric interface, we have
primarily focused on read operations. However, in `our request flow
diagram <fabric-fig>`__, we note that writes are counted separately. To
obeserve this, we use the ‘-p’ flag to trigger write operations to
fine-grained memory allocated on the host:
primarily focused on read operations. However, in
:ref:`our request flow diagram <l2-request-flow>`, we note that writes are
counted separately. To observe this, we use the ``-p`` flag to trigger write
operations to fine-grained memory allocated on the host:

.. code:: shell-session
.. code-block:: shell-session
$ omniperf profile -n fine_grained_host_write --no-roof -- ./fabric -t 0 -o 1 -p 1
Using:
Expand Down Expand Up @@ -533,25 +533,29 @@ fine-grained memory allocated on the host:
│ 17.5.10 │ Atomic │ 0.00 │ 0.00 │ 0.00 │ Req per kernel │
╘═════════╧═════════════════════════╧══════════════╧══════════════╧══════════════╧════════════════╛
Here we notice a few changes in our request pattern: - As expected, the
requests have changed from 64B Reads to 64B Write requests (17.5.7), -
these requests are homed in on a “remote” destination (17.2.6, 17.5.9),
as expected, and, - these are also counted as a single Uncached Write
request (17.5.6).
Here we notice a few changes in our request pattern:

* As expected, the requests have changed from 64B Reads to 64B Write requests
(**17.5.7**),

* these requests are homed in on a “remote” destination (**17.2.6, 17.5.9**), as
expected, and,

* these are also counted as a single Uncached Write request (**17.5.6**).

In addition, there rather significant changes in the bandwidth values
reported: - the “L2-Fabric Write and Atomic” bandwidth metric (17.2.4)
reports about 40GiB of data written across Infinity Fabric(tm) while, -
the “Remote Write and Traffic” metric (17.2.5) indicates that nearly
reported: - the “L2-Fabric Write and Atomic” bandwidth metric (**17.2.4**)
reports about 40GiB of data written across Infinity Fabric while, -
the “Remote Write and Traffic” metric (**17.2.5**) indicates that nearly
100% of these request are being directed to a remote source

The precise meaning of these metrics are explored in the
:ref:`subsequent experiment <infinity-fabric-ex7>`.

Finally, we note that we see no write stalls on the PCIe(r) bus
(17.4.3). This is because writes over a PCIe(r) bus `are
non-posted <https://members.pcisig.com/wg/PCI-SIG/document/10912>`__,
i.e., they do not require acknowledgement.
Finally, we note that we see no write stalls on the PCIe bus
(**17.4.3**). This is because writes over a PCIe bus `are
non-posted <https://members.pcisig.com/wg/PCI-SIG/document/10912>`_,
that is, they do not require acknowledgement.

.. _infinity-fabric-ex7:

Expand All @@ -561,7 +565,7 @@ Experiment #7: Fine-grained, CPU-DRAM atomicAdd
Next, we change our experiment to instead target ``atomicAdd``
operations to the CPU’s DRAM.

.. code:: shell-session
.. code-block:: shell-session
$ omniperf profile -n fine_grained_host_add --no-roof -- ./fabric -t 0 -o 1 -p 2
Using:
Expand Down Expand Up @@ -618,14 +622,18 @@ operations to the CPU’s DRAM.
│ 17.5.10 │ Atomic │ 13421773.00 │ 13421773.00 │ 13421773.00 │ Req per kernel │
╘═════════╧═════════════════════════╧═════════════╧═════════════╧═════════════╧════════════════╛
In this case, there is quite a lot to unpack: - For the first time, the
32B Write requests (17.5.5) are heavily used. - These correspond to
Atomic requests (17.2.7, 17.5.10), and are counted as Uncached Writes
(17.5.6). - The L2-Fabric Write and Atomic bandwidth metric (17.2.4)
shows about 0.4 GiB of traffic. For convenience, the sample reduces the
default problem size for this case due to the speed of atomics across a
PCIe(r) bus, and finally, - The traffic is directed to a remote device
(17.2.6, 17.5.9)
In this case, there is quite a lot to unpack:

- For the first time, the 32B Write requests (**17.5.5**) are heavily used.

- These correspond to Atomic requests (**17.2.7, 17.5.10**), and are counted as
Uncached Writes (**17.5.6**).

- The L2-Fabric Write and Atomic bandwidth metric (**17.2.4**) shows about 0.4
GiB of traffic. For convenience, the sample reduces the default problem size
for this case due to the speed of atomics across a PCIe bus, and finally,

- The traffic is directed to a remote device (**17.2.6, 17.5.9**)

Let us consider what an “atomic” request means in this context. Recall
that we are discussing memory traffic flowing from the L2 cache, the
Expand All @@ -634,9 +642,8 @@ MI250, to e.g., the CPU’s DRAM. In this light, we see that these
requests correspond to *system scope* atomics, and specifically in the
case of the MI250, to fine-grained memory!

.. raw:: html

<!-- Leave as possible future experiment to add
..
`Leave as possible future experiment to add

### Experiment #2 - Non-temporal writes
Expand All @@ -647,9 +654,8 @@ case of the MI250, to fine-grained memory!
```
template<typename T>
__device__ void store (T* ptr, T val) {
__builtin_nontemporal_store(val, ptr);
__builtin_nontemporal_store(val, ptr);
}
```

On an AMD [MI2XX](2xxnote) accelerator, for FP32 values this will generate a `global_store_dword` instruction, with the `glc` and `slc` bits set, described in [section 10.1](https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf) of the CDNA2 ISA guide.
-->
On an AMD MI2XX accelerator, for FP32 values this will generate a `global_store_dword` instruction, with the `glc` and `slc` bits set, described in [section 10.1](https://developer.amd.com/wp-content/resources/CDNA2_Shader_ISA_4February2022.pdf) of the CDNA2 ISA guide.`
Loading

0 comments on commit 2da48d7

Please sign in to comment.