Skip to content

Commit

Permalink
Removed acc_bench_trans miniapp (#895)
Browse files Browse the repository at this point in the history
- It is mostly subsumed by acc_bench_smm.
- Renamed acc_bench_smm to acc_bench.
- Transposes are part of acc_bench.
- Updated documentation.
  • Loading branch information
hfp authored Mar 5, 2025
1 parent 013e905 commit d0f0f3b
Show file tree
Hide file tree
Showing 10 changed files with 45 additions and 385 deletions.
30 changes: 16 additions & 14 deletions src/acc/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,9 @@ The accelerator interface (ACC) consists of ISO_C_BINDING based Fortran code of

The code for both the CUDA and the HIP backend is unified, and can be found in the `cuda` directory. At compile-time either one or the other backend is chosen per macro (`__CUDA` or `__HIP`). Similarly, the code for the OpenCL backend is activated by a build-time macro (`__OPENCL`).

## Drivers
## Miniapp

There are two stand-alone sample codes or drivers exercising the ACC-interface. The driver code (only depending on above mentioned interfaces) can be built locally and in a rather self-contained fashion, i.e., no DBCSR library is needed (except runtime libraries such as CUDA, HIP, OpenCL). For OpenCL, the LIBXSMM library is mandatory and preferred as baseline and for validation in any case. To build LIBXSMM, a folder `libxsmm` in parallel to DBCSR's root directory (`dbcsr`) is expected to be present and prebuilt.
There is one stand-alone sample code or driver exercising the ACC-interface. The driver code (only depending on above mentioned interfaces) can be built locally and in a rather self-contained fashion, i.e., no DBCSR library is needed (except runtime libraries such as CUDA, HIP, OpenCL). For OpenCL, the LIBXSMM library is mandatory and preferred as baseline and for validation in any case. To build LIBXSMM, a folder `libxsmm` in parallel to DBCSR's root directory (`dbcsr`) is expected to be present and prebuilt.

```bash
git clone -b main https://github.com/libxsmm/libxsmm.git
Expand All @@ -24,34 +24,36 @@ cd dbcsr/src/acc/opencl
make
```

**NOTE**: To activate a certain device, the drivers consider an environment variable called `DEVICE`. For example, `DEVICE=1 ./acc_bench_trans` activates the second device (at least two devices must be discovered). This environment variable is implemented by the driver code and meant to work across backends, i.e., the OpenCL backend also supports `ACC_OPENCL_DEVICE=1` (see Developer Guide for the OpenCL backend).
**NOTE**: To activate a certain device, the driver considers an environment variable called `DEVICE`. For example, `DEVICE=1 ./acc_bench` activates the second device (at least two devices must be discovered). This environment variable is implemented by the driver code and meant to work across backends, i.e., the OpenCL backend also supports `ACC_OPENCL_DEVICE=1` (see Developer Guide for the OpenCL backend).

The drivers support command line options (_nrepeat_, _stack_size_, _m_, _n_, ...). Command line arguments are positional but allow `0` as placeholder to refer to the default value (`acc_bench_smm 0 0 5 13 5` performs the default number of repetitions with the default stacksize when running the 5x13x5-kernel). For example, running the tranpose benchmark may look like:
The driver supports command line options (_nrepeat_, _stack_size_, _m_, _n_, _k_, ...). Command line arguments are positional but allow `0` as placeholder to refer to the default value (`acc_bench 0 0 5 13 5` performs the default number of repetitions with the default stacksize when running the 5x13x5-kernel). For example, running the tranpose benchmark may look like:

```bash
$ OMP_PROC_BIND=TRUE ./acc_bench_trans 5 30000 23 23
./acc_bench_trans 5 30000 23 23
$ OMP_PROC_BIND=TRUE ./acc_bench 3 30000 23 23 23
Activated device0 (ndevices=8)
acc_bench 3 30000 23 23 23 1875 18750 18750
typename (id=3): double
copy-in: 17.2 ms 7.2 GB/s
device: 8.7 ms 14.2 GB/s
host: 8.4 ms 14.6 GB/s
errors: 0
copy-in (2058 MB): 92 ms 21.9 GB/s
transpose: 0.23 ms 3187.9 GFLOPS/s
device: 0.18 ms 4122.8 GFLOPS/s
host: 0.57 ms 1278.1 GFLOPS/s
diff.cur: 3.20547e-15 (|36.6983-36.6983|=5.47118e-13)
```

For timing, comparison (host code), and validation, LIBXSMM is required. The drivers exercise the respective backend. For example with the CUDA backend:
For timing, comparison (host code), and validation, LIBXSMM is required. The driver exercises the respective backend. For example with the CUDA backend:

```bash
cd src/acc/cuda
make WITH_GPU=P100
../acc_bench_smm
../acc_bench
```

For the OpenCL backend:

```bash
cd src/acc/opencl
make
../acc_bench_smm
../acc_bench
```

In above cases, `acc_bench_trans` and `acc_bench_smm` are built using the respective backend. Both driver codes can be built for double-precision (default) or single-precision using a build-time macro (`make ELEM_TYPE=float` or `-DELEM_TYPE=float` in general).
In above cases, `acc_bench` is built using the respective backend. The driver code can be built for double-precision (default) or single-precision using a build-time macro (`make ELEM_TYPE=float` or `-DELEM_TYPE=float` in general).
File renamed without changes.
266 changes: 0 additions & 266 deletions src/acc/acc_bench_trans.c

This file was deleted.

Loading

0 comments on commit d0f0f3b

Please sign in to comment.