Releases: ROCm/TransferBench
Releases · ROCm/TransferBench
TransferBench v1.57.00
v1.57.00
Modified
- Removing use of default starship operator / C++20 requirement to enable compilation of more OSs
- Changing how version is reported. Client version is now just last two digits, and increments only if
no changes are made to the backend header-only library file, and resets to 0 when header is updated - GFX_SINGLE_TEAM=0 is set by default
TransferBench v1.56
v1.56
Fixed
- Fixed bug when using interactive mode. Interactive mode now starts prior to all warmup iterations
TransferBench v1.55
v1.55
Fixed
- Fixed missing header error when compiling on CentOS
- Fixed issues when using multi-stream mode for GFX executor
TransferBench v1.54
v1.54
Modified
- Refactored TransferBench into a header-only library combined with a thin client to facilitate the
use of TransferBench as the backend for other applications - Optimized how data validation is handled - this should speed up Tests with many parallel transfers as data is only
generated once - Preset benchmarks now no longer take in any extra command line arguments. Preset settings are only accessed via
environment variables. Details for each preset are printed - The a2a preset benchmark now defaults to using fine-grained memory and GFX unroll of 2
- Refactored how Transfers are launched in parallel which has reduced some CPU-side overheads
- CPU and DMA executor timing now use CPU wall clock timing instead of slowest Transfer time
Added
- New one2all preset which sweeps over all subests of parallel transfers from one GPU to others
- Adding new warnings for DMA execution relating to how HIP will default to using agents from the source memory
Removed
- CU scaling preset has been removed. Similar functionality already exists in the schmoo preset benchmark
- Preparation of source data via GFX kernel has been removed (USE_PREP_KERNEL)
- Removed GFX block-reordering (BLOCK_ORDER)
- Removed NUM_CPU_DEVICES and NUM_GPU_DEVICES from common env vars and only into the presets they apply to.
- Removed SHARED_MEM_BYTES option for GFX executor
- Removed USE_PCIE_INDEX, and SHARED_MEM_BYTES
Fixed
- Fixed a potential timing reporting issue when DMA executed Transfers end up getting serialized.
TransferBench v1.53
v1.53
Added
- Added ability to specify NULL for sweep preset as source or destination memory type
TransferBench v1.52
Added
- Added USE_HSA_DMA env var to switch to using hsa_amd_memory_async_copy instead of hipMemcpyAsync for DMA execution
- Added ability to set USE_GPU_DMA env var for a2a benchmark
- Adding check for large BAR enablement for GPU devices during topology check
Fixed
- Potential memory leak if HSA reports 0 hops between GPUs and CPUs
TransferBench v1.51
v1.51
Modified
- CSV output has been modified slightly to match normal terminal output
- Output for non single stream mode has been changed to match single stream mode (results per Executor)
Added
- Support for sub-iterations via NUM_SUBITERATIONS. This allows for additional looping during an iteration
If set to 0, this should infinitely loop (which may be useful for some debug purposes) - Support for variable number of subexecutors (currently for GPU-GFX executor only). Setting subExecutors to
0 will run over a range of CUs to use, and report only the results of the best one found. This can be tuned
for performance by setting the MIN_VAR_SUBEXEC and MAX_VAR_SUBEXEC environment variables to narrow the
search space. The number of CUs used will be identical for all variable subExecutor transfers - Experimental new "healthcheck" preset config which currently only supports MI300 series. This preset runs
through CPU to GPU bandwidth tests and all-to-all XGMI bandwidth tests and compares against expected values
Pass criteria limits can be modified (due to platform differences) via the environment variables
LIMIT_UDIR (undirectional), LIMIT_BDIR (bidirectional), and LIMIT_A2A (Per GPU-GPU link bandwidth)
Fixed
- Fixed out-of-bounds memory access during topology detection that can happen if the number of
CPUs is less than the number of NUMA domains - Fixed CU masking functionality on multi-XCD architectures (e.g. MI300)
TransferBench v1.50
Added
- Adding new parallel copy preset benchmark (pcopy)
- Usage: ./TransferBench pcopy <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=#GPU-1>
Fixed
- Removed non-copies DMA Transfers (this had previously been using hipMemset)
- Fixed CPU executor when operating on null destination
TransferBench v1.49
Fixes
- Enumerating previously missed DMA engines used only for CPU traffic in topology display
TransferBench v1.48
v1.48
Fixes
- Various fixes for TransferBenchCuda
Additions
- Support for targeting specific DMA engines via executor subindex (e.g. D0.1)
- Printing warnings when exeuctors are overcommited
Modifications
- USE_REMOTE_READ supported for rwrite preset benchmark