Skip to content

Releases: ROCm/TransferBench

TransferBench v1.57.00

28 Nov 00:12
062b581
Compare
Choose a tag to compare

v1.57.00

Modified

  • Removing use of default starship operator / C++20 requirement to enable compilation of more OSs
  • Changing how version is reported. Client version is now just last two digits, and increments only if
    no changes are made to the backend header-only library file, and resets to 0 when header is updated
  • GFX_SINGLE_TEAM=0 is set by default

TransferBench v1.56

26 Nov 20:37
83fc9b3
Compare
Choose a tag to compare

v1.56

Fixed

  • Fixed bug when using interactive mode. Interactive mode now starts prior to all warmup iterations

TransferBench v1.55

26 Nov 20:37
9f68d14
Compare
Choose a tag to compare
TransferBench v1.55 Pre-release
Pre-release

v1.55

Fixed

  • Fixed missing header error when compiling on CentOS
  • Fixed issues when using multi-stream mode for GFX executor

TransferBench v1.54

21 Nov 23:32
02ce785
Compare
Choose a tag to compare

v1.54

Modified

  • Refactored TransferBench into a header-only library combined with a thin client to facilitate the
    use of TransferBench as the backend for other applications
  • Optimized how data validation is handled - this should speed up Tests with many parallel transfers as data is only
    generated once
  • Preset benchmarks now no longer take in any extra command line arguments. Preset settings are only accessed via
    environment variables. Details for each preset are printed
  • The a2a preset benchmark now defaults to using fine-grained memory and GFX unroll of 2
  • Refactored how Transfers are launched in parallel which has reduced some CPU-side overheads
  • CPU and DMA executor timing now use CPU wall clock timing instead of slowest Transfer time

Added

  • New one2all preset which sweeps over all subests of parallel transfers from one GPU to others
  • Adding new warnings for DMA execution relating to how HIP will default to using agents from the source memory

Removed

  • CU scaling preset has been removed. Similar functionality already exists in the schmoo preset benchmark
  • Preparation of source data via GFX kernel has been removed (USE_PREP_KERNEL)
  • Removed GFX block-reordering (BLOCK_ORDER)
  • Removed NUM_CPU_DEVICES and NUM_GPU_DEVICES from common env vars and only into the presets they apply to.
  • Removed SHARED_MEM_BYTES option for GFX executor
  • Removed USE_PCIE_INDEX, and SHARED_MEM_BYTES

Fixed

  • Fixed a potential timing reporting issue when DMA executed Transfers end up getting serialized.

TransferBench v1.53

11 Nov 06:59
b56d481
Compare
Choose a tag to compare

v1.53

Added

  • Added ability to specify NULL for sweep preset as source or destination memory type

TransferBench v1.52

09 Oct 16:49
600cf13
Compare
Choose a tag to compare

Added

  • Added USE_HSA_DMA env var to switch to using hsa_amd_memory_async_copy instead of hipMemcpyAsync for DMA execution
  • Added ability to set USE_GPU_DMA env var for a2a benchmark
  • Adding check for large BAR enablement for GPU devices during topology check

Fixed

  • Potential memory leak if HSA reports 0 hops between GPUs and CPUs

TransferBench v1.51

15 Aug 17:46
b30aefb
Compare
Choose a tag to compare

v1.51

Modified

  • CSV output has been modified slightly to match normal terminal output
  • Output for non single stream mode has been changed to match single stream mode (results per Executor)

Added

  • Support for sub-iterations via NUM_SUBITERATIONS. This allows for additional looping during an iteration
    If set to 0, this should infinitely loop (which may be useful for some debug purposes)
  • Support for variable number of subexecutors (currently for GPU-GFX executor only). Setting subExecutors to
    0 will run over a range of CUs to use, and report only the results of the best one found. This can be tuned
    for performance by setting the MIN_VAR_SUBEXEC and MAX_VAR_SUBEXEC environment variables to narrow the
    search space. The number of CUs used will be identical for all variable subExecutor transfers
  • Experimental new "healthcheck" preset config which currently only supports MI300 series. This preset runs
    through CPU to GPU bandwidth tests and all-to-all XGMI bandwidth tests and compares against expected values
    Pass criteria limits can be modified (due to platform differences) via the environment variables
    LIMIT_UDIR (undirectional), LIMIT_BDIR (bidirectional), and LIMIT_A2A (Per GPU-GPU link bandwidth)

Fixed

  • Fixed out-of-bounds memory access during topology detection that can happen if the number of
    CPUs is less than the number of NUMA domains
  • Fixed CU masking functionality on multi-XCD architectures (e.g. MI300)

TransferBench v1.50

03 Apr 16:27
eaf32b4
Compare
Choose a tag to compare

Added

  • Adding new parallel copy preset benchmark (pcopy)
    • Usage: ./TransferBench pcopy <numBytes=64M> <#CUs=8> <srcGpu=0> <minGpus=1> <maxGpus=#GPU-1>

Fixed

  • Removed non-copies DMA Transfers (this had previously been using hipMemset)
  • Fixed CPU executor when operating on null destination

TransferBench v1.49

02 Apr 22:38
97fbbbb
Compare
Choose a tag to compare

Fixes

  • Enumerating previously missed DMA engines used only for CPU traffic in topology display

TransferBench v1.48

02 Feb 22:46
aa801b9
Compare
Choose a tag to compare

v1.48

Fixes

  • Various fixes for TransferBenchCuda

Additions

  • Support for targeting specific DMA engines via executor subindex (e.g. D0.1)
  • Printing warnings when exeuctors are overcommited

Modifications

  • USE_REMOTE_READ supported for rwrite preset benchmark