Skip to content

Tensile 4.39.0 for ROCm 6.0.0

Compare
Choose a tag to compare
@rocm-ci rocm-ci released this 15 Dec 18:30
17df881

Added

  • Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes
  • Added/updated tuning scripts
  • Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases
  • Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
  • Added asmcap check for MFMA + const src
  • Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)
  • Added a new parameter to increase miLatencyLeft

Optimizations

  • Enabled InitAccVgprOpt for MatrixInstruction cases
  • Implemented local read related parameter calculations with DirectToVgpr
  • Adjusted miIssueLatency for gfx940
  • Enabled dedicated vgpr allocation for local read + pack
  • Optimized code initialization
  • Optimized sgpr allocation
  • Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
  • Enabled miLatency optimization for (gfx940/gfx941 + MFMA) for specific data types, and fixed instruction scheduling

Changed

  • Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
  • Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
  • Removed unused CustomKernels and ReplacementKernels
  • Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
  • Removed unused code for DirectToLds
  • Updated test cases for DTV + TransposeLDS=False
  • Moved parameter MinKForGSU from globalparameter to BenchmarkCommonParameter to support smaller K
  • Changed how to calculate latencyForLR for miLatency
  • Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)
  • Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
  • Supported multi-gpu for different architectures in lazy library loading
  • Enabled dtree library for batch > 1
  • Added problem scale feature for dtree selection
  • Enabled ROCm SMI for gfx940/941.
  • Modified non-lazy load build to skip experimental logic

Fixed

  • Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
  • Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases
  • Fixed merge error affecting i8 with wmma
  • Fixed mismatch issue with DTLds + TSGR + TailLoop
  • Fixed a bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0
  • Fixed override for unloaded solutions when lazy loading
  • Fixed build some errors (adding missing headers)
  • Fixed boost link for a clean build on ubuntu22
  • Fixed bug in forcestoresc1 arch selection
  • Fixed compiler directive for gfx941 and gfx942
  • Fixed formatting for DecisionTree_test.cpp