Release Tensile 4.39.0 for ROCm 6.0.0 · ROCm/Tensile

Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes
Added/updated tuning scripts
Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases
Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
Added asmcap check for MFMA + const src
Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)
Added a new parameter to increase miLatencyLeft

Enabled InitAccVgprOpt for MatrixInstruction cases
Implemented local read related parameter calculations with DirectToVgpr
Adjusted miIssueLatency for gfx940
Enabled dedicated vgpr allocation for local read + pack
Optimized code initialization
Optimized sgpr allocation
Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
Enabled miLatency optimization for (gfx940/gfx941 + MFMA) for specific data types, and fixed instruction scheduling

Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
Removed unused CustomKernels and ReplacementKernels
Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
Removed unused code for DirectToLds
Updated test cases for DTV + TransposeLDS=False
Moved parameter MinKForGSU from globalparameter to BenchmarkCommonParameter to support smaller K
Changed how to calculate latencyForLR for miLatency
Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)
Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
Supported multi-gpu for different architectures in lazy library loading
Enabled dtree library for batch > 1
Added problem scale feature for dtree selection
Enabled ROCm SMI for gfx940/941.
Modified non-lazy load build to skip experimental logic

Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases
Fixed merge error affecting i8 with wmma
Fixed mismatch issue with DTLds + TSGR + TailLoop
Fixed a bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0
Fixed override for unloaded solutions when lazy loading
Fixed build some errors (adding missing headers)
Fixed boost link for a clean build on ubuntu22
Fixed bug in forcestoresc1 arch selection
Fixed compiler directive for gfx941 and gfx942
Fixed formatting for DecisionTree_test.cpp

Provide feedback