Skip to content

Releases: ROCm/Tensile

Tensile 4.39.0 for ROCm 6.0.0

15 Dec 18:30
17df881
Compare
Choose a tag to compare

Added

  • Added aquavanjaram support: gfx940/gfx941/gfx942, fp8/bf8 datatype, xf32 datatype, and stochastic rounding for various datatypes
  • Added/updated tuning scripts
  • Added DirectToLds support for larger data types with 32bit global load (old parameter DirectToLds is replaced with DirectToLdsA and DirectToLdsB), and the corresponding test cases
  • Added the average of frequency, power consumption, and temperature information for the winner kernels to the CSV file
  • Added asmcap check for MFMA + const src
  • Added support for wider local read + pack with v_perm (with VgprForLocalReadPacking=True)
  • Added a new parameter to increase miLatencyLeft

Optimizations

  • Enabled InitAccVgprOpt for MatrixInstruction cases
  • Implemented local read related parameter calculations with DirectToVgpr
  • Adjusted miIssueLatency for gfx940
  • Enabled dedicated vgpr allocation for local read + pack
  • Optimized code initialization
  • Optimized sgpr allocation
  • Supported DGEMM TLUB + RLVW=2 for odd N (edge shift change)
  • Enabled miLatency optimization for (gfx940/gfx941 + MFMA) for specific data types, and fixed instruction scheduling

Changed

  • Removed old code for DTL + (bpe * GlobalReadVectorWidth > 4)
  • Changed/updated failed CI tests for gfx11xx, InitAccVgprOpt, and DTLds
  • Removed unused CustomKernels and ReplacementKernels
  • Added a reject condition for DTVB + TransposeLDS=False (not supported so far)
  • Removed unused code for DirectToLds
  • Updated test cases for DTV + TransposeLDS=False
  • Moved parameter MinKForGSU from globalparameter to BenchmarkCommonParameter to support smaller K
  • Changed how to calculate latencyForLR for miLatency
  • Set minimum value of latencyForLRCount for 1LDSBuffer to avoid getting rejected by overflowedResources=5 (related to miLatency)
  • Refactored allowLRVWBforTLUandMI and renamed it as VectorWidthB
  • Supported multi-gpu for different architectures in lazy library loading
  • Enabled dtree library for batch > 1
  • Added problem scale feature for dtree selection
  • Enabled ROCm SMI for gfx940/941.
  • Modified non-lazy load build to skip experimental logic

Fixed

  • Fixed predicate ordering for fp16alt impl round near zero mode to unbreak distance modes
  • Fixed boundary check for mirror dims and re-enable disabled mirror dims test cases
  • Fixed merge error affecting i8 with wmma
  • Fixed mismatch issue with DTLds + TSGR + TailLoop
  • Fixed a bug with InitAccVgprOpt + GSU>1 and a mismatch issue with PGR=0
  • Fixed override for unloaded solutions when lazy loading
  • Fixed build some errors (adding missing headers)
  • Fixed boost link for a clean build on ubuntu22
  • Fixed bug in forcestoresc1 arch selection
  • Fixed compiler directive for gfx941 and gfx942
  • Fixed formatting for DecisionTree_test.cpp

Tensile 4.38.0 for ROCm 5.7.1

13 Oct 18:57
97e0cfc
Compare
Choose a tag to compare

Tensile code for ROCm 5.7.1 did not change. The library was rebuilt for the updated ROCm 5.7.1 stack.

Tensile 4.38.0 for ROCm 5.7.0

15 Sep 17:29
97e0cfc
Compare
Choose a tag to compare

Added

  • Added support for FP16 Alt Round Near Zero Mode (this feature allows the generation of alternate kernels with intermediate rounding instead of truncation)
  • Added user-driven solution selection feature

Optimizations

  • Enabled LocalSplitU with MFMA for I8 data type
  • Optimized K mask code in mfmaIter
  • Enabled TailLoop code in NoLoadLoop to prefetch global/local read
  • Enabled DirectToVgpr in TailLoop for NN, TN, and TT matrix orientations
  • Optimized DirectToLds test cases to reduce the test duration

Changed

  • Removed DGEMM NT custom kernels and related test cases
  • Changed noTailLoop logic to apply noTailLoop only for NT
  • Changed the range of AssertFree0ElementMultiple and Free1
  • Unified aStr, bStr generation code in mfmaIter

Fixed

  • Fixed LocalSplitU mismatch issue for SGEMM
  • Fixed BufferStore=0 and Ldc != Ldd case
  • Fixed mismatch issue with TailLoop + MatrixInstB > 1

Tensile 4.37.0 for ROCm 5.6.1

29 Aug 20:11
7d0a9d0
Compare
Choose a tag to compare

Tensile code for ROCm 5.6.1 did not change. The library was rebuilt for the updated ROCm 5.6.1 stack.

Tensile 4.37.0 for ROCm 5.6.0

28 Jun 23:17
7d0a9d0
Compare
Choose a tag to compare

Added

  • Added user driven tuning API
  • Added decision tree fallback feature
  • Added SingleBuffer + AtomicAdd option for GlobalSplitU
  • DirectToVgpr support for fp16 and Int8 with TN orientation
  • Added new test cases for various functions
  • Added SingleBuffer algorithm for ZGEMM/CGEMM
  • Added joblib for parallel map calls
  • Added support for MFMA + LocalSplitU + DirectToVgprA+B
  • Added asmcap check for MIArchVgpr
  • Added support for MFMA + LocalSplitU
  • Added frequency, power, and temperature data to the output

Optimizations

  • Improved the performance of GlobalSplitU with SingleBuffer algorithm
  • Reduced the running time of the extended and pre_checkin tests
  • Optimized the Tailloop section of the assembly kernel
  • Optimized complex GEMM (fixed vgpr allocation, unified CGEMM and ZGEMM code in MulMIoutAlphaToArch)
  • Improved the performance of the second kernel of MultipleBuffer algorithm

Changed

  • Updated custom kernels with 64-bit offsets
  • Adapted 64-bit offset arguments for assembly kernels
  • Improved temporary register re-use to reduce max sgpr usage
  • Removed some restrictions on VectorWidth and DirectToVgpr
  • Updated the dependency requirements for Tensile
  • Changed the range of AssertSummationElementMultiple
  • Modified the error messages for more clarity
  • Changed DivideAndReminder to vectorStaticRemainder in case quotient is not used
  • Removed dummy vgpr for vectorStaticRemainder
  • Removed tmpVgpr parameter from vectorStaticRemainder/Divide/DivideAndReminder
  • Removed qReg parameter from vectorStaticRemainder

Fixed

  • Fixed tmp sgpr allocation to avoid over-writing values (alpha)
  • 64-bit offset parameters for post kernels
  • Fixed gfx908 CI test failures
  • Fixed offset calculation to prevent overflow for large offsets
  • Fixed issues when BufferLoad and BufferStore are equal to zero
  • Fixed StoreCInUnroll + DirectToVgpr + no useInitAccVgprOpt mismatch
  • Fixed DirectToVgpr + LocalSplitU + FractionalLoad mismatch
  • Fixed the memory access error related to StaggerU + large stride
  • Fixed ZGEMM 4x4 MatrixInst mismatch
  • Fixed DGEMM 4x4 MatrixInst mismatch
  • Fixed ASEM + GSU + NoTailLoop opt mismatch
  • Fixed AssertSummationElementMultiple + GlobalSplitU issues
  • Fixed ASEM + GSU + TailLoop inner unroll

Tensile 4.36.0 for ROCm 5.5.1

24 May 19:05
d3bbb8b
Compare
Choose a tag to compare

Tensile code for ROCm 5.5.1 did not change. The library was rebuilt for the updated ROCm 5.5.1 stack.

Tensile 4.36.0 for ROCm 5.5.0

01 May 21:02
d3bbb8b
Compare
Choose a tag to compare

Added

  • Add functions for user-driven tuning
  • Add GFX11 support: HostLibraryTests yamls, rearragne FP32(C)/FP64(C) instruction order, archCaps for instruction renaming condition, adjust vgpr bank for A/B/C for optimize, separate vscnt and vmcnt, dual mac
  • Add binary search for Grid-Based algorithm
  • Add reject condition for (StoreCInUnroll + BufferStore=0) and (DirectToVgpr + ScheduleIterAlg<3 + PrefetchGlobalRead==2)
  • Add support for (DirectToLds + hgemm + NN/NT/TT) and (DirectToLds + hgemm + GlobalLoadVectorWidth < 4)
  • Add support for (DirectToLds + hgemm(TLU=True only) or sgemm + NumLoadsCoalesced > 1)
  • Add GSU SingleBuffer algorithm for HSS/BSS
  • Add gfx900:xnack-, gfx1032, gfx1034, gfx1035
  • Enable gfx1031 support

Optimizations

  • Use AssertSizeLessThan for BufferStoreOffsetLimitCheck if it is smaller than MT1
  • Improve InitAccVgprOpt

Changed

  • Use global_atomic for GSU instead of flat and global_store for debug code
  • Replace flat_load/store with global_load/store
  • Use global_load/store for BufferLoad/Store=0 and enable scheduling
  • LocalSplitU support for HGEMM+HPA when MFMA disabled
  • Update Code Object Version
  • Type cast local memory to COMPUTE_DATA_TYPE in LDS to avoid precision loss
  • Update asm cap cache arguments
  • Unify SplitGlobalRead into ThreadSeparateGlobalRead and remove SplitGlobalRead
  • Change checks, error messages, assembly syntax, and coverage for DirectToLds
  • Remove unused cmake file
  • Clean up the LLVM dependency code
  • Update ThreadSeparateGlobalRead test cases for PrefetchGlobalRead=2
  • Update sgemm/hgemm test cases for DirectToLds and ThreadSepareteGlobalRead

Fixed

  • Add build-id to header of compiled source kernels
  • Fix solution index collisions
  • Fix h beta vectorwidth4 correctness issue for WMMA
  • Fix an error with BufferStore=0
  • Fix mismatch issue with (StoreCInUnroll + PrefetchGlobalRead=2)
  • Fix MoveMIoutToArch bug
  • Fix flat load correctness issue on I8 and flat store correctness issue
  • Fix mismatch issue with BufferLoad=0 + TailLoop for large array sizes
  • Fix code generation error with BufferStore=0 and StoreCInUnrollPostLoop
  • Fix issues with DirectToVgpr + ScheduleIterAlg<3
  • Fix mismatch issue with DGEMM TT + LocalReadVectorWidth=2
  • Fix mismatch issue with PrefetchGlobalRead=2
  • Fix mismatch issue with DirectToVgpr + PrefetchGlobalRead=2 + small tile size
  • Fix an error with PersistentKernel=0 + PrefetchAcrossPersistent=1 + PrefetchAcrossPersistentMode=1
  • Fix mismatch issue with DirectToVgpr + DirectToLds + only 1 iteration in unroll loop case
  • Remove duplicate GSU kernels: for GSU = 1, GSUAlgorithm SingleBuffer and MultipleBuffer kernels are identical
  • Fix for failing CI tests due to CpuThreads=0
  • Fix mismatch issue with DirectToLds + PrefetchGlobalRead=2
  • Remove the reject condition for ThreadSeparateGlobalRead and DirectToLds (HGEMM, SGEMM only)
  • Modify reject condition for minimum lanes of ThreadSeparateGlobalRead (SGEMM or larger data type only)

Tensile 4.34.0 for ROCm 5.3.3

17 Nov 19:21
006a5d6
Compare
Choose a tag to compare

Tensile code for ROCm 5.3.3 did not change. The library was rebuilt for the updated ROCm 5.3.3 stack.

Tensile 4.34.0 for ROCm 5.3.2

10 Nov 01:04
006a5d6
Compare
Choose a tag to compare

Tensile code for ROCm 5.3.2 did not change. The library was rebuilt for the updated ROCm 5.3.2 stack.

Tensile 4.35.0 for ROCm 5.4.4

22 Mar 20:46
5aec089
Compare
Choose a tag to compare

Tensile code for ROCm 5.4.4 did not change. The library was rebuilt for the updated ROCm 5.4.4 stack.