Releases: ROCm/Tensile
Releases · ROCm/Tensile
Tensile 4.32.0 for ROCm 5.1.1
Tensile code for ROCm 5.1.1 did not change. The library was rebuilt for the updated ROCm 5.1.1 stack.
Tensile 4.32.0 for ROCm 5.1.0
Added
- Better control of parallelism to control memory usage
- Support for multiprocessing on Windows for TensileCreateLibrary
- New JSD metric and metric selection functionality
- Initial changes to support two-tier solution selection
Optimized
- Optimized runtime of TensileCreateLibraries by reducing max RAM usage
- StoreCInUnroll additional optimizations plus adaptive K support
- DGEMM NN optimizations with PrefetchGlobalRead(PGR)=2 support
Changed
- Update Googletest to 1.11.0
Removed
- Remove no longer supported benchmarking steps
Tensile 4.31.0 for ROCm 5.0.2
Tensile code for ROCm 5.0.2 is unchanged from Tensile for ROCm 5.0.1. The library was rebuilt for the updated ROCm 5.0.2 stack.
Tensile 4.31.0 for ROCm 5.0.1
Tensile code for ROCm 5.0.1 is unchanged from Tensile for ROCm 5.0.0. The library was rebuilt for the updated ROCm 5.0.1 stack.
Tensile 4.31.0 for ROCm 5.0.0
Added
- DirectToLds support (x2/x4)
- DirectToVgpr support for DGEMM
- Parameter to control number of files kernels are merged into to better parallelize kernel compilation
- FP16 alternate implementation for HPA HGEMM on aldebaran
Optimized
- Add DGEMM NN custom kernel for HPL on aldebaran
Changed
- Update tensile_client executable to std=c++14
Removed
- Remove unused old Tensile client code
Fixed
- Fix hipErrorInvalidHandle during benchmarks
- Fix addrVgpr for atomic GSU
- Fix for Python 3.8: add case for Constant nodeType
- Fix architecture mapping for gfx1011 and gfx1012
- Fix PrintSolutionRejectionReason verbiage in KernelWriter.py
- Fix vgpr alignment problem when enabling flat buffer load
Tensile 4.30.0 for ROCm 4.5.2
Tensile code for ROCm 4.5.2 is unchanged from Tensile for ROCm 4.5.0. The library was rebuilt for the updated ROCm 4.5.2 stack.
Tensile 4.30.0 for ROCm 4.5.0
Added
- Custom Kernel mechanism for adding custom assembly kernels to Tensile
- New assertions for problems sizes, alpha/beta values, and C equals D
- Support setting VectorWidth in M dimension in MFMA SourceSwap configuration
Fixed
- Fix merge.py keeping duplicate solutions
- Fix ScheduleIterAlg 2,3 cases for aldebaran
Tensile 4.28.0 for ROCm 4.3.1
No changes made for ROCm 4.3.1.
Tensile 4.28.0 for ROCm 4.3.0
Added
- TensileRetuneLibrary for updating existing library logic files
- Support GFX1030
- Support NHWC
Fixed
- TensileCreateLibrary crash with relative output and --merge-files
Changed
- Change cmake_minimum_required to VERSION 3.13
Tensile-4.27.0 for ROCm 4.2.0
Added
- Benchmarking and library support for CU efficiency vs. overall speed
- support general batch GEMM
- Support offset for each input/output buffer in Tensile
- support support ldc != ldd for all GEMM kernel
Optimizations
- Refactor ConvolutionVsContraction
Fixed
- Fixed MasterSolutionLibrary having duplicated hardware rows
- channel stride is incorrect when converting conv problem into tensor contraction problem]