CubeCL v0.3.0 Release Notes
This release introduces major advancements across platform compatibility, language capabilities, and performance. Key improvements include expanded runtime support, now featuring AMD GPUs via ROCm/HIP and a SPIR-V compiler to boost wgpu performance on Vulkan. The CubeCL language also sees substantial updates, adopting more Rust syntax, compile-time constants, improved generics, enums, and a refined macro system.
Language Features
- Added support for numeric constants by @booti386 in #112
- Added
for in
syntax for immutable arrays, tensors and slices by @wingertge in #119 - Added support for ROCm HIP by @syl20bnr in #183
- Added if as a value expression by @wingertge in #120
- Added select (ternary) operations by @wingertge in #152
- Implemented support for func generics for impl block by @nathanielsimard in #189
- Added support for Enum + Const Match by @nathanielsimard in #145
- Added support for numeric match at runtime by @wingertge in #143
- Added support for comptime arrays available as runtime constants by @wingertge in #147
- Added features for each supported datatype by @wingertge in #193
- Reimplemented macro to make writing kernels more ergonomic by @wingertge in #80
- Clean up macro and optimize branch operations by @wingertge in #118
Runtime Improvements
CUDA
- Improved CUDA compiler by @nathanielsimard in #88
- Fixed CUDA architecture version by @nathanielsimard in #89
- Fixed native vector types by @nathanielsimard in #92
- Fixed CUDA support for different ranks by @nathanielsimard in #124
- Better CMMA configuration by @nathanielsimard in #146
- Support SSA bindings for CUDA by @wingertge in #153
- Fixed various CUDA bugs by @nathanielsimard in #168
WGPU
- Fixed WGPU memory corruption for CubeCount::Dynamic by @ArthurBrussee in #156
- Added support for autotuning on WebGPU, more precise timings by @ArthurBrussee in #167
- Fixed overflow when max page == 4GB on WASM by @ArthurBrussee in #194
- Merged
cubecl-wgpu
andcubecl-wgpu-spirv
by @wingertge in #184
HIP/ROCm
- Added support for ROCm HIP by @syl20bnr in #183
- Added half precision support to HIP by @syl20bnr in #201
- Limited cubecl-hip for Linux targets only by @syl20bnr in #205
SPIR-V
- Added SPIR-V compiler by @wingertge in #155
- Fixed casting, powf and alignment for SPIR-V by @wingertge in #188
Optimization & Performance
- Added value-based partial redundancy elimination by @wingertge in #169
- Added prefetching to into_contiguous by @wingertge in #181
- Added block merging by @wingertge in #163
- Added round and bitwise or operations by @laggui in #99
- Skipped zero initialization of workgroup memory by @ArthurBrussee in #125
- CMMA Optimizations:
- CMMA: cube dispatch strategy by @louisfd in #126
- Reuse lhs frag strategy by @louisfd in #132
- Invert k n loops by @louisfd in #131
- Continuous warp loading by @louisfd in #138
- Relative warp IDs by @louisfd in #144
- Relaxed b_m = b_n by @louisfd in #148
- New strategy for num compute planes + many refactors by @louisfd in #150
Infrastructure
- Added profiling support by @nathanielsimard in #137
- Improved compilation arguments by @nathanielsimard in #141
- Added simple benchmarking capabilities by @jbelanich in #190
- Added periodic memory cleanup by @ArthurBrussee in #178
- Reworked & added ExclusivePages as memory management option by @ArthurBrussee in #158
- Fixed concurrency problems with autotune by @nathanielsimard in #200
- Improved timing methods for benchmarking by @jbelanich in #190
- Fixed CI for Rust 1.82 by @nathanielsimard in #182
- Migrated xtask to tracel-xtask by @syl20bnr in #93
- Updated CI workflow and badges by @syl20bnr in #96
Math & Operations
- Implemented dot product by @RianGoossens in #140
- Implemented magnitude by @RianGoossens in #105
- Added Round, Floor, Ceil for Line by @med1844 in #179
- Implemented Vector Normalization by @RianGoossens in #100
- Added round and bitwise operations by @laggui in #99
Documentation & Examples
- Added simple fusion example by @nathanielsimard in #142
- Updated README by @nathanielsimard in #192
- Added book by @nathanielsimard in #133
- Format floating point values with maximum precision by @ArthurBrussee in #130
Bug Fixes & Maintenance
- Handle empty tensors by @laggui in #86
- Fixed flaky tests in topology by @nathanielsimard in #109
- Fixed no-std support by @nathanielsimard in #175
- Fixed WASM infinite loop by @nathanielsimard in #176
- Fixed deadlock by @ArthurBrussee in #177
- Fixed legacy kernels by auto-casting unary ops by @wingertge in #187
- Fixed pico support by @BjornTheProgrammer in #198
- Fixed check on macOS and minor refactor by @AsherJingkongChen in #204
- Fixed validate checksum by @nathanielsimard in #202
- Fixed for backends with higher alignments by @ArthurBrussee in #191