v3.4
Performance Optimizations
-
Intel Architecture Processors:
- Improved performance for 4th generation Intel Xeon Scalable processors (formerly Sapphire Rapids).
- Improved performance for the future Intel Xeon Scalable processors (code-named Sierra Forest and Granite Rapids). These optimizations are now included by default on compatible processors.
- Improved RNN primitive performance with LBR_GRU cell.
- Improved softmax performance on processors with Intel AVX2 or Intel AVX-512 instruction set support.
- Improved fp32 inner product performance on processors with Intel AVX2 instruction set support.
- Improved fp32, fp16, bf16 matmul primitive performance on processors with Intel AVX-512 and Intel AMX instruction set support.
- Improved int8 matmul performance with transposed A tensor.
- Improved performance of resampling primitive on processors with Intel AVX2 instruction set support.
- Improved performance of int8 convolution with post-ops.
- Optimized batch matmul with binary post-op and broadcast mask
1
and14
. - Improved the Scaled Dot Product Attention (SDPA) subgraph performance with Graph API.
- Improved performance of subgraphs including
matmul
andadd
operations and mixed int8 and bfloat16 data types with Graph API. - [experimental] Improved performance of
reduction
,softmax
andlayernorm
operations with experimental Graph Compiler backend. - [experimental] Improved performance for llama2 MLP subgraph with experimental Graph Compiler backend.
-
Intel Graphics Products:
- Introduced initial optimizations for Processor Graphics based on Xe2 architecture.
- Improved performance for the Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Improved performance for Intel Arc graphics (formerly Alchemist and DG2) and the Intel Data Center GPU Flex Series (formerly Arctic Sound).
- Improved matmul performance for cases relevant to Large Language Models (LLMs) and Transformer-like models.
- Improved convolution performance for cases relevant to the Stable Diffusion model.
- Improved RNN primitive performance.
- Improved pooling forward propagation performance.
- Improved batched matmul performance for cases with 5 dimensions or more.
-
AArch64-based Processors:
- Added an option to build oneDNN with macOS Accelerate library to improve performance on Apple silicon.
- Improved reorder primitive performance with Compute Library for the Arm architecture (ACL).
- Improved bf16 inner product product primitive performance with ACL.
Functionality
- Introduced GPT-Q support to improve Large Language Models (LLMs) performance with compressed weights. Optimized implementation is available for Intel Graphics Products and support matmul with int8 weight compression.
- Introduced fp8 data type support in primitives and Graph API. Optimized implementation is available for Intel Data Center GPU Max Series (formerly Ponte Vecchio).
- Introduced support for fp16 and bf16 scale and shift arguments for layer normalization. Optimized implementation is available for Intel Graphics Products.
- [experimental] Introduced unstructured sparsity support for processors with Intel AMX support relying on VCOMPRESS/VPEXPAND instructions.
- Intel Graphics Products
- Introduced support for Intel Data Center GPU Max 1550VG
- Introduced PReLU post-op support for inner product and matmul primitives.
Usability
- Added opt-in deterministic mode support. Deterministic mode guarantees that results are bitwise identical between runs in a fixed environment.
- Introduced accumulation mode control.
- Extended oneDNN verbose diagnostics with information on dispatching decisions in convolution and matmul implementations.
- Extended verbose diagnostics for Graph API with information for operation schema check results and pattern matching results.
- Reduced RNN primitive memory consumption on GPUs.
- Added examples demonstrating use of oneDNN Graph API in eager mode use cases.
- Extended tensor constructor in Graph API to support memory allocation and management by the library.
- Introduced new API and environment variable to manage Graph API constant tensor cache capacity.
- Improved the efficiency of pattern matching in Graph API by optimizing pattern registration, reducing pattern numbers, and skipping patterns more wisely.
- Changed default optimization flags for AArch64 builds to
-mcpu=generic
to improve portability.
Validation
- Improved benchdnn performance by optimizing bottlenecks in validation code.
- Introduced
--num-streams
knob in benchdnn to support benchmarking in multi-stream scenarios.
Known Limitations
- Intel Datacenter GPU Flex Series driver for Windows has an issue resulting in program hangs or crashes when oneDNN primitives are created concurrently.
- int8 concat primitive may produce incorrect results on integrated GPUs with current GPU driver.
- fp32 pooling primitive may produce incorrect results in rare conditions on Intel Datacenter GPU Max Series with current GPU driver.
- reorder primitive causes segmentation fault for prime sizes exceeding 2^31 on Intel CPUs.
- fp64 convolution and deconvolution produces incorrect results on integrated graphics in future Intel Core processors (code name Arrow Lake)
- int8 matmul primitive creation with fp32 bias fails on Intel GPU Flex Series and Intel Arc Graphics.
Breaking Changes
- Updated minimal supported ACL version to 23.11 (was 23.02.1).
Thanks to these Contributors
This release contains contributions from the project core team as well as Alexander Grund @Flamefire, David Svantesson @davsva01, Fadi Arafeh @fadara01, Hugh Delaney @hdelan, Ilya Lavrenov @ilya-lavrenov, Jacob Kahn @jacobkahn, Nathan John Sircombe @nSircombe, Renato Barros Arantes @renato-arantes, Sergey Shalnov @shssf, Sunita Nadampalli @snadampal, and Svetlozar Georgiev @sgeor255. We would also like to thank everyone who asked questions and reported issues.