[pull] main from NVIDIA:main #90

pull · 2025-03-06T22:32:02Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.1)

Can you help keep this open source service alive? 💖 Please sponsor : )

* Enable MXFP8 LayerNorm and RMSNorm Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix compilation Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Fix envvar Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Tim Moon <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

Signed-off-by: Nicolas Castet <[email protected]>

Signed-off-by: Xiaowei Ren <[email protected]>

* fix Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by: Pawel Gadzinski <[email protected]> * fix Signed-off-by: Pawel Gadzinski <[email protected]> * test Signed-off-by: Pawel Gadzinski <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more sensitive tests Signed-off-by: Pawel Gadzinski <[email protected]> * typo fix and skip test on blackwell fix Signed-off-by: Pawel Gadzinski <[email protected]> --------- Signed-off-by: Pawel Gadzinski <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* Remove cudaStreamSync. call Signed-off-by: Vasudevan Rengasamy <[email protected]> * Use cudaMemsetAsync instead of cudaMemcpyAsync Signed-off-by: Vasudevan Rengasamy <[email protected]> * Update transformer_engine/common/transformer_engine.cpp Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> --------- Signed-off-by: Vasudevan Rengasamy <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Tim Moon <[email protected]>

Add NVTX ranges Signed-off-by: Jaemin Choi <[email protected]> Co-authored-by: Jaemin Choi <[email protected]> Co-authored-by: Tim Moon <[email protected]>

Don't set data to null Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Fix incorrect docstrings in tensor saving functions Signed-off-by: Tim Moon <[email protected]>

* fix recompilation of out and lse correction in p2p+bshd/sbhd Signed-off-by: Xiaowei Ren <[email protected]> * fix recompilation of get_seq_chunk_ids_for_reordering Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix recomplilation of reorder_seq_chunks_for_a2a Signed-off-by: Xiaowei Ren <[email protected]> * recover a change Signed-off-by: Xiaowei Ren <[email protected]> * typo fix Signed-off-by: Xiaowei Ren <[email protected]> * minor change to softmax_lse correction Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cache cu_seqlens for BSHD/SBHD format Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * do not need to allocate out buffer for BSHD/SBHD Signed-off-by: Xiaowei Ren <[email protected]> * code refactoring Signed-off-by: Xiaowei Ren <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix Signed-off-by: Xiaowei Ren <[email protected]> * refactor init out correction Signed-off-by: Xiaowei Ren <[email protected]> * fix a docstring Signed-off-by: Xiaowei Ren <[email protected]> * typo fix Signed-off-by: Xiaowei Ren <[email protected]> * code refactoring Signed-off-by: Xiaowei Ren <[email protected]> * fix init out correct dtype Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to DPA API Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <[email protected]> * add pad_between_seqs to the API of MHA and transformer layer Signed-off-by: Xiaowei Ren <[email protected]> --------- Signed-off-by: Xiaowei Ren <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* check in per-tensor current scaling full recipe Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> setup basics of current scaling quantizer in python level Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> add test case for current scaling dequantize Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> finish linear layer fwd bwd test, determined error with bf16 Signed-off-by: zhongboz <[email protected]> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by: zhongboz <[email protected]> achieved zero tolerance for Linear by specify gemm use_split_accumulator config Signed-off-by: zhongboz <[email protected]> enable layernormlinear with current scaling, pass bitwise test Signed-off-by: zhongboz <[email protected]> refactor test case code Signed-off-by: zhongboz <[email protected]> make current scaling quantizers distrbuted, pass distributed linear&layernormlinear tests Signed-off-by: zhongboz <[email protected]> bug fix: use cached fp8 recipe in backward Signed-off-by: zhongboz <[email protected]> fix layernorm_mlp with current scaling, fix activation_helper with current scaling Signed-off-by: zhongboz <[email protected]> support detailed numerical settings from recipe to quantization kernel Signed-off-by: zhongboz <[email protected]> resolving MR comments Signed-off-by: zhongboz <[email protected]> recipe naming Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, remove IS_CURRENT_SCALING template from kernels Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, make current scaling c++ test cases Signed-off-by: zhongboz <[email protected]> * add current scaling to test_numerics.py, skip act recomp and grouped linear Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmark for quantizer Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmarks for linear layer Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix, typo Signed-off-by: zhongboz <[email protected]> * resolve more mr comments Signed-off-by: zhongboz <[email protected]> * avoid potential race condition by not using from_blob to construct amax tensor in C++ Signed-off-by: zhongboz <[email protected]> * resolve more comments Signed-off-by: zhongboz <[email protected]> * Debug linter warnings and license check Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug import error in FP8 tensor test Signed-off-by: Tim Moon <[email protected]> * Debug compilation error with CUDA 12.1 for Turing Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, fix activation cast fusion Signed-off-by: zhongboz <[email protected]> * resolve comments, add NVTEQuantizationParams for compute scale Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove is_current_scaling check totally from common folder Signed-off-by: zhongboz <[email protected]> * remove benchmarks, will contribute in another repo Signed-off-by: zhongboz <[email protected]> * adjust cs default recipe config Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust comments in test Signed-off-by: zhongboz <[email protected]> * Remove current scaling mode from core lib Signed-off-by: Tim Moon <[email protected]> * Refactor current-scaling-specific logic in core C++ lib Move amax and scale update functions out of casting functions, and put into dedicated current-scaling source file. Add general API for accessing quantization config object. Signed-off-by: Tim Moon <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add missing header in C++ tests Signed-off-by: Tim Moon <[email protected]> * Disable test config with FP8 transpose on Blackwell Signed-off-by: Tim Moon <[email protected]> * Fix compilation error in C++ test Signed-off-by: Tim Moon <[email protected]> --------- Signed-off-by: zhongboz <[email protected]> Signed-off-by: Tim Moon <[email protected]> Co-authored-by: zhongboz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Tim Moon <[email protected]> Co-authored-by: Tim Moon <[email protected]>

* Verified TE2.0 with offloading Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skipping tests for Ampere and removed child class preparing Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * offloading support for MXFP8 dtype Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed quantized tensor detection mechanism Signed-off-by: Selvaraj Anandaraj <[email protected]> * Fix mxfp8 offload, lint errors, and var name Signed-off-by: Kirthi Shankar Sivamani <[email protected]> * Supported disabling offloading for quantized tensors Signed-off-by: Selvaraj Anandaraj <[email protected]> * bug fix Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed bugs Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added support for None in list of Quantized data tensors Signed-off-by: root <[email protected]> * Hopper backward compatibility cleanup Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Coding style nit Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added guards Signed-off-by: Selvaraj Anandaraj <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Selvaraj Anandaraj <[email protected]> Signed-off-by: Kirthi Shankar Sivamani <[email protected]> Co-authored-by: Selvaraj Anandaraj <[email protected]> Co-authored-by: Kirthi Shankar Sivamani <[email protected]>

Internal quantizer for input to the modules Signed-off-by: Przemek Tredak <[email protected]>

timmoon10 and others added 4 commits March 6, 2025 08:57

Fix UB with MPI init (#1538)

74983b3

Signed-off-by: Nicolas Castet <[email protected]>

make sure dout is contiguous (#1539)

e1c4f51

Signed-off-by: Xiaowei Ren <[email protected]>

pull bot added the ⤵️ pull label Mar 6, 2025

vasunvidia and others added 9 commits March 6, 2025 14:56

Add NVTX ranges to FP8 amax AR and grad output preprocessing (#1530)

de06a34

Add NVTX ranges Signed-off-by: Jaemin Choi <[email protected]> Co-authored-by: Jaemin Choi <[email protected]> Co-authored-by: Tim Moon <[email protected]>

[PyTorch] Don't set FP8 data to None when saving base tensors (#1548)

48b8eea

Don't set data to null Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

Add user to TE CI (#1547)

44c8fd0

Signed-off-by: Kirthi Shankar Sivamani <[email protected]>

[PyTorch] Fix incorrect docstrings in tensor saving functions (#1549)

2ad5da9

Fix incorrect docstrings in tensor saving functions Signed-off-by: Tim Moon <[email protected]>

Use internal quantizer for input to the modules (#1551)

b3e7035

Internal quantizer for input to the modules Signed-off-by: Przemek Tredak <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[pull] main from NVIDIA:main #90

[pull] main from NVIDIA:main #90

pull bot commented Mar 6, 2025 •

edited

Loading

[pull] main from NVIDIA:main #90

Are you sure you want to change the base?

[pull] main from NVIDIA:main #90

Conversation

pull bot commented Mar 6, 2025 • edited Loading

pull bot commented Mar 6, 2025 •

edited

Loading