feat: no-cache attention in PyTorch workflow #3085

qixiang-99 · 2025-03-26T02:28:05Z

No description provided.

qixiang-99 · 2025-03-26T02:32:08Z

/bot run

niukuo · 2025-03-26T02:37:59Z

PR_Github #498 [ run ] triggered by Bot

niukuo · 2025-03-26T07:09:52Z

PR_Github #498 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #430 completed with status: 'FAILURE'

qixiang-99 · 2025-03-26T22:02:53Z

/bot run

tensorrt-cicd · 2025-03-26T22:11:40Z

PR_Github #617 [ run ] triggered by Bot

tensorrt-cicd · 2025-03-26T22:19:56Z

PR_Github #617 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #519 completed with status: 'FAILURE'

symphonylyh

reviewed internally, approved.
Just need to add README and pass CI

qixiang-99 · 2025-03-26T23:35:00Z

/bot run

tensorrt-cicd · 2025-03-26T23:44:33Z

PR_Github #622 [ run ] triggered by Bot

tensorrt-cicd · 2025-03-27T01:19:52Z

PR_Github #622 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #524 completed with status: 'FAILURE'

qixiang-99 · 2025-04-01T23:05:08Z

/bot run

tensorrt-cicd · 2025-04-01T23:10:23Z

PR_Github #932 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-02T03:26:56Z

PR_Github #932 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #734 completed with status: 'FAILURE'

qixiang-99 · 2025-04-02T06:38:10Z

/bot run

tensorrt-cicd · 2025-04-02T06:43:37Z

PR_Github #977 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-02T16:16:14Z

PR_Github #977 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #761 completed with status: 'FAILURE'

qixiang-99 · 2025-04-02T16:49:15Z

/bot run

tensorrt-cicd · 2025-04-02T16:54:46Z

PR_Github #1036 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-02T23:56:00Z

PR_Github #1036 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #800 completed with status: 'FAILURE'

qixiang-99 · 2025-04-03T03:00:30Z

/bot run --stage-list "L40S-1"

tensorrt-cicd · 2025-04-03T03:05:43Z

PR_Github #1059 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-03T05:59:29Z

PR_Github #1059 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #814 (Partly Tested) completed with status: 'SUCCESS'

qixiang-99 · 2025-04-03T06:27:26Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-03T06:33:13Z

PR_Github #1079 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-03T13:11:07Z

PR_Github #1079 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #825 completed with status: 'FAILURE'

qixiang-99 · 2025-04-03T17:39:24Z

/bot run --stage-list "A30-CPP-1"

tensorrt-cicd · 2025-04-04T06:50:32Z

PR_Github #1151 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #866 completed with status: 'FAILURE'

qixiang-99 · 2025-04-04T06:52:59Z

/bot run --reuse-test

tensorrt-cicd · 2025-04-04T06:58:39Z

PR_Github #1164 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-04T07:16:32Z

PR_Github #1164 [ run ] completed with state FAILURE
/LLM/main/L0_MergeRequest_PR pipeline #873 completed with status: 'FAILURE'

qixiang-99 · 2025-04-04T07:19:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-04-04T07:24:16Z

PR_Github #1167 [ run ] triggered by Bot

tensorrt-cicd · 2025-04-04T13:42:56Z

PR_Github #1167 [ run ] completed with state SUCCESS
/LLM/main/L0_MergeRequest_PR pipeline #875 completed with status: 'SUCCESS'

Signed-off-by: Qixiang Lin <[email protected]>

…model test fix: fix minor bugs after rebase Signed-off-by: Qixiang Lin <[email protected]>

refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine Signed-off-by: Qixiang Lin <[email protected]>

…s and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks Signed-off-by: Qixiang Lin <[email protected]>

… add type assertion for no cache attention in PyTorchModelEngine Signed-off-by: Qixiang Lin <[email protected]>

…elated classes, update documentation for KV cache handling Signed-off-by: Qixiang Lin <[email protected]>

…handling and remove unused conversion function Signed-off-by: Qixiang Lin <[email protected]>

…ess with useKVCache method and simplify token per block assignment remove Debug code. Signed-off-by: Qixiang Lin <[email protected]>

Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata. Signed-off-by: Qixiang Lin <[email protected]>

…on operations Signed-off-by: Qixiang Lin <[email protected]>

… relevant metadata classes Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality. Signed-off-by: Qixiang Lin <[email protected]>

Signed-off-by: Qixiang Lin <[email protected]>

Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType Signed-off-by: Qixiang Lin <[email protected]>

Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types. Signed-off-by: Qixiang Lin <[email protected]>

symphonylyh · 2025-04-04T17:43:06Z

/bot reuse-pipeline

tensorrt-cicd · 2025-04-04T17:48:36Z

PR_Github #1182 [ reuse-pipeline ] triggered by Bot

tensorrt-cicd · 2025-04-04T17:54:31Z

PR_Github #1182 [ reuse-pipeline ] completed with state SUCCESS
Reusing PR_Github #1167 for commit c211717

* init trtllm attn no cache Signed-off-by: Qixiang Lin <[email protected]> * fix: fix the seq_len issue and attn metadata prepare for qwen reward model test fix: fix minor bugs after rebase Signed-off-by: Qixiang Lin <[email protected]> * refactor: remove unnecessary debug logs and clean up commented code refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine Signed-off-by: Qixiang Lin <[email protected]> * refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks Signed-off-by: Qixiang Lin <[email protected]> * refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine Signed-off-by: Qixiang Lin <[email protected]> * refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling Signed-off-by: Qixiang Lin <[email protected]> * refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function Signed-off-by: Qixiang Lin <[email protected]> * refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment remove Debug code. Signed-off-by: Qixiang Lin <[email protected]> * refactor: Resolve comments for Python code Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata. Signed-off-by: Qixiang Lin <[email protected]> * docs: Add is_dummy_attention field to attention metadata for simulation operations Signed-off-by: Qixiang Lin <[email protected]> * refactor: add KVCacheParams to attention backend interface and import relevant metadata classes Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality. Signed-off-by: Qixiang Lin <[email protected]> * fix: fix rebase format issue Signed-off-by: Qixiang Lin <[email protected]> * fix: extend attention mask type handling in MHARunnerFixedParams Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType Signed-off-by: Qixiang Lin <[email protected]> * fix: enhance attention mask type handling in TllmGenFmhaRunnerParams Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types. Signed-off-by: Qixiang Lin <[email protected]> --------- Signed-off-by: Qixiang Lin <[email protected]> Signed-off-by: sarattha <[email protected]>

* init trtllm attn no cache Signed-off-by: Qixiang Lin <[email protected]> * fix: fix the seq_len issue and attn metadata prepare for qwen reward model test fix: fix minor bugs after rebase Signed-off-by: Qixiang Lin <[email protected]> * refactor: remove unnecessary debug logs and clean up commented code refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine Signed-off-by: Qixiang Lin <[email protected]> * refactor: update calculate_ref_result function to accept tensor inputs and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks Signed-off-by: Qixiang Lin <[email protected]> * refactor: remove unused BERT attention metadata conversion method and add type assertion for no cache attention in PyTorchModelEngine Signed-off-by: Qixiang Lin <[email protected]> * refactor: remove use_kv_cache parameter from attention function and related classes, update documentation for KV cache handling Signed-off-by: Qixiang Lin <[email protected]> * refactor: implement setAttentionMaskType method for better mask type handling and remove unused conversion function Signed-off-by: Qixiang Lin <[email protected]> * refactor: streamline KV cache handling by replacing direct member access with useKVCache method and simplify token per block assignment remove Debug code. Signed-off-by: Qixiang Lin <[email protected]> * refactor: Resolve comments for Python code Simplify no cache attention metadata preparation and streamline related attributes in TrtllmAttentionMetadata Removed the private method for converting to no cache attention metadata and integrated its logic into the prepare method. Updated the test for BERT sequence classification to reflect these changes and ensure proper handling of attention metadata. Signed-off-by: Qixiang Lin <[email protected]> * docs: Add is_dummy_attention field to attention metadata for simulation operations Signed-off-by: Qixiang Lin <[email protected]> * refactor: add KVCacheParams to attention backend interface and import relevant metadata classes Updated the attention backend interface to include KVCacheParams and imported TrtllmAttentionMetadata and VanillaAttentionMetadata in model_engine.py for enhanced functionality. Signed-off-by: Qixiang Lin <[email protected]> * fix: fix rebase format issue Signed-off-by: Qixiang Lin <[email protected]> * fix: extend attention mask type handling in MHARunnerFixedParams Added support for additional attention mask types (BIDIRECTIONAL, BIDIRECTIONALGLM, BLOCKSPARSE) in the MHARunnerFixedParams structure to fix the mapping issue between ContextAttentionMaskType and AttentionMaskType Signed-off-by: Qixiang Lin <[email protected]> * fix: enhance attention mask type handling in TllmGenFmhaRunnerParams Updated the setAttentionMaskType method to include a switch-case structure for better handling of attention mask types, ensuring proper mapping and error handling for invalid types. Signed-off-by: Qixiang Lin <[email protected]> --------- Signed-off-by: Qixiang Lin <[email protected]>

qixiang-99 force-pushed the feat/no_cache_attn_trtllm branch from 049879f to bd371f8 Compare March 26, 2025 02:28

symphonylyh approved these changes Mar 26, 2025

View reviewed changes

qixiang-99 force-pushed the feat/no_cache_attn_trtllm branch 2 times, most recently from 5ab8aba to 78b1917 Compare April 1, 2025 23:04

qixiang-99 force-pushed the feat/no_cache_attn_trtllm branch from 6ac8df5 to c5785a6 Compare April 3, 2025 06:26

qixiang-99 added 14 commits April 4, 2025 10:42

init trtllm attn no cache

53cccb7

Signed-off-by: Qixiang Lin <[email protected]>

fix: fix the seq_len issue and attn metadata prepare for qwen reward …

d27459f

…model test fix: fix minor bugs after rebase Signed-off-by: Qixiang Lin <[email protected]>

refactor: remove unnecessary debug logs and clean up commented code

630827e

refactor: update max_seq_len documentation and remove max_seq_len for decoder model contructor in PyTorchModelEngine Signed-off-by: Qixiang Lin <[email protected]>

refactor: update calculate_ref_result function to accept tensor input…

96e794f

…s and mask type, enhance test_attention_no_cache to support FULL and CAUSAL masks Signed-off-by: Qixiang Lin <[email protected]>

refactor: remove unused BERT attention metadata conversion method and…

e7d6d2e

… add type assertion for no cache attention in PyTorchModelEngine Signed-off-by: Qixiang Lin <[email protected]>

refactor: remove use_kv_cache parameter from attention function and r…

4d8da79

…elated classes, update documentation for KV cache handling Signed-off-by: Qixiang Lin <[email protected]>

refactor: implement setAttentionMaskType method for better mask type …

d74a84c

…handling and remove unused conversion function Signed-off-by: Qixiang Lin <[email protected]>

refactor: streamline KV cache handling by replacing direct member acc…

7946e0f

…ess with useKVCache method and simplify token per block assignment remove Debug code. Signed-off-by: Qixiang Lin <[email protected]>

docs: Add is_dummy_attention field to attention metadata for simulati…

25140dd

…on operations Signed-off-by: Qixiang Lin <[email protected]>

fix: fix rebase format issue

f4b4a35

Signed-off-by: Qixiang Lin <[email protected]>

symphonylyh force-pushed the feat/no_cache_attn_trtllm branch from 0b9a699 to c211717 Compare April 4, 2025 17:42

symphonylyh enabled auto-merge (squash) April 4, 2025 17:45

symphonylyh merged commit 0d4d50a into NVIDIA:main Apr 4, 2025
2 checks passed

feat: no-cache attention in PyTorch workflow #3085

feat: no-cache attention in PyTorch workflow #3085

Conversation

qixiang-99 commented Mar 26, 2025

qixiang-99 commented Mar 26, 2025

niukuo commented Mar 26, 2025

niukuo commented Mar 26, 2025

qixiang-99 commented Mar 26, 2025

tensorrt-cicd commented Mar 26, 2025

tensorrt-cicd commented Mar 26, 2025

symphonylyh left a comment

Choose a reason for hiding this comment

qixiang-99 commented Mar 26, 2025

tensorrt-cicd commented Mar 26, 2025

tensorrt-cicd commented Mar 27, 2025

qixiang-99 commented Apr 1, 2025

tensorrt-cicd commented Apr 1, 2025

tensorrt-cicd commented Apr 2, 2025

qixiang-99 commented Apr 2, 2025

tensorrt-cicd commented Apr 2, 2025

tensorrt-cicd commented Apr 2, 2025

qixiang-99 commented Apr 2, 2025

tensorrt-cicd commented Apr 2, 2025

tensorrt-cicd commented Apr 2, 2025

qixiang-99 commented Apr 3, 2025

tensorrt-cicd commented Apr 3, 2025

tensorrt-cicd commented Apr 3, 2025

qixiang-99 commented Apr 3, 2025

tensorrt-cicd commented Apr 3, 2025

tensorrt-cicd commented Apr 3, 2025

qixiang-99 commented Apr 3, 2025

tensorrt-cicd commented Apr 4, 2025

qixiang-99 commented Apr 4, 2025

tensorrt-cicd commented Apr 4, 2025

tensorrt-cicd commented Apr 4, 2025

qixiang-99 commented Apr 4, 2025

tensorrt-cicd commented Apr 4, 2025

tensorrt-cicd commented Apr 4, 2025

symphonylyh commented Apr 4, 2025

tensorrt-cicd commented Apr 4, 2025

tensorrt-cicd commented Apr 4, 2025