[WebGPU EP] Support GroupQueryAttention #22658

satyajandhyala · 2024-10-30T07:09:29Z

Description

Motivation and Context

satyajandhyala · 2024-11-01T22:57:47Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

+  const bool feed_past_key = present_key != nullptr && past_key != nullptr && past_key->SizeInBytes() > 0;
+  const bool has_present_key = output_count > 1 && past_key;
+  const bool has_attention_bias = attention_bias != nullptr;
+  const int tile_size = 12;


onnxruntime/contrib_ops/webgpu/bert/attention.cc

github-actions

You can commit the suggested changes from lintrunner.

onnxruntime/contrib_ops/webgpu/bert/attention.cc

skottmckay · 2024-11-04T06:33:45Z

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc

+                          present_value, parameters, context, seqlen_k, total_seqlen_tensor);
+  }
+  TensorShape k_new_shape(k_new_dims);
+  Tensor K = context.CreateGPUTensor(key->DataType(), k_new_shape);


This line causes a segfault with these models: https://huggingface.co/microsoft/Phi-3-mini-4k-instruct-onnx/tree/main/cpu_and_mobile as they have GQA nodes that do not have the optional key and value inputs so the Tensor* is a nullptr.

onnxruntime/onnxruntime/core/graph/contrib_ops/bert_defs.cc

Lines 1090 to 1111 in c64459f

.Input(1,

"key",

"Key with shape (batch_size, kv_sequence_length, kv_hidden_size) ",

"T",

OpSchema::Optional)

.Input(2,

"value",

"Value with shape (batch_size, kv_sequence_length, kv_hidden_size)",

"T",

OpSchema::Optional)

.Input(3,

"past_key",

"past state key with support for format BNSH. When past_key uses same tensor as present_key"

"(k-v cache), it is of length max_sequence_length... otherwise of length past_sequence_length.",

"T",

OpSchema::Optional)

.Input(4,

"past_value",

"past state value with support for format BNSH. When past_value uses same tensor as present_value"

"(k-v cache), it is of length max_sequence_length... otherwise of length past_sequence_length.",

"T",

OpSchema::Optional)

WebgpuAttentionParameters is not copying the value of is_packed_qkv_ from GroupQueryAttentionParameters

Need to correctly initialize is_packed_qkv_

skottmckay · 2024-11-04T06:48:13Z

onnxruntime/contrib_ops/webgpu/bert/attention_common.h

+struct WebgpuAttentionParameters {
+  WebgpuAttentionParameters(AttentionParameters parameters) : is_gqa_parameters_(false),
+                                                              batch_size_(parameters.batch_size),


What's the reason WebGPU needs a parameters struct that combines AttentionParameters and GroupQueryAttentionParameters? Feels a little confusing to merge those and wondering why it's necessary if we don't need to do that for other EPs that implement these operators.

I am trying to avoid code duplication. I refactored code into attention used by both GQA and MHA. The CPU version has GQA separate implementation. group_query_attention_helper::CheckInputs() and AttentionBase::CheckInputs output different structs, GroupQueryAttentionParameters and AttentionParameters respectively. WebGPU parameters is a union of these to structs.

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2024-11-06T18:21:13Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

+    shader.MainFunctionBody() << "let kOffset = workgroup_id.z * uniforms.kv_sequence_length * uniforms.K;\n";
+    if ((feed_past_key_  && has_present_key_) || past_present_share_buffer_) {
+      shader.MainFunctionBody() << "let pastKeyOffset = workgroup_id.z * uniforms.past_sequence_length * uniforms.K;\n";


Suggested change

shader.MainFunctionBody() << "let kOffset = workgroup_id.z * uniforms.kv_sequence_length * uniforms.K;\n";

if ((feed_past_key_ && has_present_key_) || past_present_share_buffer_) {

shader.MainFunctionBody() << "let pastKeyOffset = workgroup_id.z * uniforms.past_sequence_length * uniforms.K;\n";

shader.MainFunctionBody() << "let kOffset = workgroup_id.z * uniforms.kv_sequence_length * uniforms.K;\n";

if ((feed_past_key_ && has_present_key_) || past_present_share_buffer_) {

shader.MainFunctionBody() << "let pastKeyOffset = workgroup_id.z * uniforms.past_sequence_length * uniforms.K;\n";

github-actions · 2024-11-06T18:21:14Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

+    shader.MainFunctionBody() << "    if (n + local_id.y < past_sequence_length) {\n"
+                                 "      tileK[idx] = " << (past_present_share_buffer_ ? "present_key" : "past_key") << "[pastKeyOffset + (n + local_id.y) * uniforms.K + w + local_id.x];\n"
+                                 "    } else  if (n + local_id.y - past_sequence_length < uniforms.kv_sequence_length) {\n"
+                                 "      tileK[idx] = key[kOffset + (n + local_id.y - past_sequence_length) * uniforms.K + w + local_id.x];\n"
+                                 "    }\n";
+  } else {


Suggested change

shader.MainFunctionBody() << " if (n + local_id.y < past_sequence_length) {\n"

" tileK[idx] = " << (past_present_share_buffer_ ? "present_key" : "past_key") << "[pastKeyOffset + (n + local_id.y) * uniforms.K + w + local_id.x];\n"

" } else if (n + local_id.y - past_sequence_length < uniforms.kv_sequence_length) {\n"

" tileK[idx] = key[kOffset + (n + local_id.y - past_sequence_length) * uniforms.K + w + local_id.x];\n"

" }\n";

} else {

shader.MainFunctionBody() << " if (n + local_id.y < past_sequence_length) {\n"

" tileK[idx] = "

<< (past_present_share_buffer_ ? "present_key" : "past_key") << "[pastKeyOffset + (n + local_id.y) * uniforms.K + w + local_id.x];\n"

" } else if (n + local_id.y - past_sequence_length < uniforms.kv_sequence_length) {\n"

" tileK[idx] = key[kOffset + (n + local_id.y - past_sequence_length) * uniforms.K + w + local_id.x];\n"

" }\n";

} else {

github-actions · 2024-11-06T18:21:14Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

+                               const Tensor* seqlen_k) {
+  const bool feed_past_value = present_value != nullptr && past_value != nullptr && past_value->SizeInBytes() > 0 &&  !parameters.past_present_share_buffer_;
+  const bool has_present_value = output_count > 1 && past_value != nullptr;


Suggested change

const Tensor* seqlen_k) {

const bool feed_past_value = present_value != nullptr && past_value != nullptr && past_value->SizeInBytes() > 0 && !parameters.past_present_share_buffer_;

const bool has_present_value = output_count > 1 && past_value != nullptr;

const Tensor* seqlen_k) {

const bool feed_past_value = present_value != nullptr && past_value != nullptr && past_value->SizeInBytes() > 0 && !parameters.past_present_share_buffer_;

const bool has_present_value = output_count > 1 && past_value != nullptr;

github-actions · 2024-11-06T18:21:14Z

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc

+        .TypeConstraint("T", WebGpuSupportedFloatTypes())
+          .MayInplace(3, 1)
+          .MayInplace(4, 2)
+          .InputMemoryType(OrtMemTypeCPUInput, 6),
+    GroupQueryAttention);


Suggested change

.TypeConstraint("T", WebGpuSupportedFloatTypes())

.MayInplace(3, 1)

.MayInplace(4, 2)

.InputMemoryType(OrtMemTypeCPUInput, 6),

GroupQueryAttention);

.TypeConstraint("T", WebGpuSupportedFloatTypes())

.MayInplace(3, 1)

.MayInplace(4, 2)

.InputMemoryType(OrtMemTypeCPUInput, 6),

GroupQueryAttention);

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2024-11-06T19:02:03Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

+                              << "      tileK[idx] = " << (past_present_share_buffer_ ? "present_key" : "past_key") << "[pastKeyOffset + (n + local_id.y) * uniforms.K + w + local_id.x];\n"
+                                 "    } else  if (n + local_id.y - past_sequence_length < uniforms.kv_sequence_length) {\n"
+                                 "      tileK[idx] = key[kOffset + (n + local_id.y - past_sequence_length) * uniforms.K + w + local_id.x];\n"
+                                 "    }\n";
+  } else {


Suggested change

<< " tileK[idx] = " << (past_present_share_buffer_ ? "present_key" : "past_key") << "[pastKeyOffset + (n + local_id.y) * uniforms.K + w + local_id.x];\n"

" } else if (n + local_id.y - past_sequence_length < uniforms.kv_sequence_length) {\n"

" tileK[idx] = key[kOffset + (n + local_id.y - past_sequence_length) * uniforms.K + w + local_id.x];\n"

" }\n";

} else {

<< " tileK[idx] = " << (past_present_share_buffer_ ? "present_key" : "past_key") << "[pastKeyOffset + (n + local_id.y) * uniforms.K + w + local_id.x];\n"

" } else if (n + local_id.y - past_sequence_length < uniforms.kv_sequence_length) {\n"

" tileK[idx] = key[kOffset + (n + local_id.y - past_sequence_length) * uniforms.K + w + local_id.x];\n"

" }\n";

} else {

…mmon.h from attention.h

…class" This reverts commit ba45303.

…iple definitions error.

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2024-11-13T04:06:59Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

+  const int past_sequence_length = output_count > 1 ? parameters.past_sequence_length_ : 0;
+const int total_sequence_length = seqlen_k == nullptr ? (past_sequence_length + parameters.kv_sequence_length_) : parameters.seqlen_present_kv_cache_;
+


Suggested change

const int past_sequence_length = output_count > 1 ? parameters.past_sequence_length_ : 0;

const int total_sequence_length = seqlen_k == nullptr ? (past_sequence_length + parameters.kv_sequence_length_) : parameters.seqlen_present_kv_cache_;

const int past_sequence_length = output_count > 1 ? parameters.past_sequence_length_ : 0;

const int total_sequence_length = seqlen_k == nullptr ? (past_sequence_length + parameters.kv_sequence_length_) : parameters.seqlen_present_kv_cache_;

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2024-11-14T00:47:43Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

+
+  const uint32_t vectorized_head_size = (parameters.head_size_  + components - 1) / components;
+  program.SetDispatchGroupSize((total_sequence_length + tile_size - 1) / tile_size,


Suggested change

const uint32_t vectorized_head_size = (parameters.head_size_ + components - 1) / components;

program.SetDispatchGroupSize((total_sequence_length + tile_size - 1) / tile_size,

const uint32_t vectorized_head_size = (parameters.head_size_ + components - 1) / components;

program.SetDispatchGroupSize((total_sequence_length + tile_size - 1) / tile_size,

github-actions · 2024-11-14T00:47:43Z

onnxruntime/contrib_ops/webgpu/bert/attention.cc

+  int work_group_size = 64;
+  const int total_sequence_length_comp = (total_sequence_length + components -1) / components;
+  if (total_sequence_length_comp < work_group_size) {


Suggested change

int work_group_size = 64;

const int total_sequence_length_comp = (total_sequence_length + components -1) / components;

if (total_sequence_length_comp < work_group_size) {

int work_group_size = 64;

const int total_sequence_length_comp = (total_sequence_length + components - 1) / components;

if (total_sequence_length_comp < work_group_size) {

…ce_length_comp

This reverts commit 15c96b3.

github-advanced-security bot found potential problems Oct 30, 2024

View reviewed changes

github-actions bot requested changes Oct 31, 2024

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/attention.cc Outdated Show resolved Hide resolved

github-advanced-security bot found potential problems Oct 31, 2024

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/attention.cc Fixed Show fixed Hide fixed

satyajandhyala marked this pull request as ready for review November 1, 2024 19:28

skottmckay reviewed Nov 4, 2024

View reviewed changes

satyajandhyala force-pushed the sajandhy/webgpu-ep-gqa-new branch from 514217f to d49ecb4 Compare November 4, 2024 20:20

guschmue added the ep:WebGPU ort-web webgpu provider label Nov 6, 2024

github-advanced-security bot found potential problems Nov 6, 2024

View reviewed changes

onnxruntime/contrib_ops/webgpu/bert/group_query_attention.cc Fixed Show fixed Hide fixed

github-actions bot reviewed Nov 6, 2024

View reviewed changes

satyajandhyala added 19 commits November 12, 2024 18:51

Added attention_common.h

0a5d212

wip

5bfa070

Fix compilation errors

e6615e9

lint

449afb4

Modified MultiHeadAttention to not derive from AttentionBase class

8d10472

Uncomment GQA registration

4ea58d1

Moved TransferBSToBNSH and ApplyAttention declaration to attention_co…

4bcf257

…mmon.h from attention.h

Revert "Modified MultiHeadAttention to not derive from AttentionBase …

5c5c934

…class" This reverts commit ba45303.

Converted CheckInput function to template to fix compiler/linker mult…

e716546

…iple definitions error.

lint

aba59e5

Fixed conflicts.

067ecd1

copying errors

53f1c78

Fixed inplacesoftmax dispatch

f4dc9fc

Initialize required parameter data

3d1af1c

Map total_seqlen_tensor input to CPU

2eaeebc

Use uniforms variable name consistently to avoid confusion.

9c828cc

Keep InplaceSoftmax dispatch 3-dim.

26caa06

Formatting changes.

64b093f

Use total_seqlen_tensor input only to determin is_first_prompt.

a8bd38b

satyajandhyala added 5 commits November 12, 2024 18:51

initialize is_packed_qkv_

d613df4

Handle past key/value and present key/value buffer sharing.

0fedb9f

lint

993140b

Added past_present_share_buffer to the hint. typo

7502493

past_present_share_buffer related changes.

5f1fdae

satyajandhyala force-pushed the sajandhy/webgpu-ep-gqa-new branch from 4a072b5 to 5f1fdae Compare November 13, 2024 04:00

github-actions bot reviewed Nov 13, 2024

View reviewed changes

satyajandhyala added 5 commits November 13, 2024 00:31

lint

6d2bd68

Fix integer division

82a005d

Updated hints

fd9409f

match jsep code

15c96b3

Fixed a minor issue

72601d1

github-actions bot reviewed Nov 14, 2024

View reviewed changes

satyajandhyala added 3 commits November 13, 2024 16:52

lint

65495b6

Fix a bug using total_sequence_length instead of uniform.total_sequen…

63f20ed

…ce_length_comp

Revert "match jsep code"

0102206

This reverts commit 15c96b3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WebGPU EP] Support GroupQueryAttention #22658

[WebGPU EP] Support GroupQueryAttention #22658

satyajandhyala commented Oct 30, 2024

satyajandhyala Nov 1, 2024

github-actions bot left a comment

skottmckay Nov 4, 2024 •

edited

Loading

satyajandhyala Nov 4, 2024 •

edited

Loading

skottmckay Nov 4, 2024

satyajandhyala Nov 4, 2024 •

edited

Loading

github-actions bot left a comment

github-actions bot Nov 6, 2024

github-actions bot Nov 6, 2024

github-actions bot Nov 6, 2024

github-actions bot Nov 6, 2024

github-actions bot left a comment

github-actions bot Nov 6, 2024

github-actions bot left a comment

github-actions bot Nov 13, 2024

github-actions bot left a comment

github-actions bot Nov 14, 2024

github-actions bot Nov 14, 2024

	.Input(1,
	"key",
	"Key with shape (batch_size, kv_sequence_length, kv_hidden_size) ",
	"T",
	OpSchema::Optional)
	.Input(2,
	"value",
	"Value with shape (batch_size, kv_sequence_length, kv_hidden_size)",
	"T",
	OpSchema::Optional)
	.Input(3,
	"past_key",
	"past state key with support for format BNSH. When past_key uses same tensor as present_key"
	"(k-v cache), it is of length max_sequence_length... otherwise of length past_sequence_length.",
	"T",
	OpSchema::Optional)
	.Input(4,
	"past_value",
	"past state value with support for format BNSH. When past_value uses same tensor as present_value"
	"(k-v cache), it is of length max_sequence_length... otherwise of length past_sequence_length.",
	"T",
	OpSchema::Optional)

		const int past_sequence_length = output_count > 1 ? parameters.past_sequence_length_ : 0;
		const int total_sequence_length = seqlen_k == nullptr ? (past_sequence_length + parameters.kv_sequence_length_) : parameters.seqlen_present_kv_cache_;


		const uint32_t vectorized_head_size = (parameters.head_size_ + components - 1) / components;
		program.SetDispatchGroupSize((total_sequence_length + tile_size - 1) / tile_size,

[WebGPU EP] Support GroupQueryAttention #22658

Are you sure you want to change the base?

[WebGPU EP] Support GroupQueryAttention #22658

Conversation

satyajandhyala commented Oct 30, 2024

Description

Motivation and Context

satyajandhyala Nov 1, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

skottmckay Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

satyajandhyala Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

skottmckay Nov 4, 2024

Choose a reason for hiding this comment

satyajandhyala Nov 4, 2024 • edited Loading

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Nov 6, 2024

Choose a reason for hiding this comment

github-actions bot Nov 6, 2024

Choose a reason for hiding this comment

github-actions bot Nov 6, 2024

Choose a reason for hiding this comment

github-actions bot Nov 6, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Nov 6, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Nov 13, 2024

Choose a reason for hiding this comment

github-actions bot left a comment

Choose a reason for hiding this comment

github-actions bot Nov 14, 2024

Choose a reason for hiding this comment

github-actions bot Nov 14, 2024

Choose a reason for hiding this comment

skottmckay Nov 4, 2024 •

edited

Loading

satyajandhyala Nov 4, 2024 •

edited

Loading

satyajandhyala Nov 4, 2024 •

edited

Loading