Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rfcs: graph: support int4/int8 compression for K/V in fused SDPA #2041

Open
wants to merge 1 commit into
base: rfcs
Choose a base branch
from

Conversation

wzt1997
Copy link
Contributor

@wzt1997 wzt1997 commented Aug 16, 2024

Description

This is to propose to support int4/int8 compression for K/V in fused SDPA.
Link to the rendered document.

@wzt1997 wzt1997 added the RFC A design document label Aug 16, 2024
@wzt1997 wzt1997 self-assigned this Aug 16, 2024
@wzt1997 wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 943a6b3 to 83217a2 Compare August 16, 2024 06:57
@vpirogov vpirogov changed the title rfcs: graph: support int4/int8 compression for K/V in fuesd SDPA rfcs: graph: support int4/int8 compression for K/V in fused SDPA Aug 16, 2024
inputs should be `1d` tensors.
2. For `per_group` quantization, all dimensions should match the input,
excepts for the dimension where grouped quantization applies, which should
be `src_dim / group_size`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how that matches the case when Batch dimension should be broadcasted for scales/zero-points:

W = [B, K, N]
pre-scale W: [1, gK, N] x [B, K, N] = W'
matmul: [B, M, K] x W' = [B, M, N]

Use case I've seen are like that, batch dimension doesn't receive its own dimension of scales.

Copy link
Contributor Author

@wzt1997 wzt1997 Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for providing the case. According to potential request from IPEX and per-token quantization, we decide to update the scale/zp input shape requirement for per-group quantization for scalability and flexibility. One potential option is to allow 1 on dimensions other than the last two, K and N.

rfcs/20240808-graph-api-int-compression-for-sdpa/README.md Outdated Show resolved Hide resolved
rfcs/20240808-graph-api-int-compression-for-sdpa/README.md Outdated Show resolved Hide resolved
@wzt1997 wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 1d5e1d2 to d633cf7 Compare August 27, 2024 08:31
@wzt1997 wzt1997 requested a review from a team as a code owner August 27, 2024 08:31
@wzt1997 wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch 4 times, most recently from 04a94db to 5b84ad0 Compare August 29, 2024 00:28
/// 4-bit signed integer.
dnnl_s4 = 11,
/// 4-bit unsigned integer.
dnnl_u4 = 12,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For API completeness, I feel yes we need to add both s4 and u4. Just for my curiosity, do you know if both s4 and u4 are used for this int4 K/V compression request? If both are used, any difference in quantization recipe between s4 and u4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to user request, they are likely to use s4 for KV storage. But as the KV compression is still WIP, it might change in the future. Regarding the difference between u4 and s4 recipes, since int4 data types always use asymmetric quantization, the parameters will be quite similar for u4 and s4. The difference should be mainly related to the de-quantization logic.

@wzt1997 wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 5b84ad0 to 6f68c51 Compare September 2, 2024 02:49
@wzt1997 wzt1997 force-pushed the zhitao/rfc/graph-int4-support branch from 6f68c51 to f1c5003 Compare September 6, 2024 09:06
@vpirogov vpirogov added this to the v3.7 milestone Sep 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
RFC A design document
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants