Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify scaled INT8 matmul #862

Open
gau-nernst opened this issue Sep 10, 2024 · 1 comment
Open

Unify scaled INT8 matmul #862

gau-nernst opened this issue Sep 10, 2024 · 1 comment

Comments

@gau-nernst
Copy link
Collaborator

With the new addition of INT8 mixed-precision training, there are now 2 implementations of scaled INT8 matmul (INT8 matmul + dequant)

I have identified the key differences

intmm_triton.py int8_mm.py
Only fuse act scale Fuse both act scale and weight scale
Scale step is acc_i32 x scale Scale step is cast to fp32 acc_i32.to(f32) x scale.to(f32)
Different autotune configs Different autotune configs

Ideally we should only keep 1. The tedious part is to validate there is no accuracy+speed regression, regardless of which final implementation we will adopt.

Here are the places that use intmm_triton.py

-> Basically ensure INT8 dynamic quantization for Llama and SAM benchmarks don't regress

Here are the places that use int8_mm.py

-> Ensure INT8 mixed-precision training doesn't regress

Another question. Is it ok to change int_scaled_matmul() signature to accept scales for both A and B instead of only for A?

@jerryzh168
Copy link
Contributor

main/torchao/quantization/utils.py contains a lot of util q/dq ops that's call the more versatile quant primitive ops (quantize_affine/dequantize_affine/choose_qparams_affine) btw, so many of these are convenience functions to hold the configurations for these quant primitive ops (e.g. dtype, block_size, symmetric/asymmetric, quant_min/quant_max, eps etc.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants