Decouple int4 weight with serialized format #187

yanbing-j · 2024-07-02T07:25:22Z

This PR is to decouple int4 weight with serialized format, so that int4 model checkpoint can be shared in different test machines or ISAs, without re-generating in one certain platform.

In int4 woq quantization, weight is saved as [n][k / 2] uint8 (serialized format). The behavior of converting weight to int4 weight is moved to loading model in generate.py.

And this PR is based on pytorch/pytorch#129940, which updates the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8.

mingfeima

Also need to update the readme.

mingfeima · 2024-07-03T01:12:55Z

quantize.py

@@ -404,7 +403,7 @@ def __init__(self, mod, groupsize=128, inner_k_tiles=8, padding=True):

    @torch.no_grad()
    def create_quantized_state_dict(self, use_cuda = True):
-        if use_cuda:
+        if use_cuda and torch.cuda.is_available():


won't be necessary to add torch.cuda.is_available(), just let it report error if use_cuda is true and gpu is not available.

mingfeima · 2024-07-03T01:15:59Z

@malfet Hi we have modified the int4 packed weight logic from gpt-fast and also from torch: pytorch/pytorch#129940 could you please help review?

@yanbing-j could you also help evaluate how much time spent on the weight prepacking after model is loaded? both CPU and GPU numbers will be needed. Also this won't affect the 1st token latency, right ?

yanbing-j · 2024-07-03T06:13:51Z

@mingfeima , README has been updated. The time of weight prepacking in CPU is 0.23s (total time of loading model is 0.28s), and in GPU is 4ms (total time of loading model is 1.2s, mainly in model.to(device)). And this will not affect the 1st token latency.

yanbing-j · 2024-07-10T07:47:54Z

Hi @yanboliang , could you please help merge this PR? Since pytorch/pytorch#129940 has been merged.

yanbing-j · 2024-09-18T06:59:31Z

Hi @yanboliang , could you please help review this PR? Since the API of _convert_weight_to_int4pack is changed in pytorch/pytorch#129940. Thanks!

yanbing-j · 2024-09-20T02:15:42Z

Hi @yanboliang , could you please help review this PR? Thanks!

yanboliang · 2024-10-01T04:04:45Z

generate.py

+            if isinstance(mod, WeightOnlyInt4Linear):
+                weight = mod.weight.data
+                weight_int4pack = torch.ops.aten._convert_weight_to_int4pack(weight, mod.inner_k_tiles)
+                mod.weight = weight_int4pack


Can you put the weight conversion into WeightOnlyInt4QuantHandler.convert_for_runtime at L236? More concretely, it can be part of replace_linear_int4, I think that's the right place. And your change doesn't work well with TP, since the conversion happens after L246. Otherwise, this looks good!

@yanboliang Thanks for the comments!

When quantizing, it generates [n][k / 2] uint8 (serialized format). After model loading to specific device, int4 pack weight can be converted then. So, moving to WeightOnlyInt4QuantHandler.convert_for_runtime or replace_linear_int4 cannot be a suitable way.

I try to make some refactor to wrapper the conversion into a new function and change TP after model loading and weight convert. Is this okay for you? Thanks!

update int4 weight dim Add CPU profiling

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2024

mingfeima suggested changes Jul 3, 2024

View reviewed changes

yanbing-j force-pushed the yanbing/int4pack_mm branch from cc1f6cd to acdc197 Compare September 18, 2024 06:55

yanboliang reviewed Oct 1, 2024

View reviewed changes

yanbing-j added 3 commits October 8, 2024 00:49

Update int4 weight with serialized format

222ec25

Add int8 Woq for CPU

f7f8298

update int4 weight dim Add CPU profiling

refactor

fe741f4

yanbing-j force-pushed the yanbing/int4pack_mm branch from acdc197 to fe741f4 Compare October 8, 2024 08:19

yanbing-j mentioned this pull request Dec 20, 2024

int4 quant broken right now? #217

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decouple int4 weight with serialized format #187

Decouple int4 weight with serialized format #187

Uh oh!

yanbing-j commented Jul 2, 2024

Uh oh!

mingfeima left a comment

Uh oh!

mingfeima Jul 3, 2024

Uh oh!

mingfeima commented Jul 3, 2024

Uh oh!

yanbing-j commented Jul 3, 2024

Uh oh!

yanbing-j commented Jul 10, 2024

Uh oh!

yanbing-j commented Sep 18, 2024

Uh oh!

yanbing-j commented Sep 20, 2024

Uh oh!

yanboliang Oct 1, 2024

Uh oh!

yanbing-j Oct 8, 2024

Uh oh!

Uh oh!

Decouple int4 weight with serialized format #187

Are you sure you want to change the base?

Decouple int4 weight with serialized format #187

Uh oh!

Conversation

yanbing-j commented Jul 2, 2024

Uh oh!

mingfeima left a comment

Choose a reason for hiding this comment

Uh oh!

mingfeima Jul 3, 2024

Choose a reason for hiding this comment

Uh oh!

mingfeima commented Jul 3, 2024

Uh oh!

yanbing-j commented Jul 3, 2024

Uh oh!

yanbing-j commented Jul 10, 2024

Uh oh!

yanbing-j commented Sep 18, 2024

Uh oh!

yanbing-j commented Sep 20, 2024

Uh oh!

yanboliang Oct 1, 2024

Choose a reason for hiding this comment

Uh oh!

yanbing-j Oct 8, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!