[feature request] lm_head quantization #2550

youki-sada · 2024-12-09T07:00:33Z

Recently, vocab_size is getting increased and weight size of lm_head exceeds 10GB in some LLMs.
However, there is no way to quantize lm_head. modelopt.torch.export.postprocess.update_lm_head_quantization ignores manual quant_cfg and disable it.

update_lm_head_quantization(modelopt==0.19.0)

638     # Disable lm_head quantization for TRTLLM
639     if get_quantization_format(lm_head) != QUANTIZATION_NONE:
640         disable_lm_head_quantization = True

662     # disable quantizer
663     if disable_lm_head_quantization:
664         if hasattr(input_quantizer, "_pre_quant_scale"):
665             disable_pre_quant_scale_and_resmooth(lm_head, delete_pre_quant_scale=True)
666
667         for quantizer in SequentialQuantizer.tensor_quantizer_iterator(lm_head.weight_quantizer):
668             quantizer.disable()
669
670         input_quantizer.disable()
671         print("Disable lm_head quantization for TRT-LLM export due to deployment limitations.")

Related issue: #1394

The text was updated successfully, but these errors were encountered:

nv-guomingz · 2024-12-17T09:56:00Z

@Tracin @RalphMao Is there any update for this request on latest Modelopt?

nv-guomingz added the Low Precision Issue about lower bit quantization, including int8, int4, fp8 label Dec 17, 2024

nv-guomingz assigned RalphMao and Tracin Dec 17, 2024

github-actions bot added triaged Issue has been triaged by maintainers Investigating labels Dec 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feature request] lm_head quantization #2550

[feature request] lm_head quantization #2550

youki-sada commented Dec 9, 2024

nv-guomingz commented Dec 17, 2024

[feature request] lm_head quantization #2550

[feature request] lm_head quantization #2550

Comments

youki-sada commented Dec 9, 2024

nv-guomingz commented Dec 17, 2024