Future plan? #423

Katehuuh · 2024-04-22T11:41:20Z

Katehuuh
Apr 22, 2024

Any plans for variant post-training quantization for "heal" perplexity from 1-bit ~~BitNet~~ to the current 4-bit perplexity.
post-training quantization methods such as AQLM, Smoothquant+, and SqueezeLLM.
Variant quantization as most method require significantly more VRAM, and time for sft.

Or perhaps my prediction is wrong, and you have other v3 plan?

turboderp · 2024-04-24T10:45:36Z

turboderp
Apr 24, 2024
Maintainer

I have lots of plans for improved quantization, I just keep getting pulled in different directions. :) Latest distraction is Phi-3, but there are other issues that keep coming up.

Quantization-aware finetuning I don't think I'd look at until I've experimented enough with all the most recent post-training quantization schemes. It's a very heavy-handed approach and I don't think it's much of an exaggeration to say it's too slow for the rate at which models are being released right now. It simply can't take two weeks and cost a significant amount of money to quantize a model because no one will care about that model by the time you're done.

4 replies

Katehuuh Apr 24, 2024
Author

I have lots of plans for improved quantization

Great! Mind share any or just exp idea?

Variant quantization as most method require significantly more VRAM, and time for sft.

Can be

heavy-handed

But using QLoRa or any of the 1k&1 variant ^^ intead of Full Fine Tuning

can't take two weeks

and would only take a day.

Katehuuh May 1, 2024
Author

I did notice no 8bpw in recent @turboderp HF exl2, is it due that quality 8bpw is same/close to 6bpw?

Katehuuh Sep 18, 2024
Author

@turboderp Any benchmark of exl2 script eval on Llama-3.1-70B-It-exl2-2.2-bits? Benchmark or perplexity from Llama-3.1-70B-Instruct-AQLM-PV-2Bit-1x16, Llama-3.1-70B VPTQ 2 bits (2) to compare.

Katehuuh Nov 4, 2024
Author

I ran a short benchmark with Qwen2.5-72B-Instruct-2.4bpw-h6-exl2 to stay under 24GB VRAM and scored 11, Qwen2.5-32B-4.7bpw scored 73. The benchmark is MMLU-pro (computer science). Is this very poor performance expected? Will new methods like VPTQ,AQLM make large models (60B+) perform much better at low quantization? Both 72B-2.4bpw and 32B-4.7bpw is limit of what can handle 24GB.

YangWang92 · 2024-10-21T09:09:42Z

YangWang92
Oct 21, 2024

Hi all,

Thank you for your interest in the VPTQ project.

I am also actively trying to integrate VPTQ https://github.com/microsoft/VPTQ into various inference frameworks.

What would be required for me to integrate it into this project? Should I prepare a pull request?

Yang

3 replies

turboderp Oct 21, 2024
Maintainer

I have been looking at VPTQ myself and I've considered adding support. I just haven't had the time to fully evaluate it and determine what the pain points are likely to be.

On the Python side, tensor formats are identified here, where the loader distinguishes between EXL2, GPTQ and unquantized tensors.

Unquantized tensors become a torch.nn.Linear which is used as-is or as a container for the raw FP16 tensors. For quantized layers, the collection of tensors associated with each is passed to this function which constructs a C++ object defined here and here.

There are different code paths for GPTQ and EXL2, so VPTQ would need its own handling there as well. The main challenge would be the quantized GEMM function and associated CUDA code, as well as a dequantization kernel that converts a whole matrix to FP16 for efficiency with large batches or long sequences. Example for 4-bit GPTQ is here.

There's a bunch of auxiliary stuff to consider as well, such as the C++ MLP and attn functions that have some specific requirements of the GEMM functions, like the ability to accumulate the result of the last operation directly into the residual, compatibility with the graph recording system and such. This could be deferred to later optimization by forcing those modules to fall back on forward_torch instead of forward, here, here and here. TP support could likewise wait. It just needs a method for splitting a quantized tensor into fully defined portions.

It looks like VPTQ puts some additional metadata in config.json, and if that's needed for dequantization and GEMM to work properly (i.e. if everything can't be inferred from tensor names/shapes/dtypes), then the quant_config key should probably be extracted while parsing the config, and kept to be used by the loader functions.

The framework could probably do with some refactoring first, since it really shouldn't be complicated to plug in another quantization format. I'll give it a closer look tomorrow.

YangWang92 Oct 21, 2024

Thank you for your prompt reply. I will take a close look at your response and should there be any need for modifications or development on my part, feel free to let me know.

YangWang92 Oct 21, 2024

thanks for your clues, and let me try.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Future plan? #423

{{title}}

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Future plan? #423

Katehuuh Apr 22, 2024

Replies: 2 comments · 7 replies

turboderp Apr 24, 2024 Maintainer

Katehuuh Apr 24, 2024 Author

Katehuuh May 1, 2024 Author

Katehuuh Sep 18, 2024 Author

Katehuuh Nov 4, 2024 Author

YangWang92 Oct 21, 2024

turboderp Oct 21, 2024 Maintainer

YangWang92 Oct 21, 2024

YangWang92 Oct 21, 2024

Katehuuh
Apr 22, 2024

Replies: 2 comments 7 replies

turboderp
Apr 24, 2024
Maintainer

Katehuuh Apr 24, 2024
Author

Katehuuh May 1, 2024
Author

Katehuuh Sep 18, 2024
Author

Katehuuh Nov 4, 2024
Author

YangWang92
Oct 21, 2024

turboderp Oct 21, 2024
Maintainer