-
Notifications
You must be signed in to change notification settings - Fork 246
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llm weight compression tool #3290
base: develop
Are you sure you want to change the base?
llm weight compression tool #3290
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments:
- As far as I understand, the main purpose of this tool is to return the top k compression parameters, sorted by increasing the drop between original and compressed models. Could you explain how to get such list of compression parameters.
- I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.
- You have divided the task into several steps. The first step is to compress the model with parameters from the grid and save the number of copies of the model equal to the number of parameter sets. The second step is validation. How do you propose to parallelize it? Do you really need a model copy for each set of compression parameters?
Yes, the main purpose of this tool is to automate the enumeration of compression parameters. The script saves a From this table, we can easily understand which parameters are suitable for our criteria. For all configurations, we save a file called {
"task": "text-generation",
"trust_remote_code": true,
"weight_format": "int4",
"ratio": 0.2,
"sym": false,
"group_size": 128,
"backup_precision": null,
"dataset": "auto",
"all_layers": false,
"awq": true,
"scale_estimation": false,
"gptq": false,
"lora_correction": false
} |
Currently, only the optimum-cli backend is supported, and as far as I know, there is no way to dump statistics via it. However, I will take this proposal into account and use caching statistics for the NNCF backend. |
I save the model for only one reason: it is needed for validation. We can combine compression and validation into a single task. In this scenario, we probably don't need to save the models, and can save only the compression parameters/metrics. I am open to any suggestions here, and we can discuss and select the best way. Regarding parallelization, I think one way is to execute several tasks at the same time. A task can contain either only the compression step or both the compression and validation steps. Or we can trigger validation only when a compression step is finished (for some configuration). If you have any other suggestions or ideas, please let me know. |
@nikita-malininn @nikita-savelyevv @ljaljushkin @andreyanufr Guys, please have a look. It is still a raw version, but any feedback will be valuable. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great tool!!
One idea for improvement I have is to also take resulting model inference speed into consideration. There is always a trade-off between accuracy and performance so for every model the task is to find the fastest model reaching the acceptable accuracy drop.
As the first step we could do an additional performance measuring step after compression with llm_bench and add first and second token latency measurements as columns in the resulting table.
In some future I can imagine that we could do some kind of "smart" parameter search based on this. Because formally speaking what we have here is a min-max optimization task: we minimize latency while maximizing accuracy. The problem is that different parameters have different impact on the target metrics so for me it's not straightforward how exactly it could be possible. Maybe some heuristics will have to be added. In any case, this is just an idea for future improvements, not for right now.
:param base_model_dir: A directory where the model should be saved. | ||
""" | ||
model = OVModelForCausalLM.from_pretrained( | ||
model_id=model_id, export=True, load_in_8bit=False, load_in_4bit=False, compile=False, trust_remote_code=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know there is no such argument as load_in_4bit
:param base_model_dir: A directory where the model should be saved. | ||
""" | ||
model = OVModelForCausalLM.from_pretrained( | ||
model_id=model_id, export=True, load_in_8bit=False, load_in_4bit=False, compile=False, trust_remote_code=True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For security reasons, it is not a good idea to always pass trust_remote_code
as True. I would suggest to make a CLI argument for this.
|
||
{ROOT_DIR} | ||
|-- {encoded model ID} | ||
|-- fp32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strictly speaking, the code below with save weights in either fp16 or bf16 precision, depending on which precision the original PyTorch weights are given in on the HF hub model card.
""" | ||
cmd_line = "optimum-cli" | ||
cmd_line += " export openvino" | ||
cmd_line += f" --model {model_id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To speed-up things, it could be beneficial to run export from model_path
, not model_id
. This will avoid additional model export step. This will especially be noticeable for compression configs that are fast to apply.
:param params: | ||
:param log_filename: | ||
""" | ||
cmd_line = "optimum-cli" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered to run compression through optimum python API instead of optimum CLI? Like this: https://huggingface.co/docs/optimum/en/intel/openvino/optimization#4-bit Compression parameters can be given within OVWeightQuantizationConfig
.
It may be easier this way.
gt_data_filename = f"gt_{language}.csv" | ||
|
||
cmd_line = "wwb" | ||
cmd_line += f" --base-model {base_model_dir.joinpath('model')}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand, this way we'll compare OV INT4 model against OV FP32/FP16/BF16 model. Strictly speaking, OV int4/int8 models are compared against PyTorch models inside OV LLM validation. Usually OV float precision model and PT model have very high 99+% similarity, but still, the results will be a bit off if we do it this way.
@@ -0,0 +1,23 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about splitting compression and evaluation configs into separate files?
There is a way to make it work without additional changes to optimum-intel: model = OVModelForCausalLM(<model_id or model_path>, load_in_8bit=False)
OVQuantizer(model).quantize(
ov_config=OVConfig(quantization_config=OVWeightQuantizationConfig(bits=4, ...)),
advanced_parameters=nncf.AdvancedCompressionParameters(statistics_path=self.statistics_path),
save_directory=<save_directory>
) |
Also, it would be convenient if it was possible to run compression / evaluation steps separately if needed. |
In certain situations, it may be preferable not to re-compress models, but rather to use the existing ones and test them on different datasets. |
I would suggest to think about adding |
I agree that the approach of trying all parameters without taking into account the heuristics for performance at the initial stage will lead to a significant increase in the time for searching compression parameters. @andrey-churkin , just to clarify, is the idea to specify the order of experiments into the config? |
Thanks for the explanation. If you want to find compression parameters that satisfy a given drop in accuracy, you should not check all sets of compression parameters. You should select compression parameters with the best performance. Thus I would suggest to introduce "max_accuracy_drop" as early stopper of the searching process. |
I didn't understand, are you suggesting that parallelization be regulated at the config level or will it be implemented inside the tool? |
Changes
Add a script to automate the enumeration of compression parameters.
Supported backends for compression:
optimum-cli
,nncf
Supported backends for evaluation:
lm_eval
,wwb
Reason for changes
Related tickets
Ref: 160664
Tests