llm weight compression tool #3290

andrey-churkin · 2025-02-18T08:10:41Z

Changes

Add a script to automate the enumeration of compression parameters.

Supported backends for compression: optimum-cli, nncf
Supported backends for evaluation: lm_eval, wwb

Reason for changes

Related tickets

Ref: 160664

Tests

alexsu52

General comments:

As far as I understand, the main purpose of this tool is to return the top k compression parameters, sorted by increasing the drop between original and compressed models. Could you explain how to get such list of compression parameters.
I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.
You have divided the task into several steps. The first step is to compress the model with parameters from the grid and save the number of copies of the model equal to the number of parameter sets. The second step is validation. How do you propose to parallelize it? Do you really need a model copy for each set of compression parameters?

andrey-churkin · 2025-02-19T15:08:13Z

As far as I understand, the main purpose of this tool is to return the top k compression parameters, sorted by increasing the drop between original and compressed models. Could you explain how to get such list of compression parameters.

Yes, the main purpose of this tool is to automate the enumeration of compression parameters. The script saves a results.xlsx file that contains the following sheet (columns are subject to discussion).

From this table, we can easily understand which parameters are suitable for our criteria. For all configurations, we save a file called optimum_cli_params.json (for the optimum-cli backend) that contains all the compression parameters that were used. For example, for the int4_r0.2_gs128_auto_awq configuration it contains the following parameters:

{
    "task": "text-generation",
    "trust_remote_code": true,
    "weight_format": "int4",
    "ratio": 0.2,
    "sym": false,
    "group_size": 128,
    "backup_precision": null,
    "dataset": "auto",
    "all_layers": false,
    "awq": true,
    "scale_estimation": false,
    "gptq": false,
    "lora_correction": false
}

andrey-churkin · 2025-02-20T07:53:52Z

I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.

Currently, only the optimum-cli backend is supported, and as far as I know, there is no way to dump statistics via it. However, I will take this proposal into account and use caching statistics for the NNCF backend.

andrey-churkin · 2025-02-20T08:10:48Z

You have divided the task into several steps. The first step is to compress the model with parameters from the grid and save the number of copies of the model equal to the number of parameter sets. The second step is validation. How do you propose to parallelize it? Do you really need a model copy for each set of compression parameters?

I save the model for only one reason: it is needed for validation. We can combine compression and validation into a single task. In this scenario, we probably don't need to save the models, and can save only the compression parameters/metrics. I am open to any suggestions here, and we can discuss and select the best way.

Regarding parallelization, I think one way is to execute several tasks at the same time. A task can contain either only the compression step or both the compression and validation steps. Or we can trigger validation only when a compression step is finished (for some configuration). If you have any other suggestions or ideas, please let me know.

andrey-churkin · 2025-02-20T08:53:00Z

@nikita-malininn @nikita-savelyevv @ljaljushkin @andreyanufr Guys, please have a look. It is still a raw version, but any feedback will be valuable.

andreyanufr · 2025-02-20T09:59:41Z

@nikita-malininn @nikita-savelyevv @ljaljushkin @andreyanufr Guys, please have a look. It is still a raw version, but any feedback will be valuable.

@andrey-churkin

Is tools/llm_weight_compression/config_optimum_lm_eval.json default configuarion ? May be it worth to add scale estimation and ratio in more optimal range [0.7, 0.8, 0.9, 1.0] ?
Do you have plans to add example without optimum-cli ?

nikita-savelyevv

Great tool!!

One idea for improvement I have is to also take resulting model inference speed into consideration. There is always a trade-off between accuracy and performance so for every model the task is to find the fastest model reaching the acceptable accuracy drop.

As the first step we could do an additional performance measuring step after compression with llm_bench and add first and second token latency measurements as columns in the resulting table.

In some future I can imagine that we could do some kind of "smart" parameter search based on this. Because formally speaking what we have here is a min-max optimization task: we minimize latency while maximizing accuracy. The problem is that different parameters have different impact on the target metrics so for me it's not straightforward how exactly it could be possible. Maybe some heuristics will have to be added. In any case, this is just an idea for future improvements, not for right now.

nikita-savelyevv · 2025-02-19T18:02:42Z