Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llm weight compression tool #3290

Draft
wants to merge 11 commits into
base: develop
Choose a base branch
from

Conversation

andrey-churkin
Copy link
Contributor

@andrey-churkin andrey-churkin commented Feb 18, 2025

Changes

Add a script to automate the enumeration of compression parameters.

Supported backends for compression: optimum-cli, nncf
Supported backends for evaluation: lm_eval, wwb

Reason for changes

Related tickets

Ref: 160664

Tests

@andrey-churkin andrey-churkin requested a review from a team as a code owner February 18, 2025 08:10
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 18, 2025
@andrey-churkin andrey-churkin marked this pull request as draft February 18, 2025 10:56
Copy link
Contributor

@alexsu52 alexsu52 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments:

  • As far as I understand, the main purpose of this tool is to return the top k compression parameters, sorted by increasing the drop between original and compressed models. Could you explain how to get such list of compression parameters.
  • I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.
  • You have divided the task into several steps. The first step is to compress the model with parameters from the grid and save the number of copies of the model equal to the number of parameter sets. The second step is validation. How do you propose to parallelize it? Do you really need a model copy for each set of compression parameters?

@andrey-churkin
Copy link
Contributor Author

As far as I understand, the main purpose of this tool is to return the top k compression parameters, sorted by increasing the drop between original and compressed models. Could you explain how to get such list of compression parameters.

Yes, the main purpose of this tool is to automate the enumeration of compression parameters. The script saves a results.xlsx file that contains the following sheet (columns are subject to discussion).

image

From this table, we can easily understand which parameters are suitable for our criteria. For all configurations, we save a file called optimum_cli_params.json (for the optimum-cli backend) that contains all the compression parameters that were used. For example, for the int4_r0.2_gs128_auto_awq configuration it contains the following parameters:

{
    "task": "text-generation",
    "trust_remote_code": true,
    "weight_format": "int4",
    "ratio": 0.2,
    "sym": false,
    "group_size": 128,
    "backup_precision": null,
    "dataset": "auto",
    "all_layers": false,
    "awq": true,
    "scale_estimation": false,
    "gptq": false,
    "lora_correction": false
}

@andrey-churkin
Copy link
Contributor Author

I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.

Currently, only the optimum-cli backend is supported, and as far as I know, there is no way to dump statistics via it. However, I will take this proposal into account and use caching statistics for the NNCF backend.

@andrey-churkin
Copy link
Contributor Author

You have divided the task into several steps. The first step is to compress the model with parameters from the grid and save the number of copies of the model equal to the number of parameter sets. The second step is validation. How do you propose to parallelize it? Do you really need a model copy for each set of compression parameters?

I save the model for only one reason: it is needed for validation. We can combine compression and validation into a single task. In this scenario, we probably don't need to save the models, and can save only the compression parameters/metrics. I am open to any suggestions here, and we can discuss and select the best way.

Regarding parallelization, I think one way is to execute several tasks at the same time. A task can contain either only the compression step or both the compression and validation steps. Or we can trigger validation only when a compression step is finished (for some configuration). If you have any other suggestions or ideas, please let me know.

@andrey-churkin
Copy link
Contributor Author

@nikita-malininn @nikita-savelyevv @ljaljushkin @andreyanufr Guys, please have a look. It is still a raw version, but any feedback will be valuable.

@andreyanufr
Copy link
Collaborator

@nikita-malininn @nikita-savelyevv @ljaljushkin @andreyanufr Guys, please have a look. It is still a raw version, but any feedback will be valuable.

@andrey-churkin

  1. Is tools/llm_weight_compression/config_optimum_lm_eval.json default configuarion ? May be it worth to add scale estimation and ratio in more optimal range [0.7, 0.8, 0.9, 1.0] ?
  2. Do you have plans to add example without optimum-cli ?

Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great tool!!

One idea for improvement I have is to also take resulting model inference speed into consideration. There is always a trade-off between accuracy and performance so for every model the task is to find the fastest model reaching the acceptable accuracy drop.

As the first step we could do an additional performance measuring step after compression with llm_bench and add first and second token latency measurements as columns in the resulting table.

In some future I can imagine that we could do some kind of "smart" parameter search based on this. Because formally speaking what we have here is a min-max optimization task: we minimize latency while maximizing accuracy. The problem is that different parameters have different impact on the target metrics so for me it's not straightforward how exactly it could be possible. Maybe some heuristics will have to be added. In any case, this is just an idea for future improvements, not for right now.

:param base_model_dir: A directory where the model should be saved.
"""
model = OVModelForCausalLM.from_pretrained(
model_id=model_id, export=True, load_in_8bit=False, load_in_4bit=False, compile=False, trust_remote_code=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know there is no such argument as load_in_4bit

:param base_model_dir: A directory where the model should be saved.
"""
model = OVModelForCausalLM.from_pretrained(
model_id=model_id, export=True, load_in_8bit=False, load_in_4bit=False, compile=False, trust_remote_code=True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For security reasons, it is not a good idea to always pass trust_remote_code as True. I would suggest to make a CLI argument for this.


{ROOT_DIR}
|-- {encoded model ID}
|-- fp32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strictly speaking, the code below with save weights in either fp16 or bf16 precision, depending on which precision the original PyTorch weights are given in on the HF hub model card.

"""
cmd_line = "optimum-cli"
cmd_line += " export openvino"
cmd_line += f" --model {model_id}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To speed-up things, it could be beneficial to run export from model_path, not model_id. This will avoid additional model export step. This will especially be noticeable for compression configs that are fast to apply.

:param params:
:param log_filename:
"""
cmd_line = "optimum-cli"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered to run compression through optimum python API instead of optimum CLI? Like this: https://huggingface.co/docs/optimum/en/intel/openvino/optimization#4-bit Compression parameters can be given within OVWeightQuantizationConfig.

It may be easier this way.

gt_data_filename = f"gt_{language}.csv"

cmd_line = "wwb"
cmd_line += f" --base-model {base_model_dir.joinpath('model')}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand, this way we'll compare OV INT4 model against OV FP32/FP16/BF16 model. Strictly speaking, OV int4/int8 models are compared against PyTorch models inside OV LLM validation. Usually OV float precision model and PT model have very high 99+% similarity, but still, the results will be a bit off if we do it this way.

@@ -0,0 +1,23 @@
{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think about splitting compression and evaluation configs into separate files?

@nikita-savelyevv
Copy link
Collaborator

nikita-savelyevv commented Feb 20, 2025

I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.

Currently, only the optimum-cli backend is supported, and as far as I know, there is no way to dump statistics via it. However, I will take this proposal into account and use caching statistics for the NNCF backend.

There is a way to make it work without additional changes to optimum-intel:

model = OVModelForCausalLM(<model_id or model_path>, load_in_8bit=False)
OVQuantizer(model).quantize(
    ov_config=OVConfig(quantization_config=OVWeightQuantizationConfig(bits=4, ...)),
    advanced_parameters=nncf.AdvancedCompressionParameters(statistics_path=self.statistics_path),
    save_directory=<save_directory>
)

@nikita-savelyevv
Copy link
Collaborator

Also, it would be convenient if it was possible to run compression / evaluation steps separately if needed.

@ljaljushkin
Copy link
Contributor

convenient if it was possible to run compression / evaluation steps separately if

In certain situations, it may be preferable not to re-compress models, but rather to use the existing ones and test them on different datasets.
For local testing, it would be helpful to have the option to visualize currently compressed models with their respective parameters and versions (nncf, ov, optimum, transformers, torch...) and the way to select some of them for validation.

@alexsu52
Copy link
Contributor

alexsu52 commented Feb 21, 2025

I would suggest thinking about using statistics dumping to speed up model compression under different compression parameters.

Currently, only the optimum-cli backend is supported, and as far as I know, there is no way to dump statistics via it. However, I will take this proposal into account and use caching statistics for the NNCF backend.

I would suggest to think about adding caching statistics in optimum-cli. If you make a suggestion on this matter, it will be a first input. cc' @nikita-savelyevv, @AlexKoff88

@alexsu52
Copy link
Contributor

alexsu52 commented Feb 21, 2025

Great tool!!

One idea for improvement I have is to also take resulting model inference speed into consideration. There is always a trade-off between accuracy and performance so for every model the task is to find the fastest model reaching the acceptable accuracy drop.

As the first step we could do an additional performance measuring step after compression with llm_bench and add first and second token latency measurements as columns in the resulting table.

In some future I can imagine that we could do some kind of "smart" parameter search based on this. Because formally speaking what we have here is a min-max optimization task: we minimize latency while maximizing accuracy. The problem is that different parameters have different impact on the target metrics so for me it's not straightforward how exactly it could be possible. Maybe some heuristics will have to be added. In any case, this is just an idea for future improvements, not for right now.

I agree that the approach of trying all parameters without taking into account the heuristics for performance at the initial stage will lead to a significant increase in the time for searching compression parameters.

@andrey-churkin , just to clarify, is the idea to specify the order of experiments into the config?

@alexsu52
Copy link
Contributor

alexsu52 commented Feb 21, 2025

Yes, the main purpose of this tool is to automate the enumeration of compression parameters. The script saves a results.xlsx file that contains the following sheet (columns are subject to discussion).

As far as I understand, the main purpose of this tool is to return the top k compression parameters, sorted by increasing the drop between original and compressed models. Could you explain how to get such list of compression parameters.

Yes, the main purpose of this tool is to automate the enumeration of compression parameters. The script saves a results.xlsx file that contains the following sheet (columns are subject to discussion).

image

From this table, we can easily understand which parameters are suitable for our criteria. For all configurations, we save a file called optimum_cli_params.json (for the optimum-cli backend) that contains all the compression parameters that were used. For example, for the int4_r0.2_gs128_auto_awq configuration it contains the following parameters:

{
    "task": "text-generation",
    "trust_remote_code": true,
    "weight_format": "int4",
    "ratio": 0.2,
    "sym": false,
    "group_size": 128,
    "backup_precision": null,
    "dataset": "auto",
    "all_layers": false,
    "awq": true,
    "scale_estimation": false,
    "gptq": false,
    "lora_correction": false
}

Thanks for the explanation. If you want to find compression parameters that satisfy a given drop in accuracy, you should not check all sets of compression parameters. You should select compression parameters with the best performance. Thus I would suggest to introduce "max_accuracy_drop" as early stopper of the searching process.

@alexsu52
Copy link
Contributor

You have divided the task into several steps. The first step is to compress the model with parameters from the grid and save the number of copies of the model equal to the number of parameter sets. The second step is validation. How do you propose to parallelize it? Do you really need a model copy for each set of compression parameters?

I save the model for only one reason: it is needed for validation. We can combine compression and validation into a single task. In this scenario, we probably don't need to save the models, and can save only the compression parameters/metrics. I am open to any suggestions here, and we can discuss and select the best way.

Regarding parallelization, I think one way is to execute several tasks at the same time. A task can contain either only the compression step or both the compression and validation steps. Or we can trigger validation only when a compression step is finished (for some configuration). If you have any other suggestions or ideas, please let me know.

I didn't understand, are you suggesting that parallelization be regulated at the config level or will it be implemented inside the tool?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants