Support user-defined batch size for one shot #1117

kylesayrs · 2025-01-30T18:42:10Z

Purpose

Enable oneshot flows with calibration_batch_size in order to support accelerated calibration

Preparing intermediates cache: 100%|??????????????????????????????????????????????????????| 16/16 [00:00<00:00, 24.61it/s]
(1/17): Calibrating: 100%|????????????????????????????????????????????????????????????????| 16/16 [00:09<00:00,  1.75it/s]
2025-02-04T03:24:08.343218+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples
2025-02-04T03:24:24.219222+0000 | compress | METRIC - time 15.88s
2025-02-04T03:24:24.219357+0000 | compress | METRIC - error 1805.50
2025-02-04T03:24:24.219666+0000 | compress | METRIC - GPU 0 | usage: 78.79% | total memory: 17 GB
2025-02-04T03:24:24.219707+0000 | compress | METRIC - GPU 1 | usage: 11.88% | total memory: 17 GB                       
2025-02-04T03:24:24.219775+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-02-04T03:24:24.219901+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples

Unfortunately, running with larger batch sizes is limited by the peak memory required for running the lm_head. Below is the memory profile of running batch_size 16, seq_len 2048 with llama3.2-1B

Future work could better support this by integrating with custom kernels like the LigerKernel, using the sequential pipeline to skip calibrating the lm_head, and/or providing a function to reserve memory for large batch sizes.

Changes

Add calibration_batch_size argument (this can be aliased to batch_size in the future if we find that to be a better api
- Distinguishing the batch_size as oneshot allows the user to use different batch sizes for oneshot + training workflows
Implement configure_processor which modifies the processor to support saving and padding
- Padding was previously done in TextGenerationDataset class. However, the processor needs to be configured earlier because the StageRunner needs a reference to the processor in order to pass it to format_calibration_data
  - format_calibration_dataset needs a reference to the processor in order to support dynamic padding
- Some processor definitions are defunct and need modification to support saving, such as phi3. This was previously done in the user script, now it is fixed transparently in LC
Modify format_calibration_data to support using the DataCollatorWithPadding (dynamic padding)
Remove smoothquant in phi3 W4A16 example

Testing

Added tests in tests/llmcompressor/transformers/finetune/data/test_dataset_helpers.py

example script

```python3 from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot

Select model and load it.

MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Select calibration dataset.

DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

Select number of samples. 512 samples is a good place to start.

Increasing the number of samples can improve accuracy.

NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
BATCH_SIZE = 32

Load dataset and preprocess.

ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}

ds = ds.map(preprocess)

Tokenize inputs.

def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)

ds = ds.map(tokenize, remove_columns=ds.column_names)

Configure the quantization algorithm to run.

* quantize the weights to 4 bit with GPTQ with a group size 128

recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
]

Apply algorithms.

oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
calibration_batch_size=BATCH_SIZE,
)

Confirm generations of the quantized model look sane.

print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

Save to disk compressed.

SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

</details>

Signed-off-by: Kyle Sayers <[email protected]>

github-actions · 2025-01-30T18:42:23Z

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

dsikka · 2025-01-30T21:31:18Z

src/llmcompressor/transformers/finetune/training_args.py

@@ -32,6 +32,12 @@ class TrainingArguments(HFTrainingArgs):
            )
        },
    )
+    per_device_oneshot_batch_size: int = field(


Why per device? Considering gptq's sequential nature/one active execution device

This name is just to match the existing per_device_train_batch_size argument name. We can alias this or resolve per_device_train_batch_size = per_device_oneshot_batch_size = batch_size in the future

Given that oneshot is unlikely to support device-parallel computation in the future, I'm fine using a more concise name now

Renamed to oneshot_batch_size to allow users to have separate batch sizes for oneshot and train. We can add a batch_size later if we think that's a better interface

Renamed to calibration_batch_size

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-02-04T03:45:36Z

This feature is implemented aside from the additional memory concerns related to large batch sizes. I've proposed a better API here for handling stacked memory requirements.

Signed-off-by: Kyle Sayers <[email protected]>

horheynm · 2025-02-05T20:49:18Z

src/llmcompressor/transformers/finetune/training_args.py

@@ -32,6 +32,12 @@ class TrainingArguments(HFTrainingArgs):
            )
        },
    )
+    calibration_batch_size: int = field(


Oneshot will not depend on training_args in the follow up PR, so this will be moved once that lands.
For now its ok since oneshot uses training_args

Which argument set should this exist on @horheynm?

horheynm · 2025-02-05T20:50:17Z

src/llmcompressor/transformers/finetune/data/base.py

@@ -53,11 +53,7 @@ def __init__(
        self.tokenizer = getattr(self.processor, "tokenizer", self.processor)

        if self.tokenizer is not None:
-            # fill in pad token


Do we not use this no more?

Ok you moved it to
if hasattr(tokenizer, "pad"):
collate_fn = DataCollatorWithPadding(tokenizer)

No, I moved this logic to configure_processor, as indicated in the PR description

Implement configure_processor which modifies the processor to support saving and padding

Padding was previously done in TextGenerationDataset class. However, the processor needs to be configured earlier because the StageRunner needs a reference to the processor in order to pass it to format_calibration_data

horheynm · 2025-02-05T20:56:08Z

src/llmcompressor/transformers/finetune/data/data_helpers.py

        "sampler": RandomSampler(tokenized_calibration)
        if do_shuffle
        else SequentialSampler(tokenized_calibration),
        "collate_fn": collate_fn,
        "pin_memory": True,
+        "drop_last": False,


why not drop if not divisible by batch size?

"drop_last" is not relevant if the number of samples is divisible by the batch size because there is no remainder. I'm confused what you're referring to here.

horheynm · 2025-02-05T20:57:45Z

src/llmcompressor/transformers/finetune/runner.py

@@ -68,8 +68,8 @@ def populate_datasets(self, processor: Processor, add_labels: bool = True):
        :param processor: processor or tokenizer to use for dataset tokenization
        :param add_labels: if True, add labels column to dataset splits
        """
+        self.processor = processor  # TODO: pass processor into init instead of this fn


if this processor is the same as the model processor, it will be accessible by self.model_args.processor.

At this location, processor is not the same as self.model_args.processor.

We could change this flow in the future, but for now we separate processor and self.model_args.processor within text_generation.py in order to preserve self.model_args.processor being a string (if the user passed a string)

WIP

29f93d3

Signed-off-by: Kyle Sayers <[email protected]>

dsikka reviewed Jan 30, 2025

View reviewed changes

kylesayrs added 2 commits January 30, 2025 23:52

WIP

de38a64

Signed-off-by: Kyle Sayers <[email protected]>

patch processor

afabe5a

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[WIP] Support Batch Size for OneShot~~ [WIP] Support user-defined batch size for one shot Feb 3, 2025

kylesayrs added 3 commits February 3, 2025 22:45

remove stuff, reduce diff

a3a9f17

Signed-off-by: Kyle Sayers <[email protected]>

woof

447b5ad

Signed-off-by: Kyle Sayers <[email protected]>

add tests

f87a78f

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs changed the title ~~[WIP] Support user-defined batch size for one shot~~ Support user-defined batch size for one shot Feb 3, 2025

kylesayrs self-assigned this Feb 3, 2025

revert examples, rename arg

7b3f434

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs marked this pull request as ready for review February 4, 2025 19:52

kylesayrs requested a review from dsikka February 5, 2025 01:14

rename to calibration_batch_size

fe67d7e

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/calibration-batch-size branch from 7a8f569 to fe67d7e Compare February 5, 2025 19:53

horheynm reviewed Feb 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support user-defined batch size for one shot #1117

Support user-defined batch size for one shot #1117

kylesayrs commented Jan 30, 2025 •

edited

Loading

github-actions bot commented Jan 30, 2025

dsikka Jan 30, 2025

kylesayrs Jan 30, 2025

kylesayrs Jan 30, 2025

kylesayrs Feb 4, 2025

kylesayrs Feb 5, 2025

kylesayrs commented Feb 4, 2025

horheynm Feb 5, 2025

kylesayrs Feb 6, 2025

horheynm Feb 5, 2025

horheynm Feb 5, 2025

kylesayrs Feb 6, 2025 •

edited

Loading

horheynm Feb 5, 2025

kylesayrs Feb 6, 2025

horheynm Feb 5, 2025

kylesayrs Feb 6, 2025 •

edited

Loading

Support user-defined batch size for one shot #1117

Are you sure you want to change the base?

Support user-defined batch size for one shot #1117

Conversation

kylesayrs commented Jan 30, 2025 • edited Loading

Purpose

Changes

Testing

Select model and load it.

Select calibration dataset.

Select number of samples. 512 samples is a good place to start.

Increasing the number of samples can improve accuracy.

Load dataset and preprocess.

Tokenize inputs.

Configure the quantization algorithm to run.

* quantize the weights to 4 bit with GPTQ with a group size 128

Apply algorithms.

Confirm generations of the quantized model look sane.

Save to disk compressed.

github-actions bot commented Jan 30, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylesayrs commented Feb 4, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylesayrs Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kylesayrs Feb 6, 2025 • edited Loading

Choose a reason for hiding this comment

kylesayrs commented Jan 30, 2025 •

edited

Loading

kylesayrs Feb 6, 2025 •

edited

Loading

kylesayrs Feb 6, 2025 •

edited

Loading