Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support user-defined batch size for one shot #1117

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Jan 30, 2025

Purpose

  • Enable oneshot flows with calibration_batch_size in order to support accelerated calibration
Preparing intermediates cache: 100%|??????????????????????????????????????????????????????| 16/16 [00:00<00:00, 24.61it/s]
(1/17): Calibrating: 100%|????????????????????????????????????????????????????????????????| 16/16 [00:09<00:00,  1.75it/s]
2025-02-04T03:24:08.343218+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples
2025-02-04T03:24:24.219222+0000 | compress | METRIC - time 15.88s
2025-02-04T03:24:24.219357+0000 | compress | METRIC - error 1805.50
2025-02-04T03:24:24.219666+0000 | compress | METRIC - GPU 0 | usage: 78.79% | total memory: 17 GB
2025-02-04T03:24:24.219707+0000 | compress | METRIC - GPU 1 | usage: 11.88% | total memory: 17 GB                       
2025-02-04T03:24:24.219775+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-02-04T03:24:24.219901+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples
  • Unfortunately, running with larger batch sizes is limited by the peak memory required for running the lm_head. Below is the memory profile of running batch_size 16, seq_len 2048 with llama3.2-1B

Screenshot 2025-02-05 at 12 55 05

  • Future work could better support this by integrating with custom kernels like the LigerKernel, using the sequential pipeline to skip calibrating the lm_head, and/or providing a function to reserve memory for large batch sizes.

Changes

  • Add calibration_batch_size argument (this can be aliased to batch_size in the future if we find that to be a better api
    • Distinguishing the batch_size as oneshot allows the user to use different batch sizes for oneshot + training workflows
  • Implement configure_processor which modifies the processor to support saving and padding
    • Padding was previously done in TextGenerationDataset class. However, the processor needs to be configured earlier because the StageRunner needs a reference to the processor in order to pass it to format_calibration_data
      • format_calibration_dataset needs a reference to the processor in order to support dynamic padding
    • Some processor definitions are defunct and need modification to support saving, such as phi3. This was previously done in the user script, now it is fixed transparently in LC
  • Modify format_calibration_data to support using the DataCollatorWithPadding (dynamic padding)
  • Remove smoothquant in phi3 W4A16 example

Testing

  • Added tests in tests/llmcompressor/transformers/finetune/data/test_dataset_helpers.py
example script ```python3 from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot

Select model and load it.

MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"

model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)

Select calibration dataset.

DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"

Select number of samples. 512 samples is a good place to start.

Increasing the number of samples can improve accuracy.

NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
BATCH_SIZE = 32

Load dataset and preprocess.

ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))

def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}

ds = ds.map(preprocess)

Tokenize inputs.

def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)

ds = ds.map(tokenize, remove_columns=ds.column_names)

Configure the quantization algorithm to run.

* quantize the weights to 4 bit with GPTQ with a group size 128

recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
]

Apply algorithms.

oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
calibration_batch_size=BATCH_SIZE,
)

Confirm generations of the quantized model look sane.

print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

Save to disk compressed.

SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)


Preparing intermediates cache: 100%|??????????????????????????????????????????????????????| 16/16 [00:00<00:00, 24.61it/s]
(1/17): Calibrating: 100%|????????????????????????????????????????????????????????????????| 16/16 [00:09<00:00, 1.75it/s]
2025-02-04T03:24:08.343218+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples
2025-02-04T03:24:24.219222+0000 | compress | METRIC - time 15.88s
2025-02-04T03:24:24.219357+0000 | compress | METRIC - error 1805.50
2025-02-04T03:24:24.219666+0000 | compress | METRIC - GPU 0 | usage: 78.79% | total memory: 17 GB
2025-02-04T03:24:24.219707+0000 | compress | METRIC - GPU 1 | usage: 11.88% | total memory: 17 GB
2025-02-04T03:24:24.219775+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-02-04T03:24:24.219901+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples

</details>

Signed-off-by: Kyle Sayers <[email protected]>
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@@ -32,6 +32,12 @@ class TrainingArguments(HFTrainingArgs):
)
},
)
per_device_oneshot_batch_size: int = field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why per device? Considering gptq's sequential nature/one active execution device

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This name is just to match the existing per_device_train_batch_size argument name. We can alias this or resolve per_device_train_batch_size = per_device_oneshot_batch_size = batch_size in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that oneshot is unlikely to support device-parallel computation in the future, I'm fine using a more concise name now

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to oneshot_batch_size to allow users to have separate batch sizes for oneshot and train. We can add a batch_size later if we think that's a better interface

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to calibration_batch_size

Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs changed the title [WIP] Support Batch Size for OneShot [WIP] Support user-defined batch size for one shot Feb 3, 2025
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs changed the title [WIP] Support user-defined batch size for one shot Support user-defined batch size for one shot Feb 3, 2025
@kylesayrs kylesayrs self-assigned this Feb 3, 2025
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs
Copy link
Collaborator Author

This feature is implemented aside from the additional memory concerns related to large batch sizes. I've proposed a better API here for handling stacked memory requirements.

@kylesayrs kylesayrs marked this pull request as ready for review February 4, 2025 19:52
@kylesayrs kylesayrs requested a review from dsikka February 5, 2025 01:14
@kylesayrs kylesayrs force-pushed the kylesayrs/calibration-batch-size branch from 7a8f569 to fe67d7e Compare February 5, 2025 19:53
@@ -32,6 +32,12 @@ class TrainingArguments(HFTrainingArgs):
)
},
)
calibration_batch_size: int = field(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oneshot will not depend on training_args in the follow up PR, so this will be moved once that lands.
For now its ok since oneshot uses training_args

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which argument set should this exist on @horheynm?

@@ -53,11 +53,7 @@ def __init__(
self.tokenizer = getattr(self.processor, "tokenizer", self.processor)

if self.tokenizer is not None:
# fill in pad token
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we not use this no more?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok you moved it to
if hasattr(tokenizer, "pad"):
collate_fn = DataCollatorWithPadding(tokenizer)

Copy link
Collaborator Author

@kylesayrs kylesayrs Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I moved this logic to configure_processor, as indicated in the PR description

  • Implement configure_processor which modifies the processor to support saving and padding
    • Padding was previously done in TextGenerationDataset class. However, the processor needs to be configured earlier because the StageRunner needs a reference to the processor in order to pass it to format_calibration_data

"sampler": RandomSampler(tokenized_calibration)
if do_shuffle
else SequentialSampler(tokenized_calibration),
"collate_fn": collate_fn,
"pin_memory": True,
"drop_last": False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not drop if not divisible by batch size?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"drop_last" is not relevant if the number of samples is divisible by the batch size because there is no remainder. I'm confused what you're referring to here.

@@ -68,8 +68,8 @@ def populate_datasets(self, processor: Processor, add_labels: bool = True):
:param processor: processor or tokenizer to use for dataset tokenization
:param add_labels: if True, add labels column to dataset splits
"""
self.processor = processor # TODO: pass processor into init instead of this fn
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this processor is the same as the model processor, it will be accessible by self.model_args.processor.

Copy link
Collaborator Author

@kylesayrs kylesayrs Feb 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this location, processor is not the same as self.model_args.processor.

We could change this flow in the future, but for now we separate processor and self.model_args.processor within text_generation.py in order to preserve self.model_args.processor being a string (if the user passed a string)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants