-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support user-defined batch size for one shot #1117
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Kyle Sayers <[email protected]>
👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review. Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed. |
@@ -32,6 +32,12 @@ class TrainingArguments(HFTrainingArgs): | |||
) | |||
}, | |||
) | |||
per_device_oneshot_batch_size: int = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why per device? Considering gptq's sequential nature/one active execution device
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This name is just to match the existing per_device_train_batch_size
argument name. We can alias this or resolve per_device_train_batch_size = per_device_oneshot_batch_size = batch_size
in the future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that oneshot is unlikely to support device-parallel computation in the future, I'm fine using a more concise name now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to oneshot_batch_size
to allow users to have separate batch sizes for oneshot and train. We can add a batch_size later if we think that's a better interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed to calibration_batch_size
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
This feature is implemented aside from the additional memory concerns related to large batch sizes. I've proposed a better API here for handling stacked memory requirements. |
Signed-off-by: Kyle Sayers <[email protected]>
7a8f569
to
fe67d7e
Compare
@@ -32,6 +32,12 @@ class TrainingArguments(HFTrainingArgs): | |||
) | |||
}, | |||
) | |||
calibration_batch_size: int = field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oneshot will not depend on training_args in the follow up PR, so this will be moved once that lands.
For now its ok since oneshot uses training_args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which argument set should this exist on @horheynm?
@@ -53,11 +53,7 @@ def __init__( | |||
self.tokenizer = getattr(self.processor, "tokenizer", self.processor) | |||
|
|||
if self.tokenizer is not None: | |||
# fill in pad token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not use this no more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok you moved it to
if hasattr(tokenizer, "pad"):
collate_fn = DataCollatorWithPadding(tokenizer)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I moved this logic to configure_processor
, as indicated in the PR description
- Implement configure_processor which modifies the processor to support saving and padding
- Padding was previously done in TextGenerationDataset class. However, the processor needs to be configured earlier because the StageRunner needs a reference to the processor in order to pass it to format_calibration_data
"sampler": RandomSampler(tokenized_calibration) | ||
if do_shuffle | ||
else SequentialSampler(tokenized_calibration), | ||
"collate_fn": collate_fn, | ||
"pin_memory": True, | ||
"drop_last": False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not drop if not divisible by batch size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"drop_last" is not relevant if the number of samples is divisible by the batch size because there is no remainder. I'm confused what you're referring to here.
@@ -68,8 +68,8 @@ def populate_datasets(self, processor: Processor, add_labels: bool = True): | |||
:param processor: processor or tokenizer to use for dataset tokenization | |||
:param add_labels: if True, add labels column to dataset splits | |||
""" | |||
self.processor = processor # TODO: pass processor into init instead of this fn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this processor is the same as the model processor, it will be accessible by self.model_args.processor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this location, processor
is not the same as self.model_args.processor
.
We could change this flow in the future, but for now we separate processor
and self.model_args.processor
within text_generation.py in order to preserve self.model_args.processor
being a string (if the user passed a string)
Purpose
calibration_batch_size
in order to support accelerated calibrationChanges
calibration_batch_size
argument (this can be aliased tobatch_size
in the future if we find that to be a better apiconfigure_processor
which modifies the processor to support saving and paddingTextGenerationDataset
class. However, the processor needs to be configured earlier because theStageRunner
needs a reference to the processor in order to pass it toformat_calibration_data
format_calibration_dataset
needs a reference to the processor in order to support dynamic paddingformat_calibration_data
to support using theDataCollatorWithPadding
(dynamic padding)Testing
tests/llmcompressor/transformers/finetune/data/test_dataset_helpers.py
example script
```python3 from datasets import load_dataset from transformers import AutoModelForCausalLM, AutoTokenizerfrom llmcompressor.modifiers.smoothquant import SmoothQuantModifier
from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
Select model and load it.
MODEL_ID = "meta-llama/Llama-3.2-1B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
Select calibration dataset.
DATASET_ID = "HuggingFaceH4/ultrachat_200k"
DATASET_SPLIT = "train_sft"
Select number of samples. 512 samples is a good place to start.
Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048
BATCH_SIZE = 32
Load dataset and preprocess.
ds = load_dataset(DATASET_ID, split=DATASET_SPLIT)
ds = ds.shuffle(seed=42).select(range(NUM_CALIBRATION_SAMPLES))
def preprocess(example):
return {
"text": tokenizer.apply_chat_template(
example["messages"],
tokenize=False,
)
}
ds = ds.map(preprocess)
Tokenize inputs.
def tokenize(sample):
return tokenizer(
sample["text"],
padding=False,
max_length=MAX_SEQUENCE_LENGTH,
truncation=True,
add_special_tokens=False,
)
ds = ds.map(tokenize, remove_columns=ds.column_names)
Configure the quantization algorithm to run.
* quantize the weights to 4 bit with GPTQ with a group size 128
recipe = [
SmoothQuantModifier(smoothing_strength=0.8),
GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
]
Apply algorithms.
oneshot(
model=model,
dataset=ds,
recipe=recipe,
max_seq_length=MAX_SEQUENCE_LENGTH,
num_calibration_samples=NUM_CALIBRATION_SAMPLES,
calibration_batch_size=BATCH_SIZE,
)
Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to("cuda")
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")
Save to disk compressed.
SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
Preparing intermediates cache: 100%|??????????????????????????????????????????????????????| 16/16 [00:00<00:00, 24.61it/s]
(1/17): Calibrating: 100%|????????????????????????????????????????????????????????????????| 16/16 [00:09<00:00, 1.75it/s]
2025-02-04T03:24:08.343218+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.q_proj using 512 samples
2025-02-04T03:24:24.219222+0000 | compress | METRIC - time 15.88s
2025-02-04T03:24:24.219357+0000 | compress | METRIC - error 1805.50
2025-02-04T03:24:24.219666+0000 | compress | METRIC - GPU 0 | usage: 78.79% | total memory: 17 GB
2025-02-04T03:24:24.219707+0000 | compress | METRIC - GPU 1 | usage: 11.88% | total memory: 17 GB
2025-02-04T03:24:24.219775+0000 | compress | METRIC - Compressed module size: 8.486912 MB
2025-02-04T03:24:24.219901+0000 | on_sequential_batch_end | INFO - Quantizing model.layers.0.self_attn.k_proj using 512 samples