Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Audio] Qwen Audio Example #1082

Draft
wants to merge 24 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,7 @@ Applying quantization with `llmcompressor`:
* [Activation quantization to `fp8`](examples/quantization_w8a8_fp8)
* [Weight only quantization to `int4`](examples/quantization_w4a16)
* [Quantizing MoE LLMs](examples/quantizing_moe)
* [Quantizing Multimodal Audio LLMs](examples/multimodal_audio)

### User Guides
Deep dives into advanced usage of `llmcompressor`:
Expand Down
88 changes: 88 additions & 0 deletions examples/multimodal_audio/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Quantizing Multimodal Audio Models #

https://github.com/user-attachments/assets/6732c60b-1ebe-4bed-b409-c16c4415dff5

Audio provided by Daniel Galvez et al. under creative commons license

```
<|startoftranscript|> <|en|>
...

<|transcribe|> <|notimestamps|>
that's where you have a lot of windows in the south no actually that's passive solar
and passive solar is something that was developed and designed in the 1960s and 70s
and it was a great thing for what it was at the time but it's not a passive house
```
</em>

This directory contains example scripts for quantizing a variety of audio language models using the GPTQ quantization.

## Compressing Your Own Model ##
To use your own multimodal modal, start with an existing example change the `model_id` to match your own model stub.
```python3
model_id = "path/to/your/model"
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
```

## Customizing GPTQModifier Parameters ##
The GPTQModifier is the modifier responsible for performing quantization of the model weights. For more information on quantizing with different weight schemes, see the `quantization_` examples in the [examples folder](/examples/).

```python3
recipe = [
GPTQModifier(
targets="Linear",
scheme="W4A16",
sequential_targets=["WhisperEncoderLayer", "WhisperDecoderLayer"],
ignore=["lm_head"],
)
]
```

### Sequential Targets ###
Sequential targets are the modules which determine the granularity of error propagation and activation offloading when performing forward passes of the model. These are typically the "transformer blocks" of the model, also referred to as "layers" with llm-compressor.

Choosing sequential targets with higher granularity (for example "Linear" instead of "LlamaDecoderLayer") will result in fewer hessians being allocated at the same time, decreasing the memory requirements for compression. This may also increase the recovered accuracy of the model, as compression error is propagated at a higher granularity. However, using higher granularity sequential targets may also increase compression time, as more time is spent offloading and onloading activations.

### Ignore ###
If your model is not traceable for your desired dataset, first consider adding any problematic modules to the ignore list. Doing this prevents the model tracer from tracing the internals of those modules, thereby avoid the untraceable operations.

## Tracing Errors ##
Because the architectures of audio-language models is often times more complex than those of typical decoder-only text models, you may encounter `torch.fx.TraceError`s when attempting to quantize your model. For more information on `torch.fx.TraceError`s, why they occur, and how to resolve them, please see the [Model Tracing Guide](/src/llmcompressor/transformers/tracing/GUIDE.md).

## Adding Your Own Smoothquant Mappings ##
For a guide on adding smoothquant mappings for your dataset, see the [SmoothQuant Guide](/src/llmcompressor/modifiers/smoothquant/README.md).

## Adding Your Own Data Collator ##
Most examples utilize a generic `data_collator` which correctly correlates data for most multimodal datasets. If you find that your model needs custom data collation (as is the case with [pixtral](/examples/multimodal_vision/pixtral_example.py)), you can modify this function to reflect these model-specific requirements.

## Sample Audio Provided Under a Creative Commons Attribution License ##
https://creativecommons.org/licenses/by/4.0/legalcode
```
@article{DBLP:journals/corr/abs-2111-09344,
author = {Daniel Galvez and
Greg Diamos and
Juan Ciro and
Juan Felipe Cer{\'{o}}n and
Keith Achorn and
Anjali Gopi and
David Kanter and
Maximilian Lam and
Mark Mazumder and
Vijay Janapa Reddi},
title = {The People's Speech: {A} Large-Scale Diverse English Speech Recognition
Dataset for Commercial Usage},
journal = {CoRR},
volume = {abs/2111.09344},
year = {2021},
url = {https://arxiv.org/abs/2111.09344},
eprinttype = {arXiv},
eprint = {2111.09344},
timestamp = {Mon, 22 Nov 2021 16:44:07 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2111-09344.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
```
258 changes: 258 additions & 0 deletions examples/multimodal_audio/qwen2_audio_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,258 @@
import torch
from datasets import load_dataset
from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration
import soundfile as sf
from io import BytesIO
from urllib.request import urlopen

from llmcompressor.modifiers.quantization import GPTQModifier
from llmcompressor.transformers import oneshot
from llmcompressor.transformers.tracing import (
TraceableQwen2AudioForConditionalGeneration,
)

# Select model and load it.
MODEL_ID = "Qwen/Qwen2-Audio-7B-Instruct"

#model = TraceableQwen2AudioForConditionalGeneration.from_pretrained(
model = Qwen2AudioForConditionalGeneration.from_pretrained(
MODEL_ID,
device_map="auto",
torch_dtype="auto",
)
processor = AutoProcessor.from_pretrained(MODEL_ID)

# # Select calibration dataset.
# DATASET_ID = "MLCommons/peoples_speech"
# DATASET_SUBSET = "test"
# DATASET_SPLIT = "test"

# # Select number of samples. 512 samples is a good place to start.
# # Increasing the number of samples can improve accuracy.
# NUM_CALIBRATION_SAMPLES = 1 #512
# MAX_SEQUENCE_LENGTH = 2048

# # Load dataset and preprocess.
# ds = load_dataset(
# DATASET_ID,
# DATASET_SUBSET,
# split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
# trust_remote_code=True,
# )


# def preprocess(example):
# messages = [
# # {"role": "system", "content": "You are a helpful assistant."},
# {"role": "user", "content": [{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"},
# # {"role": "user", "content": [{"type": "text", "text": "What does the person say?"}]},
# ]}
# ]

# audio_data = example["audio"]["array"]
# sample_rate = example["audio"]["sampling_rate"]

# # import librosa
# # new_sr = processor.feature_extractor.sampling_rate
# # audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=new_sr)
# # sample_rate = new_sr

# #processor.feature_extractor.sampling_rate

# # # Create an in-memory buffer
# # import io
# # buffer = io.BytesIO()

# # # Write the audio data to the in-memory buffer in WAV format
# # sf.write(buffer, audio_data, sample_rate, format='WAV')

# # import librosa
# # audio_data, sample_rate = librosa.load(buffer, sr=sample_rate)

# import librosa
# audio_data = librosa.load(
# BytesIO(urlopen("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav").read()),
# sr=processor.feature_extractor.sampling_rate
# )[0]

# return {
# "text": processor.apply_chat_template(
# messages, add_generation_prompt=True, tokenize=False
# ),
# #"audios": [example["audio"]["array"]],
# "audios": [audio_data],
# #"array": example["audio"]["array"],
# #"sampling_rate": example["audio"]["sampling_rate"],
# "sampling_rate": sample_rate,
# #"sampling_rate": processor.feature_extractor.sampling_rate
# }


# ds = ds.map(preprocess, remove_columns=ds.column_names)


# # Tokenize inputs.
# def tokenize(sample):
# return processor(**sample, return_tensors="pt")

# # Process inputs.
# def process(sample):

# messages = [
# {"role": "user", "content": [{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"}]}
# ]

# # import librosa
# # new_sr = processor.feature_extractor.sampling_rate
# # audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=new_sr)
# # sample_rate = new_sr

# #processor.feature_extractor.sampling_rate

# # # Create an in-memory buffer
# # import io
# # buffer = io.BytesIO()

# # # Write the audio data to the in-memory buffer in WAV format
# # sf.write(buffer, audio_data, sample_rate, format='WAV')

# # import librosa
# # audio_data, sample_rate = librosa.load(buffer, sr=sample_rate)

# import librosa
# audio_data = librosa.load(
# BytesIO(urlopen("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav").read()),
# sr=processor.feature_extractor.sampling_rate
# )[0]

# return processor(
# text=processor.apply_chat_template(
# messages, add_generation_prompt=True, tokenize=False
# ),
# #audio=sample["array"],
# audios=[audio_data],
# #sampling_rate=sample["sampling_rate"],
# #sampling_rate=sample["sampling_rate"],
# #add_special_tokens=True,
# return_tensors="pt",
# padding=True
# )




# audio_inputs = processor(
# text=sample["text"],
# #audio=sample["array"],
# audios=sample["audios"],
# #sampling_rate=sample["sampling_rate"],
# #sampling_rate=sample["sampling_rate"],
# #add_special_tokens=True,
# return_tensors="pt",
# padding=True
# )
# return audio_inputs

# text_inputs = processor(
# text=sample["text"], add_special_tokens=True, return_tensors="pt"
# )
# text_inputs["decoder_input_ids"] = text_inputs["input_ids"]
# del text_inputs["input_ids"]

# return dict(**audio_inputs, **text_inputs)


# #ds = ds.map(tokenize, remove_columns=ds.column_names)
# ds = ds.map(process, remove_columns=ds.column_names)

messages = [
{"role": "user", "content": [{"type": "audio", "audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav"}]}
]

# import librosa
# new_sr = processor.feature_extractor.sampling_rate
# audio_data = librosa.resample(audio_data, orig_sr=sample_rate, target_sr=new_sr)
# sample_rate = new_sr

#processor.feature_extractor.sampling_rate

# # Create an in-memory buffer
# import io
# buffer = io.BytesIO()

# # Write the audio data to the in-memory buffer in WAV format
# sf.write(buffer, audio_data, sample_rate, format='WAV')

# import librosa
# audio_data, sample_rate = librosa.load(buffer, sr=sample_rate)

import librosa
audio_data = librosa.load(
BytesIO(urlopen("https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/translate_to_chinese.wav").read()),
sr=processor.feature_extractor.sampling_rate
)[0]

text = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=False
)

breakpoint()
sample_input = processor(
text=text,
#audio=sample["array"],
audios=[audio_data],
#sampling_rate=sample["sampling_rate"],
#sampling_rate=sample["sampling_rate"],
#add_special_tokens=True,
return_tensors="pt",
padding=True
)
breakpoint()


# Define a oneshot data collator for multimodal inputs.
# def data_collator(batch):
# assert len(batch) == 1
# return {key: torch.tensor(value) for key, value in batch[0].items()}


# Configure the quantization algorithm to run.
# # * quantize the weights to 4 bit with GPTQ with a group size 128
# recipe = GPTQModifier(
# targets="Linear",
# scheme="W4A16",
# ignore=[
# # "re:audio_tower.*",
# # "re:multi_modal_projector.*",
# "lm_head",
# ], # TODO: honestly, there's a decent number of parameters in the audio tower worth quantizing
# )

# Apply algorithms.
# oneshot(
# model=model,
# dataset=ds,
# recipe=recipe,
# max_seq_length=MAX_SEQUENCE_LENGTH,
# num_calibration_samples=NUM_CALIBRATION_SAMPLES,
# data_collator=data_collator,
# )

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
breakpoint()
#sample_input = data_collator([next(iter(ds))])
#sample_input = ds[0]
sample_input = {k: v.to(model.device) for k, v in sample_input.items()}
output = model.generate(**sample_input, max_new_tokens=256)
print(processor.batch_decode(output, skip_special_tokens=True)[0])
print("==========================================\n\n")
# that's where you have a lot of windows in the south no actually that's passive solar
# and passive solar is something that was developed and designed in the 1960s and 70s
# and it was a great thing for what it was at the time but it's not a passive house

# Save to disk compressed.
# SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
# model.save_pretrained(SAVE_DIR, save_compressed=True)
# processor.save_pretrained(SAVE_DIR)
Loading