简体中文 | English
Disclaimer, this document was obtained through machine translation, please check the original document here.
OpenAI open-sourced project Whisper, which claims to have human-level speech recognition in English, and it also supports automatic speech recognition in 98 other languages. Whisper provides automatic speech recognition and translation tasks. They can turn speech into text in various languages and translate that text into English. The main purpose of this project is to fine-tune the Whisper model using Lora. It supports training on non-timestamped data, with timestamped data, and without speech data. Currently open source for several models, specific can be openai to view, the following is a list of commonly used several models. In addition, the project also supports CTranslate2 accelerated reasoning and GGML accelerated reasoning. As a hint, accelerated reasoning supports direct use of Whisper original model transformation, and does not necessarily need to be fine-tuned. Supports Windows desktop applications, Android applications, and server deployments.
- openai/whisper-large-v2
- openai/whisper-large-v3
- distil-whisper
Environment:
- Anaconda 3
- Python 3.10
- Pytorch 2.1.0
- GPU A100-PCIE-80GB
- Introduction of the main program of the project
- Test table
- Install
- Prepare data
- Fine-tuning
- Merge model
- Evaluation
- Inference
- Accelerate inference
- GUI inference
- Web deploy
- Android
- Windows Desktop
aishell.py
: Create AIShell training data.finetune.py
: Fine-tune the model by peft(Lora).finetune_all.py
: Fine-tune all paramenters of the model.merge_lora.py
: Merge Whisper and Lora models.evaluation.py
: Evaluate the fine-tuned model or the original Whisper model.infer_tfs.py
: Use the transformers library to directly call the fine-tuned model or the original Whisper model for prediction, suitable only for inference on short audio clips.infer_ct2.py
: Use the converted CTranslate2 model for prediction, primarily as a reference for program usage.infer_gui.py
: Has a GUI interface for operation, using the converted CTranslate2 model for prediction.infer_server.py
: Deploys the converted CTranslate2 model to the server for use by client applications.convert-ggml.py
: Converts the model to GGML format for use in Android or Windows applications.AndroidDemo
: Contains the source code for deploying the model to Android.WhisperDesktop
: Contains the program for the Windows desktop application.
Model | Parameters(M) | Base Model | Data (Re)Sample Rate | Train Datasets | Fine-tuning (full or peft) |
---|---|---|---|---|---|
Belle-whisper-large-v2-zh | 1550 | whisper-large-v2 | 16KHz | AISHELL-1 AISHELL-2 WenetSpeech HKUST | full fine-tuning |
Belle-distil-whisper-large-v2-zh | 756 | distil-whisper-large-v2 | 16KHz | AISHELL-1 AISHELL-2 WenetSpeech HKUST | full fine-tuning |
Belle-whisper-large-v3-zh | 1550 | whisper-large-v3 | 16KHz | AISHELL-1 AISHELL-2 WenetSpeech HKUST | full fine-tuning |
Belle-whisper-large-v3-zh-punct | 1550 | Belle-whisper-large-v3-zh | 16KHz | AISHELL-1 AISHELL-2 WenetSpeech HKUST | lora fine-tuning |
Model | Language Tag | aishell_1 test | aishell_2 test | wenetspeech test_net | wenetspeech test_meeting | HKUST_dev | Model Link |
---|---|---|---|---|---|---|---|
whisper-large-v2 | Chinese | 8.818 | 6.183 | 12.343 | 26.413 | 31.917 | HF |
Belle-whisper-large-v2-zh | Chinese | 2.549 | 3.746 | 8.503 | 14.598 | 16.289 | HF |
whisper-large-v3 | Chinese | 8.085 | 5.475 | 11.72 | 20.15 | 28.597 | HF |
Belle-whisper-large-v3-zh | Chinese | 2.781 | 3.786 | 8.865 | 11.246 | 16.440 | HF |
Belle-whisper-large-v3-zh-punct | Chinese | 2.945 | 3.808 | 8.998 | 10.973 | 17.196 | HF |
distil-whisper-large-v2 | Chinese | - | - | - | - | - | HF |
Belle-distilwhisper-large-v2-zh | Chinese | 5.958 | 6.477 | 12.786 | 17.039 | 20.771 | HF |
Note:
- All punctuation marks are removed during evaluation to compute the CER.
- Compare to whisper-large-v2, Belle-whisper-large-v2-zh demonstrates a 30-70% relative improvement in performance on Chinese ASR benchmarks.
- Belle-whisper-large-v3-zh has a significant improvement in complex acoustic scenes(such as wenetspeech_meeting).
- Belle-whisper-large-v3-zh-punct even has a slight improvement in complex acoustic scenes(such as wenetspeech_meeting), while improving the punctuation ability.
- The GPU version of Pytorch will be installed first. You can choose one of two ways to install Pytorch.
- Here's how to install Pytorch using Anaconda. If you already have it, please skip it.
conda install pytorch==1.13.1 torchvision==0.14.1 torchaudio==0.13.1 pytorch-cuda=11.6 -c pytorch -c nvidia
- Here's how to pull an image of a Pytorch environment using a Docker image.
sudo docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel
It then moves into the image and mounts the current path to the container's '/workspace' directory.
sudo nvidia-docker run --name pytorch -it -v $PWD:/workspace pytorch/pytorch:1.13.1-cuda11.6-cudnn8-devel /bin/bash
- Install the required libraries.
python -m pip install -r requirements.txt -i https://pypi.tuna.tsinghua.edu.cn/simple
- Windows requires a separate installation of bitsandbytes.
python -m pip install https://github.com/jllllll/bitsandbytes-windows-webui/releases/download/wheels/bitsandbytes-0.40.1.post1-py3-none-win_amd64.whl
The training dataset is a list of jsonlines, meaning that each line is a JSON data in the following format: This project
provides a program to make the AIShell dataset, 'aishell.py'. Executing this program will automatically download and
generate the training and test sets in the following format. This program can skip the download process by specifying
the compressed file of AIShell. If the direct download would be very slow, you can use some downloader such as
thunderbolt to download the dataset and then specify the compressed filepath through the '--filepath' parameter.
Like /home/test/data_aishell.tgz
.
Note:
- If timestamp training is not used, the
sentences
field can be excluded from the data. - If data is only available for one language, the language field can be excluded from the data.
- If training empty speech data, the
sentences
field should be[]
, thesentence
field should be""
, and the language field can be absent. - Data may exclude punctuation marks, but the fine-tuned model may lose the ability to add punctuation marks.
{
"audio": {
"path": "dataset/0.wav"
},
"sentence": "近几年,不但我用书给女儿压岁,也劝说亲朋不要给女儿压岁钱,而改送压岁书。",
"language": "Chinese",
"sentences": [
{
"start": 0,
"end": 1.4,
"text": "近几年,"
},
{
"start": 1.42,
"end": 8.4,
"text": "不但我用书给女儿压岁,也劝说亲朋不要给女儿压岁钱,而改送压岁书。"
}
],
"duration": 7.37
}
Once we have our data ready, we are ready to fine-tune our model. Training is the most important two parameters,
respectively, --base_model
specified fine-tuning the Whisper of model, the parameter values need to be
in HuggingFace, the don't need to download in advance, It can be downloaded
automatically when starting training, or in advance, if --base_model
is specified as the path and --local_files_only
is set to True. The second --output_path
is the Lora checkpoint path saved during training as we use Lora to fine-tune
the model. If you want to save enough, it's best to set --use_8bit
to False, which makes training much faster. See
this program for more parameters.
The single card training command is as follows. Windows can do this without the CUDA_VISIBLE_DEVICES
parameter.
CUDA_VISIBLE_DEVICES=0 python finetune.py --base_model=openai/whisper-tiny --output_dir=output/
torchrun and accelerate are two different methods for multi-card training, which developers can use according to their preferences.
- To start multi-card training with torchrun, use
--nproc_per_node
to specify the number of graphics cards to use.
torchrun --nproc_per_node=2 finetune.py --base_model=openai/whisper-tiny --output_dir=output/
- Start multi-card training with accelerate, and if this is the first time you're using accelerate, configure the training parameters as follows:
The first step is to configure the training parameters. The process is to ask the developer to answer a few questions. Basically, the default is ok, but there are a few parameters that need to be set according to the actual situation.
accelerate config
Here's how it goes:
--------------------------------------------------------------------In which compute environment are you running?
This machine
--------------------------------------------------------------------Which type of machine are you using?
multi-GPU
How many different machines will you use (use more than 1 for multi-node training)? [1]:
Do you wish to optimize your script with torch dynamo?[yes/NO]:
Do you want to use DeepSpeed? [yes/NO]:
Do you want to use FullyShardedDataParallel? [yes/NO]:
Do you want to use Megatron-LM ? [yes/NO]:
How many GPU(s) should be used for distributed training? [1]:2
What GPU(s) (by id) should be used for training on this machine as a comma-seperated list? [all]:
--------------------------------------------------------------------Do you wish to use FP16 or BF16 (mixed precision)?
fp16
accelerate configuration saved at /home/test/.cache/huggingface/accelerate/default_config.yaml
Once the configuration is complete, you can view the configuration using the following command:
accelerate env
Start fine-tune:
accelerate launch finetune.py --base_model=openai/whisper-tiny --output_dir=output/
log:
{'loss': 0.9098, 'learning_rate': 0.000999046843662503, 'epoch': 0.01}
{'loss': 0.5898, 'learning_rate': 0.0009970611012927184, 'epoch': 0.01}
{'loss': 0.5583, 'learning_rate': 0.0009950753589229333, 'epoch': 0.02}
{'loss': 0.5469, 'learning_rate': 0.0009930896165531485, 'epoch': 0.02}
{'loss': 0.5959, 'learning_rate': 0.0009911038741833634, 'epoch': 0.03}
After fine-tuning, there will be two models, the first is the Whisper base model, and the second is the Lora model.
These two models need to be merged before the next operation. This program only needs to pass two
arguments, --lora_model
is the path of the Lora model saved after training, which is the checkpoint folder, and the
second --output_dir
is the saved directory of the merged model.
python merge_lora.py --lora_model=output/whisper-tiny/checkpoint-best/ --output_dir=models/
The following procedure is performed to evaluate the model, the most important two parameters are respectively. The
first --model_path
specifies the path of the merged model, but also supports direct use of the original whisper model,
such as directly specifying openai/Whisper-large-v2
, and the second --metric
specifies the evaluation method. For
example, there are word error rate cer
and word error rate wer
. Note: Models without fine-tuning may have
punctuation in their output, affecting accuracy. See this program for more parameters.
python evaluation.py --model_path=models/whisper-tiny-finetune --metric=cer
Execute the following program for speech recognition, this uses transformers to directly call the fine-tuned model or
Whisper's original model prediction, only suitable for reasoning short audio, long speech or refer to the use
of infer_ct2.py
. The first --audio_path
argument specifies the audio path to predict. The second --model_path
specifies the path of the merged model. It also allows you to use the original whisper model directly, for
example openai/whisper-large-v2
. See this program for more parameters.
python infer_tfs.py --audio_path=dataset/test.wav --model_path=models/whisper-tiny-finetune
As we all know, directly using the Whisper model reasoning is relatively slow, so here provides a way to accelerate,
mainly using CTranslate2 for acceleration, first to transform the model, transform the combined model into CTranslate2
model. In the following command, the --model
parameter is the path of the merged model, but it is also possible to use
the original whisper model directly, such as openai/whisper-large-v2
. The --output_dir
parameter specifies the path
of the transformed CTranslate2 model, and the --quantization
parameter quantizes the model size. If you don't want to
quantize the model, you can drop this parameter.
ct2-transformers-converter --model models/whisper-tiny-finetune --output_dir models/whisper-tiny-finetune-ct2 --copy_files tokenizer.json --quantization float16
Execute the following program to accelerate speech recognition, where the --audio_path
argument specifies the audio
path to predict. --model_path
specifies the transformed CTranslate2 model. See this program for more parameters.
python infer_ct2.py --audio_path=dataset/test.wav --model_path=models/whisper-tiny-finetune-ct2
Output:
----------- Configuration Arguments -----------
audio_path: dataset/test.wav
model_path: models/whisper-tiny-finetune-ct2
language: zh
use_gpu: True
use_int8: False
beam_size: 10
num_workers: 1
vad_filter: False
local_files_only: True
------------------------------------------------
[0.0 - 8.0]:近几年,不但我用书给女儿压碎,也全说亲朋不要给女儿压碎钱,而改送压碎书。
Here again, CTranslate2 is used for acceleration, and the transformation model is shown in the above
documentation. --model_path
specifies the transformed CTranslate2 model. See this program for more parameters.
python infer_gui.py --model_path=models/whisper-tiny-finetune-ct2
After startup, the screen is as follows:
Web deployment is also accelerated using CTranslate2, as shown in the documentation above. --host
specifies the
address where the service will be started, here 0.0.0.0
, which means any address will be accessible. --port
specifies the port number to use. --model_path
specifies the transformed CTranslate2 model. --num_workers
specifies
how many threads to use for concurrent inference, which is important in Web deployments where multiple concurrent
accesses can be inferred at the same time. See this program for more parameters.
python infer_server.py --host=0.0.0.0 --port=5000 --model_path=models/whisper-tiny-finetune-ct2 --num_workers=2
At present, two interfaces are provided, the common recognition interface /recognition
and the stream return
result /recognition_stream
. Note that the stream refers to the stream return recognition result, which is also to
upload the complete audio and then stream back the recognition result. This method is very good for long speech
recognition experience. Their document interface is exactly the same, and the interface parameters are as follows.
Field | Need | type | Default | Explain |
---|---|---|---|---|
audio | Yes | File | Audio File | |
to_simple | No | int | 1 | Traditional Chinese to Simplified Chinese |
remove_pun | No | int | 0 | Whether to remove punctuation |
task | No | String | transcribe | Identify task types and support transcribe and translate |
language | No | String | zh | Set the language, shorthand, to automatically detect the language if None |
Return result:
Field | type | Explain |
---|---|---|
results | list | Recognition results separated into individual parts |
+result | str | Text recognition result for each separated part |
+start | int | Start time in seconds for each separated part |
+end | int | End time in seconds for each separated part |
code | int | Error code, 0 indicates successful recognition |
Example:
{
"results": [
{
"result": "近几年,不但我用书给女儿压碎,也全说亲朋不要给女儿压碎钱,而改送压碎书。",
"start": 0,
"end": 8
}
],
"code": 0
}
To make it easier to understand, here is the Python code to call the Web interface. Here is how to call /recognition
.
import requests
response = requests.post(url="http://127.0.0.1:5000/recognition",
files=[("audio", ("test.wav", open("dataset/test.wav", 'rb'), 'audio/wav'))],
json={"to_simple": 1, "remove_pun": 0, "language": "zh", "task": "transcribe"}, timeout=20)
print(response.text)
Here is how /recognition stream
is called.
import json
import requests
response = requests.post(url="http://127.0.0.1:5000/recognition_stream",
files=[("audio", ("test.wav", open("dataset/test_long.wav", 'rb'), 'audio/wav'))],
json={"to_simple": 1, "remove_pun": 0, "language": "zh", "task": "transcribe"}, stream=True,
timeout=20)
for chunk in response.iter_lines(decode_unicode=False, delimiter=b"\0"):
if chunk:
result = json.loads(chunk.decode())
text = result["result"]
start = result["start"]
end = result["end"]
print(f"[{start} - {end}]:{text}")
The provided test page is as follows:
The home page http://127.0.0.1:5000/
looks like this:
Document page http://127.0.0.1:5000/docs
page is as follows:
The source code for the installation and deployment can be found in AndroidDemo and the documentation can be found in README.md.
The program is in the WhisperDesktop directory, and the documentation can be found in README.md.