The differences between OpenAI and whisper.cpp in generating log_mel_spectrograms
#1163
Replies: 1 comment 1 reply
-
We've wrapped up our analysis comparing the To summarize the main issues we found in whisper.cpp:
On top of these, whisper.cpp presents a couple of secondary concerns:
With these findings in hand, we're set to fix whisper.cpp.
|
Beta Was this translation helpful? Give feedback.
-
This note will compare the differences between OpenAI and whisper.cpp in generating log mel spectrograms, following the sequence of steps used by OpenAI's whisper to create the
log_mel_spectrogram
.Click me
Part-0: Introduce the Hyperparameters
whisper/audio.py
whisper.cpp/whisper.h
Lines 22 to 26 in a32c4aa
How are these hardcoded hyperparameters obtained? We can find the answers by looking at OpenAI's paper on Whisper. In the paper, it mentions that the signal is 're-sampled to 16,000Hz,' so we know that
SAMPLE_RATE
equals16000
. Similarly, there's an '80-channel log-magnitude Mel spectrogram,' which meansN_MELS
must be80
. The paper also talks about '25-millisecond windows,' soN_FFT
can be calculated as16,000
multiplied by0.025
, resulting in400
. Lastly, it mentions 'a stride of 10 milliseconds,' allowing us to calculateHOP_LENGTH
by multiplying16,000
by0.010
, resulting in160
. The most important description isa stride of 10 milliseconds
. That is to say, when a window of 25 milliseconds moves across the signal, it moves 10 milliseconds each time, meaning that adjacent windows will have an overlap of 15 milliseconds. This ensures that no information will be lost, and the captured information will be more complete and continuous.Part-1: Sample Preprocessing
Although OpenAI has not specified whether the whisper model can support higher sample rate audio (greater than 16KHz), for convenience, we will follow OpenAI's standard and assume that whisper can only support up to 16KHz. To facilitate subsequent calculations, we need to decompress the compressed audio first.
whisper/audio.py
whisper.cpp/examples/common.cpp
Lines 650 to 679 in a32c4aa
Notice that
np.frombuffer(out, np.int16).flatten().astype(np.float32) / 32768.0
reads the raw bytes and converts them into a 16-bit integer array. The.flatten()
method ensures the array is 1-dimensional. Then, the values are converted to 32-bit floating-point numbers (FP32) and normalized to the range between-1
and0.999969
, which corresponds to the range of 16-bit signed integers divided by32768.0
A strange thing is that
whisper.cpp
, based on the settings, will try to convert stereo audio into mono, or turn mono into fake stereo without using ffmpeg, whereasOpenAI's whisper
directly uses ffmpeg to convert it into mono. Although I can confirm that the methodwhisper.cpp
uses to change the channel count is correct (according to Multimedia Programming Interface and Data Specifications 1.0 (Page 59)), I am not clear whether Whisper actually supports stereo audio or not.Part-2: Sample Padding
In OpenAI's Whisper, padding of the sample is divided into two stages. The first stage is carried out in the
log_mel_spectrogram
function, and the second stage is in thestft
function.whisper/transcribe.py
whisper/audio.py
We can find that the
log_mel_spectrogram
function is called inwhisper/transcribe.py
to generate the log mel spectrogram. This function is located inwhisper/audio.py
, where we can see that as long aspadding > 0
, the audio will be padded. Since we passed the argumentpadding=N_SAMPLES
when calling it inwhisper/transcribe.py
, the audio will definitely be padded.According to Torch's documentation and the code in
whisper/audio.py
, we can conclude that its padding strategy is quite simple, just appending a lot of zeros (480,000 samples) to the end of the audio, equivalent to 30 seconds of blank audio, since the default values arevalue=0
,mode='constant'
. In fact, we can conduct a small experiment to verify our idea.Let's see how whisper.cpp implements Sample Padding?
whisper.cpp/examples/main/main.cpp
Lines 934 to 937 in a4bb2df
whisper.cpp/whisper.cpp
Lines 4774 to 4829 in b948361
whisper.cpp/whisper.cpp
Lines 4027 to 4049 in b948361
whisper.cpp/whisper.cpp
Lines 2997 to 3004 in b948361
whisper.cpp/whisper.cpp
Lines 2493 to 2582 in b948361
To make it clear for everyone, I have pasted the most critical code for Sample Padding here.
Actually, this code has many problems, but in this chapter, we are only discussing Sample Padding. In C/C++, the integer division operator '/' itself has a truncation function, meaning it rounds down by discarding the decimal part for numbers greater than 0. So,
mel.n_len = (mel.n_len/pad + 1)*pad;
will increasemel.n_len
by at most1500
, andmel.n_len += pad;
will further increasemel.n_len
by1500
on this basis. Overall, this will increasemel.n_len
by1500
to3000
, padding240,000
to480,000
samples, equivalent to15
seconds to30
seconds of blank audio. This is significantly less than the480,000
samples padded by OpenAI's whisper.In OpenAI's Whisper, in addition to appending many zeros at the end of the Sample, there is also a padding hidden within the
stft
function. However, this padding strategy is more complex, adding 200 samples to both the beginning and the end of the sample, using the padding modereflect
. But what isreflect
? I drew a graph to describe it simply.For example, if I now want to pad this array of length 8 at both the beginning and end, adding 2 samples to each, and the mode is 'reflect'. It will end up looking like this, reflecting around the first sample at both the beginning and the end as the axis of symmetry. Similarly, we can conduct an experiment to verify our idea. The reason why it carries out such a complex operation here, first increasing the dimension to pad, then reducing it, is because
torch.nn.functional.pad
currently does not support direct reflect padding on a 1-D tensor. Otherwise, you will receive the following error message: 'NotImplementedError: Only 2D, 3D, 4D, 5D padding with non-constant padding are supported for now.'How does whisper.cpp achieve this? Unfortunately, I carefully checked the code of whisper.cpp and found that it does not carry out this padding stage. If anyone finds that it does, please tell me where it is?
Part-3: Hann Window Generation
whisper/audio.py
OpenAI's Whisper directly uses the built-in
torch.hann_window
in PyTorch to generate a Hann Window. Since PyTorch invokes backend code for computation, and the calling process uses automatically generated code that is very complex, we are unable to understand how the internal calculation works. Fortunately, the official Torch documentation provides a detailed explanation.We can conduct an experiment to check the output of
torch.hann_window
Show Console
whisper.cpp/whisper.cpp
Lines 2507 to 2512 in b948361
I had always thought that the implementation of generating the Hann window in whisper.cpp was problematic. But today, after taking a closer look, I realized that it is actually not problematic. This is because
torch.hann_window
is set toperiodic=True
by default, soN
becomeswindow_size + 1
, and therefore the denominator should indeed befft_size
. But we still need to compare the output results with the results output by torch.Show Console
I used the results output by torch as the numerator and the results from whisper.cpp as the denominator, dividing the corresponding items, and obtained the following results. I have tried using the PI defined by torch,
3.141592653589793
, but the results did not change.Show Console
0.9998766183853149
0.9998766324816614
0.9998766324816614
However, I think overall, the problem shouldn't be too significant, since we are using the float type, which only has 6 to 7 significant digits?
Part-4: Short-Time Fourier Transform
whisper/audio.py
torch/functional.py
OpenAI's Whisper directly uses PyTorch's
torch.stft
to perform STFT calculations. Most of the computations in torch.stft are implemented in C++, and although PyTorch provides documentation, the STFT is rather complex, and the documentation does not fully describe how the STFT is calculated, with many details left out. Fortunately, I successfully found the C++ code that PyTorch uses to calculate the STFT.pytorch/aten/src/ATen/native/SpectralOps.cpp
The STFT C++ code is written very obscurely and is not very easy to understand. Fortunately, PyTorch provides a set of Python APIs that allow you to directly call the tensor's member functions in Python, which is very convenient. The following will use this method to conduct an in-depth analysis of this program.
The code above was written by PyTorch for the sake of backward compatibility. Up until now, the Stage-2 Padding in STFT has always been implemented using Python, so the speed can be relatively slow. PyTorch plans to implement this part of the computation using C++, but it seems that the migration is not yet fully completed. The STFT function first checks the input variables to ensure that they all meet the definition; if not, an error will be reported. Then it handles the default values. After that, it enters the formal calculation phase.
We can clearly see that the Tensor has added a dimension on top of its original basis.
Center mode
; if it is, half ofn_fft
will be padded at both the beginning and the end, but since we have already padded it, this step is directly skipped.batch
will be assigned a value of1
, andlen
will be assignedthe length of the one-dimensional Tensor
.n_fft
. If it is inconsistent, padding will be applied. If the window is found to be undefined, a window of all ones will be created, meaning it will have no effect (as if it doesn't exist). Since we have already ensured that the size of the input window is consistent with the size ofn_fft
, the last two steps are directly skipped.as_strided
? Before understandingas_strided
, we must first understand a concept calledStride
. In PyTorch, theStride
of a Tensor refers to the distance that needs to be moved in memory to access the next element within a dimension. This is because all Tensors, regardless of how many dimensions they have, are stored in memory in a one-dimensional form.as_strided
can change theSize
andStride
of a Tensor without copying in memory, achieving our goal of splitting frames. We can conduct an experiment to verify our idea.Show Console
We first generated a one-dimensional Tensor that has already undergone Stage-2 Padding, and then used
unsqueeze
to increase its dimensionality from 1D to 2D, finally usingas_strided
to split it. Inas_strided((1, 11, 50), (250, 20, 1))
, the first1
corresponds to the batch,11
corresponds ton_frames
, which we obtained through calculation,50
corresponds ton_fft
, which we arbitrarily set in our experiment,250
corresponds toinput.stride(0)
, which is the total length of our data,20
corresponds tohop_length * input.stride(1)
, which is our moving step size, and the last1
corresponds toinput.stride(1)
. We can clearly see that the data has been divided into11
frames, each with a length of50
. Since we used Stage-2 Padding, the data atindex 0
has been placed near the center offrame 0
, and the data atindex 199
has been placed near the center offrame 10
.Show Console
In the experiment, we used a Hann window, and we can clearly see that the Hann window has been applied to each frame.
onesided
will be set to true, and in the end, onlyn_fft/2 + 1
bins will be returned. Then it will check whethernormalized
is true. Since the default value ofnormalized
is false, and we did not set it to true, thenorm
will end up beingfft_norm_mode::none
, meaning that the result will not be normalized. Next is the actual computation, and finally, through the transpose operation, the same frequency bins from each frame are placed together. We can verify our idea through an experiment.experiment
Show Console
reference
Show Console
The left side is the output of
torch.stft
, and the right side is the output of the code used in this experiment. Except for the length of the output, everything else is exactly the same.torch.stft
will only output frombin 0
tobin Nyquist
, while the code used in the experiment will output the entire spectrum.unsqueeze
to increase its dimension, and now we need to usesqueeze_
to restore it back.whisper.cpp/whisper.cpp
Lines 2514 to 2516 in a4bb2df
whisper.cpp/whisper.cpp
Lines 2455 to 2457 in a4bb2df
The whisper.cpp does a rather poor job when calculating the STFT. For example, the formula for calculating frames is incorrect, resulting in an incorrect final count. The lack of Stage-2 Padding leads to potential edge effects. Also, after the FFT computation, it adds together the amplitudes of the symmetrical parts...
Part-5: Magnitudes of Bins
To get the power or magnitude spectrum, we compute the magnitude of each complex number from the FFT. Magnitude represents the amount of the frequency content present in that frame.
whisper/audio.py
OpenAI's Whisper implementation computes mag^2 using just one line of code. The
stft[..., :-1]
is to remove the last frame from each frequency bin in the STFT computation results (I'm not clear why it's done this way either). Since the result of the STFT is complex,.abs()
takes the absolute value of each complex number, calculating their magnitude.** 2
then squares each magnitude. We can conduct a small experiment to verify our idea.Show Console
We can see that the result is as we expected, the last frame of each frequency bin will be deleted.
Let's see how whisper.cpp implements this step?
whisper.cpp/whisper.cpp
Lines 2452 to 2454 in a4bb2df
First, whisper.cpp doesn't remove the last frame like OpenAI's whisper does, which could lead to some potential issues? Secondly, whisper.cpp doesn't calculate the magnitude first and then square it. Instead, it combines these two steps, making the implementation more efficient than OpenAI's.
Part-6: Mel Filters
Mel filters are a collection of triangular filters that are used in signal processing to mimic the non-linear frequency resolution of the human ear.
whisper/audio.py
OpenAI's Whisper uses a very simple method with mel filters. It loads the precomputed mel filters matrix from the hard drive and then performs matrix multiplication to obtain the computed results.
whisper.cpp/whisper.cpp
Lines 2466 to 2490 in a4bb2df
whisper.cpp takes a similar approach by performing matrix multiplication. However, its method of processing each frame individually as it's computed means there's potential for optimization. Additionally, thanks to the filters and model weights being bundled together in the ggml, we can skip reading from the hard drive during the actual computation. Everything is loaded upfront with the model weights when the program starts.
Part-7: Dynamic Normalization
whisper/audio.py
OpenAI's Whisper first clamps the computed results to ensure no values are less than
1e-10
. Then, it takes the base-10 logarithm of these results usinglog10
. After that, it uses maximum to make sure no values are less than the maximum value minus8
. Finally, it adds4
to all the values and then divides by4
.whisper.cpp/whisper.cpp
Line 2485 in a4bb2df
whisper.cpp/whisper.cpp
Lines 2558 to 2575 in a4bb2df
whisper.cpp uses a virtually identical method, with no major issues.
The End
Beta Was this translation helpful? Give feedback.
All reactions