Why is the audio padded with `N_SAMPLES` instead of `HOP_LENGTH` #2422

MahmoudAshraf97 · 2024-11-06T16:21:55Z

MahmoudAshraf97
Nov 6, 2024

Hello, I noticed that if we pad the audio with HOP_LENGTH instead of N_SAMPLES the resulting features should be identical since the extra padding does not contribute to the STFT calculation

from whisper import log_mel_spectrogram
import torch

SAMPLING_RATE = 16000
N_SAMPLES = SAMPLING_RATE * 30
HOP_LENGTH = 160
N_FRAMES = N_SAMPLES // HOP_LENGTH


for i in range(100):
    audio_n_samples = torch.randint(low=1, high=30, size=(1,)).item() * SAMPLING_RATE

    audio = torch.rand(audio_n_samples)

    features_pad_hop_length = log_mel_spectrogram(audio, padding=HOP_LENGTH)

    features_pad_30s_zeros = log_mel_spectrogram(audio, padding=N_SAMPLES)

    assert torch.allclose(
        features_pad_hop_length[:, :-1],
        features_pad_30s_zeros[:, :-N_FRAMES],
    )

and the feature frames that correspond to the extra padding are cropped anyway here:

whisper/whisper/transcribe.py

Line 140 in 271445b

content_frames = mel.shape[-1] - N_FRAMES

whisper/whisper/transcribe.py

Lines 281 to 282 in 271445b

    
           segment_size = min(N_FRAMES, content_frames - seek, seek_clip_end - seek) 
        
           mel_segment = mel[:, seek : seek + segment_size]

I want to hear thoughts about whether this is a valid idea or not, I will open a PR if it is, this will make the feature extraction faster and will use less resources especially for short audios

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is the audio padded with `N_SAMPLES` instead of `HOP_LENGTH` #2422

{{title}}

Replies: 0 comments

Select a reply

Why is the audio padded with N_SAMPLES instead of HOP_LENGTH #2422

MahmoudAshraf97 Nov 6, 2024

Replies: 0 comments

Why is the audio padded with `N_SAMPLES` instead of `HOP_LENGTH` #2422

MahmoudAshraf97
Nov 6, 2024