Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silero VAD support #888

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Silero VAD support #888

wants to merge 1 commit into from

Conversation

3manifold
Copy link

@3manifold 3manifold commented Sep 26, 2024

Description

Implementation includes:

  • Extension of WhisperX to accept multiple VAD alternatives that do not have to necessarily emerge from pyannote-audio toolkit.
  • Silero VAD as an alternative VAD option.
  • Fix in whisperx\__init__.py imports.

The implementation aims to respect the current structure as well as keep the existing functionality intact. It is worth mentioning that the manually-assigned vad_model still works as expected (see load_model for details).

See relevant issue for further details. resolves #889

Tests

  • pyannote and silero cases both tested on CPU & GPU setups without an issue (current silero vad implementation utilizes only CPU)
  • Also tested using manually assigned vad_model (manually assigned vad_model has higher priority than vad_method, see load_model function for details)
  • Test were conducted using .wav files of various lengths (30s, 15min, 1hr)

Example command line (applies also for --vad_method pyannote):

  • GPU: python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
  • CPU: python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

Example Python script usage:

import whisperx
import gc

device = "cpu"
audio_file = "audio.wav"
batch_size = 16 # reduce if low on GPU mem
compute_type = "int8" # change to "int8" if low on GPU mem (may reduce accuracy)

# 1. Transcribe with original whisper (batched)
model = whisperx.load_model("small", device, vad_method="silero", compute_type=compute_type)

# save model to local path (optional)
# model_dir = "/path/"
# model = whisperx.load_model("large-v2", device, compute_type=compute_type, download_root=model_dir)

audio = whisperx.load_audio(audio_file)
result = model.transcribe(audio, batch_size=batch_size)
print(result["segments"]) # before alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(language_code=result["language"], device=device)
result = whisperx.align(result["segments"], model_a, metadata, audio, device, return_char_alignments=False)

print(result["segments"]) # after alignment

# delete model if low on GPU resources
# import gc; gc.collect(); torch.cuda.empty_cache(); del model_a

# 3. Assign speaker labels
diarize_model = whisperx.DiarizationPipeline(use_auth_token="xxx", device=device)

# add min/max number of speakers if known
diarize_segments = diarize_model(audio)
# diarize_model(audio, min_speakers=min_speakers, max_speakers=max_speakers)

result = whisperx.assign_word_speakers(diarize_segments, result)
print(diarize_segments)
print(result["segments"]) # segments are now assigned speaker IDs

output:

click to expand
python3 whisperx/example.py 
torchvision is not available - cannot save figures
No language specified, language will be first be detected for each audio file (increases inference time).
>>Performing voice activity detection using Silero...
Using cache found in /home/xxx/.cache/torch/hub/snakers4_silero-vad_master
Detected language: en (0.99) in first 30s of audio...
[{'text': ' Birch canoes slid on the smooth planks. Glued the sheet to the dark blue background. It is easy to tell the depth of a well. These days a chicken leg is a rare dish. Rice is often served in round bowls. The juice of lemons makes fine punch. The box was thrown beside the parked truck. The hogs were fed chopped corn and garbage. Four hours of study work faced us.', 'start': 0.674, 'end': 28.83}, {'text': ' A large size in stockings is hard to sell.', 'start': 30.05, 'end': 32.254}]
[{'start': 0.694, 'end': 2.995, 'text': ' Birch canoes slid on the smooth planks.', 'words': [{'word': 'Birch', 'start': 0.694, 'end': 1.034, 'score': 0.854}, {'word': 'canoes', 'start': 1.114, 'end': 1.555, 'score': 0.763}, {'word': 'slid', 'start': 1.595, 'end': 1.915, 'score': 0.881}, {'word': 'on', 'start': 2.015, 'end': 2.095, 'score': 0.909}, {'word': 'the', 'start': 2.115, 'end': 2.195, 'score': 0.789}, {'word': 'smooth', 'start': 2.255, 'end': 2.615, 'score': 0.828}, {'word': 'planks.', 'start': 2.695, 'end': 2.995, 'score': 0.861}]}, {'start': 4.296, 'end': 6.357, 'text': 'Glued the sheet to the dark blue background.', 'words': [{'word': 'Glued', 'start': 4.296, 'end': 4.616, 'score': 0.474}, {'word': 'the', 'start': 4.676, 'end': 4.756, 'score': 0.968}, {'word': 'sheet', 'start': 4.796, 'end': 5.016, 'score': 0.933}, {'word': 'to', 'start': 5.056, 'end': 5.157, 'score': 0.776}, {'word': 'the', 'start': 5.177, 'end': 5.237, 'score': 0.952}, {'word': 'dark', 'start': 5.277, 'end': 5.517, 'score': 0.99}, {'word': 'blue', 'start': 5.577, 'end': 5.777, 'score': 0.844}, {'word': 'background.', 'start': 5.837, 'end': 6.357, 'score': 0.93}]}, {'start': 7.838, 'end': 9.659, 'text': 'It is easy to tell the depth of a well.', 'words': [{'word': 'It', 'start': 7.838, 'end': 7.918, 'score': 0.932}, {'word': 'is', 'start': 7.978, 'end': 8.058, 'score': 0.724}, {'word': 'easy', 'start': 8.118, 'end': 8.318, 'score': 0.958}, {'word': 'to', 'start': 8.358, 'end': 8.438, 'score': 0.88}, {'word': 'tell', 'start': 8.498, 'end': 8.699, 'score': 0.712}, {'word': 'the', 'start': 8.739, 'end': 8.819, 'score': 0.828}, {'word': 'depth', 'start': 8.859, 'end': 9.119, 'score': 0.859}, {'word': 'of', 'start': 9.179, 'end': 9.279, 'score': 0.796}, {'word': 'a', 'start': 9.319, 'end': 9.339, 'score': 0.767}, {'word': 'well.', 'start': 9.399, 'end': 9.659, 'score': 0.933}]}, {'start': 10.9, 'end': 12.841, 'text': 'These days a chicken leg is a rare dish.', 'words': [{'word': 'These', 'start': 10.9, 'end': 11.12, 'score': 0.856}, {'word': 'days', 'start': 11.16, 'end': 11.36, 'score': 0.87}, {'word': 'a', 'start': 11.4, 'end': 11.44, 'score': 0.515}, {'word': 'chicken', 'start': 11.48, 'end': 11.78, 'score': 0.932}, {'word': 'leg', 'start': 11.82, 'end': 12.0, 'score': 0.993}, {'word': 'is', 'start': 12.04, 'end': 12.121, 'score': 0.76}, {'word': 'a', 'start': 12.181, 'end': 12.221, 'score': 0.499}, {'word': 'rare', 'start': 12.281, 'end': 12.501, 'score': 0.776}, {'word': 'dish.', 'start': 12.581, 'end': 12.841, 'score': 0.878}]}, {'start': 14.282, 'end': 16.123, 'text': 'Rice is often served in round bowls.', 'words': [{'word': 'Rice', 'start': 14.282, 'end': 14.522, 'score': 0.867}, {'word': 'is', 'start': 14.582, 'end': 14.662, 'score': 0.638}, {'word': 'often', 'start': 14.722, 'end': 15.022, 'score': 0.922}, {'word': 'served', 'start': 15.082, 'end': 15.362, 'score': 0.848}, {'word': 'in', 'start': 15.422, 'end': 15.502, 'score': 0.85}, {'word': 'round', 'start': 15.562, 'end': 15.783, 'score': 0.912}, {'word': 'bowls.', 'start': 15.823, 'end': 16.123, 'score': 0.647}]}, {'start': 17.343, 'end': 19.265, 'text': 'The juice of lemons makes fine punch.', 'words': [{'word': 'The', 'start': 17.343, 'end': 17.464, 'score': 0.796}, {'word': 'juice', 'start': 17.504, 'end': 17.764, 'score': 0.976}, {'word': 'of', 'start': 17.804, 'end': 17.884, 'score': 0.83}, {'word': 'lemons', 'start': 17.944, 'end': 18.264, 'score': 0.914}, {'word': 'makes', 'start': 18.344, 'end': 18.564, 'score': 0.866}, {'word': 'fine', 'start': 18.644, 'end': 18.904, 'score': 0.914}, {'word': 'punch.', 'start': 18.964, 'end': 19.265, 'score': 0.888}]}, {'start': 20.445, 'end': 22.406, 'text': 'The box was thrown beside the parked truck.', 'words': [{'word': 'The', 'start': 20.445, 'end': 20.565, 'score': 0.89}, {'word': 'box', 'start': 20.605, 'end': 20.885, 'score': 0.956}, {'word': 'was', 'start': 20.926, 'end': 21.046, 'score': 0.907}, {'word': 'thrown', 'start': 21.106, 'end': 21.346, 'score': 0.621}, {'word': 'beside', 'start': 21.386, 'end': 21.706, 'score': 0.901}, {'word': 'the', 'start': 21.746, 'end': 21.806, 'score': 0.977}, {'word': 'parked', 'start': 21.866, 'end': 22.086, 'score': 0.65}, {'word': 'truck.', 'start': 22.126, 'end': 22.406, 'score': 0.859}]}, {'start': 23.767, 'end': 25.748, 'text': 'The hogs were fed chopped corn and garbage.', 'words': [{'word': 'The', 'start': 23.767, 'end': 23.867, 'score': 0.997}, {'word': 'hogs', 'start': 23.907, 'end': 24.147, 'score': 0.873}, {'word': 'were', 'start': 24.167, 'end': 24.287, 'score': 0.874}, {'word': 'fed', 'start': 24.347, 'end': 24.588, 'score': 0.763}, {'word': 'chopped', 'start': 24.628, 'end': 24.928, 'score': 0.671}, {'word': 'corn', 'start': 24.968, 'end': 25.208, 'score': 0.843}, {'word': 'and', 'start': 25.248, 'end': 25.328, 'score': 0.923}, {'word': 'garbage.', 'start': 25.348, 'end': 25.748, 'score': 0.902}]}, {'start': 27.129, 'end': 28.73, 'text': 'Four hours of study work faced us.', 'words': [{'word': 'Four', 'start': 27.129, 'end': 27.329, 'score': 0.819}, {'word': 'hours', 'start': 27.369, 'end': 27.629, 'score': 0.805}, {'word': 'of', 'start': 27.669, 'end': 27.709, 'score': 0.735}, {'word': 'study', 'start': 27.749, 'end': 28.01, 'score': 0.873}, {'word': 'work', 'start': 28.05, 'end': 28.25, 'score': 0.885}, {'word': 'faced', 'start': 28.29, 'end': 28.57, 'score': 0.97}, {'word': 'us.', 'start': 28.67, 'end': 28.73, 'score': 0.99}]}, {'start': 30.111, 'end': 32.092, 'text': ' A large size in stockings is hard to sell.', 'words': [{'word': 'A', 'start': 30.111, 'end': 30.171, 'score': 0.927}, {'word': 'large', 'start': 30.212, 'end': 30.454, 'score': 0.968}, {'word': 'size', 'start': 30.515, 'end': 30.758, 'score': 0.982}, {'word': 'in', 'start': 30.798, 'end': 30.879, 'score': 0.691}, {'word': 'stockings', 'start': 30.919, 'end': 31.344, 'score': 0.923}, {'word': 'is', 'start': 31.405, 'end': 31.486, 'score': 0.816}, {'word': 'hard', 'start': 31.526, 'end': 31.708, 'score': 0.834}, {'word': 'to', 'start': 31.748, 'end': 31.85, 'score': 0.938}, {'word': 'sell.', 'start': 31.89, 'end': 32.092, 'score': 0.954}]}]
                             segment label  ... intersection      union
0  [ 00:00:00.486 -->  00:00:03.000]     A  ...   -28.889031  31.605406
1  [ 00:00:04.266 -->  00:00:06.392]     B  ...   -25.497156  27.825406
2  [ 00:00:07.776 -->  00:00:09.683]     C  ...   -22.206531  24.315406
3  [ 00:00:10.847 -->  00:00:12.923]     D  ...   -18.966531  21.244156
4  [ 00:00:14.205 -->  00:00:16.163]     E  ...   -15.726531  17.886031
5  [ 00:00:17.294 -->  00:00:19.319]     F  ...   -12.570906  14.797906
6  [ 00:00:20.399 -->  00:00:22.390]     G  ...    -9.499656  11.692906
7  [ 00:00:23.723 -->  00:00:25.849]     H  ...    -6.040281   8.368531
8  [ 00:00:27.064 -->  00:00:28.769]     I  ...    -3.120906   5.027281
9  [ 00:00:30.017 -->  00:00:32.194]     J  ...     0.202000   2.176875

[10 rows x 7 columns]
[{'start': 0.694, 'end': 2.995, 'text': ' Birch canoes slid on the smooth planks.', 'words': [{'word': 'Birch', 'start': 0.694, 'end': 1.034, 'score': 0.854, 'speaker': 'SPEAKER_00'}, {'word': 'canoes', 'start': 1.114, 'end': 1.555, 'score': 0.763, 'speaker': 'SPEAKER_00'}, {'word': 'slid', 'start': 1.595, 'end': 1.915, 'score': 0.881, 'speaker': 'SPEAKER_00'}, {'word': 'on', 'start': 2.015, 'end': 2.095, 'score': 0.909, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 2.115, 'end': 2.195, 'score': 0.789, 'speaker': 'SPEAKER_00'}, {'word': 'smooth', 'start': 2.255, 'end': 2.615, 'score': 0.828, 'speaker': 'SPEAKER_00'}, {'word': 'planks.', 'start': 2.695, 'end': 2.995, 'score': 0.861, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 4.296, 'end': 6.357, 'text': 'Glued the sheet to the dark blue background.', 'words': [{'word': 'Glued', 'start': 4.296, 'end': 4.616, 'score': 0.474, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 4.676, 'end': 4.756, 'score': 0.968, 'speaker': 'SPEAKER_00'}, {'word': 'sheet', 'start': 4.796, 'end': 5.016, 'score': 0.933, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 5.056, 'end': 5.157, 'score': 0.776, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 5.177, 'end': 5.237, 'score': 0.952, 'speaker': 'SPEAKER_00'}, {'word': 'dark', 'start': 5.277, 'end': 5.517, 'score': 0.99, 'speaker': 'SPEAKER_00'}, {'word': 'blue', 'start': 5.577, 'end': 5.777, 'score': 0.844, 'speaker': 'SPEAKER_00'}, {'word': 'background.', 'start': 5.837, 'end': 6.357, 'score': 0.93, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 7.838, 'end': 9.659, 'text': 'It is easy to tell the depth of a well.', 'words': [{'word': 'It', 'start': 7.838, 'end': 7.918, 'score': 0.932, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 7.978, 'end': 8.058, 'score': 0.724, 'speaker': 'SPEAKER_00'}, {'word': 'easy', 'start': 8.118, 'end': 8.318, 'score': 0.958, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 8.358, 'end': 8.438, 'score': 0.88, 'speaker': 'SPEAKER_00'}, {'word': 'tell', 'start': 8.498, 'end': 8.699, 'score': 0.712, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 8.739, 'end': 8.819, 'score': 0.828, 'speaker': 'SPEAKER_00'}, {'word': 'depth', 'start': 8.859, 'end': 9.119, 'score': 0.859, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 9.179, 'end': 9.279, 'score': 0.796, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 9.319, 'end': 9.339, 'score': 0.767, 'speaker': 'SPEAKER_00'}, {'word': 'well.', 'start': 9.399, 'end': 9.659, 'score': 0.933, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 10.9, 'end': 12.841, 'text': 'These days a chicken leg is a rare dish.', 'words': [{'word': 'These', 'start': 10.9, 'end': 11.12, 'score': 0.856, 'speaker': 'SPEAKER_00'}, {'word': 'days', 'start': 11.16, 'end': 11.36, 'score': 0.87, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 11.4, 'end': 11.44, 'score': 0.515, 'speaker': 'SPEAKER_00'}, {'word': 'chicken', 'start': 11.48, 'end': 11.78, 'score': 0.932, 'speaker': 'SPEAKER_00'}, {'word': 'leg', 'start': 11.82, 'end': 12.0, 'score': 0.993, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 12.04, 'end': 12.121, 'score': 0.76, 'speaker': 'SPEAKER_00'}, {'word': 'a', 'start': 12.181, 'end': 12.221, 'score': 0.499, 'speaker': 'SPEAKER_00'}, {'word': 'rare', 'start': 12.281, 'end': 12.501, 'score': 0.776, 'speaker': 'SPEAKER_00'}, {'word': 'dish.', 'start': 12.581, 'end': 12.841, 'score': 0.878, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 14.282, 'end': 16.123, 'text': 'Rice is often served in round bowls.', 'words': [{'word': 'Rice', 'start': 14.282, 'end': 14.522, 'score': 0.867, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 14.582, 'end': 14.662, 'score': 0.638, 'speaker': 'SPEAKER_00'}, {'word': 'often', 'start': 14.722, 'end': 15.022, 'score': 0.922, 'speaker': 'SPEAKER_00'}, {'word': 'served', 'start': 15.082, 'end': 15.362, 'score': 0.848, 'speaker': 'SPEAKER_00'}, {'word': 'in', 'start': 15.422, 'end': 15.502, 'score': 0.85, 'speaker': 'SPEAKER_00'}, {'word': 'round', 'start': 15.562, 'end': 15.783, 'score': 0.912, 'speaker': 'SPEAKER_00'}, {'word': 'bowls.', 'start': 15.823, 'end': 16.123, 'score': 0.647, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 17.343, 'end': 19.265, 'text': 'The juice of lemons makes fine punch.', 'words': [{'word': 'The', 'start': 17.343, 'end': 17.464, 'score': 0.796, 'speaker': 'SPEAKER_00'}, {'word': 'juice', 'start': 17.504, 'end': 17.764, 'score': 0.976, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 17.804, 'end': 17.884, 'score': 0.83, 'speaker': 'SPEAKER_00'}, {'word': 'lemons', 'start': 17.944, 'end': 18.264, 'score': 0.914, 'speaker': 'SPEAKER_00'}, {'word': 'makes', 'start': 18.344, 'end': 18.564, 'score': 0.866, 'speaker': 'SPEAKER_00'}, {'word': 'fine', 'start': 18.644, 'end': 18.904, 'score': 0.914, 'speaker': 'SPEAKER_00'}, {'word': 'punch.', 'start': 18.964, 'end': 19.265, 'score': 0.888, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 20.445, 'end': 22.406, 'text': 'The box was thrown beside the parked truck.', 'words': [{'word': 'The', 'start': 20.445, 'end': 20.565, 'score': 0.89, 'speaker': 'SPEAKER_00'}, {'word': 'box', 'start': 20.605, 'end': 20.885, 'score': 0.956, 'speaker': 'SPEAKER_00'}, {'word': 'was', 'start': 20.926, 'end': 21.046, 'score': 0.907, 'speaker': 'SPEAKER_00'}, {'word': 'thrown', 'start': 21.106, 'end': 21.346, 'score': 0.621, 'speaker': 'SPEAKER_00'}, {'word': 'beside', 'start': 21.386, 'end': 21.706, 'score': 0.901, 'speaker': 'SPEAKER_00'}, {'word': 'the', 'start': 21.746, 'end': 21.806, 'score': 0.977, 'speaker': 'SPEAKER_00'}, {'word': 'parked', 'start': 21.866, 'end': 22.086, 'score': 0.65, 'speaker': 'SPEAKER_00'}, {'word': 'truck.', 'start': 22.126, 'end': 22.406, 'score': 0.859, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 23.767, 'end': 25.748, 'text': 'The hogs were fed chopped corn and garbage.', 'words': [{'word': 'The', 'start': 23.767, 'end': 23.867, 'score': 0.997, 'speaker': 'SPEAKER_00'}, {'word': 'hogs', 'start': 23.907, 'end': 24.147, 'score': 0.873, 'speaker': 'SPEAKER_00'}, {'word': 'were', 'start': 24.167, 'end': 24.287, 'score': 0.874, 'speaker': 'SPEAKER_00'}, {'word': 'fed', 'start': 24.347, 'end': 24.588, 'score': 0.763, 'speaker': 'SPEAKER_00'}, {'word': 'chopped', 'start': 24.628, 'end': 24.928, 'score': 0.671, 'speaker': 'SPEAKER_00'}, {'word': 'corn', 'start': 24.968, 'end': 25.208, 'score': 0.843, 'speaker': 'SPEAKER_00'}, {'word': 'and', 'start': 25.248, 'end': 25.328, 'score': 0.923, 'speaker': 'SPEAKER_00'}, {'word': 'garbage.', 'start': 25.348, 'end': 25.748, 'score': 0.902, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 27.129, 'end': 28.73, 'text': 'Four hours of study work faced us.', 'words': [{'word': 'Four', 'start': 27.129, 'end': 27.329, 'score': 0.819, 'speaker': 'SPEAKER_00'}, {'word': 'hours', 'start': 27.369, 'end': 27.629, 'score': 0.805, 'speaker': 'SPEAKER_00'}, {'word': 'of', 'start': 27.669, 'end': 27.709, 'score': 0.735, 'speaker': 'SPEAKER_00'}, {'word': 'study', 'start': 27.749, 'end': 28.01, 'score': 0.873, 'speaker': 'SPEAKER_00'}, {'word': 'work', 'start': 28.05, 'end': 28.25, 'score': 0.885, 'speaker': 'SPEAKER_00'}, {'word': 'faced', 'start': 28.29, 'end': 28.57, 'score': 0.97, 'speaker': 'SPEAKER_00'}, {'word': 'us.', 'start': 28.67, 'end': 28.73, 'score': 0.99, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}, {'start': 30.111, 'end': 32.092, 'text': ' A large size in stockings is hard to sell.', 'words': [{'word': 'A', 'start': 30.111, 'end': 30.171, 'score': 0.927, 'speaker': 'SPEAKER_00'}, {'word': 'large', 'start': 30.212, 'end': 30.454, 'score': 0.968, 'speaker': 'SPEAKER_00'}, {'word': 'size', 'start': 30.515, 'end': 30.758, 'score': 0.982, 'speaker': 'SPEAKER_00'}, {'word': 'in', 'start': 30.798, 'end': 30.879, 'score': 0.691, 'speaker': 'SPEAKER_00'}, {'word': 'stockings', 'start': 30.919, 'end': 31.344, 'score': 0.923, 'speaker': 'SPEAKER_00'}, {'word': 'is', 'start': 31.405, 'end': 31.486, 'score': 0.816, 'speaker': 'SPEAKER_00'}, {'word': 'hard', 'start': 31.526, 'end': 31.708, 'score': 0.834, 'speaker': 'SPEAKER_00'}, {'word': 'to', 'start': 31.748, 'end': 31.85, 'score': 0.938, 'speaker': 'SPEAKER_00'}, {'word': 'sell.', 'start': 31.89, 'end': 32.092, 'score': 0.954, 'speaker': 'SPEAKER_00'}], 'speaker': 'SPEAKER_00'}]

Process finished with exit code 0

Future work

  • Silero ONNX model usage (silero-vad repo & faster-whisper for inspiration) to enable GPU usage and harvest possible benefits.
  • Expose additional VAD settings to the user. These settings may have common meaning among the various VAD methods. E.g.:
    • min_silence_duration_ms (silero) and min_duration_off (pyannote)
    • min_speech_duration_ms (silero) and min_duration_on (pyannote)

@3manifold 3manifold marked this pull request as ready for review September 26, 2024 08:35
onset: float = 0.5,
offset: Optional[float] = None,
):
assert chunk_size > 0
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keep binarization separate from the parent class function merge_chunks (i.e. Vad.merge_chunks). This is because binarization of other VAD methods (e.g. silero) may happen in earlier stages making Vad.merge_chunks easier to reuse. Specifically, in the case of silero, binarization happens during model invocation.

@sulutian
Copy link

How do I use Silero VAD with WhisperX!!

@3manifold
Copy link
Author

3manifold commented Sep 30, 2024

How do I use Silero VAD with WhisperX!!

From the pull request description:

Example command line (applies also for --vad_method pyannote):

  • GPU: python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
  • CPU: python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

@sulutian
Copy link

sulutian commented Oct 1, 2024

如何将 Silero VAD 与 WhisperX 一起使用!

来自请求的描述:

窗口命令行(也适用于--vad_method pyannote):

  • 图形处理器:python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
  • 中央处理器:python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

An error occurred whisperx: error: unrecognized arguments: --vad_method silero

@3manifold
Copy link
Author

如何将 Silero VAD 与 WhisperX 一起使用!

来自请求的描述:

窗口命令行(也适用于--vad_method pyannote):

  • 图形处理器:python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
  • 中央处理器:python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

An error occurred whisperx: error: unrecognized arguments: --vad_method silero

You have to checkout silero-vad branch

@sulutian
Copy link

sulutian commented Oct 1, 2024

如何将 Silero VAD 与 WhisperX 一起使用!

来自请求的描述:

窗口命令行(也适用于--vad_method pyannote):

  • 图形处理器:python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
  • 中央处理器:python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

发生错误 whisperx:错误:无法识别的参数:--vad_method silero

您必须结帐silero-vad分行

I have * main
remotes/origin/HEAD -> origin/main
remotes/origin/main
remotes/origin/silero-vad

@3manifold
Copy link
Author

3manifold commented Oct 1, 2024

如何将 Silero VAD 与 WhisperX 一起使用!

来自请求的描述:

窗口命令行(也适用于--vad_method pyannote):

  • 图形处理器:python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
  • 中央处理器:python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

发生错误 whisperx:错误:无法识别的参数:--vad_method silero

您必须结帐silero-vad分行

I have * main remotes/origin/HEAD -> origin/main remotes/origin/main remotes/origin/silero-vad

You can run git checkout -t origin/silero-vad to checkout the remote branch.

@sulutian
Copy link

sulutian commented Oct 1, 2024

如何将 Silero VAD 与 WhisperX 一起使用!

来自请求的描述:

窗口命令行(也适用于--vad_method pyannote):

  • 图形处理器:python3 -m whisperx.transcribe audio.wav --language en --device cuda --diarize --hf_token xxx --vad_method silero
  • 中央处理器:python3 -m whisperx.transcribe audio.wav --language en --device cpu --diarize --hf_token xxx --compute_type int8 --vad_method silero

发生错误whisperx:错误:无法识别的参数:--vad_method silero

男人结帐silero-vad分行

我有 * 主遥控器/原点/HEAD -> 原点/主遥控器/原点/主遥控器/原点/silero-vad

您可以运行git checkout -t origin/silero-vad来检出远程分支。

i showed up!!
whisperX-silero-vad>git checkout -t origin/silero-vad
fatal: a branch named 'silero-vad' already exists

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature] Silero VAD support
2 participants