generated from sensein/python-package-template
-
Notifications
You must be signed in to change notification settings - Fork 3
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
11 changed files
with
427 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,3 @@ | ||
"""This module contains the speaker diarization API for senselab.""" | ||
""".. include:: ./doc.md""" # noqa: D415 | ||
|
||
from .api import diarize_audios # noqa: F401 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,30 @@ | ||
# Speaker diarization | ||
|
||
[![Tutorial](https://img.shields.io/badge/Tutorial-Click%20Here-blue?style=for-the-badge)](https://github.com/sensein/senselab/blob/main/tutorials/speaker_diarization.ipynb) | ||
|
||
## Task Overview | ||
Speaker diarization is the process of segmenting audio recordings by speaker labels, aiming to answer the question: **"Who spoke when?"** | ||
|
||
## Models | ||
|
||
In `senselab`, we integrate [pyannote.audio](https://github.com/pyannote/pyannote-audio) models for speaker diarization. These models can be explored on the [Hugging Face Hub](https://huggingface.co/pyannote). We may integrate additional approaches for speaker diarization into the package in the future. | ||
|
||
## Evaluation | ||
|
||
### Metrics | ||
|
||
The **Diarization Error Rate (DER)** is the standard metric for evaluating and comparing speaker diarization systems. It is defined as: | ||
```text | ||
DER= (false alarm + missed detection + confusion) / total | ||
``` | ||
where | ||
- `false alarm` is the duration of non-speech incorrectly classified as speech, missed detection | ||
- `missed detection` is the duration of speech incorrectly classified as non-speech, confusion | ||
- `confusion` is the duration of speaker confusion, and total | ||
- `total` is the sum over all speakers of their reference speech duration. | ||
|
||
**Note:** DER takes overlapping speech into account. This can lead to increased missed detection rates if the speaker diarization system does not include an overlapping speech detection module. | ||
|
||
### Benchmark | ||
|
||
You can find a benchmark of the latest pyannote.audio model's performance on various time-stamped speech datasets [here](https://github.com/pyannote/pyannote-audio?tab=readme-ov-file#benchmark). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
3 changes: 3 additions & 0 deletions
3
src/senselab/audio/tasks/voice_activity_detection/__init__.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
""".. include:: ./doc.md""" # noqa: D415 | ||
|
||
from .api import detect_human_voice_activity_in_audios # noqa: F401 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# Voice Activity Detection (VAD) | ||
|
||
[![Tutorial](https://img.shields.io/badge/Tutorial-Click%20Here-blue?style=for-the-badge)](https://github.com/sensein/senselab/blob/main/tutorials/voice_activity_detection.ipynb) | ||
|
||
## Task Overview | ||
|
||
Voice Activity Detection (VAD) is a binary classification task that identifies the presence of human voice in audio. The primary challenge in VAD lies in differentiating between noise and human voice, particularly in environments with significant background noise (e.g., fans, car engines). While VAD performs well in quiet environments where distinguishing between silence and speech is straightforward, the task becomes more difficult when background noise or non-standard speech patterns are present. | ||
|
||
## Models | ||
|
||
In `senselab`, we integrate [pyannote.audio](https://github.com/pyannote/pyannote-audio) models for VAD. These models can be explored on the [Hugging Face Hub](https://huggingface.co/pyannote). Additional approaches for VAD may be integrated into the package in the future. | ||
|
||
## Evaluation | ||
|
||
### Metrics | ||
|
||
The primary metrics used to evaluate VAD modules are Detection Error Rate (DER) and Detection Cost Function (DCF). | ||
|
||
- **Detection Error Rate (DER):** | ||
|
||
```text | ||
DER = (false alarm + missed detection) / total duration of speech in reference | ||
``` | ||
|
||
- **False alarm:** Duration of non-speech incorrectly classified as speech. | ||
- **Missed detection:** Duration of speech incorrectly classified as non-speech. | ||
- **Total:** Total duration of speech in the reference. | ||
|
||
- **Detection Cost Function (DCF):** | ||
|
||
```text | ||
DCF = 0.25 * false alarm rate + 0.75 * miss rate | ||
``` | ||
|
||
- **False alarm rate:** Proportion of non-speech incorrectly classified as speech. | ||
- **Miss rate:** Proportion of speech incorrectly classified as non-speech. | ||
|
||
### Additional Metrics | ||
|
||
VAD systems may also be evaluated using the following metrics: | ||
|
||
- **Accuracy:** Proportion of the input signal correctly classified. | ||
- **Precision:** Proportion of detected speech that is actually speech. | ||
- **Recall:** Proportion of speech that is correctly detected. | ||
|
||
For more detailed information on these metrics, refer to the [pyannote.metrics documentation](https://pyannote.github.io/pyannote-metrics/reference.html). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,115 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"# Speaker diarization\n", | ||
"\n", | ||
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sensein/senselab/blob/main/tutorials/speaker_diarization.ipynb)\n", | ||
"\n", | ||
"This tutorial demonstrates how to use the `diarize_audios` function to perform speaker diarization on some audio files, which means to segment the audio into multiple speakers." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Import necessary modules\n", | ||
"from senselab.audio.data_structures.audio import Audio\n", | ||
"from senselab.audio.tasks.speaker_diarization import diarize_audios\n", | ||
"from senselab.utils.data_structures.model import PyannoteAudioModel\n", | ||
"from senselab.utils.data_structures.device import DeviceType\n", | ||
"from senselab.audio.tasks.plotting.plotting import play_audio\n", | ||
"from senselab.audio.tasks.preprocessing.preprocessing import resample_audios\n", | ||
"from senselab.utils.tasks.plotting import plot_segment" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Initialize a PyannoteAudioModel for speaker diarization, providing the model's path or URI.\n", | ||
"model = PyannoteAudioModel(path_or_uri=\"pyannote/speaker-diarization-3.1\")\n", | ||
"\n", | ||
"# Specify the device type to be used for processing (CPU in this case).\n", | ||
"device = DeviceType.CPU" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Load an audio file from the specified file path into an Audio object.\n", | ||
"audio = Audio.from_filepath(\"../src/tests/data_for_testing/audio_48khz_mono_16bits.wav\")\n", | ||
"\n", | ||
"# Resample the audio to 16kHz, as this is the expected input format for the model.\n", | ||
"# The resample_audios function returns a list, so we take the first (and only) element.\n", | ||
"audio = resample_audios([audio], 16000)[0]\n", | ||
"\n", | ||
"# Play the resampled audio to verify the preprocessing step was successful.\n", | ||
"play_audio(audio)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Perform speaker diarization on the audio using the specified model and device.\n", | ||
"# The function returns a list of results, where each element corresponds to an audio segment.\n", | ||
"results = diarize_audios(audios=[audio], model=model, device=device)\n", | ||
"\n", | ||
"# Print the results of speaker diarization to the console.\n", | ||
"print(results)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Plot the detected speakers for visualization.\n", | ||
"plot_segment(results[0])" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"**Ehm wait**. In the audio, we can hear four speakers, but the speaker diarization results indicate only two speakers. Why is this happening?\n", | ||
"\n", | ||
"Unfortunately, the model is not perfect and can make mistakes. We can try adjusting the parameters by setting `num_speakers=4`, `min_speakers=4`, and `max_speakers=4` to force the model to recognize four speakers. However, this approach doesn't always work as expected." | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "senselab-lOUhtavG-py3.10", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 2 | ||
} |
Oops, something went wrong.