Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: get ready for pyannote.audio 3.0.0 #1470

Merged
merged 9 commits into from
Sep 26, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 18 additions & 25 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,24 +1,30 @@
# Changelog

## Version 3.0 (xxxx-xx-xx)
## Version 3.0.0 (2023-09-26)

### Highlights
### Features and improvements

- *"Harder"*. Fixed [major reproducibility issue](https://github.com/pyannote/pyannote-audio/issues/1370) with Ampere (A100) NVIDIA GPUs
In case you tried `pyannote.audio` pretrained pipelines in the past on Ampera (A100) NVIDIA GPUs
and were disappointed by the accuracy, please give it another try with this new version.
- "Better".
- "Faster".
- "Stronger".
- feat(pipeline): send pipeline to device with `pipeline.to(device)`
- feat(pipeline): add `return_embeddings` option to `SpeakerDiarization` pipeline
- feat(pipeline): make `segmentation_batch_size` and `embedding_batch_size` mutable in `SpeakerDiarization` pipeline (they now default to `1`)
- feat(pipeline): add progress hook to pipelines
- feat(task): add [powerset](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html) support to `SpeakerDiarization` task
- feat(task): add support for multi-task models
- feat(task): add support for label scope in speaker diarization task
- feat(task): add support for missing classes in multi-label segmentation task
- feat(model): add segmentation model based on torchaudio self-supervised representation
- feat(pipeline): check version compatibility at load time
- improve(task): load metadata as tensors rather than pyannote.core instances
- improve(task): improve error message on missing specifications

### Breaking changes

- BREAKING(task): rename `Segmentation` task to `SpeakerDiarization`
- BREAKING(task): remove support for variable chunk duration for segmentation tasks
- BREAKING(pipeline): pipeline defaults to CPU (use `pipeline.to(device)`)
- BREAKING(pipeline): remove `SpeakerSegmentation` pipeline (use `SpeakerDiarization` pipeline)
- BREAKING(pipeline): remove support for `FINCHClustering` and `HiddenMarkovModelClustering`
- BREAKING(pipeline): remove `segmentation_duration` parameter from `SpeakerDiarization` pipeline (defaults to `duration` of segmentation model)
- BREAKING(task): remove support for variable chunk duration for segmentation tasks
- BREAKING(pipeline): remove support for `FINCHClustering` and `HiddenMarkovModelClustering`
- BREAKING(setup): drop support for Python 3.7
- BREAKING(io): channels are now 0-indexed (used to be 1-indexed)
- BREAKING(io): multi-channel audio is no longer downmixed to mono by default.
Expand All @@ -29,21 +35,8 @@
- BREAKING(model): get rid of (flaky) `Model.introspection`
If, for some weird reason, you wrote some custom code based on that,
you should instead rely on `Model.example_output`.
- BREAKING(interactive): remove support for Prodigy recipes

### Features and improvements

- feat(task): add [powerset](https://www.isca-speech.org/archive/interspeech_2023/plaquet23_interspeech.html) support to `SpeakerDiarization` task
- feat(task): add support for multi-task models
- feat(task): add support for label scope in speaker diarization task
- feat(task): add support for missing classes in multi-label segmentation task
- feat(model): add segmentation model based on torchaudio self-supervised representation
- feat(pipeline): send pipeline to device with `pipeline.to(device)`
- feat(pipeline): add `return_embeddings` option to `SpeakerDiarization` pipeline
- feat(pipeline): make `segmentation_batch_size` and `embedding_batch_size` mutable in `SpeakerDiarization` pipeline (they now default to `1`)
- feat(pipeline): add progress hook to pipelines
- feat(pipeline): check version compatibility at load time
- improve(task): load metadata as tensors rather than pyannote.core instances
- improve(task): improve error message on missing specifications

### Fixes and improvements

Expand All @@ -54,7 +47,7 @@
- fix(task): fix support for "balance" option
- improve(task): shorten and improve structure of Tensorboard tags

### Dependencies
### Dependencies update

- setup: switch to torch 2.0+, torchaudio 2.0+, soundfile 0.12+, lightning 2.0+, torchmetrics 0.11+
- setup: switch to pyannote.core 5.0+, pyannote.database 5.0+, and pyannote.pipeline 3.0+
Expand Down
99 changes: 47 additions & 52 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,37 @@
> [!IMPORTANT]
> I propose (paid) scientific [consulting services](https://herve.niderb.fr/consulting.html) to companies willing to make the most of their data and open-source speech processing toolkits (and `pyannote` in particular).
Using `pyannote.audio` open-source toolkit in production?
Make the most of it thanks to our [consulting services](https://herve.niderb.fr/consulting.html).

# Speaker diarization with `pyannote.audio`
# `pyannote.audio` speaker diarization toolkit

`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it provides a set of trainable end-to-end neural building blocks that can be combined and jointly optimized to build speaker diarization pipelines.
`pyannote.audio` is an open-source toolkit written in Python for speaker diarization. Based on [PyTorch](pytorch.org) machine learning framework, it comes with state-of-the-art [pretrained models and pipelines](https://hf.co/pyannote), that can be further finetuned to your own data for even better performance.

<p align="center">
<a href="https://www.youtube.com/watch?v=37R_R82lfwA"><img src="https://img.youtube.com/vi/37R_R82lfwA/0.jpg"></a>
</p>


## TL;DR [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/pyannote/pyannote-audio/blob/develop/tutorials/intro.ipynb)
## TL;DR

1. Install [`pyannote.audio`](https://github.com/pyannote/pyannote-audio) `3.0` with `pip install pyannote.audio`
2. Accept [`pyannote/segmentation-3.0`](https://hf.co/pyannote/segmentation-3.0) user conditions
3. Accept [`pyannote/speaker-diarization-3.0`](https://hf.co/pyannote-speaker-diarization-3.0) user conditions
4. Create access token at [`hf.co/settings/tokens`](https://hf.co/settings/tokens).


```python
# 1. visit hf.co/pyannote/speaker-diarization and hf.co/pyannote/segmentation and accept user conditions (only if requested)
# 2. visit hf.co/settings/tokens to create an access token (only if you had to go through 1.)
# 3. instantiate pretrained speaker diarization pipeline
from pyannote.audio import Pipeline
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization",
use_auth_token="ACCESS_TOKEN_GOES_HERE")
pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.0",
use_auth_token="HUGGINGFACE_ACCESS_TOKEN_GOES_HERE")

# send pipeline to GPU (when available)
import torch
pipeline.to(torch.device("cuda"))

# 4. apply pretrained pipeline
# apply pretrained pipeline
diarization = pipeline("audio.wav")

# 5. print the result
# print the result
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"start={turn.start:.1f}s stop={turn.end:.1f}s speaker_{speaker}")
# start=0.2s stop=1.5s speaker_0
Expand All @@ -39,16 +46,7 @@ for turn, _, speaker in diarization.itertracks(yield_label=True):
- :exploding_head: state-of-the-art performance (see [Benchmark](#benchmark))
- :snake: Python-first API
- :zap: multi-GPU training with [pytorch-lightning](https://pytorchlightning.ai/)
- :control_knobs: data augmentation with [torch-audiomentations](https://github.com/asteroid-team/torch-audiomentations)

## Installation

Only Python 3.8+ is supported.

```bash
# install from develop branch
pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip
```

## Documentation

Expand All @@ -72,53 +70,50 @@ pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/de
- 2022-12-02 > ["How I reached 1st place at Ego4D 2022, 1st place at Albayzin 2022, and 6th place at VoxSRC 2022 speaker diarization challenges"](tutorials/adapting_pretrained_pipeline.ipynb)
- 2022-10-23 > ["One speaker segmentation model to rule them all"](https://herve.niderb.fr/fastpages/2022/10/23/One-speaker-segmentation-model-to-rule-them-all)
- 2021-08-05 > ["Streaming voice activity detection with pyannote.audio"](https://herve.niderb.fr/fastpages/2021/08/05/Streaming-voice-activity-detection-with-pyannote.html)
- Miscellaneous
- [Training with `pyannote-audio-train` command line tool](tutorials/training_with_cli.md)
- [Speaker verification](tutorials/speaker_verification.ipynb)
- Visualization and debugging
- Videos
- [Introduction to speaker diarization](https://umotion.univ-lemans.fr/video/9513-speech-segmentation-and-speaker-diarization/) / JSALT 2023 summer school / 90 min
- [Speaker segmentation model](https://www.youtube.com/watch?v=wDH2rvkjymY) / Interspeech 2021 / 3 min
- [First releaase of pyannote.audio](https://www.youtube.com/watch?v=37R_R82lfwA) / ICASSP 2020 / 8 min

## Benchmark

Out of the box, `pyannote.audio` default speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization) is expected to be much better (and faster) in v2.x than in v1.1. Those numbers are diarization error rates (in %)

| Dataset \ Version | v1.1 | v2.0 | v2.1.1 (finetuned) |
| ---------------------- | ---- | ---- | ------------------ |
| AISHELL-4 | - | 14.6 | 14.1 (14.5) |
| AliMeeting (channel 1) | - | - | 27.4 (23.8) |
| AMI (IHM) | 29.7 | 18.2 | 18.9 (18.5) |
| AMI (SDM) | - | 29.0 | 27.1 (22.2) |
| CALLHOME (part2) | - | 30.2 | 32.4 (29.3) |
| DIHARD 3 (full) | 29.2 | 21.0 | 26.9 (21.9) |
| VoxConverse (v0.3) | 21.5 | 12.6 | 11.2 (10.7) |
| REPERE (phase2) | - | 12.6 | 8.2 ( 8.3) |
| This American Life | - | - | 20.8 (15.2) |
Out of the box, `pyannote.audio` speaker diarization [pipeline](https://hf.co/pyannote/speaker-diarization-3.0) v3.0 is expected to be much better (and faster) than v2.x.
Those numbers are diarization error rates (in %):

| Dataset \ Version | v1.1 | v2.0 | [v2.1](https://hf.co/pyannote/speaker-diarization-2.1) | [v3.0](https://hf.co/pyannote/speaker-diarization-3.0) | <a href="mailto:herve-at-niderb-dot-fr?subject=Premium pyannote.audio pipeline&body=Looks like I got your attention! Drop me an email for more details. Hervé.">Premium</a> |
| ---------------------- | ---- | ---- | ------ | ------ | --------- |
| AISHELL-4 | - | 14.6 | 14.1 | 12.3 | 12.3 |
| AliMeeting (channel 1) | - | - | 27.4 | 24.3 | 19.4 |
| AMI (IHM) | 29.7 | 18.2 | 18.9 | 19.0 | 16.7 |
| AMI (SDM) | - | 29.0 | 27.1 | 22.2 | 20.1 |
| AVA-AVD | - | - | - | 49.1 | 42.7 |
| DIHARD 3 (full) | 29.2 | 21.0 | 26.9 | 21.7 | 17.0 |
| MSDWild | - | - | - | 24.6 | 20.4 |
| REPERE (phase2) | - | 12.6 | 8.2 | 7.8 | 7.8 |
| VoxConverse (v0.3) | 21.5 | 12.6 | 11.2 | 11.3 | 9.5 |

## Citations

If you use `pyannote.audio` please use the following citations:

```bibtex
@inproceedings{Bredin2020,
Title = {{pyannote.audio: neural building blocks for speaker diarization}},
Author = {{Bredin}, Herv{\'e} and {Yin}, Ruiqing and {Coria}, Juan Manuel and {Gelly}, Gregory and {Korshunov}, Pavel and {Lavechin}, Marvin and {Fustes}, Diego and {Titeux}, Hadrien and {Bouaziz}, Wassim and {Gill}, Marie-Philippe},
Booktitle = {ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing},
Year = {2020},
@inproceedings{Plaquet23,
author={Alexis Plaquet and Hervé Bredin},
title={{Powerset multi-class cross entropy loss for neural speaker diarization}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
```

```bibtex
@inproceedings{Bredin2021,
Title = {{End-to-end speaker segmentation for overlap-aware resegmentation}},
Author = {{Bredin}, Herv{\'e} and {Laurent}, Antoine},
Booktitle = {Proc. Interspeech 2021},
Year = {2021},
@inproceedings{Bredin23,
author={Hervé Bredin},
title={{pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe}},
year=2023,
booktitle={Proc. INTERSPEECH 2023},
}
```

## Support

For commercial enquiries and scientific consulting, please contact [me](mailto:[email protected]).

## Development

The commands below will setup pre-commit hooks and packages needed for developing the `pyannote.audio` library.
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ einops >=0.6.0
huggingface_hub >= 0.13.0
lightning >= 2.0.1
omegaconf >=2.1,<3.0
onnxruntime >= 1.16.0
pyannote.core >= 5.0.0
pyannote.database >= 5.0.1
pyannote.metrics >= 3.2
Expand Down
2 changes: 1 addition & 1 deletion version.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
2.1.1
3.0.0
Loading