ClearerVoice-Studio is an open-source, AI-powered speech processing toolkit designed for researchers, developers, and end-users. It provides capabilities of speech enhancement, speech separation, target speaker extraction, and more. The toolkit provides state-of-the-art pre-trained models, along with training and inference scripts, all accessible from this repository.
👉🏻ClearVoice Demo👈🏻 | 👉🏻SpeechScore Demo👈🏻
Please support our community project 💖 by starring it on GitHub 加⭐支持 🙏
- [2024.11] FRCRN speech denoiser has been used over 2.8 million times on ModelScope
- [2024.11] MossFormer speech separator has been used over 2.5 million times on ModelScope
- [2024.11] Release of this repository
- Upcoming: More tasks will be added to ClearVoice.
- Pre-Trained Models: Includes cutting-edge pre-trained models, fine-tuned on extensive, high-quality datasets. No need to start from scratch!
- Ease of Use: Designed for seamless integration with your projects, offering a simple yet flexible interface for inference and training.
- Comprehensive Features: Combines advanced algorithms for multiple speech processing tasks in one platform.
- Community-Driven: Built for researchers, developers, and enthusiasts to collaborate and innovate together.
This repository is organized into three main components: ClearVoice, Train, and SpeechScore.
ClearVoice offers a user-friendly solution for speech processing tasks such as speech denoising, separation, audio-visual target speaker extraction, and more. It is designed as a unified inference platform leveraged pre-trained models (e.g., FRCRN, MossFormer), all trained on extensive datasets. If you're looking for a tool to improve speech quality, ClearVoice is the perfect choice. Simply click on ClearVoice
and follow our detailed instructions to get started.
For advanced researchers and developers, we provide model finetune and training scripts for all the tasks offerred in ClearVoice and more:
- Task 1: Speech enhancement (16kHz & 48kHz)
- Task 2: Speech separation (8kHz & 16kHz)
- Task 3: Target speaker extraction
- Sub-Task 1: Audio-only Speaker Extraction Conditioned on a Reference Speech (8kHz)
- Sub-Task 2: Audio-visual Speaker Extraction Conditioned on Face (Lip) Recording (16kHz)
- Sub-Task 3: Audio-visual Speaker Extraction Conditioned on Body Gestures (16kHz)
- Sub-Task 4: Neuro-steered Speaker Extraction Conditioned on EEG Signals (16kHz)
Contributors are welcomed to include more model architectures and tasks!
SpeechScore
is a speech quality assessment toolkit. We include it here to evaluate different model performance. SpeechScore includes many popular speech metrics:
- Signal-to-Noise Ratio (SNR)
- Perceptual Evaluation of Speech Quality (PESQ)
- Short-Time Objective Intelligibility (STOI)
- Deep Noise Suppression Mean Opinion Score (DNSMOS)
- Scale-Invariant Signal-to-Distortion Ratio (SI-SDR)
- and many more quality benchmarks
If you have any comments or questions about ClearerVoice-Studio, feel free to raise an issue in this repository or contact us directly at:
- email: {shengkui.zhao, zexu.pan}@alibaba-inc.com
Alternatively, welcome to join our DingTalk and WeChat groups to share and discuss algorithms, technology, and user experience feedback. You may scan the following QR codes to join our official chat groups accordingly.
Checkout some awesome Github repositories from Speech Lab of Institute for Intelligent Computing, Alibaba Group.
ClearerVoice-Studio contains third-party components and code modified from some open-source repos, including:
Speechbrain, ESPnet, TalkNet-ASD