-
Hong Kong University of Science and Technology
- Hong Kong
- @zhenye234
- https://huggingface.co/ZhenYe234
Stars
Zonos-v0.1 is a leading open-weight text-to-speech model trained on more than 200k hours of varied multilingual speech, delivering expressiveness and quality on par with—or even surpassing—top TTS …
Robust recipes to align language models with human and AI preferences
Unified automatic quality assessment for speech, music, and sound.
Fully open reproduction of DeepSeek-R1
LLaSE: Maximizing Acoustic Preservation for LLaMA based Speech Enhancement
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
Realtime Video and Audio Streaming with WebRTC and Gradio
Official repo for CoVoMix: Advancing Zero-Shot Speech Generation for Human-like Multi-talker Conversations
Recipes to scale inference-time compute of open models
Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
This repository contains demos I made with the Transformers library by HuggingFace.
Reference-aware automatic speech evaluation toolkit
Welcome to the Llama Cookbook! This is your go to guide for Building with Llama: Getting started with Inference, Fine-Tuning, RAG. We also show you how to solve end to end problems using Llama mode…
Official repo for Images that sound: a special spectrogram that can be seen as images and played as sound generated by diffusions
[ACM MM24] Official implementation of paper "From Speaker to Dubber: Movie Dubbing with Prosody and Duration Consistency Learning"
PyTorch implementation of Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities.
NeMo text processing for ASR and TTS
Text Normalization & Inverse Text Normalization
A quick guide (especially) for trending instruction finetuning datasets
first base model for full-duplex conversational audio
A wrapper around speech quality metrics MOSNet, BSSEval, STOI, PESQ, SRMR, SISDR