🔱 Speech Trident - Awesome Speech LM

In this repository, we survey three crucial areas: (1) representation learning, (2) neural codec, and (3) language models that contribute to speech/audio large language models.

1.⚡ Speech Representation Models: These models focus on learning structural speech representations, which can then be quantized into discrete speech tokens, often refer to semantic tokens.

2.⚡ Speech Neural Codec Models: These models are designed to learn speech and audio discrete tokens, often referred to as acoustic tokens, while maintaining reconstruction ability and low bitrate.

3.⚡ Speech Large Language Models: These models are trained on top of speech and acoustic tokens in a language modeling approach. They demonstrate proficiency in tasks on speech understanding and speech generation.

🔱 Contributors

_{Kai-Wei Chang}	_{Haibin Wu}	_{Wei-Cheng Tseng}
_{Kehan Lu}	_{Chun-Yi Kuan}	_{Hung-yi Lee}

🔱 Speech/Audio Language Models

Date	Model Name	Paper Title	Link
2024-11	hertz-dev	blog	code
2024-11	Freeze-Omni	Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM	paper
2024-11	Align-SLM	Align-SLM: Textless Spoken Language Models with Reinforcement Learning from AI Feedback	paper
2024-10	OmniFlatten	OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation	paper
2024-10	GPT-4o	GPT-4o System Card	paper
2024-10	Baichuan-OMNI	Baichuan-Omni Technical Report	paper
2024-10	GLM-4-Voice	GLM-4-Voice	GitHub
2024-10	--	Roadmap towards Superhuman Speech Understanding using Large Language Models	paper
2024-10	SALMONN-OMNI	SALMONN-OMNI: A SPEECH UNDERSTANDING AND GENERATION LLM IN A CODEC-FREE FULL-DUPLEX FRAMEWORK	paper
2024-10	Mini-Omni 2	Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities	paper
2024-10	HALL-E	HALL-E: Hierarchical Neural Codec Language Model for Minute-Long Zero-Shot Text-to-Speech Synthesis	paper
2024-10	SyllableLM	SyllableLM: Learning Coarse Semantic Units for Speech Language Models	paper
2024-09	Moshi	Moshi: a speech-text foundation model for real-time dialogue	paper
2024-09	Takin AudioLLM	Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models	paper
2024-09	FireRedTTS	FireRedTTS: A Foundation Text-To-Speech Framework for Industry-Level Generative Speech Applications	paper
2024-09	LLaMA-Omni	LLaMA-Omni: Seamless Speech Interaction with Large Language Models	paper
2024-09	MaskGCT	MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer	paper
2024-09	SSR-Speech	SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis	paper
2024-09	MoWE-Audio	MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders	paper
2024-08	Mini-Omni	Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming	paper
2024-08	Make-A-Voice 2	Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learner	paper
2024-08	LSLM	Language Model Can Listen While Speaking	paper
2024-06	SimpleSpeech	SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models	paper
2024-06	UniAudio 1.5	UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner	paper
2024-06	VALL-E R	VALL-E R: Robust and Efficient Zero-Shot Text-to-Speech Synthesis via Monotonic Alignment	paper
2024-06	VALL-E 2	VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers	paper
2024-06	GPST	Generative Pre-trained Speech Language Model with Efficient Hierarchical Transformer	paper
2024-04	CLaM-TTS	CLaM-TTS: Improving Neural Codec Language Model for Zero-Shot Text-to-Speech	paper
2024-04	RALL-E	RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis	paper
2024-04	WavLLM	WavLLM: Towards Robust and Adaptive Speech Large Language Model	paper
2024-02	MobileSpeech	MobileSpeech: A Fast and High-Fidelity Framework for Mobile Zero-Shot Text-to-Speech	paper
2024-02	SLAM-ASR	An Embarrassingly Simple Approach for LLM with Strong ASR Capacity	paper
2024-02	AnyGPT	AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling	paper
2024-02	SpiRit-LM	SpiRit-LM: Interleaved Spoken and Written Language Model	paper
2024-02	USDM	Integrating Paralinguistics in Speech-Empowered Large Language Models for Natural Conversation	paper
2024-02	BAT	BAT: Learning to Reason about Spatial Sounds with Large Language Models	paper
2024-02	Audio Flamingo	Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities	paper
2024-02	Text Description to speech	Natural language guidance of high-fidelity text-to-speech with synthetic annotations	paper
2024-02	GenTranslate	GenTranslate: Large Language Models are Generative Multilingual Speech and Machine Translators	paper
2024-02	Base-TTS	BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data	paper
2024-02	--	It's Never Too Late: Fusing Acoustic Information into Large Language Models for Automatic Speech Recognition	paper
2024-01	--	Large Language Models are Efficient Learners of Noise-Robust Speech Recognition	paper
2024-01	ELLA-V	ELLA-V: Stable Neural Codec Language Modeling with Alignment-guided Sequence Reordering	paper
2023-12	Seamless	Seamless: Multilingual Expressive and Streaming Speech Translation	paper
2023-11	Qwen-Audio	Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models	paper
2023-10	LauraGPT	LauraGPT: Listen, Attend, Understand, and Regenerate Audio with GPT	paper
2023-10	SALMONN	SALMONN: Towards Generic Hearing Abilities for Large Language Models	paper
2023-10	UniAudio	UniAudio: An Audio Foundation Model Toward Universal Audio Generation	paper
2023-10	Whispering LLaMA	Whispering LLaMA: A Cross-Modal Generative Error Correction Framework for Speech Recognition	paper
2023-09	VoxtLM	Voxtlm: unified decoder-only models for consolidating speech recognition/synthesis and speech/text continuation tasks	paper
2023-09	LTU-AS	Joint Audio and Speech Understanding	paper
2023-09	SLM	SLM: Bridge the thin gap between speech and text foundation models	paper
2023-09	--	Generative Speech Recognition Error Correction with Large Language Models and Task-Activating Prompting	paper
2023-08	SpeechGen	SpeechGen: Unlocking the Generative Power of Speech Language Models with Prompts	paper
2023-08	SpeechX	SpeechX: Neural Codec Language Model as a Versatile Speech Transformer	paper
2023-08	LLaSM	Large Language and Speech Model	paper
2023-08	SeamlessM4T	Massively Multilingual & Multimodal Machine Translation	paper
2023-07	Speech-LLaMA	On decoder-only architecture for speech-to-text and large language model integration	paper
2023-07	LLM-ASR(temp.)	Prompting Large Language Models with Speech Recognition Abilities	paper
2023-06	AudioPaLM	AudioPaLM: A Large Language Model That Can Speak and Listen	paper
2023-05	Make-A-Voice	Make-A-Voice: Unified Voice Synthesis With Discrete Representation	paper
2023-05	Spectron	Spoken Question Answering and Speech Continuation Using Spectrogram-Powered LLM	paper
2023-05	TWIST	Textually Pretrained Speech Language Models	paper
2023-05	Pengi	Pengi: An Audio Language Model for Audio Tasks	paper
2023-05	SoundStorm	Efficient Parallel Audio Generation	paper
2023-05	LTU	Joint Audio and Speech Understanding	paper
2023-05	SpeechGPT	Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities	paper
2023-05	VioLA	Unified Codec Language Models for Speech Recognition, Synthesis, and Translation	paper
2023-05	X-LLM	X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages	paper
2023-03	Google USM	Google USM: Scaling Automatic Speech Recognition Beyond 100 Languages	paper
2023-03	VALL-E X	Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling	paper
2023-02	SPEAR-TTS	Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision	paper
2023-01	VALL-E	Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers	paper
2022-12	Whisper	Robust Speech Recognition via Large-Scale Weak Supervision	paper
2022-10	AudioGen	AudioGen: Textually Guided Audio Generation	paper
2022-09	AudioLM	AudioLM: a Language Modeling Approach to Audio Generation	paper
2022-05	Wav2Seq	Wav2Seq: Pre-training Speech-to-Text Encoder-Decoder Models Using Pseudo Languages	paper
2022-04	Unit mBART	Enhanced Direct Speech-to-Speech Translation Using Self-supervised Pre-training and Data Augmentation	paper
2022-03	d-GSLM	Generative Spoken Dialogue Language Modeling	paper
2021-10	SLAM	SLAM: A Unified Encoder for Speech and Language Modeling via Speech-Text Joint Pre-Training	paper
2021-09	p-GSLM	Text-Free Prosody-Aware Generative Spoken Language Modeling	paper
2021-02	GSLM	Generative Spoken Language Modeling from Raw Audio	paper

🔱 Speech/Audio Codec Models

Date	Model Name	Paper Title	Link
2024-11	UniCodec	Universal Speech Token Learning Via Low-Bitrate Neural Codec and Pretrained Representations	paper
2024-11	SimVQ	Addressing Representation Collapse in Vector Quantized Models with One Linear Layer	paper
2024-11	MDCTCodec	MDCTCodec: A Lightweight MDCT-based Neural Audio Codec towards High Sampling Rate and Low Bitrate Scenarios	paper
2024-10	APCodec+	APCodec+: A Spectrum-Coding-Based High-Fidelity and High-Compression-Rate Neural Audio Codec with Staged Training Paradigm	paper
2024-10	-	A Closer Look at Neural Codec Resynthesis: Bridging the Gap between Codec and Waveform Generation	paper
2024-10	SNAC	SNAC: Multi-Scale Neural Audio Codec	paper
2024-10	LSCodec	LSCodec: Low-Bitrate and Speaker-Decoupled Discrete Speech Codec	paper
2024-10	Co-design for codec and codec-LM	TOWARDS CODEC-LM CO-DESIGN FOR NEURAL CODEC LANGUAGE MODELS	paper
2024-10	VChangeCodec	VChangeCodec: A High-efficiency Neural Speech Codec with Built-in Voice Changer for Real-time Communication	paper
2024-10	DC-Spin	DC-Spin: A Speaker-invariant Speech Tokenizer For Spoken Language Models	paper
2024-10	TAAE	Scaling Transformers for Low-Bitrate High-Quality Speech Coding	paper
2024-10	DM-Codec	DM-Codec: Distilling Multimodal Representations for Speech Tokenization	paper
2024-09	Mimi	Moshi: a speech-text foundation model for real-time dialogue	paper
2024-09	NDVQ	NDVQ: Robust Neural Audio Codec with Normal Distribution-Based Vector Quantization	paper
2024-09	SoCodec	SoCodec: A Semantic-Ordered Multi-Stream Speech Codec for Efficient Language Model Based Text-to-Speech Synthesis	paper
2024-09	BigCodec	BigCodec: Pushing the Limits of Low-Bitrate Neural Speech Codec	paper
2024-08	X-Codec	Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model	paper
2024-08	WavTokenizer	WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling	paper
2024-07	Super-Codec	SuperCodec: A Neural Speech Codec with Selective Back-Projection Network	paper
2024-07	dMel	dMel: Speech Tokenization made Simple	paper
2024-06	CodecFake	CodecFake: Enhancing Anti-Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems	paper
2024-06	Single-Codec	Single-Codec: Single-Codebook Speech Codec towards High-Performance Speech Generation	paper
2024-06	SQ-Codec	SimpleSpeech: Towards Simple and Efficient Text-to-Speech with Scalar Latent Transformer Diffusion Models	paper
2024-06	PQ-VAE	Addressing Index Collapse of Large-Codebook Speech Tokenizer with Dual-Decoding Product-Quantized Variational Auto-Encoder	paper
2024-06	LLM-Codec	UniAudio 1.5: Large Language Model-driven Audio Codec is A Few-shot Audio Task Learner	paper
2024-05	HILCodec	HILCodec: High Fidelity and Lightweight Neural Audio Codec	paper
2024-04	SemantiCodec	SemantiCodec: An Ultra Low Bitrate Semantic Audio Codec for General Sound	paper
2024-04	PromptCodec	PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders	paper
2024-04	ESC	ESC: Efficient Speech Coding with Cross-Scale Residual Vector Quantized Transformers	paper
2024-03	FACodec	NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models	paper
2024-02	AP-Codec	APCodec: A Neural Audio Codec with Parallel Amplitude and Phase Spectrum Encoding and Decoding	paper
2024-02	Language-Codec	Language-Codec: Reducing the Gaps Between Discrete Codec Representation and Speech Language Models	paper
2024-01	ScoreDec	ScoreDec: A Phase-preserving High-Fidelity Audio Codec with A Generalized Score-based Diffusion Post-filter	paper
2023-11	HierSpeech++	HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis	paper
2023-10	TiCodec	FEWER-TOKEN NEURAL SPEECH CODEC WITH TIME-INVARIANT CODES	paper
2023-09	RepCodec	RepCodec: A Speech Representation Codec for Speech Tokenization	paper
2023-09	FunCodec	FunCodec: A Fundamental, Reproducible and Integrable Open-source Toolkit for Neural Speech Codec	paper
2023-08	SpeechTokenizer	Speechtokenizer: Unified speech tokenizer for speech large language models	paper
2023-06	VOCOS	VOCOS: CLOSING THE GAP BETWEEN TIME-DOMAIN AND FOURIER-BASED NEURAL VOCODERS FOR HIGH-QUALITY AUDIO SYNTHESIS	paper
2023-06	Descript-audio-codec	High-Fidelity Audio Compression with Improved RVQGAN	paper
2023-05	AudioDec	Audiodec: An open-source streaming highfidelity neural audio codec	paper
2023-05	HiFi-Codec	Hifi-codec: Group-residual vector quantization for high fidelity audio codec	paper
2023-03	LMCodec	LMCodec: A Low Bitrate Speech Codec With Causal Transformer Models	paper
2022-11	Disen-TF-Codec	Disentangled Feature Learning for Real-Time Neural Speech Coding	paper
2022-10	EnCodec	High fidelity neural audio compression	paper
2022-07	S-TFNet	Cross-Scale Vector Quantization for Scalable Neural Speech Coding	paper
2022-01	TFNet	End-to-End Neural Speech Coding for Real-Time Communications	paper
2021-07	SoundStream	SoundStream: An End-to-End Neural Audio Codec	paper

🔱 Speech/Audio Representation Models

Date	Model Name	Paper Title	Link
2024-09	NEST-RQ	NEST-RQ: Next Token Prediction for Speech Self-Supervised Pre-Training	paper
2024-01	EAT	Self-Supervised Pre-Training with Efficient Audio Transformer	paper
2023-10	MR-HuBERT	Multi-resolution HuBERT: Multi-resolution Speech Self-Supervised Learning with Masked Unit Prediction	paper
2023-10	SpeechFlow	Generative Pre-training for Speech with Flow Matching	paper
2023-09	WavLabLM	Joint Prediction and Denoising for Large-scale Multilingual Self-supervised Learning	paper
2023-08	W2v-BERT 2.0	Massively Multilingual & Multimodal Machine Translation	paper
2023-07	Whisper-AT	Noise-Robust Automatic Speech Recognizers are Also Strong General Audio Event Taggers	paper
2023-06	ATST	Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks	paper
2023-05	SPIN	Self-supervised Fine-tuning for Improved Content Representations by Speaker-invariant Clustering	paper
2023-05	DinoSR	Self-Distillation and Online Clustering for Self-supervised Speech Representation Learning	paper
2023-05	NFA	Self-supervised neural factor analysis for disentangling utterance-level speech representations	paper
2022-12	Data2vec 2.0	Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language	paper
2022-12	BEATs	Audio Pre-Training with Acoustic Tokenizers	paper
2022-11	MT4SSL	MT4SSL: Boosting Self-Supervised Speech Representation Learning by Integrating Multiple Targets	paper
2022-08	DINO	Non-contrastive self-supervised learning of utterance-level speech representations	paper
2022-07	Audio-MAE	Masked Autoencoders that Listen	paper
2022-04	MAESTRO	Matched Speech Text Representations through Modality Matching	paper
2022-03	MAE-AST	Masked Autoencoding Audio Spectrogram Transformer	paper
2022-03	LightHuBERT	Lightweight and Configurable Speech Representation Learning with Once-for-All Hidden-Unit BERT	paper
2022-02	Data2vec	A General Framework for Self-supervised Learning in Speech, Vision and Language	paper
2021-10	WavLM	WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing	paper
2021-08	W2v-BERT	Combining Contrastive Learning and Masked Language Modeling for Self-Supervised Speech Pre-Training	paper
2021-07	mHuBERT	Direct speech-to-speech translation with discrete units	paper
2021-06	HuBERT	Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units	paper
2021-03	BYOL-A	Self-Supervised Learning for General-Purpose Audio Representation	paper
2020-12	DeCoAR2.0	DeCoAR 2.0: Deep Contextualized Acoustic Representations with Vector Quantization	paper
2020-07	TERA	TERA: Self-Supervised Learning of Transformer Encoder Representation for Speech	paper
2020-06	Wav2vec2.0	wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations	paper
2019-10	APC	Generative Pre-Training for Speech with Autoregressive Predictive Coding	paper
2018-07	CPC	Representation Learning with Contrastive Predictive Coding	paper

🔱 SLT 2024 Codec-SUPERB challenge (upcoming)

Webpage. The challenge will cover nowday's neural audio codecs and speech/audio language models. Agenda: To be determined.

🔱 Interspeech 2024 Survey Talk

Professor Hung-Yi Lee will be giving a talk as part of the Interspeech 2024 survey talk titled Challenges in Developing Spoken Language Models. The topic will cover nowday's speech/audio large language models.

🔱 ICASSP 2024 Tutorial Information

I (Kai-Wei Chang) will be giving a talk as part of the ICASSP 2024 tutorial titled Parameter-Efficient and Prompt Learning for Speech and Language Foundation Models. The topic will cover nowday's speech/audio large language models. The slides from my presentation is available at https://kwchang.org/talks/. Please feel free to reach out to me for any discussions.

🔱 Related Repository

Citation

If you find this repository useful, please consider citing the following papers.

@article{wu2024codec,
  title={Codec-SUPERB@ SLT 2024: A lightweight benchmark for neural audio codec models},
  author={Wu, Haibin and Chen, Xuanjun and Lin, Yi-Cheng and Chang, Kaiwei and Du, Jiawei and Lu, Ke-Han and Liu, Alexander H and Chung, Ho-Lam and Wu, Yuan-Kuei and Yang, Dongchao and others},
  journal={arXiv preprint arXiv:2409.14085},
  year={2024}
}

@inproceedings{wu-etal-2024-codec,
    title = "Codec-{SUPERB}: An In-Depth Analysis of Sound Codec Models",
    author = "Wu, Haibin  and
      Chung, Ho-Lam  and
      Lin, Yi-Cheng  and
      Wu, Yuan-Kuei  and
      Chen, Xuanjun  and
      Pai, Yu-Chi  and
      Wang, Hsiu-Hsuan  and
      Chang, Kai-Wei  and
      Liu, Alexander  and
      Lee, Hung-yi",
    editor = "Ku, Lun-Wei  and
      Martins, Andre  and
      Srikumar, Vivek",
    booktitle = "Findings of the Association for Computational Linguistics: ACL 2024",
    month = aug,
    year = "2024",
    address = "Bangkok, Thailand",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-acl.616",
    doi = "10.18653/v1/2024.findings-acl.616",
    pages = "10330--10348",
}

@article{wu2023speechgen,
  title={Speechgen: Unlocking the generative power of speech language models with prompts},
  author={Wu, Haibin and Chang, Kai-Wei and Wu, Yuan-Kuei and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2306.02207},
  year={2023}
}

@article{wu2024towards,
  title={Towards audio language modeling-an overview},
  author={Wu, Haibin and Chen, Xuanjun and Lin, Yi-Cheng and Chang, Kai-wei and Chung, Ho-Lam and Liu, Alexander H and Lee, Hung-yi},
  journal={arXiv preprint arXiv:2402.13236},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
assets		assets
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🔱 Speech Trident - Awesome Speech LM

🔱 Contributors

🔱 Speech/Audio Language Models

🔱 Speech/Audio Codec Models

🔱 Speech/Audio Representation Models

🔱 SLT 2024 Codec-SUPERB challenge (upcoming)

🔱 Interspeech 2024 Survey Talk

🔱 ICASSP 2024 Tutorial Information

🔱 Related Repository

Citation

About

Releases

Packages

Contributors 9

ga642381/speech-trident

Folders and files

Latest commit

History

Repository files navigation

🔱 Speech Trident - Awesome Speech LM

🔱 Contributors

🔱 Speech/Audio Language Models

🔱 Speech/Audio Codec Models

🔱 Speech/Audio Representation Models

🔱 SLT 2024 Codec-SUPERB challenge (upcoming)

🔱 Interspeech 2024 Survey Talk

🔱 ICASSP 2024 Tutorial Information

🔱 Related Repository

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 9

Packages