-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
task: Train and test text2semantic under decoder only framework for ichigo v0.5 #145
Comments
GoalBe able to handle any arbitrary language
Methodology
|
Need to add more details to this issue:
Please help me to align nomenclature etc. @tikikun's diagram above is very helpful. |
I move the table to the top for better visualization cc @bachvudinh |
This task is a hybrid between Text-to-speech and speech-to-speech translation. It is quite hard because there is a one-to-many mapping between input text, and possible output token combinations. Here are two papers that are using the same AR setting, but for slightly different tasks. I think it can be adapted. AudioPALM: https://arxiv.org/pdf/2306.12925 Specifically, I think we can use Valle-E's idea of using a phoneme conversion layer before sending the text into the AR model, this might bridge the gap to the semantic embeddings abit, making the AR model's job easier. We also need to somehow provide some auxiliary information about the expected acoustic ground-truth that we are using, otherwise, if we provide text-only to the AR model, there are too many possible correct answers, so across multiple samples the loss may conflict. However, I think it will be hard to make this work. The AR model needs a better constraint. My proposalIn the WhisperSpeech framework, the text-to-semantic model is the inverse of the whisper decoder. We need to involve the whisper decoder in the training.
You will meet a practical challenge, which is that while training this AR decoder model, its acting like its a NAR encoder model to the Whisper Decoder. There might be a smart way to solve this, but I can't think of one at the moment, or you can just use a NAR model. Another (Simpler) IdeaIf we really want an AR model trained using next token prediction, we must use WhisperVQ tokens in the current format and we don't want to add auxiliary information, we can try a simple intervention of grouping identical WhisperVQ tokens together. This way, the model is not penalized for getting the output length wrong. i.e this original example: get mapped to this: This way the order of the token output matters, but the number of consecutively repeated tokens do not matter. |
Updated from Research sync 2024-12-4:
|
cc @PodsAreAllYouNeed @tikikun
|
What I tried to do:
Result:
|
Idea: Add duration tokensObservations:
Theories:
Implementation:
Extra InformationWhy some word might result in repetition? At first glance it's tempting to think that the same information in a repetitive token (or embedding) might be redundant. But if we take a closer look at long and short vowels in English it might not be the case Example:
The only way for you to discern the difference between Sheep and Ship sometimes in English speaking is only whether the i sound is long or short, or the duration part of it. By de-duplicating the duration, everything becomes a short sound but your target training still sheep or sip (either long or short) make it impossible to really converge. Hence, duration is the information that is left out when you de-dup. |
Related to the idea of the token-level duration token, we could potentially have a "global duration token" added as a context token either before or after the provided text input. This "global duration token" gives information to the t2s model about the length of semantic tokens it needs to generate. After training, this global duration token is also used to control the length of generation of the text, which controls the speaking speed. This is inspired by the "number of frames" mechanic found in the F5-TTS code generation https://github.com/SWivid/F5-TTS/blob/8898d05e374bcb8d3fc0b1286037e95df61f491f/src/f5_tts/infer/utils_infer.py#L449C1-L452C96 If TTS models need some global duration information in order to do the generation, then our text2semantic should also use the same kind of global information. we just need to encode it a little differently. |
This idea needs further validation on different generation len |
@bachvudinh please add validation on longer sequence |
|
|
Text-to-Semantics Training Issue Resolution
Word Error Rate (WER) Comparison Between Real Semantic tokens and Synthetic tokens.
Note: "With prompt" refers to adding a prompt to the Whisper decoder. |
Maybe take a look at https://huggingface.co/FunAudioLLM/SenseVoiceSmall as an alternative? It claims to be multilingual (50 languages) but also not, could be worth a try as it is small and accurate apparently. |
Motivation
Since ichigo v0.5 will support additional language that will make the traditional t2s obsolete. This is a good chance to introduce a t2s framework that we have full control over.
Goal
Be able to handle any arbitrary language
Methodology
First step: Run base case first, with English first (before exploring other languages)
<|text_to_semantic|>
task token and added 512 sound tokens + 3 special tokens (start, end, mask) to its vocabulary.[152,192](https://github.com/QwenLM/Qwen/issues/419)
tokens for training speed optimization.What needed to be done:
Experiments
Test Results:
Benchmarking
Using WhisperVQ dequantize the tokens back to embedding and then using whisper model to decode this emebedding into text. Benchmark on LibrisSpeech Clean test set
Using WhisperVQ dequantize the tokens back to embedding and then using whisper model to decode this emebedding into text. Benchmark on Bud500 test set
Using AudioBench.
The text was updated successfully, but these errors were encountered: