-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
research: Flow Matching for synthetic data generation #140
Comments
This could be related: https://github.com/dongzhuoyao/flowseq/tree/main |
You can use a continuous flow matching model to train essentially a text-based auto encoder. If it works, it means we will be able to train T2S model on all the languages supported by whisper, without the need for any audio data. all we need is some multi-lingual text data. |
Also, check out this repo https://github.com/lucidrains/voicebox-pytorch It has a good implementation for the CFM model, relatively easy to read. I've used it before in my work. It also has some links to Spear-TTS, the precursor to WhisperSpeech. Also E2-TTS and later F5-TTS might have built on top of it. |
Oh he also has: https://github.com/lucidrains/e2-tts-pytorch thanks lucidrains |
Here is more inspiration about how to achieve some of this |
I think we might be able to frame this as a generative adversarial network: The CFM is the generator, and the discriminator is Whisper Decoder. |
Overall
We can significantly improve the quality of synthetic multi-modal datasets by using Flow Matching with Optimal Transport.
Context
Currently we make use of Autoregressive Model from WhisperSpeech to generate synthetic dataset specifically the t2s model.
Theoretical Details
The T2S model (Text-to-Semantics) predicts sound tokens from text tokens to generate synthetic data. This problem can be framed as:
Alternatively, it can be stated as:
We address sequence-to-sequence embeddings generation tasks. Given a source sequence:
we aim to develop a generative model that produces a target sequence:
conditioned on the source sequence.
Empirical results, such as those in the F5-TTS paper, demonstrate that flow matching models efficiently solve this problem with high accuracy and low resource requirements. This approach also avoids the inherent issues of autoregressive models when generating synthetic data.
There is a possibility that we can produce novel results using this approach + increase ichigo performance significantly.
Next Steps
The text was updated successfully, but these errors were encountered: