Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

research: Flow Matching for synthetic data generation #140

Open
2 tasks
tikikun opened this issue Nov 28, 2024 · 6 comments
Open
2 tasks

research: Flow Matching for synthetic data generation #140

tikikun opened this issue Nov 28, 2024 · 6 comments
Assignees
Labels
P1: important Important feature / fix

Comments

@tikikun
Copy link
Collaborator

tikikun commented Nov 28, 2024

Overall

We can significantly improve the quality of synthetic multi-modal datasets by using Flow Matching with Optimal Transport.

Context

Currently we make use of Autoregressive Model from WhisperSpeech to generate synthetic dataset specifically the t2s model.

Theoretical Details

The T2S model (Text-to-Semantics) predicts sound tokens from text tokens to generate synthetic data. This problem can be framed as:

"Transforming a distribution of text embeddings into synthetic sound token embeddings."

Alternatively, it can be stated as:

We address sequence-to-sequence embeddings generation tasks. Given a source sequence:

$$ w_x = {w_x^1, ..., w_x^m}, \quad \text{of length } M $$

we aim to develop a generative model that produces a target sequence:

$$ w_y = {w_y^1, ..., w_y^n}, \quad \text{of length } N $$

conditioned on the source sequence.

Empirical results, such as those in the F5-TTS paper, demonstrate that flow matching models efficiently solve this problem with high accuracy and low resource requirements. This approach also avoids the inherent issues of autoregressive models when generating synthetic data.

There is a possibility that we can produce novel results using this approach + increase ichigo performance significantly.

Next Steps

  • Adapt our dataset to a flow matching framework.
  • Develop a flow matching framework for T2S tasks.
@tikikun tikikun transferred this issue from janhq/jan Nov 28, 2024
@tikikun tikikun added this to the Ichigo v0.5 milestone Nov 28, 2024
@tikikun tikikun self-assigned this Nov 28, 2024
@tikikun tikikun added the P1: important Important feature / fix label Nov 28, 2024
@tikikun tikikun moved this from Investigating to In Progress in Menlo Nov 28, 2024
@hahuyhoang411
Copy link
Contributor

hahuyhoang411 commented Nov 28, 2024

This could be related: https://github.com/dongzhuoyao/flowseq/tree/main

@PodsAreAllYouNeed
Copy link

You can use a continuous flow matching model to train essentially a text-based auto encoder.
The specific architecture should probably be conditional flow matching, with the text as the condition.
length of generation should be set using something as simple as a words-per-second heuristic
The decoder will a the frozen whisper decoder.
The goal is self-supervised text-to-text roundtrip through the CFM model and the decoder.
No guarantee that this will work at all but it would be damn interesting if it works.
I think it has a chance of working because we're distilling the information from the whisper decoder, which is a strong model.

Image

If it works, it means we will be able to train T2S model on all the languages supported by whisper, without the need for any audio data. all we need is some multi-lingual text data.

@PodsAreAllYouNeed
Copy link

Also, check out this repo

https://github.com/lucidrains/voicebox-pytorch

It has a good implementation for the CFM model, relatively easy to read. I've used it before in my work.

It also has some links to Spear-TTS, the precursor to WhisperSpeech. Also E2-TTS and later F5-TTS might have built on top of it.

@hahuyhoang411
Copy link
Contributor

Oh he also has: https://github.com/lucidrains/e2-tts-pytorch

thanks lucidrains

@PodsAreAllYouNeed
Copy link

Here is more inspiration about how to achieve some of this

https://minjekim.com/research-projects/ladiffcodec/

https://github.com/haiciyang/LaDiffCodec

@hahuyhoang411 hahuyhoang411 moved this from In Progress to Investigating in Menlo Dec 1, 2024
@dan-menlo dan-menlo changed the title research: Possibility of a breakthrough in synthetic data generation using Flow Matching research: Flow Matching for synthetic data generation Dec 2, 2024
@tikikun tikikun moved this from Investigating to In Progress in Menlo Dec 2, 2024
@PodsAreAllYouNeed
Copy link

I think we might be able to frame this as a generative adversarial network: The CFM is the generator, and the discriminator is Whisper Decoder.

@dan-menlo dan-menlo removed this from the Ichigo v0.5 milestone Jan 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P1: important Important feature / fix
Projects
Status: In Progress
Development

No branches or pull requests

4 participants