research: Flow Matching for synthetic data generation #140

tikikun · 2024-11-28T02:52:12Z

Overall

We can significantly improve the quality of synthetic multi-modal datasets by using Flow Matching with Optimal Transport.

Context

Currently we make use of Autoregressive Model from WhisperSpeech to generate synthetic dataset specifically the t2s model.

Theoretical Details

The T2S model (Text-to-Semantics) predicts sound tokens from text tokens to generate synthetic data. This problem can be framed as:

"Transforming a distribution of text embeddings into synthetic sound token embeddings."

Alternatively, it can be stated as:

We address sequence-to-sequence embeddings generation tasks. Given a source sequence:

$$ w_x = {w_x^1, ..., w_x^m}, \quad \text{of length } M $$

we aim to develop a generative model that produces a target sequence:

$$ w_y = {w_y^1, ..., w_y^n}, \quad \text{of length } N $$

conditioned on the source sequence.

Empirical results, such as those in the F5-TTS paper, demonstrate that flow matching models efficiently solve this problem with high accuracy and low resource requirements. This approach also avoids the inherent issues of autoregressive models when generating synthetic data.

There is a possibility that we can produce novel results using this approach + increase ichigo performance significantly.

Next Steps

Adapt our dataset to a flow matching framework.
Develop a flow matching framework for T2S tasks.

hahuyhoang411 · 2024-11-28T06:38:13Z

This could be related: https://github.com/dongzhuoyao/flowseq/tree/main

PodsAreAllYouNeed · 2024-11-28T07:01:51Z

You can use a continuous flow matching model to train essentially a text-based auto encoder.
The specific architecture should probably be conditional flow matching, with the text as the condition.
length of generation should be set using something as simple as a words-per-second heuristic
The decoder will a the frozen whisper decoder.
The goal is self-supervised text-to-text roundtrip through the CFM model and the decoder.
No guarantee that this will work at all but it would be damn interesting if it works.
I think it has a chance of working because we're distilling the information from the whisper decoder, which is a strong model.

If it works, it means we will be able to train T2S model on all the languages supported by whisper, without the need for any audio data. all we need is some multi-lingual text data.

PodsAreAllYouNeed · 2024-11-28T07:07:17Z

Also, check out this repo

https://github.com/lucidrains/voicebox-pytorch

It has a good implementation for the CFM model, relatively easy to read. I've used it before in my work.

It also has some links to Spear-TTS, the precursor to WhisperSpeech. Also E2-TTS and later F5-TTS might have built on top of it.

hahuyhoang411 · 2024-11-28T07:09:13Z

Oh he also has: https://github.com/lucidrains/e2-tts-pytorch

thanks lucidrains

PodsAreAllYouNeed · 2024-12-01T09:13:56Z

Here is more inspiration about how to achieve some of this

https://minjekim.com/research-projects/ladiffcodec/

https://github.com/haiciyang/LaDiffCodec

PodsAreAllYouNeed · 2024-12-02T13:56:51Z

I think we might be able to frame this as a generative adversarial network: The CFM is the generator, and the discriminator is Whisper Decoder.

tikikun transferred this issue from janhq/jan Nov 28, 2024

tikikun added this to the Ichigo v0.5 milestone Nov 28, 2024

tikikun self-assigned this Nov 28, 2024

tikikun added the P1: important Important feature / fix label Nov 28, 2024

tikikun moved this from Investigating to In Progress in Menlo Nov 28, 2024

hahuyhoang411 moved this from In Progress to Investigating in Menlo Dec 1, 2024

dan-menlo changed the title ~~research: Possibility of a breakthrough in synthetic data generation using Flow Matching~~ research: Flow Matching for synthetic data generation Dec 2, 2024

tikikun moved this from Investigating to In Progress in Menlo Dec 2, 2024

dan-menlo removed this from the Ichigo v0.5 milestone Jan 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: Flow Matching for synthetic data generation #140

research: Flow Matching for synthetic data generation #140

tikikun commented Nov 28, 2024 •

edited

Loading

hahuyhoang411 commented Nov 28, 2024 •

edited

Loading

PodsAreAllYouNeed commented Nov 28, 2024

PodsAreAllYouNeed commented Nov 28, 2024

hahuyhoang411 commented Nov 28, 2024

PodsAreAllYouNeed commented Dec 1, 2024

PodsAreAllYouNeed commented Dec 2, 2024

research: Flow Matching for synthetic data generation #140

research: Flow Matching for synthetic data generation #140

Comments

tikikun commented Nov 28, 2024 • edited Loading

Overall

Context

Theoretical Details

Next Steps

hahuyhoang411 commented Nov 28, 2024 • edited Loading

PodsAreAllYouNeed commented Nov 28, 2024

PodsAreAllYouNeed commented Nov 28, 2024

hahuyhoang411 commented Nov 28, 2024

PodsAreAllYouNeed commented Dec 1, 2024

PodsAreAllYouNeed commented Dec 2, 2024

tikikun commented Nov 28, 2024 •

edited

Loading

hahuyhoang411 commented Nov 28, 2024 •

edited

Loading