Incremental addition of the new modality #390

averkij · 2022-12-07T12:18:26Z

🚀 The feature, motivation and pitch

🤗 Hello! Thank you for your work!

I see model configurations which working with certain modalities in this repo and it is great.

I have a question though, what if I have pretrained encoder for other modality (e.g. for audio) and a data for training (audio-text pairs and audio-image pairs).

How can I train a model which will be able to solve tasks with my new modality?
In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?

Alternatives

No response

Additional context

It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.

ebsmothers · 2022-12-13T18:33:40Z

Hi @averkij thanks for using the library. Can you share more specifics of the task you're working on? That way we can hopefully give more detailed and informative answers.

How can I train a model which will be able to solve tasks with my new modality?

I guess you are talking about co-learning (or something similar)? But again if you can provide more specifics that'll be helpful.

In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?

For fusing different modalities, we provide some generic components for fusion which can be found here.

It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.

With the fusion modules above you should hopefully be able to do this without much trouble. They all take in Dict[str, Tensor], so you just need to put each encoder's outputs into a dict then pass to the fusion. You can also see late_fusion.py, which provides a general way to set up this type of architecture.

averkij · 2022-12-14T14:42:04Z

Hello, @ebsmothers. Thank you for reply. Let me be more specific.

I have three unimodal encoders for different modalities (text, image, audio), which translate data to sequences. I also have datasets for different tasks across these three modalities. I want to make and train one model which will be able to solve such tasks (like image captioning, ASR, audio classification, image generation, etc).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental addition of the new modality #390

Incremental addition of the new modality #390

averkij commented Dec 7, 2022

ebsmothers commented Dec 13, 2022

averkij commented Dec 14, 2022

Incremental addition of the new modality #390

Incremental addition of the new modality #390

Comments

averkij commented Dec 7, 2022

🚀 The feature, motivation and pitch

Alternatives

Additional context

ebsmothers commented Dec 13, 2022

averkij commented Dec 14, 2022