You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I see model configurations which working with certain modalities in this repo and it is great.
I have a question though, what if I have pretrained encoder for other modality (e.g. for audio) and a data for training (audio-text pairs and audio-image pairs).
How can I train a model which will be able to solve tasks with my new modality?
In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?
Alternatives
No response
Additional context
It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.
The text was updated successfully, but these errors were encountered:
Hi @averkij thanks for using the library. Can you share more specifics of the task you're working on? That way we can hopefully give more detailed and informative answers.
How can I train a model which will be able to solve tasks with my new modality?
I guess you are talking about co-learning (or something similar)? But again if you can provide more specifics that'll be helpful.
In other words, which components I should use to fuse new modality with other ones? Should I implement a new model or I can use existed components as fusers?
For fusing different modalities, we provide some generic components for fusion which can be found here.
It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.
With the fusion modules above you should hopefully be able to do this without much trouble. They all take in Dict[str, Tensor], so you just need to put each encoder's outputs into a dict then pass to the fusion. You can also see late_fusion.py, which provides a general way to set up this type of architecture.
Hello, @ebsmothers. Thank you for reply. Let me be more specific.
I have three unimodal encoders for different modalities (text, image, audio), which translate data to sequences. I also have datasets for different tasks across these three modalities. I want to make and train one model which will be able to solve such tasks (like image captioning, ASR, audio classification, image generation, etc).
🚀 The feature, motivation and pitch
🤗 Hello! Thank you for your work!
I see model configurations which working with certain modalities in this repo and it is great.
I have a question though, what if I have pretrained encoder for other modality (e.g. for audio) and a data for training (audio-text pairs and audio-image pairs).
Alternatives
No response
Additional context
It will be great if the user that have N pretrained encoders for arbitrary modalities will be able to pass them to some fuser model and train it to solve cross modal tasks. Or add the new modality to existing model.
The text was updated successfully, but these errors were encountered: