Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About multimodal sequence input #38

Open
tulvgengenr opened this issue Oct 5, 2024 · 4 comments
Open

About multimodal sequence input #38

tulvgengenr opened this issue Oct 5, 2024 · 4 comments

Comments

@tulvgengenr
Copy link

Hello, I am very interested in your great work. I see in the code that the sequence of the image generation input is basically text tokens before image tokens, what about reversing the order when generating the image?

@Sierkinhane
Copy link
Collaborator

Hi, we did not try that.

@zc1023
Copy link

zc1023 commented Oct 8, 2024

Hello, there is something strange for me about multimodal sequence input in mmu.
In embedding input, the sequence is [system embedding, image embedding, question embedding]. However, in tokens input, the sequence is [question token, image token]. Does the input order not matter?

@Sierkinhane
Copy link
Collaborator

Hi, for continuous clip-vit features, we follow llava's processing. In our experiments, it seems that the order does not matter a lot.

@zc1023
Copy link

zc1023 commented Oct 10, 2024

Hi, for continuous clip-vit features, we follow llava's processing. In our experiments, it seems that the order does not matter a lot.

This result is quite interesting. I'd like to know which input order while training.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants