About multimodal sequence input #38

tulvgengenr · 2024-10-05T08:26:38Z

Hello, I am very interested in your great work. I see in the code that the sequence of the image generation input is basically text tokens before image tokens, what about reversing the order when generating the image?

Sierkinhane · 2024-10-07T02:13:22Z

Hi, we did not try that.

zc1023 · 2024-10-08T13:43:52Z

Hello, there is something strange for me about multimodal sequence input in mmu.
In embedding input, the sequence is [system embedding, image embedding, question embedding]. However, in tokens input, the sequence is [question token, image token]. Does the input order not matter?

Sierkinhane · 2024-10-08T14:07:19Z

Hi, for continuous clip-vit features, we follow llava's processing. In our experiments, it seems that the order does not matter a lot.

zc1023 · 2024-10-10T12:53:57Z

Hi, for continuous clip-vit features, we follow llava's processing. In our experiments, it seems that the order does not matter a lot.

This result is quite interesting. I'd like to know which input order while training.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About multimodal sequence input #38

About multimodal sequence input #38

tulvgengenr commented Oct 5, 2024

Sierkinhane commented Oct 7, 2024

zc1023 commented Oct 8, 2024

Sierkinhane commented Oct 8, 2024

zc1023 commented Oct 10, 2024

About multimodal sequence input #38

About multimodal sequence input #38

Comments

tulvgengenr commented Oct 5, 2024

Sierkinhane commented Oct 7, 2024

zc1023 commented Oct 8, 2024

Sierkinhane commented Oct 8, 2024

zc1023 commented Oct 10, 2024