Did you try sequence-wise concatenation in self-attention? #2

lucasgblu · 2024-12-27T07:43:11Z

First of all, OmniEdit is a great paper and I really enjoy reading it.

After going through the paper, I have a question regarding the architecture of EditNet. In your comparison of the three variants—EditNet, ControlNet, and InstructPix2Pix—did you explore the possibility of sequence-wise concatenation in self-attention?

More specifically, by concatenating the noisy latent token, text token, and condition image token in a sequence-wise (or token-wise) manner, it may still be feasible to inject conditional information into the base model. This approach would allow us to compute self-attention only once, rather than twice. From my perspective, EditNet appears to perform incremental training by adding residual information into the image and text streams, while sequence-wise concatenation facilitates the sharing of information between the original text and image for condition learning. Specifically, the attention softmax of the text and image originally sums to 1, but after concatenating the condition, the softmax of the text and image sums to less than 1, with the remaining portion supplemented by the condition.

I am curious to know if you have tried the method I proposed. If so, could you share any insights or results from your comparisons?

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Did you try sequence-wise concatenation in self-attention? #2

Did you try sequence-wise concatenation in self-attention? #2

lucasgblu commented Dec 27, 2024

Did you try sequence-wise concatenation in self-attention? #2

Did you try sequence-wise concatenation in self-attention? #2

Comments

lucasgblu commented Dec 27, 2024