You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First of all, OmniEdit is a great paper and I really enjoy reading it.
After going through the paper, I have a question regarding the architecture of EditNet. In your comparison of the three variants—EditNet, ControlNet, and InstructPix2Pix—did you explore the possibility of sequence-wise concatenation in self-attention?
More specifically, by concatenating the noisy latent token, text token, and condition image token in a sequence-wise (or token-wise) manner, it may still be feasible to inject conditional information into the base model. This approach would allow us to compute self-attention only once, rather than twice. From my perspective, EditNet appears to perform incremental training by adding residual information into the image and text streams, while sequence-wise concatenation facilitates the sharing of information between the original text and image for condition learning. Specifically, the attention softmax of the text and image originally sums to 1, but after concatenating the condition, the softmax of the text and image sums to less than 1, with the remaining portion supplemented by the condition.
I am curious to know if you have tried the method I proposed. If so, could you share any insights or results from your comparisons?
Thanks!
The text was updated successfully, but these errors were encountered:
First of all, OmniEdit is a great paper and I really enjoy reading it.
After going through the paper, I have a question regarding the architecture of EditNet. In your comparison of the three variants—EditNet, ControlNet, and InstructPix2Pix—did you explore the possibility of sequence-wise concatenation in self-attention?
More specifically, by concatenating the noisy latent token, text token, and condition image token in a sequence-wise (or token-wise) manner, it may still be feasible to inject conditional information into the base model. This approach would allow us to compute self-attention only once, rather than twice. From my perspective, EditNet appears to perform incremental training by adding residual information into the image and text streams, while sequence-wise concatenation facilitates the sharing of information between the original text and image for condition learning. Specifically, the attention softmax of the text and image originally sums to 1, but after concatenating the condition, the softmax of the text and image sums to less than 1, with the remaining portion supplemented by the condition.
I am curious to know if you have tried the method I proposed. If so, could you share any insights or results from your comparisons?
Thanks!
The text was updated successfully, but these errors were encountered: