-
Notifications
You must be signed in to change notification settings - Fork 118
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About subtract in pooling #4
Comments
Hi @Dong-Huo , As shown in the paper, since the MetaFormer block already has a residual connection, subtraction of the input |
Duplicate of #1 |
@yuweihao Have you tried removing the residual connection for token mixer? Currently you subtract "normed" x (basically |
Hi @yangcf10 , It is not elegant to remove the residual connection in the block just for the pooling token mixer. It is better to remain the residual connection whatever the token mixer is so that we can just freely specify the token mixers in MetaFormer. Instead, I have tried removing subtraction, i.e., replacing |
Thanks for the prompt reply! I understand it's mostly from empirical results. But any insight why we should do the subtraction? The explanation "since the MetaFormer block already has a residual connection so we should add subtraction" seems not to be convincing. If we treat token mixer as an abstracted module, then we shouldn't consider the residual connection when designing it. |
Hi @yangcf10 , Thank you for your feedback and suggestion. We will attempt to further improve the explanation "since the MetaFormer block already has a residual connection, subtraction of the input itself is added in Equation (4)". |
Why don't we just remove the residual connection and the subtraction then? It would save compute and memory. What I'm more concerned about is that the subtraction and the residual connection don't use the same "x" so they don't null each other. Indeed, the residual connection uses a pre-norm x while the subtraction uses a post-norm x. It changes the semantics to something along the lines of a block emphasizing the spatial gradients. What do you think? Does it work as well without the residual connection and the subtraction? |
okay I saw your other comments about using DW conv instead of pooling. I understand that poolformer is not what your paper is about it but about the MetaFormer and the poolformer is indeed just a demonstration. Also, the fact that DW conv brings similar or superior performance shows that there is nothing special in this pooling layer, let alone this subtraction. This is missing the forest for the trees. |
Hi @Vermeille , Many thanks for your attention to this work and insightful comment. Yeah, the target in this work is to demonstrate the competence of transformer-like models primarily stem from the general architecture MetaFormer. The Pooling/PoolFormer are just tools to demonstrate the MetaFormer. If considering PoolFormer as a practical model to use, as your comment, it can be further improved from implementation efficiency and other aspects. |
Is there some relation between this pooling operation and graph convolutional networks? Because graphs have no regular structure GCNs are essentially some kind of pooling followed by MLP - which seems a lot like PoolFormer, though the MetaFormer still has an image pyramid which isn't present in graphs. |
Hi @saulzar , pooling is a basic operator in deep learning. Transformer or MetaFormer can be regarded as a type of Graph Neural Networks [1]. From this perspective, attention or pooling in MetaFormer can be regarded as a type of graph attention or graph pooling, respectively. [1] https://graphdeeplearning.github.io/post/transformers-are-gnns/ |
Average Pool combining with Subtraction yields a [Laplacian kernel] (https://homepages.inf.ed.ac.uk/rbf/HIPR2/log.htm) |
Hi @chuong98 , Yes, it can be regarded as a fixed kernel in image processing (vs traditional CNN's learnable kernels). For each token, Laplacian(x) aggregates nearby token information different from itself, while the residual connection remains information of itself. The alpha in Normalization or LayerScale can balance nearby information and own information. Without subtraction, since the MetaFormer block already has a residual connection, alpha then becomes into balancing [nearby information + own information] and [own information], which looks weird. The above reason may make the performance with subtraction slightly better than that without subtraction. Thanks for your continued attention to our work. Happy new year in advance :) |
For anyone wondering, I got the following results on ImageNet-100:
|
that is really helpful, thanks @DonkeyShot21. @yuweihao Can you add these extra experiments to your revised version ? |
Hi @chuong98 , sure, we plan to add extra experiments about more different token mixers (eg. DWConv) on ImageNet-1K in our revised version. |
Hi, thank you for publishing such a nice paper. I just have one question. I do not understand the subtraction of the input in eqn.4. Is it necessary? What will happen if we just do the average pooling without substrating the input?
The text was updated successfully, but these errors were encountered: