Competition Results:
- On CUBES dataset we got 98.9% test accuracy after 311 epochs (stable result). The average of top 5 in 100 epochs is 97% - the result is verfied in 2 separate executions.
- On HUMANS dataset we got 92.4% test accuracy after 128 epochs and 92.6% after 313 epochs. The average of top 5 in first 100 epochs is 92.2% verfied on 2 separate executions.
We implemented 2 models - on is self attention adaptation to meshCNN which we call "Mesh Transformer" and one is LSTM based mesh walk.
The attention based model performed very well. I think we are the current SOTA on CUBES (Our test accuracy is 98.9% and the current SOTA is 98.6%. It's important to mention that our method is MeshCNN based and not related to the current SOTA - we got the results while going on different direction and therefore we do think it's an avidance that MeshCNN based model can still be the SOTA).
The LSTM based performed bad.
p.s. We didn't take any code from the web - we wrote all the code by ourselves.
Request: Because we inversted siginificant time in this we would like to continue working on this more and consider it as a part of the final project also (we inversted much more time than for just HW :) )
We added self attention layer to meshCNN. self attention is known to sufffer from high memory consuming problem - to handle that we used patched self attention. That means we devided the mesh edges into disjoint local sets and we applied self attention on each (same weights to all patches). The patch size is denoted by window size (and we put default unchangeable stride=window_size so the patches are disjoint). (The implementation of that is in /models/layers/mesh_self_attention.py).
Lets define the "attention degree" of edge e as the sum of the attention scores that all edges gave to this edeg. We use the attention degree as an optimized critiria to Edge Collapse (i.e. Pooling). (Implmented in /models/layers/mesh_pool_sa.py)
We see that using self attention only for pooling and not updating the input (i.e. after applying self attention we ignore the output and take only the attention matrix - as it's used to self attention pooling) performs way better then does changing the input (as in transformers in usual).
Just to clarify:
- suppose self attention layer is the function sa that takes as input the edges list denoted by x and returns new_x, attention_matrix.
- use it for pooling only:
* _, attention_matrix = sa(x) * x = pooling_based_self_attention(x, attention_matrix)
- use it as "full transformer":
* x, attention_matrix = sa(x) * x = pooling_based_self_attention(x, attention_matrix)
We use the notation of "Full transformer" and "Self Attention based pooling only" also in the table results. There we see that Self Attention based pooling only outperforms the full transformer.
Possible reasons:
- We used patched self attention
- We had to give it more training time
- Not appropiate initialization
- Using bigger datasets
- Use matrix multiplication to aggreagate the heads as described in the paper. The problem is that it takes a lot of memory (we used small embedding size to overcome this issue but there it's not possible - there're options for that like 2 layers which will take less parameters but I saw that max operation also works well in practice so we preferd it. it might be the reason).
Our mesh transformer supports patching to handle memory consumption. Problems:
- The self attention is very local - a lot of improvement can from globalizing the attention (as in the original form - window_size=num_of_edges). Possible improvements:
- Make the patches overlap.
- Use better technique (e.g. low rank approximizations of exp(KQ^T)) to overcome the memory issue (e.g. Performers for low rank approx. LinFormer. LongFormer...).
- Window size
- Embedding size
- Number of self attention heads
We added those to the command line options.
At the time we tried that we didn't know that there's existing work that does it (which is also the current SOTA on CUBES)
- Adding edge embedding layer - We think it can help becasue we think that the initialized 5 features can disturb each other in the aggregation. When we tried it didn't increase the results but we still think is a good option
- Changing all the convolutions to fully connected layers (we thought it can increase the network expressiveness but turn out to perform bad)
- Batch normalization instead Group normalization. We notice that we run on one GPU and Group norm is good when applying on multiple GPU (becasue then the batch is seperated on the GPUs and each has 1 or 2 samples - not accuracte mean and variance can be calculated).
- We run with larger batch size also (32 instead 16) which also helped (becasue of change 1)
There was a tiny bug in the original code in the batch option ('BatchNorm2D' should be instead of 'BatchNorm' in norm selection).
- Changing the aggregation in mesh convolution layer to average instead of symetric transformations and concatatnation. It decreased the results.
- Changing the pooling critiria to the norm of the first feature only - decreased results.
- Adding dropout didn't helped - we used BN that known to replace it.
LSTM to go over the mesh edges in some order and rout information from one edge to another. The idea of using LSTM was at start in order to globalize the patched (local) self attention (this approach doesn't exist and we thought is cool). Then we saw that with this layer only (and 0.6M parameters only) we got 80% accuracy on CUBES which proves there's something in this layer that can work.
We use a circular LSTM which is applying LSTM several times while we keep the state of the previous iteration and use it as a start to the next (different from bidirectional LSTMs but with same motivation). The exact motivation here that we want the LSTM to first compute a global information and then use it as start hidden vector - because otherwise the first of elements of the sequence suffer from bad hidden state (that contains no information at start).
- Make it to partial and many traverses (i.e. not on the whole mesh) - as done in Mesh Walk paper.
- As there is attention based Deep Walk we can do something similar here when we greedily traverse the graph and each time go the the nearby edge with highest attention degree (defined in MeshTransformer section).
- Use recent advances in Linear Multi Arm Bandits because this setting can be adapted to there.
Our best model which perform self attention based pooling (not a full transformer) is with test accuracy of 98.9% (and stable) - which we think is the current SOTA now. We attach here a table with the results:
- Our method passed the current SOTA by 0.3%.
- We saw that full transformer decreased the results compared to using only attention based pooling. It emphesizes the importance of the pooling layer as when the self attention is dedicated to pooling only.
- The good results apeared around epoch 200 where we also got stable results. The 98.9% apeared in epoch 311.
- We already discussed on our thoughts on how to improve the full transformer and the LSTM bsed mesh walk results. There's more specific discuession in Notes column of the table.
-
In the training plots a_b_c means window_size=a, embedding_size=b (of keys and queries), num_heads=c
-
This is our self attention pooling versus MeshCNN
-
In the following image we see how much the attention hyperparamters are important on CUBES (see the table for more percise analysis). This is of self attention pooling (not a full transformer).
-
And this is the same on full transformer.
-
this is with the LSTM also - original vs self attention pooling vs lstm
We see that this benchmark is harder. We also had only 500 train samples which we think make the training process harder. The highest score we got is 92.6% after 313 epochs. We don't know what is the current SOTA in this benchmark so we can't compare. Here we see that higher window size does helped (in contrast to CUBES). There's more specific discuession in Notes column of the table.
We see that here the augmentation is important. More percise discussion in at Notes column.
look at the original repository for more info: ranahanocka/MeshCNN
- Clone this repo:
git clone https://github.com/ranahanocka/MeshCNN.git
cd MeshCNN
- Install dependencies: PyTorch version 1.2. Optional : tensorboardX for training plots.
- Via new conda environment
conda env create -f environment.yml
(creates an environment called meshcnn)
- Via new conda environment
Download the dataset
bash ./scripts/cubes/get_data.sh
Run training (if using conda env first activate env e.g. source activate meshcnn
)
bash ./scripts/cubes/train.sh
To view the training loss plots, in another terminal run tensorboard --logdir runs
and click http://localhost:6006.
Run test and export the intermediate pooled meshes:
bash ./scripts/cubes/test.sh
Visualize the network-learned edge collapses:
bash ./scripts/cubes/view.sh
The same as above, to download the dataset / run train / get pretrained / run test / view
bash ./scripts/human_seg/get_data.sh
bash ./scripts/human_seg/train.sh
bash ./scripts/human_seg/get_pretrained.sh
bash ./scripts/human_seg/test.sh
bash ./scripts/human_seg/view.sh