Research project of Computer Vision - CSCI-GA.2271-001 Fall 20
Md Salman Rahman([email protected]) and Wonkwon Lee ([email protected])
This research work provides a fair and in-depth out-of-distribution robustness comparison among 58 state-of-the-art computer vision model such as vision transformers, convolution, combination of convolution and attention, multi layer perceptron, sequence-based model, complementary search, and network-based model.
The vision transformer (ViT) has advanced to the cutting edge in the visual recognition task. Transformers are more robust than CNN, according to the latest research. ViT’s self-attention mechanism, according to the claim, makes it more robust than CNN. Even with this, we discover that these conclusions are based on unfair experimental conditions and just comparing a few models, which did not allow us to depict the entire scenario of robustness performance. In this study, we investigate the performance of 58 state-ofthe-art computer vision models in a unified training setup based not only on attention and convolution mechanisms but also on neural networks based on a combination of convolution and attention mechanisms, sequence-based model, complementary search, and network-based method. Our research demonstrates that robustness depends on the training setup and model types, and performance varies based on out-of-distribution type. Our research will aid the community in better understanding and benchmarking the robustness of computer vision models.