Visual Attention Analysis with Spatial Transformer Networks for Handwritten Digit Classification on MNIST
git clone https://github.com/biswassanket/STN_FGC.git
cd STN_FGC
- To create conda environment:
conda env create -f environment.yml
conda activate stn_fgc
- To run base STN with standard Conv layers:
$ python main.py --stn
- To run STN with Coordconv layers:
$ python main.py --stncoordconv --localization
$ python main.py --vit
Step 6: For the detailed analysis on the experimented visual attention models, here is the complete report
Model Variant | Accuracy | Best Epoch |
---|---|---|
Simple Conv | 0.9879 | 48 |
Simple STN+Conv | 0.9889 | 44 |
Simple STN+CoordConv | 0.9850 | 43 |
Simple STN+CoordConv+localization | 0.9910 | 47 |
Simple STN=CoordConv+localization+r-channel | 0.9868 | 40 |
Vision Transformers | 0.9844 | 49 |
Enjoyed playing with the models. Stay tuned, more implementations of visual attention models on fine-grained image classification task is coming soon. Thank you and sorry for the bugs,as usual.