The official PyTorch implementation of MGP-STR (ECCV 2022).
MGP-STR is a conceptually SIMPLE yet POWERFUL vision STR model, which is built upon Vision Transformer (ViT). To integrate linguistic knowledge, Multi-Granularity Prediction (MGP) strategy is proposed to inject information from the language modality into
the model in an implicit way. With NO independent language model (LM), MGP-STR outperforms previous state-of-the-art models for scene text recognition, including both pure vision models and language-augmented methods.
- This work was tested with PyTorch 1.7.0, CUDA 10.1, python 3.6 and Ubuntu 16.04.
pip3 install -r requirements.txt
Download lmdb dataset from Read Like Humans: Autonomous, Bidirectional and Iterative Language Modeling for Scene Text Recognition.
-
Training datasets
- MJSynth (MJ):
- Use
tools/create_lmdb_dataset.py
to convert images into LMDB dataset - LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- SynthText (ST):
- Use
tools/crop_by_word_bb.py
to crop images from original SynthText dataset, and convert images into LMDB dataset bytools/create_lmdb_dataset.py
- LMDB dataset BaiduNetdisk(passwd:n23k)
- Use
- Real:
- We use the real dataset from PARSeq. We recommend you follow the instructions of PARSeq at its parseq/Datasets.md . The gdrive links are gdrive-link1 and gdrive-link2 from PARSeq.
- Union14M-L:
- We use Union14M-L for training, you can download from here.
- MJSynth (MJ):
-
Evaluation datasets Datasets can be downloaded from BaiduNetdisk(passwd:1dbv), GoogleDrive and parseq/Datasets.md .
- ICDAR 2013 (IC13_857)
- ICDAR 2013 (IC13_1015)
- ICDAR 2015 (IC15_1811)
- ICDAR 2015 (IC15_2077)
- IIIT5K Words (IIIT)
- Street View Text (SVT)
- Street View Text-Perspective (SVTP)
- CUTE80 (CUTE)
- ArT (ArT)
- COCOv1.4 (COCO)
- Uber (Uber)
-
The structure of data folder as below.
data
├── evaluation
│ ├── CUTE80
│ ├── IC13_857
│ ├── IC13_1015
│ ├── IC15_1811
│ ├── IC15_2077
│ ├── IIIT5k_3000
│ ├── SVT
│ ├── SVTP
│ ├── ArT
│ ├── COCOv1.4
│ └── Uber
├── training
│ ├── MJ
│ │ ├── MJ_test
│ │ ├── MJ_train
│ │ └── MJ_valid
│ ├── Real
│ └── Union14M-L
At this time, training datasets and evaluation datasets are LMDB datasets
Available model weights:
Tiny | Small | Base |
---|---|---|
MGP-STR-Tiny | MGP-STR-Small | MGP-STR-Base |
Performances of the reproduced pretrained models are summaried as follows:
Model | Output | IC13_857 | SVT | IIIT | IC15_1811 | SVTP | CUTE | AVG |
---|---|---|---|---|---|---|---|---|
MGP-STR-tiny | Char | 94.6 | 91.2 | 94.1 | 82.7 | 84.7 | 81.9 | 89.7 |
BPE | 86.3 | 86.4 | 83.6 | 73.2 | 80.0 | 70.1 | 80.7 | |
WP | 53.7 | 43.1 | 56.8 | 52.0 | 39.2 | 44.1 | 51.9 | |
Fuse | 95.3 | 92.1 | 94.3 | 83.1 | 85.9 | 81.6 | 90.2 | |
MGP-STR-small | Char | 95.8 | 91.8 | 95.0 | 84.9 | 86.7 | 87.5 | 91.2 |
BPE | 97.0 | 94.0 | 88.8 | 80.5 | 87.4 | 84.0 | 87.8 | |
WP | 79.5 | 76.4 | 77.0 | 70.2 | 72.7 | 64.9 | 74.7 | |
Fuse | 96.6 | 93.2 | 95.1 | 86.4 | 88.1 | 88.5 | 92.0 | |
MGP-STR-base | Char | 96.3 | 93.0 | 95.9 | 86.0 | 87.4 | 88.5 | 92.2 |
BPE | 97.1 | 95.1 | 90.0 | 82.1 | 89.9 | 84.0 | 89.1 | |
WP | 97.8 | 94.6 | 89.1 | 81.6 | 90.4 | 81.6 | 88.6 | |
Fuse | 97.6 | 94.9 | 96.2 | 87.9 | 90.2 | 89.2 | 93.4 |
Available model weights:
Base | Large |
---|---|
MGP-STR-Base | MGP-STR-Large |
Performances of the reproduced pretrained models are summaried as follows:
Model | Output | IIIT | SVT | IC13_1015 | IC15_2077 | SVTP | CUTE | ArT | COCO | Uber |
---|---|---|---|---|---|---|---|---|---|---|
MGP-STR-base | Char | 98.4 | 98.0 | 98.2 | 89.3 | 96.7 | 98.6 | 84.5 | 78.8 | 87.9 |
BPE | 96.6 | 97.5 | 98.0 | 88.0 | 96.1 | 96.8 | 80.7 | 76.6 | 88.0 | |
WP | 96.5 | 97.2 | 98.3 | 88.1 | 95.3 | 96.8 | 80.9 | 76.1 | 87.8 | |
Fuse | 98.5 | 98.5 | 98.6 | 89.9 | 97.2 | 98.3 | 84.5 | 79.9 | 89.6 | |
MGP-STR-large | Char | 98.7 | 98.7 | 97.9 | 90.6 | 97.8 | 98.9 | 85.4 | 80.6 | 89.4 |
BPE | 97.2 | 97.5 | 97.9 | 89.4 | 97.6 | 97.5 | 82.7 | 78.4 | 89.9 | |
WP | 97.3 | 98.1 | 97.8 | 89.4 | 97.2 | 97.2 | 83.3 | 78.6 | 89.8 | |
Fuse | 98.8 | 98.6 | 98.5 | 90.8 | 98.3 | 99.3 | 85.5 | 81.7 | 91.0 |
- Download pretrained model
- Add image files to test into
demo_imgs/
- Run demo.py
mkdir demo_imgs/attens
CUDA_VISIBLE_DEVICES=0 python3 demo.py --Transformer mgp-str \
--TransformerModel=mgp_str_base_patch4_3_32_128 --model_dir mgp_str_base.pth --demo_imgs demo_imgs/
MGP-STR-base
CUDA_VISIBLE_DEVICES=0 python3 -m torch.distributed.launch --nproc_per_node=1 --nnodes=1 --master_port 29501 train_final_dist.py --train_data data/training \
--valid_data data/evaluation --select_data MJ-ST \
--batch_ratio 0.5-0.5 --Transformer mgp-str \
--TransformerModel=mgp_str_base_patch4_3_32_128 --imgH 32 --imgW 128 \
--manualSeed=226 --workers=12 --isrand_aug --scheduler --batch_size=100 --rgb \
--saved_path <path/to/save/dir> --exp_name mgp_str_patch4_3_32_128 --valInterval 5000 --num_iter 2000000 --lr 1
MGP-STR-base on a 2-GPU machine
It is recommended to train larger networks like MGP-STR-Small and MGP-STR-Base on a multi-GPU machine. To keep a fixed batch size at 100
, use the --batch_size
option. Divide 100
by the number of GPUs. For example, to train MGP-STR-Small on a 2-GPU machine, this would be --batch_size=50
.
CUDA_VISIBLE_DEVICES=0,1 python3 -m torch.distributed.launch --nproc_per_node=2 --nnodes=1 --master_port 29501 train_final_dist.py --train_data data/training \
--valid_data data/evaluation --select_data MJ-ST \
--batch_ratio 0.5-0.5 --Transformer mgp-str \
--TransformerModel=mgp_str_base_patch4_3_32_128 --imgH 32 --imgW 128 \
--manualSeed=226 --workers=12 --isrand_aug --scheduler --batch_size=50 --rgb \
--saved_path <path/to/save/dir> --exp_name mgp_str_patch4_3_32_128 --valInterval 5000 --num_iter 2000000 --lr 1
Find the path to best_accuracy.pth
checkpoint file (usually in saved_path
folder).
CUDA_VISIBLE_DEVICES=0 python3 test_final.py --eval_data data/evaluation --benchmark_all_eval --Transformer mgp-str --data_filtering_off --rgb --fast_acc --TransformerModel=mgp_str_base_patch4_3_32_128 --model_dir <path_to/best_accuracy.pth>
The illustration of spatial attention masks on Character A3 module, BPE A3 module and WordPiece A3 module, respectively.
This implementation has been based on these repository ViTSTR, CLOVA AI Deep Text Recognition Benchmark, TokenLearner.
If you find this work useful, please cite:
@inproceedings{ECCV2022mgp_str,
title={Multi-Granularity Prediction for Scene Text Recognition},
author={Peng Wang, Cheng Da, and Cong Yao},
booktitle = {ECCV},
year={2022}
}
MGP-STR is released under the terms of the Apache License, Version 2.0.
MGP-STR is an algorithm for scene text recognition and the code and models herein created by the authors from Alibaba can only be used for research purpose.
Copyright (C) 1999-2022 Alibaba Group Holding Ltd.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.