This repo contains a comprehensive list of our research works related to scene text detection, spotting, and hierarchical segmentation. For any related questions, please contact Maoyuan Ye at [email protected] . Note that the official repo for "I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection" has been moved to https://github.com/ViTAE-Transformer/I3CL.
1. Hi-SAM: Marrying Segment Anything Model for Hierarchical Text Segmentation [TPAMI'24]
3. DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Text Spotting [ArXiv'23]
4. DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting [CVPR'23]
5. DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer [AAAI'23]
Maoyuan Ye, Jing Zhang, Juhua Liu, Chenyu Liu, Baocai Yin, Cong Liu, Bo Du, and Dacheng Tao.
Paper | Github Code | BibTex
We propose the first unified model, Hi-SAM, for segmenting text across stroke, word, text-line, and paragraph hierarchies. Hi-SAM also achieves layout analysis as a by-product. Hi-SAM supports both automatic mask generation and interactive promptable mode. Given a single-point prompt, Hi-SAM provides word, text-line, and paragraph masks.
📘 GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching [NeurIPS'24]
Haibin He, Maoyuan Ye, Jing Zhang, Juhua Liu, and Dacheng Tao.
Paper | Github Code | BibTex
We identify a main bottleneck in the state-of-the-art video text spotter: the limited recognition capability. In response to this issue, we propose to efficiently turn an off-the-shelf query-based image text spotter into a specialist on video and present a simple baseline termed GoMatching. We establish the ArTVideo test set for addressing the absence of curved texts in current video text spotting datasets and evaluating the performance of video text spotters on videos with arbitrary-shaped text. ArTVideo contains 20 video clips, featuring 30% curved text approximately.
📘 DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multilingual Text Spotting [ArXiv'23]
Maoyuan Ye∗, Jing Zhang∗, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, and Dacheng Tao.
Paper | Github Code | BibTex
We propose simple yet effective baseline models named DeepSolo and DeepSolo++ for monolingual and multilingual text spotting, taking the merit of a novel explicit point query form that provides a unified and pivotal representation for different tasks. With a single decoder and several simple prediction heads (a linear layer or MLP), we present a much simpler method compared with previous state-of-the-art (SOTA) text spotters. Our method shows several good properties, including 1) simplicity of model structure and training pipeline, 2) efficiency of training and inference, and 3) extensibility of character class, language, and task. Extensive experiments demonstrate that our method has achieved SOTA performance while enjoying some other distinctions, such as the effectiveness on dense and long text, and compatibility to line annotations.
Maoyuan Ye∗, Jing Zhang∗, Shanshan Zhao, Juhua Liu, Tongliang Liu, Bo Du, and Dacheng Tao.
Paper | Github Code | BibTex
End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. In this paper, we present DeepSolo, a succinct DETR-like baseline that lets a single decoder with explicit points solo for text detection and recognition simultaneously and efficiently. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations.
Maoyuan Ye, Jing Zhang, Shanshan Zhao, Juhua Liu, Bo Du, and Dacheng Tao.
Paper | Github Code | BibTex
Recently, Transformer-based methods, which predict polygon points or Bezier curve control points for localizing texts, are popular in scene text detection. However, these methods built upon detection transformer framework might achieve sub-optimal training efficiency and performance due to coarse positional query modeling. In addition, the point label form exploited in previous works implies the reading order of humans, which impedes the detection robustness from our observation. To address these challenges, this paper proposes DPText-DETR, which directly leverages explicit point coordinates to generate position queries and dynamically updates them in a progressive way. Moreover, we present an Enhanced Factorized Self-Attention module to improve the spatial inductive bias of non-local self-attention in Transformer. Furthermore, we design a simple yet effective positional label form to tackle the side effect of the previous form. To further evaluate the impact of different label forms on the detection robustness in real-world scenario, we establish an Inverse-Text test set containing 500 manually labeled images. Extensive experiments prove the high training efficiency, robustness, and state-of-the-art performance of our method on popular benchmarks
📘 I3CL: Intra- and Inter-Instance Collaborative Learning for Arbitrary-shaped Scene Text Detection [IJCV'22]
Bo Du, Jian Ye, Jing Zhang, Juhua Liu, and Dacheng Tao.
Paper | Github Code | BibTex
Existing methods for arbitrary-shaped text detection in natural scenes face two critical issues, i.e., (1) fracture detections at the gaps in a text instance; and (2) inaccurate detections of arbitrary-shaped text instances with diverse background context. To address these issues, we propose a novel method named Intra- and Inter-Instance Collaborative Learning (I3CL). Specifically, to address the first issue, we design an effective convolutional module with multiple receptive fields, which is able to collaboratively learn better character and gap feature representations at local and long ranges inside a text instance. To address the second issue, we devise an instance-based transformer module to exploit the dependencies between different text instances and a global context module to exploit the semantic context from the shared background, which are able to collaboratively learn more discriminative text feature representation. In this way, I3CL can effectively exploit the intra- and inter-instance dependencies together in a unified end-to-end trainable framework. Besides, to make full use of the unlabeled data, we design an effective semi-supervised learning method to leverage the pseudo labels via an ensemble strategy. Without bells and whistles, experimental results show that the proposed I3CL sets new state-of-the-art results on three challenging public benchmarks, i.e., an F-measure of 77.5% on ArT, 86.9% on Total-Text, and 86.4% on CTW-1500. Notably, our I3CL with the ResNeSt-101 backbone ranked the 1st place on the ArT leaderboard.