Current video text spotting methods can achieve preferable performance, powered with sufficient labeled training data. However, labeling data manually is time-consuming and labor-intensive. To overcome this, using low-cost synthetic data is a promising alternative. This paper introduces a novel video text synthesis technique called FlowText, which utilizes optical flow estimation to synthesize a large amount of text video data at a low cost for training robust video text spotters. Unlike existing methods that focus on image-level synthesis, FlowText concentrates on synthesizing temporal information of text instances across consecutive frames using optical flow. This temporal information is crucial for accurately tracking and spotting text in video sequences, including text movement, distortion, appearance, disappearance, shelter, and blur. Experiments show that combining general detectors like TransDETR with the proposed FlowText produces remarkable results on various datasets, such as ICDAR2015video and ICDAR2013video.
FlowText is based on the segmentation model Mask2fomer, depth estimation model Monodepth2, optical flow estimation model GMA, synthesis engine SynthText. To setup the environment of FlowText, we use conda
to manage our dependencies. Our developers use CUDA 11.1
to do experiments. You can specify the appropriate cudatoolkit
version to install on your machine in the requirements.txt
file, and then run the following commands to install FlowText:
conda create -n flowtext python=3.8
conda activate flowtext
pip install torch==1.9.0+cu111 torchvision==0.10.0+cu111 torchaudio==0.9.0 -f https://download.pytorch.org/whl/torch_stable.html
git clone https://github.com/callsys/FlowText
cd FlowText
pip install -r requirements.txt
cd segmentation/mask2former/modeling/pixel_decoder/ops/
sh make.sh
To run FlowText, you need to download some files (Google Drive), which mainly contain the font file for the synthesized text, the text source, and the weight of the models. Once you have downloaded the files, link them to the FlowText directory:
ln -s path/to/FlowText_data FlowText/data
Generate Synthetic video with demo video assets/demo.mp4
and output to result to assets
:
python gen.py
Generate Synthetic video with given video video.mp4
, frame range start,end,interval
, save path save
and random seed seed
:
python gen.py --video video.mp4 --range start,end,interval --save save --seed seed
For example:
python gen.py --video assets/demo.mp4 --range 0,400,5 --save assets/result --seed 16
The format of the file output by gen.py
is as follows:
result
|
└─── 00000000.jpg
└─── 00000001.jpg
└─── 00000002.jpg
└─── ......
└─── 00000079.jpg
└─── ann.json
└─── viz.mp4
└─── viz_ann.mp4
where xxx.jpg
denotes the synthetic video video frames, ann.json
is the annotation file, viz.mp4
is the synthetic video and viz_ann.mp4
is the synthetic video with visualized annotations.
If you use FlowText in your research or wish to refer to the results, please use the following BibTeX entry.
@inproceedings{zhao2023flowtext,
title={FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation},
author={Yuzhong Zhao and Weijia Wu and Zhuang Li and Jiahong Li and Weiqiang Wang},
journal={ICME},
year={2023}
}
@article{zhao2023flowtext,
title={FlowText: Synthesizing Realistic Scene Text Video with Optical Flow Estimation},
author={Yuzhong Zhao and Weijia Wu and Zhuang Li and Jiahong Li and Weiqiang Wang},
journal={arXiv preprint arXiv:2305.03327},
year={2023}
}
Affiliations: University of Chinese Academy of Sciences, Zhejiang University, MMU of Kuaishou Technology
Authors: Yuzhong Zhao ([email protected]), Weijia Wu ([email protected]), Zhuang Li ([email protected]) Jiahong Li ([email protected]) Weiqiang Wang ([email protected])
Code is largely based on SynthText, and models are borrowed from Mask2fomer,Monodepth2 and GMA.
Work is fully supported by MMU of Kuaishou Technology.