detecting text in natural image with connectionist text ... · detecting text in natural image with...
TRANSCRIPT
DDDeeettteeeccctttiiinnnggg TTTeeexxxttt iiinnn NNNaaatttuuurrraaalll IIImmmaaagggeee wwwiiittthhhCCCooonnnnnneeeccctttiiiooonnniiisssttt TTTeeexxxttt PPPrrrooopppooosssaaalll NNNeeetttwwwooorrrkkk Zhi Tian1, Weilin Huang1,2, Tong He1, Pan He1, and Yu Qiao1,3 1 Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China
2 University of Oxford, UK 3 The Chinese University of Hongkong, China
Insight
Motivation●Current bottom-up approaches are complicated, with weak robustness and
reliability, and accumulated errors.● Stat-of-the-art object detectors are powerful, but not accurate for text localisation.
● Fill the gap between general object detection (e.g., RPN [1]) and text detection.
Connectionist Text Proposal Network (CTPN)(CTPN)
CTPN Architecture
CTPN Proposals
●Recurrently connect sequential proposals by BLSTM
● Jointly predict text scores, y-axis coordinates, and refinement offsets
●Detect text in sequences of fine-scale proposals
Recurrent Connectionist Text Proposals
Top: CTPN without recurrent connection. Bottom: with recurrent connection
●RNN layer connects sequential proposals directly in convolutional layer● In-network recurrent architecture is end-to-end trainable
●Detect highly ambiguous text, and reduce false detections considerably
Red Box: with side-refinement. Yellow Box: without side-refinement
●Predict offsets for side-proposals - horizontal sides rectification● Further improve localisation accuracy● Joint predictions - not a post-precessing step
Side-Refinement
R d B ith id fi t Y ll B ith t id fi t
Detecting Text in Fine-Scale Proposals
RPN Proposals CTPN Proposals
● Slide a 3x3 window through Conv5● Text anchors are used for each window● Output a sequence of 16-pixel width proposals
Details:● Improve localisation accuracy●Generalise to multi scales, aspects,
and languages● Using single-scale image
Advantages:
●Encode rich context information
Experimental Results
[1] S. Ren, K. He, R, Girshick, and J. Sun: Faster R-CNN: Towards real-time object detection with region proposals network, NIPS, 2016. Reference:
[2] K. Simonyan, A. Zisserman: Very deep convolutional networks for large-scale image recognition, ICLR, 2015.[3] P. He, W. Huang, Y. Qiao, C. C. Loy, and X. Tang: Reading scene text in deep convolutional sequences, AAAI, 2016.
Zhi Tian, Weilin Huang, Tong He, Pan He, and Yu Qiao European Conference on Computer Vision (ECCV), 2016
Online demo: textdet.com
Summary:
Red Box: CTPN detection. Yellow Box: ground truth
● Trained on 3K images in English and Chinese, generalise well to others (e,g., Korean)● Fine-scale strategy improves Precision, while using RNN increases Recall and Precision● Obtain 0.88 and 0.61 F-measures on the ICDAR 2013 and 2015, respectively● Computationally efficient, with 0.14s/image GPU time (scale=600)● Strong capability for detecting very small-size text