![Page 1: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/1.jpg)
DEEP STRUCTURED OUTPUT LEARNING FOR UNCONSTRAINED TEXT RECOGNITION Max Jaderberg, Karen Simonyan, Andrea Vedaldi, Andrew ZissermanVisual Geometry Group, Department Engineering Science, University of Oxford, UK
1
![Page 2: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/2.jpg)
TEXT RECOGNITION
COSTA
Localized text image as input, character string as output
DENIM DISTRIBUTED FOCAL
![Page 3: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/3.jpg)
TEXT RECOGNITION
APARTMENTS
State of the art — constrained text recognition !word classification [Jaderberg, NIPS DLW 2014] !static ngram and word language model [Bissacco, ICCV 2013]
![Page 4: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/4.jpg)
TEXT RECOGNITION
?
Random string
New, unmodeled word
?
State of the art — constrained text recognition !word classification [Jaderberg, NIPS DLW 2014] !static ngram and word language model [Bissacco, ICCV 2013]
![Page 5: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/5.jpg)
TEXT RECOGNITION
RGQGAN323
Unconstrained text recognition !e.g. for house numbers [Goodfellow, ICLR 2014] business names, phone numbers, emails, etc
Random string
New, unmodeled word
TWERK
![Page 6: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/6.jpg)
OVERVIEW
• Two models for text recognition [Jaderberg, NIPS DLW 2014] ‣ Character Sequence Model ‣ Bag-of-N-grams Model
!• Joint formulation ‣ CRF to construct graph ‣ Structured output loss ‣ Use back-propagation for joint optimization
!• Experiments ‣ Generalize to perform zero-shot recognition ‣ When constrained recover performance
!
![Page 7: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/7.jpg)
CHARACTER SEQUENCE MODEL
32⨉100⨉1
Deep CNN to encode image. Per-character decoder.
x1⨉1⨉4096
1⨉1⨉4096
4⨉13⨉5128⨉25⨉2568⨉25⨉512
16⨉50⨉12832⨉100⨉64
5 convolutional layers, 2 FC layers, ReLU, max-pooling 23 output classifiers for 37 classes (0-9,a-z,null) !Fixed 32x100 input size — distorts aspect ratio
![Page 8: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/8.jpg)
CHARACTER SEQUENCE MODEL
char 1z
⋮
⋮⋮
0 e Ø
⋮⋮
s
⋮⋮⋮⋮
⋮
char 5char 6
char 231⨉1⨉37
32⨉100⨉1CHAR CNN
Deep CNN to encode image. Per-character decoder.
x
P (c1|�(x))
P (c23|�(x))
![Page 9: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/9.jpg)
BAG-OF-N-GRAMS MODEL
Represent string by the character N-grams contained within the string
spires
s!p!i!r!e!sp!pi!ir!re!es!spi!pir!ire!res!spir!pire!ires
1-grams
2-grams
3-grams
4-grams
![Page 10: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/10.jpg)
BAG-OF-N-GRAMS MODEL
Deep CNN to encode image. N-grams detection vector output. Limited (10k) set of modeled N-grams.
⋮
⋮ rake
ra
ak
ab
1⨉1⨉10000
32⨉100⨉1
raze
aba
ke1⨉1⨉4096
1⨉1⨉4096
4⨉13⨉5128⨉25⨉2568⨉25⨉512
16⨉50⨉12832⨉100⨉64
N-gram detection vector
![Page 11: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/11.jpg)
JOINT MODEL
⋮
⋮ rake
ra
ak
ab
1⨉1⨉10000
32⨉100⨉1
raze
aba
keNGRAM CNN
char 1z
⋮
⋮⋮
0 r Ø
⋮⋮
e
⋮⋮⋮⋮
⋮
char 4char 5
char 231⨉1⨉37
32⨉100⨉1CHAR CNN
Can we combine these two representations?
![Page 12: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/12.jpg)
JOINT MODEL
a
e
k
q
r
CHAR CNN f(x)
![Page 13: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/13.jpg)
JOINT MODEL
a
e
k
q
r
CHAR CNN f(x)
NGRAM CNN g(x)
maximum number of chars
![Page 14: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/14.jpg)
JOINT MODEL
a
e
k
q
r
CHAR CNN f(x)
NGRAM CNN g(x)
w⇤ = argmax
wS(w, x)
beam search
![Page 15: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/15.jpg)
STRUCTURED OUTPUT LOSS
Score of ground-truth word should be greater than or equal to the highest scoring incorrect word + margin. !
where
Enforcing as soft constraint leads to a hinge loss
![Page 16: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/16.jpg)
STRUCTURED OUTPUT LOSS
![Page 17: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/17.jpg)
EXPERIMENTS
![Page 18: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/18.jpg)
DATASETS
All models trained purely on synthetic data![Jaderberg, NIPS DLW 2014]
Font rendering Border/shadow & color Composition Projective distortion Natural image blending
Realistic enough to transfer to test on real-world images
![Page 19: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/19.jpg)
DATASETS
Synth90k!Lexicon of 90k words. 9 million images, training + test splits Download from http://www.robots.ox.ac.uk/~vgg/data/text/
![Page 20: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/20.jpg)
DATASETS
ICDAR 2003, 2013!
Street View Text
IIIT 5k-word
![Page 21: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/21.jpg)
TRAINING
Pre-train CHAR and NGRAM model independently. !Use them to initialize joint model and continue jointly training.
![Page 22: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/22.jpg)
EXPERIMENTS - JOINT IMPROVEMENT
CHAR: grahaws!JOINT: grahams!GT: grahams
CHAR: mediaal!JOINT: medical!GT: medical
CHAR: chocoma_!JOINT: chocomel!GT: chocomel
CHAR: iustralia!JOINT: australia!GT: australia
Train Data Test Data CHAR JOINT
Synth90k
Synth90k 87.3 91.0 IC03 85.9 89.6 SVT 68.0 71.7 IC13 79.5 81.8
joint model outperforms character sequence model alone
![Page 23: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/23.jpg)
JOINT MODEL CORRECTIONSedge down-weighted in graph
edges up-weighted in graph
![Page 24: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/24.jpg)
EXPERIMENTS - ZERO-SHOT RECOGNITION
joint model recovers performance
Train Data Test Data CHAR JOINT
Synth90k
Synth90k 87.3 91.0 Synth72k-90k 87.3 - Synth45k-90k 87.3 - IC03 85.9 89.6 SVT 68.0 71.7 IC13 79.5 81.8
Synth1-72k Synth72k-90k 82.4 89.7 Synth1-45k Synth45k-90k 80.3 89.1 SynthRand SynthRand 80.7 79.5
large difference for CHAR model when not trained on test words
![Page 25: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/25.jpg)
EXPERIMENTS - COMPARISON
No Lexicon Fixed Lexicon
IC03 SVT IC13 IC03-Full SVT-50 IIIT5k
-50 IIIT5k-
1k Model Type Model Unconstrained Baseline (ABBYY) - - - 55.0 35.0 24.3 -
Language Constrained
Wang, ICCV ‘11 - - - 62.0 57.0 - - Bissacco, ICCV ‘13 - 78.0 87.6 - 90.4 - - Yao, CVPR ‘14 - - - 80.3 75.9 80.2 69.3 Jaderberg, ECCV ‘14 - - - 91.5 86.1 - - Gordo, arXiv ‘14 - - - - 90.7 93.3 86.6 Jaderberg, NIPSDLW ‘14 98.6 80.7 90.8 98.6 95.4 97.1 92.7
Unconstrained CHAR 85.9 68.0 79.5 96.7 93.5 95.0 89.3 JOINT 89.6 71.7 81.8 97.0 93.2 95.5 89.6
![Page 26: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/26.jpg)
EXPERIMENTS - COMPARISON
No Lexicon Fixed Lexicon
IC03 SVT IC13 IC03-Full SVT-50 IIIT5k
-50 IIIT5k-
1k Model Type Model Unconstrained Baseline (ABBYY) - - - 55.0 35.0 24.3 -
Language Constrained
Wang, ICCV ‘11 - - - 62.0 57.0 - - Bissacco, ICCV ‘13 - 78.0 87.6 - 90.4 - - Yao, CVPR ‘14 - - - 80.3 75.9 80.2 69.3 Jaderberg, ECCV ‘14 - - - 91.5 86.1 - - Gordo, arXiv ‘14 - - - - 90.7 93.3 86.6 Jaderberg, NIPSDLW ‘14 98.6 80.7 90.8 98.6 95.4 97.1 92.7
Unconstrained CHAR 85.9 68.0 79.5 96.7 93.5 95.0 89.3 JOINT 89.6 71.7 81.8 97.0 93.2 95.5 89.6
![Page 27: EEP STRUCTURED OUTPUT LEARNING FOR NCONSTRAINED …](https://reader031.vdocument.in/reader031/viewer/2022012100/6169e0e411a7b741a34c5eeb/html5/thumbnails/27.jpg)
SUMMARY
• Two models for text recognition !• Joint formulation ‣ Structured output loss ‣ Use back-propagation for joint optimization
!• Experiments ‣ Joint model improves accuracy on language-based
data. ‣ Degrades elegantly when not from language (N-
gram model doesn’t contribute much) ‣ Set benchmark for unconstrained accuracy,
competes with purely constrained models.