![Page 1: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/1.jpg)
A New Approach for Video Text Detection and Localization
M. Cai, J. Song and M.R. Lyu
VIEW Technologies
The Chinese University of Hong Kong
![Page 2: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/2.jpg)
Related work
Text Area Detection– Uncompressed domain methods
• Texture-based• Color-based• Edge-based
– Compressed domain methods• DCT coefficients• Number of intra-coded blocks on P- / B- frames
Text String Localization– Bottom-up scheme– Top-down scheme
![Page 3: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/3.jpg)
Language-independent characteristics
Contrast– An adaptive contrast threshold according
to the background complexity
Color– Color bleeding caused by compression
Orientation– Well-defined size and orientation make it
easy to understand
Stationary location– Appear a certain long time
![Page 4: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/4.jpg)
Language-dependent characteristics
English Chinese
Stroke density roughly similar varies dramatically
Min(Font size) 10-pixel high 20-pixel high
Min(Aspect ratio) Relatively large Relatively small
Stroke direction statistics
mainly vertical vertical horizontalLeft diagonalRight diagonal
![Page 5: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/5.jpg)
Workflow
Sampling &color space conversion
Multi-frame comparison
Video text detection andlocalization on
every sampled frame
![Page 6: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/6.jpg)
A sequential multi-resolution paradigm
Level = 2
Level = n-1
Original image
Edge map
Text regions
Original coordinates of text regions
Size/ f(l)Text areaDetection
Text stringLocalization Size f(l)
Level = 1
Edge map
Text regions
Original coordinates of text regions
Size/ f(l)Text areaDetection
Text stringLocalization Size f(l)
Level = n
Final text regions with original coordinates
Edge detection
![Page 7: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/7.jpg)
Text detection
Edge detection– Sobel edge detector
Local thresholding– Adaptive to background complexity
Text-like area recovery– Enhance the density of text areas
![Page 8: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/8.jpg)
Local Thresholding
Use a small kernel (gray) to scan the whole edge map row by row.
In the bigger window surrounding the kernel, check the background type: “Clear” or “Noisy”.
For Clear background and Noisy background, determined the local threshold by low and high parts, respectively, of the edge strength histogram in the bigger window.
3hh
Window
Kernel
(a) Concentric kernel and window
P1
P3h....
(b) A window on the multi-line text area and the horizontal projection in it.
(c) Local threshold selection MAX
Count
Edge strength 0
Low part High part
![Page 9: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/9.jpg)
Thresholding result comparison
Video image Local thresholding resultsGlobal thresholding results
![Page 10: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/10.jpg)
Labeling: Classify current edge pixels as “TEXT” and “NON_TEXT” based on its local density.Recovery/Suppression:– Bring back neighboring lower-strength edge pixels of
the TEXT edge pixels.– The NON_TEXT edge pixels are suppressed.
Text-like area recovery
Before recovery After recovery
![Page 11: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/11.jpg)
Coarse-to-fine Text localization
Projection-based top-down localization.
To handle complex text layout.
Divisible? Horizontal projection
Vertical projection
Pop the first region from theprocessing array
Add to the processing array
InitializationThe whole edge map is the only region in the processing array.
Add to the resulting text regions
Y
N
Eachsub-region
The region
Sub-regions
Indivisible regions
Y
N
If the array is empty, terminate.
Divisible?
Check aspect ratio
Y
N
Discard false regions
![Page 12: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/12.jpg)
Localization steps
(1)
(2)
(3)
(4)
![Page 13: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/13.jpg)
Experimental results
![Page 14: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/14.jpg)
Experimental results
![Page 15: A New Approach for Video Text Detection and Localization](https://reader036.vdocument.in/reader036/viewer/2022062321/56813432550346895d9b2154/html5/thumbnails/15.jpg)
Performance statistics
Statistics of 10 news videos:
Processing time per frame: 0.25 s (PIII 1G CPU)
Detection rate = = 93.6%
Detection accuracy =
= 87.2%
Localization accuracy
= > 90%
)regionstexttruthground(
)regionstextdetectedcorrectly(
Num
Num
)regionstextdetectedall(
)regionstextdetectedcorrectly(
Num
Num
)regionstexttruthground(
)regionstexttruthground()regionstextdetected(
Area
AreaArea