detection and extraction of artificial text for semantic indexing

1/25

Detection and Extraction of Artificial Text for Semantic Indexing

Laboratoire Reconnaissance de Formes et VisionBât. Jules Verne, INSA de Lyon

69621 Villeurbanne cedex, France

January 9th 2002Dagstuhl Seminar on Content-Based Image and Video Retrieval

Christian Wolf and Jean-Michel Jolion

http://rfv.insa-lyon.fr/~wolf/presentations

This presentation can be downloaded from:

2/25

Plan of the presentationIntroductionDetection and trackingEnhancement and binarization of the text

boxesExperiments and resultsOpen problemsConclusion and Outlook

634

291

25

Slides:

This work resulted in a patent submitted by France Télécom on May 23th, 2001 under the reference FR 01 06776.

Enh/Binarization Exp.Results Open problems ConclusionIntroduction Detection

3/25

Content based image retrieval

SimilarityFunction

ResultExample image

Indexing phase

Enh/Binarization Exp.Results Open problems ConclusionDetectionIntroduction

4/25

Similarity measures

similar

similar

Not similar


5/25

Indexing using Text

Keyword basedSearch

Patrick Mayhew

Patrick MayhewMin. chargé de l´irlande de NordISRAELJerusalemmontageT.Nouel...............

ResultKey word

Indexing phase


6/25

Video properties

80 px

12 px 8 px


7/25

Text extraction: general scheme

TrackingDetection of the text in single frames

Image enhancement - Multiple frame integration

Segmentation/Binarisation

OCR

"EVENEMENT""ACTU""SPELEOS""Gouffre Berger (Isére)""aujourd'hui""France 3 Alpes""un spéléologue sauveteur"

Video


8/25Text detection by accumulation of horizontal gradients (LeBourgeois, 1997).

Justification: Text forms a regular texture containing vertical edges which are aligned horizontally.

Post processing by mathematical morphology.


9/25

Detection in video sequences

Detection per single frame

List of rectanglesper frame

Tracking -keeping track of text occurrences

Suppression offalse alarms

Image Enhancement -Multiple frame integration

Text occurrences

Frame nr.(time)


10/25

Image enhancementSuper-resolution(interpolation)

Multiple frame integration:Averaging

Integration of multiple frames to create a single image of higher quality.

M1

M4

M2

M3

An additional weight is included into the interpolation scheme, which decreases the weights of temporal outlier pixels.

Exp.Results Open problems ConclusionIntroduction Detection Enh/Binarization

11/25

Binarization

))1.(1.( Rs

kmT

skmT .

)(: max FL CCaCI

)()1( MmRs

aaMmaT

Niblack:

Sauvola et al.:

m mean of the windows standard deviation of the

windowk parameterR dynamics of the gray

values of the image

s

ImCL

Contrast in the center of the image

s

MmC

max

The maximum local contrast

RMm

CF

The contrast of the window

M minimum gray value of the image


12/25

Binarization methods: examples

Original image

Fisher

Fisher (windowed)

Yanowitz B.

Niblack

Sauvola et al.

Our method


13/25

Binarization using a priori knowledgeBayesian MAP estimation using prior knowledge on the spatial relationships in the image, modeled as a Markov random field.


(In collaboration with David Doermann from the Language and Media Processing Laboratory of the University of Maryland)

14/255 different MPEG 1 videos of resolution 384x288.

62 minutes93000 frames413 text appearances

Enh/Binarization Open problems ConclusionIntroduction Detection Exp.Results

15/25

Detection and OCR results

DETECTION %Pred. Text 301 93,5Pred. Non-Text 21Total 322

Positives 350False alarms 947Logos 75Scene text 72Pos+Log+Scene 497 34,4Total 1444

Detection results

Input Bin. method Recall Precision CostAIM2 Niblack 67,4 87,5 499

Sauvola R=128 53,8 87,6 616,5R=ad 75,0 87,8 384,5R=ad, shift 78,4 90,4 344,5

AIM3 Niblack 92,5 78,1 196Sauvola R=128 69,9 89,6 206R=ad 85,3 92,5 110R=ad, shift 96,2 95,3 51,00

AIM4 Niblack 78,5 92,0 252,00Sauvola R=128 48,6 87,7 490,50R=ad 69,8 84,8 360,50R=ad, shift 80,1 90,4 211,50

AIM5 Niblack 62,1 71,4 501,50Sauvola R=128 66,7 89,3 324,50R=ad 64,8 90,1 328,00R=ad, shift 69,0 91,0 294,50

Total Niblack 73,1 82,6 1448,5Sauvola R=128 58,4 88,5 1637,5R=ad 73,0 88,4 1183R=ad, shift 79,6 91,5 901,5

OCR Results, classified by binarization method

Enh/Binarization Open problems ConclusionIntroduction Detection Exp.Results

True pos.

False pos.

True neg.

False neg.

16/25

Open questions Scene text (general orientations, deformations) Moving text

Enh/Binarization Exp.Results ConclusionIntroduction Detection Open problems

17/25

What is scene text?

Video frames

Frames containingscene text

We do not have enough information about the importance of text in the destination domain. How many frames do contain text and scene text?


Frames containingartificial text

18/25

Detection:From artificial text to scene text

Several constraints have to be removed passing from artificial text to scene text:

The constraints on temporal stability need to be abandoned or at least softened (no initial frame integration)

Text can be aligned in all orientations (Creation of an oriented feature in multiple directions, similar to invariant features)

Contrast is possibly lower because scene text is not designed to be read easily (Is detection of unreadable text necessary?).


19/25

Text models

Simple Modelssets of edges or vertical strokes...

Complex Modelstemplates, probabilistic models (MRF)...

+Generalize well, respond to many kinds of text

- Many false alarms

+Powerful less false alarms

- Do not generalize well

Assumptions are necessary (on the font, size, style, contrast, color, length, etc.) but not sufficient.

Main problem: Distinction between characters and structures similar to text according to the chosen model.


20/25


Sven Dickinson: evolution of models

21/25

What is text?Whatever model we choose, we cannot detect/recognize all kinds of text without solving the general image understanding problem. The best thing we can do is to include richer features into the detection process: a composite model for text.

Structural analysis (e.g. detection and recognition of characters by strokes). Very hard and very unlikely to work in the case of noisy images, low resolutions and difficult fonts.

Statistical modeling of text features (e.g. by learning techniques). Problem: For a robust detection high neighborhood sizes are needed, which lead to combinatorial explosions.

E.g.: Texture based methods for small text and segmentation + perceptual grouping, structural methods for big text.


22/25

Learning techniques: pro et contra

Bibliography:

Learning directly the gray levels of the input image (Jung 2001)

Learning features, i.e. coefficients of the Haar wavelet (Li and Doermann 2000) or edge strength (Lienhart 2000)

+ Learning is an easy way to handle the complexity of text.

- Text can appear in videos in many different fonts, sizes, styles, colors, orientations etc. Learning all different forms is maybe not feasible.


23/25

Color processing for detection?

Original image Sobel on grayscale image Sobel on L*u*v* image

101

202

101

1

2

1

),( 0,10,1 xxeuclid IID

Saturating distance or non saturating distance?Reflection processing?

101

202

101


24/25

Tracking of moving scene textDo we detect the text in single frames (like artificial text), or do we treat the flow in its integrality?

Single frames: Multiple frame integration of moving text needs robust registration of the text boxes in different frames (e.g. rough segmentation into text and background pixels before the registration of the text pixels only) . Robust methods, which are able to track objects in clutter, are needed.

Detection of moving objects, e.g. by optical flow, spatio-temporal methods.

Mosaicing techniques can be employed for image enhancement.


25/25

Conclusion and Outlook We developed a system for detection, tracking,

enhancement and binarization of artificial text in videos.

The total recognition rate for artificial text is surprisingly high, given the quality of the text, but not yet good enough for indexing purposes.

The remaining problems in text extraction seem to be typical for applications in visual information management: We went as far as we could with low level features. We can’t do the necessary step to semantic information. What is text? Possible definition: text is, what (a human or an OCR) can recognize as text.

We have to include as much a priori knowledge as possible into the process.


detection and extraction of artificial text for semantic indexing

Documents