1 recognition of multi-fonts character in early-modern printed books chisato ishikawa(1), naomi...

27
1 Recognition of Multi- Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa Kimesawa(2) and Kazuki Joe(1) (1) Nara Women’s University, Japan (2) National Diet Library, Japan * Currently work for Mitsubishi Electric co

Upload: mervyn-barber

Post on 30-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

1

Recognition of Multi-Fonts Character in Early-Modern

Printed Books

Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa Kimesawa(2) and Kazuki Joe(1)

(1) Nara Women’s University, Japan(2) National Diet Library, Japan

* Currently work for Mitsubishi Electric co

Page 2: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

2

Contents

• Introduction

• Multi-fonts character recognition– Feature extraction from character images– Learning method for feature

• Experiments– Improvement of pre-process

• Conclusions and future work

Page 3: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

3

Introduction• The Digital Library from the Meiji Era

(Supported by the National Diet Library in Japan) – Digital archive: Books published in the Meiji and Taisho eras

Top page Data Viewer

Search box

1868-1926

The digital data are opened at the project Web site

Page 4: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

4

Image data

IntroductionMain bodies of books

Development of an OCR for multi-fonts character

in early-modern printed books

Our goal

–Too many kinds of fonts

–Existence of old characters

–Very noisy image

Existence OCRs are not applicable.

Full text search, text function:Not supported

Conversion Text data

Page 5: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

5

Flow of OCR

Input image data

Character image

Pre-process

Character image data X

Preprocessed image data X’

Feature extraction

Feature vector v

Recognition

Recognized class no. n

Contents of this presentation

Page 6: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

6

Flow of our OCRPre-process

• Noise reduction• Normalization

– Removing margin– Normalizing size– Normalizing position

Input image data

Character image

Pre-process

Character image data X

Preprocessed image data X’

Feature extraction

Feature vector v

Recognition

Recognized class no. n

Page 7: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

7

Feature extraction

Flow of our OCRFeature Extraction

Extraction of a PDC feature

Peripheral Direction Contributivity Reflects four statuses of character-lines: ・ Direction

・ Connectivity・ Relative position・ Complexion

Input image data

Character image

Preprocessing

Preprocessed image data X’

Feature vector v

Recognition

Recognized class no. n

Character image data X

Page 8: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

8

PDC FeatureScanning from 8 directions

Scanning-line

Scanning-line

Scanning the lengths of connected black dots for 4 directions

Reflecting the direction and the connectivity of character-lines

Reflecting the position of character-lines

Target character image

Direction contributivity is calculated from the scanned lengths

A vector of 4 elements

Page 9: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

9

are not 0 → Complex character-lines are 0 → Simple character-lines

Deeper level’s

PDC Feature

Scanning-line1st depth

2nd depth3rd depth

1st depth 2nd depth 3rd depth

Black dot: Direction contributivity is not 0Base image

Scanning-line

Reflecting the complexity of character-lines

Direction contributivity

Direction contributivityDirection contributivity

Page 10: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

10

PDC Feature• PDC feature vector: Direction contributivities set

Resolution: 16・・・

Depth: 3

Direction: 8 Direction contributivity element: 4

Direction(8)*Resolution(16)*Depth(3)*Element(4)=1536

Dimension number=

Page 11: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

11

Recognition

Flow of our OCRRecognition

Recognition by an SVMSupport Vector Machine–High generalization capability–Independence of the number of target vector dimension –Low calculation cost

Input image data

Character image

Preprocessing

Preprocessed image data X’

Feature extraction

Feature vector v

Recognized class no.   n

A character image data X

Page 12: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

12

Experiments

• Experimental sample data– Character images obtained from “The Digital

Library from the Meiji era”– Target characters :

Class no. No.1 No.2 No.3 No.4 No.5Character 行 三 人 生 十

Number of samples 102 103 134 100 100

Class no. No.6 No.7 No.8 No.9 No.10Character 來 小 中 年 彼

Number of samples 135 100 209 153 100

Page 13: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

13

Examples of Sample Images

No.1 ( 行 ) No.2 ( 三 ) No.3 ( 人 ) No.4 ( 生 )

No.5 ( 十 ) No.6 ( 來 ) No.7 ( 小 )

No.8 ( 中 ) No.9 ( 年 ) No.10 ( 彼 )

Monochrome or 256-grayscale

Page 14: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

14

PDCfeature

Experiments Description(1/2)

Conversion of character images to feature vectors– Pre-process

1. Binarization Threshold: 128

2. Noise Reduction Median filter (Filter size : 3×3)

3. Normalization Removing margin and scaling to 128×128

– Extraction of PDC features• Vector dimension: 1536

3.

Pre-process

PDCfeature

Extraction of PDC features

1. 2.

Page 15: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

15

Experiments Description(2/2)Learning and evaluation of a recognition model

– Learning recognition model with training samples to SVM• Used SVM: LIB-SVM• Parameters of SVM: Tweaked by grid search

– Evaluation of the recognition model by using test samples

PDCfeature

PDCfeature

Test samples

PDCfeature

Training samplesSVM (LIB-SVM)

Learning

Tweaked by grid-search

Parameters

Evaluation

Recognition model

50 samples for each character

Page 16: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

16

Result of Recognition Model Evaluation

• Recognition rate: 97.8%

cf. Recognition rate by neural network(NN) ・・ 77.6%

Computation time ・・ SVM: NN= 1 : 7.7

※We have shown this result at 73th Mathematical Modeling and Problem Solving (MPS) in March, 2009.

Class CharacterThe number of

test samples ErrorRecognitionrate[%]

1 行 52 0 100.02 三 53 1 98.13 人 84 1 98.84 生 50 0 100.05 十 50 1 98.06 来 85 1 98.87 小 50 0 100.08 中 159 12 92.59 年 103 0 100.0

10 彼 50 0 100.0

Page 17: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

17

Recognition Error in Result

• Some images are not recognized because of …noise similarity of character forms

Diminishable by an improvement of pre-process

or

Page 18: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

18

• Pre-process1. Binarization

• Threshold:t=128

2. First noise reduction • Median filter , Filter size : 3×3

3. Normalization

4.Second noise reduction • Based on estimated width of character-line

5.Normalization

Improvement of Pre-process

Discriminant Analysis

Page 19: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

19

pi

pj

lpi

lpj

• Estimation of line width by using the largest connected component X

lpn : Length of the shortest connected

line pass through pixel pn

(pn⊂X)

• Elimination of connected component whose area is smaller than

The largest component X

Target image

2

2b

Estimated width of character-line: b=median value of lpn

Noise Reduction based on Estimation of Character-line Width

Page 20: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

20

Target image• Estimation of line width by using the

largest connected component X

lpn : Length of the shortest connected

line pass through pixel pn

(pn⊂X)

• Elimination of connected components whose area are smaller than

Estimated width of character-line: b=median value of lpn

Noise Reduction based on Estimation of Character-line Width

2

2b

Page 21: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

21

Noise Reduction based on Estimation of Character-line Width

Target image• Estimation of line width by using the

largest connected component X

lpn : Length of the shortest connected

line pass through pixel pn

(pn⊂X)

• Elimination of connected components whose area are smaller than 2

2b

Estimated width of character-line: b=median value of lpn

Page 22: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

22

Result of Improved Pre-process Adoption

• Recognition rate 97.8%→99.0%Previous

result

Error1 行 52 100.0% 100.0% 02 三 53 98.1% 98.1% 13 人 84 98.8% 100.0% 04 生 50 100.0% 100.0% 05 十 50 98.0% 100.0% 06 来 85 98.8% 100.0% 07 小 50 100.0% 100.0% 08 中 159 92.5% 96.9% 59 年 103 100.0% 99.0% 110 彼 50 100.0% 100.0% 0

Recognitionrate[%]

New noisereduction

Thenumber ofunknowninput dataCharacterClass

Page 23: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

23

DiscussionCase: better recognition(Error→Correct)

Previous pre-process Error

Improved pre-processCorrect

Quality of test samples are improved

Quality of training samples are improved

More efficient recognition model

Page 24: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

24

Residual noise Error

Similar form of character no.5( 十 )Error

Major form of no.8

Shorter than major form →Similar with one horizontal line

Previous Error

Improved Error Connected to character-line

DiscussionCase: unchanged(Error→Error)

Page 25: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

25

DiscussionCase: worse recognition (Correct→Error)

Previous Improved

Training samples with lack of line are reducedRecognition rate of data with lack of line becomes low

Previous Correct

ImprovedError

Pre-processed images

Page 26: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

26

Conclusions and Future work• Recognition of multi-fonts character in Early-Modern

Printed Books – Proposal of our method which uses PDC feature and SVM

– Experimentations of applying our method• The results show high recognition rate

• Improvement of noise reduction leads higher recognition rate

– Recognized 10 kinds of character at 99 % accuracy

• Future works– Dealing lots of character kinds

• Recognition of similar form characters

– Automation of extracting character area

Hierarchical recognition method

Page 27: 1 Recognition of Multi-Fonts Character in Early-Modern Printed Books Chisato Ishikawa(1), Naomi Ashida(1)*, Yurie Enomoto(1), Masami Takata(1), Tsukasa

27

Thank you for your attention!