adaptive method for the digitization of mathematical journals file adaptive method for the...

37
http://www.inftyproject.org/ Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus Kyushu Institute of Systems, Informatics and Nanotechnologies (ISIT) InftyProject ((http://www/inftyproject.org) Science Accessibility Net (http://www.sciaccess.net) IMU-WDML Workshop June 2, 2012, Washington DC

Upload: others

Post on 03-Sep-2019

14 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/

Adaptive method for the digitization of mathematical journals

Masakazu Suzuki Kyushu University, Professor emeritus

Kyushu Institute of Systems, Informatics and Nanotechnologies (ISIT) InftyProject ((http://www/inftyproject.org)

Science Accessibility Net (http://www.sciaccess.net)

IMU-WDML Workshop

June 2, 2012, Washington DC

Page 2: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 2

Plan of the talk

About InftyProject

Making Rich Digital Mathematical Libraries Process Flow and Technical Components

Current State of the Art with Demonstration

Adaptive Method Character and Symbol Recognition

Logical Structure Analysis

Future Problems

Page 3: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 3

Section 1 About Infty Project

Page 4: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 4

InftyProject

R&D on Math Information Systems

Main system development InftyReader : Math OCR software

InftyEditor : Editor of math documents Data conversion(XML, LaTeX, MathML, PDF, etc.)

ChattyInfty : InftyEditor + speech output, Authoring of DAISY

URL: Project site: http://www.inftyproject.org/en//

Release & user support of Infty products: Science Accessibility Net http://www.sciaccess.net/

Page 5: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 5

InftyProject

R&D on Math Information Systems

Main system development InftyReader : Math OCR software

InftyEditor : Editor of math documents Data conversion(XML, LaTeX, MathML, PDF, etc.)

ChattyInfty : InftyEditor + speech output, Authoring of DAISY

URL: Project site: http://www.inftyproject.org/en//

Release & user support of Infty products: Science Accessibility Net http://www.sciaccess.net/

Page 6: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 6

“InftyReader” OCR software for math documents

Demonstration. Recognition result samples (YMJ, AJM).

Page 7: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 7

Section 2 Toward Rich DML

Page 8: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 8

Different levels in digitization Level 1: Bitmap images of printed materials e.g. GIF, TIFF

Level 2: Searchable digitized document e.g. PDF with hidden text, Bib Link

Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …

Level 4: (partially) Executable document e.g. Mathematica, Maple

Level 5: Formally presented document e.g. Mizar, OMDoc

Page 9: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 9

Different levels in digitization Level 1: Bitmap images of printed materials e.g. GIF, TIFF

Level 2: Searchable digitized document e.g. PDF with hidden text, Bib Link

Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …

Level 4: (partially) Executable document e.g. Mathematica, Maple

Level 5: Formally presented document e.g. Mizar, OMDoc

WDML achieved this level.

Page 10: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 10

Different levels in digitization Level 1: Bitmap images of printed materials e.g. GIF, TIFF

Level 2: Searchable digitized document e.g. PDF with hidden text

Level 3: Structured accessible document e.g. XML, HTML(+MathML), LATEX, …

Level 4: (partially) Executable document e.g. Mathematica, Maple

Level 5: Formally presented document e.g. Mizar, OMDoc

Infty : Level 1 → Level 3

Page 11: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 11

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.)

XML Outputs

LaTeX, XHTML+MathML, PDF, Braille codes, etc.

PDF Image File (TIF) Texts & Math symbols

Page 12: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 12

Layout Analysis

Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math. Structure analysis)

Document Structure analysis (Title, Chapter, Section, Itemize, Theorem, Bib, etc.)

XML Outputs LaTeX. HTML,

Human readable TeX Braille codes, Speak data, etc.

PDF Image File (TIF) (Pre processing)

Page 13: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 13

Layout Analysis

Segmentation of Areas Table Analysis

Recognition per line (Character recognition, Math. Structure analysis)

Document Structure analysis (Title, Chapter, Section, Itemize, Theorem, Bib, etc.)

XML Outputs LaTeX. HTML,

Human readable TeX Braille codes, Speak data, etc.

PDF Image File (TIF)

Page 14: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 14

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.)

XML Outputs

LaTeX, XHTML+MathML, PDF, Braille codes, etc.

PDF Image File (TIF) Texts & Math symbols

Page 15: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 15

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.)

XML Outputs

LaTeX, XHTML+MathML, PDF, Braille codes, etc.

PDF Image File (TIF) Texts & Math symbols

Page 16: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 16

Process Flow of Digitization

Layout Analysis : Segmentation of Areas (Text, Table, Figure)

Recognition per line (Character recognition, Math/Text segmentation, Math. Structure analysis)

Document Structure analysis (Chapter, Section, Itemize, Theorem description, References, etc.)

XML Outputs

LaTeX, XHTML+MathML, PDF, Braille codes, etc.

PDF Image File (TIF) Texts & Math symbols

Page 17: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 17

Document Structure Analysis Detection of :

Title, Autor, Section, Subsection, Itemization, BibItem, Theorem, Lemma, etc.

- Currently, naïve methods are used: Line classification using the combination features such as: Character size, Font Information (Bold, Italic, Small Capital), Keywords, Indentation, Starting with Numbers or Special pattern (e.g. “[Num]”), etc.

- Stronger method is required in actual digitization.

Hyperlink inside document.

Page 18: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 18

Section 3 Current state of the art

with demonstration

Page 19: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 19

“InftyReader” OCR software for math documents

Demonstration… Math recognition (Already shown)

Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample

Matrices

Layout analysis, Table recognition

Logical structure analysis

Page 20: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 20

“InftyReader” OCR software for math documents

Demonstration… Math recognition (Already shown)

Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample

Matrices

Layout analysis, Table recognition

Logical structure analysis

Page 21: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 21

“InftyReader” OCR software for math documents

Demonstration… Math recognition (Already shown)

Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample

Matrices

Layout analysis, Table recognition

Logical structure analysis

Page 22: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 22

“InftyReader” OCR software for math documents

Demonstration… Math recognition (Already shown)

Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample

Matrices

Layout analysis, Table recognition

Logical structure analysis

Page 23: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 23

“InftyReader” OCR software for math documents

Demonstration… Math recognition (Already shown)

Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample

Matrices

Layout analysis, Table recognition

Logical structure analysis

Page 24: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 24

“InftyReader” OCR software for math documents

Demonstration… Math recognition (Already shown)

Multi lingual recognition ← FineReader OCR plug-in Czech paper result sample

Matrices

Layout analysis, Table recognition

Logical structure analysis

Page 25: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 25

Section 4 Large Volume Recognition

Page 26: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 26

Large Volume Digitization

Adaptive method is efficient:

Get information from the target document: - Character features, - Math formula parameters, - Layout parameters, etc.

Recognition

or (Directly) After manual checking (Semi-automatic)

Page 27: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 27

Process Flow using BatchInfty & InftyReader pro 1. Noise reduction, centering, etc.

2. Trial recognition

3. Extraction features: - Document style → Logical structure analysis - Character cluster images → OCR engine

4. Recognition & verification

5. PDF output

Large Volume Digitization

Page 28: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 28

Generation of UserDictionary adapting OCR engine to the target documents.

Large Volume Digitization

Trial recognition

CharDataA: Centroides of the clusters of text characters with reliable score

CharDataB: Centroides of the clusters of math symbols and text characters with low score

User Dictionary of Character Features

(automatic) (manual correction)

Clustering of the character images

Show CharImageManager

Page 29: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 29

Generation of UserDictionary adapting OCR engine to the target documents.

Large Volume Digitization

Trial recognition

CharDataA: Centroides of the clusters of text characters with reliable score

CharDataB: Centroides of the clusters of math symbols and text characters with low score

User Dictionary of Character Features

(automatic) (manual correction)

Clustering of the character images

Show CharImageManager

Page 30: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 30

Section 5 Open Problems

Page 31: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 31

Problems Further improvement of character/symbol

recognition and structure analysis of math expressions. Touched characters, Broken characters in math area

Low resolution image

Different type face (Old books, typewriter prints, etc.)

Bold char detection in math area

Page 32: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 32

Problems Logical Structure Analysis (Automatic detection

and manual correction) --- still difficult! Title, Autor, Section, Subsection, Itemization, BibItem,

Theorem, Lemma, etc.

Hyperlink inside document.

Page 33: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 33

Problems Detection/Analysis of Figures and Tables Detection of characters in figures

Table structure analysis (Sample)

Diagram recognition

Chemical diagrams ← Recently developing world wide

(Commutative diagrams) ← Future work

Page 34: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 34

Problems Detection/Analysis of Figures and Tables Detection of characters in figures

Table structure analysis (Sample)

Diagram recognition

Chemical diagrams ← Recently developing world wide

(Commutative diagrams) ← Future work

Page 35: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 35

Problems Detection/Analysis of Figures and Tables Detection of characters in figures

Table structure analysis (Sample)

Diagram recognition

Chemical diagrams ← Recently developing world wide

(Commutative diagrams) ← Future work

Page 36: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 36

Conclusion InftyProject.

Research group of math information processing. Demo (InftyReader) to show the current state of

the art. Adaptive method to improve character and

symbol recogition (CharImageManager). Proposed some problems to be attacked.

Page 37: Adaptive method for the digitization of mathematical journals file Adaptive method for the digitization of mathematical journals Masakazu Suzuki Kyushu University, Professor emeritus

http://www.inftyproject.org/ 37

“INFTY” an integrated OCR for mathematical documents

Thanks you!

Masakazu Suzuki [email protected] (current address) [email protected] (permanent address)

InftyProject: http://www.inftyproject.org/en/ Science Accessibility Net: http://www.sciaccess.net/en/