document image analysis for metadata extraction george r. thoma, ph.d. chief, ceb lister hill center...

52
Document image analysis Document image analysis for metadata extraction for metadata extraction eorge R. Thoma, Ph.D. hief, CEB ister Hill Center U.S. National Library of Medi

Post on 15-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Document image analysis for Document image analysis for metadata extractionmetadata extraction

George R. Thoma, Ph.D.Chief, CEBLister Hill Center

U.S. National Library of Medicine

Page 2: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

National Library of MedicineNational Library of Medicine

World’s largest medical libraryWorld’s largest medical library

U.S. govt. agency, part of NIHU.S. govt. agency, part of NIH

Collects all significant material in Collects all significant material in biomedicine and health carebiomedicine and health care

Database producer (MEDLINE, GenBank,..)Database producer (MEDLINE, GenBank,..)

Research centersResearch centers

Extramural grantsExtramural grants U.S. National Library of MedicineU.S. National Library of Medicine

Page 3: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

NLM-MissionNLM-MissionDevelop and provide biomedical information to:Develop and provide biomedical information to:

The clinical and research communities (The clinical and research communities (e.g., MEDLINEe.g., MEDLINE))

Public health and public safety agencies (Public health and public safety agencies (e.g., HSDBe.g., HSDB))

The lay public (The lay public (e.g., MEDLINEpluse.g., MEDLINEplus))

Develop and provide tools for biomedical research (Develop and provide tools for biomedical research (e.g., e.g.,

WebMIRS, x-ray atlas, genomic data analysisWebMIRS, x-ray atlas, genomic data analysis))

Develop and provide tools for informatics research (Develop and provide tools for informatics research (e.g., e.g.,

UMLS, vocabulary tools, knowledge representation, medical ontologiesUMLS, vocabulary tools, knowledge representation, medical ontologies))

Conduct inhouse R&DConduct inhouse R&D

Sponsor extramural research (Sponsor extramural research (Telemedicine, Visible Human Project, Telemedicine, Visible Human Project, Next Generation Internet, Medical Informatics….)Next Generation Internet, Medical Informatics….)

Provide fellowships for faculty, studentsProvide fellowships for faculty, studentsTwo important missions:1. Create citations to the biomedical journal literature for MEDLINE®2. Preservation

Page 4: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

R&D: why and howR&D: why and how Aim:Aim: to introduce appropriate technologies to introduce appropriate technologies

– To support NLM’s services and functionsTo support NLM’s services and functions– To create and disseminate information for To create and disseminate information for

biomedical communities: research, clinical and biomedical communities: research, clinical and informaticsinformatics

– To provide information for the lay publicTo provide information for the lay public How:How:

– Identifying suitable domainsIdentifying suitable domains– Designing/developing prototype systemsDesigning/developing prototype systems– Using these as testbeds to address key Using these as testbeds to address key

questionsquestions– Implementing/deploying operational systemsImplementing/deploying operational systems

Page 5: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Preservation of Digital MaterialsPreservation of Digital Materials

Technical obsolescence of storage media Technical obsolescence of storage media and supporting hardware and softwareand supporting hardware and software

Ever-increasing volume of endangered Ever-increasing volume of endangered digital materialsdigital materials

Critical component: metadata for future Critical component: metadata for future access and migration to newer formatsaccess and migration to newer formats

Avoid labor cost of manual metadata entryAvoid labor cost of manual metadata entry

Page 6: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Candidates for Digital Candidates for Digital Preservation (NLM collections)Preservation (NLM collections)

Profiles in Science Profiles in Science – Archival collections of Archival collections of

leaders in biomedical leaders in biomedical research and public research and public healthhealth

– TIFF, PDF, HTML, audio, TIFF, PDF, HTML, audio, video filesvideo files

PubMed CentralPubMed Central– Digital archive of life Digital archive of life

sciences journalssciences journals– XML, PDF, TIFF XML, PDF, TIFF – Contains about 170 Contains about 170

journal titlesjournal titles

Page 7: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Goal: System for Preservation of Goal: System for Preservation of Electronic Resources (SPER)Electronic Resources (SPER)

Automated metadata Automated metadata extractionextraction– Technical Technical metadata metadata

from file header from file header – DescriptiveDescriptive metadata metadata

(heuristic rules and (heuristic rules and machine learning machine learning techniques)techniques)

– Minimum human Minimum human interactioninteraction

Conform to standards Conform to standards (DC, NISO, METS)(DC, NISO, METS)

Intelligent file migrationIntelligent file migration– Lossy or lossless Lossy or lossless

migrationmigration– When to migrateWhen to migrate

Ingest

SPER

GUIs

Metadata extraction

Migration

Storage

Search

Metadataand files

Queryresults

Page 8: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

•Extracting descriptive metadata (e.g., article title, authors, affiliation, page numbers, journal name, publication date, publisher, databank accession numbers, grant numbers, etc……. PLUS abstract)

•Example: Grubb RL. Hemodynamic factors in the prognosis of symptomatic carotid occlusion. JAMA. 1998. 280 (12) 1055-60…..

Our Problem

Page 9: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Grubb RL. Hemodynamic factors in the prognosis of Grubb RL. Hemodynamic factors in the prognosis of symptomatic carotid occlusion. JAMA. 1998. 280 (12) 1055-symptomatic carotid occlusion. JAMA. 1998. 280 (12) 1055-60…..60…..

In other words…In other words…

Page 10: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Automated Metadata Extraction Automated Metadata Extraction MethodsMethods

TIFF TIFF OCR OCR segment segment label physical label physical zones (using DIAU techniques)zones (using DIAU techniques)

Use Use heuristic rulesheuristic rules related to layout related to layout (geometric) and context (key words)(geometric) and context (key words)– Currently in production (MARS* for citation Currently in production (MARS* for citation

generation from journal articles)generation from journal articles)

Use the Use the learned rules or modelslearned rules or models– ExperimentsExperiments

*Medical Article Records System: automatic extraction of article title, author names, affiliations, abstract, from scanned journals, to populate MEDLINE.

Page 11: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Why learned rules or models? Diverse Why learned rules or models? Diverse layout styleslayout styles

Style differs in different Style differs in different journalsjournals

Style varies in different Style varies in different issues of a journalissues of a journal

Manual rule or model Manual rule or model creation expensivecreation expensive

Automated rule or model Automated rule or model learning from previous learning from previous resultsresults

Use style related Use style related featuresfeatures

Page 12: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Significant Features Significant Features (examples)(examples)

GeometricGeometric– Absolute location and size of zones (x1, y1, x2, y2)Absolute location and size of zones (x1, y1, x2, y2)– Relative location of zones (top, bottom, left of, right of)Relative location of zones (top, bottom, left of, right of)– Page margin and gap between zonesPage margin and gap between zones

ContextualContextual– Font size (12pt, 20pt)Font size (12pt, 20pt)– Font attribute (bold, italic)Font attribute (bold, italic)– Key words (University, city, department …)Key words (University, city, department …)

Page 13: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

CheckIn

Scanner

Scanner

Database

AutozoneOCR

DCMS

Edit Edit Reconcile Admin

Lexicons/rules

Journal flow

Indexing

Autolabel

MARS

Autoformat

ConfidenceEdit

PatternMatch

Upload

MEDLINE

EditDiff

MARS

Page 14: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Original bitmap

Image

Page 15: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Original bitmap Zoned

Image processed by Autozone

Page 16: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Features for AutozoneFeatures for Autozone

Median character height and widthMedian character height and width Average character heightAverage character height Maximum character heightMaximum character height Average height of lower case characters (without Average height of lower case characters (without

ascenders or descenders)ascenders or descenders) Average character confidence valueAverage character confidence value Number of alphanumeric charactersNumber of alphanumeric characters Aspect ratio of line (height/width)Aspect ratio of line (height/width) % italics, bold, upper case, digits% italics, bold, upper case, digits Approximate location on pageApproximate location on page

For each text-line

Page 17: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Original bitmap Zoned Labeled

Image processed by Autozone and Autolabel

Page 18: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Original bitmap Zoned Labeled

Autoreformat

Text syntax reformatted

e.g., John A. Smith Smith John A

Page 19: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Original bitmap Zoned Labeled

Lexical analysis to overcome OCR errors

Text syntax reformatted

Lexical analysis

Page 20: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Scan workstation in MARS system

Page 21: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Scan workstation operation

Page 22: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Edit workstation: colors identify fields labeled automatically

Page 23: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Edit workstation

Page 24: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Edit workstation

High confidence characters(%)

Page 25: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Reconcile workstation in MARS – main screen

Page 26: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Pattern matching to correct words for Pattern matching to correct words for Reconcile operatorReconcile operator

Page 27: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Reconcile workstation GUI - closeupReconcile workstation GUI - closeup

Bitmappedimage

Incorrectword

Operator click selectscorrect word from pattern matching

Page 28: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Automated Metadata Extraction Automated Metadata Extraction based on Learning Methodsbased on Learning Methods

Automatically learn layout rules or models Automatically learn layout rules or models from previous (similar) TIFF documentsfrom previous (similar) TIFF documents

Use the learned rules or models to Use the learned rules or models to segment and label TIFF document pages segment and label TIFF document pages of similar layout stylesof similar layout styles

Page 29: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Two Machine Learning MethodsTwo Machine Learning Methods Learn labeling rules from dynamically Learn labeling rules from dynamically

generated features based on string generated features based on string matching techniques (DFGS)matching techniques (DFGS)

– Exploit the MARS system and Exploit the MARS system and DFGS is DFGS is now in the MARS production systemnow in the MARS production system

– Three types of features to infer rulesThree types of features to infer rules– Provide an unstructured and partial Provide an unstructured and partial

description of a document pagedescription of a document page– Good for arbitrary layouts but Good for arbitrary layouts but

sensitive to variations in absolute sensitive to variations in absolute zone locationszone locations

– Requires that the physical Requires that the physical segmentation (“zoning”) is done segmentation (“zoning”) is done accuratelyaccurately

Learn a 2-D layout model with logical Learn a 2-D layout model with logical labels based on a Bayesian approachlabels based on a Bayesian approach

– Provide a structured, either partial or Provide a structured, either partial or full 2D description of a document full 2D description of a document pagepage

– Physical segmentation and logical Physical segmentation and logical labeling are performed labeling are performed simultaneouslysimultaneously using the modelsusing the models

– Not sensitive to document noise and Not sensitive to document noise and variations in absolute zone locationsvariations in absolute zone locations

– Use backgroundUse background– Sensitive to document skewSensitive to document skew

tm h g1 ti g2 C g3 B bm

C1 g4 C2

ab g5 K au g6 af

lm P rmX

Y

X

Y

Title Font Size Distribution

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30 35 40

Title Font Size

Title Font Attribute Distribution

0

0.2

0.4

0.6

0.8

1

170 175 180 185 190 195 200 205 210

Title font Attribute

Example: title field

Page 30: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Learning Labeling Rules: Dynamic Learning Labeling Rules: Dynamic Feature Generation System (DFGS)Feature Generation System (DFGS)

Scanned journals

ZoneCzar1 Reformat Reconcile

Zoning and labeling

Reformatting syntax

Text verification

UploadOCR

MARS (simplified)

MEDLINE ®

FeatureGeneration

Candidate combined Feature sets

2

IndividualFeature sets

Loop Feature Combinationand matching score

ZoneMatch2ZoneCzar2

ZMControl

ZoneMatch1

Dynamic Feature Generation System (DFGS)

ZRJournalSpecificInformation

Matchedfeatures

Verifiedtext

Mao S, Kim J, Thoma GR.Mao S, Kim J, Thoma GR. A Dynamic Feature A Dynamic Feature Generation System for Automated Metadata Generation System for Automated Metadata Extraction in Preservation of Digital Materials. Extraction in Preservation of Digital Materials. Proc. 1Proc. 1stst International Workshop on Document International Workshop on Document Image Analysis for LibrariesImage Analysis for Libraries,, Pages 225-232, Palo Pages 225-232, Palo Alto, CA, January 2004.Alto, CA, January 2004.

Page 31: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Learn Document 2D Layout Models Learn Document 2D Layout Models based on a Bayesian Approach based on a Bayesian Approach

Represent 2D layouts by a set of attributed hidden Represent 2D layouts by a set of attributed hidden semi-Markov models (HSMMs)semi-Markov models (HSMMs)

A Bayesian method for learning 2D layout models A Bayesian method for learning 2D layout models from segmented and labeled, but unstructured from segmented and labeled, but unstructured training datatraining data

Simultaneous physical segmentation and logical Simultaneous physical segmentation and logical labeling using learned layout modelslabeling using learned layout models

Character bounding boxes as basic image units Character bounding boxes as basic image units [Liang [Liang et al, et al, 1996 and Ha 1996 and Ha et al, et al, 1995]1995]

Page 32: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Attributed Hidden Semi-Markov Models Attributed Hidden Semi-Markov Models (attributed HSMMs) = (attributed HSMMs) = ((A,B,C,A,B,C,ππ,,ρρ))

Hidden Semi-Markov Models (HSMMs)Hidden Semi-Markov Models (HSMMs)

1 4

3

52

6

1

0.4

0.6 1

0.8

0.2

1

1

ρ=X

AA: state transition probability matrix that : state transition probability matrix that defines a Markov modeldefines a Markov model

ππ : : initial state probability distribution initial state probability distribution vectorvector

BB: state observation probability matrix : state observation probability matrix that defines the “hidden” partthat defines the “hidden” part

CC: state duration probability matrix that : state duration probability matrix that defines the “semi” partdefines the “semi” part

ρρ: : direction attribute (x or y)direction attribute (x or y)

B C

1

Markov ModelsMarkov ModelsHidden Markov ModelsHidden Markov ModelsAttributed Hidden Semi-Markov Models Attributed Hidden Semi-Markov Models (attributed HSMMs)(attributed HSMMs)

Page 33: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Map attributed HSMMs To Map attributed HSMMs To 2D Document Layout2D Document Layout

States: document States: document regions such as text regions such as text regions, page margins, regions, page margins, gaps between text gaps between text regionsregions

State transitions: State transitions: boundaries and orderboundaries and order of document regionsof document regions

State observation: State observation: featuresfeatures of document of document regionsregions

State duration: State duration: sizessizes of of document regionsdocument regions

ρ=Y

Page 34: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

StatesStates Key statesKey states

– Title, author, affiliation, Title, author, affiliation, abstractabstract

Marginal statesMarginal states– Header, footer, section textHeader, footer, section text– Text from neighboring pageText from neighboring page– Noise streakNoise streak

Combinatorial states: can be Combinatorial states: can be partitioned at another partitioned at another dimensiondimension

Margin and gap statesMargin and gap states

Page 35: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

State Observations and DurationsState Observations and Durations

State observations State observations (contextual (contextual features)features)– Number of charactersNumber of characters– Majority font sizeMajority font size– Majority key wordMajority key word– Majority attribute (Majority attribute (BoldBold, , italicsitalics))

State durations State durations (geometric (geometric features)features)– The size of zones, page margins, The size of zones, page margins,

and gaps between zones (width and gaps between zones (width and height)and height)

Page 36: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

2D Layout Model: a set of attributed 2D Layout Model: a set of attributed hidden semi-Markov modelshidden semi-Markov models

lm P rm

X

tm h g1 ti g2 C g3 B bm

Y

C1 g4 C2

X

ab g5 K

Yau g6 af g7 ad

lm P rmtmhg1tig2

C

g3

B

bm

C1 g4 C2

ab

g5k

aug6

af

g7

ad

Page 37: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Bayesian Learning MethodBayesian Learning Method

Start with an initial model Start with an initial model MM00

Let X be the observation Let X be the observation sample associated with sample associated with MM00

Merge the states of Merge the states of MM00 until until

we find a model we find a model MM such that such that

)()|(maxarg

)(

)()|(maxarg

)|(maxarg

MPMXP

XP

MPMXP

XMPM

M

M

M

P(M0|X) < P(M1|X) < P(M2|X) < P(M|X)

Page 38: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Model Merging ConstraintsModel Merging Constraints

Do not allow loop since order of zones are Do not allow loop since order of zones are importantimportant

Do not allow text state to be merged with Do not allow text state to be merged with gap or margin stategap or margin state

Two states to be merged should be Two states to be merged should be spatially closespatially close

Page 39: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

The Recursive Learning AlgorithmThe Recursive Learning Algorithm

1. 1. Start from a set of Start from a set of training pages, let training pages, let i = 0,i = 0,

and 2D model and 2D model MM = = ΦΦ..

2.2. Learn 1D models Learn 1D models mm at the at the i i level, let level, let M = M U mM = M U m..

3.3. Use Use MM in a recursive in a recursive Duration Viterbi Duration Viterbi Algorithm to segment Algorithm to segment training pagestraining pages

4.4. Find out the segmented Find out the segmented region that can be further region that can be further split, exit if none exits.split, exit if none exits.

5.5. i = i+1i = i+1, go back to step 2., go back to step 2.

tm h

g1

ti g2

C g

3 B

bm

Y

m2:

lm P rmm1: X

Page 40: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Break a model into three componentsBreak a model into three components

Dirichlet distribution for multinomial prior (Dirichlet distribution for multinomial prior ([Stolcke [Stolcke and Omohundro, 1994] proposed priors for HMMs)and Omohundro, 1994] proposed priors for HMMs)

Multinomial and geometric distributionsMultinomial and geometric distributions

PriorsPriors

.)1()1()1()|( )(||||)( )()()()( qVdd

nQe

ne

nQt

ntg

qt ppppppMMP

qe

qe

qt

qt

Qqg

qt

qMg

qtg

gtMgtg

Mtg

Mtg

MMPMMPMP

MMPMMPMP

MMPMP

MMM

).,|()|()(

),|()|()(

),,( )(

},,{

)()()(

.),...,(

1

),...,(

1

),...,(

1),|(

)()()(

1

1

1

1

1

1)()(

qd

d

i

qe

e

i

qt

t

i

n

iq

dd

n

iq

ee

n

iq

ttg

qt

qM BBB

MMP

Page 41: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

LikelihoodLikelihood

.),...,(

),...,(

),|()|(

),|()|()|(

)()(1

)()()(1

)(1

)()()()()(

Qqqn

q

qn

qn

qq

Qq

qM

qM

qqM

MMM

B

vvB

dMvPMP

dMXPMPMXP

qM

M

),|(

),|,(

),|,(),|(

*

*

M

M

VMM

MvP

MVXP

MVXPMXP

Approximating the likelihood in Bayesian Learning by the Viterbi path

Page 42: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Global WeightsGlobal Weights

Adjust the contributions of prior and Adjust the contributions of prior and likelihood to the posterior probabilitylikelihood to the posterior probability

Control when the model generalization Control when the model generalization should stopshould stop

),...,(

),...,(log

)1(log

)1()1(log

)()(1

)()()(1

)(1

)(2

||||1

)()()()(

qn

q

qn

qn

qq

qVdd

nQe

ne

nQt

nt

B

vvB

pp

ppppqe

qe

qt

qt

Page 43: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Comparison of four Labeling MethodsComparison of four Labeling Methods((Test set: 69 pages)

0

10

20

30

40

50

60

70

80

90

100

Heuristic rulesHMM-basedDFGS-basedHSMM-based

• 198 title textlines• 181 author textlines181 author textlines• 600 affiliation textlines600 affiliation textlines• 2079 abstract textlines2079 abstract textlines

• Heuristic rules and DFGS: 1. assume zoning is done.2. Use font size, font

attribute, key words as features

• HMM- and HSMM-based methods:1. simultaneous zoning and

labeling2. Only use character count

as feature (3 others later)

Zo

nin

g a

nd

lab

elin

g a

ccu

racy

(%

)

Page 44: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Future WorkFuture Work

For TIFF images, extend feature set to font For TIFF images, extend feature set to font size, font attributes, key wordssize, font attributes, key words

Map the layout model to other document Map the layout model to other document formats, e.g., HTML, PDFformats, e.g., HTML, PDF

Use text-line (rather than zone) as basic Use text-line (rather than zone) as basic state unitstate unit

Page 45: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

George R. Thoma, Ph.D.Chief, Communications Engineering BranchLister Hill National Center for Biomedical CommunicationsNational Library of Medicine8600 Rockville Pike, Bethesda, MD 20894 USA

[email protected] 496 4496

archive.nlm.nih.gov

Page 46: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

PublicationsPublications1. 1. Bayesian Learning of 2D Document Layout models for Automated Preservation Bayesian Learning of 2D Document Layout models for Automated Preservation Metadata Extraction, Metadata Extraction, Song Mao and George R. Thoma. Song Mao and George R. Thoma. Submitted to tSubmitted to the 4th IASTED International Conference on he 4th IASTED International Conference on VISUALIZATION, IMAGING, AND IMAGE PROCESSINGVISUALIZATION, IMAGING, AND IMAGE PROCESSING..

2. 2. Style-Independent Labeling: Design and Performance Evaluation. Style-Independent Labeling: Design and Performance Evaluation. Song Mao, Jong Woo Kim and G. R. Thoma, Song Mao, Jong Woo Kim and G. R. Thoma, SPIE Conference on Document Recognition and SPIE Conference on Document Recognition and RetrievalRetrieval, pages 14-22, San Jose, CA, January 2004., pages 14-22, San Jose, CA, January 2004.

3. 3. A Dynamic Feature Generation System for Automated MetadataA Dynamic Feature Generation System for Automated Metadata Extraction in Preservation of Digital Materials. Extraction in Preservation of Digital Materials. Song Mao, Jong Woo Kim and G. R. Thoma, Song Mao, Jong Woo Kim and G. R. Thoma, The First International Workshop on Document Image The First International Workshop on Document Image Analysis for LibrariesAnalysis for Libraries,, Pages 225-232, Palo Alto, CA, January 2004.Pages 225-232, Palo Alto, CA, January 2004.

4. 4. Stochastic Attributed K-D tree Modeling of Technical Paper Title Pages,Stochastic Attributed K-D tree Modeling of Technical Paper Title Pages, Song Mao, Azriel Rosenfeld, Tapas Kanungo, Song Mao, Azriel Rosenfeld, Tapas Kanungo, IEEE International Conference on Image ProcessingIEEE International Conference on Image Processing, , pages 533-536,pages 533-536, Barcelona, Spain, September 2003.Barcelona, Spain, September 2003.

5. 5. Stochastic Language Model for Style-Directed Physical Layout Analysis of Documents,Stochastic Language Model for Style-Directed Physical Layout Analysis of Documents, Tapas Kanungo and Song Mao, Tapas Kanungo and Song Mao, IEEE Transactions on Image ProcessingIEEE Transactions on Image Processing, pages 583-596, , pages 583-596, vol. 12, no. 5, May 2003. vol. 12, no. 5, May 2003.

Page 47: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

ReferencesReferences

6. Best-first model merging for hidden Markov model induction,6. Best-first model merging for hidden Markov model induction, A. Stolcke and S. M. Omohundro,A. Stolcke and S. M. Omohundro, Technical Report TR-94-003, ICSI, Berkeley, CA, 1994Technical Report TR-94-003, ICSI, Berkeley, CA, 1994

7. Document layout structure extraction using bounding boxes7. Document layout structure extraction using bounding boxes of different entities,of different entities, J. Liang, J. Ha, R. M. Haralick,J. Liang, J. Ha, R. M. Haralick, 33rdrd IEEE Workshop on Applications of Computer Vision (WACV ’96), December, 1996 IEEE Workshop on Applications of Computer Vision (WACV ’96), December, 1996

8. Document page decomposition using bounding boxes of connected 8. Document page decomposition using bounding boxes of connected components of black pixels,components of black pixels,

J. Ha, R. M. Haralick, I. T. PhillipsJ. Ha, R. M. Haralick, I. T. Phillips Document Recognitin II, SPIE Proceedings, vol 2422, pp. 140-151, Feb 1995

9. The Elements of Statistical Learning: Data Mining, Inference, 9. The Elements of Statistical Learning: Data Mining, Inference, and Prediction,and Prediction, T. Hastie, R. Tibshurani, and J.H. Friedman, T. Hastie, R. Tibshurani, and J.H. Friedman, Spinger Series in Statistics, 2001.Spinger Series in Statistics, 2001.

Page 48: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

ExamplesExamples

HSMM-based HMM-based

DFGS-based Heuristic-rule-based

Page 49: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Bayesian LearningBayesian Learning

)|()(maxarg

)(

)|()(maxarg

)|(maxarg*

MXPMP

XP

MXPMP

XMPM

M

M

M

Goal is to find

Need to know the explicit form of prior P(M) and likelihood P(X|M). Obtained from training set.

Page 50: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Layout Model Learning ResultsLayout Model Learning Results(Training set: 19 journal title pages)(Training set: 19 journal title pages)

Model Model componentcomponent

Number of Number of Initial statesInitial states

Number of Number of final statesfinal states

11 9595 55

22 189189 1717

33 205205 3232

Page 51: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

Model Merging AlgorithmModel Merging Algorithm

Best-first merging with look ahead Best-first merging with look ahead [Stolcke and [Stolcke and Omohundro, 1994]Omohundro, 1994]

Algorithm stepsAlgorithm steps– Let Let M0M0 be the empty model. Let i=0. Loop: be the empty model. Let i=0. Loop:

1.1. Get 5 new samples X and incorporate them into Get 5 new samples X and incorporate them into MiMi

2.2. Find the best merge that maximize Find the best merge that maximize P(MP(Mii|X)|X)

3.3. Let Let Mi+1Mi+1 be the new model be the new model

4.4. if if P(MP(Mi+1i+1|X) < P(M|X) < P(Mii|X), |X), perform look-ahead:perform look-ahead:If probability does not improve after merging some more If probability does not improve after merging some more

states, break from the loop, else let states, break from the loop, else let MMi+1i+1 be the merged model.be the merged model.

5.5. Let Let i=i+1i=i+1..

– If data is exhausted, break from the loop and return If data is exhausted, break from the loop and return MiMi as the inducted model.as the inducted model.

Page 52: Document image analysis for metadata extraction George R. Thoma, Ph.D. Chief, CEB Lister Hill Center U.S. National Library of Medicine

The Recognition AlgorithmThe Recognition Algorithm

A duration Viterbi A duration Viterbi algorithm algorithm [Rabiner [Rabiner et al, et al, 1985]1985]

Recursively apply Recursively apply it to a document it to a document page using a set page using a set of learned of learned attributed hidden attributed hidden semi-Markov semi-Markov models models [Mao [Mao et al, et al, 2003]2003]

model Markov-semi hidden attributed an :

toingcorrespond states :

vectornobservatio feature :

)|,(max

oq

o

qoPq

. allfor 1 and if

if where

max

ˆ| at time ends statein stay the,,,,max

0

.1,,min

,,

21,, 121

jjtd

tda

cbij

tjoooPj

j

ij

ij

ijjd

t

dtsjodt

jiDtdid

t

pt

qqqt

s

t