a training and classification system in support of automated metadata extraction
DESCRIPTION
A Training and Classification System in Support of Automated Metadata Extraction. PhD Proposal Paul K Flynn 14 May 2009. Motivation. bootstrap. Overview . Background Metadata Extraction Overview System Description Nonform classification Proposal Classification Background - PowerPoint PPT PresentationTRANSCRIPT
A Training and Classification System in Support of Automated Metadata Extraction
PhD ProposalPaul K Flynn
14 May 2009
PhD Proposal – Paul K Flynn
Motivation
• bootstrap
2
PhD Proposal – Paul K Flynn 3
Overview Background
Metadata Extraction OverviewSystem DescriptionNonform classification
ProposalClassification
BackgroundPotential MethodsComplexitiesEarly experimentsAvenues of Investigation
Training SystemDesign Overview
Testing and EvaluationSchedule
PhD Proposal – Paul K Flynn 4
Overview Background
Metadata Extraction OverviewSystem DescriptionNonform classification
• Proposal• Classification
• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation
• Training System• Design Overview
• Testing and Evaluation• Schedule
PhD Proposal – Paul K Flynn 5
Metadata Extraction SystemLarge, heterogeneous, evolving collections consisting of documents with diverse layout and structure
Defense Technical Information Center (DTIC)U.S. Government Printing Office (GPO) – Environmental Protection Agency (EPA)
Two types of targetsForms – documents containing a standardized form filled out with metadata of interestNon-forms – all others
Developing a system which can be transitioned into a production system
6PhD Proposal – Paul K Flynn
Form Document
7PhD Proposal – Paul K Flynn
Nonform Document
PhD Proposal – Paul K Flynn
Design concepts
HeterogeneityA new document is classified, assigning it to a group of documents of similar layout – reducing the problem to multiple homogeneous collectionsAssociated with each class of document layouts is a template, a scripted description of how to associate blocks of text in the layout with metadata fields.
EvolutionNew classes of documents accommodated by writing a new templatetemplates are comparatively simple, no lengthy retraining required potentially rapid response to changes in collection
RobustnessUse of Validation techniques to detect extraction problems and selection of templates
8
PhD Proposal – Paul K Flynn 9
Architecture & ImplementationInput
Documents
Input Processing &
OCR
Form Processing
Final Metadata
Output
XML model of document
Unresolved Documents
Extracted Metadata
CleanedMetadata
sf298_1 sf298_2 ...
Form Templates
au eagle ...
Nonform TemplatesPost
Processing
Nonform Processing
Extracted Metadata
Validation
trusted outputs
Untrusted Metadata Outputs
Human Review & Correction
correctedmetadata
PhD Proposal – Paul K Flynn
Validation Scripts
10
Final Validation Classification
11PhD Proposal – Paul K Flynn
Post-Hoc Classification
Extract Metadata
Final Nonform Output
CleanXML
Selected Metadata
au eagle ...
Nonform Templates
Unresolved Document
Select Best Metadata
CandidateMetadata
Sets
Validation Spec.
validation rules
• Apply all templates to document– results in multiple candidate sets of metadata
• Score each candidate using the validator– Select the best-scoring set
12PhD Proposal – Paul K Flynn
Post hoc classification shortcomings
Correct
Selected
13PhD Proposal – Paul K Flynn
Classification (a priori)
Classify (select best template)
Final Nonform Output
CleanXML
Extracted Metadata
au eagle ...
Nonform Templates
Unresolved Document
Extract Metadata
selectedtemplate
• Replace Post hoc classification alone with a new classification module
• Continue to use Validator to provide semantic verification of extracts
PhD Proposal – Paul K Flynn 14
Focus: Non-Form Processing
• Classification – compare document against known document layouts– Select template written for closest matching
layout• Apply non-form extraction engine to
document and template• Send to validator for scoring
PhD Proposal – Paul K Flynn
Extraction System
15
PhD Proposal – Paul K Flynn 16
Overview • Background
• Metadata Extraction Overview• System Description• Nonform classification
Proposal• Classification
• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation
• Training System• Design Overview
• Testing and Evaluation• Schedule
PhD Proposal – Paul K Flynn 17
ProposalInvestigate Classification methodologies and implementationCreate Training System for managing and creating templatesSpecific questions we will attempt to answer:
Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?
PhD Proposal – Paul K Flynn 18
Overview • Background
• Metadata Extraction Overview• System Description• Nonform classification
• ProposalClassification
BackgroundPotential MethodsComplexitiesEarly experimentsAvenues of Investigation
• Training System• Design Overview
• Testing and Evaluation• Schedule
PhD Proposal – Paul K Flynn 19
Previous ResearchExtensive coverage in literature
Model and methodology match purposeNo universal classifierMany experiments use limited number of classes
Use visual similarity, logical structures, or textual features
Machine learningDecision treesDistance measures
Multiple classifiers
PhD Proposal – Paul K Flynn 20
Issues and ComplexitiesSelf-imposed constraints
Must be simple to maintainAutomated process for deploying from Training System
Avoid adding dependency on additional 3rd party packages
PhD Proposal – Paul K Flynn 21
Unpredictable OCR Results
PhD Proposal – Paul K Flynn 22
Unpredictable Page Segmentation
PhD Proposal – Paul K Flynn 23
Second page relevancy
Page 1
Page 2
24PhD Proposal – Paul K Flynn
Manual Classification
• Documents may appear visually similar at thumbnail scale
• Closer inspection reveals semantic differences
PhD Proposal – Paul K Flynn 25
Incorrect Manual Classification
Position of Date Field different – detectable by post hoc classification
PhD Proposal – Paul K Flynn 26
Initial experimentsMethods tested
Block distances – Tries to match blocks and measure distancesMxN Overlap – divide page into bins and count matchesCommon Vocab – find common words in pages of training class
• Vocab1 – looks at only 1st page• Vocab5 – looks at 1st five pages
MXY tree – variant encodes structure as string, uses edit-distance to measure similarityMXY + MxN – sums two methods
Used best 4/5 votes to declare match
PhD Proposal – Paul K Flynn 27
Sample ResultsSummary
MXY Tree
PhD Proposal – Paul K Flynn 28
Sample ResultsSummary
MXY Tree MXY Tree -MxN
MxNBlock Distance
29PhD Proposal – Paul K Flynn
Experimental Results
• Precision = #correct / #total in class• Recall = #correct / #answers
Method Precision Recall
Block Distance 43% 92%
MxN 50% 92%
Vocab 1 43% 62%
Vocab 5 61% 64%
MXY tree 30% 94%
MXY + MxN 36% 96%
PhD Proposal – Paul K Flynn
Experimental Results Layout Distance MxN Matching Vocab Match 5 Page Vocab Match 1 Page MXY Tree
MXY Tree Plus MxN
CLASS Precision Recall Precision Recall Precision Recall Precision Recall Precision Recall Precision RecallABSTRACT1-2COL 0% 0% 0% 0% 100% 100% 100% 100% 0% 0% 0% 0%ATOM 0% 0% 0% 0% 100% 100% 0% 0% 0% 0% 0% 0%AU 99% 100% 97% 100% 91% 100% 97% 100% 98% 100% 99% 100%BOTTOM-BLOCK 13% 100% 0% 0% 50% 80% 0% 0% 0% 0% 0% 0%CPRC 0% 0% 17% 100% 83% 45% 0% 0% 0% 0% 0% 0%EAGLE-IMAGE 100% 91% 100% 100% 78% 93% 0% 0% 50% 94% 94% 100%EAGLE-TEXT 100% 100% 69% 100% 54% 100% 100% 100% 100% 93% 100% 93%ERDC 69% 95% 92% 100% 65% 81% 54% 93% 19% 100% 46% 100%HORIZ 80% 100% 80% 100% 100% 15% 0% 0% 20% 100% 80% 100%LOGI 15% 100% 7% 100% 26% 100% 96% 70% 11% 60% 11% 75%RAND-ARC 0% 0% 89% 89% 56% 83% 0% 0% 0% 0% 11% 100%RAND-ARROYO 50% 86% 50% 100% 75% 82% 0% 0% 0% 0% 0% 0%RAND-ARROYO2 14% 100% 68% 79% 75% 78% 71% 80% 0% 0% 0% 0%RAND-BRIEF1 33% 100% 67% 100% 100% 100% 67% 100% 33% 100% 33% 100%RAND-BRIEF2 60% 100% 90% 100% 80% 100% 45% 100% 50% 91% 60% 86%RAND-LEFT 0% 0% 0% 0% 83% 100% 33% 8% 0% 0% 0% 0%RAND-NOTE 57% 100% 79% 100% 86% 100% 100% 100% 0% 0% 0% 0%RANDTECH 50% 73% 13% 67% 81% 57% 0% 0% 0% 0% 0% 0%RESEARCH 0% 0% 0% 0% 89% 47% 67% 67% 0% 0% 0% 0%SIGNATUR 0% 0% 0% 0% 100% 91% 100% 100% 0% 0% 0% 0%TOPLOG-2COL 0% 0% 44% 100% 56% 100% 56% 100% 0% 0% 0% 0%WARCOLLEGE 0% 0% 0% 0% 100% 71% 100% 38% 0% 0% 0% 0%
30
• Precision = #correct / #total in class• Recall = #correct / #answers
PhD Proposal – Paul K Flynn 31
Avenues of InvestigationImplement and test variety of methodHandling multiple page classificationMultiple Classifiers
Methods of combiningWeighting best for each class
Deriving signature for specifying combination rules
Clustering methods for bootstrapping
PhD Proposal – Paul K Flynn 32
Overview • Background
• Metadata Extraction Overview• System Description• Nonform classification
• Proposal• Classification
• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation
Training SystemDesign Overview
• Testing and Evaluation• Schedule
PhD Proposal – Paul K Flynn 33
Training SystemManual classification
InaccurateTime consumingNot dynamic
Need automated verification of extractionsDevelopers open original documents multiple timesCorrectness open to interpretationRegression testing only compares against previous extraction attempts
Need to allow multiple template writers to interactNeed to measure effects of individual templates against the whole
PhD Proposal – Paul K Flynn 34
Prod
uctio
n Sy
stem
Training Docs
ClusteredDocs
Candidates
TemplateMaker
TrainingEvaluator
BootStrapClassifier
BaselineData
Manager
Trained Pool
BaselineData Pool
Templates andClassification
Signatures
Metadata Training System
PhD Proposal – Paul K Flynn 35
Persistence LayerManage complete set of Training documents
Track Baseline data input by users• Allows for independent confirmation• Change tracking
Track subset of documents with out templates developedTrack trained documents
Allow for multiple access
PhD Proposal – Paul K Flynn 36
Baseline Data ManagerGUI for establishing BaselineHighlight and copy
Work on OCR to account for errorsProvide auditing and tracking of changes
PhD Proposal – Paul K Flynn 37
Bootstrap ClassifierDynamic GUI to allow:
Differing classifiersClustering method
User can flag documents to ignoreUser can designate single doc as matcherProvides output to Template Maker
PhD Proposal – Paul K Flynn 38
Template Maker
Template being developed
Results for document
PhD Proposal – Paul K Flynn 39
Template MakerFinal form depends on Engine replacementSample documents come from Bootstrap ClassifierAlso prepares classification spec for export to production system
PhD Proposal – Paul K Flynn 40
Training EvaluatorVerifies operation of template and classification specIdentifies other documents in the pool which are correctly extracted
Moved to trained poolRemoved from further bootstrapping
Supports robust regression testing
PhD Proposal – Paul K Flynn 41
Overview • Background
• Metadata Extraction Overview• System Description• Nonform classification
• Proposal• Classification
• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation
• Training System• Design Overview
Testing and Evaluation• Schedule
PhD Proposal – Paul K Flynn 42
Testing and EvaluationProposed Tests
Evaluate effectiveness of pre-classification module Evaluate effectiveness of adding similarity score to validationEvaluate the effectiveness of the bootstrap classificationEnd to End evaluation
Other tests as needed
PhD Proposal – Paul K Flynn 43
Evaluate effectiveness of pre-classification module
Use simple Baseline classifier to test
Select candidate templatesAdjust post hoc score
Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?
PhD Proposal – Paul K Flynn 44
Evaluate effectiveness of adding similarity score to validation
Use Baseline classifier Determine a baseline cluster of 5 documents to serve as the “signature” targets for measuring similarity. Apply score to final validation
Remove templates to measure effectsAssessment: Evaluate the percent of documents which are correctly flagged as resolved.
Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?
PhD Proposal – Paul K Flynn 45
Evaluate the effectiveness of the bootstrap classification
Measure amount of time to completely run thru training collection
Isolate template development time by providing appropriate template
Assessment: Compare the time needed to classify the documents to the manual method.
Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?
PhD Proposal – Paul K Flynn 46
End to End evaluation
Create a mini-collection by downloading 100 documents from DTIC. Assign two separate teams of trained template writers to create templates to correctly extract metadata from a minimum of 80 documents.
One team will perform the task using manual classification, a version of the Template Maker with the training system enhancements disabled and a production system (with no templates) for extraction. The other team will use the complete training system.
Assessment: The teams will use logs to record work time. We will evaluate logs to assess time usage and conduct interviews to compile observations and impressions of the system.
Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?
PhD Proposal – Paul K Flynn 47
Overview • Background
• Metadata Extraction Overview• System Description• Nonform classification
• Proposal• Classification
• Background• Potential Methods• Complexities• Early experiments• Avenues of Investigation
• Training System• Design Overview
• Testing and EvaluationSchedule
PhD Proposal – Paul K Flynn 48
Schedule