iccta 2006 5-7 september, alexandria 1 automated metadata extraction july 17-20, 2006 kurt maly...

68
ICCTA 2006 5-7 September , Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly [email protected]

Upload: junior-waters

Post on 04-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

1

Automated Metadata Extraction July 17-20, 2006

Kurt [email protected]

Page 2: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

2

Outline

• Background and Motivation

• Challenges and Approaches

• Metadata Extraction Experience at ODU CS

• Architecture for Metadata Extraction

• Experiments with DTIC Documents

• Experiments with limited GPO Documents

• Conclusions

Page 3: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

3

Digital Libraries

Content Creation

Content Sharing

New ContentPublication ToolsKepler, Compopt(NSF, US Navy)

Process ExistingContent(DTIC)

CentralizedModel

DistributedModel – P2P

(NSF)

HarvestingOAI-PMH

Real TimeLFDL

Arc/Archon(NSF)

Kepler(NSF)

TRI(NASA,LANL,

SANDIA)

DL Grid(Andrew Mellon)

Secure DL(NSF, IBM)

Digital Library Research at ODU

http://dlib.cs.odu.edu/

Page 4: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

4

Motivation

• Metadata enhances the value of a document collectionMetadata enhances the value of a document collection– Using metadata helps resource discovery

• It may save about $8,200 per employee for a company to use metadata in its intranet to reduce employee time for searching, verifying and organizing the files . (estimation made by Mike Doane on DCMI 2003 workshop)

– Using metadata helps make collections interoperable with OAI-PMH

• Manual metadata extraction is costly and time-consumingManual metadata extraction is costly and time-consuming– It would take about 60 employee-years to create metadata for 1 million It would take about 60 employee-years to create metadata for 1 million

documents. (estimation made by Lou Rosenfeld on DCMI 2003 documents. (estimation made by Lou Rosenfeld on DCMI 2003 workshop). Automatic metadata extraction tools are essential to reduce workshop). Automatic metadata extraction tools are essential to reduce the cost.the cost.

– Automatic extraction tools are essential for rapid dissemination at Automatic extraction tools are essential for rapid dissemination at reasonable costreasonable cost

• OCR is not sufficient for making ‘legacy’ documents searchable.OCR is not sufficient for making ‘legacy’ documents searchable.

Page 5: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

5

Challenges

A successful metadata extraction system must:

• extract metadata accurately• scale to large document collections• cope with heterogeneity within a collection• maintain accuracy, with minimal

reprogramming/training cost, as the collection evolves over time

• have a validation/correction process

Page 6: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

6

Approaches

• Machine Learning– HMM– SVM

• Rule-Based– Ad Hoc– Expert Systems– Template-Based (ODU CS)

Page 7: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

7

Comparison

• Machine-Learning Approach

– Good adaptability but it has to be trained from samples – very time consuming

– Performance degrades with increasing heterogeneity– Difficult to add new fields to be extracted– Difficult to select the right features for training

• Rule-based

– No need for training from samples– Can extract different metadata from different documents– Rule writing may require significant technical expertise

Page 8: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

8

Metadata Extraction Experience at ODU CS

• DTIC (2004, 2005)– developed software to automate the task of

extracting metadata and basic structure from DTIC PDF documents

• explored alternatives including SVM, HMM, expert systems

• origin of the ODU template-based engine

• GPO (in progress)• NASA (in progress)

– Feasibility study to apply template-based approach to CASI collection

Page 9: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

9

Meeting the Challenges

• All techniques achieved reasonable accuracy for small collections– possible to scale to large homogeneous collections

• Heterogeneity remains a problem– Ad hoc rule-based tend to complex monoliths– Expert systems tend to large rule sets with complex, poorly-understood

interactions– Machine-learning must choose between reduced accuracy and confidence or

state explosion

• Evolution problematic for machine-learning approaches– older documents may have higher rate of OCR errors– expensive retraining required to accommodate changes in collection– potential lag time during which accuracy decays until sufficient training instances

acquired

• Validation: A largely unexplored area. – Machine-learning approaches offer some support via confidence measures

Page 10: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

10

Architecture for Metadata Extraction

Lexical Analysis

SemanticTagging

Document Modelling

DocumentClassification

Classification Resolution and

Validation

Metadata Extraction

Using Template 1

Metadata Extraction

Using Template 2

Metadata Extraction

Using Template M

OCR Output ofScanned

Documents

Fail

Success

Document in Class 1

Document in Class 2

Document in Class M

Human Assisted Feedback Loop

Validation Validation Validation

Human Assistance

Fail Fail Fail

Page 11: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

11

Our Approach: Meeting the Challenges

• Bi-level architecture

– Classification based upon document similarity

– Simple templates (rule-based) written for each emerging class

Page 12: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

12

Our Approach: Meeting the Challenges

• Heterogeneity– Classification, in effect, reduces the problem to multiple homogeneous

collections– Multiple templates required, but each template is comparatively simple

• only needs to accommodate one class of documents that share a common layout and style

• Evolution– New classes of documents accommodated by writing a new template

• templates are comparatively simple• no lengthy retraining required • potentially rapid response to changes in collection

– Enriching the template engine by introducing new features to reduce complexity of templates

• Validation– Exploring a variety of techniques drawn from automated software

testing & validation

Page 13: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

13

Metadata Extraction – Template-based

• Template-based approach– Classify documents into classes based on similarity– For each document class, create a template, or a set of rules– Decoupling rules from coding

• A template is kept in a separate file

• Advantages– Easy to extend

• For a new document class, just create a template

– Rules are simpler– Rules can be refined easily

Page 14: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

14

Classes of documents

Page 15: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

15

Template engine

Scanned XML Docs

XML Parser

Data Preprocessor

Template

Metadata Extraction Metadata

Engine

Page 16: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

16

Document features

• Layout features

– Boldness, i.e., whether text is in bold font or not;– Font size, i.e., the font size used in text, e.g. font size 12, font

size 14, etc;– Alignment, i.e. whether text is left, right, central, or adjusted

alignment;– Geometric location, for example, a block starting with

coordinates (0, 0) and ending with coordinates (100, 200); – Geometric relation, for example, a block located below the title

block.

Page 17: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

17

Document features

• Textual features

– Special words, for example, a string starting with “abstract”;

– Special patterns, for example, a string with regular expression “[1-2][0-9][0-9][0-9]”;

– Statistics features, for example, a string with more than 20 words, a string with more than 100 letters, and a string with more than 50% letters in upper case;

– Knowledge features, for example, a string containing a last name from a name dictionary.

Page 18: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

18

Template language

• XML based

• Related to document features

• XML schema

• Simple document model– Document –page-zone-region-column-row-

paragraphs-lines-words-character

Page 19: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

19

Template sample

<?xml version=”1.0” ?> <structdef> <title min=”0” max=”1”> <begin inclusive=”current”>largeststrsize(0,0.5)</begin> <end inclusive=”before”>sizechange(1)</end> </title> <creator min=”0” max=”1”> <begin inclusive=”after”>title</begin> <end inclusive=”before”>!nameformat</end> </creator> <date min=”0” max=”1”> <begin inclusive=”current”>dateformat</begin> <end inclusive=”current”>onesection</end> </date>

</structdef>

Page 20: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

20

Sample document pdf

Page 21: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

21

Scan OCR output

Page 22: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

22

‘Clean XML output

Page 23: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

23

Template (part)

Page 24: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

24

Metadata extracted

Page 25: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

25

Results Summary from DTIC Project

SVM DTIC100 85%-100% Expert DTIC10 85%

DTIC600 90%-100% DTIC20 95%Coverpage DTIC30 100% detectionSF298 loc. DTIC1000 100%SF298 4fields DTIC1000 95%-97%SF298 27fields DTIC1000 88%-99%

Template

PDF from scan image PDF from text

Page 26: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

26

Experiment with Limited GPO Documents

• 14 GPO Documents having Technical Report Documentation Page

• 57 GPO Documents without Technical Report Documentation Page

• 16 Congressional Reports

• 16 Public Law Documents

Page 27: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

27

GPO Report Documentation Page

Page 28: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

28

GPO Document

Page 29: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

29

Congressional Report

Page 30: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

30

Public Law Document

Page 31: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

31

Conclusions

• OCR software works very well on current documents

• Template based approach allows automatic metadata extraction from– Dynamically changing collections– Heterogeneous, large collections– Report document pages– High degree of accuracy

• Feasibility of structure (e.g., table of contents, tables, equations, sections) metadata extraction

Page 32: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

32

Metadata extraction Part II:

Automatic Categorization

Page 33: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

33

Document Categorization

• Problem: given– a collection of documents, and– a taxonomy of subject areas

• Classification: Determine the subject area(s) most pertinent to each document

• Indexing: Select a set of keywords / index terms appropriate to each document

Page 34: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

34

Classification Techniques

• Manual (a.k.a. Knowledge Engineering)– typically, rule-based expert systems

• Machine Learning– Probabalistic (e.g., Naïve Bayesian)– Decision Structures (e.g., Decision Trees)– Profile-Based

• compare document to profile(s) of subject classes• similarity rules similar to those employed in I.R.

– Support Machines (e.g., SVM)

Page 35: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

35

Classification via Machine Learning

• Usually train-and-test– Exploit an existing collection in which

documents have already been classified• a portion used as the training set• another portion used as a test set

– permits measurement of classifier effectiveness– allows tuning of classifier parameters to yield maximum

effectiveness

• Single- vs. multi-label– can 1 document be assigned to multiple

categories?

Page 36: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

36

Automatic Indexing

• Assign to each document up to k terms drawn from a controlled vocabulary

• Typically reduced to a multi-label classification problem– each keyword corresponds to a class of

documents for which that keyword is an appropriate descriptor

Page 37: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

37

Case Study: SVM categorization

• Document Collection from DTIC– 10,000 documents

• previously classified manually

– Taxonomy of• 25 broad subject fields, divided into a total of• 251 narrower groups

– Document lengths average 27051464 words, 623274 significant unique terms.

– Collection has 32457 significant unique terms

Page 38: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

38

Document Size Distribution

0

10

20

30

40

50

60

70

80

0-1000 1001-2000 2001-3000 3001-4000 4001-5000 5001-6000 6001-7000 7001-8000 8001-

words per document

do

cu

me

nts

Document Collection

Page 39: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

39

Unique Term Distribution

0

20

40

60

80

100

120

0-300 301-600 601-900 901-1200 1201-1500 1501-

# Unique Significant Terms per Document

#d

oc

um

en

ts

Page 40: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

40

Sample: Broad Subject Fields

01--Aviation Technology02--Agriculture03--Astronomy and Astrophysics04--Atmospheric Sciences05--Behavioral and Social Sciences06--Biological and Medical Sciences07--Chemistry08--Earth Sciences and Oceanography

Page 41: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

41

Sample: Narrow Subject Groups

Aviation Technology

01 Aerodynamics

02 Military Aircraft Operations

03 Aircraft 0301 Helicopters

0302 Bombers

0303 Attack and Fighter Aircraft

0304 Patrol and Reconnaissance Aircraft

Page 42: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

42

Distribution among Categories

Broad Category Distribution

0

1

2

3

4

5

6

1-50 51-100 101-150 151-200 201-250 251-300 301-350 351-400 401-450 451-500 >500

# Documents in Category

# C

ate

go

rie

s

Page 43: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

43

Detailed Category Distribution

0

50

100

150

200

250

0-100 101-200 201-300 301-400 401-500 501-600 601-700 701-800 801-900 901-1000 >1000

#Documents per Category

# C

ate

go

rie

s

Page 44: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

44

Baseline

• Establish baseline for state-of-the-art machine learning techniques– classification– training SVM for each subject area

• “off-the-shelf” document modelling and SVM libraries

Page 45: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

45

Why SVM?

• Prior studies have suggested good results with SVM

• relatively immune to “overfitting” – fitting to coincidental relations encountered during training

• few model parameters– avoids problems of optimizing in high-

dimension space

Page 46: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

46

Machine Learning: Support Vector Machines

• Binary Classifier – Finds the plane with

largest margin to separate the two classes of training samples

– Subsequently classifies items based on which side of line they fall

Font size

Line number

hyperplane

margin

Page 47: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

47

SVM Evaluation

C o lle ct io n Te rmEx tra ct io n

C o lle ct io nM o de l

Tra in in gS a m ple

Te s tS a m ple S V M

A cce pte din C la s s

D o cu m e n tC o n v e rs io n

D o cu m e n tC o n v e rs io n

S V MTra in in g

Page 48: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

48

Baseline SVM Evaluation(Interim Report)

– Training & Testing process repeated for multiple subject categories

– Determine accuracy• overall• positive (ability to recognize new documents that

belong in the class the SVM was trained for)• negative (ability to reject new documents that

belong to other classes)

– Explore Training Issues

Page 49: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

49

SVM “Out of the Box”

• 16 broad categories with 150 or more documents

• Lucene library extracting terms and forming weighted term vectors

• LibSVM for SVM training & testing– no normalization or parameter tuning

• Training set of 100/100 (positive/negative samples)

• Test set of 50/50

Page 50: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

50

SVM Accuracy "out of the box"

0

10

20

30

40

50

60

70

80

90

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Categor ies

Std. Dev

Average

Acc

ura

cy

Page 51: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

51

“OotB” Interpretation

• Reasonable performance on broad categories given modest training set size.– Accuracy measured as (# correct decisions /

test set size)

• Related experiment showed that with normalization and optimized parameter selection, accuracy could be improved as much as an additional 10%

Page 52: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

52

Training Set Size

Page 53: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

53

Training Set Size

• accuracy plateaus for training set sizes well under the number of terms in the document model

Page 54: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

54

Training Issues

• Training Set Size– Concern: detailed subject groups may have

too few known examples to perform effective SVM training in that subject

– Possible Solution: collection may have few positive examples, but has many, many negative example

• Positive/Negative Training Mixes– effects on accuracy

Page 55: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

55

Increased Negative Training

Page 56: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

56

Training Set Composition

• experiment performed with 50 positive training examples– OotB SVM training

• increasing the number of negative training examples has little effect on overall accuracy

• but positive accuracy reduced

Page 57: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

57

Interpretation

• may indicate a weakness in SVM– or simply further evidence of the importance

of optimizing SVM parameters

• may indicate unsuitability of treating SVM output as simple boolean decision– might do better as “best fit” in a multi-label

classifier

Page 58: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

58

Conclusions

• State of the art for DTIC like collections will give on the order of 75% accuracy

• Key problems that need to be addressed– establish baseline for other methods– validation: recognizing trusted results

• to fall back on human intervention

– improve on baseline by more sophisticated methods

• possible application for knowledge bases

Page 59: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

59

Additional Slides

Page 60: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

60

Metadata Extraction: Machine-Learning Approach

• Learn the relationship between input and output from samples and make predictions for new data

• This approach has good adaptability but it has to be trained from samples.

• HMM (hidden Markov Model) & SVM (Support Vector Machine)

Page 61: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

61

Machine Learning - Hidden Markov Models

• “Hidden Markov Modeling is a probabilistic technique for the study of observed items arranged in discrete-time series” --Alan B Poritz : Hidden Markov Models : A Guided Tour, ICASSP 1988

• HMM is a probabilistic finite state automaton– Transit from state to state– Emit a symbol when visit each state– States are hidden

A B C D

Page 62: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

62

Hidden Markov Models

• A Hidden Markov Model consists of • A set of hidden states (e.g. coin1, coin2, coin3)• A set of observation symbols ( e.g. H and T)• Transition probabilities: the probabilities from

one state to another• Emission probabilities: probability of emitting

each symbol in each state• Initial probabilities: probability of each state to

be chosen as the first state

Page 63: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

63

HMM - Metadata Extraction

– A document is a sequence of words that is produced by some hidden states (title, author, etc.)

– The parameters of HMM was learned from samples in advance.

– Metadata Extraction is to find the most possible sequence of states (title, author, etc.) for a given sequence of words.

Page 64: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

64

Machine Learning: Support Vector Machines

• Binary Classifier (classify data into two classes)

– It represents data with pre-defined features

– It finds the plane with largest margin to separate the two classes from samples

– It classifies data into two classes based on which side they located.

Font size

Line number

hyperplane

margin

The figure shows a SVM example to classify a line into two classes: title, not title by two features: font size and line number (1, 2, 3, etc). Each dot represents a line. Red dot: title; Blue dot: not title.

Page 65: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

65

SVM - Metadata Extraction

• Widely used in pattern recognition areas such as face detection, isolated handwriting digit recognition, gene classification, etc.

• Basic idea– Classes metadata elements– Extract metadata from a document classify each

line (or block) into appropriate classes.– For example

Extract document title from a document

Classify each line to see whether it is a part of title or not

Page 66: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

66

Metadata Extraction: Rule-based

• Basic idea: – Use a set of rules to define how to extract

metadata based on human observation.– For example, a rule may be “ The first line is title”.

• Advantage– Can be implemented straightforwardly– No need for training

• Disadvantage– Lack of adaptability (work for similar document)– Difficult to work with a large number of features– Difficult to tune the system when errors occur because

rules are usually fixed

Page 67: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

67

Metadata Extraction - Rule-based

• Expert system approach– Build a large rule base by

using standard languages such as prolog

– Use existed expert system engine (for example, SWI-prolog)

• Advantages– Can use existing engine

• Disadvantages– Building rule base is time-

consuming

Doc

Parser

Expert System Engine

KnowledgeBase

Facts

metadata

Page 68: ICCTA 2006 5-7 September, Alexandria 1 Automated Metadata Extraction July 17-20, 2006 Kurt Maly maly@cs.odu.edu

ICCTA 2006 5-7 September, Alexandria

68

Metadata Extraction Experience at ODU CS

• We have knowledge database obtained from analyzing Arc and DTIC collections

– Authors (4Mill strings from http://arc.cs.odu.edu)

– Organizations (79 from DTIC250, 200 from DTIC 600)

– Universities (52 from DTIC250)