school of computer science 1 information extraction with hmm structures learned by stochastic...

11

School of Computer ScienceSchool of Computer ScienceSchool of Computer ScienceSchool of Computer Science

Information Extraction Information Extraction with HMM Structures with HMM Structures Learned by Stochastic Learned by Stochastic OptimizationOptimization

Dayne Freitag and Andrew McCallumDayne Freitag and Andrew McCallumPresented by Tal BlumPresented by Tal Blum

for the course: Machine Learning Approaches for the course: Machine Learning Approaches to Information Extraction and Information to Information Extraction and Information

Integration Integration

22

School of Computer Science

OutlineOutline

Background on HMM transition Background on HMM transition structure selectionstructure selection

The algorithm for the sparse IE taskThe algorithm for the sparse IE task Comparison between their algorithm Comparison between their algorithm

and Borkar et al. algorithmand Borkar et al. algorithm DiscussionDiscussion ResultsResults

33


HMMs for IEHMMs for IE

Has been successfully used in Has been successfully used in many tasks:many tasks:– Speech RecognitionSpeech Recognition– Information Extraction (Biker et Information Extraction (Biker et

al.,Borkar et al.)al.,Borkar et al.)– IE in Bioinformatics (Leek)IE in Bioinformatics (Leek)– POS Tagging (Ratnaparkhi)POS Tagging (Ratnaparkhi)

44


Sparse Extraction taskSparse Extraction task

Fields are extracted from a long Fields are extracted from a long documentdocument

Most of the document is Most of the document is irrelevantirrelevant

Examples:Examples:– NENE– Conference Time & LocationConference Time & Location

55


Learning HMM Learning HMM Structure?Structure?

BN

X Y

Z

W

S

Obs

HMM as a BN

)|( 1 tt SSPA

S1

Obs1

S2

Obs2

S3

Obs3

HMM as a dynamic BN

t

)|( tt SObsPB

66


Constrained TransitionConstrained Transition

X3 X4

X1 X2

2.0008.0

5.05.000

2.05.03.00

008.002

)|( 1 tt SSPA

X1 X2 X3 X4

X1

X2

X3

X4

77


HMM Structure HMM Structure LearningLearning

Unlike BN structure learningUnlike BN structure learning Learn the structure of the transition Matrix ALearn the structure of the transition Matrix A Learn structures with different number of Learn structures with different number of

statesstates

Zip code country

street St. #

Zip code country

street St. #

Zip C2

Zip C1

country

street St. #

88


HMM Structure HMM Structure ExampleExample

99


Example Example HierarchicaHierarchical l HMMHMM

1010


Why learn HMM Why learn HMM structure?structure?

HMMs are not specifically suited for IE tasksHMMs are not specifically suited for IE tasks Including structural bias can reduce the Including structural bias can reduce the

amount of parameters needed to learn and amount of parameters needed to learn and therefore require less datatherefore require less data

The parameters will be more accurateThe parameters will be more accurate Constrain the number of times a class can Constrain the number of times a class can

appear in a documentappear in a document Can represent class length more accuratelyCan represent class length more accurately The emission probability might be multi The emission probability might be multi

modalmodal To model class left and right context of a To model class left and right context of a

class for the sparse IE taskclass for the sparse IE task

1111


Fully Observed vs. Fully Observed vs. Partially ObservedPartially Observed The structure learning is only required The structure learning is only required

when the data is partially observedwhen the data is partially observed Partially ObservedPartially Observed – a field is – a field is

represented by several states, where represented by several states, where the label is the fieldthe label is the field

With fully observed data we can let the With fully observed data we can let the probabilities “learn” the structureprobabilities “learn” the structure

Edges that are not observed will get Edges that are not observed will get zero probabilityzero probability

Learning the transition structure Learning the transition structure involves incorporating new statesinvolves incorporating new states

Naively allowing arbitrary transition will Naively allowing arbitrary transition will not generalize wellnot generalize well

1212


The ProblemThe Problem

How to select the additional How to select the additional states and the state transition states and the state transition structurestructure

Manual Selection doesn’t scale Manual Selection doesn’t scale wellwell

Human intuition do not always Human intuition do not always corresponds to the best corresponds to the best structuresstructures

1313


The SolutionThe Solution

A system that automatically selects a HMM A system that automatically selects a HMM transition structuretransition structure

The system starts from an initial simple The system starts from an initial simple model and extends it sequentially by a set of model and extends it sequentially by a set of operations to search for a better modeloperations to search for a better model

The model quality is measured by its The model quality is measured by its discrimination on validation datasetdiscrimination on validation dataset

The best model is returnedThe best model is returned The system is comparable with human The system is comparable with human

constructed HMM structures and on average constructed HMM structures and on average outperforms themoutperforms them

1414


IE with HMMsIE with HMMs

Each extracted field has its own HMMEach extracted field has its own HMM Each HMM contains two kinds of states:Each HMM contains two kinds of states:

– Target statesTarget states– Non-Target statesNon-Target states

• All of the fields HMM are All of the fields HMM are concatenated to a whole concatenated to a whole consistent HMMconsistent HMM• The entire document is The entire document is used to train the models used to train the models with no need of pre-with no need of pre-processingprocessing

1515


Parameter EstimationParameter Estimation

Transition Probabilities Estimation is Transition Probabilities Estimation is done with Maximum Likelihooddone with Maximum Likelihood– Unique path – ratio of countsUnique path – ratio of counts– Non Unique path – use EMNon Unique path – use EM

Emission Probabilities require Emission Probabilities require smoothing with priorssmoothing with priors– shrinkage with EMshrinkage with EM

KsawPswP /1))(|()|( 321

1616


Learning State-Learning State-Transition StructureTransition Structure

States:

• Target

• Prefix

• Suffix

• Background

1717


Model Expansion Model Expansion ChoicesChoices

States:– Target– Prefix– Suffix– Background

Model Expansion Model Expansion Choices:Choices: – Lengthen a prefixLengthen a prefix– Split a prefixSplit a prefix– Lengthen a suffixLengthen a suffix– Split a suffixSplit a suffix– Lengthen a target stringLengthen a target string– Split a target stringSplit a target string– Add a background stateAdd a background state

1818


The AlgorithmThe Algorithm

1919


DiscussionDiscussion

Structure Learning is similar to Structure Learning is similar to rule learning for word or boundary rule learning for word or boundary classificationclassification

The search for the best structure The search for the best structure is not comprehensiveis not comprehensive

There is no attempt to generalize There is no attempt to generalize better by using the same better by using the same emission probabilities for different emission probabilities for different statesstates

2020


Comparison with Bokar Comparison with Bokar et. al. algorithmet. al. algorithm

Differences

• Segmentation vs.

Sparse Extraction

• Background and boundaries

modeling

• Unique Path - don’t use EM

• Backward Search vs. Forward Search

• Both assume boundaries and that the position is the more relevant feature that distinguish different states

2121


Experimental ResultsExperimental Results

Tested on 8 extraction tasks over 4 Tested on 8 extraction tasks over 4 datasetsdatasets– Seminar Announcements (485)Seminar Announcements (485)– Reuter Corporate Acquisition articles (600)Reuter Corporate Acquisition articles (600)– Job Announcements (298)Job Announcements (298)– Call For Paper (363)Call For Paper (363)

Training and Testing were equal sizeTraining and Testing were equal size Average performance over 10 splitsAverage performance over 10 splits

2222


Learned StructureLearned Structure

2323



Compared to 4 other approachesCompared to 4 other approaches– Grown HMM – the structure learnedGrown HMM – the structure learned– SRV – rule learning (Freitag 1998)SRV – rule learning (Freitag 1998)– Rapier – rule learning (Califf 1998)Rapier – rule learning (Califf 1998)– Simple HMMSimple HMM– Complex HMMComplex HMM

2424



2525


ConclusionsConclusions

HMMs has been proved to be state of HMMs has been proved to be state of the art method for IEthe art method for IE

Constraining the transition structure Constraining the transition structure has a crucial effect on performancehas a crucial effect on performance

Automatic Transition Structure Automatic Transition Structure learning compares and even learning compares and even outperforms manually crafted HMMs outperforms manually crafted HMMs which require hard labor for manual which require hard labor for manual constructionconstruction

2626


TheTheEnd!End!

Questions?Questions?

school of computer science 1 information extraction with hmm structures learned by stochastic...

Documents

observedthe structure

arbitrary transition

transition structurethe

observed data

multi modalto model

additional states

class length

kinds of states