![Page 1: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/1.jpg)
Information Extractionfrom the World Wide Web
Discriminative Finite State Models, Feature Induction & Scoped Learning
Andrew McCallumUniversity of Massachusetts, Amherst
![Page 2: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/2.jpg)
An HR office
Jobs, but not HR jobs
Jobs, but not HR jobs
![Page 3: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/3.jpg)
Extracting Job Openings from the Web
foodscience.com-Job2
JobTitle: Ice Cream Guru
Employer: foodscience.com
JobCategory: Travel/Hospitality
JobFunction: Food Services
JobLocation: Upper Midwest
Contact Phone: 800-488-2611
DateExtracted: January 8, 2001
Source: www.foodscience.com/jobs_midwest.html
OtherCompanyJobs: foodscience.com-Job1
![Page 4: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/4.jpg)
![Page 5: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/5.jpg)
![Page 6: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/6.jpg)
![Page 7: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/7.jpg)
![Page 8: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/8.jpg)
![Page 9: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/9.jpg)
This took place in ‘99
Not in Maryland
Courses from all over the world
![Page 10: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/10.jpg)
![Page 11: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/11.jpg)
![Page 12: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/12.jpg)
![Page 13: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/13.jpg)
![Page 14: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/14.jpg)
Why prefer “knowledge base search” over “page search”
• Targeted, restricted universe of hits– Don’t show resumes when I’m looking for job openings.
• Specialized queries– Topic-specific
– Multi-dimensional
– Based on information spread on multiple pages.
• Get correct granularity– Site, page, paragraph
• Specialized display– Super-targeted hit summarization in terms of DB slot values
• Ability to support sophisticated data mining
![Page 15: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/15.jpg)
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATION
![Page 16: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/16.jpg)
What is “Information Extraction”
Filling slots in a database from sub-segments of text.As a task:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
NAME TITLE ORGANIZATIONBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman founder Free Soft..
IE
![Page 17: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/17.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + clustering + association
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 18: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/18.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 19: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/19.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation
![Page 20: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/20.jpg)
What is “Information Extraction”
Information Extraction = segmentation + classification + association + clustering
As a familyof techniques:
October 14, 2002, 4:00 a.m. PT
For years, Microsoft Corporation CEO Bill Gates railed against the economic philosophy of open-source software with Orwellian fervor, denouncing its communal licensing as a "cancer" that stifled technological innovation.
Today, Microsoft claims to "love" the open-source concept, by which software code is made public to encourage improvement and development by outside programmers. Gates himself says Microsoft will gladly disclose its crown jewels--the coveted code behind the Windows operating system--to select customers.
"We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Microsoft CorporationCEOBill GatesMicrosoftGatesMicrosoftBill VeghteMicrosoftVPRichard StallmanfounderFree Software Foundation N
AME
TITLE ORGANIZATION
Bill Gates
CEO
Microsoft
Bill Veghte
VP
Microsoft
Richard Stallman
founder
Free Soft..
*
*
*
*
![Page 21: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/21.jpg)
IE in Context
Create ontology
SegmentClassifyAssociateCluster
Load DB
Spider
Query,Search
Data mine
IE
Documentcollection
Database
Filter by relevance
Label training data
Train extraction models
![Page 22: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/22.jpg)
Hidden Markov Models
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Finite state model Graphical model
Parameters: for all states S={s1,s2,…} Start state probabilities: P(st) Transition probabilities: P(st|st-1) Observation (emission) probabilities: P(ot|st)Training: Maximize probability of training observations (w/ prior)
||
11 )|()|(),(
o
ttttt soPssPosP
HMMs are the standard sequence modeling tool in genomics, speech, NLP, …
...transitions
observations
o1 o2 o3 o4 o5 o6 o7 o8
Generates:
State sequenceObservation sequence
Usually a multinomial over atomic, fixed alphabet
![Page 23: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/23.jpg)
IE with Hidden Markov Models
Yesterday Lawrence Saul spoke this example sentence.
Yesterday Lawrence Saul spoke this example sentence.
Person name: Lawrence Saul
Given a sequence of observations:
and a trained HMM:
Find the most likely state sequence: (Viterbi)
Any words said to be generated by the designated “person name”state extract as a person name:
),(maxarg osPs
![Page 24: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/24.jpg)
HMM Example: “Nymble”
Other examples of shrinkage for HMMs in IE: [Freitag and McCallum ‘99]
Task: Named Entity Extraction
[Bikel, et al 1998], [BBN “IdentiFinder”]
Person
Org
Other
(Five other name classes)
start-of-sentence
end-of-sentence
Transitionprobabilities
Observationprobabilities
P(st | st-1, ot-1 ) P(ot | st , st-1 )
Back-off to: Back-off to:
P(st | st-1 )
P(st )
P(ot | st , ot-1 )
P(ot | st )
P(ot )
or
![Page 25: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/25.jpg)
Regrets from Atomic View of Tokens
Would like richer representation of text: multiple overlapping features, whole chunks of text.
line, sentence, or paragraph features:– length– is centered in page– percent of non-alphabetics– white-space aligns with next line– containing sentence has two verbs– grammatically contains a question– contains links to “authoritative” pages– emissions that are uncountable– features at multiple levels of granularity
Example word features:– identity of word– is in all caps– ends in “-ski”– is part of a noun phrase– is in a list of city names– is under node X in WordNet or Cyc– is in bold font– is in hyperlink anchor– features of past & future– last person name was female– next two words are “and Associates”
![Page 26: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/26.jpg)
Problems with Richer Representationand a Generative Model
• These arbitrary features are not independent:– Overlapping and long-distance dependences
– Multiple levels of granularity (words, characters)
– Multiple modalities (words, formatting, layout)
– Observations from past and future
• HMMs are generative models of the text:
• Generative models do not easily handle these non-independent features. Two choices:– Model the dependencies. Each state would have its own
Bayes Net. But we are already starved for training data!
– Ignore the dependencies. This causes “over-counting” of evidence (ala naïve Bayes). Big problem when combining evidence, as in Viterbi!
),( osP
![Page 27: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/27.jpg)
Conditional Sequence Models
• We would prefer a conditional model:P(s|o) instead of P(s,o):– Can examine features, but not responsible for generating
them.
– Don’t have to explicitly model their dependencies.
– Don’t “waste modeling effort” trying to generate what we are given at test time anyway.
• This answers the challenge of integrating the ability to handle many arbitrary features with the full power of finite state automata.
![Page 28: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/28.jpg)
Locally Normalized Conditional Sequence Model
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Generative (traditional HMM)
||
11 )|()|(),(
o
ttttt soPssPosP
...transitions
observations
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Conditional
...transitions
observations
||
11 ),|()|(
o
tttt ossPosP
Standard belief propagation: forward-backward procedure.Viterbi and Baum-Welch follow naturally.
Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000]MaxEnt POS Tagger [Ratnaparkhi, 1996]
SNoW-based Markov Model [Punyakanok & Roth, 2000]
![Page 29: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/29.jpg)
Locally Normalized Conditional Sequence Model
St -1
St
Ot
St+1
Ot +1
Ot -1
...
...
Generative (traditional HMM)
||
11 )|()|(),(
o
ttttt soPssPosP
...transitions
observations
St -1
St
Ot
St+1
...
...
Conditional
...transitions
entire observation sequence
||
11 ),|()|(
o
ttt ossPosP
Standard belief propagation: forward-backward procedure.Viterbi and Baum-Welch follow naturally.
Maximum Entropy Markov Models [McCallum, Freitag & Pereira, 2000]MaxEnt POS Tagger [Ratnaparkhi, 1996]
SNoW-based Markov Model [Punyakanok & Roth, 2000]
Or, more generally:
...
![Page 30: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/30.jpg)
Exponential Form for “Next State” Function
k
ttkkt
tts
ttt
sstofstoZ
osP
ossP
t),,,(exp
),,(
1)|(
),|(
11
1
1
Overall Recipe:- Labeled data is assigned to transitions.- Train each state’s exponential model by maximum likelihood (iterative scaling or conjugate gradient).
weight feature
)|(1 tts osP
t
Black-box classifier
st-1
![Page 31: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/31.jpg)
Feature Functions
:),,,( Example 1ttk sstof
otherwise 0
s s )d(Capitalize if 1),,,( j1i
1,d,Capitalizettt
ttss
ssosstof
ji
Yesterday Lawrence Saul spoke this example sentence.
s3
s1 s2
s4
1 )2,,,( 21,, 31 ssof ssdCapitalize
o = o1 o2 o3 o4 o5 o6 o7
![Page 32: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/32.jpg)
Exponential Form for “Next State” Function
k
ttkkt
tts
ttt
sstofstoZ
osP
ossP
t),,,(exp
),,(
1)|(
),|(
11
1
1
Overall Recipe:- Labeled data is assigned to transitions.- Train each state’s exponential model by maximum likelihood (iterative scaling or conjugate gradient).
weight feature
)|(1 tts osP
t
Black-box classifier
st-1
![Page 33: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/33.jpg)
Experimental Data
38 files belonging to 7 UseNet FAQs
Example:
<head> X-NNTP-Poster: NewsHound v1.33<head> Archive-name: acorn/faq/part2<head> Frequency: monthly<head><question> 2.6) What configuration of serial cable should I use?<answer><answer> Here follows a diagram of the necessary connection<answer> programs to work properly. They are as far as I know <answer> agreed upon by commercial comms software developers fo<answer><answer> Pins 1, 4, and 8 must be connected together inside<answer> is to avoid the well known serial port chip bugs. The
Procedure: For each FAQ, train on one file, test on other; average.
![Page 34: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/34.jpg)
Features in Experiments
begins-with-number
begins-with-ordinal
begins-with-punctuation
begins-with-question-word
begins-with-subject
blank
contains-alphanum
contains-bracketed-number
contains-http
contains-non-space
contains-number
contains-pipe
contains-question-mark
contains-question-word
ends-with-question-mark
first-alpha-is-capitalized
indented
indented-1-to-4
indented-5-to-10
more-than-one-third-space
only-punctuation
prev-is-blank
prev-begins-with-ordinal
shorter-than-30
![Page 35: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/35.jpg)
Models Tested
• ME-Stateless: A single maximum entropy classifier applied to each line independently.
• TokenHMM: A fully-connected HMM with four states, one for each of the line categories, each of which generates individual tokens (groups of alphanumeric characters and individual punctuation characters).
• FeatureHMM: Identical to TokenHMM, only the lines in a document are first converted to sequences of features.
• MEMM: The Maximum Entropy Markov Model described in this talk.
![Page 36: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/36.jpg)
Results
Learner Segmentationprecision
Segmentationrecall
ME-Stateless 0.038 0.362
TokenHMM 0.276 0.140
FeatureHMM 0.413 0.529
MEMM 0.867 0.681
![Page 37: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/37.jpg)
nn oooossss ,...,,..., 2121
HMM
MEMM
CRF
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
St-1 St
Ot
St+1
Ot+1Ot-1
...
...
...
...
...
...
||
11 )|()|(),(
o
ttttt soPssPosP
||
1
1
,
||
11
),(
),(
exp1
),|()|(
1
o
t
kttkk
jttjj
os
o
tttt
osg
ssf
Z
ossPosP
tt
||
1
1
),(
),(
exp1
)|(o
t
kttkk
jttjj
o osg
ssf
ZosP
(A special case of MEMMs and CRFs.)
Conditional Random Fields (CRFs)[Lafferty, McCallum, Pereira ‘2001]
From HMMs to MEMMs to CRFs
![Page 38: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/38.jpg)
Conditional Random Fields (CRFs)
St St+1 St+2
O = Ot, Ot+1, Ot+2, Ot+3, Ot+4
St+3 St+4
Markov on s, conditional dependency on o.
||
11 ),,,(exp
1)|(
o
t jttjj
o
tossfZ
osP
Hammersley-Clifford-Besag theorem stipulates that the CRFhas this form—an exponential function of the cliques in the graph.
Assuming that the dependency structure of the states is tree-shaped (linear chain is a trivial tree), inference can be done by dynamic programming in time O(|o| |S|)—just like HMMs.
[Lafferty, McCallum, Pereira ‘2001]
![Page 39: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/39.jpg)
General CRFs vs. HMMs
• More general and expressive modeling technique
• Comparable computational efficiency
• Features may be arbitrary functions of any or all observations
• Parameters need not fully specify generation of observations; require less training data
• Easy to incorporate domain knowledge
• State means only “state of process”, vs“state of process” and “observational history I’m keeping”
![Page 40: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/40.jpg)
Training CRFs
),,,(),(
),( )|(),(
:gradient likelihood-Log
}),{|}({
:data ninggiven trai parameters of likelihood-log Maximize
1
2)()(}{
)()(
penalty smoothing parameterscurrent by assigned labels usingcount feature labelscorrect usingcount feature
)(
ttt
kk
ki s
ik
i
i
iik
k
i
k
sstofosC
osCosPosCL
--
soL
k
Methods:• iterative scaling (quite slow)• conjugate gradient (much faster)• conjugate gradient with preconditioning (super fast)• limited-memory quasi-Newton methods (also super fast)
Complexity comparable to standard Baum-Welch
[Sha & Pereira 2002]& [Malouf 2002]
![Page 41: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/41.jpg)
Voted Perceptron Sequence Models
before as ),,,(),( where
),(),( :k
),,,(expmaxarg
i instances, trainingallfor
:econvergenc toIterate
0k :zero toparameters Initialize
},{ :data ningGiven trai
1
)()()(
1
k
)(
tossfosC
osCosC
tossfs
so
ttt
kk
iViterbik
iikk
t kttkksViterbi
i
[Collins 2002]
Like CRFs with stochastic gradient ascent and a Viterbi approximation.
Avoids calculating the partition function (normalizer), Zo, but gradient ascent, not 2nd-order or conjugate gradient method.
Analogous tothe gradientfor this onetraining instance
![Page 42: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/42.jpg)
MEMM & CRF Related Work• Maximum entropy for language tasks:
– Language modeling [Rosenfeld ‘94, Chen & Rosenfeld ‘99]– Part-of-speech tagging [Ratnaparkhi ‘98]– Segmentation [Beeferman, Berger & Lafferty ‘99]– Named entity recognition “MENE” [Borthwick, Grishman,…’98]
• HMMs for similar language tasks– Part of speech tagging [Kupiec ‘92]– Named entity recognition [Bikel et al ‘99]– Other Information Extraction [Leek ‘97], [Freitag & McCallum ‘99]
• Serial Generative/Discriminative Approaches– Speech recognition [Schwartz & Austin ‘93]– Reranking Parses [Collins, ‘00]
• Other conditional Markov models– Non-probabilistic local decision models [Brill ‘95], [Roth ‘98]– Gradient-descent on state path [LeCun et al ‘98]– Markov Processes on Curves (MPCs) [Saul & Rahim ‘99]– Voted Perceptron-trained FSMs [Collins ’02]
![Page 43: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/43.jpg)
Part-of-speech Tagging
The asbestos fiber , crocidolite, is unusually resilient once
it enters the lungs , with even brief exposures to it causing
symptoms that show up decades later , researchers said .
DT NN NN , NN , VBZ RB JJ IN
PRP VBZ DT NNS , IN RB JJ NNS TO PRP VBG
NNS WDT VBP RP NNS JJ , NNS VBD .
45 tags, 1M words training data, Penn Treebank
Error oov error error err oov error err
HMM 5.69% 45.99%
CRF 5.55% 48.05% 4.27% -24% 23.76% -50%
Using spelling features*
* use words, plus overlapping features: capitalized, begins with #, contains hyphen, ends in -ing, -ogy, -ed, -s, -ly, -ion, -tion, -ity, -ies.
[Pereira 2001]
![Page 44: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/44.jpg)
Chinese Word Segmentation[McCallum & Feng 2003]
~100k words data, Penn Chinese Treebank
Lexicon features:
Adjective ending character digit characters provincesadverb ending character foreign name chars punctuation charsbuilding words function words Roman alphabeticsChinese number characters job title Roman digitsChinese period locations stopwordscities and regions money surnamescountries negative characters symbol charactersdates organization indicator verb charsdepartment characters preposition characters wordlist (188k lexicon)
![Page 45: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/45.jpg)
Chinese Word Segmentation Results[McCallum & Feng 2003]
Precision and recall of segments with perfect boundaries:
# training testing segmentationMethod sentences prec. recall F1
[Peng] ~5M 75.1 74.0 74.2[Ponte] ? 84.4 87.8 86.0[Teahan] ~40k ? ? 94.4[Xue] ~10k 95.2 95.1 95.2CRF 2805 97.3 97.8 97.5CRF 140 95.4 96.0 95.7CRF 56 93.9 95.0 94.4CRF+ 2805 99.6 99.7 99.6
Prev. world’s best error 50%
![Page 46: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/46.jpg)
Person name Extraction[McCallum 2001]
![Page 47: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/47.jpg)
Features in Experiment
Capitalized Xxxxx
Mixed Caps XxXxxx
All Caps XXXXX
Initial Cap X….
Contains Digit xxx5
All lowercase xxxx
Initial X
Punctuation .,:;!(), etc
Period .
Comma ,
Apostrophe ‘
Dash -
Preceded by HTML tag
Character n-gram classifier says string is a person name (80% accurate)
In stopword list(the, of, their, etc)
In honorific list(Mr, Mrs, Dr, Sen, etc)
In person suffix list(Jr, Sr, PhD, etc)
In name particle list (de, la, van, der, etc)
In Census lastname list;segmented by P(name)
In Census firstname list;segmented by P(name)
In locations lists(states, cities, countries)
In company name list(“J. C. Penny”)
In list of company suffixes(Inc, & Associates, Foundation)
Hand-built FSM person-name extractor says yes, (prec/recall ~ 30/95)
Conjunctions of all previous feature pairs, evaluated at the current time step.
Conjunctions of all previous feature pairs, evaluated at current step and one step ahead.
All previous features, evaluated two steps ahead.
All previous features, evaluated one step behind.
Total number of features = ~200k
![Page 48: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/48.jpg)
Training and Testing
• Trained on 65469 words from 85 pages, 30 different companies’ web sites.
• Training takes 4 hours on a 1 GHz Pentium.• Training precision/recall is 96/96.
• Tested on different set of web pages with similar size characteristics.
• Testing precision is 0.92 - 0.95, recall is 0.89 - 0.91.
![Page 49: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/49.jpg)
![Page 50: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/50.jpg)
![Page 51: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/51.jpg)
![Page 52: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/52.jpg)
Feature Induction
WORD=“interest” & WORD@+1=“rates”
WORD=“put” & WORD@+1=“and” & WORD@+2=“call” (options)
WORD=“to” & POS@+1=VB
# features without conjunctions = ~90k# features with conjunctions = ~1 million
![Page 53: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/53.jpg)
Feature Gain
)||~()||~(),(Gain fq qpDqpDf
Currentloss
Loss if feature fwere added andhad optimal weight,
![Page 54: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/54.jpg)
Feature Induction Procedure
• Partially train model with current features• Find tokens significantly contributing to D(p||q).• “Cluster” these errors by different cells of the
confusion matrix.• For each cell
– Generate ~50k candidate new features– Measure the Gain of each– Add to the model the top N features
~
![Page 55: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/55.jpg)
Feature Induction Discussion
• Decision trees partition the training data.• Boosting is like feature induction
– (Generate a new conjunction or tree & give it a weight)– but it typically adds just one feature for each round– and it doesn’t change the weight once set.
• [Della Piettra2 & Lafferty 97]– also add one feature at a time, and would be quite
inefficient for CRFs.
• Here we carefully select several approximations to their scheme.
![Page 56: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/56.jpg)
Feature Induction Results (Preliminary)
Noun Phrase Segmentation
Method # features F1
[Sha & Pereira 2003] 3.8M 94.4Feature Induction 140K 94.4
Named Entity Extraction (in Spanish)
Method F1
With no conjunctions ~30With fixed conjunctions 51With induced conjunctions 65
(extra preliminary)
![Page 57: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/57.jpg)
Person name Extraction
![Page 58: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/58.jpg)
![Page 59: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/59.jpg)
![Page 60: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/60.jpg)
![Page 61: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/61.jpg)
Local features, like formatting, exhibit regularity on a particular subset of the data (e.g. web site or document). Note that future data will probably not have the same regularities as the training data.
Global features, like word content, exhibit regularity over an entire data set. Traditional classifiers are generally trained on these kinds of features.
Local and Global Features
w
f
![Page 62: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/62.jpg)
Scoped Learning Generative Model
1. For each of the D documents:a) Generate the multinomial formatting
feature parameters from p(|)
2. For each of the N words in the document:
a) Generate the nth category cn from
p(cn).
b) Generate the nth word (global feature)
from p(wn|cn,)
c) Generate the nth formatting feature
(local feature) from p(fn|cn,)
w f
c
N
D
![Page 63: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/63.jpg)
Inference
Given a new web page, we would like to classify each wordresulting in c = {c1, c2,…, cn}
This is not feasible to compute because of the integral andsum in the denominator. We experimented with twoapproximations: - MAP point estimate of - Variational inference
![Page 64: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/64.jpg)
MAP Point Estimate
If we approximate with a point estimate, , then the integral disappears and c decouples. We can then label each word with:
E-step:
M-step:
A natural point estimate is the posterior mode: a maximum likelihood estimate for the local parameters given the document in question:
^
![Page 65: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/65.jpg)
Global Extractor: Correctly extracts 3 out of 9
![Page 66: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/66.jpg)
Scoped Learning Extractor: Correctly extracts 8 out of 9
![Page 67: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/67.jpg)
Coverage
Acc
ura
cy
Global Extractor: Precision = 46%, Recall = 75%
Scoped Learning Extractor: Precision = 58%, Recall = 75% Error = -22%
![Page 68: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/68.jpg)
Discriminative Model
w f
c
N
D
When there are many non-independent formatting features, may prefer a non-generative model.
Use a point estimate of .
E-step computes p(c|w, f, ).
M-step sets to maximize conditional likelihood, p(c|f)—uses MaxEnt.
![Page 69: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/69.jpg)
More Experimental Results
• 42,548 web pages from 330 web sites. Each page labeled as PRESS_RELEASE, OTHER.
• Locale: web site• Global features: words on page
– E.g. “immediate” “release” “products”
• Local features: Character 4-gram window slid across URL.– E.g. “prel”, “/pr/”, “pres”…
![Page 70: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/70.jpg)
Global Extractor: Precision = 54%, Recall = 90%
Scoped Learning Extractor: Precision = 77%, Recall = 90% Error = -50%
![Page 71: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/71.jpg)
Scoped Learning Related Work
• Co-training [Blum & Mitchell 1998]– Although it has no notion of scope, it also has an independence
assumption about two independent views of data.
• PRMs for classification [Taskar, Segal & Koller 2001]– Extends notion of multiple views to multiple kinds of relationships– This model can be cast as a PRM where each locale is a separate group
of nodes with separate parameters. Their inference corresponds our MAP estimate inference.
• Classification with Labeled & Unlabeled Data [Nigam et al, 1999, Joachims 1999, etc.]– Particularly transduction, in which the unlabeled set is the test set. – However, we model locales, represent a difference between local and
global features, and can use locales at training time to learn hyper-parameters over local features.
• Classification with Hyperlink Structure [Slattery 2001]– Adjusts a web page classifier using ILP and a hubs & authorities
algorithm.
![Page 72: Information Extraction from the World Wide Web Discriminative Finite State Models, Feature Induction & Scoped Learning Andrew McCallum University of Massachusetts,](https://reader031.vdocument.in/reader031/viewer/2022032707/56649e3b5503460f94b2d44c/html5/thumbnails/72.jpg)
Future Directions• Further work in feature selection and induction: automatically
choose the fk functions (efficiently).
• Factorial CRFs
• Tree-structured Markov random fields for hierarchical parsing.
• Induction of finite state structure.
• Combine CRFs and Scoped Learning.
• Data mine the results of information extraction, and integrate the data mining with extraction.
• Create a text extraction and mining system that can be assembled and trained to a new vertical application by non-technical users.
Papers on MEMMs, CRFs, Scoped Learning and more available at http://www.cs.umass.edu/~mccallum