plain text information extraction (based on machine learning )
DESCRIPTION
Plain Text Information Extraction (based on Machine Learning ). Chia-Hui Chang Department of Computer Science & Information Engineering National Central University [email protected] 9/24/2002. Introduction. Plain Text Information Extraction - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/1.jpg)
Plain Text Information Extraction (based on Machine Learning)Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central University
[email protected]/24/2002
![Page 2: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/2.jpg)
Introduction
Plain Text Information Extraction The task of locating specific pieces of data from a
natural language document To obtain useful structured information from unstr
uctured text DARPA’s MUC program
The extraction rules are based on syntactic analyzer semantic tagger
![Page 3: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/3.jpg)
On-line documents SRV, AAAI-1998
D. Freitag Rapier, ACL-1997, AAAI-
1999 M. E. Califf
WHISK, ML-1999 Solderland
Related Work
Free-text documents PALKA, MUC-5, 1993 AutoSlog, AAAI-1993
E. Riloff LIEP, IJCAI-1995
Huffman Crystal, IJCAI-1995, KD
D-1997 Solderland
![Page 4: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/4.jpg)
SRVInformation Extraction from HTML: Application of a General Machine Learning Approach
Dayne Freitag
AAAI-98
![Page 5: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/5.jpg)
Introduction
SRV A general-purpose relational learner A top-down relational algorithm for IE Reliance on a set of token-oriented features
Extraction pattern First-order logic extraction pattern with predicates
based on attribute-value tests
![Page 6: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/6.jpg)
Extraction as Text Classification Extraction as Text Classification
Identify the boundaries of field instances Treat each fragment as a bag-of-words Find the relations from the surrounding context
![Page 7: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/7.jpg)
Relational Learning
Inductive Logic Programming (ILP) Input: class-labeled instances Output: classifier for unlabeled instances Typical covering algorithm
Attribute values are added greedily to a rule The number of positive examples is heuristically
maximized while the number of negative examples is heuristically minimized
![Page 8: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/8.jpg)
Simple Features
Features on individual token Length (e.g. single letter or multiple letters) Character type (e.g. numeric or alphabet) Orthography (e.g. capitalized) Part of speech (e.g. verb) Lexical meaning (e.g. geographical_place)
![Page 9: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/9.jpg)
Individual Predicates
Individual predicate: Length (=3): accepts only fragments containing three token
s Some(?A [] capitalizedp true): the fragment contains some t
oken that is capitalized Every(numericp false): every token in the fragment is non-n
umeric Position(?A fromfirst <2): the token bound to ?A is either fir
st or second in the fragment Relpos(?A ?B =1) the token bound to ?A immediately prece
ds the token bound to ?B
![Page 10: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/10.jpg)
Relational Features
Relational Feature types Adjacency (next_token) Linguistic syntax (subject_verb)
![Page 11: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/11.jpg)
Example
![Page 12: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/12.jpg)
Search
Adding predicates greedily, attempting to cover as many positive and as few negative examples as possible.
At every step in rule construction, all documents in the training set are scanned and every text fragment of appropriate size counted.
Every legal predicate is assessed in terms of the number of positive and negative examples it covers.
A position-predicate is not legal unless some-predicate is already part of the rule
![Page 13: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/13.jpg)
Relational Paths
Relational features are used only in the Path argument to the some-predicate Some(?A [prev_token prev_token] capitalized tru
e): The fragment contains some token preceded by a capitalized token two tokens back.
![Page 14: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/14.jpg)
Validation Training Phase
2/3: learning 1/3: validation
Testing Bayesian m-
estimates: All rules matching a given
fragment are used to assign a confidence score.
Combined confidence:
Ccc)1(1
![Page 15: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/15.jpg)
Adapting SRV for HTML
![Page 16: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/16.jpg)
Experiments Data Source:
Four university computer science departments: Cornell, U. of Texas, U. of Washington, U. of Wisconsin
Data Set: Course: title, number, instructor Project: title, member 105 course pages 96 project pages
Two Experiments Random: 5 cross-validation LOUO: 4-fold experiments
![Page 17: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/17.jpg)
OPD Coverage:Each rule
has its own confidence
![Page 18: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/18.jpg)
MPD
![Page 19: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/19.jpg)
Baseline Strategies
OPD
MPD
Simply memorizes field instances
Random Guesser
![Page 20: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/20.jpg)
Conclusions
Increased modularity and flexibility Domain-specific information is separate from the
underlying learning algorithm Top-down induction
From general to specific Accuracy-coverage trade-off
Associate confidence score with predictions Critique: single-slot extraction rule
![Page 21: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/21.jpg)
RAPIERRelational Learning of Pattern-Match Rules for Information Extraction
M.E. Califf and R.J. Mooney
ACL-97, AAAI-1999
![Page 22: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/22.jpg)
Rule Representation
Single-slot extraction patterns Syntactic information (part-of-speech tagger) Semantic class information (WordNet)
![Page 23: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/23.jpg)
The Learning Algorithm A specific to general search
The pre-filler pattern contains an item for each word The filler pattern has one item from each word in the
filler The post-filler has one item for each word
Compress the rules for each slot Generate the least general generalization (LGG) of each
pair of rules When the LGG of two constraints is a disjunction, we
create two alternatives (1) disjunction (2) removal of the constraints.
![Page 24: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/24.jpg)
Example Located in Atlanta, Georgia. Offices in Kansas City, Missouri.
,,
,,
![Page 25: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/25.jpg)
Example:
Assume there is a semantic class for states, but not one for cities.
Located in Atlanta, Georgia.Offices in Kansas City, Missouri.
![Page 26: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/26.jpg)
![Page 27: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/27.jpg)
Experimental Evaluation
300 computer-related Jobs 17 slots: employer, location, salary, job requirements,
language and platform.
![Page 28: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/28.jpg)
Experimental Evaluation
485 seminar announcement 4 slots:
![Page 29: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/29.jpg)
WHISK:
S. Soderland
University of Washington
Journal of Machine Learning 1999
![Page 30: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/30.jpg)
Semi-structured Text
![Page 31: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/31.jpg)
Free Text
Person name Position
Verb stem
Verb stem
![Page 32: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/32.jpg)
WHISK Rule Representation
For Semi-structured IE
![Page 33: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/33.jpg)
WHISK Rule Representation For Free Text IE
Person name Position
Verb stem
Verb stem
Skip only whithin the same syntactic field
![Page 34: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/34.jpg)
Example – Tagged by Users
![Page 35: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/35.jpg)
The WHISK Algorithm
![Page 36: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/36.jpg)
Creating a Rule from a Seed Instance Top-down rule induction
Start from an empty rule
Add terms within the extraction boundary (Base_1) Add terms just outside the extraction (Base_2)
Until the seed is covered
![Page 37: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/37.jpg)
Example
![Page 38: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/38.jpg)
![Page 39: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/39.jpg)
![Page 40: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/40.jpg)
EN
![Page 41: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/41.jpg)
AutoSlog: Automatically Constructing a Dictionary for Information Extraction Tasks
Ellen RiloffDept. of Computer Science, University of Massachusetts, AAAI93
![Page 42: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/42.jpg)
AutoSlog
Purpose: Automatically constructs a domain-specific
dictionary for IE Extraction pattern (concept nodes):
Conceptual anchor: a trigger word Enabling conditions: constraints
![Page 43: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/43.jpg)
Concept Node Example
Physical target slot of a bombing template
![Page 44: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/44.jpg)
Construction of Concept Nodes1. Given a target piece of information.
2. AutoSlog finds the first sentence in the text that contains the string.
3. The sentence is handed over to CIRCUS which generates a conceptual analysis of the sentence.
4. The first clause in the sentence is used.
5. A set of heuristics are applied to suggest a good conceptual anchor point for a concept node.
6. If none of the heuristics is satisfied, AutoSlog searches for the next sentence, and goto 3.
![Page 45: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/45.jpg)
Conceptual Anchor Point Heuristics
![Page 46: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/46.jpg)
Background Knowledge
Concept Node Construction Slot
The slot of the answer key Hard and soft constraints
Type: Use template types such as bombing, kidnapping Enabling condition: heuristic pattern
Domain Specification The type of a template The constraints for each template slot
![Page 47: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/47.jpg)
Another good concept node definition Perpetrator slot from a
perpetrator template
![Page 48: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/48.jpg)
A bad concept node definition Victim slot from a
kidnapping template
![Page 49: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/49.jpg)
Empirical Results Input:
Annotated corpus of texts in which the targeted information is marked and annotated with semantic tags denoting the type of information (e.g., victim) and type of event (e.g., kidnapping)
1500 texts with 1258 answer keys contain 4780 string fillers Output:
1237 concept node definitions Human intervention: 5 user-hour to sift through all generated concept nodes 450 definitions are kept
Performance:
![Page 50: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/50.jpg)
Conclusion
In 5 person-hour, AutoSlog creates a dictionary that achieves 98% of the performance of hand-crafted dictionary
Each concept node is a single-slot extraction pattern Reasons for bad definitions
When a sentence contains the targeted string but does not describe the event
When a heuristic proposes the wrong conceptual anchor point
When CIRCUS incorrectly analyzes the sentence
![Page 51: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/51.jpg)
CRYSTAL: Inducing a Conceptual Dictionary
S. Soderland, D. Fisher, J. Aseltine, W. Lehnert
University of Massachusetts
IJCAI’95
![Page 52: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/52.jpg)
Concept Nodes (CN)
CN-type Subtype Extracted syntactic
constituents Linguistic patterns Constraints on
syntactic constituents
![Page 53: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/53.jpg)
The CRYSTAL Induction Tool
Creating initial CN definitions For each instance
Inducing generalized CN definitions Relaxing constraints for highly similar definitions
Word constraints: intersecting strings of words Class constraints: moving up the semantic hierarchy
![Page 54: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/54.jpg)
![Page 55: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/55.jpg)
Inducing Generalized CN Definitions1. Start from a CN definition, D
2. Assume we have found a second definition D’ which is similar to D,
a) Create a new definition U
b) Delete from the dictionary all definitions covered by U, e.g. D and D’
c) Test if U extracts only marked informationa) If ‘Yes’, then go to Step 2 and set D=U,
b) If ‘No’, then start from another definition as D
![Page 56: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/56.jpg)
![Page 57: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/57.jpg)
Implementation Issue
Finding similar definitions Indexing CN definitions by verbs and by extraction
buffers Similarity metric
Intersecting classes or intersecting strings of words
Testing error rate of a generalized definition A database of instances segmented by sentence
analyzer is constructed
![Page 58: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/58.jpg)
Experimental Results
385 annotated hospital discharge reports
14719 training instances The choice of error
tolerance parameter is used to manipulate a tradeoff between precision and recall
Output: CN definitions 194, coverage=10 527, 2<coverage<10
![Page 59: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/59.jpg)
Comparison
Bottom-up: From specific to generalized CRYSTAL [Soderland, 1996] RAPIER [Califf & Mooney, 1997]
Top-down: From general to specific SRV [Freitag, 1998] WHISK [Soderland, 1999]
![Page 60: Plain Text Information Extraction (based on Machine Learning )](https://reader035.vdocument.in/reader035/viewer/2022062518/56814c3e550346895db9426e/html5/thumbnails/60.jpg)
References
I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.
Riloff, E. (1993) Automatically Constructing a Dictionary for Information Extraction Tasks, AAAI-93, pp. 811-816
S. Soderland, et al, CRYSTAL: Inducing a Conceptual Dictionary, AAAI-95.
Dayne Freitag, Information Extraction from HTML: Application of a General Machine Learning Approach, AAAI98
Mary Elaine Califf and Raymond J. Mooney, Relational Learning of Pattern-Match Rules for Information Extraction, AAAI-99, Orlando, FL, pp. 328-334, July, 1999.
S. Soderland, Learning information extraction rules for semi-structured and free text. J. of Machine Learning, 1999.