information extraction for free text
DESCRIPTION
TRANSCRIPT
Plain Text Information Extraction (based on Machine Learning)Chia-Hui Chang Department of Computer Science & Information EngineeringNational Central University
[email protected]/24/2002
Introduction
Plain Text Information Extraction The task of locating specific pieces of data from a
natural language document To obtain useful structured information from unstr
uctured text DARPA’s MUC program
The extraction rules are based on syntactic analyzer semantic tagger
On-line documents SRV, AAAI-1998
D. Freitag Rapier, ACL-1997, AAAI-
1999 M. E. Califf
WHISK, ML-1999 Solderland
Related Work
Free-text documents PALKA, MUC-5, 1993 AutoSlog, AAAI-1993
E. Riloff LIEP, IJCAI-1995
Huffman Crystal, IJCAI-1995, KD
D-1997 Solderland
SRVInformation Extraction from HTML: Application of a General Machine Learning Approach
Dayne Freitag
AAAI-98
Introduction
SRV A general-purpose relational learner A top-down relational algorithm for IE Reliance on a set of token-oriented features
Extraction pattern First-order logic extraction pattern with predicates
based on attribute-value tests
Extraction as Text Classification Extraction as Text Classification
Identify the boundaries of field instances Treat each fragment as a bag-of-words Find the relations from the surrounding context
Relational Learning
Inductive Logic Programming (ILP) Input: class-labeled instances Output: classifier for unlabeled instances Typical covering algorithm
Attribute values are added greedily to a rule The number of positive examples is heuristically
maximized while the number of negative examples is heuristically minimized
Simple Features
Features on individual token Length (e.g. single letter or multiple letters) Character type (e.g. numeric or alphabet) Orthography (e.g. capitalized) Part of speech (e.g. verb) Lexical meaning (e.g. geographical_place)
Individual Predicates
Individual predicate: Length (=3): accepts only fragments containing three token
s Some(?A [] capitalizedp true): the fragment contains some t
oken that is capitalized Every(numericp false): every token in the fragment is non-n
umeric Position(?A fromfirst <2): the token bound to ?A is either fir
st or second in the fragment Relpos(?A ?B =1) the token bound to ?A immediately prece
ds the token bound to ?B
Relational Features
Relational Feature types Adjacency (next_token) Linguistic syntax (subject_verb)
Example
Search
Adding predicates greedily, attempting to cover as many positive and as few negative examples as possible.
At every step in rule construction, all documents in the training set are scanned and every text fragment of appropriate size counted.
Every legal predicate is assessed in terms of the number of positive and negative examples it covers.
A position-predicate is not legal unless some-predicate is already part of the rule
Relational Paths
Relational features are used only in the Path argument to the some-predicate Some(?A [prev_token prev_token] capitalized tru
e): The fragment contains some token preceded by a capitalized token two tokens back.
Validation Training Phase
2/3: learning 1/3: validation
Testing Bayesian m-
estimates: All rules matching a given
fragment are used to assign a confidence score.
Combined confidence:
Ccc)1(1
Adapting SRV for HTML
Experiments Data Source:
Four university computer science departments: Cornell, U. of Texas, U. of Washington, U. of Wisconsin
Data Set: Course: title, number, instructor Project: title, member 105 course pages 96 project pages
Two Experiments Random: 5 cross-validation LOUO: 4-fold experiments
OPD Coverage:Each rule
has its own confidence
MPD
Baseline Strategies
OPD
MPD
Simply memorizes field instances
Random Guesser
Conclusions
Increased modularity and flexibility Domain-specific information is separate from the
underlying learning algorithm Top-down induction
From general to specific Accuracy-coverage trade-off
Associate confidence score with predictions Critique: single-slot extraction rule
RAPIERRelational Learning of Pattern-Match Rules for Information Extraction
M.E. Califf and R.J. Mooney
ACL-97, AAAI-1999
Rule Representation
Single-slot extraction patterns Syntactic information (part-of-speech tagger) Semantic class information (WordNet)
The Learning Algorithm A specific to general search
The pre-filler pattern contains an item for each word The filler pattern has one item from each word in the
filler The post-filler has one item for each word
Compress the rules for each slot Generate the least general generalization (LGG) of each
pair of rules When the LGG of two constraints is a disjunction, we
create two alternatives (1) disjunction (2) removal of the constraints.
Example Located in Atlanta, Georgia. Offices in Kansas City, Missouri.
,,
,,
Example:
Assume there is a semantic class for states, but not one for cities.
Located in Atlanta, Georgia.Offices in Kansas City, Missouri.
Experimental Evaluation
300 computer-related Jobs 17 slots: employer, location, salary, job requirements,
language and platform.
Experimental Evaluation
485 seminar announcement 4 slots:
WHISK:
S. Soderland
University of Washington
Journal of Machine Learning 1999
Semi-structured Text
Free Text
Person name Position
Verb stem
Verb stem
WHISK Rule Representation
For Semi-structured IE
WHISK Rule Representation For Free Text IE
Person name Position
Verb stem
Verb stem
Skip only whithin the same syntactic field
Example – Tagged by Users
The WHISK Algorithm
Creating a Rule from a Seed Instance Top-down rule induction
Start from an empty rule
Add terms within the extraction boundary (Base_1) Add terms just outside the extraction (Base_2)
Until the seed is covered
Example
EN
AutoSlog: Automatically Constructing a Dictionary for Information Extraction Tasks
Ellen RiloffDept. of Computer Science, University of Massachusetts, AAAI93
AutoSlog
Purpose: Automatically constructs a domain-specific
dictionary for IE Extraction pattern (concept nodes):
Conceptual anchor: a trigger word Enabling conditions: constraints
Concept Node Example
Physical target slot of a bombing template
Construction of Concept Nodes1. Given a target piece of information.
2. AutoSlog finds the first sentence in the text that contains the string.
3. The sentence is handed over to CIRCUS which generates a conceptual analysis of the sentence.
4. The first clause in the sentence is used.
5. A set of heuristics are applied to suggest a good conceptual anchor point for a concept node.
6. If none of the heuristics is satisfied, AutoSlog searches for the next sentence, and goto 3.
Conceptual Anchor Point Heuristics
Background Knowledge
Concept Node Construction Slot
The slot of the answer key Hard and soft constraints
Type: Use template types such as bombing, kidnapping Enabling condition: heuristic pattern
Domain Specification The type of a template The constraints for each template slot
Another good concept node definition Perpetrator slot from a
perpetrator template
A bad concept node definition Victim slot from a
kidnapping template
Empirical Results Input:
Annotated corpus of texts in which the targeted information is marked and annotated with semantic tags denoting the type of information (e.g., victim) and type of event (e.g., kidnapping)
1500 texts with 1258 answer keys contain 4780 string fillers Output:
1237 concept node definitions Human intervention: 5 user-hour to sift through all generated concept nodes 450 definitions are kept
Performance:
Conclusion
In 5 person-hour, AutoSlog creates a dictionary that achieves 98% of the performance of hand-crafted dictionary
Each concept node is a single-slot extraction pattern Reasons for bad definitions
When a sentence contains the targeted string but does not describe the event
When a heuristic proposes the wrong conceptual anchor point
When CIRCUS incorrectly analyzes the sentence
CRYSTAL: Inducing a Conceptual Dictionary
S. Soderland, D. Fisher, J. Aseltine, W. Lehnert
University of Massachusetts
IJCAI’95
Concept Nodes (CN)
CN-type Subtype Extracted syntactic
constituents Linguistic patterns Constraints on
syntactic constituents
The CRYSTAL Induction Tool
Creating initial CN definitions For each instance
Inducing generalized CN definitions Relaxing constraints for highly similar definitions
Word constraints: intersecting strings of words Class constraints: moving up the semantic hierarchy
Inducing Generalized CN Definitions1. Start from a CN definition, D
2. Assume we have found a second definition D’ which is similar to D,
a) Create a new definition U
b) Delete from the dictionary all definitions covered by U, e.g. D and D’
c) Test if U extracts only marked informationa) If ‘Yes’, then go to Step 2 and set D=U,
b) If ‘No’, then start from another definition as D
Implementation Issue
Finding similar definitions Indexing CN definitions by verbs and by extraction
buffers Similarity metric
Intersecting classes or intersecting strings of words
Testing error rate of a generalized definition A database of instances segmented by sentence
analyzer is constructed
Experimental Results
385 annotated hospital discharge reports
14719 training instances The choice of error
tolerance parameter is used to manipulate a tradeoff between precision and recall
Output: CN definitions 194, coverage=10 527, 2<coverage<10
Comparison
Bottom-up: From specific to generalized CRYSTAL [Soderland, 1996] RAPIER [Califf & Mooney, 1997]
Top-down: From general to specific SRV [Freitag, 1998] WHISK [Soderland, 1999]
References
I. Muslea, Extraction Patterns for Information Extraction Tasks: A Survey, The AAAI-99 Workshop on Machine Learning for Information Extraction.
Riloff, E. (1993) Automatically Constructing a Dictionary for Information Extraction Tasks, AAAI-93, pp. 811-816
S. Soderland, et al, CRYSTAL: Inducing a Conceptual Dictionary, AAAI-95.
Dayne Freitag, Information Extraction from HTML: Application of a General Machine Learning Approach, AAAI98
Mary Elaine Califf and Raymond J. Mooney, Relational Learning of Pattern-Match Rules for Information Extraction, AAAI-99, Orlando, FL, pp. 328-334, July, 1999.
S. Soderland, Learning information extraction rules for semi-structured and free text. J. of Machine Learning, 1999.