automatic rule refinement for information extraction bin liu university of michigan laura chiticariu...
TRANSCRIPT
Automatic Rule Refinement for Information Extraction
Bin LiuUniversity of Michigan
Laura ChiticariuIBM Research - Almaden
Vivian ChuIBM Research - Almaden
Frederic R. ReissIBM Research - Almaden
H. V. JagadishUniversity of Michigan
VLDB 2010
Presenter: Ajay Gupta Date: 20th Oct 2011
Outline2
Introduction Rules Representation Method Overview Experimental Setup Results Conclusion & Future Work
3
Information Extraction (IE)3
Distill structured data from unstructured text Exploit the extracted data in your applications
For years, Microsoft Corporation CEO Bill Gates was against open source. But today he appears to have changed his mind. "We can be open source. We love the concept of shared source," said Bill Veghte, a Microsoft VP. "That's a super-important shift for us in terms of code access.“
Richard Stallman, founder of the Free Software Foundation, countered saying…
Name Title OrganizationBill Gates CEO MicrosoftBill Veghte VP MicrosoftRichard Stallman Founder Free Soft..
Frederick Reiss et. al. SIGMOD 2010 Tutorial
AnnotationsAnnotations
4
Rule Based Information Extraction
Most IE systems uses Rules to define important patterns in the text.
Example: Person name extractorIf a match of a dictionary of common first names occurs in the text, followed immediately by a capitalized word, mark the two words as a “candidate person name”.
5
Anna at James St. office (555-1234), or James, her assistant - 555-7789 have the details.
Example Extraction Rules – When Things Go Wrong
Phone
555-7789
Phone
555-1234
Person
James
Person
Anna
Person
James
6
Rule Development in Information Extraction
Analyze
Develop
Test
Iterative refinement process
labor intensive
time-consuming
error prone
7
Rule Refinement Is Hard
Number of rules could be large.
Rule interactions could be complex.
Analyzing side effects- False positive → improve precision- Correct results → decrease recall
Identifying change could take hours
- Person extractor has 14 complex rules
9
Rules Representation SQL to represent rules. SQL Subset: – Select, Project, Join, Union ALL, Except ALL
SQL Extension: Data Type: span Table: Document(text span) Predicate Functions: - Follows, FollowsTok, Contains
Scalar Functions: - Merge, Between, LeftContext
Table Functions: - Regex, Dictionary
10
Rules Examples
Dictionary file first_names.dict: anna, james, john, peter…
R1: create view Phone as Regex(‘d{3}-\d{4}’, Document, text);
R2: create view FirstNameCand F as Dictionary(‘first_names.dict’, Document, text);
R3: create view FirstName as Select * from FirstNameCand F where Not(ContainsDict('street_suffix.dict', RightContextTok(F.match,1)));
Anna at James St. Office (555-1234), or James, her assistant
- 555-7789 have the details. t0:
Phone
t1 555-1234
t2 555-7789
FirstNameCand
t3 Anna
t4 James
t5 James
FirstName
t6 Anna
t7 James
11
Rules Examples
R4: create view PersonPhoneAll as Select Merge(F.Match, P.match) as match from FirstName F, Phone P where Follows(F.match, P.match, 0, 60);
R5: create table PersonPhone(match span); insert into PersonPhone ( select * from PersonPhoneAll A) except all ( select A1.* from PersonPhoneAll A1 , PersonPhoneAll A2 where Contains(A1.match, A2.match) and Not(Equals(A1.match, A2.match));
PersonPhoneAll
PersonPhone
t11 Anna at James St. Office (555-1234)
t12 James, her assistant - 555-7789
t8 Anna at James St. Office (555-1234)
t9 James, her assistant - 555-7789
t10 Anna at James St. Office (555-1234) or James, her assistant - 555-7789
15
Input: Set of correct and incorrect examples
generated by an Extractor
Goal: Generate refinements of Extractor
that remove incorrect example, while minimizing the rest
of the results.
Basic Idea: Data provenance allows one to understand the origins of an output
Cut any provenance link wrong output disappears
Method Overview
FirstNameCandDictionary
FirstNames.dict
Doc
PersonPhoneAllJoin
Follows(name,phone,0,60)
Anna
Anna555-7789
PhoneRegex
/\d{3}-\d{4}/
555-7789
(Simplified) provenance
of a wrong output
16
Method Overview
Solution:
Stage1: Generate High Level Changes
“remove tuple t from the output of operator Op in the canonical representation of the extractor”.
Problems:
1) feasibility
2) side-effects
Stage2: Generate Low Level Changes
- How to modify the operator to implement high level change.
- Ranking
17
High-Level Change
Let t be a tuple in an output table V . A high-level change for t is a pair (t′ , Op), where Op is an operator in the canonical operator graph of V and t′ is a tuple in the output of Op such that eliminating t′ from the output of Op by modifying Op results in eliminating t from V .
DEFINITION: HIGH-LEVEL CHANGE
20
FirstNameCandDictionary
FirstNames.dict
Doc
PersonPhoneAllJoin
Follows(name,phone,0,60)
Anna
Anna555-7789
PhoneRegex
/\d{3}-\d{4}/
555-7789FirstName
SelectNot(ContainsDict('street_suffix..
HLC Example
Anna
HLC1Remove Anna<--> 555-7789From output of Join in R4
HLC4Remove Anna From
Output of Dictionary in R2
HLC3Remove Anna From
Output of Select in R3HLC2
Remove 555-7789From Output of Regex in R1
21
FirstNameCandDictionary
FirstNames.dict
Doc
PersonPhoneAllJoin
Follows(name,phone,0,60)
Anna
Anna555-7789
PhoneRegex
/\d{3}-\d{4}/
555-7789FirstName
SelectNot(ContainsDict('street_suffix..
Anna
HLC1Remove Anna<--> 555-7789From output of Join in R4
HLC4Remove Anna From
Output of Dictionary in R2
Generating Low-Level Changes from HLCs
LLCChange Join Predicate to Follows(name,phone,0,50)
LLCRemove 'anna' From
FirstNames.dict
22
Generating Low-Level Changes from HLCs:Naive Approach
Input: Set of HLCs
Output: List of LLCs, ranked based on effects
Algorithm:
1) For each operator Op, consider all HLCs
2) For each HLC, enumerate all possible LLCs
3) For each LLC:• Compute the set of local tuples it removes from the output of Op• Propagate these removals up through the provenance graph to compute the
effect on end-to-end result
4) Rank LLCs
23
Problems with Naive Approach
Problem1: Number of possible LLCs for a HLC could be very large
Example: Remove output tuple of a Dictionary operator
Dictionary with 1000 entries possible LLCs: 2^999 -1 !.
Solution:
Limit the LLCs considered to a set of tractable size, while still considering all feasible combinations of HLCs for given operator1) Generate a single LLC for each of k promising combinations of HLCs for given operator2) k is the number of LLCs presented to the user
24
Problems with Naive Approach
Problem2: Traversing the provenance graph is expensive O(n2), where n is the size of the operator tree
Solution:Remember the mapping from each high-level change back to the affected output tuple.
25
Specific Classes of Low-Level Changes1) Modify numerical join parameters -
E.g., “Modify max char. distance of Follows() predicate in the join operator of rule R4 from 60 to 20”
2) Remove dictionary entries -
E.g., “Modify the Dictionary operator of rule R2 by removing entry Anna from first_names.dict”
3) Add filtering dictionary -
E.g., “Add predicate Not(ContainsDict(‘street_suffix.dict’, RightContextTok(match,1))) to Dictionary operator of rule R3”
4) Add filtering view - applies to an entire view
E.g., “Subtract from the result of rule R4 PersonPhoneAll spans that are strictly contained within another PersonPhoneAll span”
26
LLC Generation: Removing Dictionary Entries
James James X
James Y
James Anderson
Anna Anna XYZ
Anna ABC
26
Output of operator Dictionary(‘FirstNameDict’)
Final output of FirstName extractor
‘anna’ Anna XYZ
Anna ABC
‘james’ James X
James Y
James Anderson
Dictionary entries in ‘FirstNameDict’
Effects of removing Dictionary entry
1. ‘anna’
2. ‘anna’, ‘james’
Generated LLCs:
Remove from dictionary FirstNameDict the following entries:
27
Experiments
- Rule refinement approach implemented in SystemT information extraction system - Uses SystemT’s AQL rule language
Goals:Quality evaluation of generated refinementsPerformance evaluation
Setup: Ubuntu Linux version 9.10, 2.26 GHz Intel Xeon CPU with 8 GB RAM. 10 fold cross validation.
28
Extraction Tasks and Rule Sets
Person task 14 complex rules for identifying person names
• E.g., “CapitalizedWord followed by FirstName”
“LastName followed by Comma followed by CapitalizedWord”
Rules for identifying other Named Entities • E.g., Organization, EmailAddress, AddressThese can be used as filtering purpose to enable refinement. - “Morgan Stanley”, “Georgia”
PersonPhone task 11 complex rules for identifying phone numbers High-quality Person extractor One rule to identify PersonPhone candidates: “Person followed by Phone within 0 to 60 characters”
29
Evaluation Datasets
Dataset #docs #labels #docs #labels
ACECoNLLEnronEnronPP
273946434322
520165604500157
69216218161
12201842196946
Training Set Test Set
ACE: collection of newswire reports, broadcast news and con- versations with Person labeled data from the ACE05 Dataset.
CoNLL: collection of news articles with Person labeled data from the CoNLL 2003 Shared Task.
Enron, EnronPP: collections of emails from the Enron corpus annotated with Person and respectively PersonPhone labels.
31
Quality Evaluation
- F1-measure improves between 6% to 26% in few iterations
- Recall remains stable.
- F1-measure and Precision reaches platue
- First few high ranked refinements - Some low level changes are not implemented yet
32
Quality Evaluation: Comparison with Experts
- Two experts- Enron dataset for Person task- Time: One hour
33
Performance Evaluation
- One hour by an expert - 3 min to 15 min per refinement- System refinement time: 2 min
34
Conclusion & Future work
- Database provenance technique for refining information extraction rules.
Future work:
- Extensions• Other types of LLCs. e.g. Regex
- Addressing false negatives