ondux on-demand unsupervised learning for information extraction eli cortez, altigran da silva and...
Post on 11-Jan-2016
215 Views
Preview:
TRANSCRIPT
ONDUXONDUXOn-Demand Unsupervised Learning On-Demand Unsupervised Learning
for Information Extractionfor Information Extraction
Eli Cortez, Altigran da Silva and Edleno de Moura
Federal University of Amazonas (UFAM) - BRAZIL
Marcos GonçalvesFederal University of Minas Gerais (UFMG) - BRAZIL
UFMG
AgendaAgenda
Introduction
Information Extraction by Text
Segmentation
◦ Challenges
Related Work
ONDUX
Experiments
Conclusions and Future Work
Introduction (1)Introduction (1)Abundance of on-line sources of
text documents containing implicit semi-structured data records
Addresses Bibliographic References Classified Ads Product Descriptions
Introduction (1I)Introduction (1I)
Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273
Classified Ad
Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214
Address
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based
similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221,
January 2006
Bibliographic Reference
Introduction (III)Introduction (III)Why extracting information?
Database Storage, Query… Data Mining Record Linkage.
Regent Square
$228,900 1028 Mifflin
Ave.; 6 Bedrooms; 2
Bathrooms. 412-638-
7273
Classified Ad
<Neighboorhood> :
Regent Square
<Price> :
$228,900
<No.> : 1028
<Street> :
Mifflin Ave,
<Bed.> : 6 Bedrooms
<Bath..> : 2
Bathrooms
<Phone> : 412-
638-7273
IETS – Challenges(I)IETS – Challenges(I)Information Extraction by Text
Segmentation (IETS)
◦ Borkar@SIGMOD'01, McCallum@ICML'01,
Agichtein@SIGKDD'04, Mansuri@ICDE'06,
Zhao@SICDM'08, Cortez@JASIST'09
Diversity of templates and styles Attribute Ordering Capitalization Abbreviations.
Different applications share similar domains Ex.: Address and Ads
Records from both domains contain address information
IETS – Challenges(II)IETS – Challenges(II)Diversity of templates and styles
Attribute Ordering; Capitalization; Abbreviations.
HomePage
DBLP
ACM
Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006)
Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006
Existing approaches deal with this problem use Machine Learning techniques
Hidden Markov Models (HMM) Conditional Random Fields (CRF) Structured Support Vector Machines
(SSVM)• (semi) Supervised approaches require a hand-
labeled training set created by an expert.
• Each generated model is particular to a given
application
• High computational cost
IETS – Challenges(III)IETS – Challenges(III)
Related WorkRelated Work [Borkar et. al @ SIGMOD 2001]
◦ Supervised extraction method based on Hidden Markov Models (HMM)
[McCallum et. al @ ICML 2001]◦ Proposed the usage of Conditional Random Fields
(CRF), a supervised model – (S-CRF)
[Mansuri et. al @ ICDE 2006]◦ Semi-supervised approach based on CRF models
All of these approaches require an expert to create a hand-labeled training set for each application.
Related Work (II)Related Work (II) [Agichtein et. al @ SIGKDD 2004]
◦ Usage of Reference Tables to create an unsupervised model using Hidden Markov Models (HMM)
[Zhao et. al @ SIAM ICDM 2008]◦ Usage of reference tables to create unsupervised
CRF models - (U-CRF)
[Cortez et. al @ JASIST 2009]◦ Unsupervised method to extract bibliographic
information Domain-specific heuristics, not general application.
Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)
ContributionsContributionsProposal of extraction method based on
information retrieval to perform IETS tasks;
◦ Eliminate the need of a user involved in any source specific training process;
◦ Flexible in the sense that do not rely on any particular style to perform the extraction
◦ Unsupervised Reinforcement Phase Attribute ordering and positioning learned On-Demand
Experimental comparison with the state-of-art information extraction approach (CRF).
Basic Concepts(1)Basic Concepts(1)
Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in:
1.Segmenting
2.Assigning to each segment a label corresponding to an attribute a
I
Basic Concepts(I1)Basic Concepts(I1)
Knowledge Base◦Set of pairs KB =◦Easily built from pre-existing sources
◦ Bibliographic DBs, Freebase, Google Fusion Tables, etc.
)},(),...,,{( 11 nn OmOm
KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )}
O = { “Regent Square”, “Milenight Park”}
O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”}
O = { “323 462-6252”, “(171) 289-7527”}
Neigh. Street
Neigh.
Street
Phone
Phone
ONDUX (I)ONDUX (I)Three main steps
◦Blocking
◦Matching
◦Reinforcement
ONDUX (II)ONDUX (II)General View
1
ONDUX (III)ONDUX (III)Blocking
◦ Split the input text in substrings called blocks;
◦ Consider the co-occurrence of consecutive terms based on the KB
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Co-occur in the KB
(Neighborhood)
Left separated (no presence in the KB)
ONDUX (IV)ONDUX (IV)General View
12
ONDUX (V)ONDUX (V)Matching
◦Associate each block generated in the previous phase with an attribute according to the Knowledge Base
◦Use distinct functions to compute the similarity between a block and the know values of the attributes in in the KB
ONDUX (VI)ONDUX (VI)Matching
Textual Values: FF Function (Field Frequency) Similarity between the terms on the block and the terms of a given attribute of the KB
Numeric Values : NM Function (Numeric Matching) [Agrawal @ CIDR 2003] Similarity between the value on the block, the mean and the standard deviation of a numeric attribute in the KB
ONDUX (VI)ONDUX (VI)Matching
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Street Price No. ??? Street
Bed. Bath. Phone
ONDUX (VII)ONDUX (VII)How can we deal with blocks that
were incorrectly labeled or were not associated to any attribute?
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Street Price No. ??? Street
Bed. Bath. Phone
ONDUX (VIII)ONDUX (VIII)Reinforcement
◦ Review the labeling task performed in the Matching step
Unmatched blocks must receive a label of a given attribute
Mismatched blocks must be correctly labeled
◦How to handle these cases? Using positioning and sequencing
information that are obtained On-Demand.
ONDUX (IX)ONDUX (IX)General View
2
3
ONDUX (X)ONDUX (X)Reinforcement
◦ Given the extraction output of the matching step ONDUX automatically build a
graphical structure, the PSM.
PSM: Positioning and Sequencing Model.
ONDUX (XI)ONDUX (XI)Reinforcement – PSM
Ordering and Positioning Probabilities are learned On-Demand based on the test instances trough the
Matching Phase
In the PSM, each state represents
attributes of the KB plus special states
start and endEdges representtransition probabilities
ONDUX (XII)ONDUX (XII)Reinforcement
◦Remarks The PSM is automatically learned On-
Demand from test instances No a priori training required No assumptions regarding a particular order of
attribute values Relies on the very effective strategies deployed
in the Matching Step
ONDUX (XIII)ONDUX (XIII)Reinforcement
◦Once the PSM is built, we combine the matching, positioning and sequencing evidences using the Bayesian operator OR.
))1()1()),(1((1),( ,, kiijii ptaBMaBFS
Matching Sequence Positioning
ONDUX (XIV)ONDUX (XIV)Reinforcement
◦Extraction Result
Regent Square $228,900 1028 Mifflin Ave.;
6 Bedrooms; 2 Bathrooms. 412-638-7273
Price No.
Bed. Bath. Phone
Street
???
Neighborhood
Street
Street
ONDUX (XV)ONDUX (XV)Overview
3
12
Experiments (1)Experiments (1)Setup
◦We tested our proposed approach with several sources from 3 distinct domains: Addresses
BigBook, Restaurants [RISE] Bibilographic Data
CORA [Peng@IPM’ 06], PersonalBib [Mansuri@ICDE’ 06] Classified Ads
7 distinct newspaper sites[Oliveira@SBBD’ 06]
◦We limited the presentation to one experiment per domain. More on the paper
Experiments (II)Experiments (II)Evaluation
◦Metrics Precision, Recall and F-Measure
T-Test for the statistical validation of the results
◦Baselines Conditional Random Fields (CRF)
U-CRF (Unsupervised method) [Zhao@SICDM’ 08]
S-CRF (Classical supervised method) [Peng@IPM’ 06]
Experiments (III)Experiments (III)Extraction Quality
U-CRF results similar to Zhao@SICDM (validation)
Dataset follows the single order assumption
After Reinforcement ONDUX achieved similar quality
Experiments (IV)Experiments (IV)Extraction Quality
S-CRF achieved results higher than U-CRF due to the hand-labeled training
CORA includes a variety of citation styles (conference, journal, books, etc,)
In general, ONDUX outperformed CRF models
Experiments (V)Experiments (V)Extraction Quality
Due to the Matching Phase and the PSM that is learned On-Demand, ONDUX achieve very high quality results
U-CRF presented a poor performance (very heterogeneous dataset)
Experiments (VI)Experiments (VI)Varying the number of terms common
to test instances and the KB
◦Determine how dependent the quality of results is from the overlap between the previously known data and the text input.
These experiments were conducted with the BigBook dataset.
Experiments (VII)Experiments (VII)Varying the number of shared terms
Even presenting a poor quality in the Matching Phase, the PSM is able to increase ONDUX’s quality in the Reinforcement Step
Starting with a batch of 500 input strings, after having an overlap of 500 terms, ONDUX achieved high quality results
Experiments (VIII)Experiments (VIII)Varying the number of shared terms
As the number of shared terms increases, the best quality the Mathching phase achieves
Conclusions andConclusions andFuture Work (I)Future Work (I)New approach for information
extraction independent of the style of the data records
ONDUX◦ Flexible: Do not consider any particular style◦ Unsupervised: Do not require any human
effort to create a training set◦ On-Demand: Ordering and Positioning
Information are learned trough the Matching Phase
Proposed strategy achieve good results of precision and recall◦Small size of the Knowledge Base◦Comparison with the state-of-art
As a Future Work◦Investigate different matching
functions;◦Nested structures?
Conclusions and Conclusions and Future Work (II)Future Work (II)
Acknowledgements
UFMG
Questions?
Setup
ExperimentesExperimentes
Experiment Dataset (records) # Source (records)
BigBook X BigBook
2000 2000
CORA X CORA 150 350
Folha X Web Ads 500 125
ExperimentesExperimentes
ExperimentesExperimentes
top related