ondux on-demand unsupervised learning for information extraction eli cortez, altigran da silva and...

44
ONDUX ONDUX On-Demand Unsupervised On-Demand Unsupervised Learning for Information Learning for Information Extraction Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM) - BRAZIL Marcos Gonçalves Federal University of Minas Gerais (UFMG) - BRAZIL UFMG

Upload: hester-daniel

Post on 11-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUXONDUXOn-Demand Unsupervised Learning On-Demand Unsupervised Learning

for Information Extractionfor Information Extraction

Eli Cortez, Altigran da Silva and Edleno de Moura

Federal University of Amazonas (UFAM) - BRAZIL

Marcos GonçalvesFederal University of Minas Gerais (UFMG) - BRAZIL

UFMG

Page 2: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

AgendaAgenda

Introduction

Information Extraction by Text

Segmentation

◦ Challenges

Related Work

ONDUX

Experiments

Conclusions and Future Work

Page 3: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Introduction (1)Introduction (1)Abundance of on-line sources of

text documents containing implicit semi-structured data records

Addresses Bibliographic References Classified Ads Product Descriptions

Page 4: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Introduction (1I)Introduction (1I)

Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273

Classified Ad

Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214

Address

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based

similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221,

January 2006

Bibliographic Reference

Page 5: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Introduction (III)Introduction (III)Why extracting information?

Database Storage, Query… Data Mining Record Linkage.

Regent Square

$228,900 1028 Mifflin

Ave.; 6 Bedrooms; 2

Bathrooms. 412-638-

7273

Classified Ad

<Neighboorhood> :

Regent Square

<Price> :

$228,900

<No.> : 1028

<Street> :

Mifflin Ave,

<Bed.> : 6 Bedrooms

<Bath..> : 2

Bathrooms

<Phone> : 412-

638-7273

Page 6: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

IETS – Challenges(I)IETS – Challenges(I)Information Extraction by Text

Segmentation (IETS)

◦ Borkar@SIGMOD'01, McCallum@ICML'01,

Agichtein@SIGKDD'04, Mansuri@ICDE'06,

Zhao@SICDM'08, Cortez@JASIST'09

Diversity of templates and styles Attribute Ordering Capitalization Abbreviations.

Different applications share similar domains Ex.: Address and Ads

Records from both domains contain address information

Page 7: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

IETS – Challenges(II)IETS – Challenges(II)Diversity of templates and styles

Attribute Ordering; Capitalization; Abbreviations.

HomePage

DBLP

ACM

Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006)

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006

Page 8: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Existing approaches deal with this problem use Machine Learning techniques

Hidden Markov Models (HMM) Conditional Random Fields (CRF) Structured Support Vector Machines

(SSVM)• (semi) Supervised approaches require a hand-

labeled training set created by an expert.

• Each generated model is particular to a given

application

• High computational cost

IETS – Challenges(III)IETS – Challenges(III)

Page 9: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Related WorkRelated Work [Borkar et. al @ SIGMOD 2001]

◦ Supervised extraction method based on Hidden Markov Models (HMM)

[McCallum et. al @ ICML 2001]◦ Proposed the usage of Conditional Random Fields

(CRF), a supervised model – (S-CRF)

[Mansuri et. al @ ICDE 2006]◦ Semi-supervised approach based on CRF models

All of these approaches require an expert to create a hand-labeled training set for each application.

Page 10: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Related Work (II)Related Work (II) [Agichtein et. al @ SIGKDD 2004]

◦ Usage of Reference Tables to create an unsupervised model using Hidden Markov Models (HMM)

[Zhao et. al @ SIAM ICDM 2008]◦ Usage of reference tables to create unsupervised

CRF models - (U-CRF)

[Cortez et. al @ JASIST 2009]◦ Unsupervised method to extract bibliographic

information Domain-specific heuristics, not general application.

Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)

Page 11: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ContributionsContributionsProposal of extraction method based on

information retrieval to perform IETS tasks;

◦ Eliminate the need of a user involved in any source specific training process;

◦ Flexible in the sense that do not rely on any particular style to perform the extraction

◦ Unsupervised Reinforcement Phase Attribute ordering and positioning learned On-Demand

Experimental comparison with the state-of-art information extraction approach (CRF).

Page 12: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Basic Concepts(1)Basic Concepts(1)

Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in:

1.Segmenting

2.Assigning to each segment a label corresponding to an attribute a

I

Page 13: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Basic Concepts(I1)Basic Concepts(I1)

Knowledge Base◦Set of pairs KB =◦Easily built from pre-existing sources

◦ Bibliographic DBs, Freebase, Google Fusion Tables, etc.

)},(),...,,{( 11 nn OmOm

KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )}

O = { “Regent Square”, “Milenight Park”}

O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”}

O = { “323 462-6252”, “(171) 289-7527”}

Neigh. Street

Neigh.

Street

Phone

Phone

Page 14: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (I)ONDUX (I)Three main steps

◦Blocking

◦Matching

◦Reinforcement

Page 15: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (II)ONDUX (II)General View

1

Page 16: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (III)ONDUX (III)Blocking

◦ Split the input text in substrings called blocks;

◦ Consider the co-occurrence of consecutive terms based on the KB

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Co-occur in the KB

(Neighborhood)

Left separated (no presence in the KB)

Page 17: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (IV)ONDUX (IV)General View

12

Page 18: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (V)ONDUX (V)Matching

◦Associate each block generated in the previous phase with an attribute according to the Knowledge Base

◦Use distinct functions to compute the similarity between a block and the know values of the attributes in in the KB

Page 19: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (VI)ONDUX (VI)Matching

Textual Values: FF Function (Field Frequency) Similarity between the terms on the block and the terms of a given attribute of the KB

Numeric Values : NM Function (Numeric Matching) [Agrawal @ CIDR 2003] Similarity between the value on the block, the mean and the standard deviation of a numeric attribute in the KB

Page 20: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (VI)ONDUX (VI)Matching

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

Page 21: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (VII)ONDUX (VII)How can we deal with blocks that

were incorrectly labeled or were not associated to any attribute?

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

Page 22: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (VIII)ONDUX (VIII)Reinforcement

◦ Review the labeling task performed in the Matching step

Unmatched blocks must receive a label of a given attribute

Mismatched blocks must be correctly labeled

◦How to handle these cases? Using positioning and sequencing

information that are obtained On-Demand.

Page 23: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (IX)ONDUX (IX)General View

2

3

Page 24: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (X)ONDUX (X)Reinforcement

◦ Given the extraction output of the matching step ONDUX automatically build a

graphical structure, the PSM.

PSM: Positioning and Sequencing Model.

Page 25: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (XI)ONDUX (XI)Reinforcement – PSM

Ordering and Positioning Probabilities are learned On-Demand based on the test instances trough the

Matching Phase

In the PSM, each state represents

attributes of the KB plus special states

start and endEdges representtransition probabilities

Page 26: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (XII)ONDUX (XII)Reinforcement

◦Remarks The PSM is automatically learned On-

Demand from test instances No a priori training required No assumptions regarding a particular order of

attribute values Relies on the very effective strategies deployed

in the Matching Step

Page 27: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (XIII)ONDUX (XIII)Reinforcement

◦Once the PSM is built, we combine the matching, positioning and sequencing evidences using the Bayesian operator OR.

))1()1()),(1((1),( ,, kiijii ptaBMaBFS

Matching Sequence Positioning

Page 28: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (XIV)ONDUX (XIV)Reinforcement

◦Extraction Result

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Price No.

Bed. Bath. Phone

Street

???

Neighborhood

Street

Street

Page 29: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ONDUX (XV)ONDUX (XV)Overview

3

12

Page 30: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Experiments (1)Experiments (1)Setup

◦We tested our proposed approach with several sources from 3 distinct domains: Addresses

BigBook, Restaurants [RISE] Bibilographic Data

CORA [Peng@IPM’ 06], PersonalBib [Mansuri@ICDE’ 06] Classified Ads

7 distinct newspaper sites[Oliveira@SBBD’ 06]

◦We limited the presentation to one experiment per domain. More on the paper

Page 31: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Experiments (II)Experiments (II)Evaluation

◦Metrics Precision, Recall and F-Measure

T-Test for the statistical validation of the results

◦Baselines Conditional Random Fields (CRF)

U-CRF (Unsupervised method) [Zhao@SICDM’ 08]

S-CRF (Classical supervised method) [Peng@IPM’ 06]

Page 32: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Experiments (III)Experiments (III)Extraction Quality

U-CRF results similar to Zhao@SICDM (validation)

Dataset follows the single order assumption

After Reinforcement ONDUX achieved similar quality

Page 33: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Experiments (IV)Experiments (IV)Extraction Quality

S-CRF achieved results higher than U-CRF due to the hand-labeled training

CORA includes a variety of citation styles (conference, journal, books, etc,)

In general, ONDUX outperformed CRF models

Page 34: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Experiments (V)Experiments (V)Extraction Quality

Due to the Matching Phase and the PSM that is learned On-Demand, ONDUX achieve very high quality results

U-CRF presented a poor performance (very heterogeneous dataset)

Page 35: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Experiments (VI)Experiments (VI)Varying the number of terms common

to test instances and the KB

◦Determine how dependent the quality of results is from the overlap between the previously known data and the text input.

These experiments were conducted with the BigBook dataset.

Page 36: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Experiments (VII)Experiments (VII)Varying the number of shared terms

Even presenting a poor quality in the Matching Phase, the PSM is able to increase ONDUX’s quality in the Reinforcement Step

Starting with a batch of 500 input strings, after having an overlap of 500 terms, ONDUX achieved high quality results

Page 37: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Experiments (VIII)Experiments (VIII)Varying the number of shared terms

As the number of shared terms increases, the best quality the Mathching phase achieves

Page 38: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Conclusions andConclusions andFuture Work (I)Future Work (I)New approach for information

extraction independent of the style of the data records

ONDUX◦ Flexible: Do not consider any particular style◦ Unsupervised: Do not require any human

effort to create a training set◦ On-Demand: Ordering and Positioning

Information are learned trough the Matching Phase

Page 39: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Proposed strategy achieve good results of precision and recall◦Small size of the Knowledge Base◦Comparison with the state-of-art

As a Future Work◦Investigate different matching

functions;◦Nested structures?

Conclusions and Conclusions and Future Work (II)Future Work (II)

Page 40: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Acknowledgements

UFMG

Page 41: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Questions?

Page 42: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

Setup

ExperimentesExperimentes

Experiment Dataset (records) # Source (records)

BigBook X BigBook

2000 2000

CORA X CORA 150 350

Folha X Web Ads 500 125

Page 43: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ExperimentesExperimentes

Page 44: ONDUX On-Demand Unsupervised Learning for Information Extraction Eli Cortez, Altigran da Silva and Edleno de Moura Federal University of Amazonas (UFAM)

ExperimentesExperimentes