ondux on-demand unsupervised learning for information extraction eli cortez, altigran da silva and...

Post on 11-Jan-2016

215 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

ONDUXONDUXOn-Demand Unsupervised Learning On-Demand Unsupervised Learning

for Information Extractionfor Information Extraction

Eli Cortez, Altigran da Silva and Edleno de Moura

Federal University of Amazonas (UFAM) - BRAZIL

Marcos GonçalvesFederal University of Minas Gerais (UFMG) - BRAZIL

UFMG

AgendaAgenda

Introduction

Information Extraction by Text

Segmentation

◦ Challenges

Related Work

ONDUX

Experiments

Conclusions and Future Work

Introduction (1)Introduction (1)Abundance of on-line sources of

text documents containing implicit semi-structured data records

Addresses Bibliographic References Classified Ads Product Descriptions

Introduction (1I)Introduction (1I)

Regent Square $228,900 1028 Mifflin Ave.; 6 Bedrooms; 2 Bathrooms. 412-638-7273

Classified Ad

Dr. Robert A. Jacobson, 8109 Harford Road, Baltimore, MD 21214

Address

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based

similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221,

January 2006

Bibliographic Reference

Introduction (III)Introduction (III)Why extracting information?

Database Storage, Query… Data Mining Record Linkage.

Regent Square

$228,900 1028 Mifflin

Ave.; 6 Bedrooms; 2

Bathrooms. 412-638-

7273

Classified Ad

<Neighboorhood> :

Regent Square

<Price> :

$228,900

<No.> : 1028

<Street> :

Mifflin Ave,

<Bed.> : 6 Bedrooms

<Bath..> : 2

Bathrooms

<Phone> : 412-

638-7273

IETS – Challenges(I)IETS – Challenges(I)Information Extraction by Text

Segmentation (IETS)

◦ Borkar@SIGMOD'01, McCallum@ICML'01,

Agichtein@SIGKDD'04, Mansuri@ICDE'06,

Zhao@SICDM'08, Cortez@JASIST'09

Diversity of templates and styles Attribute Ordering Capitalization Abbreviations.

Different applications share similar domains Ex.: Address and Ads

Records from both domains contain address information

IETS – Challenges(II)IETS – Challenges(II)Diversity of templates and styles

Attribute Ordering; Capitalization; Abbreviations.

HomePage

DBLP

ACM

Link-based similarity measures for the classication of Web documents. Pável Calado. Journal of the American Society for the Information Science and Technology – 57(2) 2006

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno Silva de Moura, Berthier A. Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST 57 (2) 208-221(2006)

Pável Calado, Marco Cristo, Marcos André Gonçalves, Edleno S. de Moura, Berthier Ribeiro-Neto, Nivio Ziviani. Link-based similarity measures for the classication of Web documents. JASIST, v. 57 n.2, p. 208-221, January 2006

Existing approaches deal with this problem use Machine Learning techniques

Hidden Markov Models (HMM) Conditional Random Fields (CRF) Structured Support Vector Machines

(SSVM)• (semi) Supervised approaches require a hand-

labeled training set created by an expert.

• Each generated model is particular to a given

application

• High computational cost

IETS – Challenges(III)IETS – Challenges(III)

Related WorkRelated Work [Borkar et. al @ SIGMOD 2001]

◦ Supervised extraction method based on Hidden Markov Models (HMM)

[McCallum et. al @ ICML 2001]◦ Proposed the usage of Conditional Random Fields

(CRF), a supervised model – (S-CRF)

[Mansuri et. al @ ICDE 2006]◦ Semi-supervised approach based on CRF models

All of these approaches require an expert to create a hand-labeled training set for each application.

Related Work (II)Related Work (II) [Agichtein et. al @ SIGKDD 2004]

◦ Usage of Reference Tables to create an unsupervised model using Hidden Markov Models (HMM)

[Zhao et. al @ SIAM ICDM 2008]◦ Usage of reference tables to create unsupervised

CRF models - (U-CRF)

[Cortez et. al @ JASIST 2009]◦ Unsupervised method to extract bibliographic

information Domain-specific heuristics, not general application.

Both models assume single positioning and ordering of attributes in all test instances. (Distinct Orderings ?)

ContributionsContributionsProposal of extraction method based on

information retrieval to perform IETS tasks;

◦ Eliminate the need of a user involved in any source specific training process;

◦ Flexible in the sense that do not rely on any particular style to perform the extraction

◦ Unsupervised Reinforcement Phase Attribute ordering and positioning learned On-Demand

Experimental comparison with the state-of-art information extraction approach (CRF).

Basic Concepts(1)Basic Concepts(1)

Given an input string I representing an implicit textual record (e.g. classified ad), the IETS task consists in:

1.Segmenting

2.Assigning to each segment a label corresponding to an attribute a

I

Basic Concepts(I1)Basic Concepts(I1)

Knowledge Base◦Set of pairs KB =◦Easily built from pre-existing sources

◦ Bibliographic DBs, Freebase, Google Fusion Tables, etc.

)},(),...,,{( 11 nn OmOm

KB= { (Neighboorhhod, O ), (Street, O ), (Phone, O )}

O = { “Regent Square”, “Milenight Park”}

O = { “Regent St.”, “Morewood Ave.”, “Square Ave. Park”}

O = { “323 462-6252”, “(171) 289-7527”}

Neigh. Street

Neigh.

Street

Phone

Phone

ONDUX (I)ONDUX (I)Three main steps

◦Blocking

◦Matching

◦Reinforcement

ONDUX (II)ONDUX (II)General View

1

ONDUX (III)ONDUX (III)Blocking

◦ Split the input text in substrings called blocks;

◦ Consider the co-occurrence of consecutive terms based on the KB

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Co-occur in the KB

(Neighborhood)

Left separated (no presence in the KB)

ONDUX (IV)ONDUX (IV)General View

12

ONDUX (V)ONDUX (V)Matching

◦Associate each block generated in the previous phase with an attribute according to the Knowledge Base

◦Use distinct functions to compute the similarity between a block and the know values of the attributes in in the KB

ONDUX (VI)ONDUX (VI)Matching

Textual Values: FF Function (Field Frequency) Similarity between the terms on the block and the terms of a given attribute of the KB

Numeric Values : NM Function (Numeric Matching) [Agrawal @ CIDR 2003] Similarity between the value on the block, the mean and the standard deviation of a numeric attribute in the KB

ONDUX (VI)ONDUX (VI)Matching

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

ONDUX (VII)ONDUX (VII)How can we deal with blocks that

were incorrectly labeled or were not associated to any attribute?

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Street Price No. ??? Street

Bed. Bath. Phone

ONDUX (VIII)ONDUX (VIII)Reinforcement

◦ Review the labeling task performed in the Matching step

Unmatched blocks must receive a label of a given attribute

Mismatched blocks must be correctly labeled

◦How to handle these cases? Using positioning and sequencing

information that are obtained On-Demand.

ONDUX (IX)ONDUX (IX)General View

2

3

ONDUX (X)ONDUX (X)Reinforcement

◦ Given the extraction output of the matching step ONDUX automatically build a

graphical structure, the PSM.

PSM: Positioning and Sequencing Model.

ONDUX (XI)ONDUX (XI)Reinforcement – PSM

Ordering and Positioning Probabilities are learned On-Demand based on the test instances trough the

Matching Phase

In the PSM, each state represents

attributes of the KB plus special states

start and endEdges representtransition probabilities

ONDUX (XII)ONDUX (XII)Reinforcement

◦Remarks The PSM is automatically learned On-

Demand from test instances No a priori training required No assumptions regarding a particular order of

attribute values Relies on the very effective strategies deployed

in the Matching Step

ONDUX (XIII)ONDUX (XIII)Reinforcement

◦Once the PSM is built, we combine the matching, positioning and sequencing evidences using the Bayesian operator OR.

))1()1()),(1((1),( ,, kiijii ptaBMaBFS

Matching Sequence Positioning

ONDUX (XIV)ONDUX (XIV)Reinforcement

◦Extraction Result

Regent Square $228,900 1028 Mifflin Ave.;

6 Bedrooms; 2 Bathrooms. 412-638-7273

Price No.

Bed. Bath. Phone

Street

???

Neighborhood

Street

Street

ONDUX (XV)ONDUX (XV)Overview

3

12

Experiments (1)Experiments (1)Setup

◦We tested our proposed approach with several sources from 3 distinct domains: Addresses

BigBook, Restaurants [RISE] Bibilographic Data

CORA [Peng@IPM’ 06], PersonalBib [Mansuri@ICDE’ 06] Classified Ads

7 distinct newspaper sites[Oliveira@SBBD’ 06]

◦We limited the presentation to one experiment per domain. More on the paper

Experiments (II)Experiments (II)Evaluation

◦Metrics Precision, Recall and F-Measure

T-Test for the statistical validation of the results

◦Baselines Conditional Random Fields (CRF)

U-CRF (Unsupervised method) [Zhao@SICDM’ 08]

S-CRF (Classical supervised method) [Peng@IPM’ 06]

Experiments (III)Experiments (III)Extraction Quality

U-CRF results similar to Zhao@SICDM (validation)

Dataset follows the single order assumption

After Reinforcement ONDUX achieved similar quality

Experiments (IV)Experiments (IV)Extraction Quality

S-CRF achieved results higher than U-CRF due to the hand-labeled training

CORA includes a variety of citation styles (conference, journal, books, etc,)

In general, ONDUX outperformed CRF models

Experiments (V)Experiments (V)Extraction Quality

Due to the Matching Phase and the PSM that is learned On-Demand, ONDUX achieve very high quality results

U-CRF presented a poor performance (very heterogeneous dataset)

Experiments (VI)Experiments (VI)Varying the number of terms common

to test instances and the KB

◦Determine how dependent the quality of results is from the overlap between the previously known data and the text input.

These experiments were conducted with the BigBook dataset.

Experiments (VII)Experiments (VII)Varying the number of shared terms

Even presenting a poor quality in the Matching Phase, the PSM is able to increase ONDUX’s quality in the Reinforcement Step

Starting with a batch of 500 input strings, after having an overlap of 500 terms, ONDUX achieved high quality results

Experiments (VIII)Experiments (VIII)Varying the number of shared terms

As the number of shared terms increases, the best quality the Mathching phase achieves

Conclusions andConclusions andFuture Work (I)Future Work (I)New approach for information

extraction independent of the style of the data records

ONDUX◦ Flexible: Do not consider any particular style◦ Unsupervised: Do not require any human

effort to create a training set◦ On-Demand: Ordering and Positioning

Information are learned trough the Matching Phase

Proposed strategy achieve good results of precision and recall◦Small size of the Knowledge Base◦Comparison with the state-of-art

As a Future Work◦Investigate different matching

functions;◦Nested structures?

Conclusions and Conclusions and Future Work (II)Future Work (II)

Acknowledgements

UFMG

Questions?

Setup

ExperimentesExperimentes

Experiment Dataset (records) # Source (records)

BigBook X BigBook

2000 2000

CORA X CORA 150 350

Folha X Web Ads 500 125

ExperimentesExperimentes

ExperimentesExperimentes

top related