modeling query-based access to text databases

17
1 Modeling Query-Based Access to Text Databases Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University

Upload: kimi

Post on 03-Feb-2016

61 views

Category:

Documents


0 download

DESCRIPTION

Modeling Query-Based Access to Text Databases. Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University. Extracting Structured Information “Buried” in Text Documents. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Modeling Query-Based Access to Text Databases

1

Modeling Query-Based Access to Text Databases

Eugene AgichteinPanagiotis IpeirotisLuis Gravano

Computer Science Department

Columbia University

Page 2: Modeling Query-Based Access to Text Databases

2

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Extracting Structured Information “Buried” in Text Documents

Date DiseaseName Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease

The U.K.

Feb. 1995 Pneumonia The U.S.

May 1995 Ebola Zaire

Information Extraction System

(e.g., NYU’s Proteus)

Page 3: Modeling Query-Based Access to Text Databases

3

Extracting All “Tuples” of a Relation from a Text Database

Naïve approach: feed every document to information extraction system. At 7 secs./document, Proteus takes over 8 days for 100K documents

Only a tiny fraction of documents contains tuples Processing every document is inefficient

Many databases are not crawlable (scannable), but available only via a search engine.

Text Database

InformationExtraction

System

Extracted Tuples

Search engines can help:efficiency and accessibility

Page 4: Modeling Query-Based Access to Text Databases

4

A Query-Based Strategy for Information Extraction [Agichtein and Gravano, ICDE 2003]

1 While seed has unprocessed tuple t

2 Retrieve up to MaxResults documents

using query derived from t

3 Extract new tuples te from these documents

4 Augment seed with te

Potential problem: May run out of tuples (and queries) incomplete relation!

seed

t0

t1

t2

0 Start with some seed tuples (e.g., <“May 1995”, “Ebola”, “Zaire”>)

Page 5: Modeling Query-Based Access to Text Databases

5

Iterative Methods Sometimes (but not Always) “Succeed”

seed

seed

SUCCESS! FAIL

Can we predict if a query-based strategy

will succeed?

Page 6: Modeling Query-Based Access to Text Databases

6

Model: Querying Graph

Tokens: Tuple attributes<“May 1995”, “Ebola”,

“Zaire”>

Each Token (as query) retrieves documents

Documents contain tokens

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Page 7: Modeling Query-Based Access to Text Databases

7

Model: Reachability Graph

t2, t3, and t4 “reachable” from t1t1 retrieves document

d1 that contains t2

t1

t2 t3

t4t5

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Page 8: Modeling Query-Based Access to Text Databases

8

Out

Model: Connected Components

Tokens not in Core, but are reachable from Core

Tokens not in Core but from which Core is reachable

In

Core(strongly

connected)

t1

t2 t3

t4

Page 9: Modeling Query-Based Access to Text Databases

9

Components of Reachability Graph

OutInCore

OutIn Core

OutIn

Core

(strongly

connected)

t0

How many tokens are in the largest Core + Out?

Page 10: Modeling Query-Based Access to Text Databases

10

Model: Power-law Graphs

Conjecture: Degree distribution in the reachability graph follows power-law:

#(nodes with degree k) ≈ O(k-β)

(i.e., many nodes with small degree, a few nodes with large degree)

Power-law random graphs are expected to have at most one giant connected component (~Core+In+Out). Other connected components are small.

Page 11: Modeling Query-Based Access to Text Databases

11

Model: Reachability

Reachability : Fraction of tokens in the largest Core + Out

(Power law allows to ignore small components)

OutIn

Core(strongly

connected)

t0

Page 12: Modeling Query-Based Access to Text Databases

12

Estimating Reachability

In a power-law random graph G a giant component CG emerges if the average outdegree d > 1

Graph theory results predict relative size of CG

Estimate reachability as relative size of CG, which reduces to estimating average outdegree of reachability graph

[Chung and Lu, Annals of Combinatorics, 2002 ]

Relative size of giant component (lower bound)

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8 9 10average outdegree

Page 13: Modeling Query-Based Access to Text Databases

13

Estimating Reachability Using Sampling(estimate average outdegree)

1. Choose S random seed tokens

2. Query the database for seed3. Extract tokens to compute

the reachability graph edges for seed tokens.

4. Estimate d as average outdegree of seed tokens.

5. Estimate reachability

Tokens Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

t1

t3

t2

t2

t4

d =1.5

Page 14: Modeling Query-Based Access to Text Databases

14

Experimental Results: Verifying the “Power-law” Conjecture

Task 1: NYTDiseaseOutbreaks

(Date, Disease, Location)

New York Times, 1995|T|= 8,859 |D|

=137,000 Date Disease Locatio

n

Jan. 1995

Malaria Ethiopia

June 1995

Ebola Zaire

July 1995

Mad Cow Disease

The U.K.

Feb. 1995

Pneumonia

The U.S.

… … …

Follows the power-law distribution

Page 15: Modeling Query-Based Access to Text Databases

15

Experimental Results:Estimating Reachability by Sampling

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000

MaxResults

Rea

chab

ility

S=10 S=50 S=100 S=200 Real Graph

Approximate reachability isestimated with S = 50 tokens

The reachability correctly predicts performance of query-based information extraction strategy

If the estimated reachability is too low,can switch to a different strategy early

Page 16: Modeling Query-Based Access to Text Databases

16

Future Work

What if we have only limited access to the database? Limit on number of queries Limit on number of documents retrieved

Not modelled by reachability graph, but can be modelled using properties of querying graph

Tokens

Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

Page 17: Modeling Query-Based Access to Text Databases

17

Summary

Presented graph model for query-based algorithms:– for Information Extraction– for Constructing Database Content Summaries

Showed that querying and reachability graphs can be used to analyze such algorithms

Presented single reachability metric to predict success of iterative query-based algorithms

Presented and verified conjecture that reachability graphs for these algorithms follow the power law

Presented efficient techniques for estimating reachability by exploiting properties of power-law random graphs