modeling query-based access to text databases
DESCRIPTION
Modeling Query-Based Access to Text Databases. Eugene Agichtein Panagiotis Ipeirotis Luis Gravano Computer Science Department Columbia University. Extracting Structured Information “Buried” in Text Documents. - PowerPoint PPT PresentationTRANSCRIPT
1
Modeling Query-Based Access to Text Databases
Eugene AgichteinPanagiotis IpeirotisLuis Gravano
Computer Science Department
Columbia University
2
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Extracting Structured Information “Buried” in Text Documents
Date DiseaseName Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease
The U.K.
Feb. 1995 Pneumonia The U.S.
May 1995 Ebola Zaire
Information Extraction System
(e.g., NYU’s Proteus)
3
Extracting All “Tuples” of a Relation from a Text Database
Naïve approach: feed every document to information extraction system. At 7 secs./document, Proteus takes over 8 days for 100K documents
Only a tiny fraction of documents contains tuples Processing every document is inefficient
Many databases are not crawlable (scannable), but available only via a search engine.
Text Database
InformationExtraction
System
Extracted Tuples
Search engines can help:efficiency and accessibility
4
A Query-Based Strategy for Information Extraction [Agichtein and Gravano, ICDE 2003]
1 While seed has unprocessed tuple t
2 Retrieve up to MaxResults documents
using query derived from t
3 Extract new tuples te from these documents
4 Augment seed with te
Potential problem: May run out of tuples (and queries) incomplete relation!
seed
t0
t1
t2
0 Start with some seed tuples (e.g., <“May 1995”, “Ebola”, “Zaire”>)
5
Iterative Methods Sometimes (but not Always) “Succeed”
seed
seed
SUCCESS! FAIL
Can we predict if a query-based strategy
will succeed?
6
Model: Querying Graph
Tokens: Tuple attributes<“May 1995”, “Ebola”,
“Zaire”>
Each Token (as query) retrieves documents
Documents contain tokens
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
7
Model: Reachability Graph
t2, t3, and t4 “reachable” from t1t1 retrieves document
d1 that contains t2
t1
t2 t3
t4t5
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
8
Out
Model: Connected Components
Tokens not in Core, but are reachable from Core
Tokens not in Core but from which Core is reachable
In
Core(strongly
connected)
t1
t2 t3
t4
9
Components of Reachability Graph
OutInCore
OutIn Core
OutIn
Core
(strongly
connected)
t0
How many tokens are in the largest Core + Out?
10
Model: Power-law Graphs
Conjecture: Degree distribution in the reachability graph follows power-law:
#(nodes with degree k) ≈ O(k-β)
(i.e., many nodes with small degree, a few nodes with large degree)
Power-law random graphs are expected to have at most one giant connected component (~Core+In+Out). Other connected components are small.
11
Model: Reachability
Reachability : Fraction of tokens in the largest Core + Out
(Power law allows to ignore small components)
OutIn
Core(strongly
connected)
t0
12
Estimating Reachability
In a power-law random graph G a giant component CG emerges if the average outdegree d > 1
Graph theory results predict relative size of CG
Estimate reachability as relative size of CG, which reduces to estimating average outdegree of reachability graph
[Chung and Lu, Annals of Combinatorics, 2002 ]
Relative size of giant component (lower bound)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 1 2 3 4 5 6 7 8 9 10average outdegree
13
Estimating Reachability Using Sampling(estimate average outdegree)
1. Choose S random seed tokens
2. Query the database for seed3. Extract tokens to compute
the reachability graph edges for seed tokens.
4. Estimate d as average outdegree of seed tokens.
5. Estimate reachability
Tokens Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
t1
t3
t2
t2
t4
d =1.5
14
Experimental Results: Verifying the “Power-law” Conjecture
Task 1: NYTDiseaseOutbreaks
(Date, Disease, Location)
New York Times, 1995|T|= 8,859 |D|
=137,000 Date Disease Locatio
n
Jan. 1995
Malaria Ethiopia
June 1995
Ebola Zaire
July 1995
Mad Cow Disease
The U.K.
Feb. 1995
Pneumonia
The U.S.
… … …
Follows the power-law distribution
15
Experimental Results:Estimating Reachability by Sampling
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000
MaxResults
Rea
chab
ility
S=10 S=50 S=100 S=200 Real Graph
Approximate reachability isestimated with S = 50 tokens
The reachability correctly predicts performance of query-based information extraction strategy
If the estimated reachability is too low,can switch to a different strategy early
16
Future Work
What if we have only limited access to the database? Limit on number of queries Limit on number of documents retrieved
Not modelled by reachability graph, but can be modelled using properties of querying graph
Tokens
Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
17
Summary
Presented graph model for query-based algorithms:– for Information Extraction– for Constructing Database Content Summaries
Showed that querying and reachability graphs can be used to analyze such algorithms
Presented single reachability metric to predict success of iterative query-based algorithms
Presented and verified conjecture that reachability graphs for these algorithms follow the power law
Presented efficient techniques for estimating reachability by exploiting properties of power-law random graphs