scalable information extraction
DESCRIPTION
Scalable Information Extraction. Eugene Agichtein. Example: Angina treatments. Structured databases (e.g., drug info, WHO drug adverse effects DB, etc). Medical reference and literature. Web search results. Research Goal. - PowerPoint PPT PresentationTRANSCRIPT
1
Scalable Information Extraction
Eugene Agichtein
2
Example: Angina treatments
PDR
Web search results
Structured databases (e.g., drug info, WHO drug adverse effects DB, etc)
Medical reference and literature
MedLine
guideline for unstable angina
unstable angina management
herbal treatment for angina pain
medications for treating angina
alternative treatment for angina pain
treatment for angina
angina treatments
3
Research GoalAccurate, intuitive, and efficient access
to knowledge in unstructured sources
Approaches: Information Retrieval
Retrieve the relevant documents or passages Question answering
Human Reading Construct domain-specific “verticals” (MedLine)
Machine Reading Extract entities and relationships Network of relationships: Semantic Web
4
Semantic Relationships “Buried” in Unstructured Text
Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives
Corporate mergers, succession, location Terrorist attacks ] M essage
U nderstandingC onferences
…A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris…
Drug Condition
statins recurrent myocardial
infarction
statins strokes
statins unstable angina pectoris
RecommendedTreatment
5
What Structured Representation
Can Do for You:
… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web
Large Text Collection Structured Relation
6
Challenges in Information Extraction
Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune
Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years
Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction
7
Outline Snowball: partially supervised information
extraction (overview and key results)
Effective retrieval algorithms for information extraction (in detail)
Current: mining user behavior for web search
Future work
8
The Snowball System: Overview
Snowball
Text Database
Organization Location Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
157th Street Manhattan 0.52
15th Party Congress
China 0.3
15th Century Europe
Dark Ages 0.1
3
2
... ... ..... ... ..
1
9
Snowball: Getting User Input
User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
ACM DL 2000
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
10
Can use any Can use any full-text search full-text search engineengine
Snowball: Finding Example Occurrences Get
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
Search Engine
Text Database
Organization Headquarters
Microsoft Redmond
IBM Armonk
Intel Santa Clara
Computer servers at Microsoft’s headquarters in Redmond…
In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp
The Armonk-based IBM introduced a new line…
Change of guard at IBM Corporation’s headquarters near Armonk, NY ...
11
Named Named entityentity taggerstaggers can recognize can recognize DatesDates, , PeoplePeople, , LocationsLocations, , OrganizationsOrganizations, …, … MITRE’s MITRE’s AlembicAlembic, IBM’s , IBM’s TalentTalent, , LingPipeLingPipe, …, …
Snowball: Tagging EntitiesGet
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
Computer servers at Microsoft ’s headquarters in Redmond…
In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp
The Armonk -based IBM introduced a new line…
Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...
12
Snowball: Extraction Patterns
General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2
Acceptor instantiations: String Match (accepts string “’s headquarters in”) Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5),
(in, 0.5)] ) Classifier (estimate P(T=valid | ‘s, headquarters, in) )
Computer servers at Microsoft’s headquarters in Redmond…
13
Snowball: Generating Patterns Get
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
1 Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms
LOCATIONORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}
LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}
LOCATIONORGANIZATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>}
LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}
2 Cluster Cluster similarsimilar occurrences.occurrences.
14
Snowball: Generating Patterns Get
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
LOCATIONORGANIZATION { <'s 0.71>, <headquarters 0.71>}
LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}
Create Create patternspatterns as filtered as filtered clustercluster centroidscentroids
1Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms
2 Cluster Cluster similarsimilar occurrences.occurrences.
3
15
Google 's new headquarters in Mountain View are …
Snowball: Extracting New TuplesMatch tagged text fragments against patterns
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
ORGANIZATION {<'s 0.71>, <headquarters 0.71> }
{<located 0.71>, < in 0.71>}
LOCATION {<- 0.71>, <based 0.71>
P1
P2
P3
Match=0.8
Match=0.4
Match=0
ORGANIZATION
ORGANIZATION
LOCATION
LOCATION
V ORGANIZATION {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} {<are 1>} LOCATION
16
Snowball: Evaluating Patterns
Automatically estimate Automatically estimate patternpattern confidenceconfidence::Conf(P4)= Conf(P4)= Positive / TotalPositive / Total
= 2/3 = 0.66= 2/3 = 0.66
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
IBM, Armonk, reported… PositiveIntel, Santa Clara, introduced... Positive
“Bet on Microsoft”, New York-based analyst Jane Smith said... Negative
LOCATIONORGANIZATION { < , 1> } P4Organization Headquarters
IBM Armonk
Intel Santa Clara
Microsoft Redmond
Current seed tuples
17
Snowball: Evaluating Tuples
Automatically evaluate tuple confidence:
Conf(T) =
A tuple has high confidence if generated by high-confidence patterns.
GetExamples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
P4: 0.663COM Santa Clara
{<- 0.75>, <based 0.75>}P3: 0.95
0.4
Conf(T): 0.83
)PMatch(*)Conf(P-1-1 i
p
i
0.8
LOCATIONORGANIZATION { < , 1> }
LOCATION ORGANIZATION
18
Snowball: Evaluating TuplesGet
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
157th Street Manhattan 0.52
15th Party Congress
China 0.3
15th Century Europe
Dark Ages 0.1
... .... ..... .... .. ... .... ..... .... ..
Keep only Keep only high-confidencehigh-confidence tuples for next iterationtuples for next iteration
19
Snowball: Evaluating TuplesGet
Examples
Evaluate Tuples
Generate Extraction Patterns
Tag Entities
Extract Tuples
Find Example Occurrences in Text
Organization Headquarters Conf
Microsoft Redmond 1
IBM Armonk 1
Intel Santa Clara 1
AG Edwards St Louis 0.9
Air Canada Montreal 0.8
7th Level Richardson 0.8
3Com Corp Santa Clara 0.8
3DO Redwood City 0.7
3M Minneapolis 0.7
MacWorld San Francisco 0.7
Start new iteration with Start new iteration with expandedexpanded example setexample setIterate until no new tuples are extractedIterate until no new tuples are extracted
20
Pattern-Tuple Duality A “good” tuple:
Extracted by “good” patterns Tuple weight goodness
A “good” pattern: Generated by “good” tuples Extracts “good” new tuples Pattern weight goodness
Edge weight: Match/Similarity of tuple context
to pattern
21
How to Set Node Weights Constraint violation (from before)
Conf(P) = Log(Pos) Pos/(Pos+Neg) Conf(T) =
HITS [Hassan et al., EMNLP 2006] Conf(P) = ∑Conf(T) Conf(T) = ∑Conf(P)
URNS [Downey et al., IJCAI 2005]
EM-Spy [Agichtein, SDM 2006] Unknown tuples = Neg Compute Conf(P), Conf(T) Iterate
)PMatch(*)Conf(P-1-1 i
p
i
22
Snowball: EM-based Pattern Evaluation
23
Evaluating Patterns and Tuples: Expectation Maximization
EM-Spy Algorithm “Hide” labels for some seed
tuples
Iterate EM algorithm to convergence on tuple/pattern confidence values
Set threshold t such that (t > 90% of spy tuples)
Re-initialize Snowball using new seed tuples
Organization Headquarters Initial Final
Microsoft Redmond 1 1
IBM Armonk 1 0.8
Intel Santa Clara 1 0.9
AG Edwards St Louis 0 0.9
Air Canada Montreal 0 0.8
7th Level Richardson 0 0.8
3Com Corp Santa Clara 0 0.8
3DO Redwood City 0 0.7
3M Minneapolis 0 0.7
MacWorld San Francisco 0 0.7
0
0
157th Street Manhattan 0 0.52
15th Party Congress
China 0 0.3
15th Century Europe
Dark Ages 0 0.1
…..
24
Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy
Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded
25
Example Task 1: DiseaseOutbreaks
Proteus: 0.409Snowball: 0.415
SDM 2006
26
Example Task 2: Bioinformaticsa.k.a. mining the “bibliome”
100,000+ gene and protein synonyms extracted from 50,000+ journal articles
Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT)
ISMB 2003
“APO-1, also known as DR6…”“MEK4, also called SEK1…”
27
Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]
CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks
Medical literature: PDRHealth, Micromedex… [Thesis]
AdverseEffects, DrugInteractions, RecommendedTreatments
Biological literature: GeneWays corpus [ISMB’03]
Gene and Protein Synonyms
28
Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from background
Quantify as relative entropy (Kullback-Liebler divergence)
After calibration, metric predicts if bootstrapping likely to work
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
the to and said 's company mrs won president
fre
qu
en
cy
Vw BG
CiCBGC wLM
wLMwLMLMLM
)(
)(log)()||(KL
CIKM 2005
President George W Bush’s three-day visit to India
29
Few Relations Cover Common Questions
25 relations cover > 50% of question types, 5 relations cover > 55% question instances
SIGIR 2005
Relation Type Instance
<person> discovers <concept> 7.7 2.9
<person> has position <concept> 5.6 4.6
<location> has location <location> 5.2 1.5
<person> known for <concept> 4.7 1.7
<event> has date <date> 4.1 0.9
30
Outline Snowball, a domain-independent, partially
supervised information extraction system
Retrieval algorithms for scalable information extraction
Current: mining user behavior for web search
Future work
31
Extracting A Relation From a Large Text Database
Brute force approach: feed all docs to information extraction system
Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing
keyword index How to identify “useful” documents?
InformationExtraction
System
Text Database StructuredRelation
]Expensive for large collections
32
Accessing Text DBs via Search Engines
InformationExtraction
System
Text Database
StructuredRelation
Search Engine
Search engines impose limitations Limit on documents retrieved per query Support simple keywords and phrases Ignore “stopwords” (e.g., “a”, “is”)
33
Extracted Relation
QXtract: Querying Text Databases for Robust Scalable Information EXtractionUser-Provided Seed Tuples
Queries
Promising Documents
Text Database
Search Engine
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Mad Cow Disease The U.K. July 1995
Pneumonia The U.S. Feb. 1995
DiseaseName Location Date
Malaria Ethiopia Jan. 1995
Ebola Zaire May 1995
Query Generation
Information Extraction System
Problem: Learn keyword queries to retrieve “promising” documents
34
Learning Queries to Retrieve Promising Documents
1. Get document sample with “likely negative” and “likely positive” examples.
2. Label sample documents using information extraction system as “oracle.”
3. Train classifiers to “recognize” useful documents.
4. Generate queries from classifier model/rules. Queries
Query Generation
Information Extraction System
? ???
? ?
??
++
++
- -
--
Seed Sampling
Classifier Training
tuple1tuple2tuple3tuple4tuple5
++
++
- -
--
User-Provided Seed Tuples
Text Database
Search Engine
35
Training Classifiers to Recognize “Useful” Documents
disease reported epidemic expected area
virus reported expected infected patients
products made used exported far
past old homerun sponsored event
++
--
Ripper SVM
disease AND reported => USEFUL
virus 3
infected 2
sponsored -1
Okapi (IR)
disease
infected
reported
virus
epidemic
products
usedfar
exported
Document features:
words
D1
D2
D3
D4
36
SVM
Generating Queries from Classifiers
disease and reportedepidemic
virus
QCombined
virusinfected
epidemicvirusdisease AND reported
Ripper Okapi (IR)
disease AND reported => USEFUL
disease
infected
reported
virus
epidemic
products
usedfar
exportedvirus 3
infected 2
sponsored -1
37
SIGMOD 2003 Demonstration
38
Tuples: A Simple Querying Strategy DiseaseName Location Date
Ebola Zaire May 1995
“Ebola” and “Zaire”
InformationExtraction
System
Malaria Ethiopia Jan. 1995
hemorrhagic fever Africa May 1995
1. Convert given tuples into queries2. Retrieve matching documents3. Extract new tuples from documents and
iterate
Search Engine
39
0
10
20
30
40
50
60
70
80
5% 10% 25%
M axFractionRetrieved
reca
ll (%
)
QXtract Manual Tuples Baseline
Comparison of Document Access Methods
QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database
Tuples strategy: Recall at most 46%
40
How to choose the best strategy?
Tuples: Simple, no training, but limited recall QXtract: Robust, but has training and query overhead Scan: No overhead, but must process all documents
41
Predicting Recall of Tuples Strategy
Seed
Tuple
SUCCESS! FAILURE
Can we predict if Tuples will succeed?
WebDB 2003
Seed
Tuple
42
Abstract the problem: Querying GraphTuples Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
“Ebola” and “Zaire”
Note: Only top K docs returned for each query. <Violence, U.S.> retrieves many documents that do not contain tuples;
searching for an extracted tuple may not retrieve source document
Search
Engine
43
Information Reachability Graph
t2, t3, and t4 “reachable” from t1t1 retrieves document d1
that contains t2
t1
t2 t3
t4t5
Tuples Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
44
t2
t1
t3
t4
Connected Components
In OutCore(strongly
connected)
Reachable Tuples, do not retrieve tuples in Core
Tuples that retrieve other tuples and themselves
Tuples that retrieve other tuples but are not reachable
45
Sizes of Connected Components
OutInCor
e
OutIn Core
OutIn Core(strongly
connected)
t0
How many tuples are in largest Core + Out?
Conjecture: Degree distribution in reachability graphs follows “power-law.”
Then, reachability graph has at most one giant component.
Define Reachability as Fraction of tuples in largest Core + Out
46
NYT Reachability Graph: Outdegree Distribution
MaxResults=10
MaxResults=50
Matches the power-law distribution
47
NYT: Component Size Distribution
MaxResults=10
MaxResults=50
CG / |T| = 0.297
CG / |T| = 0.620
Not “reachable”
“reachable”
48
Connected Components Visualization
DiseaseOutbreaks, New York Times 1995
49
Estimating ReachabilityIn a power-law random graph G a giant
component CG emerges* if d (the average outdegree) > 1, and:
Estimate: Reachability ~ CG / |T| Depends only on d (average
outdegree)
* For < 3.457Chung and Lu, Annals of Combinatorics, 2002
50
Estimating Reachability Algorithm
1. Pick some random tuples
2. Use tuples to query database
3. Extract tuples from matching documents to compute reachability graph edges
4. Estimate average outdegree
5. Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002
TuplesDocument
st1
t2
t3
t4
d1
d2
d3
d4
t1
t3
t2
t2
t4
d =1.5
51
Estimating Reachability of NYT
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000
MaxResults
Rea
chab
ility
S=10 S=50 S=100 S=200 Real Graph
.46
Approximate reachability is estimated after ~ 50 queries.
Can be used to predict success (or failure) of a Tuples querying strategy.
52
To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]
Information extraction applications extract structured relations from unstructured text
May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…
Date Disease Name Location
Jan. 1995 Malaria Ethiopia
July 1995 Mad Cow Disease U.K.
Feb. 1995 Pneumonia U.S.
May 1995 Ebola Zaire
Information Extraction System
(e.g., NYU’s Proteus)
Disease Outbreaks in The New York Times
53
An Abstract View of Text-Centric Tasks
Output tuples
…Extraction
System
Text Database
3. Extract output tuples2. Process documents1. Retrieve documents from database
Task tuple
Information Extraction Relation Tuple
Database Selection Word (+Frequency)
Focused Crawling Web Page about a Topic
For the rest of the talk
[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]
54
Executing a Text-Centric TaskOutput tuples
…Extraction
System
Text Database
3. Extract output tuples
2. Process documents
1. Retrieve documents from database
Similar to relational world
Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results
Unlike the relational world
Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed)
→underlying data distribution dictates what is best
55
Execution Plan CharacteristicsOutput tuples
…Extraction
System
Text Database
3. Extract output tuples2. Process documents1. Retrieve documents from database
Execution Plans have two main characteristics:Execution TimeRecall (fraction of tuples retrieved)
Question: How do we choose the fastest execution plan for reaching
a target recall ?
“What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”
56
Outline
Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation
Optimization strategy
Experimental results and conclusions
Crawl-based
Query-based(Index-based)
57
ScanOutput tuples
…Extraction
System
Text Database
3. Extract output tuples
2. Process documents
1. Retrieve docs from database
ScanScan retrieves and processes documents sequentially (until reaching target recall)
Execution time = |Retrieved Docs| · (R + P)
Time for retrieving a document
Question: How many documents does Scan retrieve
to reach target recall?
Time for processing a document
Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)
58
Estimating Recall of ScanModeling Scan for tuple t: What is the probability of seeing t (with
frequency g(t)) after retrieving S documents? A “sampling without replacement” process
After retrieving S documents, frequency of tuple t follows hypergeometric distribution
Recall for tuple t is the probability that frequency of t in S docs > 0
t
d1
d2
dS
dN
...
D
Token
Samplingfor t
...
<SARS, China>
S documents
Probability of seeing tuple t after retrieving S
documentsg(t) = frequency of tuple t
59
Estimating Recall of ScanModeling Scan: Multiple “sampling without replacement”
processes, one for each tuple Overall recall is average recall across
tuples
→ We can compute number of documents required to reach target recall
t1 t2 tM
d1
d2
d3
dN
...
...
D
Tokens
Samplingfor t1
Samplingfor t2
Samplingfor tM
<SARS, China>
<Ebola, Zaire>
Execution time = |Retrieved Docs| · (R + P)
60
Iterative Set ExpansionOutput tuples
…Extraction
System
Text Database
3. Extract tuplesfrom docs
2. Process retrieved documents
1. Query database with seed tuples
Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q
Time for retrieving a document
Time for answering a query
Question: How many queries and how many documents
does Iterative Set Expansion need to reach target recall?
Time for processing a document
Query
Generation
4. Augment seed tuples with new tuples
Question: How many queries and how many documents
does Iterative Set Expansion need to reach target recall?
(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)
61
Using Querying Graph for Analysis
We need to compute the: Number of documents retrieved after
sending Q tuples as queries (estimates time) Number of tuples that appear in the
retrieved documents (estimates recall)
To estimate these we need to compute the: Degree distribution of the tuples
discovered by retrieving documents Degree distribution of the documents
retrieved by the tuples (Not the same as the degree distribution of a
randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees)
tuples Documents
t1
t2
t3
t4
t5
d1
d2
d3
d4
d5
<SARS, China>
<Ebola, Zaire>
<Malaria, Ethiopia>
<Cholera, Sudan>
<H5N1, Vietnam>
62
Summary of Cost Analysis
Our analysis so far: Takes as input a target recall Gives as output the time for each plan to reach target recall
(time = infinity, if plan cannot reach target recall)
Time and recall depend on task-specific properties of database: tuple degree distribution Document degree distribution
Next, we show how to estimate degree distributions on-the-fly
63
Estimating Cost Model Parameterstuple and document degree distributions belong to known distribution families
Can characterize distributions with only a few parameters!
Task Document Distribution tuple Distribution
Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)
Focused Resource Discovery Uniform Uniform
y = 43060x-3.3863
1
10
100
1000
10000
100000
1 10 100Document Degree
Nu
mb
er
of
Do
cum
en
ts
y = 5492.2x-2.0254
1
10
100
1000
10000
1 10 100 1000Token Degree
Num
ber
of T
oken
s
64
Parameter Estimation
Naïve solution for parameter estimation: Start with separate, “parameter-estimation” phase Perform random sampling on database Stop when cross-validation indicates high confidence
We can do better than this!
No need for separate sampling phase Sampling is equivalent to executing the task:
→Piggyback parameter estimation into execution
65
On-the-fly Parameter Estimation
Pick most promising execution plan for target recall assuming “default” parameter values
Start executing task Update parameter estimates
during execution Switch plan if updated statistics
indicate so
ImportantOnly Scan acts as “random sampling”All other execution plan need parameter adjustment (see paper)
Correct (but unknown) distribution
Initial, default estimationUpdated estimationUpdated estimation
66
Outline
Description and analysis of crawl- and query-based plans
Optimization strategy
Experimental results and conclusions
67
Correctness of Theoretical Analysis
Solid lines: Actual time Dotted lines: Predicted time with correct parameters
Task: Disease Outbreaks
Snowball IE system
182,531 documents from NYT
16,921 tuples
100
1,000
10,000
100,000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall
Exe
cutio
n T
ime
(se
cs)
Scan
Filt. Scan
Automatic Query Gen.
Iterative Set Expansion
68
Experimental Results (Information Extraction)
Solid lines: Actual time Green line: Time with optimizer
(results similar in other experiments – see paper)
100
1,000
10,000
100,000
0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00
Recall
Exe
cutio
n T
ime
(sec
s)
Scan
Filt. Scan
Iterative Set Expansion
Automatic Query Gen.
OPTIMIZED
69
Conclusions
Common execution plans for multiple text-centric tasks
Analytic models for predicting execution time and recall of various crawl- and query-based plans
Techniques for on-the-fly parameter estimation
Optimization framework picks on-the-fly the fastest plan for target recall
70
Can we do better?
Yes. For some information extraction systems
71
Bindings Engine (BE) [Slides: Cafarella 2005] Bindings Engine (BE) is search engine where:
No downloads during query processing Disk seeks constant in corpus size #queries = #phrases
BE’s approach: “Variabilized” search query language Pre-processes all documents before query-time Integrates variable/type data with inverted index,
minimizing query seeks
72
BE Query Support
cities such as <NounPhrase>
President Bush <Verb>
<NounPhrase> is the capital of <NounPhrase>
reach me at <phone-number> Any sequence of concrete terms and typed
variables NEAR is insufficient Functions (e.g., “head(<NounPhrase>)”)
73
BE Operation Like a generic search engine, BE:
Downloads a corpus of pages Creates an index Uses index to process queries efficiently
BE further requires: Set of indexed types (e.g., “NounPhrase”), with a
“recognizer” for each String processing functions (e.g., “head()”)
A BE system can only process types and functions that its index supports
74
as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
#docs docid0 docid1
#docs docid0
#docs docid0 docid1 docid2 docid#docs-1…#docs docid0 docid1 docid2
#docs docid0 docid1
#docs docid0 docid1 docid2 docid#docs-1…
#docs docid0
#docs docid0 docid1 docid2 docid3
#docs docid0 docid1 docid2 docid#docs-1…
#docs docid0 docid1 docid2 docid#docs-1…
75
as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
#docs docid0 docid1 docid2 docid#docs-1…
104 21 150 322 2501
15 99 322 426 1309
1.Test for equality2.Advance smaller pointer3.Abort when a list is
exhausted
Returned docs: 322
Query: such as
#docs docid0 docid1 docid2 docid#docs-1…
76
as
billy
cities
friendly
give
mayors
nickels
seattle
such
words
#docs …pos0 pos1 docid#docs-1 pos#docs-1
#posns pos0 pos1… pos#pos-1
docid0 docid1
#docs …pos0 pos1 docid#docs-1 pos#docs-1docid0 docid1
#posns pos0 pos1… pos#pos-1
In phrase queries, match positions as well
#docs docid0 docid1 docid2 docid#docs-1…
#docs docid0 docid1 docid2 docid#docs-1…
“such as”
77
Neighbor Index
At each position in the index, store “neighbor text” that might be useful
Let’s index <NounPhrase> and <Adj-Term>
“I love cities such as Atlanta.”
Left Right
AdjT: “love”
78
Neighbor Index
At each position in the index, store “neighbor text” that might be useful
Let’s index <NounPhrase> and <Adj-Term>
“I love cities such as Atlanta.”
Left Right
AdjT: “cities”NP: “cities”
AdjT: “I”NP: “I”
79
Neighbor Index
Left Right
AdjT: “such”
Query: “cities such as <NounPhrase>”
AdjT: “Atlanta”NP: “Atlanta”
“I love cities such as Atlanta.”
80
neighbor1 str1
NPright Atlanta
as
billy
cities
friendly
give
mayors
nickels
Atlanta
such
words
#docs …pos0 pos1 docid#docs-1 pos#docs-1
#posns pos0 pos1… pos#pos-1
docid0 docid1
“cities such as <NounPhrase>”
1. Find phrase query positions, as with phrase queries2. If term is adjacent to variable, extract typed value
#posns pos0 neighbor0 pos1 neighbor1… pos#pos-1 …
#neighborsblk_offset
3<offset>
19
12
In doc 19, starting at posn 8:
“I love cities such as Atlanta.”
neighbor0 str0
AdjTleft such
81
Current Research Directions Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)
Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)
82
Page Quality: In Search of an Unbiased Web Ranking[Cho, Roy, Adams, SIGMOD 2005] “popular pages tend to get even more popular, while
unpopular pages get ignored by an average user”
83
Sic Transit Gloria Telae: Towards an Understanding of theWeb’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]
84
Modeling Social Networks for Epidemiology, security, …
Email exchange mapped onto cubicle locations.
85
Some Research Directions Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)
Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Query processing over unstructured text Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)
Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)
86
Mining Text and Sequence Data
Agichtein & Eskin, PSB 2004
ROC50 scores for each class and method
87
Some Research Directions Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)
Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)
88
Structure and evolution of blogspace [Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]
Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week.
89
Current Research Directions Modeling explicit and Implicit network structures
Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)
Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles
Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources
Information diffusion/propagation in online sources Information propagation on the web, news In collaborative sources (wikipedia, MedLine)
90
Thank You
Details:http://www.mathcs.emory.edu/~eugene/