scalable information extraction

1

Scalable Information Extraction

Eugene Agichtein

2

Example: Angina treatments

PDR

Web search results

Structured databases (e.g., drug info, WHO drug adverse effects DB, etc)

Medical reference and literature

MedLine

guideline for unstable angina

unstable angina management

herbal treatment for angina pain

medications for treating angina

alternative treatment for angina pain

treatment for angina

angina treatments

3

Research GoalAccurate, intuitive, and efficient access

to knowledge in unstructured sources

Approaches: Information Retrieval

Retrieve the relevant documents or passages Question answering

Human Reading Construct domain-specific “verticals” (MedLine)

Machine Reading Extract entities and relationships Network of relationships: Semantic Web

4

Semantic Relationships “Buried” in Unstructured Text

Web, newsgroups, web logs Text databases (PubMed, CiteSeer, etc.) Newspaper Archives

Corporate mergers, succession, location Terrorist attacks ] M essage

U nderstandingC onferences

…A number of well-designed and -executed large-scale clinical trials have now shown that treatment with statins reduces recurrent myocardial infarction, reduces strokes, and lessens the need for revascularization or hospitalization for unstable angina pectoris…

Drug Condition

statins recurrent myocardial

infarction

statins strokes

statins unstable angina pectoris

RecommendedTreatment

5

What Structured Representation

Can Do for You:

… allow precise and efficient querying … allow returning answers instead of documents … support powerful query constructs … allow data integration with (structured) RDBMS … provide useful content for Semantic Web

Large Text Collection Structured Relation

6

Challenges in Information Extraction

Portability Reduce effort to tune for new domains and tasks MUC systems: experts would take 8-12 weeks to tune

Scalability, Efficiency, Access Enable information extraction over large collections 1 sec / document * 5 billion docs = 158 CPU years

Approach: learn from data ( “Bootstrapping” ) Snowball: Partially Supervised Information Extraction Querying Large Text Databases for Efficient Information Extraction

7

Outline Snowball: partially supervised information

extraction (overview and key results)

Effective retrieval algorithms for information extraction (in detail)

Current: mining user behavior for web search

Future work

8

The Snowball System: Overview

Snowball

Text Database

Organization Location Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1

AG Edwards St Louis 0.9

Air Canada Montreal 0.8

7th Level Richardson 0.8

3Com Corp Santa Clara 0.8

3DO Redwood City 0.7

3M Minneapolis 0.7

MacWorld San Francisco 0.7

157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

3

2

... ... ..... ... ..

1

9

Snowball: Getting User Input

User input: • a handful of example instances• integrity constraints on the relation e.g., Organization is a “key”, Age > 0, etc…

GetExamples

Evaluate Tuples

Generate Extraction Patterns

Tag Entities

Extract Tuples

Find Example Occurrences in Text

ACM DL 2000

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

10

Can use any Can use any full-text search full-text search engineengine

Snowball: Finding Example Occurrences Get

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


Search Engine

Text Database

Organization Headquarters

Microsoft Redmond

IBM Armonk

Intel Santa Clara

Computer servers at Microsoft’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA-based Microsoft Corp

The Armonk-based IBM introduced a new line…

Change of guard at IBM Corporation’s headquarters near Armonk, NY ...

11

Named Named entityentity taggerstaggers can recognize can recognize DatesDates, , PeoplePeople, , LocationsLocations, , OrganizationsOrganizations, …, … MITRE’s MITRE’s AlembicAlembic, IBM’s , IBM’s TalentTalent, , LingPipeLingPipe, …, …

Snowball: Tagging EntitiesGet

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


Computer servers at Microsoft ’s headquarters in Redmond…

In mid-afternoon trading, shares of Redmond, WA -based Microsoft Corp

The Armonk -based IBM introduced a new line…

Change of guard at IBM Corporation‘s headquarters near Armonk, NY ...

12

Snowball: Extraction Patterns

General extraction pattern model: acceptor0, Entity, acceptor1, Entity, acceptor2

Acceptor instantiations: String Match (accepts string “’s headquarters in”) Vector-Space (~ vector [(-’s,0.5), (headquarters, 0.5),

(in, 0.5)] ) Classifier (estimate P(T=valid | ‘s, headquarters, in) )

Computer servers at Microsoft’s headquarters in Redmond…

13

Snowball: Generating Patterns Get

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


1 Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

LOCATIONORGANIZATION {<'s 0.57>, <headquarters 0.57>, < in 0.57>}

LOCATION ORGANIZATION{<- 0.71>, < based 0.71>}

LOCATIONORGANIZATION {<‘s 0.57>, <headquarters 0.57>, < near 0.57>}


2 Cluster Cluster similarsimilar occurrences.occurrences.

14

Snowball: Generating Patterns Get

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


LOCATIONORGANIZATION { <'s 0.71>, <headquarters 0.71>}


Create Create patternspatterns as filtered as filtered clustercluster centroidscentroids

1Represent occurrences Represent occurrences as vectors of as vectors of tagstags and and termsterms

2 Cluster Cluster similarsimilar occurrences.occurrences.

3

15

Google 's new headquarters in Mountain View are …

Snowball: Extracting New TuplesMatch tagged text fragments against patterns

GetExamples

Evaluate Tuples


Tag Entities

Extract Tuples


ORGANIZATION {<'s 0.71>, <headquarters 0.71> }

{<located 0.71>, < in 0.71>}

LOCATION {<- 0.71>, <based 0.71>

P1

P2

P3

Match=0.8

Match=0.4

Match=0

ORGANIZATION

ORGANIZATION

LOCATION

LOCATION

V ORGANIZATION {<'s 0.5>, <new 0.5> <headquarters 0.5>, < in 0.5>} {<are 1>} LOCATION

16

Snowball: Evaluating Patterns

Automatically estimate Automatically estimate patternpattern confidenceconfidence::Conf(P4)= Conf(P4)= Positive / TotalPositive / Total

= 2/3 = 0.66= 2/3 = 0.66

GetExamples

Evaluate Tuples


Tag Entities

Extract Tuples


IBM, Armonk, reported… PositiveIntel, Santa Clara, introduced... Positive

“Bet on Microsoft”, New York-based analyst Jane Smith said... Negative

LOCATIONORGANIZATION { < , 1> } P4Organization Headquarters

IBM Armonk

Intel Santa Clara

Microsoft Redmond

Current seed tuples

17

Snowball: Evaluating Tuples

Automatically evaluate tuple confidence:

Conf(T) =

A tuple has high confidence if generated by high-confidence patterns.

GetExamples

Evaluate Tuples


Tag Entities

Extract Tuples


P4: 0.663COM Santa Clara

{<- 0.75>, <based 0.75>}P3: 0.95

0.4

Conf(T): 0.83

)PMatch(*)Conf(P-1-1 i

p

i

0.8

LOCATIONORGANIZATION { < , 1> }

LOCATION ORGANIZATION

18

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1






3M Minneapolis 0.7


157th Street Manhattan 0.52

15th Party Congress

China 0.3

15th Century Europe

Dark Ages 0.1

... .... ..... .... .. ... .... ..... .... ..

Keep only Keep only high-confidencehigh-confidence tuples for next iterationtuples for next iteration

19

Snowball: Evaluating TuplesGet

Examples

Evaluate Tuples


Tag Entities

Extract Tuples


Organization Headquarters Conf

Microsoft Redmond 1

IBM Armonk 1

Intel Santa Clara 1






3M Minneapolis 0.7


Start new iteration with Start new iteration with expandedexpanded example setexample setIterate until no new tuples are extractedIterate until no new tuples are extracted

20

Pattern-Tuple Duality A “good” tuple:

Extracted by “good” patterns Tuple weight goodness

A “good” pattern: Generated by “good” tuples Extracts “good” new tuples Pattern weight goodness

Edge weight: Match/Similarity of tuple context

to pattern

21

How to Set Node Weights Constraint violation (from before)

Conf(P) = Log(Pos) Pos/(Pos+Neg) Conf(T) =

HITS [Hassan et al., EMNLP 2006] Conf(P) = ∑Conf(T) Conf(T) = ∑Conf(P)

URNS [Downey et al., IJCAI 2005]

EM-Spy [Agichtein, SDM 2006] Unknown tuples = Neg Compute Conf(P), Conf(T) Iterate

)PMatch(*)Conf(P-1-1 i

p

i

22

Snowball: EM-based Pattern Evaluation

23

Evaluating Patterns and Tuples: Expectation Maximization

EM-Spy Algorithm “Hide” labels for some seed

tuples

Iterate EM algorithm to convergence on tuple/pattern confidence values

Set threshold t such that (t > 90% of spy tuples)

Re-initialize Snowball using new seed tuples

Organization Headquarters Initial Final

Microsoft Redmond 1 1

IBM Armonk 1 0.8

Intel Santa Clara 1 0.9

AG Edwards St Louis 0 0.9

Air Canada Montreal 0 0.8

7th Level Richardson 0 0.8

3Com Corp Santa Clara 0 0.8

3DO Redwood City 0 0.7

3M Minneapolis 0 0.7

MacWorld San Francisco 0 0.7

0

0

157th Street Manhattan 0 0.52

15th Party Congress

China 0 0.3

15th Century Europe

Dark Ages 0 0.1

…..

24

Adapting Snowball for New Relations Large parameter space Initial seed tuples (randomly chosen, multiple runs) Acceptor features: words, stems, n-grams, phrases, punctuation, POS Feature selection techniques: OR, NB, Freq, ``support’’, combinations Feature weights: TF*IDF, TF, TF*NB, NB Pattern evaluation strategies: NN, Constraint violation, EM, EM-Spy

Automatically estimate parameter values: Estimate operating parameters based on occurrences of seed tuples Run cross-validation on hold-out sets of seed tuples for optimal perf. Seed occurrences that do not have close “neighbors” are discarded

25

Example Task 1: DiseaseOutbreaks

Proteus: 0.409Snowball: 0.415

SDM 2006

26

Example Task 2: Bioinformaticsa.k.a. mining the “bibliome”

100,000+ gene and protein synonyms extracted from 50,000+ journal articles

Approximately 40% of confirmed synonyms not previously listed in curated authoritative reference (SWISSPROT)

ISMB 2003

“APO-1, also known as DR6…”“MEK4, also called SEK1…”

27

Snowball Used in Various Domains News: NYT, WSJ, AP [DL’00, SDM’06]

CompanyHeadquarters, MergersAcquisitions, DiseaseOutbreaks

Medical literature: PDRHealth, Micromedex… [Thesis]

AdverseEffects, DrugInteractions, RecommendedTreatments

Biological literature: GeneWays corpus [ISMB’03]

Gene and Protein Synonyms

28

Limits of Bootstrapping for Extraction Task “easy” when context term distributions diverge from background

Quantify as relative entropy (Kullback-Liebler divergence)

After calibration, metric predicts if bootstrapping likely to work

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

the to and said 's company mrs won president

fre

qu

en

cy

Vw BG

CiCBGC wLM

wLMwLMLMLM

)(

)(log)()||(KL

CIKM 2005

President George W Bush’s three-day visit to India

29

Few Relations Cover Common Questions

25 relations cover > 50% of question types, 5 relations cover > 55% question instances

SIGIR 2005

Relation Type Instance

<person> discovers <concept> 7.7 2.9

<person> has position <concept> 5.6 4.6

<location> has location <location> 5.2 1.5

<person> known for <concept> 4.7 1.7

<event> has date <date> 4.1 0.9

30

Outline Snowball, a domain-independent, partially

supervised information extraction system

Retrieval algorithms for scalable information extraction

Current: mining user behavior for web search

Future work

31

Extracting A Relation From a Large Text Database

Brute force approach: feed all docs to information extraction system

Only a tiny fraction of documents are often useful Many databases are not crawlable Often a search interface is available, with existing

keyword index How to identify “useful” documents?

InformationExtraction

System

Text Database StructuredRelation

]Expensive for large collections

32

Accessing Text DBs via Search Engines


System

Text Database

StructuredRelation

Search Engine

Search engines impose limitations Limit on documents retrieved per query Support simple keywords and phrases Ignore “stopwords” (e.g., “a”, “is”)

33

Extracted Relation

QXtract: Querying Text Databases for Robust Scalable Information EXtractionUser-Provided Seed Tuples

Queries

Promising Documents

Text Database

Search Engine

DiseaseName Location Date

Malaria Ethiopia Jan. 1995

Ebola Zaire May 1995

Mad Cow Disease The U.K. July 1995

Pneumonia The U.S. Feb. 1995

DiseaseName Location Date



Query Generation

Information Extraction System

Problem: Learn keyword queries to retrieve “promising” documents

34

Learning Queries to Retrieve Promising Documents

1. Get document sample with “likely negative” and “likely positive” examples.

2. Label sample documents using information extraction system as “oracle.”

3. Train classifiers to “recognize” useful documents.

4. Generate queries from classifier model/rules. Queries

Query Generation


? ???

? ?

??

++

++

- -

--

Seed Sampling

Classifier Training

tuple1tuple2tuple3tuple4tuple5

++

++

- -

--

User-Provided Seed Tuples

Text Database

Search Engine

35

Training Classifiers to Recognize “Useful” Documents

disease reported epidemic expected area

virus reported expected infected patients

products made used exported far

past old homerun sponsored event

++

--

Ripper SVM

disease AND reported => USEFUL

virus 3

infected 2

sponsored -1

Okapi (IR)

disease

infected

reported

virus

epidemic

products

usedfar

exported

Document features:

words

D1

D2

D3

D4

36

SVM

Generating Queries from Classifiers

disease and reportedepidemic

virus

QCombined

virusinfected

epidemicvirusdisease AND reported

Ripper Okapi (IR)

disease AND reported => USEFUL

disease

infected

reported

virus

epidemic

products

usedfar

exportedvirus 3

infected 2

sponsored -1

37

SIGMOD 2003 Demonstration

38

Tuples: A Simple Querying Strategy DiseaseName Location Date


“Ebola” and “Zaire”


System


hemorrhagic fever Africa May 1995

1. Convert given tuples into queries2. Retrieve matching documents3. Extract new tuples from documents and

iterate

Search Engine

39

0

10

20

30

40

50

60

70

80

5% 10% 25%

M axFractionRetrieved

reca

ll (%

)

QXtract Manual Tuples Baseline

Comparison of Document Access Methods

QXtract: 60% of relation extracted from 10% of documents of 135,000 newspaper article database

Tuples strategy: Recall at most 46%

40

How to choose the best strategy?

Tuples: Simple, no training, but limited recall QXtract: Robust, but has training and query overhead Scan: No overhead, but must process all documents

41

Predicting Recall of Tuples Strategy

Seed

Tuple

SUCCESS! FAILURE

Can we predict if Tuples will succeed?

WebDB 2003

Seed

Tuple

42

Abstract the problem: Querying GraphTuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

“Ebola” and “Zaire”

Note: Only top K docs returned for each query. <Violence, U.S.> retrieves many documents that do not contain tuples;

searching for an extracted tuple may not retrieve source document

Search

Engine

43

Information Reachability Graph

t2, t3, and t4 “reachable” from t1t1 retrieves document d1

that contains t2

t1

t2 t3

t4t5

Tuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

44

t2

t1

t3

t4

Connected Components

In OutCore(strongly

connected)

Reachable Tuples, do not retrieve tuples in Core

Tuples that retrieve other tuples and themselves

Tuples that retrieve other tuples but are not reachable

45

Sizes of Connected Components

OutInCor

e

OutIn Core

OutIn Core(strongly

connected)

t0

How many tuples are in largest Core + Out?

Conjecture: Degree distribution in reachability graphs follows “power-law.”

Then, reachability graph has at most one giant component.

Define Reachability as Fraction of tuples in largest Core + Out

46

NYT Reachability Graph: Outdegree Distribution

MaxResults=10

MaxResults=50

Matches the power-law distribution

47

NYT: Component Size Distribution

MaxResults=10

MaxResults=50

CG / |T| = 0.297

CG / |T| = 0.620

Not “reachable”

“reachable”

48

Connected Components Visualization

DiseaseOutbreaks, New York Times 1995

49

Estimating ReachabilityIn a power-law random graph G a giant

component CG emerges* if d (the average outdegree) > 1, and:

Estimate: Reachability ~ CG / |T| Depends only on d (average

outdegree)

* For < 3.457Chung and Lu, Annals of Combinatorics, 2002

50

Estimating Reachability Algorithm

1. Pick some random tuples

2. Use tuples to query database

3. Extract tuples from matching documents to compute reachability graph edges

4. Estimate average outdegree

5. Estimate reachability using results of Chung and Lu, Annals of Combinatorics, 2002

TuplesDocument

st1

t2

t3

t4

d1

d2

d3

d4

t1

t3

t2

t2

t4

d =1.5

51

Estimating Reachability of NYT

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

MR=1 MR=10 MR=50 MR=100 MR=200 MR=1000

MaxResults

Rea

chab

ility

S=10 S=50 S=100 S=200 Real Graph

.46

Approximate reachability is estimated after ~ 50 queries.

Can be used to predict success (or failure) of a Tuples querying strategy.

52

To Search or to Crawl? Towards a Query Optimizer for Text-Centric Tasks, [Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

Information extraction applications extract structured relations from unstructured text

May 19 1995, Atlanta -- The Centers for Disease Control and Prevention, which is in the front line of the world's response to the deadly Ebola epidemic in Zaire , is finding itself hard pressed to cope with the crisis…

Date Disease Name Location

Jan. 1995 Malaria Ethiopia

July 1995 Mad Cow Disease U.K.

Feb. 1995 Pneumonia U.S.

May 1995 Ebola Zaire


(e.g., NYU’s Proteus)

Disease Outbreaks in The New York Times

53

An Abstract View of Text-Centric Tasks

Output tuples

…Extraction

System

Text Database

3. Extract output tuples2. Process documents1. Retrieve documents from database

Task tuple

Information Extraction Relation Tuple

Database Selection Word (+Frequency)

Focused Crawling Web Page about a Topic

For the rest of the talk

[Ipeirotis, Agichtein, Jain, Gavano, SIGMOD 2006]

54

Executing a Text-Centric TaskOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve documents from database

Similar to relational world

Two major execution paradigms Scan-based: Retrieve and process documents sequentially Index-based: Query database (e.g., [case fatality rate]), retrieve and process documents in results

Unlike the relational world

Indexes are only “approximate”: index is on keywords, not on tuples of interest Choice of execution plan affects output completeness (not only speed)

→underlying data distribution dictates what is best

55

Execution Plan CharacteristicsOutput tuples

…Extraction

System

Text Database

3. Extract output tuples2. Process documents1. Retrieve documents from database

Execution Plans have two main characteristics:Execution TimeRecall (fraction of tuples retrieved)

Question: How do we choose the fastest execution plan for reaching

a target recall ?

“What is the fastest plan for discovering 10% of the disease outbreaks mentioned in The New York Times archive?”

56

Outline

Description and analysis of crawl- and query-based plans Scan Filtered Scan Iterative Set Expansion Automatic Query Generation

Optimization strategy

Experimental results and conclusions

Crawl-based

Query-based(Index-based)

57

ScanOutput tuples

…Extraction

System

Text Database

3. Extract output tuples

2. Process documents

1. Retrieve docs from database

ScanScan retrieves and processes documents sequentially (until reaching target recall)

Execution time = |Retrieved Docs| · (R + P)

Time for retrieving a document

Question: How many documents does Scan retrieve

to reach target recall?

Time for processing a document

Filtered ScanFiltered Scan uses a classifier to identify and process only promising documents (details in paper)

58

Estimating Recall of ScanModeling Scan for tuple t: What is the probability of seeing t (with

frequency g(t)) after retrieving S documents? A “sampling without replacement” process

After retrieving S documents, frequency of tuple t follows hypergeometric distribution

Recall for tuple t is the probability that frequency of t in S docs > 0

t

d1

d2

dS

dN

...

D

Token

Samplingfor t

...

<SARS, China>

S documents

Probability of seeing tuple t after retrieving S

documentsg(t) = frequency of tuple t

59

Estimating Recall of ScanModeling Scan: Multiple “sampling without replacement”

processes, one for each tuple Overall recall is average recall across

tuples

→ We can compute number of documents required to reach target recall

t1 t2 tM

d1

d2

d3

dN

...

...

D

Tokens

Samplingfor t1

Samplingfor t2

Samplingfor tM

<SARS, China>

<Ebola, Zaire>

Execution time = |Retrieved Docs| · (R + P)

60

Iterative Set ExpansionOutput tuples

…Extraction

System

Text Database

3. Extract tuplesfrom docs

2. Process retrieved documents

1. Query database with seed tuples

Execution time = |Retrieved Docs| * (R + P) + |Queries| * Q

Time for retrieving a document

Time for answering a query

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

Time for processing a document

Query

Generation

4. Augment seed tuples with new tuples

Question: How many queries and how many documents

does Iterative Set Expansion need to reach target recall?

(e.g., [Ebola AND Zaire])(e.g., <Malaria, Ethiopia>)

61

Using Querying Graph for Analysis

We need to compute the: Number of documents retrieved after

sending Q tuples as queries (estimates time) Number of tuples that appear in the

retrieved documents (estimates recall)

To estimate these we need to compute the: Degree distribution of the tuples

discovered by retrieving documents Degree distribution of the documents

retrieved by the tuples (Not the same as the degree distribution of a

randomly chosen tuple or document – it is easier to discover documents and tuples with high degrees)

tuples Documents

t1

t2

t3

t4

t5

d1

d2

d3

d4

d5

<SARS, China>

<Ebola, Zaire>

<Malaria, Ethiopia>

<Cholera, Sudan>

<H5N1, Vietnam>

62

Summary of Cost Analysis

Our analysis so far: Takes as input a target recall Gives as output the time for each plan to reach target recall

(time = infinity, if plan cannot reach target recall)

Time and recall depend on task-specific properties of database: tuple degree distribution Document degree distribution

Next, we show how to estimate degree distributions on-the-fly

63

Estimating Cost Model Parameterstuple and document degree distributions belong to known distribution families

Can characterize distributions with only a few parameters!

Task Document Distribution tuple Distribution

Information Extraction Power-law Power-lawContent Summary Construction Lognormal Power-law (Zipf)

Focused Resource Discovery Uniform Uniform

y = 43060x-3.3863

1

10

100

1000

10000

100000

1 10 100Document Degree

Nu

mb

er

of

Do

cum

en

ts

y = 5492.2x-2.0254

1

10

100

1000

10000

1 10 100 1000Token Degree

Num

ber

of T

oken

s

64

Parameter Estimation

Naïve solution for parameter estimation: Start with separate, “parameter-estimation” phase Perform random sampling on database Stop when cross-validation indicates high confidence

We can do better than this!

No need for separate sampling phase Sampling is equivalent to executing the task:

→Piggyback parameter estimation into execution

65

On-the-fly Parameter Estimation

Pick most promising execution plan for target recall assuming “default” parameter values

Start executing task Update parameter estimates

during execution Switch plan if updated statistics

indicate so

ImportantOnly Scan acts as “random sampling”All other execution plan need parameter adjustment (see paper)

Correct (but unknown) distribution

Initial, default estimationUpdated estimationUpdated estimation

66

Outline

Description and analysis of crawl- and query-based plans

Optimization strategy

Experimental results and conclusions

67

Correctness of Theoretical Analysis

Solid lines: Actual time Dotted lines: Predicted time with correct parameters

Task: Disease Outbreaks

Snowball IE system

182,531 documents from NYT

16,921 tuples

100

1,000

10,000

100,000

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00Recall

Exe

cutio

n T

ime

(se

cs)

Scan

Filt. Scan

Automatic Query Gen.

Iterative Set Expansion

68

Experimental Results (Information Extraction)

Solid lines: Actual time Green line: Time with optimizer

(results similar in other experiments – see paper)

100

1,000

10,000

100,000

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90 1.00

Recall

Exe

cutio

n T

ime

(sec

s)

Scan

Filt. Scan

Iterative Set Expansion

Automatic Query Gen.

OPTIMIZED

69

Conclusions

Common execution plans for multiple text-centric tasks

Analytic models for predicting execution time and recall of various crawl- and query-based plans

Techniques for on-the-fly parameter estimation

Optimization framework picks on-the-fly the fastest plan for target recall

70

Can we do better?

Yes. For some information extraction systems

71

Bindings Engine (BE) [Slides: Cafarella 2005] Bindings Engine (BE) is search engine where:

No downloads during query processing Disk seeks constant in corpus size #queries = #phrases

BE’s approach: “Variabilized” search query language Pre-processes all documents before query-time Integrates variable/type data with inverted index,

minimizing query seeks

72

BE Query Support

cities such as <NounPhrase>

President Bush <Verb>

<NounPhrase> is the capital of <NounPhrase>

reach me at <phone-number> Any sequence of concrete terms and typed

variables NEAR is insufficient Functions (e.g., “head(<NounPhrase>)”)

73

BE Operation Like a generic search engine, BE:

Downloads a corpus of pages Creates an index Uses index to process queries efficiently

BE further requires: Set of indexed types (e.g., “NounPhrase”), with a

“recognizer” for each String processing functions (e.g., “head()”)

A BE system can only process types and functions that its index supports

74

as

billy

cities

friendly

give

mayors

nickels

seattle

such

words

#docs docid0 docid1

#docs docid0

#docs docid0 docid1 docid2 docid#docs-1…#docs docid0 docid1 docid2

#docs docid0 docid1

#docs docid0 docid1 docid2 docid#docs-1…

#docs docid0

#docs docid0 docid1 docid2 docid3



75

as

billy

cities

friendly

give

mayors

nickels

seattle

such

words


104 21 150 322 2501

15 99 322 426 1309

1.Test for equality2.Advance smaller pointer3.Abort when a list is

exhausted

Returned docs: 322

Query: such as


76

as

billy

cities

friendly

give

mayors

nickels

seattle

such

words

#docs …pos0 pos1 docid#docs-1 pos#docs-1

#posns pos0 pos1… pos#pos-1

docid0 docid1

#docs …pos0 pos1 docid#docs-1 pos#docs-1docid0 docid1


In phrase queries, match positions as well



“such as”

77

Neighbor Index

At each position in the index, store “neighbor text” that might be useful

Let’s index <NounPhrase> and <Adj-Term>

“I love cities such as Atlanta.”

Left Right

AdjT: “love”

78

Neighbor Index

At each position in the index, store “neighbor text” that might be useful

Let’s index <NounPhrase> and <Adj-Term>


Left Right

AdjT: “cities”NP: “cities”

AdjT: “I”NP: “I”

79

Neighbor Index

Left Right

AdjT: “such”

Query: “cities such as <NounPhrase>”

AdjT: “Atlanta”NP: “Atlanta”


80

neighbor1 str1

NPright Atlanta

as

billy

cities

friendly

give

mayors

nickels

Atlanta

such

words

#docs …pos0 pos1 docid#docs-1 pos#docs-1


docid0 docid1

“cities such as <NounPhrase>”

1. Find phrase query positions, as with phrase queries2. If term is adjacent to variable, extract typed value

#posns pos0 neighbor0 pos1 neighbor1… pos#pos-1 …

#neighborsblk_offset

3<offset>

19

12

In doc 19, starting at posn 8:


neighbor0 str0

AdjTleft such

81

Current Research Directions Modeling explicit and Implicit network structures

Modeling evolution of explicit structure on web, blogspace, wikipedia Modeling implicit link structures in text, collections, web Exploiting implicit & explicit social networks (e.g., for epidemiology)

Knowledge Discovery from Biological and Medical Data Automatic sequence annotation bioinformatics, genetics Actionable knowledge extraction from medical articles

Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance) Accuracy (!=authority) of online sources

Information diffusion/propagation in online sources Information propagation on the web In collaborative sources (wikipedia, MedLine)

82

Page Quality: In Search of an Unbiased Web Ranking[Cho, Roy, Adams, SIGMOD 2005] “popular pages tend to get even more popular, while

unpopular pages get ignored by an average user”

83

Sic Transit Gloria Telae: Towards an Understanding of theWeb’s Decay [Bar-Yossef, Broder, Kumar, Tomkins, WWW 2004]

84

Modeling Social Networks for Epidemiology, security, …

Email exchange mapped onto cubicle locations.

85

Some Research Directions Modeling explicit and Implicit network structures



Robust information extraction, retrieval, and query processing Integrating information in structured and unstructured sources Query processing over unstructured text Robust search/question answering for medical applications Confidence estimation for extraction from text and other sources Detecting reliable signals from (noisy) text data (e.g.,: medical surveillance)


86

Mining Text and Sequence Data

Agichtein & Eskin, PSB 2004

ROC50 scores for each class and method

87

Some Research Directions Modeling explicit and Implicit network structures





88

Structure and evolution of blogspace [Kumar, Novak, Raghavan, Tomkins, CACM 2004, KDD 2006]

Fraction of nodes in components of various sizes within Flickr and Yahoo! 360 timegraph, by week.

89

Current Research Directions Modeling explicit and Implicit network structures




Information diffusion/propagation in online sources Information propagation on the web, news In collaborative sources (wikipedia, MedLine)

90

Thank You

Details:http://www.mathcs.emory.edu/~eugene/

scalable information extraction

Documents

information retrievalretrieve

armonkbased ibm

armonk based ibm

ibm corporations headquarters

shares of redmond

wabased microsoft corp

wa based microsoft corp

data bootstrapping snowball