the selim and rachel benin school of engineering and computer science keyword proximity search in...

The Selim and Rachel Benin School of Engineering and Computer Science

Keyword Proximity Search Keyword Proximity Search in Complex Data Graphsin Complex Data Graphs •• Konstantin Golenberg •• Benny Kimelfeld

•• Yehoshua Sagiv

OverviewOverview

Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08

Keyword Proximity SearchKeyword Proximity Search

System OverviewSystem Overview

Algorithm for Answer GenerationAlgorithm for Answer Generation

Ranking AnswersRanking Answers

Conclusions & Future WorkConclusions & Future Work

The natural (and popular) option: Keyword SearchKeyword Search

Schema-Free Extraction of DataSchema-Free Extraction of Data

Nowadays…

Exposure to many databases• Different types (relational, XML, RDF…)

• Different schemas

• Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema

• Goal: Enable users to instantly pose (inaccurate) queries without knowing the schema

−Problem: Inherently different from standard IR

Data have varying degrees of structure– Relational (w/ foreign keys), XML (w/ id-references)– Natural representation by a graph – Usually, data-centric rather than document-centric

A query is a set of keywords− No structural constraints

Keyword Proximity Search (KPS)Keyword Proximity Search (KPS)

The Goal:The Goal:

Extract meaningful parts of data w.r.t. the keywords

• Agrawal et al. ICDE’02 • Hristidis et al., VLDB’02,03, ICDE’03 • Bhalotia et al. VLDB’05 • Kacholia al., VLDB’06 • Ding et al., ICDE’07 • Liu et al., SIGMOD’06 • Wang et al., VLDB’06 • Luo et al., SIGMOD’07 …

Example: Search in RDBExample: Search in RDB

IDNamePopulation

22Amsterdam1101407

73Brussels951580

IDNameHead Q.

135EU73

175ESA81

CountryOrg.

search Belgium , Brussels

CodeNameAreaCapital

NLNetherlands3733022

BBelgium3051073

CitiesCities OrganizationsOrganizations

CountriesCountries MembershipsMemberships

IDNamePopulation

22Amsterdam1101407

73Brussels951580

IDNameHead Q.

135EU73

175ESA81

CountryOrg.

CodeNameAreaCapital

BBelgium3051073

Brussels is the capital city of Belgium

IDNamePopulation

22Amsterdam1101407

73Brussels951580

IDNameHead Q.

135EU73

175ESA81

CountryOrg.

CodeNameAreaCapital

BBelgium3051073

Brussels hosts EU and Belgium is a member

Example: Search in XMLExample: Search in XML

author

article

MihalisYannakakis

On theApproximationof MaximumSatisfiability

author

article

ImprovedApproximationAlgorithms for

MAX SAT

TakaoAsano

David P.Williamson

authorreferences

search Yannakakis , Approximation

Yannakakis wrote a paper about Approximation

author

article

MihalisYannakakis

On theApproximation

of MaximumSatisfiability

author

article

MAX SAT

TakaoAsano

David P.Williamson

authorreferences

author

article

MihalisYannakakis

On theApproximationof MaximumSatisfiability

author

article

MAX SAT

TakaoAsano

David P.Williamson

authorreferences

Yannakakis is cited by a paper about Approximation

Data GraphsData Graphs

company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

department

Summers

manager

Parishqhq

Structural and keyword nodes Edges and nodes may have weights

– Weak relationships are penalized by large weights

Each keyword has one occurrence in the data graph (technical)

QueriesQueries

Q={ Summers , Cohen , coffee }company

supplies

supply

product

supplier

papersA4

company

supplies

supply

product

supplier

coffee

president

department

Summers

manager

Parishqhq

Queries are sets of keywords from the data graph

company

supplies

supply

product

customer

papersA4

company

supplies

supply

product

customer

coffee

president

department

Summers

manager

Parishqhq

An Answer is a An Answer is a Reduced SubtreeReduced SubtreeAn answer is a subtree of the data graph

Contains all keywords of the query

Has no redundant edges (and nodes)

3 variants: directed, undirected, strong (undirected, kw’s are leaves);

This paper

Previous SolutionsPrevious Solutions• Lack of guarantees

−Highly relevant answers might be missed, and / or− Inefficient algorithms

• Rather simple data sets – a (very) small number of relevant answers−They considered data that are essentially collections

of entities, namely, DBLP, IMDB, Lyrics, etc.−An answer is usually within the scope of an entity

→ e.g., the keywords appear in a single movie

• Crucial problems ignored− In particular, the “repeated information” problem−Especially pervasive in complex data graphs

ContributionsContributions

A system for keyword proximity searchA system for keyword proximity search• An algorithm for generating answers with An algorithm for generating answers with

guaranteesguarantees−Does not miss (valuable) answers−Efficient (polynomial delay)−Answers generated in a 2-approximate order by height

• A ranking technique that is aware of the repeated-information repeated-information problem

−Gives preference to answers with low similarity to earlier ones

• Experimentation over a highly-cyclic data graph−The Mondial database−Many “meaningful” connections among keywords

The MONDIAL DatabaseInstitute for InformaticsGeorg-August-Universität Göttingen http://www.dbis.informatik.uni-goettingen.de/Mondial/

OverviewOverview

ChallengesChallenges

• Huge no. of answers; not instantiated!Huge no. of answers; not instantiated!−Not simple to generate all relevant answers, even if

ranking is ignored−For practical ranking functions, enumerating the

answers in ranked order is probably impossible• For example, finding the smallest answer is the intractable

Steiner-tree problem

• Redundancy / repeated information−Many answers are very similar (altogether provide a

low amount information)−Crucial in complex (highly cyclic) data graphs

We employ a two-phase architecture:

Architecture: Generator + RankerArchitecture: Generator + Ranker

Answer GeneratorAnswer GeneratorGenerates next M·k answers

(simplified ranking function)

Answer GeneratorAnswer GeneratorGenerates next M·k answers

(simplified ranking function)

top-k answers(relative to those that

have already been printed)

• search(keywords)• next k answers

RankerRankerRanks all answers

generated up to now(- printed ones)

RankerRankerRanks all answers

generated up to now(- printed ones)

Simplified ranking at first [Bhalotia et al., ICDE’02, VLDB’05]

OverviewOverview

Generating the Top Answers: Not Trivial!Generating the Top Answers: Not Trivial!

To demonstrate the difficulty of generating

the “good” (top) answers, let’s see how existing approaches operate on a simple example:

Find the Answers in this Example!Find the Answers in this Example!

location

country

Brussels

organization

The BANKS ApproachThe BANKS Approach

∀ nodes v (in a “good” order) and keyword occurrences:

Generate the min-height subtree emanating from v

location

country

Brussels

organization

Answers are directed subtrees

location

country

Brussels

organization

[Bhalotia et al., ICDE’02, VLDB’05]

The BANKS ApproachThe BANKS Approach Answers are directed subtrees

Never generated!Never generated!

What about this answer?

location

country

Brussels

organization

[Bhalotia et al., ICDE’02, VLDB’05]

The NUITS ApproachThe NUITS Approach

∀ nodes v (in a “good” order):

Generate the min-weight subtree that includes v

Answers are undirected subtrees[Ding et al., ICDE’07]

location

country

Brussels

organization

location

country

Brussels

organization

Answers are undirected subtrees

This node is redundant

It is actually the previous answer!

[Ding et al., ICDE’07]

location

country

Brussels

organization

location

country

Brussels

organization

Answers are undirected subtrees

Again, the previous answer!

[Ding et al., ICDE’07]

location

country

Brussels

organization

location

country

Brussels

organization

This node is redundant

Never generated!Never generated!

What about this answer?

Answers are undirected subtrees[Ding et al., ICDE’07]

location

country

Brussels

organization

location

country

Brussels

organization

Severe limit on # of generated

answers! (≤ one per node)

The DISCOVER / DBXplorer ApproachThe DISCOVER / DBXplorer Approach

∀ possible queries Q (from the schema) in inc. size:

Evaluate Q over the database

All answers are generated in ranked order!

[Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02]

Easy to implement!

DBMS queries–No in-memory

graph algorithms

location

country

Brussels

organization

location

country

Brussels

organization

The DISCOVER / DBXplorer ApproachThe DISCOVER / DBXplorer Approach

But many queries do not generate

any answer at all!

Worst case: exponential in

the data

Limited Ranking!Limited Ranking!by the query (rather

than the answer) weight

[Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02]

Inefficient!Inefficient!

We Need Generators w/ Guarantees!We Need Generators w/ Guarantees!

All answers are generatedAll answers are generated− In particular, each of the “relevant” answers is

produced at some point (100% recall is achievable)

Controlled order of answersControlled order of answers−For instance, increasing weight, increasing height,

approximate (what is the ratio?) / heuristic order

EfficiencyEfficiency−The top-k answers should be generated efficiently−Bound on time between successive answers

A B C A B C A B C A B C A B C A B C A B C

Order by Increasing Weight / HeightOrder by Increasing Weight / Height

IfIf ThenThen ≤≤

Top-Top-kk Answers AnswersTop-Top-kk Answers Answers

A B C A B C A B C A B C A B CA B C A B C

A B C A B C

Approximate and Heuristic OrdersApproximate and Heuristic Orders

Approximate orderApproximate order Heuristic orderHeuristic order

There is a provable bound on the extent to which the actual order can deviate from the optimal one

Intuitively, expected to be close to the optimal order, but there is no guarantee

CC-Approximate Order (inc. Weight / Height)-Approximate Order (inc. Weight / Height)

IfIf ThenThen ≤≤

CC-Approximation of the Top--Approximation of the Top-kk Answers Answers[Fagin et al., PODS’01]

A B C A B C A B CA B CA B CA B CA B C

Our ApproachOur Approach• PODS’06: Enum. by (exact / approx) inc. weight

− Problem: Repeated application of Steiner-tree alg’s− “Heavy” – hard to implement efficiently

• Here: Follow the basic approach of PODS’06

• But, we adopt the BANKS idea of using height (≠ weight) for the enumeration order−Recall: BANKS might miss highly relevant answers

• Thus, we bypass Steiner trees and obtain a much faster algorithm

• Our alg. has all 3 guarantees: answers are not answers are not missedmissed, approximate orderapproximate order, poly. delaypoly. delay

Find the shortest answer (w/o constraints)

An Overview of the Algorithm An Overview of the Algorithm

Enum. by (2-approx.) increasing height

Find (a 2-approx. of) the shortest answer under constraints

TaskTask::

Lawler / Yen methodTypes of Constraints:• Inclusion: “include edge e”• Exclusion: “exclude edge e”

Backward-search (Dijkstra) iterators (~ BANKS)

The intricate part …

Finding an Answer under ConstraintsFinding an Answer under Constraints

• Inclusion: “include edge e”• Exclusion: “exclude edge e”

Belgium

location

country

Brussels

organization

Belgium

location

country

Brussels

organization

Handling exclusion constraints is easy

Simply remove the excluded edges from the graph

Belgium

location

country

Brussels

organization

Inclusion Constraints are the ProblemInclusion Constraints are the Problem

• Inclusion: “include edge e”• Exclusion: “exclude edge e”

But it is not an But it is not an answer!answer!

Belgium

location

country

Brussels

organization

The shortest subtree that contains the kw’s

and satisfies the const’s

Belgium

location

country

Brussels

organization

redundant edge

• Not reduced (has redundancy)

• Moreover, includes a previously printed answer

• Sometimes, no answer at all!

Belgium

location

country

Brussels

organization

Belgium

location

country

Brussels

organization

The Correct AnswerThe Correct Answer• Inclusion: “include edge e”• Exclusion: “exclude edge e”

Technique:

1.1. Generate a min-height subtree (as in the wrong solution)

2.2. Not an answer? → modify• Intricate to guarantee 2-approx.• Details in the proceedings

Technique:

1.1. Generate a min-height subtree (as in the wrong solution)

2.2. Not an answer? → modify• Intricate to guarantee 2-approx.• Details in the proceedings

Running TimesRunning Times

2 3 4 5 6 7 8 9 10

# keywords

100 answers 1000 answers

Each entry is an avg. of 4 queries

Alg. Order vs. Weight OrderAlg. Order vs. Weight Order

0100200300400500600700800900

10 20 30 40 50 60 70 80 90 100

Weight-Based Rank

2 kw's

3 kw's

4 kw's

5 kw's

6 kw's

7 kw's

8 kw's

9 kw's

10 kw's

How many answers are generated in order to obtain the top-k (among 1000) according to weight?

Each entry is an avg. of 4 queries

Effective Approx. Ratio: Height Effective Approx. Ratio: Height ↑↑

3 keywords

0100200300

100 1600 3100 4600 6100 7600 9100

2 keywords

k (answers)Effective approx. ratio

worst / best (among first k)

0100200300

100 1600 3100 4600 6100 7600 9100

Effective Approx. Ratio: Height Effective Approx. Ratio: Height ↑↑

5 keywords

4 keywords

k (answers)worst / best (among first k)

100 1600 3100 4600 6100 7600 9100

Effective approx. ratio

Effective Approx. Ratio: Effective Approx. Ratio: Weight Weight ↑↑

3 keywords

2 keywords

0100200300

100 1600 3100 4600 6100 7600 9100

0100200300

100 1600 3100 4600 6100 7600 9100

Effective Approx. Ratio: Effective Approx. Ratio: Weight Weight ↑↑

5 keywords

4 keywords

0100200300

100 1600 3100 4600 6100 7600 9100

OverviewOverview

The Basic Ranking FunctionThe Basic Ranking Function

abs-rel(a)=1

weight(a)

weight(a) = Σ weight(node) + Σ weight(edge)node∊a edge∊a

Determining the Weight of an EdgeDetermining the Weight of an Edge

organization

country

organization...organization ...

country

bordersborders

organization

countrycountry ... country country

country

capital

Many org’s enter country → weak connection (large weight)

org. enters many countries → weak connection (large weight)

Strong connection (small weight) Strongest!

The Basic Ranking Function (cont’d)The Basic Ranking Function (cont’d)

abs-rel(a)=1

weight(a)

weight(a) = Σ weight(node) + Σ weight(edge)node∊a edge∊a

weight(node) = fixed (1)

weight(edge) = log(1 + α·out(v1→t2) + (1 − α)·in(t1→v2))

edge = (v1,v2)tag(vi) = ti

# t2 nodes with edges from v1

# t1 nodes with edges to v2

Relevant answers but …but …

country

organization (EU)

country

NetheslandsBelgium France

country

Answers with High SimilarityAnswers with High Similarity

country

organization (ADB)

country

Netheslands Belgium France

country

organization (NATO)

country

organization (ESA)

country

Netheslands

country

Belgium

organization

country

France

borders

country

organization

country

Belgium

country

France

borders

country

borders

Netheslands

But each individual answer is relevant!

Combinations of ConnectionsCombinations of Connections

country

Belgium

organization

country

Franceborders

Netheslands

Dynamic RankingDynamic Ranking

country

organization (EU)

country

organization (ADB)

country

organization (NATO)

country

organization (ESA)

country

Netheslands

country

Belgium

organization

country

France

borders

country

organization

country

Belgium

country

France

borders

country

borders

Netheslands

country

Belgium

organization

country

Franceborders

Netheslands

country

Belgium

organization

country

France

borders

country

Candidate Answers

Output

country

organization (ESA)

country

countryNetheslands

country

Belgium

organization

country

France

borders

country…

NextNext--Answer()Answer()a ← extract-top-candidate()print(a)for all candidates c and pairs of keywords k1, k2

if c and a connect k1 and k2 similarly, then penalize(c)What does it mean?

country

Belgium France

country

borders

Netheslands

organization (ESA)

country

Belgium France

country

borders

Netheslands

organization (ESA)

Two Types of “Similarity”Two Types of “Similarity”

country

organization (ESA)

country

organization (EU)

country

countrycountry

organization (EU)

country

organization (ESA)

country

The same connection

Isomorphic connection

(same schema)k1, k2 = Belgium, France

Penalty: 1

Penalty: p (≤1)

2 options:2 options:•Sum over printed answers•Max over printed answers

The General Ranking FunctionThe General Ranking Function

abs-rel(c)=1

weight(c)

rpt-inf(c)=∑ ∑p or 1

k1, k2

∊ kw’s

printed answers

or ∑maxp or 1

k1, k2

∊ kw’s

printed answers

score(c) =1

+ ε · rpt-inf(c) abs-rel(c)

Score Loss vs. DiversityScore Loss vs. Diversity

0255075

0 2 4 6 8 10

Sum, p=1.0

0255075

0 2 4 6 8 10

Max, p=0.1

• 5 keywords• Avg. of 4 queries• Top-20 answers

%of max.

Score (1/weight) Connections (u.t. iso.)Connections

The bottom configuration is better than the top oneSmaller reduction of score for similar/higher degree of diversity

OverviewOverview

ConclusionsConclusions

• KPS in complex data graphs has inherent problems that are ignored in existing systems

• 2-component arch.: answer generator & ranker

• 1st component: Enum. algorithm w/ guarantees −Efficient, correct (no missed answers), 2-approximate

order by height− In the paper: Ext. to OR semantics (exact order)

• 2nd component: Dynamically ranks candidates by penalizing them for repeated information−Our experiments over Mondial suggest a tuning of

the parameters that gives the best tradeoff between information gain and score loss

Current & Future ResearchCurrent & Future Research

• Improve / optimize the answer generator−Successful: Parallelism−Concurrent queries?

• Implement different answer generators−E.g., by (approx.) increasing weight [KS-PODS’06]

• Assessment by humans−Relevancy / repeated information −Methodology example: [Zhang et al., SIGIR’02]

• Other aspects−Answer presentation →

Answer PresentationAnswer Presentation• On the Web, we instantly get the meaning of an

answer (Web page) by the <title>, URL and, possibly, a snippet of the text

• In KPS, understanding the meaning of a subtree is note straightforward—need to derive the semantics from the graphical presentation

What’s the Meaning of this Answer?What’s the Meaning of this Answer?

A snapshot of BANKS demo (http://www.cse.iitb.ac.in/banks/)

IMDB Harder in XML!Harder in XML!• No division into relations (everything is element / attribute)• What information is needed to describe a node?

Answer PresentationAnswer Presentation• On the Web, we instantly understand the

meaning of an answer (Web page) by reading the <title> element, the URL and, possibly, a snapshot of the text

• In KPS, understanding the meaning of a subtree is cumbersome since we need to derive the semantics from the presentation

Solution:Solution:(under

develop.)

•• Graphical presentation is based on restructuring answers in terms of of entities, properties and relationships

•• Apply heuristics for determining the minimal set of properties required for each entity

Thank you!Thank you!

Questions?

the selim and rachel benin school of engineering and computer science keyword proximity search in...

simple data

xml search yannakakis

data graph technical

member search belgium

brussels slide

approximation slide

b135 nl135 search belgium

meaningful parts of

Documents

hypothyroidism dr shahjada selim

selim burduroğlu enterprise architect named and strategic

selim miled, french ministry of health

selİm ÇaĞatay (professor of...

gestational diabetes by dr shahjada selim

diabetes:past-present-future by dr shahjada selim

iii. selİm ve dÖnemİ selİm iii and his era...7 İ ç i n...

exercise for diabetes by selim

presented by selim dursun

tems parameters(selim)

insulinaspart by dr shahjada selim

taha selim - vu research repository

jonathan edlow, magdy selim neurology emergencies 2010

smbg by dr shahjada selim

thyroiditis by dr shahjada selim

selim hassan, the excavations at giza 3

gynecomastia by dr shahjada selim

selİm İlerİ’nİn romanci gÖzÜyle kaleme...

pathophysiology of diabetes by dr shahjada selim

thyroiditis by dr selim