the selim and rachel benin school of engineering and computer science keyword proximity search in...
Post on 27-Mar-2015
247 Views
Preview:
TRANSCRIPT
The Selim and Rachel Benin School of Engineering and Computer Science
Keyword Proximity Search Keyword Proximity Search in Complex Data Graphsin Complex Data Graphs •• Konstantin Golenberg •• Benny Kimelfeld
•• Yehoshua Sagiv
OverviewOverview
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Keyword Proximity SearchKeyword Proximity Search
System OverviewSystem Overview
Algorithm for Answer GenerationAlgorithm for Answer Generation
Ranking AnswersRanking Answers
Conclusions & Future WorkConclusions & Future Work
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The natural (and popular) option: Keyword SearchKeyword Search
Schema-Free Extraction of DataSchema-Free Extraction of Data
Nowadays…
Exposure to many databases• Different types (relational, XML, RDF…)
• Different schemas
• Not easy to use traditional paradigms of querying (e.g., SQL, XQuery, SPARQL) and, moreover, they require a thorough understanding of the schema
• Goal: Enable users to instantly pose (inaccurate) queries without knowing the schema
−Problem: Inherently different from standard IR
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Data have varying degrees of structure– Relational (w/ foreign keys), XML (w/ id-references)– Natural representation by a graph – Usually, data-centric rather than document-centric
A query is a set of keywords− No structural constraints
Keyword Proximity Search (KPS)Keyword Proximity Search (KPS)
The Goal:The Goal:
Extract meaningful parts of data w.r.t. the keywords
• Agrawal et al. ICDE’02 • Hristidis et al., VLDB’02,03, ICDE’03 • Bhalotia et al. VLDB’05 • Kacholia al., VLDB’06 • Ding et al., ICDE’07 • Liu et al., SIGMOD’06 • Wang et al., VLDB’06 • Luo et al., SIGMOD’07 …
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Example: Search in RDBExample: Search in RDB
IDNamePopulation
22Amsterdam1101407
73Brussels951580
IDNameHead Q.
135EU73
175ESA81
CountryOrg.
B135
NL135
search Belgium , Brussels
CodeNameAreaCapital
NLNetherlands3733022
BBelgium3051073
CitiesCities OrganizationsOrganizations
CountriesCountries MembershipsMemberships
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
IDNamePopulation
22Amsterdam1101407
73Brussels951580
IDNameHead Q.
135EU73
175ESA81
CountryOrg.
B135
NL135
search Belgium , Brussels
CodeNameAreaCapital
NLNetherlands3733022
BBelgium3051073
CitiesCities OrganizationsOrganizations
CountriesCountries MembershipsMemberships
Brussels is the capital city of Belgium
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
IDNamePopulation
22Amsterdam1101407
73Brussels951580
IDNameHead Q.
135EU73
175ESA81
CountryOrg.
B135
NL135
CodeNameAreaCapital
NLNetherlands3733022
BBelgium3051073
CitiesCities OrganizationsOrganizations
CountriesCountries MembershipsMemberships
Brussels hosts EU and Belgium is a member
search Belgium , Brussels
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Example: Search in XMLExample: Search in XML
dblp
title
author
article
MihalisYannakakis
On theApproximationof MaximumSatisfiability
title
author
article
ImprovedApproximationAlgorithms for
MAX SAT
TakaoAsano
David P.Williamson
authorreferences
cite
search Yannakakis , Approximation
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Yannakakis wrote a paper about Approximation
dblp
title
author
article
MihalisYannakakis
On theApproximation
of MaximumSatisfiability
title
author
article
ImprovedApproximationAlgorithms for
MAX SAT
TakaoAsano
David P.Williamson
authorreferences
cite
search Yannakakis , Approximation
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
dblp
title
author
article
MihalisYannakakis
On theApproximationof MaximumSatisfiability
title
author
article
ImprovedApproximationAlgorithms for
MAX SAT
TakaoAsano
David P.Williamson
authorreferences
cite
Yannakakis is cited by a paper about Approximation
search Yannakakis , Approximation
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Data GraphsData Graphs
company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Structural and keyword nodes Edges and nodes may have weights
– Weak relationships are penalized by large weights
Each keyword has one occurrence in the data graph (technical)
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
QueriesQueries
Q={ Summers , Cohen , coffee }company
supplies
supply
product
supplier
papersA4
company
supplies
supply
product
supplier
coffee
president
Cohen
department
Summers
manager
Parishqhq
Queries are sets of keywords from the data graph
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
company
supplies
supply
product
customer
papersA4
company
supplies
supply
product
customer
coffee
president
Cohen
department
Summers
manager
Parishqhq
An Answer is a An Answer is a Reduced SubtreeReduced SubtreeAn answer is a subtree of the data graph
Contains all keywords of the query
Has no redundant edges (and nodes)
3 variants: directed, undirected, strong (undirected, kw’s are leaves);
This paper
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Previous SolutionsPrevious Solutions• Lack of guarantees
−Highly relevant answers might be missed, and / or− Inefficient algorithms
• Rather simple data sets – a (very) small number of relevant answers−They considered data that are essentially collections
of entities, namely, DBLP, IMDB, Lyrics, etc.−An answer is usually within the scope of an entity
→ e.g., the keywords appear in a single movie
• Crucial problems ignored− In particular, the “repeated information” problem−Especially pervasive in complex data graphs
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
ContributionsContributions
A system for keyword proximity searchA system for keyword proximity search• An algorithm for generating answers with An algorithm for generating answers with
guaranteesguarantees−Does not miss (valuable) answers−Efficient (polynomial delay)−Answers generated in a 2-approximate order by height
• A ranking technique that is aware of the repeated-information repeated-information problem
−Gives preference to answers with low similarity to earlier ones
• Experimentation over a highly-cyclic data graph−The Mondial database−Many “meaningful” connections among keywords
The MONDIAL DatabaseInstitute for InformaticsGeorg-August-Universität Göttingen http://www.dbis.informatik.uni-goettingen.de/Mondial/
OverviewOverview
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Keyword Proximity SearchKeyword Proximity Search
System OverviewSystem Overview
Algorithm for Answer GenerationAlgorithm for Answer Generation
Ranking AnswersRanking Answers
Conclusions & Future WorkConclusions & Future Work
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
ChallengesChallenges
• Huge no. of answers; not instantiated!Huge no. of answers; not instantiated!−Not simple to generate all relevant answers, even if
ranking is ignored−For practical ranking functions, enumerating the
answers in ranked order is probably impossible• For example, finding the smallest answer is the intractable
Steiner-tree problem
• Redundancy / repeated information−Many answers are very similar (altogether provide a
low amount information)−Crucial in complex (highly cyclic) data graphs
We employ a two-phase architecture:
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Architecture: Generator + RankerArchitecture: Generator + Ranker
Answer GeneratorAnswer GeneratorGenerates next M·k answers
(simplified ranking function)
Answer GeneratorAnswer GeneratorGenerates next M·k answers
(simplified ranking function)
top-k answers(relative to those that
have already been printed)
• search(keywords)• next k answers
RankerRankerRanks all answers
generated up to now(- printed ones)
RankerRankerRanks all answers
generated up to now(- printed ones)
Simplified ranking at first [Bhalotia et al., ICDE’02, VLDB’05]
OverviewOverview
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Keyword Proximity SearchKeyword Proximity Search
System OverviewSystem Overview
Algorithm for Answer GenerationAlgorithm for Answer Generation
Ranking AnswersRanking Answers
Conclusions & Future WorkConclusions & Future Work
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Generating the Top Answers: Not Trivial!Generating the Top Answers: Not Trivial!
To demonstrate the difficulty of generating
the “good” (top) answers, let’s see how existing approaches operate on a simple example:
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Find the Answers in this Example!Find the Answers in this Example!
location
name
EU
country
city
Brussels
name
headq
organization
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The BANKS ApproachThe BANKS Approach
∀ nodes v (in a “good” order) and keyword occurrences:
Generate the min-height subtree emanating from v
∀ nodes v (in a “good” order) and keyword occurrences:
Generate the min-height subtree emanating from v
location
name
EU
country
city
Brussels
name
headq
organization
Answers are directed subtrees
location
name
EU
country
city
Brussels
name
headq
organization
[Bhalotia et al., ICDE’02, VLDB’05]
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The BANKS ApproachThe BANKS Approach Answers are directed subtrees
Never generated!Never generated!
What about this answer?
location
name
EU
country
city
Brussels
name
headq
organization
∀ nodes v (in a “good” order) and keyword occurrences:
Generate the min-height subtree emanating from v
∀ nodes v (in a “good” order) and keyword occurrences:
Generate the min-height subtree emanating from v
[Bhalotia et al., ICDE’02, VLDB’05]
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The NUITS ApproachThe NUITS Approach
∀ nodes v (in a “good” order):
Generate the min-weight subtree that includes v
∀ nodes v (in a “good” order):
Generate the min-weight subtree that includes v
Answers are undirected subtrees[Ding et al., ICDE’07]
location
name
EU
country
city
Brussels
name
headq
organization
location
name
EU
country
city
Brussels
name
headq
organization
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The NUITS ApproachThe NUITS Approach
∀ nodes v (in a “good” order):
Generate the min-weight subtree that includes v
∀ nodes v (in a “good” order):
Generate the min-weight subtree that includes v
Answers are undirected subtrees
This node is redundant
It is actually the previous answer!
[Ding et al., ICDE’07]
location
name
EU
country
city
Brussels
name
headq
organization
location
name
EU
country
city
Brussels
name
headq
organization
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The NUITS ApproachThe NUITS Approach
∀ nodes v (in a “good” order):
Generate the min-weight subtree that includes v
∀ nodes v (in a “good” order):
Generate the min-weight subtree that includes v
Answers are undirected subtrees
Again, the previous answer!
[Ding et al., ICDE’07]
location
name
EU
country
city
Brussels
name
headq
organization
location
name
EU
country
city
Brussels
name
headq
organization
This node is redundant
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The NUITS ApproachThe NUITS Approach
Never generated!Never generated!
What about this answer?
∀ nodes v (in a “good” order):
Generate the min-weight subtree that includes v
∀ nodes v (in a “good” order):
Generate the min-weight subtree that includes v
Answers are undirected subtrees[Ding et al., ICDE’07]
location
name
EU
country
city
Brussels
name
headq
organization
location
name
EU
country
city
Brussels
name
headq
organization
Severe limit on # of generated
answers! (≤ one per node)
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The DISCOVER / DBXplorer ApproachThe DISCOVER / DBXplorer Approach
∀ possible queries Q (from the schema) in inc. size:
Evaluate Q over the database
∀ possible queries Q (from the schema) in inc. size:
Evaluate Q over the database
All answers are generated in ranked order!
[Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02]
Easy to implement!
DBMS queries–No in-memory
graph algorithms
location
name
EU
country
city
Brussels
name
headq
organization
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
location
name
EU
country
city
Brussels
name
headq
organization
The DISCOVER / DBXplorer ApproachThe DISCOVER / DBXplorer Approach
∀ possible queries Q (from the schema) in inc. size:
Evaluate Q over the database
∀ possible queries Q (from the schema) in inc. size:
Evaluate Q over the database
But many queries do not generate
any answer at all!
Worst case: exponential in
the data
Limited Ranking!Limited Ranking!by the query (rather
than the answer) weight
[Hristidis et al., VLDB’02,03, ICDE’03] [Agrawal et al. ICDE’02]
Inefficient!Inefficient!
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
We Need Generators w/ Guarantees!We Need Generators w/ Guarantees!
All answers are generatedAll answers are generated− In particular, each of the “relevant” answers is
produced at some point (100% recall is achievable)
Controlled order of answersControlled order of answers−For instance, increasing weight, increasing height,
approximate (what is the ratio?) / heuristic order
EfficiencyEfficiency−The top-k answers should be generated efficiently−Bound on time between successive answers
A B C A B C A B C A B C A B C A B C A B C
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Order by Increasing Weight / HeightOrder by Increasing Weight / Height
IfIf ThenThen ≤≤
Top-Top-kk Answers AnswersTop-Top-kk Answers Answers
A B C A B C A B C A B C A B CA B C A B C
A B C A B C
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Approximate and Heuristic OrdersApproximate and Heuristic Orders
Approximate orderApproximate order Heuristic orderHeuristic order
There is a provable bound on the extent to which the actual order can deviate from the optimal one
Intuitively, expected to be close to the optimal order, but there is no guarantee
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
CC-Approximate Order (inc. Weight / Height)-Approximate Order (inc. Weight / Height)
IfIf ThenThen ≤≤
CC-Approximation of the Top--Approximation of the Top-kk Answers Answers[Fagin et al., PODS’01]
CC-Approximation of the Top--Approximation of the Top-kk Answers Answers[Fagin et al., PODS’01]
CC
A B C A B C A B CA B CA B CA B CA B C
A B C
A B C
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Our ApproachOur Approach• PODS’06: Enum. by (exact / approx) inc. weight
− Problem: Repeated application of Steiner-tree alg’s− “Heavy” – hard to implement efficiently
• Here: Follow the basic approach of PODS’06
• But, we adopt the BANKS idea of using height (≠ weight) for the enumeration order−Recall: BANKS might miss highly relevant answers
• Thus, we bypass Steiner trees and obtain a much faster algorithm
• Our alg. has all 3 guarantees: answers are not answers are not missedmissed, approximate orderapproximate order, poly. delaypoly. delay
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Find the shortest answer (w/o constraints)
An Overview of the Algorithm An Overview of the Algorithm
Enum. by (2-approx.) increasing height
Find (a 2-approx. of) the shortest answer under constraints
TaskTask::
TaskTask::
TaskTask::
Lawler / Yen methodTypes of Constraints:• Inclusion: “include edge e”• Exclusion: “exclude edge e”
Backward-search (Dijkstra) iterators (~ BANKS)
The intricate part …
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Finding an Answer under ConstraintsFinding an Answer under Constraints
• Inclusion: “include edge e”• Exclusion: “exclude edge e”
Belgium
location
country
city
Brussels
name
headq
organization
name
organization
Belgium
location
country
city
Brussels
name
headq
organization
name
organization
Handling exclusion constraints is easy
Simply remove the excluded edges from the graph
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Belgium
location
country
city
Brussels
name
headq
organization
name
Inclusion Constraints are the ProblemInclusion Constraints are the Problem
• Inclusion: “include edge e”• Exclusion: “exclude edge e”
But it is not an But it is not an answer!answer!
Belgium
location
country
city
Brussels
name
headq
organization
name
The shortest subtree that contains the kw’s
and satisfies the const’s
Belgium
location
country
city
Brussels
name
headq
organization
name
redundant edge
• Not reduced (has redundancy)
• Moreover, includes a previously printed answer
• Sometimes, no answer at all!
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Belgium
location
country
city
Brussels
name
headq
organization
name
Belgium
location
country
city
Brussels
name
headq
organization
name
The Correct AnswerThe Correct Answer• Inclusion: “include edge e”• Exclusion: “exclude edge e”
Technique:
1.1. Generate a min-height subtree (as in the wrong solution)
2.2. Not an answer? → modify• Intricate to guarantee 2-approx.• Details in the proceedings
Technique:
1.1. Generate a min-height subtree (as in the wrong solution)
2.2. Not an answer? → modify• Intricate to guarantee 2-approx.• Details in the proceedings
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Running TimesRunning Times
0
100
200
300
400
500
2 3 4 5 6 7 8 9 10
# keywords
Tim
e (
sec)
100 answers 1000 answers
Each entry is an avg. of 4 queries
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Alg. Order vs. Weight OrderAlg. Order vs. Weight Order
0100200300400500600700800900
1000
10 20 30 40 50 60 70 80 90 100
Weight-Based Rank
Ge
ne
ratio
n R
an
k
2 kw's
3 kw's
4 kw's
5 kw's
6 kw's
7 kw's
8 kw's
9 kw's
10 kw's
How many answers are generated in order to obtain the top-k (among 1000) according to weight?
Each entry is an avg. of 4 queries
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Effective Approx. Ratio: Height Effective Approx. Ratio: Height ↑↑
3 keywords
0100200300
100 1600 3100 4600 6100 7600 9100
2 keywords
%
k (answers)Effective approx. ratio
worst / best (among first k)
0100200300
100 1600 3100 4600 6100 7600 9100
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Effective Approx. Ratio: Height Effective Approx. Ratio: Height ↑↑
5 keywords
4 keywords
%
k (answers)worst / best (among first k)
0
100
200
100 1600 3100 4600 6100 7600 9100
0
100
200
100 1600 3100 4600 6100 7600 9100
Effective approx. ratio
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Effective Approx. Ratio: Effective Approx. Ratio: Weight Weight ↑↑
3 keywords
2 keywords
%
k (answers)Effective approx. ratio
worst / best (among first k)
0100200300
100 1600 3100 4600 6100 7600 9100
0100200300
100 1600 3100 4600 6100 7600 9100
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Effective Approx. Ratio: Effective Approx. Ratio: Weight Weight ↑↑
5 keywords
4 keywords
%
k (answers)Effective approx. ratio
worst / best (among first k)
0100200300
100 1600 3100 4600 6100 7600 9100
0
100
200
100 1600 3100 4600 6100 7600 9100
OverviewOverview
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Keyword Proximity SearchKeyword Proximity Search
System OverviewSystem Overview
Algorithm for Answer GenerationAlgorithm for Answer Generation
Ranking AnswersRanking Answers
Conclusions & Future WorkConclusions & Future Work
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The Basic Ranking FunctionThe Basic Ranking Function
abs-rel(a)=1
weight(a)
weight(a) = Σ weight(node) + Σ weight(edge)node∊a edge∊a
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Determining the Weight of an EdgeDetermining the Weight of an Edge
organization
country
organization...organization ...
country
bordersborders
organization
countrycountry ... country country
country
capital
Many org’s enter country → weak connection (large weight)
org. enters many countries → weak connection (large weight)
Strong connection (small weight) Strongest!
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The Basic Ranking Function (cont’d)The Basic Ranking Function (cont’d)
abs-rel(a)=1
weight(a)
weight(a) = Σ weight(node) + Σ weight(edge)node∊a edge∊a
weight(node) = fixed (1)
weight(edge) = log(1 + α·out(v1→t2) + (1 − α)·in(t1→v2))
edge = (v1,v2)tag(vi) = ti
# t2 nodes with edges from v1
# t1 nodes with edges to v2
Relevant answers but …but …
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
country
organization (EU)
country
NetheslandsBelgium France
country
Answers with High SimilarityAnswers with High Similarity
country
organization (ADB)
country
Netheslands Belgium France
country
country
organization (NATO)
country
Netheslands Belgium France
country
country
organization (ESA)
country
Netheslands Belgium France
country
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Netheslands
country
Belgium
organization
country
France
borders
country
country
organization
country
NetheslandsBelgium France
country
Belgium
country
country
France
borders
country
borders
Netheslands
But each individual answer is relevant!
Combinations of ConnectionsCombinations of Connections
country
Belgium
organization
country
country
Franceborders
Netheslands
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Dynamic RankingDynamic Ranking
country
organization (EU)
country
NetheslandsBelgium France
country
country
organization (ADB)
country
Netheslands Belgium France
country
country
organization (NATO)
country
Netheslands Belgium France
country
country
organization (ESA)
country
Netheslands Belgium France
country
Netheslands
country
Belgium
organization
country
France
borders
country
country
organization
country
NetheslandsBelgium France
country
Belgium
country
country
France
borders
country
borders
Netheslands
country
Belgium
organization
country
country
Franceborders
Netheslands
Netheslands
country
Belgium
organization
country
France
borders
country
Candidate Answers
Output
country
organization (ESA)
country
Netheslands Belgium France
countryNetheslands
country
Belgium
organization
country
France
borders
country…
NextNext--Answer()Answer()a ← extract-top-candidate()print(a)for all candidates c and pairs of keywords k1, k2
if c and a connect k1 and k2 similarly, then penalize(c)What does it mean?
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
country
Belgium France
country
country
borders
Netheslands
organization (ESA)
country
Belgium France
country
country
borders
Netheslands
organization (ESA)
Two Types of “Similarity”Two Types of “Similarity”
country
organization (ESA)
country
Netheslands Belgium France
country
country
organization (EU)
country
NetheslandsBelgium France
countrycountry
organization (EU)
country
NetheslandsBelgium France
country
country
organization (ESA)
country
Netheslands Belgium France
country
The same connection
Isomorphic connection
(same schema)k1, k2 = Belgium, France
a c1
c2
Penalty: 1
Penalty: p (≤1)
2 options:2 options:•Sum over printed answers•Max over printed answers
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
The General Ranking FunctionThe General Ranking Function
abs-rel(c)=1
weight(c)
rpt-inf(c)=∑ ∑p or 1
k1, k2
∊ kw’s
printed answers
or ∑maxp or 1
k1, k2
∊ kw’s
printed answers
score(c) =1
+ ε · rpt-inf(c) abs-rel(c)
1
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Score Loss vs. DiversityScore Loss vs. Diversity
0255075
100
0 2 4 6 8 10
Sum, p=1.0
0255075
100
0 2 4 6 8 10
Max, p=0.1
• 5 keywords• Avg. of 4 queries• Top-20 answers
%of max.
ε
Score (1/weight) Connections (u.t. iso.)Connections
The bottom configuration is better than the top oneSmaller reduction of score for similar/higher degree of diversity
OverviewOverview
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Keyword Proximity SearchKeyword Proximity Search
System OverviewSystem Overview
Algorithm for Answer GenerationAlgorithm for Answer Generation
Ranking AnswersRanking Answers
Conclusions & Future WorkConclusions & Future Work
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
ConclusionsConclusions
• KPS in complex data graphs has inherent problems that are ignored in existing systems
• 2-component arch.: answer generator & ranker
• 1st component: Enum. algorithm w/ guarantees −Efficient, correct (no missed answers), 2-approximate
order by height− In the paper: Ext. to OR semantics (exact order)
• 2nd component: Dynamically ranks candidates by penalizing them for repeated information−Our experiments over Mondial suggest a tuning of
the parameters that gives the best tradeoff between information gain and score loss
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Current & Future ResearchCurrent & Future Research
• Improve / optimize the answer generator−Successful: Parallelism−Concurrent queries?
• Implement different answer generators−E.g., by (approx.) increasing weight [KS-PODS’06]
• Assessment by humans−Relevancy / repeated information −Methodology example: [Zhang et al., SIGIR’02]
• Other aspects−Answer presentation →
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Answer PresentationAnswer Presentation• On the Web, we instantly get the meaning of an
answer (Web page) by the <title>, URL and, possibly, a snippet of the text
• In KPS, understanding the meaning of a subtree is note straightforward—need to derive the semantics from the graphical presentation
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
What’s the Meaning of this Answer?What’s the Meaning of this Answer?
A snapshot of BANKS demo (http://www.cse.iitb.ac.in/banks/)
IMDB Harder in XML!Harder in XML!• No division into relations (everything is element / attribute)• What information is needed to describe a node?
Benny KimelfeldKeyword Proximity Search in Complex Data GraphsKeyword Proximity Search in Complex Data Graphs The Hebrew UniversitySIGMOD’08SIGMOD’08
Answer PresentationAnswer Presentation• On the Web, we instantly understand the
meaning of an answer (Web page) by reading the <title> element, the URL and, possibly, a snapshot of the text
• In KPS, understanding the meaning of a subtree is cumbersome since we need to derive the semantics from the presentation
Solution:Solution:(under
develop.)
•• Graphical presentation is based on restructuring answers in terms of of entities, properties and relationships
•• Apply heuristics for determining the minimal set of properties required for each entity
Thank you!Thank you!
Questions?
top related