a graph method for keyword-based s election of the top-k databases presented by gaurav dutta gaurav...

A Graph Method for Keyword-A Graph Method for Keyword-based Selection of the top-K based Selection of the top-K Databases Databases

Presented byPresented by

GAURAV DUTTAGAURAV DUTTA

Motivation Motivation

Keyword Search is the dominant Keyword Search is the dominant information discovery method in information discovery method in documents documents

Increasing amount of data stored Increasing amount of data stored in databasesin databases

MotivationMotivation

Currently, information discovery in Currently, information discovery in databases requires: databases requires:– Knowledge of schemaKnowledge of schema– Knowledge of a query language (eg: SQL, Knowledge of a query language (eg: SQL,

XQuery) XQuery)– Knowledge of the role of the keywordsKnowledge of the role of the keywords

IR based MethodIR based Method

They pre-compute summary for They pre-compute summary for each document repository each document repository

They do not capture the inherent They do not capture the inherent structure of DBMS because they structure of DBMS because they ignore connectivity and ignore connectivity and distance information among distance information among tuples containing termstuples containing terms

IntroductionIntroduction

Keyword Relationship Graph: It is a Keyword Relationship Graph: It is a graph that summarizes the databases.graph that summarizes the databases.

It is utilized for computing the similarity It is utilized for computing the similarity between each database and a KS query, so between each database and a KS query, so that, during query processing, Only the most that, during query processing, Only the most promising databases are searchedpromising databases are searched. .

ExampleExample(q={Anderson, Love})(q={Anderson, Love})

M-KSM-KS

Based on the concept of the Based on the concept of the Keyword Relationship Matrix(KRM)Keyword Relationship Matrix(KRM)

It builds KRM for every DBIt builds KRM for every DB For Each term pair(Ki,Kj) there is an For Each term pair(Ki,Kj) there is an

entry KRM(ki,kj) that records the entry KRM(ki,kj) that records the frequencies of occurrences of the frequencies of occurrences of the two terms at different distancestwo terms at different distances

ExampleExample

The entry KRM1(Olson,love) stores that the The entry KRM1(Olson,love) stores that the two terms can be connected once at two terms can be connected once at distance 2 (i.e., t2 t9 t4) and once at distance 2 (i.e., t2 t9 t4) and once at distance 3 (t2 t10 t5 t7)distance 3 (t2 t10 t5 t7)

The similarity between a query q and DBl is The similarity between a query q and DBl is computed using the KRMl entries of all computed using the KRMl entries of all possible keyword pairs in q; e.g.,possible keyword pairs in q; e.g.,

If q={k1, k2, k3}, the score of DBl is based If q={k1, k2, k3}, the score of DBl is based on KRMl(k1,k2), KRMl(k1,k3) and on KRMl(k1,k2), KRMl(k1,k3) and KRMl(k2,k3).KRMl(k2,k3).

M-KS(Disadvantages)M-KS(Disadvantages)

Binary Relationships between Binary Relationships between KeywordKeyword

Yields numerous false positives for Yields numerous false positives for queries where all pairs of keywords queries where all pairs of keywords are relatedare related

Records only the frequency of term Records only the frequency of term co-occurrences co-occurrences

Does not include a mechanism for Does not include a mechanism for handling OR semantics. handling OR semantics.

Importance of G-KSImportance of G-KS

G-KS summarizes each database DBl as a G-KS summarizes each database DBl as a keyword relationship graph KRGl that keyword relationship graph KRGl that captures the terms and their relationships captures the terms and their relationships with Weighted nodes and edges.with Weighted nodes and edges.

Minimizes the chance of false positives by Minimizes the chance of false positives by imposing more stringent conditions than imposing more stringent conditions than simple binary relationships. simple binary relationships.

Based on Based on KRGl, G-KS can effectively KRGl, G-KS can effectively estimate the importance of estimate the importance of DBl with DBl with respect to a query under respect to a query under both the AND and both the AND and the OR semantics.the OR semantics.

KEYWORD SEARCH KEYWORD SEARCH OVER SINGLE OVER SINGLE SYSTEMSYSTEM Most methods are based on vector Most methods are based on vector

space model. space model. Element in vector=termElement in vector=term And weight is w=tf.idfAnd weight is w=tf.idf Queries represented as vector of Queries represented as vector of

keywords keywords Similarity is measured by cosine of the Similarity is measured by cosine of the

angle between their vector angle between their vector representations representations

Different methodsDifferent methods

BANKS create a datagraph and BANKS create a datagraph and employ a backward search employ a backward search strategy strategy

Bidirectional search Bidirectional search DBXplorer and DISCOVER ranks DBXplorer and DISCOVER ranks

results exclusively on the results exclusively on the distance of tuples containing distance of tuples containing query wordsquery words

KEYWORD SEARCH OVER KEYWORD SEARCH OVER DISTRIBUTED SYSTEMS DISTRIBUTED SYSTEMS

The existing techniques differ mainly The existing techniques differ mainly on the way that they construct on the way that they construct summaries.summaries.

GIOSS and CVV Use term frequencies GIOSS and CVV Use term frequencies CORI adds a factor called inverse CORI adds a factor called inverse

collection frequency to find the collection frequency to find the importance of term importance of term

Some also consider dependency Some also consider dependency relationships between termsrelationships between terms

M-KSM-KS

M-KS summarizes every DBl with a keyword relationship matrix KRMl capturing the binary relationships between terms at different distances.

Specifically, each entry KRMl(ki, kj) stores Specifically, each entry KRMl(ki, kj) stores a vector d0d1..., where d0 is the numberof a vector d0d1..., where d0 is the numberof times that terms ki and kj occur in the times that terms ki and kj occur in the same tuple,d1 is the number of times that same tuple,d1 is the number of times that they occur in tuples with distance 1 and they occur in tuples with distance 1 and so on.so on.

M-KSM-KS

In order to compute these vectors (for In order to compute these vectors (for all term pairs), M-KS scans DBl, parses all term pairs), M-KS scans DBl, parses each tuple, removes stop words, each tuple, removes stop words, stems the terms and inserts them stems the terms and inserts them into a table T0 associating each term into a table T0 associating each term with the tuples that contain it.with the tuples that contain it.

KEYWORD RELATIONSHIP GRAPHSKEYWORD RELATIONSHIP GRAPHSFor instance, the edge betweenFor instance, the edge between

Olson and love has two values: 2 due to the resultOlson and love has two values: 2 due to the result

t2 t9 t4, and 3 due to t2 t10 t5 t7.t2 t9 t4, and 3 due to t2 t10 t5 t7.

Weight of a nodeWeight of a node

We use We use ci(t) to ci(t) to denote the number denote the number of occurrences of occurrences of of ki in t.ki in t.

N is N is the cardinality the cardinality of tuples that of tuples that include terms in include terms in DB, and Ni DB, and Ni is the is the number of tuples number of tuples containing containing ki.ki.

Definition 1. The term frequency of ki in tuple t Definition 1. The term frequency of ki in tuple t is tfi(t) = ci(t)/S(t) . is tfi(t) = ci(t)/S(t) .

Definition 2. iThe inverse tuple frequency of Definition 2. iThe inverse tuple frequency of term ki is iufi = ln N+1/Niterm ki is iufi = ln N+1/Ni

Eg: The sample database of Figure 1 containsEg: The sample database of Figure 1 contains N=7 tuples with terms (t1 to t7). The term N=7 tuples with terms (t1 to t7). The term

Anderson has a single appearance in t1 (i.e., Anderson has a single appearance in t1 (i.e., Nanderson=1and cAnderson(t1)=1), which Nanderson=1and cAnderson(t1)=1), which contains 2 terms in total (i.e., S(t1= 2). contains 2 terms in total (i.e., S(t1= 2). Given the above, tfanderson(t1) = 1/2 and Given the above, tfanderson(t1) = 1/2 and iufAnderson= ln(8/1).iufAnderson= ln(8/1).

wAnderson(t1) = tfAnderson(t1) ・ iufAnderson =1/2 ・ ln( 8/1) = 1.040. Since there is only one occurrence of the term in the database, the node representing Anderson in the KRG of Figure 3 has a weight wAnderson = wAnderson(t1).

Similarly, the term love appears in Nlove=3 tuples t3, t4, t7.

The sizes of these tuples are S(t3)=3, S(t4)=2 and S(t7)=2,respectively. Consequently, the weight of the term in theKRG is wlove = average( 1/3 ・ ln( 8/3) + 1/2 ・ ln( 8/3) + 1/2 ・ln( 8/3 )) =0.436.

Weight of an edgeWeight of an edge

nc(tx, ty, d) is the number of unique nc(tx, ty, d) is the number of unique connectionsbetween tx and ty at distance d, connectionsbetween tx and ty at distance d,

cij (tx, ty, d) =nc(tx, ty, d) cij (tx, ty, d) =nc(tx, ty, d) ・・ ci(tx) ci(tx) ・・ cj (ty) cj (ty) is the number of connections between ki and is the number of connections between ki and kj in tx and ty.kj in tx and ty.

The values of cij (tx, ty, d) and nc(tx, ty, d) The values of cij (tx, ty, d) and nc(tx, ty, d) differ only if there are multiple occurrences of differ only if there are multiple occurrences of ki(kj) in tx (ty). ki(kj) in tx (ty).

Sij (tx, ty, d) = nc(tx, ty, d) Sij (tx, ty, d) = nc(tx, ty, d) ・・ S(tx) S(tx) ・・ S(ty) is S(ty) is the number of connections between all term the number of connections between all term pairs in tx and ty at distance d. pairs in tx and ty at distance d.

tfOlson(t2) = tflove(t4) = 0.5, we obtain tfOlson(t2) = tflove(t4) = 0.5, we obtain pfOlson,love(t2, t4, 2) = 0.25.pfOlson,love(t2, t4, 2) = 0.25.

let Nij (d) be the total number of cases where let Nij (d) be the total number of cases where two tuples containing ki and kj can be two tuples containing ki and kj can be connected at distance d. connected at distance d.

Similarly, N(d) is the total number of cases Similarly, N(d) is the total number of cases where two tuples containing any terms can where two tuples containing any terms can be be

connected at distance dconnected at distance d

Inverse pairwise frequency, iwfij (d)The inverse pairwise frequency of two terms ki and kj at distance d is iwfij (d) = ln N(d)+1/Nij (d)

G-KS identifies, or each tuple, the terms that appear only once in the database and inserts them in a compound node. Thus, the KRG has two types of nodes: single nodes containing one term and compound nodes consisting of multiple terms.

Graph compression

Graph constructionGraph construction

Creation of compound nodes and Creation of compound nodes and single nodessingle nodes

the system constructs relationships the system constructs relationships between tuples in the database at between tuples in the database at different distances different distances d ≥ 1.d ≥ 1.

based on the relationships of tuples based on the relationships of tuples at different distances, at different distances, G-KS creates G-KS creates edges between nodes at the KRG.edges between nodes at the KRG.

QUERY PROCESSINGQUERY PROCESSING

Join keyword tree, JKT(SG) Given SG, JKT(SG) Join keyword tree, JKT(SG) Given SG, JKT(SG) is a tree satisfying the following properties: is a tree satisfying the following properties:

1. Each tree vertex tni maps to a non-empty 1. Each tree vertex tni maps to a non-empty set of nodes of SG, and the tree vertices set of nodes of SG, and the tree vertices should collectively contain all nodes in SG.should collectively contain all nodes in SG.

2. Edges connecting two vertices are 2. Edges connecting two vertices are associated with a single distance d.associated with a single distance d.

3: if two SG nodes (ni, ni) map to the same tree 3: if two SG nodes (ni, ni) map to the same tree vertex tni, they must co-exist in some tuple of vertex tni, they must co-exist in some tuple of DB.DB.

If there is an edge between two vertices (tni,tnj) If there is an edge between two vertices (tni,tnj) with distance d in JKT, then all pairs of with distance d in JKT, then all pairs of corresponding nodes ni ∈ tniand nj ∈ tnj must be corresponding nodes ni ∈ tniand nj ∈ tnj must be related at d in SG.related at d in SG.

If two vertices(tni, tnj) are not directly connected If two vertices(tni, tnj) are not directly connected in JKT, then for each pair of nodes ni ∈ tni and nj in JKT, then for each pair of nodes ni ∈ tni and nj ∈ tnj, there must be a relationship in SG at ∈ tnj, there must be a relationship in SG at distance equal to that of the path connecting tni distance equal to that of the path connecting tni and tnj .and tnj .

CANDIDATE GRAPHCANDIDATE GRAPH

Definition . Candidate graph, CG(KRG, q) Definition . Candidate graph, CG(KRG, q) Given a query q and a KRG, CG(KRG, q) is an Given a query q and a KRG, CG(KRG, q) is an SG ofSG of

KRG satisfying the following propertiesKRG satisfying the following properties 1. SG includes all nodes of KRG containing 1. SG includes all nodes of KRG containing

the query keywords, and only these nodes.the query keywords, and only these nodes. 2. SG is complete (i.e., there is an edge 2. SG is complete (i.e., there is an edge

between each pair of nodes). between each pair of nodes). 3. There exists at least one JKT(SG).3. There exists at least one JKT(SG).

CANDIDATE GRAPHCANDIDATE GRAPH

Theorem 5.1. If a database contains a result Theorem 5.1. If a database contains a result with all keywords of query q, then the with all keywords of query q, then the corresponding KRG must have a candidate corresponding KRG must have a candidate graph CG(KRG, q).graph CG(KRG, q).

Theorem 5.2. The existence of a candidate Theorem 5.2. The existence of a candidate graph CG(KRG, q) in KRG does not graph CG(KRG, q) in KRG does not guarantee that the corresponding database guarantee that the corresponding database has results for q.has results for q.

Search for candidate Search for candidate graphsgraphs

Selection of top-Selection of top-K K databases databases

Differences between Differences between M-KS and G-KSM-KS and G-KS Our experiments show that, compared to Our experiments show that, compared to

M-KS, G-KS M-KS, G-KS i) is more effective(improving recall/ i) is more effective(improving recall/

precision by as much as 50%), precision by as much as 50%), (ii) is more efficient (reducing query (ii) is more efficient (reducing query

processing cost by as much as 50%),processing cost by as much as 50%), (iii) incurs less pre-processing time (iii) incurs less pre-processing time

(faster by 30%), (faster by 30%), (iv)has less space overhead (smaller by (iv)has less space overhead (smaller by

17%).17%).

Future Work – Future Work – Results Results EstimationEstimation Top-k algorithms usually slow Top-k algorithms usually slow

when very few results. when very few results.

CONCLUSIONCONCLUSION

G-KS summarizes each database as a keyword G-KS summarizes each database as a keyword relationship graph, where nodes correspond to relationship graph, where nodes correspond to terms, and edges capture distance relationships.terms, and edges capture distance relationships.

Based on the KRG, G-KS applies an intricate Based on the KRG, G-KS applies an intricate algorithm to identify and eliminate non-algorithm to identify and eliminate non-promising databases.promising databases.

G-KS considers all query keywords as a whole in G-KS considers all query keywords as a whole in order to minimize the number of false order to minimize the number of false positives.positives.

experimental evaluation confirms the superiority of experimental evaluation confirms the superiority of G-KS in terms of effectiveness, efficiency, G-KS in terms of effectiveness, efficiency, processing and pre-processing overhead. processing and pre-processing overhead.

a graph method for keyword-based s election of the top-k databases presented by gaurav dutta gaurav...

Documents

ks query

query q

keywordbased selection

query wordskeyword search

distance information

importance of dbl

distance of tuples

database dbl