1 efficient ir-style keyword search over relational databases 12 december 2005 databases and the...

1

Efficient IR-Style Keyword Search

over Relational Databases

12 December 2005

Seminar on Databases and the InternetDatabases and the InternetThe Hebrew University of Jerusalem, Winter 2006

Efficient IR-Style Keyword Search over Relational Databases2SDBI 05’

IntroductionIntroduction

This presentation is mainly based upon the

work of Hristidis, Gravano, and

Papakonstantinou.

The work consists of showing several

Efficient algorithms for Information-retrieval

Keyword search, based on the DISCOVER

Architecture.


ContentsContents

Introduction

Goal and Motivation

Framework and examples

Architecture

Algorithms

Experimental Results

Criticism and Conclusion


ContentsContents

Introduction

Goal and Motivation


Architecture

Algorithms




Goal and MotivationGoal and Motivation

We present a detailed framework and methods for combining IR-style keyword search over relational databases

What is Information Retrieval Keyword Search in general?

Mainly, it’s this…



…But not always:

SELECT * FROM Complaints C

WHERE CONTAINS (C.comment, ’disk crash’, 1) > 0

ORDER BY score(1) DESC

SELECT * FROM Complaints C

WHERE CONTAINS (C.comment, ’disk crash’, 1) > 0

ORDER BY score(1) DESC

prodIDcustIDdatecomment

p121c32326-30-2002“Disk crashed after one week of moderate use on an IBM Netvista X41”

p131c31317-3-2002“lower-end IBM Netvista caught fire, starting apparently with disk Crash”



Current status:

• RDBMSs (Such as Oracle) provide querying capabilities for text attributes, provided that an exact colum is specified.

• Only AND semantics are being used.

• Limited ranking functions.

• Known approaches for query processing strategies are inefficient (and sometimes even infeasible).



In particular, we’d like:

• Efficient ways to generate “top k” results according to some form of “ranking”.

• The Use AND and OR semantics (not just the default AND) when gaining results.

• Assembling keyword occurances from multiple attributes - perhaps in “unforseen” ways – without needing to specify columns.



We would like to apply same (or similar) methods and rules that apply in this world,

Prioritizing -

K-best

results first

Prioritizing -

K-best

results first

Efficient

Searching

Efficient

Searching

Use of

AND, OR

Semantics

Use of

AND, OR

Semantics



Why should we care??

• Keyword queries require little or no knowledge about the database semantics.

• Ranking results correctly (and returning only relevant tuples) is, of course, highly desirable.

• Efficient implementation should reduce the querying process to a fraction of the time of a naïve implementation.


ContentsContents

Introduction

Goal and Motivation


Architecture

Algorithms




FrameworkFramework

Customers

custId, name, occupations

Complaints

prodId, custId, date, comments

Products

prodId, model manufacturer

Query Model:

•A database with n relations R1,…, Rn.

•relations possibly have primary key to foreign key constraints.

•The schema graph G is a directed graph, in which for each primary to foreign key relationship between Ri and Rj, there’s an edge (i,j) :


FrameworkFrameworkA possible instance of the schema graph can be:

tupleIDprodIDcustIDdatecomment

c1p121c32326-30-2002

“Disk crashed after one week of moderate use on an IBM Netvista X41”

c2p131c31317-3-2002“lower-end IBM Netvista caught fire, starting apparently with disk”

c3p131c31438-3-2002“IBM Netvista unstable with Maxtor HD”

Complaints

tupleIDprodIDmanufac.

model

p1p121“Maxtor”“D540X”

p2p131“IBM”“Netvista”

p3p141“Tripplite”

“Smart 700VA”

Products

tupleIDcustIDnameOccupation

u1c3232“John Smith”

“Software engineer”

u2c3131“John L.”“Architect”

u3c3143“Jack M.”“student”

Customers


FrameworkFramework

Joining trees of tuples:

• Given a schema graph G for a database, a joining tree of tuples T is a tree of tuples where each edge (ti,tj) in T, where ti ∈ Ri and tj ∈ Rj and, which satisfies 2 properties:

(1) (Ri,Rj) ∈ G (The schema graph we talked about)

(2) ti t⋈ j ∈ Ri, ⋈ Rj

• The size(T) of a joining tree is the number of tuples in T.


FrameworkFrameworkA joining tree of tuples for our example:



Complaints

tupleIDprodIdmanufac.model


Products

tupleIDcustIdnameOccupation


Customers

⋈

⋈


FrameworkFramework

“Top-k” keyword query

• a “top-k” keyword query is a list of keywords Q={w1… wm}. The result for such a query is a list of the k joining trees of tuples T whose score(T,Q) is the highest, so that:

(1) each tree T in a result is minimal: cannot have a zero-scored leaf.

(2) no tuple appears more than once in a joining tree of tuples.


FrameworkFrameworkFor example, the query Q = {Netvista, Maxtor}

should yield the following results: C1 (by itself)


c1p121c32326-30-2002



Complaints




p3p141“Tripplite”“Smart 700VA”

Products






Customers


FrameworkFrameworkAnd the following: p2 c3


c1p121c32326-30-2002



Complaints





Products






Customers


FrameworkFrameworkAnd the following: p1 c1


c1p121c32326-30-2002



Complaints





Products






Customers


FrameworkFramework

Score (ai,Q)

• A method to evaluate the relevance of a tree of tuples. Consists of a single-attribute (ai) IR-style relevance scoring function:

tf - Term frequency

of w (w ∈ Q) in ai

tf - Term frequency

of w (w ∈ Q) in ai

N - number of

tuples in ai’s

relation

N - number of

tuples in ai’s

relation df - number of tuples in ai’s

relation with the word w

df - number of tuples in ai’s

relation with the word w

dl, avdl - (average)

attribute value size

dl, avdl - (average)

attribute value size

S - a

constant

S - a

constant


FrameworkFrameworkCombined Score (T,Q)• another function should be used to combine

the single attributes into a final score:

• those are only optional candidates

• This framework can handle many functions - as long as they satisfy the Tuple monitonicity property:

• if individual Scores of tuples in T’ < individual Scores of T, then the combined score of the trees will also have this property.


FrameworkFrameworkCandidate Networks (CN)

• can be thought of as a join expression that involves tuple sets plus (perhaps) “base” relations, that do not have occurrences of query keywords, but help to connect relations that do…


Complaints{}


p2p131“IBM”“netvista”

ProductsQ



⋈ ⋈

Q = {IBM, Architect}

Q = {IBM, Architect}

customersQ


FrameworkFramework

For example, all the candidate networks (With

scores) For Q = {Maxtor,Netvista}:

P = products

C = complaints

U = customers


ContentsContents

Introduction

Goal and Motivation


Architecture

Algorithms




ArchitectureArchitecture

• Follows is a quick overview of the system architecture needed in order to efficiently implement top-k keyword queries.

• Description relies much on the DISCOVER architecture, but is not really OS/RDBMS specific.



• The architecture consists of:

– an IR Engine

– a CN generator

– an Execution Engine

Keywords

IREngine

Tuple Sets

CandidateNetwork

GeneratorDatabaseSchema

Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index



IR Engine

• Modern RDBMSs include IR-style text-indexing functionality (e.g. Oracle Text).

• It is useful to think of the IR-engine as an indexer that gives a SCORE>0 to tuples that have occurrences of the keywords

Keywords

IREngine

Tuple Sets

CandidateNetwork


Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index



IR Engine• The proposed architecture

exploits this functionality -upon arrival of a query Q, generates for each relation the tuple set RQ = { t ∈ R | Score(t,Q) > 0}

• The tuple sets are then sorted by decreasing score and passed on to the next module.

Keywords

IREngine

Tuple Sets

CandidateNetwork


Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index



CN Generator

• receives non-empty tuple sets (Such as CQ, PQ), and the general schema graph.

• attempts to join those sets, perhaps using “base” relations (U{ }… remember?) - generates Candidate Networks (CNs)!

Keywords

IREngine

Tuple Sets

CandidateNetwork


Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index



CN Generator

• Also receives a parameter M, that bounds the maximum tuple sets participating in a CN (either free or non-free).]

• Why is this boundary needed?

Keywords

IREngine

Tuple Sets

CandidateNetwork


Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index

Number of CN

Might be exponential

in query size!



CN Generator

The generated CNs MUST satisfy:

• No “leaf” of a tuple set is a “free” tuple set (P{}…).

• No RSR tuple set exists – a tree of tuples cannot include duplicate tuples!

Keywords

IREngine

Tuple Sets

CandidateNetwork


Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index



Execution Engine

• This is the module that actually contacts the RDBMS query tools, in order to generate the top-k results.

• This is our focus! (as it’s the most hard to implement efficiently)

Keywords

IREngine

Tuple Sets

CandidateNetwork


Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index


Sparse algorithm exampleSparse algorithm exampleRecall the database from before, with the query

Q= {Maxtor, Netvista} tupleIDprodIDcustIDdatecomment

c1p121c32326-30-2002




Complaints





Products






Customers


Architecture - demonstrationArchitecture - demonstration

{Maxtor, netvista}

User

Database

Keywords

IREngine

Tuple Sets

CandidateNetwork


Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index


Architecture - demonstrationArchitecture - demonstration

{Maxtor, netvista}

User

Database

Keywords

IREngine

Tuple Sets

CandidateNetwork


Execution engine

Database

Candidate

Networks

Parameterized

SQL queries

User

IR index

We now turn our

attention to how

THIS is done


ContentsContents

Introduction

Goal and Motivation


Architecture

Algorithms




First of all, what do we have so far?First of all, what do we have so far?

• An architecture that constructs Candidate Networks from keyword queries, using “black box” functions of modern RDBMSs, and some given SCORE functions.

• A notion of what should be done in order to produce the keyword query results.

So, how would you do it???


Naïve algorithmNaïve algorithm

• The naïve approach: simply issue an SQL query for each CN.

• The results from all the queries are then combined using Sort-Merge-Join.

• Main problem – runtime.

• What characteristic(s) can we use in order to make our algorithm more efficient?


Naïve algorithm is too slowNaïve algorithm is too slow

• Remember that the IR Engine returns Tuple sets that are ranked in DESCENDING order in respect to the SCORE() function.

• So, when applying COMBINE(Score(T,Q)) for a whole CN, we can get an ESTIMATE of its maximal possible score For CNi (MPSi).

• We can use this knowledge to disregard “unfruitful” CNs!!


Sparse AlgorithmSparse Algorithm

• For every CNi, compute MPSi.

• If MPSi does not exceed the lowest “best-k” match for the query found so far, DISCARD CNi .

• Otherwise, join tuples in CNi as usual…

• As a further optimization, CNs are evaluated in ASCENDING SIZE order - smaller CNs, are evaluated first, while “heavy” CNs might be discarded after only short calculation steps!


Sparse algorithm exampleSparse algorithm exampleRemember this database, with the query

Q= {Maxtor, Netvista} ?tupleIDprodIDcustIDdatecomment

c1p121c32326-30-2002




Complaints





Products






Customers


Sparse algorithm exampleSparse algorithm example• Suppose we want to find the Top-2 best results

for this query Q={Maxtor, Netvista} on our existing database.

• The CN generator supplies our execution engine with the following Candidate Networks, with M=3:

• We start off with CQ ,let’s take a look:


Sparse algorithm exampleSparse algorithm exampleCQ consists of all the tuples (with

Different scores, of course):


c1p121c32326-30-2002




ComplaintsQ

C3 – it’s SCORE is 1.33







Sparse algorithm exampleSparse algorithm example• We start off with CQ , no need to calculate

MPS(CQ) – but we do it anyway!

• We already know everything! (We got these exact results from the IR engine!

• We now turn to examine the CN PQ ...

CQ

C3 = 1.33

C1 = 0.33

C2 = 0.33

MPS(CQ)=

1.33 2 BEST RESULTSQUEUE

C3 = 1.33

C1 = 0.33


Sparse algorithm exampleSparse algorithm exampleThese are the relevant tuples that PQ consists of:





ProductsQ

P1 – it’s SCORE is 1





Sparse algorithm exampleSparse algorithm example• Let’s look at the algorithm function over PQ :

• We calculate MPS(PQ) = 1, so it might still yield some result that can be added to the TOP-K Queue.

• We now turn to examine the CN CQ PQ ...

CQ

C3 = 1.33

C1 = 0.33

C2 = 0.33

MPS(CQ)=


C3 = 1.33

C1 = 0.33

PQ

P1 = 1

P2 = 1

MPS(PQ)=

1

P1 = 1


Sparse algorithm exampleSparse algorithm exampleThese are the joins of CQ PQ:




ProductsQ


c1p121c32326-30-2002


c2p131c31317-3-2002

“lower-end IBM Netvista caught fire, starting apparently with disk”

c3p131c31438-3-2002

“IBM Netvista unstable with Maxtor HD”

ComplaintsQ

C3P2 SCORE: 1.17

C3P2 SCORE: 1.17

C2P2 SCORE: 0.66

C2P2 SCORE: 0.66

C1P1 SCORE: 0.66

C1P1 SCORE: 0.66


Sparse algorithm exampleSparse algorithm example• Now, we turn to examine CQ PQ ...

• We calculate MPS(CQ PQ ) = (1+1.33) / 2=1.17, so it might still yield some result!

CQ

C3 = 1.33

C1 = 0.33

C2 = 0.33

MPS(CQ)=


C3 = 1.33

P1 = 1

PQ

P1 = 1

P2 = 1

MPS(PQ)=

1

MPS (CQ

PQ) = 1.17

CQ PQ

C3P2 = 1.17

C1P1 = 0.67

C2P2 = 0.67 C3P2 = 1.17


Sparse algorithm exampleSparse algorithm example• Now, we turn to examine CQ P{ } CQ ...

• We calculate MPS(CQ P{ } CQ )= (1.33 + 1.33) / 3 = 0.89 , so we don’t need to calculate this CN! and the same goes for CQ U{ } CQ .

• We’re finished! We return {C3 , C3P2} as results.

CQ

C3 = 1.33

C1 = 0.33

C2 = 0.33

MPS(CQ)=


C3 = 1.33

P1 = 1

PQ

P1 = 1

P2 = 1

MPS(PQ)=

1

MPS (CQ

PQ) = 1.17

CQ PQ

C3P2 = 1.17

C1P1 = 0.67

C2P2 = 0.67 C3P2 = 1.17

MPS (CQ U{}PQ) =

0.89

No need

To calc-

ulate!


Sparse is nice, but…Sparse is nice, but…

• What if there are many possible answers, some of them requiring multiple joins? (Keywords are “hiding” in multiple relations)

• Apparently, the Sparse algorithm becomes (almost) as inefficient as the Naïve algorithm – especially acute in AND queries.

• What plan should we devise now??

• We need to make better use of our architecture!


The Single-pipelined algorithmThe Single-pipelined algorithm

• This Single-Pipelined Algorithm is essentially what we’d like to happen in a SINGLE CN case.

• IT DOES NOT solve the problem in whole

• but…

It’s a great building block for the more sophisticated General-pipelined algorithm!



• This algorithm accepts a Candidate Network, The Non-empty tuple-sets TS1…TSk that participate in it.

• Recall TSi corresponds with a relation Ri, that has tuples matching the query keywords (already ordered in descending order according to the SCORE function).

• The Single-Pipelined Algorithm’s output: A stream of joining trees of tuples in descending SCORE order.



• We need to keep track of the prefix S(TS) we’ve already retrieved from every tuple set.

• Each iteration, retrieve another tuple t from some TSk, and try to match it against all other tuple sets, to create potential joining trees.

• All the joining trees of tuples T that we’ve found are added to the Queue of results.

• Anyone see a problem here?



• Yup, we’re back to the Naïve algorithm, aren’t we?

• Well – not quite!

• In order to guarantee that some result we’ve produced will be in the top-k, we need a similar method to the MPS.

• The MPFSi - Maximum Possible Future Score will be our estimate for the maximum score of any yet “unseen” result from TSi.



• We would’ve liked using the status of each prefix S(TSk) to bound the maximum score it can yield from a yet unretrieved tuple:

MPFSi = Max { Score(T,Q) | T ∈ TS1 … ⋈ TSi-1 (⋈ TSi – S(TSi)) … ⋈ TSn }

• This is expensive!

• Instead we produce a cheaper over estimate – MPFS’i – computed as the score of the next tuple from TSi, combined with the top-ranked tuples from every other TS.



Some

Free Relations

R{} ,Q{} ,P{}

Suppose the algorithm receives a CN with 3 Tuple sets and

three free tuple sets that connect them:

TS3

S(TS3)=∅

MPFS’3=?

TupleIdScore

A13

A22

A31

TS1

S(TS1)=∅TupleIdScore

B19

B23

B31

TS2

S(TS2)=∅

MPFS’1=? MPFS’2=?

MPFS’all = ∅

TupleIdScore

C17

C23

C32

TupleScore

Output Queue

TupleScore



Some

Free Relations

R{} ,Q{} ,P{}

We want the algorithm to output BEST-6 results!

TS3

S(TS3)=∅

MPFS’3=?

TupleIdScore

A13

A22

A31

TS1


B19

B23

B31

TS2

S(TS2)=∅

MPFS’1=? MPFS’2=?

MPFS’all = ∅

TupleIdScore

C17

C23

C32

TupleScore

Output Queue

TupleScore



Some

Free Relations

R{} ,Q{} ,P{}

First, we calculate MPFS’i which is similar in every TS in

The beginning.

TS3

S(TS3)=∅

TupleIdScore

A13

A22

A31

TS1


B19

B23

B31

TS2

S(TS2)=∅

MPFS’all = ∅

TupleIdScore

C17

C23

C32

TupleScore

Output

TupleScore

MPFS’3=3.16

MPFS’1=3+9+7/6=3.16

MPFS’2=3.16



Some

Free Relations

R{} ,Q{} ,P{}

Then we compute MPFS’all as the maximum of MPFS’i

TS3

S(TS3)=∅

TupleIdScore

A13

A22

A31

TS1


B19

B23

B31

TS2

S(TS2)=∅

TupleIdScore

C17

C23

C32

TupleScore

TupleScore

MPFS’3=3.16

MPFS’1=3+9+7/6=3.16

MPFS’2=3.16

MPFS’all =3.16

Output



Some

Free Relations

R{} ,Q{} ,P{}

Now, We advance one of the S(TSi), say S(TS1), and

Have to update MPFS’1!

TS3

S(TS3)=∅

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

S(TS2)=∅

TupleIdScore

C17

C23

C32

TupleScore

TupleScore

MPFS’3=3.16

MPFS’1=2+9+7/6=3

MPFS’2=3.16

MPFS’all =3.16

S(TS1)=

Output



Some

Free Relations

R{} ,Q{} ,P{}

We try to join A1, with all the other tuples in S(TSi), but

There aren’t any.

TS3

S(TS3)=∅

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

S(TS2)=∅

TupleIdScore

C17

C23

C32

TupleScore

TupleScore

MPFS’3=3.16

MPFS’1=2+9+7/6=3

MPFS’2=3.16

MPFS’all =3.16

S(TS1)=

Output



Some

Free Relations

R{} ,Q{} ,P{}

We advance S(TS2), We also have no luck

getting join results. Now the MPFS’s will be:

TS3

S(TS3)=∅

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

TupleScore

MPFS’3=3.16

MPFS’2=3+3+7/6 = 2.16

MPFS’all =3.16

S(TS1)= S(TS2)=

Output

MPFS’1=3



Some

Free Relations

R{} ,Q{} ,P{}

We advance S(TS3), this time we’ve managed to join

C1 B⇝ 1 A⇜ 1. (We’re not forgetting to update MPFS’3!) :

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

TupleScore

MPFS’3=3+3+9/6=2.5

MPFS’2=2.16

MPFS’all =3.16

S(TS1)= S(TS2)=

S(TS3)=

Output

MPFS’1=3



Some

Free Relations

R{} ,Q{} ,P{}

The SCORE of C1 B1 A1 is 3.16 =MPFS’⇝ ⇜ all, so we

output it!

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16

TupleScore

MPFS’3=2.5

MPFS’2=2.16

MPFS’all =3.16

S(TS1)= S(TS2)=

S(TS3)=

Output

MPFS’1=3



Some

Free Relations

R{} ,Q{} ,P{}

But now MPFS’all should reduce!

Remember - it’s equal to the Max{MPFS’i}…

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16

TupleScore

MPFS’3=2.5

MPFS’2=2.16

MPFS’all =3

S(TS1)= S(TS2)=

S(TS3)=

Output

MPFS’1=3



Some

Free Relations

R{} ,Q{} ,P{}

Now, we turn to advance S(TS1) again…

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16

TupleScore

MPFS’3=2.5

MPFS’2=2.16

MPFS’all =3

S(TS1)= S(TS2)=

S(TS3)=

Output

MPFS’1=3



Some

Free Relations

R{} ,Q{} ,P{}

Now, we turn to advance S(TS1) again… we have no luck

joining A2, but we update MPFS’1 and MPFS’all …

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16

TupleScore

MPFS’3=2.5

MPFS’1=1+9+7/6=2.83

MPFS’2=2.16

MPFS’all =2.83

S(TS1)=S(TS2)=

S(TS3)=

Output



Some

Free Relations

R{} ,Q{} ,P{}

We now try join B2, with any other in S(TSi) and succeed!

We find C1 B⇝ 2 A⇜ 2 with score 3+2+7/6 = 2 < MPFS’all!

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16

TupleScore

MPFS’2=1.84

S(TS1)= S(TS2)=

Output

MPFS’1=2.83

MPFS’all =2.83

MPFS’3=2.5

S(TS3)=We keep

C1 B2 A2⇝ ⇜ in

a queue for later

output!

We keep

C1 B2 A2⇝ ⇜ in

a queue for later

output!



Some

Free Relations

R{} ,Q{} ,P{}

We now try join C2, with any other in S(TSi) and succeed!

We find C2 B⇝ 1 A⇜ 1 with score 3+3+9/6 = 2.5 <MPFS’all!

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16

TupleScore

MPFS’3=2.33

MPFS’2=1.84

S(TS1)= S(TS2)=

S(TS3)=

Output

MPFS’1=2.83

MPFS’all =2.83

We keep

C2 B1 A1⇝ ⇜ in

a queue for later

output!

We keep

C2 B1 A1⇝ ⇜ in

a queue for later

output!



Some

Free Relations

R{} ,Q{} ,P{}

We go back to S(TS1),

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16

TupleScore

MPFS’3=2.33

MPFS’2=1.84

S(TS1)= S(TS2)=

S(TS3)=

Output

MPFS’1=2.83

MPFS’all =2.83



Some

Free Relations

R{} ,Q{} ,P{}

We advance S(TS1), And manage to find two joins –

C1 B⇝ 1 A⇜ 3 =1+9+7/6=2.83, which we output!

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83

TupleScore

MPFS’3=2.33

MPFS’1=0 MPFS’2=1.84

S(TS1)=S(TS2)=

S(TS3)=

Output

MPFS’all =2.83



Some

Free Relations

R{} ,Q{} ,P{}


C2 B⇝ 1 A⇜ 3 =1+9+3/6=2.33, which we can’t yet output

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83

TupleScore

MPFS’3=2.33


S(TS1)=S(TS2)=

S(TS3)=

Output

MPFS’all =2.83

We keep

C2 B1 A3⇝ ⇜ in

a queue for later

output!

We keep

C2 B1 A3⇝ ⇜ in

a queue for later

output!



Some

Free Relations

R{} ,Q{} ,P{}

Now, MPFS’all updates to 2.33, but we already have a result

that can be output from before C2 B⇝ 1 A⇜ 1! (SCORE=2.5)

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83C2B1A12.5

TupleScore

MPFS’3=2.33


MPFS’all =2.33

S(TS1)=S(TS2)=

S(TS3)=

Output

Remember

C2 B1 A1⇝ ⇜ ?

It’s now output!

Remember

C2 B1 A1⇝ ⇜ ?

It’s now output!



Some

Free Relations

R{} ,Q{} ,P{}

And what about C2 B⇝ 1 A⇜ 3, with score 2.33?

Well, it’s time for it to be output also!

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83C2B1A12.5

TupleScore

C2B1A12.33

MPFS’3=2.33


S(TS1)=S(TS2)=

S(TS3)=

MPFS’all =2.33

Output

Now

C2 B1 A3⇝ ⇜

Is also output!

Now

C2 B1 A3⇝ ⇜

Is also output!



Some

Free Relations

R{} ,Q{} ,P{}

Now, let’s advance S(TS3).

CAN ANYONE GUESS WHY NOT S(TS2)?

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83C2B1A12.5

TupleScore

C2B1A12.33

MPFS’3=2.33


S(TS1)=S(TS2)=

S(TS3)=

MPFS’all =2.33

Output



Some

Free Relations

R{} ,Q{} ,P{}

Now, let’s advance S(TS3).

It has the biggest MPFSi – most likely to yield results...!

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83C2B1A12.5

TupleScore

C2B1A12.33

MPFS’3=2.33


S(TS1)=S(TS2)=

S(TS3)=

MPFS’all =2.33

Output



Some

Free Relations

R{} ,Q{} ,P{}


C3 B1 A2 =2+9+2/6=2.16, which we can’t yet output⇝ ⇜

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83C2B1A12.5

TupleScore

C2B1A12.33

MPFS’3= 0


S(TS1)=S(TS2)=

S(TS3)=

MPFS’all =2.33

Output



Some

Free Relations

R{} ,Q{} ,P{}

But – with MPFS3=0, we have to update MPFS’all , so turnsout we can output C3 B⇝ 1 A⇜ 2 after all…

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83C2B1A12.5

TupleScore

C2B1A12.33C3B1A22.16

MPFS’3= 0


S(TS1)=S(TS2)=

S(TS3)=

MPFS’all =1.84

Output



Some

Free Relations

R{} ,Q{} ,P{}

Also, remember C1 B⇝ 2 A⇜ 2 with SCORE=2? Its time

has come to be output!

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83C2B1A12.5

TupleScore

C2B1A12.33C3B1A22.16

C1B2A2 2MPFS’3= 0


S(TS1)=S(TS2)=

S(TS3)=

MPFS’all =1.84

Output

Now

C1 B2 A2⇝ ⇜

Is also output!

Now

C1 B2 A2⇝ ⇜

Is also output!



Some

Free Relations

R{} ,Q{} ,P{}

That’s it, we’re done!

Phew….

TS3

TupleIdScore

A13

A22

A31

TS1

TupleIdScore

B19

B23

B31

TS2

TupleIdScore

C17

C23

C32

TupleScore

C1B1A13.16C1B1A32.83C2B1A12.5

TupleScore

C2B1A12.33C3B1A22.16

C1B2A2 2MPFS’3= 0


S(TS1)=S(TS2)=

S(TS3)=

MPFS’all =1.84

Output


In the common case…In the common case…

• This algorithm would output the best results of the specific CN quickly

• And will save time by not touching non-promising TSs!

• In our example it didn’t really happen (only the last tuple from TS3 was untouched)…


The General-pipelined algorithmThe General-pipelined algorithm

• As mentioned before, the Single Pipelined algorithm (that operates on a SINGLE CN) does not solve the whole problem.

• However, a concurrent approach using the single algorithm might!

• This is exactly the idea behind the general-pipelined algorithm:



• The General pipelined algorithm evaluates concurrently all the CNs, using a priority preemptive, round-robin protocol.

• What’s the priority of each CNi? MPFS’i !

• Also, a result will only be output once its score is higher than GMPFS’ - the maximal value of the current set of MPFS’s.



CN5

CN1

CN5

CN3

CN2

…

CN Queue ordered by ascending MPFS

Execution

engine

Output to user

TupleScore

B1C34.22

C1A27

A13

Queue of

Future(?)

Results


The Hybrid algorithmThe Hybrid algorithm

• The hybrid algorithm simply combines the power of the two most successful algorithms

• It estimates the number of results that would be for a query

• If expecting “few” results, it runs the Sparse algorithm.

• In any other case - it runs the General Pipelined algorithm!


ContentsContents

Introduction

Goal and Motivation


Architecture

Algorithms




RuntimeRuntime• All the algorithms were run through a series of

runtime tests.

• the tests used the DBLP data set translated to relations (Conferences, Papers, Citations…) The tests consisted of some one parameter (I.E. Query size) while others are constant .

• Different tests for AND and OR semantics.

• Also, sometimes use two modified algorithms:– SASymmetric - Single pipelined with round-robin

– GASymmetric - General pipelined with round-robin


Maximal CN size (OR)Maximal CN size (OR)

• This test evaluates M, the maximal CN size.


Maximal CN size (AND)Maximal CN size (AND)

• Clearly, bigger M’s have greater impact using AND queries (Why?).


Number of keywords (OR)Number of keywords (OR)


Number of keywords (AND)Number of keywords (AND)


ContentsContents

Introduction

Goal and Motivation


Architecture

Algorithms




CriticismCriticism• Runtime is not clearly stated in the article (For

a reason!)

• Effected heavily by query size! for |Q|>4, most queries will take a lot of time!

• The same goes for M>6…

• The system is a bit “platform-dependant”… prone to future RDBMS policy changes…


ConclusionConclusion• Today we’ve discussed a method for using IR-

Style keyword search over relational databases:

– Motivations for such searches

– An Architecture that can achieve such goal

– Several algorithms, in varying efficiencies, that can issue results.

– Experimental results that allow better evaluation of runtime.

100

Thank You!Thank You!

…Questions?

Phew!...


DISCOVER – original ArchitectureDISCOVER – original Architecture

1 efficient ir-style keyword search over relational databases 12 december 2005 databases and the...

Documents

relational databasessdbi

keyword queries

relational databaseswhat

assembling keyword occurances

efficient implementation

efficient ways

ranking results

form of ranking