boolean + ranking: querying a database by k-constrained optimization

30
The Database and Info. Systems Lab. University of Illinois at Urbana-Champaign Boolean + Ranking: Querying a Database by K- Constrained Optimization Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang

Upload: lev-levine

Post on 02-Jan-2016

28 views

Category:

Documents


2 download

DESCRIPTION

Boolean + Ranking: Querying a Database by K-Constrained Optimization. Zhen Zhang Joint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang. Information retrieval. Traditional databases. Ranking query: Top 5 ranked by gpa. Boolean query: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

The Database and Info. Systems Lab.University of Illinois at Urbana-Champaign

Boolean + Ranking: Querying a Database by K-Constrained Optimization

Zhen ZhangJoint work with: Seung-won Hwang, Kevin C. Chang, Min Wang, Christian A. Lang, Yuan-chi Chang

Page 2: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 2

Many queries naturally combine Boolean and ranking

Information retrieval

Ranking query:

Top 5 ranked by gpa

+Database applications on Web

Traditional databases

Boolean query:

dept = CS and year = 2

Qualifying constraint

Quantifying function R: gpa

B: dept = CS and year = 2

Find top answers

Page 3: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 3

Motivating scenarios

Data retrieval: Find houses in certain price range with good

price/sqrft ratio

Data analysis: Find products with highest sale increase in

consecutive years

Select h.address from House h

Where h.price ≤ 200k ν h.price ≥ 400k

Order by h.size/|h.price-300k| Limit 1

Select h.address from House h, CrimeRate c

Where h.price ≤ 200k ν h.price ≥ 400k and h.zipcode = c.zipcode

Order by h.size/|h.price-300k| *c.crimerate-1 Limit 10

Select itemid from Sales s1, Sales s2

Where s1.itemid = s2.itemid and s2.year – s1.year = 1

Order by s2.sale – s1.sale Limit 10

Page 4: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 4

Boolean + Ranking form a coherent goal function

Boolean B + Ranking R = Goal function G

For a tuple t

G(t) = B(t)*R(t) = R(t) if B(t) is true

0 if B(t) is false(ie, lowest score)

Page 5: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 5

The nature of Boolean + Ranking is K-constrained optimization query Optimize goal function G over database D

h.size/|h.price-300k|

[h.price ≤ 200k ν h.price ≥ 400k ]

Addr Zip Price Size

1. Oak park, Chicago 60644 600K 4500

2. Mattis, Champaign 61821 350K 2000

3. … 150K 1000

4. … 250K 2000

5. … 300K 3500

6. … 80K 500

Goal function G

Database D

D

G

Page 6: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 6

What is the query evaluation mechanism?

Ranking query+Boolean query

How to answer?

Page 7: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 7

Current techniques lack of global search mechanism

If evaluated as separate operators

If search by an overall goal function G as a ranking

function

Boolean query B

………

Ranking query R

Current techniques restrict G to be monotonic

Current techniques optimize only condition-by-condition

D Boolean query B

Ranking query R

D RBGoal function G

Page 8: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 8

Our thesis: Evaluate query as its nature suggests!

Optimize G over D

Function optimization

of GDiscrete state

search over D

G

D

D

OPT*

Page 9: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 9

We view compound index as discrete space

Addr Zip Price Size

1. Oak park, Chicago 60644 600K 4500

2. Mattis, Champaign 61821 350K 2000

3. … 150K 1000

4. … 250K 2000

5. … 300K 3500

6. … 80K 500

Page 10: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 10

250

3000

350

100

1500

4000

4500

600

We view compound index as discrete space

250-6000-250

100-2500-100 350-600250-350

52 1………

b1

b3b2

b7b6

3000-45000-3000

1500-30000-1500 4000-60003000-4000

5 1………

a1

a6

a3a2

a7

size

Price (k)

1

52

3 4

6

Page 11: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 11

250

3000

350

100

1500

4000

4500

600

We view compound index as discrete space

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M56M75

154 2

250-6000-250

100-2500-100 350-600250-350

52 1………

b1

b3b2

b7b6

3000-45000-3000

1500-30000-1500 4000-60003000-4000

5 1………

a1

a6

a3a2

a7

size

Price (k)

1

52

3 4

6

Mij =(ai, bj)

……

Page 12: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 12

250

3000

350

100

1500

4000

4500

600

We view compound index as discrete space

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M56M75

154 2

250-6000-250

100-2500-100 350-600250-350

52 1………

b1

b3b2

b7b6

3000-45000-3000

1500-30000-1500 4000-60003000-4000

5 1………

a1

a6

a3a2

a7

size

Price (k)

1

52

3 4

6

Mij =(ai, bj)

conceptually, combined space

Page 13: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 13

How to perform the search in the space?

What is the search mechanism? How to conceptually view the index space of

D for search How to guide the search?

How to use function G to focus the search

Page 14: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 14

Challenge 1: What is the search mechanism?

Page 15: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 15

We encode as A* because it’s optimal

What A* is: Finding the shortest path Why we choose: Completeness and optimality with

proper heuristics Complete: guarantee to find shortest path Optimal: visit least number of nodes

origin

destination

5

2

96

3

5

1

1

7

Page 16: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 16

Encoding our problem into shortest path is challenging

How to encode: a tuple a path? score of tuple distance of path?

K-constrained optimization

Find a tuple with maximal score

Shortest path

Find a path with minimal distance

Page 17: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 17

Therefore, we encode K-constrained opt. as: How to encode a tuple to a path?

Adding a virtual target t* only reachable through tuples How to encode maximal tuple with minimal path?

Quality of path depends solely on the tuple it passes by For tuple state t

D(t, t*) = - G(t) For two states r, u

D(r, u) = 0

M55

M11

M22 M32 M23 M33

M66 M77 M67 M76M75 M56

154 2

t*

0

0

0

0

- G(4)- G(1)

0

0

Page 18: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 18

Challenge 2: How to guide the search?

Page 19: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 19

We use function opt. to sketch the landscape of G Function optimization measures quality of states Function optimization enables:

1. How to define heuristics? 2. How to configure space? 3. Where to start the search?

Page 20: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 20

1. Define admissible heuristics: Measure tightest upper bound

H(region) = OPTMAX(G, region)

ie, maximal value of G in the region

To guarantee completeness A* requires admissible heuristics, ie, estimate

optimistically To ensure admissible heuristics

Function optimization gives tightest upper bound Analytical approaches Numeric analysis package

Page 21: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 21

2. Configure descending space: disconnect uphills To guarantee optimality

A* requires descending heuristics To ensure descending heuristics

Remove uphill links

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154 2

Page 22: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 22

Find right start point: Start from local optima To guarantee correctness

Every tuple state must be reachable from start states Taking only downhills requires start with high points

To ensure reachability Initial states should contain all local optima

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154 2

Page 23: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 23

Putting together: Executing A* on the configured space

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154 2

M57…

Search is implemented as priority queue driven traversal

top-down

Page 24: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 24

Putting together: Executing A* on the configured space

Bottom-up approach is always better than top-down

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154

2

M57

M11

M22 M32 M23 M33

M66 M77 M67 M76M55 M75 M56

154 2

M57…

top-down

bottom-up

Page 25: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 25

Experiments

Comparison vs. Boolean then ranking Ranking then boolean

Metrics: node accessed = Nl + Nt

Settings: Benchmark queries over real dataset Controlled queries over synthetic dataset

Page 26: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 26

Benchmark queries

Datasets: 19,706 real estate listing crawled online

Queries Q1: size * bedrms/| price-450k| : [40k<=price<=50k] Q2: size * ebedrms / |price-350k| : [price<400k^size>4000] Q3: size/price : [bedrms=3 ν bedrms=4]

BR_unclustered

BR_clustered

OPT*

Q1 Q2 Q3

Page 27: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 27

Controlled queries Datasets

Three randomly generated datasets of 100k points Uniform, gaussian, logvariatenormal

Queries Linear average queries: (eg, 0.4*a + 0.6*b) Nearest neighbor queries: (eg, (x-3)^2 + (y-4)^2) Join queries: (0.4*R.a + 0.6*S.b: R.c=R.d)

!"#$

%

!"#$

! "#$%

Page 28: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 28

Conclusion

Problem Study K-constrained optimization queries as boolean +

ranking Abstraction

Encode K-constrained optimization into shortest path problem

Framework Develop OPT* to process K-constrained optimization

Page 29: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 29

Thank you!

Questions?

Page 30: Boolean + Ranking:  Querying a Database by K-Constrained Optimization

AIM 30

How to implement function optimization? How do we compare with RankSQL? If bottom-up is always better, why consider top-down Computing upper bound for each region is costly Random vs. sequential I/O Assuming indices on every attribute? Materialize state space for every query? Exponential number of states when attribute grows

Not every attribute has index on it Selective choose the right index (attribute) to use We do perform experiment to study how the system scale with

#attr Your algorithm is not optimal because you change the

space