detecting attribute dependencies from query feedback

16
Detecting Attribute Dependencies from Query Feedback Peter J. Haas, Volker Markl (IBM Almaden Research Center, USA) Fabian Hueske (univ. of Ulm, Germany) Summarized by Jung-yeon, Yang Presented by Jung-yeon, Yang 2008.01.25 VLDB 2007 IDS Lab. Winter Seminar

Upload: hester

Post on 06-Jan-2016

21 views

Category:

Documents


4 download

DESCRIPTION

IDS Lab. Winter Seminar. Detecting Attribute Dependencies from Query Feedback. VLDB 2007. Peter J. Haas, Volker Markl (IBM Almaden Research Center, USA) Fabian Hueske ( univ . of Ulm, Germany) Summarized by Jung- yeon , Yang Presented by Jung- yeon , Yang 2008.01.25. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Detecting Attribute Dependencies from Query Feedback

Detecting Attribute Depen-dencies

from Query Feedback

Peter J. Haas, Volker Markl (IBM Almaden Research Center, USA)

Fabian Hueske (univ. of Ulm, Germany)

Summarized by Jung-yeon, Yang

Presented by Jung-yeon, Yang

2008.01.25

VLDB 2007

IDS Lab. Winter Semi-nar

Page 2: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology IDS Lab Seminar - 2

Introduction

Real-world datasets typically exhibit strong and complex dependencies between data attributes.

Discovering this dependency structure is important in a variety of settings

Commercial database systems currently use knowledge about dependent attributes for statistics configura-tion

Approaches to detection and ranking Proactive

reactive

Page 3: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology IDS Lab Seminar - 3

The Problem Detecting Dependent Attributes

Example : Color and Year are independent if

– F(P) = fraction of rows in table that satisfy predicate P

– Dependence = “significant” departure from independence

Detection needed for automatic statistics configuration in query optimizers– Which multivariate statistics should we keep?

– Need to rank the dependencies (limited space budget)

Other uses include– Schema discovery for data integration

– Data mining (dependency diagrams)

– Root-cause analysis and system monitoring

Page 4: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

Previous approaches : Proactive

CORDS [IMH+, SIGMOD `04] Sample the relation(or view) and compute a contingency

table

Compute (robust) chi-squared statistic

Declare dependency if Both t and sample size chosen using chi-squared theory

IDS Lab Seminar - 4

Page 5: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

Previous approaches : Reactive

Focus system resources on interesting attributes Complement proactive approaches Can exploit DB2 feedback warehouse

IDS Lab Seminar - 5

Page 6: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

A Spectrum of Reactive Approaches

IDS Lab Seminar - 6

Page 7: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

Reactive approach

Correlation Analyzer Use multiple observations (actuals) for each attribute pair

– O1 = {(blue,2005):0.02, (blue)0.2, (2005):0.103}

– O2 = {(red,2006):0.07, (red)0.82, (2006):0.11}

– etc

Computes ratio for each pair and compares to – O1 :

– O2 :

Attribute dependency if two or more observations look de-pendent

Problems– Ad hoc procedures, wasted information

– Unstable: depends on amount, ordering of feedback

IDS Lab Seminar - 7

Page 8: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

New Reactive approach

Dependency Discovery Like CORDS, but uses incomplete contingency table with

exact entries

Critical value u from extension of classical chi-squared the-

ory

Normalize HM to get ranking metric

IDS Lab Seminar - 8

Page 9: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

New Reactive approach The HM Statistic

– Similar to chi-square

Choosing the Threshold u

IDS Lab Seminar - 9

Page 10: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

New Reactive approach

Missing Feedback Assume (rough) upper bound on relative error of estimate

– Can obtain from feedback-warehouse records

Handling Inconsistency Problem : No full multivariate frequency distribution consis-

tent with feedback– Records collected at different time points

– Insert/Delete/Updates in between feedback observations

Solution method 1 : use timestamps to resolve conflicts Solution method 2 : linear programming

– Obtain minimal adjustment of frequencies needed for consis-tency

IDS Lab Seminar - 10

Page 11: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

New Reactive approach

Ranking Attribute Pairs

Heuristic normalizations HM / z

– Table Cardinality

– Minimal number of distinct values

– Degrees of freedom of chi-squared distribution

– * 0.995 Quantile of

IDS Lab Seminar - 11

Page 12: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

Experiment

Normalization Constants

IDS Lab Seminar - 12

Page 13: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

Experiment

Ranking vs Amount of Feedback

IDS Lab Seminar - 13

Page 14: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

Experiment

Dependency Measure vs Amount of Feedback

IDS Lab Seminar - 14

Page 15: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

Experiment

Execution Time

IDS Lab Seminar - 15

Page 16: Detecting Attribute Dependencies from Query Feedback

Copyright 2007 by CEBTCenter for E-Business Technology

Conclusions

Query feedback is an effective way to detect depen-

dence

Chi-squared extension to implement detection Attributes can be in multiple tables

Effective ranking methods

Practical solutions for handling inconsistent or missing

feedback

Acceptable performance using sampling and incremental

maintenance

IDS Lab Seminar - 16