detecting attribute dependencies from query feedback
DESCRIPTION
IDS Lab. Winter Seminar. Detecting Attribute Dependencies from Query Feedback. VLDB 2007. Peter J. Haas, Volker Markl (IBM Almaden Research Center, USA) Fabian Hueske ( univ . of Ulm, Germany) Summarized by Jung- yeon , Yang Presented by Jung- yeon , Yang 2008.01.25. Introduction. - PowerPoint PPT PresentationTRANSCRIPT
Detecting Attribute Depen-dencies
from Query Feedback
Peter J. Haas, Volker Markl (IBM Almaden Research Center, USA)
Fabian Hueske (univ. of Ulm, Germany)
Summarized by Jung-yeon, Yang
Presented by Jung-yeon, Yang
2008.01.25
VLDB 2007
IDS Lab. Winter Semi-nar
Copyright 2007 by CEBTCenter for E-Business Technology IDS Lab Seminar - 2
Introduction
Real-world datasets typically exhibit strong and complex dependencies between data attributes.
Discovering this dependency structure is important in a variety of settings
Commercial database systems currently use knowledge about dependent attributes for statistics configura-tion
Approaches to detection and ranking Proactive
reactive
Copyright 2007 by CEBTCenter for E-Business Technology IDS Lab Seminar - 3
The Problem Detecting Dependent Attributes
Example : Color and Year are independent if
– F(P) = fraction of rows in table that satisfy predicate P
– Dependence = “significant” departure from independence
Detection needed for automatic statistics configuration in query optimizers– Which multivariate statistics should we keep?
– Need to rank the dependencies (limited space budget)
Other uses include– Schema discovery for data integration
– Data mining (dependency diagrams)
– Root-cause analysis and system monitoring
Copyright 2007 by CEBTCenter for E-Business Technology
Previous approaches : Proactive
CORDS [IMH+, SIGMOD `04] Sample the relation(or view) and compute a contingency
table
Compute (robust) chi-squared statistic
Declare dependency if Both t and sample size chosen using chi-squared theory
IDS Lab Seminar - 4
Copyright 2007 by CEBTCenter for E-Business Technology
Previous approaches : Reactive
Focus system resources on interesting attributes Complement proactive approaches Can exploit DB2 feedback warehouse
IDS Lab Seminar - 5
Copyright 2007 by CEBTCenter for E-Business Technology
A Spectrum of Reactive Approaches
IDS Lab Seminar - 6
Copyright 2007 by CEBTCenter for E-Business Technology
Reactive approach
Correlation Analyzer Use multiple observations (actuals) for each attribute pair
– O1 = {(blue,2005):0.02, (blue)0.2, (2005):0.103}
– O2 = {(red,2006):0.07, (red)0.82, (2006):0.11}
– etc
Computes ratio for each pair and compares to – O1 :
– O2 :
Attribute dependency if two or more observations look de-pendent
Problems– Ad hoc procedures, wasted information
– Unstable: depends on amount, ordering of feedback
IDS Lab Seminar - 7
Copyright 2007 by CEBTCenter for E-Business Technology
New Reactive approach
Dependency Discovery Like CORDS, but uses incomplete contingency table with
exact entries
Critical value u from extension of classical chi-squared the-
ory
Normalize HM to get ranking metric
IDS Lab Seminar - 8
Copyright 2007 by CEBTCenter for E-Business Technology
New Reactive approach The HM Statistic
– Similar to chi-square
Choosing the Threshold u
IDS Lab Seminar - 9
Copyright 2007 by CEBTCenter for E-Business Technology
New Reactive approach
Missing Feedback Assume (rough) upper bound on relative error of estimate
– Can obtain from feedback-warehouse records
Handling Inconsistency Problem : No full multivariate frequency distribution consis-
tent with feedback– Records collected at different time points
– Insert/Delete/Updates in between feedback observations
Solution method 1 : use timestamps to resolve conflicts Solution method 2 : linear programming
– Obtain minimal adjustment of frequencies needed for consis-tency
IDS Lab Seminar - 10
Copyright 2007 by CEBTCenter for E-Business Technology
New Reactive approach
Ranking Attribute Pairs
Heuristic normalizations HM / z
– Table Cardinality
– Minimal number of distinct values
– Degrees of freedom of chi-squared distribution
– * 0.995 Quantile of
IDS Lab Seminar - 11
Copyright 2007 by CEBTCenter for E-Business Technology
Experiment
Normalization Constants
IDS Lab Seminar - 12
Copyright 2007 by CEBTCenter for E-Business Technology
Experiment
Ranking vs Amount of Feedback
IDS Lab Seminar - 13
Copyright 2007 by CEBTCenter for E-Business Technology
Experiment
Dependency Measure vs Amount of Feedback
IDS Lab Seminar - 14
Copyright 2007 by CEBTCenter for E-Business Technology
Experiment
Execution Time
IDS Lab Seminar - 15
Copyright 2007 by CEBTCenter for E-Business Technology
Conclusions
Query feedback is an effective way to detect depen-
dence
Chi-squared extension to implement detection Attributes can be in multiple tables
Effective ranking methods
Practical solutions for handling inconsistent or missing
feedback
Acceptable performance using sampling and incremental
maintenance
IDS Lab Seminar - 16