copyright 2006, data mining research lab machine and statistical learning for database querying chao...
TRANSCRIPT
![Page 1: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/1.jpg)
Copyright 2006, Data Mining Research Lab
Machine and Statistical Learning for Database Querying
Chao WangData Mining Research Lab
Dept. of Computer Science & EngineeringThe Ohio State University
Advisor: Prof. Srinivasan Parthasarathy
Supported by: NSF Career Award IIS-0347662
![Page 2: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/2.jpg)
Copyright 2006, Data Mining Research Lab
Outline
• Introduction– Selectivity estimation– Probabilistic graphical model
• Querying transaction database
• Probabilistic model-based itemset summarization
• Querying XML database
• Conclusion
![Page 3: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/3.jpg)
Copyright 2006, Data Mining Research Lab
Introduction
![Page 4: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/4.jpg)
Copyright 2006, Data Mining Research Lab
Introduction
• Database querying
• Selectivity estimation– Estimation of a query result size in database
systems– Usage: for query optimizer to choose an
efficient execution plan
• Rely on probabilistic graphical models
![Page 5: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/5.jpg)
Copyright 2006, Data Mining Research Lab
Probabilistic Graphical Models
• Marriage of graph theory and probability theory
• Special cases of the basic algorithms discovered in many (dis)guises:– Statistical physics– Hidden Markov models– Genetics– Statistics– …
• Numerous applications – Bioinformatics – Speech– Vision, – Robotics, – Optimization– …
![Page 6: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/6.jpg)
Copyright 2006, Data Mining Research Lab
p(x1,x2,x3,x4,x5,x6) = p(x1)p(x2|x1) p(x3|x1)p(x4|x2)p(x5|x3)p(x6|x2,x5)
Directed Graphical Models (Bayesian Network)
x1
x2x4
x6
x3 x5
![Page 7: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/7.jpg)
Copyright 2006, Data Mining Research Lab
p(x1,x2,x3,x4,x5,x6) = (1/Z)Φ(x1,x2) Φ(x1,x3)Φ(x2,x4)Φ(x3,x5)Φ(x2,x5,x6)
Undirected Graphical Models (Markov Random Field (MRF))
x1
x2x4
x3 x5
x6
![Page 8: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/8.jpg)
Copyright 2006, Data Mining Research Lab
Inference – Computing Conditional Probabilities
x1
x2x4
x3 x5
x6
• Conditioning
• Marginalization:
• Conditional probabilities
![Page 9: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/9.jpg)
Copyright 2006, Data Mining Research Lab
Querying Transaction Database
![Page 10: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/10.jpg)
Copyright 2006, Data Mining Research Lab
Transaction Database
• Consist of records of interactions among entities
• Two examples:– Market-basket data
Each basket is a transaction consisting of items
– Co-authorship data
Each paper is a transaction consisting of “author” items
![Page 11: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/11.jpg)
Copyright 2006, Data Mining Research Lab
Querying Transaction Database
• Rely on frequent itemsets to learn graphical models
• Rely on the model to solve the selectivity estimation problem– Given a conjunctive query Q, estimate the size
of the answer set, i.e., how many transactions satisfy Q
![Page 12: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/12.jpg)
Copyright 2006, Data Mining Research Lab
Frequent Itemset Mining
• Market-Basket Analysis
A B C D
1 0 1 1 0
![Page 13: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/13.jpg)
Copyright 2006, Data Mining Research Lab
Frequent Itemset Mining
• Support(I): number of transactions “containing I”
1 11 1
1
1 1
1 1
1 1
![Page 14: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/14.jpg)
Copyright 2006, Data Mining Research Lab
Frequent Itemset Mining Problem
• Given D, minsup
Find all itemsets I with support(I) ≥ minsup
![Page 15: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/15.jpg)
Copyright 2006, Data Mining Research Lab
Using Frequent Itemsets to Learn an MRF
• A k-itemset can be viewed as a constraint on the underlying distribution generating the data
• Given a set of itemsets, we compute a distribution satisfying them and having a Maximum Entropy (ME)
• This maximum entropy distribution is equivalent to an MRF
![Page 16: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/16.jpg)
Copyright 2006, Data Mining Research Lab
An ME Distribution Example
Frequent Itemsets
X1
X2
X3
X4
X5
X1 X2
X1 X3
X2 X3
X3 X4
X4 X5
X1 X2 X3
• The maximum entropy distribution has the following product form:
Where I(.) is an indication function for the corresponding itemset constraint and the constants u0, u1, …, u11 are estimated from the data.
![Page 17: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/17.jpg)
Copyright 2006, Data Mining Research Lab
An MRF Example
X1
X2 X3
X4
X5
C1
C2
C3
![Page 18: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/18.jpg)
Copyright 2006, Data Mining Research Lab
Iterative Scaling Algorithm
• Time complexityRuns for k iterations, m itemset constraints
and t is the average inference time
O(k * M * t)
Efficient inference is crucial !
![Page 19: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/19.jpg)
Copyright 2006, Data Mining Research Lab
Junction Tree Algorithm
• Exact inference algorithm
• Time complexity is exponential in the treewidth (tw) of the model– Treewidth = (maximum clique size in the
graph formed by triangulating the model – 1)
• Real world models, tw is often well above 20, thus intractable
![Page 20: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/20.jpg)
Copyright 2006, Data Mining Research Lab
Approximate Inference Algorithm• Gibbs sampling
– Simulating samples from posterior distributions– Sum over samples to evaluate marginal probabilities
• Mean field algorithm– Convert the inference problem to an optimization problem, and
solve the relaxed optimization problem• Loopy belief propagation
– Apply Pearl’s belief propagation directly to loopy graphs– Works quite well in practice
Will the iterative scaling algorithm still converge (when subjected to
approximate inference algorithms) ?
![Page 21: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/21.jpg)
Copyright 2006, Data Mining Research Lab
Graph Partitioning-Based Approximate MRF Learning
For all disjoint vertex subsets a, b and c in an MRF, whenever b and c are separated by a in the graph, then the variables associated with b, c are independent given the variables associated with a alone.
Lemma:
![Page 22: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/22.jpg)
Copyright 2006, Data Mining Research Lab
Graph Partitioning-Based Approximate MRF Learning
• Cluster variables based on graph partitioning
• Interaction importance and treewidth based variable-cluster augmentation
• Learn an exact local MRF on a variable-cluster and combine all local models to derive an approximate global MRF
![Page 23: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/23.jpg)
Copyright 2006, Data Mining Research Lab
Clustering Variables
• k-MinCut – Partition the graph into k equal parts – Minimize the number of edges of E whose
incident vertices belong to different partitions – Weighted graphs: Minimize the sum of
weights of all edges across different partitions
![Page 24: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/24.jpg)
Copyright 2006, Data Mining Research Lab
Accumulative Edge Weighting Scheme
Itemsets SupportX1 X2 3X1 X3 4X2 X3 2X3 X4 2X4 X5 6
X1 X2 X3 2
3+2=
• Edge weight should reflect the correlation strength
![Page 25: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/25.jpg)
Copyright 2006, Data Mining Research Lab
Clustering Variables
• The k-MinCut partitioning scheme yields disjoint partitions. However, there exist edges across different partitions. In other words, different partitions are correlated to each other. So how do we account for the correlations across different partitions?
![Page 26: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/26.jpg)
Copyright 2006, Data Mining Research Lab
Interaction Importance and Treewidth Based Variable-Cluster Augmentation
• Augmenting variable-cluster– Add back most significant incident edges to a
variable-cluster
• Optimization– Take into consideration model complexity
• Keep track of treewidth of the augmented variable-clusters• 1-hop neighboring nodes first, then 2-hop nodes, …, and so
on
![Page 27: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/27.jpg)
Copyright 2006, Data Mining Research Lab
Treewidth Based Augmentation
Variable-cluster
1-hop neighboring nodes
2-hop neighboring nodes
… …
![Page 28: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/28.jpg)
Copyright 2006, Data Mining Research Lab
Interaction Importance and Treewidth Based Variable-Cluster
Augmentation
![Page 29: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/29.jpg)
Copyright 2006, Data Mining Research Lab
Approximate Global MRFs
• For each augmented variable-cluster, collect related itemsets and learn an exact local MRF
• All local MRFs together offer an approximate global MRF
![Page 30: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/30.jpg)
Copyright 2006, Data Mining Research Lab
Learning Algorithm
![Page 31: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/31.jpg)
Copyright 2006, Data Mining Research Lab
A Greedy Inference Algorithm
• Given the global model consisting of a set of local MRFs, how do we make inference?– Case 1: all query variables are covered by a single
MRF, evaluate the marginal probability directly– Case 2: use a greedy decomposition scheme to
compute• First, pick a local model that has the largest intersection with
the current query (i.e., cover most variables)• Then pick the next local model covering most uncovered
query variables, and so on• Overlapped decomposition
![Page 32: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/32.jpg)
Copyright 2006, Data Mining Research Lab
A Greedy Inference Algorithm
Qx = X1 X2 X3 X4 X5
X1X2X3X6X7 X3X4X6X8 X5X9X10
M1 M2 M3
1, 2, 3 3 4 5
3
( ) ( , ) ( )( )
( )x
P X X X P X X P XP Q
P X
![Page 33: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/33.jpg)
Copyright 2006, Data Mining Research Lab
Discussions
• The greedy inference scheme is a heuristic• Global model is not globally consistent;
However, we expect that the global model is nearly consistent ( Heckerman et al. 2000)
• A generalized belief propagation style approach is currently under investigation to force the local consistency across the local models, thereby offering a globally consistent model
![Page 34: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/34.jpg)
Copyright 2006, Data Mining Research Lab
Experimental Results
• C++ implementation. The Junction tree algorithm is implemented based on Intel’s Open-Source Probabilistic Networks library (C++)
• Use Apriori algorithm to collect frequent itemsets
• Use Metis for graph partitioning
![Page 35: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/35.jpg)
Copyright 2006, Data Mining Research Lab
Experimental Setup• Datasets
– Microsoft Anonymous Web, |D|=32711, |I|=294– BMS-Webview1, |D|=59602, |I|=497
• Query workloads– Conjunctive queries, e.g., X1 & ¬X2 & X4
• Performance metrics– Time: online estimating time and offline learning time– Error: average absolute relative error
• Varying – k, the no. of clusters– g, the no. of vertices used during the augmentation– tw, the treewidth threshold when using treewidth based augmentation
optimization
![Page 36: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/36.jpg)
Copyright 2006, Data Mining Research Lab
Results on the Web Data
• Support threshold = 20, results in 9901 frequent itemsets
• Treewidth = 28 according to Maximum Cardinality Search (MCS)-ordering heuristic
![Page 37: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/37.jpg)
Copyright 2006, Data Mining Research Lab
Varying k (g = 5):
Estimation accuracy
Online time
Online Time
Offline Time
![Page 38: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/38.jpg)
Copyright 2006, Data Mining Research Lab
Varying g (k = 20):
Estimation Accuracy
Online time
Online Time
Offline Time
![Page 39: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/39.jpg)
Copyright 2006, Data Mining Research Lab
Estimation Accuracy
Online Time
Offline Time
Varying tw (k = 25):
![Page 40: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/40.jpg)
Copyright 2006, Data Mining Research Lab
Using Non-Redundant Itemsets
• There exist redundancies in a collection of frequent itemsets
• Select non-redundant patterns to learn probabilistic models
• Closely related to pattern summarization
![Page 41: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/41.jpg)
Copyright 2006, Data Mining Research Lab
Probabilistic Model-Based Itemset Summarization
![Page 42: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/42.jpg)
Copyright 2006, Data Mining Research Lab
Non-Derivable Itemsets
• Based on redundancies– How do supports relate?
• What information about unknown supports can we derive from known supports?– Concise representation: only store non-
redundant information
![Page 43: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/43.jpg)
Copyright 2006, Data Mining Research Lab
The Inclusion-Exclusion Principle
![Page 44: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/44.jpg)
Copyright 2006, Data Mining Research Lab
Deduction Rules via Inclusion-Exclusion
• Let A, B, C, … be items
• Let A’ correspond to the set{ transactions t | t contains A }
• (AB)’ = (A)’ ∩ (B)’
• Then supp(AB) = | (AB)’|
![Page 45: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/45.jpg)
Copyright 2006, Data Mining Research Lab
Deduction Rules via Inclusion-Exclusion
• Inclusion-exclusion principle:|A’ U B’ U C’| = |A’| + |B’| + |C’|
- |(AB)’| - |(AC)’| - |(BC)’|
+ |(ABC)’|
Thus, since |A’ U B’ U C’| ≤ n,
Supp(ABC) ≤ s(AB) + s(AC) + s(BC)
- s(A) - s(B) - s(C) + n
![Page 46: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/46.jpg)
Copyright 2006, Data Mining Research Lab
Complete Set for Supp(ABC)0 sABC ≥ 0
1 sABC ≤ sAB
sABC ≤ sAC
sABC ≤ sBC
2 sABC ≥ sAB + sAC - sA
sABC ≥ sAB + sBC – sB
sABC ≥ sAC + sBC – sC
3 sABC ≤ sAB + sAC + sBC - sA - sB - sC + n
![Page 47: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/47.jpg)
Copyright 2006, Data Mining Research Lab
Derivable Itemsets
Given: Supp(I) for all I J
Lower bound on Supp(J) = L
Upper bound on Supp(J) = U
• Without counting: Supp(J) [L, U]
• J is a derivable itemset (DI) iff L = UWe know Supp(J) exactly without counting!
![Page 48: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/48.jpg)
Copyright 2006, Data Mining Research Lab
Derivable Itemsets
• J is a derivable itemset:– No need to count Supp(J)– No need to store Supp(J)
• We can use the deduction rules
– Concise representation:C = { (J, Supp(J) ) | J not derivable from Supp(I),
I J }
![Page 49: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/49.jpg)
Copyright 2006, Data Mining Research Lab
Probabilistic Model Based Itemset Summarization
• We can learn the MRF from non-derivable itemsets alone
Lemma: Given a transaction dataset D, the MRF M constructed from all of its σ-frequent itemsets is equivalent to M’, the MRF constructed from only its σ-frequent non-derivable itemsets
• Can we do better?– Further compress the patterns
![Page 50: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/50.jpg)
Copyright 2006, Data Mining Research Lab
Probabilistic Model Based Itemset Summarization
• Use smaller itemsets to learn an MRF
• Use this model to infer the supports of larger itemsets
• Use those itemsets whose occurrence can not be explained (by some error threshold) by the model to augment the model
![Page 51: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/51.jpg)
Copyright 2006, Data Mining Research Lab
Itemset Summarization Algorithm
![Page 52: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/52.jpg)
Copyright 2006, Data Mining Research Lab
Generalized Non-Derivable Itemsets
• All the itemsets in the final summary are non-derivable
• Relax the requirement for an itemset to be derivable
![Page 53: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/53.jpg)
Copyright 2006, Data Mining Research Lab
Experimental Results
• Experimental Setup– Datasets:
– Performance metrics:• Summarization accuracy (restoration error)
• Summary size
• Summarizing time
![Page 54: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/54.jpg)
Copyright 2006, Data Mining Research Lab
Results on the Chess Dataset
Estimation accuracy
Summary size
Summarizing time
minSup = 2000
166581 frequent itemsets
1276 non-derivable
![Page 55: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/55.jpg)
Copyright 2006, Data Mining Research Lab
Results on the Chess Dataset
Skewed itemset distribution when varying error threshold
![Page 56: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/56.jpg)
Copyright 2006, Data Mining Research Lab
Results on the Mushroom Dataset
Estimation accuracy
Summary size
Summarizing time
minSup = 2031 (25%)
5545 frequent itemsets
534 non-derivable
![Page 57: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/57.jpg)
Copyright 2006, Data Mining Research Lab
Results on the Mushroom Dataset
Skewed itemset distribution when varying error threshold
![Page 58: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/58.jpg)
Copyright 2006, Data Mining Research Lab
Result Summary and Discussions
• There do exist redundancies in a collection of itemsets, and the probabilistic model based summarization scheme can effectively eliminate such redundancies– When datasets are dense and largely satisfy conditional
independence assumption, our summarization approach is extremely efficient
– When datasets become sparse and do not satisfy the conditional independence assumption, the summarization task becomes more difficult (need more time and space)
• Itemsets-based MRF learning and MRF-based itemset summarization are two interactive procedures
![Page 59: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/59.jpg)
Copyright 2006, Data Mining Research Lab
Query XML Database – Exploiting Independence Structure from Complex Structural Patterns
![Page 60: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/60.jpg)
Copyright 2006, Data Mining Research Lab
Querying XML Database• XML is becoming the standard for data exchange• We need to query the structure and text data of XML
documents• XML twig query:
– an important query mechanism
– a structural query with small branches
• Optimizing these queries requires estimating the selectivity of the twig queries
![Page 61: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/61.jpg)
Copyright 2006, Data Mining Research Lab
Querying XML Database
• An XML document example: DBLP.xml
(Digital Bibliography & Library Project)
![Page 62: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/62.jpg)
Copyright 2006, Data Mining Research Lab
Querying XML Database
• A twig example:FOR all books IN document(“DBLP.xml")WHERE publisher = "Morgan Kaufmann"RETURN title
b
p t
b: book
p: publisher
t : title
![Page 63: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/63.jpg)
Copyright 2006, Data Mining Research Lab
Querying XML Database
b
p t
b: book
p: publisher
t : title
selectivity = 2
![Page 64: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/64.jpg)
Copyright 2006, Data Mining Research Lab
Problem Statement
• The goal is to accurately estimate the selectivity of twig queries with limited memory– Need a structure to store relevant statistics of
the data– Then estimate selectivity from these statistics
![Page 65: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/65.jpg)
Copyright 2006, Data Mining Research Lab
Our Approach (TreeLattice)
• Key idea: store the occurrence statistics of small twigs in the summary– The summary is a lattice consisting of small
trees, thus called TreeLattice
• Then based on these statistics to estimate the selectivity of the large twigs
![Page 66: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/66.jpg)
Copyright 2006, Data Mining Research Lab
Challenges
• How to estimate the selectivity for a given twig given the selectivity information of its sub-twigs?
• How to decompose a large twig into smaller twigs?• What statistics to store in the lattice summary?
![Page 67: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/67.jpg)
Copyright 2006, Data Mining Research Lab
Estimation Procedure
T
y
e2
x
e1
T1 T2
x
Augmenting T with e1 to get T1
y
Augmenting T with e2 to get T2
Lemma: If these two tree augmentations are conditionally independent (conditioned on T), then we have:
: selectivity
![Page 68: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/68.jpg)
Copyright 2006, Data Mining Research Lab
Decomposition Strategies
• How to decompose a large twig into smaller sub-twigs?– Recursive decomposition with or without
voting – Fixed-sized decomposition– Hybrid decomposition
![Page 69: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/69.jpg)
Copyright 2006, Data Mining Research Lab
Recursive Decompositionab
c d fe g
abd fe g
ab
c d fg
abd f
g
abd fe
abd f
g
ab
c d f
Recursively applying the estimation formula.
It’s possible there exist multiple feasible decompositions. Rely on voting to obtain the best estimate as we can
• Much more accurate than without voting• Estimating process slows down
![Page 70: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/70.jpg)
Copyright 2006, Data Mining Research Lab
Fixed-sized Decomposition
a
b
c d
a
b
c d f
e g
a
b
c d
a
b
c d
b
c d
e
b
c d
e
b
c d
e
+b
d f
e
+b
d f
g+
b
d f
e
b
d f
e
b
d f
g
b
d f
g
Very fast, but can not be applied directly
![Page 71: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/71.jpg)
Copyright 2006, Data Mining Research Lab
Hybrid Decomposition
a
b
c d f
e g
… …recursive
decomposition with voting
a
b
d
a
b
c
b
d
a
b
b
c
a
b
a
b
c d
b
c d
e
+ b
d f
e
+b
d f
g+
fixed-sized decomposition
![Page 72: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/72.jpg)
Copyright 2006, Data Mining Research Lab
Summary Statistics
• What to store in lattice summary?– Store important statistics – Store non-redundant information– How to achieve this?
• Store non-derivable patterns only!
![Page 73: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/73.jpg)
Copyright 2006, Data Mining Research Lab
Summary Statistics
• A twig pattern is δ- derivable if and only if its true selectivity is within an error tolerance of δ to its expected selectivity according to TreeLattice.– 0-derivable (δ=0) patterns are those patterns whose
selectivity can be estimated exactly.
• Pruning 0-derivable patterns – No loss of accuracy
![Page 74: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/74.jpg)
Copyright 2006, Data Mining Research Lab
Summary Statistics
• Level-wise lattice summary construction– Add all twigs of size 1&2 to the summary (base)– Then add larger non-derivable frequent twigs
into the summary, until the memory budget is depleted
![Page 75: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/75.jpg)
Copyright 2006, Data Mining Research Lab
Experimental Methodology
• Datasets: NASA, PSD, IMDB and XMark
• Workloads: 1000 frequent twig queries of size between 4 and 9.
• Error metric: Mean absolute relative error
1
|W |
| estim(q) count(q) |count(q)qW
![Page 76: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/76.jpg)
Copyright 2006, Data Mining Research Lab
Accuracy of Estimators
02468
101214161820
4 5 6 7 8 9Query Si ze
Avg.
Rel
Err
or(%
)
Recursi ve Decomp+Voti ng Recursi ve DecompFast Decomp TreeSketches
NASA
• Recursive decomposition with voting yields best estimates
• The quality of estimation degrades as the twig size increases due to error propagation
![Page 77: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/77.jpg)
Copyright 2006, Data Mining Research Lab
Varying Summary Size
0%
2%
4%
6%
8%
10%
12%
14%
10k 20k 30k 40k 50k
Summary Si ze
Avg.
Rel
Err
or(%
)
TreeLatti ce TreeSketches
NASA
• The larger the summary, the better the estimations
• TreeLattice makes more efficient use of the memory budget
![Page 78: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/78.jpg)
Copyright 2006, Data Mining Research Lab
Estimation Time
010203040506070
4 5 6 7 8 9Query Si ze
Resp
onse
Tim
e(ms
)
Recursi ve Decomp+Voti ng Recursi ve DecompFast Decomp TreeSketches
NASA
• TreeLattice is very fast when processing relative small twigs
• Recursive decomposition with voting slows down a lot as the twig size increases.
• Overall, fast decomposition is best.
![Page 79: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/79.jpg)
Copyright 2006, Data Mining Research Lab
δ-derivable Pruning
• The proportion of 0-derivable patterns is very high on NASA, PSD and XMark– Tree growing conditional independence
assumption holds well– TreeLattice works very well
• Assumption does not hold that well on IMDB. How to improve the estimations on IMDB?
![Page 80: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/80.jpg)
Copyright 2006, Data Mining Research Lab
δ-derivable Pruning
• Larger δ is good for large twigs, at the cost of sacrificing estimation accuracy for small twigs.
IMDB
TreeSketches
![Page 81: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/81.jpg)
Copyright 2006, Data Mining Research Lab
Discussions
• TreeLattice is effective in estimating the selectivity of XML twig queries– Compares favorably with the state-of-the-art
approach– The lattice summary construction is fast– The online estimation is fast
![Page 82: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/82.jpg)
Copyright 2006, Data Mining Research Lab
Conclusion
![Page 83: Copyright 2006, Data Mining Research Lab Machine and Statistical Learning for Database Querying Chao Wang Data Mining Research Lab Dept. of Computer Science](https://reader030.vdocument.in/reader030/viewer/2022032707/56649e505503460f94b4783f/html5/thumbnails/83.jpg)
Copyright 2006, Data Mining Research Lab
Conclusion
• Conditional independence structure is common in the real world
• Graphical models are effective to capture such structures and solve the selectivity estimation problem for database querying
• Model structured data (sequence/tree/graph) using probabilistic models
• Model streaming/incremental data