learning tree structures
DESCRIPTION
Learning Tree Structures. If we measured a distribution P, what is the tree-dependent distribution P t that best approximates P?. Which P t will be closest to P?. Search Space : All possible trees Goal : From all possible trees find the one closest to P Distance Measurement: - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/1.jpg)
![Page 2: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/2.jpg)
• If we measured a distribution P, what is the tree-dependent distribution Pt that best approximates P?
• Search Space: All possible trees• Goal: From all possible trees find the one closest
to P• Distance Measurement:
• Kullback–Leibler cross –entropy measure• Operators/Procedure
![Page 3: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/3.jpg)
Problem definition
• X1…Xn are random variables• P is unknown • Given independent samples x1,…,xs drawn from distribution P• Estimate P
Solution 1 - independence
• Assume X1…Xn are independent • P(x) = Π P(xi)
Solution 2 - trees
• P(x) = Π P(xi|xj)• xj- The parent of xi in some
![Page 4: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/4.jpg)
Kullback–Leibler cross–enthropy measure
• For probability distributions P and Q of a discrete random variable the K–L divergence of Q from P is defined to be
• It can be seen from the definition of the Kullback-Leibler divergence that
• where H(P,Q) is called the cross entropy of P and Q, and H(P) is the entropy of P.
• Non negative measure (by Gibb’s inequality)
![Page 5: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/5.jpg)
5
Entropy is a measure for Uncertainty
• Fair coin:– H(½, ½) = – ½ log2(½) – ½ log2(½) = 1 bit– (ie, need 1 bit to convey the outcome of coin flip)
• Biased coin:H( 1/100, 99/100) =
– 1/100 log2(1/100) – 99/100 log2(99/100) = 0.08 bit
• As P( heads ) 1, info of actual outcome 0H(0, 1) = H(1, 0) = 0 bitsie, no uncertainty left in source
(0 log2(0) = 0)
![Page 6: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/6.jpg)
Optimization Task
• Init: Fix the structure of some tree t• Assign Probabilities: What conditional probabilities
Pt(x|y) would yield the best approximation of P?
• Procedure: vary the structure of t over all possible spanning trees
• Goal: among all trees with probabilities- which is the closest to P?
![Page 7: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/7.jpg)
What Probabilities to assign?
Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P
![Page 8: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/8.jpg)
How to vary over all trees? How to move in the search space?
Theorem 2 :The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement
![Page 9: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/9.jpg)
Mutual information
• measures how much knowing one of these variables reduces our uncertainty about the other
• the mutual information is the same as the uncertainty contained in Y (or X) alone, namely the entropy of Y or X
• Mutual information is a measure of dependence• Mutual information is nonnegative (i.e. I(X;Y) ≥ 0) and
symmetric (i.e. I(X;Y) = I(Y;X)).
![Page 10: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/10.jpg)
The algorithm
• Find Maximum spanning tree with weights given by :
• Compute Pt
– Select an arbitrary root node and compute
![Page 11: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/11.jpg)
11
0.31260.02290.01720.02300.01830.2603
ABACADBCBDCD
(0.56, 0.11, 0.02, 0.31)(0.51, 0.17, 0.17, 0.15)(0.53, 0.15, 0.19, 0.13)(0.44, 0.14, 0.23, 0.19)(0.46, 0.12, 0.26, 0.16)(0.64, 0.04, 0.08, 0.24)
A
C
B
D
A
C
B
D
ABACADBCBDCD
0.31260.02290.01720.02300.01830.2603
(0.56, 0.11, 0.02, 0.31)(0.51, 0.17, 0.17, 0.15)(0.53, 0.15, 0.19, 0.13)(0.44, 0.14, 0.23, 0.19)(0.46, 0.12, 0.26, 0.16)(0.64, 0.04, 0.08, 0.24)
Illustration of CL-Tree LearningA
C
B
D
![Page 12: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/12.jpg)
Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P
![Page 13: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/13.jpg)
Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P
By Gibb’s inequality the Expression
is maximized when P’(x)=P(x) => all the expression is maximal when P’(xi|xj)=P(xi|xj)
Q.E.D.
![Page 14: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/14.jpg)
Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P
Gibbs' inequality
![Page 15: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/15.jpg)
Theorem 1 :If we Force the probabilities along the branches of the tree t to coincide with those computed from P => We get the best t-dependent approximation of P
Gibbs' inequality
![Page 16: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/16.jpg)
Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement:
From theorem 1, we get:
After assignment, and Bayes rule:
maximizes DKL
![Page 17: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/17.jpg)
Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement:
From theorem 1, we get:
After assignment, and Bayes rule:
maximizes DKL
![Page 18: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/18.jpg)
Theorem 2 : The distance measure(Kullback–Leibler) is minimized by assigning the best distribution on any maximum weight spanning tree, where the weight on the branch (x,y) is defined by the mutual information measurement:
• The second and third term are independent of t
• D(P,Pt) is nonnegative (Gibb’s inequality)
Thus, minimizing the distance D(P,Pt) is equivalent to maximizing the
sum of branch weights
Q.E.D.
![Page 19: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/19.jpg)
19
Chow-Liu (CL) Results
• If distribution P is tree-structured,CL finds CORRECT one
• If distribution P is NOT tree-structured,CL finds tree structured Q that has min’l KL-divergence – argminQ KL(P; Q)
• Even though 2(n log n) trees,CL finds BEST one in poly time O(n2 [m + log n])
![Page 20: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/20.jpg)
Chow-Liu Trees -Summary• Approximation of a joint distribution with a
tree-structured distribution [Chow and Liu 68]
• Learning the structure and the probabilities– Compute individual and pairwise marginal
distributions for all pairs of variables – Compute mutual information (MI) for each pair of
variables
– Build maximum spanning tree with for a complete graph with variables as nodes and MIs as weights
• Properties– Efficient:
• O(#samples×(#variables)2×(#values per variable)2)
– Optimal
YX YPXP
YXPYXPYX
, )()(
),(log),(),MI(
![Page 21: Learning Tree Structures](https://reader035.vdocument.in/reader035/viewer/2022081520/5681582b550346895dc59030/html5/thumbnails/21.jpg)
• S. Kullback (1959) Information theory and statistics (John Wiley and Sons, NY).
• Chow, C. K. and Liu, C. N. (1968). Approximating discrete probability distributions with dependence trees. IEEE Transactions on Information Theory, IT-14(3):462{467.