![Page 1: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/1.jpg)
Variational sampling approachesto word confusability
John R. Hershey, Peder A. Olsen and Ramesh A. Gopinath
{jrhershe,pederao,rameshg}@us.ibm.com
IBM, T. J. Watson Research Center
Information Theory and Applications, 2/1-2007 – p.1/31
![Page 2: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/2.jpg)
Abstract
In speech recognition it is often useful to determine how confusable two words are. For
speech models this comes down to computing the Bayes error between two HMMs. This
problem is analytically and numerically intractable. A common alternative, that is numeri-
cally approachable, uses the KL divergence in place of the Bayes error. We present new
approaches to approximating the KL divergence, that combine variational methods with
importance sampling. The Bhattacharyya distance – a closer cousin of the Bayes error –
turns out to be even more amenable to our approach. Our experiments demonstrate an
improvement of orders of magnitude in accuracy over conventional methods.
Information Theory and Applications, 2/1-2007 – p.2/31
![Page 3: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/3.jpg)
Outline
• Acoustic Confusability
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 4: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/4.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 5: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/5.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 6: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/6.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 7: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/7.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 8: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/8.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 9: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/9.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation• Bhattacharyya: Variational Approximation
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 10: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/10.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation• Bhattacharyya: Variational Approximation• Bhattacharyya: Variational Monte Carlo Approximation
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 11: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/11.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation• Bhattacharyya: Variational Approximation• Bhattacharyya: Variational Monte Carlo Approximation• Bhattacharyya: Empirical Evaluations
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 12: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/12.jpg)
Outline
• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation• Bhattacharyya: Variational Approximation• Bhattacharyya: Variational Monte Carlo Approximation• Bhattacharyya: Empirical Evaluations• Future Directions
Information Theory and Applications, 2/1-2007 – p.3/31
![Page 13: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/13.jpg)
A Toy Version of the Confusability Problem
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
probability density plot
N(x;−2,1)
• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).
R R
Information Theory and Applications, 2/1-2007 – p.4/31
![Page 14: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/14.jpg)
A Toy Version of the Confusability Problem
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
probability density plot
N(x;−2,1)
N(x;2,1)
• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).
• Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).
R R
Information Theory and Applications, 2/1-2007 – p.4/31
![Page 15: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/15.jpg)
A Toy Version of the Confusability Problem
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
probability density plot
N(x;−2,1)
N(x;2,1)
• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).
• Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).
• The asymmetric error is the probability that one class, "O", will be mistaken for theother, "E", when classifying according to Ae(f, g) =
Rf(x)1g(x)≥f(x)(x)dx.
R
Information Theory and Applications, 2/1-2007 – p.4/31
![Page 16: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/16.jpg)
A Toy Version of the Confusability Problem
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
x
probability density plot
N(x;−2,1)
N(x;2,1)
• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).
• Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).
• The asymmetric error is the probability that one class, "O", will be mistaken for theother, "E", when classifying according to Ae(f, g) =
Rf(x)1g(x)≥f(x)(x)dx.
• The Bayes Error is the total classification error Be(f, g) =
Rmin(f(x), g(x))dx.
Information Theory and Applications, 2/1-2007 – p.4/31
![Page 17: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/17.jpg)
A Toy Version of the Confusability Problem
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.1
0.2
0.3
0.4
0.5
x
probability density plot
N(x;−2,1)
N(x;2,1)
(fg)1/2
• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).
• Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).
• The asymmetric error is the probability that one class, "O", will be mistaken for theother, "E", when classifying according to Ae(f, g) =
Rf(x)1g(x)≥f(x)(x)dx.
• The Bayes Error is the total classification error Be(f, g) =
Rmin(f(x), g(x))dx.
Information Theory and Applications, 2/1-2007 – p.4/31
![Page 18: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/18.jpg)
Word Models
A word is modeled using its pronunciation(s) and an HMM. Asan example the word DIAL has a pronunciation D AY AX L andHMM F .
D AY AX L
Information Theory and Applications, 2/1-2007 – p.5/31
![Page 19: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/19.jpg)
Word Models
A word is modeled using its pronunciation(s) and an HMM. Asan example the word DIAL has a pronunciation D AY AX L andHMM F .
D AY AX L
CALL has a pronunciation K AO L and HMM G
K AO L
Information Theory and Applications, 2/1-2007 – p.5/31
![Page 20: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/20.jpg)
Word Models
A word is modeled using its pronunciation(s) and an HMM. Asan example the word DIAL has a pronunciation D AY AX L andHMM F .
D AY AX L
CALL has a pronunciation K AO L and HMM G
K AO L
Each node in the HMM has a GMM associated with it. The word
confusability is the Bayes error Be(F,G). This quantity is too hard
to compute!!
Information Theory and Applications, 2/1-2007 – p.5/31
![Page 21: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/21.jpg)
The Edit Distance
DIAL CALL edit op. cost
D K substitution 1
AY ins/del 1
AX AO substitution 1
L L none 0
Total cost 3
Information Theory and Applications, 2/1-2007 – p.6/31
![Page 22: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/22.jpg)
The Edit Distance
DIAL CALL edit op. cost
D K substitution 1
AY ins/del 1
AX AO substitution 1
L L none 0
Total cost 3
The edit distance is the shortest path in the graph:D AY AX L
K
AO
L
I
F
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1 1
1 1 1 1
1 1 1 1
1 1 1 0
Information Theory and Applications, 2/1-2007 – p.6/31
![Page 23: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/23.jpg)
Better ways
Other techniques use weights on the edges. Acousticperplexity and Average Divergence Distance are variantsto this paradigm that use approximations to the KL
divergence as weights.
Information Theory and Applications, 2/1-2007 – p.7/31
![Page 24: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/24.jpg)
Bayes Error
We use Bayes Error approximations for each pair of GMMs inthe Cartesian HMM products:
D : KI
D : AO
D : L
AY : K
AY : AO
AY : L
AX : K
AX : AO
AX : L
L : K
L : AO
L : L
F
D : K
D : K
D : K
D : AO
D : AO
D : AO
D : L
AY : K
AY : K
AY : K
AY : AO
AY : AO
AY : AO
AY : L
AX : K
AX : K
AX : K
AX : AO
AX : AO
AX : AO
AX : L
L : K
L : AO
L : L
Information Theory and Applications, 2/1-2007 – p.8/31
![Page 25: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/25.jpg)
Gaussian Mixture Models
Each node in the Cartesian HMM product corresponds to a pairof Gaussian Mixture models f and g. We write
f(x) =∑
a
πafa(x), where fa(x) = N (x;µa; Σa),
Information Theory and Applications, 2/1-2007 – p.9/31
![Page 26: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/26.jpg)
Gaussian Mixture Models
Each node in the Cartesian HMM product corresponds to a pairof Gaussian Mixture models f and g. We write
f(x) =∑
a
πafa(x), where fa(x) = N (x;µa; Σa),
and
g(x) =∑
b
ωbgb(x), where gb(x) = N (x;µb; Σb).
Information Theory and Applications, 2/1-2007 – p.9/31
![Page 27: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/27.jpg)
Gaussian Mixture Models
Each node in the Cartesian HMM product corresponds to a pairof Gaussian Mixture models f and g. We write
f(x) =∑
a
πafa(x), where fa(x) = N (x;µa; Σa),
and
g(x) =∑
b
ωbgb(x), where gb(x) = N (x;µb; Σb).
The high dimensionality of x ∈ Rd, d = 39, makes numerical
integration difficult.
Information Theory and Applications, 2/1-2007 – p.9/31
![Page 28: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/28.jpg)
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be(f, g) =∫
min(f(x), g(x))dx
Information Theory and Applications, 2/1-2007 – p.10/31
![Page 29: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/29.jpg)
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be(f, g) =∫
min(f(x), g(x))dx
• Kullback Leibler divergence: D(f‖g) =∫
f(x) log f(x)g(x)dx
Information Theory and Applications, 2/1-2007 – p.10/31
![Page 30: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/30.jpg)
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be(f, g) =∫
min(f(x), g(x))dx
• Kullback Leibler divergence: D(f‖g) =∫
f(x) log f(x)g(x)dx
• Bhattacharyya distance: B(f, g) =∫√
f(x)g(x)dx
Information Theory and Applications, 2/1-2007 – p.10/31
![Page 31: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/31.jpg)
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be(f, g) =∫
min(f(x), g(x))dx
• Kullback Leibler divergence: D(f‖g) =∫
f(x) log f(x)g(x)dx
• Bhattacharyya distance: B(f, g) =∫√
f(x)g(x)dx
• Chernoff distance: C(f, g) = max0≤s≤1Cs(f, g) andCs(f, g) =
∫
f(x)sg(x)1−sdx, 0 ≤ s ≤ 1.
Information Theory and Applications, 2/1-2007 – p.10/31
![Page 32: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/32.jpg)
Bayes, Bhattacharyya, Chernoff and Kullback-Leibler
• Bayes error: Be(f, g) =∫
min(f(x), g(x))dx
• Kullback Leibler divergence: D(f‖g) =∫
f(x) log f(x)g(x)dx
• Bhattacharyya distance: B(f, g) =∫√
f(x)g(x)dx
• Chernoff distance: C(f, g) = max0≤s≤1Cs(f, g) andCs(f, g) =
∫
f(x)sg(x)1−sdx, 0 ≤ s ≤ 1.
Why these? — For a pair of single Gaussians f and g we can
compute D(f‖g), B(f, g) and Cs(f, g) analytically.
Information Theory and Applications, 2/1-2007 – p.10/31
![Page 33: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/33.jpg)
Connections
• Perimeter divergence (power mean):
Pα(f, g) =
Z �1
2f(x)α +
1
2g(x)α
� 1α
dx.
We have Be(f, g) = P−∞(f, g), B(f, g) = P0(f, g).
ZZ
Information Theory and Applications, 2/1-2007 – p.11/31
![Page 34: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/34.jpg)
Connections
• Perimeter divergence (power mean):
Pα(f, g) =
Z �1
2f(x)α +
1
2g(x)α
� 1α
dx.
We have Be(f, g) = P−∞(f, g), B(f, g) = P0(f, g).
• Rényi generalised divergence of order s:
Ds(f‖g) =1
s− 1log
Zf(x)sg(x)1−sdx.
We have D1(f‖g) = D(f‖g) and Ds(f‖g) = 1s−1
logCs(f‖g).
Z
Information Theory and Applications, 2/1-2007 – p.11/31
![Page 35: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/35.jpg)
Connections
• Perimeter divergence (power mean):
Pα(f, g) =
Z �1
2f(x)α +
1
2g(x)α
� 1α
dx.
We have Be(f, g) = P−∞(f, g), B(f, g) = P0(f, g).
• Rényi generalised divergence of order s:
Ds(f‖g) =1
s− 1log
Zf(x)sg(x)1−sdx.
We have D1(f‖g) = D(f‖g) and Ds(f‖g) = 1s−1
logCs(f‖g).• Generalisation:
Gα,s(f‖g) =1
s− 1log
Z(sf(x)α + (1 − s)g(x)α)
1α .
Gα,s(f‖g) connects logBe(f, g), D(f‖g), logB(f, g) and Cs(f, g).
Information Theory and Applications, 2/1-2007 – p.11/31
![Page 36: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/36.jpg)
Relations and Inequalities
• Gα, 12
(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and
G−∞,s(f‖g) = Be(f, g).
Information Theory and Applications, 2/1-2007 – p.12/31
![Page 37: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/37.jpg)
Relations and Inequalities
• Gα, 12
(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and
G−∞,s(f‖g) = Be(f, g).
• Pα(f, g) ≤ Pβ(f, g) for α ≤ β andB(f, g) ≤
√
Pα(f, g)P−α(f, g).
Information Theory and Applications, 2/1-2007 – p.12/31
![Page 38: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/38.jpg)
Relations and Inequalities
• Gα, 12
(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and
G−∞,s(f‖g) = Be(f, g).
• Pα(f, g) ≤ Pβ(f, g) for α ≤ β andB(f, g) ≤
√
Pα(f, g)P−α(f, g).
• Pα(f, f) = 1, Gα,s(f, f) = 0, P−∞(f, g) + P∞(f, g) = 2,Be(f, g) = 2 − P∞(f, g).
Information Theory and Applications, 2/1-2007 – p.12/31
![Page 39: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/39.jpg)
Relations and Inequalities
• Gα, 12
(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and
G−∞,s(f‖g) = Be(f, g).
• Pα(f, g) ≤ Pβ(f, g) for α ≤ β andB(f, g) ≤
√
Pα(f, g)P−α(f, g).
• Pα(f, f) = 1, Gα,s(f, f) = 0, P−∞(f, g) + P∞(f, g) = 2,Be(f, g) = 2 − P∞(f, g).
• B(f, g) = B(g, f), Be(f, g) = Be(g, f), D(f‖g) 6= D(g‖f).
Information Theory and Applications, 2/1-2007 – p.12/31
![Page 40: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/40.jpg)
Relations and Inequalities
• Gα, 12
(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and
G−∞,s(f‖g) = Be(f, g).
• Pα(f, g) ≤ Pβ(f, g) for α ≤ β andB(f, g) ≤
√
Pα(f, g)P−α(f, g).
• Pα(f, f) = 1, Gα,s(f, f) = 0, P−∞(f, g) + P∞(f, g) = 2,Be(f, g) = 2 − P∞(f, g).
• B(f, g) = B(g, f), Be(f, g) = Be(g, f), D(f‖g) 6= D(g‖f).
• D(f‖g) ≥ 2 logB(f, g) ≥ 2 logBe(f, g),Be(f, g) ≤ B(f, g) ≤
√
Be(f, g)(2 −Be(f, g)),Be(f, g) ≤ C(f, g) ≤ B(f, g), C0(f, g) = C1(f, g) = 1 andB(f, g) = C1/2(f, g).
and so on.
Information Theory and Applications, 2/1-2007 – p.12/31
![Page 41: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/41.jpg)
The KL Divergence of a GMM
• Monte Carlo sampling : Draw n samples from f . Then
D(f‖g) ≈ 1
n
n∑
i=1
logf(xi)
g(xi)
with error O(1/√n).
Information Theory and Applications, 2/1-2007 – p.13/31
![Page 42: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/42.jpg)
The KL Divergence of a GMM
• Monte Carlo sampling : Draw n samples from f . Then
D(f‖g) ≈ 1
n
n∑
i=1
logf(xi)
g(xi)
with error O(1/√n).
• Gaussian Approximation: Approximate f with a gaussian fwhose mean and covariance matches the total mean andcovariance of f . Same for g and g, then useD(f‖g) ≈ D(f‖g)
µf =∑
a
πaµa
Σf =∑
a
πa(Σa + (µa − µf )(µa − µf )T ).
Information Theory and Applications, 2/1-2007 – p.13/31
![Page 43: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/43.jpg)
Unscented Approximation
It is possible to pick 2d “sigma” points {xa,k}2dk=1 such that
∫
fa(x)h(x)dx = 12d
∑2dk=1 h(xa,k) is exact for all quadratic
functions h. One choice of sigma points is
xa,k = µa +√
dλa,k ea,k
xa,d+k = µa −√
dλa,k ea,k,
This is akin to gaussian quadrature.
Information Theory and Applications, 2/1-2007 – p.14/31
![Page 44: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/44.jpg)
Matched Bound Approximation
Match the closest pairs of gaussians
m(a) = argminb(D(fa‖gb) − log(ωb)).
Goldberger’s approximate formula is:
D(f‖g) ≈ DGoldberger(f‖g) =∑
a
πa
(
D(fa‖gm(a)) + logπa
ωm(a)
)
.
Analogous to the chain rule for relative entropy.
Information Theory and Applications, 2/1-2007 – p.15/31
![Page 45: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/45.jpg)
Matched Bound Approximation
Match the closest pairs of gaussians
m(a) = argminb(D(fa‖gb) − log(ωb)).
Goldberger’s approximate formula is:
D(f‖g) ≈ DGoldberger(f‖g) =∑
a
πa
(
D(fa‖gm(a)) + logπa
ωm(a)
)
.
Analogous to the chain rule for relative entropy.
Min Approximation: D(f‖g) ≈ mina,bD(fa‖gb) is an approximation
in the same spirit.
Information Theory and Applications, 2/1-2007 – p.15/31
![Page 46: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/46.jpg)
Variational Approximation
Let 1 =∑
b φb|a be free parameters
∫
f log g =∑
a
πa
∫
fa log∑
b
ωbgb
=∑
a
πa
∫
fa log∑
b
φb|aωbgbφb|a
Introduce the variational parameters
Information Theory and Applications, 2/1-2007 – p.16/31
![Page 47: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/47.jpg)
Variational Approximation
Let 1 =∑
b φb|a be free parameters
∫
f log g =∑
a
πa
∫
fa log∑
b
φb|aωbgbφb|a
≥∑
a
πa
∫
fa∑
b
φb|a logωbgbφb|a
Use Jensen’s inequality to interchange log and∑
b φb|a
Information Theory and Applications, 2/1-2007 – p.16/31
![Page 48: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/48.jpg)
Variational Approximation
Let 1 =∑
b φb|a be free parameters
∫
f log g =∑
a
πa
∫
fa log∑
b
φb|aωbgbφb|a
≥∑
a
πa
∫
fa∑
b
φb|a logωbgbφb|a
=∑
a
πa∑
b
φb|a
(
log(ωb/φb|a) +
∫
fa log gb
)
Simplify expression after using Jensen
Information Theory and Applications, 2/1-2007 – p.16/31
![Page 49: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/49.jpg)
Variational Approximation
Let 1 =∑
b φb|a be free parameters
∫
f log g =∑
a
πa
∫
fa log∑
b
φb|aωbgbφb|a
≥∑
a
πa
∫
fa∑
b
φb|a logωbgbφb|a
=∑
a
πa∑
b
φb|a
(
log(ωb/φb|a) +
∫
fa log gb
)
Maximize over φb|a and do the same for∫
f log f to get
Information Theory and Applications, 2/1-2007 – p.16/31
![Page 50: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/50.jpg)
Variational Approximation
Let 1 =∑
b φb|a be free parameters
∫
f log g =∑
a
πa
∫
fa log∑
b
φb|aωbgbφb|a
≥∑
a
πa
∫
fa∑
b
φb|a logωbgbφb|a
=∑
a
πa∑
b
φb|a
(
log(ωb/φb|a) +
∫
fa log gb
)
Maximize over φb|a and do the same for∫
f log f to get
D(f‖g) ≈ Dvar(f‖g) =∑
a
πa log
∑
a′ πa′e−D(fa‖fa′ )
∑
b ωbe−D(fa‖gb)
.
Information Theory and Applications, 2/1-2007 – p.16/31
![Page 51: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/51.jpg)
Variational Upper Bound
Variational parameters:∑
b φb|a = πa and∑
a ψa|b = ωb.Gaussian replication:
f =∑
a πafa =∑
ab φb|afag =
∑
b ωbgb =∑
ba ψa|bgb.
Information Theory and Applications, 2/1-2007 – p.17/31
![Page 52: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/52.jpg)
Variational Upper Bound
Variational parameters:∑
b φb|a = πa and∑
a ψa|b = ωb.Gaussian replication:
f =∑
a πafa =∑
ab φb|afag =
∑
b ωbgb =∑
ba ψa|bgb.
D(f‖g) =
∫
f log(f/g)
= −∫
f log
(
∑
ab ψa|bgbf
)
Introduce the variational parameters ψa|b.
Information Theory and Applications, 2/1-2007 – p.17/31
![Page 53: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/53.jpg)
Variational Upper Bound
Variational parameters:∑
b φb|a = πa and∑
a ψa|b = ωb.
D(f‖g) = −∫
f log
(
∑
ab ψa|bgbf
)
= −∫
f log
(
∑
ab
φb|afaf
ψa|bgbφb|afa
)
dx
Prepare to use Jensen’s inequality.
Information Theory and Applications, 2/1-2007 – p.17/31
![Page 54: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/54.jpg)
Variational Upper Bound
Variational parameters:∑
b φb|a = πa and∑
a ψa|b = ωb.
D(f‖g) = −∫
f log
(
∑
ab φb|afa∑
ab ψa|bgb
)
= −∫
f log
(
∑
ab
φb|afaf
ψa|bgbφb|afa
)
dx
≤∑
ab
φb|a
∫
fa log
(
ψa|bgbφb|afa
)
dx
Interchange log and∑
abφb|afa
f .
Information Theory and Applications, 2/1-2007 – p.17/31
![Page 55: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/55.jpg)
Variational Upper Bound
Variational parameters:∑
b φb|a = πa and∑
a ψa|b = ωb.
D(f‖g) = −∫
f log
(
∑
ab φb|afa∑
ab ψa|bgb
)
≤∑
ab
φb|a
∫
fa log
(
ψa|bgbφb|afa
)
dx
= D(φ‖ψ) +∑
ab
φb|aD(fa‖gb)
The chain rule for relative entropy for mixtures with unequal num-
ber of components!!
Information Theory and Applications, 2/1-2007 – p.17/31
![Page 56: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/56.jpg)
Variational Upper Bound
Optimize variational bound
D(f‖g) ≤ D(φ‖ψ) +∑
ab
φb|aD(fa‖gb)
with respect to constraints∑
b φb|a = πa and∑
a ψa|b = ωb.
Information Theory and Applications, 2/1-2007 – p.18/31
![Page 57: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/57.jpg)
Variational Upper Bound
Optimize variational bound
D(f‖g) ≤ D(φ‖ψ) +∑
ab
φb|aD(fa‖gb)
with respect to constraints∑
b φb|a = πa and∑
a ψa|b = ωb.Fix φ, find optimal value for ψ
ψa|b =ωbφb|a∑
a′ φb|a′
.
Fix ψ, find optimal value for φ:
φb|a =πaψa|be−D(fa‖gb)
∑
b′ ψa|b′e−D(fa‖gb′ )
.
Iterate a few times to find optimal solution!
Information Theory and Applications, 2/1-2007 – p.18/31
![Page 58: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/58.jpg)
Comparison of KL divergence methods (1)
Histograms of difference between closed form approximationsand Monte Carlo Sampling with 1 million samples:
−60 −50 −40 −30 −20 −10 0 10 200
0.1
0.2
0.3
0.4
0.5
Deviation from DMC(1M)
Pro
bability
zeroproduct
mingaussian
variational
Information Theory and Applications, 2/1-2007 – p.19/31
![Page 59: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/59.jpg)
Comparison of KL divergence methods (2)
Histograms of difference between various methodsand Monte Carlo Sampling with 1 million samples:
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.2
0.4
0.6
0.8
1
Deviation from DMC(1M)
Pro
babilit
y
MC(2dn)
unscentedgoldberger
variationalupper
Information Theory and Applications, 2/1-2007 – p.20/31
![Page 60: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/60.jpg)
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
Information Theory and Applications, 2/1-2007 – p.21/31
![Page 61: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/61.jpg)
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
Information Theory and Applications, 2/1-2007 – p.21/31
![Page 62: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/62.jpg)
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
Information Theory and Applications, 2/1-2007 – p.21/31
![Page 63: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/63.jpg)
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
• Unscented Approximation – not as good as MC per cost
Information Theory and Applications, 2/1-2007 – p.21/31
![Page 64: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/64.jpg)
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
• Unscented Approximation – not as good as MC per cost
• Goldberger Approximation – cheap, good, not closed form
Information Theory and Applications, 2/1-2007 – p.21/31
![Page 65: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/65.jpg)
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
• Unscented Approximation – not as good as MC per cost
• Goldberger Approximation – cheap, good, not closed form
• Variational Approximation – cheap, good, closed form
Information Theory and Applications, 2/1-2007 – p.21/31
![Page 66: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/66.jpg)
Summary of KL divergence methods
• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost
• Gaussian Approximation – not cheap, not good, closed form
• Min Approximation – cheap, but not good
• Unscented Approximation – not as good as MC per cost
• Goldberger Approximation – cheap, good, not closed form
• Variational Approximation – cheap, good, closed form
• Variational Upper Bound – cheap, good, upper-bound
Information Theory and Applications, 2/1-2007 – p.21/31
![Page 67: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/67.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 68: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/68.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 69: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/69.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 70: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/70.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 71: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/71.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 72: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/72.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 73: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/73.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 74: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/74.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 75: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/75.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.22/31
![Page 76: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/76.jpg)
Bhattacharyya distance
The Bhattacharyya distance, B(f, g) =∫ √
fg, can be estimateusing Monte Carlo sampling from an arbitrary distribution h:
B(f, g) ≈ Bh =1
n
n∑
i=1
√
f(xi)g(xi)
h(xi),
where {xi}ni=1 are sampled from h.
Information Theory and Applications, 2/1-2007 – p.23/31
![Page 77: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/77.jpg)
Bhattacharyya distance
The Bhattacharyya distance, B(f, g) =∫ √
fg, can be estimateusing Monte Carlo sampling from an arbitrary distribution h:
B(f, g) ≈ Bh =1
n
n∑
i=1
√
f(xi)g(xi)
h(xi),
where {xi}ni=1 are sampled from h. The estimators areunbiased, E[Bh] = B(f, g), with variance
var(Bh) =1
n
(∫
fg
h− (B(f, g))2
)
.
Information Theory and Applications, 2/1-2007 – p.23/31
![Page 78: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/78.jpg)
Bhattacharyya distance
The Bhattacharyya distance, B(f, g) =∫ √
fg, can be estimateusing Monte Carlo sampling from an arbitrary distribution h:
B(f, g) ≈ Bh =1
n
n∑
i=1
√
f(xi)g(xi)
h(xi),
where {xi}ni=1 are sampled from h. The estimators areunbiased, E[Bh] = B(f, g), with variance
var(Bh) =1
n
(∫
fg
h− (B(f, g))2
)
.
h = f gives var(Bf ) = 1−(B(f,g))2
n .
h = f+g2 gives var(B f+g
2
) =2fg
f+g−(B(f,g))2
n .
And var(B f+g
2
) ≤ var(Bf ) (Harmonic–Arithmetic inequality).
Information Theory and Applications, 2/1-2007 – p.23/31
![Page 79: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/79.jpg)
Best sampling distribution
We can find the “best” sampling distribution h by minimizing thevariance of Bh subject to the constraints h ≥ 0 and
∫
h = 1.
The solution is h =√fg√fg
, var(Bh) = 0.
Information Theory and Applications, 2/1-2007 – p.24/31
![Page 80: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/80.jpg)
Best sampling distribution
We can find the “best” sampling distribution h by minimizing thevariance of Bh subject to the constraints h ≥ 0 and
∫
h = 1.
The solution is h =√fg√fg
, var(Bh) = 0.
Unfortunately, using this h requires
• Computing the quantity, B(f, g) =∫ √
fg, that we are tryingto compute in the first place.
• Sample from√fg.
Information Theory and Applications, 2/1-2007 – p.24/31
![Page 81: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/81.jpg)
Best sampling distribution
We can find the “best” sampling distribution h by minimizing thevariance of Bh subject to the constraints h ≥ 0 and
∫
h = 1.
The solution is h =√fg√fg
, var(Bh) = 0.
Unfortunately, using this h requires
• Computing the quantity, B(f, g) =∫ √
fg, that we are tryingto compute in the first place.
• Sample from√fg.
We will use variational techniques to “approximate”√fg with
some unnormalized h that can be analytically integrated to give a
genuine pdf h.
Information Theory and Applications, 2/1-2007 – p.24/31
![Page 82: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/82.jpg)
Bhattacharyya Variational Upper Bound
f =∑
a πafa =∑
ab φb|afa∑
b φb|a = πa
g =∑
b ωbgb =∑
ba ψa|bgb∑
a ψa|b = ωb.
Information Theory and Applications, 2/1-2007 – p.25/31
![Page 83: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/83.jpg)
Bhattacharyya Variational Upper Bound
f =∑
a πafa =∑
ab φb|afa∑
b φb|a = πa
g =∑
b ωbgb =∑
ba ψa|bgb∑
a ψa|b = ωb.
B(f, g) =
∫
f
√
g
f
=
∫
f
√
∑
ab ψa|bgbf
Introduce the variational parameters ψa|b.
Information Theory and Applications, 2/1-2007 – p.25/31
![Page 84: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/84.jpg)
Bhattacharyya Variational Upper Bound
f =∑
a πafa =∑
ab φb|afa∑
b φb|a = πa
g =∑
b ωbgb =∑
ba ψa|bgb∑
a ψa|b = ωb.
B(f, g) =
∫
f
√
∑
ab ψa|bgbf
=
∫
f
√
∑
ab
φb|afaf
ψa|bgbφb|afa
dx
Prepare to use Jensen’s inequality.
Information Theory and Applications, 2/1-2007 – p.25/31
![Page 85: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/85.jpg)
Bhattacharyya Variational Upper Bound
f =∑
a πafa =∑
ab φb|afa∑
b φb|a = πa
g =∑
b ωbgb =∑
ba ψa|bgb∑
a ψa|b = ωb.
B(f, g) =
∫
f
√
∑
ab ψa|bgbf
=
∫
f
√
∑
ab
φb|afaf
ψa|bgbφb|afa
dx
≥∑
ab
φb|a
∫
fa
√
ψa|bgbφb|afa
dx
Interchange√· and
∑
abφb|afa
f .
Information Theory and Applications, 2/1-2007 – p.25/31
![Page 86: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/86.jpg)
Bhattacharyya Variational Upper Bound
f =∑
a πafa =∑
ab φb|afa∑
b φb|a = πa
g =∑
b ωbgb =∑
ba ψa|bgb∑
a ψa|b = ωb.
B(f, g) =
∫
f
√
∑
ab ψa|bgbf
≥∑
ab
φb|a
∫
fa
√
ψa|bgbφb|afa
dx
=∑
ab
√
φb|aψa|bB(fa, gb)
An inequality linking the mixture Bhattacharyya distance to the
component distances!!
Information Theory and Applications, 2/1-2007 – p.25/31
![Page 87: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/87.jpg)
Bhattacharyya Variational Upper Bound
Optimize variational bound
B(f, g) ≥∑
ab
√
φb|aψa|bB(fa, gb)
with respect to constraints∑
b φb|a = πa and∑
a ψa|b = ωb.
Information Theory and Applications, 2/1-2007 – p.26/31
![Page 88: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/88.jpg)
Bhattacharyya Variational Upper Bound
Optimize variational bound
B(f, g) ≥∑
ab
√
φb|aψa|bB(fa, gb)
with respect to constraints∑
b φb|a = πa and∑
a ψa|b = ωb.Fix φ, find optimal value for ψ
ψa|b =ωbφb|a(B(fa, gb))
2
∑
a′ φb|a′(B(fa′ , gb))2.
Fix ψ, find optimal value for φ:
φb|a =πaψa|b(B(fa, gb))
2
∑
b′ ψa|b′(B(fa, gb′))2.
Iterate to find optimal solution!Information Theory and Applications, 2/1-2007 – p.26/31
![Page 89: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/89.jpg)
Variational Monte Carlo Sampling
Write the variational estimate
V (f, g) =
X
ab
q
φb|aψa|bB(fa, gb) =
Z Xab
qφb|aψa|b
pfagb =
Zh.
Here h =
P
ab
q
φb|aψa|b
√fagb is an unnormalized approximation of the optimal
sampling distribution,√fg/
R √fg.
R RX q RXp
Information Theory and Applications, 2/1-2007 – p.27/31
![Page 90: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/90.jpg)
Variational Monte Carlo Sampling
Write the variational estimate
V (f, g) =
X
ab
q
φb|aψa|bB(fa, gb) =
Z Xab
qφb|aψa|b
pfagb =
Zh.
Here h =
P
ab
q
φb|aψa|b
√fagb is an unnormalized approximation of the optimal
sampling distribution,√fg/
R √fg.
h = h/
R
h is a GMM since hab =√fagb/
R √fagb is a gaussian and
h =
Xab
πabhab, where πab =q
φb|aψa|b
R √fagb
V (f, g)
Xp
Information Theory and Applications, 2/1-2007 – p.27/31
![Page 91: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/91.jpg)
Variational Monte Carlo Sampling
Write the variational estimate
V (f, g) =
X
ab
q
φb|aψa|bB(fa, gb) =
Z Xab
qφb|aψa|b
pfagb =
Zh.
Here h =
P
ab
q
φb|aψa|b
√fagb is an unnormalized approximation of the optimal
sampling distribution,√fg/
R √fg.
h = h/
R
h is a GMM since hab =√fagb/
R √fagb is a gaussian and
h =
Xab
πabhab, where πab =q
φb|aψa|b
R √fagb
V (f, g)
Thus, drawing samples {xi}ni=1 from h the estimate
Vn =1
n
nXi=1
pf(xi)g(xi)
h(xi)
is unbiased and in experiments is seen to be far superior to sampling from (f + g)/2.
Information Theory and Applications, 2/1-2007 – p.27/31
![Page 92: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/92.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
3.5Bhattacharyya: Importance sampling from f (x)
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
10
• Importance sampling from f(x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 93: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/93.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
3.5Bhattacharyya: Importance sampling from f (x)
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
10100
• Importance sampling from f(x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 94: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/94.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
3.5Bhattacharyya: Importance sampling from f (x)
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
101001000
• Importance sampling from f(x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 95: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/95.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
3.5Bhattacharyya: Importance sampling from f (x)
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
10100100010K
• Importance sampling from f(x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 96: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/96.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−5 −4 −3 −2 −1 0 1 2 3 4 50
0.5
1
1.5
2
2.5
3
3.5Bhattacharyya: Importance sampling from f (x)
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
10100100010K100K
• Importance sampling from f(x): Slow convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 97: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/97.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
5
10
15
Importance sampling from f(x)+g(x)2
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
f 100K10
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 98: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/98.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
5
10
15
Importance sampling from f(x)+g(x)2
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
f 100K10100
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 99: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/99.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
5
10
15
Importance sampling from f(x)+g(x)2
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
f 100K101001000
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 100: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/100.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
5
10
15
Importance sampling from f(x)+g(x)2
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
f 100K10100100010K
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 101: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/101.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10
5
10
15
Importance sampling from f(x)+g(x)2
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
f 100K10100100010K100K
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
Information Theory and Applications, 2/1-2007 – p.28/31
![Page 102: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/102.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20
20
40
60
80
100
120Variational importance sampling
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
(f+g)/2 100K10
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31
![Page 103: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/103.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20
20
40
60
80
100
120Variational importance sampling
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
(f+g)/2 100K10100
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31
![Page 104: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/104.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20
20
40
60
80
100
120Variational importance sampling
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
(f+g)/2 100K101001000
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31
![Page 105: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/105.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20
20
40
60
80
100
120Variational importance sampling
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
(f+g)/2 100K10100100010K
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31
![Page 106: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/106.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20
20
40
60
80
100
120Variational importance sampling
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
(f+g)/2 100K10100100010K100K
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31
![Page 107: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/107.jpg)
Bhattacharyya Distance: Monte Carlo estimation
−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.050
20
40
60
80
100
120Variational importance sampling
Pro
bability
den
sity
Deviation from Bhattacharyya estimate with 1M samples
f 100K(f+g)/2 100Kvariational 100K
• Importance sampling from f(x): Slow convergence.
• Importance sampling from f(x)+g(x)2 : Better convergence.
• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31
![Page 108: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/108.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.29/31
![Page 109: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/109.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.29/31
![Page 110: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/110.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.29/31
![Page 111: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/111.jpg)
Comparison of KL, Bhattacharyya, Bayes
Information Theory and Applications, 2/1-2007 – p.29/31
![Page 112: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/112.jpg)
Variational Monte Carlo Sampling: KL-Divergence
Information Theory and Applications, 2/1-2007 – p.30/31
![Page 113: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/113.jpg)
Future Directions
• HMM variational KL divergence
Information Theory and Applications, 2/1-2007 – p.31/31
![Page 114: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/114.jpg)
Future Directions
• HMM variational KL divergence• HMM variational Bhattacharyya
Information Theory and Applications, 2/1-2007 – p.31/31
![Page 115: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/115.jpg)
Future Directions
• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance
Information Theory and Applications, 2/1-2007 – p.31/31
![Page 116: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/116.jpg)
Future Directions
• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff
approximation
Information Theory and Applications, 2/1-2007 – p.31/31
![Page 117: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/117.jpg)
Future Directions
• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff
approximation• Discriminative training, using Bhattacharyya divergence.
Information Theory and Applications, 2/1-2007 – p.31/31
![Page 118: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/118.jpg)
Future Directions
• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff
approximation• Discriminative training, using Bhattacharyya divergence.• Acoustic confusability using Bhattacharyya divergence
Information Theory and Applications, 2/1-2007 – p.31/31
![Page 119: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/119.jpg)
Future Directions
• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff
approximation• Discriminative training, using Bhattacharyya divergence.• Acoustic confusability using Bhattacharyya divergence• Clustering of HMMs
Information Theory and Applications, 2/1-2007 – p.31/31
![Page 120: Variational sampling approaches to word confusability](https://reader031.vdocument.in/reader031/viewer/2022020703/61fb439e2e268c58cd5c1c19/html5/thumbnails/120.jpg)
Future Directions
• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff
approximation• Discriminative training, using Bhattacharyya divergence.• Acoustic confusability using Bhattacharyya divergence• Clustering of HMMs
Information Theory and Applications, 2/1-2007 – p.31/31