variational sampling approaches to word confusability

Post on 03-Feb-2022

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Variational sampling approachesto word confusability

John R. Hershey, Peder A. Olsen and Ramesh A. Gopinath

{jrhershe,pederao,rameshg}@us.ibm.com

IBM, T. J. Watson Research Center

Information Theory and Applications, 2/1-2007 – p.1/31

Abstract

In speech recognition it is often useful to determine how confusable two words are. For

speech models this comes down to computing the Bayes error between two HMMs. This

problem is analytically and numerically intractable. A common alternative, that is numeri-

cally approachable, uses the KL divergence in place of the Bayes error. We present new

approaches to approximating the KL divergence, that combine variational methods with

importance sampling. The Bhattacharyya distance – a closer cousin of the Bayes error –

turns out to be even more amenable to our approach. Our experiments demonstrate an

improvement of orders of magnitude in accuracy over conventional methods.

Information Theory and Applications, 2/1-2007 – p.2/31

Outline

• Acoustic Confusability

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation• Bhattacharyya: Variational Approximation

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation• Bhattacharyya: Variational Approximation• Bhattacharyya: Variational Monte Carlo Approximation

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation• Bhattacharyya: Variational Approximation• Bhattacharyya: Variational Monte Carlo Approximation• Bhattacharyya: Empirical Evaluations

Information Theory and Applications, 2/1-2007 – p.3/31

Outline

• Acoustic Confusability• Divergence Measures for Distributions• KL Divergence: Prior Art• KL Divergence: Variational Approximations• KL Divergence: Empirical Evaluations• Bhattacharyya: Monte Carlo Approximation• Bhattacharyya: Variational Approximation• Bhattacharyya: Variational Monte Carlo Approximation• Bhattacharyya: Empirical Evaluations• Future Directions

Information Theory and Applications, 2/1-2007 – p.3/31

A Toy Version of the Confusability Problem

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

pdf

probability density plot

N(x;−2,1)

• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).

R R

Information Theory and Applications, 2/1-2007 – p.4/31

A Toy Version of the Confusability Problem

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

pdf

probability density plot

N(x;−2,1)

N(x;2,1)

• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).

• Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).

R R

Information Theory and Applications, 2/1-2007 – p.4/31

A Toy Version of the Confusability Problem

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

pdf

probability density plot

N(x;−2,1)

N(x;2,1)

• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).

• Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).

• The asymmetric error is the probability that one class, "O", will be mistaken for theother, "E", when classifying according to Ae(f, g) =

Rf(x)1g(x)≥f(x)(x)dx.

R

Information Theory and Applications, 2/1-2007 – p.4/31

A Toy Version of the Confusability Problem

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

x

pdf

probability density plot

N(x;−2,1)

N(x;2,1)

• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).

• Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).

• The asymmetric error is the probability that one class, "O", will be mistaken for theother, "E", when classifying according to Ae(f, g) =

Rf(x)1g(x)≥f(x)(x)dx.

• The Bayes Error is the total classification error Be(f, g) =

Rmin(f(x), g(x))dx.

Information Theory and Applications, 2/1-2007 – p.4/31

A Toy Version of the Confusability Problem

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.1

0.2

0.3

0.4

0.5

x

pdf

probability density plot

N(x;−2,1)

N(x;2,1)

(fg)1/2

• X is an acoustic feature vector representing a speech class, say "E", with pdff(x) = N (x;−2, 1).

• Another speech class, "O", is described by pdf g(x) = N (x; 2, 1).

• The asymmetric error is the probability that one class, "O", will be mistaken for theother, "E", when classifying according to Ae(f, g) =

Rf(x)1g(x)≥f(x)(x)dx.

• The Bayes Error is the total classification error Be(f, g) =

Rmin(f(x), g(x))dx.

Information Theory and Applications, 2/1-2007 – p.4/31

Word Models

A word is modeled using its pronunciation(s) and an HMM. Asan example the word DIAL has a pronunciation D AY AX L andHMM F .

D AY AX L

Information Theory and Applications, 2/1-2007 – p.5/31

Word Models

A word is modeled using its pronunciation(s) and an HMM. Asan example the word DIAL has a pronunciation D AY AX L andHMM F .

D AY AX L

CALL has a pronunciation K AO L and HMM G

K AO L

Information Theory and Applications, 2/1-2007 – p.5/31

Word Models

A word is modeled using its pronunciation(s) and an HMM. Asan example the word DIAL has a pronunciation D AY AX L andHMM F .

D AY AX L

CALL has a pronunciation K AO L and HMM G

K AO L

Each node in the HMM has a GMM associated with it. The word

confusability is the Bayes error Be(F,G). This quantity is too hard

to compute!!

Information Theory and Applications, 2/1-2007 – p.5/31

The Edit Distance

DIAL CALL edit op. cost

D K substitution 1

AY ins/del 1

AX AO substitution 1

L L none 0

Total cost 3

Information Theory and Applications, 2/1-2007 – p.6/31

The Edit Distance

DIAL CALL edit op. cost

D K substitution 1

AY ins/del 1

AX AO substitution 1

L L none 0

Total cost 3

The edit distance is the shortest path in the graph:D AY AX L

K

AO

L

I

F

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1 1

1 1 1 1

1 1 1 1

1 1 1 0

Information Theory and Applications, 2/1-2007 – p.6/31

Better ways

Other techniques use weights on the edges. Acousticperplexity and Average Divergence Distance are variantsto this paradigm that use approximations to the KL

divergence as weights.

Information Theory and Applications, 2/1-2007 – p.7/31

Bayes Error

We use Bayes Error approximations for each pair of GMMs inthe Cartesian HMM products:

D : KI

D : AO

D : L

AY : K

AY : AO

AY : L

AX : K

AX : AO

AX : L

L : K

L : AO

L : L

F

D : K

D : K

D : K

D : AO

D : AO

D : AO

D : L

AY : K

AY : K

AY : K

AY : AO

AY : AO

AY : AO

AY : L

AX : K

AX : K

AX : K

AX : AO

AX : AO

AX : AO

AX : L

L : K

L : AO

L : L

Information Theory and Applications, 2/1-2007 – p.8/31

Gaussian Mixture Models

Each node in the Cartesian HMM product corresponds to a pairof Gaussian Mixture models f and g. We write

f(x) =∑

a

πafa(x), where fa(x) = N (x;µa; Σa),

Information Theory and Applications, 2/1-2007 – p.9/31

Gaussian Mixture Models

Each node in the Cartesian HMM product corresponds to a pairof Gaussian Mixture models f and g. We write

f(x) =∑

a

πafa(x), where fa(x) = N (x;µa; Σa),

and

g(x) =∑

b

ωbgb(x), where gb(x) = N (x;µb; Σb).

Information Theory and Applications, 2/1-2007 – p.9/31

Gaussian Mixture Models

Each node in the Cartesian HMM product corresponds to a pairof Gaussian Mixture models f and g. We write

f(x) =∑

a

πafa(x), where fa(x) = N (x;µa; Σa),

and

g(x) =∑

b

ωbgb(x), where gb(x) = N (x;µb; Σb).

The high dimensionality of x ∈ Rd, d = 39, makes numerical

integration difficult.

Information Theory and Applications, 2/1-2007 – p.9/31

Bayes, Bhattacharyya, Chernoff and Kullback-Leibler

• Bayes error: Be(f, g) =∫

min(f(x), g(x))dx

Information Theory and Applications, 2/1-2007 – p.10/31

Bayes, Bhattacharyya, Chernoff and Kullback-Leibler

• Bayes error: Be(f, g) =∫

min(f(x), g(x))dx

• Kullback Leibler divergence: D(f‖g) =∫

f(x) log f(x)g(x)dx

Information Theory and Applications, 2/1-2007 – p.10/31

Bayes, Bhattacharyya, Chernoff and Kullback-Leibler

• Bayes error: Be(f, g) =∫

min(f(x), g(x))dx

• Kullback Leibler divergence: D(f‖g) =∫

f(x) log f(x)g(x)dx

• Bhattacharyya distance: B(f, g) =∫√

f(x)g(x)dx

Information Theory and Applications, 2/1-2007 – p.10/31

Bayes, Bhattacharyya, Chernoff and Kullback-Leibler

• Bayes error: Be(f, g) =∫

min(f(x), g(x))dx

• Kullback Leibler divergence: D(f‖g) =∫

f(x) log f(x)g(x)dx

• Bhattacharyya distance: B(f, g) =∫√

f(x)g(x)dx

• Chernoff distance: C(f, g) = max0≤s≤1Cs(f, g) andCs(f, g) =

f(x)sg(x)1−sdx, 0 ≤ s ≤ 1.

Information Theory and Applications, 2/1-2007 – p.10/31

Bayes, Bhattacharyya, Chernoff and Kullback-Leibler

• Bayes error: Be(f, g) =∫

min(f(x), g(x))dx

• Kullback Leibler divergence: D(f‖g) =∫

f(x) log f(x)g(x)dx

• Bhattacharyya distance: B(f, g) =∫√

f(x)g(x)dx

• Chernoff distance: C(f, g) = max0≤s≤1Cs(f, g) andCs(f, g) =

f(x)sg(x)1−sdx, 0 ≤ s ≤ 1.

Why these? — For a pair of single Gaussians f and g we can

compute D(f‖g), B(f, g) and Cs(f, g) analytically.

Information Theory and Applications, 2/1-2007 – p.10/31

Connections

• Perimeter divergence (power mean):

Pα(f, g) =

Z �1

2f(x)α +

1

2g(x)α

� 1α

dx.

We have Be(f, g) = P−∞(f, g), B(f, g) = P0(f, g).

ZZ

Information Theory and Applications, 2/1-2007 – p.11/31

Connections

• Perimeter divergence (power mean):

Pα(f, g) =

Z �1

2f(x)α +

1

2g(x)α

� 1α

dx.

We have Be(f, g) = P−∞(f, g), B(f, g) = P0(f, g).

• Rényi generalised divergence of order s:

Ds(f‖g) =1

s− 1log

Zf(x)sg(x)1−sdx.

We have D1(f‖g) = D(f‖g) and Ds(f‖g) = 1s−1

logCs(f‖g).

Z

Information Theory and Applications, 2/1-2007 – p.11/31

Connections

• Perimeter divergence (power mean):

Pα(f, g) =

Z �1

2f(x)α +

1

2g(x)α

� 1α

dx.

We have Be(f, g) = P−∞(f, g), B(f, g) = P0(f, g).

• Rényi generalised divergence of order s:

Ds(f‖g) =1

s− 1log

Zf(x)sg(x)1−sdx.

We have D1(f‖g) = D(f‖g) and Ds(f‖g) = 1s−1

logCs(f‖g).• Generalisation:

Gα,s(f‖g) =1

s− 1log

Z(sf(x)α + (1 − s)g(x)α)

1α .

Gα,s(f‖g) connects logBe(f, g), D(f‖g), logB(f, g) and Cs(f, g).

Information Theory and Applications, 2/1-2007 – p.11/31

Relations and Inequalities

• Gα, 12

(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and

G−∞,s(f‖g) = Be(f, g).

Information Theory and Applications, 2/1-2007 – p.12/31

Relations and Inequalities

• Gα, 12

(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and

G−∞,s(f‖g) = Be(f, g).

• Pα(f, g) ≤ Pβ(f, g) for α ≤ β andB(f, g) ≤

Pα(f, g)P−α(f, g).

Information Theory and Applications, 2/1-2007 – p.12/31

Relations and Inequalities

• Gα, 12

(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and

G−∞,s(f‖g) = Be(f, g).

• Pα(f, g) ≤ Pβ(f, g) for α ≤ β andB(f, g) ≤

Pα(f, g)P−α(f, g).

• Pα(f, f) = 1, Gα,s(f, f) = 0, P−∞(f, g) + P∞(f, g) = 2,Be(f, g) = 2 − P∞(f, g).

Information Theory and Applications, 2/1-2007 – p.12/31

Relations and Inequalities

• Gα, 12

(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and

G−∞,s(f‖g) = Be(f, g).

• Pα(f, g) ≤ Pβ(f, g) for α ≤ β andB(f, g) ≤

Pα(f, g)P−α(f, g).

• Pα(f, f) = 1, Gα,s(f, f) = 0, P−∞(f, g) + P∞(f, g) = 2,Be(f, g) = 2 − P∞(f, g).

• B(f, g) = B(g, f), Be(f, g) = Be(g, f), D(f‖g) 6= D(g‖f).

Information Theory and Applications, 2/1-2007 – p.12/31

Relations and Inequalities

• Gα, 12

(f‖g) = −2 logPα(f, g), G0,s(f‖g) = Ds(f‖g) and

G−∞,s(f‖g) = Be(f, g).

• Pα(f, g) ≤ Pβ(f, g) for α ≤ β andB(f, g) ≤

Pα(f, g)P−α(f, g).

• Pα(f, f) = 1, Gα,s(f, f) = 0, P−∞(f, g) + P∞(f, g) = 2,Be(f, g) = 2 − P∞(f, g).

• B(f, g) = B(g, f), Be(f, g) = Be(g, f), D(f‖g) 6= D(g‖f).

• D(f‖g) ≥ 2 logB(f, g) ≥ 2 logBe(f, g),Be(f, g) ≤ B(f, g) ≤

Be(f, g)(2 −Be(f, g)),Be(f, g) ≤ C(f, g) ≤ B(f, g), C0(f, g) = C1(f, g) = 1 andB(f, g) = C1/2(f, g).

and so on.

Information Theory and Applications, 2/1-2007 – p.12/31

The KL Divergence of a GMM

• Monte Carlo sampling : Draw n samples from f . Then

D(f‖g) ≈ 1

n

n∑

i=1

logf(xi)

g(xi)

with error O(1/√n).

Information Theory and Applications, 2/1-2007 – p.13/31

The KL Divergence of a GMM

• Monte Carlo sampling : Draw n samples from f . Then

D(f‖g) ≈ 1

n

n∑

i=1

logf(xi)

g(xi)

with error O(1/√n).

• Gaussian Approximation: Approximate f with a gaussian fwhose mean and covariance matches the total mean andcovariance of f . Same for g and g, then useD(f‖g) ≈ D(f‖g)

µf =∑

a

πaµa

Σf =∑

a

πa(Σa + (µa − µf )(µa − µf )T ).

Information Theory and Applications, 2/1-2007 – p.13/31

Unscented Approximation

It is possible to pick 2d “sigma” points {xa,k}2dk=1 such that

fa(x)h(x)dx = 12d

∑2dk=1 h(xa,k) is exact for all quadratic

functions h. One choice of sigma points is

xa,k = µa +√

dλa,k ea,k

xa,d+k = µa −√

dλa,k ea,k,

This is akin to gaussian quadrature.

Information Theory and Applications, 2/1-2007 – p.14/31

Matched Bound Approximation

Match the closest pairs of gaussians

m(a) = argminb(D(fa‖gb) − log(ωb)).

Goldberger’s approximate formula is:

D(f‖g) ≈ DGoldberger(f‖g) =∑

a

πa

(

D(fa‖gm(a)) + logπa

ωm(a)

)

.

Analogous to the chain rule for relative entropy.

Information Theory and Applications, 2/1-2007 – p.15/31

Matched Bound Approximation

Match the closest pairs of gaussians

m(a) = argminb(D(fa‖gb) − log(ωb)).

Goldberger’s approximate formula is:

D(f‖g) ≈ DGoldberger(f‖g) =∑

a

πa

(

D(fa‖gm(a)) + logπa

ωm(a)

)

.

Analogous to the chain rule for relative entropy.

Min Approximation: D(f‖g) ≈ mina,bD(fa‖gb) is an approximation

in the same spirit.

Information Theory and Applications, 2/1-2007 – p.15/31

Variational Approximation

Let 1 =∑

b φb|a be free parameters

f log g =∑

a

πa

fa log∑

b

ωbgb

=∑

a

πa

fa log∑

b

φb|aωbgbφb|a

Introduce the variational parameters

Information Theory and Applications, 2/1-2007 – p.16/31

Variational Approximation

Let 1 =∑

b φb|a be free parameters

f log g =∑

a

πa

fa log∑

b

φb|aωbgbφb|a

≥∑

a

πa

fa∑

b

φb|a logωbgbφb|a

Use Jensen’s inequality to interchange log and∑

b φb|a

Information Theory and Applications, 2/1-2007 – p.16/31

Variational Approximation

Let 1 =∑

b φb|a be free parameters

f log g =∑

a

πa

fa log∑

b

φb|aωbgbφb|a

≥∑

a

πa

fa∑

b

φb|a logωbgbφb|a

=∑

a

πa∑

b

φb|a

(

log(ωb/φb|a) +

fa log gb

)

Simplify expression after using Jensen

Information Theory and Applications, 2/1-2007 – p.16/31

Variational Approximation

Let 1 =∑

b φb|a be free parameters

f log g =∑

a

πa

fa log∑

b

φb|aωbgbφb|a

≥∑

a

πa

fa∑

b

φb|a logωbgbφb|a

=∑

a

πa∑

b

φb|a

(

log(ωb/φb|a) +

fa log gb

)

Maximize over φb|a and do the same for∫

f log f to get

Information Theory and Applications, 2/1-2007 – p.16/31

Variational Approximation

Let 1 =∑

b φb|a be free parameters

f log g =∑

a

πa

fa log∑

b

φb|aωbgbφb|a

≥∑

a

πa

fa∑

b

φb|a logωbgbφb|a

=∑

a

πa∑

b

φb|a

(

log(ωb/φb|a) +

fa log gb

)

Maximize over φb|a and do the same for∫

f log f to get

D(f‖g) ≈ Dvar(f‖g) =∑

a

πa log

a′ πa′e−D(fa‖fa′ )

b ωbe−D(fa‖gb)

.

Information Theory and Applications, 2/1-2007 – p.16/31

Variational Upper Bound

Variational parameters:∑

b φb|a = πa and∑

a ψa|b = ωb.Gaussian replication:

f =∑

a πafa =∑

ab φb|afag =

b ωbgb =∑

ba ψa|bgb.

Information Theory and Applications, 2/1-2007 – p.17/31

Variational Upper Bound

Variational parameters:∑

b φb|a = πa and∑

a ψa|b = ωb.Gaussian replication:

f =∑

a πafa =∑

ab φb|afag =

b ωbgb =∑

ba ψa|bgb.

D(f‖g) =

f log(f/g)

= −∫

f log

(

ab ψa|bgbf

)

Introduce the variational parameters ψa|b.

Information Theory and Applications, 2/1-2007 – p.17/31

Variational Upper Bound

Variational parameters:∑

b φb|a = πa and∑

a ψa|b = ωb.

D(f‖g) = −∫

f log

(

ab ψa|bgbf

)

= −∫

f log

(

ab

φb|afaf

ψa|bgbφb|afa

)

dx

Prepare to use Jensen’s inequality.

Information Theory and Applications, 2/1-2007 – p.17/31

Variational Upper Bound

Variational parameters:∑

b φb|a = πa and∑

a ψa|b = ωb.

D(f‖g) = −∫

f log

(

ab φb|afa∑

ab ψa|bgb

)

= −∫

f log

(

ab

φb|afaf

ψa|bgbφb|afa

)

dx

≤∑

ab

φb|a

fa log

(

ψa|bgbφb|afa

)

dx

Interchange log and∑

abφb|afa

f .

Information Theory and Applications, 2/1-2007 – p.17/31

Variational Upper Bound

Variational parameters:∑

b φb|a = πa and∑

a ψa|b = ωb.

D(f‖g) = −∫

f log

(

ab φb|afa∑

ab ψa|bgb

)

≤∑

ab

φb|a

fa log

(

ψa|bgbφb|afa

)

dx

= D(φ‖ψ) +∑

ab

φb|aD(fa‖gb)

The chain rule for relative entropy for mixtures with unequal num-

ber of components!!

Information Theory and Applications, 2/1-2007 – p.17/31

Variational Upper Bound

Optimize variational bound

D(f‖g) ≤ D(φ‖ψ) +∑

ab

φb|aD(fa‖gb)

with respect to constraints∑

b φb|a = πa and∑

a ψa|b = ωb.

Information Theory and Applications, 2/1-2007 – p.18/31

Variational Upper Bound

Optimize variational bound

D(f‖g) ≤ D(φ‖ψ) +∑

ab

φb|aD(fa‖gb)

with respect to constraints∑

b φb|a = πa and∑

a ψa|b = ωb.Fix φ, find optimal value for ψ

ψa|b =ωbφb|a∑

a′ φb|a′

.

Fix ψ, find optimal value for φ:

φb|a =πaψa|be−D(fa‖gb)

b′ ψa|b′e−D(fa‖gb′ )

.

Iterate a few times to find optimal solution!

Information Theory and Applications, 2/1-2007 – p.18/31

Comparison of KL divergence methods (1)

Histograms of difference between closed form approximationsand Monte Carlo Sampling with 1 million samples:

−60 −50 −40 −30 −20 −10 0 10 200

0.1

0.2

0.3

0.4

0.5

Deviation from DMC(1M)

Pro

bability

zeroproduct

mingaussian

variational

Information Theory and Applications, 2/1-2007 – p.19/31

Comparison of KL divergence methods (2)

Histograms of difference between various methodsand Monte Carlo Sampling with 1 million samples:

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

Deviation from DMC(1M)

Pro

babilit

y

MC(2dn)

unscentedgoldberger

variationalupper

Information Theory and Applications, 2/1-2007 – p.20/31

Summary of KL divergence methods

• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost

Information Theory and Applications, 2/1-2007 – p.21/31

Summary of KL divergence methods

• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost

• Gaussian Approximation – not cheap, not good, closed form

Information Theory and Applications, 2/1-2007 – p.21/31

Summary of KL divergence methods

• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost

• Gaussian Approximation – not cheap, not good, closed form

• Min Approximation – cheap, but not good

Information Theory and Applications, 2/1-2007 – p.21/31

Summary of KL divergence methods

• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost

• Gaussian Approximation – not cheap, not good, closed form

• Min Approximation – cheap, but not good

• Unscented Approximation – not as good as MC per cost

Information Theory and Applications, 2/1-2007 – p.21/31

Summary of KL divergence methods

• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost

• Gaussian Approximation – not cheap, not good, closed form

• Min Approximation – cheap, but not good

• Unscented Approximation – not as good as MC per cost

• Goldberger Approximation – cheap, good, not closed form

Information Theory and Applications, 2/1-2007 – p.21/31

Summary of KL divergence methods

• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost

• Gaussian Approximation – not cheap, not good, closed form

• Min Approximation – cheap, but not good

• Unscented Approximation – not as good as MC per cost

• Goldberger Approximation – cheap, good, not closed form

• Variational Approximation – cheap, good, closed form

Information Theory and Applications, 2/1-2007 – p.21/31

Summary of KL divergence methods

• Monte Carlo Sampling – arbitrary accuracy, arbitrary cost

• Gaussian Approximation – not cheap, not good, closed form

• Min Approximation – cheap, but not good

• Unscented Approximation – not as good as MC per cost

• Goldberger Approximation – cheap, good, not closed form

• Variational Approximation – cheap, good, closed form

• Variational Upper Bound – cheap, good, upper-bound

Information Theory and Applications, 2/1-2007 – p.21/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.22/31

Bhattacharyya distance

The Bhattacharyya distance, B(f, g) =∫ √

fg, can be estimateusing Monte Carlo sampling from an arbitrary distribution h:

B(f, g) ≈ Bh =1

n

n∑

i=1

f(xi)g(xi)

h(xi),

where {xi}ni=1 are sampled from h.

Information Theory and Applications, 2/1-2007 – p.23/31

Bhattacharyya distance

The Bhattacharyya distance, B(f, g) =∫ √

fg, can be estimateusing Monte Carlo sampling from an arbitrary distribution h:

B(f, g) ≈ Bh =1

n

n∑

i=1

f(xi)g(xi)

h(xi),

where {xi}ni=1 are sampled from h. The estimators areunbiased, E[Bh] = B(f, g), with variance

var(Bh) =1

n

(∫

fg

h− (B(f, g))2

)

.

Information Theory and Applications, 2/1-2007 – p.23/31

Bhattacharyya distance

The Bhattacharyya distance, B(f, g) =∫ √

fg, can be estimateusing Monte Carlo sampling from an arbitrary distribution h:

B(f, g) ≈ Bh =1

n

n∑

i=1

f(xi)g(xi)

h(xi),

where {xi}ni=1 are sampled from h. The estimators areunbiased, E[Bh] = B(f, g), with variance

var(Bh) =1

n

(∫

fg

h− (B(f, g))2

)

.

h = f gives var(Bf ) = 1−(B(f,g))2

n .

h = f+g2 gives var(B f+g

2

) =2fg

f+g−(B(f,g))2

n .

And var(B f+g

2

) ≤ var(Bf ) (Harmonic–Arithmetic inequality).

Information Theory and Applications, 2/1-2007 – p.23/31

Best sampling distribution

We can find the “best” sampling distribution h by minimizing thevariance of Bh subject to the constraints h ≥ 0 and

h = 1.

The solution is h =√fg√fg

, var(Bh) = 0.

Information Theory and Applications, 2/1-2007 – p.24/31

Best sampling distribution

We can find the “best” sampling distribution h by minimizing thevariance of Bh subject to the constraints h ≥ 0 and

h = 1.

The solution is h =√fg√fg

, var(Bh) = 0.

Unfortunately, using this h requires

• Computing the quantity, B(f, g) =∫ √

fg, that we are tryingto compute in the first place.

• Sample from√fg.

Information Theory and Applications, 2/1-2007 – p.24/31

Best sampling distribution

We can find the “best” sampling distribution h by minimizing thevariance of Bh subject to the constraints h ≥ 0 and

h = 1.

The solution is h =√fg√fg

, var(Bh) = 0.

Unfortunately, using this h requires

• Computing the quantity, B(f, g) =∫ √

fg, that we are tryingto compute in the first place.

• Sample from√fg.

We will use variational techniques to “approximate”√fg with

some unnormalized h that can be analytically integrated to give a

genuine pdf h.

Information Theory and Applications, 2/1-2007 – p.24/31

Bhattacharyya Variational Upper Bound

f =∑

a πafa =∑

ab φb|afa∑

b φb|a = πa

g =∑

b ωbgb =∑

ba ψa|bgb∑

a ψa|b = ωb.

Information Theory and Applications, 2/1-2007 – p.25/31

Bhattacharyya Variational Upper Bound

f =∑

a πafa =∑

ab φb|afa∑

b φb|a = πa

g =∑

b ωbgb =∑

ba ψa|bgb∑

a ψa|b = ωb.

B(f, g) =

f

g

f

=

f

ab ψa|bgbf

Introduce the variational parameters ψa|b.

Information Theory and Applications, 2/1-2007 – p.25/31

Bhattacharyya Variational Upper Bound

f =∑

a πafa =∑

ab φb|afa∑

b φb|a = πa

g =∑

b ωbgb =∑

ba ψa|bgb∑

a ψa|b = ωb.

B(f, g) =

f

ab ψa|bgbf

=

f

ab

φb|afaf

ψa|bgbφb|afa

dx

Prepare to use Jensen’s inequality.

Information Theory and Applications, 2/1-2007 – p.25/31

Bhattacharyya Variational Upper Bound

f =∑

a πafa =∑

ab φb|afa∑

b φb|a = πa

g =∑

b ωbgb =∑

ba ψa|bgb∑

a ψa|b = ωb.

B(f, g) =

f

ab ψa|bgbf

=

f

ab

φb|afaf

ψa|bgbφb|afa

dx

≥∑

ab

φb|a

fa

ψa|bgbφb|afa

dx

Interchange√· and

abφb|afa

f .

Information Theory and Applications, 2/1-2007 – p.25/31

Bhattacharyya Variational Upper Bound

f =∑

a πafa =∑

ab φb|afa∑

b φb|a = πa

g =∑

b ωbgb =∑

ba ψa|bgb∑

a ψa|b = ωb.

B(f, g) =

f

ab ψa|bgbf

≥∑

ab

φb|a

fa

ψa|bgbφb|afa

dx

=∑

ab

φb|aψa|bB(fa, gb)

An inequality linking the mixture Bhattacharyya distance to the

component distances!!

Information Theory and Applications, 2/1-2007 – p.25/31

Bhattacharyya Variational Upper Bound

Optimize variational bound

B(f, g) ≥∑

ab

φb|aψa|bB(fa, gb)

with respect to constraints∑

b φb|a = πa and∑

a ψa|b = ωb.

Information Theory and Applications, 2/1-2007 – p.26/31

Bhattacharyya Variational Upper Bound

Optimize variational bound

B(f, g) ≥∑

ab

φb|aψa|bB(fa, gb)

with respect to constraints∑

b φb|a = πa and∑

a ψa|b = ωb.Fix φ, find optimal value for ψ

ψa|b =ωbφb|a(B(fa, gb))

2

a′ φb|a′(B(fa′ , gb))2.

Fix ψ, find optimal value for φ:

φb|a =πaψa|b(B(fa, gb))

2

b′ ψa|b′(B(fa, gb′))2.

Iterate to find optimal solution!Information Theory and Applications, 2/1-2007 – p.26/31

Variational Monte Carlo Sampling

Write the variational estimate

V (f, g) =

X

ab

q

φb|aψa|bB(fa, gb) =

Z Xab

qφb|aψa|b

pfagb =

Zh.

Here h =

P

ab

q

φb|aψa|b

√fagb is an unnormalized approximation of the optimal

sampling distribution,√fg/

R √fg.

R RX q RXp

Information Theory and Applications, 2/1-2007 – p.27/31

Variational Monte Carlo Sampling

Write the variational estimate

V (f, g) =

X

ab

q

φb|aψa|bB(fa, gb) =

Z Xab

qφb|aψa|b

pfagb =

Zh.

Here h =

P

ab

q

φb|aψa|b

√fagb is an unnormalized approximation of the optimal

sampling distribution,√fg/

R √fg.

h = h/

R

h is a GMM since hab =√fagb/

R √fagb is a gaussian and

h =

Xab

πabhab, where πab =q

φb|aψa|b

R √fagb

V (f, g)

Xp

Information Theory and Applications, 2/1-2007 – p.27/31

Variational Monte Carlo Sampling

Write the variational estimate

V (f, g) =

X

ab

q

φb|aψa|bB(fa, gb) =

Z Xab

qφb|aψa|b

pfagb =

Zh.

Here h =

P

ab

q

φb|aψa|b

√fagb is an unnormalized approximation of the optimal

sampling distribution,√fg/

R √fg.

h = h/

R

h is a GMM since hab =√fagb/

R √fagb is a gaussian and

h =

Xab

πabhab, where πab =q

φb|aψa|b

R √fagb

V (f, g)

Thus, drawing samples {xi}ni=1 from h the estimate

Vn =1

n

nXi=1

pf(xi)g(xi)

h(xi)

is unbiased and in experiments is seen to be far superior to sampling from (f + g)/2.

Information Theory and Applications, 2/1-2007 – p.27/31

Bhattacharyya Distance: Monte Carlo estimation

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

3.5Bhattacharyya: Importance sampling from f (x)

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

10

• Importance sampling from f(x): Slow convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

3.5Bhattacharyya: Importance sampling from f (x)

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

10100

• Importance sampling from f(x): Slow convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

3.5Bhattacharyya: Importance sampling from f (x)

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

101001000

• Importance sampling from f(x): Slow convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

3.5Bhattacharyya: Importance sampling from f (x)

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

10100100010K

• Importance sampling from f(x): Slow convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−5 −4 −3 −2 −1 0 1 2 3 4 50

0.5

1

1.5

2

2.5

3

3.5Bhattacharyya: Importance sampling from f (x)

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

10100100010K100K

• Importance sampling from f(x): Slow convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

5

10

15

Importance sampling from f(x)+g(x)2

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

f 100K10

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

5

10

15

Importance sampling from f(x)+g(x)2

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

f 100K10100

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

5

10

15

Importance sampling from f(x)+g(x)2

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

f 100K101001000

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

5

10

15

Importance sampling from f(x)+g(x)2

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

f 100K10100100010K

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 10

5

10

15

Importance sampling from f(x)+g(x)2

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

f 100K10100100010K100K

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20

20

40

60

80

100

120Variational importance sampling

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

(f+g)/2 100K10

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20

20

40

60

80

100

120Variational importance sampling

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

(f+g)/2 100K10100

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20

20

40

60

80

100

120Variational importance sampling

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

(f+g)/2 100K101001000

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20

20

40

60

80

100

120Variational importance sampling

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

(f+g)/2 100K10100100010K

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−0.2 −0.15 −0.1 −0.05 0 0.05 0.1 0.15 0.20

20

40

60

80

100

120Variational importance sampling

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

(f+g)/2 100K10100100010K100K

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31

Bhattacharyya Distance: Monte Carlo estimation

−0.05 −0.04 −0.03 −0.02 −0.01 0 0.01 0.02 0.03 0.04 0.050

20

40

60

80

100

120Variational importance sampling

Pro

bability

den

sity

Deviation from Bhattacharyya estimate with 1M samples

f 100K(f+g)/2 100Kvariational 100K

• Importance sampling from f(x): Slow convergence.

• Importance sampling from f(x)+g(x)2 : Better convergence.

• Variational importance sampling: Fast convergence.Information Theory and Applications, 2/1-2007 – p.28/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.29/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.29/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.29/31

Comparison of KL, Bhattacharyya, Bayes

Information Theory and Applications, 2/1-2007 – p.29/31

Variational Monte Carlo Sampling: KL-Divergence

Information Theory and Applications, 2/1-2007 – p.30/31

Future Directions

• HMM variational KL divergence

Information Theory and Applications, 2/1-2007 – p.31/31

Future Directions

• HMM variational KL divergence• HMM variational Bhattacharyya

Information Theory and Applications, 2/1-2007 – p.31/31

Future Directions

• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance

Information Theory and Applications, 2/1-2007 – p.31/31

Future Directions

• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff

approximation

Information Theory and Applications, 2/1-2007 – p.31/31

Future Directions

• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff

approximation• Discriminative training, using Bhattacharyya divergence.

Information Theory and Applications, 2/1-2007 – p.31/31

Future Directions

• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff

approximation• Discriminative training, using Bhattacharyya divergence.• Acoustic confusability using Bhattacharyya divergence

Information Theory and Applications, 2/1-2007 – p.31/31

Future Directions

• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff

approximation• Discriminative training, using Bhattacharyya divergence.• Acoustic confusability using Bhattacharyya divergence• Clustering of HMMs

Information Theory and Applications, 2/1-2007 – p.31/31

Future Directions

• HMM variational KL divergence• HMM variational Bhattacharyya• Variational Chernoff distance• Variational sampling of Bayes error using Chernoff

approximation• Discriminative training, using Bhattacharyya divergence.• Acoustic confusability using Bhattacharyya divergence• Clustering of HMMs

Information Theory and Applications, 2/1-2007 – p.31/31

top related