type-based mcmcpliang/papers/type-naacl2010-talk.pdf · type-based mcmc naacl 2010 { los angeles...
Post on 18-Jan-2021
0 Views
Preview:
TRANSCRIPT
Type-Based MCMC
NAACL 2010 – Los Angeles
Percy Liang Michael I. Jordan Dan Klein
Type-Based MCMC
NAACL 2010 – Los Angeles
Percy Liang Michael I. Jordan Dan Klein
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
2
Learning latent-variable models
θ
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
Simple: token-based (e.g., Gibbs sampler)
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing
Classic: sentence-based (e.g., EM)
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing
Classic: sentence-based (e.g., EM)
Structural dependencies
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing
Classic: sentence-based (e.g., EM)
Structural dependencies
dynamic programming
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing
Classic: sentence-based (e.g., EM)
Structural dependencies
dynamic programming
New: type-based
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing
Classic: sentence-based (e.g., EM)
Structural dependencies
dynamic programming
New: type-based
Parameter dependencies
2
Learning latent-variable models
· · · DT NNS VBD · · ·
· · · some stocks rose · · ·
· · · NN NNS VBD · · ·
· · · tech stocks fell · · ·
· · · DT NNS VBD · · ·
· · · the stocks soared · · ·
· · · WRB NNS VBD · · ·
· · · how stocks dipped · · ·
· · · DT NNS VBD · · ·
· · · many stocks created · · ·
Variables are highly dependent
Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing
Classic: sentence-based (e.g., EM)
Structural dependencies
dynamic programming
New: type-based
Parameter dependencies
exchangeability
2
Structural dependencies
Dependencies between adjacent variables
3
Structural dependencies
Dependencies between adjacent variables
prob
abili
ty
NN VBZ NN NNworker demands meeting resistance
3
Structural dependencies
Dependencies between adjacent variables
prob
abili
ty
NN VBZ NN NNworker demands meeting resistance
NN NNS VBG NNworker demands meeting resistance
3
Structural dependencies
Dependencies between adjacent variables
prob
abili
ty
NN VBZ NN NNworker demands meeting resistance
NN NNS VBG NNworker demands meeting resistance
Token-based: update only one variable at a time
3
Structural dependencies
Dependencies between adjacent variables
prob
abili
ty
NN VBZ NN NNworker demands meeting resistance
NN NNS VBG NNworker demands meeting resistance
NN VBZ VBG NNworker demands meeting resistance
Token-based: update only one variable at a time
3
Structural dependencies
Dependencies between adjacent variables
prob
abili
ty
NN VBZ NN NNworker demands meeting resistance
NN NNS VBG NNworker demands meeting resistance
NN VBZ VBG NNworker demands meeting resistance
Token-based: update only one variable at a time
3
Structural dependencies
Dependencies between adjacent variables
prob
abili
ty
NN VBZ NN NNworker demands meeting resistance
NN NNS VBG NNworker demands meeting resistance
NN VBZ VBG NNworker demands meeting resistance
Token-based: update only one variable at a timeProblem: need to go downhill before uphill
3
Structural dependencies
Dependencies between adjacent variables
Sentence-based: update all variables in a sentence
prob
abili
ty
NN VBZ NN NNworker demands meeting resistance
NN NNS VBG NNworker demands meeting resistance
NN VBZ VBG NNworker demands meeting resistance
Token-based: update only one variable at a timeProblem: need to go downhill before uphill
3
Parameter dependencies
Dependencies between variables with shared parameters
4
Parameter dependencies
Dependencies between variables with shared parameters
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
4
Parameter dependencies
Dependencies between variables with shared parameters
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...
4
Parameter dependencies
Dependencies between variables with shared parameters
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...
NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
Token-based:
4
Parameter dependencies
Dependencies between variables with shared parameters
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...
NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
Token-based:
4
Parameter dependencies
Dependencies between variables with shared parameters
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...
NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...
Token-based:
4
Parameter dependencies
Dependencies between variables with shared parameters
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...
NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...
Token-based: need to go downhill a lot before going uphill
4
Parameter dependencies
Dependencies between variables with shared parameters
Type-based: update all variables of one type
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...
NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...
Token-based: need to go downhill a lot before going uphill
4
Parameter dependencies
Dependencies between variables with shared parameters
Type-based: update all variables of one type
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...
NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...
Token-based: need to go downhill a lot before going uphill
1. Parameter dependencies create deeper valleys
4
Parameter dependencies
Dependencies between variables with shared parameters
Type-based: update all variables of one type
prob
abili
ty
VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...
NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...
NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...
Token-based: need to go downhill a lot before going uphill
1. Parameter dependencies create deeper valleys
2. Sentence-based cannot handle these dependencies
4
What exactly is a type?
How can we update all variables of a type efficiently?
5
Formal setup
Parameters θ: vector of conditional probabilities
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
Choices z: specifies values of latent and observed variables
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
Choices z: specifies values of latent and observed variables8 9
· · · V N · · ·stocks
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
Choices z: specifies values of latent and observed variables
z ∈ z[T :V,N at 8]
8 9
· · · V N · · ·stocks
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
Choices z: specifies values of latent and observed variables
z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N
8 9
· · · V N · · ·stocks
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
Choices z: specifies values of latent and observed variables
z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N
8 9
· · · V N · · ·stocks
p(z | θ) =∏
z∈z θj(z) [likelihood]
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
Choices z: specifies values of latent and observed variables
z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N
8 9
· · · V N · · ·stocks
p(z | θ) =∏
z∈z θj(z) [likelihood]
p(z) =∫p(z | θ)p(θ)dθ [marginal likelihood]
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
Choices z: specifies values of latent and observed variables
z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N
8 9
· · · V N · · ·stocks
p(z | θ) =∏
z∈z θj(z) [likelihood]
p(z) =∫p(z | θ)p(θ)dθ [marginal likelihood]
Observations x: observed part of z (e.g., the words)
6
Formal setup
Parameters θ: vector of conditional probabilities
θT:V,N: from state V, probability of transitioning to state N
p(θ) is product of Dirichlets [prior]
Choices z: specifies values of latent and observed variables
z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N
8 9
· · · V N · · ·stocks
p(z | θ) =∏
z∈z θj(z) [likelihood]
p(z) =∫p(z | θ)p(θ)dθ [marginal likelihood]
Observations x: observed part of z (e.g., the words)
Goal: sample from p(z | x) [not p(z | x,θ)]6
Exchangeability
Sufficient statistics n: # times parameters were used in z
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j = simple function of n and θ
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j = simple function of n and θ
p(z) = simple function of n
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j = simple function of n and θ
p(z) = simple function of n (key: exchangeability)
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j = simple function of n and θ
p(z) = simple function of n (key: exchangeability)
z1
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j = simple function of n and θ
p(z) = simple function of n (key: exchangeability)
z1 z2
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D N V · · ·many stocks rose
· · · D V V · · ·tech stocks were
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j = simple function of n and θ
p(z) = simple function of n (key: exchangeability)
z1 z2
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D N V · · ·many stocks rose
· · · D V V · · ·tech stocks were
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j = simple function of n and θ
p(z) = simple function of n (key: exchangeability)
z1 z2
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D N V · · ·many stocks rose
· · · D V V · · ·tech stocks were
z1 and z2 have same sufficient statistics
7
Exchangeability
Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N
Rewrite likelihood:
p(z | θ) =∏
j θnj
j = simple function of n and θ
p(z) = simple function of n (key: exchangeability)
z1 z2
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D N V · · ·many stocks rose
· · · D V V · · ·tech stocks were
z1 and z2 have same sufficient statistics ⇒ p(z1) = p(z2)
7
Types
Type of a variable: its dependent parameter components
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
· · · D V · · ·many stocks rose
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D V · · ·many stocks rose
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D V · · ·many stocks rose
· · · D V · · ·tech stocks were
· · · D V · · ·the stocks have
· · · D V · · ·how stocks from
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D V V · · ·many stocks rose
· · · D V V · · ·tech stocks were
· · · D N V · · ·the stocks have
· · · D N V · · ·how stocks from
Assignments[VVNN]
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D V V · · ·the stocks have
· · · D N V · · ·how stocks from
Assignments[VVNN][VNVN]
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D N V · · ·the stocks have
· · · D V V · · ·how stocks from
Assignments[VVNN][VNVN][VNNV]
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D N V · · ·many stocks rose
· · · D V V · · ·tech stocks were
· · · D V V · · ·the stocks have
· · · D N V · · ·how stocks from
Assignments[VVNN][VNVN][VNNV][NVVN]
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D N V · · ·many stocks rose
· · · D V V · · ·tech stocks were
· · · D N V · · ·the stocks have
· · · D V V · · ·how stocks from
Assignments[VVNN][VNVN][VNNV][NVVN][NVNV]
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D N V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D V V · · ·the stocks have
· · · D V V · · ·how stocks from
Assignments[VVNN][VNVN][VNNV][NVVN][NVNV][NNVV]
8
Types
Type of a variable: its dependent parameter components
For HMM, type = assignment to Markov blanket
type( ) = (D, stocks, V)
· · · D N V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D V V · · ·the stocks have
· · · D V V · · ·how stocks from
Assignments[VVNN][VNVN][VNNV][NVVN][NVNV][NNVV]
p(assignment) only depends on number of Vs and Ns8
Sampling same-type variables
Goal: sample an assignment of a set of same-type variables
9
Sampling same-type variables
Goal: sample an assignment of a set of same-type variables
m = 0 m = 1 m = 2 m = 3 m = 4p0 [VVVV] p1 [VVVN]
p1 [VVNV]p1 [VNVV]p1 [NVVV]
p2 [VVNN]p2 [VNVN]p2 [VNNV]p2 [NVVN]p2 [NVNV]p2 [NNVV]
p3 [VNNN]p3 [NVNN]p3 [NNVN]p3 [NNNV]
p4 [NNNN]
9
Sampling same-type variables
Goal: sample an assignment of a set of same-type variables
Algorithm:
1. Choose m ∈ {0, . . . , B} with prob. ∝(
Bm
)pm
m = 0 m = 1 m = 2 m = 3 m = 4p0 [VVVV] p1 [VVVN]
p1 [VVNV]p1 [VNVV]p1 [NVVV]
p2 [VVNN]p2 [VNVN]p2 [VNNV]p2 [NVVN]p2 [NVNV]p2 [NNVV]
p3 [VNNN]p3 [NVNN]p3 [NNVN]p3 [NNNV]
p4 [NNNN]
9
Sampling same-type variables
Goal: sample an assignment of a set of same-type variables
Algorithm:
1. Choose m ∈ {0, . . . , B} with prob. ∝(
Bm
)pm
2. Choose assignment uniformly from column m
m = 0 m = 1 m = 2 m = 3 m = 4p0 [VVVV] p1 [VVVN]
p1 [VVNV]p1 [VNVV]p1 [NVVV]
p2 [VVNN]p2 [VNVN]p2 [VNNV]p2 [NVVN]p2 [NVNV]p2 [NNVV]
p3 [VNNN]p3 [NVNN]p3 [NNVN]p3 [NNNV]
p4 [NNNN]
9
Full algorithm
Iterate: · · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D N V · · ·the stocks have
· · · D V V · · ·how stocks from
10
Full algorithm
Iterate:1. Choose a position
e.g., 2
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D N V · · ·the stocks have
· · · D V V · · ·how stocks from
10
Full algorithm
Iterate:1. Choose a position
e.g., 2
2. Add variables with same typee.g., (D, stocks, V)
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D N V · · ·the stocks have
· · · D V V · · ·how stocks from
10
Full algorithm
Iterate:1. Choose a position
e.g., 2
2. Add variables with same typee.g., (D, stocks, V)
3. Sample me.g., 1 ⇒ {V, V, V, N}
· · · D V · · ·many stocks rose
· · · D V · · ·tech stocks were
· · · D V · · ·the stocks have
· · · D V · · ·how stocks from
10
Full algorithm
Iterate:1. Choose a position
e.g., 2
2. Add variables with same typee.g., (D, stocks, V)
3. Sample me.g., 1 ⇒ {V, V, V, N}
4. Sample assignmente.g., [VNVV]
· · · D V V · · ·many stocks rose
· · · D N V · · ·tech stocks were
· · · D V V · · ·the stocks have
· · · D V V · · ·how stocks from
10
Experimental setup: sampling algorithms
2
tech
5
stocks
3
rose
1
in
8
heavy
2
trading
4
worker
6
demands
8
meeting
1
resistance
2
many
4
stocks
3
shot
8
up
6
today
5
investors
2
await
4
stocks
3
news
11
Experimental setup: sampling algorithms
2
tech
5
stocks
3
rose
1
in
8
heavy
2
trading
4
worker
6
demands
8
meeting
1
resistance
2
many
4
stocks
3
shot
8
up
6
today
5
investors
2
await
4
stocks
3
news
Token-based sampler (Token)1. Choose token
11
Experimental setup: sampling algorithms
2
tech
5
stocks
3
rose
1
in
8
heavy
2
trading
4
worker
6
demands
7
meeting
1
resistance
2
many
4
stocks
3
shot
8
up
6
today
5
investors
2
await
4
stocks
3
news
Token-based sampler (Token)1. Choose token
2. Update conditioned on rest
11
Experimental setup: sampling algorithms
2
tech
5
stocks
3
rose
1
in
8
heavy
2
trading
4
worker
6
demands
8
meeting
1
resistance
2
many
4
stocks
3
shot
8
up
6
today
5
investors
2
await
4
stocks
3
news
Token-based sampler (Token)1. Choose token
2. Update conditioned on rest
Sentence-based sampler (Sentence)1. Choose sentence
11
Experimental setup: sampling algorithms
3
tech
4
stocks
6
rose
1
in
2
heavy
7
trading
4
worker
6
demands
8
meeting
1
resistance
2
many
4
stocks
3
shot
8
up
6
today
5
investors
2
await
4
stocks
3
news
Token-based sampler (Token)1. Choose token
2. Update conditioned on rest
Sentence-based sampler (Sentence)1. Choose sentence
2. Update conditioned on rest
11
Experimental setup: sampling algorithms
2
tech
5
stocks
3
rose
1
in
8
heavy
2
trading
4
worker
6
demands
8
meeting
1
resistance
2
many
4
stocks
3
shot
8
up
6
today
5
investors
2
await
4
stocks
3
news
Token-based sampler (Token)1. Choose token
2. Update conditioned on rest
Sentence-based sampler (Sentence)1. Choose sentence
2. Update conditioned on rest
Type-based sampler (Type)1. Choose type
11
Experimental setup: sampling algorithms
2
tech
4
stocks
3
rose
1
in
8
heavy
2
trading
4
worker
6
demands
8
meeting
1
resistance
2
many
3
stocks
3
shot
8
up
6
today
5
investors
2
await
4
stocks
3
news
Token-based sampler (Token)1. Choose token
2. Update conditioned on rest
Sentence-based sampler (Sentence)1. Choose sentence
2. Update conditioned on rest
Type-based sampler (Type)1. Choose type
2. Update conditioned on rest
11
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
l o o k a t t h e b o o k
Word segmentation
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
l o o k a t t h e b o o k
Word segmentation
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
l o o k a t t h e b o o k
Word segmentation
Dataset: CHILDES(9.7K sentences)
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
l o o k a t t h e b o o k
Word segmentation
Dataset: CHILDES(9.7K sentences)
Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
l o o k a t t h e b o o k
Word segmentation
Dataset: CHILDES(9.7K sentences)
Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]
the
DT
sun
NN
NP
has
VBD
risen
VBN
VP
S
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
l o o k a t t h e b o o k
Word segmentation
Dataset: CHILDES(9.7K sentences)
Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]
the
DT
sun
NN
NP
has
VBD
risen
VBN
VP
S
12
Experimental setup: models/tasks/datasets
Hidden Markov model (HMM)
NN NNS VBG NNworker demands meeting resistance
Part-of-speech induction
Dataset: WSJ(49K sentences, 45 tags)
Unigram segmentation model (USM) [Goldwater et al., 2006]
l o o k a t t h e b o o k
Word segmentation
Dataset: CHILDES(9.7K sentences)
Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]
the
DT
sun
NN
NP
has
VBD
risen
VBN
VP
S
Dataset: WSJ(49K sentences, 45 tags)
12
Token versus Type
HMM USM PTSG
13
Token versus Type
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
ihood
Token
13
Token versus Type
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
ihood
Token
Type
13
Token versus Type
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
ihood
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
ihood
Token
Type
13
Token versus Type
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
ihood
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
ihood
3 6 9 12
time (hr.)
-6.1e6
-5.9e6
-5.7e6
-5.5e6
-5.3e6
log-
likel
ihood
Token
Type
13
Token versus Type
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
ihood
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
ihood
3 6 9 12
time (hr.)
-6.1e6
-5.9e6
-5.7e6
-5.5e6
-5.3e6
log-
likel
ihood
3 6 9 12
time (hr.)
35
40
45
50
55
60
tag
accu
racy
2 4 6 8
time (min.)
35
40
45
50
55
60
wor
dto
ken
F1
Token
Type
13
Sensitivity to initialization
14
Sensitivity to initialization
(use few params.)η = 0
14
Sensitivity to initialization
(use few params.)η = 0
V V V Vworker demands meeting resistance
14
Sensitivity to initialization
(use few params.)η = 0
V V V Vworker demands meeting resistance
l o o k a t t h e b o o k
14
Sensitivity to initialization
(use few params.)η = 0
(use many params.)η = 1
V V V Vworker demands meeting resistance
l o o k a t t h e b o o k
14
Sensitivity to initialization
(use few params.)η = 0
(use many params.)η = 1
V V V Vworker demands meeting resistance
P N D Vworker demands meeting resistance
l o o k a t t h e b o o k
14
Sensitivity to initialization
(use few params.)η = 0
(use many params.)η = 1
V V V Vworker demands meeting resistance
P N D Vworker demands meeting resistance
l o o k a t t h e b o o k l o o k a t t h e b o o k
14
Sensitivity to initialization
HMM USM PTSG
15
Sensitivity to initialization
HMM USM PTSG
0.2 0.4 0.6 0.8 1.0
η
-7.2e6
-7.0e6
-6.8e6
-6.6e6
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
10
20
30
40
50
60
tag
accu
racy
Token
15
Sensitivity to initialization
HMM USM PTSG
0.2 0.4 0.6 0.8 1.0
η
-7.2e6
-7.0e6
-6.8e6
-6.6e6
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
10
20
30
40
50
60
tag
accu
racy
Token
Type
15
Sensitivity to initialization
HMM USM PTSG
0.2 0.4 0.6 0.8 1.0
η
-7.2e6
-7.0e6
-6.8e6
-6.6e6
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
-3.4e5
-3.1e5
-2.7e5
-2.3e5
-1.9e5
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
10
20
30
40
50
60
tag
accu
racy
0.2 0.4 0.6 0.8 1.0
η
10
20
30
40
50
60
wor
dto
ken
F1
Token
Type
15
Sensitivity to initialization
HMM USM PTSG
0.2 0.4 0.6 0.8 1.0
η
-7.2e6
-7.0e6
-6.8e6
-6.6e6
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
-3.4e5
-3.1e5
-2.7e5
-2.3e5
-1.9e5
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
-5.7e6
-5.6e6
-5.5e6
-5.4e6
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
10
20
30
40
50
60
tag
accu
racy
0.2 0.4 0.6 0.8 1.0
η
10
20
30
40
50
60
wor
dto
ken
F1
Token
Type
15
Sensitivity to initialization
HMM USM PTSG
0.2 0.4 0.6 0.8 1.0
η
-7.2e6
-7.0e6
-6.8e6
-6.6e6
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
-3.4e5
-3.1e5
-2.7e5
-2.3e5
-1.9e5
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
-5.7e6
-5.6e6
-5.5e6
-5.4e6
log-
likel
ihood
0.2 0.4 0.6 0.8 1.0
η
10
20
30
40
50
60
tag
accu
racy
0.2 0.4 0.6 0.8 1.0
η
10
20
30
40
50
60
wor
dto
ken
F1
Token
Type
Type less sensitive than Token
15
Alternatives
Can we get the gains of Type via simpler means?
16
Alternatives
Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal)
– Use p(z | x)1/T , temperature T from 5 to 1
17
Annealing
HMM USM PTSG
18
Annealing
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
ihood
3 6 9 12
time (hr.)
35
40
45
50
55
60
tag
accu
racy
Token
Type
18
Annealing
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
ihood
3 6 9 12
time (hr.)
35
40
45
50
55
60
tag
accu
racy
Token
Tokenanneal
Type
[anneal. hurts]
18
Annealing
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
ihood
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
ihood
3 6 9 12
time (hr.)
35
40
45
50
55
60
tag
accu
racy
2 4 6 8
time (min.)
35
40
45
50
55
60
wor
dto
ken
F1
Token
Tokenanneal
Type
[anneal. hurts] [anneal. helps]
18
Annealing
HMM USM PTSG
3 6 9 12
time (hr.)
-7.3e6
-6.9e6
-6.5e6
log-
likel
iho
od
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
iho
od
3 6 9 12
time (hr.)
-6.1e6
-5.9e6
-5.7e6
-5.5e6
-5.3e6
log-
likel
iho
od
3 6 9 12
time (hr.)
35
40
45
50
55
60
tag
accu
racy
2 4 6 8
time (min.)
35
40
45
50
55
60
wor
dto
ken
F1
Token
Tokenanneal
Type
[anneal. hurts] [anneal. helps] [no effect]
18
Alternatives
Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal)
– Use p(z | x)1/T , temperature T from 5 to 1
– More (random) mobility through space, but insufficient
19
Alternatives
Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal)
– Use p(z | x)1/T , temperature T from 5 to 1
– More (random) mobility through space, but insufficient
• Sentence-based (Sentence)
20
Sentence
USM:
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
ihood
2 4 6 8
time (min.)
35
40
45
50
55
60
wor
dto
ken
F1
Token
Type
21
Sentence
USM:
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
ihood
2 4 6 8
time (min.)
35
40
45
50
55
60
wor
dto
ken
F1
Token
Type
Sentence
21
Sentence
USM:
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
ihood
2 4 6 8
time (min.)
35
40
45
50
55
60
wor
dto
ken
F1
Token
Type
Sentence
Sentence performs comparably to Token,
but worse than Type
21
Sentence
USM:
2 4 6 8
time (min.)
-2.1e5
-1.9e5
-1.7e5
log-
likel
ihood
2 4 6 8
time (min.)
35
40
45
50
55
60
wor
dto
ken
F1
Token
Type
Sentence
Sentence performs comparably to Token,
but worse than Type
Sentence requires dynamic programming,
computationally more expensive than Token and Type
21
Alternatives
Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal)
– Use p(z | x)1/T , temperature T from 5 to 1
– More (random) mobility through space, but insufficient
• Sentence-based (Sentence)
– Sentence handles structural dependencies
– Type handles parameter dependencies
22
Alternatives
Can we get the gains of Type via simpler means?
• Annealing (Tokenanneal)
– Use p(z | x)1/T , temperature T from 5 to 1
– More (random) mobility through space, but insufficient
• Sentence-based (Sentence)
– Sentence handles structural dependencies
– Type handles parameter dependencies
– Intuition: parameter dependencies more important inunsupervised learning
23
Summary and outlook
General strategy: update many dependent variables tractably
24
Summary and outlook
General strategy: update many dependent variables tractably
Variables sharing parameters very dependent
Type-based sampler updates exactly these variables
24
Summary and outlook
General strategy: update many dependent variables tractably
Variables sharing parameters very dependent
Type-based sampler updates exactly these variables
Techniques for tractability:
Old: exploit dynamic programming structure
24
Summary and outlook
General strategy: update many dependent variables tractably
Variables sharing parameters very dependent
Type-based sampler updates exactly these variables
Techniques for tractability:
Old: exploit dynamic programming structureNew: think about exchangeability
24
Summary and outlook
General strategy: update many dependent variables tractably
Variables sharing parameters very dependent
Type-based sampler updates exactly these variables
Techniques for tractability:
Old: exploit dynamic programming structureNew: think about exchangeability
Tokens versus types:
24
Summary and outlook
General strategy: update many dependent variables tractably
Variables sharing parameters very dependent
Type-based sampler updates exactly these variables
Techniques for tractability:
Old: exploit dynamic programming structureNew: think about exchangeability
Tokens versus types:
• Older work operate on types (e.g., model merging)
24
Summary and outlook
General strategy: update many dependent variables tractably
Variables sharing parameters very dependent
Type-based sampler updates exactly these variables
Techniques for tractability:
Old: exploit dynamic programming structureNew: think about exchangeability
Tokens versus types:
• Older work operate on types (e.g., model merging)Larger updates, but greedy and brittle
24
Summary and outlook
General strategy: update many dependent variables tractably
Variables sharing parameters very dependent
Type-based sampler updates exactly these variables
Techniques for tractability:
Old: exploit dynamic programming structureNew: think about exchangeability
Tokens versus types:
• Older work operate on types (e.g., model merging)Larger updates, but greedy and brittle
• Recent methods operate on sentences or tokensSmaller updates, but softer and more robust
24
Summary and outlook
General strategy: update many dependent variables tractably
Variables sharing parameters very dependent
Type-based sampler updates exactly these variables
Techniques for tractability:
Old: exploit dynamic programming structureNew: think about exchangeability
Tokens versus types:
• Older work operate on types (e.g., model merging)Larger updates, but greedy and brittle
• Recent methods operate on sentences or tokensSmaller updates, but softer and more robust
Type-based sampling combines advantages of both
24
Thank you!
25
top related