type-based mcmcpliang/papers/type-naacl2010-talk.pdf · type-based mcmc naacl 2010 { los angeles...

Post on 18-Jan-2021

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Type-Based MCMC

NAACL 2010 – Los Angeles

Percy Liang Michael I. Jordan Dan Klein

Type-Based MCMC

NAACL 2010 – Los Angeles

Percy Liang Michael I. Jordan Dan Klein

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

2

Learning latent-variable models

θ

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

Simple: token-based (e.g., Gibbs sampler)

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing

Classic: sentence-based (e.g., EM)

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing

Classic: sentence-based (e.g., EM)

Structural dependencies

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing

Classic: sentence-based (e.g., EM)

Structural dependencies

dynamic programming

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing

Classic: sentence-based (e.g., EM)

Structural dependencies

dynamic programming

New: type-based

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing

Classic: sentence-based (e.g., EM)

Structural dependencies

dynamic programming

New: type-based

Parameter dependencies

2

Learning latent-variable models

· · · DT NNS VBD · · ·

· · · some stocks rose · · ·

· · · NN NNS VBD · · ·

· · · tech stocks fell · · ·

· · · DT NNS VBD · · ·

· · · the stocks soared · · ·

· · · WRB NNS VBD · · ·

· · · how stocks dipped · · ·

· · · DT NNS VBD · · ·

· · · many stocks created · · ·

Variables are highly dependent

Simple: token-based (e.g., Gibbs sampler)⇒ local optima, slow mixing

Classic: sentence-based (e.g., EM)

Structural dependencies

dynamic programming

New: type-based

Parameter dependencies

exchangeability

2

Structural dependencies

Dependencies between adjacent variables

3

Structural dependencies

Dependencies between adjacent variables

prob

abili

ty

NN VBZ NN NNworker demands meeting resistance

3

Structural dependencies

Dependencies between adjacent variables

prob

abili

ty

NN VBZ NN NNworker demands meeting resistance

NN NNS VBG NNworker demands meeting resistance

3

Structural dependencies

Dependencies between adjacent variables

prob

abili

ty

NN VBZ NN NNworker demands meeting resistance

NN NNS VBG NNworker demands meeting resistance

Token-based: update only one variable at a time

3

Structural dependencies

Dependencies between adjacent variables

prob

abili

ty

NN VBZ NN NNworker demands meeting resistance

NN NNS VBG NNworker demands meeting resistance

NN VBZ VBG NNworker demands meeting resistance

Token-based: update only one variable at a time

3

Structural dependencies

Dependencies between adjacent variables

prob

abili

ty

NN VBZ NN NNworker demands meeting resistance

NN NNS VBG NNworker demands meeting resistance

NN VBZ VBG NNworker demands meeting resistance

Token-based: update only one variable at a time

3

Structural dependencies

Dependencies between adjacent variables

prob

abili

ty

NN VBZ NN NNworker demands meeting resistance

NN NNS VBG NNworker demands meeting resistance

NN VBZ VBG NNworker demands meeting resistance

Token-based: update only one variable at a timeProblem: need to go downhill before uphill

3

Structural dependencies

Dependencies between adjacent variables

Sentence-based: update all variables in a sentence

prob

abili

ty

NN VBZ NN NNworker demands meeting resistance

NN NNS VBG NNworker demands meeting resistance

NN VBZ VBG NNworker demands meeting resistance

Token-based: update only one variable at a timeProblem: need to go downhill before uphill

3

Parameter dependencies

Dependencies between variables with shared parameters

4

Parameter dependencies

Dependencies between variables with shared parameters

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

4

Parameter dependencies

Dependencies between variables with shared parameters

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...

4

Parameter dependencies

Dependencies between variables with shared parameters

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...

NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

Token-based:

4

Parameter dependencies

Dependencies between variables with shared parameters

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...

NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

Token-based:

4

Parameter dependencies

Dependencies between variables with shared parameters

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...

NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...

Token-based:

4

Parameter dependencies

Dependencies between variables with shared parameters

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...

NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...

Token-based: need to go downhill a lot before going uphill

4

Parameter dependencies

Dependencies between variables with shared parameters

Type-based: update all variables of one type

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...

NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...

Token-based: need to go downhill a lot before going uphill

4

Parameter dependencies

Dependencies between variables with shared parameters

Type-based: update all variables of one type

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...

NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...

Token-based: need to go downhill a lot before going uphill

1. Parameter dependencies create deeper valleys

4

Parameter dependencies

Dependencies between variables with shared parameters

Type-based: update all variables of one type

prob

abili

ty

VBZ VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...NNS VBD ...stocks shot ...

NNS VBD ...stocks rose ...VBZ VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...VBZ VBD ...stocks took ...VBZ VBD ...stocks shot ...

NNS VBD ...stocks rose ...NNS VBD ...stocks fell ...NNS VBD ...stocks took ...VBZ VBD ...stocks shot ...

Token-based: need to go downhill a lot before going uphill

1. Parameter dependencies create deeper valleys

2. Sentence-based cannot handle these dependencies

4

What exactly is a type?

How can we update all variables of a type efficiently?

5

Formal setup

Parameters θ: vector of conditional probabilities

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

Choices z: specifies values of latent and observed variables

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

Choices z: specifies values of latent and observed variables8 9

· · · V N · · ·stocks

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

Choices z: specifies values of latent and observed variables

z ∈ z[T :V,N at 8]

8 9

· · · V N · · ·stocks

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

Choices z: specifies values of latent and observed variables

z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N

8 9

· · · V N · · ·stocks

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

Choices z: specifies values of latent and observed variables

z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N

8 9

· · · V N · · ·stocks

p(z | θ) =∏

z∈z θj(z) [likelihood]

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

Choices z: specifies values of latent and observed variables

z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N

8 9

· · · V N · · ·stocks

p(z | θ) =∏

z∈z θj(z) [likelihood]

p(z) =∫p(z | θ)p(θ)dθ [marginal likelihood]

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

Choices z: specifies values of latent and observed variables

z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N

8 9

· · · V N · · ·stocks

p(z | θ) =∏

z∈z θj(z) [likelihood]

p(z) =∫p(z | θ)p(θ)dθ [marginal likelihood]

Observations x: observed part of z (e.g., the words)

6

Formal setup

Parameters θ: vector of conditional probabilities

θT:V,N: from state V, probability of transitioning to state N

p(θ) is product of Dirichlets [prior]

Choices z: specifies values of latent and observed variables

z ∈ z j(z) [parameter used][T :V,N at 8] T:V,N

8 9

· · · V N · · ·stocks

p(z | θ) =∏

z∈z θj(z) [likelihood]

p(z) =∫p(z | θ)p(θ)dθ [marginal likelihood]

Observations x: observed part of z (e.g., the words)

Goal: sample from p(z | x) [not p(z | x,θ)]6

Exchangeability

Sufficient statistics n: # times parameters were used in z

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j = simple function of n and θ

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j = simple function of n and θ

p(z) = simple function of n

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j = simple function of n and θ

p(z) = simple function of n (key: exchangeability)

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j = simple function of n and θ

p(z) = simple function of n (key: exchangeability)

z1

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j = simple function of n and θ

p(z) = simple function of n (key: exchangeability)

z1 z2

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D N V · · ·many stocks rose

· · · D V V · · ·tech stocks were

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j = simple function of n and θ

p(z) = simple function of n (key: exchangeability)

z1 z2

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D N V · · ·many stocks rose

· · · D V V · · ·tech stocks were

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j = simple function of n and θ

p(z) = simple function of n (key: exchangeability)

z1 z2

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D N V · · ·many stocks rose

· · · D V V · · ·tech stocks were

z1 and z2 have same sufficient statistics

7

Exchangeability

Sufficient statistics n: # times parameters were used in znT:V,N: # times that state V transitioned to state N

Rewrite likelihood:

p(z | θ) =∏

j θnj

j = simple function of n and θ

p(z) = simple function of n (key: exchangeability)

z1 z2

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D N V · · ·many stocks rose

· · · D V V · · ·tech stocks were

z1 and z2 have same sufficient statistics ⇒ p(z1) = p(z2)

7

Types

Type of a variable: its dependent parameter components

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

· · · D V · · ·many stocks rose

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D V · · ·many stocks rose

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D V · · ·many stocks rose

· · · D V · · ·tech stocks were

· · · D V · · ·the stocks have

· · · D V · · ·how stocks from

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D V V · · ·many stocks rose

· · · D V V · · ·tech stocks were

· · · D N V · · ·the stocks have

· · · D N V · · ·how stocks from

Assignments[VVNN]

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D V V · · ·the stocks have

· · · D N V · · ·how stocks from

Assignments[VVNN][VNVN]

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D N V · · ·the stocks have

· · · D V V · · ·how stocks from

Assignments[VVNN][VNVN][VNNV]

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D N V · · ·many stocks rose

· · · D V V · · ·tech stocks were

· · · D V V · · ·the stocks have

· · · D N V · · ·how stocks from

Assignments[VVNN][VNVN][VNNV][NVVN]

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D N V · · ·many stocks rose

· · · D V V · · ·tech stocks were

· · · D N V · · ·the stocks have

· · · D V V · · ·how stocks from

Assignments[VVNN][VNVN][VNNV][NVVN][NVNV]

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D N V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D V V · · ·the stocks have

· · · D V V · · ·how stocks from

Assignments[VVNN][VNVN][VNNV][NVVN][NVNV][NNVV]

8

Types

Type of a variable: its dependent parameter components

For HMM, type = assignment to Markov blanket

type( ) = (D, stocks, V)

· · · D N V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D V V · · ·the stocks have

· · · D V V · · ·how stocks from

Assignments[VVNN][VNVN][VNNV][NVVN][NVNV][NNVV]

p(assignment) only depends on number of Vs and Ns8

Sampling same-type variables

Goal: sample an assignment of a set of same-type variables

9

Sampling same-type variables

Goal: sample an assignment of a set of same-type variables

m = 0 m = 1 m = 2 m = 3 m = 4p0 [VVVV] p1 [VVVN]

p1 [VVNV]p1 [VNVV]p1 [NVVV]

p2 [VVNN]p2 [VNVN]p2 [VNNV]p2 [NVVN]p2 [NVNV]p2 [NNVV]

p3 [VNNN]p3 [NVNN]p3 [NNVN]p3 [NNNV]

p4 [NNNN]

9

Sampling same-type variables

Goal: sample an assignment of a set of same-type variables

Algorithm:

1. Choose m ∈ {0, . . . , B} with prob. ∝(

Bm

)pm

m = 0 m = 1 m = 2 m = 3 m = 4p0 [VVVV] p1 [VVVN]

p1 [VVNV]p1 [VNVV]p1 [NVVV]

p2 [VVNN]p2 [VNVN]p2 [VNNV]p2 [NVVN]p2 [NVNV]p2 [NNVV]

p3 [VNNN]p3 [NVNN]p3 [NNVN]p3 [NNNV]

p4 [NNNN]

9

Sampling same-type variables

Goal: sample an assignment of a set of same-type variables

Algorithm:

1. Choose m ∈ {0, . . . , B} with prob. ∝(

Bm

)pm

2. Choose assignment uniformly from column m

m = 0 m = 1 m = 2 m = 3 m = 4p0 [VVVV] p1 [VVVN]

p1 [VVNV]p1 [VNVV]p1 [NVVV]

p2 [VVNN]p2 [VNVN]p2 [VNNV]p2 [NVVN]p2 [NVNV]p2 [NNVV]

p3 [VNNN]p3 [NVNN]p3 [NNVN]p3 [NNNV]

p4 [NNNN]

9

Full algorithm

Iterate: · · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D N V · · ·the stocks have

· · · D V V · · ·how stocks from

10

Full algorithm

Iterate:1. Choose a position

e.g., 2

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D N V · · ·the stocks have

· · · D V V · · ·how stocks from

10

Full algorithm

Iterate:1. Choose a position

e.g., 2

2. Add variables with same typee.g., (D, stocks, V)

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D N V · · ·the stocks have

· · · D V V · · ·how stocks from

10

Full algorithm

Iterate:1. Choose a position

e.g., 2

2. Add variables with same typee.g., (D, stocks, V)

3. Sample me.g., 1 ⇒ {V, V, V, N}

· · · D V · · ·many stocks rose

· · · D V · · ·tech stocks were

· · · D V · · ·the stocks have

· · · D V · · ·how stocks from

10

Full algorithm

Iterate:1. Choose a position

e.g., 2

2. Add variables with same typee.g., (D, stocks, V)

3. Sample me.g., 1 ⇒ {V, V, V, N}

4. Sample assignmente.g., [VNVV]

· · · D V V · · ·many stocks rose

· · · D N V · · ·tech stocks were

· · · D V V · · ·the stocks have

· · · D V V · · ·how stocks from

10

Experimental setup: sampling algorithms

2

tech

5

stocks

3

rose

1

in

8

heavy

2

trading

4

worker

6

demands

8

meeting

1

resistance

2

many

4

stocks

3

shot

8

up

6

today

5

investors

2

await

4

stocks

3

news

11

Experimental setup: sampling algorithms

2

tech

5

stocks

3

rose

1

in

8

heavy

2

trading

4

worker

6

demands

8

meeting

1

resistance

2

many

4

stocks

3

shot

8

up

6

today

5

investors

2

await

4

stocks

3

news

Token-based sampler (Token)1. Choose token

11

Experimental setup: sampling algorithms

2

tech

5

stocks

3

rose

1

in

8

heavy

2

trading

4

worker

6

demands

7

meeting

1

resistance

2

many

4

stocks

3

shot

8

up

6

today

5

investors

2

await

4

stocks

3

news

Token-based sampler (Token)1. Choose token

2. Update conditioned on rest

11

Experimental setup: sampling algorithms

2

tech

5

stocks

3

rose

1

in

8

heavy

2

trading

4

worker

6

demands

8

meeting

1

resistance

2

many

4

stocks

3

shot

8

up

6

today

5

investors

2

await

4

stocks

3

news

Token-based sampler (Token)1. Choose token

2. Update conditioned on rest

Sentence-based sampler (Sentence)1. Choose sentence

11

Experimental setup: sampling algorithms

3

tech

4

stocks

6

rose

1

in

2

heavy

7

trading

4

worker

6

demands

8

meeting

1

resistance

2

many

4

stocks

3

shot

8

up

6

today

5

investors

2

await

4

stocks

3

news

Token-based sampler (Token)1. Choose token

2. Update conditioned on rest

Sentence-based sampler (Sentence)1. Choose sentence

2. Update conditioned on rest

11

Experimental setup: sampling algorithms

2

tech

5

stocks

3

rose

1

in

8

heavy

2

trading

4

worker

6

demands

8

meeting

1

resistance

2

many

4

stocks

3

shot

8

up

6

today

5

investors

2

await

4

stocks

3

news

Token-based sampler (Token)1. Choose token

2. Update conditioned on rest

Sentence-based sampler (Sentence)1. Choose sentence

2. Update conditioned on rest

Type-based sampler (Type)1. Choose type

11

Experimental setup: sampling algorithms

2

tech

4

stocks

3

rose

1

in

8

heavy

2

trading

4

worker

6

demands

8

meeting

1

resistance

2

many

3

stocks

3

shot

8

up

6

today

5

investors

2

await

4

stocks

3

news

Token-based sampler (Token)1. Choose token

2. Update conditioned on rest

Sentence-based sampler (Sentence)1. Choose sentence

2. Update conditioned on rest

Type-based sampler (Type)1. Choose type

2. Update conditioned on rest

11

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

l o o k a t t h e b o o k

Word segmentation

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

l o o k a t t h e b o o k

Word segmentation

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

l o o k a t t h e b o o k

Word segmentation

Dataset: CHILDES(9.7K sentences)

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

l o o k a t t h e b o o k

Word segmentation

Dataset: CHILDES(9.7K sentences)

Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

l o o k a t t h e b o o k

Word segmentation

Dataset: CHILDES(9.7K sentences)

Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]

the

DT

sun

NN

NP

has

VBD

risen

VBN

VP

S

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

l o o k a t t h e b o o k

Word segmentation

Dataset: CHILDES(9.7K sentences)

Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]

the

DT

sun

NN

NP

has

VBD

risen

VBN

VP

S

12

Experimental setup: models/tasks/datasets

Hidden Markov model (HMM)

NN NNS VBG NNworker demands meeting resistance

Part-of-speech induction

Dataset: WSJ(49K sentences, 45 tags)

Unigram segmentation model (USM) [Goldwater et al., 2006]

l o o k a t t h e b o o k

Word segmentation

Dataset: CHILDES(9.7K sentences)

Probabilistic tree-substitution grammar (PTSG) [Cohn et al., 2009]

the

DT

sun

NN

NP

has

VBD

risen

VBN

VP

S

Dataset: WSJ(49K sentences, 45 tags)

12

Token versus Type

HMM USM PTSG

13

Token versus Type

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

ihood

Token

13

Token versus Type

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

ihood

Token

Type

13

Token versus Type

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

ihood

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

ihood

Token

Type

13

Token versus Type

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

ihood

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

ihood

3 6 9 12

time (hr.)

-6.1e6

-5.9e6

-5.7e6

-5.5e6

-5.3e6

log-

likel

ihood

Token

Type

13

Token versus Type

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

ihood

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

ihood

3 6 9 12

time (hr.)

-6.1e6

-5.9e6

-5.7e6

-5.5e6

-5.3e6

log-

likel

ihood

3 6 9 12

time (hr.)

35

40

45

50

55

60

tag

accu

racy

2 4 6 8

time (min.)

35

40

45

50

55

60

wor

dto

ken

F1

Token

Type

13

Sensitivity to initialization

14

Sensitivity to initialization

(use few params.)η = 0

14

Sensitivity to initialization

(use few params.)η = 0

V V V Vworker demands meeting resistance

14

Sensitivity to initialization

(use few params.)η = 0

V V V Vworker demands meeting resistance

l o o k a t t h e b o o k

14

Sensitivity to initialization

(use few params.)η = 0

(use many params.)η = 1

V V V Vworker demands meeting resistance

l o o k a t t h e b o o k

14

Sensitivity to initialization

(use few params.)η = 0

(use many params.)η = 1

V V V Vworker demands meeting resistance

P N D Vworker demands meeting resistance

l o o k a t t h e b o o k

14

Sensitivity to initialization

(use few params.)η = 0

(use many params.)η = 1

V V V Vworker demands meeting resistance

P N D Vworker demands meeting resistance

l o o k a t t h e b o o k l o o k a t t h e b o o k

14

Sensitivity to initialization

HMM USM PTSG

15

Sensitivity to initialization

HMM USM PTSG

0.2 0.4 0.6 0.8 1.0

η

-7.2e6

-7.0e6

-6.8e6

-6.6e6

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

10

20

30

40

50

60

tag

accu

racy

Token

15

Sensitivity to initialization

HMM USM PTSG

0.2 0.4 0.6 0.8 1.0

η

-7.2e6

-7.0e6

-6.8e6

-6.6e6

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

10

20

30

40

50

60

tag

accu

racy

Token

Type

15

Sensitivity to initialization

HMM USM PTSG

0.2 0.4 0.6 0.8 1.0

η

-7.2e6

-7.0e6

-6.8e6

-6.6e6

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

-3.4e5

-3.1e5

-2.7e5

-2.3e5

-1.9e5

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

10

20

30

40

50

60

tag

accu

racy

0.2 0.4 0.6 0.8 1.0

η

10

20

30

40

50

60

wor

dto

ken

F1

Token

Type

15

Sensitivity to initialization

HMM USM PTSG

0.2 0.4 0.6 0.8 1.0

η

-7.2e6

-7.0e6

-6.8e6

-6.6e6

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

-3.4e5

-3.1e5

-2.7e5

-2.3e5

-1.9e5

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

-5.7e6

-5.6e6

-5.5e6

-5.4e6

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

10

20

30

40

50

60

tag

accu

racy

0.2 0.4 0.6 0.8 1.0

η

10

20

30

40

50

60

wor

dto

ken

F1

Token

Type

15

Sensitivity to initialization

HMM USM PTSG

0.2 0.4 0.6 0.8 1.0

η

-7.2e6

-7.0e6

-6.8e6

-6.6e6

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

-3.4e5

-3.1e5

-2.7e5

-2.3e5

-1.9e5

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

-5.7e6

-5.6e6

-5.5e6

-5.4e6

log-

likel

ihood

0.2 0.4 0.6 0.8 1.0

η

10

20

30

40

50

60

tag

accu

racy

0.2 0.4 0.6 0.8 1.0

η

10

20

30

40

50

60

wor

dto

ken

F1

Token

Type

Type less sensitive than Token

15

Alternatives

Can we get the gains of Type via simpler means?

16

Alternatives

Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal)

– Use p(z | x)1/T , temperature T from 5 to 1

17

Annealing

HMM USM PTSG

18

Annealing

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

ihood

3 6 9 12

time (hr.)

35

40

45

50

55

60

tag

accu

racy

Token

Type

18

Annealing

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

ihood

3 6 9 12

time (hr.)

35

40

45

50

55

60

tag

accu

racy

Token

Tokenanneal

Type

[anneal. hurts]

18

Annealing

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

ihood

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

ihood

3 6 9 12

time (hr.)

35

40

45

50

55

60

tag

accu

racy

2 4 6 8

time (min.)

35

40

45

50

55

60

wor

dto

ken

F1

Token

Tokenanneal

Type

[anneal. hurts] [anneal. helps]

18

Annealing

HMM USM PTSG

3 6 9 12

time (hr.)

-7.3e6

-6.9e6

-6.5e6

log-

likel

iho

od

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

iho

od

3 6 9 12

time (hr.)

-6.1e6

-5.9e6

-5.7e6

-5.5e6

-5.3e6

log-

likel

iho

od

3 6 9 12

time (hr.)

35

40

45

50

55

60

tag

accu

racy

2 4 6 8

time (min.)

35

40

45

50

55

60

wor

dto

ken

F1

Token

Tokenanneal

Type

[anneal. hurts] [anneal. helps] [no effect]

18

Alternatives

Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal)

– Use p(z | x)1/T , temperature T from 5 to 1

– More (random) mobility through space, but insufficient

19

Alternatives

Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal)

– Use p(z | x)1/T , temperature T from 5 to 1

– More (random) mobility through space, but insufficient

• Sentence-based (Sentence)

20

Sentence

USM:

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

ihood

2 4 6 8

time (min.)

35

40

45

50

55

60

wor

dto

ken

F1

Token

Type

21

Sentence

USM:

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

ihood

2 4 6 8

time (min.)

35

40

45

50

55

60

wor

dto

ken

F1

Token

Type

Sentence

21

Sentence

USM:

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

ihood

2 4 6 8

time (min.)

35

40

45

50

55

60

wor

dto

ken

F1

Token

Type

Sentence

Sentence performs comparably to Token,

but worse than Type

21

Sentence

USM:

2 4 6 8

time (min.)

-2.1e5

-1.9e5

-1.7e5

log-

likel

ihood

2 4 6 8

time (min.)

35

40

45

50

55

60

wor

dto

ken

F1

Token

Type

Sentence

Sentence performs comparably to Token,

but worse than Type

Sentence requires dynamic programming,

computationally more expensive than Token and Type

21

Alternatives

Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal)

– Use p(z | x)1/T , temperature T from 5 to 1

– More (random) mobility through space, but insufficient

• Sentence-based (Sentence)

– Sentence handles structural dependencies

– Type handles parameter dependencies

22

Alternatives

Can we get the gains of Type via simpler means?

• Annealing (Tokenanneal)

– Use p(z | x)1/T , temperature T from 5 to 1

– More (random) mobility through space, but insufficient

• Sentence-based (Sentence)

– Sentence handles structural dependencies

– Type handles parameter dependencies

– Intuition: parameter dependencies more important inunsupervised learning

23

Summary and outlook

General strategy: update many dependent variables tractably

24

Summary and outlook

General strategy: update many dependent variables tractably

Variables sharing parameters very dependent

Type-based sampler updates exactly these variables

24

Summary and outlook

General strategy: update many dependent variables tractably

Variables sharing parameters very dependent

Type-based sampler updates exactly these variables

Techniques for tractability:

Old: exploit dynamic programming structure

24

Summary and outlook

General strategy: update many dependent variables tractably

Variables sharing parameters very dependent

Type-based sampler updates exactly these variables

Techniques for tractability:

Old: exploit dynamic programming structureNew: think about exchangeability

24

Summary and outlook

General strategy: update many dependent variables tractably

Variables sharing parameters very dependent

Type-based sampler updates exactly these variables

Techniques for tractability:

Old: exploit dynamic programming structureNew: think about exchangeability

Tokens versus types:

24

Summary and outlook

General strategy: update many dependent variables tractably

Variables sharing parameters very dependent

Type-based sampler updates exactly these variables

Techniques for tractability:

Old: exploit dynamic programming structureNew: think about exchangeability

Tokens versus types:

• Older work operate on types (e.g., model merging)

24

Summary and outlook

General strategy: update many dependent variables tractably

Variables sharing parameters very dependent

Type-based sampler updates exactly these variables

Techniques for tractability:

Old: exploit dynamic programming structureNew: think about exchangeability

Tokens versus types:

• Older work operate on types (e.g., model merging)Larger updates, but greedy and brittle

24

Summary and outlook

General strategy: update many dependent variables tractably

Variables sharing parameters very dependent

Type-based sampler updates exactly these variables

Techniques for tractability:

Old: exploit dynamic programming structureNew: think about exchangeability

Tokens versus types:

• Older work operate on types (e.g., model merging)Larger updates, but greedy and brittle

• Recent methods operate on sentences or tokensSmaller updates, but softer and more robust

24

Summary and outlook

General strategy: update many dependent variables tractably

Variables sharing parameters very dependent

Type-based sampler updates exactly these variables

Techniques for tractability:

Old: exploit dynamic programming structureNew: think about exchangeability

Tokens versus types:

• Older work operate on types (e.g., model merging)Larger updates, but greedy and brittle

• Recent methods operate on sentences or tokensSmaller updates, but softer and more robust

Type-based sampling combines advantages of both

24

Thank you!

25

top related