part of speech taggingusers.umiacs.umd.edu/~jbg/teaching/cmsc_723/10b_viterbi.pdf · part of speech...

64
Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland VITERBI Adapted from material by Jimmy Lin and Jason Eisner Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 1 / 17

Upload: others

Post on 03-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Part of Speech Tagging

Natural Language Processing: JordanBoyd-GraberUniversity of MarylandVITERBI

Adapted from material by Jimmy Lin and Jason Eisner

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 1 / 17

Page 2: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Finding Tag Sequences

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest probability.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:δ1(k) =πkβk ,xi

(1)

� Recursion:δn(k) =max

j

δn−1(j)θj ,k

βk ,xn(2)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 2 / 17

Page 3: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Finding Tag Sequences

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest probability.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:δ1(k) =πkβk ,xi

(1)

� Recursion:δn(k) =max

j

δn−1(j)θj ,k

βk ,xn(2)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 2 / 17

Page 4: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Finding Tag Sequences

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest probability.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:δ1(k) =πkβk ,xi

(1)

� Recursion:δn(k) =max

j

δn−1(j)θj ,k

βk ,xn(2)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 2 / 17

Page 5: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Finding Tag Sequences

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest probability.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:δ1(k) =πkβk ,xi

(1)

� Recursion:δn(k) =max

j

δn−1(j)θj ,k

βk ,xn(2)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 2 / 17

Page 6: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Finding Tag Sequences

Viterbi Algorithm

� Given an unobserved sequence of length L, {x1, . . . ,xL}, we want to finda sequence {z1 . . .zL} with the highest probability.

� It’s impossible to compute K L possibilities.

� So, we use dynamic programming to compute most likely tags for eachtoken subsequence from 0 to t that ends in state k .

� Memoization: fill a table of solutions of sub-problems

� Solve larger problems by composing sub-solutions

� Base case:δ1(k) =πkβk ,xi

(1)

� Recursion:δn(k) =max

j

δn−1(j)θj ,k

βk ,xn(2)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 2 / 17

Page 7: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Finding Tag Sequences

� The complexity of this is now K 2L.

� In class: example that shows why you need all O(KL) table cells(garden pathing)

� But just computing the max isn’t enough. We also have to rememberwhere we came from. (Breadcrumbs from best previous state.)

Ψn = argmaxjδn−1(j)θj ,k (3)

� Let’s do that for the sentence “come and get it”

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 3 / 17

Page 8: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Finding Tag Sequences

� The complexity of this is now K 2L.

� In class: example that shows why you need all O(KL) table cells(garden pathing)

� But just computing the max isn’t enough. We also have to rememberwhere we came from. (Breadcrumbs from best previous state.)

Ψn = argmaxjδn−1(j)θj ,k (3)

� Let’s do that for the sentence “come and get it”

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 3 / 17

Page 9: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS πk βk ,x1logδ1(k)

MOD 0.234 0.024 -5.18DET 0.234 0.032 -4.89

CONJ 0.234 0.024 -5.18N 0.021 0.016 -7.99

PREP 0.021 0.024 -7.59PRO 0.021 0.016 -7.99

V 0.234 0.121 -3.56come and get it

Why logarithms?

1. More interpretable than a float with lots of zeros.

2. Underflow is less of an issue

3. Addition is cheaper than multiplication

log(ab) = log(a)+ log(b) (4)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 4 / 17

Page 10: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j)

logδ1(j)θj ,CONJ

logδ2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47 ??? -6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56

-5.21

come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 11: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j)

logδ1(j)θj ,CONJ

logδ2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56

-5.21

come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 12: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56

-5.21

come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 13: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56

-5.21

come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 14: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99

≤−7.99

PREP -7.59

≤−7.59

PRO -7.99

≤−7.99

V -3.56 -5.21come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 15: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18

-8.48

DET -4.89

-7.72

CONJ -5.18

-8.47

???

-6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 16: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47 ???

-6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 17: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47 ???

-6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 18: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47

??? -6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 19: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47

??? -6.02

N -7.99 ≤−7.99PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and = −5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 20: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS logδ1(j) logδ1(j)θj ,CONJ logδ2(CONJ)MOD -5.18 -8.48DET -4.89 -7.72

CONJ -5.18 -8.47

???

-6.02N -7.99 ≤−7.99

PREP -7.59 ≤−7.59PRO -7.99 ≤−7.99

V -3.56 -5.21come and get it

log�

δ0(V)θV, CONJ�

= logδ0(k)+ logθV, CONJ =−3.56+−1.65

logδ1(k) =−5.21− logβCONJ, and =

−5.21−0.64

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 5 / 17

Page 21: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4

MOD -5.18

-0.00 X -0.00 X -0.00 X

DET -4.89

-0.00 X -0.00 X -0.00 X

CONJ -5.18 -6.02 V

-0.00 X -0.00 X

N -7.99

-0.00 X -0.00 X -0.00 X

PREP -7.59

-0.00 X -0.00 X -0.00 X

PRO -7.99

-0.00 X -0.00 X -14.6 V

V -3.56

-0.00 X -9.03 CONJ -0.00 X

WORD come and get it

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 6 / 17

Page 22: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4

MOD -5.18 -0.00 X

-0.00 X -0.00 X

DET -4.89 -0.00 X

-0.00 X -0.00 X

CONJ -5.18 -6.02 V

-0.00 X -0.00 X

N -7.99 -0.00 X

-0.00 X -0.00 X

PREP -7.59 -0.00 X

-0.00 X -0.00 X

PRO -7.99 -0.00 X

-0.00 X -14.6 V

V -3.56 -0.00 X

-9.03 CONJ -0.00 X

WORD come and get it

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 6 / 17

Page 23: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4

MOD -5.18 -0.00 X -0.00 X

-0.00 X

DET -4.89 -0.00 X -0.00 X

-0.00 X

CONJ -5.18 -6.02 V -0.00 X

-0.00 X

N -7.99 -0.00 X -0.00 X

-0.00 X

PREP -7.59 -0.00 X -0.00 X

-0.00 X

PRO -7.99 -0.00 X -0.00 X

-14.6 V

V -3.56 -0.00 X -9.03 CONJ

-0.00 X

WORD come and get it

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 6 / 17

Page 24: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

Viterbi Algorithm

POS δ1(k) δ2(k) b2 δ3(k) b3 δ4(k) b4

MOD -5.18 -0.00 X -0.00 X -0.00 XDET -4.89 -0.00 X -0.00 X -0.00 X

CONJ -5.18 -6.02 V -0.00 X -0.00 XN -7.99 -0.00 X -0.00 X -0.00 X

PREP -7.59 -0.00 X -0.00 X -0.00 XPRO -7.99 -0.00 X -0.00 X -14.6 V

V -3.56 -0.00 X -9.03 CONJ -0.00 XWORD come and get it

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 6 / 17

Page 25: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

What if you don’t have training data?

� You can still learn a hmm� Using a general technique called expectation maximization

� Take a guess at the parameters� Figure out latent variables� Find the parameters that best explain the latent variables� Repeat

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 7 / 17

Page 26: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

What if you don’t have training data?

� You can still learn a hmm� Using a general technique called expectation maximization� Take a guess at the parameters� Figure out latent variables� Find the parameters that best explain the latent variables� Repeat

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 7 / 17

Page 27: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

Model Parameters

We need to start with model parameters

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 28: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

Model Parameters

π, β, θ

We can initialize these any way we want

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 29: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

Model Parameters

E stepπ, β, θ

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 30: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

come and get it

Model Parameters Latent Variables

E stepπ, β, θ

We compute the E-step based on our data

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 31: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

V

C

P

comeV

C

P

andV

C

P

getV

C

P

it

Model Parameters Latent Variables

E stepπ, β, θ

Each word in our dataset could take any part of speech

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 32: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

come and get it

Model Parameters Latent Variables

E stepπ, β, θ

But we don’t know which state was used for each word

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 33: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

come and get it

Model Parameters Latent Variables

E stepπ, β, θ

Determine the probability of being in each latent state using Forward /Backward

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 34: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

come and get it

Model Parameters Latent Variables

E step

M step

π, β, θ

Calculate new parameters:

θi =ni +αi

k Ep [nk ]+αk(5)

Where the expected counts are from the lattice

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 35: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

em for hmm

come and get it

π, β, θ

Model Parameters Latent Variables

E step

M step

π, β, θ

Replace old parameters (and start over)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 8 / 17

Page 36: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Hard vs. Full EM

Hard EM

Train only on the most likely sentence(Viterbi)

� Faster: E-step is faster

� Faster: Fewer iterations

Full EM

Compute probability of all possiblesequences

� More accurate: Doesn’t get stuckin local optima as easily

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 9 / 17

Page 37: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Warning about next homework(s)

� Kaggle competition

� Thus, late days not very useful

� Following homework is not computational

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 10 / 17

Page 38: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Garden Pathing

What is the probability of the sequence “a/Det blue/Adj boat/N”?

πdβd ,theθd ,aβa,blueθa,nβn,boat = (5)

0.3 ∗0.6 ∗0.4 ∗0.3 ∗0.5 ∗0.1= 0.00108 (6)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 11 / 17

Page 39: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Garden Pathing

What is the probability of the sequence “a/Det blue/Adj boat/N”?

πdβd ,theθd ,aβa,blueθa,nβn,boat = (5)

0.3 ∗0.6 ∗0.4 ∗0.3 ∗0.5 ∗0.1= 0.00108 (6)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 11 / 17

Page 40: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Garden Pathing

What is the probability of the sequence “a/Det blue/Adj boat/N”?

πdβd ,theθd ,aβa,blueθa,nβn,boat = (5)

0.3 ∗0.6 ∗0.4 ∗0.3 ∗0.5 ∗0.1= 0.00108 (6)

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 11 / 17

Page 41: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Base case

1. δ1(a) =−4.6

2. δ1(v) =−5.7

3. δ1(d) =−1.7

4. δ1(n) =−4.6

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 12 / 17

Page 42: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Base case

1. δ1(a) =−4.6

2. δ1(v) =−5.7

3. δ1(d) =−1.7

4. δ1(n) =−4.6

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 12 / 17

Page 43: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Base case

1. δ1(a) =−4.6

2. δ1(v) =−5.7

3. δ1(d) =−1.7

4. δ1(n) =−4.6

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 12 / 17

Page 44: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Base case

1. δ1(a) =−4.6

2. δ1(v) =−5.7

3. δ1(d) =−1.7

4. δ1(n) =−4.6

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 12 / 17

Page 45: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Base case

1. δ1(a) =−4.6

2. δ1(v) =−5.7

3. δ1(d) =−1.7

4. δ1(n) =−4.6

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 12 / 17

Page 46: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Second position

1. δ2(a)=max

−5.8︸︷︷︸

a

,−7.3︸︷︷︸

v

,−2.6︸︷︷︸

d

,−7.6︸︷︷︸

n

!

+−1.2=−2.6+−1.2=−3.8

2. δ2(v)=max

−6.9︸︷︷︸

a

,−7.3︸︷︷︸

v

,−4.7︸︷︷︸

d

,−4.8︸︷︷︸

n

!

+−2.3=−4.7+−2.3=−7.0

3. δ2(d)=max

−6.9︸︷︷︸

a

,−6.9︸︷︷︸

v

,−4.0︸︷︷︸

d

,−7.6︸︷︷︸

n

!

+−3.7=−4.0+−3.7=−7.7

4. δ2(n)=max

−5.3︸︷︷︸

a

,−6.9︸︷︷︸

v

,−2.5︸︷︷︸

d

,−6.9︸︷︷︸

n

!

+−1.9=−2.5+−1.9=−4.4

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 13 / 17

Page 47: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Second position

1. δ2(a)=max

−5.8︸︷︷︸

a

,−7.3︸︷︷︸

v

,−2.6︸︷︷︸

d

,−7.6︸︷︷︸

n

!

+−1.2=−2.6+−1.2=−3.8

2. δ2(v)=max

−6.9︸︷︷︸

a

,−7.3︸︷︷︸

v

,−4.7︸︷︷︸

d

,−4.8︸︷︷︸

n

!

+−2.3=−4.7+−2.3=−7.0

3. δ2(d)=max

−6.9︸︷︷︸

a

,−6.9︸︷︷︸

v

,−4.0︸︷︷︸

d

,−7.6︸︷︷︸

n

!

+−3.7=−4.0+−3.7=−7.7

4. δ2(n)=max

−5.3︸︷︷︸

a

,−6.9︸︷︷︸

v

,−2.5︸︷︷︸

d

,−6.9︸︷︷︸

n

!

+−1.9=−2.5+−1.9=−4.4

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 13 / 17

Page 48: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Second position

1. δ2(a)=max

−5.8︸︷︷︸

a

,−7.3︸︷︷︸

v

,−2.6︸︷︷︸

d

,−7.6︸︷︷︸

n

!

+−1.2=−2.6+−1.2=−3.8

2. δ2(v)=max

−6.9︸︷︷︸

a

,−7.3︸︷︷︸

v

,−4.7︸︷︷︸

d

,−4.8︸︷︷︸

n

!

+−2.3=−4.7+−2.3=−7.0

3. δ2(d)=max

−6.9︸︷︷︸

a

,−6.9︸︷︷︸

v

,−4.0︸︷︷︸

d

,−7.6︸︷︷︸

n

!

+−3.7=−4.0+−3.7=−7.7

4. δ2(n)=max

−5.3︸︷︷︸

a

,−6.9︸︷︷︸

v

,−2.5︸︷︷︸

d

,−6.9︸︷︷︸

n

!

+−1.9=−2.5+−1.9=−4.4

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 13 / 17

Page 49: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Second position

1. δ2(a)=max

−5.8︸︷︷︸

a

,−7.3︸︷︷︸

v

,−2.6︸︷︷︸

d

,−7.6︸︷︷︸

n

!

+−1.2=−2.6+−1.2=−3.8

2. δ2(v)=max

−6.9︸︷︷︸

a

,−7.3︸︷︷︸

v

,−4.7︸︷︷︸

d

,−4.8︸︷︷︸

n

!

+−2.3=−4.7+−2.3=−7.0

3. δ2(d)=max

−6.9︸︷︷︸

a

,−6.9︸︷︷︸

v

,−4.0︸︷︷︸

d

,−7.6︸︷︷︸

n

!

+−3.7=−4.0+−3.7=−7.7

4. δ2(n)=max

−5.3︸︷︷︸

a

,−6.9︸︷︷︸

v

,−2.5︸︷︷︸

d

,−6.9︸︷︷︸

n

!

+−1.9=−2.5+−1.9=−4.4

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 13 / 17

Page 50: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Third position

1. δ3(a)=max

−5.0︸︷︷︸

a

,−8.6︸︷︷︸

v

,−8.6︸︷︷︸

d

,−7.4︸︷︷︸

n

!

+−2.3=−5.0+−2.3=−7.3

2. δ3(v)=max

−6.1︸︷︷︸

a

,−8.6︸︷︷︸

v

,−10.7︸ ︷︷ ︸

d

,−4.6︸︷︷︸

n

!

+−0.9=−4.6+−0.9=−5.5

3. δ3(d)=max

−6.1︸︷︷︸

a

,−8.2︸︷︷︸

v

,−10.0︸ ︷︷ ︸

d

,−7.4︸︷︷︸

n

!

+−3.7=−6.1+−3.7=−9.8

4. δ3(n)=max

−4.5︸︷︷︸

a

,−8.2︸︷︷︸

v

,−8.5︸︷︷︸

d

,−6.7︸︷︷︸

n

!

+−0.9=−4.5+−0.9=−5.4

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 14 / 17

Page 51: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Third position

1. δ3(a)=max

−5.0︸︷︷︸

a

,−8.6︸︷︷︸

v

,−8.6︸︷︷︸

d

,−7.4︸︷︷︸

n

!

+−2.3=−5.0+−2.3=−7.3

2. δ3(v)=max

−6.1︸︷︷︸

a

,−8.6︸︷︷︸

v

,−10.7︸ ︷︷ ︸

d

,−4.6︸︷︷︸

n

!

+−0.9=−4.6+−0.9=−5.5

3. δ3(d)=max

−6.1︸︷︷︸

a

,−8.2︸︷︷︸

v

,−10.0︸ ︷︷ ︸

d

,−7.4︸︷︷︸

n

!

+−3.7=−6.1+−3.7=−9.8

4. δ3(n)=max

−4.5︸︷︷︸

a

,−8.2︸︷︷︸

v

,−8.5︸︷︷︸

d

,−6.7︸︷︷︸

n

!

+−0.9=−4.5+−0.9=−5.4

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 14 / 17

Page 52: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Third position

1. δ3(a)=max

−5.0︸︷︷︸

a

,−8.6︸︷︷︸

v

,−8.6︸︷︷︸

d

,−7.4︸︷︷︸

n

!

+−2.3=−5.0+−2.3=−7.3

2. δ3(v)=max

−6.1︸︷︷︸

a

,−8.6︸︷︷︸

v

,−10.7︸ ︷︷ ︸

d

,−4.6︸︷︷︸

n

!

+−0.9=−4.6+−0.9=−5.5

3. δ3(d)=max

−6.1︸︷︷︸

a

,−8.2︸︷︷︸

v

,−10.0︸ ︷︷ ︸

d

,−7.4︸︷︷︸

n

!

+−3.7=−6.1+−3.7=−9.8

4. δ3(n)=max

−4.5︸︷︷︸

a

,−8.2︸︷︷︸

v

,−8.5︸︷︷︸

d

,−6.7︸︷︷︸

n

!

+−0.9=−4.5+−0.9=−5.4

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 14 / 17

Page 53: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Third position

1. δ3(a)=max

−5.0︸︷︷︸

a

,−8.6︸︷︷︸

v

,−8.6︸︷︷︸

d

,−7.4︸︷︷︸

n

!

+−2.3=−5.0+−2.3=−7.3

2. δ3(v)=max

−6.1︸︷︷︸

a

,−8.6︸︷︷︸

v

,−10.7︸ ︷︷ ︸

d

,−4.6︸︷︷︸

n

!

+−0.9=−4.6+−0.9=−5.5

3. δ3(d)=max

−6.1︸︷︷︸

a

,−8.2︸︷︷︸

v

,−10.0︸ ︷︷ ︸

d

,−7.4︸︷︷︸

n

!

+−3.7=−6.1+−3.7=−9.8

4. δ3(n)=max

−4.5︸︷︷︸

a

,−8.2︸︷︷︸

v

,−8.5︸︷︷︸

d

,−6.7︸︷︷︸

n

!

+−0.9=−4.5+−0.9=−5.4

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 14 / 17

Page 54: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Fourth position

1. δ4(a)=max

−8.5︸︷︷︸

a

,−7.2︸︷︷︸

v

,−10.7︸ ︷︷ ︸

d

,−8.4︸︷︷︸

n

!

+−3.4=−7.2+−3.4=−10.6

2. δ4(v)=max

−9.6︸︷︷︸

a

,−7.2︸︷︷︸

v

,−12.8︸ ︷︷ ︸

d

,−5.7︸︷︷︸

n

!

+−3.4=−5.7+−3.4=−9.1

3. δ4(d)=max

−9.6︸︷︷︸

a

,−6.8︸︷︷︸

v

,−12.1︸ ︷︷ ︸

d

,−8.4︸︷︷︸

n

!

+−0.5=−6.8+−0.5=−7.3

4. δ4(n)=max

−8.0︸︷︷︸

a

,−6.8︸︷︷︸

v

,−10.6︸ ︷︷ ︸

d

,−7.7︸︷︷︸

n

!

+−3.4=−6.8+−3.4=−10.2

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 15 / 17

Page 55: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Fourth position

1. δ4(a)=max

−8.5︸︷︷︸

a

,−7.2︸︷︷︸

v

,−10.7︸ ︷︷ ︸

d

,−8.4︸︷︷︸

n

!

+−3.4=−7.2+−3.4=−10.6

2. δ4(v)=max

−9.6︸︷︷︸

a

,−7.2︸︷︷︸

v

,−12.8︸ ︷︷ ︸

d

,−5.7︸︷︷︸

n

!

+−3.4=−5.7+−3.4=−9.1

3. δ4(d)=max

−9.6︸︷︷︸

a

,−6.8︸︷︷︸

v

,−12.1︸ ︷︷ ︸

d

,−8.4︸︷︷︸

n

!

+−0.5=−6.8+−0.5=−7.3

4. δ4(n)=max

−8.0︸︷︷︸

a

,−6.8︸︷︷︸

v

,−10.6︸ ︷︷ ︸

d

,−7.7︸︷︷︸

n

!

+−3.4=−6.8+−3.4=−10.2

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 15 / 17

Page 56: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Fourth position

1. δ4(a)=max

−8.5︸︷︷︸

a

,−7.2︸︷︷︸

v

,−10.7︸ ︷︷ ︸

d

,−8.4︸︷︷︸

n

!

+−3.4=−7.2+−3.4=−10.6

2. δ4(v)=max

−9.6︸︷︷︸

a

,−7.2︸︷︷︸

v

,−12.8︸ ︷︷ ︸

d

,−5.7︸︷︷︸

n

!

+−3.4=−5.7+−3.4=−9.1

3. δ4(d)=max

−9.6︸︷︷︸

a

,−6.8︸︷︷︸

v

,−12.1︸ ︷︷ ︸

d

,−8.4︸︷︷︸

n

!

+−0.5=−6.8+−0.5=−7.3

4. δ4(n)=max

−8.0︸︷︷︸

a

,−6.8︸︷︷︸

v

,−10.6︸ ︷︷ ︸

d

,−7.7︸︷︷︸

n

!

+−3.4=−6.8+−3.4=−10.2

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 15 / 17

Page 57: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Fourth position

1. δ4(a)=max

−8.5︸︷︷︸

a

,−7.2︸︷︷︸

v

,−10.7︸ ︷︷ ︸

d

,−8.4︸︷︷︸

n

!

+−3.4=−7.2+−3.4=−10.6

2. δ4(v)=max

−9.6︸︷︷︸

a

,−7.2︸︷︷︸

v

,−12.8︸ ︷︷ ︸

d

,−5.7︸︷︷︸

n

!

+−3.4=−5.7+−3.4=−9.1

3. δ4(d)=max

−9.6︸︷︷︸

a

,−6.8︸︷︷︸

v

,−12.1︸ ︷︷ ︸

d

,−8.4︸︷︷︸

n

!

+−0.5=−6.8+−0.5=−7.3

4. δ4(n)=max

−8.0︸︷︷︸

a

,−6.8︸︷︷︸

v

,−10.6︸ ︷︷ ︸

d

,−7.7︸︷︷︸

n

!

+−3.4=−6.8+−3.4=−10.2

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 15 / 17

Page 58: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Fifth position

1. δ5(a)=max

−11.8︸ ︷︷ ︸

a

,−10.7︸ ︷︷ ︸

v

,−8.2︸︷︷︸

d

,−13.2︸ ︷︷ ︸

n

!

+−2.3=−8.2+−2.3=−11

2. δ5(v)=max

−12.9︸ ︷︷ ︸

a

,−10.7︸ ︷︷ ︸

v

,−10.3︸ ︷︷ ︸

d

,−10.4︸ ︷︷ ︸

n

!

+−1.6=−10.3+−1.6=−12

3. δ5(d)=max

−12.9︸ ︷︷ ︸

a

,−10.3︸ ︷︷ ︸

v

,−9.6︸︷︷︸

d

,−13.2︸ ︷︷ ︸

n

!

+−3.7=−9.6+−3.7=−13

4. δ5(n)=max

−11.3︸ ︷︷ ︸

a

,−10.3︸ ︷︷ ︸

v

,−8.1︸︷︷︸

d

,−12.5︸ ︷︷ ︸

n

!

+−1.2=−8.1+−1.2=−9.3

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 16 / 17

Page 59: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Fifth position

1. δ5(a)=max

−11.8︸ ︷︷ ︸

a

,−10.7︸ ︷︷ ︸

v

,−8.2︸︷︷︸

d

,−13.2︸ ︷︷ ︸

n

!

+−2.3=−8.2+−2.3=−11

2. δ5(v)=max

−12.9︸ ︷︷ ︸

a

,−10.7︸ ︷︷ ︸

v

,−10.3︸ ︷︷ ︸

d

,−10.4︸ ︷︷ ︸

n

!

+−1.6=−10.3+−1.6=−12

3. δ5(d)=max

−12.9︸ ︷︷ ︸

a

,−10.3︸ ︷︷ ︸

v

,−9.6︸︷︷︸

d

,−13.2︸ ︷︷ ︸

n

!

+−3.7=−9.6+−3.7=−13

4. δ5(n)=max

−11.3︸ ︷︷ ︸

a

,−10.3︸ ︷︷ ︸

v

,−8.1︸︷︷︸

d

,−12.5︸ ︷︷ ︸

n

!

+−1.2=−8.1+−1.2=−9.3

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 16 / 17

Page 60: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Fifth position

1. δ5(a)=max

−11.8︸ ︷︷ ︸

a

,−10.7︸ ︷︷ ︸

v

,−8.2︸︷︷︸

d

,−13.2︸ ︷︷ ︸

n

!

+−2.3=−8.2+−2.3=−11

2. δ5(v)=max

−12.9︸ ︷︷ ︸

a

,−10.7︸ ︷︷ ︸

v

,−10.3︸ ︷︷ ︸

d

,−10.4︸ ︷︷ ︸

n

!

+−1.6=−10.3+−1.6=−12

3. δ5(d)=max

−12.9︸ ︷︷ ︸

a

,−10.3︸ ︷︷ ︸

v

,−9.6︸︷︷︸

d

,−13.2︸ ︷︷ ︸

n

!

+−3.7=−9.6+−3.7=−13

4. δ5(n)=max

−11.3︸ ︷︷ ︸

a

,−10.3︸ ︷︷ ︸

v

,−8.1︸︷︷︸

d

,−12.5︸ ︷︷ ︸

n

!

+−1.2=−8.1+−1.2=−9.3

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 16 / 17

Page 61: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Fifth position

1. δ5(a)=max

−11.8︸ ︷︷ ︸

a

,−10.7︸ ︷︷ ︸

v

,−8.2︸︷︷︸

d

,−13.2︸ ︷︷ ︸

n

!

+−2.3=−8.2+−2.3=−11

2. δ5(v)=max

−12.9︸ ︷︷ ︸

a

,−10.7︸ ︷︷ ︸

v

,−10.3︸ ︷︷ ︸

d

,−10.4︸ ︷︷ ︸

n

!

+−1.6=−10.3+−1.6=−12

3. δ5(d)=max

−12.9︸ ︷︷ ︸

a

,−10.3︸ ︷︷ ︸

v

,−9.6︸︷︷︸

d

,−13.2︸ ︷︷ ︸

n

!

+−3.7=−9.6+−3.7=−13

4. δ5(n)=max

−11.3︸ ︷︷ ︸

a

,−10.3︸ ︷︷ ︸

v

,−8.1︸︷︷︸

d

,−12.5︸ ︷︷ ︸

n

!

+−1.2=−8.1+−1.2=−9.3

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 16 / 17

Page 62: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

Reconstruction

For “the old man”, the reconstruction starts with the best part of speech atPosition 3, which is noun (-5.4), which has an adjective back pointer, whichas a back pointer to determiner. The overall sequence is “The/det old/adjman/n”.

For “the old man the boats”, the reconstruction starts with the best part ofspeech at Position 5, which is a noun (-9.3), which leads to the sequence“The/det old/n man/v the/det boats/n”.

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 17 / 17

Page 63: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

ReconstructionFor “the old man”, the reconstruction starts with the best part of speech atPosition 3, which is noun (-5.4), which has an adjective back pointer, whichas a back pointer to determiner. The overall sequence is “The/det old/adjman/n”.

For “the old man the boats”, the reconstruction starts with the best part ofspeech at Position 5, which is a noun (-9.3), which leads to the sequence“The/det old/n man/v the/det boats/n”.

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 17 / 17

Page 64: Part of Speech Taggingusers.umiacs.umd.edu/~jbg/teaching/CMSC_723/10b_viterbi.pdf · Part of Speech Tagging Natural Language Processing: Jordan Boyd-Graber University of Maryland

EM Algorithm

Example . . .

ReconstructionFor “the old man”, the reconstruction starts with the best part of speech atPosition 3, which is noun (-5.4), which has an adjective back pointer, whichas a back pointer to determiner. The overall sequence is “The/det old/adjman/n”.

For “the old man the boats”, the reconstruction starts with the best part ofspeech at Position 5, which is a noun (-9.3), which leads to the sequence“The/det old/n man/v the/det boats/n”.

Natural Language Processing: Jordan Boyd-Graber | UMD Part of Speech Tagging | 17 / 17