parameter estimation: the perceptron...

35
Parameter Estimation: The Perceptron Algorithm

Upload: lamthu

Post on 24-Jul-2019

233 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Parameter Estimation:

The Perceptron Algorithm

Page 2: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

What we have done so far...• packed forest

• as a general representation for many NLP problems

• formalized as a weighted hypergraph

• DP algorithms for 1-best and k-best on hypergraphs

2

Baoweier yu Shalong juxing le huitan Powell held a meeting w/ Sharon

Page 3: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

What we have done so far...• packed forest

• as a general representation for many NLP problems

• formalized as a weighted hypergraph

• DP algorithms for 1-best and k-best on hypergraphs

2

Baoweier yu Shalong juxing le huitan Powell held a meeting w/ Sharon

Big Q: where do the weights come from?

Page 4: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

A Quiz

3

Page 5: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

4

Page 6: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Perceptron is ...

• an extremely simple algorithm

• almost universally applicable

• and works very well in practice

4

vanilla perceptron(Rosenblatt, 1958)

structured perceptron(Collins, 2002)

the man bit the dog

DT NN VBD DT NN

Page 7: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Generic Perceptron• online-learning: one example at a time

• learning by doing

• find the best output under the current weights

• update weights at mistakes

5

inferencexi

update weightszi

yi

w

Page 8: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

Page 9: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Page 10: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Page 11: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Page 12: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Example: POS Tagging• gold-standard: DT NN VBD DT NN

• the man bit the dog

• current output: DT NN NN DT NN

• the man bit the dog

• assume only two feature classes

• tag bigrams ti-1 ti

• word/tag pairs wi

• weights ++: (NN, VBD) (VBD, DT) (VBD→bit)

• weights --: (NN, NN) (NN, DT) (NN→bit)

6

x

x

y

z

!(x, z)

!(x, y)

Page 13: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

!

Structured Perceptron

7

inferencexi

update weightszi

yi

w

Page 14: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Page 15: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Page 16: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Page 17: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Page 18: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Efficiency vs. Expressiveness

• the inference (argmax) must be efficient

• either the search space GEN(x) is small, or factored

• features must be local to y (but can be global to x)

• e.g. bigram tagger, but look at all input words (cf. CRFs)

8

inferencexi

update weightszi

yi

w

x

y

argmaxy!GEN(x)

Page 19: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

What about tree-to-string?

9

Page 20: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Averaged Perceptron

• more stable and accurate results

• approximation of voted perceptron (Freund & Schapire, 1999)

10

j

!

0

jj + 1

=

!

j

Wj

Page 21: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Comparison with Other Models

Page 22: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

from HMM to MEMM

12

HMM: joint distribution MEMM: locally normalized(per-state conditional)

Page 23: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Label Bias Problem

• bias towards states with fewer outgoing transitions

• a problem with all locally normalized models

13

Page 24: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Conditional Random Fields

• globally normalized (no label bias problem)

• but training requires expected features counts

• (related to the fractional counts in EM)

• need to use Inside-Outside algorithm (sum)

• Perceptron just needs Viterbi (max)

14

Page 25: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Experiments

Page 26: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Experiments: Tagging• (almost) identical features from (Ratnaparkhi, 1996)

• trigram tagger: current tag ti, previous tags ti-1, ti-2

• current word wi and its spelling features

• surrounding words wi-1 wi+1 wi-2 wi+2..

16

Page 27: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Experiments: NP Chunking

• B-I-O scheme

• features:

• unigram model

• surrounding words and POS tags

17

Rockwell International Corp. B I I's Tulsa unit said it signed B I I O B O a tentative agreement ...B I I

Page 28: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Experiments: NP Chunking• results

• (Sha and Pereira, 2003) trigram tagger

• voted perceptron: 94.09% vs. CRF: 94.38%18

Page 29: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Other NLP Applications

• dependency parsing (McDonald et al., 2005)

• parse reranking (Collins)

• phrase-based translation (Liang et al., 2006)

• word segmentation

• ... and many many more ...

19

Page 30: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Theory

Page 31: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Vanilla Perceptron

21

Page 32: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Vanilla Perceptron

22

Page 33: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Vanilla Perceptron

23

Page 34: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Convergence Theorem

• Data is separable if and only if perceptron converges

• number of updates is bounded by

• γ is the margin; R = maxi || xi ||

• This result generalizes to structured perceptron

• Also in the paper: theorems for non-separable cases and generalization bounds

24

(R/!)2

R = maxi

!!(xi, yi) " !(xi, zi)!

Page 35: Parameter Estimation: The Perceptron Algorithmpageperso.lif.univ-mrs.fr/~alexis.nasr/Ens/M2/collins-perceptron.pdf · Collins Perceptron What we have done so far... • packed forest

Collins Perceptron

Conclusion

• a very simple framework that can work with many structured problems and that works very well

• all you need is (fast) 1-best inference

• much simpler than CRFs and SVMs

• can be applied to parsing, translation, etc.

• generalization bounds depend on separability

• not the (exponential) size of the search space

• extensions: MIRA, k-best MIRA, ...

• major limitation: only local features

25