sequence modelling with features: linear-chain conditional ...jcheung/teaching/fall-2016/...logistic...

34
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 3, 2016

Upload: others

Post on 27-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Sequence Modelling with Features:

Linear-Chain Conditional Random

Fields

COMP-599

Oct 3, 2016

Page 2: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

OutlineSequence modelling: review and miscellaneous notes

Hidden Markov models: shortcomings

Generative vs. discriminative models

Linear-chain CRFs

โ€ข Inference and learning algorithms with linear-chain CRFs

2

Page 3: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Hidden Markov ModelGraph specifies how join probability decomposes

๐‘ƒ(๐‘ถ,๐‘ธ) = ๐‘ƒ ๐‘„1

๐‘ก=1

๐‘‡โˆ’1

๐‘ƒ(๐‘„๐‘ก+1|๐‘„๐‘ก)

๐‘ก=1

๐‘‡

๐‘ƒ(๐‘‚๐‘ก|๐‘„๐‘ก)

3

๐‘„1

๐‘‚1

๐‘„2

๐‘‚2

๐‘„3

๐‘‚3

๐‘„4

๐‘‚4

๐‘„5

๐‘‚5

Initial state probability

State transition probabilities

Emission probabilities

Page 4: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

EM and Baum-WelchEM finds a local optimum in ๐‘ƒ ๐‘ถ ๐œƒ .

You can show that after each step of EM:

๐‘ƒ ๐‘ถ ๐œƒ๐‘˜+1 > ๐‘ƒ(๐‘ถ|๐œƒ๐‘˜)

However, this is not necessarily a global optimum.

4

Page 5: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Dealing with Local OptimaRandom restarts

โ€ข Train multiple models with different random initializations

โ€ข Model selection on development set to pick the best one

Biased initialization

โ€ข Bias your initialization using some external source of knowledge (e.g., external corpus counts or clustering procedure, expert knowledge about domain)

โ€ข Further training will hopefully improve results

5

Page 6: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

CaveatsBaum-Welch with no labelled data generally gives poor results, at least for linguistic structure (~40% accuracy, according to Johnson, (2007))

Semi-supervised learning: combine small amounts of labelled data with larger corpus of unlabelled data

6

Page 7: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

In PracticePer-token (i.e., per word) accuracy results on WSJ corpus:

Most frequent tag baseline ~90โ€”94%

HMMs (Brants, 2000) 96.5%

Stanford tagger (Manning, 2011) 97.32%

7

Page 8: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Other Sequence Modelling TasksChunking (a.k.a., shallow parsing)

โ€ข Find syntactic chunks in the sentence; not hierarchical

[NPThe chicken] [Vcrossed] [NPthe road] [Pacross] [NPthe lake].

Named-Entity Recognition (NER)

โ€ข Identify elements in text that correspond to some high level categories (e.g., PERSON, ORGANIZATION, LOCATION)

[ORGMcGill University] is located in [LOCMontreal, Canada].

โ€ข Problem: need to detect spans of multiple words

8

Page 9: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

First AttemptSimply label words by their category

ORG ORG ORG - ORG - - - LOC

McGill, UQAM, UdeM, and Concordia are located in Montreal.

What is the problem with this scheme?

9

Page 10: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

IOB TaggingLabel whether a word is inside, outside, or at the beginning of a span as well.

For n categories, 2n+1 total tags.

BORG IORG O O O BLOC ILOC

McGill University is located in Montreal, Canada

BORG BORG BORG O BORG O O O BLOC

McGill, UQAM, UdeM, and Concordia are located in Montreal.

10

Page 11: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

11

Sequence Modelling with Linguistic

Features

Page 12: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Shortcomings of Standard HMMsHow do we add more features to HMMs?

Might be useful for POS tagging:

โ€ข Word position within sentence (1st, 2nd, lastโ€ฆ)

โ€ข Capitalization

โ€ข Word prefix and suffixes (-tion, -ed, -ly, -er, re-, de-)

โ€ข Features that depend on more than the current word or the previous words.

12

Page 13: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Possible to Do with HMMsAdd more emissions at each timestep

Clunky

Is there a better way to do this?

13

๐‘„1

๐‘‚12๐‘‚11 ๐‘‚13 ๐‘‚14

Word identity Capitalization Prefix feature Word position

Page 14: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Discriminative ModelsHMMs are generative models

โ€ข Build a model of the joint probability distribution ๐‘ƒ(๐‘ถ,๐‘ธ),

โ€ข Letโ€™s rename the variables

โ€ข Generative models specify ๐‘ƒ(๐‘‹, ๐‘Œ; ๐œƒgen)

If we are only interested in POS tagging, we can instead train a discriminative model

โ€ข Model specifies ๐‘ƒ(๐‘Œ|๐‘‹; ๐œƒdisc)

โ€ข Now a task-specific model for sequence labelling; cannot use it for generating new samples of word and POS sequences

14

Page 15: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Generative or Discriminative?Naive Bayes

๐‘ƒ ๐‘ฆ ๐‘ฅ = ๐‘ƒ ๐‘ฆ ๐‘– ๐‘ƒ ๐‘ฅ๐‘– ๐‘ฆ / ๐‘ƒ( ๐‘ฅ)

Logistic regression

๐‘ƒ(๐‘ฆ| ๐‘ฅ) =1

๐‘๐‘’๐‘Ž1๐‘ฅ1 + ๐‘Ž2๐‘ฅ2 + โ€ฆ +๐‘Ž๐‘›๐‘ฅ๐‘› + ๐‘

15

Page 16: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Discriminative Sequence ModelThe parallel to an HMM in the discriminative case: linear-chain conditional random fields (linear-chain CRFs) (Lafferty et al., 2001)

๐‘ƒ ๐‘Œ ๐‘‹ =1

๐‘ ๐‘‹exp

๐‘ก

๐‘˜

๐œƒ๐‘˜๐‘“๐‘˜(๐‘ฆ๐‘ก , ๐‘ฆ๐‘กโˆ’1, ๐‘ฅ๐‘ก)

Z(X) is a normalization constant:

๐‘ ๐‘‹ =

๐’š

exp

๐‘ก

๐‘˜

๐œƒ๐‘˜๐‘“๐‘˜(๐‘ฆ๐‘ก , ๐‘ฆ๐‘กโˆ’1, ๐‘ฅ๐‘ก)

16

sum over all features

sum over all time-steps

sum over all possible sequences of hidden states

Page 17: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

IntuitionStandard HMM: product of probabilities; these probabilities are defined over the identity of the states and words

โ€ข Transition from state DT to NN: ๐‘ƒ(๐‘ฆ๐‘ก+1 = ๐‘๐‘|๐‘ฆ๐‘ก = ๐ท๐‘‡)

โ€ข Emit word the from state DT: ๐‘ƒ(๐‘ฅ๐‘ก = ๐‘กโ„Ž๐‘’|๐‘ฆ๐‘ก = ๐ท๐‘‡)

Linear-chain CRF: replace the products by numbers that are NOT probabilities, but linear combinations of weights and feature values.

17

Page 18: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Features in CRFsStandard HMM probabilities as CRF features:

โ€ข Transition from state DT to state NN๐‘“๐ท๐‘‡โ†’๐‘๐‘(๐‘ฆ๐‘ก , ๐‘ฆ๐‘กโˆ’1, ๐‘ฅ๐‘ก) = ๐Ÿ(๐‘ฆ๐‘กโˆ’1 = ๐ท๐‘‡) ๐Ÿ(๐‘ฆ๐‘ก = ๐‘๐‘)

โ€ข Emit the from state DT๐‘“๐ท๐‘‡โ†’๐‘กโ„Ž๐‘’(๐‘ฆ๐‘ก , ๐‘ฆ๐‘กโˆ’1, ๐‘ฅ๐‘ก) = ๐Ÿ(๐‘ฆ๐‘ก = ๐ท๐‘‡) ๐Ÿ(๐‘ฅ๐‘ก = ๐‘กโ„Ž๐‘’)

โ€ข Initial state is DT๐‘“๐ท๐‘‡ ๐‘ฆ1, ๐‘ฅ1 = ๐Ÿ ๐‘ฆ1 = ๐ท๐‘‡

Indicator function:

Let ๐Ÿ ๐‘๐‘œ๐‘›๐‘‘๐‘–๐‘ก๐‘–๐‘œ๐‘› = 1 if ๐‘๐‘œ๐‘›๐‘‘๐‘–๐‘ก๐‘–๐‘œ๐‘› is true0 otherwise

18

Page 19: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Features in CRFsAdditional features that may be useful

โ€ข Word is capitalized๐‘“๐‘๐‘Ž๐‘(๐‘ฆ๐‘ก , ๐‘ฆ๐‘กโˆ’1, ๐‘ฅ๐‘ก) = ๐Ÿ(๐‘ฆ๐‘ก = ? )๐Ÿ(๐‘ฅ๐‘ก is capitalized)

โ€ข Word ends in โ€“ed๐‘“โˆ’๐‘’๐‘‘(๐‘ฆ๐‘ก , ๐‘ฆ๐‘กโˆ’1, ๐‘ฅ๐‘ก) = ๐Ÿ(๐‘ฆ๐‘ก = ? )๐Ÿ(๐‘ฅ๐‘ก ends with ๐‘’๐‘‘)

โ€ข Exercise: propose more features

19

Page 20: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Inference with LC-CRFsDynamic programming still works โ€“ modify the forward and the Viterbi algorithms to work with the weight-feature products.

20

HMM LC-CRF

Forward algorithm ๐‘ƒ ๐‘‹ ๐œƒ ๐‘(๐‘‹)

Viterbi algorithm argmax๐‘Œ

๐‘ƒ(๐‘‹, ๐‘Œ|๐œƒ) argmax๐‘Œ

๐‘ƒ(๐‘Œ|๐‘‹, ๐œƒ)

Page 21: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Forward Algorithm for HMMsCreate trellis ๐›ผ๐‘–(๐‘ก) for ๐‘– = 1โ€ฆ๐‘, ๐‘ก = 1โ€ฆ๐‘‡

๐›ผ๐‘— 1 = ๐œ‹๐‘—๐‘๐‘—(๐‘‚1) for j = 1 โ€ฆ N

for t = 2 โ€ฆ T:

for j = 1 โ€ฆ N:

๐›ผ๐‘— ๐‘ก =

๐‘–=1

๐‘

๐›ผ๐‘– ๐‘ก โˆ’ 1 ๐‘Ž๐‘–๐‘—๐‘๐‘—(๐‘‚๐‘ก)

๐‘ƒ ๐‘ถ ๐œƒ =

๐‘—=1

๐‘

๐›ผ๐‘—(๐‘‡)

Runtime: O(๐‘2๐‘‡)

21

Page 22: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Forward Algorithm for LC-CRFsCreate trellis ๐›ผ๐‘–(๐‘ก) for ๐‘– = 1โ€ฆ๐‘, ๐‘ก = 1โ€ฆ๐‘‡

๐›ผ๐‘— 1 = exp ๐‘˜ ๐œƒ๐‘˜๐‘–๐‘›๐‘–๐‘ก๐‘“๐‘˜

๐‘–๐‘›๐‘–๐‘ก(๐‘ฆ1 = ๐‘—, ๐‘ฅ1) for j = 1 โ€ฆ N

for t = 2 โ€ฆ T:

for j = 1 โ€ฆ N:

๐›ผ๐‘— ๐‘ก =

๐‘–=1

๐‘

๐›ผ๐‘– ๐‘ก โˆ’ 1 exp

๐‘˜

๐œƒ๐‘˜๐‘“๐‘˜(๐‘ฆ๐‘ก = ๐‘—, ๐‘ฆ๐‘กโˆ’1, ๐‘ฅ๐‘ก)

๐‘(๐‘‹) =

๐‘—=1

๐‘

๐›ผ๐‘—(๐‘‡)

Runtime: O(๐‘2๐‘‡)

Having ๐‘(๐‘‹) allows us to compute ๐‘ƒ(๐‘Œ|๐‘‹)

22

Transition and emission probabilities replacedby exponent of weighted sums of features.

Page 23: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Viterbi Algorithm for HMMsCreate trellis ๐›ฟ๐‘–(๐‘ก) for ๐‘– = 1โ€ฆ๐‘, ๐‘ก = 1โ€ฆ๐‘‡

๐›ฟ๐‘— 1 = ๐œ‹๐‘—๐‘๐‘—(๐‘‚1) for j = 1 โ€ฆ N

for t = 2 โ€ฆ T:

for j = 1 โ€ฆ N:๐›ฟ๐‘— ๐‘ก = max

๐‘–๐›ฟ๐‘– ๐‘ก โˆ’ 1 ๐‘Ž๐‘–๐‘—๐‘๐‘—(๐‘‚๐‘ก)

Take max๐‘–

๐›ฟ๐‘– ๐‘‡

Runtime: O(๐‘2๐‘‡)

23

Page 24: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Viterbi Algorithm for LC-CRFsCreate trellis ๐›ฟ๐‘–(๐‘ก) for ๐‘– = 1โ€ฆ๐‘, ๐‘ก = 1โ€ฆ๐‘‡

๐›ฟ๐‘— 1 = exp ๐‘˜ ๐œƒ๐‘˜๐‘–๐‘›๐‘–๐‘ก๐‘“๐‘˜

๐‘–๐‘›๐‘–๐‘ก(๐‘ฆ1 = ๐‘—, ๐‘ฅ1)for j = 1 โ€ฆ N

for t = 2 โ€ฆ T:

for j = 1 โ€ฆ N:

๐›ฟ๐‘— ๐‘ก = max๐‘–

๐›ฟ๐‘– ๐‘ก โˆ’ 1 exp

๐‘˜

๐œƒ๐‘˜๐‘“๐‘˜(๐‘ฆ๐‘ก = ๐‘—, ๐‘ฆ๐‘กโˆ’1, ๐‘ฅ๐‘ก)

Take max๐‘–

๐›ฟ๐‘– ๐‘‡

Runtime: O(๐‘2๐‘‡)

Remember that we need backpointers.

24

Page 25: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Training LC-CRFsUnlike for HMMs, no analytic MLE solution

Use iterative method to improve data likelihood

Gradient descent

A version of Newtonโ€™s method to find where the gradient is 0

25

Page 26: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

ConvexityFortunately, ๐‘™ ๐œƒ is a concave function (equivalently, its negation is a convex function). That means that we will find the global maximum of ๐‘™ ๐œƒ with gradient descent.

26

Page 27: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Gradient AscentWalk in the direction of the gradient to maximize ๐‘™(๐œƒ)

โ€ข a.k.a., gradient descent on a loss function

๐œƒnew = ๐œƒold + ๐›พ๐›ป๐‘™(๐œƒ)

๐›พis a learning rate that specifies how large a step to take.

There are more sophisticated ways to do this update:

โ€ข Conjugate gradient

โ€ข L-BFGS (approximates using second derivative)

27

Page 28: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Gradient Descent SummaryDescent vs ascent

Convention: think about the problem as a minimization problem

Minimize the negative log likelihood

Initialize ๐œƒ = ๐œƒ1, ๐œƒ2, โ€ฆ , ๐œƒ๐‘˜ randomly

Do for a while:

Compute ๐›ป๐‘™(๐œƒ), which will require dynamic programming (i.e., forward algorithm)

๐œƒ โ† ๐œƒ โˆ’ ๐›พ๐›ป๐‘™(๐œƒ)

28

Page 29: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Gradient of Log-LikelihoodFind the gradient of the log likelihood of the training corpus:

๐‘™ ๐œƒ = log

๐‘–

๐‘ƒ(๐‘Œ ๐‘– |๐‘‹ ๐‘– )

29

Page 30: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Interpretation of GradientOverall gradient is the difference between:

๐‘–

๐‘ก

๐‘“๐‘˜(๐‘ฆ๐‘ก๐‘–, ๐‘ฆ๐‘กโˆ’1

๐‘–, ๐‘ฅ๐‘ก

๐‘–)

the empirical distribution of feature ๐‘“๐‘˜ in the training corpus

and:

๐‘–

๐‘ก

๐‘ฆ,๐‘ฆโ€ฒ

๐‘“๐‘˜ ๐‘ฆ, ๐‘ฆโ€ฒ, ๐‘ฅ๐‘ก๐‘– ๐‘ƒ(๐‘ฆ, ๐‘ฆโ€ฒ|๐‘‹ ๐‘– )

the expected distribution of ๐‘“๐‘˜ as predicted by the current model

30

Page 31: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Interpretation of GradientWhen the corpus likelihood is maximized, the gradient is zero, so the difference is zero.

Intuitively, this means that finding parameter estimate by gradient descent is equivalent to telling our model to predict the features in such a way that they are found in the same distribution as in the gold standard.

31

Page 32: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

RegularizationTo avoid overfitting, we can encourage the weights to be close to zero.

Add term to corpus log likelihood:

๐‘™โˆ— ๐œƒ = log

๐‘–

๐‘ƒ(๐‘Œ ๐‘– |๐‘‹ ๐‘– ) โˆ’ exp

๐‘˜

๐œƒ๐‘˜2

2๐œŽ2

๐œŽ controls the amount of regularization

Results in extra term in gradient:

โˆ’๐œƒ๐‘˜๐œŽ2

32

Page 33: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Stochastic Gradient DescentIn the standard version of the algorithm, the gradient is computed over the entire training corpus.

โ€ข Weight update only once per iteration through training corpus.

Alternative: calculate gradient over a small mini-batch of the training corpus and update weights then

โ€ข Many weight updates per iteration through training corpus

โ€ข Usually results in much faster convergence to final solution, without loss in performance

33

Page 34: Sequence Modelling with Features: Linear-Chain Conditional ...jcheung/teaching/fall-2016/...Logistic regression ( | )= 1 1 1+ 2 2+โ€ฆ+ ๐‘› ๐‘›+ 15 Discriminative Sequence Model The

Stochastic Gradient DescentInitialize ๐œƒ = ๐œƒ1, ๐œƒ2, โ€ฆ , ๐œƒ๐‘˜ randomly

Do for a while:

Randomize order of samples in training corpus

For each mini-batch in the training corpus:

Compute ๐›ปโˆ—๐‘™(๐œƒ) over this mini-batch๐œƒ โ† ๐œƒ โˆ’ ๐›พ๐›ปโˆ—๐‘™(๐œƒ)

34