lecture 03: machine learning for language technology - linear classifiers

55
Machine Learning for Language Technology Uppsala University Department of Linguistics and Philology Slides borrowed from previous courses. Thanks to Ryan McDonald (Google Research) and Prof. Joakim Nivre Machine Learning for Language Technology 1(55)

Upload: marina-santini

Post on 28-Nov-2014

937 views

Category:

Education


5 download

DESCRIPTION

We will talk about the following machine learning methods: perceptron, SVMs, MIRA, logistic regression/maximum entropy

TRANSCRIPT

Page 1: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Machine Learning for Language Technology

Uppsala UniversityDepartment of Linguistics and Philology

Slides borrowed from previous courses.Thanks to Ryan McDonald (Google Research)

and Prof. Joakim Nivre

Machine Learning for Language Technology 1(55)

Page 2: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Introduction

Linear Classifiers

I Classifiers covered so far:I Decision treesI Nearest neighbors

I Next two lectures: Linear classifiersI Statistics from Google Scholar (Sept 2013):

I “Maximum Entropy”&“NLP”; 11,400 hits (2013), 2,660(2009), 141 before 2000

I “SVM”&“NLP”: 10,900 hits (2013), 2,210 (2009), 16 before2000

I “Perceptron”&“NLP”: 3,160 hits (2013), 947 (2009), 118before 2000

I All are linear classifiers that have become important tools inLanguage Technology in the past 10 years or so.

Machine Learning for Language Technology 2(55)

Page 3: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Introduction

Outline

I Today:I Preliminaries: input/output, features, etc.I Linear classifiers

I PerceptronI Large-margin classifiers (SVMs, MIRA)I Logistic regression

I Next time:I Structured prediction with linear classifiers

I Structured perceptronI Structured large-margin classifiers (SVMs, MIRA)I Conditional random fields

I Case study: Dependency parsing

Machine Learning for Language Technology 3(55)

Page 4: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Preliminaries

Inputs and Outputs

I Input: x ∈ XI e.g., document or sentence with some words x = w1 . . .wn, or

a series of previous actions

I Output: y ∈ YI e.g., parse tree, document class, part-of-speech tags,

word-sense

I Input/output pair: (x,y) ∈ X × YI e.g., a document x and its label yI Sometimes x is explicit in y, e.g., a parse tree y will contain

the sentence x

Machine Learning for Language Technology 4(55)

Page 5: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Preliminaries

Feature Representations

I We assume a mapping from input-output pairs (x,y) to ahigh dimensional feature vector

I f(x,y) : X × Y → Rm

I For some cases, i.e., binary classification Y = {−1,+1}, wecan map only from the input to the feature space

I f(x) : X → Rm

I However, most problems in NLP require more than twoclasses, so we focus on the multi-class case

I For any vector v ∈ Rm, let vj be the j th value

Machine Learning for Language Technology 5(55)

Page 6: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Preliminaries

Features and Classes

I All features must be numericalI Numerical features are represented directly as fi (x,y) ∈ RI Binary (boolean) features are represented as fi (x,y) ∈ {0, 1}

I Multinomial (categorical) features must be binarizedI Instead of: fi (x,y) ∈ {v0, . . . , vp}I We have: fi+0(x,y) ∈ {0, 1}, . . . , fi+p(x,y) ∈ {0, 1}I Such that: fi+j(x,y) = 1 iff fi (x,y) = vj

I We need distinct features for distinct output classesI Instead of: fi (x) (1 ≤ i ≤ m)I We have: fi+0m(x,y), . . . , fi+Nm(x,y) for Y = {0, . . . ,N}I Such that: fi+jm(x,y) = fi (x) iff y = yj

Machine Learning for Language Technology 6(55)

Page 7: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Preliminaries

Examples

I x is a document and y is a label

fj(x,y) =

1 if x contains the word“interest”

and y =“financial”0 otherwise

fj(x,y) = % of words in x with punctuation and y =“scientific”

I x is a word and y is a part-of-speech tag

fj(x,y) =

{1 if x = “bank”and y = Verb0 otherwise

Machine Learning for Language Technology 7(55)

Page 8: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Preliminaries

Examples

I x is a name, y is a label classifying the name

f0(x, y) =

8<: 1 if x contains “George”and y =“Person”

0 otherwise

f1(x, y) =

8<: 1 if x contains “Washington”and y =“Person”

0 otherwise

f2(x, y) =

8<: 1 if x contains “Bridge”and y =“Person”

0 otherwise

f3(x, y) =

8<: 1 if x contains “General”and y =“Person”

0 otherwise

f4(x, y) =

8<: 1 if x contains “George”and y =“Object”

0 otherwise

f5(x, y) =

8<: 1 if x contains “Washington”and y =“Object”

0 otherwise

f6(x, y) =

8<: 1 if x contains “Bridge”and y =“Object”

0 otherwise

f7(x, y) =

8<: 1 if x contains “General”and y =“Object”

0 otherwise

I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]

I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]

I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]

Machine Learning for Language Technology 8(55)

Page 9: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Preliminaries

Block Feature Vectors

I x=General George Washington, y=Person → f(x, y) = [1 1 0 1 0 0 0 0]

I x=George Washington Bridge, y=Object → f(x, y) = [0 0 0 0 1 1 1 0]

I x=George Washington George, y=Object → f(x, y) = [0 0 0 0 1 1 0 0]

I One equal-size block of the feature vector for each label

I Input features duplicated in each block

I Non-zero values allowed only in one block

Machine Learning for Language Technology 9(55)

Page 10: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Linear Classifiers

I Linear classifier: score (or probability) of a particularclassification is based on a linear combination of features andtheir weights

I Let w ∈ Rm be a high dimensional weight vectorI If we assume that w is known, then we define our classifier as

I Multiclass Classification: Y = {0, 1, . . . ,N}

y = arg maxy

w · f(x,y)

= arg maxy

m∑j=0

wj × fj(x,y)

I Binary Classification just a special case of multiclass

Machine Learning for Language Technology 10(55)

Page 11: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Linear Classifiers - Bias Terms

I Often linear classifiers presented as

y = arg maxy

m∑j=0

wj × fj(x,y) + by

I Where b is a bias or offset term

I But this can be folded into f

x=General George Washington, y=Person → f(x, y) = [1 1 0 1 1 0 0 0 0 0]

x=General George Washington, y=Object → f(x, y) = [0 0 0 0 0 1 1 0 1 1]

f4(x, y) =

1 y =“Person”0 otherwise

f9(x, y) =

1 y =“Object”0 otherwise

I w4 and w9 are now the bias terms for the labels

Machine Learning for Language Technology 11(55)

Page 12: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Binary Linear Classifier

Divides all points:

Machine Learning for Language Technology 12(55)

Page 13: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Multiclass Linear Classifier

Defines regions of space:

I i.e., + are all points (x,y) where + = arg maxy w · f(x,y)

Machine Learning for Language Technology 13(55)

Page 14: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Separability

I A set of points is separable, if there exists a w such thatclassification is perfect

Separable Not Separable

I This can also be defined mathematically (and we will shortly)

Machine Learning for Language Technology 14(55)

Page 15: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Supervised Learning – how to find w

I Input: training examples T = {(xt ,yt)}|T |t=1

I Input: feature representation fI Output: w that maximizes/minimizes some important

function on the training setI minimize error (Perceptron, SVMs, Boosting)I maximize likelihood of data (Logistic Regression)

I Assumption: The training data is separableI Not necessary, just makes life easierI There is a lot of good work in machine learning to tackle the

non-separable case

Machine Learning for Language Technology 15(55)

Page 16: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Perceptron

... The resulting hyperplane is called -perceptron- ...(Witten and Frank, 2005:126)

Machine Learning for Language Technology 16(55)

Page 17: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Perceptron

I Choose a w that minimizes error

w = arg minw

∑t

1− 1[yt = arg maxy

w · f(xt ,y)]

1[p] =

{1 p is true0 otherwise

I This is a 0-1 loss functionI Aside: when minimizing error people tend to use hinge-loss or

other smoother loss functions

Machine Learning for Language Technology 17(55)

Page 18: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Perceptron Learning Algorithm

Training data: T = {(xt ,yt)}|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T (random order: shuffle the training set beforehand!)

4. Let y′ = arg maxy w(i) · f(xt ,y)5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)

7. i = i + 18. return wi

I See also Daume’ (2012:38-41).

Machine Learning for Language Technology 18(55)

Page 19: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Perceptron: Separability and Margin

I Given an training instance (xt ,yt), define:I Yt = Y − {yt}I i.e., Yt is the set of incorrect labels for xt

I A training set T is separable with margin γ > 0 if there existsa vector w with ‖w‖ = 1 such that:

w · f(xt ,yt)−w · f(xt ,y′) ≥ γ

for all y′ ∈ Yt and ||w|| =√∑

j w2j (Euclidean or L2 norm)

I Assumption: the training set is separable with margin γ

Machine Learning for Language Technology 19(55)

Page 20: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Perceptron: Main Theorem

I Theorem: For any training set separable with a margin of γ,the following holds for the perceptron algorithm:

mistakes made during training ≤ R2

γ2

where R ≥ ||f(xt ,yt)− f(xt ,y′)|| for all (xt ,yt) ∈ T and

y′ ∈ Yt

I Thus, after a finite number of training iterations, the error onthe training set will converge to zero

I For proof, see the Appendix to these slides

Machine Learning for Language Technology 20(55)

Page 21: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Perceptron Summary

I Learns a linear classifier that minimizes error

I Guaranteed to find a w in a finite amount of timeI Perceptron is an example of an online learning algorithm

I w is updated based on a single training instance in isolation

w(i+1) = w(i) + f(xt ,yt)− f(xt ,y′)

I Compare decision trees that perform batch learningI All training instances are used to find best split

Machine Learning for Language Technology 21(55)

Page 22: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Margin

Machine Learning for Language Technology 22(55)

Page 23: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Margin

Training Testing

Denote thevalue of themargin by γ

Machine Learning for Language Technology 23(55)

Page 24: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Maximizing Margin (i)

I For a training set T , the margin of a weight vector w is thesmallest γ such that

w · f(xt ,yt)−w · f(xt ,y′) ≥ γ

for every training instance (xt ,yt) ∈ T , y′ ∈ Yt

Machine Learning for Language Technology 24(55)

Page 25: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Maximizing Margin (ii)

I Intuitively maximizing margin makes sense

I More importantly, generalization error to unseen test data isproportional to the inverse of the margin (for the proof, seeDaume’, 2012: 45-46)

ε ∝ R2

γ2 × |T |

I Perceptron: we have shown that:I If a training set is separable by some margin, the perceptron

will find a w that separates the dataI However, the perceptron does not pick w to maximize the

margin!

Machine Learning for Language Technology 25(55)

Page 26: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Maximizing Margin (iii)

Let γ > 0max||w||≤1

γ

such that:w · f(xt ,yt)−w · f(xt ,y

′) ≥ γ

∀(xt ,yt) ∈ T

and y′ ∈ Yt

I Note: algorithm still minimizes error

I ||w|| is bound since scaling trivially produces larger margin

β(w · f(xt ,yt)−w · f(xt ,y′)) ≥ βγ, for some β ≥ 1

Machine Learning for Language Technology 26(55)

Page 27: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Max Margin = Min Norm

Let γ > 0

Max Margin:

max||w||≤1

γ

such that:

w·f(xt ,yt)−w·f(xt ,y′) ≥ γ

∀(xt ,yt) ∈ T

and y′ ∈ Yt

=

Min Norm:

minw

1

2||w||2

such that:

w·f(xt ,yt)−w·f(xt ,y′) ≥ 1

∀(xt ,yt) ∈ T

and y′ ∈ Yt

I Instead of fixing ||w|| we fix the margin γ = 1

I Technically γ ∝ 1/||w||

Machine Learning for Language Technology 27(55)

Page 28: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Support Vector Machines

a.k.a. SVM(s)

Machine Learning for Language Technology 28(55)

Page 29: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Support Vector Machines (i)

min1

2||w||2

such that:w · f(xt ,yt)−w · f(xt ,y

′) ≥ 1

∀(xt ,yt) ∈ T

and y′ ∈ Yt

I Quadratic programming problem – a well known convexoptimization problem

I Can be solved with out-of-the-box algorithms

I Batch learning algorithm – w set w.r.t. all training points

Machine Learning for Language Technology 29(55)

Page 30: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Support Vector Machines (ii)

I Problem: Sometimes |T | is far too large

I Thus the number of constraints might make solving thequadratic programming problem very difficult

I Common technique: Sequential Minimal Optimization (SMO)

I Sparse: solution depends only on features in support vectors

Machine Learning for Language Technology 30(55)

Page 31: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

MIRA

Margin Infused Relaxed Algorithm

Machine Learning for Language Technology 31(55)

Page 32: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Margin Infused Relaxed Algorithm (MIRA)

I Another option – maximize margin using an online algorithmI Batch vs. Online

I Batch – update parameters based on entire training set (SVM)I Online – update parameters based on a single training instance

at a time (Perceptron)

I MIRA can be thought of as a max-margin perceptron or anonline SVM

Machine Learning for Language Technology 32(55)

Page 33: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

MIRA

Batch (SVMs):

min1

2||w||2

such that:

w·f(xt ,yt)−w·f(xt ,y′) ≥ 1

∀(xt ,yt) ∈ T and y′ ∈ Yt

Online (MIRA):

Training data: T = {(xt ,yt)}|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T4. w(i+1) = arg minw*

∥∥w*−w(i)∥∥

such that:w · f(xt ,yt)−w · f(xt ,y

′) ≥ 1∀y′ ∈ Yt

5. i = i + 16. return wi

I MIRA has much smaller optimizations with only |Yt |constraints

I Cost: sub-optimal optimization

Machine Learning for Language Technology 33(55)

Page 34: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Interim Summary

What we have coveredI Linear classifiers:

I PerceptronI SVMsI MIRA

I All are trained to minimize errorI With or without maximizing marginI Online or batch

What is nextI Logistic Regression

I Train linear classifiers to maximize likelihood

Machine Learning for Language Technology 34(55)

Page 35: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Logistic Regression

Machine Learning for Language Technology 35(55)

Page 36: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Logistic Regression (i)

Define a conditional probability:

P(y|x) =ew·f(x,y)

Zx, where Zx =

∑y′∈Y

ew·f(x,y′)

Note: still a linear classifier

arg maxy

P(y|x) = arg maxy

ew·f(x,y)

Zx

= arg maxy

ew·f(x,y)

= arg maxy

w · f(x,y)

Machine Learning for Language Technology 36(55)

Page 37: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Logistic Regression (ii)

P(y|x) =ew·f(x,y)

Zx

I Q: How do we learn weights w

I A: Set weights to maximize log-likelihood of training data:

w = arg maxw

∏t

P(yt |xt) = arg maxw

∑t

log P(yt |xt)

I In a nut shell we set the weights w so that we assign as muchprobability to the correct label y for each x in the training set

Machine Learning for Language Technology 37(55)

Page 38: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Logistic Regression

P(y|x) =ew·f(x,y)

Zx, where Zx =

∑y′∈Y

ew·f(x,y′)

w = arg maxw

∑t

log P(yt |xt) (*)

I The objective function (*) is concave

I Therefore there is a global maximumI No closed form solution, but lots of numerical techniques

I Gradient methods (gradient ascent, iterative scaling)I Newton methods (limited-memory quasi-newton)

Machine Learning for Language Technology 38(55)

Page 39: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Logistic Regression Summary

I Define conditional probability

P(y|x) =ew·f(x,y)

Zx

I Set weights to maximize log-likelihood of training data:

w = arg maxw

∑t

log P(yt |xt)

I Can find the gradient and run gradient ascent (or anygradient-based optimization algorithm)

OF (w) = (∂

∂w0F (w),

∂w1F (w), . . . ,

∂wmF (w))

∂wiF (w) =

∑t

fi (xt ,yt)−∑

t

∑y′∈Y

P(y′|xt)fi (xt ,y′)

Machine Learning for Language Technology 39(55)

Page 40: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Linear Classification: Summary

I Basic form of (multiclass) classifier:

y = arg maxy

w · f(x,y)

I Different learning methods:I Perceptron – separate data (0-1 loss, online)I Support vector machine – maximize margin (hinge loss, batch)I Logistic regression – maximize likelihood (log loss, batch)

I All three methods are widely used in NLP

Machine Learning for Language Technology 40(55)

Page 41: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Aside: Min error versus max log-likelihood (i)

I Highly related but not identical

I Example: consider a training set T with 1001 points

1000× (xi ,y = 0) = [−1, 1, 0, 0] for i = 1 . . . 1000

1× (x1001,y = 1) = [0, 0, 3, 1]

I Now consider w = [−1, 0, 1, 0]

I Error in this case is 0 – so w minimizes error

[−1, 0, 1, 0] · [−1, 1, 0, 0] = 1 > [−1, 0, 1, 0] · [0, 0,−1, 1] = −1

[−1, 0, 1, 0] · [0, 0, 3, 1] = 3 > [−1, 0, 1, 0] · [3, 1, 0, 0] = −3

I However, log-likelihood = −126.9 (omit calculation)

Machine Learning for Language Technology 41(55)

Page 42: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Aside: Min error versus max log-likelihood (ii)

I Highly related but not identical

I Example: consider a training set T with 1001 points

1000× (xi ,y = 0) = [−1, 1, 0, 0] for i = 1 . . . 1000

1× (x1001,y = 1) = [0, 0, 3, 1]

I Now consider w = [−1, 7, 1, 0]

I Error in this case is 1 – so w does not minimizes error

[−1, 7, 1, 0] · [−1, 1, 0, 0] = 8 > [−1, 7, 1, 0] · [0, 0,−1, 1] = −1

[−1, 7, 1, 0] · [0, 0, 3, 1] = 3 < [−1, 7, 1, 0] · [3, 1, 0, 0] = 4

I However, log-likelihood = -1.4

I Better log-likelihood and worse error

Machine Learning for Language Technology 42(55)

Page 43: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Aside: Min error versus max log-likelihood (iii)

I Max likelihood 6= min errorI Max likelihood pushes as much probability on correct labeling

of training instanceI Even at the cost of mislabeling a few examples

I Min error forces all training instances to be correctly classified

I SVMs with slack variables – allows some examples to beclassified wrong if resulting margin is improved on otherexamples

Machine Learning for Language Technology 43(55)

Page 44: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Aside: Max margin versus max log-likelihood

I Let’s re-write the max likelihood objective function

w = arg maxw

∑t

log P(yt |xt)

= arg maxw

∑t

logew·f(xt ,yt)∑

y′∈Y ew·f(x,y′)

= arg maxw

∑t

w · f(xt ,yt)− log∑y′∈Y

ew·f(x,y′)

I Pick w to maximize score difference between correct labelingand every possible labeling

I Margin: maximize difference between correct and all incorrect

I The above formulation is often referred to as the soft-margin

Machine Learning for Language Technology 44(55)

Page 45: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Linear Classifiers

Aside: Logistic Regression = MaximumEntropy

I Well known equivalenceI Max Ent: maximize entropy subject to constraints on features

I Empirical feature counts must equal expected counts

I Quick intuitionI Partial derivative in logistic regression

∂wiF (w) =

∑t

fi (xt ,yt)−∑

t

∑y′∈Y

P(y′|xt)fi (xt ,y′)

I First term is empirical feature counts and second term isexpected counts

I Derivative set to zero maximizes functionI Therefore when both counts are equivalent, we optimize the

logistic regression objective!

Machine Learning for Language Technology 45(55)

Page 46: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Appendix

Proofs and Derivations

Machine Learning for Language Technology 46(55)

Page 47: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Convergence Proof for Perceptron

Perceptron Learning AlgorithmTraining data: T = {(xt , yt )}

|T |t=1

1. w(0) = 0; i = 02. for n : 1..N3. for t : 1..T

4. Let y′ = arg maxy w(i) · f(xt , y)

5. if y′ 6= yt

6. w(i+1) = w(i) + f(xt , yt ) − f(xt , y′)7. i = i + 1

8. return wi

I w (k−1) are the weights before kth

mistake

I Suppose kth mistake made at thetth example, (xt , yt)

I y′ = arg maxy w(k−1) · f(xt , y)

I y′ 6= yt

I w (k) = w(k−1) + f(xt , yt)− f(xt , y′)

I Now: u · w(k) = u · w(k−1) + u · (f(xt , yt)− f(xt , y′)) ≥ u · w(k−1) + γ

I Now: w(0) = 0 and u · w(0) = 0, by induction on k, u · w(k) ≥ kγ

I Now: since u · w(k) ≤ ||u|| × ||w(k)|| and ||u|| = 1 then ||w(k)|| ≥ kγ

I Now:

||w(k)||2 = ||w(k−1)||2 + ||f(xt , yt)− f(xt , y′)||2 + 2w(k−1) · (f(xt , yt)− f(xt , y

′))

||w(k)||2 ≤ ||w(k−1)||2 + R2

(since R ≥ ||f(xt , yt)− f(xt , y′)||

and w(k−1) · f(xt , yt)− w(k−1) · f(xt , y′) ≤ 0)

Machine Learning for Language Technology 47(55)

Page 48: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Convergence Proof for Perceptron

Perceptron Learning Algorithm

I We have just shown that ||w(k)|| ≥ kγ and||w(k)||2 ≤ ||w(k−1)||2 + R2

I By induction on k and since w(0) = 0 and ||w(0)||2 = 0

||w(k)||2 ≤ kR2

I Therefore,k2γ2 ≤ ||w(k)||2 ≤ kR2

I and solving for k

k ≤ R2

γ2

I Therefore the number of errors is bounded!

Machine Learning for Language Technology 48(55)

Page 49: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Gradient Ascent for Logistic Regression

Gradient Ascent

I Let F (w) =∑

t log ew·f(xt ,yt )

Zx

I Want to find arg maxw F (w)I Set w0 = Om

I Iterate until convergence

wi = wi−1 + αOF (wi−1)

I α > 0 and set so that F (wi ) > F (wi−1)I OF (w) is gradient of F w.r.t. w

I A gradient is all partial derivatives over variables wi

I i.e., OF (w) = ( ∂∂w0

F (w), ∂∂w1

F (w), . . . , ∂∂wm

F (w))

I Gradient ascent will always find w to maximize F

Machine Learning for Language Technology 49(55)

Page 50: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Gradient Ascent for Logistic Regression

The partial derivatives

I Need to find all partial derivatives ∂∂wi

F (w)

F (w) =∑

t

log P(yt |xt)

=∑

t

logew·f(xt ,yt)∑

y′∈Y ew·f(xt ,y′)

=∑

t

loge

Pj wj×fj (xt ,yt)∑

y′∈Y eP

j wj×fj (xt ,y′)

Machine Learning for Language Technology 50(55)

Page 51: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Gradient Ascent for Logistic Regression

Partial derivatives - some reminders

1. ∂∂x log F = 1

F∂∂x F

I We always assume log is the natural logarithm loge

2. ∂∂x eF = eF ∂

∂x F

3. ∂∂x

∑t Ft =

∑t

∂∂x Ft

4. ∂∂x

FG =

G ∂∂x

F−F ∂∂x

G

G2

Machine Learning for Language Technology 51(55)

Page 52: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Gradient Ascent for Logistic Regression

The partial derivatives

∂wiF (w) =

∂wi

∑t

loge

Pj wj×fj (xt ,yt)∑

y′∈Y eP

j wj×fj (xt ,y′)

=∑

t

∂wilog

eP

j wj×fj (xt ,yt)∑y′∈Y e

Pj wj×fj (xt ,y′)

=∑

t

(

∑y′∈Y e

Pj wj×fj (xt ,y

′)

eP

j wj×fj (xt ,yt))(

∂wi

eP

j wj×fj (xt ,yt)∑y′∈Y e

Pwj

wj×fj (xt ,y′))

=∑

t

(Zxt

eP

j wj×fj (xt ,yt))(

∂wi

eP

j wj×fj (xt ,yt)

Zxt

)

Machine Learning for Language Technology 52(55)

Page 53: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Gradient Ascent for Logistic Regression

The partial derivativesNow,

∂wi

eP

j wj×fj (xt ,yt )

Zxt

=Zxt

∂∂wi

eP

j wj×fj (xt ,yt ) − eP

j wj×fj (xt ,yt ) ∂∂wi

Zxt

Z 2xt

=Zxt e

Pj wj×fj (xt ,yt )fi (xt , yt) − e

Pj wj×fj (xt ,yt ) ∂

∂wiZxt

Z 2xt

=e

Pj wj×fj (xt ,yt )

Z 2xt

(Zxt fi (xt , yt) −∂

∂wiZxt )

=e

Pj wj×fj (xt ,yt )

Z 2xt

(Zxt fi (xt , yt)

−X

y′∈Y

eP

j wj×fj (xt ,y′)fi (xt , y

′))

because

∂wiZxt =

∂wi

Xy′∈Y

eP

j wj×fj (xt ,y′) =

Xy′∈Y

eP

j wj×fj (xt ,y′)fi (xt , y

′)

Machine Learning for Language Technology 53(55)

Page 54: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Gradient Ascent for Logistic Regression

The partial derivativesFrom before,

∂wi

eP

j wj×fj (xt ,yt )

Zxt

=e

Pj wj×fj (xt ,yt )

Z 2xt

(Zxt fi (xt , yt)

−X

y′∈Y

eP

j wj×fj (xt ,y′)fi (xt , y

′))

Sub this in,

∂wiF (w) =

Xt

(Zxt

eP

j wj×fj (xt ,yt ))(

∂wi

eP

j wj×fj (xt ,yt )

Zxt

)

=X

t

1

Zxt

(Zxt fi (xt , yt) −X

y′∈Y

eP

j wj×fj (xt ,y′)fi (xt , y

′)))

=X

t

fi (xt , yt) −X

t

Xy′∈Y

eP

j wj×fj (xt ,y′)

Zxt

fi (xt , y′)

=X

t

fi (xt , yt) −X

t

Xy′∈Y

P(y′|xt)fi (xt , y′)

Machine Learning for Language Technology 54(55)

Page 55: Lecture 03: Machine Learning for Language Technology - Linear Classifiers

Gradient Ascent for Logistic Regression

FINALLY!!!

I After all that,

∂wiF (w) =

∑t

fi (xt ,yt)−∑

t

∑y′∈Y

P(y′|xt)fi (xt ,y′)

I And the gradient is:

OF (w) = (∂

∂w0F (w),

∂w1F (w), . . . ,

∂wmF (w))

I So we can now use gradient assent to find w!!

Machine Learning for Language Technology 55(55)