machine learning: intro & linear clasifier · supervised machine learning •training phase...

Post on 01-Aug-2020

3 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Machine Learning: Intro & Linear Clasifier

Olga Veksler

Outline

• Introduction to Machine Learning

• motivation of machine learning

• types of machine learning

• basic concepts

• overfitting, underfitting, generalization

• Optimization with gradient descent

• Linear Classifier• Two classes

• Loss functions

• Gradient descent schemes

• Multiple classes

INTRO TO ML

Why use Machine Learning? • Difficult to come up with explicit program for some tasks

• Digit Recognition, a classic example

• Easy to collect images of digits with their correct labels

0

• Machine Learning Algorithm takes collected data and produces program for recognizing digits• done right, program will recognize correctly new images it has never seen

4

5

What is Machine Learning?

• General definition (Tom Mitchell):

• Based on experience E, improve performance on task T as measured by performance measure P

• Digit Recognition Example

• T = recognize character in the image

• P = percentage of correctly classified images

• E = dataset of human-labeled images of characters

6

Types of Machine Learning • Supervised Learning

• given training examples with corresponding outputs (also called label, target, class, answer, etc.)

• learn to produces correct output for a new example

• Unsupervised Learning

• given training examples only

• discover good data representation

• e.g. “natural” clusters

• k-means is the most widely known example

• Reinforcement Learning

• learn to select action that maximizes payoff

7

Supervised Machine Learning: subtypes

• Classification • output belongs to a finite set

• example: age {baby, child, adult, elder}

• output is also called class or label

• Regression• output is continuous

• example: age [0,130]

• Difference mostly in design of loss function

Supervised Machine Learning

salmon salmonsea bass sea bass

• Have examples with corresponding outputs

• Example: fish classification, salmon or sea bass?

x1 x2 x3 x4

=

7.5

3.3

=

7.8

3.6

=

7.1

3.2

=

0.7

4.6

y1=0 y2=1 y3 = 0 y4=1

• Each example represented by a vector • data may be given in vector form from the start

• if not, for each example i, extract useful features and put them in a vector xi

• fish classification example• extract two features, fish length and average fish brightness

• can extract as many other features

• for images, can also use raw pixel intensity or color as features

• an example is often called feature vector

• yi is the output for example xi

Supervised Machine Learning

• Training phase

• estimate function y = f(x) from labeled data• f(x) is called classifier, learning machine, prediction function, etc.

• Testing phase (deployment)

• predict output f(x) for a new (unseen) sample x

• We are given

1. Training examples x1, x2,…, xn

2. Target output for each sample y1, y2,…ynlabeled data

label prediction

Training/Testing Phases Illustrated

training labels

training examples

Training

Training

feature vectors

feature vector

Testing

test Image

Learned model f

Learned model f

More on Training Phase

• Estimate prediction function y = f(x) from labeled data

• Choose hypothesis space f(x) belongs to

• hypothesis space f(x,w) is parameterized by vector of weights w

• each setting of w corresponds to a different hypothesis

• find f(x,w) in the hypothesis space s.t. f(xi,w) = yi “as much as possible” for training examples

• define “as much as possible” via some loss function L(f(x,w),y)

f(x,w1)f(x,w3)

f(x,w2)

f(x,w4)

f(x,w5)

hypothesis space

Training Phase Example in 1D• 2 class classification problem, yi ∊{-1,1}

• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}

x

-1+2x

0.5

class -1 class 1

• Hypothesis space f(x,w) = sign(w0 + w1x)

0

=

1

0

w

ww

• with w0 = -1, w1 = 2, f(x) = sign( -1 + 2x )

f(x)

Training Phase Example in 1D

x

1.5+x

1.5

class -1 class 1

0

• The process of finding good w is weight tuning, or training

• Hypothesis space f(x,w) = sign(w0+w1x )• With w0 = -1.5, w1 = 1, f(x) = sign(-1.5 + x )

• 2 class classification problem, yi ∊{-1,1}

• Training set class -1: {-2, -1, 1} class 1: {2, 3, 5}

Training Phase Example in 2D• For 2 class problem and 2 dimensional samples

f(x,w) = sign(w0+w1x1+w2x2)

decision boundary

decision regions

x1

x2

• Can be generalized to examples of arbitrary dimension

• Classifier that makes a decision based on linear combination of features is called a linear classifier

Training Phase: Linear Classifier

classification error 38%

bad setting of w

x1

x2

x1

x2

better setting of w

classification error 4%

16

Training Stage: More Complex Classifier

• for example if f(x,w) is a polynomial of high degree

• 0% classification error

x1

x2

Test Classifier on New Data

• The goal is to classify well on new data

• Test “wiggly” classifier on new data: 25% error

x1

x2

Overfitting

• Have only limited amount of data for training

• Complex model often has too many parameters to fit reliably with limited data

• Complex model may adapt too closely to random noise of training data, rather than look at a ‘big picture’

x1

x2

Overfitting: Extreme Example• 2 class problem: face and non-face images

• Memorize (i.e. store) all the “face” images

• For a new image, see if it is one of the stored faces

• if yes, output “face” as the classification result

• If no, output “non-face”

• also called “rote learning”

• problem: new “face” images are different from stored

“face” examples

• zero error on stored data, 50% error on test (new) data

• decision boundary is very irregular

• Rote learning is memorization without generalization

slide is modified from Y. LeCun

Generalizationtraining data

• Ability to produce correct outputs on previously unseen examples

is called generalization

• Big question of learning theory: how to get good generalization

with a limited number of examples

• Intuitive idea: favor simpler classifiers

• William of Occam (1284-1347): “entities are not to be multiplied without necessity”

• Simpler decision boundary may not fit ideally to training data but

tends to generalize better to new data

new data

21

Training and Testing• How to diagnose overfitting?

• Divide all labeled samples x1,x2,…xn into training set and test set

• Use training set (training samples) to tune classifier weights w

• Use test set (test samples) to see how well classifier with tuned weights w work on unseen examples

• Thus there are 2 main phases in classifier design

1. training

2. testing

Training Phase

• Find weights w s.t. f(xi,w) = yi “as much as possible” for training samples xi

• define “as much as possible”• penalty whenever f(xi,w) ≠ yi

• via loss function L(f(xi,w), yi)

• how to search for good setting of w?• usually through optimization

• can be very time consuming

• classification error on training data is called training error

Testing Phase• The goal is good performance on unseen examples

• Evaluate performance of the trained classifier f(x,w) on the test samples (unseen labeled samples)

• Testing on unseen labeled examples is an approximation of how well classifier will perform in practice

• If testing results are poor, may have to go back to the training phase and redesign f(x,w)

• Classification error on test data is called test error

• Side note

• when we “deploy” the final classifier f(x,w) in practice, this is also called testing

24

• Can also underfit data, i.e. too simple decision boundary • chosen hypothesis space is not

expressive enough

• No linear decision boundary can separate the samples well

• Training error is too high• test error is, of course, also high

Underfitting

Underfitting → Overfitting

underfitting “just right” overfitting

• high training error

• high test error

• low training error

• low test error

• low training error

• high test error

26

Sketch of Supervised Machine Learning

• Chose a hypothesis space f(x,w)

• w are tunable weights

• x is the input sample

• tune w so that f(x,w) gives the correct label for training samples x

• Which hypothesis space f(x,w) to choose?

• has to be expressive enough to model our problem well, i.e. to avoid underfitting

• yet not to complicated to avoid overfitting

Classification System Design Overview

• Collect and label data by handsalmon salmon salmonsea bass sea bass sea bass

• Preprocess data (i.e. segmenting fish from background)

• Extract possibly discriminating features• length, lightness, width, number of fins,etc.

• Classifier design• choose model for classifier

• train classifier on training data

• Test classifier on test data

• Split data into training and test sets

mostly look at these steps in the course

OPTIMIZATION

( ) 0xJdx

d=

Optimization

• How to minimize a function of a single variable

J(x) =(x-5)2

• From calculus, take derivative, set it to 0

• Solve the resulting equation• maybe easy or hard to solve

• Example above is easy

( ) ( ) 5x05x2xJdx

d==−=

( )

( )

( ) 0xJ

xJx

xJx

d

1

==

Optimization• How to minimize a function of many variables

J(x) = J(x1,…, xd)

• From calculus, take partial derivatives, set them to 0

gradient

• Solve the resulting system of d equations

• It may not be possible to solve the system of equations above analytically

Optimization: Gradient Direction

x2x1

J(x1, x2)

Picture from Andrew Ng

• Gradient J(x) points in the direction of steepest increase of function J(x)

• - J(x) points in the direction of steepest decrease

Gradient Direction in 2D• J(x1, x2) =(x1-5)2+(x2-10)2

( ) ( )5x2xJx

1

1

−=

( ) ( )10x2xJx

2

2

−=

( ) 101

=

aJ

x•

( ) 102

−=

aJ

x•

a

global min

x1

x2

5

10

10

5

10

10

( )

−=

10

10aJ•

• Let

=

5

10a

( )

−=−

10

10aJ•

Gradient Descent: Step Size

• J(x1, x2) =(x1-5)2+(x2-10)2

• Which step size to take?

• Controlled by parameter

• called learning rate

• From previous slide

• J(10, 5) = 50; J(8,7) = 18

a

global min

x1

x2

5

10

10

5

10

10

( )

−=−

=

10

10

5

10aJ,a•

• Let = 0.2

( )aJa −α

−+

=

10

1020

5

10.

=

7

8

J(x)

x

Gradient Descent Algorithm

x(1) x(2)

-J(x(1))

-J(x(2))

x(k)

-J(x(k))0

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

Gradient Descent: Local Minimum

• Not guaranteed to find global minimum• gets stuck in local minimum

J(x)

x

x(1)x(2)

-J(x(1))

-J(x(2))

x(k)

-J(x(k))=0

global minimum

• Still gradient descent is very popular because it is simple and applicable to any differentiable function

x

How to Set Learning Rate ?

• If too large, may overshoot the local minimum and possibly never even converge

J(x)

x

• If too small, too many iterations to converge

x(2) x(1)x(4) x(3)

• It helps to compute J(x) as a function of iteration number, to make sure we are properly minimizing it

J(x)

Variable Learning Rate

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

• If desired, can change learning rate at each iteration

k = 1

x(1) = any initial guess

choose

while ||J(x(k))|| >

choose (k)

x(k+1) = x (k) - (k) J(x(k))

k = k + 1

Variable Learning Rate

k = 1

x(1) = any initial guess

choose ,

while ||J(x(k))|| >

x(k+1) = x (k) - J(x(k))

k = k + 1

• Usually do not keep track of all intermediate solutions

k = 1

x = any initial guess

choose ,

while ||J(x)|| >

x = x - J(x)

k = k + 1

Learning Rate

• Monitor learning rate by looking at how fast the objective function decreases

J(x)

number of iterations

very high learning rate

high learning rate

low learning rate

good learning rate

1.0=Learning Rate: Loss Surface Illustration

001.0=

updates 3~ k

updates 30.~ k

01.0=

1.0=

Linear Classifiers

Supervised Machine Learning: Recap• Chose type of f(x,w)

• w are tunable weights, x is the input example

• f(x,w) should output the correct class of sample x

• use labeled samples to tune weights w so that f(x,w) give the correct class y for x in training data

• loss function L(f(x,w) ,y)

• How to choose type of f(x,w)?

• many choices

• simple choice: linear classifier

• other choices

• Neural Network

• Convolutional Neural Network

Linear Classifier

• Use

• y = 1 for the first class

• y = -1 for the second class

• One choice for linear classifier

f(x,w) = sign(g(x,w))

• 1 if g(x,w) is positive

• -1 if g(x,w) is negative

g(x)

x-1

1f(x)

• Classifier it makes a decision based on linear combination of features

g(x,w) = w0+x1w1 + … + xdwd

• g(x,w) called discriminant function

bad boundary

Linear Classifier: Decision Boundary

• f(x,w) = sign(g(x,w)) = sign(w0+x1w1 + … + xdwd)

• Decision boundary is linear

• Find w0, w1,…, wd that give best separation of two classes with a linear boundary

better boundary

More on Linear Discriminant Function (LDF)

• LDF: g(x,w,w0) = w0+x1w1 + … + xdwd

x1

x2

decision boundary

g(x) = 0

g(x) > 0

decision region for class 1

g(x) < 0

decision region for class 2

=

dw

...

w

w

w2

1bias or threshold

More on Linear Discriminant Function (LDF)

• Decision boundary: g(x,w) = w0+x1w1 + … + xdwd = 0

• This is a hyperplane, by definition

• a point in 1D

• a line in 2D

• a plane in 3D

• a hyperplane in higher dimensions

Vector Notation • Linear discriminant function g(x, w, w0) = wtx + w0

• Example in 2D

( ) 423 210 ++= xxw,w,xg 42

3

0 =

= w,w

• Shorter notation if add extra feature of value 1 to x

=

2

1

x

xx

=

2

3

4

a

=

2

1

1

x

xz ( )

==

2

1

1

234

x

xaza,zg t

( ) 21 234 xxaza,zg t ++== ( )00 w,w,xgwwxt =+=

• Use atz instead of wtx + w0

dx

x

1

1

Fitting Parameters w

• Rewrite g(x,w,w0) = [w0 wt] = atz = g(z,a)1xnew

feature vector z

new weight vector a

• z is called augmented feature vector

• new problem equivalent to the old g(z,a) = atz

• f(z,a) = sign(g(z,a))

dw

w

w

1

0

a

a

a

Solution Region• If there is a that classifies all examples correctly, it is

called separating or solution vector

• then there are infinitely many solution vectors a

• original samples x1,… xn are also linearly separable

Loss Function• How to find solution vector a?

• or, if no separating a exists, a good approximation a?

• Design a non-negative loss function L(a)

• L(a) is small if a is good

• L(a) is large if a is bad

• Minimize L(a) with gradient descent

• Two steps in design of L(a)

1. per-example loss L(f(zi,a),yi)• penalizes for deviations of f(zi,a) from yi

2. total loss adds up per-sample loss over all training examples

( ) ( )( )=i

ii y,a,zfLaL

Loss Function, First Attempt• Per-example loss function ‘counts’ if error happens

( )( ) ( )

=

=otherwise

ya,zfify,a,zfL

iiii

1

0

• Example

=

2

11z 11 =y

=

4

12z 12 −=y

−=

3

2a

( )( ) 111 =y,a,zfL

( ) ( )11 zasigna,zf t= ( ) ( )22 zasigna,zf t=

( )( ) 022 =y,a,zfL

( )2321 −= sign

1−=

( )4321 −= sign

1−=

Loss Function, First Attempt• Per-example loss function ‘counts’ if error happens

( )( ) ( )

=

=otherwise

ya,zfify,a,zfL

iiii

1

0

• Total loss function counts total number of errors

−=

3

2a

( )( ) 111 =y,a,zfL

( )( ) 022 =y,a,zfL

( ) ( )( )=i

ii y,a,zfLaL

• For previous example

( ) 101 =+=aL

• Total loss

Loss Function: First Attempt

• ‘error count’ loss function not suitable for gradient descent

• piecewise constant, gradient zero or does not exist

a

L(a)

Perceptron Loss Function• Different Loss Function: Perceptron Loss

• Lp(a) is non-negative• positive misclassified example zi

• atzi < 0

• yi = 1

• yi(atzi) < 0

• negative misclassified example zi

• atzi > 0

• yi = -1

• yi(atzi) < 0

• if zi is misclassified then yi(atzi) < 0

• if zi is misclassified then -yi(atzi) > 0

• Lp(a) proportional to distance of misclassified example to boundary

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Perceptron Loss Function

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

• Example

=

2

11z 11 =y

=

4

12z 12 −=y

−=

3

2a

( )( ) 411 =y,a,zfLp

( ) ( )11 zasigna,zf t= ( ) ( )22 zasigna,zf t=

( )( ) 022 =y,a,zfLp

( )2321 −= sign

1−=

( )4321 −= sign

1−=( )4−= sign

• Total loss Lp(a) = 4 + 0 = 4

Perceptron Loss Function

• Lp(a) is piecewise linear and suitable for gradient descent

a

Lp(a)

• Per-example loss

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Optimizing with Gradient Descent

• Gradient descent to minimize Lp(a)

a= a – Lp(a)

• Need gradient vector Lp(a)• the same dimension as a

=

1

3

2

a( )

=

3

2

1

a

L

a

L

a

L

aL

p

p

p

p

• Per-example loss

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

Optimizing with Gradient Descent

• Gradient descent to minimize Lp(a)

a= a – Lp(a)

• Per-example loss

( ) ( )( )=i

iip y,a,zfLaL

• Total loss

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

• Compute

( )( )=i

iip y,a,zfL( )aLp ( )( )=

i

iip y,a,zfL

( )( )

( )( )

( )( )

3

2

1

a

y,a,zfL

a

y,a,zfL

a

y,a,zfL

iip

iip

iip

per example gradient

• Compute and add up per example gradients

Per Example Loss Gradient

( )( )( )

=

=

otherwise?

ya,zfif

y,a,zfL

ii

iip

000

• First case, f(zi,a) = yi

• Per-example loss has two cases

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

( )( )( )

==

otherwise?

ya,zfify,a,zfL

ii

iip

0

• To save space, rewrite

Per Example Loss Gradient

( )( )= iip y,a,zfL

• Second case, f(zi,a) ≠ yi

( )( )

( )( )

( )( )

iti

iti

iti

zaya

L

zaya

L

zaya

L

3

2

1

( )( )

( )( )

( )( )

++−

++−

++−

=

iiii

iiii

iiii

zazazaya

L

zazazaya

L

zazazaya

L

332211

3

332211

2

332211

1

=ii

ii

ii

zy

zy

zy

3

2

1

iizy−=

• Per-example loss has two cases

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0

( )( ) ( )

==

otherwisezy

ya,zfify,a,zfL

ii

iiii

p

0

• Gradient for per-example loss

Optimizing with Gradient Descent

• Simpler formula

( ) −=

iexamplesiedmisclassif

iip zyaL

• Gradient decent update rule for Lp(a)

+=

iexamplesiedmisclassif

iizyaa α

• called batch because it is based on all examples

• Examples

=

=

=53

34

32 321 xxx

Perceptron Loss Batch Example

11 =y

• Labels

12 =y 13 =y 14 −=y 15 −=y

• Add extra feature

=

1

2

11z

=

3

4

12z

=

5

3

13z

=

6

5

15z

=

3

1

14z

=

=65

31 54 xx

class 1 class 2

• Pile all examples as rows in matrix Z

=

651311531341321

Z

=

1

1

1

1

1

Y

• Pile all labels into column vector Y

• Initial weights

Perceptron Loss Batch Example

• Examples in Z, labels in Y

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

1

1

1

a

• This is line x1 + x2 + 1 = 0

• Perceptron Batch

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

1

1

1

a

+=

iexamplesiedmisclassif

iizyaa α

• Sample misclassified if y(atz) < 0

• Let us use learning rate α = 0.2

+=

iexamplesiedmisclassif

iizy.aa 20

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

1

1

1

a

• Can compute y(atz) for all samples

( )

−−

==

125986

11111

*.a*Z*.Y

• Sample misclassified if y(atz) < 0

=

12

5

9

8

6

Total loss is L(a) = 5+12 = 17

( )( ) ( )( )

==

otherwisezay

ya,zfify,a,zfL

iti

iiii

p

0• Per example loss is

• Samples 4 and 5 misclassified

• Perceptron Batch rule update

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

1

1

1

a

+=

iexamplesiedmisclassif

iizy.aa 20

−+=

6

5

1

1

3

1

1

120.aa

−=

80

20

60

.

.

.

=

21

1

20

60

20

20

1

1

1

.

.

.

.

.

• This is line -0.2x1 -0.8 x2 +0.6 = 0

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

−=

80

20

60

.

.

.

a

• Find all misclassified samples

( ) Y*.a*Z

• Sample misclassified if y(atz) < 0

=

25

2

04

62

22

.

.

.

.

• Total loss is L(a) = 2.2 + 2.6 +4= 8.8• previous loss was 17 with 2 misclassified examples

• Perceptron Batch rule update

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

+=

iexamplesiedmisclassif

iizy.aa 20

+

+

+=

5

3

1

1

3

4

1

1

3

2

1

120.aa

=

41

61

21

.

.

.

• This is line 1.6x1 +1.4 x2 + 1.2 = 0

−=

80

20

60

.

.

.

a

+

+

+

−=

1

60

20

60

80

20

60

40

20

80

20

60

.

.

.

.

.

.

.

.

.

.

.

( ) Y*.a*Z

=

25

2

04

62

22

.

.

.

.

Perceptron Loss Batch Example

=

651

311

531

341

321

Z

=

1

1

1

1

1

Y

=

41

61

21

.

.

.

a

• Find all misclassified samples

( ) Y*.a*Z

• Sample misclassified if y(atz) < 0

=

617

7

013

811

68

.

.

.

.

• Total loss is L(a) = 7 + 17.6 = 24.6• previous loss was 8.8 with 3 misclassified examples

• loss went up, means learning rate of 0.2 is too high

• Batch gradient descent can be slow to converge if lots of examples

• Single sample optimization• update weights a as soon as possible, after seeing 1 example

• One iteration (epoch) • go over all examples, in random order

• update after seeing one example zi

Single-Sample Gradient Descent

per example gradient

𝐚 = 𝐚 − 𝛂𝛻𝐋p 𝐟 𝐳i, 𝐚 , 𝐲i

Single sample gradient descent, one iteration

global min

finish

start

Batch Size: Loss Surface Illustration

Batch Gradient Descent, one iteration

see only one example

global min

start

finish

• Mini-Batch optimization• update weights a after seeing m examples

• One iteration (epoch) • go over all examples, in random order

• update after seeing m examples

Mini-Batch Gradient Descent

per example gradient

𝐚 = 𝐚 − 𝛂

𝑚

𝛻𝐋p 𝐟 𝐳i, 𝐚 , 𝐲i

• Most often used in practice

• Practical Issue: both batch and mini-batch algorithms converge faster if features are roughly on the same scale

Logistic Regression, Motivation• Recall line fitting

z

1

-1

z

1

-1

• solution with squared error loss

• one sample misclassified

• solution with perceptron loss

• all samples classified correctly

• Instead of trying to get atz close to y• use differentiable function Ϭ(atz) that “squishes” the range

• try to get Ϭ(atz) close to y

Linear Classifier: Logistic Regression

( )( )texp

t−+

=1

t

Ϭ(t)

0.5

1

• Use logistic sigmoid function Ϭ(t) for “squishing” atz

• Denote classes with 1 and 0 now• yi = 1 for positive class, yi = 0 for negative

• Despite “regression” in the name, logistic regression is used for classification, not regression

Logistic Regression vs. Regresson

Ϭ(atz)

atz

0.5

1

quadratic loss

logistic regression loss

Logistic Regression: Loss Function• Could use (yi- Ϭ(atz)) 2 as per-example loss function

• Instead use a different loss • if z has label 1, want Ϭ(aTz) close to 1, define loss as

–log [Ϭ(aTz)]

• if z has label 0, want Ϭ(aTz) close to 0, define loss as

–log [1-Ϭ(aTz)]

1

t

-log t

• Gradient descent batch update rule

( )( ) i

i

iti zzayaa −+= σα

• Probabilistic interpretation• P(class 1) = Ϭ(aTz)

• P(class 0) = 1 - P(class 1)

• loss function is negative log-likelihood , -log P(y)

• standard objective in statistics

Logistic Regression vs. Perceptron• Green example classified correctly, but

close to decision boundary• no loss under Perceptron

• non-negligible loss under logistic regression

x

Ϭ(wtx)0.5

1

• Logistic Regression encourages decision boundary move away from training samples

• better generalization

• zero Perceptron loss

• smaller LR loss• zero Perceptron loss

• larger LR loss

• red classifier works better for new data

Logistic Regression vs. Regression vs. Perceptron

yatz

misclassified classified correctly but close to

decision boundary

classified correctly and not too close

to the decision boundary

1

quadratic loss

perceptron

loss

• Assuming labels are +1 and -1

logistic regression

loss

• Define m linear discriminant functions

gi(x) = witx + wi0 for i = 1, 2, … m

Multiple Classes: General Case

• Assign x to class i if

gi(x) > gj(x) for all j ≠ i

• Let Ri be decision region for class i• all points in Ri assigned to class i

g2(x) > g1(x)g2(x) > g3(x)

R1R2

R3

g1(x) > g2(x)g1(x) > g3(x)

g3(x) > g1(x)g3(x) > g2(x)

Multiple Classes

• Can be shown that decision regions are convex

• In particular, they must be spatially contiguous

Multiclass Linear Classifier: Matrix Notation

• Assume examples x are augmented with extra feature 1, no need to write bias explicitly• but from now on will not change notation to z’s

• Define m discriminant functions

gi(x) = witx for i = 1, 2, … m

• Assign x to i that gives maximum gi(x)

• Picture illustration

x

g1(x)

g2(x)

g3(x)

g4(x)

10

9

3

5→ 5

→ 3

→ -9

→ 10

pile all outputs into one vector

decide class 4

Multiclass Linear Classifier: Matrix Notation• Could use one dimensional output yi ∊ {1, 2, 3, …, m}

got this

0

0

1

0

want this

• Convenient to use multi-dimensional outputs (one-hot encoding)

=

0

0

0

1

jy

class 1

=

0

0

1

0

jy

class 2

=

0

1

0

0

jy

class 3

=

1

0

0

0

jy

class 4

x

g1(x)

g2(x)

g3(x)

g4(x)

10

9

3

5• if sample is of class i, want output vector to be 0 everywhere except position i, where it should be 1 x is of class 2

• Assign x to i that gives maximum gi(x)= witx

x

g1(x)

g2(x)

g3(x)

g4(x)

w2tx

w3tx

w4tx

w1tx

x

• In matrix notation

172

254

239

742

4

7

1

−=

43

47

4

2w1

w2

w3

w4

xW Wx• Assign x to class that corresponds to largest row of Wx

Multiclass Linear Classifier: Matrix Notation

742 −

x

x

x

x

x

239 −

254

172 −

• Assign sample xi to class that corresponds to largest row of Wxi

• Loss function?

43

47

4

2

Wxi

1

0

0

0

yi

( ) ( )( ) −=i

tiii xyWxWL

• Can use quadratic loss per sample xi as ½||Wxi - yi ||2

• for example above, loss (22 + 42 + 472 +442)/2

• total loss on all training samples L(W) = ½Σi|| Wxi - yi ||2

• gradient of the loss

• L(W) has the same shape as the same shape as W

• batch gradient descent update

( ) ( )ti

i

ii xyWxWW −−=

Quadratic Loss Function

• Consider gradient descent update, single sample x with α = 1

( ) txyWxWW −−=

• Suppose and is of class 2 and

=

2

3

1

x

=

172

254

239

742

W

=

=−

17

23

3

0

0

0

1

0

17

23

4

0

yWxok

too smalltoo large

−−−

=

=

345117

466923

693

000

172

254

239

742

231

17

23

3

0

172

254

239

742

W

• update rule

Quadratic Loss Function

−−−

−−

=

354419

446419

4126

742

Quadratic Loss Function

−−−

−−

=

354419

446419

4126

742

W

−=

221

299

38

0

Wx• With new ,

• Already saw that quadratic loss does not work that well for classification

=

=−

17

23

3

0

0

0

1

0

17

23

4

0

yWxok

too smalltoo large

Softmax Function

( )

( )

( )

( )

( )

( )

( )

( )

=

=

=

=

4

1j

4

4

1j

3

4

1j

2

4

1j

1

exp

exp

exp

exp

j

j

j

j

aexp

a

aexp

a

aexp

a

aexp

a

• Define softmax(a) function

4

3

2

1

a

a

a

a

=

26760

72750

0050

.

.

.

softmax

• Example

( )( ) ( ) ( )

( )( ) ( ) ( )

( )( ) ( ) ( )

++−

++−

++−

123

1exp

123

2exp

123

3exp

expexpexp

expexpexp

expexpexp

1

2

3

softmax

• Softmax renormalizes a vector so that it can be interpreted as a vector of probabilities

• Generalization of logistic regression to multiclass case

• Instead of raw scores

xw

xw

xw

xw

T

4

T

3

T

2

T

1

−=

3

5

1

2

• Use softmax scores

=

xw

xw

xw

xw

T

4

T

3

T

2

T

1

maxsoft

( )

( )

( )

( )

=

4

3

2

1

classPr

classPr

classPr

classPr

=

00030

95000

00240

04730

.

.

.

.

Softmax Loss Function

3

5

1

2

maxsoft

• Classifier output interpreted as probability for each class

• Optimize under –log Pr( yi) loss function

Gradient Descent: Softmax Loss Function

• Example

• Loss on this example is –log(0.000000000000001) = 40

=

2

3

1

ix

=

172

254

239

742

W

1

0

0

0

yi

( )=iWxmaxsoft

( )

( )

( )

( )

=

4

3

2

1

classPr

classPr

classPr

classPr

=

−17

23

4

0

maxsoft

00000010.00000000

42945850.99999999

56027960.00000000

01026190.00000000

• Update rule for weight matrix W

( )( ) ( ) −+=i

tiii xWxmaxsoftyWW

Gradient Descent: Softmax Loss Function

• Example, single sample gradient descent with α = 0.1

231

17

23

4

0

1

0

0

0

10

172

254

239

742

+

= maxsoft.W

=

2

3

1

ix

=

172

254

239

742

W

1

0

0

0

yi

=

17

23

4

0

iWx

=

217612

817493

239

742

...

...

• Update for W

More General Discriminant Functions

• Linear discriminant functions

• simple decision boundary

• should try simpler models first to avoid overfitting

• optimal for certain type of data

• Gaussian distributions with equal covariance

• May not be optimal for other data distributions

• Discriminant functions can be more general than linear

• For example, polynomial discriminant functions

• Decision boundaries more complex than linear

• Later will look more at non-linear discriminant functions

• Linear classifier works well when examples are linearly separable, or almost separable

• Perceptron Loss function was the first historic loss

• Logistic regression/softmax work better in practice

• Optimization with gradient descent

• stochastic mini-batch works best in practice

Summary

Linear Classifier: Quadratic Loss

( )( ) ( )22

1 itiiip zayz,a,zfL −=

• This is standard line fitting • note that even correctly classified examples can have a large loss

• Can find optimal weight a analytically with least squares• expensive for large problems

• Gradient descent more efficient for a larger problem

( ) ( ) i

i

itip zzayaL −−=

( ) i

i

iti zzayaa −+= α

• Quadratic per-example loss

• Batch update rule

z

1

-1

atz

top related