sumeet agarwal, eel709 (most figures from bishop, prmlweb.iitd.ac.in/~sumeet/linear_classif.pdf ·...

21
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML)

Upload: leduong

Post on 31-Mar-2018

221 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Linear Models for Classification

Sumeet Agarwal, EEL709

(Most figures from Bishop, PRML)

Page 2: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Approaches to classification

● Discriminant function: Directly assigns each data point x to a particular class Ci

● Model the conditional class distribution p(C i|x): allows separation of inference and decision

● Generative approach: model class likelihoods, p(x|Ci), and priors, p(Ci); use Bayes' theorem to get posteriors:

p(Ci|x) ~ p(x|Ci)p(Ci)

Page 3: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Linear discriminant functions

y(x) = wTx + w0

Page 4: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Multiple ClassesProblem of ambiguous regions

Page 5: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Multiple Classes

Consider a single K-class discriminant, with K linear functions:y

k(x) = w

k

Tx + wk0

And assign x to class Ck if y

k(x) > y

j(x) for all j ≠ k

Implies singly connected and convex decision regions:

Page 6: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Least squares for classificationToo sensitive to outliers:

Page 7: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Least squares for classification

Problematic due to evidently non-Gaussian distribution of target values:

Page 8: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Fisher's linear discriminant

Linear classification model is like 1-D projection of data: y = wTx.Thus we need to find a decision threshold along this 1-D projection (line). Simplest measure is separation of the class means: m

2 – m

1 = wT(m

2 – m

1). If classes have nondiagonal

covariances, then a better idea is to use the Fisher criterion:

J(w) = (m2 – m

1)2 / (s

1

2 + s2

2)

Where s1

2

denotes the variance of class 1 in the 1-D projection.

Maximising J() attempts to give a large separation between projected class means, but also a small variance within each

class.

Page 9: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Fisher's linear discriminant

Line joining class means Fisher discriminant

Page 10: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

The Perceptron

Φ1(x)

Φ2(x)

Φ3(x)

Φ4(x)

w1

w2

w3

w4

f()Activationfunction

wTΦ(x)0

-1

1

f(wTΦ(x))

A non-linear transformation in the form of a step function is applied to the weighted sum of the input features. This is inspired by the way neurons appear to function, mimicking the action potential.

Page 11: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

The perceptron criterion

● We'd like a weight vector w such that wTΦ(xi) > 0 for xi C∈ 1

(say, ti=1) and wTΦ(xi) < 0 for xi C∈ 2 (ti=-1)

● Thus, we want wTΦ(xi)ti > 0 i; those data points for which ∀this is not true will be misclassified

● The perceptron criterion tries to minimise the 'magnitude' of misclassification, i.e., it tries to minimise -wTΦ(xi)ti for all misclassified points (the set of which is denoted by M):

EP(w) = -∑i∈M wTΦ(xi)ti

● Why not just count the number of misclassified points? Because this is a piecewise constant function of w, and thus the gradient is zero at most places, making optimisation hard

Page 12: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Learning by gradient descent

w(τ+1) = w(τ) – η E∇ P(w)

= w(τ) + ηΦ(xi)ti

(if xi is misclassified)

We can show that after this update, the error due to xi will be reduced:

-w(τ+1)TΦ(xi)ti = -w(τ)TΦ(xi)ti – (Φ(xi)ti)TΦ(xi)ti

< -w(τ)TΦ(xi)ti

(having set η=1, which can be done without loss of generality)

Page 13: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Perceptron convergence

Perceptron convergence theorem guarantees exact solution in finite steps for linearly separable data; but no convergence for nonseparable data

Page 14: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Gaussian Discriminant Analysis

● Generative approach, with class-conditional densities (likelihoods) modelled as Gaussians

For the case of two classes, we have:

Logistic sigmoid

Page 15: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Gaussian Discriminant Analysis

● In the Gaussian case, we get

The assumption of equal

covariance matrices leads

to linear decision

boundaries

Page 16: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Gaussian Discriminant Analysis

Allowing for unequal covariance matrices for different classes leads to quadratic decision boundaries

Page 17: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Parameter estimation for GDA

Maximum Likelihood Estimators

Likelihood:(assuming equal covariance matrices)

Page 18: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Logistic Regression

● An example of a probabilistic discriminative model

● Rather than learning P(x|Ci) and P(Ci), attempts to directly learn P(Ci|x)

● Advantages: fewer parameters, better if assumptions in class-conditional density formulation are inaccurate

● We have seen how the class posterior for a two-class setting can be written as a logistic sigmoid acting on a linear function of the feature vector Φ:

● This model is called logistic regression, even though it is

a model for classification, not regression!

Page 19: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Parameter learning

● If we let

then the likelihood function is

and we can define a corresponding error, known as cross-entropy:

Page 20: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Parameter learning● The derivative of the sigmoid function is given by:

● Using this, we can obtain the gradient of the error function with respect to w:

● Thus the contribution to the gradient from point n is given by the 'error' between model prediction and actual class label (yn – tn) times the basis function vector for that point, Φn

● Could use this for sequential learning by gradient descent, exactly as for least-squares linear regression

Page 21: Sumeet Agarwal, EEL709 (Most figures from Bishop, PRMLweb.iitd.ac.in/~sumeet/linear_classif.pdf · Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop,

Nonlinear basis functions