lecture 8,9 – linear methods for classification rice elec 697 farinaz koushanfar fall 2006

Lecture 8,9 – Linear Methods for Classification

Rice ELEC 697

Farinaz Koushanfar

Fall 2006

Summary

• Bayes Classifiers

• Linear Classifiers

• Linear regression of an indicator matrix

• Linear discriminant analysis (LDA)

• Logistic regression

• Separating hyperplanes

• Reading (ch4, ELS)

Bayes Classifier

• The marginal distributions of G are specified as PMF pG(g), g=1,2,…,K

• fX|G(x|G=g) shows the conditional distribution of X for G=g

• The training set (xi,gi),i=1,..,N has independent samples from the joint distribution fX,G(x,g)– fX,G(x,g) = pG(g)fX|G(x|G=g)

• The loss of predicting G* for G is L(G*,G)• Classification goal: minimize the expected loss

– EX,GL(G(X),G)=EX(EG|XL(G(X),G))

Bayes Classifier (cont’d)

• It suffices to minimize EG|XL(G(X),G) for each X. The optimal classifier is:– G(x) = argmin g EG|X=xL(g,G)

• The Bayes rule is also known as the rule of maximum a posteriori probability– G(x) = argmax g Pr(G=g|X=x)

• Many classification algorithms estimate the Pr(G=g|X=x) and then apply the Bayes rule

Bayes classification rule

More About Linear Classification

• Since predictor G(x) take values in a discrete set G, we can divide the input space into a collection of regions labeled according to classification

• For K classes (1,2,…,K), and the fitted linear model for k-th indicator response variable is

• • The decision boundary b/w k and l is:• An affine set or hyperplane:

• Model discriminant function k(x) for each class, then classify x to the class with the largest value for k(x)

xxf Tkkk ˆˆ)(ˆ

0

)(ˆ)(ˆ xfxf lk

}0)ˆˆ()ˆˆ(:{ 00 xx Tlklk

Linear Decision Boundary

• We require that monotone transformation of k or Pr(G=k|X=x) be linear

• Decision boundaries are the set of points with log-odds=0 • Prob. of class 1: , prob. of class 2: 1- • Apply a transformation:: log[/(1- )]=0+ Tx• Two popular methods that use log-odds

– Linear discriminant analysis, linear logistic regression

• Explicitly model the boundary b/w two classes as linear. For a two-class problem with p-dimensional input space, this is modeling decision boundary as a hyperplane

• Two methods using separating hyperplanes– Perceptron - Rosenblatt, optimally separating hyperplanes - Vapnik

Generalizing Linear Decision Boundaries

• Expand the variable set X1,…,Xp by including squares and cross products, adding up to p(p+1)/2 additional variables

Linear Regression of an Indicator Matrix

• For K classes, K indicators Yk, k=1,…,K, with Yk=1, if G=k, else 0

• Indicator response matrix

Linear Regression of an Indicator Matrix (Cont’d)

• For N training data, form NK indicator response matrix Y, a matrix of 0’s and 1’s

• A new observation is classified as follows:– Compute the fitted output (K vector) -– Identify the largest component and classify accordingly:

• But… how good is the fit?– Verify kG fk(x)=1 for any x– fk(x) can be negative or larger than 1

• We can allow linear regression into basis expansion of h(x) • As the size of training set increases, adaptively add more basis

YXXXXY TT 1)(ˆ

Txxf ]ˆ),1[()(ˆ B

(x)fxG kkˆmaxarg)(ˆ G

Linear Regression - Drawback• For K3, especially for large K

Linear Regression - Drawback• For large K and small p, masking can naturally occur• E.g. Vowel recognition data in 2D subspace, K=11, p=10 dimensions

Linear Regression and Projection*

• A linear regression function (here in 2D)

• Projects each point x=[x1 x2]T to a line parallel to W1

• We can study how well the projected points {z1,z2,…,zn}, viewed as functions of w1, are separated across the classes* Slides Courtesy of Tommi S. Jaakkola, MIT CSAIL

Linear Regression and Projection

• A linear regression function (here in 2D)

• Projects each point x=[x1 x2]T to a line parallel to W1

• We can study how well the projected points {z1,z2,…,zn}, viewed as functions of w1, are separated across the classes

Projection and Classification

• By varying w1 we get different levels of separation between the projected points

Optimizing the Projection

• We would like to find the w1 that somehow maximizes the separation of the projected points across classes

• We can quantify the separation (overlap) in terms of means and variations of the resulting 1-D class distribution

Fisher Linear Discriminant: Preliminaries

• Class description in d

– Class 0: n0 samples, mean 0, covariance 0

– Class 1: n1 samples, mean 1, covariance 1

• Projected class descriptions in – Class 0: n0 samples, mean 0

Tw1, covariance w1T0 w1

– Class 1: n1 samples, mean 1Tw1, covariance w1

T1 w1

Fisher Linear Discriminant

• Estimation criterion: find w1 that maximizes

• The solution (class separation)

is decision theoretically optimal for two normal populations with equal covariances (1=0)

Linear Discriminant Analysis (LDA) k class prior Pr(G=k)• Function fk(x)=density of X in class G=k• Bayes Theorem:

• Leads to LDA, QDA, MDA (mixture DA), Kernel DA, Naïve Bayes

• Suppose that we model density as a MVG:

• LDA is when we assume the classes have a common covariance matrix: k= k. It’s sufficient to look at log-odds

LDA

• Log-odds function implies decision boundary b/w k and l: Pr(G=k|X=x)=Pr(G=l|X=x) – linear in x; in p dimensions a hyperplane

• Example: three classes and p=2

LDA (Cont’d)

LDA (Cont’d)

• In practice, we do not know the parameters of Gaussian distributions. Estimate w/ training set– Nk is the number of class k data

– –

• For two classes, this is like linear regression

NNkk /ˆ

kg kiki

Nx /ˆ

K

k kg

Tkiki

iKNxx

1)/()ˆ)(ˆ(ˆ

QDA

• If k’s are not equal, the quadratic terms in x remain; we get quadratic discriminant functions (QDA)

QDA (Cont’d)

• The estimates are similar to LDA, but each class has a separate covariance matrices

• For large p dramatic increase in parameters• In LDA, there are (K-1)(p+1) parameters• For QDA, there are (K-1){1+p(p+3)/2}• LDA and QDA both work really well• This is not because the data is Gaussian, rather, for

simple decision boundaries, Gaussian estimates are stable

• Bias-variance trade-off

Regularized Discriminent Analysis

• A compromise b/w LDA and QDA. Shrink separate covariances of QDA towards a common covariance (similar to Ridge Reg.)

•

•

Example - RDA

Computations for LDA

• Suppose we compute the eigen decomposition for k, i.e. • Uk is pp orthonormal, Dk diagonal matrix of positive

eigenvalues dkl. Then,

• The LDA classifier is implemented as:• X* D-1/2UTX, where =UDUT. The common covariance

estimate of X* is identity• Classify to the closest class centroid in the transformed space,

modulo the effect of the class prior probabilities k

)]ˆ([)]ˆ([)(ˆ)( 11k

Tkkk

Tkk

Tk xUDxUxx

l klk dlog|ˆ|log

Background: Simple Decision Theory*

• Suppose we know the class-conditional densities p(X|y) for y=0,1 as well as the overall class frequencies P(y)

• How do we decide which class a new example x’ belongs to so as to minimize the overall probability of error?

* Courtesy of Tommi S. Jaakkola, MIT CSAIL

Background: Simple Decision Theory

• Suppose we know the class-conditional densities p(X|y) for y=0,1 as well as the overall class frequencies P(y)

• How do we decide which class a new example x’ belongs to so as to minimize the overall probability of error?

2-Class Logistic Regression

• The optimal decisions are based on the posterior class probabilities P(y|x). For binary classification problems, we can write these decisions as

• We generally don’t know P(y|x) but we can parameterize the possible decisions according to

2-Class Logistic Regression (Cont’d)

• Our log-odds model

• Gives rise to a specific form for the conditional probability over the labels (the logistic model):

Where

Is a logistic squashing functionThat turns linear predictions into probabilities

2-Class Logistic Regression: Decisions

• Logistic regression models imply a linear decision boundary

K-Class Logistic Regression

• The model is specified in terms of K-1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one)

• The choice of denominator is arbitrary, typically last class

xxXKG

xXG T110)|Pr(

)|1Pr(log

xxXKG

xXG T220)|Pr(

)|2Pr(log

xxXKG

xXKG TKK 10)1()|Pr(

)|1Pr(log

…..

K-Class Logistic Regression (Cont’d)

• The model is specified in terms of K-1 log-odds or logit transformations (reflecting the constraint that the probabilities sum to one)

• A simple calculation shows that

,1,...,1,)exp(1

)exp()|Pr( 1

1 0

0

Kkx

xxXkG K

l

Tll

Tkk

1

1 0 )exp(1

1)|Pr( K

l

Tll x

xXKG

• To emphasize the dependence on the entire parameter set ={10, 1

T,…,(K-1)0, T(K-1)}, we denote the probabilities as

Pr(G=k|X=x) = pk(x; )

Fitting Logistic Regression Models

xxxP

xPxP T

)(

)(1

)(log)(itlog

N

i

xi

Ti

N

i iiii

iT

exy

pypy

1

1

)}1log({

)}1log()1(log{Likelihoodlog


• IRLS is equivalent to Newton-Raphson procedure


xxxP

xPxP T

)(

)(1

)(log)(itlog

N

i

xi

Ti

N

i iiii

iT

exy

pypy

1

1

)}1log({

)}1log()1(log{Likelihoodlog

• IRLS algorithm (equivalent to Newton-Raphson)– Initialize .– Form Linearized response:

– Form weights wi=pi(1-pi)– Update by weighted LS of zi on xi with weights wi

– Steps 2-4 repeated until convergence

Example – Logistic Regression

• South African Heart Disease:– Coronary risk factor study (CORIS) baseline survey,

carried out in three rural areas. – White males b/w 15 and 64– Response: presence or absence of myocardial infarction– Maximum likelihood fit:

Example – Logistic Regression

• South African Heart Disease:

Logistic Regression or LDA?

• LDA:

• This linearity is a consequence of the Gaussian assumption for the class densities, as well as the assumption of a common covariance matrix.

• Logistic model

• They use the same form for the logit function

Logistic Regression or LDA?

• Discriminative vs informative learning:

• logistic regression uses the conditional distribution of Y given x to estimate parameters, while LDA uses the full joint distribution (assuming normality).

• If normality holds, LDA is up to 30% more efficient; o/w logistic regression can be more robust. But the methods are similar in practice.

Separating Hyperplanes

Separating Hyperplanes

• Perceptrons: compute a linear combination of the input features and return the sign

• For x1,x2 in L, T(x1-x2)=0 *= /|| || normal to surface L• For x0 in L, Tx0= - 0

• The signed distance of any point x to L is given by

)(||)('||

1

)(||||

1)( 00

*

xfxf

xxx TT

Rosenblatt's Perceptron Learning Algorithm

• Finds a separating hyperplane by minimizing the distance of misclassified points to the decision boundary

• If a response yi=1 is misclassified, then xiT+0<0, and the

opposite for misclassified point yi=-1• The goal is to minimize

Rosenblatt's Perceptron Learning Algorithm (Cont’d)

• Stochastic gradient descent• The misclassified observations are visited in some

sequence and the parameters updated

is the learning rate, can be 1 w/o loss of generality• It can be shown that algorithm converges to a

separating hyperplane in a finite number of steps

Optimal Separating Hyperplanes

• Problem

Example - Optimal Separating Hyperplanes

lecture 8,9 – linear methods for classification rice elec 697 farinaz koushanfar fall 2006

Documents

linear classification

linear methods

linear regression function

discrete set g

argmax g prg

argmin g egx

fitted linear model

kx linear decision boundarywe