linear methods for classification 20.04.2015: presentation for ma seminar in statistics eli dahan

Post on 17-Jan-2016

217 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Linear Methods for Classification

20.04.2015:

Presentation for MA seminar in statistics

Eli Dahan

Outline

Introduction - problem and solution LDA - Linear Discriminant Analysis LR :

Logistic Regression (Linear Regression) LDA Vs. LR In a word – Separating Hyperplanes

Introduction - the problem

X

Group k

Observation

Or Group l?

*We can think of G as “group label”

Posteriori

Pj=P(G=j|X=x)

Introduction - the solution

Linear Decision boundary:

pk=pl

pk>plchoose K

pl>pkchoose L

Linear Discriminant Analysis

Let P(G = k) = k and P(X=x|G=k) = fk(x) Then by bayes rule:

Decision boundary:

Linear Discriminant Analysis

Assuming fk(x) ~ gauss(k, k) and 1 =2 = …

=K= We get Linear (in x) decision boundary

For not common we get QDA (RDA)

Linear Discriminant Analysis

Using empirical estimation methods:

Top classifier (Michie et al., 1994) – the data supports linear boundaries, stability

Logistic Regression

Models posterior prob. Of K classes; they sum to one and remain in [0,1]:

• Linear Decision boundary:

Logistic Regression

Model fit:

• In max. ML Newton-Raphson algorithm is used

Linear Regression

Recall the common features of multivariate regression:

• +Lack of multicollinearity etc.• Here: Assuming N instances (N*p observation

matrix X), Y is a N*K indicator response matrix (K classes).

Linear Regression

Linear Regression

LDA Vs. LR

Similar results, LDA slightly better (56% vs. 67% error rate for LR)

Presumably, they are identical because of the linear end-form of decision boundaries (return to see).

LDA Vs. LR

LDA: parameters fit by max. full log-likelihood based on the joint density which assumes Gaussian density (Efron 1975 – worst case of ignoring gaussianity 30% eff. reduction)

Linearity is derived

LR: P(X) arbitrary (advantage in model selection and abitility to absorb extreme X values), fits parameters of P(G|X) by maximizing the conditional likelihood.

Linearity is assumed

In a word – separating hyperplanes

top related