classification, part 2 bmtry 726 4/11/14. the 3 approaches (1) discriminant functions: -find...

Classification, Part 2

BMTRY 7264/11/14

The 3 Approaches(1) Discriminant functions:

-Find function f(x) that maps each point x directly into class label -NOTE: in this case, probabilities play no role

(2) Linear (Quadratic) Discriminant Analysis- Solve inference problem using prior class probability and-use Bayes’ thm and find posterior class probabilities-use posteriors to make optimal decision

(3) Logistic Regression-Solve inference problem by determining -use posteriors to make optimal decision

kp Cx kp C

kp C x

Logistic Regression

Probably the most commonly used linear classifier (certainly one we all know)

If the outcome is binary, we can describe the relationship between our features x and the probability of our outcomes as a linear relationship:

Using this we define the posterior probability of being in either of the two classes as

'0

1ln

2

P C

P C

X xβ x

X x

'0

'0

'0

If our reference class is 2

exp1

1 exp

12

1 exp

C

P C

P C

β xX x

β x

X xβ x

Logistic Regression for K > 2

But what happens if we have more than two classes?Examples:

(1) Ordinal outcome (i.e. Likert scale) defining attitudes towards safety of epidurals during labor among moms to be

-Features may include things like age, ethnicity, level of education, socio-economic status, parity, etc.

(2) Goal is to distinguish between several different types of lung tumors (both malignant and non-malignant):

-small cell lung cancer-non-small cell lung cancer-granulomatosis -sarcoidosis

In this case features may be pixels from CT scan image


In the first example, cancer stage is ordinalOne possible option is to fit a cumulative logit model

The model for P(C < j|X = x) is just a logit model for a binary response

In this case, the response takes value 1 if y < j and the takes value 0 if y > j + 1

1 2 ... , 1, 2,...,

Cumulative Logits :

logit log1

1 ...log

1 ...

P C j P C P C C j j J

P C jP C j

C j

P C C j

P C j C J

X x X x X x X x

X xX x

X x

X x X x

X x X x


It is of greater interest however to model all K – 1 cumulative logits in a single model.

This leads us to the proportional odds model

Notice the intercept is allowed to vary as j increases

However, the other model parameters remain constant

Does this make sense given the name of the model?

'0logit , 1, 2,..., 1jP C j j K X x β x


Assumptions for the proportional odds model-Intercepts are increasing with increasing j-Models share the same rate of increase with increasing j-Odds ratios are proportional to distance between x1 and x2 and the proportionality constant is same for each logit

So for j < k, the curve for P(C < k| X = x) is equivalent to curveP(C < j| X = x) shifted (b0k – b0j)/b units in the x direction

Odds ratios to interpret the model are cumulative odds ratios

1

1

2

2

1 2

'1 2

logit logit logP C j

P C j

P C jP C j

P C j P C j

xx

xx

x x

β x x

0 0k kP C k P C j X x X x


In the second example, our class categories are not ordinalWe can however fit a multinomial logit model

The model includes K – 1 logit models

'01 1

'02 2

'0, 1 1

1log

2log

1log K K

P C

P C K

P C

P C K

P C K

P C K

X xβ x

X x

X xβ x

X x

X xβ x

X x


We can estimate the posterior probabilities of each of our K classes from the multinomial logit model

When K = 2, this reduces down to a single linear function (i.e. a single logistic regression)

'10 1

1 '01

'20 2

1 '01

'0

1 '01

1 '01

exp1

1 exp

exp2

1 exp

exp, 1, 2,..., 1

1 exp

1

1 exp

K

l ll

K

l ll

k k

K

l ll

K

l ll

P C

P C

P C k k K

P C K

β xX x

β x

β xX x

β x

β xX x

β x

X xβ x


When K = 2, this reduces down to a single linear function (i.e. a single logistic regression)

Though we’ve referenced the last category, since the data have no natural ordering we could reference any category we choose.

Unlike the cumulative logit and proportional odds models, all parameters vary in these models


As in the case of the ordinal models, it makes more sense to fit these models simultaneously

As a result, there are some assumptions and constraints we must impose(1) The different classes in the data represent a multinomial distribution

-Constraint: all posterior probabilities must sum to 1-In order to achieve this all models fit simultaneously

(2) Independence of Irrelevant Alternatives:- relative odds between any two outcomes independent of

number and nature of other outcomes being simultaneously considered

Logistic Regression vs. LDA

Both LDA and logistic regression represent models of the log-posterior odds between classes k and K that are linear functions of x

LDA:

Logistic regression:

' 1 ' 112

'0

' 110 2

1

log log

where

log

k K k K k K

k k

k k K k K

k k K

P C k P C k

P C KP C K

P C k

P C K

X xμ μ Σ μ μ x Σ μ μ

X x

α x

μ μ Σ μ μ

α Σ μ μ

'0log k k

P C k

P C K

X xβ x

X x


The posterior conditional density of class k for both LDA and logistic regression can be written in the linear logit form

The joint density for both can be written in the same way

Both methods represent linear decision boundaries that classify observations

So what’s the difference?

'0

'0

'0

'0

expLDA :

1 exp

explogistic :

1 exp

k k

k k

k k

k k

P C k

P C k

α xX x

α x

β xX x

β x

,P C k P C k P C k X X


The difference lies in how the linear coefficients are estimatedLDA: Parameters are fit by maximizing the full log likelihood based on the joint density

-recall here f is the Gaussian density function

Logistic regression: In this case the marginal density P(C = k) is arbitrary and parameters are estimated by maximizing the conditional

multinomial likelihood

-although ignored, we can think of this marginal density as being estimated in a nonparametric unrestricted fashion

, ; ,kP C k p C k X X μ Σ

P C k X


So this means…(1) LR makes fewer assumptions about distribution of the data (more general approach)

(2) But LR “ignores” the marginal distribution P(C = k)-Including additional distributional assumptions provides more information about parameters allowing for more efficient estimation (i.e. lower variance)-if Gaussian assumption correct, could lose up to 30% efficiency -OR need 30% more data for conditional likelihood to do as well

as full likelihood


(3) If observations far from decision boundary (i.e. probably NOT Gaussian), they influence estimation of common covariance matrix

-i.e. LDA is not robust to outliers

(4) data in a two class model can be perfectly separated by a hyperplane, LR parameters are undefined. But LDA coefficients still well defined (marginal likelihood avoids this degeneracy)

-e.g. one particular feature has all of its density in one class

-Advantages/disadvantages to both methods

-LR thought of as more robust because makes fewer assumptions

-In practice they tend to perform similarly

Problems with Both Methods

There are a few problems with both methods…

(1) As with regression, we have a hard time including a large number of covariates especially if n is small

(2) A linear boundary may not really be an appropriate choice for separating our classes

So what can we do?

LDA and logistic regression works wellif classes are linear separable…

But what if they aren’t? Linear boundaries may be almost

useless

Nonlinear test statistics

The optimal decision boundary may not be a hyperplane→ nonlinear test statistic

accept

H0

H1

Multivariate statistical methods are a Big Industry:Neural NetworksSupport Vector MachinesKernel density methods

Artificial Neural Networks (ANNs)

Central Idea:Extract linear combinations of inputs as derived features and then model the outcome (classes) as a nonlinear function of these features

Huh!?

Really they are nonlinear statistical models but with pieces that are familiar to us already

Biologic Neurons

Input signals come from the axons of otherneurons, which connect to dendrites (inputterminals) at the synapses

If a sufficient excitatory signal is received, theneuron fires and sends an output signal alongthe axons

The firing of the neuron occurs when athreshold excitation is reached

Idea for Neural Networks came from biology- more specifically, the brain…

Brains versus Computers : Some numbers-Approximately 10 billion neurons in the human cortex, compared with 10 of thousands of processors in the most powerful parallel computers

-Each biological neuron is connected to several thousand other neurons, similar to the connectivity in powerful parallel computers

-Lack of processing units can be compensated by speed. The typical operating speeds of biological neurons is measured in milliseconds while a silicon chip can operate in nanoseconds

-The human brain is extremely energy efficient, using approximately 10-16 joules per operation per second, whereas the best computers today use around 10-6 joules per operation per second

-Brains have been evolving for tens of millions of years, computers have been evolving for tens of decades.

ANNs

Non-linear (mathematical) models of an artificial neuron

w1

w2

w3

wp

x1

x2

x3

xp

Sh

g O

Output signal

Input Signal

Synaptic Weights

Activation/ThresholdFunction

ANNsNeural Network is 2-stage classification (or regression) model

Can be represented as network diagram

-for classification these represent the K-classes -kth unit models probability of being in kth class

Y1 Y2 YKY1

Z2Z1 Z3 ZM

X2X1 X3 XpXp-1

…

…

…

ANNsZm represent derived features created from linear combinations of the X’s

Y’s are modeled as a function of linear combinations of the Zm

s is called the activation function

Y1 Y2 YKY1

Z2Z1 Z3 ZM

X2X1 X3 XpXp-1

…

…

…

'0

'0

, 1, 2,...,

, 1, 2,...,

, 1, 2,...,

m m m

k k k

k k k

Z m M

T Z k K

f g T k K

α X

β

X

ANNsThe activation function, s, could be any function we choose

In practice, there are only a few that are frequently used

1 if 0

0 if 0

1

1 x

xi sign x

x

ii sigmoid xe

iii Gaussian radial basis function

ANNsANNs are based on simpler classifiers called perceptrons

The original single layer perceptron used the hard threshold sign function but this lacks flexibility making separation of classes difficult

Later adapted to use the sigmoid function -Note this should be familiar (think back to logistic regression)

ANNs are adaptation of the original single layer perceptron that include multiple layers (and have hence also been referred to as multi-layer perceptrons)

Use of the sigmoid function also links it with multinomial logistic regression

Next Class…How do you fit an ANN?

What are the issues with ANN?

Software

classification, part 2 bmtry 726 4/11/14. the 3 approaches (1) discriminant functions: -find...

Documents