classification, part 2 bmtry 726 4/11/14. the 3 approaches (1) discriminant functions: -find...
TRANSCRIPT
The 3 Approaches(1) Discriminant functions:
-Find function f(x) that maps each point x directly into class label -NOTE: in this case, probabilities play no role
(2) Linear (Quadratic) Discriminant Analysis- Solve inference problem using prior class probability and-use Bayes’ thm and find posterior class probabilities-use posteriors to make optimal decision
(3) Logistic Regression-Solve inference problem by determining -use posteriors to make optimal decision
kp Cx kp C
kp C x
Logistic Regression
Probably the most commonly used linear classifier (certainly one we all know)
If the outcome is binary, we can describe the relationship between our features x and the probability of our outcomes as a linear relationship:
Using this we define the posterior probability of being in either of the two classes as
'0
1ln
2
P C
P C
X xβ x
X x
'0
'0
'0
If our reference class is 2
exp1
1 exp
12
1 exp
C
P C
P C
β xX x
β x
X xβ x
Logistic Regression for K > 2
But what happens if we have more than two classes?Examples:
(1) Ordinal outcome (i.e. Likert scale) defining attitudes towards safety of epidurals during labor among moms to be
-Features may include things like age, ethnicity, level of education, socio-economic status, parity, etc.
(2) Goal is to distinguish between several different types of lung tumors (both malignant and non-malignant):
-small cell lung cancer-non-small cell lung cancer-granulomatosis -sarcoidosis
In this case features may be pixels from CT scan image
Logistic Regression for K > 2
In the first example, cancer stage is ordinalOne possible option is to fit a cumulative logit model
The model for P(C < j|X = x) is just a logit model for a binary response
In this case, the response takes value 1 if y < j and the takes value 0 if y > j + 1
1 2 ... , 1, 2,...,
Cumulative Logits :
logit log1
1 ...log
1 ...
P C j P C P C C j j J
P C jP C j
C j
P C C j
P C j C J
X x X x X x X x
X xX x
X x
X x X x
X x X x
Logistic Regression for K > 2
It is of greater interest however to model all K – 1 cumulative logits in a single model.
This leads us to the proportional odds model
Notice the intercept is allowed to vary as j increases
However, the other model parameters remain constant
Does this make sense given the name of the model?
'0logit , 1, 2,..., 1jP C j j K X x β x
Logistic Regression for K > 2
Assumptions for the proportional odds model-Intercepts are increasing with increasing j-Models share the same rate of increase with increasing j-Odds ratios are proportional to distance between x1 and x2 and the proportionality constant is same for each logit
So for j < k, the curve for P(C < k| X = x) is equivalent to curveP(C < j| X = x) shifted (b0k – b0j)/b units in the x direction
Odds ratios to interpret the model are cumulative odds ratios
1
1
2
2
1 2
'1 2
logit logit logP C j
P C j
P C jP C j
P C j P C j
xx
xx
x x
β x x
0 0k kP C k P C j X x X x
Logistic Regression for K > 2
In the second example, our class categories are not ordinalWe can however fit a multinomial logit model
The model includes K – 1 logit models
'01 1
'02 2
'0, 1 1
1log
2log
1log K K
P C
P C K
P C
P C K
P C K
P C K
X xβ x
X x
X xβ x
X x
X xβ x
X x
Logistic Regression for K > 2
We can estimate the posterior probabilities of each of our K classes from the multinomial logit model
When K = 2, this reduces down to a single linear function (i.e. a single logistic regression)
'10 1
1 '01
'20 2
1 '01
'0
1 '01
1 '01
exp1
1 exp
exp2
1 exp
exp, 1, 2,..., 1
1 exp
1
1 exp
K
l ll
K
l ll
k k
K
l ll
K
l ll
P C
P C
P C k k K
P C K
β xX x
β x
β xX x
β x
β xX x
β x
X xβ x
Logistic Regression for K > 2
When K = 2, this reduces down to a single linear function (i.e. a single logistic regression)
Though we’ve referenced the last category, since the data have no natural ordering we could reference any category we choose.
Unlike the cumulative logit and proportional odds models, all parameters vary in these models
Logistic Regression for K > 2
As in the case of the ordinal models, it makes more sense to fit these models simultaneously
As a result, there are some assumptions and constraints we must impose(1) The different classes in the data represent a multinomial distribution
-Constraint: all posterior probabilities must sum to 1-In order to achieve this all models fit simultaneously
(2) Independence of Irrelevant Alternatives:- relative odds between any two outcomes independent of
number and nature of other outcomes being simultaneously considered
Logistic Regression vs. LDA
Both LDA and logistic regression represent models of the log-posterior odds between classes k and K that are linear functions of x
LDA:
Logistic regression:
' 1 ' 112
'0
' 110 2
1
log log
where
log
k K k K k K
k k
k k K k K
k k K
P C k P C k
P C KP C K
P C k
P C K
X xμ μ Σ μ μ x Σ μ μ
X x
α x
μ μ Σ μ μ
α Σ μ μ
'0log k k
P C k
P C K
X xβ x
X x
Logistic Regression vs. LDA
The posterior conditional density of class k for both LDA and logistic regression can be written in the linear logit form
The joint density for both can be written in the same way
Both methods represent linear decision boundaries that classify observations
So what’s the difference?
'0
'0
'0
'0
expLDA :
1 exp
explogistic :
1 exp
k k
k k
k k
k k
P C k
P C k
α xX x
α x
β xX x
β x
,P C k P C k P C k X X
Logistic Regression vs. LDA
The difference lies in how the linear coefficients are estimatedLDA: Parameters are fit by maximizing the full log likelihood based on the joint density
-recall here f is the Gaussian density function
Logistic regression: In this case the marginal density P(C = k) is arbitrary and parameters are estimated by maximizing the conditional
multinomial likelihood
-although ignored, we can think of this marginal density as being estimated in a nonparametric unrestricted fashion
, ; ,kP C k p C k X X μ Σ
P C k X
Logistic Regression vs. LDA
So this means…(1) LR makes fewer assumptions about distribution of the data (more general approach)
(2) But LR “ignores” the marginal distribution P(C = k)-Including additional distributional assumptions provides more information about parameters allowing for more efficient estimation (i.e. lower variance)-if Gaussian assumption correct, could lose up to 30% efficiency -OR need 30% more data for conditional likelihood to do as well
as full likelihood
Logistic Regression vs. LDA
(3) If observations far from decision boundary (i.e. probably NOT Gaussian), they influence estimation of common covariance matrix
-i.e. LDA is not robust to outliers
(4) data in a two class model can be perfectly separated by a hyperplane, LR parameters are undefined. But LDA coefficients still well defined (marginal likelihood avoids this degeneracy)
-e.g. one particular feature has all of its density in one class
-Advantages/disadvantages to both methods
-LR thought of as more robust because makes fewer assumptions
-In practice they tend to perform similarly
Problems with Both Methods
There are a few problems with both methods…
(1) As with regression, we have a hard time including a large number of covariates especially if n is small
(2) A linear boundary may not really be an appropriate choice for separating our classes
So what can we do?
LDA and logistic regression works wellif classes are linear separable…
But what if they aren’t? Linear boundaries may be almost
useless
Nonlinear test statistics
The optimal decision boundary may not be a hyperplane→ nonlinear test statistic
accept
H0
H1
Multivariate statistical methods are a Big Industry:Neural NetworksSupport Vector MachinesKernel density methods
Artificial Neural Networks (ANNs)
Central Idea:Extract linear combinations of inputs as derived features and then model the outcome (classes) as a nonlinear function of these features
Huh!?
Really they are nonlinear statistical models but with pieces that are familiar to us already
Biologic Neurons
Input signals come from the axons of otherneurons, which connect to dendrites (inputterminals) at the synapses
If a sufficient excitatory signal is received, theneuron fires and sends an output signal alongthe axons
The firing of the neuron occurs when athreshold excitation is reached
Idea for Neural Networks came from biology- more specifically, the brain…
Brains versus Computers : Some numbers-Approximately 10 billion neurons in the human cortex, compared with 10 of thousands of processors in the most powerful parallel computers
-Each biological neuron is connected to several thousand other neurons, similar to the connectivity in powerful parallel computers
-Lack of processing units can be compensated by speed. The typical operating speeds of biological neurons is measured in milliseconds while a silicon chip can operate in nanoseconds
-The human brain is extremely energy efficient, using approximately 10-16 joules per operation per second, whereas the best computers today use around 10-6 joules per operation per second
-Brains have been evolving for tens of millions of years, computers have been evolving for tens of decades.
ANNs
Non-linear (mathematical) models of an artificial neuron
w1
w2
w3
wp
x1
x2
x3
xp
Sh
g O
Output signal
Input Signal
Synaptic Weights
Activation/ThresholdFunction
ANNsNeural Network is 2-stage classification (or regression) model
Can be represented as network diagram
-for classification these represent the K-classes -kth unit models probability of being in kth class
Y1 Y2 YKY1
Z2Z1 Z3 ZM
X2X1 X3 XpXp-1
…
…
…
ANNsZm represent derived features created from linear combinations of the X’s
Y’s are modeled as a function of linear combinations of the Zm
s is called the activation function
Y1 Y2 YKY1
Z2Z1 Z3 ZM
X2X1 X3 XpXp-1
…
…
…
'0
'0
, 1, 2,...,
, 1, 2,...,
, 1, 2,...,
m m m
k k k
k k k
Z m M
T Z k K
f g T k K
α X
β
X
ANNsThe activation function, s, could be any function we choose
In practice, there are only a few that are frequently used
1 if 0
0 if 0
1
1 x
xi sign x
x
ii sigmoid xe
iii Gaussian radial basis function
ANNsANNs are based on simpler classifiers called perceptrons
The original single layer perceptron used the hard threshold sign function but this lacks flexibility making separation of classes difficult
Later adapted to use the sigmoid function -Note this should be familiar (think back to logistic regression)
ANNs are adaptation of the original single layer perceptron that include multiple layers (and have hence also been referred to as multi-layer perceptrons)
Use of the sigmoid function also links it with multinomial logistic regression