7_4 linear discriminant analysis

7/29/2019 7_4 Linear Discriminant Analysis

http://slidepdf.com/reader/full/74-linear-discriminant-analysis 1/13

1

Advanced Statistical Methods inInsurance

7. Multivariate Data

7.4 Linear Discriminant Analysis

7.4 Linear DiscriminantAnalysis7 Multivariate Data

Problem Definition

` Given are samples from g different populations

` The main question of discriminantanalysis

u a r i a l S t u d i e s

based on a so called training sample, which allowsthe correct classification of future observations intotheir unknown population they belong to

` d: S → {1, ..., g} S ⊂ RP sampling spaced is a decision rule which can be applied to an

S a l z b u r g I n s t i t u t e o f A c t

observation xω : d(xω)=k

` If ω ∈ Ωk and d(xω) = k Correct decision

` If ω ∈ Ωk and d(xω) ≠ k Wrong decision

2 ©Hudec & Schlögl



2


Data Structure Training sample with known group membership


S a l z b u r g I n s t i t u t e o f A

c t

©Hudec & Schlögl3


Bayes Theorem

` Prior probability of group memebership

p k P k( ) { }= ∈ >ω Ω 0


` Class specific (conditional) distribution of x

` Unconditional distribution of xg

*

f( k)x


` Posterior Probability


p(k) * f( k)P(k|x)

f( )=

x

x

k 1

=



3


Decision Principles` Bayes Decision Rule

` Assign an object to that class kest for which the


` kest= e(x) with p(kest|x) ≥ p (l|x) for l = 1, ..., g

` p(kest ) * f(x| ) ≥ p(l) * f(x|l) for l = 1, ..., g

` Maximum-Likelihood Rule

` Assign an object to that class kest for which the


c t

` kest= e(x) with f(x| kest) ≥ f(x|l) for l = 1, ..., g

` In case of equality of all prior probabilities both rulesare equivalent



Optimality of Bayes Decision

Conditional error rate:

ε(d|x) = P (d(x) ≠ k|x) = 1 - P(d(x) = k|x)


As the Bayes rule maximizes the second term on the rightside, it minimizes the conditional error rate. Integration over

the sampling space S leads to the minimization of theunconditional error rate ε(e) = P (d(x) ≠ k) for an object frompopulation k


Visualization of Optimality of Bayes Decision rule for p=1




4


Bayes Rule forg=2



c t

©Hudec & Schlögl7


Linear Decision Rules



©Hudec & Schlögl8



5


Discrimination with p=2 and g=2

Assumption:


t n eac popuatonthe data aremultivariate normalwith different centersbut constantcovariance matrix


c t

©Hudec & Schlögl9



12

Obviously the Bayes


2

4

6

8

10

0.010.02

0.03

0.03

0.04

0.04

0.05

0.050.06

0.06

and the MaximumLikelihood decisionrule both lead to a

linear separationbetween the groups


©Hudec & Schlögl10

0

0 2 4 6 8 10 12

grid



6



1 2

homoscedastic


1

4

6

8

1 0 From the training set

we can estimate theunknown parametersof the multivariatenormal and cancalculate an estimatefor the optimum


c t


1

0 2 4 6 8 10 12

0

2

separating line


4

Bayes-Regel

-

Bayes versus Maximum Likelihood Rule

4


- 2

0

2

- 2

0

2


-4 -2 0 2 4

- 4

Priori-Wahrscheinlickeiten 0,8 0,2


-4 -2 0 2 4

- 4

Priori-Wahrscheinlickeiten 0,5 0,5



7


Case of Non-Homogeneous Variances

12


2

4

6

8

10

0.010.02

0.030.040.05

0.06

0.06


c t


0

0 2 4 6 8 10 12

Results in quadratic separation of populations


Empirical Data

1 2

heteroscedastic


1

4

6

8

1 0



1

0 2 4 6 8 10 12

0



8


LDA & QDA



c t



0 . 4

Nonparametric DiscriminantAnalysis


. 1

0 . 2

0 . 3

e as e nes s owthe true class specificdensities.

In this situation, wherethe true distributionsare normal, the non-parametric densityestimation will ive


-2 0 2 4

0 . 0


sub-optimalclassification results



9


Nonparametric DiscriminantAnalysis

0 . 5


e as e nes s owthe true class specificdensities.In this situation non-parametric densityestimation will givebetter classificationresults than the

0 . 2

0 . 3

0 . 4


c t


parametric estimateshown on the nextslide

-2 0 2 4

0 . 0

0 . 1


0 . 5

Inadequacy of Parametric Estimate


0 . 2

0

. 3

0 . 4

Compare thisexample with the


-2 0 2 4

0 . 0

0 . 1


considerations onrobustness fromchapter 3.2!



10


Naïve Bayes` In case of large p non-parametric methods suffer from

the so called curse of dimensionality (Bellmann, 1961).


` assumptions tend to break down, as the number of datapoints needed to derive reliable estimates increases veryfast.

` In these situations the “Naïve Bayes” principle has itsmerits. It assumes that the class densities are products


c t

of marginal densities, which corresponds to assumingconditional independence of the variables within eachclass


p

1 p j j 1

ˆ ˆ ˆf( |k) f((x , ,x ) '|k) f(x |k)=

= = ∏x …


LDA versus Logistic Regression

green Logistic Regression magenta LDA






11


Fisher - LDA` Popularity of LDA is due to Fisher

who developed the method


assumption of multivariateGaussian distributions within eachclass.

` Fisher showed that the sameresult can be achieved by


c t

searching for the most informative(with regard to the class structure)low-dimensional projections of thedata.



Not the Most Informative Projection






12


Discriminant Analysis due to Fisher


This projection gives


c t

the best discriminationbetween the groups



Fisher‘s Approach with g=2 Groups

` Looking for a linear combination of the observedvariables y a xk k= ′ k=1,2


` which maximizes the variance criterion

` leads to the solution

( )

Q S S( )a

y y

=

−

+

1 22

12

22 → max

a W x x= −−1


` W ~ Within sumof squares cross productmatrix


( )W x x x x x x x x= + − = − − ′ + − − ′= =

∑ ∑N N Sn

N

n nn

N

n n1 21

1 1 1 11

2 2 2 221 2

* ( )( ) ( )( )



13


Fisher‘s Approach with g>2 Groups (1)



c t



Fisher‘s Approach with g>2 Groups (2)




It can be shown that the approach of Fisher leads to the same results asthe LDA based on multivariate Gaussians with constant variance-matrices

7_4 linear discriminant analysis

Documents