class-modelling vs discriminant classificationclassiﬁcation rate or **cv efﬁciency sunday, march...

Dr. Marina CocchiDpt Chemical and Geological Sciences - UniMOREArctic Analysis 10-15 March [email protected]

Class-modelling vs Discriminant classification

A practitioner reflections

Dept. of Chemical and Geological Sciences, University of Modena and Reggio Emilia

Via G. Campi, 183 - 41125 Modena (Italy)

Servizi Tracciabilità, Autenticità, Monitoraggio Processo e prodotto

www.chemstamp.it

Dr. Marina Cocchi Researcher Analytical Chemistry (Chemometrics)

Sunday, March 16, 2014

mailto:[email protected]



outline

o The one class modeling context

o Retrospective on SIMCA

o PLS-DA and its implementation





Classification task

The one class modeling context

Class Modeling

Discriminant methods

‣single class information is used‣useful in authentication

‣other/others class/es information always used

by F. Marini

by F. Marini





Why class-modeling ?e.g. Authentication issue

4

Need to identify non conformity, adulterations...

Some.mes

Most o1en

we know what we want !!!

we would like to know!!!

Chemical Signature

there is a category to contrast with

to find if an dulterant is present even not knowing

what it could be

we expect something to change in the signal profiles if an ”unknown” constituent is added







the “unknown” has similar profile but some peculiar

features

modeled class/es

‣ in model space the “unknown” will be accepted /close to class but it will have high residuals

A discriminant method will assign it to one of the classes by definition

‣ Does the paradigm one class against the rest of words make sense ? (e.g. ill vs healthy)





SIMCA model Liguria 3LV**ECVA model 3LV in inner PLS*

CLASS Training set

Test set

Liguria 12 9

Apulia 12 10

Modeled categories

‣ how the models consider “not italian samples” ?

The one class modeling contextDistinguish different valued italian products

Canonical Variate Scores

Ort

hogo

nal D

ista

nce/

OD

lim

Score Distances/SDlim

* model dimensionality according to minimum CV classification rate or **CV Efficiency





ECVA weights

SIMCA PC1 loadings Liguria

different concentration common

main variance for liguria


std for each category+ Apulia

- Liguria





discirminant PLS model 3LV The one class modeling context

calibration predition

again the model is built on 2 classes Apulia and Liguria

‣ how the models consider “not italian samples” ?





Retrospective on SIMCA

1. Build a distinct PCA model for each class (separate centering/scaling & dimensionality choice)

2. Define a distance measure to the class model

3. Define a limit/boundary for acceptability (classification rule)

4. For a “new” sample estimate distances from each class model

5. objects may be assigned to one, more than one, or none of the classes

SIMCA methods

not assigned

to both





Common frame

SD = Scores Distancedistance within model spaceUse PCA scores

OD = Orthogonal Distancedistance from model spaceUse Residuals

2. Define a distance measure to the class model

SIMCA critical/differing steps

Variations

‣how they are definedSD‣Euclidean / Mahalonobis / Leverage‣from origin / boundaries

OD‣DoF correction

‣Use CV residuals and scores or not‣Use both / only one of the two‣ how to combine

3. Define limit(s)/boundary(ies) for acceptability (classification rule)

alternative-SIMCA

‣Use SD & OD distributions/statistics‣same / different

‣Use SD & OD limits Robust / distributions free

original-SIMCA

‣combine SD & OD for “external” object in a single “distance to class” measure : D

‣Use F-test to compare D with class residuals variance (RSD)






Implementation 1 SIMCA original[1-2]

[1] S.Wold, Pattern Recognition, 1976, 127; [2] S. Wold, M. Sjostrom, ACS Symposium Series, 1977, 243-281.

sp(q)

PC1

PC2

Class space

ta –ϑa(q)

,lim

dp(q) = sp

(q)

i = 1:M M = n° variables; k = 1:N N = n° objects of class qa = 1:A A = n° components

★ SD = 0 for calibration samples★ SD is weigthed by Φa

2

Φa2

= sp(q)2 / (Σktak(q)2 /Nq)

★ F-test because it is a comparison of variances

distance from a class, i.e q, of a test object, i.e. p

dp(q) = sqrt[sp(q)2 + ΣaΦa2(ta –ϑa

(q),lim)2]

total RSD of a class, i.e. q:

s0(q) = sqrt[Σikeik2/(N-‐‑A-‐‑1)(M-‐‑A)]

OD SD

sp(q) = sqrt[Σk e2pk/(M-A)]

Classification rule

F = dp(q)2 / s0(q)2 < Fcrit (M-‐‑A),(N-‐‑A-‐‑1)(M-‐‑A)

If true for both q and r Unique assignment only if

F = dp(q)2 / dp(r)2 > Fcrit (M-‐‑Ar),(M-‐‑Aq)

ϑa(q)

,lim can be: tmax , tmax + std(t), CI(t),






DoF , corrections:

F = dp(q)2/s0(q)2

• correct F *(N-1)/(N-A-1)

Fcrit (M-A) , (N-A-1)(M-A)Variants: • ½[(N-A-1)(M-A)] (to enlarge acceptance)

• Use: (r-A)(N-A-1)/N, (N-A-1)(r-A)r = M if N>M r = N-1 if N<=M

‣ correct F *N/(N-A-1) and use Fcrit 1, (N-A-1)

See references 15-18 in [1]

1. R.D. Maesschalck, A. Candolfi, D.L. Massart, S. Heuerding, Chemom. Intell. Lab. Syst. 47 (1999) 65-772. G.R. Flaten, B. Grung, O.M. Kvalheim, Chemom. Intell. Lab. Syst. 72 (2004) 1013. Todeschini et al., Chemom. Intell. Lab. Syst. 87 (2007) 3–17.

Sp(q)

PC1

PC2

Class space

✓ use cross-validated residuals in Orthogonal Distance [1, 2]

(DoF in F test were left unchanged)

✓ use only Orthogonal distance [2]

✓ ... use only Leverage values [3]..

Distances:

✓ Use Mahalonobis distance as SD

variations on Theme SIMCA original





Implementation in Software SIMCA original

SIMCA Umetrics s0(q) = sqrt(Σikeik2/(N-‐‑A-‐‑1)(M-‐‑A))pooled RSD of a class, i.e. q:

absolute DModX(q) = s i(q) = sqrt(Σkeik2/(M-‐‑A))*v

(calibration samples)

where v is a correction factor slightly higher than 1 accounting for the fact that distance to model is expected to be slightly smaller for a sample that is part of it

absolute DModXPS(q) = s p (q) = sqrt(Σkepk2/(M-‐‑A)) (“external” samples), i.e. OD

dp(q) = augmented DModXPS+(q) = sqrt[s p (q)2 + SD2]

Classification rule

(s i(q) / s0

(q) ) is approximately F-distributed with DoF of

observation and the model

SIMCA PARVUS Classification Rule is as in Umetrics but correct F *(N-1)/(N-A-1) Moreover variants are introduced in: ϑa

(q),lim can be restricted

D is considered as the hypotenuse from class box Residuals are corrected by leverage






Implementation in Software SIMCA original

SIMCA Unscramblers0

(q) = sqrt(ResXValTot) =(?) sqrt(Σikeik2 /[Nval *(M-‐‑A)/M]

ModelDist(q) =

s p (q) = sqrt(Σkepk2/(M-‐‑A)) (“external” samples)

i.e. OD

Classification Membership Limits

‣ Leverage Limit

<= 3*(A+1)/N

‣ Sample to model distance Limit

(s p(q) / s0

(q) )

is approximately F-distributed

Leverage correction: Fcrit 1, (N-‐‑A-‐‑1)

Cross Validation/Test: Fcrit 1, Nval

not clear to me how the two limits are used






alternative SIMCA a somehow alternative approach (coming from MSPC realm)

T2

Q

Variable 1

Variab

le 2

Varia

ble

3

PC1

PC2

Common frame

SD = Scores Distanceuse Mahalonobis Distance / Hotelling's T-square (T2)

OD = Orthogonal DistanceQ is the sum of squared residuals: Σkepk2 (same role as sp

(q)2) Qlim is calculated by assuming a χ2 distribution

Variations ‣ Which Reference Distribution for SD

‣ Use of Robust estimation

‣ Use CV scores

Classification rule

Use a summary of reduced distance

SD/SDlim & OD/ODlim

Variations ‣ Unique assignment /not






Classification rule

Assign an object to a class if its reduced combined distance satisfies :

< V2

PLS-Toolbox SIMCA

Tlim 2 = [A*(N-1)/N*(N-A)]*Fcrit A, (N-‐‑A)

alternative SIMCA

PLS-Toolbox SIMCA

Implementation II

Qlim χ2 distribution JM approximation

Q = Σkepk2

Ti2 = tiλ-1ti

Hotelling's T-square statistics





robust SIMCA [1,2]

1. K.V. Branden and M. Hubert, Chemolab 79 (2005) pp.10-212. M. Daszykowski, K. Kaczmareck, I. Stanimirova, Y. Vander Heyden, B. Walczak

Chemom. Intell. Lab. Syst. (2006)

- use robust PCA for class models

- use a weighted sum of “reduced” orthogonal (OD, i.e. sp(q)2 in previous slide) and

Mahalonobis (MD, in scores space) distances

- thus classification rule will be to assign a new sample to the class for which R is minimal:

R = γ (OD/ODlim) + (1-γ) (MD/MDlim)

where MDlim is estimated by using as reference the chi-squared distribution with dof equal to the number of retained components (A) and ODlim is estimated by assuming that a scaled version of chi-squared distribution is appropriate (Wilson-Hilferty approximation)

γ in the range 0-1

variations on Theme alternative SIMCA





The same method as for SD is used:

1. A. Pomerantsev J. Chemometrics 22 (2008) 601–609.


Pomerantsev acceptance area [1] Scores Distance = Leverage

mean Leverage

SD distribution

DoF = Nh

The author estimated DoF by the method of moment

The author prefers a Robust estimate:

DoF = A if scores are normally distributed

Ortoghonal Distance





The same method as for SD is used:

1. A. Pomerantsev J. Chemometrics 22 (2008) 601–609.


Pomerantsev acceptance area [1] Scores Distance = Leverage

mean Leverage

SD distribution

DoF = Nh

The author estimated DoF by the method of moment

The author prefers a Robust estimate:

DoF = A if scores are normally distributed

Ortoghonal Distance

Classification ruleDifferent acceptance areas are derived

depending on how SD and OD are

combined






Pomerantsev acceptance area [1]

I, II and V were considered acceptable

I

II

V see article





Hlim fit

Qlim fit χ2 distribution JM approximation

Classification rule (reduced distances as in PLS toolbox)

‣ Q, H limits:

Orthogonal Distance: Q

Scores Distance:

Leverage: Hfit = diag [ Tfit(TfitTTfit)-1TfitT]

HCV = diag [ TCV(TfitTTfit)-1TCVT]

S: scores Covariance matrix;T: Scores matrix;

D-statistic Dfit = diag[ Tfit(Sit)-1TfitT]

Dlim fit[1] ~ [A(N2 - 1)/N(N-A)] F(A, N-A)

[1] J.A. Westerhuis, S.P. Gurden and A.K. Smilde, Chemolab. 51 (2000) 95-114.

[2] M. Forina Chemolab. 96 (2009) 239-245

*

H CV 95%: the 95 percentile of HCV

Q CV 95%: the 95 percentile of QCV


hi N/(N-1) ~ ß(M/2, (N-M-1)/2)[2]





Implementation I & II comparison

square distance/rsd of a “new” sample is compared to square rsd of class by F-‐‑test

Score distance is computed from the “box” boundaries (but can be taken from average problematic degree of freedom for F-‐‑test scaling factor needed in order to combine score distance with orthogonal one reduced score distance and orthogonal distance are

combined, thus being directly comparable

Score distance is computed from the averageproblematic

two different reference distribution are assumed

degree of freedom (?) giving the same weight to both type of distances

Sp(q)

PC1

PC2

Class space

PC1

PC2 Q

Class space

T2

Qlim

T2limRejected by both approaches

Rejected by original SIMCAAccepted by this rule

Accepted by both approaches

Q

Accepted by rules using only orthogonal

distance?

T2

original in SIMCA (I): score and orthogonal distances are combined as a “rsd” measure

Variants using T2 , MD avoids to define “box” boundaries

using only Orthogonal distances does not take into account deviations inside class space using χ2 distribution for both OD and SD do not combine OD and SD in a single rule

alternative in PLS-‐‑Toolbox SIMCA (II):




22

0

5000.0000

10000.0000

15000.0000

20000.0000

0

50.0000

100.0000

150.0000

200.0000

0 25 50 75 100

Dlim

numero di oggetti

one thirdone fourthone halfN-1

ratio objs/LVs

Dlim fit[1] ~ [A(N2 - 1)/N(N-A)] F(A, N-A)



1. I. Frank, Chemom. Intell. Lab. Syst. 4 (1998) 215-222

Distances:

✓ Use a modified Mahalonobis distance defined by using both SD and OD

between LDA and SIMCA DASCO

Bayes rule:

DASCO: min

SD OD





Regression on dummy variables (DPLS, PLS-DA)

Y coding: as many dummy y-variables as classes 1/0 Fit a PLS2 model

DPLSPredict the dummy values (yi_pred) and assign the object to the group with highest predicted value, k = argmax(yi_pred)

PLS-DA

with more than 2 categories can be sub-optimal (masking effect Hastie1 )

1 Hastie T. et al. The element of statistical learning, Springer: New York,2001

It has been demonstrated PLS on dummy variables has connection with LDA2

2 Barker and Rayens,. J. Chemometrics 2003 (17): 166-173

or with Fisher canonical discriminant analysis, in fact it correspond to eigendecomposition of a slightly altered version (M) of the Between-group variance-covariance matrix (B)

a better alternative would be to use LDA, QDA, etc.. on PLS scores

Nocairi3 et al. implemented PLS-DA bringing to eigendecomposition of B

3 Nocairi et al. Comp. Stat & Data Anal. 2005 (48) 139-147





Regression on dummy variables (DPLS, PLS-DA)

Y coding: as many dummy y-variables as classes 1/0 Fit a PLS2 model

PLS-DA extension

1 U. Indhal, H. Martens, T. Naes,.J. Chemometrics 2007; 21: 529–536

A general framework has been proposed1 by including prior probability in the derivation

Given a classification problem with specified prior probabilities a feasible strategy isto extract PLS loading weights according to the dominant eigenvector of B found by SVD of: YtX

* X has to be centered by the weighted global means taking into account the prior

Moreover using LDA on PLS2 scores or Y fitted values (not redundant, i.e. with k column) give the same results





- Y coding: as many dummy y-variables as classes 1/0

- Fit a PLS2 model

‣ define a threshold‣ UNSCRAMBLER: fixed, e.g. 0.5 ; (Also suggest to use LDA... on PLS scores)‣ SIMCA-Umetrics : <0.35 reject; 0.35-0.65 border-line; 0.65-1.35 accept; >1.35 do not belong, possible extreme/outlier‣ PLS-toolbox:‣ Threshold selected on best SENS/SPEC compromise from ROC curve (CV‣ Bayesan thresholdThe Bayesian threshold calculation assumes that the predicted y values follow a distribution similar to what will be observed for future samples. Using these estimated distributions, a threshold is selected at the point where the two estimated distributions cross; this is the y-value at which the number of false positives and false negatives should be minimized for future predictions (either in fit or CV prediction).

Implementation in Software

Classification rule

discriminant PLS





Practitioners reflections:

‣ there is a clear imbalance between literature and implementations‣ there is a need for tutorials (back to basic.. but a little beyond...)‣ need to define a problem oriented strategy, e.g. modeling vs. discrimination, etc..

‣ using the threshold rule(s) in the DPLS spanned space how to relate to the theoretical FCDA framework or other?‣ when there are more than 2 classes an “hybrid” use of threshold can be found, i.e. use it as class boundaries and treat results as class modeling... make sense (??)‣ to interprete PLS weights, regression coefficients in the usual way forgetting Y is a dummy matrix is correct ?‣ what about target projection and SR in the discriminant PLS framework ?

‣ could we think of unified/generalized SIMCA making a synthesis of the different derivations ?

for discussion:





Thanks for attention




class-modelling vs discriminant classificationclassiﬁcation rate or **cv efﬁciency sunday, march...

Documents