12 classification
TRANSCRIPT
-
7/28/2019 12 Classification
1/16
Classification Methods Classification problems
Classification Methods
Supported byXLStat
Classification Methods
All of these methods are consideredsupervised learning.
Initial assumptions regardingmembership or properties are madewhen developing a model.
An initial evaluation of the data usingexploratory data analysis is useful.
Data sets
Needed develop and evaluate a classification model.
Training setRepresentative samples used to build the model.
The modeling software uses the class information.
Evaluation set
Samples of known class, used to test the model.
The modeling software does not know the classes.
Test setTrue unknowns.
Data pre-processing
With any of these methods, you may chooseto do some sort of data preprocessing.
Raw
Is fastest.
Scaled
Gives equal weight to the variables.
PCACan be used to reduce noise, insignificantvariables.
-
7/28/2019 12 Classification
2/16
Data pre-processing
With some data sets, you may also want tosome other types of pre-processing.
Example. Spectral or chromatographic traces.
Options may include:
Smoothing, baseline correction, signalaveraging, using the first or secondderivative.
Creating an evaluation set
The evaluation set is typically a sub-set ofthe training set that was omitted whenbuilding a model.
Randomly pick a subset of the data.Random pick members from each class
Any approach that selectively removes aportion of the data could cause bias.
Leave-one-out validation
A standardized approach forvalidation of a model where eachsample serves as an evaluation set.
1. Omit a single sample from the set
2. Build the model
3. Test the omitted sample
4. Repeat the above steps until each sample hasbeen omitted and tested once.
Your data
While Leave-One-Out testing is the best approach, itcan be slow for large sets.
Alternate approaches are to leave two or moresamples out with each pass.
Samples should be randomly listed in the matrix.
The same two (or more) sample should never beomitted together more than once.
Rule building methods
Methods where a set of rules are created to discriminatebetween classes.
Linear learning machineOne or more linear vectors are created todiscriminate between classes.
Discriminate analysisLinear or quadratic equations are used to separateclasses.
Classification treesSeries of rules are used to sequentially classify.
Linear learning machine
The assumption isthat one or morevector can be foundthat can be used todiscriminate betweeour classes.
This can make useof our raw data orwork in PC space.
PC space would bebetter as there woulbe noise reduction.
-
7/28/2019 12 Classification
3/16
Linear learning machine
For simpleclassifications,there can be manylinear vectors thatgive complete classdiscrimination.
You would select theone that gives thebest partitioning.
You are not limited tojust 1 or 2-D vectors.
Linear learning machine
As the number of classeincreases, the potentialnumber of usable vectorwill decrease.
The problem can becomcomplex very rapidly.
You can reach a pointwhere simple linear linecan no longer solve theproblem.
Linear learning machineIn this example, alinear solution cant befound that discriminatesbetween the classes.
Clearly, there shouldbe a way todiscriminate - theclasses appear to bewell defined.
A non-linear function
may offer the bestapproach (discriminateanalysis.
Discriminant Analysis (DA)
First descried by Fisher in 1936.
Similar to LLM but can use both quantitative andqualitative variables.
Approach uses linear models when sample classeshave similar covariance matrices.
Uses quadratic models when classes havedissimilar covariance matrices.
Can have problems if you have variables with nullvariance or multicolinearity - must be eliminated.
Iris example
Well return to the Irisexample dataset - usingXLStats built in DAfunction.
Were going to useautoscaled data.
DA with XLStat.
-
7/28/2019 12 Classification
4/16
Petal width
Petal length
Sepal Width
Sepal Length
-1
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
F1 (99.01 %)
F2(0.99%)
1
11
11
1
1
1
1
1
11
1
1
11 1
1
1
1
1
1
1
1
1
1
1
11
1
1
11
1
11
11
1
11
1
1
1
1
1
1
1
1
1
2
22
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
2
2
2
2
2
22
2
2
2
2
3
3 3
3
3
333
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
33
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
3
2
1
-4
1
-10 -5 0 5
Coffee example
This consisted of 6 types ofcoffee - identified based on MSdata.
To avoid colinearity and nullvariable problems, PCA scores
were used (first 5 components).
C
C
CC
C
C
C
C
CC
C
C
EEE
EE
E
EEE
EE
E
KK K
KK
K
KK K
KK
K
RR
RR
RR
RR
RR
RR
S
S
SS
S S
S
S
SS
S S
UU
UUU
U
UU
UUU
U
C
E
K
R
S
U
-10
-5
0
5
10
15
-15 -10 -5 0 5 10 15
F1 (56.19 %)
F2(22.7
9%)
-
7/28/2019 12 Classification
5/16
Classification trees
Predicts class membership by sequentialapplication of rules based on predictor variables.
With DA and LLM, you create a set of mathmodels that are all applied at once.
With classification trees, the predictor variablesare evaluated as ordinal rules, one at a time.
Classification trees
Solid - liquid
Density > 1 Red or green
Density > 1
Iris example (yet again!)
XLStat supports the use ofclassification and regression trees.
Classification if the Y variable (class)is qualitative, regression if the Yvariable is quantitative.
The iris example is a classificationexample.
Iris example
= If Petal width is between 1 and 8the assign to Species 1
Node: 1
Size: 150
%: 100
Purity(%):
.
50
50
501
2
3
Node: 2
Size: 50
%: 33.3
Purity(%):
[1, 8[0
0
501
2
3
Node: 3
Size: 100
%: 66.7
Purity(%):
[8, 25[50
50
01
2
3
Node: 4
Size: 53
%: 35.3
Purity(%):
.
[8, 16.5[5
48
01
2
3
Node: 6
Size: 46
%: 30.7
Purity(%):
97.
[30,
50.51
45
01
2
3
Node: 8
Size: 41
%: 27.3
Purity(%):
[30,
47.50
41
01
2
3
Node: 9
Size: 5
%: 3.3
Purity(%):
[47.5,
50.51
4
01
2
3
Node: 10
Size: 2
%: 1.3
Purity(%):
[22,
23.51
1
01
2
3
Node: 11
Size: 3
%: 2
Purity(%):
[23.5,
310
3
01
2
3
Sepal Width
Petal length
Node: 7
Size: 7
%: 4.7
Purity(%):
7.
[50.5,
584
3
01
2
3
Node: 12
Size: 3
%: 2
Purity(%):
.
[60,
62.51
2
01
2
3
Node: 13
Size: 4
%: 2.7
Purity(%):
[62.5,
723
1
01
2
3
Sepal Length
Petal length
Node: 5
Size: 47
%: 31.3
Purity(%):
.
[16.5,
2545
2
01
2
3
Node: 14
Size: 10
%: 6.7
Purity(%):
[45,
50.58
2
01
2
3
Node: 15
Size: 37
%: 24.7
Purity(%):
[50.5,
69
1
2
3
Petal lengt
Petal width
Petal width
-
7/28/2019 12 Classification
6/16
i
i
Node: 3
Size: 100
%: 66.7
Purity(%):
[8, 25[50
50
01
2
3
Node: 4
Size: 53
%: 35.3
Purity(%):
[8, 16.5[5
48
01
2
3
1
45
Node: 9
Size: 5
%: 3.3
Purity(%):
[47.5,1
4
01
2
3
i
i
length
Node: 7
Size: 7
%: 4.7
Purity(%):
[50.5,4
3
01
2
3
Node: 12
Size: 3
%: 2
Purity(%):
[60,1
2
01
2
3
Node: 13
Size: 4
%: 2.7
Purity(%):
[62.5,3
1
01
2
3
Sepal Length
Petal length
Node: 5
Size: 47
%: 31.3
Purity(%):
[16.5,45
2
01
2
3
Node: 14
Size: 10
%: 6.7
Purity(%):
[45,8
2
01
2
3
Node: 15
Size: 37
%: 24.7
Purity(%):
[50.5,37
0
01
2
3
Petal length
Petal width
Purity is just the percent ofsamples assigned to that node.
Using Classification Tree
Using DA
Wine exampleRiesling vs. Chardonnay.
Ohio vs. California.
Assayed 5 organic and 4 tracemetal components.
Yes, youll do the same with yourhomework.
Node Class Freq. Purity Rules
1 CaC 17 41.46%
2 CaC 17 58.62%If Ca in [17.5, 60.75[ then Class = CaC in 58.6% ocases
3 CaR 7 58.33%If Ca in [60.75, 94.75[ then Class = CaR in 58.3%of cases
4 CaR 3 60.00%If 2,3-butanediol in [0, 0.065[ and Ca in [17.5,60.75[ then Class = CaR in 60% of cases
5 CaC 17 70.83%If 2,3-butanediol in [0.065, 0.514[ and Ca in [17.560.75[ then Class = CaC in 70.8% of cases
6 CaC 14 100.00%If Mn in [0.82, 1.625[ and 2,3-butanediol in [0.060.514[ and Ca in [17.5, 60.75[ then Class = CaC i
100% of cases
7 OhC 7 70.00%If Mn in [1.625, 3.51[ and 2,3-butanediol in [0.060.514[ and Ca in [17.5, 60.75[ then Class = OhC i70% of cases
8 CaC 3 60.00%If K in [735.5, 881.75[ and Mn in [1.625, 3.51[ and2,3-butanediol in [0.065, 0.514[ and Ca in [17.5,60.75[ then Class = CaC in 60% of cases
9 OhC 5 100.00%If K in [881.75, 1147.5[ and Mn in [1.625, 3.51[ and2,3-butanediol in [0.065, 0.514[ and Ca in [17.5,
60.75[ then Class = OhC in 100% of cases
10 OhC 2 100.00%
If 1-hexanol in [0.638, 0.723[ and K in [735.5,881.75[ and Mn in [1.625, 3.51[ and 2,3-butanediolin [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class= OhC in 100% of cases
11 CaC 3 100.00%
If 1-hexanol in [0.723, 1.056[ and K in [735.5,881.75[ and Mn in [1.625, 3.51[ and 2,3-butanediolin [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class= CaC in 100% of cases
12 OhR 5 83.33%If 1-hexanol in [0.409, 0.673[ and Ca in [60.75,94.75[ then Class = OhR in 83.3% of cases
13 CaR 6 100.00%If 1-hexanol in [0.673, 1.218[ and Ca in [60.75,94.75[ then Class = CaR in 100% of cases
Node: 1
Size: 41
%: 100
Purity(%):
6
8
10
17CaC
CaR
OhC
OhR
Node: 2
Size: 29
%: 70.7
Purity(%):
[17.5,
.1
8
3
17CaC
CaR
OhC
OhR
Node: 4
Size: 5
%: 12.2
Purity(%):
[0,
.1
1
3
0CaC
CaR
OhC
OhR
Node: 5
Size: 24
%: 58.5
Purity(%):
[0.065,
.0
7
0
17CaC
CaR
OhC
OhR
Node: 6
Size: 14
%: 34.1
Purity(%):
[0.82,
.0
0
0
14CaC
CaR
OhC
OhR
Node: 7
Size: 10
%: 24.4
Purity(%):
[1.625,
.0
7
0
3CaC
CaR
OhC
OhR
Node: 9
Size: 5
%: 12.2
Purity(%):
[881.75,
.0
5
0
0CaC
CaR
OhC
OhR
Node: 8
Size: 5
%: 12.2
Purity(%):
[735.5,
.0
2
0
3CaC
CaR
OhC
OhR
Node: 10
Size: 2
%: 4.9
Purity(%):
[0.638,
.0
2
0
0CaC
CaR
OhC
OhR
Node: 11
Size: 3
%: 7.3
Purity(%):
[0.723,
.0
0
0
3CaC
CaR
OhC
OhR
1-hexanol
K
Mn
2,3-butanediol
Node: 3
Size: 12
%: 29.3
Purity(%):
[60.75,
.5
0
7
0CaC
CaR
OhC
OhR
Node: 12
Size: 6
%: 14.6
Purity(%):
[0.409,
.5
0
1
0CaC
CaR
OhC
OhR
Node: 13
Size: 6
%: 14.6
Purity(%):
[0.673,
.
Ca
Ca
Oh
Oh
1-hexanol
Ca
-
7/28/2019 12 Classification
7/16
Confusion matriix for the estiimation sam le:
from \ to CaC CaR OhC OhR Total % correct
CaC 17 0 0 0 17 100.0%
CaR 0 9 0 1 10 90.0%
OhC 0 1 7 0 8 87.5%
OhR 0 1 0 5 6 83.3%
Total 17 11 7 6 41 92.7%
K nearest neighbor classification
A similarity-based classification method.
It attempts to assign categories to unknownsamples based on multivariate proximity to othersamples.
It works best with discrete classification types andis tolerant of poor data sets.
K - ! The number of closest neighbors beingcompared.
Consider this as the supervised version of HCA.
K nearest neighbor classificationIn its simplest form, KNN is conducted by:
First, a training set is collected thatcontains examples of each class.
Intersample distances are thencalculated.
"
whereN = # of variables or components used.
da " b= aj-bb^ hj=1
N
!2
KNN
The distance matrix is sorted and thedistance of the unknown sample can becompared to:
1. The K nearest neighbors
2. The nearest class cluster.
Option 2 requires that K = 1.
KNN
When using the distance to a class, you
can use the same link options that werediscussed earlier.
The distance can be based on:
Single link - closest member of class.
Complete link - farthest member of class.
Centroid - center of class cluster.
KNN - single link
In this example, the
unknown is compa
to the 3 closest
known samples.
In this case, the thr
closest samples ar
all red.
Single link
K = 3
-
7/28/2019 12 Classification
8/16
KNN - centroid link
Centroid link
With this approach,
the distance to the
center of a class
cluster is determined
and compared.
KNN
Ideally, if a test sample falls well within a known class,its closes neighbors should all be of one class.
Here, all of the
blue sampleswould be closerto the unknownthan any of thegreen.
Mycobacteria - HCA
47474747474747474747474747474747474747474543434242424242424242424242424242424242424243434343434345454545454545454545454545454543434343434345454545454545454545454545454545454343434343434343434344434444444444444444444447444444444446464649494949494949494949494949494949494946464646464646464646464646464646464646464646464646464646464646464646464646464646
0
100
200
300
400
500
600
700
800
900
1000
A quick review of ALL of the waysthat this data set was difficult to getuseful information from.
Mycobacteria - k means
Mycobacteria - PCA
-3.000
-2.000
-1.000
0.000
1.000
2.000
3.000
4.000
-6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.000
42
43
44
45
46
47
49
Mycobacteria - DA
-
7/28/2019 12 Classification
9/16
Mycobacteria - DA
4242
42
42
42
4242
4242
42
4242
42
4242
42
42
42
42
434343
43 43
4343
43
43
43
434343
4343
43
43
43
4343
434343
43
43 444444
4444
4444
44
444444
444444
4444
45
454545454545
45
4545
454545454545
45
4545
45
45
45
45
45
45
45
45
4545
45 4545
45
46
46
46
4646 46
46
4646
4646
4646
46
4646
4646
4646
46
46
46
46
46
4646
4646 4646
4646
4646
46
46
4646 4646
46
46
47
474747
47
47
47
47
47
47
47
47
47
4747
47
4747474747
49
49
49
4949
4949
49
49
494949
494949
4949
49
49
47
46
45
44
4342
-15
-10
-5
0
5
10
-25 -20 -15 -10 -5 0 5
F1 (56.45 %)
F2(29.6
3%)
Mycobacteria - DA
2
2
2
2
2
22
22
2
222
22
2
2
2
2
434343
4343
4343
43
43
43
434343
43
43
43
43
43
43
43
43
43 43
43
4344
4444
4444
44
44
44
444444
444444
44
44
45
454545454545
45
4545
454545
45 4545
45
4545
45
45
5
5
45
45
45
45
4545
5 45
4545
4664646
6 466 464666 4646 4666 464666 46
46
464646 46464646 46464646 4646466 466 46 4666
47
474
47
47
47
47
47
47
47
47
47
47
447
47
4747 47
47
47
49
49
49
49
49
49
49
49
49
4949
49
49 49
49
49
49
49
49
47
46
45
44
43
2
-5
0
5
10
0 5
F1 (56.45 %)
F2(29.6
3%)
Mycobacteria - DA Getting out the voteWhat if a samples distances is such that itcould be in more than one class?
When you havemore than onepossible class,we can take avote. The classwith the mostvotes wins.
K = 5
Getting out the vote Getting out the vote
Example - K = 3
Sample Class Distance1 A 0.134
2 B 0.145
3 A 0.158
Here you would end up with 2 votes for A
and one for B - A would win and the
distances would be smaller.
-
7/28/2019 12 Classification
10/16
Getting out the vote
Example - K = 5
Sample Class Distance
1 A 0.1342 B 0.145
3 A 0.1584 B 0.234
5 C 0.502
Here, A and B would tie. The tie-breaker wouldbe that A averages a smaller distance so
would be made the winner.
KNN validation
The optimum number for K can be foundby trial an error but for a close match, itshould make no difference.
The classifying power of your data can beevaluated by leave one out validation ofyour training set.
This should be done before any sort ofreal classification begins.
KNN validationValidation
You can sequentially leave out each of yoursamples and test it for votes at several Kvalues.
You end up with a vote matrix that will tell youthe optimum K value for each class.
You will also get a misclassification matrix "-this tells you how often one of your knownsare incorrectly classified.
K nearest neighbor classification
So KNN will always assign a class.
What if you have a material that is not a member ofan existing class?
One option is to set a maximum distance.
Example
Your intraclass distances run about 0.2 for all ofyour classes, you might want to omit votes withdistances that exceed 0.2.
Iris (of course) The Iris data set is included with a demo of the
program Pirouette.
Well be using the Pirouette demo to show howto conduct KNN and SIMCA classifications.
You can download a copy of the demo fromwww.infometrix.com.
The demo is fully functional but only with thedata sets that are provided by Infometrix.
The actual software is pretty easy to use buttoo expensive for our use in the course.
Iris example
-
7/28/2019 12 Classification
11/16
Iris - scores by class Iris - voting results
Iris - class partitions
What? NOT the Iris data set?
Headspace MS of 4 cola classes.
Two cola brands.
Diet and regular.
m/e 44 - 149.
May need to preprocess toeliminate any nonvariant data.
Cola example
Class
1 Brand 1
2 Diet brand 1
3 Brand 2
4 Diet brand 2
PCA scores
-
7/28/2019 12 Classification
12/16
PCA scores PCA loadings
KNN classificationNot a bad job! KNN classifications
SIMCASoft Independent Modeling ofClass Analogy
A method of classification that provides:Detection of outliers.
Estimates of confidence for aclassification.
Determination of potentialmembership in more than a singleclass.
SIMCABasic approach.
For each class of samples, a PCA model isconstructed.
This model is based on the optimum numberof components that best clusters an individuaclass.
The optimum number of components can varyfrom class to class and can be determined bycross-validation
-
7/28/2019 12 Classification
13/16
SIMCA modelsSince the numberof componentsused can vary,each class will bebest describedby its own
hypervolume.
SIMCA modelsLimitation of a class hypervolume.
You can limit the size of a hypervolumeby setting a standard deviation cutoff.This results in better defined classes.
SIMCA modelsOnce a model has been created for eachclass, you are ready to classify unknowns.
For each model/sample combination:
+ The sample is transformed into PC spaceand compared to see if is a likely classmember.
+ If it is within the hypervolume of a singleclass, you have a match.
SIMCA classificationThe potential still exists for a sample to beclassified as a member of more than one class.
It may also notbe a member oany known clas
SIMCA classification
SIMCA will give you an estimate as to
the probability of class membership.
Example - two possible classes." " Probability" Class A" " 0.90" Class B 0.45Here, the sample is more likely to bea member of Class A.
SIMCA summary
Of the methods covered, SIMCA offers the
most options for developing a classificationmodel when the classes are well known.
It also requires the most development time asyou must determine the optimum modelconditions for each class.
If used, plan on spending quite a bit of timeworking with all of the available options.
-
7/28/2019 12 Classification
14/16
SIMCA example - Iris.Of course well look at the iris dataset again.
Note: We have aseparate model foreach class in the datset - in this case three
SIMCA example - Iris.
Pirouette willprovide anestimate as to theclass hypervolumesbased on the firstthree PCs.
SIMCA example - Iris. SIMCA example - Iris.
It appears thatpetal length isthe most usefulfor classifying.
SIMCA example - Iris.These plots show the relativepositions of each sample whenprojected into any of the three classmodels - two classes at a time - with
color coding based on known class
-
7/28/2019 12 Classification
15/16
Cola exampleWith the cola example (two brands, diet andregular), we have 4 classes.
Here you can see that the classes are pretty wellresolved.
Cola example
Mycobacteria againThis data set is included with the Pirouette demo.
File = Mycosing.wks
It is a subset of the version Ive been using(only 72 samples)
Mycobacteria SIMCA
Perfect classifications - a first for this dataset.
-
7/28/2019 12 Classification
16/16
Mycobacteria SIMCA
DiscriminatingPower is a
measure of whichvariables showthe biggest class
differences.
Mycobacteria SIMCAExample shows that a differentnumber of components wereused in developing theindividual SIMCAhypervolumes.
Mycobacteria SIMCAModeling power
indicates the relativeimportance of each
variable forclassification.
Loadings, as always show the
relative significance of eachvariable in constructing each PC
There are relatively unimportant.
Mycobacteria SIMCAPC plots are pretty boring since you only have one class. However, it canbe used to see if you have any sub-classes.
Outliers are test for by plotting sample residuals (dif ference betweensample and center of hypervolume) vs its Mahalanobis distance from thecenter of the cluster - similar to a Euclidian distance but takes into accountcorrelations of the data and is scale invariant.
Mycobacteria SIMCA