12 classification

Upload: deborahrosales

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 12 Classification

    1/16

    Classification Methods Classification problems

    Classification Methods

    Supported byXLStat

    Classification Methods

    All of these methods are consideredsupervised learning.

    Initial assumptions regardingmembership or properties are madewhen developing a model.

    An initial evaluation of the data usingexploratory data analysis is useful.

    Data sets

    Needed develop and evaluate a classification model.

    Training setRepresentative samples used to build the model.

    The modeling software uses the class information.

    Evaluation set

    Samples of known class, used to test the model.

    The modeling software does not know the classes.

    Test setTrue unknowns.

    Data pre-processing

    With any of these methods, you may chooseto do some sort of data preprocessing.

    Raw

    Is fastest.

    Scaled

    Gives equal weight to the variables.

    PCACan be used to reduce noise, insignificantvariables.

  • 7/28/2019 12 Classification

    2/16

    Data pre-processing

    With some data sets, you may also want tosome other types of pre-processing.

    Example. Spectral or chromatographic traces.

    Options may include:

    Smoothing, baseline correction, signalaveraging, using the first or secondderivative.

    Creating an evaluation set

    The evaluation set is typically a sub-set ofthe training set that was omitted whenbuilding a model.

    Randomly pick a subset of the data.Random pick members from each class

    Any approach that selectively removes aportion of the data could cause bias.

    Leave-one-out validation

    A standardized approach forvalidation of a model where eachsample serves as an evaluation set.

    1. Omit a single sample from the set

    2. Build the model

    3. Test the omitted sample

    4. Repeat the above steps until each sample hasbeen omitted and tested once.

    Your data

    While Leave-One-Out testing is the best approach, itcan be slow for large sets.

    Alternate approaches are to leave two or moresamples out with each pass.

    Samples should be randomly listed in the matrix.

    The same two (or more) sample should never beomitted together more than once.

    Rule building methods

    Methods where a set of rules are created to discriminatebetween classes.

    Linear learning machineOne or more linear vectors are created todiscriminate between classes.

    Discriminate analysisLinear or quadratic equations are used to separateclasses.

    Classification treesSeries of rules are used to sequentially classify.

    Linear learning machine

    The assumption isthat one or morevector can be foundthat can be used todiscriminate betweeour classes.

    This can make useof our raw data orwork in PC space.

    PC space would bebetter as there woulbe noise reduction.

  • 7/28/2019 12 Classification

    3/16

    Linear learning machine

    For simpleclassifications,there can be manylinear vectors thatgive complete classdiscrimination.

    You would select theone that gives thebest partitioning.

    You are not limited tojust 1 or 2-D vectors.

    Linear learning machine

    As the number of classeincreases, the potentialnumber of usable vectorwill decrease.

    The problem can becomcomplex very rapidly.

    You can reach a pointwhere simple linear linecan no longer solve theproblem.

    Linear learning machineIn this example, alinear solution cant befound that discriminatesbetween the classes.

    Clearly, there shouldbe a way todiscriminate - theclasses appear to bewell defined.

    A non-linear function

    may offer the bestapproach (discriminateanalysis.

    Discriminant Analysis (DA)

    First descried by Fisher in 1936.

    Similar to LLM but can use both quantitative andqualitative variables.

    Approach uses linear models when sample classeshave similar covariance matrices.

    Uses quadratic models when classes havedissimilar covariance matrices.

    Can have problems if you have variables with nullvariance or multicolinearity - must be eliminated.

    Iris example

    Well return to the Irisexample dataset - usingXLStats built in DAfunction.

    Were going to useautoscaled data.

    DA with XLStat.

  • 7/28/2019 12 Classification

    4/16

    Petal width

    Petal length

    Sepal Width

    Sepal Length

    -1

    -0.75

    -0.5

    -0.25

    0

    0.25

    0.5

    0.75

    1

    -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

    F1 (99.01 %)

    F2(0.99%)

    1

    11

    11

    1

    1

    1

    1

    1

    11

    1

    1

    11 1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    1

    11

    1

    1

    11

    1

    11

    11

    1

    11

    1

    1

    1

    1

    1

    1

    1

    1

    1

    2

    22

    2

    2

    2

    2

    2

    2

    22

    2

    2

    2

    2

    2

    2

    2

    2

    2

    2

    2

    2

    2

    22

    2

    2

    2

    2

    2

    2

    2

    2

    22

    2

    2

    2

    2

    2

    2

    2

    2

    22

    2

    2

    2

    2

    3

    3 3

    3

    3

    333

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    33

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    3

    2

    1

    -4

    1

    -10 -5 0 5

    Coffee example

    This consisted of 6 types ofcoffee - identified based on MSdata.

    To avoid colinearity and nullvariable problems, PCA scores

    were used (first 5 components).

    C

    C

    CC

    C

    C

    C

    C

    CC

    C

    C

    EEE

    EE

    E

    EEE

    EE

    E

    KK K

    KK

    K

    KK K

    KK

    K

    RR

    RR

    RR

    RR

    RR

    RR

    S

    S

    SS

    S S

    S

    S

    SS

    S S

    UU

    UUU

    U

    UU

    UUU

    U

    C

    E

    K

    R

    S

    U

    -10

    -5

    0

    5

    10

    15

    -15 -10 -5 0 5 10 15

    F1 (56.19 %)

    F2(22.7

    9%)

  • 7/28/2019 12 Classification

    5/16

    Classification trees

    Predicts class membership by sequentialapplication of rules based on predictor variables.

    With DA and LLM, you create a set of mathmodels that are all applied at once.

    With classification trees, the predictor variablesare evaluated as ordinal rules, one at a time.

    Classification trees

    Solid - liquid

    Density > 1 Red or green

    Density > 1

    Iris example (yet again!)

    XLStat supports the use ofclassification and regression trees.

    Classification if the Y variable (class)is qualitative, regression if the Yvariable is quantitative.

    The iris example is a classificationexample.

    Iris example

    = If Petal width is between 1 and 8the assign to Species 1

    Node: 1

    Size: 150

    %: 100

    Purity(%):

    .

    50

    50

    501

    2

    3

    Node: 2

    Size: 50

    %: 33.3

    Purity(%):

    [1, 8[0

    0

    501

    2

    3

    Node: 3

    Size: 100

    %: 66.7

    Purity(%):

    [8, 25[50

    50

    01

    2

    3

    Node: 4

    Size: 53

    %: 35.3

    Purity(%):

    .

    [8, 16.5[5

    48

    01

    2

    3

    Node: 6

    Size: 46

    %: 30.7

    Purity(%):

    97.

    [30,

    50.51

    45

    01

    2

    3

    Node: 8

    Size: 41

    %: 27.3

    Purity(%):

    [30,

    47.50

    41

    01

    2

    3

    Node: 9

    Size: 5

    %: 3.3

    Purity(%):

    [47.5,

    50.51

    4

    01

    2

    3

    Node: 10

    Size: 2

    %: 1.3

    Purity(%):

    [22,

    23.51

    1

    01

    2

    3

    Node: 11

    Size: 3

    %: 2

    Purity(%):

    [23.5,

    310

    3

    01

    2

    3

    Sepal Width

    Petal length

    Node: 7

    Size: 7

    %: 4.7

    Purity(%):

    7.

    [50.5,

    584

    3

    01

    2

    3

    Node: 12

    Size: 3

    %: 2

    Purity(%):

    .

    [60,

    62.51

    2

    01

    2

    3

    Node: 13

    Size: 4

    %: 2.7

    Purity(%):

    [62.5,

    723

    1

    01

    2

    3

    Sepal Length

    Petal length

    Node: 5

    Size: 47

    %: 31.3

    Purity(%):

    .

    [16.5,

    2545

    2

    01

    2

    3

    Node: 14

    Size: 10

    %: 6.7

    Purity(%):

    [45,

    50.58

    2

    01

    2

    3

    Node: 15

    Size: 37

    %: 24.7

    Purity(%):

    [50.5,

    69

    1

    2

    3

    Petal lengt

    Petal width

    Petal width

  • 7/28/2019 12 Classification

    6/16

    i

    i

    Node: 3

    Size: 100

    %: 66.7

    Purity(%):

    [8, 25[50

    50

    01

    2

    3

    Node: 4

    Size: 53

    %: 35.3

    Purity(%):

    [8, 16.5[5

    48

    01

    2

    3

    1

    45

    Node: 9

    Size: 5

    %: 3.3

    Purity(%):

    [47.5,1

    4

    01

    2

    3

    i

    i

    length

    Node: 7

    Size: 7

    %: 4.7

    Purity(%):

    [50.5,4

    3

    01

    2

    3

    Node: 12

    Size: 3

    %: 2

    Purity(%):

    [60,1

    2

    01

    2

    3

    Node: 13

    Size: 4

    %: 2.7

    Purity(%):

    [62.5,3

    1

    01

    2

    3

    Sepal Length

    Petal length

    Node: 5

    Size: 47

    %: 31.3

    Purity(%):

    [16.5,45

    2

    01

    2

    3

    Node: 14

    Size: 10

    %: 6.7

    Purity(%):

    [45,8

    2

    01

    2

    3

    Node: 15

    Size: 37

    %: 24.7

    Purity(%):

    [50.5,37

    0

    01

    2

    3

    Petal length

    Petal width

    Purity is just the percent ofsamples assigned to that node.

    Using Classification Tree

    Using DA

    Wine exampleRiesling vs. Chardonnay.

    Ohio vs. California.

    Assayed 5 organic and 4 tracemetal components.

    Yes, youll do the same with yourhomework.

    Node Class Freq. Purity Rules

    1 CaC 17 41.46%

    2 CaC 17 58.62%If Ca in [17.5, 60.75[ then Class = CaC in 58.6% ocases

    3 CaR 7 58.33%If Ca in [60.75, 94.75[ then Class = CaR in 58.3%of cases

    4 CaR 3 60.00%If 2,3-butanediol in [0, 0.065[ and Ca in [17.5,60.75[ then Class = CaR in 60% of cases

    5 CaC 17 70.83%If 2,3-butanediol in [0.065, 0.514[ and Ca in [17.560.75[ then Class = CaC in 70.8% of cases

    6 CaC 14 100.00%If Mn in [0.82, 1.625[ and 2,3-butanediol in [0.060.514[ and Ca in [17.5, 60.75[ then Class = CaC i

    100% of cases

    7 OhC 7 70.00%If Mn in [1.625, 3.51[ and 2,3-butanediol in [0.060.514[ and Ca in [17.5, 60.75[ then Class = OhC i70% of cases

    8 CaC 3 60.00%If K in [735.5, 881.75[ and Mn in [1.625, 3.51[ and2,3-butanediol in [0.065, 0.514[ and Ca in [17.5,60.75[ then Class = CaC in 60% of cases

    9 OhC 5 100.00%If K in [881.75, 1147.5[ and Mn in [1.625, 3.51[ and2,3-butanediol in [0.065, 0.514[ and Ca in [17.5,

    60.75[ then Class = OhC in 100% of cases

    10 OhC 2 100.00%

    If 1-hexanol in [0.638, 0.723[ and K in [735.5,881.75[ and Mn in [1.625, 3.51[ and 2,3-butanediolin [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class= OhC in 100% of cases

    11 CaC 3 100.00%

    If 1-hexanol in [0.723, 1.056[ and K in [735.5,881.75[ and Mn in [1.625, 3.51[ and 2,3-butanediolin [0.065, 0.514[ and Ca in [17.5, 60.75[ then Class= CaC in 100% of cases

    12 OhR 5 83.33%If 1-hexanol in [0.409, 0.673[ and Ca in [60.75,94.75[ then Class = OhR in 83.3% of cases

    13 CaR 6 100.00%If 1-hexanol in [0.673, 1.218[ and Ca in [60.75,94.75[ then Class = CaR in 100% of cases

    Node: 1

    Size: 41

    %: 100

    Purity(%):

    6

    8

    10

    17CaC

    CaR

    OhC

    OhR

    Node: 2

    Size: 29

    %: 70.7

    Purity(%):

    [17.5,

    .1

    8

    3

    17CaC

    CaR

    OhC

    OhR

    Node: 4

    Size: 5

    %: 12.2

    Purity(%):

    [0,

    .1

    1

    3

    0CaC

    CaR

    OhC

    OhR

    Node: 5

    Size: 24

    %: 58.5

    Purity(%):

    [0.065,

    .0

    7

    0

    17CaC

    CaR

    OhC

    OhR

    Node: 6

    Size: 14

    %: 34.1

    Purity(%):

    [0.82,

    .0

    0

    0

    14CaC

    CaR

    OhC

    OhR

    Node: 7

    Size: 10

    %: 24.4

    Purity(%):

    [1.625,

    .0

    7

    0

    3CaC

    CaR

    OhC

    OhR

    Node: 9

    Size: 5

    %: 12.2

    Purity(%):

    [881.75,

    .0

    5

    0

    0CaC

    CaR

    OhC

    OhR

    Node: 8

    Size: 5

    %: 12.2

    Purity(%):

    [735.5,

    .0

    2

    0

    3CaC

    CaR

    OhC

    OhR

    Node: 10

    Size: 2

    %: 4.9

    Purity(%):

    [0.638,

    .0

    2

    0

    0CaC

    CaR

    OhC

    OhR

    Node: 11

    Size: 3

    %: 7.3

    Purity(%):

    [0.723,

    .0

    0

    0

    3CaC

    CaR

    OhC

    OhR

    1-hexanol

    K

    Mn

    2,3-butanediol

    Node: 3

    Size: 12

    %: 29.3

    Purity(%):

    [60.75,

    .5

    0

    7

    0CaC

    CaR

    OhC

    OhR

    Node: 12

    Size: 6

    %: 14.6

    Purity(%):

    [0.409,

    .5

    0

    1

    0CaC

    CaR

    OhC

    OhR

    Node: 13

    Size: 6

    %: 14.6

    Purity(%):

    [0.673,

    .

    Ca

    Ca

    Oh

    Oh

    1-hexanol

    Ca

  • 7/28/2019 12 Classification

    7/16

    Confusion matriix for the estiimation sam le:

    from \ to CaC CaR OhC OhR Total % correct

    CaC 17 0 0 0 17 100.0%

    CaR 0 9 0 1 10 90.0%

    OhC 0 1 7 0 8 87.5%

    OhR 0 1 0 5 6 83.3%

    Total 17 11 7 6 41 92.7%

    K nearest neighbor classification

    A similarity-based classification method.

    It attempts to assign categories to unknownsamples based on multivariate proximity to othersamples.

    It works best with discrete classification types andis tolerant of poor data sets.

    K - ! The number of closest neighbors beingcompared.

    Consider this as the supervised version of HCA.

    K nearest neighbor classificationIn its simplest form, KNN is conducted by:

    First, a training set is collected thatcontains examples of each class.

    Intersample distances are thencalculated.

    "

    whereN = # of variables or components used.

    da " b= aj-bb^ hj=1

    N

    !2

    KNN

    The distance matrix is sorted and thedistance of the unknown sample can becompared to:

    1. The K nearest neighbors

    2. The nearest class cluster.

    Option 2 requires that K = 1.

    KNN

    When using the distance to a class, you

    can use the same link options that werediscussed earlier.

    The distance can be based on:

    Single link - closest member of class.

    Complete link - farthest member of class.

    Centroid - center of class cluster.

    KNN - single link

    In this example, the

    unknown is compa

    to the 3 closest

    known samples.

    In this case, the thr

    closest samples ar

    all red.

    Single link

    K = 3

  • 7/28/2019 12 Classification

    8/16

    KNN - centroid link

    Centroid link

    With this approach,

    the distance to the

    center of a class

    cluster is determined

    and compared.

    KNN

    Ideally, if a test sample falls well within a known class,its closes neighbors should all be of one class.

    Here, all of the

    blue sampleswould be closerto the unknownthan any of thegreen.

    Mycobacteria - HCA

    47474747474747474747474747474747474747474543434242424242424242424242424242424242424243434343434345454545454545454545454545454543434343434345454545454545454545454545454545454343434343434343434344434444444444444444444447444444444446464649494949494949494949494949494949494946464646464646464646464646464646464646464646464646464646464646464646464646464646

    0

    100

    200

    300

    400

    500

    600

    700

    800

    900

    1000

    A quick review of ALL of the waysthat this data set was difficult to getuseful information from.

    Mycobacteria - k means

    Mycobacteria - PCA

    -3.000

    -2.000

    -1.000

    0.000

    1.000

    2.000

    3.000

    4.000

    -6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.000

    42

    43

    44

    45

    46

    47

    49

    Mycobacteria - DA

  • 7/28/2019 12 Classification

    9/16

    Mycobacteria - DA

    4242

    42

    42

    42

    4242

    4242

    42

    4242

    42

    4242

    42

    42

    42

    42

    434343

    43 43

    4343

    43

    43

    43

    434343

    4343

    43

    43

    43

    4343

    434343

    43

    43 444444

    4444

    4444

    44

    444444

    444444

    4444

    45

    454545454545

    45

    4545

    454545454545

    45

    4545

    45

    45

    45

    45

    45

    45

    45

    45

    4545

    45 4545

    45

    46

    46

    46

    4646 46

    46

    4646

    4646

    4646

    46

    4646

    4646

    4646

    46

    46

    46

    46

    46

    4646

    4646 4646

    4646

    4646

    46

    46

    4646 4646

    46

    46

    47

    474747

    47

    47

    47

    47

    47

    47

    47

    47

    47

    4747

    47

    4747474747

    49

    49

    49

    4949

    4949

    49

    49

    494949

    494949

    4949

    49

    49

    47

    46

    45

    44

    4342

    -15

    -10

    -5

    0

    5

    10

    -25 -20 -15 -10 -5 0 5

    F1 (56.45 %)

    F2(29.6

    3%)

    Mycobacteria - DA

    2

    2

    2

    2

    2

    22

    22

    2

    222

    22

    2

    2

    2

    2

    434343

    4343

    4343

    43

    43

    43

    434343

    43

    43

    43

    43

    43

    43

    43

    43

    43 43

    43

    4344

    4444

    4444

    44

    44

    44

    444444

    444444

    44

    44

    45

    454545454545

    45

    4545

    454545

    45 4545

    45

    4545

    45

    45

    5

    5

    45

    45

    45

    45

    4545

    5 45

    4545

    4664646

    6 466 464666 4646 4666 464666 46

    46

    464646 46464646 46464646 4646466 466 46 4666

    47

    474

    47

    47

    47

    47

    47

    47

    47

    47

    47

    47

    447

    47

    4747 47

    47

    47

    49

    49

    49

    49

    49

    49

    49

    49

    49

    4949

    49

    49 49

    49

    49

    49

    49

    49

    47

    46

    45

    44

    43

    2

    -5

    0

    5

    10

    0 5

    F1 (56.45 %)

    F2(29.6

    3%)

    Mycobacteria - DA Getting out the voteWhat if a samples distances is such that itcould be in more than one class?

    When you havemore than onepossible class,we can take avote. The classwith the mostvotes wins.

    K = 5

    Getting out the vote Getting out the vote

    Example - K = 3

    Sample Class Distance1 A 0.134

    2 B 0.145

    3 A 0.158

    Here you would end up with 2 votes for A

    and one for B - A would win and the

    distances would be smaller.

  • 7/28/2019 12 Classification

    10/16

    Getting out the vote

    Example - K = 5

    Sample Class Distance

    1 A 0.1342 B 0.145

    3 A 0.1584 B 0.234

    5 C 0.502

    Here, A and B would tie. The tie-breaker wouldbe that A averages a smaller distance so

    would be made the winner.

    KNN validation

    The optimum number for K can be foundby trial an error but for a close match, itshould make no difference.

    The classifying power of your data can beevaluated by leave one out validation ofyour training set.

    This should be done before any sort ofreal classification begins.

    KNN validationValidation

    You can sequentially leave out each of yoursamples and test it for votes at several Kvalues.

    You end up with a vote matrix that will tell youthe optimum K value for each class.

    You will also get a misclassification matrix "-this tells you how often one of your knownsare incorrectly classified.

    K nearest neighbor classification

    So KNN will always assign a class.

    What if you have a material that is not a member ofan existing class?

    One option is to set a maximum distance.

    Example

    Your intraclass distances run about 0.2 for all ofyour classes, you might want to omit votes withdistances that exceed 0.2.

    Iris (of course) The Iris data set is included with a demo of the

    program Pirouette.

    Well be using the Pirouette demo to show howto conduct KNN and SIMCA classifications.

    You can download a copy of the demo fromwww.infometrix.com.

    The demo is fully functional but only with thedata sets that are provided by Infometrix.

    The actual software is pretty easy to use buttoo expensive for our use in the course.

    Iris example

  • 7/28/2019 12 Classification

    11/16

    Iris - scores by class Iris - voting results

    Iris - class partitions

    What? NOT the Iris data set?

    Headspace MS of 4 cola classes.

    Two cola brands.

    Diet and regular.

    m/e 44 - 149.

    May need to preprocess toeliminate any nonvariant data.

    Cola example

    Class

    1 Brand 1

    2 Diet brand 1

    3 Brand 2

    4 Diet brand 2

    PCA scores

  • 7/28/2019 12 Classification

    12/16

    PCA scores PCA loadings

    KNN classificationNot a bad job! KNN classifications

    SIMCASoft Independent Modeling ofClass Analogy

    A method of classification that provides:Detection of outliers.

    Estimates of confidence for aclassification.

    Determination of potentialmembership in more than a singleclass.

    SIMCABasic approach.

    For each class of samples, a PCA model isconstructed.

    This model is based on the optimum numberof components that best clusters an individuaclass.

    The optimum number of components can varyfrom class to class and can be determined bycross-validation

  • 7/28/2019 12 Classification

    13/16

    SIMCA modelsSince the numberof componentsused can vary,each class will bebest describedby its own

    hypervolume.

    SIMCA modelsLimitation of a class hypervolume.

    You can limit the size of a hypervolumeby setting a standard deviation cutoff.This results in better defined classes.

    SIMCA modelsOnce a model has been created for eachclass, you are ready to classify unknowns.

    For each model/sample combination:

    + The sample is transformed into PC spaceand compared to see if is a likely classmember.

    + If it is within the hypervolume of a singleclass, you have a match.

    SIMCA classificationThe potential still exists for a sample to beclassified as a member of more than one class.

    It may also notbe a member oany known clas

    SIMCA classification

    SIMCA will give you an estimate as to

    the probability of class membership.

    Example - two possible classes." " Probability" Class A" " 0.90" Class B 0.45Here, the sample is more likely to bea member of Class A.

    SIMCA summary

    Of the methods covered, SIMCA offers the

    most options for developing a classificationmodel when the classes are well known.

    It also requires the most development time asyou must determine the optimum modelconditions for each class.

    If used, plan on spending quite a bit of timeworking with all of the available options.

  • 7/28/2019 12 Classification

    14/16

    SIMCA example - Iris.Of course well look at the iris dataset again.

    Note: We have aseparate model foreach class in the datset - in this case three

    SIMCA example - Iris.

    Pirouette willprovide anestimate as to theclass hypervolumesbased on the firstthree PCs.

    SIMCA example - Iris. SIMCA example - Iris.

    It appears thatpetal length isthe most usefulfor classifying.

    SIMCA example - Iris.These plots show the relativepositions of each sample whenprojected into any of the three classmodels - two classes at a time - with

    color coding based on known class

  • 7/28/2019 12 Classification

    15/16

    Cola exampleWith the cola example (two brands, diet andregular), we have 4 classes.

    Here you can see that the classes are pretty wellresolved.

    Cola example

    Mycobacteria againThis data set is included with the Pirouette demo.

    File = Mycosing.wks

    It is a subset of the version Ive been using(only 72 samples)

    Mycobacteria SIMCA

    Perfect classifications - a first for this dataset.

  • 7/28/2019 12 Classification

    16/16

    Mycobacteria SIMCA

    DiscriminatingPower is a

    measure of whichvariables showthe biggest class

    differences.

    Mycobacteria SIMCAExample shows that a differentnumber of components wereused in developing theindividual SIMCAhypervolumes.

    Mycobacteria SIMCAModeling power

    indicates the relativeimportance of each

    variable forclassification.

    Loadings, as always show the

    relative significance of eachvariable in constructing each PC

    There are relatively unimportant.

    Mycobacteria SIMCAPC plots are pretty boring since you only have one class. However, it canbe used to see if you have any sub-classes.

    Outliers are test for by plotting sample residuals (dif ference betweensample and center of hypervolume) vs its Mahalanobis distance from thecenter of the cluster - similar to a Euclidian distance but takes into accountcorrelations of the data and is scale invariant.

    Mycobacteria SIMCA