11-b_pca

Upload: deborahrosales

Post on 03-Apr-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/28/2019 11-B_PCA

    1/9

    PCA for removal of noise

    With PCA analysis, correlated data isextracted into a series of score-loading

    pairs.

    Random noise is uncorrelated and tends

    to stay in the residual matrix.

    This feature can be exploited as a meansof improving S/N in a 2-D data set.

    PCA for removal of noise

    GC/MS example

    A GC/MS dataset is already a suitable matrix

    for PCA work. It already exists as a scaled

    matrix.

    PCA can be used to extract all significantinformation, leaving the noise behind.

    The datafile can then be reconstructed from

    the loadings and scores.

    GC/MS example

    + + . . . + +=

    m

    nnnn

    mm mpap2p1

    t1 t2 ta

    noise

    m

    n

    GC/MS

    data

    + + . . . +

    m

    nnn

    mpap2p1

    t1 t2 ta

    m

    n

    new

    GC/MS

    data

    =

    m = m/e n = scans

    GC/MS exampleFor the evaluation of the approach, a 6

    component mixture was assayed over a range

    of concentrations.

    ! Components:! benzene, toluene, ethylbenzene,! dichloromethane, trichloromethane! tetrachloromethane.

    ! Concentration range: 0.005-1% V/V

    Number of components required

    The optimumnumber of

    components wasdetermined bymeasuring peakarea for all 6components.

    8 PCs was foundto be adequate forextraction of allpeak areainformation.

    Number of components required

    The same tyof evaluatwas made

    mass specqua

    Again, at 8 Pit w

    determined tthe optim

    match quawas observ

  • 7/28/2019 11-B_PCA

    2/9

    Effect on chromatographic data quality

    A significantreduction in

    noise wasobserved.

    S/N improvedby a factor

    of 1.88

    Mass spectral improvements

    Mass spectral improvements Mass spectral improvements

    Overall match quality improved by anaverage of 17%.

    The major reason for this was that

    small noise related lines had beeneliminated.

    An additional advantage of themethod is that datafile size was

    reduced by 30%.

    Noise removal from 2-D NMR data

    PCA has also been evaluated as ameans of removing noise from 2-D

    COSY NMR spectra.

    One major difference from the GC/

    MS approach is that the data isuncorrelated and it is the artifacts that

    are extracted.

    The data remains in the residual.

    Noise removal from 2-D NMR data

    FID trace

    The signal of interestin this type ofexperiment isuncorrelated bynature.

    For evaluation of themethod, a 2K x 2KCOSY spectrum formenthol was obtained.

  • 7/28/2019 11-B_PCA

    3/9

    Original data PCA processed

    At this po95% of tvariance

    had beenremoved

    PCA processed

    Here, 99% ofthe variancehas beenremoved.

    Before and after

    Wine analysis

    A series of wines were assayed by both GC/MS and AA. Samples were obtained from two

    regions for two different types of wines.! Ohio - Lake Erie Chardonnay! California - Napa Valley J. ReislingComponents were identified and only thosefound in all samples were used in subsequentcharacterizations. GC/MS peaks were used asrelative areas and the metals as ppm values.

    Wine analysis

    Eighteen organic and eight trace metals were

    evaluated. After an initial ANOVA, only 9

    species were used.

    ! 1-hexanol ! manganese! 3-ethoxy-1-propanol ! calcium! ethyl octanoate ! potassium! 2,3-butanediol ! sodium! benzyl alcohol

  • 7/28/2019 11-B_PCA

    4/9

    Wines - Initial dendrogram

    Cr

    Cr

    Cr

    Cc

    Cc

    Oc

    Cc

    Cc

    Oc

    Cc

    Cc

    Cc

    Or

    Cr

    Or

    Cr

    Cr

    Cr

    Cr

    Cr

    Or

    Oc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Oc

    Oc

    Or

    Or

    Oc

    Or

    Oc

    Oc

    Cr

    0

    1

    2

    3

    4

    5

    6

    7

    Dissimilarity

    Wines - Initial dendrogram

    Wines - Eigenvalues

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    F1 F2 F3 F4 F5 F6 F7 F8 F9

    Eigenvalue

    0

    20

    40

    60

    80

    100

    Cumulativevariability(%)

    PC Plots

    Or

    Or

    Or

    Or

    Or

    Or

    Oc

    Oc

    Oc

    OcOc

    Oc

    Oc

    Oc

    Cr

    Cr

    CrCrCr

    Cr

    Cr

    Cr

    Cr

    Cr

    Cc

    Cc

    Cc

    CcCc

    Cc

    Cc Cc

    Cc Cc

    CcCc

    Cc

    Cc

    Cc

    Cc

    Cc

    -4

    -3

    -2

    -1

    0

    1

    2

    3

    4

    -4 -3 -2 -1 0 1 2 3 4 5

    F1 (36.35 %)

    F2(24.8

    6%)

    PC Plots

    Or

    Or

    OrOr

    Or

    Or

    Oc

    OcOc

    Oc

    Oc

    Oc

    Oc

    Oc

    Cr Cr

    Cr

    Cr

    CrCr

    Cr

    Cr

    Cr

    Cr

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    -2

    -1

    0

    1

    2

    3

    -4 -3 -2 -1 0 1 2 3 4

    F1 (36.35 %)

    F3(11.1

    9%)

    Or

    Or

    OrOr

    Or

    Or

    Oc

    OcOc

    Oc

    Oc

    Oc

    Oc

    Oc

    Cr Cr

    CrCr

    CrCr

    Cr

    Cr

    Cr

    Cr

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    Cc

    -2

    -1

    0

    1

    2

    3

    -4 -3 -2 -1 0 1 2 3 4

    F2 (24.86 %)

    F3(11.1

    9%)

    1-hexanol

    3-ethoxy-1-

    propanol

    ethyl octanoate

    2,3-butanediol

    benzyl alcohol

    K

    Na

    Ca

    Mn

    -1

    -0.75

    -0.5

    -0.25

    0

    0.25

    0.5

    0.75

    1

    -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

    Loading 1

    Loading

    2

    Wines - Loading plot

  • 7/28/2019 11-B_PCA

    5/9

    Wines - summary

    Using just 9 variables - 5 organic species and 4metals, it was possible to identify both thetype and source of the wine samples.

    The major source of variation was the type -Chardonnay or J. Reisling.

    Identification of the wine region (Ohio LakeErie or California Napa Valley) was the secondmajor source of variation.

    In the next unit well see if a classificationsystem can be developed using this data.

    Earlier examples.

    Lets look at some of the earlier problems

    that we evaluated using HCA and see what

    we can do with them using PCA.

    Iris classification! Physical measurement of flowers! Three species.Coffee! MS analysis of bean headspace.! Six regions.

    Iris setosa Iris versicolor Iris virginica

    Data from Fisher M., The Use of MultipleMeasurements in Taxonomic Problems. Annals of

    Eugenics, 7, 179 -188 (1936).

    Correspond to 150 Iris flowers using four variables

    (sepal length, sepal width, petal length, petal width)

    and their species. Three different species have been

    included in this study: setosa, versicolor and

    virginica.

    1 1111111111111111111111111111111111111111112332322322223222332222222222233222333333333333333323332223323233332222222233322222333233333223123332

    0.40

    0.50

    0.60

    0.70

    0.80

    0.90

    1.00

    Similarity

    Iris - Autoscaled, centroidal linkage

    Our original dendrogram didnprovide much useful informati

    Only one species was resolvefrom the other two.

    Iris - PCA eigenvalues

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    F1 F2 F3 F4

    Eigenvalue

    0

    20

    40

    60

    80

    100

    Cumulativevariability(%)

    The first two components

    account for 95.7% of thevariance. Data was initially

    autoscaled prior to analysis

    Iris - scores

    1

    3

    1

    2

    3

    1

    3

    2

    13

    22

    2 313

    1

    2

    1

    2 3

    23

    33

    3

    1

    12

    3

    1

    3

    1

    1

    2

    31

    3

    1

    2

    2

    3

    3

    3

    13

    2

    2

    1

    2

    3

    1

    2

    1

    1

    2

    23

    1

    22

    1

    1

    2

    1

    21

    2

    2

    1

    1

    2

    3

    1

    2

    3

    2

    11 2

    1

    1

    3

    3

    1

    1

    3

    2

    3

    3

    21

    333

    2

    2

    2

    33

    2

    1

    2

    1

    1

    3

    2

    3

    3

    2

    2

    2

    33

    2

    3

    2

    12

    3

    1

    2

    31

    3

    1

    2

    21

    1

    3

    1

    1

    2

    3

    2

    1

    1

    3

    1

    2

    3

    2

    23

    31

    33

    1

    -3

    -2

    -1

    0

    1

    2

    3

    -4 -3 -2 -1 0 1 2 3

    F1 (72.82 %)

    F2(22.8

    7%)

  • 7/28/2019 11-B_PCA

    6/9

    Iris - loadings

    Sepal Length

    Sepal Width

    Petal lengthPetal width

    0

    0.25

    0.5

    0.75

    1

    -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

    F1 (72.82 %)

    F2

    (22.8

    7%)

    Iris - PC3 vs. PC4

    Later componendont appear

    have any usefinformatio

    All that can bseen is that som

    classes have greater degree sample to samp

    variabilit

    1

    3

    1

    2

    3

    1 3

    21

    3

    2

    2

    2

    3

    1

    3

    1

    21

    23

    2 3

    3

    3

    3

    11

    2

    3

    1

    3

    1

    1

    2

    3

    1

    3

    12

    2

    3 3

    3

    1 32

    2

    1

    2

    3

    1

    2

    1

    1

    2

    2

    3

    12

    211

    2

    1 2

    1

    2

    2

    1

    1

    2

    3

    1

    2

    3

    2

    1

    1

    2

    1

    1

    3

    3

    11

    3

    2 33

    2

    1

    3

    3

    3

    22

    2

    3

    3

    2

    1

    21

    1

    32

    3

    3

    2

    2

    2

    3

    3

    2 3

    2

    1

    2

    3

    1

    23

    1

    3

    12

    21

    1 3

    1

    1 23

    2

    1

    1

    31

    2

    3

    22

    3

    31

    3

    3

    1

    -2

    -1

    0

    1

    2

    3

    -3 -2 -1 0 1 2 3 4

    F1 (72.82 %)

    F3(3.7

    1%)

    Complete linkage HCA

    Sulawesi

    Costa Rica

    Ethiopia

    Sumatra

    Kenya

    Columbia

    0

    5

    10

    15

    20

    25

    30

    35

    40

    F1 F3 F5 F7 F9 F11

    F13

    F15

    F17

    F19

    F21

    F23

    F25

    F27

    F29

    F31

    F33

    F35

    Eigenvalue

    0

    20

    40

    60

    80

    100

    Coffee - eigenvalues

    Here you can see thatthe first 4 componentsaccount for almost all ofthe variance.

    The first twocomponents account for82.1% of the variance.

    Coffee - scores

    UU

    U UUU

    UU

    U UUU

    SSS SSS SSS SSS

    K

    KKK

    K

    K

    K

    KKK

    K

    K

    E

    EEE EEE

    EEE EE

    RRRRR R RRRRR R

    C

    CC

    C

    C

    C

    C

    CC

    C

    C

    C

    -8

    -3

    2

    7

    -10 -5 0 5 10

    F1

    F2

    Coffee - Loadings

    The loading plot indicates that most m/e values contribute to theclasses. Also, it should be possible to reduce the number of m/e used

    47

    484950

    5152

    53

    54

    55

    56

    57

    585960

    61

    62

    63

    6465

    66

    6768

    6970

    71

    72

    73

    74 75

    76

    7778

    79

    80

    81

    82

    83

    84

    85

    86

    87

    88

    89

    90

    91

    92

    93

    94

    9596

    9798

    99

    -1

    -0.75

    -0.5

    -0.25

    0

    0.25

    0.5

    0.75

    1

    -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

    F1

    F2

  • 7/28/2019 11-B_PCA

    7/9

    Coffee - later components

    PC3 stillappears tocontain someclass relatedinformation.

    PC4 indicatesthat two of ourColumbiansamples areoutliers.

    Arson example

    A series of flammable solventsassociate with starting fires wereassayed by GC/MS.

    GC traces were reduced to a patternbased on total peak areas for each 1minute interval during an analysis.

    A matrix was constructed using thatdata which was then subjected to PCAanalysis.

    Earlier HCA work

    Scores (PC1 - PC2)

    A B

    CD

    E

    Loadings

    1,2 and 3

    18 and 19

  • 7/28/2019 11-B_PCA

    8/9

    Arson example

    The five groups formed reasonableclusters.

    Loadings followed odd pattern.

    It should be possible to develop a method

    for classifying samples.

    Will be used as an example in the nextunit.

    Scree plot

    0

    1

    2

    3

    4

    5

    6

    7

    F1

    F2

    F3

    F4

    F5

    F6

    F7

    F8

    F9

    F10

    F11

    F12

    F13

    F14

    F15

    F16

    F17

    F18

    F19

    F20

    F21

    axis

    Eigenvalue

    0

    20

    40

    60

    80

    100

    Cumulativevariability(%)

    Mycobacteria

    Mycobacteria - PC1 v PC2

    -4.000

    -2.000

    0.000

    2.000

    4.000

    6.000

    8.000

    10.000

    12.000

    14.000

    16.000

    -6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.000

    42

    43

    44

    45

    46

    47

    49

    Mycobacteria - PC1 v PC2

    - .

    -2.000

    -1.000

    0.000

    1.000

    2.000

    3.000

    4.000

    -6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.000

    4949

    49

    49

    49

    494949

    49

    49

    49

    49

    49

    49

    4949

    49 49

    47474747

    47

    474747

    47

    47474747 474747

    47474747 47

    46

    4646

    46 46

    46 4646

    46

    46

    46

    4646

    46

    4646

    46

    46

    46

    4646

    46

    46

    46

    46

    46

    46

    46

    464646

    46

    464646 4646

    46 46

    464646

    46

    4545

    45

    4545

    45

    45

    45

    45

    45 45

    4545

    4545

    4545

    45

    45

    45

    45

    4545

    45

    45

    4545

    4545

    45

    4545 45

    44

    44

    444444

    4444

    4444

    44

    444444

    44

    44

    44 43

    43

    43

    43

    43

    43 43

    43

    43

    43

    43

    43

    43

    43

    43

    43

    4343

    43

    43

    43

    43

    43 43

    434242 42

    42

    42

    42

    424242

    42

    4242

    42

    424242

    4242

    42

    E1B9

    B8

    B7

    B6B5B

    B5A

    B4

    B3

    B2B1

    M6M5

    M4

    M3

    M2

    M1A5

    A4

    A3

    A2

    A1

    -10

    -5

    0

    5

    10

    15

    20

    -20 -15 -10 -5 0 5 10 15 20

    F1 (29.05 %)

    F2(17.55%)

    Biplot

    E1 B9 B8

    B7

    B6B5BB5A

    B4B3

    B2

    B1M6

    M5

    M4

    M3

    M2

    M1

    A5

    A4

    A3

    A2

    A1

    -1

    -0.75

    -0.5

    -0.25

    0

    0.25

    0.5

    0.75

    1

    -1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1

    D1 (25.15 %)

    D2(16.1

    2%)

    After Varimax rotation (loadings)

  • 7/28/2019 11-B_PCA

    9/9

    49

    49

    49

    49

    49

    49

    49

    4949

    49

    49

    49

    49

    49

    49

    4949

    4947

    474747 4747 4747

    47

    474747 47

    474747 47 47474747

    46

    4646

    4646

    464646

    46

    46

    46

    4646464646

    46

    46

    4646

    46

    46

    46

    46

    46

    46

    46

    4646

    46

    4646

    46

    464646

    46

    4646 464646

    46

    45

    4545

    4545

    4545

    45

    45

    4545

    4545

    45

    45

    4545

    45

    45

    45

    45

    45

    45

    45

    45

    4545

    45

    45

    45

    454545

    44

    44

    444444

    44

    4444

    44

    44

    44

    444444

    44

    4443

    43

    43

    43

    43

    4343 43

    43

    43 43

    43

    43 43

    43

    43

    43 43

    43

    43

    43 43

    4343

    43

    4242

    42

    42

    42

    4242

    424242

    42

    424242

    42

    4242

    4242E1 B9 B8

    B7

    B6B5BB5A

    B4

    B3

    B2

    B1M6

    M5

    M4

    M3

    M2

    M1

    A5

    A4

    A3

    A2

    A1

    -2

    2

    6

    -8 -4 0 4 8

    D1 (25.15 %)

    D2(16.12%)

    After Varimax rotation (biplot)

    49

    49

    49

    49

    49

    49

    4949

    49

    49

    49

    49

    49

    49

    4949

    4947

    474747 4747 4747

    4

    474747 47

    474747 47 47474747

    46

    4646

    4646

    46

    46

    46

    4646464646

    46

    46

    4646

    46

    46

    46

    46

    46

    46

    46

    4646

    46

    4646

    46

    464646

    46

    4646 464646

    46

    45

    4545

    4545

    4545

    45

    45

    4545

    4545

    4545

    4545

    45

    45

    45

    45

    45

    45

    45

    45

    4545

    45

    45

    45

    454545

    44

    44

    4444

    44

    4444

    44

    44

    44

    444444

    44

    4443

    43

    43

    43

    43

    4343 43

    43

    43 43

    43

    43 43

    43

    43

    43 4

    43

    43

    43 43

    4343

    43

    4242

    2

    42

    42

    4242

    424242

    42

    424242

    42

    4242

    4242E1 B9B5BB5A

    B4

    B3

    B2

    B1M6

    M5

    M4

    M3

    M2 A4

    A2

    -4 0

    After Varimax rotation (biplot)

    49

    49

    49

    49

    49

    49

    49

    4949

    49

    49

    49

    49

    49

    49

    4949

    4947

    474747 4747 4747

    47

    474747 47

    474747 47 47

    474747

    46

    4646

    4646

    464646

    46

    46

    46

    4646464646

    46

    46

    4646

    46

    46

    46

    46

    46

    46

    46

    4646

    46

    4646

    46

    464646

    46

    4646 464646

    46

    45

    4545

    4545

    4545

    45

    45

    4545

    4545

    4545

    4545

    45

    45

    45

    45

    45

    45

    45

    45

    4545

    45

    45

    45

    454545

    44

    44

    444444

    44

    4444

    44

    44

    44

    444444

    44

    4443

    43

    43

    43

    43

    4343 43

    43

    43 43

    43

    43 43

    43

    43

    43 43

    43

    43

    43 43

    4343

    43

    4242

    42

    42

    42

    4242

    424242

    42

    424242

    42

    4242

    4242E1 B9 B8

    B7

    B6B5B

    B5A

    B4

    B3

    B2

    B1M6

    M5

    M4

    M3

    M2

    M1

    A5

    A4

    A3

    A2

    A1

    -2

    2

    6

    -8 -4 0 4

    D1 (25.15 %)

    D2(16.1

    2%)

    4949

    49

    49

    49

    494949

    49

    49

    49

    49

    49

    49

    4949

    49 49

    47474747

    47

    474747

    47

    47474747

    4747

    47

    4747

    4747

    47

    46

    464646 46

    46 4646

    46

    46

    46

    4646

    46

    4646

    46

    46

    46

    4646

    46

    46

    46

    46

    46

    46

    46

    464646

    46

    464646 4646

    46 46

    464646

    46

    4545

    45

    4545

    45

    45

    45

    45

    45 45

    4545

    4545

    4545

    45

    45

    45

    45

    4545

    45

    45

    45

    45

    45

    45

    45

    454545

    44

    44

    444444

    4444

    4444

    44

    444444

    44

    44

    44 43

    43

    43

    43

    43

    43 43

    43

    43

    43

    43

    43

    43

    43

    43

    43

    4343

    43

    43

    43

    43

    43 43

    434242 42

    42

    42

    42

    424242

    42

    4242

    42

    424242

    4242

    42

    E1B9

    B8

    B7

    B6B5B

    B5A

    B4

    B3

    B2B1

    M6

    M5

    M4

    M3

    M2

    M1

    A5

    A4

    A3

    A2

    A1

    -10

    -5

    0

    5

    10

    15

    20

    -20 -15 -10 -5 0 5 10 15 20

    F1 (29.05 %)

    F2(17.5

    5%)