11-b_pca
TRANSCRIPT
-
7/28/2019 11-B_PCA
1/9
PCA for removal of noise
With PCA analysis, correlated data isextracted into a series of score-loading
pairs.
Random noise is uncorrelated and tends
to stay in the residual matrix.
This feature can be exploited as a meansof improving S/N in a 2-D data set.
PCA for removal of noise
GC/MS example
A GC/MS dataset is already a suitable matrix
for PCA work. It already exists as a scaled
matrix.
PCA can be used to extract all significantinformation, leaving the noise behind.
The datafile can then be reconstructed from
the loadings and scores.
GC/MS example
+ + . . . + +=
m
nnnn
mm mpap2p1
t1 t2 ta
noise
m
n
GC/MS
data
+ + . . . +
m
nnn
mpap2p1
t1 t2 ta
m
n
new
GC/MS
data
=
m = m/e n = scans
GC/MS exampleFor the evaluation of the approach, a 6
component mixture was assayed over a range
of concentrations.
! Components:! benzene, toluene, ethylbenzene,! dichloromethane, trichloromethane! tetrachloromethane.
! Concentration range: 0.005-1% V/V
Number of components required
The optimumnumber of
components wasdetermined bymeasuring peakarea for all 6components.
8 PCs was foundto be adequate forextraction of allpeak areainformation.
Number of components required
The same tyof evaluatwas made
mass specqua
Again, at 8 Pit w
determined tthe optim
match quawas observ
-
7/28/2019 11-B_PCA
2/9
Effect on chromatographic data quality
A significantreduction in
noise wasobserved.
S/N improvedby a factor
of 1.88
Mass spectral improvements
Mass spectral improvements Mass spectral improvements
Overall match quality improved by anaverage of 17%.
The major reason for this was that
small noise related lines had beeneliminated.
An additional advantage of themethod is that datafile size was
reduced by 30%.
Noise removal from 2-D NMR data
PCA has also been evaluated as ameans of removing noise from 2-D
COSY NMR spectra.
One major difference from the GC/
MS approach is that the data isuncorrelated and it is the artifacts that
are extracted.
The data remains in the residual.
Noise removal from 2-D NMR data
FID trace
The signal of interestin this type ofexperiment isuncorrelated bynature.
For evaluation of themethod, a 2K x 2KCOSY spectrum formenthol was obtained.
-
7/28/2019 11-B_PCA
3/9
Original data PCA processed
At this po95% of tvariance
had beenremoved
PCA processed
Here, 99% ofthe variancehas beenremoved.
Before and after
Wine analysis
A series of wines were assayed by both GC/MS and AA. Samples were obtained from two
regions for two different types of wines.! Ohio - Lake Erie Chardonnay! California - Napa Valley J. ReislingComponents were identified and only thosefound in all samples were used in subsequentcharacterizations. GC/MS peaks were used asrelative areas and the metals as ppm values.
Wine analysis
Eighteen organic and eight trace metals were
evaluated. After an initial ANOVA, only 9
species were used.
! 1-hexanol ! manganese! 3-ethoxy-1-propanol ! calcium! ethyl octanoate ! potassium! 2,3-butanediol ! sodium! benzyl alcohol
-
7/28/2019 11-B_PCA
4/9
Wines - Initial dendrogram
Cr
Cr
Cr
Cc
Cc
Oc
Cc
Cc
Oc
Cc
Cc
Cc
Or
Cr
Or
Cr
Cr
Cr
Cr
Cr
Or
Oc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Oc
Oc
Or
Or
Oc
Or
Oc
Oc
Cr
0
1
2
3
4
5
6
7
Dissimilarity
Wines - Initial dendrogram
Wines - Eigenvalues
0
0.5
1
1.5
2
2.5
3
3.5
F1 F2 F3 F4 F5 F6 F7 F8 F9
Eigenvalue
0
20
40
60
80
100
Cumulativevariability(%)
PC Plots
Or
Or
Or
Or
Or
Or
Oc
Oc
Oc
OcOc
Oc
Oc
Oc
Cr
Cr
CrCrCr
Cr
Cr
Cr
Cr
Cr
Cc
Cc
Cc
CcCc
Cc
Cc Cc
Cc Cc
CcCc
Cc
Cc
Cc
Cc
Cc
-4
-3
-2
-1
0
1
2
3
4
-4 -3 -2 -1 0 1 2 3 4 5
F1 (36.35 %)
F2(24.8
6%)
PC Plots
Or
Or
OrOr
Or
Or
Oc
OcOc
Oc
Oc
Oc
Oc
Oc
Cr Cr
Cr
Cr
CrCr
Cr
Cr
Cr
Cr
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4
F1 (36.35 %)
F3(11.1
9%)
Or
Or
OrOr
Or
Or
Oc
OcOc
Oc
Oc
Oc
Oc
Oc
Cr Cr
CrCr
CrCr
Cr
Cr
Cr
Cr
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
Cc
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3 4
F2 (24.86 %)
F3(11.1
9%)
1-hexanol
3-ethoxy-1-
propanol
ethyl octanoate
2,3-butanediol
benzyl alcohol
K
Na
Ca
Mn
-1
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
Loading 1
Loading
2
Wines - Loading plot
-
7/28/2019 11-B_PCA
5/9
Wines - summary
Using just 9 variables - 5 organic species and 4metals, it was possible to identify both thetype and source of the wine samples.
The major source of variation was the type -Chardonnay or J. Reisling.
Identification of the wine region (Ohio LakeErie or California Napa Valley) was the secondmajor source of variation.
In the next unit well see if a classificationsystem can be developed using this data.
Earlier examples.
Lets look at some of the earlier problems
that we evaluated using HCA and see what
we can do with them using PCA.
Iris classification! Physical measurement of flowers! Three species.Coffee! MS analysis of bean headspace.! Six regions.
Iris setosa Iris versicolor Iris virginica
Data from Fisher M., The Use of MultipleMeasurements in Taxonomic Problems. Annals of
Eugenics, 7, 179 -188 (1936).
Correspond to 150 Iris flowers using four variables
(sepal length, sepal width, petal length, petal width)
and their species. Three different species have been
included in this study: setosa, versicolor and
virginica.
1 1111111111111111111111111111111111111111112332322322223222332222222222233222333333333333333323332223323233332222222233322222333233333223123332
0.40
0.50
0.60
0.70
0.80
0.90
1.00
Similarity
Iris - Autoscaled, centroidal linkage
Our original dendrogram didnprovide much useful informati
Only one species was resolvefrom the other two.
Iris - PCA eigenvalues
0
0.5
1
1.5
2
2.5
3
3.5
F1 F2 F3 F4
Eigenvalue
0
20
40
60
80
100
Cumulativevariability(%)
The first two components
account for 95.7% of thevariance. Data was initially
autoscaled prior to analysis
Iris - scores
1
3
1
2
3
1
3
2
13
22
2 313
1
2
1
2 3
23
33
3
1
12
3
1
3
1
1
2
31
3
1
2
2
3
3
3
13
2
2
1
2
3
1
2
1
1
2
23
1
22
1
1
2
1
21
2
2
1
1
2
3
1
2
3
2
11 2
1
1
3
3
1
1
3
2
3
3
21
333
2
2
2
33
2
1
2
1
1
3
2
3
3
2
2
2
33
2
3
2
12
3
1
2
31
3
1
2
21
1
3
1
1
2
3
2
1
1
3
1
2
3
2
23
31
33
1
-3
-2
-1
0
1
2
3
-4 -3 -2 -1 0 1 2 3
F1 (72.82 %)
F2(22.8
7%)
-
7/28/2019 11-B_PCA
6/9
Iris - loadings
Sepal Length
Sepal Width
Petal lengthPetal width
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
F1 (72.82 %)
F2
(22.8
7%)
Iris - PC3 vs. PC4
Later componendont appear
have any usefinformatio
All that can bseen is that som
classes have greater degree sample to samp
variabilit
1
3
1
2
3
1 3
21
3
2
2
2
3
1
3
1
21
23
2 3
3
3
3
11
2
3
1
3
1
1
2
3
1
3
12
2
3 3
3
1 32
2
1
2
3
1
2
1
1
2
2
3
12
211
2
1 2
1
2
2
1
1
2
3
1
2
3
2
1
1
2
1
1
3
3
11
3
2 33
2
1
3
3
3
22
2
3
3
2
1
21
1
32
3
3
2
2
2
3
3
2 3
2
1
2
3
1
23
1
3
12
21
1 3
1
1 23
2
1
1
31
2
3
22
3
31
3
3
1
-2
-1
0
1
2
3
-3 -2 -1 0 1 2 3 4
F1 (72.82 %)
F3(3.7
1%)
Complete linkage HCA
Sulawesi
Costa Rica
Ethiopia
Sumatra
Kenya
Columbia
0
5
10
15
20
25
30
35
40
F1 F3 F5 F7 F9 F11
F13
F15
F17
F19
F21
F23
F25
F27
F29
F31
F33
F35
Eigenvalue
0
20
40
60
80
100
Coffee - eigenvalues
Here you can see thatthe first 4 componentsaccount for almost all ofthe variance.
The first twocomponents account for82.1% of the variance.
Coffee - scores
UU
U UUU
UU
U UUU
SSS SSS SSS SSS
K
KKK
K
K
K
KKK
K
K
E
EEE EEE
EEE EE
RRRRR R RRRRR R
C
CC
C
C
C
C
CC
C
C
C
-8
-3
2
7
-10 -5 0 5 10
F1
F2
Coffee - Loadings
The loading plot indicates that most m/e values contribute to theclasses. Also, it should be possible to reduce the number of m/e used
47
484950
5152
53
54
55
56
57
585960
61
62
63
6465
66
6768
6970
71
72
73
74 75
76
7778
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
9596
9798
99
-1
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
F1
F2
-
7/28/2019 11-B_PCA
7/9
Coffee - later components
PC3 stillappears tocontain someclass relatedinformation.
PC4 indicatesthat two of ourColumbiansamples areoutliers.
Arson example
A series of flammable solventsassociate with starting fires wereassayed by GC/MS.
GC traces were reduced to a patternbased on total peak areas for each 1minute interval during an analysis.
A matrix was constructed using thatdata which was then subjected to PCAanalysis.
Earlier HCA work
Scores (PC1 - PC2)
A B
CD
E
Loadings
1,2 and 3
18 and 19
-
7/28/2019 11-B_PCA
8/9
Arson example
The five groups formed reasonableclusters.
Loadings followed odd pattern.
It should be possible to develop a method
for classifying samples.
Will be used as an example in the nextunit.
Scree plot
0
1
2
3
4
5
6
7
F1
F2
F3
F4
F5
F6
F7
F8
F9
F10
F11
F12
F13
F14
F15
F16
F17
F18
F19
F20
F21
axis
Eigenvalue
0
20
40
60
80
100
Cumulativevariability(%)
Mycobacteria
Mycobacteria - PC1 v PC2
-4.000
-2.000
0.000
2.000
4.000
6.000
8.000
10.000
12.000
14.000
16.000
-6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.000
42
43
44
45
46
47
49
Mycobacteria - PC1 v PC2
- .
-2.000
-1.000
0.000
1.000
2.000
3.000
4.000
-6.000 -4.000 -2.000 0.000 2.000 4.000 6.000 8.000 10.000
4949
49
49
49
494949
49
49
49
49
49
49
4949
49 49
47474747
47
474747
47
47474747 474747
47474747 47
46
4646
46 46
46 4646
46
46
46
4646
46
4646
46
46
46
4646
46
46
46
46
46
46
46
464646
46
464646 4646
46 46
464646
46
4545
45
4545
45
45
45
45
45 45
4545
4545
4545
45
45
45
45
4545
45
45
4545
4545
45
4545 45
44
44
444444
4444
4444
44
444444
44
44
44 43
43
43
43
43
43 43
43
43
43
43
43
43
43
43
43
4343
43
43
43
43
43 43
434242 42
42
42
42
424242
42
4242
42
424242
4242
42
E1B9
B8
B7
B6B5B
B5A
B4
B3
B2B1
M6M5
M4
M3
M2
M1A5
A4
A3
A2
A1
-10
-5
0
5
10
15
20
-20 -15 -10 -5 0 5 10 15 20
F1 (29.05 %)
F2(17.55%)
Biplot
E1 B9 B8
B7
B6B5BB5A
B4B3
B2
B1M6
M5
M4
M3
M2
M1
A5
A4
A3
A2
A1
-1
-0.75
-0.5
-0.25
0
0.25
0.5
0.75
1
-1 -0.75 -0.5 -0.25 0 0.25 0.5 0.75 1
D1 (25.15 %)
D2(16.1
2%)
After Varimax rotation (loadings)
-
7/28/2019 11-B_PCA
9/9
49
49
49
49
49
49
49
4949
49
49
49
49
49
49
4949
4947
474747 4747 4747
47
474747 47
474747 47 47474747
46
4646
4646
464646
46
46
46
4646464646
46
46
4646
46
46
46
46
46
46
46
4646
46
4646
46
464646
46
4646 464646
46
45
4545
4545
4545
45
45
4545
4545
45
45
4545
45
45
45
45
45
45
45
45
4545
45
45
45
454545
44
44
444444
44
4444
44
44
44
444444
44
4443
43
43
43
43
4343 43
43
43 43
43
43 43
43
43
43 43
43
43
43 43
4343
43
4242
42
42
42
4242
424242
42
424242
42
4242
4242E1 B9 B8
B7
B6B5BB5A
B4
B3
B2
B1M6
M5
M4
M3
M2
M1
A5
A4
A3
A2
A1
-2
2
6
-8 -4 0 4 8
D1 (25.15 %)
D2(16.12%)
After Varimax rotation (biplot)
49
49
49
49
49
49
4949
49
49
49
49
49
49
4949
4947
474747 4747 4747
4
474747 47
474747 47 47474747
46
4646
4646
46
46
46
4646464646
46
46
4646
46
46
46
46
46
46
46
4646
46
4646
46
464646
46
4646 464646
46
45
4545
4545
4545
45
45
4545
4545
4545
4545
45
45
45
45
45
45
45
45
4545
45
45
45
454545
44
44
4444
44
4444
44
44
44
444444
44
4443
43
43
43
43
4343 43
43
43 43
43
43 43
43
43
43 4
43
43
43 43
4343
43
4242
2
42
42
4242
424242
42
424242
42
4242
4242E1 B9B5BB5A
B4
B3
B2
B1M6
M5
M4
M3
M2 A4
A2
-4 0
After Varimax rotation (biplot)
49
49
49
49
49
49
49
4949
49
49
49
49
49
49
4949
4947
474747 4747 4747
47
474747 47
474747 47 47
474747
46
4646
4646
464646
46
46
46
4646464646
46
46
4646
46
46
46
46
46
46
46
4646
46
4646
46
464646
46
4646 464646
46
45
4545
4545
4545
45
45
4545
4545
4545
4545
45
45
45
45
45
45
45
45
4545
45
45
45
454545
44
44
444444
44
4444
44
44
44
444444
44
4443
43
43
43
43
4343 43
43
43 43
43
43 43
43
43
43 43
43
43
43 43
4343
43
4242
42
42
42
4242
424242
42
424242
42
4242
4242E1 B9 B8
B7
B6B5B
B5A
B4
B3
B2
B1M6
M5
M4
M3
M2
M1
A5
A4
A3
A2
A1
-2
2
6
-8 -4 0 4
D1 (25.15 %)
D2(16.1
2%)
4949
49
49
49
494949
49
49
49
49
49
49
4949
49 49
47474747
47
474747
47
47474747
4747
47
4747
4747
47
46
464646 46
46 4646
46
46
46
4646
46
4646
46
46
46
4646
46
46
46
46
46
46
46
464646
46
464646 4646
46 46
464646
46
4545
45
4545
45
45
45
45
45 45
4545
4545
4545
45
45
45
45
4545
45
45
45
45
45
45
45
454545
44
44
444444
4444
4444
44
444444
44
44
44 43
43
43
43
43
43 43
43
43
43
43
43
43
43
43
43
4343
43
43
43
43
43 43
434242 42
42
42
42
424242
42
4242
42
424242
4242
42
E1B9
B8
B7
B6B5B
B5A
B4
B3
B2B1
M6
M5
M4
M3
M2
M1
A5
A4
A3
A2
A1
-10
-5
0
5
10
15
20
-20 -15 -10 -5 0 5 10 15 20
F1 (29.05 %)
F2(17.5
5%)