pattern recognition : principal components analysis richard brereton [email protected]
TRANSCRIPT
NEED FOR PATTERN RECOGNITION
•Exploratory data analysis
e.g. PCA
•Unsupervised pattern recognition
e.g. Cluster analysis
•Supervised pattern recognition
e.g. Classification
Case study
Coupled chromatography in HPLC : profile
0
2
4
6
8
10
12
14
1 6 11 16 21 26
Tim
e : rows
Wavelength : columns
MULTIVARIATE DATA
DATA MATRICES
The rows do not need to correspond to elution times in chromatography they can be any type of sample
• Blood sample
• Wood
• Chromatograms
• Samples from a reaction mixture
• Chromatographic columns
The loadings do not need to correspond to spectral wavelengths they can be any type of sample
• NMR peak heights
• Atomic spectroscopy measurements of elements
• Chromatographic intensities
• Concentrations of compounds in a mixture
• Results of chromatographic tests
Return to example of chromatography.
Rows : elution times
Columns : wavelengths
.
X
S C
E
=
+
Chemical factors : X = C.S + E
It would be nice to look at the chemical factors underlying the chromatogram. We can use mathematical methods to do this.
ABSTRACT FACTORS : PRINCIPAL COMPONENTS
CHROMATOGRAM
LOADINGS
SC
OR
ES
PCA
TRANSFORMATION
SPECTRA
EL
UT
IO
N
PR
OF
IL
ES
X = T . P + E = C . S + E
T are called scores: these correspond to elution profile
P are called loadings : these correspond to spectra
Ideally the “size” of T and P equals the number of compounds in the mixture.
This “size” equals the number of principal components, e.g. 1, 2, 3 etc.
Each PC has an associated scores vector (column of T), and loadings vector (column of P).
Scores T
Data X
I
J
I
J
A
Loadings P
A
PCA
Hence if the original data matrix is dimensions 30 28 (or I J) (= 30 elution times and 28 wavelengths - or 30 blood samples and 28 compound concentrations - or 30 chromatographic columns and 28 tests) and if the number of PCs is denoted by A, then
•the dimensions of T will be 30 A, and
•the dimensions of P will be A 28.
Samples
Samples
Variables Scores
PCA
A major reason for performing PCA is data simplification.
Often datasets are very complex, it is possible to make many measurements, but only a few underlying factors.
“See the wood from the trees”. Will look at this in more detail later.
SCORES AND LOADINGS HAVE SPECIAL MATHEMATICAL PROPERTIES
•Scores and loadings are orthogonal.
What does this mean?
•Loadings are normalised.
What does this mean?
0.1
ib
I
iia tt 0.
1
bj
J
jaj pp
11
2
J
jajp
PCA is an abstract concept.
Theory. Non-mathematical
Spectrum recorded at different concentrations and several wavelengths; wavelength 6 versus 9 : six spectra.
0
0.2
0.4
0.6
0.8
1
1.2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0
1
2
3
4
5
6
0 1 2 3 4 5 6
Each spectrum becomes ONE POINT IN 2 DIMENSIONAL SPACE
(2D = 2 wavelengths)
Spectra
•Fall on a straight line which is the FIRST PRINCIPAL COMPONENT
•The line has a DIRECTION often called the LOADINGS corresponding to the SPECTRAL CHARACTERISTICS
•Each spectrum has a DISTANCE along the line often called the SCORES corresponding to CONCENTRATION
EXTENSIONS TO THE IDEA
1. Measurement error
2. Several wavelengths
3. Several compounds
0
1
2
3
4
5
6
0 1 2 3 4 5 6
•Best fit straight line - statistics
•Two PCs - the second relates to the error around the straight
Measurement error
Several wavelengths
•Now no longer a point in 2 dimensional space.
•Typical spectrum. Several thousand wavelengths
• The number of dimensions equals the number of wavelengths.
•The spectra still fall (roughly) on a straight line.
•A point in 1000 dimensional space.
Several compounds
Two compounds, two wavelengths.
A
B
RANK AND EIGENVALUE
How many PCs describe a dataset?
Often unknown•How many compounds in a series of mixtures?•How many sources of pollution?•How many compounds in a reaction mixture?
•Sometimes just statistical concept.•Sometimes mixture of physical and chemical factors, e.g. a reaction mixture : compounds, temperature etc.
EVERY PRINCIPAL COMPONENT HAS A CORRESPONDING EIGENVALUE
•The eigenvalue equals the sum of squares of the scores vector for each PC.
•The more important the PC the bigger the eigenvalue.
•The sum of squares of the eigenvalues of a matrix should never exceed that of the original matrix.
•The sum of squares of all significant PCs should approximate to that of the original matrix.
RESIDUAL SUM OF SQUARES : decreases as the number of eigenvalues increases.
Log eigenvalue versus component number.
Cut off?
0
1
2
3
4
5
1 2 3 4 5 6 7
SEVERAL OTHER APPROACHES FOR THE DETERMINATION OF NUMBER OF EIGENVALUES.
SUMMARY SO FAR
PCA
• Principal components – how many?
• Scores
• Loadings
• Eigenvalues
GRAPHIC DISPLAY OF PCSSCORES PLOT
PC2 VERSUS PC1
30
20
1514
1312
11
10
9
8
7
6
5
4
3
2
1
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
0.0 0.5 1.0 1.5 2.0 2.5 3.0
SCORES AGAINST TIME
PC1 AND PC2 VERSUS TIME
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
0 5 10 15 20 25
LOADINGS PLOTPC2 VERSUS PC1
220
225
230
234
239
244
249
253
258263
268
272 277
282
287
291
296
301
306
310
315
320325
329
334
349
-0.4
-0.3
-0.2
-0.1
0.0
0.1
0.2
0.3
0.4
0.5
0.0 0.1 0.2 0.3 0.4
FOR REFERENCE : pure spectra
220 240 260 280 300 320 340
225
301
LOADINGS AGAINST WAVELENGTH
PC1 AND PC2 VERSUS WAVELENGTH
-0.4
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
220 240 260 280 300 320 340
BIPLOTS : SUPERIMPOSING SCORES AND LOADINGS PLOTS
12
3
4
5 6
7
8
9
10
111213
1415
202530
349344339
334
329
325 320
315
310
306
301 296
291
287
282
277272
268
263258
253
249244
239
234
230
225
220
MANY OTHER PLOTS
•Not only PC2 versus 1, also PC3 versus 1, PC3 versus 2 etc.
•3D PC plots, 3 axes, rotation etc.
•Loadings and scores sometimes presented as bar graphs, not always a sequential meaning.
•Plots of eigenvalues against component number
DATA SCALING AND PREPROCESSING
Influences appearance of plots
• Column centring – common in traditional statistics
• Standardisation of columns – subtract mean and divide by standard deviation.
If data of different types or absolute scales this is an essential technique
• Row scaling – to constant total
ANOTHER EXAMPLE
Grouping of elements from fundamental properties using PCA.
Step 1 : standardise the data.
Why? On different scales.
PERFORM PCA : Choose the first two PCs
Scores plot
Ti
PbBi
Ni
Mn
FeCu
CoZn
XnKr
Ar
Ne
He
IBrCl
F
SrCa
MgBe
CsRbK
Na
Li
-1.5
-1.0
-0.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
-3.0 -2.0 -1.0 0.0 1.0 2.0 3.0 4.0
Loadings plot
ElectroNeg
Oxidation#
Density
Boiling P.Melting P.
-0.6
-0.4
-0.2
0.0
0.2
0.4
0.6
0.8
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6
SUMMARY
•Many types of plot from PCA.
•Interpretation of the plots.
•Preprocessing important.