guide to multivariate analysis in sims chemometrics guide 2010.pdf · • multivariate analysis can...
TRANSCRIPT
A Guide to the Practical Use ofMultivariate Analysis in SIMS
© Crown Copyright 2010
Joanna Lee, Ian Gilmore
National Physical Laboratory, Teddington, UK
Email: [email protected]: http://www.npl.co.uk/nanoanalysis
Slide 2
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
• What is multivariate analysis?
• Some matrix algebra…2. Identification
• Principal component analysis (PCA)
• Multivariate curve resolution (MCR)
3. Quantification and prediction
• Partial least squares regression (PLS)4. Classification
• PCA classification
• Principal Component Discriminant Function Analysis (PC-DFA)
• Partial Least Squares Discriminant Analysis (PLS-DA)
5. Conclusion
Slide 3
(C) Crown Copyrig
ht 2010
0
20
40
60
80
100
120
1990 2000 2010
Year of Publication
Num
ber
of P
ublic
atio
ns
PCAMCRPLSDFAANNs
Why are we here?
Slide 4
(C) Crown Copyrig
ht 2010
Data analysis
SIMS Dataset
SIMS Dataset
How is it related to known properties?
Where are they located?
What chemicalsare on the surface?
Calibration / Quantification
Classification
Identification
Can we predictthese properties?
Which group does it belong to?
What are the differencesbetween groups?
Slide 5
(C) Crown Copyrig
ht 2010
Contents
1. Introduction• What is multivariate analysis?• Some matrix algebra…
2. Identification
3. Quantification and prediction
4. Classification
5. Conclusion
Slide 6
(C) Crown Copyrig
ht 2010
From A. M. C. Davies Spectroscopy Europe 10 (1998) 28
Chemometrics is the science of relating measurements made on a chemical system to the state of the system via application of mathematical or statisticalmethods
Chemometrics
SIMSData
StatisticalResults
ChemometricsKnowledge of Surface Chemistry
and Instrumental Influences
Statistical methods e.g. multivariate analysis
Slide 7
(C) Crown Copyrig
ht 2010
Multivariate analysis
• Multivariate = More than 1 variable• Multivariate analysis is the statistical study of the dependence (covariance)
between different variables
• Variables are numerical values that we can measure on a sampleExample 1: A sample of people
Variables: Height, weight, shoe size, days since last haircut…Example 2: A sample of weather
Variables: Temperature, humidity, wind speed, visibility, UV index…Example 3: A sample of SIMS spectra
Variables: Intensity of Si+ peak, intensity of O+ peak, intensity of C3H5+
peak…
Key points• Many surface analytical technique, incl. SIMS and
XPS, gives data that are multivariate in nature• Finding correlations in the data is the key to
multivariate analysis!
Slide 8
(C) Crown Copyrig
ht 2010
Why use multivariate analysis?
• Modern ToF-SIMS instrument generates huge, multivariatedata sets
• Manual analysis involves selecting a sub-set of most interesting features for analysis by eye
• Multivariate analysis involves simultaneous statistical analysis of all the variables
• Multivariate analysis can summarise the data with a large number of variables, using a much smaller number of “factors”
72 ion images (out of > 400!)
Slide 9
(C) Crown Copyrig
ht 2010
Advantages and disadvantages
• Advantages
– Fast and efficient on modern computers
– Uses all information available– Improves signal to noise ratio
– Statistically valid, removes potential bias
• Disadvantages
– Lots of different methods, procedures, terminologies
– Can be difficult to understand and interpret
Slide 10
(C) Crown Copyrig
ht 2010
Why use multivariate analysis?
100
101
102
103
104
100
101
102
103
104
x ÷÷÷÷ + -
Counts per peak
Pea
ks o
r in
form
atio
n bi
ns
Multivariate methods such as PCA, PLS useful
Ordinary “x ÷ + -” jolly good here
Slide 11
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
• What is multivariate analysis?
• Some matrix algebra…2. Identification
3. Quantification and prediction
4. Classification
5. Conclusion
Slide 12
(C) Crown Copyrig
ht 2010
Data matrix
X has 3 row and 5 columns →3 × 5 data matrix
Each row (spectra) is represented by a vector
=66301224
124222018
21110329
X
Variables
Sam
ples
Mass spectrum of Sample 1
010203040
1 2 3 4 5Mass
Inte
nsity
Mass spectrum of Sample 2
0
10
20
30
1 2 3 4 5Mass
Inte
nsity
Mass spectrum of Sample 3
010203040
1 2 3 4 5Mass
Inte
nsity
A matrix is simply a rectangular table of numbers!
Slide 13
(C) Crown Copyrig
ht 2010
Matrix algebra
• A and B must be the same size
• Each corresponding element is added
=
−−
+
493
161
210
021
683
142
CBA =+)()()( KIKIKI ×=×+×
Matrix addition
(e.g. ‘pure spectra’ + noise = experimental data)
• No. of columns of A must be equal no. of rows of B
• Row i of A times column j of B gives the row i and column j of the product matrix AB
CBA =×)()()( KIKNNI ×=×××
=
×+××+××+××+××+××+×
=
×
1210
88
1013
22243214
22223212
24213411
23
21
24
22
41
Matrix multiplication
Slide 14
(C) Crown Copyrig
ht 2010
Matrix algebra and SIMS
Mass spectrum of Sample 1
010203040
1 2 3 4 5Mass
Inte
nsity
Mass spectrum of Sample 2
0
10
20
30
1 2 3 4 5Mass
Inte
nsity
Mass spectrum of Sample 3
010203040
1 2 3 4 5Mass
Inte
nsity
Mass spectrum of Chemical 1
0
2
4
6
8
1 2 3 4 5Mass
Inte
nsity
Mass spectrum of Chemical 2
0
2
4
6
1 2 3 4 5Mass
Inte
nsity
= ×
60Sample
3
42Sample
2
15Sample
1
Chemical2
Chemical1
60
42
15
Variables [mass]S
amples
Variables [mass]Chemicals
Chem
icals
Data matrix Chemical spectraSample composition= ×
11524
40161
66301224
124222018
21110329
= ×
Sam
ples
Slide 15
(C) Crown Copyrig
ht 2010
Matrix algebra and SIMS
1. Each spectrum can be represented by a vector
2. Instead of x, y, z in 3D real space, the axes are mass1, mass2, mass3… etc in variable space (also ‘data space’)
3. Assuming the data are a linear combination of chemical spectra, we
can write it as a product of two matrices
4. There are infinite number of possible solutions!
60
42
15
Variables [mass]S
amples
Variables [mass]Chemicals
Chem
icals
Data matrix Chemical spectraSample composition= ×
11524
40161
66301224
124222018
21110329
= ×
Sam
ples
Slide 16
(C) Crown Copyrig
ht 2010
Factor analysis
1. Each spectrum can be represented by a vector
2. Instead of x, y, z in 3D real space, the axes are mass1, mass2, mass3… etc in variable space (also ‘data space’)
3. Assuming the data are a linear combination of chemical spectra, we
can write it as a product of two matrices
4. There are infinite number of possible solutions!
-1.63.8
-0.23.7
1.93.6
Variables [mass]S
amples
Variables [mass]Factors
Factors
Data matrix “Loadings”“Scores”= ××××
−−− 4.43.13.59.59.3
5.30.17.57.57.4
66301224
124222018
21110329
= ××××
Sam
ples
We can describe the data as a linear combination of spectra, by writing the data matrix as product of two matrices:
One contains the spectra (“loadings”)One containing the contributions (“scores”)
This is the basis of factor analysis!
Slide 17
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification• Principal component analysis (PCA)• PCA walkthrough
• Data preprocessing
• PCA examples
• Multivariate curve resolution (MCR)
• MCR examples3. Quantification and prediction
4. Classification
5. Conclusion
Slide 18
(C) Crown Copyrig
ht 2010
Data analysis
SIMS Dataset
SIMS Dataset
How is it related to known properties?
Where are they located?
What chemicalsare on the surface?
Calibration / Quantification
Classification
Identification
Can we predictthese properties?
Which group does it belong to?
What are the differencesbetween groups?
Slide 19
(C) Crown Copyrig
ht 2010
Terminology
In order to clarify existing terminology and emphasise the relationship between the different multivariate techniques, we are going to adopt the following terminology in this lecture
ISO FDIS 18115-1 Surface chemical analysis – Vocabulary –Part 1: General terms and terms for the spectroscopies
ScoresPure Component Concentration
Scores, Projections
Projection of the samples onto the factors
TScores
LoadingsPure Component Spectrum
Loadings, Eigenvector
Projection of a factor onto the variables
PLoadings
Latent Vectors, Latent Variables
Pure Component
Principal Component
An axis in the data space of a factor analysis model, representing an underlying dimension that contributes to summarising or accounting for the original data set
-Factor
PLSMCRPCADefinitionSymbolTerms Here
Slide 20
(C) Crown Copyrig
ht 2010
Principal component analysis (PCA)
m1
m2
PCA Factor 1
PCA Factor 2
• Factors are directions in the data space chosen such that they reflect interesting properties of the dataset
• Equivalent to a rotation in data space – factors are new axes
• Data described by their projections onto the factors
PCA Factor 1
PCA Factor 2
Slide 21
(C) Crown Copyrig
ht 2010
Principal component analysis (PCA)
m1
m2
PCA Factor 1
PCA Factor 2
• The projection of the PCA factors onto the original variables (m1, m2) are ‘loadings’
• The projection of the samples (stars) onto the PCA factors are ‘scores’
• The data is fully described by D factors, where D is the ‘dimensionality of the data’ (number of samples or variables, whichever is smaller)
Data matrix
Scores matrix Loadings matrix
PTX ′=
PCA Factor 1
PCA Factor 2
I = no. of samplesK = no. of mass units
D = dimensionality of data
))(()( KDDIKI ××=×
Slide 22
(C) Crown Copyrig
ht 2010
Principal component analysis (PCA)
m1
m2
PCA Factor 1
PCA Factor 2
• PCA extract orthogonal (uncorrelated)factors that successively capture the largest amount of variance within the data
• The amount of variance described by each factor is called ‘eigenvalue’
PCA Factor 1(78%)
PCA Factor 2 (22%)
Data matrix
Scores matrix Loadings matrix
PTX ′=))(()( KDDIKI ××=×
I = no. of samplesK = no. of mass units
D = dimensionality of data
Slide 23
(C) Crown Copyrig
ht 2010
Principal component analysis (PCA)
m1
m2
PCA Factor 1
• By removing higher factors (small variance due to noise) we can reduce the dimensionality of data ⇒ ‘factor compression’
• Often hundreds of variables can be described with just a handful of factors!
Data matrix
Scores matrix Loadings matrix
Residuals (noise)
EPTX +′=
PCA Factor 1(78%)
)())(()( KIKNNIKI ×+××=×
I = no. of samplesK = no. of mass units
N = no. of PCA factors
PCA reproduceddata matrix
Scores matrix Loadings matrix
))(()( KNNIKI ××=×PTX ′=
Slide 24
(C) Crown Copyrig
ht 2010
Number of factors
1 2 3 4 5 6 7 810-15
10-10
10-5
100
105
1010
Var
ianc
e ca
ptur
ed (
Eig
enva
lues
)
1 2 3 4 5 6 7 8101
102
103
104
105
106
107
108
PCA Factor
(a)
(b)
Data set of 8 spectra from mixing 3 pure compound spectra
no noise
Poisson noisemax 5000 counts
1. Prior knowledge of system
2. ‘Scree test’:Eigenvalue plot levels off in a linearly decreasing manner after 3 factors
3. Percentage of variance captured byNth PCA factor:
4. Percentage of total variance captured by first N PCA factors:
%100seigenvalue all of sum
eigenvalue th
×N
%100seigenvalue all of sum to up seigenvalue of sum ×N
Slide 25
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
• Principal component analysis (PCA)• PCA walkthrough• Data preprocessing
• PCA examples
• Multivariate curve resolution (MCR)
• MCR examples3. Quantification and prediction
4. Classification
5. Conclusion
Slide 26
(C) Crown Copyrig
ht 2010
PCA walkthrough
Eight library spectra:
PS 2480PS 3550PMMA 2170PMMA 2500PEG 1470PEG 4250PPG 425PPG 1000
Unit mass binnedand mean centeredprior to analysis
Calculation using MATLAB with PLS Toolbox 4.0
Slide 27
(C) Crown Copyrig
ht 2010
PCA walkthrough
Perform scree test using the log eigenvalue plot
Slide 28
(C) Crown Copyrig
ht 2010
PCA walkthrough
First PCA factor (PC1):
Slide 29
(C) Crown Copyrig
ht 2010
PCA walkthrough
Second PCA factor (PC2):
Slide 30
(C) Crown Copyrig
ht 2010
PCA walkthrough
Third PCA factor (PC3):
Slide 31
(C) Crown Copyrig
ht 2010
PCA walkthrough
Biplot - PCA factor 1 against PCA factor 2:
Slide 32
(C) Crown Copyrig
ht 2010
PCA walkthrough
Using PCA we have effectively reduced 300 correlated variables (mass units) to 3 independent variables (factors) by which all the samples can be characterised.
Slide 33
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
• Principal component analysis (PCA)• PCA walkthrough
• Data preprocessing• PCA examples
• Multivariate curve resolution (MCR)
• MCR examples3. Quantification and prediction
4. Classification
5. Conclusion
Slide 34
(C) Crown Copyrig
ht 2010
Data preprocessing
Data preprocessing is the manipulation of data prior to data analysis…
Slide 35
(C) Crown Copyrig
ht 2010• Enhances PCA by bringing out important variance in dataset
• Makes assumption about the nature of variance in data
• Can distort interpretation and quantification
• Includes:
– Peak selection and binning
– Centering
• Mean centering– Scaling
• Normalisation
• Variance scaling
• Poisson scaling
• Binominal scaling
Data preprocessing
Slide 36
(C) Crown Copyrig
ht 2010
Peak selection and binning
All peaks of interest included?What threshold to use?
Auto peak search
Straightforward to use but detailed information lost
Unit mass binning
0.5 u binning *Separates organic from inorganics
Peaks of interest onlyUnexpected features lost
Manual selection
A Henderson et al., Surf. Interface Anal. 41 (2009) 666-674
Slide 37
(C) Crown Copyrig
ht 2010
Peak selection and binning
All peaks of interest included?What threshold to use?
Auto peak search
Straightforward to use but detailed information lost
Unit mass binning
0.5 u binning *Separates organic from inorganics
Peaks of interest onlyUnexpected features lost
Manual selection
A Henderson et al., Surf. Interface Anal. 41 (2009) 666-674
Important considerations
• What information are we putting into PCA? What is included? What is omitted?
• Do we need to apply further processinge.g. dead time correction?
Slide 38
(C) Crown Copyrig
ht 2010• Subtract mean spectrum from each sample
• PCA describes variations from the mean
Mean centering
m1
Factor 2
Factor 1
Raw data
1st factor goes from origin to centre of gravity of data
m2
m1
m3
Factor 1Factor 2
Mean Centering
1st factor goes from origin and accounts for the highest variance
( )kikik :mean~
XXX −=
m3
m2
Slide 39
(C) Crown Copyrig
ht 2010
Normalisation
• Divide each spectrum by a constant for each sample e.g. intensity of a specific ion, total ion intensity
• Assumes chemical variances can be described by relative changes in ion intensities
• Preserves the shape of spectra
• Reduces effects of topography, sample charging, changes in primary ion current
( ) iki
ik XX
X ×=:sum
~ 1
0 50 100 150Mass, u
0 50 100 150Mass, u
Normalised by total ion intensity
Slide 40
(C) Crown Copyrig
ht 2010
Variance scaling
• Divide each variable by its standard deviation in the dataset
• Equalises importance of each variable (i.e. mass)
• Problematic for weak peaks – usually used with peak selection• Called ‘auto scaling’ if combined with mean centering
Var
ianc
e
Mean
For each variable (mass, in SIMS spectrum)
Raw dataMean-
centeringVariance scaling
Auto scaling
Diagram from P. Geladi and B. Kowalski, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta, 185 (1986) 1
( ) ikk
ik XX
X ×=:var
~ 1
Slide 41
(C) Crown Copyrig
ht 2010
Poisson and binomial scaling
• SIMS data is dominated by Poisson counting noise – statistical uncertainty of a peak is proportional to intensity
• The noise becomes binomial for saturated data with dead time correction
• Divide data by the estimated noise variance of each data point
M R Keenan et al. Surf. Interface Anal., 36 (2004) 203M R Keenan et al., Surf. Interface Anal., 40 (2008) 97
No preprocessingNo. of factors?
Poisson scaling4 factors needed
• Emphasises weak peaks which vary above the expected counting noise, over intense peaks varying solely due to counting statistics
• Provides better noise rejection in PCA
Slide 42
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
• Principal component analysis (PCA)• PCA walkthrough
• Data preprocessing
• PCA examples• Multivariate curve resolution (MCR)
• MCR examples3. Quantification and prediction
4. Classification
5. Conclusion
Slide 43
(C) Crown Copyrig
ht 2010
PCA example (1)
• Three protein compositions (100% fibrinogen, 50% fibrinogen / 50% albumin, 100% albumin) adsorbed onto poly(DTB suberate)
• Loadings on first factor (PC1) shows relative abundance of amino acid peaks of two proteins
• Scores on PC1 separates samples based on protein composition
D.J. Graham et al, Appl. Surf. Sci., 252 (2006) 6860
PC
1 S
core
s (6
2%)
PC
1 Lo
adin
gs (
62%
)
Fib
Alb
Slide 44
(C) Crown Copyrig
ht 2010
PCA example (2)
• SIMS spectra acquired for antiferritin with or without trehalose coating
• Largest variance (PC 1) arises from sample heterogeneity
• PC 2 distinguishes samples protected by trehalose –higher intensities of polar and hydrophilic amino acid fragments
• Trehalose preserves protein conformation in UHV
N. Xia et al, Langmuir, 18 (2002) 4090
Slide 45
(C) Crown Copyrig
ht 2010
PCA example (3)
• 16 different single protein films adsorbed on mica
• Excellent classification of proteins using only 2 factors
• Loadings consistent with total amino acid composition of various proteins
• 95% confidence limits provide means for identification / classification
M. Wagner & D. G. Castner, Langmuir, 17 (2001) 4649
PC
2 S
core
s (1
9%)
PC1 Scores (53%)
PC
A L
oadi
ngs
Slide 46
(C) Crown Copyrig
ht 2010
PCA image analysis
I, rows
J, columnsK, m
ass
peaks963
852
741
963
852
741
• ‘Datacube’ contains a raster of I x J pixels and K mass peaks
• The datacube is rearranged into 2D data matrix with dimensions [(I × J) × K] prior to PCA – ‘unfolding’
• PCA results are folded to form scores images prior to interpretation
9
8
7
6
5
4
3
2
1
963
852
741 unfold
9
8
7
6
5
4
3
2
1
9
8
7
6
5
4
3
2
1
Slide 47
(C) Crown Copyrig
ht 2010
2 4 6 8 10 12 14 16 18 20
0
0.2
0.4
0.6
0.8
Sorted eigenvector index
log
(eig
env
alu
es)
2 4 6 8 10 12 14 16 18 20-2
-1
0
1
2
Sorted eigenvector index
log
(eig
en
valu
es)
2 4 6 8 10 12 14 16 18 20-5
-4.5-4
-3.5-3
-2.5-2
-1.5
Sorted eigenvector index
log(
eig
enva
lue
s)
PCA image example (1)
Immiscible PC / PVC polymer blend42 counts per pixel on average
Total ion image
Mean centering
Normalisation
Poisson scaling
Only 2 factors needed –dimensionality of image reducedby a factor of 20!
J L S Lee, I S Gilmore, in Surface Analysis: The Principal Techniques 2nd edition (eds J C Vickerman, I S Gilmore), Wiley.
Slide 48
(C) Crown Copyrig
ht 2010
-5
0
5
10
0 5 10 15 20 25 30 35 40-0.5
0
0.5
1
1.5
2
Mass, u
PCA image example (1)
PCA results after Poisson scaling and mean centering
1st factor distinguishes PVC and PC phases
PC1 loadings
PC1 scores
-2
0
2
4
6
8
10
0 5 10 15 20 25 30 35 40-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Mass, u
2nd factor shows detector saturation for intense 35Cl peak
PC2 loadings
PC2 scores
35Cl + 37Cl
O + OH
J L S Lee, I S Gilmore, in Surface Analysis: The Principal Techniques 2nd edition (eds J C Vickerman, I S Gilmore), Wiley.
Slide 49
(C) Crown Copyrig
ht 2010
PCA image example (2)
72 ion images (out of > 400!)
Image courtesy ofDr Ian FletcherIntertek MSG
50µµµµmMass, u
Total Spectra
Slide 50
(C) Crown Copyrig
ht 2010
PCA image example (2)
Hair fibre with multi-component pretreatment
PCA factors are abstract combinations of chemical components andoptimally describe variance – PCA results can be difficult to interpret!
Image courtesy of Dr Ian Fletcher, Intertek MSG
Fac
tor
1
1000
2000
3000
4000
5000
Fac
tor
2
-3000
-2000
-1000
0
1000
0
0.2
0.4
A
C
D E
B
-0.2
0
0.2
0.4
0.6
E
CB
0.8
PCA scores PCA loadings
Fac
tor
1
1000
2000
3000
4000
5000
Fac
tor
2
-3000
-2000
-1000
0
1000
Fac
tor
1
1000
2000
3000
4000
5000
Fac
tor
2
-3000
-2000
-1000
0
1000
0
0.2
0.4
A
C
D E
B
-0.2
0
0.2
0.4
0.6
E
CB
0.8
0
0.2
0.4
A
C
D E
B
-0.2
0
0.2
0.4
0.6
E
CB
0.8
PCA scores PCA loadings
Fac
tor
3
-1000
0
1000
2000
3000
Fac
tor
4
-1000
0
1000
2000
Fac
tor
5
-1000
0
1000
2000
-0.4
-0.2
0
0.2
0.4
0.6
0.8
D
B
A
-0.4
-0.2
0
0.2
0.4
0.6
A
E
B
C
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Mass (arb. scale)
C
D
B
Fac
tor
3
-1000
0
1000
2000
3000
Fac
tor
4
-1000
0
1000
2000
Fac
tor
5
-1000
0
1000
2000
Fac
tor
3
-1000
0
1000
2000
3000
Fac
tor
4
-1000
0
1000
2000
Fac
tor
5
-1000
0
1000
2000
-0.4
-0.2
0
0.2
0.4
0.6
0.8
D
B
A
-0.4
-0.2
0
0.2
0.4
0.6
A
E
B
C
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Mass (arb. scale)
C
D
B
-0.4
-0.2
0
0.2
0.4
0.6
0.8
D
B
A
-0.4
-0.2
0
0.2
0.4
0.6
A
E
B
C
-0.4
-0.2
0
0.2
0.4
0.6
0.8
Mass (arb. scale)
C
D
B
J L S Lee et al., Surf. Interface Anal. 2009, 41, 653-665
Slide 51
(C) Crown Copyrig
ht 2010
PCA summary
• PCA describes the original data using factors, consisting of loadingsand scores which efficiently accounts for variance in the data
• Eigenvalues give the variance captured by the corresponding factors
• Data preprocessing method needs to be selected with care
• PCA is excellent for discrimination and classification based on differences in spectra, and for identifying important mass peaks
• PCA factors optimally describe variance – PCA results may be difficult to interpret
Data matrix
Projection of samplesonto factors (scores matrix)
Projection of factorsonto variables (loadings matrix)
Residuals (noise)
EPTX +′=
I = no. of samplesK = no. of mass unitsN = no. of factors
Slide 52
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
• Principal component analysis (PCA)• PCA walkthrough
• Data preprocessing
• PCA examples
• Multivariate curve resolution (MCR)• MCR examples
3. Quantification and prediction
4. Classification
5. Conclusion
Slide 53
(C) Crown Copyrig
ht 2010
Multivariate curve resolution (MCR)
• PCA factors are directions that describes variance
– positive and negative peaks in the loadings
– can be difficult to interpret• What if we want to resolve original chemical spectra and reverse
the following process?
• Try multivariate curve resolution (MCR)!
60
42
15
Variables [mass]
Sam
ples
Variables [mass]Chemicals
Chem
icals
Data matrix Chemical spectraSample composition= ×
11524
40161
66301224
124222018
21110329
= ×S
amples
Slide 54
(C) Crown Copyrig
ht 2010
Multivariate curve resolution (MCR)
Data matrix
Projection of samplesonto factors (scores matrix)
Projection of factors onto variables (loadings matrix)
Residuals (noise)
EPTX +′=)())(()( KIKNNIKI ×+××=×
I = no. of samplesK = no. of mass unitsN = no. of factors
m1
m2 MCR Factor 2
MCR Factor 1
MCR is designed for recovery of chemical spectra and contributions from a multi-component mixture, when little or no prior information about the composition is available
MCR assumes linear combination of chemical spectra (loadings) and contributions (scores) – only an approximation in SIMS
Slide 55
(C) Crown Copyrig
ht 2010
Multivariate curve resolution (MCR)
MCR uses an iterative least-squares algorithm to extract solutions, while applying suitable constraints
With non-negativity constraint, MCR factors resemble SIMS spectra and chemical contributions more directly, as these must be positive
m1
m2 MCR Factor 2
MCR Factor 1
Data matrix
Projection of samplesonto factors (scores matrix)
Projection of factors onto variables (loadings matrix)
Residuals (noise)
EPTX +′=)())(()( KIKNNIKI ×+××=×
I = no. of samplesK = no. of mass unitsN = no. of factors
Slide 56
(C) Crown Copyrig
ht 2010
Outline of MCR Raw Data
Constraints•Non-negativity•Equality
Convergence criterion
•Non-negativity•Equality
MCRalternating-least-squares
optimisation
EPTX +′=
Initial Estimates of T or P•Random initialisation•PCA loadings or scores•Varimax rotated PCA loadings or scores
•Pure variable detection algorithm e.g. SIMPLISMA Number of Factors
ReproducedData Matrix
•Noise filtered data•Ensures MCR solution is robust
X
MCRScores T
MCRLoadings P
PCA
Data Matrix X
Slide 57
(C) Crown Copyrig
ht 2010• MCR solutions are not unique!
• Accuracy of resolved spectra depends on the existence of pixels or samples where there is only contribution from one chemical component (‘selectivity’)
• Good initial estimates, suitable data preprocessing and correct number of factors are essential
m2
Rotational ambiguity
m1
MCR Factor 2
MCR Factor 1
m2
m1
MCR Factor 2
MCR Factor 1
Slide 58
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
• Principal component analysis (PCA)• PCA walkthrough
• Data preprocessing
• PCA examples
• Multivariate curve resolution (MCR)
• MCR examples3. Quantification and prediction
4. Classification
5. Conclusion
Slide 59
(C) Crown Copyrig
ht 2010
MCR image example (1)MCR calculations using Matlab with MCR-ALS toolbox, freely available from http://www.mcrals.info/
Simple PVC / PC polymer blend
Slide 60
(C) Crown Copyrig
ht 2010
MCR image example (1)
MCR calculations using Matlab with MCR-ALS toolbox, freely available from http://www.mcrals.info/
MCR scores (‘pure component concentration’) –these will be ‘folded’ to form projection images
MCR loadings (‘pure component spectra’)
Slide 61
(C) Crown Copyrig
ht 2010
MCR image example (1)
0 10 20 30 400
0.5
1
1.5
2
(a)
Loadings on MCR factor 1
0 10 20 30 400
0.5
1
1.5
Mass, u
Loadings on MCR factor 2
(b)
Scores on MCR factor 1
Scores on MCR factor 2
0
5
10
0
5
10
Simple PVC / PC polymer blend
• MCR extracts two distinctive factors, corresponding to PVC and PC respectively
• Straightforward interpretation
PVC
PC
J L S Lee et al., Surf. Interface Anal. 2008, 40, 1-14
Slide 62
(C) Crown Copyrig
ht 2010
MCR image example (2)
72 ion images (out of > 400!)
Image courtesy ofDr Ian FletcherIntertek MSG
50µµµµmMass, u
Total Spectra
Slide 63
(C) Crown Copyrig
ht 2010
MCR image example (2)
0
0.2
0.4
0.6
D
B
0
0.2
0.4 E
0
0.2
0.4
0.6
0.8
1
Mass (arb. scale)
C
Fac
tor
3
0
1000
2000
3000
4000
5000
Fac
tor
4
0
1000
2000
3000
4000
Fac
tor
5
0
1000
2000
MCR scores MCR loadings
0
0.2
0.4
0.6
0.8
1
B
0
0.2
0.4
0.6
0.8A
Fac
tor
1
0
1000
2000
3000
4000
5000
6000
Fac
tor
2
0
1000
2000
3000
MCR scores MCR loadings
Mass (arb. scale)
MCR loadings resemble SIMS spectra (characteristic peaks A-E) and fragments, and scores directly reveal spatial distributions!
J L S Lee et al., Surf. Interface Anal. 2009, 41, 653-665
Image courtesy of Dr Ian Fletcher, Intertek MSG
Slide 64
(C) Crown Copyrig
ht 2010
PCA Scores 1 PCA Scores 2 PCA Scores 3
MCR image example (3)
• We take three pictures and assign each with a SIMS spectra (PBC, PC, PVT)
• The pictures are combined to form a multivariate image dataset
• Poisson noise are added to the image (avg ~50 counts per pixel)
Slide 65
(C) Crown Copyrig
ht 2010
PCA Scores 1 PCA Scores 2 PCA Scores 3MCR Scores 1 MCR Scores 2 MCR Scores 3
MCR resolves the original images unambiguously!
MCR image example (3)
• We take three pictures and assign each with a SIMS spectra (PBC, PC, PVT)
• The pictures are combined to form a multivariate image dataset
• Poisson noise are added to the image (avg ~50 counts per pixel)
Slide 66
(C) Crown Copyrig
ht 2010
MCR spectra example
• ToF-SIMS depth profiling of copper film grown on TaNcoated silicon wafer
• Manual analysis is difficult, e.g. Si- can arise from SiOx
-, SiN- or silicon substrate
• MCR resolves 8 factors. Loadings resemble SIMS spectra of individual phases and scores resemble their contribution to the depth profile
• Improve signal to noise and correlation of related peaks
Scores for all 8 MCR factors
Scores and loadings for 3 of the MCR factors
K G Lloyd. J. Vac. Sci. Technol. A 25 (2007) 878
Slide 67
(C) Crown Copyrig
ht 2010
MCR summary
• MCR describes the original data using factors, consisting of loadingsand scores which which resembles chemical spectra and contributions from a multi-component mixture, respectively
• MCR uses an iterative algorithm to extract solutions, while applying suitable constraints e.g. non-negativity
• Good initial estimates and suitable data preprocessing are essential
• MCR is excellent for identification and localisation of chemicals in complex mixtures and allows for direct interpretation
Data matrix
Projection of samplesonto factors (scores matrix)
Projection of variablesonto factors (loadings matrix)
Residuals (noise)
EPTX +′=
I = no. of samplesK = no. of mass unitsN = no. of factors
Slide 68
(C) Crown Copyrig
ht 2010
Identification summary
Easy –
Non-negative scores and loadings
Medium / Difficult –
Abstract, orthogonal factors
Easy –
Single ion images
Ease of interpretation
Difficult ? –Possibly depend on system studied
Easy –Higher factors capture small variance
Difficult –Only if substance is known
Detection of minor components
Easy –Full spectra obtained
Medium –Important peaks and correlation
Difficult –Characteristic peaks only
Chemical identification
Most suitable for
Attributes
Identification for unknown mixtures
Discrimination of similar chemical phases
Simple dataset with good prior knowledge
MCRPCAManual analysis
Slide 69
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
3. Quantification and prediction• Partial least squares regression (PLS)• Calibration, validation and prediction• PLS examples
4. Classification
5. Conclusion
Slide 70
(C) Crown Copyrig
ht 2010
Data analysis
SIMS Dataset
SIMS Dataset
How is it related to known properties?
Where are they located?
What chemicalsare on the surface?
Calibration / Quantification
Classification
Identification
Can we predictthese properties?
Which group does it belong to?
What are the differencesbetween groups?
Slide 71
(C) Crown Copyrig
ht 2010
Regression analysis
We use regression analysis to find a predictive relationship between two variables y = 0.6944*x
0
5
10
15
20
25
30
0 10 20 30Predictor variable, x
Res
pons
e va
riabl
e, y
ey = b*x + e
‘Response’ variable
‘Predictor’ variable
Regressioncoefficient
Slide 72
(C) Crown Copyrig
ht 2010
Multivariate regression
Mass spectrum of Sample 1
01020
3040
1 2 3 4 5Mass
Inte
nsity
Mass spectrum of Sample 2
0
10
20
30
1 2 3 4 5Mass
Inte
nsity
Mass spectrum of Sample 3
010203040
1 2 3 4 5Mass
Inte
nsity
6
4
1
Solution concentration
41Sample 3
72Sample 2
35Sample 1
Reaction time
Molecular weight
Measured properties
Can we predict the properties of similar materials from their SIMS spectra?
( )exbxbxbxby
efy
mm +++++=+=
...332211
x
‘Response’ variablei.e. measured property
‘Predictor’ variablei.e. intensity at mass m
Regressioncoefficient
Slide 73
(C) Crown Copyrig
ht 2010
Multivariate regression
EXBY +=
Responsevariables
SIMSdata matrix
Regressionmatrix
Residuals (noise)
)())(()( MIMKKIMI ×+××=×
I = no. of samplesK = no. of mass unitsM = no. of response variables
Extending to I samples and M response variables
1. We can calculate B to gain an understanding of the covariance relationship between X and Ye.g. relating SIMS spectra with sample preparation parameters
2. B can be applied to future samples in order to predict Y using only measurements of Xe.g. quantifying surface composition or coverage of samples using only their SIMS spectra
Slide 74
(C) Crown Copyrig
ht 2010
Partial least squares regression (PLS)
• Partial least squares regression (PLS) is a multivariate regressionmethod for data X containing a large number of strongly correlatedvariables
• PLS finds factors (called ‘latent variables’) that successively accounts for the largest covariance between X and Y
– Removes redundant information from the regression i.e. information describing X that has no correlation with Y
– Higher PLS factors that describe little covariance between X and Y can be discarded
• The regression vectors B are a linear combination of PLS loadingsthat best predict Y from X
EXBY +=)())(()( MIMKKIMI ×+××=×
Important to determine the number of factors to include in PLS!
Slide 75
(C) Crown Copyrig
ht 2010
Calibration, validation, prediction
Calibration
Fit a PLS model to a calibration data set with known X and Y
Use cross-validation to determine number of factors
Validation
Apply model to an independentvalidation data set with known X and Y, and calculate error between predicted Y and measured Y
Prediction
Apply model to future samples to predict Y using only measurements of X
EXBY +=)())(()( MIMKKIMI ×+××=×
Use validation to determine prediction error
Slide 76
(C) Crown Copyrig
ht 2010
Number of factors –cross validation
• PLS can be used to build predictive models (calibration)• Cross-validation can be used to determine the number of factors and
guard against over-fitting
0
5
10
15
20
25
30
35
0 10 20 30Predictor variable, X
Res
pons
e va
riabl
e, Y
0
5
10
15
20
25
30
35
40
45
0 10 20 30Predictor variable, X
Res
pons
e va
riabl
e, Y
Good predictive model Data is overfitted!
5th orderpolynomial
1st order polynomial
Slide 77
(C) Crown Copyrig
ht 2010
1 2 3
4 5
1 2 3 4 5 6 7 8 9 10 11 120
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of PLS Factors
RM
SE
CV
, RM
SE
C
RMSECVRMSEC
Number of factors –cross validation
• ‘Leave one out’ cross validation most popular– Calculate PLS model excluding sample i, for N PLS factors– Use model to predict sample i and calculate error
– Repeat for all different samples– Calculate root mean square error of cross validation (RMSECV)– Repeat for different number of PLS factors
RMSEC (Root Mean Square Error of Calibration) goes down with increasing number of factors
To decide optimal number of factors use minimum of RMSECV(Root Mean Square Error of Cross Validation)
1 2 3
4 5
Slide 78
(C) Crown Copyrig
ht 2010
Validation and prediction
Calibrationdata set
populationnext set of
samplenext set ofsample?
• Validation data should be statistically independent from calibration data e.g. data taken on a different batch of samples, on a different day
• Calculate RMSEP (Root Mean Square Error of Prediction)
Calibration
Validation
Prediction
Independent validation set is essential if we want to use model to predict new samples!
Slide 79
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
3. Quantification and prediction• Partial least squares regression (PLS)
• Calibration, validation and prediction
• PLS examples4. Classification
5. Conclusion
Slide 80
(C) Crown Copyrig
ht 2010
PLS example (1)
Calculation using MATLAB with PLS Toolbox 4.0
• 13 SIMS spectra of thin films of Irganox were compared with their thicknesses measured with XPS
• Both X and Y are mean centered before analysis, and peaks below 50 u were removed
• Two PLS factors are retained, explaining 99.8% of the variance in X (SIMS data) and 98.8% of the variance in Y (thicknesses)
Slide 81
(C) Crown Copyrig
ht 2010
PLS example (1)
• Leave-one-out cross validation shows 2 PLS factors to include in model
• Scores and loadings on PLSfactor 1 and factor 2 are calculated
2 4 6 8 10 12
-2
-1
0
1
2
x 10-11
Sample
PLS
Sco
res
Factor 1Factor 2
500 1000 1500-0.2
0
0.2
0.4
0.6
0.8
Mass
PLS
Loa
ding
s
Factor 1Factor 2
Slide 82
(C) Crown Copyrig
ht 2010
0 1 2 3 4 5 60
1
2
3
4
5
6
Thickness measured by XPS (nm)
Th
ickn
ess
pre
dic
ted
by
SIM
S (
nm
)
PLS example (1)
• PLS model able to predict thicknesses for t < 6 nm
• PLS regression vector shows Irganox characteristic peaks most correlated with thickness
• Irganox dewets on the surface so initial thickness is proportional to surface coverage!
200 400 600 800 1000 1200-2
0
2
4
6
8
10
12x 10
10
Mass, u
Re
gre
ssio
n V
ect
or
for
Y
231
1176
59
277
4 nm
FM Green et al., Anal. Chem. 2009, 81, 75–79
Slide 83
(C) Crown Copyrig
ht 2010
PLS example (2)
A. J. Urquhart et al., Anal. Chem., 80 (1), 135 -142, 2008
• ToF-SIMS spectra of 576 copolymers are related to their experimental water contact angles (WCA)
• Positive and negative ion spectra are normalised separately, then concatenated (combined) into single data matrix X
Polar speciese.g. CnHmO+ & CN-
hydrophilic
Hydrocarbonse.g. CnHm
+
hydrophobic
Slide 84
(C) Crown Copyrig
ht 2010
Quantification and prediction summary
• PLS is a multivariate linear regression technique
• PLS find factors that best describe the structure of covariancebetween X and Y
• Data preprocessing method needs to be selected with care
• PLS is excellent for calibration and quantification, and for studying the relationship between SIMS data and other measured properties
• Properly validated PLS models can be used for predictions of these properties using SIMS spectra
EXBY +=Responsevariables
Data matrix Regressionmatrix
Residuals (noise)
Slide 85
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
3. Quantification and prediction4. Classification
• PCA classification• Principal Component Discriminant Function Analysis (PC-DFA)• Partial Least Squares Discriminant Analysis (PLS-DA )
5. Conclusion
Slide 86
(C) Crown Copyrig
ht 2010
Data analysis
SIMS Dataset
SIMS Dataset
How is it related to known properties?
Where are they located?
What chemicalsare on the surface?
Calibration / Quantification
Classification
Identification
Can we predictthese properties?
Which group does it belong to?
What are the differencesbetween groups?
Slide 87
(C) Crown Copyrig
ht 2010
PCA classification
• 16 different single protein films adsorbed on mica
• Excellent classification of proteins using only 2 factors
• Factors consistent with total amino acid composition of various proteins
• 95% confidence limits provide means for identification / classification
M. Wagner & D. G. Castner, Langmuir, 17 (2001) 4649
PC
2 S
core
s (1
9%)
PC1 Scores (53%)
PC
A L
oadi
ngs
Slide 88
(C) Crown Copyrig
ht 2010PCA
Discriminant analysis
Discriminant analysis
1
Describes the difference between the smiley and the heart:
1
2
Describes the spread of the data in order of importance:
Slide 89
(C) Crown Copyrig
ht 2010• PC-DFA = “Principal Component – Discriminant Function Analysis”
• ‘Discriminant functions’ maximizes the Fisher’s ratio between groups
• Used to distinguish strains of bacteria
Example 1 – PC-DFA
( )21
221ratio sFisher'
varvarmeanmean
+−=
J. S. Fletcher et al, Appl. Surf. Sci. 252 (2006) 6869
Slide 90
(C) Crown Copyrig
ht 2010
Example 2 – PLS-DA
Example: Hyperaccumulator plant which stores high levels of Ni in epidermal cells. What is the difference between epidermal and other cells?
Image courtesy of Dr Kat Smart and Prof Chris Grovenor at the University of Oxford
trichome
epidermal
mesophyll
50 100 150 200 250 300 350-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
Variable
Load
ings
on
PC
10
(0.5
8%)
2 4 8 12
16 19 22
23
28 30
31 32 34
37
38
39
42 47 50 55
56
57
58
62 63
66 68 70
71 72
73
74
75 76
77 79 82 83
84
85
86
89 91
92
93
94
95
97
99
100 102
104
109
111 112
113
116
121 122 127
130 131
132 134 135
137
140 145
147 148
155 156
157 159 164
166 167
170 172 173
175
177
178 182
184
187 188
194 195
197 203 205 208 212
213
215
218 220 223 225
231 232
235 237 241 243 247
249
250
253
257
259 261 267
273 275
281 284 289 294 297 299 302
309
310 311
317
319 323 327 335
336
337 344 350
Variables/Loadings Plot for data
Decluttered
PC1, 2 and 3 overlay
PCA
PC10 scores and loadings
PC 10 describes differences between epidermal cells and other areas –
but this is not efficient!
Slide 91
(C) Crown Copyrig
ht 2010
Example 2 – PLS-DA
Step 1 – Define epidermal cells
LV2 LV3LV1
Step 2 – PLS-DA factors: The largest co-variancebetween the data and the group assignments (0 or 1)
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
PLS-DA prediction
Decluttered
50 100 150 200 250 300 350-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
Variable
Reg
Vec
tor
for
Y 1
3 4 5 9 11 16 17
19 22 23 24 27 31
32 33
37
38
39
40 42
44
45 50 53
54
55
57
60 61 62 65 67
71
72 74
76
77 79 81 84
86 87 89
91
92
93 94
95
96
97 98 103
104
105 108
109
111
113
114
118 119
123 126
129
131
132
135 136
139 141 144
145
147
149
151
154
157
158
163
164 169
170
171
172
173
175
177
179 182 183
184 187
191 195 197 198
199 200 208
209 212
213
214
215
217 219 225
226 230
233
235 238 243 246
249 251 254
255
257 259 263 267
274 276
278 284
288 290 292
298 300 308 314
316 317
325 326
330 333
339 341
344
349 350
Variables/Loadings Plot for data
Regression vector
Step 3 –Regression vector: Combination of peaks which best predicts the differences between epidermal and other cells
Image courtesy of Dr Kat Smart and Prof Chris Grovenor at the University of Oxford
Slide 92
(C) Crown Copyrig
ht 2010
Classification summary
• PCA allows for quick grouping of samples based on their similarities
• PC-DFA and PLS-DA are supervised classification methods – prior knowledge about groups are required
• Properly validated classification models are needed for predictions• There also exists unsupervised clustering methods, e.g. hierarchal cluster
analysis, K-nearest-neighbours, artificial neural networks…..
design
retrieval
analysis dissemination
visualisation
usecreation
organisation
management
chemical information
chemical information
All these (and much, much more) belong to thewider field of chemoinformatics !
Slide 93
(C) Crown Copyrig
ht 2010
Contents
1. Introduction
2. Identification
3. Quantification and prediction4. Classification
5. Conclusion
Slide 94
(C) Crown Copyrig
ht 2010
Data analysis
SIMS Dataset
SIMS Dataset
How is it related to known properties?
Where are they located?
What chemicalsare on the surface?
Calibration / Quantification
Identification
Can we predictthese properties?
PCA, MCR
PLS
Classification
Which group does it belong to?
What are the differencesbetween groups?
PC-DFA, PLS-DA
Slide 95
(C) Crown Copyrig
ht 2010
Conclusion
In this lecture we looked at– Identification, quantification & classification using multivariate analysis– Importance of validation for predictive models– Data preprocessing techniques and their effects
Surface and Interface AnalysisMultivariate Analysis special issues (Volume 41 Issue 2 & 8, Feb/Aug 2009)
http://mvsa.nb.uw.edu/Community website with tutorials, links and softwareDeveloped by Dan Graham and hosted by NESAC/BIO
Surface Analysis: The Principal Techniques 2nd edition , Chapter 10 “The application of multivariate data analysis techniques in surface analysis”
http://www.npl.co.uk/nanoanalysis/chemometrics.html
• Copy of tutorial slides• Draft terminology for ISO18115
MVA vocabulary
Slide 96
(C) Crown Copyrig
ht 2010
Bibliography
General• J. L. S. Lee et al, “The application of multivariate data analysis techniques in surface analysis”, in Surface Analysis: The
Principal Techniques 2nd edition (eds J C Vickerman, I S Gilmore), Wiley.• S. Wold, Chemometrics; what do we mean with it, and what do we want from it?, Chemom. Intell. Lab. Syst. 30 (1995) 109• E. R. Malinowski, Factor analysis in Chemistry, John Wiley and Sons (2002)• P. Geladi et al, Multivariate image analysis, John Wiley and Sons (1996)• J. L. S. Lee et al, Quantification and methodology issues in multivariate analysis of ToF-SIMS data for mixed organic systems,
Surf. Interface Anal. 40 (2008) 1• D. J. Graham, NESAC/BIO ToF-SIMS MVA web resource, http://nb.engr.washington.edu/nb-sims-resource/
PCA• D. J. Graham et al, Information from complexity: challenges of ToF-SIMS data interpretation, Appl. Surf. Sci. 252 (2006) 6860• M. R. Keenan et al, Accounting for Poisson noise in the multivariate analysis of ToF-SIMS spectrum images, Surf. Interface
Anal. 36 (2004) 203• M. R. Keenan et al, Mitigating dead-time effects during multivariate analysis of ToF-SIMS spectral images, Surf. Interface Anal.
40 (2008) 97
MCR• N. B. Gallagher et al, Curve resolution for multivariate images with applications to TOF-SIMS and Raman, Chemom. Intell.
Lab. Syst. 73 (2004) 105• J. A. Ohlhausen et al, Multivariate statistical analysis of time-of-flight secondary ion mass spectrometry using AXSIA, Appl.
Surf. Sci. 231-232 (2004) 230• R. Tauler, A. de Juan, MCR-ALS Graphic User Friendly Interface, http://www.ub.edu/mcr/
PLS• P. Geladi et al, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta 185 (1986) 1• A. M. C. Davies et al, Back to basics: observing PLS, Spectroscopy Europe 17 (2005) 28