guide to multivariate analysis in sims chemometrics guide 2010.pdf · • multivariate analysis can...

A Guide to the Practical Use ofMultivariate Analysis in SIMS

© Crown Copyright 2010

Joanna Lee, Ian Gilmore

National Physical Laboratory, Teddington, UK

Email: [email protected]: http://www.npl.co.uk/nanoanalysis

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

• What is multivariate analysis?

• Some matrix algebra…2. Identification

• Principal component analysis (PCA)

• Multivariate curve resolution (MCR)

3. Quantification and prediction

• Partial least squares regression (PLS)4. Classification

• PCA classification

• Principal Component Discriminant Function Analysis (PC-DFA)

• Partial Least Squares Discriminant Analysis (PLS-DA)

5. Conclusion

(C) Crown Copyrig

ht 2010

0

20

40

60

80

100

120

1990 2000 2010

Year of Publication

Num

ber

of P

ublic

atio

ns

PCAMCRPLSDFAANNs

Why are we here?

(C) Crown Copyrig

ht 2010

Data analysis

SIMS Dataset

SIMS Dataset

How is it related to known properties?

Where are they located?

What chemicalsare on the surface?

Calibration / Quantification

Classification

Identification

Can we predictthese properties?

Which group does it belong to?

What are the differencesbetween groups?

(C) Crown Copyrig

ht 2010

Contents

1. Introduction• What is multivariate analysis?• Some matrix algebra…

2. Identification


4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

From A. M. C. Davies Spectroscopy Europe 10 (1998) 28

Chemometrics is the science of relating measurements made on a chemical system to the state of the system via application of mathematical or statisticalmethods

Chemometrics

SIMSData

StatisticalResults

ChemometricsKnowledge of Surface Chemistry

and Instrumental Influences

Statistical methods e.g. multivariate analysis

(C) Crown Copyrig

ht 2010

Multivariate analysis

• Multivariate = More than 1 variable• Multivariate analysis is the statistical study of the dependence (covariance)

between different variables

• Variables are numerical values that we can measure on a sampleExample 1: A sample of people

Variables: Height, weight, shoe size, days since last haircut…Example 2: A sample of weather

Variables: Temperature, humidity, wind speed, visibility, UV index…Example 3: A sample of SIMS spectra

Variables: Intensity of Si+ peak, intensity of O+ peak, intensity of C3H5+

peak…

Key points• Many surface analytical technique, incl. SIMS and

XPS, gives data that are multivariate in nature• Finding correlations in the data is the key to

multivariate analysis!

(C) Crown Copyrig

ht 2010

Why use multivariate analysis?

• Modern ToF-SIMS instrument generates huge, multivariatedata sets

• Manual analysis involves selecting a sub-set of most interesting features for analysis by eye

• Multivariate analysis involves simultaneous statistical analysis of all the variables

• Multivariate analysis can summarise the data with a large number of variables, using a much smaller number of “factors”

72 ion images (out of > 400!)

(C) Crown Copyrig

ht 2010

Advantages and disadvantages

• Advantages

– Fast and efficient on modern computers

– Uses all information available– Improves signal to noise ratio

– Statistically valid, removes potential bias

• Disadvantages

– Lots of different methods, procedures, terminologies

– Can be difficult to understand and interpret

(C) Crown Copyrig

ht 2010

Why use multivariate analysis?

100

101

102

103

104

100

101

102

103

104

x ÷÷÷÷ + -

Counts per peak

Pea

ks o

r in

form

atio

n bi

ns

Multivariate methods such as PCA, PLS useful

Ordinary “x ÷ + -” jolly good here

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

• What is multivariate analysis?

• Some matrix algebra…2. Identification


4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

Data matrix

X has 3 row and 5 columns →3 × 5 data matrix

Each row (spectra) is represented by a vector

=66301224

124222018

21110329

X

Variables

Sam

ples

Mass spectrum of Sample 1

010203040

1 2 3 4 5Mass

Inte

nsity


0

10

20

30

1 2 3 4 5Mass

Inte

nsity


010203040

1 2 3 4 5Mass

Inte

nsity

A matrix is simply a rectangular table of numbers!

(C) Crown Copyrig

ht 2010

Matrix algebra

• A and B must be the same size

• Each corresponding element is added

=

−−

+

493

161

210

021

683

142

CBA =+)()()( KIKIKI ×=×+×

Matrix addition

(e.g. ‘pure spectra’ + noise = experimental data)

• No. of columns of A must be equal no. of rows of B

• Row i of A times column j of B gives the row i and column j of the product matrix AB

CBA =×)()()( KIKNNI ×=×××

=

×+××+××+××+××+××+×

=

×

1210

88

1013

22243214

22223212

24213411

23

21

24

22

41

Matrix multiplication

(C) Crown Copyrig

ht 2010

Matrix algebra and SIMS


010203040

1 2 3 4 5Mass

Inte

nsity


0

10

20

30

1 2 3 4 5Mass

Inte

nsity


010203040

1 2 3 4 5Mass

Inte

nsity

Mass spectrum of Chemical 1

0

2

4

6

8

1 2 3 4 5Mass

Inte

nsity

Mass spectrum of Chemical 2

0

2

4

6

1 2 3 4 5Mass

Inte

nsity

= ×

60Sample

3

42Sample

2

15Sample

1

Chemical2

Chemical1

60

42

15

Variables [mass]S

amples

Variables [mass]Chemicals

Chem

icals

Data matrix Chemical spectraSample composition= ×

11524

40161

66301224

124222018

21110329

= ×

Sam

ples

(C) Crown Copyrig

ht 2010

Matrix algebra and SIMS

1. Each spectrum can be represented by a vector

2. Instead of x, y, z in 3D real space, the axes are mass1, mass2, mass3… etc in variable space (also ‘data space’)

3. Assuming the data are a linear combination of chemical spectra, we

can write it as a product of two matrices

4. There are infinite number of possible solutions!

60

42

15

Variables [mass]S

amples


Chem

icals


11524

40161

66301224

124222018

21110329

= ×

Sam

ples

(C) Crown Copyrig

ht 2010

Factor analysis

1. Each spectrum can be represented by a vector

2. Instead of x, y, z in 3D real space, the axes are mass1, mass2, mass3… etc in variable space (also ‘data space’)

3. Assuming the data are a linear combination of chemical spectra, we

can write it as a product of two matrices

4. There are infinite number of possible solutions!

-1.63.8

-0.23.7

1.93.6

Variables [mass]S

amples

Variables [mass]Factors

Factors

Data matrix “Loadings”“Scores”= ××××

−−− 4.43.13.59.59.3

5.30.17.57.57.4

66301224

124222018

21110329

= ××××

Sam

ples

We can describe the data as a linear combination of spectra, by writing the data matrix as product of two matrices:

One contains the spectra (“loadings”)One containing the contributions (“scores”)

This is the basis of factor analysis!

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification• Principal component analysis (PCA)• PCA walkthrough

• Data preprocessing

• PCA examples


• MCR examples3. Quantification and prediction

4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

Data analysis

SIMS Dataset

SIMS Dataset





Classification

Identification




(C) Crown Copyrig

ht 2010

Terminology

In order to clarify existing terminology and emphasise the relationship between the different multivariate techniques, we are going to adopt the following terminology in this lecture

ISO FDIS 18115-1 Surface chemical analysis – Vocabulary –Part 1: General terms and terms for the spectroscopies

ScoresPure Component Concentration

Scores, Projections

Projection of the samples onto the factors

TScores

LoadingsPure Component Spectrum

Loadings, Eigenvector

Projection of a factor onto the variables

PLoadings

Latent Vectors, Latent Variables

Pure Component

Principal Component

An axis in the data space of a factor analysis model, representing an underlying dimension that contributes to summarising or accounting for the original data set

-Factor

PLSMCRPCADefinitionSymbolTerms Here

(C) Crown Copyrig

ht 2010

Principal component analysis (PCA)

m1

m2

PCA Factor 1

PCA Factor 2

• Factors are directions in the data space chosen such that they reflect interesting properties of the dataset

• Equivalent to a rotation in data space – factors are new axes

• Data described by their projections onto the factors

PCA Factor 1

PCA Factor 2

(C) Crown Copyrig

ht 2010


m1

m2

PCA Factor 1

PCA Factor 2

• The projection of the PCA factors onto the original variables (m1, m2) are ‘loadings’

• The projection of the samples (stars) onto the PCA factors are ‘scores’

• The data is fully described by D factors, where D is the ‘dimensionality of the data’ (number of samples or variables, whichever is smaller)

Data matrix

Scores matrix Loadings matrix

PTX ′=

PCA Factor 1

PCA Factor 2

I = no. of samplesK = no. of mass units

D = dimensionality of data

))(()( KDDIKI ××=×

(C) Crown Copyrig

ht 2010


m1

m2

PCA Factor 1

PCA Factor 2

• PCA extract orthogonal (uncorrelated)factors that successively capture the largest amount of variance within the data

• The amount of variance described by each factor is called ‘eigenvalue’

PCA Factor 1(78%)

PCA Factor 2 (22%)

Data matrix


PTX ′=))(()( KDDIKI ××=×


D = dimensionality of data

(C) Crown Copyrig

ht 2010


m1

m2

PCA Factor 1

• By removing higher factors (small variance due to noise) we can reduce the dimensionality of data ⇒ ‘factor compression’

• Often hundreds of variables can be described with just a handful of factors!

Data matrix


Residuals (noise)

EPTX +′=

PCA Factor 1(78%)

)())(()( KIKNNIKI ×+××=×


N = no. of PCA factors

PCA reproduceddata matrix


))(()( KNNIKI ××=×PTX ′=

(C) Crown Copyrig

ht 2010

Number of factors

1 2 3 4 5 6 7 810-15

10-10

10-5

100

105

1010

Var

ianc

e ca

ptur

ed (

Eig

enva

lues

)

1 2 3 4 5 6 7 8101

102

103

104

105

106

107

108

PCA Factor

(a)

(b)

Data set of 8 spectra from mixing 3 pure compound spectra

no noise

Poisson noisemax 5000 counts

1. Prior knowledge of system

2. ‘Scree test’:Eigenvalue plot levels off in a linearly decreasing manner after 3 factors

3. Percentage of variance captured byNth PCA factor:

4. Percentage of total variance captured by first N PCA factors:

%100seigenvalue all of sum

eigenvalue th

×N

%100seigenvalue all of sum to up seigenvalue of sum ×N

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification

• Principal component analysis (PCA)• PCA walkthrough• Data preprocessing

• PCA examples



4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

PCA walkthrough

Eight library spectra:

PS 2480PS 3550PMMA 2170PMMA 2500PEG 1470PEG 4250PPG 425PPG 1000

Unit mass binnedand mean centeredprior to analysis

Calculation using MATLAB with PLS Toolbox 4.0

(C) Crown Copyrig

ht 2010

PCA walkthrough

Perform scree test using the log eigenvalue plot

(C) Crown Copyrig

ht 2010

PCA walkthrough

First PCA factor (PC1):

(C) Crown Copyrig

ht 2010

PCA walkthrough

Second PCA factor (PC2):

(C) Crown Copyrig

ht 2010

PCA walkthrough

Third PCA factor (PC3):

(C) Crown Copyrig

ht 2010

PCA walkthrough

Biplot - PCA factor 1 against PCA factor 2:

(C) Crown Copyrig

ht 2010

PCA walkthrough

Using PCA we have effectively reduced 300 correlated variables (mass units) to 3 independent variables (factors) by which all the samples can be characterised.

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification

• Principal component analysis (PCA)• PCA walkthrough

• Data preprocessing• PCA examples



4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

Data preprocessing

Data preprocessing is the manipulation of data prior to data analysis…

(C) Crown Copyrig

ht 2010• Enhances PCA by bringing out important variance in dataset

• Makes assumption about the nature of variance in data

• Can distort interpretation and quantification

• Includes:

– Peak selection and binning

– Centering

• Mean centering– Scaling

• Normalisation

• Variance scaling

• Poisson scaling

• Binominal scaling

Data preprocessing

(C) Crown Copyrig

ht 2010

Peak selection and binning

All peaks of interest included?What threshold to use?

Auto peak search

Straightforward to use but detailed information lost

Unit mass binning

0.5 u binning *Separates organic from inorganics

Peaks of interest onlyUnexpected features lost

Manual selection

A Henderson et al., Surf. Interface Anal. 41 (2009) 666-674

(C) Crown Copyrig

ht 2010

Peak selection and binning

All peaks of interest included?What threshold to use?

Auto peak search

Straightforward to use but detailed information lost

Unit mass binning

0.5 u binning *Separates organic from inorganics

Peaks of interest onlyUnexpected features lost

Manual selection

A Henderson et al., Surf. Interface Anal. 41 (2009) 666-674

Important considerations

• What information are we putting into PCA? What is included? What is omitted?

• Do we need to apply further processinge.g. dead time correction?

(C) Crown Copyrig

ht 2010• Subtract mean spectrum from each sample

• PCA describes variations from the mean

Mean centering

m1

Factor 2

Factor 1

Raw data

1st factor goes from origin to centre of gravity of data

m2

m1

m3

Factor 1Factor 2

Mean Centering

1st factor goes from origin and accounts for the highest variance

( )kikik :mean~

XXX −=

m3

m2

(C) Crown Copyrig

ht 2010

Normalisation

• Divide each spectrum by a constant for each sample e.g. intensity of a specific ion, total ion intensity

• Assumes chemical variances can be described by relative changes in ion intensities

• Preserves the shape of spectra

• Reduces effects of topography, sample charging, changes in primary ion current

( ) iki

ik XX

X ×=:sum

~ 1

0 50 100 150Mass, u

0 50 100 150Mass, u

Normalised by total ion intensity

(C) Crown Copyrig

ht 2010

Variance scaling

• Divide each variable by its standard deviation in the dataset

• Equalises importance of each variable (i.e. mass)

• Problematic for weak peaks – usually used with peak selection• Called ‘auto scaling’ if combined with mean centering

Var

ianc

e

Mean

For each variable (mass, in SIMS spectrum)

Raw dataMean-

centeringVariance scaling

Auto scaling

Diagram from P. Geladi and B. Kowalski, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta, 185 (1986) 1

( ) ikk

ik XX

X ×=:var

~ 1

(C) Crown Copyrig

ht 2010

Poisson and binomial scaling

• SIMS data is dominated by Poisson counting noise – statistical uncertainty of a peak is proportional to intensity

• The noise becomes binomial for saturated data with dead time correction

• Divide data by the estimated noise variance of each data point

M R Keenan et al. Surf. Interface Anal., 36 (2004) 203M R Keenan et al., Surf. Interface Anal., 40 (2008) 97

No preprocessingNo. of factors?

Poisson scaling4 factors needed

• Emphasises weak peaks which vary above the expected counting noise, over intense peaks varying solely due to counting statistics

• Provides better noise rejection in PCA

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification



• PCA examples• Multivariate curve resolution (MCR)


4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

PCA example (1)

• Three protein compositions (100% fibrinogen, 50% fibrinogen / 50% albumin, 100% albumin) adsorbed onto poly(DTB suberate)

• Loadings on first factor (PC1) shows relative abundance of amino acid peaks of two proteins

• Scores on PC1 separates samples based on protein composition

D.J. Graham et al, Appl. Surf. Sci., 252 (2006) 6860

PC

1 S

core

s (6

2%)

PC

1 Lo

adin

gs (

62%

)

Fib

Alb

(C) Crown Copyrig

ht 2010

PCA example (2)

• SIMS spectra acquired for antiferritin with or without trehalose coating

• Largest variance (PC 1) arises from sample heterogeneity

• PC 2 distinguishes samples protected by trehalose –higher intensities of polar and hydrophilic amino acid fragments

• Trehalose preserves protein conformation in UHV

N. Xia et al, Langmuir, 18 (2002) 4090

(C) Crown Copyrig

ht 2010

PCA example (3)

• 16 different single protein films adsorbed on mica

• Excellent classification of proteins using only 2 factors

• Loadings consistent with total amino acid composition of various proteins

• 95% confidence limits provide means for identification / classification

M. Wagner & D. G. Castner, Langmuir, 17 (2001) 4649

PC

2 S

core

s (1

9%)

PC1 Scores (53%)

PC

A L

oadi

ngs

(C) Crown Copyrig

ht 2010

PCA image analysis

I, rows

J, columnsK, m

ass

peaks963

852

741

963

852

741

• ‘Datacube’ contains a raster of I x J pixels and K mass peaks

• The datacube is rearranged into 2D data matrix with dimensions [(I × J) × K] prior to PCA – ‘unfolding’

• PCA results are folded to form scores images prior to interpretation

9

8

7

6

5

4

3

2

1

963

852

741 unfold

9

8

7

6

5

4

3

2

1

9

8

7

6

5

4

3

2

1

(C) Crown Copyrig

ht 2010

2 4 6 8 10 12 14 16 18 20

0

0.2

0.4

0.6

0.8

Sorted eigenvector index

log

(eig

env

alu

es)

2 4 6 8 10 12 14 16 18 20-2

-1

0

1

2


log

(eig

en

valu

es)

2 4 6 8 10 12 14 16 18 20-5

-4.5-4

-3.5-3

-2.5-2

-1.5


log(

eig

enva

lue

s)

PCA image example (1)

Immiscible PC / PVC polymer blend42 counts per pixel on average

Total ion image

Mean centering

Normalisation

Poisson scaling

Only 2 factors needed –dimensionality of image reducedby a factor of 20!

J L S Lee, I S Gilmore, in Surface Analysis: The Principal Techniques 2nd edition (eds J C Vickerman, I S Gilmore), Wiley.

(C) Crown Copyrig

ht 2010

-5

0

5

10

0 5 10 15 20 25 30 35 40-0.5

0

0.5

1

1.5

2

Mass, u


PCA results after Poisson scaling and mean centering

1st factor distinguishes PVC and PC phases

PC1 loadings

PC1 scores

-2

0

2

4

6

8

10

0 5 10 15 20 25 30 35 40-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

Mass, u

2nd factor shows detector saturation for intense 35Cl peak

PC2 loadings

PC2 scores

35Cl + 37Cl

O + OH

J L S Lee, I S Gilmore, in Surface Analysis: The Principal Techniques 2nd edition (eds J C Vickerman, I S Gilmore), Wiley.

(C) Crown Copyrig

ht 2010



Image courtesy ofDr Ian FletcherIntertek MSG

50µµµµmMass, u

Total Spectra

(C) Crown Copyrig

ht 2010


Hair fibre with multi-component pretreatment

PCA factors are abstract combinations of chemical components andoptimally describe variance – PCA results can be difficult to interpret!

Image courtesy of Dr Ian Fletcher, Intertek MSG

Fac

tor

1

1000

2000

3000

4000

5000

Fac

tor

2

-3000

-2000

-1000

0

1000

0

0.2

0.4

A

C

D E

B

-0.2

0

0.2

0.4

0.6

E

CB

0.8

PCA scores PCA loadings

Fac

tor

1

1000

2000

3000

4000

5000

Fac

tor

2

-3000

-2000

-1000

0

1000

Fac

tor

1

1000

2000

3000

4000

5000

Fac

tor

2

-3000

-2000

-1000

0

1000

0

0.2

0.4

A

C

D E

B

-0.2

0

0.2

0.4

0.6

E

CB

0.8

0

0.2

0.4

A

C

D E

B

-0.2

0

0.2

0.4

0.6

E

CB

0.8

PCA scores PCA loadings

Fac

tor

3

-1000

0

1000

2000

3000

Fac

tor

4

-1000

0

1000

2000

Fac

tor

5

-1000

0

1000

2000

-0.4

-0.2

0

0.2

0.4

0.6

0.8

D

B

A

-0.4

-0.2

0

0.2

0.4

0.6

A

E

B

C

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Mass (arb. scale)

C

D

B

Fac

tor

3

-1000

0

1000

2000

3000

Fac

tor

4

-1000

0

1000

2000

Fac

tor

5

-1000

0

1000

2000

Fac

tor

3

-1000

0

1000

2000

3000

Fac

tor

4

-1000

0

1000

2000

Fac

tor

5

-1000

0

1000

2000

-0.4

-0.2

0

0.2

0.4

0.6

0.8

D

B

A

-0.4

-0.2

0

0.2

0.4

0.6

A

E

B

C

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Mass (arb. scale)

C

D

B

-0.4

-0.2

0

0.2

0.4

0.6

0.8

D

B

A

-0.4

-0.2

0

0.2

0.4

0.6

A

E

B

C

-0.4

-0.2

0

0.2

0.4

0.6

0.8

Mass (arb. scale)

C

D

B

J L S Lee et al., Surf. Interface Anal. 2009, 41, 653-665

(C) Crown Copyrig

ht 2010

PCA summary

• PCA describes the original data using factors, consisting of loadingsand scores which efficiently accounts for variance in the data

• Eigenvalues give the variance captured by the corresponding factors

• Data preprocessing method needs to be selected with care

• PCA is excellent for discrimination and classification based on differences in spectra, and for identifying important mass peaks

• PCA factors optimally describe variance – PCA results may be difficult to interpret

Data matrix

Projection of samplesonto factors (scores matrix)

Projection of factorsonto variables (loadings matrix)

Residuals (noise)

EPTX +′=

I = no. of samplesK = no. of mass unitsN = no. of factors

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification



• PCA examples

• Multivariate curve resolution (MCR)• MCR examples


4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

Multivariate curve resolution (MCR)

• PCA factors are directions that describes variance

– positive and negative peaks in the loadings

– can be difficult to interpret• What if we want to resolve original chemical spectra and reverse

the following process?

• Try multivariate curve resolution (MCR)!

60

42

15

Variables [mass]

Sam

ples


Chem

icals


11524

40161

66301224

124222018

21110329

= ×S

amples

(C) Crown Copyrig

ht 2010


Data matrix


Projection of factors onto variables (loadings matrix)

Residuals (noise)

EPTX +′=)())(()( KIKNNIKI ×+××=×


m1

m2 MCR Factor 2

MCR Factor 1

MCR is designed for recovery of chemical spectra and contributions from a multi-component mixture, when little or no prior information about the composition is available

MCR assumes linear combination of chemical spectra (loadings) and contributions (scores) – only an approximation in SIMS

(C) Crown Copyrig

ht 2010


MCR uses an iterative least-squares algorithm to extract solutions, while applying suitable constraints

With non-negativity constraint, MCR factors resemble SIMS spectra and chemical contributions more directly, as these must be positive

m1

m2 MCR Factor 2

MCR Factor 1

Data matrix


Projection of factors onto variables (loadings matrix)

Residuals (noise)

EPTX +′=)())(()( KIKNNIKI ×+××=×


(C) Crown Copyrig

ht 2010

Outline of MCR Raw Data

Constraints•Non-negativity•Equality

Convergence criterion

•Non-negativity•Equality

MCRalternating-least-squares

optimisation

EPTX +′=

Initial Estimates of T or P•Random initialisation•PCA loadings or scores•Varimax rotated PCA loadings or scores

•Pure variable detection algorithm e.g. SIMPLISMA Number of Factors

ReproducedData Matrix

•Noise filtered data•Ensures MCR solution is robust

X

MCRScores T

MCRLoadings P

PCA

Data Matrix X

(C) Crown Copyrig

ht 2010• MCR solutions are not unique!

• Accuracy of resolved spectra depends on the existence of pixels or samples where there is only contribution from one chemical component (‘selectivity’)

• Good initial estimates, suitable data preprocessing and correct number of factors are essential

m2

Rotational ambiguity

m1

MCR Factor 2

MCR Factor 1

m2

m1

MCR Factor 2

MCR Factor 1

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification



• PCA examples



4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

MCR image example (1)MCR calculations using Matlab with MCR-ALS toolbox, freely available from http://www.mcrals.info/

Simple PVC / PC polymer blend

(C) Crown Copyrig

ht 2010

MCR image example (1)

MCR calculations using Matlab with MCR-ALS toolbox, freely available from http://www.mcrals.info/

MCR scores (‘pure component concentration’) –these will be ‘folded’ to form projection images

MCR loadings (‘pure component spectra’)

(C) Crown Copyrig

ht 2010


0 10 20 30 400

0.5

1

1.5

2

(a)

Loadings on MCR factor 1

0 10 20 30 400

0.5

1

1.5

Mass, u

Loadings on MCR factor 2

(b)

Scores on MCR factor 1

Scores on MCR factor 2

0

5

10

0

5

10

Simple PVC / PC polymer blend

• MCR extracts two distinctive factors, corresponding to PVC and PC respectively

• Straightforward interpretation

PVC

PC


(C) Crown Copyrig

ht 2010



Image courtesy ofDr Ian FletcherIntertek MSG

50µµµµmMass, u

Total Spectra

(C) Crown Copyrig

ht 2010


0

0.2

0.4

0.6

D

B

0

0.2

0.4 E

0

0.2

0.4

0.6

0.8

1

Mass (arb. scale)

C

Fac

tor

3

0

1000

2000

3000

4000

5000

Fac

tor

4

0

1000

2000

3000

4000

Fac

tor

5

0

1000

2000

MCR scores MCR loadings

0

0.2

0.4

0.6

0.8

1

B

0

0.2

0.4

0.6

0.8A

Fac

tor

1

0

1000

2000

3000

4000

5000

6000

Fac

tor

2

0

1000

2000

3000

MCR scores MCR loadings

Mass (arb. scale)

MCR loadings resemble SIMS spectra (characteristic peaks A-E) and fragments, and scores directly reveal spatial distributions!


Image courtesy of Dr Ian Fletcher, Intertek MSG

(C) Crown Copyrig

ht 2010

PCA Scores 1 PCA Scores 2 PCA Scores 3


• We take three pictures and assign each with a SIMS spectra (PBC, PC, PVT)

• The pictures are combined to form a multivariate image dataset

• Poisson noise are added to the image (avg ~50 counts per pixel)

(C) Crown Copyrig

ht 2010

PCA Scores 1 PCA Scores 2 PCA Scores 3MCR Scores 1 MCR Scores 2 MCR Scores 3

MCR resolves the original images unambiguously!


• We take three pictures and assign each with a SIMS spectra (PBC, PC, PVT)

• The pictures are combined to form a multivariate image dataset

• Poisson noise are added to the image (avg ~50 counts per pixel)

(C) Crown Copyrig

ht 2010

MCR spectra example

• ToF-SIMS depth profiling of copper film grown on TaNcoated silicon wafer

• Manual analysis is difficult, e.g. Si- can arise from SiOx

-, SiN- or silicon substrate

• MCR resolves 8 factors. Loadings resemble SIMS spectra of individual phases and scores resemble their contribution to the depth profile

• Improve signal to noise and correlation of related peaks

Scores for all 8 MCR factors

Scores and loadings for 3 of the MCR factors

K G Lloyd. J. Vac. Sci. Technol. A 25 (2007) 878

(C) Crown Copyrig

ht 2010

MCR summary

• MCR describes the original data using factors, consisting of loadingsand scores which which resembles chemical spectra and contributions from a multi-component mixture, respectively

• MCR uses an iterative algorithm to extract solutions, while applying suitable constraints e.g. non-negativity

• Good initial estimates and suitable data preprocessing are essential

• MCR is excellent for identification and localisation of chemicals in complex mixtures and allows for direct interpretation

Data matrix


Projection of variablesonto factors (loadings matrix)

Residuals (noise)

EPTX +′=


(C) Crown Copyrig

ht 2010

Identification summary

Easy –

Non-negative scores and loadings

Medium / Difficult –

Abstract, orthogonal factors

Easy –

Single ion images

Ease of interpretation

Difficult ? –Possibly depend on system studied

Easy –Higher factors capture small variance

Difficult –Only if substance is known

Detection of minor components

Easy –Full spectra obtained

Medium –Important peaks and correlation

Difficult –Characteristic peaks only

Chemical identification

Most suitable for

Attributes

Identification for unknown mixtures

Discrimination of similar chemical phases

Simple dataset with good prior knowledge

MCRPCAManual analysis

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification

3. Quantification and prediction• Partial least squares regression (PLS)• Calibration, validation and prediction• PLS examples

4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

Data analysis

SIMS Dataset

SIMS Dataset





Classification

Identification




(C) Crown Copyrig

ht 2010

Regression analysis

We use regression analysis to find a predictive relationship between two variables y = 0.6944*x

0

5

10

15

20

25

30

0 10 20 30Predictor variable, x

Res

pons

e va

riabl

e, y

ey = b*x + e

‘Response’ variable

‘Predictor’ variable

Regressioncoefficient

(C) Crown Copyrig

ht 2010

Multivariate regression


01020

3040

1 2 3 4 5Mass

Inte

nsity


0

10

20

30

1 2 3 4 5Mass

Inte

nsity


010203040

1 2 3 4 5Mass

Inte

nsity

6

4

1

Solution concentration

41Sample 3

72Sample 2

35Sample 1

Reaction time

Molecular weight

Measured properties

Can we predict the properties of similar materials from their SIMS spectra?

( )exbxbxbxby

efy

mm +++++=+=

...332211

x

‘Response’ variablei.e. measured property

‘Predictor’ variablei.e. intensity at mass m

Regressioncoefficient

(C) Crown Copyrig

ht 2010

Multivariate regression

EXBY +=

Responsevariables

SIMSdata matrix

Regressionmatrix

Residuals (noise)

)())(()( MIMKKIMI ×+××=×

I = no. of samplesK = no. of mass unitsM = no. of response variables

Extending to I samples and M response variables

1. We can calculate B to gain an understanding of the covariance relationship between X and Ye.g. relating SIMS spectra with sample preparation parameters

2. B can be applied to future samples in order to predict Y using only measurements of Xe.g. quantifying surface composition or coverage of samples using only their SIMS spectra

(C) Crown Copyrig

ht 2010

Partial least squares regression (PLS)

• Partial least squares regression (PLS) is a multivariate regressionmethod for data X containing a large number of strongly correlatedvariables

• PLS finds factors (called ‘latent variables’) that successively accounts for the largest covariance between X and Y

– Removes redundant information from the regression i.e. information describing X that has no correlation with Y

– Higher PLS factors that describe little covariance between X and Y can be discarded

• The regression vectors B are a linear combination of PLS loadingsthat best predict Y from X

EXBY +=)())(()( MIMKKIMI ×+××=×

Important to determine the number of factors to include in PLS!

(C) Crown Copyrig

ht 2010

Calibration, validation, prediction

Calibration

Fit a PLS model to a calibration data set with known X and Y

Use cross-validation to determine number of factors

Validation

Apply model to an independentvalidation data set with known X and Y, and calculate error between predicted Y and measured Y

Prediction

Apply model to future samples to predict Y using only measurements of X

EXBY +=)())(()( MIMKKIMI ×+××=×

Use validation to determine prediction error

(C) Crown Copyrig

ht 2010

Number of factors –cross validation

• PLS can be used to build predictive models (calibration)• Cross-validation can be used to determine the number of factors and

guard against over-fitting

0

5

10

15

20

25

30

35

0 10 20 30Predictor variable, X

Res

pons

e va

riabl

e, Y

0

5

10

15

20

25

30

35

40

45

0 10 20 30Predictor variable, X

Res

pons

e va

riabl

e, Y

Good predictive model Data is overfitted!

5th orderpolynomial

1st order polynomial

(C) Crown Copyrig

ht 2010

1 2 3

4 5

1 2 3 4 5 6 7 8 9 10 11 120

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Number of PLS Factors

RM

SE

CV

, RM

SE

C

RMSECVRMSEC

Number of factors –cross validation

• ‘Leave one out’ cross validation most popular– Calculate PLS model excluding sample i, for N PLS factors– Use model to predict sample i and calculate error

– Repeat for all different samples– Calculate root mean square error of cross validation (RMSECV)– Repeat for different number of PLS factors

RMSEC (Root Mean Square Error of Calibration) goes down with increasing number of factors

To decide optimal number of factors use minimum of RMSECV(Root Mean Square Error of Cross Validation)

1 2 3

4 5

(C) Crown Copyrig

ht 2010

Validation and prediction

Calibrationdata set

populationnext set of

samplenext set ofsample?

• Validation data should be statistically independent from calibration data e.g. data taken on a different batch of samples, on a different day

• Calculate RMSEP (Root Mean Square Error of Prediction)

Calibration

Validation

Prediction

Independent validation set is essential if we want to use model to predict new samples!

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification

3. Quantification and prediction• Partial least squares regression (PLS)

• Calibration, validation and prediction

• PLS examples4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

PLS example (1)

Calculation using MATLAB with PLS Toolbox 4.0

• 13 SIMS spectra of thin films of Irganox were compared with their thicknesses measured with XPS

• Both X and Y are mean centered before analysis, and peaks below 50 u were removed

• Two PLS factors are retained, explaining 99.8% of the variance in X (SIMS data) and 98.8% of the variance in Y (thicknesses)

(C) Crown Copyrig

ht 2010

PLS example (1)

• Leave-one-out cross validation shows 2 PLS factors to include in model

• Scores and loadings on PLSfactor 1 and factor 2 are calculated

2 4 6 8 10 12

-2

-1

0

1

2

x 10-11

Sample

PLS

Sco

res

Factor 1Factor 2

500 1000 1500-0.2

0

0.2

0.4

0.6

0.8

Mass

PLS

Loa

ding

s

Factor 1Factor 2

(C) Crown Copyrig

ht 2010

0 1 2 3 4 5 60

1

2

3

4

5

6

Thickness measured by XPS (nm)

Th

ickn

ess

pre

dic

ted

by

SIM

S (

nm

)

PLS example (1)

• PLS model able to predict thicknesses for t < 6 nm

• PLS regression vector shows Irganox characteristic peaks most correlated with thickness

• Irganox dewets on the surface so initial thickness is proportional to surface coverage!

200 400 600 800 1000 1200-2

0

2

4

6

8

10

12x 10

10

Mass, u

Re

gre

ssio

n V

ect

or

for

Y

231

1176

59

277

4 nm

FM Green et al., Anal. Chem. 2009, 81, 75–79

(C) Crown Copyrig

ht 2010

PLS example (2)

A. J. Urquhart et al., Anal. Chem., 80 (1), 135 -142, 2008

• ToF-SIMS spectra of 576 copolymers are related to their experimental water contact angles (WCA)

• Positive and negative ion spectra are normalised separately, then concatenated (combined) into single data matrix X

Polar speciese.g. CnHmO+ & CN-

hydrophilic

Hydrocarbonse.g. CnHm

+

hydrophobic

(C) Crown Copyrig

ht 2010

Quantification and prediction summary

• PLS is a multivariate linear regression technique

• PLS find factors that best describe the structure of covariancebetween X and Y

• Data preprocessing method needs to be selected with care

• PLS is excellent for calibration and quantification, and for studying the relationship between SIMS data and other measured properties

• Properly validated PLS models can be used for predictions of these properties using SIMS spectra

EXBY +=Responsevariables

Data matrix Regressionmatrix

Residuals (noise)

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification

3. Quantification and prediction4. Classification

• PCA classification• Principal Component Discriminant Function Analysis (PC-DFA)• Partial Least Squares Discriminant Analysis (PLS-DA )

5. Conclusion

(C) Crown Copyrig

ht 2010

Data analysis

SIMS Dataset

SIMS Dataset





Classification

Identification




(C) Crown Copyrig

ht 2010

PCA classification

• 16 different single protein films adsorbed on mica

• Excellent classification of proteins using only 2 factors

• Factors consistent with total amino acid composition of various proteins

• 95% confidence limits provide means for identification / classification

M. Wagner & D. G. Castner, Langmuir, 17 (2001) 4649

PC

2 S

core

s (1

9%)

PC1 Scores (53%)

PC

A L

oadi

ngs

(C) Crown Copyrig

ht 2010PCA

Discriminant analysis

Discriminant analysis

1

Describes the difference between the smiley and the heart:

1

2

Describes the spread of the data in order of importance:

(C) Crown Copyrig

ht 2010• PC-DFA = “Principal Component – Discriminant Function Analysis”

• ‘Discriminant functions’ maximizes the Fisher’s ratio between groups

• Used to distinguish strains of bacteria

Example 1 – PC-DFA

( )21

221ratio sFisher'

varvarmeanmean

+−=

J. S. Fletcher et al, Appl. Surf. Sci. 252 (2006) 6869

(C) Crown Copyrig

ht 2010

Example 2 – PLS-DA

Example: Hyperaccumulator plant which stores high levels of Ni in epidermal cells. What is the difference between epidermal and other cells?

Image courtesy of Dr Kat Smart and Prof Chris Grovenor at the University of Oxford

trichome

epidermal

mesophyll

50 100 150 200 250 300 350-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

Variable

Load

ings

on

PC

10

(0.5

8%)

2 4 8 12

16 19 22

23

28 30

31 32 34

37

38

39

42 47 50 55

56

57

58

62 63

66 68 70

71 72

73

74

75 76

77 79 82 83

84

85

86

89 91

92

93

94

95

97

99

100 102

104

109

111 112

113

116

121 122 127

130 131

132 134 135

137

140 145

147 148

155 156

157 159 164

166 167

170 172 173

175

177

178 182

184

187 188

194 195

197 203 205 208 212

213

215

218 220 223 225

231 232

235 237 241 243 247

249

250

253

257

259 261 267

273 275

281 284 289 294 297 299 302

309

310 311

317

319 323 327 335

336

337 344 350

Variables/Loadings Plot for data

Decluttered

PC1, 2 and 3 overlay

PCA

PC10 scores and loadings

PC 10 describes differences between epidermal cells and other areas –

but this is not efficient!

(C) Crown Copyrig

ht 2010

Example 2 – PLS-DA

Step 1 – Define epidermal cells

LV2 LV3LV1

Step 2 – PLS-DA factors: The largest co-variancebetween the data and the group assignments (0 or 1)

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

PLS-DA prediction

Decluttered

50 100 150 200 250 300 350-0.2

-0.15

-0.1

-0.05

0

0.05

0.1

0.15

Variable

Reg

Vec

tor

for

Y 1

3 4 5 9 11 16 17

19 22 23 24 27 31

32 33

37

38

39

40 42

44

45 50 53

54

55

57

60 61 62 65 67

71

72 74

76

77 79 81 84

86 87 89

91

92

93 94

95

96

97 98 103

104

105 108

109

111

113

114

118 119

123 126

129

131

132

135 136

139 141 144

145

147

149

151

154

157

158

163

164 169

170

171

172

173

175

177

179 182 183

184 187

191 195 197 198

199 200 208

209 212

213

214

215

217 219 225

226 230

233

235 238 243 246

249 251 254

255

257 259 263 267

274 276

278 284

288 290 292

298 300 308 314

316 317

325 326

330 333

339 341

344

349 350

Variables/Loadings Plot for data

Regression vector

Step 3 –Regression vector: Combination of peaks which best predicts the differences between epidermal and other cells

Image courtesy of Dr Kat Smart and Prof Chris Grovenor at the University of Oxford

(C) Crown Copyrig

ht 2010

Classification summary

• PCA allows for quick grouping of samples based on their similarities

• PC-DFA and PLS-DA are supervised classification methods – prior knowledge about groups are required

• Properly validated classification models are needed for predictions• There also exists unsupervised clustering methods, e.g. hierarchal cluster

analysis, K-nearest-neighbours, artificial neural networks…..

design

retrieval

analysis dissemination

visualisation

usecreation

organisation

management

chemical information

chemical information

All these (and much, much more) belong to thewider field of chemoinformatics !

(C) Crown Copyrig

ht 2010

Contents

1. Introduction

2. Identification

3. Quantification and prediction4. Classification

5. Conclusion

(C) Crown Copyrig

ht 2010

Data analysis

SIMS Dataset

SIMS Dataset





Identification


PCA, MCR

PLS

Classification



PC-DFA, PLS-DA

(C) Crown Copyrig

ht 2010

Conclusion

In this lecture we looked at– Identification, quantification & classification using multivariate analysis– Importance of validation for predictive models– Data preprocessing techniques and their effects

Surface and Interface AnalysisMultivariate Analysis special issues (Volume 41 Issue 2 & 8, Feb/Aug 2009)

http://mvsa.nb.uw.edu/Community website with tutorials, links and softwareDeveloped by Dan Graham and hosted by NESAC/BIO

Surface Analysis: The Principal Techniques 2nd edition , Chapter 10 “The application of multivariate data analysis techniques in surface analysis”

http://www.npl.co.uk/nanoanalysis/chemometrics.html

• Copy of tutorial slides• Draft terminology for ISO18115

MVA vocabulary

(C) Crown Copyrig

ht 2010

Bibliography

General• J. L. S. Lee et al, “The application of multivariate data analysis techniques in surface analysis”, in Surface Analysis: The

Principal Techniques 2nd edition (eds J C Vickerman, I S Gilmore), Wiley.• S. Wold, Chemometrics; what do we mean with it, and what do we want from it?, Chemom. Intell. Lab. Syst. 30 (1995) 109• E. R. Malinowski, Factor analysis in Chemistry, John Wiley and Sons (2002)• P. Geladi et al, Multivariate image analysis, John Wiley and Sons (1996)• J. L. S. Lee et al, Quantification and methodology issues in multivariate analysis of ToF-SIMS data for mixed organic systems,

Surf. Interface Anal. 40 (2008) 1• D. J. Graham, NESAC/BIO ToF-SIMS MVA web resource, http://nb.engr.washington.edu/nb-sims-resource/

PCA• D. J. Graham et al, Information from complexity: challenges of ToF-SIMS data interpretation, Appl. Surf. Sci. 252 (2006) 6860• M. R. Keenan et al, Accounting for Poisson noise in the multivariate analysis of ToF-SIMS spectrum images, Surf. Interface

Anal. 36 (2004) 203• M. R. Keenan et al, Mitigating dead-time effects during multivariate analysis of ToF-SIMS spectral images, Surf. Interface Anal.

40 (2008) 97

MCR• N. B. Gallagher et al, Curve resolution for multivariate images with applications to TOF-SIMS and Raman, Chemom. Intell.

Lab. Syst. 73 (2004) 105• J. A. Ohlhausen et al, Multivariate statistical analysis of time-of-flight secondary ion mass spectrometry using AXSIA, Appl.

Surf. Sci. 231-232 (2004) 230• R. Tauler, A. de Juan, MCR-ALS Graphic User Friendly Interface, http://www.ub.edu/mcr/

PLS• P. Geladi et al, Partial Least-Squares Regression: A Tutorial, Analytica Chimica Acta 185 (1986) 1• A. M. C. Davies et al, Back to basics: observing PLS, Spectroscopy Europe 17 (2005) 28

guide to multivariate analysis in sims chemometrics guide 2010.pdf · • multivariate analysis can...

Documents