multivariate data analysis - imedea · 2014. 5. 9. · univariate and multivariate data univariate...

187
Multivariate Data Analysis Multivariate Data Analysis Introduction to Multivariate Data Analysis Principal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR) Laboratory exercises: Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentration of chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,… Romà Tauler (IDAEA, CSIC, Barcelona) [email protected]

Upload: others

Post on 20-Jan-2021

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Data AnalysisMultivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)

Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentrationof chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…

Romà Tauler (IDAEA, CSIC, Barcelona)[email protected]

Page 2: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Univariate and Multivariate DataUnivariate

• Only one measurement per sample (pH, Absorbance, peak height or area)

• The property is defined by oneone measurement.

• Total Selectivity is needed• Interferences should be

eliminated de beforemeasurements (separation)

• Numerical treatment is easy

Multivariate• Multiple measurements per

sample– Instrumental measuremnts

(spectra, cromatograms,...)– Multiple measurements

(constituent conc., sensorial variables,....)

• Total selectivity is not needed.• Interferences can be present• Different complexity levels

(vectors, matrices, tensors,...).

Page 3: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Univariate Statisticsn

ii 1

n2

i2 2 i 1

X x

n2

ii 1

X x

n

i ii 1

x ,y x ,y

x ,yx ,y

x y

xX

n

(x X)s

n 1

(x X)s

n 1

(x X)(y Y)s

n 1s

rs s

μ

σ

σ

σ

=

=

=

=

→ =

−→ =

−→ =

− −→ =

Mean

Variance

Standard Deviation

Covariance

Correlation

One variable summary

Relationbetween 2 variables

Page 4: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Univariate Statistics

• Mean: 4.00– Mean value of the variable.– Size of the scale of the variable.

• Standard Deviation: 0.18.• Variance: 0.032.

– Dispersion around the mean.– Spread (dispersion) of the scale of the

variable.

Description of one variableZn

4.104.043.823.964.074.233.733.804.23

Page 5: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Univariate Statistics

• Covariance: 0.00043.– High values Indicate a linear relationship

between x and y.– Sign: relation direct (+) or inverse (-).– Depends on the scale sizes of x i y.

• Correlation: 0.996.– |1| total linearity, 0 absence of linear

relationship.– Sign: relation direct (+), inverse (-).– Independent from the scale sizes of x and y.

Description of the relation between 2 variables

High Correlation, redundant information.Low Correlation, complementary Information.

Sn Ni0.20 0.00220.20 0.00200.15 0.00150.61 0.00620.57 0.00560.58 0.00550.30 0.00330.60 0.00560.10 0.0014

Page 6: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate DataExample

X (9,4) (n,m)

Sn Zn Fe Ni 0.20 4.10 0.06 0.0022 0.20 4.04 0.04 0.0020 0.15 3.82 0.08 0.0015 0.61 3.96 0.09 0.0062 0.57 4.07 0.08 0.0056 0.58 4.23 0.07 0.0055 0.30 3.73 0.02 0.0033 0.60 3.80 0.07 0.0056 0.10 4.23 0.05 0.0014

Sam

ples

Variables

Objects

Page 7: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Statistics

• Vector of means:– Mean value of each variable– Sample diferences due to the

differences in the size scales of thevariables

• Variables with higher means havea higher influence in dtadescription.

• If the scale sizes are very different, a data pretreatment will probablybe needed.

Description of variables

Sn Zn Fe Ni (0.37 4.00 0.06 0.0037)

11

−=∑=

n

xx

n

iij

j

Page 8: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Statisitcs

• Vector of standard deviations:– Dispersion of the different variables.– Shows the differences in the scale

ranges (dispersion, spread) amongvariables.

• Variables with higher dispersions will have higher influence in thedata description.

• If the scale ranges of the differentvariables are very different, a data pretreatment will probably beneeded.

Description of variables

s = (s1, s2, s3, ..., sn)

Sn Zn Fe Ni (0.22 0.18 0.02 0.0020)

1

)(1

2

−=∑=

n

xxs

n

ijij

j

Page 9: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate DataRobust Parameters

Sn Zn Fe Ni

0,100000 3,730000 0,020000 0,0014

0,150000 3,800000 0,040000 0,0015

0,200000 3,820000 0,050000 0,0020,200000 3,960000 0,060000 0,0022

0,300000 4,040000 0,070000 0,0033

0,570000 4,070000 0,070000 0,0055

0,580000 4,100000 0,080000 0,0056

0,600000 4,230000 0,080000 0,0056

0,610000 4,230000 0,090000 0,0062

MedianSpread

interquartile

They are less sensitive to the presence of outliersThe range of the interquartile is not symmetric respect the medianShow the data structure

Page 10: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Descriptive Statistics (Excel)

ppDDD opDDT ppDDT Total DDX Total POCs PCB#28

Media 6.096511644 2.16580938 7.086504433 66.19020173 78.70933531 7.600267864 Mediana 1.928828322 0.035 0.099439743 18.10346088 23.83508898 2.1 Moda 0.01 0.035 0.02 0.145 0.3215 0.9

Desviación estándar 20.01177892 12.61189239 59.35342594 199.7332731 229.5487184 23.05609066

Varianza de la muestra 400.4712955 159.0598297 3522.829171 39893.38037 52692.61412 531.5833166

Curtosis 45.08257127 52.62597243 100.808786 37.034318 41.9234197 37.79168548

Coeficiente de asimetría 6.553464204 7.205359571 10.01488076 5.826281824 6.155842663 6.004360386

Rango 158.99 103.565 598.939 1536.855 1856.1785 165.535

Mínimo 0.01 0.035 0.02 0.145 0.3215 0.065

Máximo 159 103.6 598.959 1537 1856.5 165.6

Suma 621.8441877 220.9125568 722.8234522 6751.400577 8028.352202 752.4265185

Cuenta 102 102 102 102 102 99

Introduction to Multivariate Data Analysis

Page 11: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

A box plot summarizes the information on the data distribution primarily in termsof the median, the upper quartile, and lower quartile. The “box” by definitionextends from the upper to lower quartile. Within the box is a dot or line markingthe median. The width of the box, or the distance between the upper and lowerquartiles, is equal to the interquartile range, and is a measure of spread. Themedian is a measure of location, and the relative distances of the median from theupper and lower quartiles is a measure of symmetry “in the middle” of thedistribution. is defined by the upper and lower quartiles. A line or dot in the box marks the median. For example, the median is approximately in the middle of thebox for a symmetric distribution, and is positioned toward the lower part of thebox for a positively skewed distribution.

Jan July

0

1

2

3

4

5

6

7

Box plots, Tucson Precipitation

P (

in)

Month

median

upper quartile

lower quartile

interquartile rangeirq

Probability distributions: Box plots

Page 12: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

“Whiskers” are drawn outside the box at what are called the the “adjacent values.”The upper adjacent value is the largest observation that does not exceed the upperquartile plus 1.5 iqr , where iqr is the interquartile range. The lower adjacent valueis the smallest observation than is not less than the lower quartile minus1.5 iqr . Ifno data fall outside this 1.5 iqr buffer around the box, the whiskers mark the data extremes. The whiskers also give information about symmetry in the tails of thedistribution.

Jan July

0

1

2

3

4

5

6

7

Box plots, Tucson Precipitation

P (

in)

Month

Probability distributions: Box plots

Whiskers

interquartile rangeirq

For example, if the distance from thetop of the box to the upper whiskerexceeds the distance from the bottomof the box to the lower whisker, thedistribution is positively skewed in the tails. Skewness in the tails may be different from skewness in the middleof the distribution. For example, a distribution can be positively skewedin the middle and negatively skewedin the tails.

Page 13: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Any points lying outside the 1.5 iqr around the box are marked by individual symbolsas “outliers”. These points are outliers in comparison to what is expected from a normal distribution with the same mean and variance as the data sample. For a standard normal distribution, the median and mean are both zero, and: q at 0.25 = −0.67449, q at 0.75 =0.67449, iqr = q 0.75 − q 0.25 =1.349, where q 0.25and q. 075are the first and third quartiles, and iqr is the interquartile range. We see that the whiskersfor a standard normal distribution are at data values: Upper whisker = 2.698 , Lowerwhisker = -2.698

Jan July

0

1

2

3

4

5

6

7

Box plots, Tucson Precipitation

P (

in)

Month

Outliers

Probability distributions: Box plots

Page 14: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

From the cdf of the standard normal distribution, we see that the probability of a lower value than x=−2.698 is 0.00035. This result shows that for a normal distribution, roughly 0.35 percent of the data is expected to fall below the lower whisker. By symmetry, 0.35 percent of thedata are expected above the upper whisker. These data values are classified as outliers. Exactly how many outliers might be expected in a sample of normally distributeddata depends on the sample size. For example, with a sample size of 100, weexpect no outliers, as 0.35 percent of 100 is much less than 1. With a sample sizeof 10,000, however, we would expect 35 positive outliers and 35 negative outliersfor a normal distribution.

Probability distributions: Box plots

For a normal distribution>> varnorm=randn(10000,3);>> boxplot(varnorm)

0.35% of 10000are approx. 35 outliersat each whisker side

Page 15: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Parametric vs Robust Statistics

MeanStandard Dev.

Parametric

Robust

Box Plot

Median

Mínimum

Màximum

Interquartil Range(IQR)

Page 16: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Mean and standarddeviation plots

Median and IQRs plots(Box plots)

Sn Zn Fe Ni Sn Zn Fe Ni

They help to see the size and range scale differencesThey suggest the use of appropriate data pretreatments

to handle these differences.

Parametric vs Robust Statistics

Page 17: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Statistics

• Covariance Matrix, S (m,m): it has all the possible pairwise combinations between variables.

Description of the variable relationships

2 2 211 12 1m2 221 2m

2 2m1 mm

s s ... ss ... ... s

S... ... ... ...s ... ... s

⎛ ⎞⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

2ij

n

ij j il li 1s

n 1

(x x )(x -x )==

−∑

2

2jj

n

ij ji 1s

n 1

(x x )==

−∑

Covariance

Variance

Page 18: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Statistics

• Correlation Matrix, C(m,m): it has all the possible of correlations between variables.– Diagonal elements are 1.

2 2 211 12 1m2 221 2m

2 2m1 mm

r r ... rr ... ... r

C... ... ... ...r ... ... r

⎛ ⎞⎜ ⎟⎜ ⎟= ⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠

2ij2

iji j

2i ii

sr

s s

s s

=

=

Correlation

Description of the variable relationships

Page 19: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Statistics

Sn Zn Fe NiSn 0,047319 -0,002593 0,002518 0,000434Zn -0,002593 0,033644 0,000581 -0,000022Fe 0,002518 0,000581 0,000494 0,000022Ni 0,000434 -0,000022 0,000022 0,000004

Covariance Matrix

Correlation MatrixSn Zn Fe Ni

Sn 1,000 -0,065 0,521 0,995Zn -0,065 1,000 0,142 -0,060Fe 0,521 0,142 1,000 0,502Ni 0,995 -0,060 0,502 1,000

Description of the variable relationships

Page 20: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Pair-wiseCorrelationsvariables

samples

Introduction to Multivariate Data Analysis

CORRMAP Correlation map with variable grouping. CORRMAP produces a pseudocolor map which shows the correlation of between variables (columns) in a data set.

Page 21: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Data Analysis

Need• Nature is multivariate.

– Climate = f(T, rain, winds, seasons,...)– Health = f(genetics, diet, climate, habits,...)– Abs. analit = f(solvent, T, interferences, matrix,...)

• Few properties are dependent of only one variable.• Many chemical measurements are multivariate

Page 22: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Data Analysis

Need• Many times measurements are indirect.

– Temperature. • Low values (thermometer) (univariate data).• High values (emision spectra FTIR, f(T))

– Nitrogen Concentration.• Kjeldahl chemical method (univariate data).• NIR spectra (f(C)).

Page 23: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Data Analyis

Need• The studied property is selectively correlated to a

single variable only in a few ocasions (lack of total selectivity).

• The studied property is determined by a set of variables with which presents high correlatio

P = f(x1, x2, ...,xn);

Page 24: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Observations = Structure + Noise

Structure = part of the signal correlated withthe sought property

Noise = all the other contributions, instrumental noise, experimental errors, other components, ...

NeedExperimental measures contain informationwhich is not relevant to the property of interest

Multivariate Data Analyis

Page 25: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Causality vs Correlation

Correlation is a statistical concept which measures the linear relationbetween two variables

Causality relationship is a deterministic interpretation from the problem or application

Example: the number of stork and the number of new born childrenin a geographical area

Multivariate Data Analyis

Page 26: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Information

Initial Hypothesis: Data have the sought information.Exists a relationship which can be modelled from measured variables and the measured property. When variables change their value, also the property will changed

X (variables) -------> Y propertymodel Y = f(X)

X is a vector or a matrix (e.g. spectral measures)Y is a scalar, vector or matrix (e.g. analytic concentrations)

Multivariate Data Analyis

Page 27: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Visualization of original dataPlot of the matrix rows and/or columns

Spectra set

0 10 20 30 40 50 60 70 80 90 1000

0.5

1

1.5

2

2.5

3

Rows

outlier sample

Variables0 5 10 15 20 25 30

0

0.5

1

1.5

2

2.5

3

ColumnsSamples

020

4060

80100

0

10

20

300

0.5

1

1.5

2

2.5

3

Rows and Columns (3D)Variables

Samples

Detection of outlier samples/variables Detection of scale and range variable differencesSystematic information (structure) is easily detected (instrumental responses).Difficult to interpret when the number of samples is high.

Page 28: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Visualization of original dataMap of samplesIn the column space (variables)

Samples are drawn as points in the variables spaceSimilarities among samples can be detected (distances among samples).

2 3 41 0 6⎛ ⎞⎜ ⎟⎝ ⎠

m1

m2

v1 v2 v3

2 4 6

642

(1,0,6)

(2,3,4)v1

v2

v3

m1

m2

Page 29: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Map of variablesIn the row space (samples)

Variables are drawn as vectors in the sample subspaceCorrelation among variables can be estimated (angle).

2 3 41 0 6⎛ ⎞⎜ ⎟⎝ ⎠

m1

m2

v1 v2 v3

r(vi,vj) = cos(vi,vj)

r = 1, angle 0o

r = 0, angle 90o2 4 6

642

(4,6)

(2,1)(3,0)

m1

m2

v1

v2

v3

Visualization of original data

Page 30: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Samples Sn Zn Fe Ni

1 0.2 3.4 0.06 0.08

2 0.2 2.4 0.04 0.06

3 0.15 2.0 0.08 0.16

4 0.61 6.0 0.09 0.02

5 0.57 4.2 0.08 0.06

6 0.58 4.82 0.07 0.02

7 0.30 5.60 0.02 0.01

8 0.60 6.60 0.07 0.06

9 0.10 1.60 0.05 0.19

Graphical representacion of multivariate data

in the variable space (3D)

Visualization of original data

02

46

8

0.020.04

0.060.08

0.10

0.2

0.4

0.6

0.8

ZnFeS

n

7

4

8 5

9

3

2

1

6

?

There are two sample groupsIs it critical the representation of a 4th variable?

Page 31: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

For more than 3 dimensions space?

• Qualitative approximations for a few variables – Chernoff faces.

• Efficient Compression of the original space of variables– Principal Component Analysis (PCA).

Visualization of original data

Page 32: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Chernoff faces. – It is easu to distinguish different features in human faces. Each

sample is a Chernoff face.– Each face feature is a variable.

V. 1 High front face of the head

V. 2 Lower front face of the head

V. 3 Eyebrows

V. 4 Smile

Visualization of original dataFor more than 3 dimensions space?

sample 7

Page 33: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Methods Classification

According to their goal

Exploration methods.Discrimation and Classification methods.Correlation and Regresion methods.Resolution methods

According data type

Based on original data.Based on latent variables (factor analysis)

Multivariate Data Analyis

Page 34: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Exploration Methods

• Visualization of the information.• Sample similarities and clusters• Correlations among variables. • Outlier detection.• Measured variables relevance. Selection.• Principal Component Analysis (PCA).

Multivariate Data Analyis

Page 35: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Discriminant and Classification Methods

• Separation of the objects (samples) in defined groups or clusters (classes).

• Assignation of new objects to predefined classes• Detection of outlier objects not belonging to any group

(classes).• PCA, SIMCA, LDA, PLS-DA, SVM.....

Multivariate Data Analyis

Page 36: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Correlation and Regression Methods

• Finding relations between two blocks of variables.• Modeling property changes from a group of variables.• Prediction of a property from the indirect

measurement from a group of variables correlated to it.

• Multilinear Regression (MLR), Principal ComponentsRegresion (PCR), Partial Least Squares Regression (PLS).

• Non-linear Regression methods, Kernel, SVM,...

Multivariate Data Analyis

Page 37: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Factor Analysis based methods

• Factor: source of the observed data variance of independent and defined nature.

• Extraction of the relevant factors (structure) of the data set. Noise Filtering.

• Description of the data variance from basic factors.• Identification of the chemical nature of these relevant

factors. • PCA, PLS, PCR, SIMCA, • Multivariate Curve Resolution. MCR. PMF, ICA,....

Multivariate Data Analyis

Page 38: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Modify the size and the range of the scale of thevariables.

• They can be applied in the direction of thecolumns (variables) or of the rows (objects, samples).

• They are selected as a function of the data nature and of the information to be obtained.

• There is no optimal treatment, it depends on thechemical problem to be investigated.

Data pre-processing

Multivariate Data Analyis

Page 39: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

1) mean centering (axes translation)Iik

* i 1ik ik k k

K

ik* k 1ik ik i i

xx x x , x

I

xx x x , x

K

=

=

= − =

= − =

on the data matrix columns

on the data matrix rows

centering

Data pre-processing

Multivariate Data Analyis

Page 40: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

2) scaling ( )( )

( )( )

I 2

ik k* ik i 1ik k

k

I 2

ik k* ik i 1ik i

i

x xxx , ss I 1

x xxx , ss K 1

=

=

−= =

−= =

on the data matrix columns

on the data matrix rows

3) autoscaling = mean centering + scaling

autoscaling* *ik k ik iik ik

k i

x x x xx ; xs s− −

= =

Data pre-processing

Multivariate Data Analyis

Page 41: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

4) normalization: K K

* 2ik ik i ik i ik

k 1 ñ 1i

Nx x ; c x ; c x ; ....c = =

= = =∑ ∑

5) rotation: X* = RT X

p.e. in two dimensions:

*11

*22

xcosθ sinθxx-sinθ cosθx

⎛ ⎞ ⎛ ⎞⎛ ⎞=⎜ ⎟ ⎜ ⎟⎜ ⎟⎝ ⎠⎝ ⎠⎝ ⎠

RT, rotation matrix

Data pre-processing

Multivariate Data Analyis

Page 42: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0.00

1.00

2.00

3.00

4.00

5.00

0 5 10

Samples

Met

al C

oncs

. SnZnFeNi

• Original data (without pretreatment). Scale, size and range of variables is kept

Data pre-processing

Multivariate Data Analyis

Page 43: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Centered Data Each variable value is subtractedwith the mean of all the values of that variable.

Diferences among variables due to scale size are eliminated

-0.3-0.2-0.1

00.10.20.3

0 5 10

Samples

Met

all C

oncs Sn

ZnFeNi

Data pre-processing

Multivariate Data Analyis

Page 44: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Autoescaled Data ades. Each value of the variable is centered and divided by the standard deviation of thevalues of the variable

Differences among variables due to size and range are eliminated

-2.5-2

-1.5-1

-0.50

0.51

1.5

0 5 10

Samples

Met

al c

oncs Sn

ZnFeNi

Data pre-processing

Multivariate Data Analyis

Page 45: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0.00

1.00

2.00

3.00

4.00

5.00

0 5 10

Mostres

Con

c. m

etal

ls

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0 5 10

MostresC

onc.

met

alls

-2.5-2

-1.5-1

-0.50

0.51

1.5

0 5 10

Mostres

Con

c. m

etal

ls Sn

Zn

Fe

Ni

Original Data Centered Data Autoscaled Data

Whatsamples/variables

have highervalues?

What variables discriminate

better?

What is thecorrelation amongdifferent variables?

Data pre-processing

Multivariate Data Analyis

Page 46: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Centered data. Each variable value is subtractedby its mean value.

– The mean value of all variables is zero

11 1 12 2 1m m

21 1 22 2 2m m

n1 1 n2 2 nm m

x x x x ... x xx x x x x x

XC(n,m)... ... ... ...

x x x x ... x x

− − −⎛ ⎞⎜ ⎟− − −⎜ ⎟=⎜ ⎟⎜ ⎟⎜ ⎟− − −⎝ ⎠

Centered data and Covariance

n number of samplesm number of variables

),(),(),( 11

mnnmT

mm XCXCn

S−

=

Data pre-processing

Multivariate Data Analyis

Page 47: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Autoscaled Data. Each value of the variable is centered and divided by the standard deviationof the values of the variable – The mean of all variables is 0.– The variance (dispersion) of all variables is 1.

Autoescaled data and correlation.

XT (n,m) j

jijij s

xxxt

−=

),(),(),( 11

mnnmT

mm XTXTn

C−

=

Data pre-processing

Multivariate Data Analyis

Page 48: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 10 20 30 40 50 60 70 800

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

Longituds d'ona

Abs

orbà

ncia

0 10 20 30 40 50 60 70 800

0.05

0.1

0.15

0.2

0.25

Longituds d'ona

X

spectrum

Normalitzation

Each vector (spectrum) value is divided by its lenght (norm)

XN (n,m)

X

XN

i

ijij x

xxn = ∑=

jiji xx 2

Equals the response intensities.Allows a better comparison of the shapes

Data pre-processingMultivariate Data Analyis

Page 49: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• To eliminate changes due to instrumental variations without chemical information– Smoothing: noise correction– 1st derivative: correction of constant variations– 2nd derivative: correction of linear variations – Peak alignements.– Baseline corrections– Warping– …

• Different pretreatments can be combined.

In instrumental responses

Data pre-processing

Multivariate Data Analyis

Page 50: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

NIR spectra of samples

original data

baseline

vertical offset

2a. derivative

row autoescaling

row autoscaling + baseline correction

Data pre-processingMultivariate Data Analyis

Page 51: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Data pre-processingMultivariate Data Analyis

Page 52: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Other data pretreatments

– Baseline and background correction– Noise filtering– Shift alignement– Warping– ...

Multivariate Data Analyis

Page 53: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 5 10 15 20 25 30-50

0

50

100

150

200

250

300

350

Plot of rows (samples)of data matrix X

0 5 10 15 20 25 30 35 40 45 50-50

0

50

100

150

200

250

300

350

Plot of columns (variables)of data matrix X

Data example: environmental monitoiung Data table or matrix X(30,50)

50 parameters are meassured on 30 samples

Multivariate Data Analyis

Page 54: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 5 10 15 20 25 30-50

0

50

100

150

200

250

300

350

Use of descriptive statistics1) Individual sample plots2) Individual variable plots 3) Descriptive statistics (Excel Statistics)4) Histograms/Box plots5) Binary correlation between variables6) ..............................................................

variables

0 5 10 15 20 25 30 35 40 45 50-50

0

50

100

150

200

250

300

350

samples

Multivariate Data Analyis

Page 55: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 5 10 15 20 25 30 35 40 45 500

200

400

600

800

1000

1200

1400

Samples sum, average andstandard deviation for everyvariable

0 5 10 15 20 25 30 35 40 45 500

5

10

15

20

25

30

35

40

45

0 5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

summean

std

69

1117 29

35

39

41

47

69

1117 29

35

39

41

47

6

9

1117

29

35

39

4147

Page 56: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 5 10 15 20 25 300

500

1000

1500

2000

2500

3000

3500

4000

Variables sum, average andstandard deviationfor every sample

0 5 10 15 20 25 300

10

20

30

40

50

60

70

80

0 5 10 15 20 25 300

10

20

30

40

50

60

70

sum mean

std

21

26

21

26

26

21

Page 57: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Descriptive Statistics (Excel)

ppDDD opDDT ppDDT Total DDX Total POCs PCB#28

Media 6.096511644 2.16580938 7.086504433 66.19020173 78.70933531 7.600267864 Mediana 1.928828322 0.035 0.099439743 18.10346088 23.83508898 2.1 Moda 0.01 0.035 0.02 0.145 0.3215 0.9

Desviación estándar 20.01177892 12.61189239 59.35342594 199.7332731 229.5487184 23.05609066

Varianza de la muestra 400.4712955 159.0598297 3522.829171 39893.38037 52692.61412 531.5833166

Curtosis 45.08257127 52.62597243 100.808786 37.034318 41.9234197 37.79168548

Coeficiente de asimetría 6.553464204 7.205359571 10.01488076 5.826281824 6.155842663 6.004360386

Rango 158.99 103.565 598.939 1536.855 1856.1785 165.535

Mínimo 0.01 0.035 0.02 0.145 0.3215 0.065

Máximo 159 103.6 598.959 1537 1856.5 165.6

Suma 621.8441877 220.9125568 722.8234522 6751.400577 8028.352202 752.4265185

Cuenta 102 102 102 102 102 99

Multivariate Data Analyis

Page 58: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

**

*

**

The box has lines at the lower quartile, median, and upper quartile values.

The whiskers are lines extending from each end of the box to show the extent of the rest of the data.

Outliers are data with values beyond the ends of the whiskers.

1 2 3 4 5 6 7 8 91011121314151617181920212223242526272829303132333435363738394041424344454647484950

0

50

100

150

200

250

300V

alue

s

Column Number

Boxplot

Multivariate Data Analyis

Page 59: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Columns 1 through 7

1.0000 0.7071 0.9088 0.8677 0.9011 0.7807 0.90200.7071 1.0000 0.7413 0.6186 0.7394 0.6311 0.68350.9088 0.7413 1.0000 0.8069 0.8313 0.9304 0.98120.8677 0.6186 0.8069 1.0000 0.9716 0.5649 0.74570.9011 0.7394 0.8313 0.9716 1.0000 0.5945 0.76320.7807 0.6311 0.9304 0.5649 0.5945 1.0000 0.96590.9020 0.6835 0.9812 0.7457 0.7632 0.9659 1.00000.8438 0.9207 0.8171 0.8244 0.9198 0.6322 0.74840.9213 0.6820 0.9842 0.8547 0.8528 0.9081 0.97900.8608 0.5930 0.8594 0.5502 0.5965 0.9161 0.92030.8508 0.6622 0.7248 0.9727 0.9742 0.4484 0.65250.9012 0.7057 0.8973 0.9595 0.9549 0.7050 0.84410.7564 0.6899 0.9133 0.5232 0.5746 0.9909 0.94470.6942 0.5023 0.6083 0.9450 0.9053 0.3058 0.51790.7363 0.4323 0.6508 0.9624 0.8984 0.3683 0.57590.8445 0.3825 0.5904 0.7990 0.7847 0.3878 0.58530.7738 0.6443 0.9330 0.5707 0.6013 0.9988 0.96540.9249 0.7821 0.9756 0.7923 0.8362 0.9140 0.96880.8241 0.4946 0.6426 0.9021 0.8829 0.4065 0.61110.8350 0.6862 0.9750 0.7015 0.7202 0.9747 0.98190.9383 0.7801 0.9716 0.7655 0.8175 0.9282 0.97590.8716 0.6598 0.9818 0.8190 0.8134 0.9242 0.97540.8883 0.5718 0.7118 0.6022 0.6668 0.6651 0.75560.9154 0.8532 0.9245 0.7477 0.8279 0.8566 0.91160.9280 0.6412 0.9564 0.8246 0.8240 0.8884 0.95960.7005 0.9217 0.7575 0.4710 0.6158 0.7461 0.73880.9209 0.8423 0.9303 0.7820 0.8482 0.8428 0.91660.9145 0.8839 0.9470 0.8490 0.9106 0.8218 0.90690.9090 0.8047 0.9795 0.8521 0.8842 0.8838 0.95450.7394 0.3741 0.6021 0.9413 0.8757 0.3190 0.53880.8043 0.6443 0.9127 0.5437 0.5826 0.9850 0.95670.8287 0.7583 0.6477 0.8407 0.9123 0.3899 0.57590.8259 0.6214 0.9145 0.5609 0.5979 0.9828 0.96280.8927 0.6820 0.9768 0.7402 0.7568 0.9627 0.98880.8630 0.6876 0.9399 0.6166 0.6632 0.9806 0.97630.8785 0.5285 0.7587 0.9735 0.9386 0.5223 0.71730.8702 0.7217 0.9775 0.7037 0.7371 0.9769 0.98840.9045 0.6304 0.9432 0.6802 0.7057 0.9552 0.97960.6489 0.5175 0.5127 0.9035 0.8850 0.1833 0.40990.9031 0.8127 0.9118 0.7370 0.8028 0.8697 0.92300.7481 0.6250 0.9203 0.5424 0.5707 0.9977 0.95470.8293 0.6533 0.9664 0.6903 0.7032 0.9785 0.98260.8431 0.8600 0.9018 0.6070 0.7059 0.9062 0.91180.8637 0.5809 0.7495 0.9830 0.9618 0.4834 0.68450.8606 0.5226 0.7253 0.9750 0.9425 0.4707 0.67570.8783 0.6756 0.9791 0.7609 0.7716 0.9575 0.99090.8453 0.6615 0.9699 0.6840 0.7024 0.9872 0.99050.8632 0.6759 0.7094 0.5733 0.6677 0.6640 0.74220.8802 0.6873 0.9784 0.7389 0.7545 0.9698 0.99510.8733 0.6441 0.9746 0.7600 0.7583 0.9495 0.9837

Columns 8 through 14

0.8438 0.9213 0.8608 0.8508 0.9012 0.7564 0.69420.9207 0.6820 0.5930 0.6622 0.7057 0.6899 0.50230.8171 0.9842 0.8594 0.7248 0.8973 0.9133 0.60830.8244 0.8547 0.5502 0.9727 0.9595 0.5232 0.94500.9198 0.8528 0.5965 0.9742 0.9549 0.5746 0.90530.6322 0.9081 0.9161 0.4484 0.7050 0.9909 0.30580.7484 0.9790 0.9203 0.6525 0.8441 0.9447 0.51791.0000 0.7907 0.6191 0.8598 0.8566 0.6577 0.73770.7907 1.0000 0.8493 0.7651 0.9209 0.8765 0.66660.6191 0.8493 1.0000 0.4861 0.6604 0.9066 0.27430.8598 0.7651 0.4861 1.0000 0.9197 0.4205 0.94920.8566 0.9209 0.6604 0.9197 1.0000 0.6721 0.85000.6577 0.8765 0.9066 0.4205 0.6721 1.0000 0.26150.7377 0.6666 0.2743 0.9492 0.8500 0.2615 1.00000.6804 0.7210 0.3517 0.9398 0.8838 0.3068 0.97390.6010 0.6510 0.6050 0.8304 0.7372 0.3345 0.71760.6435 0.9102 0.9031 0.4539 0.7100 0.9909 0.31750.8429 0.9618 0.8833 0.7292 0.8843 0.9064 0.57920.7205 0.7091 0.5315 0.9175 0.8419 0.3639 0.85340.7284 0.9626 0.8686 0.5926 0.8117 0.9601 0.48010.8364 0.9603 0.9239 0.7106 0.8590 0.9243 0.54400.7595 0.9915 0.8207 0.7119 0.8976 0.8930 0.63270.6553 0.7158 0.8884 0.6247 0.6506 0.6550 0.38800.8947 0.9023 0.8610 0.7288 0.8361 0.8697 0.54760.7663 0.9731 0.8794 0.7436 0.8912 0.8571 0.62700.8329 0.6826 0.7322 0.4908 0.5931 0.8099 0.29060.8913 0.9125 0.8429 0.7626 0.8591 0.8498 0.60390.9505 0.9291 0.7747 0.8258 0.9139 0.8266 0.70050.8850 0.9749 0.8049 0.7916 0.9272 0.8739 0.68350.6457 0.6793 0.3610 0.9306 0.8371 0.2543 0.95470.6320 0.8852 0.9511 0.4452 0.6793 0.9810 0.27150.8997 0.6521 0.5012 0.9263 0.8069 0.3972 0.80140.6301 0.8963 0.9689 0.4626 0.6913 0.9724 0.28850.7383 0.9712 0.9148 0.6436 0.8435 0.9435 0.50500.7027 0.9208 0.9668 0.5335 0.7423 0.9743 0.35450.7583 0.8204 0.5713 0.9575 0.9194 0.4688 0.91790.7573 0.9602 0.9091 0.6111 0.8139 0.9685 0.47080.6942 0.9392 0.9745 0.5950 0.7782 0.9347 0.42660.7378 0.5648 0.1928 0.9486 0.8046 0.1520 0.97500.8557 0.9043 0.8666 0.7040 0.8075 0.8764 0.53550.6158 0.8959 0.8922 0.4208 0.6873 0.9904 0.28720.6956 0.9600 0.8774 0.5768 0.8081 0.9584 0.46200.8389 0.8581 0.8934 0.5786 0.7289 0.9331 0.38330.8037 0.7970 0.5222 0.9823 0.9248 0.4407 0.94090.7596 0.7881 0.5305 0.9709 0.9102 0.4186 0.93500.7463 0.9830 0.8776 0.6584 0.8574 0.9316 0.54640.7044 0.9610 0.9051 0.5740 0.7984 0.9696 0.44840.7293 0.6920 0.8699 0.6102 0.6211 0.6794 0.36720.7405 0.9775 0.8998 0.6392 0.8421 0.9490 0.51460.7210 0.9781 0.8811 0.6526 0.8446 0.9213 0.5430

Columns 15 through 21

0.7363 0.8445 0.7738 0.9249 0.8241 0.8350 0.93830.4323 0.3825 0.6443 0.7821 0.4946 0.6862 0.78010.6508 0.5904 0.9330 0.9756 0.6426 0.9750 0.97160.9624 0.7990 0.5707 0.7923 0.9021 0.7015 0.76550.8984 0.7847 0.6013 0.8362 0.8829 0.7202 0.81750.3683 0.3878 0.9988 0.9140 0.4065 0.9747 0.92820.5759 0.5853 0.9654 0.9688 0.6111 0.9819 0.97590.6804 0.6010 0.6435 0.8429 0.7205 0.7284 0.83640.7210 0.6510 0.9102 0.9618 0.7091 0.9626 0.96030.3517 0.6050 0.9031 0.8833 0.5315 0.8686 0.92390.9398 0.8304 0.4539 0.7292 0.9175 0.5926 0.71060.8838 0.7372 0.7100 0.8843 0.8419 0.8117 0.85900.3068 0.3345 0.9909 0.9064 0.3639 0.9601 0.92430.9739 0.7176 0.3175 0.5792 0.8534 0.4801 0.54401.0000 0.7787 0.3747 0.6183 0.8824 0.5312 0.58290.7787 1.0000 0.3722 0.6206 0.9100 0.4613 0.63650.3747 0.3722 1.0000 0.9159 0.4040 0.9772 0.92630.6183 0.6206 0.9159 1.0000 0.6820 0.9455 0.97820.8824 0.9100 0.4040 0.6820 1.0000 0.5180 0.66320.5312 0.4613 0.9772 0.9455 0.5180 1.0000 0.95190.5829 0.6365 0.9263 0.9782 0.6632 0.9519 1.00000.6845 0.5635 0.9286 0.9475 0.6420 0.9758 0.94080.4501 0.8270 0.6476 0.7772 0.7038 0.6444 0.81650.5604 0.6168 0.8565 0.9405 0.6462 0.8927 0.96590.6882 0.6942 0.8852 0.9376 0.7333 0.9317 0.95170.2425 0.3147 0.7513 0.8036 0.3726 0.7524 0.82070.6100 0.6456 0.8462 0.9557 0.6918 0.8829 0.95430.6884 0.6141 0.8296 0.9502 0.7058 0.8927 0.94860.6992 0.5910 0.8907 0.9646 0.6811 0.9485 0.96140.9813 0.8440 0.3212 0.5780 0.9173 0.4731 0.55350.3342 0.4371 0.9801 0.9126 0.4224 0.9465 0.93100.7653 0.7952 0.3919 0.6972 0.8544 0.5099 0.68820.3604 0.4827 0.9761 0.9159 0.4596 0.9474 0.93930.5689 0.5764 0.9606 0.9639 0.6018 0.9739 0.96980.4142 0.5181 0.9764 0.9445 0.5102 0.9578 0.96740.9513 0.8847 0.5232 0.7495 0.9375 0.6455 0.73850.5164 0.5060 0.9783 0.9652 0.5407 0.9840 0.97200.4992 0.6229 0.9480 0.9449 0.6044 0.9439 0.96610.9418 0.7208 0.1943 0.5086 0.8534 0.3626 0.47080.5437 0.5980 0.8706 0.9384 0.6293 0.9024 0.95900.3467 0.3385 0.9982 0.8989 0.3687 0.9729 0.91130.5222 0.4659 0.9805 0.9462 0.5140 0.9883 0.94200.3806 0.4650 0.9085 0.9304 0.4991 0.8998 0.95070.9625 0.8569 0.4859 0.7428 0.9309 0.6248 0.72140.9624 0.8954 0.4710 0.7190 0.9437 0.6027 0.70620.6040 0.5514 0.9601 0.9581 0.6012 0.9871 0.96110.5067 0.4843 0.9880 0.9484 0.5200 0.9912 0.95670.3904 0.7438 0.6513 0.7873 0.6794 0.6489 0.82270.5693 0.5432 0.9705 0.9622 0.5833 0.9856 0.96840.6055 0.5699 0.9497 0.9474 0.6119 0.9756 0.9530

Correlation between variables

Page 60: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Columns 22 through 28

0.8716 0.8883 0.9154 0.9280 0.7005 0.9209 0.91450.6598 0.5718 0.8532 0.6412 0.9217 0.8423 0.88390.9818 0.7118 0.9245 0.9564 0.7575 0.9303 0.94700.8190 0.6022 0.7477 0.8246 0.4710 0.7820 0.84900.8134 0.6668 0.8279 0.8240 0.6158 0.8482 0.91060.9242 0.6651 0.8566 0.8884 0.7461 0.8428 0.82180.9754 0.7556 0.9116 0.9596 0.7388 0.9166 0.90690.7595 0.6553 0.8947 0.7663 0.8329 0.8913 0.95050.9915 0.7158 0.9023 0.9731 0.6826 0.9125 0.92910.8207 0.8884 0.8610 0.8794 0.7322 0.8429 0.77470.7119 0.6247 0.7288 0.7436 0.4908 0.7626 0.82580.8976 0.6506 0.8361 0.8912 0.5931 0.8591 0.91390.8930 0.6550 0.8697 0.8571 0.8099 0.8498 0.82660.6327 0.3880 0.5476 0.6270 0.2906 0.6039 0.70050.6845 0.4501 0.5604 0.6882 0.2425 0.6100 0.68840.5635 0.8270 0.6168 0.6942 0.3147 0.6456 0.61410.9286 0.6476 0.8565 0.8852 0.7513 0.8462 0.82960.9475 0.7772 0.9405 0.9376 0.8036 0.9557 0.95020.6420 0.7038 0.6462 0.7333 0.3726 0.6918 0.70580.9758 0.6444 0.8927 0.9317 0.7524 0.8829 0.89270.9408 0.8165 0.9659 0.9517 0.8207 0.9543 0.94861.0000 0.6412 0.8737 0.9535 0.6691 0.8853 0.91060.6412 1.0000 0.7950 0.7707 0.6396 0.7997 0.71420.8737 0.7950 1.0000 0.8924 0.8785 0.9540 0.95290.9535 0.7707 0.8924 1.0000 0.6642 0.8923 0.89500.6691 0.6396 0.8785 0.6642 1.0000 0.8461 0.83560.8853 0.7997 0.9540 0.8923 0.8461 1.0000 0.95730.9106 0.7142 0.9529 0.8950 0.8356 0.9573 1.00000.9697 0.6849 0.9345 0.9382 0.7818 0.9398 0.97860.6306 0.5039 0.5324 0.6725 0.1988 0.5754 0.64100.8891 0.7393 0.8683 0.8778 0.7668 0.8557 0.81310.5822 0.7033 0.7501 0.6570 0.6333 0.7632 0.80480.8936 0.7671 0.8698 0.8967 0.7545 0.8484 0.80980.9670 0.7444 0.9098 0.9533 0.7354 0.9003 0.89840.9139 0.7918 0.9114 0.9169 0.7935 0.8968 0.86370.7702 0.6759 0.7088 0.8082 0.3938 0.7568 0.79170.9622 0.7177 0.9159 0.9328 0.7834 0.9112 0.91020.9233 0.8345 0.8970 0.9477 0.7268 0.8874 0.85570.5167 0.3819 0.5056 0.5321 0.2787 0.5518 0.64930.8816 0.7841 0.9493 0.8819 0.8475 0.9298 0.93980.9190 0.6208 0.8405 0.8690 0.7404 0.8248 0.80820.9737 0.6540 0.8717 0.9248 0.7189 0.8752 0.87040.8463 0.7681 0.9522 0.8479 0.9303 0.9305 0.91850.7479 0.6382 0.7077 0.7824 0.4322 0.7487 0.80270.7349 0.6557 0.6838 0.7792 0.3784 0.7232 0.76990.9866 0.6961 0.8997 0.9481 0.7154 0.8971 0.90850.9700 0.6877 0.8878 0.9358 0.7358 0.8825 0.87850.6247 0.9617 0.8305 0.7468 0.7604 0.8218 0.74900.9794 0.7137 0.9027 0.9491 0.7330 0.9029 0.90580.9792 0.6980 0.8862 0.9558 0.6784 0.8777 0.8879

Columns 29 through 35

0.9090 0.7394 0.8043 0.8287 0.8259 0.8927 0.86300.8047 0.3741 0.6443 0.7583 0.6214 0.6820 0.68760.9795 0.6021 0.9127 0.6477 0.9145 0.9768 0.93990.8521 0.9413 0.5437 0.8407 0.5609 0.7402 0.61660.8842 0.8757 0.5826 0.9123 0.5979 0.7568 0.66320.8838 0.3190 0.9850 0.3899 0.9828 0.9627 0.98060.9545 0.5388 0.9567 0.5759 0.9628 0.9888 0.97630.8850 0.6457 0.6320 0.8997 0.6301 0.7383 0.70270.9749 0.6793 0.8852 0.6521 0.8963 0.9712 0.92080.8049 0.3610 0.9511 0.5012 0.9689 0.9148 0.96680.7916 0.9306 0.4452 0.9263 0.4626 0.6436 0.53350.9272 0.8371 0.6793 0.8069 0.6913 0.8435 0.74230.8739 0.2543 0.9810 0.3972 0.9724 0.9435 0.97430.6835 0.9547 0.2715 0.8014 0.2885 0.5050 0.35450.6992 0.9813 0.3342 0.7653 0.3604 0.5689 0.41420.5910 0.8440 0.4371 0.7952 0.4827 0.5764 0.51810.8907 0.3212 0.9801 0.3919 0.9761 0.9606 0.97640.9646 0.5780 0.9126 0.6972 0.9159 0.9639 0.94450.6811 0.9173 0.4224 0.8544 0.4596 0.6018 0.51020.9485 0.4731 0.9465 0.5099 0.9474 0.9739 0.95780.9614 0.5535 0.9310 0.6882 0.9393 0.9698 0.96740.9697 0.6306 0.8891 0.5822 0.8936 0.9670 0.91390.6849 0.5039 0.7393 0.7033 0.7671 0.7444 0.79180.9345 0.5324 0.8683 0.7501 0.8698 0.9098 0.91140.9382 0.6725 0.8778 0.6570 0.8967 0.9533 0.91690.7818 0.1988 0.7668 0.6333 0.7545 0.7354 0.79350.9398 0.5754 0.8557 0.7632 0.8484 0.9003 0.89680.9786 0.6410 0.8131 0.8048 0.8098 0.8984 0.86371.0000 0.6454 0.8624 0.7191 0.8614 0.9474 0.90150.6454 1.0000 0.3057 0.7741 0.3339 0.5304 0.38640.8624 0.3057 1.0000 0.4199 0.9885 0.9517 0.98780.7191 0.7741 0.4199 1.0000 0.4358 0.5735 0.51390.8614 0.3339 0.9885 0.4358 1.0000 0.9575 0.99150.9474 0.5304 0.9517 0.5735 0.9575 1.0000 0.97170.9015 0.3864 0.9878 0.5139 0.9915 0.9717 1.00000.7894 0.9573 0.5172 0.8311 0.5393 0.7029 0.59600.9526 0.4697 0.9636 0.5544 0.9631 0.9876 0.97930.8982 0.4896 0.9677 0.5563 0.9799 0.9747 0.98570.6040 0.9290 0.1639 0.8545 0.1795 0.4035 0.25740.9339 0.5161 0.8768 0.7138 0.8819 0.9091 0.91580.8747 0.2898 0.9765 0.3549 0.9715 0.9507 0.96800.9377 0.4652 0.9522 0.4792 0.9528 0.9776 0.96060.9059 0.3472 0.9218 0.6330 0.9149 0.9030 0.94480.7928 0.9588 0.4789 0.8791 0.5015 0.6735 0.55950.7659 0.9702 0.4658 0.8514 0.4963 0.6651 0.54950.9607 0.5551 0.9353 0.5617 0.9388 0.9806 0.95560.9379 0.4578 0.9659 0.4918 0.9694 0.9867 0.97680.6975 0.4436 0.7411 0.7428 0.7609 0.7305 0.78960.9575 0.5219 0.9532 0.5496 0.9560 0.9876 0.97040.9397 0.5673 0.9261 0.5433 0.9357 0.9761 0.9468

Columns 36 through 42

0.8785 0.8702 0.9045 0.6489 0.9031 0.7481 0.82930.5285 0.7217 0.6304 0.5175 0.8127 0.6250 0.65330.7587 0.9775 0.9432 0.5127 0.9118 0.9203 0.96640.9735 0.7037 0.6802 0.9035 0.7370 0.5424 0.69030.9386 0.7371 0.7057 0.8850 0.8028 0.5707 0.70320.5223 0.9769 0.9552 0.1833 0.8697 0.9977 0.97850.7173 0.9884 0.9796 0.4099 0.9230 0.9547 0.98260.7583 0.7573 0.6942 0.7378 0.8557 0.6158 0.69560.8204 0.9602 0.9392 0.5648 0.9043 0.8959 0.96000.5713 0.9091 0.9745 0.1928 0.8666 0.8922 0.87740.9575 0.6111 0.5950 0.9486 0.7040 0.4208 0.57680.9194 0.8139 0.7782 0.8046 0.8075 0.6873 0.80810.4688 0.9685 0.9347 0.1520 0.8764 0.9904 0.95840.9179 0.4708 0.4266 0.9750 0.5355 0.2872 0.46200.9513 0.5164 0.4992 0.9418 0.5437 0.3467 0.52220.8847 0.5060 0.6229 0.7208 0.5980 0.3385 0.46590.5232 0.9783 0.9480 0.1943 0.8706 0.9982 0.98050.7495 0.9652 0.9449 0.5086 0.9384 0.8989 0.94620.9375 0.5407 0.6044 0.8534 0.6293 0.3687 0.51400.6455 0.9840 0.9439 0.3626 0.9024 0.9729 0.98830.7385 0.9720 0.9661 0.4708 0.9590 0.9113 0.94200.7702 0.9622 0.9233 0.5167 0.8816 0.9190 0.97370.6759 0.7177 0.8345 0.3819 0.7841 0.6208 0.65400.7088 0.9159 0.8970 0.5056 0.9493 0.8405 0.87170.8082 0.9328 0.9477 0.5321 0.8819 0.8690 0.92480.3938 0.7834 0.7268 0.2787 0.8475 0.7404 0.71890.7568 0.9112 0.8874 0.5518 0.9298 0.8248 0.87520.7917 0.9102 0.8557 0.6493 0.9398 0.8082 0.87040.7894 0.9526 0.8982 0.6040 0.9339 0.8747 0.93770.9573 0.4697 0.4896 0.9290 0.5161 0.2898 0.46520.5172 0.9636 0.9677 0.1639 0.8768 0.9765 0.95220.8311 0.5544 0.5563 0.8545 0.7138 0.3549 0.47920.5393 0.9631 0.9799 0.1795 0.8819 0.9715 0.95280.7029 0.9876 0.9747 0.4035 0.9091 0.9507 0.97760.5960 0.9793 0.9857 0.2574 0.9158 0.9680 0.96061.0000 0.6579 0.6777 0.8802 0.6993 0.4914 0.63900.6579 1.0000 0.9662 0.3664 0.9221 0.9686 0.98520.6777 0.9662 1.0000 0.3274 0.9024 0.9365 0.94810.8802 0.3664 0.3274 1.0000 0.4777 0.1600 0.34150.6993 0.9221 0.9024 0.4777 1.0000 0.8530 0.87850.4914 0.9686 0.9365 0.1600 0.8530 1.0000 0.97660.6390 0.9852 0.9481 0.3415 0.8785 0.9766 1.00000.5576 0.9339 0.9067 0.3259 0.9468 0.8953 0.88870.9768 0.6322 0.6367 0.9226 0.6885 0.4547 0.61150.9872 0.6149 0.6358 0.9087 0.6750 0.4394 0.59700.7235 0.9805 0.9551 0.4358 0.9174 0.9501 0.98270.6406 0.9934 0.9664 0.3287 0.8991 0.9819 0.99410.6190 0.7194 0.8175 0.3708 0.8183 0.6242 0.63820.7004 0.9890 0.9664 0.4021 0.9176 0.9621 0.98870.7284 0.9723 0.9573 0.4249 0.8900 0.9419 0.9740

Correlation between variables

Page 61: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Columns 43 through 49

0.8431 0.8637 0.8606 0.8783 0.8453 0.8632 0.88020.8600 0.5809 0.5226 0.6756 0.6615 0.6759 0.68730.9018 0.7495 0.7253 0.9791 0.9699 0.7094 0.97840.6070 0.9830 0.9750 0.7609 0.6840 0.5733 0.73890.7059 0.9618 0.9425 0.7716 0.7024 0.6677 0.75450.9062 0.4834 0.4707 0.9575 0.9872 0.6640 0.96980.9118 0.6845 0.6757 0.9909 0.9905 0.7422 0.99510.8389 0.8037 0.7596 0.7463 0.7044 0.7293 0.74050.8581 0.7970 0.7881 0.9830 0.9610 0.6920 0.97750.8934 0.5222 0.5305 0.8776 0.9051 0.8699 0.89980.5786 0.9823 0.9709 0.6584 0.5740 0.6102 0.63920.7289 0.9248 0.9102 0.8574 0.7984 0.6211 0.84210.9331 0.4407 0.4186 0.9316 0.9696 0.6794 0.94900.3833 0.9409 0.9350 0.5464 0.4484 0.3672 0.51460.3806 0.9625 0.9624 0.6040 0.5067 0.3904 0.56930.4650 0.8569 0.8954 0.5514 0.4843 0.7438 0.54320.9085 0.4859 0.4710 0.9601 0.9880 0.6513 0.97050.9304 0.7428 0.7190 0.9581 0.9484 0.7873 0.96220.4991 0.9309 0.9437 0.6012 0.5200 0.6794 0.58330.8998 0.6248 0.6027 0.9871 0.9912 0.6489 0.98560.9507 0.7214 0.7062 0.9611 0.9567 0.8227 0.96840.8463 0.7479 0.7349 0.9866 0.9700 0.6247 0.97940.7681 0.6382 0.6557 0.6961 0.6877 0.9617 0.71370.9522 0.7077 0.6838 0.8997 0.8878 0.8305 0.90270.8479 0.7824 0.7792 0.9481 0.9358 0.7468 0.94910.9303 0.4322 0.3784 0.7154 0.7358 0.7604 0.73300.9305 0.7487 0.7232 0.8971 0.8825 0.8218 0.90290.9185 0.8027 0.7699 0.9085 0.8785 0.7490 0.90580.9059 0.7928 0.7659 0.9607 0.9379 0.6975 0.95750.3472 0.9588 0.9702 0.5551 0.4578 0.4436 0.52190.9218 0.4789 0.4658 0.9353 0.9659 0.7411 0.95320.6330 0.8791 0.8514 0.5617 0.4918 0.7428 0.54960.9149 0.5015 0.4963 0.9388 0.9694 0.7609 0.95600.9030 0.6735 0.6651 0.9806 0.9867 0.7305 0.98760.9448 0.5595 0.5495 0.9556 0.9768 0.7896 0.97040.5576 0.9768 0.9872 0.7235 0.6406 0.6190 0.70040.9339 0.6322 0.6149 0.9805 0.9934 0.7194 0.98900.9067 0.6367 0.6358 0.9551 0.9664 0.8175 0.96640.3259 0.9226 0.9087 0.4358 0.3287 0.3708 0.40210.9468 0.6885 0.6750 0.9174 0.8991 0.8183 0.91760.8953 0.4547 0.4394 0.9501 0.9819 0.6242 0.96210.8887 0.6115 0.5970 0.9827 0.9941 0.6382 0.98871.0000 0.5562 0.5264 0.8900 0.9065 0.8271 0.90680.5562 1.0000 0.9874 0.6899 0.6073 0.6046 0.66790.5264 0.9874 1.0000 0.6798 0.5957 0.6020 0.65730.8900 0.6899 0.6798 1.0000 0.9867 0.6814 0.99240.9065 0.6073 0.5957 0.9867 1.0000 0.6793 0.99310.8271 0.6046 0.6020 0.6814 0.6793 1.0000 0.69930.9068 0.6679 0.6573 0.9924 0.9931 0.6993 1.00000.8684 0.6916 0.6858 0.9819 0.9799 0.6804 0.9837

Column 50

0.87330.64410.97460.76000.75830.94950.98370.72100.97810.88110.65260.84460.92130.54300.60550.56990.94970.94740.61190.97560.95300.97920.69800.88620.95580.67840.87770.88790.93970.56730.92610.54330.93570.97610.94680.72840.97230.95730.42490.89000.94190.97400.86840.69160.68580.98190.97990.68040.98371.0000

Correlation between variables

Pairwise correlations are difficult tointerpret when many variables areInvolved Need of multivariatedata analysis tools

Page 62: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Pair-wiseCorrelationsvariables

samples

CORRMAP Correlation map with variable grouping. CORRMAP produces a pseudocolor map which shows the correlation of between variables (columns) in a data set.

Multivariate Data Analyis

Page 63: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• What is the SVD of a data matrix X?....– singular value decomposition– singular values are the root square of the eigenvalues– X=USVT, U ana VT are orthonormal matrices and S is a diagonal

matrix with singular values– SVD is an orthogonal matrix decomposition– The elements in S are ordered according to the variance

explained by each each component– Variance is concentrated in the first components, it allows for

reducing the number of variables explaining the variance structure and filtering the noise

1

, ,... , ,... , =

= + = = <<

= +

∑TX USV E

K

ij ik k kj ijk

x u s v e i 1 I j 1 J K I or J

Multivariate Data Analyis

Page 64: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 2 4 6 8 100

200

400

600

800

1000

0 2 4 6 8 100

200

400

600

800

0 2 4 6 8 100

10

20

30

40

50

0 2 4 6 8 100

10

20

30

40

Effect of data pretreatments on SVDplot(svds)

Raw data X Mean-centered data X

Scaled data X Autoscaled data X

4 larger components

Multivariate Data Analyis

Page 65: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 2 4 6 8 100

100

200

300

400

500

600

700

800

900

0 2 4 6 8 100

5

10

15

20

25

30

35

Methods of data pretreatmentEffect of pretreatments

Raw data X log10 data X

4 components how many components?non-linearity?

Multivariate Data Analyis

Page 66: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Data AnalysisMultivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)

Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentrationof chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…

Romà Tauler (IDAEA, CSIC, Barcelona)[email protected]

Page 67: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Components (PCA)

x1

x2

••

•• •

••

Original Variables

Data Representation

PC1

Reduce the dimensions of the original space.Keep the relevant information about data variance.

m1m2

• • • • • • ••

Principal Components

PC1m1m2

PC1 Direction of maximum variation

Page 68: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Components (PCA)Data Representation

Reduce dimensions of the original space.Keep the relevant information about data varianceNot repeat information among PCs (they areorthogonal).

x1

x2

x3

••••

••••

••••

••••

•••••

••••

PC1PC2

•••••••••• ••••••••

•••••••••

PC1

PC2

PC2 Direcction of maximumremaining variance ⊥ PC1

Page 69: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

λ1 λ2 λ3a 1 2 3b 3 3 5c 4 5 8d 5 4 7e 2 1 2f 0 0 0

*c

*e*f

*b

*d

*a

λ1

λ2

λ3

Geometrical interpretation

Principal Components (PCA)

Page 70: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

*a

*b

*c

*d*e

*f

PCA

PC1

PC2

Principal Components (PCA)

Page 71: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

* *

**

**

*

**

**

*PC1directionof maximumvariance

PC2orthogonalto PC1

λ2

λ1

** * **

*

* *

**

λ3

Principal Components (PCA)

Page 72: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Components (PCA)

• They are mathematical variables which describe efficiently the data variance.– The relevant variance of the original data is described

by a reduced number of components (PCs)– Visualization of large data sets (many variables) in the

PC space of reduced dimensions.• Information is not repeated (overloaded, PCs are

orthogonal).• Describe the main directions of data variance in

decreasing order.• They are linear combination of the original variables.

Page 73: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Components (PCA)

• They are linear combination of the original variables.

x1

x2

••

•• •

••

t1

m1m2

x1p11

x 2p 2

1

t1 = x1p11 + x2p21

p11, p21

loadings of the original variables in first PC, t1.

Page 74: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA ModelRelationship betwen the data in the PC spaceand in the original space.

t1 t2 x1 x2 x3 p1 p2

=tj1 tj2 xj1 xj2 xj3

p11

p21

p31

T (n,2)

X (n,3)

P(3,2)

tj1 = xj1 p11 + xj2 p21 + xj3p31

T = XP

PCs (linear combination of the original variables)

Page 75: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA Model

T (scores matrix).• Describe the

samples in theprincipal components space.

• They are orthogonal. – ti

T tj = 0.

P (loadings matrix).• Describe the original

variables in theprincipal components space.

• They are orthonormal. – pi

T pj = 0.– ||pi|| = 1.– PTP = I

T = X P(n,npc) (n,m) (m,npc)

n (samples), m (variables), npc (principal components)

X = TPT

Page 76: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA Model: X = T PT

scores loadings (projections)

= +XT

PT

E

X = t1p1T + t2p2

T + ……+ tnpnT + E

X t1p1T t2 p2

T tnpnT E= + +….+ +

n number of components (<< number of variables in X)

rank 1 rank 1 rank 1

PCA Model

Page 77: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Model: X = T PT + EX = structure + noise

It is an approximation to the experimental data matrix X

• Loadings, Projections: PT relationships between original variables and the principal components (eigenvectors of the covariances matrix). Vectors in PT (loadings) are orthonormals (orthogonal and normalized).

• Scores, Targets: T relationships between the samples (coordinates ofsamples or objects in the space defined by the principal componentsVectors in T (scores) are orthogonal

Noise E Experimental error, non-explained variances

PCA Model

Page 78: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Determination of the number of components

• When the expected experimental error is known.– Plots of explained or residual variance as a function of

the nr. of components of the model.• Ex. models explaining a 95% of the variance are satisfactory?.

– Compare mean residual values with experimental error size.

• Ex. absornance errors in UV are aprox. 0.002.

mn

ee j,i ij

×=

∑2

n × m number of elements of X matrix

Number of PC → ē ≤ error (0.002).

PCA Model

Page 79: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• When experimental error is unknown:– Plot of singular values (or of eigenvalues).– Empirical functions related to experimental errors.– Cross Validation Methods.

Determination of the number of components

PCA Model

Page 80: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

n cromatograms m spectra

How many components coeluted?The number of coeluting components is deduced from the number ofpricipal components.

ExampleDetermination of the number of components

PCA Model

Page 81: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

– Plot of singular values(sk) or of functions of eigenvalues (λk).

• Singular values or eigenvalues vs. number of PCs (sk).• Log(eigenvalues) vs. number of PCs (λk).• Log(reduced eigenvalues) vs. number of PCs (REVk).

)1()1()(

+−+−=

kckrkREV kλ

λk = (sk)2r ne filesc ne columnes

Size of sk /λk /REVk ∝ associate PC importance

Determination of the number of componentsPCA Model

Page 82: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

4 significative componentsThe rest of components are used to explain the experimental noise

log eigenvalues log(REV)

PCA ModelDetermination of the number of components

Page 83: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

– Evaluation of empirical functions related to error.• Eigenvalue fucntions. Take advantage of the relation

between the explained variance and teh size of theeigenvalues.

• These functions have minima or considerable sizechanges for the optimal number of PCs

PCA ModelDetermination of the number of components

Page 84: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Malinowski error functions

( )2RSDIND

c n=

c0k

k=n+1RSD =r(c-n)

λ∑

c number of columnsr number of rowsn number of componentsλ0

k eigenvalue of componentk

Indicator Function (IND)

Mínimum IND → optimal number of PCs

PCA ModelDetermination of the number of components

Page 85: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

n)-r(c

0k

c

1+n=k=RSDλ∑ c

nRSDIE =

cncRSDXE −

=

λ

λ0k

s

1+n=k

n

s

1+n=k

1)+n-1)(c+n-(r

1)+k-1)(c+k-(r=n)-sF(1,

Malinowski error fucntions

Statistical test estadístic of eigenvalues (Malinowski)

( )2RSDIND

c n=

Imbedded error

Extracted error

PCA ModelDetermination of the number of components

Page 86: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

n Valor propi RE IND REV %SL (test F) 1 5.4068e+001 1.6716e-002 6.6864e-006 1.1043e-002 0 2 1.2172e+000 5.1353e-003 2.1388e-006 2.5625e-004 -3.5756e-005 3 6.6753e-002 3.5263e-003 1.5305e-006 1.4493e-005 1.6084e-003 4 1.8613e-002 2.9281e-003 1.3255e-006 4.1695e-006 3.7223e-001 5 2.1991e-003 2.8744e-003 1.3584e-006 5.0858e-007 2.9015e+001 6 2.0762e-003 2.8223e-003 1.3937e-006 4.9599e-007 2.9475e+001 7 1.9640e-003 2.7715e-003 1.4316e-006 4.8495e-007 2.9895e+001 8 1.9102e-003 2.7198e-003 1.4710e-006 4.8779e-007 2.9620e+001 9 1.7319e-003 2.6728e-003 1.5152e-006 4.5770e-007 3.1087e+001 10 1.6755e-003 2.6254e-003 1.5618e-006 4.5852e-007 3.0982e+001

Eigenvalues and REVs > 4 have lower sizesIND has a minimum in 4.RE lower its size.Eigenvalue for PC 4 is significatively larger than higher ones.

PCA ModelDetermination of the number of components

Page 87: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

– Cross-validation methods.• A part of the data is used to built the model and another

part of the data is described by this model. The optimalnumber of components is teh one giving lower resoidualsin the description of new data.

• This procedure is repeated until all the samples are usedto built the model and as external data set.

• The final results are the mean of all repetitions obtainedin the modeling/description of the non-included samples.

PCA ModelDetermination of the number of components

Page 88: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

1. Divide the data sample set on q subsets.2. Built PCA modesl with q –1 data subsets (Xmodel).3. Use these PCA models to explain the external data subset (Xextern).

i. Scores. Textern = XexternP.ii. Reproduction Xextern.

4. PRESS (Predictive Residual Sum of Sq uares) calculation usingdiffernt numbert of PCs.i. For PC k.

5. Repeat steps 1-4 until the q subsets have been used as external data sets.

6. Plot PRESScum vs. number of PCs .i. For PC k.

Texternextern PTX =ˆ

( )∑ −=j,i ijij xx̂)k(PRESS 2

∑=qcum )k(PRESS)k(PRESS

q number of PCA models

PCA ModelDetermination of the number of components

Page 89: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Xextern

Xmodel1 ... Xmodeln

Xextern

PRESScum,i = Σj PRESSji

Xmodel1

1PC 2PC .... mPC

Xmodel2

Xmodeln

.

.

.

PRESS11 PRESS12 PRESS1m

PRESS21 PRESS22 PRESS2m

PRESSn1 PRESSn2 PRESSnm

PRESScum,1 PRESScum,2 PRESScum,m

PCA ModelDetermination of the number of components

Page 90: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Nombre de PCs

PRES

S cum

1 2 3 4 5 6 7 8 9 100

0.5

1

1.5

2

2.5

3

3.5

PC Nr. PRESScum

1.0000 3.4772

2.0000 0.2320

3.0000 0.1117

4.0000 0.0505

5.0000 0.0515

6.0000 0.0517

7.0000 0.0521

8.0000 0.0524

9.0000 0.0535

10.0000 0.0531

Optimal number of PCs → Minimum value of PRESS

PCA ModelDetermination of the number of components

(cross-validation)

Page 91: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Model Fitting

PC reliability

Model reliability

How many principal components?Ex

plai

ned

Var

iànc

e

Number of PCs

Higher PCs explain data noiseA PCA model with noisy PCs is less reliable when describes

new dataWith more PCs in the model, better data fitting but the model

reliability when it is applied to new data may be worse (overfitting)

PCA ModelDetermination of the number of components

Page 92: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA model visualitzation

X = T PT

Scores plot(map of samples on PC space)

Loadings plot(map of variables on PC space)

tj1 tj2mj

T

PC1 PC2

PC1

PC2

•mj (tj1, tj2)

pj1 pj2

P

PC1 PC2

PC1

PC2

xj (pj1, pj2)

xj

Page 93: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Sn Zn Fe Ni 0.20 4.10 0.06 0.0022 0.20 4.04 0.04 0.0020 0.15 3.82 0.08 0.0015 0.61 3.96 0.09 0.0062 0.57 4.07 0.08 0.0056 0.58 4.23 0.07 0.0055 0.30 3.73 0.02 0.0033 0.60 3.80 0.07 0.0056 0.10 4.23 0.05 0.0014

Sam

ples

X (9,4)

Example PCA model visualitzation

Page 94: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Scores plot• distance among

samples shows theirsimilarity (5,6 and 4,8 very similar).

• Detection of samplegroups (clusters) (I i II).

• External informationcan help to identify thenature of the detectedgroups (ex. samplesorigen,...).

• Very distant samplesare extreme samples.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

-0.2

-0.1

0

0.1

0.2

0.3

123

45 6

7

8

9

PC1PC2

Samples Map

II

I

Original data

Example PCA model visualitzation

Page 95: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Loadings plot

• Relevant variables for themodel have high loadings andfar from the origen (Zn, Sn).

• Close variables to the origen do not give information aboutthe data variance (Fe, Ni).

• High Loadings in one PC ishow high weight in thiscomponent (Zn – PC1, Sn –PC2).

• Correlated Variables withlower PCs have more importance.

• Correlations between variables are described by their angle(Zn are Sn little correlated).

• Positive (direct) and negative(indirect) correlations betweenvariables can be detected.

-0.8 -0.4 0.4 0.8

-0.8

-0.4

0.4

0.8

PC1

PC2

Sn

ZnFeNi

Map of variables

Original Data

Example PCA model visualitzation

Page 96: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Close samples to the origen aresililar to the average sample. This cannot be distinguishedwhen they are separated in groups.

• The importance of the variables change because centeringeliminates the weight of the scalesize. Sn, related with PC1, isnow more important than Zn.

-0.4 -0.3 -0.2 -0.1 0.1 0.2 0.3

-0.3

-0.2

-0.1

0.1

0.2

12

3

4

5

6

7

8

9

PC1

PC2

-0.2 0.2 0.4 0.6 0.8 1 1.2

-0.8

-0.6

-0.4

-0.2

0.2

0.4

0.6

0.8

Sn

Zn

FeNi PC1

PC2

Centered dataScores plot Loadings plot

Example PCA model visualitzation

Page 97: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• They eliminate the scale size and range. The effect of all variables is enhanced.

• Correlation information amongvariables is more clearly seen.

-1 -0.5 0 0.5 1 21.5

-1.5

-1

-0.5

0.5

1.5

1

2

34

5

6

7

8

9

PC1

PC2

0.1 0.2 0.3 0.4 0.5 0.60.2

0.4

0.6

0.8

1

1.2

Sn

Zn

Fe

NiPC1

PC2Scores plot

Autoescaled dataLoadings plot

Example PCA model visualitzation

Page 98: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Objects (samples) close to the origin are ‘ordinary’ objects • Objects (samples) far from the origin are ‘extreme’ objects or ‘outliers’• Objects close between them are ‘similar’ objects• Objects far between them are ‘different• Objects can be grouped in ‘clusters’ which have common characteristics and they have different characteristics to other‘clusters’ ==> Cluster Analysis • The set of objects should cover the whole scores plot, otherwisethere are ‘clusters’• Principal Components identification may be achieved from theexternal identification of the clusters of objects• Simultaneous analysis of ‘loadings’ and ‘scores’ plots help toidentify/interpret the principal components principals

Interpretation of the ‘scores’(targets, punctuations)

Principal Components (PCA)

Page 99: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Scores and Loadings

• ‘scores’ show the relationships between samples

• ‘loadings’ show the relationships between variables

• ‘scores’ and ‘loadings’ should be interpreted in pairs

• ‘scores’ and ‘loadings’ are plotted ones againstthe others

Principal Components (PCA)

Page 100: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Outliers

• Outlier samples can have a great influence(leverage) in the PCA model

• They can be detected

• To detect them, find:isolated samples in scores plotssamples with large values of Q or T2 or both

Principal Components (PCA)

Page 101: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Detection of anomalous objects

Extreme Objects:• Diferent to the rest of

objects.

Outlier objects• Extreme objects that

cannot be fitted by themodel.

x1

x2

•• •

••

••

PC1

••

O. extreme

O. outlier

An object (samples) is extreme or outlierwhen one or more variables have theirvalues very different to the other samples

Principal Components (PCA)

Page 102: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Outlier detectionWhy is needed?• They distort the model• They hidden the structure of the rest of the data.When they should be eliminated and How?• If it is justified methematically and chemically.• It should be done gradually and stating with the more

outstanding ones.

x1

x2

•• •

••

••

PC1••

•O. outlier

Principal Components (PCA)

Page 103: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Outlier DetectionMathematical Indicadors• Extreme objects

–Mahalanobis Distance (to themean model)

22 ,

1( 1)

ni k

ikk

td ns λ=

= − ∑di distànce of object (sample) i

(for centered data)

Leverage. Related to the scoressize

2,

1

1n

i ki

kk

th ns λ=

= +∑

hi ‘leverage’ of the sample i

Outlier objectsSize of the residual.

∑==j

iji eeQ 22

j variable index

ns number of samplesti,k2 score value for sample i component kλk eigenvalue of component kn number of PCs

T2. Related to the scores size

∑= λ

=n

k k

kii

tT

1

2,2

Page 104: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Detection of outliers• Leverage plot.

Leverage (hi)

Res

idua

l (Q

)

• • •• •• •

•••

• • •• ••

•••

• • •• •• •

•••

• • •• •• •

•••• • •• •

• ••

•••

Objects poorly described by the model

Outliers

Extreme Objects

To be used with a low number of PCs !!!!(Higher PCs are used to describe outliers)

Similar gràfics Q as a function of T2 eare used in process control

Page 105: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA Statistics Residuals statistic to measure the lack of fit (large residuals)Qi = ei ei

T = xi (I-PkPkT) xk

T

samples with large Qi values are unusual(they are out of the model!!!!)

Hotelling statistic T2

Ti2 = ti λi

-1tiT = xi Pk λi

-1 PkT xi

T

samples with large values of Ti2 are unusual

(they are inside the model with high leverage!!!!)

These statistics are used to develop control charts and limits in Statistics Process Control

Principal Component Analysis (PCA)

Page 106: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA StatisticsQi = eiei

T = xi(I - PkPkT)xi

T, variation out of the PCA modelTi

2 = ti-1tiT = xiPΤ-1PTxi

T, variation inside of the PCA model

Principal Component Analysis (PCA)

Page 107: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Summary: Building PCA model

• Need of data pretreatment.• Determination of the number of principal

components of the PCA model.• Detection and elimination of outliers• Repetition of previous steps• Assesment of the model quality and reliability

Page 108: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• What is SVD?....– PCA is many times done using SVD – SVD is an orthogonal matrix decomposition– X=USVT, U ana VT are orthonormal matrices and S is

a diagonal matrix with singular values– The elements in S are ordered according to the

variance explained by each each component– Variance is concentrated in the first components, it

allows for reducing the number of variables explaining the variance structure and filtering the noise

– singular values are the root square of the eigenvalues

Singular Value Decomposition (SVD)

1

K

ij ik k kjk

x u s v=

= =∑ T TX = USV TPscores

loadings

Page 109: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Components in m dimensions

t1 = p11x1 + p12x2 + ..............+ p1mxmt2 = p21x1 + p22x2 + ..............+ p2mxm....................................................................................................................tm = pm1x1 + pm2x2 + ..............+ pmmxm

Linear combination of the original variables (already mean centred)

1 11 12 1m 1

2 21 22 2m 2

m m1 m2 mm m

t p p ... p xt p p ... p x... ... ... ... ... ...t p p ... p x

⎛ ⎞ ⎛ ⎞⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟=⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎝ ⎠ ⎝ ⎠⎝ ⎠

principalcomponents

originalvariables

Principal Component Analysis (PCA)

Page 110: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA Model: X = T PT

scores loadings (projections)

= +XT

PT

E

X = t1p1T + t2p2

T + ……+ tnpnT + E

X t1p1T t2 p2

T tnpnT E= + +….+ +

n number of components (<< number of variables in X)

rank 1 rank 1 rank 1

Principal Component Analysis (PCA)

Page 111: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Model: X = T PT + EX = structure + noise

It is an approximation to the experimental data matrix X

• Loadings, Projections: PT relationships between original variables and the principal components (eigenvectors of the covariances matrix). Vectors in PT (loadings) are orthonormals (orthogonal and normalized).

• Scores, Targets: T relationships between the samples (coordinates ofsamples or objects in the space defined by the principal componentsVectors in T (scores) are orthogonal

Noise E Experimental error, non-explained variances

Principal Component Analysis (PCA)

Page 112: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA Model : X = TPT + EDetermination of the number of principal components A

X(n,m), when n > m, m is the maximum number of PCs, Amax= mwhen m > n, n is the maximum number of PCs, Amax= n

In general, a number much smaller of PCs is used ‘data compression’, ‘data reduction’, A << n or m

A is chosen for the variance in TPT having most of the relevant structure of X, whereas noise remains in E(noise does not interest us! we want to filter it! ...)

To select the appropriate number of PCs, A, E (residuals, lack of fit,...) has to be studied quantitation of the variance in E, e.g. by residuals variance in %

Principal Component Analysis (PCA)

Page 113: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Determination of the number of principal components

a) visual inspection of the magnitudee of the singular values. Graphical representation (search for an inflexion)

b) representationss of the explained/residual variance respect the number of principal component

c) For autoscaled data, keep components until their λ aprox 1-2d) When the noise level is known, select the number of PCs until

the residual variance is similar to noise variance. e) Consider PCs until ‘loadings’ have structural features (nor noise)f) Use statistical tests and methods based in the previous

knowledge of experimental noise size. g) Approximate methods when experimental noise is not known

• Malinowski error functions • cross validation• .........

Principal Component Analysis (PCA)

Page 114: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 2 4 6 8 10 120

50

100

150

200

250

300

350

400

450

Determination of the number of PCsfrom eigenvalue/singular value plots

4 components

Principal Component Analysis (PCA)

Page 115: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Model Fitting

PC reliability

Model reliability

How many principal components?

With more PCs in the model, better data fitting but the model reliability when it is applied to new data may be worse (overfitting)

Expl

aine

d V

arià

nce

Number of PCs

Principal Component Analysis (PCA)

Page 116: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Cross-validation Methods- a data subset is eliminated from the original data matrix X Xr- a number of components k is estimated for Xr- the eliminated data subset is predicted for k componentsand the predicted values are compared with the actual values

Determination of the number of principal components

X

Xr k PCsPCA

Xr = Tk PkT

x Tx̂ x P Pk k

ˆevaluation of (x x)

=

eliminated

PCAprojection

Principal Component Analysis (PCA)

Page 117: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Cross-validation Methodsa data subset is eliminated from the original data matrix X Xr- a number of components k is estimated for Xr- the eliminated data subset is predicted for k componentsand the predicted values are compared with the actual values

r c2

ij iji=1 j=1

ˆPRESS(k) = (x - x (k))∑∑PRESS is plotted for the different number of considered componentsk, and the minimum value of PRESS or when it does not decreaseany more is looked for

Determination of the number of principal components

Principal Component Analysis (PCA)

Page 118: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

oo

o o

oo

o o

o

o

o

oo

o oo

PC1

x1

x2

x1 loading

x2 lo

adin

gLoadings

Loadings are orthonormal, PTP = I and PT = P-1

Principal Component Analysis (PCA)

Page 119: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Loadings interpretation

• Determination of the more important variables inthe formation of the principal components (those variableswith large loadings are important, either neg. or pos.)

• Multivariate correlation between variables– positive correlation (common variation)– negative correlation (contrary variation)

• Identification and qualitative information (fingerprinting) on the variation sources

Principal Component Analysis (PCA)

Page 120: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Scores

oo

o o

oo

o o

o

o

o

oo

o oo

PC1

sample s

core t 1

x1

x2

Projection of X in the PCs (loadings) gives the ‘scores’T = XP

Principal Component Analysis (PCA)

Page 121: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Objects (samples) close to the origin are ‘ordinary’ objects • Objects (samples) far from the origin are ‘extreme’ objects or ‘outliers’• Objects close between them are ‘similar’ objects• Objects far between them are ‘different• Objects can be grouped in ‘clusters’ which have common characteristics and they have different characteristics to other‘clusters’ ==> Cluster Analysis • The set of objects should cover the whole scores plot, otherwisethere are ‘clusters’• Principal Components identification may be achieved from theexternal identification of the clusters of objects• Simultaneous analysis of ‘loadings’ and ‘scores’ plots help toidentify/interpret the principal components principals

Interpretation of the ‘scores’(targets, punctuations)

Principal Component Analysis (PCA)

Page 122: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Scores and Loadings

• ‘scores’ show the relationships between samples

• ‘loadings’ show the relationships between variables

• ‘scores’ and ‘loadings’ should be interpreted in pairs

• ‘scores’ and ‘loadings’ are plotted ones againstthe others

Principal Component Analysis (PCA)

Page 123: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA Statistics Residuals statistic to measure the lack of fit (large residuals)Qi = ei ei

T = xi (I-PkPkT) xk

T

samples with large Qi values are unusual(they are out of the model!!!!)

Hotelling statistic T2

Ti2 = ti λi

-1tiT = xi Pk λi

-1 PkT xi

T

samples with large values of Ti2 are unusual

(they are inside the model with high leverage!!!!)

These statistics are used to develop control charts and limits in Statistics Process Control

Principal Component Analysis (PCA)

Page 124: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCA StatisticsQi = eiei

T = xi(I - PkPkT)xi

T, variation out of the PCA modelTi

2 = ti-1tiT = xiPΤ-1PTxi

T, variation inside of the PCA model

Principal Component Analysis (PCA)

Page 125: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Outliers

• Outlier samples can have a great influence(leverage) in the PCA model

• They can be detected

• To detect them, find:isolated samples in scores plotssamples with large values of Q or T2 or both

Principal Component Analysis (PCA)

Page 126: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Outliers in scores plots

xx x

xxx xx

x

x

xx

x

xx

xxxx x

xxx

xxx

xxxx

xx x

x

xx

x x

Scores en PC2

Scor

esen

PC

1

Principal Component Analysis (PCA)

Page 127: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Detection of ‘outliers’From scores plots ==> outlier samplesFrom loadings plots ==> outlier variables Leverage samples or variables affecting very much the PCA modelIt is evaluated from the expression:

2,

1

1n

i ki

kk

th ns λ=

= +∑hi sample i ‘leverage’ti,k ‘score’ of sample i on the k componentλk singular value of k componentns number of samples n number of considered components

Principal Component Analysis (PCA)

Page 128: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Data AnalysisMultivariate Data AnalysisIntroduction to Multivariate Data AnalysisPrincipal Component Analysis (PCA) Multivariate Linear Regression (MLR, PCR and PLSR)

Laboratory exercises:Introduction to MATLAB Examples of PCA (cluster analysis of samples, identification and geographical distribution of contamination sources/patterns…) Examples of Multivariate Regression (prediction of concentrationof chemicals from spectral analysis, investigation of correlation patterns and of the relative importance of variables,…

Romà Tauler (IDAEA, CSIC, Barcelona)[email protected]

Page 129: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multiple linear regression (MLR) is a method used to model thelinear relationship between a dependent variable (predictand) andone or more independent variables (predictors).

MLR is based on least squares: the model is fit such that thesum-of-squares of differences of observed and predicted values isminimized..

The performance of the model on data not used to fit the model isusually checked in some way by a process called validation.

The reconstruction is a "prediction" in the sense that theregression model is applied to generate estimates of thepredictand variable different to the used to fit the data. TheUncertainty in the reconstruction is summarized by confidenceintervals, which can be computed by various alternative ways.

Multivariate (Multiple) Linear Regression (MLR)

Page 130: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

MLR Model0 1 ,1 2 ,2 ,

th,

0

value of predictor in sample

regression constant

coefficient on the predictor

total number of predictors= predictand in sample

error termIn vecto

= + + + + +

=

=

=

=

=

…i i i K i K i

i j

thj

i

i

y b b x b x b x e

x j i

b

b j

Ky ie

r-matrix form: = +y Xb e

Multivariate Linear Regression

Page 131: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

MLR Predictions

0 1 ,1 2 ,2 ,

th,

0 1

ˆ ˆ ˆ ˆˆ

value of predictor in new sample ˆ ˆ ˆ, , estimated regression constant

and coefficientsˆ = predicted value for new sample

in matrix-vector

= + + + +

=

=

…i i i K i K

i k

k

i

y b b x b x b x

x k i

b b b

y i

ˆˆform =y Xb

Multivariate Linear Regression

Page 132: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

0 1 ,1 2 ,2 ,ˆ ˆ ˆ ˆˆi i i K i Ky b b x b x b x= + + + +…

MLR Prediction

Measurement i might be outside the range used for calibrationor validation

Multivariate Linear Regression

Page 133: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

MLR Residuals

ˆ ˆobserved value of predictand in sample

ˆ predicted value of predictand in sample

= −==

i i i

i

i

e y yy iy i

Multivariate Linear Regression

Page 134: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

MLR Assumptions

1. Relationships are linear

2. Predictors are nonstochastic

3. Residuals have zero mean

4. Residuals have constant variance

5. Residuals are not autocorrelated

6. Residuals are normally distributed

Multivariate Linear Regression

Page 135: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

1. Relationship may be nonlinear or outlier-driven

2. Correlation ≠causation

3. Statistical significance≠ practical significance

4. Lagged effects not measured

5. Problems when X variables (predictors) are strongly correlated (common in practice)

Caveats/Problems to interpretation

Multivariate Linear Regression

Page 136: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Alternatives to MLR

Nonlinearity?

Data transformation,use kernel and

try MLR again Neural Networks

Nonparametric Regression(e.g., kernel regression)

Categorical predictand?

Discriminant analysis

Classification treesLogistic regression

Quadratic response surfaces

Multivariate Linear Regression

Correlation among predictors?

Reduce numberof variables

Stepwise Regrssion

Factor Analysisbased methods

Page 137: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

MLR Statistics

• R2 -- explanatory power

• Adjusted R2: R2 adjusted for loss of degrees of freedom due to number of predictors in model

• F and its p-value -- significance of the equation

• se standard error of the estimate;equivalent to “root mean square error” (RMSEc);subscript “c” denotes “calibration”

• Confidence interval for parameters

X

Multivariate Linear Regression

Page 138: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

MLR ANOVA Table(testing linearity)

MSE=SSE/(n-K-1)SSEn-K-1Residual

MSR=SSR/KSSRKRegression(model)

SSTn-1Total

Mean squaresSum of Squares

dfSource

Multivariate Linear Regression

Page 139: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Validating the MLR regression model

•Regression R-squared, even if adjusted for loss of degrees of freedom due to thenumber of predictors in the model, can give a misleading, overly optimistic viewof accuracy of prediction when the model is applied outside the calibrationperiod.

•Several approaches to validation are available. Among these are cross-validationand split-sample validation.

•In cross-validation, a series of regression models is fit, each time deleting a different observation from the calibration set and using the model to predict thepredictand for the deleted observation. The merged series of predictions fordeleted observations is then checked for accuracy against the observed data.

•In split-sample calibration, the model is fit to some portion of the data (say, thesecond half), and accuracy is measured on the predictions for the other half ofthe data. The calibration and validation periods are then exchanged and theprocess repeated.

Multivariate Linear Regression

Page 140: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Model Calibration vs Validation

Validation1. Testing the model on data not

used to fit the model2. “validation”, “verification”,

“independent” data3. Accuracy statistics:

RE SSEvMSEvRMSEv

Calibration1. Fitting the model to the data

2. “calibration”, “construction”, “estimation” data

3. Accuracy statistics:{R2, Ra

2,}SSEcMSEcRMSEc

Definitions: validation, cross-validation, split-sample validation, mean square error (MSE), root-mean-square error (RMSE); standard error ofprediction, PRESS statistic, "hat" matrix, extrapolation vs interpolationAdvantages of cross-validation over alternative validation methods

Multivariate Linear Regression

Page 141: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Cross validation stopping rule

Stop here

Multivariate Linear Regression

Page 142: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Error bars for MLR predictions

1. Standard error of the estimate (calibration statistic)

2. Standard error of prediction (calibration statistic)

3. Root-mean-square error of validation (validation statistic)

Heirarchy

Multivariate Linear Regression

Page 143: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Standard Error of MLR Prediction(Equation for simple linear regression)

( )

( )

1/ 2

2*

ˆ2

1

11y e n

ii

x xs s

n x x=

⎡ ⎤⎢ ⎥−⎢ ⎥= + +⎢ ⎥−⎢ ⎥⎣ ⎦

∑Standard errorof the estimate

MSERMSE

e

c

s ==

Term due to uncertaintyIn the estimate of the predictand mean; a function of sample sizen

Term due to departure of the predictor value for the predicted observation from the predictor mean for the calibration period

Multivariate Linear Regression

Page 144: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Univariate linear regression

1 1

0

i o 1 i i i i y x

i i i1 i2 2

i i

2 2y y y2

b b2 2i XX i

2i

b y 2i

y b b x e , y =f(x ), s s 0

(x X)(y Y) (x X)b y

(x X) (x X)

s s ss , s

(x X) S (x X)

xs s

n (x X)

= + + >> ≈

− − −= =

− −

= = =− −

⎛ ⎞⎜ ⎟=⎜ ⎟−⎝ ⎠

∑ ∑∑ ∑

∑ ∑∑

Least Squares and linear regression

Page 145: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Ordinary Least Squares (OLS)Multivariate Linear regression

n experimental measures, m variables X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model Assumption that experimental errors are only important for yi

y = X b, b= (XTX)-1XTys2(b) = (A AT)-1 s2(y) = (X XT)-1 s2(y)

where: A = {Aij}; Aij = δ ri / δ bj = {-Xij};ri = yi – yci, residuals and s2(y) is estimated from:

s2(y) = ∑ ri2 / (m-n)

Page 146: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Weighted Least Squares (WLS)Multivariate Linear regression

n experimental measures X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model W = {wij} n x m weights to each xij value, considering errors in X (error standard deviations of xij,, sij; wij = 1/sij)

unweigthed weightedy = X b y = X bb= (XTX)-1XTy b= (XTWX)-1XTWys2(b) = (A AT)-1 s2(y) s2(b) = (AWAT)-1 s2(y)

where: A = {Aij}; Aij = δ ri / δ bj;ri = yi – yci, residuals and s2(y) is estimated from:

s2(y) = ∑ wi2ri

2 / (m-n)

Page 147: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Generalized Least Squares (GLS)Multivariate Linear regression

n experimental measures X = {xij} n x m independent variablesy = {y1,y2, ..., yn), dependent variables b = {b0,b1,b2,...,bm} m parameters of the linear model M = {mij} n x m weights to each xij value, calculated from errors in X and y (it is more complex!)

y = X b, b= (XTMX)-1XTMys2(b) = (AMAT)-1 s2(y)

where: A = {Aij}; Aij = δ ri / δ bj;ri = yi – yci, residuals and s2(y) is estimated from:

s2(y) = ∑ ri2 / (m-n)

Page 148: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Interpolation vs Extrapolation

Prediction based on predictordata “similar” to that in the calibration range

Prediction based on predictordata “unlike” that in thecalibration range

X

Multivariate Linear Regression

Page 149: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

( )−= T 1 TH X X X X

* * *( )h −= T T 1x X X x

* maxh h>

Classifying MLR predicted values

“Hat” matrix, computed from Calibration-only predictor data

Classification statistic

Vector of predictor data for some observationOutside calibration range

Rule identifying “extrapolation”

Maximum value along diagonal of the hat matrix

Multivariate Linear Regression

Page 150: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Conventions & Notation in CalibrationData are arrenged in two blocs/tables/matrices X and Y where:X = matrix of predictor variables Y = matrix of predicted (predictand) variables ns = number of samples/observationsnx = number of variables in Xny = number of variables in Yn = number of PCs/latent variables/components

1 2 3 .... nypredicted (predictant) variables

sam

ples

123..

ns

MatrixY, y

1 2 3 .... nxpredictor variables

Sam

ples 1

23..

ns

MatrixX

Y = f(X)

Find f

Multivariate Linear Regression

Page 151: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Multivariate Linear Regression Causal vs Predictive Model

Causal models X = f (Y) (1)Predictive models (inverse) Y = f(X) or y = f(X) (2)

Independent (predictors) vs dependent (predictands)

Example:X (R) is the matrix of multivariate (instrumental) responses for different samplesY is the matrix of concentrations of one chemical component (or more) in the different samples y (c) is the concentration of one component in the samplesf is the calibration function in the causal (1) and predictive in (2) linear models, f is a linear function

Multivariate Linear Regression

Page 152: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

R = C ST + Ens,nw ns,nc nc,nw ns,nw

R matrix of sensor responses (ns samples, nw wavelengths)

C matrix of concentrations (ns samples, nc components)

ST matrix of sensibilities (nc components, nw wavelengths)

E matrix of experimental errors (ns samples, nw wavelengths)

⎛ ⎞ ⎛ ⎞⎛ ⎞⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟= +⎜ ⎟⎜ ⎟ ⎜ ⎟⎜ ⎟⎜ ⎟ ⎜ ⎟⎝ ⎠⎜ ⎟ ⎜ ⎟

⎝ ⎠ ⎝ ⎠

1,1 1, 1,1 1, 1,1 1,1,1 1,

2,1 2, 2,1 2, 2,1 2,

,1 ,,1 , ,1 , ,1

....

..

nw nc nwnw

nw nc nw

nc nc nwns ns nw ns ns nc ns

r r c c e es s

r r c c e e

s sr r c c e e

⎛ ⎞⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎜ ⎟⎝ ⎠,ns nw

Multicomponent Analysis:Bilinear ModelMultivariate Linear Regression

Page 153: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Advantages• Total selectivity is not needed• Allow multicomponent analysis• Outlier detection is possible

• More used methods– MLR, Multilinear Regression

• Classical Least Squares (CLS)• Inverse Least squares (ILS)

– Factors based Linear Regression (biased) • Principal Components Regression (PCR)• Partial Least Squares Regression (PLSR)

– Non-linear Regression

Multivariate Linear Regression

Page 154: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

CLS Model : R = C ST + Ens,nw ns,nc nc,nw ns,nw

The responses are modelled as a function of the concentrations. It is the same causal model as for the generalized Beer’s law (generalized multilinear model)

Calibration step:a) direct: pure component spectra of the components or

sensibilities are previously known; ST is known b) indirect: pure component spectra of the components are not

previously known; ST is unknown, it has to be estimated in the calibration step:

ST = C+Rwhere: C+=(CTC)-1CT (pseudoinverse)

Page 155: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Prediction step:

foe a set of samples ‘nunk’ with unknown analyte concentration

Cunk = Runk(ST)+

nunk,nc nunk,nw nw,ncCunk matrix of concentrations of the nunk unknown samples

(nunk samples, nc components)Runk matrix of their instrumental responses

(nunk mostres, nw wavelengths)

(ST)+ pseudoinverse of the sensibilities matrix (pure spectra)(nw wavelengths, nc components)

(ST)+=S(STS)-1

Page 156: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Prediction step (one sample):

the concentration of several analytes in one sample:

cTunk= rT

unk(ST)+

1,nc 1,nw nw,nc

or what is the same:

cunk= S+ rnc,1 nc,nw nw,1

where S+ = (STS)-1ST

Page 157: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Classical (Causal) Least Squares (CLS)Advantages (compared to univariate least squares):1. Increase of precision in the estimations(signal averaging)2. Allows the estimation of the pure responses(pure spectra) => qualitative information, identification3. Allows multicomponent quantitative analysisDisadvantages:Needs knowing and introducing the whole information of allthese components contributing to the measured analytical responseIt does not allow calibration in the presence of unknowninterferents (it is not used in the analysis of natural samples)

Multivariate Linear Regression

Page 158: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

2. Inverse Calibration. Inverse Least Squares (ILS)

Model: c = R b + e (y = X b)ns,1 ns,nw nw,1 ns,1

The concentrations are modeled as a function of the instrumentalresponses. It is not a causal model. It is a predictive model. Only needs knowing the analyte concentration in the calibration samples.

Calibration: b = R+ c (b = X+c)nw,1 nw,ns ns,1

b is the calibration vector, evaluated from the responses of the calibration samples R where the analyte concentration c is known.

R+ = (RTR)-1RT pseudoinverse of RX+ = (XTX)-1XT pseudoinverse of X

Page 159: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Prediction: cunk = rTunk b (yunk = xT

unk b) 1,1 1,nw nw,1

cunk is the concentration of the analyte in a new samplerT

unk is the instrumental response given by this sampleb is the calibration vector previously evaluated

R is not square => calculation of the generalized inverse or pseudoinverse R+ = (RTR)-1RT

Problem:In the evaluation of the calibration vector b = (RTR)-1RT c,(RTR)-1 nw,nw has to be evaluated The nw rows and nw columns should be linearly independent!!!and ns > nw (number of calibration samples > number of wavelengths

Page 160: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Inverse Least squares, ILS (inverse model)

Advantages- allows the determination of one analyte in the presence of unknowninterferences (this is not possible with CLS!!!)- Only needs the calibration information for one analyteDisadvantages- It does not use all the variables; only uses a reduced number of selected variables (sensors or wavelengths).- There is no increase of measurement precision (there is nosignal averaging)

Multivariate Linear Regression

Page 161: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Methods based on Factor Analysis• Factor decomposition of matrix X• Resolve the colinearity problem in X• Backgroundd noise filtering• Improve precision (signal averaging)• ‘Compression’ of the information in a reduced number

of new variables o factors

Multivariate Linear Regression

Page 162: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

- as in PCA- linearization, if possible- mean centering- variance scaling, when the variablesare in different units or differconsiderably in magnitude

- outliers elimination is critical

Pretreatment Methods in Multivariate Regression/Calibration

Multivariate Linear Regression

Page 163: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Component Regression PCR

1. Descomposition in factors of X by PCAX = T PT + E

ns,nw ns,nc nc,nw (ns,nw)

2. Multilinear regression (MLR) on thescores T (instead of on original variables in X)

y = T bns,1 ns,nc nc,1

Multivariate Linear Regression

Page 164: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Component Regression PCR

3. Evaluation of the regression vector b

b = T+ y = (TTT)-1TT y

the scores (PCA) are orthogonals

(TTT)-1 = diag(1/λi), i =1,...,nc

Multivariate Linear Regression

Page 165: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Component Regression PCR

4. Prediction of a new sample with response xT

unkscore of the new sample: tT

unk= xTunkP

1,nc 1,nw nw,nc

prediction of its concentration yunk = tT b1,1 1,nc nc,1

Multivariate Linear Regression

Page 166: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Component Regression PCR

Direct calculation ycal = X bcal

bcal = X+ ycal

X+ XPCA+ = (T PT)+ = P (TTT)-1TT

PT orthonormal P PT = I (PT)-1 = P

T orthogonal TT T = diag(λi)yunk = xT

unk bcal1,1 1,nw nw,1

Multivariate Linear Regression

Page 167: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

PCAX T

PT

ns

nw

ns

nc

nc

nw

y T

1

ns ns

nc

MLR

bnc

1PCR = PCA + MLR

nc<<nwreduction of thenumber of variables!!!

Page 168: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Principal Components Regression, PCRPCR is one of the ways to solve the inversion of matrices ill-conditioned in linear regressionThe property of interest is modelled (regressed) on the PCA scores:

y = TkbPC + e = XPkbPC + ebPC = X(Tk

TTk)-1TkTy

(regression vector from PCA ‘sores’)b = PkbPC ; y = Xb(regression vector from the original variables) X+ = Pk(Tk

TTk)-1TkT

(la inverse matrix is calculated from orthogonal matrices)

Multivariate Linear Regression

Page 169: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Possible problems with PCRSome of the PCs cannot be relevant for the prediction of y, only are relevant for the description of XThe PCs are estimated without considering the prperty to predict ySolution: find the components using the information in y, not only the information in PLSThe number of components has to be estimated using validation methods.Diagnostics are used to find out outliers (Q residuals, T2

values, leverage plots)

Multivariate Linear Regression

Page 170: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Partial Least Squares Regression PLSR

The responses matrix X is decomposed and ‘truncated’in a similar way to PCR, but in the decomposition the information of y in the calibration samples is considered

X = T PT + EY = U QT + F

u = y (in the case of a single component)

X T U Yvariance covariance variance

W

Multivariate Linear Regression

Page 171: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

CLS: X = f(y)

ILS,MLR y = f(X)

x1x2x3x4x5

x1x2x3x4x5 x1

x2x3x4x5

x1x2x3x4x5

y

y

y

y

t1

t2

t1t2

PCR = PCA ++ MLR

PLSR

PCA

Page 172: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Partial Least Squares, PLS•PLS is a mixture of PCR and MLR•PCR captures the maximum variance in X•MLR gets maxima correlation between X and y•PLS tries both things: makes maximum the covariance•PLS requires an additional matrix of weights W to keep the orthogonality of the scores and makes easier matrix inversion•The factors are evaluated sequentially by projection of y on X•The expression to evaluate the matrix inverse is more complex than for PCR:

X+ = Wk(PkTWk)-1(Tk

TTk)-1TkT

Multivariate Linear Regression

Page 173: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Comparison of predictive (inverse) linear calibration modelsThe different methods differ in the way they resolve the same equation of calibration (calculation of the inverse matrix of R):

b = X+y MLR (ILS) X+ = (XTX)-1XT

Maximum correlation between X and y is achieved, but the direct inversion of X is problematic

PCR X+ = Pk(TkTTk)-1Tk

T

Maximum variance in X is achieved

PLSR X+ = Wk(PkTWk)-1(Tk

TTk)-1TkT

Maximum covariance between X and y is achieved

Inverses in PCR i PLS are calculated from orthogonal matrices A-1 A= D (diagonal); orthonormal A-1A=ATA=I

Page 174: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Calibration step:

Calculation of the Model

Xcalycal

ModelObtain nr. components

Validation step:

Test the model

Xvalyval

Test del ModelTest nr. components

Validation of the calibration models (regression)

Multivariate Linear Regression

Page 175: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Validation of the model; with the same calibration samples

Xcal + Model

Model Error : - ycal

residual variance (calibration)

RMSEC (Root Mean SquareError of Calibration)

ˆ caly

ˆ calyˆ∑

ns2

i,cal i,cali=1

(y -y )

ns

ˆ∑ns

2i,cal i,cal

i=1

(y -y )

ns

Multivariate Linear Regression

Page 176: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Validation of the Model; with new validation samples

Xval + Model

Model Error : - yval

residual variance (prediction)

RMSEP (Root Mean SquareError of Prediction)

ˆ valy

ˆ valyˆ∑

ns2

i,val i,vali=1

(y -y )

ns

ˆ∑ns

2i,val i,val

i=1

(y -y )

ns

Multivariate Linear Regression

Page 177: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Validation of the calibration model

The ability of the calibration model to predict has to beevaluated using a new sample set not used in thedevelopment of the calibration model:

Samples:

Training set

Test set

Calibration of the Model Xcal ycal

fcalCalculation of RMSEC

Validation of the ModelXval

fvalCalculation of RMSEP

ˆ valy

Multivariate Linear Regression

Page 178: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Validation Methods1) With a calibration set of samples and a different

validation set of samples. Both data sets should representative. It is the best method.

2) Cross-Validation2A) Two groups of data X and y

A

B

calibrationvalidation

validationcalibration

X yXA

XB

yA

yB

Prediction error is evaluated for A and B and the average is evaluated

Multivariate Linear Regression

Page 179: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

2B) Full cross-validation or leave-one-out validation

The same number of PCA models as samples are. Successively, one sample is removed, a new PCA model is built and the left out sample is predicted. This is repeated for every sample and the prediction error is calculated.

2C) Segmented Validation (for small groups of samples, i.e. 10 % of samples)

3) Leverage correction (leverage, hi)

Residuals fi = yval - are weighted according to hi

ficorr = fi / (1-hi)

ˆ valy

Multivariate Linear Regression

Page 180: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

∑2ni,k

ikk=1

t1h = +ns λ

hi values are between 0 and 1

samples with low ‘leverage’ hi 0 ficor fi

samples with high ‘leverage’ hi 1 ficor >> fi

This procedure uses only the calibration samples dada; it gives a first approximation of the future prediction ability.

where

Multivariate Linear Regression

Page 181: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Cross-Validation

Data are divided in q subsetsBuilt the model with the q-1 subsetsCalculate PRESS (predictive Residual Sum of Squares

Repeat until all the groups have been left out one time. Find the minimum (or the inflexion point) in the the plot of PRESS vs nr. components

2ij ij

i j

ˆPRESS (y y )= −∑∑

Multivariate Linear Regression

Page 182: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

1 2 3 4 5 6 7 8 9 100

50

100

150

200

250

300

350

400

450

500

Determination of the number of componentsusing PRESS PLOTS

number of components

cum

ulat

ive

PRES

S

5 components

Multivariate Linear Regression

Page 183: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Model evaluation and validation2

ij iji j

2ij ij

i j

2ij ij

i j

ij iji j

2ij ij

i j2

iji j

ˆPRESS (y y )

ˆ(y ÿ )RMSEP

n.samples

ˆ(y y bias)SEP

n.samples 1ˆ(y y )

biasn.samples

ˆ(y y )RE 100

(y )

= −

−=

− −=

−=

−=

∑∑

∑∑

∑∑

∑∑

∑∑∑∑

Prediction Error Sum ofSquares, PRESS

Root Mean Square Error inPrediction

Standard Error in Prediction

bias

Relative Error

Multivariate Linear Regression

Page 184: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

• Comparison of experimental values versus model predicted values– for the calibration samples– for the external validation samples

• Plot and calculate regression line of predicted versus actual values:predicted values = slope x experimental values + offsetslope should be one, offset should be zero and r2 = 1

Model evaluation and validation

Multivariate Linear Regression

Page 185: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Partial Least Squares Regression PLSR modelsX loadings

X = T PT + EY = U QT + F

Y loadings

u = y (in the case of a single component)

X scores Y scoresX T U Yvariance covariance variance

W (weights)

Multivariate Linear RegressionPLS model interpretation

Page 186: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Interpretation of PLS models.• Physical interpretation of PLS models can be obtained

from plots of scores and loadings like in PCA. • More interestingly PLS models provide the weights (Wk)

which describe the covariance structure between X and y blocks. Plot of the weights are extremely useful for PLS models interpretation

X+ = Wk(PkTWk)-1(TkTk

T)-1TkT

b=X+yy=Xb

• Other measures exist like the variable influence (importance) on projection, VIP, parameter. Plot of VIPs are also very useful for PLS model interpretation

Multivariate Linear RegressionPLS model interpretation

Page 187: Multivariate Data Analysis - IMEDEA · 2014. 5. 9. · Univariate and Multivariate Data Univariate • Only one measurement per sample (pH, Absorbance, peak height or area) • The

Variable influence (importance) on projection, VIP, parameter

Multivariate Linear RegressionPLS model interpretation

∑=

− −−=

A

a aoaaakAk SSYSSY

KSSYSSYwVIP1

12

)()((

A total number of factors considered in the model a considered factor

k considered variable

The variables with larger VIPs (larger than one) are more influential and important for explaining Y