partial least squares - ucd school of mathematics and...
TRANSCRIPT
12/9/2013
1
Partial Least Squares
A tutorial
Lutgarde Buydens
Partial least Squares
• Multivariate regression• Multiple Linear Regression (MLR)• Principal Component Regression (PCR)• Partial Least Squares (PLS)
• Validation
• Preprocessing
Multivariate Regression
X Y
n
p k
Rows: Cases, observations, … Collums: Variables, Classes, tags
Analytical observations of different samplesExperimental runsPersons….
P: Spectral variablesAnalytical measurements
K: Class informationConcentration,.. X: Independent variabels (will be always available)
Y: Dependent variables ( to be predicted later from X)
2000 4000 6000 8000 10000 12000 14000-0.5
0
0.5
1
1.5
2
Wavenumber (cm-1)
Raw data
Multivariate Regression
X Y
n
pk
Rows: Cases, observations … Collums: Variables, Classes, tagsX: Independent variabels (will be always available)Y: Dependent variables ( to be predicted later from X)
2000 4000 6000 8000 10000 12000 14000-0.5
0
0.5
1
1.5
2
Wavenumber (cm-1)
Raw data
Y = f(X) : Predict Y from X
MLR: Multiple Linear RegressionPCR: Principal Component RegressionPLS: Partial Least Sqaures
From univariate to Multiple Linear Regression (MLR)
Least squares regression
y= b0 +b1 x1 + ε
b0 : interceptb1 : slopeε
y
x
MLR: Multiple Linear Regression
Least squares regression
y= b0 +b1 x1 + ε
b0 : interceptb1 : slope
Multiple Linear Regression
y= b0 +b1 x1 + b2x2 + … bpxp + ε
EYY ^
ε
y
x
x2
x1
y
),(
yyrmaximizes
ε
12/9/2013
2
MLR: Multiple Linear Regression
y= b0 +b1 x1 + b2x2 + … bpxp + ε
+
y b0 eX
nn n
b1
bp
=
EYY ^
11
11
:::
p+1
x
x2
x1
yε
yn1 = Xnpbp1 + en1
Ynk = XnpBpk + Enk
b = (XTX)-1XTy
MLR: Multiple Linear Regression
•Disadavantages: (XTX)-1
• Uncorrelated X-variables required
• n p +1
x2
x1
y r(x1,x2) 1
MLR: Multiple Linear Regression
Disadavantages: (XTX)-1
• Uncorrelated X-variables required
x2
x1
y r(x1,x2) 1
Fits a plane through a line !!
MLR: Multiple Linear Regression
Disadavantages: (XTX)-1
• Uncorrelated X-variables required
x2
x1
y r(x1,x2) 1
3.763.67
-2.91-2.87
0.230.23
5.555.49
3.253.23
-0.99-1.01
3.67
-2.87
0.23
5.49
3.23
-1.01
3.76
-2.91
0.21
5.55
3.25
-0.99
x2x2 x1x1
Set A Set B
11.29
-8.09
2.19
19.09
10.33
-1.89
y
y= b1 x1 + b2x2 + ε
b1 b2 b1 b2
10.3 -6.92 2.96 0.28MLR
R2 =0.98 R2 =0.98
MLR: Multiple Linear Regression
Disadavantages: (XTX)-1
• Uncorrelated X-variables required
• n p +1
X p
n
Dimension reduction Variable Selection
Latent variables (PCR, PLS)
PCR: Principal Component Regression
Step 1: Perform PCA on the original X
Step 2 : Use the orthogonal PC-scores as independent variables in a MLR model
Step 3: Calculate b-coefficients from the a-coefficients
XPCA
T
pcols
n-rows n-rows
acols
a1
a2
aa
MLRy
Step 1 Step2
n-rows
a2
aa
b1
b0
bp
Step 3a1
12/9/2013
3
PCR: Principal Component Regression
PC1
Dimension reduction:
Use scores (projections) on latent variables that explain maximal variance in X
xp
x2
x1
PCR: Principal Component Regression
Step 0 : Meancenter X
Step 1: Perform PCA: X = TPT X* = (TPT)*
Step 2: Perform MLR Y=TA
A = (TTT)-1TTY
Step 3 : Calculate B Y = X* B
Y = (T PT) B
A = PTB
B = (PPT)-1PA
B = PA
Calculate b0’s
MLR on reconstructed X*= (TPT)*
yyb ˆ0
PCR: Principal Component Regression
Optimal number of PC’s
Calculate Crossvalidation RMSE for different # PC’s
nyyRMSECV ii
2)(
PLS: Partial Least Squares Regression
XPLS
T
pcols
n-rows n-rows
acol
a1
a2
aa
MLRy
Phase 1
n-rowsa1
a2
aa
b1
b0
bp
Y
kcols
n-rows
Phase 2
a1
kcols
Phase 3
PLS: Partial Least Squares Regression
Projection to Latent Structure
PC1
xp
x2
x1
LV1 (w)
xp
x2
x1
PCR PLS
Use PC: Maximizes variance in X
Use LV: Maximizes covariance (X,y)
= VarX*vary*cor(X,y)
PLS: Partial Least Squares Regression
Sequential Algorithm: Latent variables and their scores are calculated sequentially
• Step 0: Mean center X
• Step 1: Calculate w
Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY
(XTY)pk = WpaDaa ZTak w1 = 1st col. of W
Phase 1 : Calculate new independent variables (T)
xp
x2
x1
w1
12/9/2013
4
PLS: Partial Least Squares Regression
Sequential Algorithm: Latent variables and their scores are calculated sequentially
• Step 1: Calculate LV1= w1 that maximizes Covariance (X,Y) : SVD on XTY
(XTY)pk = WpaDaa ZTak w1 = 1st col. of W
•Step 2:
Calculate t1, scores (projections) of X on w1
tn1 = Xnpwp1
Phase 1 : Calculate new independent variables (T)
xp
x2
x1
w
PLS: Partial Least Squares Regression
XPLS
T
pcols
n-rows n-rows
acol
a1
a2
aa
MLRy
Phase 1
n-rowsa1
a2
aa
b1
b0
bp
Y
kcols
n-rows
Phase 2
a1
kcols
Phase 3
Optimal number of LV’s Calculate Crossvalidation RMSE for different # LV’s
nyyRMSECV ii
2)(
PLS: Partial Least Squares Regression
3.763.67
-2.91-2.87
0.230.23
5.555.49
3.253.23
-0.99-1.01
3.67
-2.87
0.23
5.49
3.23
-1.01
3.76
-2.91
0.21
5.55
3.25
-0.99
x2x2 x1x1
Set A Set B
11.29
-8.09
2.19
19.09
10.33
-1.89
y
y= b1 x1 + b2x2 + ε
b1 b2 b1 b2
10.3 -6.92 2.96 0.28
1.60 1.62 1.60 1.62
1.60 1.62 1.60 1.62
MLR
PCR
PLS
MLR, PCR, PLS:
VALIDATION
Estimating prediction error.
Basic Principle:
test how well your model works with new data,
it has not seen yet!
Common measure for prediction error
12/9/2013
5
Prediction error of the samples the model was built on
Error is biased!
Samples also used to build the model
model is biased towards accurate prediction of these specific samples
A Biased Approach
Basic Principle:
test how well your model works with new data, it has not seen yet!
Split data in training and test set.
Several ways:One large test setLeave one out and repeat: LOOLeave n objects out and repeat: LNO
. . .Apply entire model procedure on the test set
Validation: Basic Principle
Validation
Full dataset
Trainingset
Testset
Build model :
b0
bp
yRMSEP
Remark: for final model use whole data set.
Training and test sets
Split in training and test set.• Test set should be
representative of training set• Random choice is often the
best• Check for extremely unlucky
divisions• Apply whole procedure on the
test and validation sets
Cross-validation
• Most simple case: Leave-One-Out (=LOO, segment=1 sample). Normally 10-20% out (=LnO).
• Remark: for final model use whole data set.
Cross-validation: an example
• The data
12/9/2013
6
Cross-validation: an example
• Split data into training set and validation set
Cross-validation: an example
• Split data into training set and test set
Cross-validation: an example
• Build a model on the training set
Cross-validation: an example
Cross-validation: an example
• Split data again into training set and valid. set– Until all samples have been in the validation set once– Common: Leave-One-Out (LOO)
Cross-validation: an example
• Split data again into training set and valid. set– Until all samples have been in the validation set once– Common: Leave-One-Out (LOO)
12/9/2013
7
Cross-validation: an example
• Split data again into training set and valid. set– Until all samples have been in the validation set once– Common: Leave-One-Out (LOO)
Cross-validation: an example
• Split data again into training set and valid. set– Until all samples have been in the validation set once– Common: Leave-One-Out (LOO)
Cross-validation: an example
• Split data again into training set and valid. set– Until all samples have been in the validation set once– Common: Leave-One-Out (LOO)
Cross-validation: an example
• Split data again into training set and valid. set– Until all samples have been in the validation set once– Common: Leave-One-Out (LOO)
Cross-validation: a warning
• Data: 13 x 5 = 65 NIR spectra (1102 wavelengths)– 13 samples: different composition of NaOH, NaOCl and Na2CO3
– 5 temperatures: each sample measured at 5 temperatures
Composition
NaOH (wt%) NaOCl(wt%)
Na2CO3 (wt%) Temperature (°C)
1 18.99 0 0 15 21 27 34 402 9.15 9.99 0.15 15 21 27 34 403 15.01 0 4.01 15 21 27 34 404 9.34 5.96 3.97 15 21 27 34 40… … … … …13 16.02 2.01 1.00 15 21 27 34 40
Cross-validation: a warning
• The data
65
1102
65
312
13
Leave SAMPLE out
y
12/9/2013
8
Selection of number of LV’s
Trough Validation:
Choose number of LV’s that results in model with lowest prediction errror
Testset to assess final model cannot be used !
Divide trainingsetCrossvalidation
Validation
Full dataset
TrainingSet
Test’ set
Testset
1) determine #LV’s : wit test’ set
2) Build model : b0
bp
yRMSEP
Remark: for final model use whole data set.
Double Cross Validation
Full dataset
TrainingsetC
1) determine #LV’s : CV Innerloop
2) Build model : CV Outer loop
b0
bp
yRMSEP
Remark: for final model use whole data set Skip.
CV 1
CV2
Double cross-validation
• The data
Double cross-validation
• Split data into training set and validation set
Double cross-validation
• Split data into training set and validation set
Used later to assess model performance!
12/9/2013
9
Double cross-validation
• Apply crossvalidation on the rest: Split training set into(new) training set and test set
1LV 2LV 3LV
1LV 2LV 3LV 1LV 2LV 3LV
Lowest RMSECV
Double cross-validation
12/9/2013
10
Cross-validation: an example
• Repeat procedure – Until all samples have been in the validation set once
Cross-validation: an example
• Repeat procedure – Until all samples have been in the validation set once
Double cross-validation
• In this way:– The number of LVs is determined by using samples not
used to build the model with
– The prediction error is also determined using samples the model has not seen before
Remark: for final model use whole data set.
Raw + meancentered data
2000 4000 6000 8000 10000 12000 14000-0.5
0
0.5
1
1.5
2
Wavenumber (cm-1)
Abs
orba
nce
(a.u
.)
Raw data
2000 4000 6000 8000 10000 12000 14000-0.2
-0.15
-0.1
-0.05
0
0.05
0.1
0.15
0.2
0.25
0.3
Wavenumber (cm-1)
Abs
orba
nce
(a.u
.)
Meancentered data
PLS: an example
RMSECV vs. No of LVs
1 2 3 4 5 6 7 8 9 100
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Number of LVs
RM
SEC
V
RMSECV values for prediction of NaOH
Regression coeffficients
3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000-0.5
0
0.5
1
1.5
2
Wavenumber (cm-1)
Abs
orba
nce
(a.u
.)
Raw data
3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000-2
0
2
4
6
8
10
Wavenumber (cm-1)
Reg
ress
ion
coef
ficie
nt
12/9/2013
11
True vs. predicted
8 10 12 14 16 18 208
10
12
14
16
18
20
NaOH, true
NaO
H, p
redi
cted
True values vs. predictions
500 1000 1500 2000 2500
0
0.5
1
1.5
2
2.5
3
Wavelength (cm-1)
Inte
nsity
(a.u
)
Original spectrumOffsetSlopeScatter
Why Pre-Processing ?Data Artefacts• Baseline correction• Alignment• Scatter correction• Noise removal• Scaling, Normalisation• Transformation• …..
Other• Missing values• Outliers
0 200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Inte
nsity
(a.u
.)
original
0 200 400 600 800 1000 1200 1400 160000.10.20.30.40.50.60.70.8
Wavelength (a.u.)
Inte
nsity
(a.u
.) offset
0 200 400 600 800 1000 1200 1400 160000.10.20.30.40.50.60.70.8
Wavelength (a.u.)
Inte
nsity
(a.u
.)
offset+slope
0 200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Inte
nsity
(a.u
.) multiplicative
0 200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Inte
nsity
(a.u
.)
originaloffsetoffset+slopemultiplicativeoffset + slope + multiplicative
0 200 400 600 800 1000 1200 1400 16000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Wavelength (a.u.)
Inte
nsity
(a.u
.)
originaloffsetoffset+slopemultiplicativeoffset + slope + multiplicative
Pre-Processing Methods4914 combinations: all reasonable
STEP 1: (7x) BASELINE
STEP 2: (10x) SCATTER
STEP 3: (10x) NOISE
STEP 4: (7x) SCALING &
TRANSFORMATIONS No baseline correction No scatter correction No noise removal Meancentering
(3x) Detrending polynomial order (2-3-4)
(4x) scaling: Mean Median Max L2 norm
(9x) S-G smoothing (window: 5-9-11 pt) (order: 2-3-4)
Autoscaling
Range scaling
(2x) Derivatisation (1st – 2nd )
SNV Pareto scaling
(3x) RNV (15, 25, 35)% Poisson scaling
AsLS MSC Level scaling
Log transformation
Supervised pre-processing methods OSC No noise removal Meancentering
DOSC Autoscaling Range scaling Pareto scaling Poisson scaling Level scaling Log scaling
Pre-Processing Results
Classification accuracy %
Com
plex
ity o
f the
mod
el (n
o of
LV)
• Complexity of the model : no of LV• Classification Accuracy Raw Data
J. Engel et al. TrAC 2013
12/9/2013
12
• PLS Toolbox (Eigenvector Inc.)– www.eigenvector.com– For use in MATLAB (or standalone!)
• XLSTAT-PLS (XLSTAT)– www.xlstat.com– For use in Microsoft Excel
• Package pls for R– Free software– http://cran.r-project.org
SOFTWARE