Download - How to validate your model
"IT'S GOOD, BUT IT'S NOT RIGHT!..." HOW TO VALIDATE YOUR MODEL
ALEX HENDERSON [email protected]
University of Manchester & SurfaceSpectra
SurfaceSpectra (XPS)
XPS of Polymers Database Surface Analysis by Auger and
X-Ray Photoelectron Spectroscopy
surfacespectra.com
SurfaceSpectra (SIMS)
Static SIMS Library TOF-SIMS: Materials Analysis
by Mass Spectrometry identity
surfacespectra.com
Multivariate analysis of SIMS spectra
Content here taken from my chapter “Multivariate analysis of SIMS spectra” in
“TOF-SIMS: Materials Analysis by Mass Spectrometry”
Eds. John C. Vickerman and David Briggs, Second edition,
2013, SurfaceSpectra & IM Publications
ISBN: 978-1-906715-17-5
Book and individual electronic chapters available
from: http://impublications.com/tof-sims/
See also: http://surfacespectra.com/books/tof-sims
Availability
Code: MATLAB (requires Statistics Toolbox for gscatter & mahal)
Data: SIMS spectra from samples of bacteria Applied Surface Science 252 (2006) 6869, DOI: 10.1016/j.apsusc.2006.02.153
J.S. Fletcher, A. Henderson, R.M. Jarvis, N.P. Lockyer, J.C. Vickerman and R. Goodacre
Analyst 134 (2009) 2352, DOI: 10.1039/b907570d S. Vaidyanathan, J.S. Fletcher, R.M. Jarvis, A. Henderson, N.P. Lockyer, R. Goodacre and J.C. Vickerman
Both code and data will be made available shortly
on one or more of the following platforms
Check http://manchester.ac.uk/sarc for information
Uses of validation
Right/wrong answer
Luck
Overfitting
Mistakes
Outliers
Usefulness of result
Sensitivity
Specificity
Uses of validation we will cover
Right/wrong answer
Luck Sampling (cross-validation, bootstrap)
Overfitting PRESS and RSS tests
Mistakes Visualisation
Outliers Robust methods (LIBRA)
Usefulness of result
Sensitivity Distance metrics
Specificity and confusion matrices
Chemometrics and Intelligent Laboratory Systems. 75, 127 (2005). http://wis.kuleuven.be/stat/robust/LIBRA.html
Example data
Bacterial samples related to urinary tract infection
5 bacterial species; 2 or 3 strains of each
Citrobacter freundii coded Cf (14 spectra)
Escherichia coli (E. Coli) coded Ec (32 spectra)
Enterococcus spp. coded En (33 spectra)
Klebsiella pneumonia coded Kp (15 spectra)
Proteus mirabilis coded Pm (21 spectra)
Each species/strain grown 3 times
(biological replicates)
Applied Surface Science 252 (2006) 6869-6874
Example data
Positive ion ToF-SIMS data
Samples analysed in random sequence over 3 days
Each sample analysed 3 times (technical replicates)
115 spectra ion total
Mass range: 1 – 800 u
Bin summed to nominal mass (±0.5 u)
Square root of data taken
Each spectrum normalised to unity
Applied Surface Science 252 (2006) 6869-6874
PCA results
Not too useful
without
context…
-8 -6 -4 -2 0 2 4 6 8 10
x 10-3
-6
-4
-2
0
2
4
6x 10
-3
principal component 1 (55.2%)
princip
al com
ponent
2 (
13.2
%)
PCA results
A priori
knowledge
indicates some
correlation
between
principal
components
and the
bacterial
classes
Labelled as
species
-8 -6 -4 -2 0 2 4 6 8 10
x 10-3
-6
-4
-2
0
2
4
6x 10
-3
principal component 1 (55.2%)
princip
al com
ponent
2 (
13.2
%)
Cf
Ec
En
Kp
Pm
PCA results
A priori
knowledge
indicates some
correlation
between
principal
components
and the
bacterial
classes
Labelled as
strains
-8 -6 -4 -2 0 2 4 6 8 10
x 10-3
-6
-4
-2
0
2
4
6x 10
-3
principal component 1 (55.2%)
princip
al com
ponent
2 (
13.2
%)
Cf102
Cf109
Ec013
Ec017
Ec041
Ec007
EnC82
EnC85
EnC90
EnC93
Kp052
Kp059
Pm065
Pm070
Pm073
Canonical Variates Analysis (CVA)
Also known as Discriminant Function Analysis (DFA)
Often used in conjunction with PCA PC-DFA
Blends principal components to better match a priori classes
PCA identifies unique characteristics of the data set as a whole
CVA identifies proportions of these characteristics that best match known classes of samples
Need to define how many PCs to use – see later…
The Great UKSAF Bake Off!
David Scurr did some baking(!!), what did he make?
The Great UKSAF Bake Off!
Approach…
Go to a real baker and collect some items of
different types: bread, scones, cakes, etc.
Analyse them by SIMS, XPS or Raman [insert your
favourite technique here] and do multivariate
analysis on the data
Pre-process
PCA
CVA
The Great UKSAF Bake Off!
PCA gives unique ingredients (characteristics):
Flour
Butter
Yeast
Eggs
Sugar
Salt
Etc
Note: it did not identify fatty acids, amino acids or other discrete chemicals
The Great UKSAF Bake Off!
CVA gives proportions of the PCA results
(ingredients) that match the bakery items helps
to identify the recipe
The Great UKSAF Bake Off!
CVA gives proportions of the PCA results
(ingredients) that match the bakery items helps
to identify the recipe
Now analyse David’s offering in the same manner
Should identify the proportions of the various
ingredients (PCs)
The Great UKSAF Bake Off!
CVA gives proportions of the PCA results (ingredients) that match the bakery items helps to identify the recipe
Now analyse David’s offering in the same manner
Should identify the proportions of the various ingredients (PCs)
Given one of David’s (unrecognisable?) baked items, CVA could predict what it was supposed to be
(Well we can only hope!)
Types of variance
Reduce within-class variance (W) Increase between-class variance (B)
Fisher’s ratio
Wish to maximise the ratio of the between-class
variance to the within-class variance
𝑇𝑜𝑡𝑎𝑙 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 = 𝐵𝑒𝑡𝑤𝑒𝑒𝑛_𝑐𝑙𝑎𝑠𝑠 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒 + 𝑊𝑖𝑡ℎ𝑖𝑛_𝑐𝑙𝑎𝑠𝑠 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝐵𝑒𝑡𝑤𝑒𝑒𝑛 = 𝑇𝑜𝑡𝑎𝑙 − 𝑊𝑖𝑡ℎ𝑖𝑛
𝐵𝑒𝑡𝑤𝑒𝑒𝑛
𝑊𝑖𝑡ℎ𝑖𝑛=
𝑇𝑜𝑡𝑎𝑙 − 𝑊𝑖𝑡ℎ𝑖𝑛
𝑊𝑖𝑡ℎ𝑖𝑛
𝐵𝑒𝑡𝑤𝑒𝑒𝑛
𝑊𝑖𝑡ℎ𝑖𝑛=
𝑇𝑜𝑡𝑎𝑙
𝑊𝑖𝑡ℎ𝑖𝑛− 1
𝐹𝑖𝑠ℎ𝑒𝑟′𝑠 𝑟𝑎𝑡𝑖𝑜 =𝐵𝑒𝑡𝑤𝑒𝑒𝑛
𝑊𝑖𝑡ℎ𝑖𝑛=
𝑇𝑜𝑡𝑎𝑙
𝑊𝑖𝑡ℎ𝑖𝑛− 1
CVA results
Outcome of
CVA using 9
principal
components.
Class
separation
better defined
-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06-0.015
-0.01
-0.005
0
0.005
0.01
0.015
0.02
canonical variate 1 (eig=12.6)
canonic
al variate
2 (
eig
=5.7
1)
Cf
Ec
En
Kp
Pm
PCA versus CVA
PCA CVA
-0.08 -0.06 -0.04 -0.02 0 0.02 0.04 0.06-0.015
-0.01
-0.005
0
0.005
0.01
0.015
0.02
canonical variate 1 (eig=12.6)
canonic
al variate
2 (
eig
=5.7
1)
Cf
Ec
En
Kp
Pm
-8 -6 -4 -2 0 2 4 6 8 10
x 10-3
-6
-4
-2
0
2
4
6x 10
-3
principal component 1 (55.2%)
princip
al com
ponent
2 (
13.2
%)
Cf
Ec
En
Kp
Pm
Data projection
Predict classification of an unknown sample
Unseen by the model – not used to train the model
Pre-treat in the same manner as the training data
Remove the mean of the TRAINING data not the unknown or test data
Need to have the same origin for matrix rotation
Rotate the unknown data by the same amount as the training data
See where the unknown samples turn up
CVA outcome with projected data
Outcome of
CVA using 9
principal
components.
Model trained
with bootstrap
sampling
(empty circles)
and test data
projected in
(filled circles)
-0.06 -0.04 -0.02 0 0.02 0.04 0.06 0.08-0.015
-0.01
-0.005
0
0.005
0.01
0.015
canonical variate 1 (eig=15.2)
canonic
al variate
2 (
eig
=4.6
4)
Cf
Ec
En
Kp
Pm
Cf
Ec
En
Kp
Pm
How close is close?
Problems:
Can only visualise up to 3 dimensions
Something that looks close in 2D may not be close in
another dimension
Spread of data may not be (hyper)spherical
Need an N-D measurement system
Simple solution – Euclidean distance
Use Euclidean distance metric
Measure N-dimensional distance of projected data
from each group centroid (essentially N-dimensional
trigonometry)
Smallest distance gives assigned group
Problems with non (hyper)spherical group
distribution
Better solution – Mahalanobis distance
Developed by Prasanta Chandra Mahalanobis in
India in 1936
Takes into account the spread of the data when
calculating the distance
Mahalanobis versus Euclidean
All 4 stars
have the
same
Euclidean
distance from
the group
centroid
(green circle).
Blue stars
have a
smaller
Mahalanobis
distance than
the red stars.
Testing the model (Holdout)
Randomly split data into training set and test set
(~2:1)
Use the training set to develop the model
Pre-process PCA CVA
Project the test data into the model
Pre-process (using mean of training set) rotate
Measure distance of each test point from each
group; smallest distance is predicted class
Count how many correct answers we get
Caution
We are assigning the test sample to the nearest
grouping, it could be very far away from that
grouping
Could/should put limits on distance
Contingency table
Method of displaying results from a 2 class test
Used to assess how well the model performs
Relates to test samples not training samples
Need to define a perspective
For example; “We wish to predict class A”
Contingency table
Define a perspective. Eg “From the point of view that I want to predict Class A…”
Truly Class A Truly Class B
(therefore not Class A)
Totals
Predicted to be
Class A
Correctly predicted as
Class A
TRUE | POSITIVE
Wrongly predicted as
Class A
FALSE | POSITIVE
Total number
predicted as
Class A
(TP+FP)
Predicted to be
Class B
Wrongly predicted as
not Class A
FALSE | NEGATIVE
Correctly predicted
not Class A
TRUE | NEGATIVE
Total number
predicted as
Class B
(FN+TN)
Totals Total number of Class A
(TP+FN)
Total number of Class B
(FP+TN)
Total number of
samples
Sensitivity and specificity
Sensitivity: the proportion of things we were looking for that were found
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁
How good is the model at getting things right
Specificity: the proportion of things we were not interested in that were found
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 =𝑇𝑁
𝐹𝑃 + 𝑇𝑁
How good is the model at making sure we don’t get things wrong
Requires a perspective
Contingency table example
𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑖𝑡𝑦𝐴 =𝑇𝑃
𝑇𝑃 + 𝐹𝑁=
12
12 + 2= 86%
𝑠𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦𝐴 =𝑇𝑁
𝐹𝑃 + 𝑇𝑁=
7
4 + 7= 64%
True Group A True Group B Totals
Predicted Group A 12 (TP) 4 (FP) 16
Predicted Group B 2 (FN) 7 (TN) 9
Totals 14 (TP+FN) 11 (FP+TN) 25
Confusion matrix
Contingency table extended to more than 2
classes
Relates to test samples not training samples
Cannot simply use sensitivity and specificity
Use ‘percentage correctly classified’ instead (%CC)
Confusion matrix example
Overall %CC = (3+11+12+2+4)/44 = 73%
True Cf True Ec True En True Kp True Pm
Pred. Cf 3 1 0 0 3
Pred. Ec 1 11 0 4 1
Pred. En 0 0 12 2 0
Pred. Kp 0 0 0 2 0
Pred. Pm 0 0 0 0 4
Total 4 12 12 8 8
%CC 75% 92% 100% 25% 50%
Is it right or were we lucky?
Randomly selected training and test data only once
Get a single answer without feel for likelihood
Randomly selecting a second times gives different
answer
Which one is right?
How many times do we repeat?
Cross-validation (k-fold)
Decide how many ‘folds’ to make (arbitrary) k
Randomly allocate all data into k groupings
Define first grouping as a test set
Pool all other groupings as a training set
Perform analysis (PCA CVA Conf. Mat.)
Repeat, but use second group as test set and pool all other groups as training set
Repeat until each grouping has been a test set
Produces k outcomes distribution of results
Stratification
If training sets are chosen randomly some classes
may be missed out entirely
Solution is to randomly select the same proportion
from each class
Ensures all classes are represented in training set
and therefore the model
Cross-validation (leave-one-out)
Extreme case of k-fold cross validation
If we have N spectra, let k=N
Each spectrum is treated as a test set (single entry)
Model is trained using N-1 spectra each time
Produces N outcomes
Distribution of outcomes
k-fold CV produces k confusion matrices
Eg. Each class will have k %CC values
5 classes and 10-fold CV gives 50 %CC values
Treat each class as a distribution and determine the
mean, standard deviation etc
Eg. 10-fold CV produces a mean (of 10 values) and
standard deviation of the percentage correctly
classified for class A, rather than a single value
Bootstrap
Introduced by Bradley Efron in 70’s
Sampling with replacement
I also wish to thank the many friends who suggested names more colorful than Bootstrap, including Swiss Army Knife, Meat Axe, Swan-Dive, Jack-Rabbit, and my personal favorite, the Shotgun, which, to paraphrase Tukey, “can blow the head off any problem if the statistician can stand the resulting mess.” – Bradley Efron (1977)
The Annals of Statistics 7 (1979)1–26
Population versus sample
Population is all possible spectra from all bacteria
(in the world)
Sample is our data, a much smaller collection
How can we tell if our collection is representative?
Bootstrap attempts to assess our model
Bootstrap
‘Sampling with replacement’ means each spectrum
has an equal probability of being selected
If our data is anything like the true population (all
possible spectra) we would expect some spectra to
be very similar
Need to repeat many times to get suitable
distribution
Not great for small sample sets
Bootstrap
Say we have N spectra
We randomly select one of these, record its identity on a list and replace it
Repeat this N times
Our list is now N spectra long
Some repeats and some not present (63.2% unique)
Training set is our list (including repeats)
Test set is the data not in the list (those never selected)
Perform analysis (PCA CVA Conf. Mat.)
Repeat many, many times (>50; perhaps >1000)
Sampling comparison
Protocol Pro Con
Holdout • Simple to implement
• Computationally light
• Single answer which may be
inaccurate
• Should be stratified
k-fold CV • Computationally light
• Useful for large datasets
• Small number of answers so
difficult to determine the
distribution
• Should be stratified
LOOCV • Relatively simple to implement
• No need to stratify
• Good distribution of answers
• Somewhat biased toward ‘best’
answer
• Computationally heavy
Bootstrap • Little/no bias
• Good distribution of answers
• Computationally heavy
• Requires large datasets
• Test set varies in size
Really confused?!
Each test produces a confusion matrix
Holdout only gives a single confusion matrix
k-fold CV k matrices
LOOCV of N spectra N confusion matrices
Bootstrap could produce >1000
Need to assess the results as a distribution
For example;
class A may be correctly classified with an average (mean) of p and a spread (standard deviation) of q
Repeating the analysis gives a better understanding of the situation
How many PCs do we use?
Complicated!
Malinowski compared 15 methods in 3 categories:
Empirical, statistical and pseudo-statistical
He didn’t like any of them!
Three common approaches are
Scree plot
95% cumulative explained variance
PRESS test
J. Chemometrics 23 (2009) 1–6
Cattell scree plot
Plot
percentage of
variance
explained by
each PC.
Stop when
curve levels
out.
Rather
subjective and
difficult to
determine
Multivariate Behavioral Research 1 (1966) 245 and Multivariate Behavioral Research 12 (1977) 289
Cattell scree plot
Plot
percentage of
variance
explained by
each PC.
Stop when
curve levels
out.
Rather
subjective and
difficult to
determine
Multivariate Behavioral Research 1 (1966) 245 and Multivariate Behavioral Research 12 (1977) 289
95% cumulative explained variance
Plot
accumulated
variance
explained by
each PC.
Stop when
greater than
95%
95% cumulative explained variance
Plot
accumulated
variance
explained by
each PC.
Stop when
greater than
95%
Residual Sum of Squares (RSS)
PCA is a specific rotation of the data matrix
Possible to rotate back to exactly recover the original
Usually only want to keep the PCs that correspond to real data and discard the noise
Using only the informative PCs we should be able to reconstruct the original data well
Subtract reconstructed data from original data and calculate the error
Iteratively increase number of PCs used to reconstruct data until slope of error changes
Predicted Residual Error Sum of Squares (PRESS)
RSS predicts the original data using a number of PCs
PRESS uses LOOCV to give a better representation
Start with 1 PC
Reconstruct the data and compare to original
Check with LOOCV error value
Increase number of PCs and repeat
Stop when slope changes requisite number of PCs
PRESS/RSS
RSS interpretation involves determination of a slope
change
PRESS interpretation also involves a slope change
Brereton suggests using the PRESS/RSS ratio
Requisite number of PCs is when ratio > 1
R.G. Brereton, Chemometrics for Pattern Recognition, John Wiley & Sons Ltd (2009)
Predicted Residual Error Sum of Squares (PRESS)
Start with 1 PC and slowly increase. At each step perform LOOCV. Determine when the difference between the steps > 1
(This is actually a ratio of PRESSn+1 and RSSn)
R.G. Brereton, Chemometrics for Pattern Recognition, John Wiley & Sons Ltd (2009)
Overfitting
The more principal components used the better the
fit DANGER
Just because you can doesn’t mean you should
Use the minimum number of PCs to generate your
model
Better to err on the side of caution
Overfitting
9 PCs 20 PCs 50 PCs
See how groups tighten and separate with
increased number of PCs
"It's good, but it's not right!..."
Any the wiser?
The combination of some data and an aching desire
for an answer does not ensure that a reasonable
answer can be extracted from a given body of data.
John W. Tukey
Any the wiser?
The combination of some data and an aching desire
for an answer does not ensure that a reasonable
answer can be extracted from a given body of data.
John W. Tukey
There are known knowns. These are things we know
that we know. There are known unknowns. That is to
say, there are things that we know we don't know. But
there are also unknown unknowns. There are things we
don't know we don't know.
Donald Rumsfeld