exploratory data analysis and multivariate strategies andrew mead (school of life sciences)
TRANSCRIPT
Exploratory Data Analysis and Multivariate Strategies
Andrew Mead (School of Life Sciences)
Multi-… approaches in statistics
Multiple comparison tests Multiple testing adjustments
Methods for adjusting the significance levels when doing a large number of tests (comparisons between treatments) within a single analyses
Multiple regression analysis Statistical model building with more than one explanatory variable
Multi-factor analysis of variance Analysis of designed experiments with more than one explanatory
factor
Multivariate Analysis Methods to summarise and explore the relationships among
multiple response variables, and/or to assess differences among “treatments” based on multiple response variables
2
Multivariate data
Responses by a number of individuals to a range of different (related) questions in a (social studies) survey
Counts of different species in a range of locations in an ecological survey
Measurements of a range of traits of individual people, animals, plants, products, … Different medical measurements made on a group of
patients Measurements of gene expression / protein
expression, metabolite expression in biological samples
Counts of sequence consensus matches from microbial samples
Data on the “distances” between a number of objects3
Multivariate questions
Identifying groups of objects with similar (and different) responses e.g. UK landscape areas with similar percentage composition of
different land covers (woodland, arable, urban, …)? Identifying the particular measurement that contribute to
the variability among a set of objects e.g. weed species that are present under different long-term
herbicide strategies? Identifying the particular measurements that contribute to
differences between groups of objects e.g. which genes discriminate between patients with and without
some form of cancer? Identifying the particular measurements that explain
variation in some over-arching response variable e.g. which traits of predatory insects influence the predation rate
of aphids?4
Exploratory data analysis
Summary statistics / graphical summaries Variability for each variable/measurement Groups of observations
both pre-determined – to find potential differences and to be identified – based on each individual variable
Scatter plots / correlations Associations between pairs of variables
5
Univariate analysis
For each individual variable: Hypothesis tests
Choice depends on the question to be answered Analysis of variance
For variables measured in designed experiments Regression analysis
To build statistical models to describe how one response variable depends on one (or more) explanatory variables
Generalised Linear Models (GLMs) For data where standard assumptions do not hold!
Time Series Analysis …
6
Multivariate analysis
For a set of “correlated” variables: Assess relationships between variables Consider the effects of “treatments” on these relationships Consider how a “response” depends on these relationships
Multivariate methods concerned with “data reduction” Summarise the correlations between variables Produce a smaller set of (uncorrelated) variables containing the
important information
For a set of “related” objects Identify groups of similar objects Identify differences between groups of similar objects
And what makes the objects similar!
7
Simple graphical summaries 1
For compositional data e.g. numbers of
onion bulbs in different marketable size grades
Present as a stacked bar-chart For raw data For percentage of the
total
0
5
10
15
20
25
30
35
40
45
A B C D E
0%
20%
40%
60%
80%
100%
A B C D E
8
Simple graphical summaries 2
More general data e.g. different
measurements on a set of plants
Scatter plots for each pair of variables
Present in a matrix
Calculate linear correlation coefficient for each pair of variables
16
14
12
12
8
10
C4
8
6
1410614121086
C1
128
C2
14
12
10
8
6
1016814
16
6
14
12
10
8
6
14
C3
612
14
12
10
10
8
6
C5
C2
C3
C4
9
Two forms of data matrix
The DATA matrix p variables for each of n samples
(observations) Presented in a rectangular matrix
n rows and p columns
The ASSOCIATION matrix Distance, similarity or dissimilarity Between every pair of variables or
every pair of samples Symmetric square matrix
n-by-n - between samples p-by-p - between variables just show lower triangle
Turns multivariate data into univariate data?
p variables
n s
am
ple
s
p variables
p v
ari
ab
les
LowerTriangle
10
Analysing Association Data
11
Start with associations Distances between locations on a map Psychometric (sensory) similarities between
products
Construct associations from data Depends on the types of data
Binary (presence/absence) data Simple matching coefficient; Jaccard coefficient; …
Continuous data Euclidean distance; Manhattan distance; …
Similarities or Disimilarities/Distances
Finding groups of similar objects
12
Hierarchical Cluster Analysis Aim: to arrange the objects into homogenous groups Output:
Dendrogram showing how objects are joined together Levels of similarity/distance at which groups are formed or
divided Primarily a descriptive technique Interpretation includes identification of “how many
groups?” Agglomerative methods
Start with individual objects, group the two most similar together, re-calculate similarity between new group and other objects, and continue until all objects in one group
Different rules (algorithms) for re-calculating similarities, resulting in different dendrograms
Simple example
Relative intensity of fluorescence spectrum at four different wavelengths
Calculate distances using the “Euclidean” metric standardised by the
mean absolute deviation
Illustrate HCA using Single Link (Nearest Neighbour) algorithm Distance to new group
is the minimum of the distances to the objects being grouped
Compound Wavelength (nm)
300 350 400 450
A 16 62 67 27
B 15 60 69 31
C 14 59 68 31
D 15 61 71 31
E 14 60 70 30
F 14 59 69 30
G 17 63 68 29
H 16 62 69 28
I 15 60 72 30
J 17 63 69 27
K 18 62 68 28
L 18 64 67 29
13
Step 1
A 0.000
B 4.024 0.000
C 4.256 1.403 0.000
D 4.967 1.955 3.179 0.000
E 4.218 1.452 2.112 1.615 0.000
F 4.016 1.334 1.213 2.568 1.153 0.000
G 2.128 3.230 4.036 3.819 3.769 3.899 0.000
H 1.991 2.897 3.693 3.197 2.818 3.099 1.615 0.000
I 5.400 2.849 3.882 1.403 1.991 2.935 4.580 3.559 0.000
J 2.112 4.157 4.980 4.256 4.103 4.415 1.841 1.334 4.503 0.000
K 2.008 3.787 4.526 4.415 4.256 4.256 1.334 1.841 4.858 1.615 0.000
L 2.667 4.429 5.108 5.108 5.131 5.163 1.403 2.918 5.928 2.650 1.861 0.000
A B C D E F G H I J K L
Identify minimum distance 1.153 between E and F
Join these objects into a group
Re-calculate all distances to this group
And repeat!
14
Dendrograms
B
C
E F
D I
A
H J
L
G K
1.0
1.5
2.0
2.5
Cluster Dendrogram
hclust (*, "single")dissim
He
igh
t
15
Non-Hierarchical Clustering
Aim: to divide units into a number of mutually exclusive groups
Optimize some suitable criterion directly from the data matrix Does not analyse the similarity matrix
Criteria include Maximise the total Euclidean distance between groups Minimise the determinant of the within-group variance-
covariance matrix, pooled over groups
Repeat for different numbers of groups Usually start with a large number of groups, and gradually
reduce the number Grouping is not hierarchical, i.e. best 3-group solution may not
be best 2-group solution with one group divided into 2 sub-groups
Need a rule to determine the “right” number of groups
Also known as K-means clustering16
Analysing Association Data
Multidimensional Scaling (MDS) and Principal Co-ordinate Analysis (PCO) Analyse the same matrix of similarities or distances to
produce a multidimensional picture of the relationships between units
Generates an “ordination” or configuration for a set of objects
Matches the inter-point distances to the dissimilarities or distances
PCO works with similarities Produces an analytical solution (metric scaling) Matches configuration distances to the observed
dissimilarities based on the sum of squared differences
MDS works with distances or dissimilarities Produces an iterated solution (non-metric scaling) Matches the configuration distances to the observed
dissimilarities based on rank orders (monotonic regression)17
MDS Output for fluorescence data
L
A
F
E
J
H
G
K
I
D
C
B
0.5
-1.0
-0.5
0.0
-0.5-1.0
0.5
1.0
0.0 1.0
Axis
2
Axis 1
18
Exploring patterns
Principal Component Analysis (PCA) Aim: to identify the (combinations of) variables
that explain the variability within a data set Primarily a descriptive technique
Usually for quantitative variables Starts with DATA matrix (p variables by n units) Transforms original set of correlated variables into
new set of orthogonal (independent) variables Linear combinations of original variables First principal component accounts for as much of the
variability in the data as possible Second principal component accounts for as much of the
remaining variability as possible, and is orthogonal to the first
etc.
19
Matrix algebra
PCA best described in terms of matrix algebra
In common with almost all multivariate analysis methods PCA is an eigenvalue decomposition of the
matrix of associations between the variables Produces two matrices
A diagonal matrix containing the eigenvalues A rectangular (n-by-p) matrix containing the eigenvectors
Three possible matrices of associations can be used Constructed from the original data matrix
Sum of Squares and Products Matrix (SSPM) Variance-Covariance Matrix Correlation Matrix
Different results from the PCA applied to each20
PCA Output
Roots (eigenvalues) How much of the variation is explained by each
component Expressed as a percentage of total Indicates how many components are necessary
Loadings (eigenvectors) How each original variable contributes to each
principal component Shows which variables are important
Scores Values of each observation on each principal
component
21
Fluorescence example
Relative intensity of fluorescence spectrum at four different wavelengths
Compound Wavelength (nm)
300 350 400 450
A 16 62 67 27
B 15 60 69 31
C 14 59 68 31
D 15 61 71 31
E 14 60 70 30
F 14 59 69 30
G 17 63 68 29
H 16 62 69 28
I 15 60 72 30
J 17 63 69 27
K 18 62 68 28
L 18 64 67 29
22
PCA output
Analysis based on the variance-covariance matrix
Variances of original variables are similar (~25% each)
PC1 accounts for almost 73% of total variability
First two PCs account for nearly 89%
Obtain PC Scores for each compound by multiplying observed values by coefficients (loadings)
View groupings against PCs
300 350 400 450
300 2.2046
350 2.2500 2.7500
400 -1.1136 -1.1591 2.2652
450 -1.4773 -1.7046 1.0227 2.2046
Eigenvalue 6.5819 1.4863 0.8795 0.2066
Proportion 0.727 0.158 0.093 0.022
Cumulative 0.727 0.885 0.978 1.000
Variable PC1 PC2 PC3 PC4
300 0.529 -0.218 -0.343 0.745
350 0.594 -0.319 -0.324 -0.664
400 -0.383 -0.917 0.100 0.050
450 -0.470 0.099 -0.876 -0.041
23
Example PCA plot
Fluorescence intensities
-86
-85.5
-85
-84.5
-84
-83.5
-83
-82.5
-82
-81.5
-81
-80.5
0 1 2 3 4 5 6 7 8 9
PC1
PC
2
A
KG
D
I
H
J
L
E
B
F
C
24
Biplots
25
Graphical approach to present results from PCA (and a number of other multivariate methods)
Plots objects as points As for example PCA plot
Plots variables as vectors
Supports interpretation of analysis Shows those objects that are similar Identifies variables that are highly correlated Identifies variables that are particular associated
with groups of objects
Correspondence Analysis
Analogous to Principal Components Analysis Appropriate for categorical variables rather than
continuous variables Also known as “reciprocal averaging” Finds an ordination (ordering) of each categorical variable
that maximises the correlation between the two categorical variables
Used in the analysis of ecological community data e.g. counts of the numbers of different species in different
environments – identifies the species associated with particular environments
Extension to Canonical Correspondence Analysis Incorporates the influence of one or more explanatory
variables (such as environmental variables) in finding the ordination
Enables sites to be ranked along each environmental variable, taking account of correlations between species and environmental variables
26
Factor Analysis
27
Similar approach to PCA, predominantly used in the social sciences
Principle: that correlations between variables can be explained by a number of common factors, plus a number of specific factors (one for each original variable)
Focused on explaining the covariance between variables While PCA is focused on explaining the maximum amount of
variance in the data Observed variables are assumed to be linear combinations
of hypothetical underlying (and un-observable) factors Creates an underlying causal model
Original application to the derivation of factors underlying intelligence
Can be used in a hypothesis testing mode (confirmatory factor analysis)
Canonical Variate Analysis (CVA)
Similar to PCA in working on the data matrix Works on within-group SSPM and between-group SSPM Finds combinations of the original variables to
maximise the ratio of between-group variance to within-group variance Groups are separated as much as possible whilst keeping
each group as compact as possible
Combinations can be used to discriminate between the groups For g groups – at most g-1 combinations of variables to
discriminate between them Need at least g-1 original variables
For a new observation, use “discriminant” functions to identify which group it is most likely to belong to Discriminant Analysis
28
CVA Output
Output Latent roots (eigenvalues)
How much variation is explained by each component Expressed as a percentage of total Root greater than 1 indicates that there is
discrimination between groups on that canonical variate Explicit test for dimensionality
Latent vectors (loadings) Contributions of each original variable to new canonical
variates
Canonical Variate Means Mean values for each group on the canonical variates
Adjustment terms so that the centroid of group means is at the origin
Produce plots showing each group mean with a 95% confidence interval Construct confidence intervals for the “population”
29
Example – Fisher “iris” data
Measurements of sepal length, sepal width, petal length and petal width for 50 plants of each of three iris species
Plot shows separation between the three species
Loadings (coefficients) indicate which variables are used to separate the groups
30
Multivariate ANOVA and Regression
Generalisation of univariate Analysis of Variance Analyse multiple variables of data from a designed
experiment Assess for the effects of different factors on the whole set
of variables Takes account of the covariance between variables Interpretation based on matrices of variances and
covariances
Similar approach for multivariate regression Relates a set of correlated response variables to one or
more explanatory variables PC Regression PLS Regression
31
Procrustes Rotation
Provides a way of comparing two (or more) multidimensional configurations of a set of units Procrustes = Greek inn-keeper who fitted his guests
to one size of bed by chopping bits off or stretching bits!
Takes one configuration and fits the second configuration to it Combinations of rotation, reflection and scaling each axis Measure how much manipulation is needed to make the
configurations similar Measure how similar it is possible to make them
Generalise to more than two configurations
32
Exploratory Data Analysis and Multivariate Strategies
Andrew Mead (School of Life Sciences)