exploratory data analysis and multivariate strategies andrew mead (school of life sciences)

Exploratory Data Analysis and Multivariate Strategies

Andrew Mead (School of Life Sciences)

Multi-… approaches in statistics

Multiple comparison tests Multiple testing adjustments

Methods for adjusting the significance levels when doing a large number of tests (comparisons between treatments) within a single analyses

Multiple regression analysis Statistical model building with more than one explanatory variable

Multi-factor analysis of variance Analysis of designed experiments with more than one explanatory

factor

Multivariate Analysis Methods to summarise and explore the relationships among

multiple response variables, and/or to assess differences among “treatments” based on multiple response variables

2

Multivariate data

Responses by a number of individuals to a range of different (related) questions in a (social studies) survey

Counts of different species in a range of locations in an ecological survey

Measurements of a range of traits of individual people, animals, plants, products, … Different medical measurements made on a group of

patients Measurements of gene expression / protein

expression, metabolite expression in biological samples

Counts of sequence consensus matches from microbial samples

Data on the “distances” between a number of objects3

Multivariate questions

Identifying groups of objects with similar (and different) responses e.g. UK landscape areas with similar percentage composition of

different land covers (woodland, arable, urban, …)? Identifying the particular measurement that contribute to

the variability among a set of objects e.g. weed species that are present under different long-term

herbicide strategies? Identifying the particular measurements that contribute to

differences between groups of objects e.g. which genes discriminate between patients with and without

some form of cancer? Identifying the particular measurements that explain

variation in some over-arching response variable e.g. which traits of predatory insects influence the predation rate

of aphids?4

Exploratory data analysis

Summary statistics / graphical summaries Variability for each variable/measurement Groups of observations

both pre-determined – to find potential differences and to be identified – based on each individual variable

Scatter plots / correlations Associations between pairs of variables

5

Univariate analysis

For each individual variable: Hypothesis tests

Choice depends on the question to be answered Analysis of variance

For variables measured in designed experiments Regression analysis

To build statistical models to describe how one response variable depends on one (or more) explanatory variables

Generalised Linear Models (GLMs) For data where standard assumptions do not hold!

Time Series Analysis …

6

Multivariate analysis

For a set of “correlated” variables: Assess relationships between variables Consider the effects of “treatments” on these relationships Consider how a “response” depends on these relationships

Multivariate methods concerned with “data reduction” Summarise the correlations between variables Produce a smaller set of (uncorrelated) variables containing the

important information

For a set of “related” objects Identify groups of similar objects Identify differences between groups of similar objects

And what makes the objects similar!

7

Simple graphical summaries 1

For compositional data e.g. numbers of

onion bulbs in different marketable size grades

Present as a stacked bar-chart For raw data For percentage of the

total

0

5

10

15

20

25

30

35

40

45

A B C D E

0%

20%

40%

60%

80%

100%

A B C D E

8

Simple graphical summaries 2

More general data e.g. different

measurements on a set of plants

Scatter plots for each pair of variables

Present in a matrix

Calculate linear correlation coefficient for each pair of variables

16

14

12

12

8

10

C4

8

6

1410614121086

C1

128

C2

14

12

10

8

6

1016814

16

6

14

12

10

8

6

14

C3

612

14

12

10

10

8

6

C5

C2

C3

C4

9

Two forms of data matrix

The DATA matrix p variables for each of n samples

(observations) Presented in a rectangular matrix

n rows and p columns

The ASSOCIATION matrix Distance, similarity or dissimilarity Between every pair of variables or

every pair of samples Symmetric square matrix

n-by-n - between samples p-by-p - between variables just show lower triangle

Turns multivariate data into univariate data?

p variables

n s

am

ple

s

p variables

p v

ari

ab

les

LowerTriangle

10

Analysing Association Data

11

Start with associations Distances between locations on a map Psychometric (sensory) similarities between

products

Construct associations from data Depends on the types of data

Binary (presence/absence) data Simple matching coefficient; Jaccard coefficient; …

Continuous data Euclidean distance; Manhattan distance; …

Similarities or Disimilarities/Distances

Finding groups of similar objects

12

Hierarchical Cluster Analysis Aim: to arrange the objects into homogenous groups Output:

Dendrogram showing how objects are joined together Levels of similarity/distance at which groups are formed or

divided Primarily a descriptive technique Interpretation includes identification of “how many

groups?” Agglomerative methods

Start with individual objects, group the two most similar together, re-calculate similarity between new group and other objects, and continue until all objects in one group

Different rules (algorithms) for re-calculating similarities, resulting in different dendrograms

Simple example

Relative intensity of fluorescence spectrum at four different wavelengths

Calculate distances using the “Euclidean” metric standardised by the

mean absolute deviation

Illustrate HCA using Single Link (Nearest Neighbour) algorithm Distance to new group

is the minimum of the distances to the objects being grouped

Compound Wavelength (nm)

300 350 400 450

A 16 62 67 27

B 15 60 69 31

C 14 59 68 31

D 15 61 71 31

E 14 60 70 30

F 14 59 69 30

G 17 63 68 29

H 16 62 69 28

I 15 60 72 30

J 17 63 69 27

K 18 62 68 28

L 18 64 67 29

13

Step 1

A 0.000

B 4.024 0.000

C 4.256 1.403 0.000

D 4.967 1.955 3.179 0.000

E 4.218 1.452 2.112 1.615 0.000

F 4.016 1.334 1.213 2.568 1.153 0.000

G 2.128 3.230 4.036 3.819 3.769 3.899 0.000

H 1.991 2.897 3.693 3.197 2.818 3.099 1.615 0.000

I 5.400 2.849 3.882 1.403 1.991 2.935 4.580 3.559 0.000

J 2.112 4.157 4.980 4.256 4.103 4.415 1.841 1.334 4.503 0.000

K 2.008 3.787 4.526 4.415 4.256 4.256 1.334 1.841 4.858 1.615 0.000

L 2.667 4.429 5.108 5.108 5.131 5.163 1.403 2.918 5.928 2.650 1.861 0.000

A B C D E F G H I J K L

Identify minimum distance 1.153 between E and F

Join these objects into a group

Re-calculate all distances to this group

And repeat!

14

Dendrograms

B

C

E F

D I

A

H J

L

G K

1.0

1.5

2.0

2.5

Cluster Dendrogram

hclust (*, "single")dissim

He

igh

t

15

Non-Hierarchical Clustering

Aim: to divide units into a number of mutually exclusive groups

Optimize some suitable criterion directly from the data matrix Does not analyse the similarity matrix

Criteria include Maximise the total Euclidean distance between groups Minimise the determinant of the within-group variance-

covariance matrix, pooled over groups

Repeat for different numbers of groups Usually start with a large number of groups, and gradually

reduce the number Grouping is not hierarchical, i.e. best 3-group solution may not

be best 2-group solution with one group divided into 2 sub-groups

Need a rule to determine the “right” number of groups

Also known as K-means clustering16

Analysing Association Data

Multidimensional Scaling (MDS) and Principal Co-ordinate Analysis (PCO) Analyse the same matrix of similarities or distances to

produce a multidimensional picture of the relationships between units

Generates an “ordination” or configuration for a set of objects

Matches the inter-point distances to the dissimilarities or distances

PCO works with similarities Produces an analytical solution (metric scaling) Matches configuration distances to the observed

dissimilarities based on the sum of squared differences

MDS works with distances or dissimilarities Produces an iterated solution (non-metric scaling) Matches the configuration distances to the observed

dissimilarities based on rank orders (monotonic regression)17

MDS Output for fluorescence data

L

A

F

E

J

H

G

K

I

D

C

B

0.5

-1.0

-0.5

0.0

-0.5-1.0

0.5

1.0

0.0 1.0

Axis

2

Axis 1

18

Exploring patterns

Principal Component Analysis (PCA) Aim: to identify the (combinations of) variables

that explain the variability within a data set Primarily a descriptive technique

Usually for quantitative variables Starts with DATA matrix (p variables by n units) Transforms original set of correlated variables into

new set of orthogonal (independent) variables Linear combinations of original variables First principal component accounts for as much of the

variability in the data as possible Second principal component accounts for as much of the

remaining variability as possible, and is orthogonal to the first

etc.

19

Matrix algebra

PCA best described in terms of matrix algebra

In common with almost all multivariate analysis methods PCA is an eigenvalue decomposition of the

matrix of associations between the variables Produces two matrices

A diagonal matrix containing the eigenvalues A rectangular (n-by-p) matrix containing the eigenvectors

Three possible matrices of associations can be used Constructed from the original data matrix

Sum of Squares and Products Matrix (SSPM) Variance-Covariance Matrix Correlation Matrix

Different results from the PCA applied to each20

PCA Output

Roots (eigenvalues) How much of the variation is explained by each

component Expressed as a percentage of total Indicates how many components are necessary

Loadings (eigenvectors) How each original variable contributes to each

principal component Shows which variables are important

Scores Values of each observation on each principal

component

21

Fluorescence example

Relative intensity of fluorescence spectrum at four different wavelengths

Compound Wavelength (nm)

300 350 400 450

A 16 62 67 27

B 15 60 69 31

C 14 59 68 31

D 15 61 71 31

E 14 60 70 30

F 14 59 69 30

G 17 63 68 29

H 16 62 69 28

I 15 60 72 30

J 17 63 69 27

K 18 62 68 28

L 18 64 67 29

22

PCA output

Analysis based on the variance-covariance matrix

Variances of original variables are similar (~25% each)

PC1 accounts for almost 73% of total variability

First two PCs account for nearly 89%

Obtain PC Scores for each compound by multiplying observed values by coefficients (loadings)

View groupings against PCs

300 350 400 450

300 2.2046

350 2.2500 2.7500

400 -1.1136 -1.1591 2.2652

450 -1.4773 -1.7046 1.0227 2.2046

Eigenvalue 6.5819 1.4863 0.8795 0.2066

Proportion 0.727 0.158 0.093 0.022

Cumulative 0.727 0.885 0.978 1.000

Variable PC1 PC2 PC3 PC4

300 0.529 -0.218 -0.343 0.745

350 0.594 -0.319 -0.324 -0.664

400 -0.383 -0.917 0.100 0.050

450 -0.470 0.099 -0.876 -0.041

23

Example PCA plot

Fluorescence intensities

-86

-85.5

-85

-84.5

-84

-83.5

-83

-82.5

-82

-81.5

-81

-80.5

0 1 2 3 4 5 6 7 8 9

PC1

PC

2

A

KG

D

I

H

J

L

E

B

F

C

24

Biplots

25

Graphical approach to present results from PCA (and a number of other multivariate methods)

Plots objects as points As for example PCA plot

Plots variables as vectors

Supports interpretation of analysis Shows those objects that are similar Identifies variables that are highly correlated Identifies variables that are particular associated

with groups of objects

Correspondence Analysis

Analogous to Principal Components Analysis Appropriate for categorical variables rather than

continuous variables Also known as “reciprocal averaging” Finds an ordination (ordering) of each categorical variable

that maximises the correlation between the two categorical variables

Used in the analysis of ecological community data e.g. counts of the numbers of different species in different

environments – identifies the species associated with particular environments

Extension to Canonical Correspondence Analysis Incorporates the influence of one or more explanatory

variables (such as environmental variables) in finding the ordination

Enables sites to be ranked along each environmental variable, taking account of correlations between species and environmental variables

26

Factor Analysis

27

Similar approach to PCA, predominantly used in the social sciences

Principle: that correlations between variables can be explained by a number of common factors, plus a number of specific factors (one for each original variable)

Focused on explaining the covariance between variables While PCA is focused on explaining the maximum amount of

variance in the data Observed variables are assumed to be linear combinations

of hypothetical underlying (and un-observable) factors Creates an underlying causal model

Original application to the derivation of factors underlying intelligence

Can be used in a hypothesis testing mode (confirmatory factor analysis)

Canonical Variate Analysis (CVA)

Similar to PCA in working on the data matrix Works on within-group SSPM and between-group SSPM Finds combinations of the original variables to

maximise the ratio of between-group variance to within-group variance Groups are separated as much as possible whilst keeping

each group as compact as possible

Combinations can be used to discriminate between the groups For g groups – at most g-1 combinations of variables to

discriminate between them Need at least g-1 original variables

For a new observation, use “discriminant” functions to identify which group it is most likely to belong to Discriminant Analysis

28

CVA Output

Output Latent roots (eigenvalues)

How much variation is explained by each component Expressed as a percentage of total Root greater than 1 indicates that there is

discrimination between groups on that canonical variate Explicit test for dimensionality

Latent vectors (loadings) Contributions of each original variable to new canonical

variates

Canonical Variate Means Mean values for each group on the canonical variates

Adjustment terms so that the centroid of group means is at the origin

Produce plots showing each group mean with a 95% confidence interval Construct confidence intervals for the “population”

29

Example – Fisher “iris” data

Measurements of sepal length, sepal width, petal length and petal width for 50 plants of each of three iris species

Plot shows separation between the three species

Loadings (coefficients) indicate which variables are used to separate the groups

30

Multivariate ANOVA and Regression

Generalisation of univariate Analysis of Variance Analyse multiple variables of data from a designed

experiment Assess for the effects of different factors on the whole set

of variables Takes account of the covariance between variables Interpretation based on matrices of variances and

covariances

Similar approach for multivariate regression Relates a set of correlated response variables to one or

more explanatory variables PC Regression PLS Regression

31

Procrustes Rotation

Provides a way of comparing two (or more) multidimensional configurations of a set of units Procrustes = Greek inn-keeper who fitted his guests

to one size of bed by chopping bits off or stretching bits!

Takes one configuration and fits the second configuration to it Combinations of rotation, reflection and scaling each axis Measure how much manipulation is needed to make the

configurations similar Measure how similar it is possible to make them

Generalise to more than two configurations

32

Exploratory Data Analysis and Multivariate Strategies

Andrew Mead (School of Life Sciences)

exploratory data analysis and multivariate strategies andrew mead (school of life sciences)

Documents

data matrix p variables

groups of objects

univariate data

pair of variables

objects similar

compositional data

raw data

general data