multivariate methods slides from machine learning by ethem alpaydin expanded by some slides from...

68
Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez- Osuna

Upload: ferdinand-barnett

Post on 30-Dec-2015

260 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

Multivariate Methods

Slides from Machine Learning by Ethem AlpaydinExpanded by some slides from Gutierrez-Osuna

Page 2: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

Overview

Now that we have a way (Bayesian decision criteria) to do classification if we had the probability distribution of the underlying classes (p(x|Ci), we are going to look into how to estimate those densities from given samples.

Then, once we have samples, we will estimate the class conditional densities and use tthem in our classification decisions.

2

Page 3: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

3

Reminder: Expectations

Approximate Expectation(discrete and continuous)

The average value of a function f(x) under a probability distribution p(x) is called the expectation of f(x). I.e. Average is weighted by the relative probabilities of different values of x.

Now we are going to look at concepts: variance and co-variance , of 1 or more random variables, using expectation.

Page 4: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

4

Variance and Covariance

The variance of f(x) provides a measure for how much f(x) varies around its mean E[f(x)].

Specific example:

Given a set of N points {xi} in the1D-space, the variance of the corresponding random variable x is var[x] = E[(x-xmean)2]

You can estimate this variance as

var(x) ≈ 1/N (xi-xmean)2

xi

Remember the definition of expectation:

Page 5: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

5

Variance and Covariance

The variance of f(x) provides a measure for how much f(x) varies around its mean E[f(x)].

Co-variance of two random variables x and y measures the extent to which they vary together.

: two random variables x, y

Page 6: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

Multivariate Normal Distribution

Page 7: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

8

The Gaussian Distribution

Page 8: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

9

Expectations

For normally distributed x:

Page 9: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

10

Assume we have a d-dimensional input (e.g. 2D), x. We will see how can we characterize p(x), assuming it is

normally distributed.

For 1-dimension it was the mean () and variance (2) Mean=E[x] Variance=E[(x - ]

For d-dimensions, we need the d-dimensional mean vector dxd dimensional covariance matrix

If x ~ Nd() then each dimension of x is univariate normal Converse is not true

Page 10: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

11

Normal Distribution &Multivariate Normal Distribution

For a single variable, the normal density function is:

For variables in higher dimensions, this generalizes to:

where the mean is now a d-dimensional vector, is a d x d covariance matrix || is the determinant of :

Page 11: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

12

Multivariate Parameters: Mean, Covariance

TTT

ddd

d

d

EE μμxxμxμxx

][]))([(Cov

221

22221

11221

TdE ,...,:Mean 1μx

)])([(,Cov jjiijiij xxExx

Page 12: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

Matlab code

close all; rand('twister', 1987) % seed

%Define the parameters of two 2d-Normal distribution mu1 = [5 -5]; mu2 = [0 0]; sigma1 = [2 0; 0 2]; sigma2 = [5 5; 5 5];

N=500; %Number of samples we want to generate from this distribution

samp1 = mvnrnd(mu1,sigma1, N); samp2 = mvnrnd(mu2, sigma2, N); figure; clf; plot(samp1(:,1), samp1(:,2),'.', 'MarkerEdgeColor', 'b'); hold on; plot(samp2(:,1), samp2(:,2),'*', 'MarkerEdgeColor', 'r'); axis([-20 20 -20 20]); legend('d1', 'd2');

13

Page 13: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

14

mu1 = [5 -5];mu2 = [0 0];

sigma1 = [2 0; 0 2];sigma2 = [5 5; 5 5];

Page 14: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

15

mu1 = [5 -5];mu2 = [0 0];

sigma1 = [2 0; 0 2];sigma2 = [5 2; 2 5];

Page 15: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

Matlab sample cont.

% Lets compute the mean and covariance as if we are given this data

sampmu1 = sum(samp1)/N; sampmu2 = sum(samp2)/N;

sampcov1 = zeros(2,2); sampcov2 = zeros(2,2);

for i =1:N sampcov1 = sampcov1 + (samp1(i,:)-sampmu1)' * (samp1(i,:)-sampmu1); sampcov2 = sampcov2 + (samp2(i,:)-sampmu2)' * (samp2(i,:)-sampmu2); End

sampcov1 = sampcov1 /N; sampcov2 = sampcov2 /N;

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Lets compute the mean and covariance as if we are given this data USING MATRIX OPERATIONS % Notice that in samp1, samples are given in ROWS – but for this multiplication, columns * rows is req.

sampcov1 = (samp1'*samp1)/N - sampmu1'*sampmu1; %Or simply

mu=mean(samp1); cov=cov(samp1);

16

Page 16: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

17

Variance: How much X varies around the expected value

Covariance is the measure the strength of the linear relationship between two random variables

covariance becomes more positive for each pair of values which differ from their mean in the same direction

covariance becomes more negative with each pair of values which differ from their mean in opposite directions.

if two variables are independent, then their covariance/correlation is zero (converse is not true).

ji

ijijji

jjiijiij

xx

xxExx

,Corr :nCorrelatio

)])([(,Cov:Covariance

Correlation is a dimensionless measure of linear dependence.

range between –1 and +1

Page 17: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

18

How to characterize differences between these distributions

Page 18: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

19

Covariance Matrices

Page 19: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

20

Contours of constant probability density for a 2D Gaussian distributions with a) general covariance matrix b) diagonal covariance matrix (covariance of x1,x2 is 0)c) proportional to the identity matrix (covariances are 0, variances of each dimension are the same)

Page 20: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

21

Shape and orientation of the hyper-ellipsoid centered at is defined by

v1v2

Page 21: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

22

Properties of

A small value of || (determinant of the covariance matrix) indicates that samples are close to

Small|| may also indicate that there is a high correlation between variables

If some of the variables are linearly dependent, or

if the variance of one variable is 0, then is singular and|| is 0.

Dimensionality should be reduced to get a positive definite matrix

Page 22: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

23

From the equation for the normal density, it is apparent that points which have the same density must have the same constant term:

Mahalanobis distance measures the distance from x to μ in terms of ∑

)()(

2

1exp

)2(

1)( 1

2/12/μxΣμx

Σx t

dp

Mahalanobis Distance

Page 23: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

24

Points that are the same distance from The

•The ellipse consists of points that are equi-distant

to the center w.r.t. Mahalanobis distance. •The circle consists of points that are equi-distant to the center w.r.t. The Euclidian distance.

Page 24: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

25

Why Mahalanobis DistanceWhy Mahalanobis Distance

It takes into account the covariance of the data.

• Point P is at closer (Euclidean) to the mean for the orange class, but using the Mahalanobis distance, it is found to be closer to 'apple‘ class.

Page 25: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

26

Positive Semi-Definite-Advanced

The covariance matrix is symmetric and positive semi-definite

An n × n real symmetric matrix M is positive definite if zTMz > 0 for all non-zero vectors z with real entries.

An n × n real symmetric matrix M is positive semi-definite if zTMz >= 0 for all non-zero vectors z with real entries.

Comes from the requirement that the variance of each dimension is >= 0 and that the matrix is symmetric.

When you randomly generate a covariance matrix, it may violate this rule Test to see if all the eigenvalues are >= 0 Higham 2002 – how to find nearest valid covariance matrix Set negative eigenvalues to small positive values

Page 26: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

Parameter Estimation

Covered only ML estimator

Page 27: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

You have some samples coming from an unknown distribution and you want to characterize that distribution; i.e. Find its parameters.

For instance, assuming you are given the samples in Slides 14 or 15, and that you assume that they are normally distributed, you will try to find the parameters Mu and Sigma of the Normal distribution.

We have referred to the sample mean (mean of the samples, m) and sample variance before, but why use those?

It turns out, the values for a parameter theta that maximizes the likelihood of seeing the samples, is what we are interested in.

28

Page 28: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

29

Page 29: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

30

Page 30: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

31

Page 31: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

32

Gaussian Parameter Estimation

Likelihood function

Assuming iid data

Page 32: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

33

Page 33: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

34

Page 34: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

35

Sample Mean and Variance

ji

ijij

jtj

N

t iti

ij

N

t

ti

i

ss

sr

N

mxmxs

diN

xm

:matrix n Correlatio

:matrix Covariance

,...,1,:mean Sample

1

1

R

S

m

Where N is the number of data points xt:

Page 35: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

36

Maximum (Log) Likelihood for a 1D-Gaussian

In other words, maximum likelihood estimates of mean and variance are the same as sample mean and variance.

Page 36: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

38

Reminder

In order to maximize or minimize a function f(x) w.r.to x, we compute the derivative df(x)/dx and set to 0, since a necessary condition for extrema is that the derivative is 0.

Commonly used derivative rules are given on the right.

Page 37: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

Parametric Classification

We will use the Bayesian decision criteria applied to normally distributed classes, whose parameters are either known or estimated from the sample.

Page 38: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

46

Parametric Classification

If p (x | Ci ) ~ N ( μi , ∑i

)

Discriminant functions are:

iiT

i

idiCp μxμxx 1

2/12/Σ

2

1exp

Σ2

1|

iiiT

ii

ii

ii

CPd

CPCp

CPg

logμΣμ2

1Σ log

2

12log

2

log| log

)|(log

1

xx

x

xx

Page 39: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

47

Estimation of Parameters

iiiT

iii CPg ˆ log2

1 log

2

1 1 mxmxx SS

If we estimate the unknown parameters from the sample,the discriminant function becomes:

Page 40: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

48

Case 1) Different Si

iiiiT

ii

iii

ii

iT

iiT

iiiT

iiiT

iT

ii

CP̂w

w

CP̂g

log log21

21

21

where

log221

log21

10

1

1

0

111

SS

S

SW

W

SSSS

mm

mw

xwxx

mmmxxxx

Quadratic discriminant

Page 41: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

49

likelihoods

posterior for C1

discriminant: P (C1|x ) = 0.5

Page 42: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

50

Two bi-variate normals, with completely different covariance matrix, showing a hyper-quadratic decision boundary.

Page 43: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

51

Page 44: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

52

Hyperbola: A hyperbola is an open curve with two branches, the intersection of a 

plane with both halves of a double cone. The plane may or may not be parallel to the

axis of the cone. Definition from Wikipedia.

Page 45: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

53

Page 46: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

54

Typical single-variable normal distributions showing a disconnected decision region R2

Page 47: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

55

Notation: Ethem book

t

ti

T

it

t itt

ii

t

ti

t

tti

i

t

ti

i

r

r

r

rN

rCP̂

mxmx

xm

S

tir

Using the notation in Ethem book, the sample mean and sample covariance… can be estimated as follows:

where is 1 if the tth sample belongs to class i

Page 48: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

56

If d (dimension) is large with respect to N (number of samples), we may have a problem with this approach: | | may be zero, thus will be singular (inverse does not exist)

| | may be non-zero, but very small, instability would result Small changes in would cause large changes in -1

Solutions: Reduce the dimensionality

Feature selection Feature extraction: PCA

Pool the data and estimate a common covariance matrix for all classes

= i P(Ci) * i

Page 49: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

57

Case 2) Common Covariance Matrix S

Shared common sample covariance S An arbitrary covariance matrix – but shared between the classes

We had this full discriminant function:

which now reduces to:

which is a linear discriminant

iiT

ii CPg ˆ log2

1 1 mxmxx S

iiT

iiii

iT

ii

CPw

wg

ˆ log2

1

where

10

1

0

mmmw

xwx

SS

iiiT

iii CPg ˆ log2

1 log

2

1 1 mxmxx SS

Page 50: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

58

Linear discriminant Decision boundaries are hyper-planes Convex decision regions:

All points between two arbitrary points chosen from one decision region belongs to the same decision region

If we also assume equal class priors, the classifier becomes a minimum Mahalanobis classifier

Page 51: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

60

Case 2) Common Covariance Matrix S

Linear discriminant

Page 52: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

61

Page 53: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

62

Page 54: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

63

Unequal priors shift the decision boundary towards the less likely class, as before.

Page 55: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

64

Page 56: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

65

Case 3) Common Covariance Matrix S which is Diagonal

When xj (j = 1,..d) are independent, ∑ is diagonal

Naive Bayes classifier where p(xj|Ci) are univariate Gaussian

p (x|Ci) = ∏j p (xj |Ci) (Naive Bayes’ assumption)

Classification is done based on weighted Euclidean distance (in sj units) to the nearest mean.

i

d

j j

ijtj

i CPs

mxg ˆ log

2

1

1

2

x

iiT

ii CPg ˆ log2

1 1 mxmxx S

Discriminant further simplified from:

Page 57: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

66

Page 58: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

67

Case 3) Common Covariance Matrix S which is Diagonal

variances may bedifferent

Page 59: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

68

Page 60: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

69

Case 4) Common Covariance Matrix S which is Diagonal + equal variances

If the priors are also equal, we have the Nearest mean classifier: Classify based on Euclidean distance to the nearest mean!

Each mean can be considered a prototype or template and this is template matching

Further simplified from:

i

d

jij

tj

ii

i

CPmxs

CPs

g

ˆ log2

1

ˆ log2

2

12

2

2

mxx

i

d

j j

ijtj

i CPs

mxg ˆ log

2

1

1

2

x

Page 61: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

70

Case 4) Common Covariance Matrix S which is Diagonal + equal variances

*?

Page 62: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

71

Case 4) Common Covariance Matrix S which is Diagonal + equal variances

Page 63: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

72

A second example where the priors have been changed. The decision boundary has shifted away from the more likely class, although it is still orthogonal to the line joining the 2 means.

Page 64: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

73

Page 65: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

74

Case 5: Si=i2I

Page 66: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

75

Model Selection

As we increase complexity (less restricted S), bias decreases and variance increases

Assume simple models (allow some bias) to control variance (regularization)

Assumption Covariance matrix No of parameters

Shared, Hyperspheric Si=S=s2I 1

Shared, Axis-aligned Si=S, with sij=0 d

Shared, Hyperellipsoidal Si=S d(d+1)/2

Different, Hyperellipsoidal Si K d(d+1)/2

Page 67: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

77

Tuning Complexity

When we use Euclidian distance to measure similarity, we are assuming that all variables have the same variance and that they are independent E.g. Two variables age and yearly income

When these assumptions don’t hold, Normalization may be used (use PCA, whitening, make each dimension

zero mean and unit variance…) to use Euclidian distance We may still want to use simpler models in order to estimate the related

parameters more accurately

Page 68: Multivariate Methods Slides from Machine Learning by Ethem Alpaydin Expanded by some slides from Gutierrez-Osuna

78

Conclusions

The Bayes classifier for normally distributed classes is a quadratic classifier

The Bayes classifier for normally distributed classes with equal covariance matrices is a linear classifier

The minimum Mahalanobis distance is Bayes-optimal for Normally distributed classes, having Equal covariance matrices and Equal priors

The minimum Euclidian distance is Bayes-optimal for Normally distributed classes -and- Equal covariance matrices proportional to the identity matrix–and- Equal priors

Both Euclidian and Mahalanobis distance classifiers are linear classifiers