1 multi variate variable n-th object m-th variable

31
1 MULTI VARIATE VARIABLE m , n i , n 1 , n m , j i , j 1 , j m , 1 i , 1 1 , 1 x x x x x x x x x T n T j T 1 x x x C n-th OBJECT m-th VARIABLE

Upload: ross-merritt

Post on 25-Dec-2015

221 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

1

MULTI VARIATE VARIABLE

m,ni,n1,n

m,ji,j1,j

m,1i,11,1

xxx

xxx

xxx

Tn

Tj

T1

x

x

x

C

n-th OBJECT

m-th VARIABLE

Page 2: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

2

STATISTICAL DEPENDANCE

CORRELATION – relationship between QUANTIVATIVE (measured) data

CONTINGENCE – relationship between QUALITATIVE (descriptive) data

Page 3: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

3

CORRELATION

  

 simple – for two variables,

 multiple – for more then two variables,

parcial – describes relationship of two variables in multivariable data set (we exclude influence of all other variables)  

Page 4: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

4

CORRELATION

positive negative

Page 5: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

5

Correlation

2x

x2

x1

CELKOVÁ VARIABILITA Y (odchylka měřené hodnoty od

průměru)

REZIDUÁLNÍ VARIABILITA (odchylka měřených a

modelových - vypočítaných – hodnot)

VARIABILITA VYSVĚTLENÁ MODELEM (odchylka modelových hodnot

od průměru)

TOTAL VARIABILITY

RESIDUAL

VARIABLITY

MODEL VARIABILITY

Page 6: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

6

CORRELATION

2

1 2

2

2

2 2x x

2x

2x x2R = =

S-

S1

S

SCOEFF. OF DETERMINATION

COEFF. OF CORRELATION

2 2

1 22

2

x

x

x

2

x2

x

2

R = = 1S

S-

S

S

Page 7: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

7

COEFF. OF DERETMINATION

quantifies which part of total variability of the response is explained by model

r2 = 0.9

r2 = 1r2 = 0.05

Page 8: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

8

COEFF. OF CORRELATION

simple correlation

        PearsonPearson

        SpearmanSpearman (rank correlation)

Page 9: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

9

PEARSON COEFF. OF CORRELATION

21

21

1221xx

xxxxxx SS

covrr

= standardised covariance

BIVARIATE normal distribution

Page 10: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

10

COVARIANCE

measure of linear relationshipalways is non – negativeproduct of standard deviations is its upper limitits magnitude is depend on units of arguments standardisation is necessary

COVARIANCE:

1 2 1 1 2 2

1

1cov

1

n

x x i ii

x x x xn

Page 11: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

11

PEARSON COEFF. OF CORRELATION

Basic properties:

It is dimensionless measure of correlation;0 – 1 for positive correlation, 0 – (-1) for negative correlation;0 means that there is no linear relationship between variables (can be nonlinear!) or this relationship is not statistically significant on the basis of available data;1 or (-1) indicates a functional (perfect) relationship;Value of correlaion coefficient is the same for dependence x1 on x2 and for reverse dependence x2

on x1.

Page 12: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

12

SPEARMAN CORRELATION COEFFICIENT

nonparametric correlation coeff. based on ranks

nn

d61r

3

n

1i

2i

S

difference between ranks of X and Y in one row

Page 13: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

13

SPEARMAN CORRELATION COEFFICIENT

influential points (extremes)

Pearson R = -0,412 Pearson R = -0,412 (influential points are fully (influential points are fully counted)counted)

Spearman R = +0,541 Spearman R = +0,541 (influential points are (influential points are stronly limited)stronly limited)

Page 14: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

14

CONFIDENCE INTERVAL R (CI)

CI () includes interval of possible values of population correlation coefficient (with probability 1 - )

Because distribution of corr. coeff. is not normal, we must use Fisher transformation

R1

R1ln5.0)R(arctgh)R(Z

with appox. normal distribution with mean E(Z) = Z() and variance D(Z) = 1/(n-3).

Page 15: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

15

CONFIDENCE INTERVAL R (CI)

RFisher transformation

Z(R)21

1( )

3Z R z

n

lower and upper boundary of CI in Fisher tranformation

retransformation Z(R) to correlation coeff. lower and upper boundary of CI of correlation coeff.

half of CI of transformed value

lower and upper boundary of CI in Fisher tranformation

Page 16: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

16

CONFIDENCE INTERVAL R (CI)

R = 0.95305 fisherz(0.95305) = 1.864

Fisher value

CI Fisher value:

11.864 1.96 1.864 0.65333

1.2107; 2.5173

=12

7

3Z

1.21 1.864 2.517

CI correlation coeff:=fisherz2r(1.2107) = 0.83689=fisherz2r (2.5174) = 0.98707

0.837 0.953 0.987

Page 17: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

17

REGRESSION ANALYSIS

závisle prom

ěnn

á Y

nezávisle proměnná X

MEASURED VALUES

MODEL VALUES

independent (explanatory) variable

depe

nden

t, ex

plai

ned,

res

pons

e va

r.

Page 18: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

18

REGRESSION MODEL

11 12 1 1

21 22 2 2

1 2

1 2

1

2

1

2

1

2

j m

j m

i i ij im

n n nj nm

i

n

j

m

i

n

y x x x x

x x x x

x x x x

x

y

x

y

x xy

X εβy

response explanatory variable(s) regression random variable parameters error

y = X +

Page 19: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

19

REGRESSION MODEL

1

závisle prom

ěnn

á Y absolutní člen

regresní parametr

nezávisle proměnná X

regression parameter

b

intercept a

independent (explanatory) variable

res

pons

e

Page 20: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

20

CONFIDENCE INTERVAL OF MODEL

upper boundary of CI lower boundary of CI

VALUE OF REGRESSION MODEL ( these values are only point estimates )

Area where all possible models Area where all possible models computed from any sample (coming computed from any sample (coming from the same population) are appear from the same population) are appear with probability 1 - with probability 1 -

CI of one model value

Page 21: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

21

CI OF Y VALUES – PREDICTION INTERVAL

is an estimate of an interval in which future observations will fall, with a certain probability 1 -

mn;

2

imax)(min,i tyy

Page 22: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

CONFIDENCE INTERVAL OF MODEL (CI), PREDICTION INTERVAL OF RESPONSE (PI)

22

Page 23: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

23

COMPARISON OF REGRESSION MODELS

Akaike information criterion (AIC)

ln 2RSS

AIC n mn

RSC rezidual sum of squaresm number of parameters

The AIC is smaller, the model is better

(from the statistical point of view!!).

Page 24: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

REGRESSION DIAGNOSTICS

24

Diagnostics of residuals:

• normality• homoscedasticity (constant variance)• independence

Page 25: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

REGRESSION DIAGNOSTICS

25

Breusch–Pagan test (and many others…)

Weighted OLS method

Page 26: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

REGRESSION DIAGNOSTICS

26

Page 27: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

REGRESSION DIAGNOSTICS

27

Influential points

Page 28: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

REGRESSION DIAGNOSTICS

28

HAT VALUES (leverages)the hat matrix, H, relates the fitted values to the observed values. It describes the influence each observed value has on each fitted value.

The diagonal elements of the hat matrix are the leverages, which describe the influence each observed value has on the fitted value for that same observation.

Page 29: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

REGRESSION DIAGNOSTICS

29

Cook distance

measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.

Page 30: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

REGRESSION DIAGNOSTICS

30

DFFITSstatistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space.

A general cutoff to consider is 2; a size-adjusted cutoff recommended is

Page 31: 1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE

REGRESSION DIAGNOSTICS

31

DFBETASare the scaled measures of the change in each parameter estimate and are calculated by deleting the ith observation

General cut off value is 2, size adjusted