1 multi variate variable n-th object m-th variable

1

MULTI VARIATE VARIABLE

m,ni,n1,n

m,ji,j1,j

m,1i,11,1

xxx

xxx

xxx

Tn

Tj

T1

x

x

x

C

n-th OBJECT

m-th VARIABLE

2

STATISTICAL DEPENDANCE

CORRELATION – relationship between QUANTIVATIVE (measured) data

CONTINGENCE – relationship between QUALITATIVE (descriptive) data

3

CORRELATION

simple – for two variables,

multiple – for more then two variables,

parcial – describes relationship of two variables in multivariable data set (we exclude influence of all other variables)

4

CORRELATION

positive negative

5

Correlation

2x

x2

x1

CELKOVÁ VARIABILITA Y (odchylka měřené hodnoty od

průměru)

REZIDUÁLNÍ VARIABILITA (odchylka měřených a

modelových - vypočítaných – hodnot)

VARIABILITA VYSVĚTLENÁ MODELEM (odchylka modelových hodnot

od průměru)

TOTAL VARIABILITY

RESIDUAL

VARIABLITY

MODEL VARIABILITY

6

CORRELATION

2

1 2

2

2

2 2x x

2x

2x x2R = =

S-

S1

S

SCOEFF. OF DETERMINATION

COEFF. OF CORRELATION

2 2

1 22

2

x

x

x

2

x2

x

2

R = = 1S

S-

S

S

7

COEFF. OF DERETMINATION

quantifies which part of total variability of the response is explained by model

r2 = 0.9

r2 = 1r2 = 0.05

8

COEFF. OF CORRELATION

simple correlation

PearsonPearson

SpearmanSpearman (rank correlation)

9

PEARSON COEFF. OF CORRELATION

21

21

1221xx

xxxxxx SS

covrr

= standardised covariance

BIVARIATE normal distribution

10

COVARIANCE

measure of linear relationshipalways is non – negativeproduct of standard deviations is its upper limitits magnitude is depend on units of arguments standardisation is necessary

COVARIANCE:

1 2 1 1 2 2

1

1cov

1

n

x x i ii

x x x xn

11

PEARSON COEFF. OF CORRELATION

Basic properties:

It is dimensionless measure of correlation;0 – 1 for positive correlation, 0 – (-1) for negative correlation;0 means that there is no linear relationship between variables (can be nonlinear!) or this relationship is not statistically significant on the basis of available data;1 or (-1) indicates a functional (perfect) relationship;Value of correlaion coefficient is the same for dependence x1 on x2 and for reverse dependence x2

on x1.

12

SPEARMAN CORRELATION COEFFICIENT

nonparametric correlation coeff. based on ranks

nn

d61r

3

n

1i

2i

S

difference between ranks of X and Y in one row

13

SPEARMAN CORRELATION COEFFICIENT

influential points (extremes)

Pearson R = -0,412 Pearson R = -0,412 (influential points are fully (influential points are fully counted)counted)

Spearman R = +0,541 Spearman R = +0,541 (influential points are (influential points are stronly limited)stronly limited)

14

CONFIDENCE INTERVAL R (CI)

CI () includes interval of possible values of population correlation coefficient (with probability 1 - )

Because distribution of corr. coeff. is not normal, we must use Fisher transformation

R1

R1ln5.0)R(arctgh)R(Z

with appox. normal distribution with mean E(Z) = Z() and variance D(Z) = 1/(n-3).

15


RFisher transformation

Z(R)21

1( )

3Z R z

n

lower and upper boundary of CI in Fisher tranformation

retransformation Z(R) to correlation coeff. lower and upper boundary of CI of correlation coeff.

half of CI of transformed value

lower and upper boundary of CI in Fisher tranformation

16


R = 0.95305 fisherz(0.95305) = 1.864

Fisher value

CI Fisher value:

11.864 1.96 1.864 0.65333

1.2107; 2.5173

=12

7

3Z

1.21 1.864 2.517

CI correlation coeff:=fisherz2r(1.2107) = 0.83689=fisherz2r (2.5174) = 0.98707

0.837 0.953 0.987

17

REGRESSION ANALYSIS

závisle prom

ěnn

á Y

nezávisle proměnná X

MEASURED VALUES

MODEL VALUES

independent (explanatory) variable

depe

nden

t, ex

plai

ned,

res

pons

e va

r.

18

REGRESSION MODEL

11 12 1 1

21 22 2 2

1 2

1 2

1

2

1

2

1

2

j m

j m

i i ij im

n n nj nm

i

n

j

m

i

n

y x x x x

x x x x

x x x x

x

y

x

y

x xy

X εβy

response explanatory variable(s) regression random variable parameters error

y = X +

19

REGRESSION MODEL

1

závisle prom

ěnn

á Y absolutní člen

regresní parametr

nezávisle proměnná X

regression parameter

b

intercept a

independent (explanatory) variable

res

pons

e

20

CONFIDENCE INTERVAL OF MODEL

upper boundary of CI lower boundary of CI

VALUE OF REGRESSION MODEL ( these values are only point estimates )

Area where all possible models Area where all possible models computed from any sample (coming computed from any sample (coming from the same population) are appear from the same population) are appear with probability 1 - with probability 1 -

CI of one model value

21

CI OF Y VALUES – PREDICTION INTERVAL

is an estimate of an interval in which future observations will fall, with a certain probability 1 -

mn;

2

imax)(min,i tyy

CONFIDENCE INTERVAL OF MODEL (CI), PREDICTION INTERVAL OF RESPONSE (PI)

22

23

COMPARISON OF REGRESSION MODELS

Akaike information criterion (AIC)

ln 2RSS

AIC n mn

RSC rezidual sum of squaresm number of parameters

The AIC is smaller, the model is better

(from the statistical point of view!!).

REGRESSION DIAGNOSTICS

24

Diagnostics of residuals:

• normality• homoscedasticity (constant variance)• independence


25

Breusch–Pagan test (and many others…)

Weighted OLS method


26


27

Influential points


28

HAT VALUES (leverages)the hat matrix, H, relates the fitted values to the observed values. It describes the influence each observed value has on each fitted value.

The diagonal elements of the hat matrix are the leverages, which describe the influence each observed value has on the fitted value for that same observation.


29

Cook distance

measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.


30

DFFITSstatistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space.

A general cutoff to consider is 2; a size-adjusted cutoff recommended is


31

DFBETASare the scaled measures of the change in each parameter estimate and are calculated by deleting the ith observation

General cut off value is 2, size adjusted

1 multi variate variable n-th object m-th variable

Documents

ci correlation

x slide

negative correlation

positive correlation

row slide

fisher tranformation

th variable slide

confidence interval