1 multi variate variable n-th object m-th variable
TRANSCRIPT
1
MULTI VARIATE VARIABLE
m,ni,n1,n
m,ji,j1,j
m,1i,11,1
xxx
xxx
xxx
Tn
Tj
T1
x
x
x
C
n-th OBJECT
m-th VARIABLE
2
STATISTICAL DEPENDANCE
CORRELATION – relationship between QUANTIVATIVE (measured) data
CONTINGENCE – relationship between QUALITATIVE (descriptive) data
3
CORRELATION
simple – for two variables,
multiple – for more then two variables,
parcial – describes relationship of two variables in multivariable data set (we exclude influence of all other variables)
4
CORRELATION
positive negative
5
Correlation
2x
x2
x1
CELKOVÁ VARIABILITA Y (odchylka měřené hodnoty od
průměru)
REZIDUÁLNÍ VARIABILITA (odchylka měřených a
modelových - vypočítaných – hodnot)
VARIABILITA VYSVĚTLENÁ MODELEM (odchylka modelových hodnot
od průměru)
TOTAL VARIABILITY
RESIDUAL
VARIABLITY
MODEL VARIABILITY
6
CORRELATION
2
1 2
2
2
2 2x x
2x
2x x2R = =
S-
S1
S
SCOEFF. OF DETERMINATION
COEFF. OF CORRELATION
2 2
1 22
2
x
x
x
2
x2
x
2
R = = 1S
S-
S
S
7
COEFF. OF DERETMINATION
quantifies which part of total variability of the response is explained by model
r2 = 0.9
r2 = 1r2 = 0.05
8
COEFF. OF CORRELATION
simple correlation
PearsonPearson
SpearmanSpearman (rank correlation)
9
PEARSON COEFF. OF CORRELATION
21
21
1221xx
xxxxxx SS
covrr
= standardised covariance
BIVARIATE normal distribution
10
COVARIANCE
measure of linear relationshipalways is non – negativeproduct of standard deviations is its upper limitits magnitude is depend on units of arguments standardisation is necessary
COVARIANCE:
1 2 1 1 2 2
1
1cov
1
n
x x i ii
x x x xn
11
PEARSON COEFF. OF CORRELATION
Basic properties:
It is dimensionless measure of correlation;0 – 1 for positive correlation, 0 – (-1) for negative correlation;0 means that there is no linear relationship between variables (can be nonlinear!) or this relationship is not statistically significant on the basis of available data;1 or (-1) indicates a functional (perfect) relationship;Value of correlaion coefficient is the same for dependence x1 on x2 and for reverse dependence x2
on x1.
12
SPEARMAN CORRELATION COEFFICIENT
nonparametric correlation coeff. based on ranks
nn
d61r
3
n
1i
2i
S
difference between ranks of X and Y in one row
13
SPEARMAN CORRELATION COEFFICIENT
influential points (extremes)
Pearson R = -0,412 Pearson R = -0,412 (influential points are fully (influential points are fully counted)counted)
Spearman R = +0,541 Spearman R = +0,541 (influential points are (influential points are stronly limited)stronly limited)
14
CONFIDENCE INTERVAL R (CI)
CI () includes interval of possible values of population correlation coefficient (with probability 1 - )
Because distribution of corr. coeff. is not normal, we must use Fisher transformation
R1
R1ln5.0)R(arctgh)R(Z
with appox. normal distribution with mean E(Z) = Z() and variance D(Z) = 1/(n-3).
15
CONFIDENCE INTERVAL R (CI)
RFisher transformation
Z(R)21
1( )
3Z R z
n
lower and upper boundary of CI in Fisher tranformation
retransformation Z(R) to correlation coeff. lower and upper boundary of CI of correlation coeff.
half of CI of transformed value
lower and upper boundary of CI in Fisher tranformation
16
CONFIDENCE INTERVAL R (CI)
R = 0.95305 fisherz(0.95305) = 1.864
Fisher value
CI Fisher value:
11.864 1.96 1.864 0.65333
1.2107; 2.5173
=12
7
3Z
1.21 1.864 2.517
CI correlation coeff:=fisherz2r(1.2107) = 0.83689=fisherz2r (2.5174) = 0.98707
0.837 0.953 0.987
17
REGRESSION ANALYSIS
závisle prom
ěnn
á Y
nezávisle proměnná X
MEASURED VALUES
MODEL VALUES
independent (explanatory) variable
depe
nden
t, ex
plai
ned,
res
pons
e va
r.
18
REGRESSION MODEL
11 12 1 1
21 22 2 2
1 2
1 2
1
2
1
2
1
2
j m
j m
i i ij im
n n nj nm
i
n
j
m
i
n
y x x x x
x x x x
x x x x
x
y
x
y
x xy
X εβy
response explanatory variable(s) regression random variable parameters error
y = X +
19
REGRESSION MODEL
1
závisle prom
ěnn
á Y absolutní člen
regresní parametr
nezávisle proměnná X
regression parameter
b
intercept a
independent (explanatory) variable
res
pons
e
20
CONFIDENCE INTERVAL OF MODEL
upper boundary of CI lower boundary of CI
VALUE OF REGRESSION MODEL ( these values are only point estimates )
Area where all possible models Area where all possible models computed from any sample (coming computed from any sample (coming from the same population) are appear from the same population) are appear with probability 1 - with probability 1 -
CI of one model value
21
CI OF Y VALUES – PREDICTION INTERVAL
is an estimate of an interval in which future observations will fall, with a certain probability 1 -
mn;
2
imax)(min,i tyy
CONFIDENCE INTERVAL OF MODEL (CI), PREDICTION INTERVAL OF RESPONSE (PI)
22
23
COMPARISON OF REGRESSION MODELS
Akaike information criterion (AIC)
ln 2RSS
AIC n mn
RSC rezidual sum of squaresm number of parameters
The AIC is smaller, the model is better
(from the statistical point of view!!).
REGRESSION DIAGNOSTICS
24
Diagnostics of residuals:
• normality• homoscedasticity (constant variance)• independence
REGRESSION DIAGNOSTICS
25
Breusch–Pagan test (and many others…)
Weighted OLS method
REGRESSION DIAGNOSTICS
26
REGRESSION DIAGNOSTICS
27
Influential points
REGRESSION DIAGNOSTICS
28
HAT VALUES (leverages)the hat matrix, H, relates the fitted values to the observed values. It describes the influence each observed value has on each fitted value.
The diagonal elements of the hat matrix are the leverages, which describe the influence each observed value has on the fitted value for that same observation.
REGRESSION DIAGNOSTICS
29
Cook distance
measures the effect of deleting a given observation. Data points with large residuals (outliers) and/or high leverage may distort the outcome and accuracy of a regression.
REGRESSION DIAGNOSTICS
30
DFFITSstatistic is a scaled measure of the change in the predicted value for the ith observation and is calculated by deleting the ith observation. A large value indicates that the observation is very influential in its neighborhood of the X space.
A general cutoff to consider is 2; a size-adjusted cutoff recommended is
REGRESSION DIAGNOSTICS
31
DFBETASare the scaled measures of the change in each parameter estimate and are calculated by deleting the ith observation
General cut off value is 2, size adjusted