fundamentals of data analysis lecture 11 correlation and regression
DESCRIPTION
Fundamentals of Data Analysis Lecture 11 Correlation and regression. Program for today. Basic concepts C orrelation d iagram and correlation table Linear correlation Linear regression The correlation of the multiple variables R egression curves. Basic concepts. - PowerPoint PPT PresentationTRANSCRIPT
Fundamentals of Data Analysis
Lecture 11
Correlation and regression
Program for todayBasic conceptsCorrelation diagram and correlation tableLinear correlationLinear regressionThe correlation of the multiple variablesRegression curves
Basic conceptsCorrelation is defined as the statistical interdependence of measurements of different phenomena, depending on the common reason or are to each other in a direct causal relationship.Note, however, that the concept of correlation is different from both the causal relationship and the notion of stochastic dependence between random variables.An extreme case is the correlation of co-linear random variables.The correlation is said to be simple or positive when an increase in one variable increases the other. However, when the increase in one variable is accompanied by degrease of second we are dealing with an inverse or negative correlation.
Basic conceptsRegression in mathematical statistics is empirically determined the functional relationship between the correlated random variables.
Having established that between the studied traits are very weak correlation, proceed to find a regression function that allows you to predict the value of one feature with the assumption that the second characteristic of a defined value.In practice, the most important is the linear regression, corresponding to a linear relationship between the random variables under consideration. Although linear regression is rare in practice, in the form of "pure", but is a convenient tool for obtaining approximate relationships.
Basic concepts
For more complex interdependencies non-linear regression is used, for example a square regression.Two models of the data are distinguished:
• I-st model , in which the values of the random variable is known (well defined)
• II-nd model , in which the random variable is random or vitiated by an error.
Correlation table and correlation diagram
If we have the general population, in which there are two measurable characteristics of X and Y, and they are random variables, and if certain parameters for two-dimensional variable (X, Y) distribution are unknown, this raises the problem of determination of their estimates based on the random sample n pairs of numbers (xi, yi). Treating
xi and yi as the coordinates of the point on the plane,
a sample can be represented graphically in a correlation diagram.
Correlation table and correlation diagram
To make the table should be for each of the features to build series of distribution, calculating the interval:
Rx = xmax - xmin Ry = ymax - ymin
then on the basis of the sample size n we take the appropriate number of classes k and calculate the length of the class:
dx = Rx / kdy = Ry / k
As the lower limit of the first class of variable we accept value slightly lower than the minimum value, and as the upper limit of the last class the value of a little more than the maximum value.
Correlation table and correlation diagram
Linear correlationThe strength of the interdependence of two variables can be expressed numerically by many measures, but the most popular of these is the Pearson correlation coefficient:
where the covariance is described in relationship:
Estimator of the correlation coefficient r between the two test features X i Y in the population is the correlation coefficient of the sample, calculated on the basis of n pairs (xi, yi) of results with the aid of equation:
yx
YX
r ,cov
n
iii yxyx
nyx
1
1,cov
Linear correlation
Factor called the coefficient of determination r, with (n-1) degrees of freedom, can be the estimator of correlation.
n
i
n
iii
n
iii
xy
yyxx
yyxxr
1 1
22
1
Linear correlationThe correlation coefficient takes values between [-1;1].
Coefficient refers to the strength of the relationship. The closer to zero is the weaker relationship them closer to 1 or -1, the stronger. The value of 1 indicates a perfect linear relationship. Sign of the correlation coefficient refers to the direction of union "+" indicates a positive relationship, ie an increase (decrease) in value of one trait will increase (decrease) in the other. "-" Negative direction, ie an increase (decrease) in the value of features results in a decrease (increase) on the other.
Linear correlationAssume the following assessment of the strength of correlation (keeping in mind the appropriate sample size):
• below 0.1 - negligible• from 0.1 to 0,3 - weak• from 0.3 to 0.5 - mean• from 0.5 to 0.7 - high• from 0.7 to 0.9 – very high• above 0.9 - almost full.
This scale is arbitrary.
Correlation table and correlation diagram
ExampleN = 50 measurements of cast dimensions was made, results are shown in Table.At the 95% confidence level to verify the hypothesis that there is a correlation between the dimensions of the castings.
Correlation table and correlation diagram
Example
i xi yi i xi yi
1 38.5 5.5 26 34.2 3.6
2 41.1 4.8 27 39.1 5.1
3 37.8 5.0 28 37.5 4.9
4 36.0 4.9 29 35.5 5.0
5 32.2 5.1 30 36.6 4.1
6 36.8 4.3 31 40.5 5.5
7 33.5 4.5 32 37.2 5.0
8 35.3 3.8 33 34.5 4.8
9 31.1 3.4 34 38.5 4.5
10 42.5 5.7 35 34.0 4.1
11 39.5 5.4 36 33.5 4.0
12 42.1 5.2 37 32.5 4.5
13 38.0 5.2 38 36.4 4.5
14 36.5 5.1 39 37.5 5.6
15 40.0 4.5 40 41.4 5.3
16 36.5 4.4 41 39.5 6.0
17 34.0 4.4 42 38.1 3.9
18 34.5 3.9 43 35.7 4.6
19 44.5 6.6 44 39.5 6.0
20 38.0 5.9 45 35.5 4.6
21 40.0 5.7 46 40.5 6.1
22 36.5 5.4 47 37.5 4.3
23 38.8 5.1 48 33.5 5.2
24 34.5 4.6 49 42.5 6.6
25 36.1 4.2 50 38.0 4.4
Correlation table and correlation diagram
Example We calculate the gaps:
Rx = 44.5 - 31.1 = 13.4 and Ry = 6.6 - 3.4 = 3.2
As the number of measurements n = 50 we take the number of classes k equal to 7. Thus, the length of the classes are equal: for characteristics of X (dimension): dx = Rx / k = 13.4 / 7 2 and for characteristics of Y : dy = 3.2 / 7 0.5.
As the lower limit for characteristics of X we assume x = 31.0 and for characteristics of Y value y = 3.25.
Thus we get correlation table which is shown in Table
Correlation table and correlation diagram
i 1 2 3 4 5 6 7
X
Y 31-33 33-35 35-37 37-39 39-41 41-43 43-45
1 3.25-3.75 1 1 - - - - -
2 3.75-4.25 1 3 3 1 - - -
3 4.25-4.75 1 3 5 3 1 - -
4 4.75-5.25 1 2 3 5 2 - -
5 5.25-5.75 - - 1 2 3 2 -
6 5.75-6.25 - - - 1 2 1 -
7 6.25-6.75 - - - - - - -
ni. 4 9 12 12 8 4 1
Example
Correlation table and correlation diagram
Example
6878.06741.09178.2
19.5273.37904102.0501
1 1
XY
k
j
k
iijij
ss
yxnxy
r
Mean values for x = 37.273 and for y = 5.19 and the standard deviations are respectively 8.5136 and 0.4544, thus
Correlation table and correlation diagram
30 35 40 453
4
5
6
7
X
Y
Diagram korelacyjny
Example
Linear regressionThe general population is given, in which the characteristics (X, Y) have a two-dimensional distribution. Regression straight line of second type for characteristics of Y versus the characteristics of X are given by the equation :
where:
is called the coefficient of a linear regression of characteristics of Y on X, and
is the coefficient of the offset.
Y
Xpa
b ax y
EXpEYbY
X
Linear regression
Linear regression
The correlation of the multiple variables
In the case of correlation of more than two variables the following additional terms should be defined:• Simple correlation (total) is the correlation
between the two variables (without taking into account other variables).
• Partial correlation is correlation between the two variables when other variables are held constant.
• Multiple correlation is a correlation between the number of connected variables, which change simultaneously.
Regression curves
Regression curves have the general form of the equation:
y = a + b1x1 + b2x2+ ...
where bi is the partial regression coefficient of the i-th order.
Regression curvesSurface chart
Thank you for attention !