prediction. prediction is a scientific guess of unknown by using the known data. statistical...

15
Prediction

Upload: oscar-barnett

Post on 23-Dec-2015

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Prediction

Page 2: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Prediction

• Prediction is a scientific guess of unknown by using the known data.

• Statistical prediction is based on correlation. If the correlation among variables is high, the success in prediction will be high.– So, if there is a perfect correlation among variables,

then the prediction will be perfect. • Today, we will study on the bivariate linear

correlation and regression. That is, linear relation between two variables.

Page 3: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Prediction

• Today, we will study on the bivariate linear correlation and regression. That is, linear relation between two variables. – In a graduate level of statistic class, you will learn curvilinear

correlation/regression for U-shaped relations and Cannonical Correlation for the relations among three or more variables

• By using statistical predictions (regression), we can answer such questions:– What is the height of a person whose weight is 80 kg?– If a students’ OSS score is 168, what will be his/her GPA in

university?

Page 4: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Prediction

• As you remember, correlation coefficient provides us a measure of how strongly two variables are related. – Pearson r is one of the correlation coefficients and

it is calculated on the least squared deviations. • By using that method, we can draw a

hypothetic line between the scores on a scatter diagram.– This line represents the best fit of the predicted

scores

Page 5: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

• In this example, the correlation between weight and height is perfect (r= +1). So, all the dots in the diagram are on the hypothetic line (the line of best fit).

• The formula of the line is Y=aX+c. For these distributions, it is Y=1x + 100.

Weight HeightAli 54 154Kezban 55 155Can 65 165Sevgi 45 145Kenan 65 165Arda 66 166Zehra 54 154Selin 56 156Nalan 67 167Serap 67 167

r=1 40 45 50 55 60 65 70130

135

140

145

150

155

160

165

170

Page 6: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Terminology:• The line in this diagram is regression line. • The equation of the line is regression equation (Y=aX+c).

– For correlation, it is not important which axis represents which variable, since there is no direction in correlation.

– For regression, Y axis is always shows the predicted (dependent) variable, and X axis represents the predictor (independent variable)

• Be careful about the language of regression. The equation is always read as regression of Y on X.

Weight Height

Ali 54 154

Kezban 55 155

Can 65 165

Sevgi 45 145

Kenan 65 165

Arda 66 166

Zehra 54 154

Selin 56 156

Nalan 67 167

Serap 67 167

r=1 40 45 50 55 60 65 70130

135

140

145

150

155

160

165

170

Page 7: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

• Now let’s focus on the other examples in which the relation between two variables is less perfect.

• As you can see, when the correlation become less perfect, dots in the diagram starts to diverge from the regression line

Weight HeightAli 53 154Kezban 54 155Can 65 165Sevgi 43 145Kenan 68 165Arda 67 166Zehra 53 154Selin 54 156Nalan 65 167Serap 66 167

r=.986 40 45 50 55 60 65 70130

135

140

145

150

155

160

165

170

Page 8: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

• In this example, the correlation is far from being perfect. – So, the dots are far away from the regression line.

Weight HeightAli 54 154Kezban 61 155Can 45 165Sevgi 46 145Kenan 72 165Arda 68 166Zehra 57 154Selin 71 156Nalan 47 167Serap 49 167

r=.08640 45 50 55 60 65 70 75

130

135

140

145

150

155

160

165

170

Page 9: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

The Criterion of Best Fit• By using the regression equation,

we can calculate predicted values of Y. – Let’s say these predicted values

are Yl

• As you can see in the scatter diagram, these predicted values of Yl are not the same of Y.– The discrepancy with Y and Yl is

the error of prediction. • These discrepancies are

presented as vertical lines in the figure. – As you can see, the higher the r,

the lower the error is.

Page 10: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

The Regression Equation

• The regression of Y on X (Standart-Score Formula)– zl

y = rzx • zl

y = the predicted standart-score value of Y• r = correlation coefficient• rzx = the standart-score value of X from which zl

y is predicted

• As you can see, – 1. zl

y = rzx when r=1

– 2. zly is 0 when zx is 0. That is mean of X overlaps with mean of

Yl

• To calculate the predicted values of Y from raw scores of X, we can use a simpler formula

Page 11: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

The Regression Equation

• In this formula– Yl = the predicted raw

score in Y– Sx and Sy= the two

standart deviations– r= correlation coefficient

• Now, lets try to calculate Yl for the distribution presented in Table 1.

Note that this form of the formula is similar toY=aX+c

Page 12: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Error of PredictionThe Standard Error of Estimate

• The regression line is a kind of a mean of the score in the scatter plot

• The discrepancy from regression line is then a kind of a deviation form mean (line)

• So, we can use a similar formula to SD in order to calculate standard discrepancy

Page 13: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Error of PredictionThe Standard Error of Estimate

• A preferred formula for Syx is Syx = Sy

Page 14: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Limits of Error in Estimating Y from X

• Assume that actual scores of Y are normally distributed about predicted scores– So, we can use normal curve to calculate the limits of

error in prediction• In a Normal Distribuion – 68% of the Y scores fall within the limits Y(mean) +/- 1 SD– 95 % of the Y scores fall within the limits Y(mean) +/- 1.96

SD– 99 % of the Y scores fall within the limits Y(mean) +/- 2.58

SD

Page 15: Prediction. Prediction is a scientific guess of unknown by using the known data. Statistical prediction is based on correlation. If the correlation among

Limits of Error in Estimating Y from X

• Given that regression line is a kind of a mean and standard error is a kind of a SD

• We can see that – 68% of actual Y values fall within the limits

Y(mean) +/- 1 Syx – 95 % of actual Y values fall within the limits

Y(mean) +/- 1.96 Syx – 99 % of actual Y values fall within the limits

Y(mean) +/- 2.58 Syx