12b. regression analysis, part 2 csci n207 data analysis using spreadsheet lingma acheson...

12
12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson [email protected] Department of Computer and Information Science, IUPUI

Upload: john-alexander

Post on 28-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

12b. Regression Analysis, Part 2

CSCI N207 Data Analysis Using Spreadsheet

Lingma [email protected]

Department of Computer and Information Science, IUPUI

Page 2: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

0 5 10 15 20 25 30 35 40 450

2

4

6

8

10

Reading Aptitude and Reading Hours

Aptitude

Hours

Fitting the Data

4 5 6 7 8 9 10 110

0.5

1

1.5

2

2.5

3

3.5

f(x) = 0.4 x − 1

Reading Aptitude and Reading Hours

Page 3: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Fitting the Data• If there are more than two data points,

chances are they don’t all fit in one straight line.

• We need to find the equation for a straight line that does the “best job” of reproducing the data.

• About half of the data points should fall above our line (“positive residual”) and about half should fall below (“negative residual”).

Page 4: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Residual

• Difference between the measured and the calculated Y-values:

Average Income versus % with a College Degree (by State)

22,000

22,500

23,000

23,500

24,000

24,500

25,000

25,500

26,000

15 15.5 16 16.5 17 17.5 18 18.5 19 19.5 20

Percentage of Population with College Degree or Higher

Ave

rag

e In

co

me

Le

vel (

$ p

er

yea

r)

Page 5: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Finding the Slope (m) of an Estimated Line

• The slope of the estimated line is given by the ratio of the covariance between the X and Y data sets and of the variance of the X data set:

set data theof Variance

sets data and between Covariance

x

yxm

Page 6: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Finding the y-Intercept (b) of an Estimated Line

• Once we’ve found the slope, we can find the Y-intercept using the standard equation for a line, with one exception: we must use the means of the X and Y data sets as our coordinates (since the actual data points are unlikely to be on the estimated line):

• Excel functions:– m: SLOPE(..,..)– b: INTERCEPT(..,..)

) ofmean (slope - ofmean xyb

XmYb

Page 7: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Practice• Find an equation for the trendline of the

following data set and predict the reading hours when aptitude is 25, 33 or 45.

StudentReading Aptitude

Reading Hours

1 20 52 5 13 5 24 35 75 30 86 35 87 10 38 5 29 15 510 40 9

Page 8: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Predicting Values• Once we get the slope (m) and the y-intercept

(b) of the estimated line, we have a mathematical relation that ties the X variable to the Y variable.

• Once we have this relation, we can use it to predict X- and Y- coordinates that are not part of the data sets.

• E.g. What is the estimated reading hours if two new students coming in, one has a reading aptitude of 25 and another one 46?y = 0.2029x + 0.9429

x = 25, y = 0.2029*25 + 0.9429 = 6.0154

x = 46, y = 0.2029*46 + 0.9429 = 10.2763

Page 9: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Interpolation

• Interpolation is the process by which we use the formula for estimated line to predict a value of Y for a given value of X that is not included in the data set, but is within the range of the data set.

• The given value of X and the predicted Y-value will be on the estimated line.

0 5 10 15 20 25 30 35 40 450

2

4

6

8

10

f(x) = 0.202857142857143 x + 0.942857142857143R² = 0.947556390977444

Reading Aptitude and Reading Hours

Aptitude

Hours

Page 10: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Extrapolation

• Extrapolation is the process by which we use the formula for estimated line to predict a value of Y for a given value of X that is not included in the data set AND is not within the range of the data set.

• The given value of X and the predicted Y-value will be on the estimated line, but outside of the range of the data set.

Page 11: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

R2 Value

• How good is the line? How confident is the prediction?

• R : Correlation Coefficient, -1 ≤ R≤ 1• R2 :Coefficient of Determination, 0 ≤ R2 ≤ 1• The Coefficient of Determination is used to

measure the certainty of making predictions from a graph. It represents the percent of data closest to the trendline.

• The closer it is to 1, the more confident the prediction is.

- From "Correlation Coefficient" (http://mathbits.com/MathBits/TISection/Statistics2/correlation.htm)

Page 12: 12b. Regression Analysis, Part 2 CSCI N207 Data Analysis Using Spreadsheet Lingma Acheson linglu@iupui.edu Department of Computer and Information Science,

Excel Functions

• TREND() - Returns predicted Y values in a linear trend when passed X data.

• Add Trendline (from the Chart menu) Returns the trendline, equation, and correlation coefficient for a set of X,Y data.

0 5 10 15 20 25 30 35 40 450

2

4

6

8

10

f(x) = 0.202857142857143 x + 0.942857142857143R² = 0.947556390977444

Reading Aptitude and Reading Hours

Aptitude

Hours