presentation stats updated

Upload: prerna-makhijani

Post on 06-Apr-2018

226 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 Presentation Stats Updated

    1/21

    REGRESSION

    MODELSBy:

    Ayush Sharma 09Mickey Haldia 19

    Prerna Makhijani 29

    Sanoj George 39

    Sushant Jaggi 49

    Nitish Dorle 59

  • 8/3/2019 Presentation Stats Updated

    2/21

    Example

    Year Population on Farm (in

    millions)

    1935 32.1

    1940 30.5

    1945 24.4

    1950 23.0

    1955 19.11960 15.6

    1965 12.5

  • 8/3/2019 Presentation Stats Updated

    3/21

    Scatter Plot

    0

    5

    10

    15

    20

    25

    30

    35

    1930 1940 1950 1960 1970

    Population(in millions)

    Poplation(in millions)

  • 8/3/2019 Presentation Stats Updated

    4/21

    Correlation Coefficient (r)

    It is a measure of strength of the linear

    relationship between two variables and iscalculated using the following formula:

  • 8/3/2019 Presentation Stats Updated

    5/21

    Interpretation

    After calculating we find r = -0.993

    There is a strong negative correlation.

  • 8/3/2019 Presentation Stats Updated

    6/21

    Coefficient of Determination

    Squaring the correlation coefficient (r) gives us

    the percent variation in the y-variable that is

    described by the variation in the x-variable

    To relate x and y, the Regression Equation is

    calculated using Least Squares technique.

    Regression Equation: Y = a +bX Slope of the regression line:

  • 8/3/2019 Presentation Stats Updated

    7/21

    To continue with the example

    We found r = -0.993. By squaring we get the

    Coefficient of Determination (R^2) = 0.987

    y = -0.671 x + 1,330.350R = 0.987

    10

    15

    20

    25

    30

    35

    1930 1940 1950 1960 1970

    Populatio

    nonFarm(

    in

    mi

    llions)

    Year

    Regression

  • 8/3/2019 Presentation Stats Updated

    8/21

    Interpretation

    We conclude that 98.7% of the decrease in

    farm population can be explained by timelineprogression.

    Theoretically, population is a dependent

    variable (y-axis) and timeline is an independent

    variable (x-axis).

  • 8/3/2019 Presentation Stats Updated

    9/21

    Assumptions of the Regression Model

    The following assumptions are made about the

    errors:

    a) The errors are independentb) The errors are normally distributed

    c) The errors have a mean of zero

    d)

    The errors have a constant variance(regardlessof the value of X)

  • 8/3/2019 Presentation Stats Updated

    10/21

    Patterns of Indicating Errors

    Error

    X

  • 8/3/2019 Presentation Stats Updated

    11/21

    Estimating the Variance

    The error variance is measured by the MSE

    s2 = MSE= SSE

    n-k-1

    where n = number of observations in the sample

    k = number of independent variables

    Therefore the standard deviation will be

    s = sqrt (MSE)

  • 8/3/2019 Presentation Stats Updated

    12/21

    Testing the Model for Significance

    MSE and co-efficient of determination (r2) does notprovide a good measure of accuracy when thesample size is small

    In this case, it is necessary to test the model forsignificance

    Linear Model is given by,

    Y=0 + 1X +

    Null Hypothesis :If 1 = 0, then there is no linear relationshipbetween X and Y

    Alternate Hypothesis : If 1 0, then there is a linear relationship

  • 8/3/2019 Presentation Stats Updated

    13/21

    Steps in Hypothesis Test for a Significant

    Regression Model

    1. Specify null and alternative hypothesis.

    2. Select the level of significance (). Common

    values are between 0.01 and 0.053. Calculate the value of the test statistic using the

    formula:

    F = MSR/MSE

    4. Make a decision using one of the followingmethods:

    a) Reject if Fcalculated > Ftableb) Reject if p-value <

  • 8/3/2019 Presentation Stats Updated

    14/21

    Multiple regression Analysis

    More than one independent variable

    Y=0+1X1+2X2++kXk+

    Where,

    Y=dependent variable(response variable)

    Xi=ith independent variable(predictor variable or explanatory

    variable)

    0= intercept(value of Y when all Xi = 0)i= coefficient of the ith independent variable

    k= number of independent variables

    = random error

    To estimate the values of these coefficients, a sample is taken and the

    following equation is developed :

    = b0+b1X1+b2X2+.+bkXkwhere,

    = predicted value of Y

    b0= sample intercept (and is an estimate of

    0)

    bi= sample coefficient of ith variable(and is an

    estimate of i)

  • 8/3/2019 Presentation Stats Updated

    15/21

    Selling Price ($) Suare Footage AGE Condition

    95000 1926 30 GOOD

    119000 2069 40 Excellent

    124800 1720 30 Excellent

    135000 1396 15 GOOD

    142800 1706 32 Mint

    145000 1847 38 Mint159000 1950 27 Mint

    165000 2323 30 Excellent

    182000 2285 26 Mint

    183000 3752 35 GOOD

    200000 2300 18 GOOD

    211000 2525 17 GOOD

    215000 3800 40 Excellent

    219000 1740 12 Mint

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R 0.819680305

    R Square 0.671875802

    Adjusted R Square 0.612216857

    Standard Error 24312.60729

    Observations 14

    ANOVA

    df SS MS F Significance F

    Regression 2 13313936968 6.7E+09 11.262 0.002178765

    Residual 11 6502131603 5.9E+08

    Total 13 19816068571

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

    Intercept 146630.89 25482.08287 5.75427 0.0001 90545.20735 202717 90545 202717

    SF 43.819366 10.28096507 4.26218 0.0013 21.19111495 66.448 21.191 66.448

    AGE -2898.686 796.5649421 -3.639 0.0039 -4651.91386 -1145 -4651.9 -1145.5

    The p-values are

    used to test the

    individual

    variables forsignificance

    The coefficient of

    determination r2

    The regression

    coefficients

    Jenny Wilson Reality

  • 8/3/2019 Presentation Stats Updated

    16/21

    Binary or Dummy Variables

    Indicator Variable

    Assigned a value of 1 if a particular condition ismet, 0 otherwise

    The number of dummy variables must equal oneless than the number of categories of aqualitative variable

    The Jenny Wilson realty example :

    X3= 1 for excellent condition= 0 otherwise

    X4= 1 for mint condition

    = 0 otherwise

  • 8/3/2019 Presentation Stats Updated

    17/21

    Selling Price

    ($)Suare Footage AGE X3(Exc.) X4(Mint) Condition

    95000 1926 30 0 0 GOOD

    119000 2069 40 1 0 Excellent

    124800 1720 30 1 0 Excellent

    135000 1396 15 0 0 GOOD

    142800 1706 32 0 1 Mint

    145000 1847 38 0 1 Mint

    159000 1950 27 0 1 Mint

    165000 2323 30 1 0 Excellent

    182000 2285 26 0 1 Mint

    183000 3752 35 0 0 GOOD

    200000 2300 18 0 0 GOOD

    211000 2525 17 0 0 GOOD

    215000 3800 40 1 0 Excellent

    219000 1740 12 0 1 Mint

    SUMMARY OUTPUT

    Regression Statistics

    Multiple R 0.94762

    R Square 0.89798

    Adjusted R Square 0.85264

    Standard Error 14987.6

    Observations 14

    ANOVA

    df SS MS F Significance F

    Regression 4 17794427451 4E+09 19.8044 0.000174421

    Residual 9 2021641120 2E+08

    Total 13 19816068571

    Coefficients Standard Error t Stat P-value Lower 95% Upper 95% Lower 95.0% Upper 95.0%

    Intercept 121658 17426.61432 6.9812 6.5E-05 82236.71393 161080 82236.71 161080

    SF 56.4276 6.947516792 8.122 2E-05 40.71122594 72.144 40.71123 72.144

    AGE -3962.82 596.0278736 -6.6487 9.4E-05 -5311.12866 -2614.5 -5311.129 -2614.5

    X3(Exc.) 33162.6 12179.62073 2.7228 0.0235 5610.432651 60714.9 5610.433 60715

    X4(Mint) 47369.2 10649.26942 4.4481 0.0016 23278.92699 71459.6 23278.93 71460

    The coefficients of age is negative, indicating

    that the price decreases as a house gets older

    Jenny Wilson Reality

  • 8/3/2019 Presentation Stats Updated

    18/21

    Model Building

    The value of r2 can never decrease when morevariables are added to the model

    Adjusted r2 often used to determine if an additionalindependent variable is beneficial

    The adjusted r

    2

    is

    A variable should not be added to the model if itcauses the adjusted r2 to decrease

  • 8/3/2019 Presentation Stats Updated

    19/21

    Multiple Regression

    Sales/Decision to buy = B0+ B1* Price

    Sales/Decision to buy = B0+ B1* (Price)3+

    B2*(Design)2+B3*(Performance)

    L = (Price)3

    M = (Design)2

    N = (Performance)

    Sales/Decision to buy = B0+ B1* L+ B2* M+ B3* N

  • 8/3/2019 Presentation Stats Updated

    20/21

    Pitfalls In Regression

    A High Correlation does not mean one variable is causing a

    change in another (Some regressions have shown a

    significantly positive relation between individuals' college

    GPA and future salary. )

    Values of the dependent variable should not be used that

    are above or below the ones from the sample

    The number of independent variables that should be used

    in the model is limited by the number of observations.

  • 8/3/2019 Presentation Stats Updated

    21/21