regression and sample correlation

Upload: ne0h16196547097

Post on 02-Jun-2018

228 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/10/2019 Regression and Sample Correlation

    1/28

    Lecture 10REGRESSION AND SAMPLE

    CORRELATIONPredrag Spasojevic

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

  • 8/10/2019 Regression and Sample Correlation

    2/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    INTRODUCTION

    Many engineering and scientific problems are concerned withdetermining a relationship between a set of variables.

    For example: chemical process, interest relationship between:

    the output of the process, the temperature at which it occurs,

    the amount of catalyst employed.

    Knowledge of such a relationship would enable us to predictthe output for various values of temperature and amount of

    catalyst.

  • 8/10/2019 Regression and Sample Correlation

    3/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LINEAR REGRESSION LINE In many situations, there is a single response variable Y - the

    dependent variable,

    depends on the value of a set of input x1, . . . , xr - called

    independent variables

    The simplest type of relationship is a linear relationship. That

    is, for some constants 0, 1, . . . , r would hold the equation

    Y= 0+ 1x1+ + rxr (1)

    If this was the relationship between Yand thexi, i = 1, . . . , r,

    then possible (once the iwere learned) to exactly predict the

    response for any set of input values.

  • 8/10/2019 Regression and Sample Correlation

    4/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LINEAR REGRESSION LINE

    In practice, such precision is almost never attainable, the most that one can expect is that Equation 1 would be

    valid subject to random error, i.e

    The explicit relationship is:Y= 0+ 1x1+ +rxr+ e (2)

    where e, representing the random error is assumed to be a r. v.

    having mean 0. This relationship is called a linear regression equation.

  • 8/10/2019 Regression and Sample Correlation

    5/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LINEAR REGRESSION LINE

    Linear regression equation describes the regression of Y onthe set of independent variablesx1, . . . ,xr.

    The quantities 0, 1, . . . , r are called the regression

    coefficients, and must usually be estimated from a set of data. Simple regression equation is a regression equation containing

    a single independent variablex (input level)

    Y= + x+ e

    Y is the response and e representing the random error, is a

    random variable having mean 0 and variation .

  • 8/10/2019 Regression and Sample Correlation

    6/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LINEAR REGRESSION LINE

    EX. 1: Consider the following 10 data pairs (xi, yi), i = 1,..., 10,relating y, the percent yield of a laboratory experiment, to x,

    the temperature at which the experiment was run.

    i xi yi i xi yi

    1 100 45 6 150 68

    2 110 52 7 160 75

    3 120 54 8 170 764 130 63 9 180 92

    5 140 62 10 190 88

  • 8/10/2019 Regression and Sample Correlation

    7/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LINEAR REGRESSION LINE

    A plot of yi versus xi called a scatter diagram is given inFig. 1. It seems that a simple linear regression model would be

    appropriate.

  • 8/10/2019 Regression and Sample Correlation

    8/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

    Suppose: the responses Yicorresponding to the input valuesxi,

    i = 1, . . . , n be observed and used to estimate and in a

    simple linear regression model.

    IfAis the estimator of and Bof ,then the estimator of the

    responsecorresponding to the input variablexiwould be:

    A+ B xi.

    The actual response is Yi, so the squared difference is:

    (YiA+ B xi),

  • 8/10/2019 Regression and Sample Correlation

    9/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

    The sum of the squared differences between the estimated

    responses and the actual response valuescall it SSis:

    The method of least squares:

    chooses as estimators of and the values ofAand Bthat

    minimize SS.

    So, to determine these estimators, we differentiate SS first

    with respect toAand then to B as follows:

    2

    1

    ( )

    n

    i i

    i

    SS Y A x

  • 8/10/2019 Regression and Sample Correlation

    10/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

    Setting these partial derivatives = zero yields the normal

    equationsfor the minimizing valuesAand B:

  • 8/10/2019 Regression and Sample Correlation

    11/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

    Let

    By method of substitution

    first normal equation:

    Second normal equation:

  • 8/10/2019 Regression and Sample Correlation

    12/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

    by usual transformations of Second normal equation:

    and the fact that

  • 8/10/2019 Regression and Sample Correlation

    13/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

    So we get the following proposition:

    The least squares estimators of and corresponding to the

    data setxi, Yi, i = 1, . . . , n are, respectively,

    straight lineA+ Bxis called the estimated regression line.

  • 8/10/2019 Regression and Sample Correlation

    14/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

    EX. 2: The raw material used in the production of a certain

    synthetic fiber is stored in a location without a humidity

    control.

    Measurements of the relative humidity in the storage

    location

    the moisture content of a sample of the raw material were

    taken over 15 days with the following data (in percentages)resulting.

  • 8/10/2019 Regression and Sample Correlation

    15/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

    Calculating least squares estimators by last proposition, the

    estimated regression line of moisture content depending on

    relative humidity in the storage location will be the line from

    the following Figure.

  • 8/10/2019 Regression and Sample Correlation

    16/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    LEAST SQUARES ESTIMATORS OF THE

    REGRESSION PARAMETERS

  • 8/10/2019 Regression and Sample Correlation

    17/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION

    Notation: If we let

    the least squares estimators can be expressed as

  • 8/10/2019 Regression and Sample Correlation

    18/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION

    Suppose: we measure the amount of variation in the set of

    response values Y1, . . . , Yncorresponding to the set of input

    valuesx1, . . . ,xn.

    A standard measure in statistics of the amount of variation in a

    set of values Y1, . . . , Ynis:

    if all the Yiare equal and thus are all equal to Ythen SYY

    would equal 0.

  • 8/10/2019 Regression and Sample Correlation

    19/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION

    The variation in the values of the Yiarises from two factors:

    First: the input values xi are different, so the response

    variables Yiall have different mean values;

    Second:

    the fact that even when the differences in the input

    values are taken into account,

    each of the response variables Yi has variance and

    thus will not exactly equal the predicted value at its

    inputxi.

    E E E E E E /

  • 8/10/2019 Regression and Sample Correlation

    20/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION

    How much of the variation in the values of the response

    variables is due to the different input values?

    How much is due to the inherent variance of the responses

    even when the input values are taken into account?

    Answer: note that the quantity

    measures the remaining amount of variation in the response

    values after the different input values taking into account.

    DE CRIPTIVE ND INFERENTI L T TI TIC LECTURE 2013/14

  • 8/10/2019 Regression and Sample Correlation

    21/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION

    Thus, SYY SSR represents the amount of variation in the

    response variables that is explained by the different input

    values.

    The quantity R defined by

    represents the proportion of the variation in the response

    variables that is explained by the different input values.

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES 2013/14

  • 8/10/2019 Regression and Sample Correlation

    22/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION

    R is called the coefficient of determination.

    0 R 1.

    A value of R near 1: most of the variation of the response data

    is explained by the different input values,

    A value of R near 0: little of the variation is explained by the

    different input values.

    The value of R is an indicator of how well the regression model

    fits the data, with a value near 1 indicating a good fit, and one

    near 0 indicating a poor fit.

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES 2013/14

  • 8/10/2019 Regression and Sample Correlation

    23/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE SAMPLE CORRELATION COEFFICIENT

    For all data set consists of the paired values (xi, yi), i =1, . . . , n.

    is obtained a statistic that can be used to measure the

    association between the individual values of a set of paired

    data. That statistic is called the sample correlation coefficient and

    defined by:

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES 2013/14

  • 8/10/2019 Regression and Sample Correlation

    24/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE SAMPLE CORRELATION COEFFICIENT

    The sample correlation coefficient is always between 1 and 1.

    If correlation coefficient is positive value, the correlation is

    proportionate.

    If correlation coefficient is negative value then the relationship

    is inverse or inversely proportional.

    If |r|=1 , then the correlation between the r.vs X and Y is

    linearly perfect.

    So, more the absolute value is closer to 1, more stronger

    correlation.

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES s 2013/14

  • 8/10/2019 Regression and Sample Correlation

    25/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE SAMPLE CORRELATION COEFFICIENT

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summ r 2013/14

  • 8/10/2019 Regression and Sample Correlation

    26/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION AND THE

    SAMPLE CORRELATION COEFFICIENT

    Consider data pairs (xi, Yi), i = 1, . . . , n, of response values Y1, .

    . . , Yncorresponding to the set of input valuesx1, . . . ,xn .

    The sample correlation coefficient rof these data pairs in the

    notation of slide 17 is:

    Upon using identity

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

  • 8/10/2019 Regression and Sample Correlation

    27/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION AND THE

    SAMPLE CORRELATION COEFFICIENT

    we see that:

    So,

    The sign of ris the same as that of B.

    The above gives additional meaning to the sample correlation

    coefficient.

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

  • 8/10/2019 Regression and Sample Correlation

    28/28

    DESCRIPTIVE AND INFERENTIAL STATISTICS LECTURES summer 2013/14

    THE COEFFICIENT OF DETERMINATION AND THE

    SAMPLE CORRELATION COEFFICIENT

    For instance, if a data set has its sample correlation coefficient

    requal to 0.9, then this implies

    a simple linear regression model for these data explains 81

    percent (since R = 0.9 = 0.81) of the variation in the

    response values.

    That is, 81 percent of the variation in the response values is

    explained by the different input values.