frank q1q3

Upload: frankie-nguyen

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Frank Q1Q3

    1/18

    Exercise 1:

    a) We have used the command describe in order to get the information about the

    dataset.

    The total sample size is: 74. Each observation is one brand of car.There are 12 variables used in the dataset:

    make Make and Model

    price Price

    mpg Mileage (mpg)

    rep78 Repair Record 1978

    headroom Headroom (in.)

    trunk Trunk space (cu. ft.)

    weight Weight (lbs.)

    length Length (in.)

    turn Turn Circle (ft.)

    displacement Displacement (cu. in.)

    gear_ratio Gear Ratio

    foreign origin car type

    b) To get the description statistics for variables we use the command summarize.

  • 7/29/2019 Frank Q1Q3

    2/18

    To get description for only foreign origin models only we add if foreign==1 to the

    command. To get description for only domestic models only we add if foreign==0

    to the command.

    In the result, we found 52 domestic models, 22 foreign models.

    For price we have statistics below:

    For mileage we have statistics below:

  • 7/29/2019 Frank Q1Q3

    3/18

    For length we have statistics below:

    For weight we have statistics below:

    c) We create new variables by using command gen:

    mpgDos Mileage per gallon of gas ( Dosmetics models)

    mpgFor Mileage per gallon of gas ( Foreign models)

    priceDos Price ( Dosmetics models)priceFor Price ( Foreign models)

  • 7/29/2019 Frank Q1Q3

    4/18

    To test the claim that domestic cars are cheaper than foreign car. We conducted two-

    sample mean comparison test.

    H0 : mean(priceDos)mean(priceFor)0

    We used the command ttest and get the result below.

    Right-tail test: our p-value equals 0.6701 , which is higher than alpha=0.05, we fail to

    reject the null hypothesis. We can conclude that domestic cars are cheaper than the

    foreign ones.

    To test the claim that domestic cars have better mileage (they can go more miles per

    consumed gallon of gas). We conducted two-sample mean comparison test.

    H0 : mean(mpgDos)mean(mpgFor)>=0

    Ha : mean(mpgDos)mean(mpgFor)

  • 7/29/2019 Frank Q1Q3

    5/18

    Left-tail test: our p-value equals 0,0017, which is less than alpha=0.05, we reject the

    null hypothesis. We conclude that domestic cars go less miles per consumed gallon of

    gas than foreign ones. ( or foreign cars have better mileage)

    d) D: Event the car has domestic origin

    F: Event the car has foreign origin

    E: Event the car is expensive (price > 5500)To find the number of cars which are expensive and have domestic origin, we used

    command below. The result is 18.

    To find the number of cars which are expensive and have foreign origin, we used

    command below. The result is 12.

    We have number of domestic cars is 52, and numbers of foreign cars is 22. Total

    sample size is 74. So we can construct this table.

    Domestic Foreign Totals

    Expensive 18 12 30

    Not Expensive 34 10 44Totals 52 22 74

  • 7/29/2019 Frank Q1Q3

    6/18

    From the table we have :

    (E|D) = 18/52 = 0.3562 nD =52

    (E|F) = 12/22 = 0.5455 nF= 22

    We now test the hypothesis that the proportion of expensive is higher for foreignmodels. So we do the two-sample proportion test.

    Ho: p(E|D) - p(E|F) 0

    Ha: p(E|D) - p(E|F) > 0

    We used command below to get the result.

    Right-tail test. The p-value equals 0.9347, which is greater than alpha 0.05.

    Consequently, we fail to reject the Ho. So we can conclude that the proportion of

    expensive is higher for foreign models.

    e) We estimate the relationship between price index and mileage index by

    constructing the simple linear regression equation:

    price = b0 + b1(mpg)

    We used command below to get the results of the regression:

    Ho : b0 = 0 Ha : b0 0

    The coefficient for _cons is 11253.06 and its p-value P>|t| is 0.000. The coefficient=11253.06 is significantly different from 0 because its p-value P>|t| is 0.000, whichis less than 0.05. So we reject the Ho. We have b0 = 11253.06

    Ho : b1 = 0 Ha : b1 0

    The coefficient for utilities is -238.89 and its p-value P>|t| is 0.000. The coefficient =-238.89 is significantly different from 0 because its p-value P>|t| is 0.000, which isless than 0.05. So we reject the Ho. We have b1 = -238.89

  • 7/29/2019 Frank Q1Q3

    7/18

    We can conclude the multiple regression equation:

    price = 11253.06 238.89 (mpg)

    Each unit increased in the mileage index, we can predict an decrease of 238.89 unit in

    the price of the car, given other indexes remain unchanged. ( Which mean, the more

    mileage the car can go per gallon of gas, the cheaper the car is) (negative relationship)

    We have R-squared = 0.2196,

    It means the portion of the total variation in the dependent variable (price) , which is

    explained by the variation of independent variable (mpg) in our regression, is

    21.96%. However, The R-squared is small, the multiple regression still do not have

    sufficient quality to claim a strong the relationship between price and mileage index.

  • 7/29/2019 Frank Q1Q3

    8/18

    Question 3:

    We used Data Editor to put the dataset given into Stata. To get some description of

    the dataset, we used command sum. There is 5 variables, namely grocery, housing,

    utilities, transportation, and healthcare . There are total 25 observations.

    a. We need to plot grocery index against each other indexes. The command we used is

    scatter.

    For Grocery vs Housing

    In the scatter plot given below, we suggest that the observation for Grocery and

    housing is require extra attention since the points tend to stay in a side of a plot.

    However, the points do not significantly follow any linear direction. We can keep in

    my this when we do the regression At this point we could not observe any significant

    pattern.

    For Grocery vs Utilities

    The scatter plot given below can suggest a pattern. We can estimate an increase in the

    cost of grocery when the cost of utilities increases. However, the points stay away

    from each other, hence we may see a large variation from the predicted pattern. We

    can figure it out after doing the regression.

  • 7/29/2019 Frank Q1Q3

    9/18

    For Grocery vs Transportation :

    In the scatter plot given below, we predict that the points tend to follow a linear

    pattern. In this case, the cost of Grocery tends to follow the increase of the cost of

    transportation. However, we suggest a cautious regression to claim the relationship

    between these variables, since there are some points which is significantly against the

    pattern.

    For Grocery vs Healthcare:

    The scatter plot for Grocery vs Healthcare given below can not suggest any

    significant pattern since each point stands out away from all other point. We can

    figure it out after doing the regression.

  • 7/29/2019 Frank Q1Q3

    10/18

    b.

    Grocery vs Housing

    To run the regression of grocery index vs housing index, we construct a simple linear

    regression equation given below:

    Grocery = b0 + b1(housing)

    We test hypothesis about the coefficient.

    We used the command

    Ho : b1 = 0 Ha : b1 0

    The coefficient for housing is 0.0517 and its p-value P>|t| is 0.186. The coefficient =

    0.0517 is not significantly different from 0 because its p-value P>|t| is 0.186, which

    is higher than 0.05. So we fail to reject the Ho.

    This means, from the observations given, we can not conclude any significant linear

    relationship between Grocery and Housing.

    Ho : b0 = 0 Ha : b0 0

    The coefficient for _cons is 92.94 and its p-value P>|t| is 0.00 . The coefficient 92.94is significantly different from 0 because its p-value P>|t| is 0.00, which is lower than

    0.05. So we reject the Ho.

    We can conclude the equation: Grocery = 92.94

  • 7/29/2019 Frank Q1Q3

    11/18

    We have R-squared = 0.1860,

    It means the portion of the total variation in the dependent variable, which is

    explained by variation in the independent is only 18.6%. 81.4% of total variation can

    not be explained by the model. The regression do not have sufficient quality to claim

    the relationship between the cost of Grocery and the cost of housing.

    Grocery vs utilitiesTo run the regression of grocery index vs utilities index, we construct a simple linear

    regression equation given below:

    Grocery = b0 + b1(utilities)

    Command :

    We test hypothesis about the coefficient.

    Ho : b1 = 0 Ha : b1 0

    The coefficient for utilities is 0.1411. p-value P>|t| is 0.029

    The coefficient = 0.1411 is significantly different from 0 because its p-value P>|t| is

    0.029, which is smaller than 0.05. So we reject the Ho. This means, from theobservations given, we can predict a linear relationship between the cost of Grocery

    and the cost of utilities.

    Ho : b0 = 0 Ha : b0 0

    The coefficient = 83.99 is significantly different from 0 because its p-value P>|t| is

    0.00 which is smaller than 0.05. So we reject the Ho.

    We can conclude the equation: Grocery = 83.99 + 0.144*(utilities)

    Each unit increased in the cost of utilities, we can predict an increase of 0.144 unit in

    the cost of grocery, given other indexes remain unchanged.

  • 7/29/2019 Frank Q1Q3

    12/18

    We have R-squared = 0.1911. It means the portion of the total variation in the

    dependent variable, which is explained by variation in the independent is only

    19.11%. The regression do not have sufficient quality to claim the relationship

    between the cost of Grocery and the cost of utilities.

    Grocery vs Transportation:

    To run the regression of grocery index vs transportation index, we construct a simple

    linear regression equation given below:

    Grocery = b0 + b1(transportation)

    Command :

    We test hypothesis about the coefficient.

    Ho : b1 = 0 Ha : b1 0

    The coefficient for utilities is 0.1372. p-value P>|t| is 0.45

    The coefficient = 0.1372 is insignificantly different from 0 because its p-value P>|t|

    is 0.45, which is higher than 0.05. So we reject the Ho. This means, from the

    observations given, we cannot conclude any linear relationship between the cost of

    Grocery and the cost of transportation.

    Ho : b0 = 0 Ha : b0 0

    The coefficient = 84.25 is significantly different from 0 because its p-value P>|t| is0.00 which is smaller than 0.05. So we reject the Ho.

  • 7/29/2019 Frank Q1Q3

    13/18

    We can conclude the estimated equation: Grocery = 84.25

    We have R-squared = 0.0251. It means the portion of the total variation in the

    dependent variable, which is explained by variation in the independent is only 2.51%.

    The regression do not have sufficient quality to claim the relationship between thecost of Grocery and the cost of transportation.

    Grocery vs Healthcare

    To run the regression of grocery index vs healthcare index, we construct a simple

    linear regression equation given below:

    Grocery = b0 + b1(healthcare)

    Command :

    We test hypothesis about the coefficient.

    Ho : b1 = 0 Ha : b1 0

    The coefficient for utilities is 0.0869. p-value P>|t| is 0.258

    The coefficient = 0.0869 is insignificantly different from 0 because its p-value P>|t|

    is 0.258, which is greater than 0.05. So we fail to reject the Ho. This means, from the

    observations given, we cannot conclude any linear relationship between the cost of

    Grocery and the cost of healthcare.

    Ho : b0 = 0 Ha : b0 0

    The coefficient = 89.44 is significantly different from 0 because its p-value P>|t| is

    0.00 which is smaller than 0.05. So we reject the Ho.

    We can conclude the estimated equation: Grocery = 89.44

    We have R-squared = 0.0552. It means the portion of the total variation in the

    dependent variable can be explained by our model is only 5.52%. The regression do

    not have sufficient quality to claim the relationship between the cost of Grocery andthe cost of healthcare.

  • 7/29/2019 Frank Q1Q3

    14/18

    c. Log Grocery vs Log Housing

    In order to estimate the elasticity of housing to the grocery index, we construct a

    simple linear regression equation given below:

    ln(Grocery) = b0 + b1ln(housing)

    Ho : b1 = 0 Ha : b1 0

    The coefficient for ln_housing is 0.066 and its p-value P>|t| is 0.199. The coefficient

    =0.066 is not significantly different from 0 because its p-value P>|t| is 0.199, which

    is higher than 0.05. So we fail to reject the Ho.

    This means, from the observations given, we can not conclude any significant linear

    relationship between Log Grocery and Log Housing.

    We can conclude housing elasticity of grocery index is =0

    We have R-squared = 0.708,

    It means the portion of the total variation in the dependent variable, which is

    explained by our regression is only 7.08%. The regression do not have sufficient

    quality to claim the elasticity of housing index and grocery index.

    Log Grocery vs Log Utilities

    In order to estimate the elasticity of housing to the grocery index, we construct a

    simple linear regression equation given below:

    ln(Grocery) = b0 + b1ln(Utilities)

  • 7/29/2019 Frank Q1Q3

    15/18

    Ho : b1 = 0 Ha : b1 0

    The coefficient for ln_utilities is 0.131 and its p-value P>|t| is 0.047. The coefficient

    =0.131 is significantly different from 0 because its p-value P>|t| is 0.045, which is

    lower than 0.05. So we reject the Ho.

    This means, from the observations given, we can not conclude any significant linear

    relationship between Log Grocery and Log utilities. We predict, when utilities index

    increases by 1% , grocery index increases by 0.131%. The estimated elasticity is

    =0.131.

    We have R-squared = 0.1605,

    It means the portion of the total variation in the dependent variable, which is

    explained by our regression is only 16.05%. The regression do not have sufficient

    quality to claim the elasticity of housing index and utilities index.

    Log Grocery vs Log Transportation

    In order to estimate the elasticity of housing to the grocery index, we construct a

    simple linear regression equation given below:

    ln(Grocery) = b0 + b1ln(Transportation)

    Ho : b1 = 0 Ha : b1 0

    The coefficient for ln_transportation is 0.1297 and its p-value P>|t| is 0.481. The

    coefficient =0.1297 is not significantly different from 0 because its p-value P>|t| is

    0.481, which is higher than 0.05. So we fail to reject the Ho.

    This means, from the observations given, we can not conclude any significant linear

    relationship between Log Grocery and Log transportation.We can conclude transportation elasticity of grocery index is =0

  • 7/29/2019 Frank Q1Q3

    16/18

    We have R-squared = 0.0218,

    It means the portion of the total variation in the dependent variable, which is

    explained by our regression is only 2.18%. The regression do not have sufficient

    quality to claim the elasticity of transportation index and grocery index.

    Log Grocery vs Log Healthcare

    In order to estimate the elasticity of housing to the grocery index, we construct a

    simple linear regression equation given below:

    ln(Grocery) = b0 + b1ln(Healthcare)

    Ho : b1 = 0 Ha : b1 0

    The coefficient for ln_Healthcare is 0.092 and its p-value P>|t| is 0.265. The

    coefficient = 0.092 is not significantly different from 0 because its p-value P>|t| is

    0.265, which is higher than 0.05. So we fail to reject the Ho.

    This means, from the observations given, we can not conclude any significant linear

    relationship between Log Grocery and Log Healthcare.

    We can conclude Healthcare elasticity of grocery index is =0

    We have R-squared = 0.0537,

    It means the portion of the total variation in the dependent variable, which is

    explained by our regression is only 5.37%. The regression do not have sufficient

    quality to claim the elasticity of healthcare index and grocery index.

  • 7/29/2019 Frank Q1Q3

    17/18

    d. Multiple regressionTo estimate the multiple linear model, we construct a multiple regression equation as

    below :

    Grocery= b0 + b1(housing) +b2(utilities) +b3(transportation) + b4(healthcare)

    We used command below to do the regression:

    Ho : b0 = 0 Ha : b0 0

    The coefficient for _cons is 76.31 and its p-value P>|t| is 0.000. The coefficient =76.31 is significantly different from 0 because its p-value P>|t| is 0.000, which is lessthan 0.05. So we reject the Ho. We have b0 = 76.31

    Ho : b1 = 0 Ha : b1 0

    The coefficient for housing is 0.0859 and its p-value P>|t| is 0.109. The coefficient =0.0859 is not significantly different from 0 because its p-value P>|t| is 0.109, whichis higher than 0.05. So we fail to reject the Ho. We have b1 = 0

    Ho : b2 = 0 Ha : b2 0

    The coefficient for utilities is 0.1677 and its p-value P>|t| is 0.018. The coefficient =0.1677 is significantly different from 0 because its p-value P>|t| is 0.018, which isless than 0.05. So we reject the Ho. . We have b2 = 0.1677

    Ho : b3 = 0 Ha : b3 0

    The coefficient for transportation is 0.0284 and its p-value P>|t| is 0.87. Thecoefficient =0.0284 is not significantly different from 0 because its p-value P>|t| is0.87, which is higher than 0.05. So we fail to reject the Ho. We have b3 = 0

    Ho : b4 = 0 Ha : b4 0

    The coefficient for healthcare is -0.0659 and its p-value P>|t| is 0.53. The coefficient

    =-0.0659 is not significantly different from 0 because its p-value P>|t| is 0.53, whichis higher than 0.05. So we fail to reject the Ho. We have b4 = 0

  • 7/29/2019 Frank Q1Q3

    18/18

    We can conclude the multiple regression equation:

    Grocery= 76.31 + 0(housing) +0.1677 (utilities) +0(transportation) + 0(healthcare)

    Grocery= 76.31 +0.1677 (utilities)

    We have R-squared = 0.3145,

    It means the portion of the total variation in the dependent variable (grocery) , whichis explained by the variation of independent variables (housing, utilities,

    transportation, healthcare) in our regression, is 31.45%. Although, the R-squared is

    higher than the R-squared in individual regressions, which mean a better model, the

    multiple regression still do not have sufficient quality to claim the relationship

    between Grocery index and other indexes.

    We can see in the multiple regressions, we increase the number of independent

    variables. The more independent variables we have, the more variation of dependent

    variable can be explained, the less error variation is. It means, we can increase the R-

    squared. Hence the quality of our regression model is increased, the relationshipbetween variables can be predicted better.

    R-squared = SSR/SST