assignment - final-1 - ibp union assumptions that are continuous throughout the assignment and...

15
Statistics & Research Methods January 2013 1 When looking into the datasets used, “For Sale” and “Sales”, it is important to underline some basic assumptions that are continuous throughout the assignment and important for the validity of its findings. In this case we would like to begin by defining the data collected for these datasets as randomized. We can do this, as boliga.dk is an independent site on which house sales are displayed disregarding type of agent, location, price range etc. Of course this is with exceptions as there are other sales options that can be used rather than uploading on this site, but this is such a small percentage of houses that the dataset still can be considered representative. When making inferences about the data we need to keep in mind that our population is only the housing market in Ballerup, Vanløse and Rødovre as the statistical inferences are based on data only for these areas. Worth mentioning is that the financial crisis has an effect on the amount of sales as well as values, so for this reason looking at houses that are not sold should be equally important when testing the data and making inferences. Question 1 Give a brief description of the variables in the two data sets. Quantitative Variables Categorical Variables Discrete Continuous - Number of rooms - Built year - Total time for sale - Sales price - Price reduction - Asking price - Net per month - Square meters - Plot sq. m. - Price per square meter - Change of price - Date (split into intervals) - Postal code - Type of house - First announced (split into intervals) - Sold within 270 days - Sold within 180 days - Sold within 90 days - Sold Observations contain characteristics that can be split into different types of variables - categorical and quantitative. Quantitative variables can further be split into the sub-categories of discrete and continuous variables Categorical variables are variables that can be said to belong to a certain category. An example could be a question that can be answered by ‘yes’ or ‘no’ thus creating two binary categories depending on the answer. Another example could be splitting a number such as postal codes into categories describing different areas in which we observe house sales. Quantitative variables are on the other hand variables that only take on numerical values and describe the magnitude of the observation in question. These numerical values can be further split into discrete variables, which have a finite number of possible values, e.g. number of rooms in a house or year of erection, or continuous in which the values are infinite, e.g. asking price or price per square meters. These can be continuously elaborated on to contain more decimals and thus give a more thorough description of the value. Variables are divided into these types because they are treated and analyzed differently. Quantitative variables are numerical and we consider the shape, center and variability to describe

Upload: danghanh

Post on 20-Apr-2018

216 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

1

When looking into the datasets used, “For Sale” and “Sales”, it is important to underline some basic assumptions that are continuous throughout the assignment and important for the validity of its findings. In this case we would like to begin by defining the data collected for these datasets as randomized. We can do this, as boliga.dk is an independent site on which house sales are displayed disregarding type of agent, location, price range etc. Of course this is with exceptions as there are other sales options that can be used rather than uploading on this site, but this is such a small percentage of houses that the dataset still can be considered representative. When making inferences about the data we need to keep in mind that our population is only the housing market in Ballerup, Vanløse and Rødovre as the statistical inferences are based on data only for these areas. Worth mentioning is that the financial crisis has an effect on the amount of sales as well as values, so for this reason looking at houses that are not sold should be equally important when testing the data and making inferences.

Question 1 Give a brief description of the variables in the two data sets.

Quantitative Variables Categorical Variables

Discrete Continuous

- Number of rooms - Built year - Total time for sale

- Sales price - Price reduction - Asking price - Net per month - Square meters - Plot sq. m. - Price per square meter - Change of price

- Date (split into intervals) - Postal code - Type of house - First announced (split into

intervals) - Sold within 270 days - Sold within 180 days - Sold within 90 days - Sold

Observations contain characteristics that can be split into different types of variables - categorical and quantitative. Quantitative variables can further be split into the sub-categories of discrete and continuous variables Categorical variables are variables that can be said to belong to a certain category. An example could be a question that can be answered by ‘yes’ or ‘no’ thus creating two binary categories depending on the answer. Another example could be splitting a number such as postal codes into categories describing different areas in which we observe house sales. Quantitative variables are on the other hand variables that only take on numerical values and describe the magnitude of the observation in question. These numerical values can be further split into discrete variables, which have a finite number of possible values, e.g. number of rooms in a house or year of erection, or continuous in which the values are infinite, e.g. asking price or price per square meters. These can be continuously elaborated on to contain more decimals and thus give a more thorough description of the value. Variables are divided into these types because they are treated and analyzed differently. Quantitative variables are numerical and we consider the shape, center and variability to describe

Page 2: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

2

the observations. In contrast, when describing categorical variables you observe the frequency and calculate the probabilities. Report the mean and the median of the variable Price Reduction and discuss the results. Price reduction is a continuous variable measured in percentages. The mean is the average value

of the observations calculated by the formula The median, on the other hand, is the middle value of the observations if you were to order them from smallest to largest. Using the analyzing tool in JMP we are able to compute that: Mean = 5.63 Median = 4.76 We observe that the mean is higher than the median. The reason is that the distribution of the observations is skewed to the right, i.e. more observations fall above the median.

We observe quite a number of outliers, which could explain the skew. By an outlier we mean an observation that falls more than 3 standard deviations away from the mean. State an approximate confidence interval for the mean We make a confidence interval to find the interval in which most observation means will fall. We use a point estimate, the sample mean, then add and subtract the margin of error.

Margin of error = z- or t-score * standard error = t-score This number states the number of standard deviations the observation falls from the sample mean. Generally known, the t-score for a 95% interval with df > 100 equals the z-score, which for a standard normal distribution, is 1.96. Otherwise it can be looked up.

Standard error Ideally, we should use the standard deviation of the sampling

distribution, . However, since , the standard deviation of the population, is unknown to us, we will use an estimate instead, which is the standard error. Confidence interval 5.63 0.476 = (5.154, 6.106)

The Standard Deviation The average distance of an observation from the mean.

99.7% of the observations will fall within 3 standard deviations.

Page 3: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

3

Assumptions for creating a confidence interval When creating a 95% confidence interval we assume the following: • Data is obtained by randomization • The sample distribution takes on an approximately normal population distribution i.e. a bell-shape

curve. In the case of violation of the randomization assumption the data risks being biased thus making statistical inference inappropriate. In the case of a non-normal distribution, when using a t-score this statistical method is still robust. Being robust means that in spite of violation of this assumption the method still performs adequately.

Question 2 This question concerns the comparison of two means with a quantitative response variable. Compare Price per Sq.m in the sales data set between detached and terraced houses. For this comparison we have a quantitative response variable, price per sq.m, and a categorical explanatory variable, terraced (group 1) and detached (group 2) houses. Using the “Fit Y by X” analysis we get the following data:

The difference between the price per sq.m of terraced and detached houses is:

This means that terraced houses, on average, are more expensive per sq.m than detached houses, DKK 255.9 to be precise. However, the standard deviation of the price per sq.m of detached houses is greater than for terraced (6874.39>5847.82), which means that there is a larger variability in the price per sq.m of detached houses. Here, it should be noted that the sample size, n, of terraced houses is much lower than that of detached. The larger the sample size n is, the closer the sample will be to the population distribution. This means that the within-group variability also increases to include extreme values in the population, which automatically increases the standard deviation. Produce 95% confidence intervals for the difference and a significance test for whether the price is the same in both groups. The confidence interval for difference between means assumes: • Independent random samples from the two groups • An approximately normal population distribution Referring to the introductory remarks, randomization is assumed to hold. We know from question 1 that when using the t-score the confidence interval will be robust even if the normal distribution assumption is violated. The 95% confidence interval for comparing two means is:

. We use t.025 as it is the 95% confidence level.

Page 4: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

4

The standard error is: . Confidence interval as reported in JMP = (-806.1, 1317.9) Significance test 1. Assumptions: The assumptions hold1 and we will test using a significance level of 0.05, i.e. if the

P-value < 0.05 we will reject H0, the null hypothesis. 2. Hypothesis: the hypothesis we wish to test is whether the price per sq.m is the same in both

groups, and we make a two-sided test as our alternative hypothesis is that the price is not the same and can be either higher or lower than the other.

3. Test statistic: We use the t-score as our test statistic because we are comparing two means.

4. P-value: This tells us the probability that the test statistic equals the observed test statistic or a

value even more extreme. To get the P-value we either use a t-test together with degrees of freedom or the data received from the software used. We use the P-value from the software, as it is more precise, and thus get a P-value of 0.64. 5. Conclusion Since the P-value is above the significance level of 0.05 we cannot reject H0, the null hypothesis, thus the two population means may be equal to each other. This is also supported by the above 95% confidence interval regarding the difference between the means, which includes 0. Also produce a significance test of whether the price per square meter differs according to Postal Code. In this case because we are comparing means of several groups, we will have to conduct a one-way ANOVA (Analysis of Variance). This means it is a bivariate method with a quantitative response variable, price per sq.m, and one categorical explanatory variable, postal codes. 1. Assumptions - Independent random samples: referring to our introductory remarks we expect this assumption to

hold. - We also assume normal population distributions with equal standard deviations, but the test is

robust to violations of this assumption. 2. Hypothesis: our H0 is that all three population means are equal to each other. We can reject H0 if

the P-value is below our selected significance level of 0.05.

Ha: at least two of the means are unequal.

1 p.481, Agresti and Franklin, Statistics, the Art and Science of Learning from Data

Page 5: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

5

3. Test statistic: for a one-way ANOVA test we use the F-test statistic because it illustrates the relative relationship between within-group and between-group variability. Evidence against H0 is stronger with a lower within-group variability and a higher between-group variability, which is illustrated by this relationship.

with df1: g-1 = 2, and df2: N - g = 654 From the software we get a F-statistic of 78.51. This is a very high F-test value, which provides us with strong evidence against H0. 4. P-value The software reports a P-value of 0.0001, which is much lower than our significance level of 0.05, again strong evidence against H0. 5. Conclusion Due to the fact that our P-value < 0.05 we can reject H0 and conclude that at least two of the population means for price per sq.m. are different from each other depending upon postal code. This means that price per sq.m. could be higher or lower in some postal codes compared to others.

Question 3 Is it easier to sell a detached house? Compare Sold within 90 Days in the for sale data set with respect to Type of House using a significance test. For this analysis we have a categorical explanatory variable (type of house) and a categorical response variable (sold within 90 days). For the significance test we thus conduct a two-sided significance test for comparing two population proportions. 1. Assumptions There are three assumptions for this test and all three are satisfied. - We have a categorical response variable for two groups. - We assume the samples are independent and random. - We know that n1 (= 414 detached) and n2 (= 110 terraced) consist of at least 5 successes and 5

failures each. 2. Hypothesis: our null hypothesis is that the probability of selling a detached house within 90 days

is as big as the probability of selling a terraced house within the same period of time. We can reject H0 if the P-value is below our selected significance level of 0.05.

H0: p1 = p2 Ha: p1 ≠ p2 Since it is a two-sided test the alternative hypothesis is that both p1 < p2 and p1 > p2 are possible outcomes. 3. Test Statistic

Probability of a detached house being sold within 90 days:

Probability of a terraced house being sold within 90 days: When conducting a two sided significance test for testing two population proportions, z and se0 are calculated as follows:

4/1/13 09.48Comment [1]: Hvilket betyder at den er meget langt ude på halen, side 685

Page 6: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

6

using

4. P-value We look up the P-value in Table A2 as 0.0764 and multiply by two as we are conducting a two-sided test, leaving us with a P-value of 0.1528. The software reports a P-value of 0.1323. We can account for the difference in software and our calculations when taking into consideration that we calculate with fewer decimals. Both P-values are in any case much higher than our significance level of 0.05, thus not creating evidence against our null hypothesis and rendering us unable to reject it. 5. Conclusion Due to the fact that our P-value > 0.05 we cannot reject H0

and can conclude that there is a possibility that the probability of selling a detached house within 90 days is as large as the probability of selling a terraced house within the same period of time. Give 95% confidence intervals for the difference in the probability of a sale. To calculate the confidence interval for two population proportions we use . For a 95% confidence interval we use z-score = 1.96. The standard error is 0.0479 as calculated above. 95% confidence interval is calculated as: Confidence interval as calculated by us = (-0.163, 0.025) We can see that this confidence interval includes zero, which supports the above conclusion of our significance test, namely that there is a possibility that the probability of selling a house in 90 days is the same as for detached and terraced houses. Test the significance of a possible relation between Sold within 90 Days and Postal Code. For this test we still have the categorical response variable “Sold within 90 days”, but now the explanatory variable is postal codes, a categorical variable with 3 categories. We therefore conduct a chi-squared test of independence. 1. Assumptions: the following assumptions are satisfied. - Two categorical variables - Randomization - Expected count is over 5 in all cells. 2. Hypothesis: our H0 is that the two variables are independent. Thus, Ha is that they are

dependent and there is an association between the two variables. H0 can be rejected when the P-value < 0.05.

3. Test Statistic

2 Appendix A, p. A-1 (Agresti and Franklin, Statistics, the Art and Science of Learning from Data)

Page 7: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

7

For this test we use chi2 as our test statistic. Chi2 summarizes how far observed cell counts in a contingency table fall from the expected cell count for an H0. In the case of independence the observed and expected cell counts are close and chi2 has a relatively small value. Chi2 is calculated as follows:

where Since this calculation is quite extensive we use the chi2 reported by JMP: X2 = 9.169 4. P-value The P-value can be found either in Table C3 or via the software. To find it in Table C we first need to calculate the degrees of freedom: df = (r-1)*(c-1). Here though, we use the software to find our P-value as it is more accurate than the table, and we thereby get a P-value of 0.0102, which provides us with evidence against H0. 5. Conclusion Due to the fact that our P-value < 0.05 we can reject H0 and conclude that there is dependence between the postal code and the time it takes to sell a house. This means that a house in Vanløse may be sold quicker than a similar house in Ballerup.

Question 4 Fit a simple linear regression in which Price per Sq.m varies according to Sq.m. It is worth looking at the fit of the model when applying a regression onto a dataset. The fit is measured by squared correlation (R2), which tells us how good the regression line is. If we look at R2 it is only 0.08. Considering that 0 < R2 < 1, this value shows that the model only has 8% less error than ȳ (sample mean) in predicting price per sq.m. This means that the model is not particularly good, but could be improved by including more variables. Compute a 95% confidence interval for the regression slope and interpret the result. The confidence interval tells us how far the slope β falls from 0. It is calculated by: , but can also be generated by the software.

JMP gives us a confidence interval = (-70.17,-38.53). From this we can state that the slope is negative and that a one unit increase in sq.m, incurs a fall between 38.53 and 70.17 DKK in price per sq.m. Extend the analysis to a multiple linear regression by further including Postal Code, Type of House, and Built Year.

3 Appendix A, A-4 (Agresti and Franklin, Statistics, the Art and Science of Learning from Data)

4/1/13 09.48Comment [2]: Measuring Strength of Association Now, we have found that there is association between the response and explanatory variables. However, we do not know how strong this association is. There are several ways to do this e.g. by calculating relative risk or by performing a residual analysis. 4/1/13 09.48Comment [3]: Why are we using the “for sale” dataset? LIVS BEGRUNDELSE. <3

Page 8: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

8

We can already see here that the our R2 value has improved to 0.37 with the addition of these variables, thus improving the model. b. Fit a basic additive model to data and explain the most important parts of the output.

A multiple regression model can be set up as: and from the software we have gotten the following expression for the data:

The y-intercept, α, is defined as -7126,17 but varies according to the other explanatory variables. We are also given several β values which describe what happens given a certain value of the particular x. These β values, differ depending on whether the variable is quantitative or categorical. From the output we can also look at the fit of the model, basing our analysis on the given squared correlation, r2, which is indicated as 0.37. This means that the multiple regression equation has 37% less error than ȳ, given as 25038.87, in predicting price per sq.m. c. Check the model assumptions. The model assumptions for a regression model analysis are: - The population means of y at different values of x have a straight-line relationship with x

- This assumption states that a straight-line regression model is valid - This can be verified with a scatterplot

- The data were gathered using randomization. - The population values of y at each value of x follow a normal distribution, with the same standard

deviation at each x value. The model assumptions are somewhat satisfied but the model may be improved by removing statistically insignificant variables. In reality the case of normal distribution with same standard deviation hardly ever happens. This though, is not an implication that the model cannot be used. Check for nonlinear effects We check for these effects by plotting the residuals against the different explanatory variables. Sq.m, postal code and type of house all show linear relationships, as the spread and standard deviation of the residuals are fairly constant for all three variables. Attention must be paid to the built year as it shows a non-constant standard deviation of the residuals. This means that we may have to be critical towards inferences of price per sq.m based on built year.

4/1/13 09.48Comment [4]: Read into residual analysis.

Page 9: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

9

Testing for interaction

We test for interaction by crossing the variables in JMP and testing on the basis of H0: no interaction. In the case of a P-value above our significance level of 0.05 we can accept H0 and remove the variables without interaction.

Page 10: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

10

First we remove the crossed values with Type of House as they have P-values falling above our significance level, thus accepting H0 of no interaction.

Finding that the P-values of crosses with Postal Code are above 0.05 we can accept that there is no interaction and thus remove their crosses.

Finally we remove the remaining Built Year cross as its P-value is above our significance level.

With this analysis we can confirm the model assumption that there is no interaction between the explanatory variables. Extend the model We cannot extend the model, as there is no interaction between any variables and for this reason they should not be included in the model. b. Which variables are statistically significant and which might be removed from the

model? Reduce the model as far as possible, and explain how to interpret the results. To reduce the model we again assess the P-value. When the P-value > 0.05, our significance level, it supports H0: β = 0 and we thus remove it. The reason is that if there is a possibility that β = 0 then the explanatory variable has no effect on price per sq.m.

4/1/13 09.48Comment [5]: In the case of interactions we would have needed to include the crosses to account for the specific interaction between the two variables in question. 4/1/13 09.48Comment [6]: Look into “multiple testing issues” as stated in the assignment...

Page 11: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

11

Looking at the above Parameter Estimates we see that Built Year is the only variable with a P-value above our significance level above 0.05. For this reason we will remove it from the model as it does not have an effect on price per sq.m.

Having removed Built Year we can see that the P-value of Type of House rises above our significance level, meaning that we now must remove it from the model and end with this:

With this result price per sq.m. is still the response variable, but the regression model only contains postal code and sq.m. as explanatory variables, having removed the variables that did not have an effect on the final results of the response variable. c. Compute the predicted sales price for a detached 110 m2 house from 1938 in 2720

Vanløse. Give an approximate 95% prediction interval.

Using the prediction equation given by JMP with the newly defined variables we create the following equation for this detached house in Vanløse:

To calculate the predicted total sales price we multiply the price per sq. m. with sq. m.:

Confidence Interval To compute the confidence interval for , the predicted value of y, we use the following:

The standard error is in this case equal to the residual standard deviation4, which is the same as the root mean square error given in JMP. Therefore we can calculate the confidence interval with a t-score of 1.96 for a 95% confidence interval as: 4 p. 611 (Agresti and Franklin, Statistics, the Art and Science of Learning from Data)

4/1/13 09.48Comment [7]: For the C. total the F-value is below 0.05 meaning that every variable has an effect on the model. (See ANOVA in JMP)

Page 12: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

12

The confidence interval for the predicted price per sq.m. therefore is: (21394.85;41578.23) When multiplying to get sales price the confidence interval is: (2,353,433.81;4,573,604.99) This is quite a wide confidence interval and not a very good prediction, which can be emphasized by the lousy R2 value of 0.3648.

Question 5 a. Fit a logistic model predicting Sold Within 90 Days from Postal Code. We use a logistic model in this case due to the fact that we are operating with two categorical variables, with a binary response variable. For a logistic regression the outcome for y is a

proportion between 0 and 1. The logistic regression equation is which can be displayed by a S-shaped graph. The probability then is either constant, increases or decreases as x increases. Since α and β are unknown parameters the software provides estimates of these:

Compute odds ratios between the three areas and find 95% confidence intervals for the OR. To compute the odds ratio you must first calculate the odds of two outcomes in both the cases you want to compare through an odds ratio. This is done with the formula:

To now find the odds ratio (OR) we calculate: This has already been done in JMP, so we have included the JMP output stating the odds ratios

and 95% confidence intervals for each: The odds ratio describes the odds of one outcome occurring in two different situations. As an illustrative example the first stated odds ratio of 1.3 illustrates that the odds of selling a house within 90 days is 1.3 times larger in 2720 compared to in 2610. The 95% confidence intervals

4/1/13 09.48Comment [8]: Work on how to justify this!

Page 13: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

13

simply show the interval in which the odds ratio can fall taking into consideration s.e. and desired confidence interval. In the case used before, the 95% confidence interval includes values below 1, meaning that there is a chance that the odds of selling a house in 2610 in some cases can be higher than the odds of selling in 2720. Compare the results to those from Question 3. In Q3 we concluded through a chi-squared test that there is an association between selling a house within 90 days and the postal code in which the sale takes place. Our findings in the OR calculations work as verification of our answer in Q3 as it illustrates how the association plays out. b. Extend the model to a multiple logistic regression using the predictors Postal Code, Net

per Month, Sq.m, and Built Year. We extend the model by adding net per month, sq.m, and built year to the above model as explanatory variables. We thus have a binary response variable and four categorical explanatory variables, which means we will be conducting a multiple logistic regression. Make basic checks of the model assumptions The model assumptions are: - Additivity, which is satisfied by the logistic regression equation stated above. - No interaction, which will be tested for below - Linearity, which will be explored afterwards

As we can see all the P-values for the crossed variables are above our significance level of 0.05, which means there is no interaction, thus satisfying the assumption of no interaction. Linearity

Page 14: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

14

To test the linearity we begin by crossing the explanatory variables with themselves and run a chi2 test on the model. We want to prove H0 that the model fits the data, thus proving linearity. Therefore we want a P-value above our significance level of 0.05.5 The reason why Postal Code is not given any values is because it is difficult to test for a categorical variable crossed with itself. If we look at the rest of the data given by JMP, we can see that the P-values for crossed Net per Month and crossed Sq.m. are above 0.05, while crossed Built Year is very close to our significance level. From this we can conclude, with some hesitation due to Built Year, that the model does fit the assumption of linearity, but we should be cautious in inferences due to this potential pitfall. Reduce the model to the significant predictors To reduce the model to the significant predictors we begin by removing the variable with the highest P-value, which is Postal Code. Having removed this we can now see that the model fits and the P-values of the remaining variables corrects to our significance level.

With this we can conclude that Net per Month, Sq.m. and Built Year are the explanatory variables that have an influence on whether or not a house is sold within 90 days. This contradicts our finding in the last part of question 3, where we concluded that there is dependence between Postal Code and Sold Within 90 Days. This can be explained by the fact that we in this model include more variables, which have an effect on the importance of Postal Code in predicting sales time. c. Explain the interpretation of the parameters in the reduced model. We can begin by taking a look at the multiple logistic regression equation, which looks as follows:

The estimates given in JMP are the α and various β values used in the above logistic regression equation. In the JMP output intercept is equal to α, while Net Per Month, Sq.m. and Built Year are separate β values. We are also given the standard error which is used to calculate the confidence intervals for a given variable, done by JMP to the right of the estimates. Chi2 tells us about the difference between expected and observed cell counts. It also tells us about the independence of the variable, with a lower chi2 pointing more strongly towards independence. If you take the square root of the chi2 you also end up with the z-score. Last is the P-value, which is used to determine whether or not we can reject H0 that the two variables in question should be independent.

5 http://www-hsc.usc.edu/~eckel/biostat2/notes/notes16.pdf

4/1/13 09.48Comment [9]: so what if it’s not linear?

Page 15: Assignment - FINAL-1 - IBP Union assumptions that are continuous throughout the assignment and important for the validity of its findings. ... The software reports a P-value of 0.0001,

Statistics & Research Methods January 2013

15

Compute the predicted probability of a sale within 90 days of the house in Question 4(d) financed with a net payment of 16000 kr per month. To compute the predicted probability we insert the parameter values given by JMP into the logistic regression equation given above.

From this we can conclude that p = 0.382, meaning that the probability of selling a house within 90 days is approximately 0.38.