assignment 2 solution s211

12
Department of Mathematics and Computing University of Southern Queensland Solution to Assignment # 2 Course No. STA2300 Data Analysis Semester 2, 2011 Question 1 (18 marks) (a) 4 marks Variable of interest: The variable of interest is the number of small businesses that fail in their first year of operation. If we assume that each small business is independent of each other and have an equal probability of failing, then the binomial model would be appropriate. The assumptions for a binomial model include: S = success/ failure condition: A success is that a small business fails in its first year of operation, a failure is that a small business does not fail in its first year of operation. P = Probability of success = the probability of a small business failing in its first year of operation is 10% (thus p = 0.10). I = Independence = it is assumed that each small business is independent of each other. N = a fixed number of small businesses is needed. For parts (b) and (c) this is 12 (For parts (d) and (e)) this is 200. (b) 2 marks Let X represent the number of small businesses, out of 12, who fail in their first year of operation. We will assume that X follows a binomial model with parameters n=12 and p=0.10. We are interested in 2 or less failing, thus X = 0, 1, 2. Using Table B in the Study Book: 8891 . 0 2301 . 0 3766 . 0 2824 . 0 ) 2 ( ) 1 ( 0 X P X P X P We estimate the probability that 2 or less (or X=0) of the next 12 small businesses to fail in their first year of operation 0.8891 = 88.91%. (c) 2 marks From above we know: p=0.10 n=12 The formula for the mean of a binomial model is thus: μ = np = 12 x 0.10 = 1.2 small businesses

Upload: ypchzr

Post on 28-Oct-2014

92 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Assignment 2 Solution S211

Department of Mathematics and Computing

University of Southern Queensland

Solution to Assignment # 2 Course No. STA2300 Data Analysis

Semester 2, 2011

Question 1 (18 marks)

(a) 4 marks

Variable of interest: The variable of interest is the number of small businesses that fail in

their first year of operation.

If we assume that each small business is independent of each other and have an equal

probability of failing, then the binomial model would be appropriate.

The assumptions for a binomial model include:

S = success/ failure condition: A success is that a small business fails in its first year

of operation, a failure is that a small business does not fail in its first year of

operation.

P = Probability of success = the probability of a small business failing in its first year

of operation is 10% (thus p = 0.10).

I = Independence = it is assumed that each small business is independent of each

other.

N = a fixed number of small businesses is needed. For parts (b) and (c) this is 12 (For

parts (d) and (e)) this is 200.

(b) 2 marks

Let X represent the number of small businesses, out of 12, who fail in their first year of

operation. We will assume that X follows a binomial model with parameters n=12 and

p=0.10. We are interested in 2 or less failing, thus X = 0, 1, 2.

Using Table B in the Study Book:

8891.02301.03766.02824.0)2()1(0 XPXPXP

We estimate the probability that 2 or less (or X=0) of the next 12 small businesses to fail

in their first year of operation 0.8891 = 88.91%.

(c) 2 marks From above we know:

p=0.10

n=12

The formula for the mean of a binomial model is thus:

µ = np = 12 x 0.10 = 1.2 small businesses

Page 2: Assignment 2 Solution S211

1.2 small businesses are expected to fail in their first year of operation.

(d) 4 marks

We now assume we have 200 small businesses. Let X represent the number of small

businesses, out of 200, that fail in their first year of operation. We will assume that X

follows a binomial model with parameters n=200 and p=0.10. As n is large we will not

be able to use Table B and so will use the normal approximation to the binomial instead.

Calculate the mean and standard deviation of this new sample:

Mean = µ = np = 200 x 0.10 = 20 small businesses

Standard deviation:

= 24.418)10.01(10.0200)1( pnp small businesses

(e) 4 marks

To use the normal approximation we need to assume that the 200 small businesses are

less than 10% of the entire population. We also need to check the rule of thumb:

np=200 x 0.10=20>10 and

n(1-p)=200 x (1-0.10)=190>10. So the data passes the rules of thumb.

The mean and standard deviation of this new sample was calculated in part (d):

Mean = µ = np = 20

Standard deviation = = 4.24

Using Z scores and Table Z and z-tables calculate the probability that 15 or less will fail:

1515

yPXP

1190.0

)18.1(

24.4

2015

ZP

ZP

We estimate the probability that 15 or less of small businesses will fail in their first year

of operation to be approximately 11.90%.

Page 3: Assignment 2 Solution S211

(f) 2 marks

Firstly the assumptions of the Binomial model need to be satisified:

S = success/ failure condition: A success is that a small business fails in its first year

of operation, a failure is that a small business does not fail in its first year of

operation.

P = Probability of success = the probability of a small business failing in its first year

of operation is 10% (thus p = 0.10).

I = Independence = it is assumed that each small business is independent of each

other.

N = a fixed number of small businesses is needed. For parts (d) and (e) this is 200.

Secondly the rules of thumb need to be satisfied:

np=200 x 0.10=20>10 and

nq = n(1-p)=200 x (1-0.10)=180>10. So the data passes the rules of thumb.

Question 2 (12 marks)

(a) 2 marks

Variable of interest – Speed of cars

Unit of measurement – Speed is measured in km/h

(b) 5 marks

Let y denote a car’s speed (km/h). The proportion of scores for which y is greater than

110 is calculated using z-scores (standardising/forward method):

25.14

105110110110

zPzP

yPyP

We wish to use Table Z to find this proportion but remember that Table Z shows the

proportion below z. So first we need to re-write the proportion we are looking for as (make an

adjustment as the book and diagram don’t match):

25.1125.1 ZPZP

Now using Table Z we find

1056.08944.0125.1 ZP

The percentage of cars expected to obtain a speeding ticket is 10.56%.

NOTE: Some students may do a speed of 111 or more (considering that if their speed was

110 then they aren’t speeding but doing the limit). They will get full marks if they do this. If

so they will find a z-score of 1.50 and a final answer of (1-0.9332) = 0.0668 = 6.68%.

Page 4: Assignment 2 Solution S211

(c) 5 marks

The question asks us to find some score y (unstandardising) such that a car would be

considered a nuisance driver:

06.0 yYP

Converting to Z scores:

That is,

0600.04

105

yzP

Using Table Z we can see that 0594.056.1 zP which is pretty close to the answer we

are looking for (you could also use 0606.055.1 zP OR 0600.0555.1 zP ).

To find the y which produces a z value less than or equal to -1.56 we need to solve

56.14

105

y

That is 456.1105 y or 105456.1 y or 76.98y .

Similarly you can simply use the unstandardising formula:

76.98105456.1 zy

To be classified as a nuisance driver a car needs to be travelling at 98.76km/hr or less.

if you had used z=-1.55 then you would have gotten y=98.80

if you had used z=-1.555 then you would have gotten y=98.78

Page 5: Assignment 2 Solution S211

Question 3 (14 marks)

(a) 4 marks

The relationship between the number of cylinders and origin of these

cars

Origin of Car

Total America European Japanese

Number of

cylinders

4 65 58 67 190

6 64 4 6 74

8 101 0 0 101

Total 230 62 73 365

(b) 2 marks 67/190 = 35.26%

(c) 2 marks 67/365 = 18.36%

(d) 6 marks

Origin and the number of cylinders of these cars do appear to be associated. From the

table (see the highlighted green part) below it can be seen that 100% of 8-cylinder

cars are from America, whereas only 34.2% of 4-cylinder cars are from America. The

large difference in percentages indicates that origin and the number of cylinders of

these cars may be associated (it appears that the origin of the cars depends on how

many cylinders the car has or vice versa).

Conditional distribution table showing the relationship between origin and number of cylinders

Origin of Car

Total America European Japanese

Number of cylinders 4 Count 65 58 67 190

% within Number of

cylinders

34.2% 30.5% 35.3% 100.0%

6 Count 64 4 6 74

% within Number of

cylinders

86.5% 5.4% 8.1% 100.0%

8 Count 101 0 0 101

% within Number of

cylinders

100.0% .0% .0% 100.0%

Total Count 230 62 73 365

% within Number of

cylinders

63.0% 17.0% 20.0% 100.0%

Page 6: Assignment 2 Solution S211

Origin of Car

Total America European Japanese

Number of cylinders 4 Count 65 58 67 190

% within Origin of Car 28.3% 93.5% 91.8% 52.1%

6 Count 64 4 6 74

% within Origin of Car 27.8% 6.5% 8.2% 20.3%

8 Count 101 0 0 101

% within Origin of Car 43.9% .0% .0% 27.7%

Total Count 230 62 73 365

% within Origin of Car 100.0% 100.0% 100.0% 100.0%

Question 4 (20 marks)

(a) 4 marks

Time to accelerate is a quantitative variable, so an appropriate graph is a histogram. There

is too much data for a stem and leaf plot; a boxplot is for comparing a quantitative

variable across various groups. The histogram appears below. (To see any useful

information, it may be necessary to adjust the scale and limits of the histogram).

Page 7: Assignment 2 Solution S211

If students change the bin size then the following graph will be produced. Either graph is

acceptable.

(b) 4 marks

The shape of distribution of time is slightly skewed to the right with one mode at

about approximately 14-15 seconds (unimodal) (The graph could be said to be

approximately symmetric);

The centre is between 14-15 seconds;

The spread is from about 11 seconds to 25 seconds (if said there was outliers then

this would be 11-22 seconds).

There does not appear to be any outliers (if produced the first graph then if they

said there are outliers this is ok).

(c) 4 marks The mean time is 16.55 seconds

The standard deviation is 2.37 seconds

(d) 4 marks The median time is 16.05 seconds

The IQR is 3.2 seconds

(e) 4 marks As the distribution is slightly skewed is would be best to use the median and IQR as the

centre and spread for this distribution as skewness affects both the mean and standard

deviation.

If students said in part (a) that the distribution was symmetric then they should report the

centre and spread as the mean and standard deviation.

Page 8: Assignment 2 Solution S211

Students are not to include SPSS in part (c) and (d) BUT they may include it as an

appendix.

Statistic Std. Error

Time to accelerate from 0 to

100km/h (seconds)

Mean 16.547 .1720

95% Confidence Interval for

Mean

Lower Bound 16.208

Upper Bound 16.886

5% Trimmed Mean 16.418

Median 16.050

Variance 5.622

Std. Deviation 2.3711

Minimum 11.6

Maximum 24.8

Range 13.2

Interquartile Range 3.2

Skewness .892 .176

Kurtosis .888 .351

Question 5 (24 marks)

(a) 2 mark

Horsepower and vehicle weight are both quantitative variables.

(b) 4 marks

As both horsepower and vehicle weight are quantitative variables the most

appropriate graph to display both of these variables is a scatterplot.

Page 9: Assignment 2 Solution S211

(c) 4 marks

Form: Approximately Linear

Direction: Positive

Scatter: Medium

Outliers: There appears to be an outlier at vehicle weight 1400kg with a horsepower

of 225hp. This car deviates away from the rest of the linear form of the data and thus

is classified as an outlier.

(d) 4 marks

The most appropriate statistic to measure both strength and direction is the correlation

coefficient (r). Since both variables are quantitative and the form of the scatter plot is

approximately linear, an appropriate statistic to use is the correlation co-efficient, r.

Using SPSS the correlation coefficient is found to be 0.836. Students should NOT

report R2 here as this statistic measures strength ONLY.

Model Summary

Model R R Square

Adjusted R

Square

Std. Error of the

Estimate

1 .836a .699 .698 22.201

a. Predictors: (Constant), Vehicle weight (in kilograms)

(e) 6 marks

The regression line to predict horsepower (HP) from vehicle weight (kg) is:

xy 092.0839.21ˆ

Where:

y is horsepower (HP)

x is vehicle weight (kg)

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -21.839 6.345 -3.442 .001

Vehicle weight (in kilograms) .092 .004 .836 23.015 .000

Page 10: Assignment 2 Solution S211

Students may have included the regression line in the graph on part (b) and this is fine and

will receive full marks.

(f) 4 marks

A car weighing 3000kg would have a predicted horsepower of 254.61HP:

161.2543000092.0839.21092.0839.21ˆ xy

It is difficult to tell from the above graph if this would be an accurate prediction or

not. We need to extrapolate beyond the data to find the horsepower required for a

3000kg car and we cannot say with certainty that the graph will continue in a linear

form beyond the data given.

Question 6 (12 marks)

(a) 3 marks

This is an observational study. In order to be an experiment, treatments (in this case

the treatments the location of a person) have to be imposed (ie you have to be told

which treatment you are going to be doing/taking). Ethically you can’t tell a person

where they can live and thus this is an observational study.

(b) 3 marks

Explanatory Variable – location of residence

Response Variable – incidence of premature death from heart disease and circulatory

disease.

OR

Parameter of interest - incidence of premature death from heart disease and

circulatory disease.

Page 11: Assignment 2 Solution S211

(c) 3 marks

In order to be a lurking variable it must affect both the explanatory and response

variables. An example of a lurking variable in this case may be a person’s job.

Link to explanatory: As a person’s job is located in Ipswich they have to move to this

area.

Link to response: It may be the person’s job that is causing them to be sick and not in

fact where they live.

(d) 3 marks

In order to say that an explanatory variables (location of residence) causes some

response in the response variable (premature death); a well-design EXPERIMENT is

needed. For an observation study you can only say that there appears to be a

relationship between the two variables of interest.

Page 12: Assignment 2 Solution S211

Key to errors

The marker of your assignment may have used some of the following key words to describe

errors in your assignment answers.

Define Variables: Symbols without some way of knowing what they refer to are

meaningless.

Title: Tables, like graphs, should be given contextual titles. Your assignment is a report and

your submission of your assignment is publication of this report so each graph and table

should be easy to read by anyone.

Label: Columns and rows of tables and axes of graphs should be given meaningful labels,

describing exactly what you have used in your table or graph. The numbers such as 0 and 1

from the data are not meaningful unless you give values to these numbers in the variable view

of the data file (see SPSS Exercise 4). Also in the variable view the variables need to have the

correct measure selected. See Assignment 1 solutions for comments about this.

Units: Units (such as cms) should be attached to numerical results or included with labels

where appropriate.

Printout: Raw output alone from computer software typically contains lots of information,

some of which may be relevant to the required answer. The marker will not decide for you

what is relevant and what is not. You must indicate this yourself, preferably by transcribing

the appropriate information into your answer.

Error: Your method and formula are correct but you have made an arithmetic error.

Rounding: As a general rule, final answers should be rounded to one or two more significant

figures than the original data. Too severe rounding, incorrect rounding or ridiculous precision

in final answers will be penalised. Rounding however should only occur at the final answer.

For intermediate calculations use whatever precision your calculator provides. This was

highlighted in Assignment 1. (These procedures were followed in the worked solutions even

though in some places intermediate steps may only show numbers rounded to four decimal

places.)

Working: Incorrect numerical answers without working receive no marks. Incorrect

numerical answers with working may attract part marks. Working includes appropriate

formulae, values of relevant statistics or variables, intermediate calculations, statement of

assumptions, etc.

Consequential Marking: If you make an error in one part of a question which impacts

further parts of the question, consequential marking is used where possible.

In Context: You need to finish off each question by putting your answer into context (in

relation to the question asked). Often this just requires a sentence at the end of the question

stating what your found answer means.