chapter 4...data recorded about a single variable, such as a person’s weight. in this chapter, you...

P1: FXS/ABE P2: FXS0521672600Xc04.xml CUAU034-EVANS September 15, 2008 17:27

C H A P T E R

4Bivariate data

What is a scatterplot, how is it constructed and what does it tell us?

What is the q-correlation coefficient, how is it calculated and what does it tell us?

How do we fit a straight line to a scatterplot by eye?

How do we fit a straight line to a scatterplot using the two-mean method?

How do we interpret the intercept and slope of a line fitted to a scatterplot?

How do we use a line fitted to a scatterplot to make predictions?

What is the difference between interpolation and extrapolation?

In Chapter 1, ‘Univariate data’, you learned about the statistical methods we use to analyse

data recorded about a single variable, such as a person’s weight. In this chapter, you will learn

about the statistical methods used to analyse data recorded about two related variables, such as

a person’s weight and height. Such data is called bivariate data (two-variable data).

When we analyse bivariate data, we are interested in how the two variables relate to each

other. We try to answer questions such as: ‘Is there a relationship between these two

variables?’ and ‘Does knowing the value of one of the variables tell us anything about the

value of the second variable?’

For example, let us take as our two variables the mark a student obtained on a test and the

amount of time they spent studying for that test. Since the amount of time spent studying may

affect the mark obtained, we distinguish between the two variables by calling the time spent

studying the independent variable (IV) and the mark obtained the dependent variable (DV).

4.1 Displaying bivariate dataScatterplotsThe first step in investigating the relationship between two numerical variables is to construct a

scatterplot.

We will illustrate the process by constructing a scatterplot to display the marks students

obtained on an examination (the DV) against the times they spent studying for the examination

(the IV).

140Cambridge University Press • Uncorrected Sample Pages • 978-0-521-74049-4 2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard

SAMPLE

Back to Menu >>>


Chapter 4 — Bivariate data 141

Student 1 2 3 4 5 6 7 8 9 10

Time (hours) 4 36 23 19 1 11 18 13 18 8

Mark (%) 41 87 67 62 23 52 61 43 65 52

In a scatterplot, each point represents a single case, in this instance a student.

When constructing a scatterplot, it is conventional to use the vertical or y-axis for the

dependent variable (DV) and the horizontal or x-axis for the independent variable (IV). This

will become very important when we come to fitting lines to scatterplots later in the chapter.

The horizontal or x-coordinate of the point represents the time spent studying (the IV).

The vertical or y-coordinate represents the mark obtained (the DV).

The scatterplot below shows the point for Student 1, who studied 4 hours for the examination

and obtained a mark of 41.

90

80

70

60

50

40

30

20

10

05 10 15 20 25 30 35 40

Mar

k (%

)

Time (hours)

Student 1 (4, 41)

The scatterplot is completed by plotting the points for each remaining student, as shown

below.

90

80

70

60

50

40

30

20

10

0 5 10 15 20 25 30 35 40

Mar

k (%

)

Time (hours)

Cambridge University Press • Uncorrected Sample Pages • 978-0-521-74049-4 2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard

SAMPLE

Back to Menu >>>


142 Essential Standard General Mathematics

For example, in the scatterplot opposite, the

advertised prices of 12 second-hand cars are

plotted against the cars’ ages (in years).

In this relationship, the car’s price is clearly

the dependent variable (DV) as it depends on

its age, so price is plotted on the vertical

axis. Age, the independent variable (IV), is

plotted on the horizontal axis. 0 2 4 6 8

Pri

ce (

$’00

0)

Age (years)

16

14

12

10

8

Using a graphics calculator to construct a scatterplotWhile you need to understand the principles of constructing a scatterplot, and maybe to

construct one by hand for a few points, in practise you will use a graphics calculator to

complete this task.

How to construct a scatterplot using the TI-Nspire CAS

The data below give the marks that students obtained on an examination and the times

they spent studying for the examination.

Time (hours) 4 36 23 19 1 11 18 13 18 8

Mark (%) 41 87 67 62 23 52 61 43 65 52

Use a graphics calculator to construct a scatterplot. Treat time as the independent

(x) variable.

Steps1 Start a new document (by pressing / + N)

and select 3:Add Lists & Spreadsheet.Enter the data into lists named time and mark.

2 Statistical graphing is done through the Data &Statistics application.

Press and select 5:Data & Statistics.Note: A random display of dots will appear – this is toindicate list data are available for plotting. It is not astatistical plot.


SAMPLE

Back to Menu >>>



3 To construct a scatterplot

a Move the cursor to the textbox area below the

horizontal (or x-) axis. Press when

prompted and select the variable time (i.e. the

independent variable). Press enter to paste the

variable onto that axis.

b Move the cursor towards the centre of the

vertical (or y-) axis until a textbox appears.

Press when prompted to select the variable

mark.

c Finally, press enter to paste the variable mark

onto that axis and generate the required

scatterplot, which is shown opposite. The plot

is automatically scaled.

How to construct a scatterplot using the ClassPad

The data below give the marks that students obtained on an examination and the

times they spent studying for the examination.

Time (hours) 4 36 23 19 1 11 18 13 18 8

Mark (%) 41 87 67 62 23 52 61 43 65 52

Use a graphics calculator to construct a scatterplot. Treat time as the independent

(x) variable.

Steps1 Open the Statistics application and

enter the coordinate values into

lists named time and mark, as

shown.

2 Tap from the toolbar to open

the Set StatGraphs dialog box.


SAMPLE

Back to Menu >>>



Complete the dialog box as given below.

For Draw: select On

For Type: select Scatter ( )

For XList: select main \ time ( )

For YList: select main \ mark ( )

Leave Freq: as 1

Leave Mark: as square

Tap h to confirm your selections.

3 Tapping from the toolbar at

the top of the screen

automatically plots a scaled

graph in the lower-half of the

screen.

Tapping the icon will give a

full-screen sized graph. Tap

again to return to a half-screen.

4 Tapping from the toolbar

places a marker on the first data

point (xc = 4, yc = 41).

Use the horizontal cursor arrow

( ) to move from point to

point.

Exercise 4A

1 Height, x 190 183 176 178 185 165 185 163

Weight, y 77 73 70 65 65 65 74 54

The table above shows the heights and weights of eight people. Use a graphics calculator to

construct a scatterplot with height as the IV (i.e. x variable).

2 Wife’s age 26 29 27 21 23 31 27 20 22 17 22

Husband’s age 29 43 33 22 27 36 26 25 26 21 24

The table above shows the ages at marriage of 11 couples. Use a graphics calculator to

construct a scatterplot with wife’s age as the independent variable.


SAMPLE

Back to Menu >>>



3 Number of seats 405 296 288 258 240 193 188 148

Airspeed (km/h) 830 797 774 736 757 765 760 718

The table above shows the numbers of seats and airspeeds of eight passenger aircraft. Use a

graphics calculator to construct a scatterplot with number of seats as the independent

variable.

4 Drug dosage (mg) 0.5 1.2 4.0 5.3 2.6 3.7 5.1 1.7 0.3 0.6

Response time (min) 65 35 15 10 22 16 10 18 70 50

The table above shows the response times of 10 patients

given a pain relief drug, and the drug dosages. Use a

graphics calculator to construct a scatterplot using

drug dosage as the independent variable.

5 Time (min) 0 5 10 15 20 25

Number in cinema 87 102 118 123 135 137

The table above shows the numbers of people in a cinema at 5-minute intervals after the

advertisements started. Use a graphics calculator to construct an appropriate scatterplot.

4.2 How to interpret a scatterplotWhat features do we look for in a scatterplot that will help us to identify and describe any

relationships present?

Presence of a relationshipFirst we look to see if there is a clear pattern in the scatterplot. y

x

In the example opposite, there is no clear pattern

in the points. The points are randomly scattered across

the plot, so we conclude that there is no relationship.

For the three examples below, there is a clear (but different) pattern in each set of points, so

we conclude that there is a relationship in each case.

y

x

y

x

y

x

Having found a clear pattern, there are two main things we look for in the pattern of points:

direction and outliers (if any)

strength of the relationship (amount of scatter).


SAMPLE

Back to Menu >>>



Direction and outliersThis scatterplot of calf diameter against age

of a group of people is just a random scatter

of points. This suggests that there is no

relationship between the variables calf

diameter and age for this group of

people. However, there is an outlier, the

person with a calf diameter of 22 cm.

Cal

f dia

met

er (

cm)

Age (years)

25

20

15

5

0

10

2220 24 26 28 30 32 34 36

In contrast, there is a clear pattern in this scatterplot

of the mark students obtained in an exam and the

time they spent studying for the exam.

The two variables, mark and time, are related.

Furthermore, the points seem to drift upwards

from left to right. When this happens, we say

that there is a positive relationship between

the variables. People who spend more time

studying tend to get higher marks, and

vice versa.

In this scatterplot there are no outliers.

908070605040302010

05 10 15 20 25 30 35 40

Mar

k (%

)

Time (hours)

Likewise, this scatterplot of the price against

age of a number of second-hand cars shows

a clear pattern. The two variables are

related. However, in this case the points seem

to drift downwards from left to right. When

this happens, we say that there is a negative

relationship between the variables. Older

second-hand cars tend to have a lower

price than newer second-hand cars.

In this scatterplot there are no outliers.

0 2 4 6 8

Pri

ce (

$’00

0)

Age (years)

16

14

12

10

8

Strength of a relationship (scatter)The strength of a relationship is measured by how much scatter there is in a scatterplot.


SAMPLE

Back to Menu >>>



Strong relationshipWhen there is a strong relationship between the variables, the points will tend to follow a

single stream. A pattern is clearly seen. There is only a small amount of scatter in the plot.

Strong positive relationship Strong positive relationship Strong negative relationship

Moderate relationshipAs the amount of scatter in the plot increases, the pattern becomes less clear. This indicates

that the relationship is less strong. In the examples below, we might say that there is a

moderate relationship between the variables.

Moderate positive relationship Moderate positive relationship Moderate negative relationship

Weak relationshipAs the amount of scatter increases further, the pattern becomes even less clear. This indicates

that any relationship between the variables is weak. The scatterplots below are examples of

weak relationships between the variables.

Weak positive relationship Weak positive relationship Weak negative relationship


SAMPLE

Back to Menu >>>



No relationshipFinally, when all we have is scatter, as seen in the scatterplots below, no pattern can be seen. In

this situation we say that there is no relationship between the variables.

No relationship No relationship No relationship

These scatterplots should help you to get a feel for the strength of a relationship, as indicated

by the amount of scatter in a scatterplot. Later in this chapter, you will learn to calculate its

value using the idea of q-correlation. At the moment, you only need be able to estimate the

strength of a relationship as strong, moderate, weak or none, by comparing it with the standard

scatterplots given above.

Exercise 4B

1 For each of the following pairs of variables, indicate whether you expect a relationship to

exist between the variables and, if so, whether you would expect the variables to be

positively or negatively related.

a Fitness level and amount of daily exercise b Foot length and height

c Comfort level and temperature above 30◦C d Foot length and intelligence

e Time taken to get to school and distance travelled

f Weight of an ice cube and surrounding temperature

2 For each of the following scatterplots:

i state whether the variables appear to be related and note any possible outliers.

If the variables appear to be related:

ii state whether the relationship is positive or negative.

iii estimate the strength of the relationship as strong, moderate or weak.

a210

200

190

180

170

18 20 22 24 26 28 30 32

Hei

ght (

cm)

Age (years)

b15

10

5

0 100 200 300 400 500 600

Bus

ines

s ($

’000

)

Advertising ($)


SAMPLE

Back to Menu >>>



c 180

180

170

170

160

160

150

150

Dau

ghte

r’s

heig

ht (

cm)

Mother’s height (cm)

d

Drug dosage (mg)

Rea

ctio

n ti

me

(min

)

0

10203040506070

1 2 3 4 5 6

e20

Scor

e on

test

0

5

10

15

20 25 30 35 40Temperature (°C)

f

20

20

25

25

35

35

40

4045

30

30

45Husband’s age (years)15

15Wife

’s a

ge (

year

s)

4.3 The q-correlation coefficientIn the previous section you learned how to estimate the strength of a relationship from a

scatterplot by considering the amount of scatter in the plot. In this section, you will learn how

the q-correlation coefficient (q for quadrant) can be used to give a measure of the strength of

the relationship between two variables.

The idea behind the q-correlation coefficientFrom our earlier investigation of the relationship between two variables, we found that for:

positive relationships, high values on one variable tend to go with high values for the

other variable, and vice versa

negative relationships, high values on one variable tend to go with low values for the other

variable, and vice versa.

The q-correlation coefficient gives a measure of the tendency for points in a scatterplot to

follow these patterns.

Example 1 Calculating the q-correlation coefficient

Calculate the q-correlation coefficient for the

scatterplot shown.10

10

9

9

8

8

7

76

5

5

4

4

3

3

2

2

1

10

6

y

x


SAMPLE

Back to Menu >>>



Solution

10

10

9

9

8

8

7

76

5

5

4

4

3

3

2

2

1

10

AB

y

C D

Median ofy-values

6

Median ofx-values

x

10

10

9

9

8

8

7

76

5

5

4

4

3

3

2

2

1

10

AB

C D

6

y

x

c = 2 d = 1

b = 1 a = 4

1 Find the median of the x-values. There are

11 points, so the median will be the 6th

point from the left.

2 Draw a vertical dotted line through this point.

3 Find the median of the y-values. There are

11 points, so the median will be the 6th point

up from the bottom of the scatterplot.

4 Draw a horizontal dotted line through this

point.

5 The scatterplot has now been divided into

four quadrants. Label them A, B, C and D,

proceeding anticlockwise from the top right.

6 Count the number of points in each of the

quadrants A, B, C and D.

Call these a, b, c and d respectively.

Any points that lie on the line are omitted.

7 The q-correlation coefficient is given by

q = (a + c) − (b + d)

a + b + c + d

Substitute the values for a, b, c and d and evaluate.

∴ q = (4 + 3) − (1 + 1)

4 + 1 + 3 + 1= 5

9

The properties of the q-correlation coefficient are summarised below.

The q-correlation coefficientThe q-correlation coefficient is defined by

q = (a + c) − (b + d)

a + b + c + d

x

y

AB

C Dwhere a, b, c and d are the numbers of points in the four

quadrants of the scatterplot, labelled A, B, C and D respectively.

Any points that lie on the lines are omitted

We can see that the q-correlation can take both positive and negative values.

Suppose that all the points lie in quadrants A and C, as shown. Then b = 0 and

d = 0 and

q = (a + c) − (0 + 0)

a + 0 + c + 0= (a + c)

(a + c)= 1

x

y

AB

C D


SAMPLE

B a c k t o M e n u > > >



Suppose all the points lie in quadrants B and D, as shown.

Then a = 0 and c = 0 and

q = (0 + 0) − (b + d)

0 + b + 0 + d= −(b + d)

(b + d)= −1

x

y

AB

C D

When there are an equal number of points in each

quadrant, then a + b = c + d, and

q = (a + c) − (b + d)

a + b + c + d= 0

4a= 0

x

y

AB

C DHere there is no relationship (q = 0).

Thus we can see that in general:� −1 ≤ q ≤ 1� If there is a positive relationship then most of the points are in A and C and q is positive.� If there is a negative relationship then most of the points are in B and D and q is negative.

Example 2 Calculating the q-correlation coefficient

Use the scatterplot opposite to calculate the

q-correlation coefficient for reaction time and

drug dosage.

Drug dosage (mg)

Rea

ctio

n ti

me

(min

)

0

10

20

30

40

50

60

70

1 2 3 4 5 6

Solution

Drug dosage (mg)

Rea

ctio

n tim

e (m

in)

0

10

20

30

40

50

60

70

1 2 3 4 5 6

b = 4 a = 1

c = 1 d = 4

AB

C D

1 Draw in the median line for both

variables on the scatterplot.

Since there are 10 points, the median

lines fall between the 5th and

6th points.

2 Count the number of points in each

of the quadrants A, B, C and D. Call

these a, b, c and d respectively.


SAMPLE

Back to Menu >>>



3 The q-correlation coefficient is given by

q = (a + c) − (b + d)

a + b + c + d

Substitute the values for

a, b, c and d and evaluate.q = (1 + 1) − (4 + 4)

1 + 4 + 1 + 4= −6

10= −0.6

Guidelines for classifying the strength of a linear relationshipusing the q-correlation coefficientEarlier, we used the degree of scatter in a

scatterplot to classify the strength of the

relationship observed as weak, moderate

or strong. Using the table opposite, we can

do the same using the q-correlation

coefficient.

For example, a q-correlation coefficient

of q = 0.86 indicates that there is a strong

positive relationship.

In contrast, a q-correlation coefficient of

q = −0.34 indicates that there is a weak

negative relationship.

Strong positive relationship 0.75 ≤ q ≤ 1

Moderate positive relationship 0.5 ≤ q < 0.75

Weak positive relationship 0.25 ≤ q < 0.5

No relationship –0.25 < q < 0.25

Weak negative relationship –0.5 < q ≤ –0.25

Moderate negative relationship –0.75 < q ≤ –0.5

Strong negative relationship –1 ≤ q ≤ –0.75

Correlation and causationThe existence of even a strong relationship between two variables is not, in itself, sufficient to

imply that altering one variable causes a change in the other. It only implies that this may be

the explanation. It may be that both the measured variables are affected by a third and different

variable. For example, if data about the variables crime rates and unemployment in a range of

cities were gathered, a high correlation would be found. But could it be inferred that high

unemployment causes high crime rates? The explanation could be that both of these variables

are dependent on other variables, such as home circumstances, peer group pressure, level of

education or economic conditions, all of which may be related to both unemployment and

crime rates. These two variables may vary together, without one being the direct cause of the

other. Correlations must be interpreted with care.


SAMPLE

Back to Menu >>>



Exercise 4C

1 Use the table of q-correlation coefficients to classify each of the following.

a q = 0.20 b q = −0.30 c q = −0.85 d q = 0.33

e q = 0.95 f q = −0.75 g q = 0.75 h q = −0.24

i q = −1 j q = 0.25 k q = 1 l q = −0.50

2 Calculate the value of the q-correlation coefficient for each of the following scatterplots.

a10

101

1

2

2

3

3

4

4

5

5

6

6

7

7

8

89

90

y

x

b10

101

1

2

2

3

3

4

4

5

5

6

6

7

7

8

89

90

y

x

c

10

101

1

2

2

3

3

4

4

5

5

6

6

7

7

8

89

90

y

x

dy

10

101

1

2

2

3

3

4

4

5

5

6

6

7

7

8

89

90 x

e10

101

1

2

2

3

3

4

4

5

5

6

6

7

7

8

89

90

y

x

f10

101

1

2

2

3

3

4

4

5

5

6

6

7

7

8

89

90

y

x


SAMPLE

Back to Menu >>>



3 Calculate the q-correlation coefficient for each pair of variables shown in the following

scatterplots.

a

0 2 4 6 8

Pri

ce (

$’00

0)

Age (years)

16

14

12

10

8

b

Cal

f dia

met

er (

cm)

Age (years)

25

20

15

5

0

10

2220 24 26 28 30 32 34 36

c 90

80

70

60

50

40

30

20

10

0 5 10 15 20 25 30 35 40

Mar

k (%

)

Time (hours)

4.4 Fitting lines to scatterplotsIf the points on the scatterplot tend to lie on a straight line, then we can fit a line to the

scatterplot. The process of fitting a straight line to bivariate data is known as linear

regression. The aim of linear regression is to model the relationship between two numerical

variables by using a simple equation: the equation of a straight line.

In regression, we write the equation of a straight line as

y = a + bx

where:

y is the dependent variable (DV)

x is the independent variable (IV).

a is the y-intercept of the line

b is the slope of the line.

Once we have the equation, we can use it to predict the value of the dependent variable (y) for

different values of the independent variable (x).


SAMPLE

Back to Menu >>>



Fitting a line by eyeWhat we want to find is the straight line that ‘best’ fits the data. You met this idea earlier, in the

chapter on linear graphs. There is no one way of finding the line that best fits a set of bivariate

data. There are many ways.

The easiest way to fit a line to bivariate data is to construct a scatterplot and draw the line in

‘by eye’. To do this, place a ruler on the scatterplot in a position that captures the general trend

of the data, and then use the ruler to draw a straight line. This method works best when the

points in the scatterplot are reasonably tightly clustered around a straight line.

Once the line is drawn, we can use the methods you learned in ‘Linear graphs’ (Chapter 3)

to find its equation. The starting point for fitting a line ‘by eye’ is a scatterplot.

Example 3 Fitting a line by eye using the intercept and slope

The scatterplot opposite plots mark against

time spent studying for an examination, for

10 students.

In this plot, mark is the y (or dependent) variable

and time is the x (or dependent) variable.

Fit a line to the scatterplot by eye and write its

equation in terms of:

a x and y

b the variables mark and time.

90

80

70

60

50

40

30

20

10

05 10 15 20 25 30 35 40

Mar

k (%

)

Time (hours)

Solution90

80

70

60

50

40

30

20

10

0 5 10 15 20 25 30 35 40

Mark

(%

)

Time (hours)

run = 35

rise = 60

1 Place a transparent ruler on the scatterplot so

that the points in the scatterplot are reasonably

evenly spread around the line made by the

edge of the ruler.

2 Draw in the line.

3 Find the equation of the line in terms

of y and x.

As the y-intercept can be read from the graph,

use the intercept–slope form of the equation

of a straight line, y = a + bx .


SAMPLE

Back to Menu >>>



To calculate the slope, choose two

easily read points that are reasonably

widely separated. The points (0, 30)

and (35, 90) are suitable.

Substitute the values for a and b into

the equation.

4 Noting that y represents the variable mark

and x represents the variable time, rewrite

the equation in terms of mark and time.

y = a + b x

a = y -intercept = 30

b = slope = rise

run= 60

35= 1.7 (to 1 d.p.)

∴ y = 30 + 1.7x

∴ mark = 30 + 1.7 × time

Example 4 Fitting a line by eye using the two-point formula

The scatterplot on the right plots weight against

height for eight people. In this plot, height is the

x (or dependent) variable, and weight is the y

(or dependent) variable.

Fit a line to the scatterplot by eye and write its

equation in terms of:

a x and y

b the variables weight and height.

155

160

165

170

175

180

185

190

195

50

55

60

65

70

75

80

Height (cm)

Wei

ght (

kg)

Solution

155

160

165

170 175 180

185

190

195

50

55

60

65

70

75

80

Height (cm)

Wei

ght

(kg)

1 Place a ruler on the scatterplot

so that the points in the scatterplot

are reasonably evenly spread around

the line.

2 Draw in the line.

3 Find the equation of the line in

terms of y and x.

As the y-intercept cannot be read from

the graph, use the two-point formula,

y − y1

x − x1= y2 − y1

x2 − x1

or use a graphics calculator.

Choose two easily read points that are

reasonably widely separated. The points

(155, 53) and (195, 77) are suitable.


SAMPLE

Back to Menu >>>



Either substitute these values into the

formula and transform or use a graphics

calculator (see page 109).

4 Noting that y represents the variable

weight and x represents the variable

height, rewrite the equation in terms of

weight and height.

y − y 1

x − x 1

= y 2 − y 1

x 2 − x 1

x 1 = 155, y 1 = 53; x2 = 195, y 2 = 77

y − 53

x − 155= 77 − 53

195 − 155

y − 53

x − 155= 0.6

y − 53 = 0.6(x − 155)

y − 53 = 0.6x − 93

∴ y = −40 + 0.6x

∴ height = −40 + 0.6 × weight

Fitting a line using the two-mean methodWhile fitting a line by eye is quick and easy, it is not a reliable method for finding the equation

of the line that best fits a scatterplot, as everyone is likely to come up with a slightly different

line.

One method for overcoming this problem is to use the two-mean method. The two-mean

method locates the line on the scatterplot by finding the mean of the bottom half and top half

of the data values and draws a line between the two.

To fit a line using the two-mean method requires both the scatterplot and the data values.

Example 5 Fitting a line using the two-mean method

The data below give the marks that students obtained on an examination and the times they

spent studying for the examination.

Time (hours), x 4 36 23 19 1 11 18 13 18 8

Mark (%), y 41 87 67 62 23 52 61 43 65 52

Fit a line to the scatterplot using the two-mean method and write its equation in terms of:

a x and y b the variables mark and time.

Solution

1 Rewrite the data pairs in order, according to the x values.

Time, x 1 4 8 11 13 18 18 19 23 36

Mark, y 23 41 52 52 43 61 65 62 67 87


SAMPLE

Back to Menu >>>



2 Divide the ordered table into two new tables: one for the lower half of data values, the

other for the top half of data values. Find the mean values of x and y for each new table.

Lower half

Time,x 1 4 8 11 13 -x L = 7.4

Mark,y 23 41 52 52 43 -yL = 42.2

Upper half

Time,x 18 18 19 23 36 -xu = 22.8

Mark,y 61 65 62 67 87 -y u = 68.490

80

70

60

50

40

30

20

10

05 10 15 20 25 30 35 40

Mark

(%

)

Time (hours)

(7.4, 42.2)

(22.8, 68.4)

3 Plot the two mean points (7.4, 42.2)

and (22.8, 68.4) on the scatterplot.

4 Draw in the line through the two mean

points to plot the two-mean line.

5 Use the two mean points (7.4, 42.2)

and (22.8, 68.4) to find the equation of the

line in terms of y and x. Use either the

two-point formula or a graphics calculator

(see page 109).

6 Rewrite the equation of the two-mean line

in terms of the variables mark and time.

Equation of the two-mean line :

y = 29.6 + 1.7x

∴ mark = 29.6 + 1.7 × time

It is interesting to note that the equation of the two-mean line is very close to the equation we

got by fitting a line by eye. This is often the case when the points in the scatterplot are

reasonably closely scattered around the line. However, for scatterplots where this is not the

case, the two-mean method is a more reliable technique to use than fitting a line by eye.

To find the equation of the two-mean line:Order the data pairs according to the x values and divide into two equal-sized groups:

lower and upper. If there is an odd number of data points, discard the middle data point.

Find the coordinates of the point (xL , yL ), where xL is the mean of the x values in the

lower half and yL the mean of the y values in the lower half.

Find the coordinates of the point (xU , yU ), where xU is the mean of the x values in the

upper half and yU the mean of the y values in the upper half.

Mark in the two points on the scatterplot. Draw a line through the two points to display

the two-mean line.

Use the two points (xL , yL ) and (xU , yU ) to find the equation of the line. This can be

done using either the two-point formula or a graphics calculator.


SAMPLE

Back to Menu >>>



Exercise 4D

1 Fit a line by eye to the scatterplot opposite.

Write the equation of the line in terms of the

variables infant death rate and female literacy

rate.

180160140120100

80604020

0 25 50 75 100Female literacy rate (%)

Infa

nt d

eath

rat

e (p

er 1

00 0

00)



variables height and age.

36 40 44 48 52 56 60

85

90

80

95

Age (months)H

eigh

t (cm

)

100



variables daughter’s height and mother’s height.

150

160

170

180

160150 170 180Mother’s height (cm)

Dau

ghte

r’s

heig

ht (

cm)

4 The data below gives the velocity of a motorbike (in m/s) over a 5-second interval. Also

shown is the scatterplot in which velocity is plotted against time.

Time (s) Velocity (m/s)

0.5 19.3

1 20.4

1.5 18.6

2 22.2

3 22.5

3.5 24.3

4 22.5

5 25.5

0

5

10

15

20

25

30

1 2 3 4 5Time (s)

Vel

ocit

y (m

/s)

Find the equation of the two-mean line for this data. Write the equation in terms of the

variables velocity and time.Cambridge University Press • Uncorrected Sample Pages • 978-0-521-74049-4 2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard

SAMPLE

Back to Menu >>>



5 The data below gives the prices and ages of 12 used cars. Also shown is the scatterplot

constructed from this data.

Age (years) Price ($)

2 15 800

3 14 300

3 13 800

4 11 800

4 13 000

4 13 300

5 11 000

6 12 200

6 9 500

7 8 300

7 9 700

8 8 000

0 2 4 6 8

Pri

ce (

$’00

0)

Age (years)

16

14

12

10

8


variables price and age.

6 The data below gives the airspeed and the number of seats in 8 aircraft. Also shown is the

scatterplot constructed from this data.

Number of seats Airspeed (km/hr)

405 830

296 797

288 774

258 736

240 757

193 765

188 760

148 718 100 150 200 250 300 350 400 450

700

725

750

775

800

825

850

Number of seats

Air

spee

d (k

m/h

)


variables airspeed and number of seats.

4.5 Using regression lines to make predictionsAs we said earlier, the process of fitting a straight line to bivariate data is known as linear

regression. The aim of linear regression is to model the relationship between two numerical

variables by using the equation of a straight line. Once we have this equation, we can use the

equation to make predictions.


SAMPLE

Back to Menu >>>



For example, in Example 5 we fitted a line to the data relating students’ marks on an

examination to the time they spent studying for the examination. The equation was

mark = 29.6 + 1.7 × time

Using this equation, and rounding off to the nearest whole number, we would predict that a

student who spent:

0 hours studying would obtain a mark of 30% (mark = 29.6 + 1.7 × 0 = 29.6)


12 hours studying would obtain a mark of 50% (mark = 29.6 + 1.7 × 12 = 50)


80 hours studying would obtain a mark of 166%! (mark = 29.6 + 1.7 × 80 = 165.6)

This last result points to one of the limitations of regression lines. We are predicting someone

to get more than 100%. When using a regression line to make predictions, we must remember

that, strictly speaking, the equation only applies to the range of data values used to determine

the equation.

Thus, we are safe using the line to make predictions within this data range. This is called

interpolation.

However, we must be extremely careful about how much faith we put into predictions made

outside the data range. Making predictions outside the data range is called extrapolating.

Predicting within the range of data is called interpolation.

Predicting outside the range of data is called extrapolation.

For example, if we use the regression

line to predict the examination mark for

30 hours of studying time, we would be

interpolating because we would be making

a prediction within the data.

However, if we use the regression

line to predict the examination mark for

50 hours of studying time, we would be

extrapolating because we would be making

a prediction outside the data. Extrapolation

is a less reliable process than interpolation

because we are going beyond the original

data, and we don’t know if the relationship is

still linear there.

0

20

40

60

80

100

120

140

160

180

200

10 20 30 40 50 60 70 80Time (hours)

Mar

k (%

)

Extrapolation: line isused to make predictionoutside the data range.

Interpolation: line isused to make predictionwithin the data range.

Exercise 4E

1 Complete the following sentences. Using a regression line to make a prediction:

a within the range of data that was used to derive the equation is called .

b outside the range of data that was used to derive the equation is called .


SAMPLE

Back to Menu >>>



2 For children between the ages of 36 and 60 months, the equation relating their height (in cm)

to their age (in months) is:

height = 72 + 0.4 × age

Use this equation to predict the height (to the nearest cm) of a child who is:

a 40 months old. Is this interpolation or extrapolation?

b 55 months old. Is this interpolation or extrapolation?

c 70 months old. Is this interpolation or extrapolation?

3 For shoe sizes between 6 and 12, the equation

relating a person’s weight (in kg) to shoe size is:

weight = 48.1 + 2.2 × shoe size

Use this equation to predict the weight (to

the nearest kg) of a person whose shoe size is:

a 5. Is this interpolation or extrapolation?

b 8. Is this interpolation or extrapolation?

c 11. Is this interpolation or extrapolation?

4 When preparing between 25 and 100 meals, a cafeteria’s cost (in dollars) is given by the

equation:

cost = 175 + 5.8 × number of meals

Use this equation to predict the cost (to the nearest dollar) of preparing:

a no meals. Is this interpolation or extrapolation?

b 60 meals. Is this interpolation or extrapolation?

c 89 meals. Is this interpolation or extrapolation?

5 For women of heights from 150 to 180 cm, the equation relating a daughter’s adult height

(in cm) to her mother’s height (in cm) is:

daughter’s height = 18.3 + 0.91 × mother’s height

Use this equation to predict (to the nearest centimetre) the adult height of a woman whose

mother is:

a 168 cm tall. Is this interpolation or extrapolation?

b 196 cm tall. Is this interpolation or extrapolation?

c 155 cm tall. Is this interpolation or extrapolation?


SAMPLE

Back to Menu >>>


Review


Key ideas and chapter summary

Scatterplot A scatterplot is used to help identify and describe the relationship

between two numerical variables.

25 30 35 40 45 50 55 60

2.5

21.5

1

3

3.5

44.5

5

Age

Socr

e on

hea

ring

test

DV

IV

In a scatterplot, the dependent variable (DV) is plotted on the

vertical axis and the independent variable (IV) on the horizontal

axis.

Identifying relationships A random cluster of points (no clear pattern) indicates that the

variables are unrelated.

A clear pattern in the scatterplot indicates that the variables are

related.

between two numericalvariables

Describing relationships Relationships are described in terms of:

direction (positive or negative) and outliers

strength (strong, moderate, weak or none).

in scatterplots

q-correlation coefficient The quadrant or q-correlation coefficient is a measure of the

strength of the relationship between two numerical variables.

The q-correlation coefficient is defined by

q = (a + c) − (b + d)

a + b + c + d

where, a, b, c and d are the number of points

in the four quadrants of the scatterplot

labelled A, B, C and D respectively.

Any points that lie on the lines are omitted.

x

y

O

AB

C D


SAMPLE

B a c k t o M e n u > > >


Rev

iew


q-correlation: strength The q-correlation coefficient

can be used to classify the

strength of the relationship

between two numerical

variables as weak, moderate

or strong, using the

guidelines shown in the

table.

Strong positive relationship 0.75 ≤ q ≤ 1

Moderate positive relationship 0.5 ≤ q < 0.75

Weak positive relationship 0.25 ≤ q < 0.5

No relationship –0.25 < q < 0.25

Weak negative relationship –0.5 < q ≤ –0.25

Moderate negative relationship –0.75 < q ≤ –0.5

Strong negative relationship –1 ≤ q ≤ –0.75

Fitting lines to A straight line can be used to model the relationship between two

numerical variables when the relationship is linear. This is known as

linear regression.

The relationship can then be described by a rule of the form

y = a + bx

where y is the dependent variable (DV), x is the independent

variable (IV), a is the y-intercept of the line and b is the slope of the

line.

scatterplots: linearregression

Fitting a line by eye Fitting a line by eye means drawing a line on the scatterplot that

captures the general trend of the data. It is most suitable when there is

minimal scatter in the scatterplot.


SAMPLE

Back to Menu >>>


Review


Fitting a line using The two-mean method positions the line on the scatterplot by finding the

mean of the bottom half and top half of the data values. A line is then

drawn between the two.

the two-meanmethod

Using a regression The regression line y = a + bx enables the value of y to be determined

for a given value of x.line to makepredictions

Interpolation and Predicting within the range of data is called interpolation.

Predicting outside the range of data is called extrapolation.extrapolation

Skills check

Having completed the current chapter you should be able to:

construct a scatterplot

use a scatterplot to comment on the direction of a relationship (positive or negative)

and possible outliers

calculate and interpret the q-correlation coefficient

determine the equation of a line drawn by eye

determine the equation of a two-mean line

use the equation of the line for prediction

distinguish between interpolation and extrapolation when using a line to make a

prediction.


SAMPLE

Back to Menu >>>


Rev

iew


Multiple-choice questions

1 For which one of the following pairs of variables would it be appropriate to

construct a scatterplot?

A Eye colour (blue, green, brown, other) and hair colour (black, brown, blonde,

red, other)

B Score out of 100 on a test for a group of Year 9 students and a group of Year 11

students

C Political party preference (Labor, Liberal, Other) and age in years

D Age in years and blood pressure in mm Hg

E Height in cm and sex (male, female)

2 For the scatterplot shown, the relationship between the

variables is best described as:

A weak negative

B strong negative

C no relationship

D weak positive

E strong positivex

y



A weak negative

B strong negative

C no relationship

D weak positive

E strong positivex

y



A weak negative

B strong negative

C no relationship

D weak positive

E strong positivex

y


SAMPLE

Back to Menu >>>


Review




A weak negative

B strong negative

C no relationship

D weak positive

E strong positivex

y

6 A q-correlation coefficient of 0.32 would describe a relationship classified as:

A weak positive B moderate positive C strong positive

D close to zero E moderately strong

7 For the scatterplot shown, the q-correlation

coefficient is:

A −1

B −0.5

C 0

D 0.5

E 10 1 2 3 4 5 6 7 8 9 10

123456789

10


coefficient is:

A −1

B −0.5

C 0

D 0.5

E 10 1 2 3 4 5 6 7 8 9 10

123456789

10


coefficient is:

A 0.2

B 0.4

C 0.6

D 0.8

E 1.00 1 2 3 4 5 6 7 8 9 10

123456789

10


SAMPLE

Back to Menu >>>


Rev

iew


10 For the scatterplot shown, the line drawn by eye

would have an equation closest to:

A velocity = 5 × time

B velocity = 19 + 1 × time

C velocity = 1 + 19 × time

D velocity = 19 + 5 × time

E velocity = 5 + 19 × time

0

5

10

15

20

25

30

1 2 3 4 5Time (s)

Vel

ocit

y (m

/s)

11 For the scatterplot shown, the line drawn by

eye would have a slope closest to:

A −2000

B −1000

C −200

D 2000

E 1000

0 2 4 6 8P

rice

($’

000)

Age (years)

16

14

12

10

8

The following information relates to Questions 12 and 13The weekly income and weekly food costs for a group of 10 university students is given

in the following table.

Income ($) 150 250 300 300 380 450 600 850 950 1000

Food cost ($) 40 60 70 130 150 260 120 460 200 600

12 The equation of the two-mean line would be found by finding the equation of the

line passing through the points:

A (276, 90) and (770, 328) B (300, 70) and (850, 460)

C (90, 276) and (328, 770) D (150, 40) and (1000, 600)

E (276, 84) and (770, 334)

13 The equation of the two-mean line that would enable food cost to be predicted from

weekly income is closest to:

A food cost = 0.48 + 43 × income B food cost = 0.48 − 43 × income

C food cost = −43 + 0.48 × income D food cost = 240 + 1.4 × income

E food cost = 1.4 + 240 × income

The following information relates to Questions 14 and 15For incomes between $600 and $1200 per week, the equation of a line that relates

weekly expenditure on entertainment (in dollars) to weekly income (in dollars) is given

by:

expenditure = 40 + 0.10 × incomeCambridge University Press • Uncorrected Sample Pages • 978-0-521-74049-4 2008 © Evans, Lipson, Jones, Avery, TI-Nspire & Casio ClassPad material prepared in collaboration with Jan Honnens & David Hibbard

SAMPLE

Back to Menu >>>


Review


14 The equation predicts that the amount spent on entertainment by a person with an

income of $800 is:

A $40 B $80 C $120 D $160 E $1200

15 The following statements relate to the equation

expenditure = 40 + 0.10 × income

Which statement is not true?

A Expenditure is the dependent variable. B Income is the independent variable.

C The slope of the line is 0.10. D The intercept of the line is 40.

E Using the line to predict the expenditure of a person with an income of $1500

per week is called interpolation.

Short-answer questions

1 The following table gives the number of times the ball was inside the 50 m line in an

AFL football game, and the team’s score in that game.

Inside 50 m 64 57 34 61 51 52 53 51 64 55 58 71

Score (points) 90 134 76 92 93 45 120 66 105 108 88 133

a Construct a scatterplot of score against the number of times the ball was

inside 50 m.

b From the scatterplot, describe any relationship between the two variables.

2 Determine the q-correlation coefficient for the

scatterplot shown.

0

10

20

30

40

50

60

70

80

5 10 15 20 25 30 35 40 45 50Time (min)

Dis

tanc

e (k

m)

3 The following scatterplot shows the relationship

between height and weight for a group of obese

people. A line by eye has been drawn on the

scatterplot. Find the equation of the line.

100

120

140

160

180

200

220

Wei

ght (

kg)

200190180170160150Height (cm)


SAMPLE

Back to Menu >>>


Rev

iew


4 The time taken to complete a task, and the number of errors on the task, were

recorded for a sample of 10 primary school children. Determine the equation of the

two-mean line that fits this data.

Time (s) 22.6 21.7 21.7 21.3 19.3 17.6 17.0 14.6 14.0 8.8

Errors 2 3 3 4 5 5 7 7 9 9

Extended-response questions

1 A marketing company wishes to predict the likely number of new clients that each of

its graduates will attract to the business in their first year of employment. It plans to

do this by using the graduates’ scores on a marketing examination in the final year of

their course.

Graduate Examination score Number of new clients

1 65 7

2 72 9

3 68 8

4 85 10

5 74 10

6 61 8

7 60 6

8 78 10

9 70 5

10 82 11

a Which is the independent variable and which is the dependent variable?

b Construct a scatterplot of this data.

c Describe the relationship between the number of new clients and the examination

score.

d Determine the value of the q-correlation coefficient for this data, and classify the

strength of the relationship.

e Determine the equation for the two-mean line and write it down in terms of the

variables number of new clients and examination score.

f Use your equation to predict, to the nearest whole number, the number of new

clients for a graduate who scored 100 on the examination.

g In making this prediction, are you interpolating or extrapolating?


SAMPLE

Back to Menu >>>


Review


2 To investigate the relationship between marks on an assignment and the final

examination mark, a sample of 10 students was taken. The table below indicates the

marks for the assignment and the final exam mark for each student.

Assignment mark 80 77 71 78 65 80 68 64 50 66

(max = 80)

Final exam mark 83 83 79 75 68 84 71 69 66 58

(max = 90)



c Describe the relationship between the assignment mark and the final examination

mark.



e Use your answer to part d to comment on the statement: ‘Good final exam marks

are the result of good assignment marks.’

f Determine the equation for the two-mean line and write it down in terms of the

variables final exam mark and assignment mark.

g Use your equation to predict the final examination mark for a student who scored

50 on the assignment.

h In making this prediction, are you interpolating or extrapolating?

3 A marketing firm wanted to investigate the relationship between airplay and CD

sales (in the following week) of newly released CDs. The following data was

collected on a random sample of 10 CDs.

Number of 47 34 40 34 33 50 28 53 25 46times played

Weekly sales 3950 2500 3700 2800 2900 3750 2300 4400 2200 3400



c Describe the association between the number of times the CD was played and

weekly sales.



e Determine the equation for the two-mean line and write it down in terms of the

variables number of times played and weekly sales.

f Use your equation to predict the weekly sales for a CD that was played 60 times.

g In making this prediction, are you interpolating or extrapolating?


SAMPLE

Back to Menu >>>


Rev

iew


4 The following table gives the gold-medal winning distance, in metres, for the men’s

long jump for the Olympic games for the years 1896 to 1996. (Some years were

missing owing to the two world wars.)

Year 1896 1900 1904 1908 1912 1920 1924 1928 1932 1936 1948 1952 1956

Distance (m) 6.35 7.19 7.34 7.49 7.59 7.16 7.44 7.75 7.65 8.05 7.82 7.57 7.82

Year 1960 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004

Distance (m) 8.13 8.08 8.92 8.26 8.36 8.53 8.53 8.72 8.67 8.50 8.55 8.59


b Construct a scatterplot of these data.

c Describe the association between the distance and year.

d Determine the value of the q-correlation coefficient for these data, and classify the


e Determine the equation for the two-mean line and write down in terms of the

variables distance and year.

f Use your equation to predict the winning distance in the year 2008.

g How reliable is the prediction made in part f?


SAMPLE

Back to Menu >>>

chapter 4...data recorded about a single variable, such as a person’s weight. in this chapter, you...

Documents