statistics all standardgradegeneralcreditmaths notes

42
Box-and-Whisker Plots: Quartiles, Boxes, and Whiskers (page 1 of 3) Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers Statistics assumes that your data points (the numbers in your list) are clustered around some central value. The "box" in the box-and-whisker plot contains, and thereby highlights, the middle half of these data points. To create a box-and-whisker plot, you start by ordering your data (putting the values in numerical order), if they aren't ordered already. Then you find the median of your data. The median divides the data into two halves. To divide the data into quarters, you then find the medians of these two halves. Note: If you have an even number of values, so the first median was the average of the two middle values, then you include the middle values in your sub-median computations. If you have an odd number of values, so the first median was an actual data point, then you do not include that value in your sub-median computations. That is, to find the sub-medians, you're only looking at the values that haven't yet been used. You have three points: the first middle point (the median), and the middle points of the two halves (what I call the "sub-medians"). These three points divide the entire data set into quarters, called "quartiles". The top point of each quartile has a name, being a "Q" followed by the number of the quarter. So the top point of the first quarter of the data points is "Q 1 ", and so forth. Note that Q 1 is also the middle number for the first half of the list, Q 2 is also the middle number for the whole list, Q 3 is the middle number for the second half of the list, and Q 4 is the largest value in the list. Once you have these three points, Q 1 , Q 2 , and Q 3 , you have all you need in order to draw a simple box-and-whisker plot. Here's an example of how it works. Draw a box-and-whisker plot for the following data set: 4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4 My first step is to order the set. This gives me: 3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1 The first number I need is the median of the entire set. Since there are seventeen values in this list, I need the ninth value: 3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1 The median is Q 2 = 4.4. The next two numbers I need are the medians of the two halves. Since I used the "4.4" in the middle of the list, I can't re-use it, so my two remaining data sets are: Page 1 of 2 25/02/2009 mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker...

Upload: tomcecdl

Post on 06-Apr-2015

163 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Statistics All StandardGradeGeneralCreditMaths Notes

Box-and-Whisker Plots: Quartiles, Boxes, and Whiskers (page 1 of 3)

Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers

Statistics assumes that your data points (the numbers in your list) are clustered around some central value. The "box" in the box-and-whisker plot contains, and thereby highlights, the middle half of these data points.

To create a box-and-whisker plot, you start by ordering your data (putting the values in numerical order), if they aren't ordered already. Then you find the median of your data. The median divides the data into two halves. To divide the data into quarters, you then find the medians of these two halves. Note: If you have an even number of values, so the first median was the average of the two middle values, then you include the middle values in your sub-median computations. If you have an odd number of values, so the first median was an actual data point, then you do not include that value in your sub-median computations. That is, to find the sub-medians, you're only looking at the values that haven't yet been used.

You have three points: the first middle point (the median), and the middle points of the two halves (what I call the "sub-medians"). These three points divide the entire data set into quarters, called "quartiles". The top point of each quartile has a name, being a "Q" followed by the number of the quarter. So the top point of the first quarter of the data points is "Q1", and so forth. Note that Q1 is

also the middle number for the first half of the list, Q2 is also the middle number for the whole list, Q3 is the middle number for the second half of the list, and Q4 is the largest value in the list.

Once you have these three points, Q1, Q2, and Q3, you have all you need in order to draw a simple box-and-whisker plot. Here's an example of how it works.

Draw a box-and-whisker plot for the following data set:

4.3, 5.1, 3.9, 4.5, 4.4, 4.9, 5.0, 4.7, 4.1, 4.6, 4.4, 4.3, 4.8, 4.4, 4.2, 4.5, 4.4

My first step is to order the set. This gives me:

3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1

The first number I need is the median of the entire set. Since there are seventeen values in this list, I need the ninth value:

3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4, 4.4, 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1

The median is Q2 = 4.4.

The next two numbers I need are the medians of the two halves. Since I used the "4.4" in the middle of the list, I can't re-use it, so my two remaining data sets are:

Page 1 of 2

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker...

Page 2: Statistics All StandardGradeGeneralCreditMaths Notes

3.9, 4.1, 4.2, 4.3, 4.3, 4.4, 4.4, 4.4 and 4.5, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0, 5.1

The first half has eight values, so the median is the average of the middle two:

Q1 = (4.3 + 4.3)/2 = 4.3

The median of the second half is:

Q3 = (4.7 + 4.8)/2 = 4.75

By the way, box-and-whisker plots don't have to be drawn horizontally as I did above; they can be vertical, too.

Original URL: http://www.purplemath.com/modules/boxwhisk.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved. Terms of Use: http://www.purplemath.com/terms.htm

Since my list values have one decimal place and range from 3.9 to 5.1, I won't use a scale of, say, zero to ten, marked off by ones. Instead, I'll draw a number line from 3.5 to 5.5, and mark off by tenths.

Now I'll mark off the minimum and maximum values, and Q1, Q2, and Q3:

The "box" part of the plot goes from Q1 to Q3:

And then the "whiskers" are drawn to the endpoints:

Page 2 of 2

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker...

Page 3: Statistics All StandardGradeGeneralCreditMaths Notes

Box-and-Whisker Plots: Five-Number Summary (page 2 of 3)

Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers

More terminology: The top end of your box may also be called the "upper hinge"; the lower end may also be called the "lower hinge". The lower hinge is also called "the 25th percentile"; the median is "the 50th percentile"; the upper hinge is "the 75th percentile". This means that 25%, 50% and 75% of the data, respectively, is at or below that point. The distance between the hinges may be referred to as the "H-spread" or, as you will see on the following page, the "Interquartile Range", abbreviated "IQR". ("Hinge" actually has a different technical definition, but the term is sometimes used informally.)

Also, some books and software will include the overall median (Q2) when computing Q1 and Q3 for

data sets with an odd number of elements. The Texas Instruments calculators do not include Q2 in this case, so you may encounter a book answer that doesn't match the calculator answer. And different software packages use all different sorts of formulas. Be careful to use the formula from yourbook when doing your homework!

Additionally, the box-and-whisker plot may include a cross or an "X" marking the mean value of the data, in addition to the line inside the box that marks the median. The difference between the "X" and the median line can then be used as a measure of "skew".

Please don't ask me to explain "skew".

Draw the box-and-whisker plot for the following data set: 77, 79, 80, 86, 87, 87, 94, 99

My first step is to find the median. Since there are eight data points, the median will be the average of the two middle values: (86 + 87) ÷ 2 = 86.5 = Q2

This splits the list into two halves: 77, 79, 80, 86 and 87, 87, 94, 99. Since the halves of the data set each contain an even number of values, the sub-medians will be the average of the middle two values.

Q1 = (79 + 80) ÷ 2 = 79.5

Q3 = (87 + 94) ÷ 2 = 90.5

The minimum value is 77 and the maximum value is 99, so I have:

min: 77, Q1: 79.5, Q2: 86.5, Q3: 90.5, max: 99

Page 1 of 2

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker...

Page 4: Statistics All StandardGradeGeneralCreditMaths Notes

Then my plot looks like this:

As you can see, you only need the five values listed above (min, Q1, Q2, Q3, and max) in order to draw your box-and-whisker plot. This set of five values has been given the name "the five-number summary".

Give the five-number summary of the following data set: 79, 53, 82, 91, 87, 98, 80, 93

The five-number summary consists of the numbers I need for the box-and-whisker plot: the minimum value, Q1 (the bottom of the box), Q2 (the median of the set), Q3 (the top of the

box), and the maximum value (which is also Q4). So I need to order the set, find the median and the sub-medians, and then list the required values in order.

ordering the list: 53, 79, 80, 82, 87, 91, 93, 98, so the minimum is 53 and the maximum is 98

finding the median: (82 + 87) ÷ 2 = 84.5 = Q2

lower half of the list: 53, 79, 80, 82, so Q1 = (79 + 80) ÷ 2 = 79.5

upper half of the list: 87, 91, 93, 98, so Q3 = (91 + 93) ÷ 2 = 92

five-number summary: 53, 79.5, 84.5, 92, 98

Part of the point of a box-and-whisker plot is to show how spread out your values are. But what if one or another of your values is way out of line? For this, we need to consider "outliers"....

Original URL: http://www.purplemath.com/modules/boxwhisk2.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved. Terms of Use: http://www.purplemath.com/terms.htm

Page 2 of 2

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker...

Page 5: Statistics All StandardGradeGeneralCreditMaths Notes

Box-and-Whisker Plots: Interquartile Ranges and Outliers (page 3 of 3)

Sections: Quartiles, boxes, and whiskers, Five-number summary, Interquartile ranges and outliers

The "interquartile range", abbreviated "IQR", is just the width of the box in the box-and-whisker plot. That is, IQR = Q3 – Q1. The IQR can be used as a measure of how spread-out the values are.

Statistics assumes that your values are clustered around some central value. The IQR tells how spread out the "middle" values are; it can also be used to tell when some of the other values are "too far" from the central value. These "too far away" points are called "outliers", because they "lie outside" the range in which we expect them.

The IQR is the length of the box in your box-and-whisker plot. An outlier is any value that lies more than one and a half times the length of the box from either end of the box. That is, if a data point is below Q1 – 1.5×IQR or above Q3 + 1.5×IQR, it is viewed as being too far from the central values to be reasonable. Maybe you bumped the weigh-scale when you were making that one measurement, or maybe your lab partner is an idiot and you should never have let him touch any of the equipment. Who knows? But whatever their cause, the outliers are those points that don't seem to "fit".

(Why one and a half times the width of the box? Why does that particular value demark the difference between "acceptable" and "unacceptable" values? Because, when John Tukey was inventing the box-and-whisker plot in 1977 to display these values, he picked 1.5×IQR as the demarkation line for outliers. This has worked well, so we've continued using that value ever since.)

Find the outliers, if any, for the following data set:

10.2, 14.1, 14.4. 14.4, 14.4, 14.5, 14.5, 14.6, 14.7, 14.7, 14.7, 14.9, 15.1, 15.9, 16.4

To find out if there are any outliers, I first have to find the IQR. There are fifteen data points, so the median will be at position (15 + 1) ÷ 2 = 8. Then Q2 = 14.6. There are seven data

points on either side of the median, so Q1 is the fourth value in the list and Q3 is the twelfth: Q1 = 14.4 and Q3 = 14.9. Then IQR = 14.9 – 14.4 = 0.5.

Outliers will be any points below Q1 – 1.5×IQR = 14.4 – 0.75 = 13.65 or above Q3 + 1.5×IQR = 14.9 + 0.75 = 15.65.

Then the outliers are at 10.2, 15.9, and 16.4.

The values for Q1 – 1.5×IQR and Q3 + 1.5×IQR are the "fences" that mark off the "reasonable" values from the outlier values. Outliers lie outside the fences.

Page 1 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker...

Page 6: Statistics All StandardGradeGeneralCreditMaths Notes

If your assignment is having you consider outliers and "extreme values", then the values for Q1 – 1.5×IQR and Q3 + 1.5×IQR are the "inner" fences and the values for Q1 – 3×IQR and Q3 + 3×IQR are the "outer" fences. The outliers (marked with asterisks or open dots) are between the inner and outer fences, and the extreme values (marked with whichever symbol you didn't use for the outliers) are outside the outer fences.

By the way, your book may refer to the value of "1.5×IQR" as being a "step". Then the outliers will be the numbers that are between one and two steps from the hinges, and extreme value will be the numbers that are more than two steps from the hinges.

Looking again at the previous example, the outer fences would be at 14.4 – 3×0.5 = 12.9 and 14.9 + 3×0.5 = 16.4. Since 16.4 is right on the upper outer fence, this would be considered to be only an outlier, not an extreme value. But 10.2 is fully below the lower outer fence, so 10.2 would be an extreme value.

If you're using your graphing calculator to help with these plots, make sure you know which setting you're supposed to be using and what the results mean, or the calculator may give you a perfectly correct but "wrong" answer.

Find the outliers and extreme values, if any, for the following data set, and draw the box-and-whisker plot. Mark any outliers with an asterisk and any extreme values with an open dot.

21, 23, 24, 25, 29, 33, 49

To find the outliers and extreme values, I first have to find the IQR. Since there are seven values in the list, the median is the fourth value, so Q2 = 25. The first half of the list is 21, 23, 24, so Q1 = 23; the second half is 29, 33, 49, so Q3 = 33. Then IQR = 33 – 23 = 10.

The outliers will be any values below 23 – 1.5×10 = 23 – 15 = 8 or above 33 + 1.5×10 = 33 + 15 = 48. The extreme values will be those below 23 – 3×10 = 23 – 30 = –7 or above 33 + 3×10 = 33 + 30 = 63.

So I have an outlier at 49 but no extreme values, I won't have a top whisker because Q3 is

Your graphing calculator may or may not indicate whether a box-and-whisker plot includes outliers. For instance, the above problem includes the points 10.2, 15.9, and 16.4 as outliers. One setting on my graphing calculator gives the simple box-and-whisker plot which uses only the five-number summary, so the furthest outliers are shown as being the endpoints of the whiskers:

A different calculator setting gives the box-and-whisker plot with the outliers specially marked (in this case, with a simulation of an open dot), and the whiskers going only as far as the highest and lowest values that aren't outliers:

Note that my calculator makes no distinction between outliers and extreme values.

Page 2 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker...

Page 7: Statistics All StandardGradeGeneralCreditMaths Notes

also the highest non-outlier, and my plot looks like this:

It should be noted that the methods, terms, and rules outlined above are what I have taught and what I have most commonly seen taught. However, your course may have different specific rules, or your calculator may do computations slightly differently. You may need to be somewhat flexible in finding the answers specific to your curriculum.

Original URL: http://www.purplemath.com/modules/boxwhisk2.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved. Terms of Use: http://www.purplemath.com/terms.htm

Page 3 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Box-and-Whisker...

Page 8: Statistics All StandardGradeGeneralCreditMaths Notes

Cumulative Frequency Graphs Cumulative Frequency

This is the running total of the frequencies. On a graph, it can be represented by a cumulative frequency polygon, where straight lines join up the points, or a cumulative frequency curve.

Example

The Median Value

The median of a group of numbers is the number in the middle, when the numbers are in order of magnitude. For example, if the set of numbers is 4, 1, 6, 2, 6, 7, 8, the median is 6: 1, 2, 4, 6, 6, 7, 8 (6 is the middle value when the numbers are in order) If you have n numbers in a group, the median is the (n + 1)/2 th value. For example, there are 7 numbers in the example above, so replace n by 7 and the median is the (7 + 1)/2 th value = 4th value. The 4th value is 6. When dealing with a cumulative frequency curve, "n" is the cumulative frequency (25 in the above example). Therefore the median would be the 13th value. To find this, on the cumulative frequency curve, find 13 on the y-axis (which should be labelled cumulative frequency). The corresponding 'x' value is an estimation of the median.

Quartiles

If we divide a cumulative frequency curve into quarters, the value at the lower quarter is referred to as the lower quartile, the value at the middle gives the median and the value at the upper quarter is the upper quartile. A set of numbers may be as follows: 8, 14, 15, 16, 17, 18, 19, 50. The mean of these numbers is 19.625 . However, the extremes in this set (8 and 50) distort this value. The interquartile range is a method of measuring the spread of the middle 50% of the values and is useful since it ignore the extreme values.

Frequency:Cumulative Frequency:

4 46 10 (4 + 6)3 13 (4 + 6 + 3)2 15 (4 + 6 + 3 + 2)6 21 (4 + 6 + 3 + 2 + 6)4 25 (4 + 6 + 3 + 2 + 6 + 4)

Page 1 of 2Cum. Freq. Graphs

23/02/2009http://www.mathsrevision.net/gcse/pages.php?page=21

Page 9: Statistics All StandardGradeGeneralCreditMaths Notes

The lower quartile is (n+1)/4 th value (n is the cumulative frequency, i.e. 157 in this case) and the upper quartile is the 3(n+1)/4 the value. The difference between these two is the interquartile range (IQR). In the above example, the upper quartile is the 118.5th value and the lower quartile is the 39.5th value. If we draw a cumulative frequency curve, we see that the lower quartile, therefore, is about 17 and the upper quartile is about 37. Therefore the IQR is 20 (bear in mind that this is a rough sketch- if you plot the values on graph paper you will get a more accurate value).

Page 2 of 2Cum. Freq. Graphs

23/02/2009http://www.mathsrevision.net/gcse/pages.php?page=21

Page 10: Statistics All StandardGradeGeneralCreditMaths Notes

Linear Regression Scatter Diagrams

We often wish to look at the relationship between two things (e.g. between a person"s height and weight) by comparing data for each of these things. A good way of doing this is by drawing a scatter diagram.

"Regression" is the process of finding the function satisfied by the points on the scatter diagram. Of course, the points might not fit the function exactly but the aim is to get as close as possible. "Linear" means that the function we are looking for is a straight line (so our function f will be of the form f(x) = mx + c for constants m and c).

Here is a scatter diagram with a regression line drawn in:

Correlation

Correlation is a term used to describe how strong the relationship between the two variables appears to be.

We say that there is a positive linear correlation if y increases as x increases and we say there is a negative linear correlation if y decreases as x increases. There is no correlation if x and y do not appear to be related.

Explanatory and Response Variables

In many experiments, one of the variables is fixed or controlled and the point of the experiment is to determine how the other variable varies with the first. The fixed/controlled variable is known as the explanatory or independent variable and the other variable is known as the response or dependent variable.

I shall use "x" for my explanatory variable and "y" for my response variable, but I could have used any letters.

Regression Lines

By Eye

If there is very little scatter (we say there is a strong correlation between the variables), a regression line can be drawn "by eye". You should make sure that your line passes through the mean point (the point (x,y) where x is mean of the data collected for the explanatory variable and y is the mean of the data collected for the response variable).

Two Regression Lines

When there is a reasonable amount of scatter, we can draw two different regression lines depending upon which variable we consider to be the most accurate. The first is a line of regression of y on x, which can be used to estimate y given x. The other is a line of regression of x on y, used to estimate x given y.

If there is a perfect correlation between the data (in other words, if all the points lie on a straight line), then the two regression lines will be the same.

Page 1 of 2Linear Regression

23/02/2009http://www.mathsrevision.net/alevel/pages.php?page=61

Page 11: Statistics All StandardGradeGeneralCreditMaths Notes

Least Squares Regression Lines

This is a method of finding a regression line without estimating where the line should go by eye.

If the equation of the regression line is y = ax + b, we need to find what a and b are. We find these by solving the "normal equations".

Normal Equations

The "normal equations" for the line of regression of y on x are:

Σy = aΣx + nb and

Σxy = aΣx2 + bΣx

The values of a and b are found by solving these equations simultaneously.

For the line of regression of x on y, the "normal equations" are the same but with x and y swapped.

Page 2 of 2Linear Regression

23/02/2009http://www.mathsrevision.net/alevel/pages.php?page=61

Page 12: Statistics All StandardGradeGeneralCreditMaths Notes

MathsII-Statistics Standard-deviation

Comparing distributions

When comparing distributions, it is better to use a measure of spread or dispersion (such as standard deviation or semi-interquartile range) in addition to a measure of central tendency (such as mean, median or mode).

For example, the following two data sets are significantly different in nature and yet have the same mean, median and range. Some sort of numerical measure which distinguishes between them would be useful.

• 1, 7, 12, 15, 20, 22, 28 • 1, 15, 15, 15, 15, 16, 28

The standard deviation of the first set of data is significantly larger than the standard deviation of the second set of data (ie there is more spread about the mean in the first set of data).

The formulae

There are two formulae for standard deviation given in the formulae list in the Credit Level examination paper. The first of the two formulae is often referred to as the defining formula and shows more clearly that the standard deviation of a set of numbers is the square root of the average of the squares of differences between each of the numbers and the mean of the numbers.

The second formula is a re-arrangement which may make it better for calculation purposes.

You may use either of the formulae; they'll give you the same answer.

More about formulae.

Comparing these formulae with standard deviation formulae in books or in your calculator, you may notice that sometimes the "n - 1" in the denominator is replaced by n.

When you're finding the standard deviation of a set of measures, which are only a sample of the total set of measures, then it's correct to use "n - 1". All examples in the exams will be of this type. When statisticians know they're working with the whole set or the population then they use "n" instead of "n - 1".

Remember

Σ means "sum of"

is the "mean"

Page 13: Statistics All StandardGradeGeneralCreditMaths Notes

MathsII-Statistics Standard-deviation

Example

Question 1

Find the mean and standard deviation of the following numbers.

4, 7, 9, 11, 13, 15, 18

The Answer

Here are two ways of calculating the standard deviation, using formulae.

(i)

x x- (x - )2 4 -7 49

7 -4 16

9 -2 4

11 0 0

13 2 4

15 4 16

18 7 49

Σ(x - )2= 138

correct to 3 decimal places

If you're having a problem with this table, this is how it works.

• The first column lists the numbers. • The second column finds the difference between each of the numbers and the

mean. • The third column squares these differences. This makes all of them positive

numbers.

The next step is to find the average of these squared differences. In this case, add them up and divide by six (one less than the number of numbers).

The final step is to take the square root. This undoes the squaring we did earlier.

(ii)

Page 14: Statistics All StandardGradeGeneralCreditMaths Notes

MathsII-Statistics Standard-deviation

x x2 4 16

7 49

9 81

11 121

13 169

15 225

18 324

Σ x2 = 985

So the standard deviation "s" = 4.796, using either of the formulae.

Now try a Test Bite

Page 15: Statistics All StandardGradeGeneralCreditMaths Notes

Statistics – Standard Deviation Test bite

Standard Deviation

1. Here are two sets of data:

• First set of data: 30, 35, 45, 50, 55, 65, 70 • Second set of data: 30, 50, 50, 50, 50, 50, 70

State whether the statements below are true or false. (i). The mean, median and range of the two sets of data is the

same. True False

(ii). The standard deviation of the first set of data is larger than the standard deviation of the second set of data. True False

2. Here are two sets of data:

• First set of data: 47, 48, 49, 50, 51, 52, 53 • Second set of data: 1, 10, 20, 50, 80, 90, 99

State whether the statements below are true or false. (i). The mean, median and range of the two sets of data is the

same. True False

(ii). The standard deviation of the first set of data is larger than the standard deviation of the second set of data. True False

3. Here is a set of 10 numbers.

2, 7, 5, 5, 3, 9, 10, 8, 12, 11

Choose the correct figures from the lists below.

(i). The mean of this set of numbers is:-

a. 72

b. 7.2

c. 7

d. 7.5

(ii). The standard deviation of the set of numbers is:-

a. 3.39

b. 3

Page 16: Statistics All StandardGradeGeneralCreditMaths Notes

Statistics – Standard Deviation Test bite

c. 3.22

d. 0.2

4. Here is a set of 5 numbers.

100, 121, 123, 145, 152

Choose the correct figures from the lists below.

(i). The mean of this set of numbers is:-

a. 160.2

b. 130

c. 123

d. 128.2

(ii). The standard deviation of the set of numbers is:-

a. 0.75

b. 20.75

c. 2.75

d. 20

Page 17: Statistics All StandardGradeGeneralCreditMaths Notes

Mean, Median, Mode, and Range

Mean, median, and mode are three kinds of "averages". There are many "averages" in statistics, but these are, I think, the three most common, and are certainly the three you are most likely to encounter in your pre-statistics courses, if the topic comes up at all.

The "mean" is the "average" you're used to, where you add up all the numbers and then divide by the number of numbers. The "median" is the "middle" value in the list of numbers. To find the median, your numbers have to be listed in numerical order, so you may have to rewrite your list first. The "mode" is the value that occurs most often. If no number is repeated, then there is no mode for the list.

The "range" is just the difference between the largest and smallest values.

Find the mean, median, mode, and range for the following list of values:

13, 18, 13, 14, 13, 16, 14, 21, 13

The mean is the usual average, so:

(13 + 18 + 13 + 14 + 13 + 16 + 14 + 21 + 13) ÷ 9 = 15

Note that the mean isn't a value from the original list. This is a common result. You should not assume that your mean will be one of your original numbers.

The median is the middle value, so I'll have to rewrite the list in order:

13, 13, 13, 13, 14, 14, 16, 18, 21

There are nine numbers in the list, so the middle one will be the (9 + 1) ÷ 2 = 10 ÷ 2 = 5th number:

13, 13, 13, 13, 14, 14, 16, 18, 21

So the median is 14.

The mode is the number that is repeated more often than any other, so 13 is the mode.

The largest value in the list is 21, and the smallest is 13, so the range is 21 – 13 = 8.

mean: 15 median: 14 mode: 13 range: 8

Note: The formula for the place to find the median is "( [the number of data points] + 1) ÷ 2", but you don't have to use this formula. You can just count in from both ends of the list until you meet in the middle, if you prefer. Either way will work.

Page 1 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo...

Page 18: Statistics All StandardGradeGeneralCreditMaths Notes

Find the mean, median, mode, and range for the following list of values:

1, 2, 4, 7

The mean is the usual average: (1 + 2 + 4 + 7) ÷ 4 = 14 ÷ 4 = 3.5

The median is the middle number. In this example, the numbers are already listed in numerical order, so I don't have to rewrite the list. But there is no "middle" number, because there are an even number of numbers. In this case, the median is the mean (the usual average) of the middle two values: (2 + 4) ÷ 2 = 6 ÷ 2 = 3

The mode is the number that is repeated most often, but all the numbers appear only once. Then there is no mode.

The largest value is 7, the smallest is 1, and their difference is 6, so the range is 6.

mean: 3.5 median: 3 mode: none range: 6

The list values were whole numbers, but the mean was a decimal value. Getting a decimal value for the mean (or for the median, if you have an even number of data points) is perfectly okay; don't round your answers to try to match the format of the other numbers.

Find the mean, median, mode, and range for the following list of values:

8, 9, 10, 10, 10, 11, 11, 11, 12, 13

The mean is the usual average:

(8 + 9 + 10 + 10 + 10 + 11 + 11 + 11 + 12 + 13) ÷ 10 = 105 ÷ 10 = 10.5

The median is the middle value. In a list of ten values, that will be the (10 + 1) ÷ 2 = 5.5th value; that is, I'll need to average the fifth and sixth numbers to find the median:

(10 + 11) ÷ 2 = 21 ÷ 2 = 10.5

The mode is the number repeated most often. This list has two values that are repeated three times.

The largest value is 13 and the smallest is 8, so the range is 13 – 8 = 5.

mean: 10.5 median: 10.5 modes: 10 and 11 range: 5

While unusual, it can happen that two of the averages (the mean and the median, in this case) will have the same value.

Note: Depending on your text or your instructor, the above data set ;may be viewed as having no mode (rather than two modes), since no single solitary number was repeated more often than any other. I've seen books that go either way; there doesn't seem to be a consensus on the "right" definition of "mode" in the above case. So if you're not certain how you should answer the "mode" part of the above example, ask your instructor before the next test.

Page 2 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo...

Page 19: Statistics All StandardGradeGeneralCreditMaths Notes

About the only hard part of finding the mean, median, and mode is keeping straight which "average" is which. Just remember the following:

mean: regular meaning of "average" median: middle value mode: most often

(In the above, I've used the term "average" rather casually. The technical definition of "average" is the arithmetic mean: adding up the values and then dividing by the number of values. Since you're probably more familiar with the concept of "average" than with "measure of central tendency", I used the more comfortable term.)

A student has gotten the following grades on his tests: 87, 95, 76, and 88. He wants an 85 or better overall. What is the minimum grade he must get on the last test in order to achieve that average?

The unknown score is "x". Then the desired average is:

(87 + 95 + 76 + 88 + x) ÷ 5 = 85

Multiplying through by 5 and simplifying, I get:

87 + 95 + 76 + 88 + x = 425 346 + x = 425 x = 79 He needs to get at least a 79 on the last test.

Original URL: http://www.purplemath.com/modules/meanmode.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved. Terms of Use: http://www.purplemath.com/terms.htm

Page 3 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Mean-median-mo...

Page 20: Statistics All StandardGradeGeneralCreditMaths Notes

Probability Introduction

Probability is the likelihood or chance of an event occurring. Probability = the number of ways of achieving success the total number of possible outcomes For example, the probability of flipping a coin and it being heads is ½, because there is 1 way of getting a head and the total number of possible outcomes is 2 (a head or tail). We write P(heads) = ½ . The probability of something which is certain to happen is 1. The probability of something which is impossible to happen is 0. The probability of something not happening is 1 minus the probability that it will happen.

Single Events

Example

There are 6 beads in a bag, 3 are red, 2 are yellow and 1 is blue. What is the probability of picking a yellow? The probability is the number of yellows in the bag divided by the total number of balls, i.e. 2/6 = 1/3.

Example

There is a bag full of coloured balls, red, blue, green and orange. Balls are picked out and replaced. John did this 1000 times and obtained the following results: Number of blue balls picked out: 300 Number of red balls: 200 Number of green balls: 450 Number of orange balls: 50 a) What is the probability of picking a green ball? b) If there are 100 balls in the bag, how many of them are likely to be green? a) For every 1000 balls picked out, 450 are green. Therefore P(green) = 450/1000 = 0.45 b) The experiment suggests that 450 out of 1000 balls are green. Therefore, out of 100 balls, 45 are green (using ratios).

Multiple Events

Page 1 of 4Probability

23/02/2009http://www.mathsrevision.net/gcse/probability.php

Page 21: Statistics All StandardGradeGeneralCreditMaths Notes

Possibility Spaces

When working out what the probability of two things happening is, a probability/ possibility space can be drawn. For example, if you throw two dice, what is the probability that you will get: a) 8, b) 9, c) either 8 or 9?

a) The black blobs indicate the ways of getting 8 (a 2 and a 6, a 3 and a 5, ...). There are 5 different ways. The probability space shows us that when throwing 2 dice, there are 36 different possibilities (36 squares). With 5 of these possibilities, you will get 8. Therefore P(8) = 5/36 . b) The red blobs indicate the ways of getting 9. There are four ways, therefore P(9) = 4/36 = 1/9. c) You will get an 8 or 9 in any of the 'blobbed' squares. There are 9 altogether, so P(8 or 9) = 9/36 = 1/4 .

Probability Trees

Another way of representing 2 or more events is on a probability tree.

Example

There are 3 balls in a bag: red, yellow and blue. One ball is picked out, and not replaced, and then another ball is picked out.

Page 2 of 4Probability

23/02/2009http://www.mathsrevision.net/gcse/probability.php

Page 22: Statistics All StandardGradeGeneralCreditMaths Notes

The first ball can be red, yellow or blue. The probability is 1/3 for each of these. If a red ball is picked out, there will be two balls left, a yellow and blue. The probability the second ball will be yellow is 1/2 and the probability the second ball will be blue is 1/2. The same logic can be applied to the cases of when a yellow or blue ball is picked out first. In this example, the question states that the ball is not replaced. If it was, the probability of picking a red ball (etc.) the second time will be the same as the first (i.e. 1/3).

The AND and OR rules

In the above example, the probability of picking a red first is 1/3 and a yellow second is 1/2. The probability that a red AND then a yellow will be picked is 1/3 × 1/2 = 1/6 (this is shown at the end of the branch). The probability of picking a red OR yellow first is 1/3 + 1/3 = 2/3. When the word 'and' is used we multiply. When 'or' is used, we add. On a probability tree, when moving from left to right we multiply and when moving down we add.

Example

What is the probability of getting a yellow and a red in any order? This is the same as: what is the probability of getting a yellow AND a red OR a red AND a yellow. P(yellow and red) = 1/3 × 1/2 = 1/6 P(red and yellow) = 1/3 × 1/2 = 1/6

Page 3 of 4Probability

23/02/2009http://www.mathsrevision.net/gcse/probability.php

Page 23: Statistics All StandardGradeGeneralCreditMaths Notes

P(yellow and red or red and yellow) = 1/6 + 1/6 = 1/3

Page 4 of 4Probability

23/02/2009http://www.mathsrevision.net/gcse/probability.php

Page 24: Statistics All StandardGradeGeneralCreditMaths Notes

Bar Chart

A bar chart is a chart where the height of bars represents the frequency. The data is 'discrete' (discontinuous- unlike histograms where the data is continuous). The bars should be separated by small gaps.

Pie Chart

A pie chart is a circle which is divided into a number of parts.

The pie chart above shows the TV viewing figures for the following TV programmes: Eastenders, 15 million Casualty, 10 million Peak Practice, 5 million The Bill, 8 million Total number of viewers for the four programmes is 38 million. To work out the angle that 'Eastenders' will have in the pie chart, we divide 15 by 38 and multiply by 360 (degrees). This is 142 degrees. So 142 degrees of the circle

Page 1 of 2Representing Data

23/02/2009http://www.mathsrevision.net/gcse/pages.php?page=10

Page 25: Statistics All StandardGradeGeneralCreditMaths Notes

represents Eastenders. Similarly, 95 degrees of the circle is Casualty, 47 degrees is Peak Practice and the remaining 76 degrees is The Bill.

Page 2 of 2Representing Data

23/02/2009http://www.mathsrevision.net/gcse/pages.php?page=10

Page 26: Statistics All StandardGradeGeneralCreditMaths Notes

Scatterplots and Regressions (page 1 of 4)

Real life is messy, so it is expected that measurements taken from real life will be messy as well. When you graph measurements of real life, it is expected that the dots won't line up exactly in a nice neat line, but will instead form a scattering of dots which, at best, might suggest a nice neat line. These dots are called a scatterplot.

Create a scatterplot from the following data:

(1, 49), (3, 51), (4, 52), (6, 52), (6, 53), (7, 53), (8, 54), (11, 56), (12, 56), (14, 57), (14, 58), (17, 59), (18, 59), (20, 60), (20, 61)

One of the first things I have to do when graphing these points is figure out what my axis scale values are going to be. If I try doing an axis system with the "standard" –10 to 10 values, none of the above points will even show up on my graph. As is common with these sorts of data sets, all the x- and y-values are positive, so I only really need scales for the first quadrant. The y-values are much larger than the x-values, but instead of squeezing all the y-values together, I'll spread them out (so I can see them better) by using an interrupted scale.

The little "hicky-bob" at the bottom of my y-axis above shows that I've skipped some of the scale values. For some reason, this broken-axis notation seems almost never to be taught in schools, though it is very commonly used in "the real world". If you read financial journals, you're very likely to see many graphs with this sort of axis notation. If you use this notation in your homework, don't be surprised if you have to explain it to your instructor.

You'll probably be expected to do your scatterplots in your graphing calculator. My calculator gives me this picture:

Page 1 of 2

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 27: Statistics All StandardGradeGeneralCreditMaths Notes

You will often need to adjust your WINDOW settings in order to have all your data points show up on the screen. I used window settings of 0 < X < 25 with an X-scale of 5 and 45 < Y < 65 with a Y-scale of 5 for the above graph.

When you're done with the scatterplot, don't forget to turn the STATPLOT "off", or the parameters for the statistics graphing could mess with your regular graphing utility.

I will give you fair warning now: It has become fashionable to insert the topic of scatterplots and regressions into algebra and other non-statistics classes, and to require students to use a graphing calculator to answer questions. While they may give you the slope formula and the Quadratic Formula and all sorts of other stuff on the test (even though you should have memorized them), they will NOT give you help with your calculator. They often don't seem to care if you've learned the math, but you had gosh-darned better know your calculator! So pull out your owners manual, or go to the manufacturer's web site, or search online, or get together with a friend NOW, because if you're doing this stuff in class, you ARE going to have to know it, and know it well, on the test.

Original URL: http://www.purplemath.com/modules/scattreg.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved. Terms of Use: http://www.purplemath.com/terms.htm

Page 2 of 2

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 28: Statistics All StandardGradeGeneralCreditMaths Notes

Scatterplots and Regressions (page 2 of 4)

You may be asked about "correlation". Correlation can be used in at least two different ways: to refer to how well an equation matches the scatterplot, or to refer to the way in which the dots line up. If you're asked about "positive" or "negative" correlation, they're using the second definition, and they're asking if the dots line up with a positive or a negative slope, respectively. If you can't plausibly put a line through the dots, if the dots are just an amorphous cloud of specks, then there is probably no correlation.

Tell whether the data graphed in the following scatterplots appear to have positive, negative, or no correlation.

Plot A: Low x-values correspond to high y-values, and high x-values correspond to low y-values. If I put a line through the dots, it would have a negative slope. This scatterplot shows a negative correlation.

Plot B: Low x-values correspond to low y-values, and high x-values correspond to high y-values. If I put a line through the dots, it would have a positive slope. This scatterplot shows a positive correlation.

Plot C: There doesn't seem to be any trend to the dots; they're just all over the place. This scatterplot shows no correlation.

Plot D: I might think that this plot shows a correlation, because I can clearly put a line through the dots. But the line would be horizontal, thus having a slope value of zero. These dots actually show that whatever is being measured on the x-axis has no bearing on whatever is being measured on the y-axis, because the value of x has no affect on the value of y. So even though I could draw a line through these points, this scatterplot still shows no correlation.

Plot A Plot B

Plot C Plot D

Page 1 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 29: Statistics All StandardGradeGeneralCreditMaths Notes

You may also be asked about "outliers", which are the dots that don't seem to fit with the rest of the dots. (There are more technical definitions of "outliers", but they will have to wait until you take statistics classes.) Maybe you dropped the crucible in chem lab, or maybe you should never have left your idiot lab partner alone with the Bunsen burner in the middle of the experiment. Whatever the cause, having outliers means you have points that don't line up with everything else.

Identity any points that appear to be outliers.

Most of the points seem to line up in a fairly straight line, but the dot at (6, 7) is way off to the side from the general trend-line of the points.

The outlier is the point at (6, 7)

Usually you'll be working with scatterplots where the dots line up in some sort of vaguely straight row. But you shouldn't expect everything to line up nice and neat, especially in "real life" (like, for instance, in a physics lab). And sometimes you'll need to pick a different sort of equation as a model, because the dots line up, but not in a straight line.

Tell which sort of equation you think would best model the data in the following scatterplots, and why.

Graph A: The dots look like they line up fairly straight, so a linear model would probably work well.

Graph B: The dots here do line up, but as more of a curvy line. A quadratic model might work better.

Graph C: The dots are very close to the x-axis, and then they shoot up, so an exponential or power-function model might work better here.

In general, expect only to need to recognize linear (straight-line) versus quadratic (curvy-line) models, and never anything that you haven't already covered in class. For instance, if you haven't done logs yet, you won't be expected to recognize the need for a logarithmic model for a given scatterplot. The next lesson explains how to define these models, called "regressions".

Original URL: http://www.purplemath.com/modules/scattreg2.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.

Page 2 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 30: Statistics All StandardGradeGeneralCreditMaths Notes

Scatterplots and Regressions (page 3 of 4)

The point of collecting data and plotting the collected values is usually to try to find a formula that can be used to model a (presumed) relationship. I say "presumed" because the researcher may end up concluding that there isn't really any relationship where he'd hoped there was one. For instance, you could run experiments timing a ball as it drops from various heights, and you would be able to find a definite relationship between "the height from which I dropped the ball" and "the time it took to hit the floor". On the other hand, you could collect reams of data on the colors of people's eyes and the colors of their cars, only to discover that there is no discernable connection between the two data sets.

The process of taking your data points and coming up with an equation is called "regression", and the graph of the "regression equation" is called "the regression line". If you're doing your scatterplots by hand, you may be told to find a regression equation by putting a ruler against the first and last dots in the plot, drawing a line, and guessing the line's equation from the picture. This is an incredibly clumsy way to proceed, and can give very wrong answers, especially since values at the ends often turn out to be outliers (numbers that don't quite fit with everything else).

For instance, suppose your dots look like this:

Connecting the first and last points, you would end up with this:

On the other hand, you could ignore the outliers and instead just eyeball the cloud of dots to locate a general trend. Put the ruler about where you think a line ought to go (regardless of whether the ruler actually crosses any of the dots), draw the line, and guess the equation from that. You'll likely end up with a more sensible result. Your equation will still be guess-work, but it'll be better guess-work than using only the first and last points:

Page 1 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 31: Statistics All StandardGradeGeneralCreditMaths Notes

If you're finding regression equations with a ruler, you'll need to work extremely neatly, of course, and using graph paper would probably be a really good idea. Once you've drawn in your line (and this will only work for linear, or straight-line, regressions), you will estimate two points on the line that seem to be close to where the gridlines intersect, and then find the line equation through those two points. From the above graph, I would guess that the line goes close to the points (3, 7) and (19, 1), so the regression equation would be y = (–3/8)x + 65/8.

Most likely, though, you'll be doing regressions in your calculator. Doing regressions properly is a difficult and technical process, but your graphing calculator has been programmed with the necessary formulas and has the memory to crunch the many numbers. The calculator will give you "the" regression line. If you're working by hand, you and your classmates will get slightly different answers; if you're using calculators, you'll all get the same answer. (Consult your owners manual or calculator web sites for specific information on doing regressions with your particular calculator model.)

If you're supposed to report how "good" a given regression is, then figure out how to find the "r", "r2", and/or "R2" values in your calculator. These diagnostic tools measure the degree to which the regression equation matches the scatterplot. The closer these correlation values are to 1 (or to –1), the better a fit your regression equation is to the data values. If the correlation value is more than 0.8 or less than –0.8, the match is judged to be pretty good; if the value is between –0.5 and 0.5, the match is judged to be pretty poor; and a correlation value close to zero means you're kidding yourself if you think there's really a relationship of the type you're looking for. (There should be instructions, somewhere in your owners manual, for finding this information.) When you're doing a regression, you're trying to find the "best fit" line to the data, and the correlation numbers help you to tell how good your "fit" is.

Given the following data values, find the linear and cubic regression lines. Say which regression is a better fit, and why.

After plugging these values into the STAT utility of my calculator, I can then do a linear regression:

...and a cubic regression:

(2, 23), (3, 24), (8, 32), (10, 36), (13, 51), (14, 59), (17, 76), (20, 107), (22, 120), (23, 131), (27, 182)

Page 2 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 32: Statistics All StandardGradeGeneralCreditMaths Notes

The line looks a little curvy on the scatterplot, so it's reasonable that the curvy line, the cubic y = 0.000829x3 + 0.23x2 – 1.09x + 24.60, is a better fit to the data points than the straight-line linear model y = 6.03x – 10.64.

Since the correlation value is closer to 1 for the cubic and since the graph of the cubic model is closer to the dots, the cubic equation y = 0.000829x3 + 0.23x2 – 1.09x + 24.60 is the better regression.

You shouldn't expect, by the way, always to get correlation values that are close to "1". If they tell you to find, say, the linear regression equation for a data set, and the correlation factor is close to zero, this doesn't mean that you've found the "wrong" linear equation; it only means that a linear equation probably wasn't a good model to the data. A quadratic model, for instance, might have been better.

Original URL: http://www.purplemath.com/modules/scattreg3.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved. Terms of Use: http://www.purplemath.com/terms.htm

Page 3 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 33: Statistics All StandardGradeGeneralCreditMaths Notes

Scatterplots and Regressions (page 4 of 4)

Until (and unless) you get into a statistics class, the preceding pages cover pretty much all there is to scatterplots and regressions. You draw the dots (or enter them into your calculator), you eyeball a line (or find one in the calculator), and you see how well the line fits the dots. About the only other thing you might do is "extrapolate" and "interpolate".

Remember that the point of all this data-collection, dot-drawing, and regression-computing was to try to find a formula that models... whatever it is that they're measuring. You can use these models to try to find missing data points or to try to project into the future (or, sometimes, into the past).

If you have data, say, for the years 1950, 1960, 1970, and 1980, and you find a model for your data, you might use it to guess at values between these dates. For instance, given Namibian population data for the listed years, you might try to guess the population of Namibia in 1965. The prefix "inter" means "between", so this guessing-between-the-points would be interpolation. On the other hand, you might try to work backwards to guess the population in 1940, or try to fill in the missing data up through 2000. The prefix "extra" means "outside", so this guessing-outside-the-points would be extrapolation.

Find a regression equation for the following population data, using t = 0 to stand for 1950. Then estimate the population of Namibia in the years 1940, 1997, and 2005. Note: Population values are in thousands.

Setting my window range as 0 < X < 55, counting by 5's, and 500 < Y < 2000, counting by 250's, my calculator gives me the following scatterplot:

The dots look like they line up in a curve, so I'll try a quadratic regression. The calculator gives me:

year t 0 5 10 15 20 25 30 35 40 45 50 pop. 511 561 625 704 800 921 1 018 1 142 1 409 1 646 1 894

Page 1 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 34: Statistics All StandardGradeGeneralCreditMaths Notes

As you can see, I've set the calculator to "DiagnosticsOn", so it displays the correlation value whenever I do a regression. This regression looks pretty darned good, especially when it's graphed with the data values:

...so I'll use this model for my computations.

Now that I have an equation for modelling Namibia's population, I can use it to estimate the population in the given years. For 1940, I'll use t = –10, since this is ten years before 1950. (This is an extrapolated value, since I'm going outside the data set.)

f(–10) = 0.4958(–10)2 + 1.9389(–10) + 538.6993 = 568.8903

For 2005, I'll use t = 55; this will be another extrapolated value.

f(55) = 0.4958(55)2 + 1.9389(55) + 538.6993 = 2145.1338

For 1997, I'll use t = 47. Since this value is between known values, this will be an interpolated answer.

f(47) = 0.4958(47)2 + 1.9389(47) + 538.6992 = 1725.0498

Remembering that the population values are in thousands, I'll add three zeroes to my numbers and round to get my final answers:

The estimated values for the population in 1940 is about 569 000; for 2005, the estimated value is about 2.15 million; and for 1997, the estimated value is about 1.73 million.

Depending on your calculator, you may need to memorize what the regression values mean. On my old TI-85, the regression screen would list values for a and b for a linear regression. But I had to memorize that the related regression equation was "a + bx", instead of the "ax + b" that I would otherwise have expected, because the screen didn't say. If you need to memorize this sort of information, do it now, because the teacher will not bail you out if you forget on the test what your calculator's variables mean.

Original URL: http://www.purplemath.com/modules/scattreg4.htm

Page 2 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\ScatterPlots-Regr...

Page 35: Statistics All StandardGradeGeneralCreditMaths Notes

Standard Deviation The standard deviation measures the spread of the data about the mean value. It is useful in comparing sets of data which may have the same mean but a different range. For example, the mean of the following two is the same: 15, 15, 15, 14, 16 and 2, 7, 14, 22, 30. However, the second is clearly more spread out. If a set has a low standard deviation, the values are not spread out too much.

Just like when working out the mean, the method is different if the data is given to you in groups.

Non-Grouped Data

Non-grouped data is just a list of values. The standard deviation is given by the formula:

σ means 'standard deviation'. Σ means 'the sum of'.

means 'the mean'

Example

Find the standard deviation of 4, 9, 11, 12, 17, 5, 8, 12, 14 First work out the mean: 10.222 Now, subtract the mean individually from each of the numbers given and

square the result. This is equivalent to the (x - )² step. x refers to the values given in the question.

Now add up these results (this is the 'sigma' in the formula): 139.55 Divide by n. n is the number of values, so in this case is 9. This gives us: 15.51

x 4 9 11 12 17 5 8 12 14

(x - )2 38.7 1.49 0.60 3.16 45.9 27.3 4.94 3.16 14.3

Page 1 of 3Standard Deviation

23/02/2009http://www.mathsrevision.net/gcse/pages.php?page=42

Page 36: Statistics All StandardGradeGeneralCreditMaths Notes

And finally, square root this: 3.94 The standard deviation can usually be calculated much more easily with a calculator and this may be acceptable in some exams. On my calculator, you go into the standard deviation mode (mode '.'). Then type in the first value, press 'data', type in the second value, press 'data'. Do this until you have typed in all the values, then press the standard deviation button (it will probably have a lower case sigma on it). Check your calculator's manual to see how to calculate it on yours. NB: If you have a set of numbers (e.g. 1, 5, 2, 7, 3, 5 and 3), if each number is increased by the same amount (e.g. to 3, 7, 4, 9, 5, 7 and 5), the standard deviation will be the same and the mean will have increased by the amount each of the numbers were increased by (2 in this case). This is because the standard deviation measures the spread of the data. Increasing each of the numbers by 2 does not make the numbers any more spread out, it just shifts them all along.

Grouped Data

When dealing with grouped data, such as the following:

the formula for standard deviation becomes:

Try working out the standard deviation of the above data. You should get an answer of 1.32 .

You may be given the data in the form of groups, such as:

x f4 95 146 227 118 17

Number Frequency3.5 - 4.5 94.5 - 5.5 145.5 - 6.5 226.5 - 7.5 11

Page 2 of 3Standard Deviation

23/02/2009http://www.mathsrevision.net/gcse/pages.php?page=42

Page 37: Statistics All StandardGradeGeneralCreditMaths Notes

In such a circumstance, x is the midpoint of groups.

7.5 - 8.5 17

Page 3 of 3Standard Deviation

23/02/2009http://www.mathsrevision.net/gcse/pages.php?page=42

Page 38: Statistics All StandardGradeGeneralCreditMaths Notes

Stem-and-Leaf Plots (page 1 of 2)

Stem-and-leaf plots are a method for showing the frequency with which certain classes of values occur. You could make a frequency distribution table or a histogram for the values, or you can use a stem-and-leaf plot and let the numbers themselves to show pretty much the same information.

For instance, suppose you have the following list of values: 12, 13, 21, 27, 33, 34, 35, 37, 40, 40, 41. You could make a frequency distribution table showing how many tens, twenties, thirties, and forties you have:

You could make a histogram, which is a bar-graph showing the number of occurrences, with the classes being numbers in the tens, twenties, thirties, and forties:

(The shading of the bars in a histogram isn't necessary, but it can be helpful by making the bars easier to see, especially if you can't use color to differentiate the bars.)

The downside of frequency distribution tables and histograms is that, while the frequency of each class is easy to see, the original data points have been lost. You can tell, for instance, that there must have been three listed values that were in the forties, but there is no way to tell from the table or from the histogram what those values might have been.

On the other hand, you could make a stem-and-leaf plot for the same data:

Frequency Class Frequency

10 - 19 220 - 29 230 - 39 440 - 49 3

Page 1 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P...

Page 39: Statistics All StandardGradeGeneralCreditMaths Notes

The "stem" is the left-hand column which contains the tens digits. The "leaves" are the lists in the right-hand column, showing all the ones digits for each of the tens, twenties, thirties, and forties. As you can see, the original values can still be determined; you can tell, from that bottom leaf, that the three values in the forties were 40, 40, and 41.

Note that the horizontal leaves in the stem-and-leaf plot correspond to the vertical bars in the histogram, and the leaves have lengths that equal the numbers in the frequency table.

That's pretty much all there is to a stem-and-leaf plot. You're just listing out how many entries you have in certain classes of numbers, and what those entries are. Here are some more examples of stem-and-leaf plots, containing a few additional details.

Complete a stem-and-leaf plot for the following list of grades on a recent test:

73, 42, 67, 78, 99, 84, 91, 82, 86, 94

I'll use the tens digits as the stem values and the ones digits as the leaves. For convenience sake, I'll order the list, but this is not required:

42, 67, 73, 78, 82, 84, 86, 91, 94, 99

Since I know where these data points came from ("a recent test"), I'll use a title. Then my plot looks like this:

The above is the simplest case for stem-and-leaf plots, but even the "complicated" cases aren't much more complex.

Original URL: http://www.purplemath.com/modules/stemleaf.htm

Copyright 2006 Elizabeth Stapel; All Rights Reserved.

Page 2 of 3

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P...

Page 40: Statistics All StandardGradeGeneralCreditMaths Notes

Stem-and-Leaf Plots: Examples (page 2 of 2)

Subjects in a psychological study were timed while completing a certain task. Complete a stem-and-leaf plot for the following list of times:

7.6, 8.1, 9.2, 6.8, 5.9, 6.2, 6.1, 5.8, 7.3, 8.1, 8.7, 7.4, 7.8, 8.2

First, I'll reorder this list:

5.8, 5.9, 6.1, 6.2, 6.8, 7.3, 7.4, 7.6, 7.8, 8.1, 8.1, 8.2, 8.7, 9.2

These values have one decimal place, but the stem-and-leaf plot makes no accomodation for this. The stem-and-leaf plot only looks at the last digit (for the leaves) and all the digits before (for the stem). So I'll have to put a "key" or legend on this plot to show what I mean by the numbers in this plot. The ones digits will be the stem values, and the tenths will be the leaves.

Properly, every stem-and-leaf plot should have a key.

Complete a stem-and-leaf plot for the following two lists of class sizes:

Economics 101: 9, 13, 14, 15, 16, 16, 17, 19, 20, 21, 21, 22, 25, 25, 26 Libertarianism: 14, 16, 17, 18, 18, 20, 20, 24, 29

This example has two lists of values. Since the values are similar, I can plot them all on one stem-and-leaf plot by drawing leaves on either side of the stem. I will use the tens digits as the stem values, and the ones digits as the leaves. Since "9" (in the Econ 101 list) has no tens digit, the stem value will be "0".

Page 1 of 4

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P...

Page 41: Statistics All StandardGradeGeneralCreditMaths Notes

Complete a stem-and-leaf plot for the following list of values:

100, 110, 120, 130, 130, 150, 160, 170, 170, 190, 210, 230, 240, 260, 270, 270, 280. 290, 290

Since all the ones digits are zeroes, I'll do this plot with the hundreds digits being the stem values and the tens digits being the leaves. I can do the plot like this:

...but the leaves are fairly long this way, because the values are so close together. To spread the values out a bit, I can break each leaf into two. For instance, the leaf for the two-hundreds class can be split into two classes, being the numbers between 200 and 240 and the numbers between 250 and 290. I can also reverse the order, so the smaller values are at the bottom of the "stem". The new plot looks like this:

For very compact data points, you can even split the leaves into five classes, like this:

Page 2 of 4

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P...

Page 42: Statistics All StandardGradeGeneralCreditMaths Notes

Complete a stem-and-leaf plot for the following list of values:

23.25, 24.13, 24.76, 24.81, 24.98, 25.31, 25.57, 25.89, 26.28, 26.34, 27.09

If I try to use the last digit, the hundredths digit, for these numbers, the stem-and-leaf plot will be enormously long, because these values are so spread out. (With the numbers' first three digits ranging from 232 to 270, I'd have thirty-nine leaves, most of which would be empty.) So instead of working with the given numbers, I'll round each of the numbers to the nearest tenth, and then use those new values for my plot. Rounding gives me the following list:

23.3, 24.1, 24.8, 24.8, 25.0, 25.3, 25.6, 25.9, 26.3, 26.3, 27.1

Then my plot looks like this:

Naturally, when you're drawing a stem-and-leaf plot, you should use a ruler to construct a neat table, and you should label everything clearly.

Original URL: http://www.purplemath.com/modules/stemleaf.htm

Page 3 of 4

25/02/2009mhtml:file://G:\MyDocs\Personal\Maths\Int2M\Int2notes\Statistics\Stem-leaf-plots_P...