© copyright mcgraw-hill 2004 3-1 chapter 3 data description

39
© Copyright McGraw-Hill 200 4 3-1 CHAPTER 3 Data Description

Upload: maurice-palmer

Post on 29-Dec-2015

224 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-1

CHAPTER 3

Data Description

Page 2: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-2

Objectives

Summarize data using measures of central tendency, such as the mean, median, mode, and midrange.

Describe data using the measures of variation, such as the range, variance, and standard deviation.

Identify the position of a data value in a data set using various measures of position, such as percentiles, deciles, and quartiles.

Page 3: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-3

Objectives (cont’d.)

Use the techniques of exploratory data

analysis, including boxplots and five-number

summaries to discover various aspects of

data.

Page 4: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-4

Introduction

Statistical methods can be used to summarize data.

Measures of average are also called measures of central tendency and include the mean, median, mode, and midrange.

Measures that determine the spread of data values are called measures of variation or measures of dispersion and include the range, variance, and standard deviation.

Page 5: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-5

Introduction (cont’d.)

Measures of position tell where a specific

data value falls within the data set or its

relative position in comparison with other

data values.

The most common measures of position are percentiles, deciles, and quartiles.

Page 6: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-6

Introduction (cont’d.)

The measures of central tendency, variation,

and position are part of what is called

traditional statistics. This type of data is

typically used to confirm conjectures about

the data.

Page 7: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-7

Introduction (cont’d.)

Another type of statistics is called

exploratory data analysis. These techniques

include the the box plot and the five-number

summary. They can be used to explore data

to see what they show.

Page 8: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-8

Basic Vocabulary

A statistic is a characteristic or measure obtained by using the data values from a sample.

A parameter is a characteristic or measure obtained by using all the data values for a specific population.

When the data in a data set is ordered it is

called a data array.

Page 9: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-9

General Rounding Rule

In statistics the basic

rounding rule is that

when computations

are done in the

calculation, rounding

should not be done

until the final answer

is calculated.

Page 10: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-10

The Arithmetic Average

The mean is the sum of the values divided by the total number of values.

Rounding rule: the mean should be rounded

to one more decimal place than occurs in the

raw data.

The type of mean that considers an additional factor is called the weighted mean.

Page 11: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-11

The Arithmetic Average

The Greek letter (mu) is used to represent the population mean.

The symbol (“x-bar”) represents the sample mean.

Assume that data are obtained from a sample unless otherwise specified.

x

Page 12: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-12

Median and Mode

The median is the halfway point in a data set. The symbol for the median is MD.

The median is found by arranging the data in order and selecting the middle point.

The value that occurs most often in a data set is called the mode.

The mode for grouped data, or the class with the highest frequency, is the modal class.

Page 13: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-13

Midrange

The midrange is defined as the sum of the

lowest and highest values in the data set

divided by 2.

The symbol for midrange is MR.

Page 14: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-14

Central Tendency: The Mean

One computes the mean by using all the

values of the data.

The mean varies less than the median or

mode when samples are taken from the

same population and all three measures are

computed for these samples.

The mean is used in computing other

statistics, such as variance.

Page 15: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-15

Central Tendency: The Mean (cont’d.)

The mean for the data set is unique, and not

necessarily one of the data values.

The mean cannot be computed for an open-

ended frequency distribution.

The mean is affected by extremely high or

low values and may not be the appropriate

average to use in these situations.

Page 16: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-16

Central Tendency: The Median

The median is used when one must find the center or middle value of a data set.

The median is used when one must determine whether the data values fall into the upper half or lower half of the distribution.

The median is used to find the average of an open-ended distribution.

The median is affected less than the mean by extremely high or extremely low values.

Page 17: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-17

Central Tendency: The Mode

The mode is used when the most typical case is desired.

The mode is the easiest average to compute.

The mode can be used when the data are nominal, such as religious preference, gender, or political affiliation.

The mode is not always unique. A data set can have more than one mode, or the mode may not exist for a data set.

Page 18: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-18

Central Tendency: The Midrange

The midrange is easy to compute.

The midrange gives the midpoint.

The midrange is affected by extremely high

or low values in a data set.

Page 19: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-19

Distribution Shapes

In a positively skewed or right skewed

distribution, the majority of the data values

fall to the left of the mean and cluster at the

lower end of the distribution.

Page 20: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-20

Distribution Shapes (cont’d.)

In a symmetrical distribution, the data

values are evenly distributed on both sides

of the mean.

Page 21: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-21

Distribution Shapes (cont’d.)

When the majority of the data values fall to

the right of the mean and cluster at the

upper end of the distribution, with the tail to

the left, the distribution is said to be

negatively skewed or left skewed.

Page 22: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-22

The Range

The range is the highest value minus the

lowest value in a data set.

The symbol R is used for the range.

Page 23: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-23

Variance and Standard Deviation

The variance is the average of the squares of

the distance each value is from the mean.

The symbol for the population variance is 2.

N

x

2

2

Page 24: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-24

Variance and Standard Deviation

The standard deviation is the square root of

the variance. The symbol for the population

standard deviation is . Rounding rule: The

final answer should be rounded to one more

decimal place than the original data.

N

x

2

2

Page 25: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-25

Coefficient of Variation

The coefficient of variation is the standard

deviation divided by the mean. The result is

expressed as a percentage.

The coefficient of variation is used to

compare standard deviations when the units

are different for the two variables being

compared.

Page 26: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-26

Variance and Standard Deviation

Variances and standard deviations can be

used to determine the spread of the data. If

the variance or standard deviation is large,

the data are more dispersed. The information

is useful in comparing two or more data sets

to determine which is more variable.

The measures of variance and standard

deviation are used to determine the

consistency of a variable.

Page 27: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-27

Variance and Standard Deviation (cont’d.)

The variance and standard deviation are

used to determine the number of data

values that fall within a specified interval in

a distribution.

The variance and standard deviation are

used quite often in inferential statistics.

Page 28: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-28

Chebyshev’s Theorem

The proportion of values from a data set that

will fall within k standard deviations of the

mean will be at least 1 – 1/k2; where k is a

number greater than 1.

This theorem applies to any distribution

regardless of its shape.

Page 29: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-29

Empirical Rule for Normal Distributions

The following apply to a bell-shaped

distribution.

Approximately 68% of the data values fall

within one standard deviation of the mean.

Approximately 95% of the data values fall

within two standard deviations of the mean.

Approximately 99.75% of the data values fall

within three standard deviations of the mean.

Page 30: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-30

Standard Scores

A standard score or z score is used when

direct comparison of raw scores is

impossible.

A standard score or z score for a value is

obtained by subtracting the mean from the

value and dividing the result by the standard

deviation.

Page 31: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-31

Percentiles

Percentiles are position measures used in

educational and health-related fields to

indicate the position of an individual in a

group.

A percentile, P, is an integer between 1 and

99 such that the Pth percentile is a value

where P % of the data values are less than or

equal to the value and 100 – P % of the data

values are greater than or equal to the value.

Page 32: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-32

Quartiles and Deciles

Quartiles divide the distribution into four

groups, denoted by Q1, Q2, Q3. Note that Q1 is

the same as the 25th percentile; Q2 is the

same as the 50th percentile or the median;

and Q3 corresponds to the 75th percentile.

Deciles divide the distribution into 10

groups. They are denoted by D1, D2, …, D10.

Page 33: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-33

Outliers

An outlier is an extremely high or an extremely low data value when compared with the rest of the data values.

Outliers can be the result of measurement or observational error.

When a distribution is normal or bell-shaped, data values that are beyond three standard deviations of the mean can be considered suspected outliers.

Page 34: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-34

Exploratory Data Analysis

The purpose of exploratory data analysis is

to examine data in order to find out what

information can be discovered. For example:

Are there any gaps in the data?

Can any patterns be discerned?

Page 35: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-35

Boxplots and Five-Number Summaries

Boxplots are graphical representations of a five-

number summary of a data set. The five specific

values that make up a five-number summary are:

The lowest value of data set (minimum)

Q1 (or 25th percentile)

The median (or 50th percentile)

Q3 (or 75th percentile)

The highest value of data set (maximum)

Page 36: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-36

Summary

Some basic ways to summarize data include

measures of central tendency, measures of

variation or dispersion, and measures of

position.

The three most commonly used measures of

central tendency are the mean, median, and

mode. The midrange is also used to

represent an average.

Page 37: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-37

Summary (cont’d.)

The three most commonly used

measurements of variation are the range,

variance, and standard deviation.

The most common measures of position are

percentiles, quartiles, and deciles.

Data values are distributed according to

Chebyshev’s theorem and in special cases, the

empirical rule.

Page 38: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-38

Summary (cont’d.)

The coefficient of variation is used to describe the standard deviation in relationship to the mean.

These methods are commonly called traditional statistics.

Other methods, such as the boxplot and five-number summary, are part of exploratory data analysis; they are used to examine data to see what they reveal.

Page 39: © Copyright McGraw-Hill 2004 3-1 CHAPTER 3 Data Description

© Copyright McGraw-Hill 20043-39

Conclusions

By combining all of these techniques together, the student is now able to collect, organize, summarize and present data.