1 copyright © 2014, 2012, 2009 pearson education, inc. lecture notes number 1 intro stats - 4 th...

Post on 17-Jan-2018

219 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

3 Copyright © 2014, 2012, 2009 Pearson Education, Inc. Two Different Meanings of Statistics a) IN PLURAL SENSE, STATISTICS MEANS A SET OF OBSERVATIONS, USUALLY COLLECTED BY MEASUREMENTS OR COUNTING, COLLECTIVELY KNOWN AS DATA. (b) IN SINGULAR SENSE, STATISITICS REFERS TO A GROUP OF SCIENTIFIC METHODS USED TO (a) collecting data (b) interpreting and analyzing data (c) making conclusions or inferences.

TRANSCRIPT

1Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Lecture Notes Number 1

INTRO STATS - 4TH EDITION

CHAPTERS 1 – 4

INTRO STATS – 3RD EDITION

CHAPTERS 1 – 5

2Copyright © 2014, 2012, 2009 Pearson Education, Inc.

What Is Statistics?

Statistics is the science of

(1)collecting, (2)organizing,

(3)summarizing, and (4)analyzing

data to answer questions and/or draw conclusions

3Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Two Different Meanings of Statistics

a) IN PLURAL SENSE, STATISTICS MEANS A SET OF OBSERVATIONS, USUALLY COLLECTED BY MEASUREMENTS OR COUNTING, COLLECTIVELY KNOWN AS DATA.

(b) IN SINGULAR SENSE, STATISITICS REFERS TO A GROUP OF SCIENTIFIC METHODS USED TO

(a) collecting data(b) interpreting and

analyzing data(c) making

conclusions or inferences.

4Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Why Do We Care About Statistics?

Statistics allows us to:

Explore the world around us.Use evidence to check whether our beliefs are true.Find patterns to lead to discoveries.Share new discoveries with others.

Keep in MindStatistics must be used carefully.Inappropriate use will result in inaccurate beliefs. Results are always uncertain.

5Copyright © 2014, 2012, 2009 Pearson Education, Inc.

THREE MAIN ASPECTS OF STATISTICS

DESIGN: PLANNING HOW TO OBTAIN DATA TO ANSWER THE QUESTIONS OF INTEREST (DATA COLLECTION)

DESCRIPTION: EXPLORING AND SUMMARIZING PATTERNS IN THE DATA (DATA ANALYSES)

INFERENCE: MAKING DECISIONS AND PREDICTIONS BASED ON THE DATA. TO INFER MEANS TO ARRIVE AT A DECISION OR PREDICTION BY REASONING FROM KNOWN EVIDENCE.

6Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Types Of Statistics

DESCRIPTIVE STATISTICSDEFINED AS THOSE METHODS INVOLVING THE COLLECTION,

PRESENTATION, AND CHARACTERIZATION OF A SET OF DATA IN ORDER TO DESCRIBE PROPERLY THE VARIOUS FEATURES OF THAT SET OF DATA. TO ACHIEVE THESE, STATISTICIANS USE TABLES – EITHER FREQUENCY OR CONTIGENCY; BAR AND PIE CHARTS; STEM-AND-LEAF DISPLAYS; BOX-AND-WHISKER PLOTS; PARETO DIAGRAMS; HISTOGRAMS.

ALSO DEFINED AS THAT BRANCH OF STATISTICS THAT INVOLVES IN THE ORGANIZING, DISPLAYING, AND DESCRIBING OF DATA.

INFERENTIAL STATISTICS IS THE BRANCH OF STATISTICS THAT INVOLVES DRAWING CONCLUSIONS ABOUT A POPULATION BASED ON INFORMATION CONTAINED IN A SAMPLE FROM THAT POPULATION

7Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Think, Show, and Tell

•Think about what information you want to know.

•Show your results by displaying the data in a professional and accurate manner.

•Tell your story by describing what can be concluded

from the data that was collected.

8Copyright © 2014, 2012, 2009 Pearson Education, Inc.

SOME

RELEVANT

STATISTICAL

TERMINOLOGIES

9Copyright © 2014, 2012, 2009 Pearson Education, Inc.

What Is Data?

Data• Any collection of numbers, characters,

images, or other items that provide information about something

• Data vary: Surveys and experiments produce a variety of outcomes.

Statistics helps us make sense of the data and how the data vary.

10Copyright © 2014, 2012, 2009 Pearson Education, Inc.

What Is Data?

DATA IS SYSTEMATICALLY RECORDED INFORMATION, WHETHER NUMBERS OR LABELS, TOGETHER WITH ITS CONTEXT.

CONTEXT TELLS WHO, WHAT, WHEN, WHERE, HOW and WHY IS BEING MEASURED.

11Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Six “W”s

•Who: Describe the individuals who were surveyed.•What: Determine what is being measured.•When: When was the research conducted?•Where: Where was the research conducted?•Why: What was the purpose of the survey or

experiment?•How: Describe how the survey or experiment was

conducted

12Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Example

BECAUSE OF THE DIFFICULTY OF WEIGHING A BEAR IN THE WOODS, RESEARCHERS CAUGHT AND MEASURED 54 BEARS, RECORDING THEIR WEIGHT, NECK SIZE, LENGTH, AND SEX. THEY HOPED TO FIND A WAY TO ESTIMATE THE WEIGHT FROM THE OTHER, MORE EASILY DETERMINED QUANTITIES. IDENTIFY THE W’S

SOLUTION

13Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Who and What

•Respondents: Individuals who answer the survey•Customers at Amazon

•Subjects or Participants: People who are experimented on•Patients who receive the new medication

•Experimental Units: The object of the experiment when it is not a person

•Rats that run through a maze

•Records: Rows in a database•Each person’s purchase record at Amazon

14Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Population Versus Sample

Population•Collection of all data values that ever will occur for a group

•Often difficult to obtain all of this information

Sample•A subset of the population •Represents the population at large•Easier to obtain this information

15Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Population Versus Sample

•The goal to collect a sample is to describe the population.

•The population is usually impractical or impossible to collect.

•A sample is used to make inferences about the population.

•The sample should be representative of the population.

16Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Population Versus sample

ILLUSTRATION: POT OF CHICKEN SOUP

17Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Examples: Identify the population and sample

i)A QUESTION POSTED ON THE LYCOS WEBSITE IN THE USA ON 18 JUNE 2000 ASKED VISITORS TO THE SITE TO SAY WHETHER THEY THOUGHT MARIJUANA SHOULD BE LEGALLY AVAILABLE FOR MEDICINAL PURPOSES.

(ii)THE GALLUP POLL INTERVIEWED 1007 RANDOMLY SELECTED U.S. ADULTS AGED 18 AND OLDER, MARCH 23 – 25, 2007. GALLUP REPORTS THAT WHEN ASKED IF EVER, THE EFFECTS OF GLOBAL WARMING WILL BEGIN TO HAPPEN, 60% OF THE RESPONDENTS SAID THE EFFECTS HAD ALREADY BEGUN. ONLY 11% THOUGHT THAT THEY WOULD NEVER HAPPEN.

18Copyright © 2014, 2012, 2009 Pearson Education, Inc.

PARAMETER Versus STATISTIC

• PARAMETER or POPULATION PARAMETER: A PARAMETER IS A NUMERICAL SUMMARY OF THE POPULATION.

• STATISTIC or SAMPLE STATISTIC: A STATISTIC IS A NUMERICAL SUMMARY OF A SAMPLE TAKEN FROM A POPULATION.

• ILLUSTRATION:

19Copyright © 2014, 2012, 2009 Pearson Education, Inc.

OUTLIERS

• OUTLIERS ARE UNUSUAL OR EXTREME VALUES THAT DO NOT APPEAR TO BELONG WITH THE REST OF THE DATA.

• SUCH STRAGGLERS STAND OFF AWAY FROM THE BODY OF THE DISTRIBUTION OF DATA SET.

• OUTLIERS CAN AFFECT MANY STATISTICAL ANALYSES, SO YOU SHOULD ALWAYS BE ALERT FOR THEM.

20Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Variables

DEFINITION: THE CHARACTERISTICS RECORDED ABOUT EACH INDIVIDUAL ARE CALLED VARIABLES.

THERE ARE TWO TYPES OF VARIABLES – CATEGORICAL AND QUANTITATIVE VARIABLES.

21Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Categorical Variables

•Categorical Variable: A variable that tells us what group or category an individual belongs to

•Synonyms: nominal and qualitative

•Examples: Favorite color, country of birth, area code

•Drawback of Categorical Variables: Challenging to analyze with computation

22Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Quantitative Variables

•Quantitative Variable: Contains measured numerical values with measurement units

•Typically records the amount or degree of something

•Unit Examples: ounces, dollars, degrees Fahrenheit

23Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Quantitative Variables

Discrete Quantitative Variable

A VARIABLE IS DISCRETE IF IT TAKES ITS VALUE FROM A COUNTABLE SET OF NUMBERS LIKE {0, 1, 2, 3, 4, …} OR FROM A FINITE SET LIKE {3, 4.5, 6, 11}

Continuous Quantitative Variable

A VARIABLE IS CONTINUOUS IF IT TAKES ITS POSSIBLE VALUES FROM AN INTERVAL OR A CONTINUUM LIKE [2,7], (-5,10), OR THE ENTIRE NUMBER LINE, R.

24Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Categorical or Quantitative Variable?

• Amazon knows your age and will use it to present an age-appropriate image customized for you.

• Is Age categorical or quantitative?

• Perceived as Child, Teen, Young Adult, Middle Aged, Senior, age is categorical.

• Can also be perceived as quantitative if not categorized into a type

25Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Identifiers

• Identifier Variable: A variable that is used to uniquely identify the individual. It does not describe the individual.•Login ID•Customer Number•Transaction Number•Social Security Number

26Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Ordinal Variables

• Ordinal Variable: A variable that reports order without natural units

• Four-point Likert Scale: Strongly Disagree, Disagree, Agree, Strongly Agree

• Olympic Rank: Gold, Silver, Bronze• Can be treated as quantitative by using the rank number

• 1 = Strongly Disagree, 2 = Disagree, 3 = Agree, 4 = Strongly Agree

27Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Quantitative And Categorical Data

• DATA COLLECTED FROM A QUANTITATIVE VARIABLE IS CALLED A QUANTITATIVE DATA. EXAMPLES INCLUDE HEIGHT, WEIGHT OF STUDENTS. TIME TO COMPLETE DIFFERENT TASKS.

• DATA COLLECTED FROM A CATEGORICAL VARIABLE IS CATEGORICAL DATA. EXAMPLE INCLUDE COLOR OF EYES.

28Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Chapter 3

Displaying and Summarizing Quantitative Data

29Copyright © 2014, 2012, 2009 Pearson Education, Inc.

3.1

Displaying Quantitative Variables

30Copyright © 2014, 2012, 2009 Pearson Education, Inc.

HISTOGRAMS

• A HISTOGRAM IS A SUMMARY GRAPH SHOWING A COUNT OF THE DATA FALLING IN VARIOUS RANGES OR BINS OR CLASSES OR GROUPS.

• THE PURPOSE IS TO GRAPHICALLY SUMMARIZE AND DISPLAY THE DISTRIBUTION OF A PROCESSED DATA SET.

• IT IS PARTICULARLY USEFUL WHEN THERE ARE A LARGE NUMBER OF OBSERVATIONS.

• THE OBSERVATIONS OR DATA SETS FOR WHICH WE DRAW A HISTOGRAM ARE QUANTITATIVE VARIABLES.

31Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Constructing Histograms Using Minitab Commands

• Open Minitab and ENTER the Data Set

• Click on Graph then on Histogram then on Simple then on OK

• Click on C1 then on Select

• Click on Labels then on Title (Write the title of the histogram)

• Click on OK

32Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Histograms

•Histogram: A chart that displays quantitative data

• Great for seeing the distribution of the data

• Most earthquake generating tsunamis have magnitudes between 6.5 and 8.

• Japan and Sumatra quakes (9.0 and 9.1) are rare.

• Quakes under 5 rarely cause tsunamis.

• Quakes between 7.0 and 7.5 most common for causing tsunamis

A histogram of tsunami generating earthquakes

33Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Choosing the Bin Width

•Different bin widths tell different stories.

•Choose the width that best shows the important features.

•Presentations can feature two histograms that present the same data in different ways.

•A gap in the histogram means that there were no occurrences in that range.

34Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Relative Frequency Histograms

•Relative Frequency Histogram

•The vertical axis represents the relative frequency, the frequency divided by the total.

•The horizontal axis is the same as the horizontal axis for the frequency histogram.

•The shape of the relative frequency histogram is the same as the frequency histogram.

•Only the scale of the y-axis is different.

35Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Stem-and-Leaf Displays

•Stem-and-Leaf: Shows both the shape of the distribution and all of the individual values

•Not as visually pleasing as a histogram; more technical looking

•Can only be used for small collections of data

•The first column (stems) represents the leftmost digit.

•The second column (leaves) shows the remaining digit(s).

36Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Dotplots•Dotplot: Displays dots to describe

the shape of the distribution

•There were 30 races with a winning time of 122 seconds.

•Good for smaller data sets

•Visually more appealing than stem-and-leaf

•In StatCrunch: Graphics → Dotplot

37Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Example

THE DATA BELOW GIVE THE NUMBER OF HURRICANES THAT HAPPENED EACH YEAR FROM 1944 THROUGH 2000 AS REPORTED BY SCIENCE MAGAZINE.

• 3,2,1,2,4,3,7,2,3,3,2,5,2,2,4,2,2,6,0,2,5,1,3,1,0,3,2,1,0,1,2,3,2,1,2,2,2,3,1,1,1,3,0,1,3,2,1,2,1,1,0,5,6,1,3,5,3

38Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Frequency Table For Hurricane Data

No. of hurricanes Frequency or Count0 51 142 173 124 25 46 27 1

39Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Dot Plot For Hurricane Data

76543210C6

Dot plot for hurrican data

40Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Think Before you Draw

•Is the variable quantitative? Is the answer to the survey question or result of the experiment a number whose units are known?

•Histograms, stem-and-leaf diagrams, and dotplots can only display quantitative data.

•Bar and pie charts display categorical data.

41Copyright © 2014, 2012, 2009 Pearson Education, Inc.

3.2

Shape

42Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Describing The Distribution of a Quantitative Variable From Histograms

• WHEN YOU DESCRIBE THE DISTRIBUTION OF A QUANTITATIVE VARIABLE, YOU SHOULD ALWAYS TELL ABOUT FOUR THINGS NAMELY:

• THE SHAPE• THE CENTER• THE SPREAD• UNUSUAL FEATURES OR OUTLIERS

43Copyright © 2014, 2012, 2009 Pearson Education, Inc.

THE SHAPE OF A DISTRIBUTION

• DOES THE HISTOGRAM HAVE A SINGLE, CENTRAL HUMP OR SEVERAL SEPERATED HUMPS? THESE HUMPS ARE CALLED MODES.

• A HISTOGRAM WITH ONE PEAK IS DUBBED UNIMODAL; HISTOGRAMS WITH TWO PEAKS ARE CALLED BIMODAL; AND THOSE WITH THREE OR MORE PEAKS ARE CALLED MULTIMODAL.

• A HISTOGRAM THAT DOESN’T APPEAR TO HAVE ANY MODE AND IN WHICH ALL THE BARS ARE APPROXIMATELY THE SAME HEIGHT IS CALLED UNIFORM.

44Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Modes

•A Mode of a histogram is a hump or high-frequency bin.•One mode → Unimodal•Two modes → Bimodal•3 or more → Multimodal

Unimodal MultimodalBimodal

45Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Uniform Distributions

•Uniform Distribution: All the bins have the same frequency, or at least close to the same frequency.

•The histogram for a uniform distribution will be flat.

46Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Is The Histogram Symmetric or Skewed?

• CAN YOU FOLD THE HISTOGRAM ALONG A VERTICAL LINE THROUGH THE MIDDLE AND HAVE THE EDGES MATCH PRETTY CLOSELY, OR ARE MORE OF THE VALUES ON ONE SIDE?

• THE (USUALLY) THINNER ENDS OF A DISTRIBUTION ARE CALLED TAILS. IF ONE TAIL STRETCHES OUT FARTHER THAN THE OTHER, THE HISTOGRAM IS SAID TO BE SKEWED TO THE SIDE OF THE LONGER TAIL.

• A SKEWED RIGHT DISTRIBUTION IS ONE IN WHICH THE TAIL IS ON THE RIGHT SIDE.

• A SKEWED LEFT DISTRIBUTION IS ONE IN WHICH THE TAIL IS ON THE LEFT SIDE.

47Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Symmetry

•The histogram for a symmetric distribution will look the same on the left and the right of its center.

Symmetric Not Symmetric Symmetric

48Copyright © 2014, 2012, 2009 Pearson Education, Inc.

SYMMETRIC HISTOGRAM

49Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Skew•A histogram is skewed right if the longer tail is on the

right side of the mode.

•A histogram is skewed left if the longer tail is on the left side of the mode.

Skewed LeftSkewed Right

50Copyright © 2014, 2012, 2009 Pearson Education, Inc.

LEFT – SKEWED HISTOGRAM

51Copyright © 2014, 2012, 2009 Pearson Education, Inc.

RIGHT – SKEWED HISTOGRAM

52Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Outliers•An Outlier is a data value that is far above or far below

the rest of the data values.

•An outlier is sometimes just an error in the data collection.

•An outlier can also be the most important data value.

•Income of a CEO

•Temperature of a person with a high fever

•Elevation at Death Valley

53Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Example•The histogram shows the amount

of money spent by a credit card company’s customers. Describe and interpret the distribution.

•The distribution is unimodal. Customers most commonly spent a small amount of money.

•The distribution is skewed right. Many customers spent only a small amount and a few were spread out at the high end.

•There is an outlier at around $7000. One customer spent much more than the rest of the customers.

54Copyright © 2014, 2012, 2009 Pearson Education, Inc.

3.3

Center

55Copyright © 2014, 2012, 2009 Pearson Education, Inc.

THE CENTER OF THE DISTRIBUTION

• THE CENTER IS A VALUE THAT ATTEMPTS THE IMPOSSIBLE BY SUMMARIZING THE ENTIRE DISTRIBUTION WITH A SINGLE NUMBER, A “TYPICAL” VALUE.

• MEASURES OF CENTER INCLUDE THE MEAN AND THE MEDIAN.

• WHEN A HISTOGRAM IS UNIMODAL AND SYMMETRIC, WE WOULD AGREE ON THE CENTER OF SYMMETRY, WHERE WE WOULD FOLD THE HISTOGRAM TO MATCH THE TWO SIDES.

• WHEN THE DISTRIBUTION IS SKEWED OR POSSIBLY MULTIMODAL, DEFINING THE CENTER IS MORE OF A CHALLENGE.

56Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Median•Median: The center of the

data values

•Half of the data values are to the left of the median and half are to the right of the median.

•For symmetric distributions, the median is directly in the middle.

57Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Calculating the Median: Odd Sample Size•First order the numbers.

•If there are an odd number of numbers, n, the median is at position .

•Find the median of the numbers: 2, 4, 5, 6, 7, 9, 9.

•The median is the fourth number: 6

•Note that there are 3 numbers to the left of 6 and 3 to the right.

12

n

1 7 1 42 2

n

58Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Calculating the Median: Even Sample Size•First order the numbers.

•If there are an even number of numbers, n, the median is the average of the two middle numbers: .

•Find the median of the numbers: 2, 2, 4, 6, 7, 8.

•The median is the average of the third and the fourth numbers:

6 3

2 2n

, 12 2n n

4 6Median 52

59Copyright © 2014, 2012, 2009 Pearson Education, Inc.

3.4

Spread

60Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Spread

•Locating the center is only part of the story

•Are the data all near the center or are they spread out?

•Is the highest value much higher than the lowest value?

•To describe data, we must discuss both the center and the spread.

61Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Measures of Spread of Quantitative Data

• A MEASURE OF SPREAD IS A NUMERICAL SUMMARY OF HOW TIGHTLY THE VALUES ARE CLUSTERED AROUND THE CENTER.

• MEASURES OF SPREAD INCLUDE:- RANGE- INTERQUARTILE RANGE (IQR)- STANDARD DEVIATION

62Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Range

•The range is the difference between the maximum and minimum values.

Range = Maximum – Minimum

•The ages of the guests at your dinner party are: 16, 18, 23, 23, 27, 35, 74

•The range is: 74 – 16 = 58

•The range is sensitive to outliers. A single high or low value will affect the range significantly.

63Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Percentiles and Quartiles

•Percentiles divide the data in one hundred groups.

•The nth percentile is the data value such that n percent of the data lies below that value.

•For large data sets, the median is the 50th percentile.

•The median of the lower half of the data is the 25th percentile and is called the first quartile (Q1).

•The median of the upper half of the data is the 75th percentile and is called the third quartile (Q3).

64Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Interquartile Range

•The Interquartile Range (IQR) is the difference between

the upper quartile and the lower quartile IQR = Q3 – Q1

•The IQR measures the range of the middle half of the data.

•Example: If Q1 = 23 and Q3 = 44 then

IQR = 44 – 23 = 21

65Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Example

(1) Find Q1, Q3, and IQR for the dataset 7, 3, 5, 1, 9 (n = odd)

(2) Find Q1, Q3, and IQR for the dataset 7, 3, 5, 1, 9, 11 ( n = even)

66Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Interquartile Range

•The Interquartile Range for earthquake causing tsunamis is 0.9.

•The picture below shows the meaning of the IQR.

67Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Benefits and Drawbacks of the IQR

•The Interquartile Range is not sensitive to outliers.

•The IQR provides a reasonable summary of the spread of the distribution.

•The IQR shows where typical values are, except for the case of a bimodal distribution.

•The IQR is not great for a general audience since most people do not know what it is.

68Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Extra credit

Find the median, Q1, Q3, and IQR of the following dataset:

(a)45, 46, 49, 35, 76, 80, 89, 94, 37, 61, 62, 64, 68, 56, 57, 57, 59, 71, 72.

(b) 850, 900,1400,1200,1050, 1000, 750, 1250, 1050, 565

69Copyright © 2014, 2012, 2009 Pearson Education, Inc.

3.5

Boxplots and 5-Number Summaries

70Copyright © 2014, 2012, 2009 Pearson Education, Inc.

5-Number Summary

•The 5-Number Summary provides a numerical description of the data. It consists of

•Minimum•First Quartile (Q1)•Median•Third Quartile (Q3)•Maximum

•The list to the right shows the 5-Number Summary for the tsunami data.

71Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Interpreting the 5-Number Summary

•The smallest tsunami-causing earthquake had magnitude 3.7.

•The largest tsunami-causing earthquake had magnitude 9.1.

•The middle half of tsunami-causing earthquakes is between 6.7 and 7.6.

•Half of tsunami-causing earthquakes have magnitudes below 7.2 and half are above 7.2.

•A tsunami-causing earthquake less than 6.7 is small.

•A tsunami-causing earthquake more than 7.6 is small.

72Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Boxplots

•A Boxplot is a chart that displays the 5-Point Summary and the outliers.

•The Box shows the Interquartile Range.

•The dashed lines are called fences, outside the fences lie the outliers.

•Above and below the box are the whiskers that display the most extreme data values within the fences.

•The line inside the box shows the median.

73Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Finding the Fences

•The lower fence is defined by Lower Fence = Q1 – 1.5 × IQR

•The upper fence is defined by Upper Fence = Q3 + 1.5 × IQR

•Tsunami Example: Q1 = 6.7, Q3 = 7.6 IQR = 7.6 – 6.7 = 0.9

•Lower Fence = 6.7 – 1.5 × 0.9 = 5.35

•Upper Fence = 7.6 + 1.5 × 0.9 = 8.95

74Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Self – read example (Slides 75 – 78)

75Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Step-by-Step Example of Shape, Center, Spread: Flight Cancellations

•Question: How often are flights cancelled?

•Who? Months

•What? Percentage of Flights Cancelled at U.S. Airports

•When? 1995 – 2011

•Where? United States

•How? Bureau of Transportation Statistics Data

76Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Flight Cancellations: Think

•Identify the Variable•Percent of flight cancellations at U.S. airports•Quantitative: Units are percentages.

•How will be data be summarized?•Histogram•Numerical Summary•Boxplot

77Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Flight Cancellations: Show

•Use StatCrunch to create the histogram, boxplot, and numerical summary.

78Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Flight Cancellations: Tell•Describe the shape, center, and spread of the

distribution. Report on the symmetry, number of modes, and any gaps or outliers. You should also mention any concerns you may have about the data.

•Skewed to the Right: Can’t be a negative percent. Bad weather and other airport troubles can cause extreme cancellations.

•IQR is small: 1.23%. Consistency among cancellation percents

•Extraordinary outlier at 20.2%: September 2001

79Copyright © 2014, 2012, 2009 Pearson Education, Inc.

3.6

The Center of Symmetric Distributions: The Mean

80Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Mean

•The Mean is what most people think of as the average.

•Add up all the numbers and divide by the number of numbers.

•Recall that means “Add them all.”

•In StatCrunch, the mean is listed in the Summary Statistics.

yy

n

81Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Mean is the “Balancing Point”

•If you put your finger on the mean, the histogram will balance perfectly.

82Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Mean Vs. Median

•For symmetric distributions, the mean and the median are equal.

•The balancing point is at the center.

•The tail “pulls” the mean towards it more than it does to

the median.

•The mean is more sensitive to outliers than the median.

83Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Mean Is Attracted to the Outlier

•The mean is larger than the median since it is “pulled” to the right by the outlier.

•The median is a better measure of the center for data that is skewed.

84Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Why Use the Mean?

•Although the median is a better measure of the center, the mean weighs in large and small values better.

•The mean is easier to work with.

•For symmetric data, statisticians would rather use the mean.

•It is always ok to report both the mean and the median.

85Copyright © 2014, 2012, 2009 Pearson Education, Inc.

3.7

The Spread of Symmetric Distributions: The Standard Deviation

86Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Variance

•The variance is a measure of how far the data is spread out from the mean.

•The difference from the mean is: .

•To make it positive, square it.

•Then find the average of all of these distances, except instead of dividing by n, divide by n – 1.

•Use s2 to represent the variance.

•The variance will mostly be used to find the standard deviation s which is the square root of the variance.

y y

22

1y y

sn

87Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Standard Deviation

•The variance’s units are the square of the original units.

•Taking the square root of the variance gives the standard deviation, which will have the same units as y.

•The standard deviation is a number that is close to the average distances that the y values are from the mean.

•If data values are close to the mean (less spread out), then the standard deviation will be small.

•If data values are far from the mean (more spread out), then the standard deviation will be large.

2

1y y

sn

88Copyright © 2014, 2012, 2009 Pearson Education, Inc.

The Standard Deviation and Histograms

A B C

Answer: C, A, B

Order the histograms below from smallest standard deviation to largest standard deviation.

89Copyright © 2014, 2012, 2009 Pearson Education, Inc.

3.8

Summary—What to Tell About a Quantitative Variable

90Copyright © 2014, 2012, 2009 Pearson Education, Inc.

What to Tell•Histogram, Stem-and-Leaf, Boxplot

•Describe modality, symmetry, outliers

•Center and Spread•Median and IQR if not symmetric•Mean and Standard Deviation if symmetric.•Unimodal symmetric data: IQR > s. Check for errors.

•Unusual Features•For multiple modes, possibly split the data into groups.•When there are outliers, report the mean and standard

deviation with and without the outliers.

91Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Self – read example (Slides 92 – 93)

92Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Example: Fuel Efficiency

•The car owner has checked the fuel efficiency each time he filled the tank. How would you describe the fuel efficiency?

•Plan: Summarize the distribution of the car’s fuel efficiency.

•Variable: mpg for 100 fill ups, Quantitative

•Mechanics: show a histogram•Fairly symmetric•Low outlier

93Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Fuel Efficiency Continued

•Which to report?•The mean and median are close.•Report the mean and standard deviation.

•Conclusion•Distribution is unimodal and symmetric.•Mean is 22.4 mpg.•Low outlier may be investigated, but limited effect on

the mean•s = 2.45; from one filling to the next, fuel efficiency

differs from the mean by an average of about 2.45 mpg.

94Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Self – read(Slides 95 – 97)

95Copyright © 2014, 2012, 2009 Pearson Education, Inc.

What Can Go Wrong?•Don’t make a histogram for categorical data.

•Don’t look for shape, center, and spread for a bar chart.

•Choose a bin width appropriate for the data.

96Copyright © 2014, 2012, 2009 Pearson Education, Inc.

What Can Go Wrong? Continued•Do a reality check

•Don’t blindly trust your calculator. For example, a mean student age of 193 years old is nonsense.

•Sort before finding the median and percentiles.•315, 8, 2, 49, 97 does not have median of 2.

•Don’t worry about small differences in the quartile calculation.

•Don’t compute numerical summaries for a categorical variable.

•The mean Social Security number is meaningless.

97Copyright © 2014, 2012, 2009 Pearson Education, Inc.

What Can Go Wrong? Continued

•Don’t report too many decimal places.•Citing the mean fuel efficiency as 22.417822453 is

going overboard.

•Don’t round in the middle of a calculation.

•For multiple modes, think about separating groups.•Heights of people → Separate men and women

•Beware of outliers, the mean and standard deviation are sensitive to outliers.

•Use a histogram or dotplot to ensure that the mean and standard deviation really do describe the data.

98Copyright © 2014, 2012, 2009 Pearson Education, Inc.

PARTIAL STUDY MATERIAL FOR TAKE HOME EXAMINATION

CHAPTER 2

Displaying and Describing Categorical Data

99Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Frequency Tables

•A frequency table is a table whose first column displays each distinct outcome and second column displays that outcome’s frequency.

•If there are many distinct outcomes, then combining them into a few categories is recommended

100Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Relative Frequency Tables

•A relative frequency table is a table whose first column displays each distinct outcome and second column displays that outcome’s relative frequency.

•The relative frequency table is similar to the frequency table, but it displays relative frequencies rather than frequencies.

101Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Bar Charts

•A bar chart displays the frequency or relative frequency of each category.

•All bars must have the same width.

•Good for general audience

102Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Pie Charts

•A pie chart presents each category as a slice of a circle so that each slice has a size that is proportion to the whole in each category.

•Pie charts are also good for a general audience.

•Pie charts help to display the fraction of the whole that each category represents

103Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Think Before You Draw

•Choose the chart that best tells the story of your data.

•Think about the intended audience to select a chart that is best for them.

•Charts often work better when the categories do not overlap.

•Don’t try to fool your audience, just give a chart that honestly expresses the interesting features of the data.

104Copyright © 2014, 2012, 2009 Pearson Education, Inc.

2.2

Exploring the Relationship Between

Two Categorical Variables

105Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Contingency Tables

•A contingency table is a table that displays two categorical variables and their relationships.

•There were 528 third-class ticket holders who died.

•The bottom row represents the totals for class and is called the marginal distribution.

•The right column represents the totals for survival and is also a marginal distribution.

106Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Table of Percents

•A table of percents can be misleading.

•Looking at “Alive”, was it better to have a second- or third-class ticket?

• 8.1% were third-class survivors, 5.4% were second- class survivors.

• What is wrong with just comparing these percentages?

107Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Conditional Distributions

•A conditional distribution provides the percent of one variable satisfying the conditions of another.

•25.2% of all third-class ticket holders survived.

•Was it better to have a second- or third-class ticket?

108Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Conditional Distribution: Rows or Columns

•The “Condition” can either be based on rows or columns.

•This table shows that the highest percent of survivors were crew members.

•The highest percent of the dead were also crew members.

109Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Conditional Distributions as Pie Charts

•Pie charts can give a visual representation of the conditional distributions.

•Compare how the first- class ticket holders were represented amongst the survivors vs. the dead.

110Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Bar Charts

•Bar charts can also effectively tell the story for conditional distributions.

•Which is best: Table, Pie chart, or Bar Graph?

111Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Independence

•Independence: The distribution of one variable is the same for all categories of another.

•For dependent variables, there is an association between the two variables

112Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Independence Example

•Is there an association between gender and interest in Super Bowl TV Coverage?

•Large difference for men between watching the game and commercials

•Smaller difference for women

•There is an association between gender and interest.

113Copyright © 2014, 2012, 2009 Pearson Education, Inc.

What’s Wrong With These Charts?

Violates the area PrincipleAdd up the percents

114Copyright © 2014, 2012, 2009 Pearson Education, Inc.

More Words of Caution

•Don’t confuse percents of the whole with marginal percents.

•Don’t leave out marginal percents.

•Don’t make conclusions based on only a handful of individuals.

•Don’t make independence conclusions where there is only a small difference.

115Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Simpson’s Paradox

•Which pilot had a better on-time flight record?

•Moe was better overall.

•Jill was better for both day and night flights.

•Simpson’s Paradox: One is higher overall while the other is higher in every category. Number of On-Time Flights

116Copyright © 2014, 2012, 2009 Pearson Education, Inc.

Learning Objectives

•Summarize categorical data by counting cases and expressing the results as percents.

•Create and interpret bar charts, pie charts and contingency tables.

•Interpret marginal and conditional distributions.

•Make conclusions about independence and associations from analyzing conditional distributions.

top related