chapter two organizing and summarizing data · organizing and summarizing data ... following figure...

Post on 03-Jul-2018

232 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Organizing and Summarizing Data Learning Objectives:

1. Organize qualitative data using

a) Frequency and relative frequency table, b) bar graph, c) pie graph and d) Pareto graph.

2. Organize quantitative data for

1. Discrete Data using a) Frequency and relative frequency table, b) bar graph, c) pie graph and d) Pareto graph

2. Continuous Data using a) histogram, b) stem-leaf plot, c) Time series plot

1

Data Presentation

Summary

Table

Dot

Chart

Pie

Chart

Quantitative

Data

Data Presentation

Bar

Chart

Qualitative

Data

Stem-&-Leaf

Display

Frequency

Distribution

Histogram Time Series

Plot

Box Plot

2

Organizing Qualitative or Categorical Data

• A statistical table can be used to display data graphically as a data distribution: consists of Class, Class Frequency, Relative Frequency or Percentage

• For qualitative data, three measurements are available for the list of categories:

– the frequency, or number of measurements

– the relative frequency, or proportion = frequency / Total # of observations

– the percentage

• A pie chart is the familiar circular graph that shows how the measurements are distributed among the categories.

• A bar chart shows the same distribution of measurements in categories, with the height of the bar measuring how often a particular category was observed.

• Pareto Chart A bar chart in which the bars are ordered from largest to smallest is called a Pareto chart.

3

A survey of 400 individuals are survey to rate the school quality.

The data is summarized: Rating A B C D

Frequency 35 260 93 12

Relative Frequency

Percentage

Draw a pie chart, a Bar Chart and a Pareto chart

B C A D

4

Exercise: A set of ten students is selected , and measurements are recorded as in the following table:

[similar exam questions] Number of Credit

Student GPA Gender Year Major Hours Enrolled

1 2.0 F 1 Psychology 16

2 2.3 F 2 Mathematics 15

3 2.9 M 2 English 17

4 2.7 M 1 English 15

5 2.6 F 3 Business 14

6 3.2 F 3 Computer 16

7 2.7 F 1 Chemistry 14

8 3.5 M 4 Chemistry 15

9 2.1 M 3 Business 12

10 2.7 F 3 Sociology 16

• What variables can be described using pie chart or bar chart?

• Construct a Bar chart and Pareto chart for the variable Year.

5

Answer

Variables that can be described by Bar cart or Pie

chart must be qualitative or discrete variables.

Gender and Major are qualitative variables.

Year and # of credit Hours Taken are discrete.

GPA is a continuous variable.

NOTE: ID is not a characteristic for describing

students.

6

Organizing Quantitative Data-Popular Plots

• If the variable can take only a finite or countable number of

values, it is a discrete variable.

– For a discrete variable, Bar chart, Pie chart or Pareto charts can be

applied to describe the discrete variable as we did for qualitative

variables.

• A variable that can assume an infinite number of values

corresponding to points on a line interval is called

continuous.

– Stem & Leaf Plot , and Histogram are two common graphs to

display continuous data.

– Time Series Plot is applied to display the data along the Time

domain for demonstrating trends or patterns along the time.

7

• Dotplots: Plots the measurements as points on the x axis,

stacking the points that duplicate existing points.

8

Stem and leaf plots: This plot presents a graphical display of the data using the actual

numerical values of each data point.

Constructing a Stem and Leaf Plot:

1. Divide each measurement into two parts: the stem and

the leaf.

2. List the stems in a column, with a vertical line to their right.

3. For each measurement, record the leaf portion in the same

row as its matching stem.

4. Order the leaves from lowest to highest in each stem.

5. Provide a key to your stem and leaf coding so that the

reader can recreate the actual measurements if necessary.

9

Example

The following Table lists the prices (in dollars) of 19 different brands of

walking shoes. Construct a tem and leaf plot to display the distribution of

the data.

90 70 70 70 75 70

65 68 60 74 70 95

75 70 68 65 40 65

70

Solution

The price 74 is represented by

the stem 7 and leaf 4. The price

obtained by: 74 x (Leaf Unit) =

74x(1) = 74.

10

Interpreting Graphs with a Critical Eye:

• What to look for as you describe the data:

- Scales : The measurement unit such as $, inches, etc

- location: Where is the center of the data

- shape: The shape of the frequency distribution.

- outliers: Some unusual data values, such as 6000 miles away from home when comparing with the rest.

• Distributions are often described by their shapes:

- symmetric

- skewed to the right (long tail goes right)

- skewed to the left (long tail goes left)

- unimodal, bimodal, multimodal (one peak, two peaks,

many peaks)

11

Examine the three dotplots generated by Minitab and shown in the

following Figure Describe these distributions in terms of their

locations and shapes.

Figure : Character Dotplots and the corresponding distribution

shapes

Symmetric Skew-to-right Skew-to-left

• Skew-to-right: Most values are small. Only a few are much larger. The

long tail is on the right side.

• Skew-to-left: Most vaules are large. Only a few are much smaller. The

long tail is on the left side.

Similar Exam questions

Identify the Shape of a Distribution

12

Exercise

Determine the shape of the distribution of

each of the following variables:

1. Score of a very easy test

2. Score of a very difficult test

3. Entry level salary for college graduates

4. Adult’s height

13

Answer

1. Very easy test: skew-to-the-left (most scores are

high. Only a few low scores)

2. Very difficult test: skew-to-the-right (most scores

are low. Only a few high scores.)

3. Entry level salary: likely to be skew-to-the-right.

Since most salaries would be lower than

$50,000. A few could be quite high.

4. Adult’s height: this has a typical symmetric

distribution

14

Relative Frequency Histograms

What is it?

A relative frequency histogram for a quantitative data set is a graph that describes the relative frequency (or frequency) of the variable, for example, distance from home, in which the possible values of the variable are divided into a few groups (classes, or intervals), the relative frequency (or frequency ) is represented by a rectangle with the height representing the proportion or relative frequency of occurrence for a particular class (or group) of the variable being measured.

• On the X axis: The class, (or group) of the variable are plotted along the x axis.

• On the Y-axis: The relative frequency or frequency of observations within the class is the height on the Y axis.

15

Histogram for Continuous Data

Why do we want to do this?

Histogram summaries data values of the variable in a graph that can demonstrate the distribution of the variable, so that it helps us to quickly visualize where are the majority of data values, if there are some very unusual data values, if these unusual data on the high side or on the low end? Are data values very far apart or are they very close to each other?, and so on

Is this different from Bar or pie graph?

YES, it is different. Bar or pie graph is for categorical or discrete variables. Histogram is for continuous variables.

16

How to construct a histogram? By hand (in case you do not have technology):

Constructing a relative frequency histogram for continuous variables:

1. Choose the number of classes, usually between 5 and 15.

2. Calculate the approximate class width by dividing the difference between the largest and smallest values

(Range = largest – smallest) by the number of classes.

3. Round the approximate class width up to a convenient number.

4 Locate the class boundaries.

If discrete, assign one or more integers to a class.

If continuous, use Method of left inclusion: Include the left class boundary point but not the right boundary point in the class.

– NOTE: Different methods may be used in different software. Some may use right inclusion. Some may add an additional decimal place for the class boundary.

5. Construct a statistical table containing the classes, their boundaries, and their relative frequencies.

6. Construct the histogram like a bar graph. 17

Example: Constructing Histogram by hand

The following Table lists the prices (in dollars) of 19 different brands of walking shoes. Construct a relative histogram to display the distribution of the data.

90 70 70 70 75 70 65 68 60 74 70 95

75 70 68 65 40 65 70

Solution

1. Determine # of classes: for example, use k=6 classes

2. Range = 95 -40 = 55,

3. Class width = 55/6 ~ 9.17 ~ 10

(Run the width up (not run off nor truncate) to a ‘convenient number.)

4. Use left-inclusion to determine class boundaries:

[40,50),[50,60), [60, 70), [70,80),[80,90),[90,100)

5. Construct a Relative Frequency Table – first count # of observations in each

class. This is the frequency, call it fi. Relative frequency (rfi) = fi/n , where n is the

total # of data points.

6. Draw a two-dimensional graph with

X-axis: the class boundaries of the variable, and Y-axis: the relative frequency for

each class, and a rectangle with the relative frequency as the height for

each class. 18

Activity : Complete the construction of the Histogram

Relative Frequency Table Histogram

ShoePrice

Fre

qu

en

cy

100908070605040

10

8

6

4

2

0

2

0

10

6

0

1

Histogram of ShoePriceGroup Frequency Relative

Frequency

[40,50) 1 1/19

[50,60) 0 0

[60,70) 6 6/19

[70,80) 10 10/19

[80.90) 0 0

[90,100) 2 2/19

Histogram constructed using Minitab

NOTE: We need to know how a relative frequency and a

histogram are constructed. The construction of a histogram,

however, can be easily done by computer software. 19

Using Minitab to create the Default Histogram for

the Shoes Price Data

ShoePrice

Fre

qu

en

cy

100908070605040

14

12

10

8

6

4

2

0

11

2

13

1

0

1

Histogram of ShoePrice - using Default options

Go to Minitab, on the Worksheet window, enter the prices of

the 19 pairs of shoes data, and give the column name: Price.

90 70 70 70 75 70 65 68 60 74 70 95

75 70 68 65 40 65 70

SAVE your data set: File, Save Worksheet As, Name it: Shoes

Price on your desktop.

Go to Graph menu, choose Histogram, select Simple, select

variable ‘Price’, OK.

20

Change the # of intervals in Minitab

As you see the default histogram has it’s own # of classes. One

can change this number to display a different histograms for the

same data. For the Shoes Price example, the following steps

change the # from 7 to 4 classes:

Click inside the histogram graph,

The bars are highlighted. Right-

click on the bars, choose

‘Edit Bars’,

Go to Binning menu, change # of

Intervals to 4, OK.

21

0 100 200 300 400 500 600 700 800

0

5

10

15

20

25

distance-exclude an extreme distance of

6000 miles

Fre

que

ncy

7

25

21

1 10 0 0

1

Histogram of Distance Data - Distance from Home for CMU

students. Sample Size =56

The Distance data is grouped into k

equal intervals, in this case, k = 9

(Minitab chooses k = 9, # of classes).

X-axis is the interval of distance.

Minitab chooses the first interval [-

50,49], 2nd interval [50,149] and so on.

Y-axis is the frequency of students

whose distance between each

respective interval.

For example, There are 25 distances

between 50 and 149 miles.

A rectangle is used to represent the

interval and the frequency.

A histogram shows the distribution of

the Distance variable. Several

properties can be noticed:

Majority of students are from within

250 miles with a few very far away.

The distribution of distance is very

skewed to the right side (where the

long tail is).

For the Distance Data

used in your Activiy#1,can

you construct the histogram on

the left for the above data?

22

The following data represent the closing value of

the Dow Jones Industrial Average for the years

1980 - 2001.

23

Time Series Plot

24

What did you learn from this chapter?

Graphical display for qualitative or categorical data: bar chart, pie chart, Pareto chart.

Graphical display for discrete quantitative data: bar chart, pie chart, Pareto chart.

Graphical display for quantitative continuous data: stem-leaf plot, dot plot, histogram, time-series plot.

The shape of distribution: skew-to-left, symmetric, skew-to-right.

Outliers, rare cases.

Real-time activities for illustration: How far are you from home? Does one minute of exercise increase your pulse rate dramatically?

25

top related