1 chapter 2: descriptive statistics 2.1 organizing qualitative data 2.2 organizing quantitative data...

Chapter 2: Descriptive Statistics

2.1 Organizing Qualitative Data2.2 Organizing Quantitative Data2.3 Additional Displays2.4 Misrepresentations of Data

September 10, 2008

Categorical Variables

• Each observation (data point) for a categorical variable belongs to one category among different categories

• Variable:

– Gender (Categories: male or female)

– Religious Affiliation (Protestant, Catholic, Jew, Muslim, etc.)

– Home State or Country (NJ, AR, CA, FL, Canada, etc.)

– Favorite Singer (Elvis, Sting, Sinatra, etc.)

– Eye Color (brown, green, blue, hazel, black)

– Favorite Type of Music (jazz, country, rock, etc.)

Section 2.1

Frequency Tables for Categorical Data

Consider a population that has N categorical variables: C1,C2 ,K ,CN{ } . For example

consider the population of freshman at Vanderbilt during the present semester and

the categorical variables for this population: C1,C2 ,C3{ } = gender, state, favorite color{ } .

For each categorical variable, we list the possible categories for this variable: for C j and say

it can have k values, xj1,xj2 ,K ,xjk{ } . For example, C1 has the possible categories male, female{ }

i.e., k=2; for C2 has 51 (50 states + other) possible categories i.e., k=51.

Definition: For a population or a sample and a particular categorical variable, the number of times that the variables is in a particular category is called the frequency of this category. The category that has the highest frequency is called the mode for the variable. A table composed of the frequencies for the categories is sometimes called the frequency distribution or simply distribution of the categorical variable.

Remark: It makes sense to construct frequency tables for a discrete quantitative variable since we can consider each discrete value of the variable a category.

Relative Frequency

Example: The categorical variable is the color of a ball in a population. A sample of 10 red, green and blue balls

Category Frequency Relative Frequency

Red 5 5/10 = 0.5

Green 2 2/10 = 0.2

Blue 3 3/10 = 0.3

Definition : Suppose that a categorical variable has N categories. Furthermore, suppose for category

k it has a frequency of fk and n = f1 + f2 + ...+ fn is the total number of data points in the sample. Then the

relative frequency is of the kth category is defined as fk =fkn, k=1,2,...,N. The relative frequency is also called

the proportion.

Example

Consider the population of vehicles that are parked in the 25th Avenue Garage and consider the categorical variable for the type of transmission (automatic or manual) in the vehicles. One hundred cars were surveyed. We construct a frequency table.

Category Automatic Manual

Number of Vehicles 73 27

The frequency of automatics is 73 and the frequency of manuals is 27. The mode for the categorical variable and sample is 73. The relative frequency of automatics is 73/100 = 0.73 (73%).

Remarks on Frequency Tables

• A method of organizing data• Lists of all possible categories for a variable along with the number of

observations for each value of the variable.• In addition, we sometimes add columns for the proportion and

percentage for each value of the variable.

Example

Florida : 289

735= 0.3931972 and

735×100 ≈ 39.3%

Example (categorical)

We are interested in the dominant color of cars that are parked on the Vanderbilt campus. Suppose we go the 25th Avenue Garage and survey the color (black, white, red, blue, green, other) of 100 cars for a sample. In the table below we summarize the counts of this categorical variable.

Color Frequency

Black 20

White 10

Red 15

Blue 35

Green 10

Other 20

Bar Chart

Color Frequency

Black 20

White 10

Red 15

Blue 35

Green 10

Other 20

Bar charts can also be constructed using Excel.

Definition: A bar chart for a categorical variable is series of horizontal or vertical bars with the height of each bar representing the frequency of a particular category for the variable.

Bar Chart for Relative Frequency

Remark: Instead of the bars representing the frequency of a category, they could represent the relative frequency.

Color Frequency Relative Frequency

Black 20 0.182

White 10 0.091

Red 15 0.136

Blue 35 0.318

Green 10 0.091

Other 20 0.182

Pie Chart

Color Frequency

Black 20

White 10

Red 15

Blue 35

Green 10

Other 20

Definition: A pie chart for a categorical variable is a circle divided into sectors with each sector representing the frequency of a category for the variable.

Variations of Pie Chart

Pie Chart with Excel

Create a pie chart for the following data using Excel.

Color Frequency

Black 20

White 10

Red 15

Blue 35

Green 10

Other 20

Example (Doctorates)

Year Physical Sciences

Engineering Life Sciences

Social Sciences

Humanities Education

1983 4425 2781 5553 6096 3500 7174

1993 6496 5698 7395 6545 4481 6689

2003 5963 5265 8369 6777 5412 6627

Doctorate Recipients: 1983, 1993, 2003. For each year we have six categories: type of degree.

(continued)

Green - 1983

Red - 1993

Orange - 2003

Pareto Charts

In a bar chart, if we order the bars (categories) from tallest to smallest, then this bar chart is called a Pareto Chart. The reason for doing this is that the “most important” category appears first.

Definition: A Pareto Chart is a bar graph whose bars are drawn in decreasing order of frequency or relative frequency.

ExampleConsider the following sample composed of Vanderbilt students who are studying at least one foreign language.

Spanish Chinese Spanish Spanish Spanish

Chinese German Spanish Spanish French

Spanish Spanish Japanese Latin Spanish

German German Spanish Italian Spanish

Italian Japanese Chinese Spanish French

Spanish Spanish Russian Latin French

(a) Construct the frequency distribution for this sample.(b) Construct the relative frequency distribution.(c) Construct the bar chart for the frequency.(d) Construct the bar chart for the relative frequency.(e) What is the mode of the frequency distribution?

Solution

Category Frequency Relative Frequency

French 3 3/30 = 0.100

Latin 2 2/30 = 0.067

Russian 1 1/30 = 0.033

Japanese 2 2/30 = 0.067

Italian 2 2/30 = 0.067

German 3 3/30 = 0.100

Chinese 3 3/30 = 0.100

Spanish 14 14/30 = 0.467

Organizing Quantitative Data

Section 2.1

Two Types of Quantitative Data

• Discrete

• Tables

• Frequency Tables

• Relative Frequency Tables

• Dot Plots

• Stem-and-Leaf Plots

• Histograms

• Continuous

• Histograms

Tables and Discrete DataRemark: There is essentially no difference between categorical data and discrete quantitative data. Each number represents a category.

Example: Consider a discrete set of quantitative data:

{1,-1,1,0,0,2,3,1,0,2} .

We can construct a frequency table for the numbers in this set of numbers.

Data Point Frequency Relative Frequency

-1 1 1/10 = 0.1

0 3 3/10 = 0.3

1 3 3/10 = 0.3

2 2 2/10 = 0.2

3 1 1/10 = 0.1

Sum 10 1.0

Frequency Chart

Histograms

Definition: A histogram is a special type of bar chart that shows the frequency of quantitative data that is separated into intervals (bins or classes).

Example

Construct a histogram for the data, {1.1,1.8, 0.9, 0.2, 2.5, 1.3 ,2.1, 2.1, 2.9, 2.0}, using the bins: [0,1), [1,2), [2,3).

[0,1): 0.9, 0.2 (frequency = 2)

[1,2): 1.1, 1.8, 1.3 (frequency = 3)

[2,3): 2.5, 2.1, 2.1, 2.9, 2.0 (frequency = 5)

Dot Plots

• Primarily for discrete quantitative data• Similar to a bar chart or histogram• Includes information about frequency i.e., how many times a data

point appears as a single number or in a range of values.

Definition: A dot plot is a chart for discrete quantitative data where each observation is represented by a dot where the possible values of data is represented along the horizontal axis.

Example (quantitative)

Suppose we stand at the entrance of the Math. Building and count the number of people entering over a 10 minute period in 1 minute increments. Below we have a table that summarizes our sample and the resulting dot plot.

Time Interval

1 (0-1) 3

2 (1-2) 1

5 (4-5) 3

6 (5-6) 4

10 (9-10) 7

In the table, we didn’t put intervals during which no people entered.

Example

This table summarizes the about of sodium (mg) and sugar (g) for some popular breakfast cereals. It also characterizes the type (adult or child) of cereal. Hence, we have three pieces of data (variables) for each cereal: 2 quantitative and 1 categorical. We will use the dot plot for the sodium.

Dot Plot of Sodium

Notice that the a dot plot gives information about the frequency that a number in a numerical data sample reoccurs, e.g., 70 occurs once and 200 twice.

Stem-and-Leaf Plots

• A stem-and-leaf plot organizes data to show its shape and distribution.• Each data point is represented by a stem and a leaf.• Usually, the leaf is the last digit of the numerical data point and the other

digits to the left of the leaf form the stem. For example, if 9834 is a data point, then 4 is the leaf and 983 is the stem. (stemleaf)

• In a set of data, a stem may have several leaves.• For one digit data (0,1,2,…,9), we can represent the data as 00,01,…09.

For a data point 0X, the leaf is X and stem is 0.• We usually organize by stems.• It is sometimes to modify this representation when large numbers are

involved. In this case the stem will represent a class of numbers of the form: d x 10s.

ExampleSuppose a sample contains the following data points: {9, 15, 17, 24, 50, 65, 101, 170, 171}.

Number Stem Leaf

9 = 09 0 9

15 1 5

17 1 7

24 2 4

50 5 0

65 6 5

101 10 1

170 17 0

171 17 1

Stems Leaves

ExampleConstruct a Stem-and-Leaf plot for the data: {5.4, 4.3, 4.1, 8.6, 6.0, 7.9, 9.1, 6.1, 3.1,14.5, 12.5, 8.3, 10.1, 8.2, 6.8, 10.9, 2.3, 1.0, 8.3, 8.9, 6.1, 6.5, 6.0, 9.4, 0.1, 13.9, 3.7, 10.1, 9.9, 4.9, 6.4, 10.3, 2.3. 11.9, 11.7, 12.1, 9.8, 7.8, 2.9, 6.7}.

We ignore the the decimal point or alternatively multiple each number by 10.

Stems Leaves

6 00114578

8 23369

9 1489

10 1139

On-line Stem-and-Leaf Plotter

http://www.shodor.http://www.shodor.org/interactivate/activities/StemAndLeafPlotter/

Stem-and-leaf Plots and Frequency

Consider a sample {101,103,104,108,109}. If we constructed the stem-and-leaf plot for this data, then there is a single stem (10) and five leaves (1,3,4,8,9). Hence, the number of leaves i.e., 5, the frequency that the data appears in the interval [100,109]. Hence, we can conclude that there is a connection in the number of leaves and the number of times data fall in 10 integer length intervals.

Bottom Line

Dot plots and stem-and-leaf plots segregate the data into bins (or numerical ranges or classes) and they show the frequency of data within those classes. This is useful information, but it is not practical when one has a sample with a large number of data points.

Remark: Frequency Tables & Dot Plots

Sodium Data:000 210 260 125220 290210 140220 200125 170250 150170 70230 200290 180

The frequency of a sodium interval level can be gotten from the dot plot.

A frequency table and a dot plot give basically the same information.

Continuous Data described by Histograms

Definition: A histogram is a type of bar chart that gives the frequencies or relative frequencies of occurrences of a quantitative variable (either discrete or continuous) in specified intervals.

Interval Frequency

0-39 1

40-79 1

80-119 0

120-159 4

160-199 3

200-239 7

240-279 2

280-319 2

Construction of Histograms

• Define intervals of equal width for the variable under consideration. For example if our data in our sample are integers and ranges from 0 to 50, we might choose the intervals (bins) [0,9],[10,19],[20,29],[30,39],[40,49,[50,60]. The intervals or bins are called classes. The length of a class is called the class width.

• Count the number of data points are in each bin. In the above example, we would calculate 6 nonnegative integer values.

• Construct a bar chart with the intervals specifying the width of the bars and the frequencies giving the height of the bars. Note that the width of the bar is arbitrary as long as we know the length of the intervals over which we do the frequency counting.

• The heights of the bars in the histogram are called the distribution of the sample.

• Histograms could be used for categorical data.• Remark: Instead of using the frequency counts, we could use the fraction of

the total sample size (percentage) as the height.

Example

Construct a histogram (using percentages) for the following sample:{1.1, -1.0, 2.1, 3.5, -2.1, 0.9, 0.75, -0.5, 0.25, 4.5, 4.1}.

Interval Frequency Fraction

[-3,-2) 1 1/11~0.091

[-2,-1) 0 0/11

[-1,0) 2 2/11~0.181

[0,1) 3 3/11~0.273

[1,2) 1 1/11

[2,3) 1 1/11

[3,4) 1 1/11

[5,5) 2 2/11

Histogram for Example

Example (IQ Scores)

IQ Range Frequency

60-69 2

70-79 3

80-89 13

90-99 42

100-110 58

110-119 40

120-129 31

130-139 8

140-149 2

150-159 1

(continued)

IQ Range Frequency

60-69 2

70-79 3

80-89 13

90-99 42

100-110 58

110-119 40

120-129 31

130-139 8

140-149 2

150-159 1

How many students were sampled?

What is the width of the intervals?

Which range of IQ had the highest frequency?

Which range of IQ had the lowest frequency?

Dot, Stem-and-leaf, or Histogram?

• Dot plot and Stem-and-Leaf plot:– Useful for showing information about small data

sets.– Shows actual data.

• Histogram– Useful for showing information about large data

sets.– Can be used for continuous or discrete data.– Most compact plot.– Has flexibility in defining intervals.

The Shape of the Distribution

For a histogram, we can associate the graph of a function by drawing a smooth curve through the midpoints of each bar. The shape of this curve can be used to describe the shape of the histogram.

Unimodal and Bimodal

Unimodal: one hump Bimodal: two humps

Skewed Distributions

Skewed to the right Skewed to the left

Symmetric

Distribution Terminology

• The value of the highest bar in a histogram is called the mode of the distribution. Hence, the terminology unimodal and bimodal.

• A distribution is said to be symmetric in there is a vertical line that separates the distribution into identical pieces.

• A distribution that is not symmetric is said to be skewed.

• The “ends” of a distribution are called the tails of the distribution.

Outliers

A bar that is completely separated from the cluster of bars is called an outlier.

Hours of TV Watching

Wechsler Adult Intelligence Scale (IQ)

Range %

<55 0.15

55-70 1.85

70-85 13.0

85-100 35.0

100-115 33.0

115-130 15.0

130-145 1.80

>145 0.20

The distribution is almost symmetric.

Additional Displays for Quantitative Data

Section 2.3

Alternative to histograms for quantitative data: Frequency Polygons.

Definition: Suppose that an interval, [a,b), represents a class for a set of quantitative data. The class midpoint is defined as (a+b)/2.

Definition: A frequency polygon is a graph that is constructed from the class midpoints and their frequencies.

Bins (class) Class Midpoint Frequency

[a,b) (a+b)/2 f

… … …

Example

Mathematica Demonstration

Cumulative Frequency Distribution

Suppose that f1, f2 ,..., fk{ } is the set of frequencies for some data set of size n. That is, suppose that we

subdivide the interval between the largest and smallest values of the data set into k categories (subintervals).We then count the number of data points that lie in each subinterval. The cumulative frequency of category j is

defined as f1 + f2 + ...+ fj = fii=1

∑ . Note the cumulative frequency of category k, f1 + f2 + ...+ fk =n.

Cumulative Frequency

Example

data = {3.1, 0.1, 0.9, 1.1, 1.3, 1.6, 2.5, 0.3, 2.5, 1.6, 1.6, 3.5, 1.8}

bins = [0,1), [1,2), [2,3), [3,4)

n = 13

Bin Frequency Cumulative Frequency

[0,1) 3 3

[1,2) 6 3+6 = 9

[2,3) 2 3+6+2 = 11

[3,4) 2 3+6+2+2 = 13

Cumulative Relative Frequency Distribution

If f1, f2 ,..., fk{ } are the frequencies in bins (classes), a1,a2[ ), a2 ,a3[ ),..., ak,ak+1[ ){ } , for a set of data such that

f1 + f2 + ...+ fk =n, then we define the relative frequencies: rj =fj

n. We note that

r1 + r2 + ...+ rk =1. The cumulative relative frequency for bin j is defined as r1 + r2 + ...+ rj .

Example

data = {3.1, 0.1, 0.9, 1.1, 1.3, 1.6, 2.5, 0.3, 2.5, 1.6, 1.6, 3.5, 1.8}

bins = [0,1), [1,2), [2,3), [3,4)

n = 13

Bin Frequency Cumulative Frequency Relative Frequency

(rounded)

Cumulative Relative Frequency

[0,1) 3 3 3/13 = 0.230 0.230

[1,2) 6 3+6 = 9 6/13= 0.462 0.230+0.462 = 0.692

[2,3) 2 3+6+2 = 11 2/13 = 0.154 0.692+0.154 = 0.846

[3,4) 2 3+6+2+2 = 13 2/13 = 0.154 0.846+0.154 = 1.000

Relative Frequency Distribution (histogram)

Definition: An ogive is a graph of the cumulative frequency or the relative cumulative frequency as a function of the bins used to construct the cumulative or relative cumulative frequency. It is constructed by using a cumulative frequency (or relative cumulative frequency) table.

Example

Bin Frequency Cumulative Frequency Relative Frequency

(rounded)

Cumulative Relative Frequency

[0,1) 3 3 3/13 = 0.230 0.230

[1,2) 6 3+6 = 9 6/13= 0.462 0.230+0.462 = 0.692

[2,3) 2 3+6+2 = 11 2/13 = 0.154 0.692+0.154 = 0.846

[3,4) 2 3+6+2+2 = 13 2/13 = 0.154 0.846+0.154 = 1.000

Time-series DataDefinition: Data about a particular variable collected over a period of time is called time-series data.

Example: Closing prices of IBM stock since Jan. 1, 2008.

Bad Graphical Representation of Data

Section 2.4

Problem: Graphs can give an incomplete or even a misrepresentation of the sample (data).

The Scale Problem

The number of bachelor’s degrees in engineering for 1999-2003 is given in the following table:

Year Number of Degrees

1999 62,372

2000 63,731

2001 65,113

2002 67,301

2003 70,949

Misleading Bar Chart

1 chapter 2: descriptive statistics 2.1 organizing qualitative data 2.2 organizing quantitative data...

frequency tables

frequency distribution

highest frequency

frequency of manuals

example categorical

categorical data definition

relative frequency example

different categories

Documents

storing and organizing data

don peebles lich - disclosure misrepresentations

organizing data and information

section 3: organizing data

organizing and modelling data

misrepresentations in applications for insurance...

chapter 2 section 2.1: organizing qualitative data section...

organizing and presenting data

lecture 14 misrepresentations

potential liability for misrepresentations in residential

organizing data into matrices

organizing and displaying data

organizing and describing data

organizing programs and data

chapter 2 : organizing data - · 10/18/2016 1 organizing and...

organizing your data

organizing your research data

organizing data & information 04

unit 1: organizing data statistics/unit 1 organizing...

9ashura - misrepresentations and distortions part 1