1 chapter 2: descriptive statistics 2.1 organizing qualitative data 2.2 organizing quantitative data...
Post on 26-Dec-2015
243 Views
Preview:
TRANSCRIPT
1
Chapter 2: Descriptive Statistics
2.1 Organizing Qualitative Data2.2 Organizing Quantitative Data2.3 Additional Displays2.4 Misrepresentations of Data
2.1 Organizing Qualitative Data2.2 Organizing Quantitative Data2.3 Additional Displays2.4 Misrepresentations of Data
September 10, 2008
2
Categorical Variables
• Each observation (data point) for a categorical variable belongs to one category among different categories
• Variable:
– Gender (Categories: male or female)
– Religious Affiliation (Protestant, Catholic, Jew, Muslim, etc.)
– Home State or Country (NJ, AR, CA, FL, Canada, etc.)
– Favorite Singer (Elvis, Sting, Sinatra, etc.)
– Eye Color (brown, green, blue, hazel, black)
– Favorite Type of Music (jazz, country, rock, etc.)
Section 2.1
3
Frequency Tables for Categorical Data
Consider a population that has N categorical variables: C1,C2 ,K ,CN{ } . For example
consider the population of freshman at Vanderbilt during the present semester and
the categorical variables for this population: C1,C2 ,C3{ } = gender, state, favorite color{ } .
For each categorical variable, we list the possible categories for this variable: for C j and say
it can have k values, xj1,xj2 ,K ,xjk{ } . For example, C1 has the possible categories male, female{ }
i.e., k=2; for C2 has 51 (50 states + other) possible categories i.e., k=51.
Definition: For a population or a sample and a particular categorical variable, the number of times that the variables is in a particular category is called the frequency of this category. The category that has the highest frequency is called the mode for the variable. A table composed of the frequencies for the categories is sometimes called the frequency distribution or simply distribution of the categorical variable.
Remark: It makes sense to construct frequency tables for a discrete quantitative variable since we can consider each discrete value of the variable a category.
4
Relative Frequency
Example: The categorical variable is the color of a ball in a population. A sample of 10 red, green and blue balls
Category Frequency Relative Frequency
Red 5 5/10 = 0.5
Green 2 2/10 = 0.2
Blue 3 3/10 = 0.3
Definition : Suppose that a categorical variable has N categories. Furthermore, suppose for category
k it has a frequency of fk and n = f1 + f2 + ...+ fn is the total number of data points in the sample. Then the
relative frequency is of the kth category is defined as fk =fkn, k=1,2,...,N. The relative frequency is also called
the proportion.
5
Example
Consider the population of vehicles that are parked in the 25th Avenue Garage and consider the categorical variable for the type of transmission (automatic or manual) in the vehicles. One hundred cars were surveyed. We construct a frequency table.
Category Automatic Manual
Number of Vehicles 73 27
The frequency of automatics is 73 and the frequency of manuals is 27. The mode for the categorical variable and sample is 73. The relative frequency of automatics is 73/100 = 0.73 (73%).
6
Remarks on Frequency Tables
• A method of organizing data• Lists of all possible categories for a variable along with the number of
observations for each value of the variable.• In addition, we sometimes add columns for the proportion and
percentage for each value of the variable.
7
Example
€
Florida : 289
735= 0.3931972 and
289
735×100 ≈ 39.3%
8
Example (categorical)
We are interested in the dominant color of cars that are parked on the Vanderbilt campus. Suppose we go the 25th Avenue Garage and survey the color (black, white, red, blue, green, other) of 100 cars for a sample. In the table below we summarize the counts of this categorical variable.
Color Frequency
Black 20
White 10
Red 15
Blue 35
Green 10
Other 20
9
Bar Chart
Color Frequency
Black 20
White 10
Red 15
Blue 35
Green 10
Other 20
Bar charts can also be constructed using Excel.
Definition: A bar chart for a categorical variable is series of horizontal or vertical bars with the height of each bar representing the frequency of a particular category for the variable.
10
Bar Chart for Relative Frequency
Remark: Instead of the bars representing the frequency of a category, they could represent the relative frequency.
Color Frequency Relative Frequency
Black 20 0.182
White 10 0.091
Red 15 0.136
Blue 35 0.318
Green 10 0.091
Other 20 0.182
11
Pie Chart
Color Frequency
Black 20
White 10
Red 15
Blue 35
Green 10
Other 20
Definition: A pie chart for a categorical variable is a circle divided into sectors with each sector representing the frequency of a category for the variable.
12
Variations of Pie Chart
13
Pie Chart with Excel
Create a pie chart for the following data using Excel.
Color Frequency
Black 20
White 10
Red 15
Blue 35
Green 10
Other 20
14
Example (Doctorates)
Year Physical Sciences
Engineering Life Sciences
Social Sciences
Humanities Education
1983 4425 2781 5553 6096 3500 7174
1993 6496 5698 7395 6545 4481 6689
2003 5963 5265 8369 6777 5412 6627
Doctorate Recipients: 1983, 1993, 2003. For each year we have six categories: type of degree.
15
(continued)
Green - 1983
Red - 1993
Orange - 2003
16
Pareto Charts
In a bar chart, if we order the bars (categories) from tallest to smallest, then this bar chart is called a Pareto Chart. The reason for doing this is that the “most important” category appears first.
Definition: A Pareto Chart is a bar graph whose bars are drawn in decreasing order of frequency or relative frequency.
17
ExampleConsider the following sample composed of Vanderbilt students who are studying at least one foreign language.
Spanish Chinese Spanish Spanish Spanish
Chinese German Spanish Spanish French
Spanish Spanish Japanese Latin Spanish
German German Spanish Italian Spanish
Italian Japanese Chinese Spanish French
Spanish Spanish Russian Latin French
(a) Construct the frequency distribution for this sample.(b) Construct the relative frequency distribution.(c) Construct the bar chart for the frequency.(d) Construct the bar chart for the relative frequency.(e) What is the mode of the frequency distribution?
18
Solution
Category Frequency Relative Frequency
French 3 3/30 = 0.100
Latin 2 2/30 = 0.067
Russian 1 1/30 = 0.033
Japanese 2 2/30 = 0.067
Italian 2 2/30 = 0.067
German 3 3/30 = 0.100
Chinese 3 3/30 = 0.100
Spanish 14 14/30 = 0.467
19
Organizing Quantitative Data
Section 2.1
Two Types of Quantitative Data
• Discrete
• Tables
• Frequency Tables
• Relative Frequency Tables
• Dot Plots
• Stem-and-Leaf Plots
• Histograms
• Continuous
• Histograms
20
Tables and Discrete DataRemark: There is essentially no difference between categorical data and discrete quantitative data. Each number represents a category.
Example: Consider a discrete set of quantitative data:
{1,-1,1,0,0,2,3,1,0,2} .
We can construct a frequency table for the numbers in this set of numbers.
Data Point Frequency Relative Frequency
-1 1 1/10 = 0.1
0 3 3/10 = 0.3
1 3 3/10 = 0.3
2 2 2/10 = 0.2
3 1 1/10 = 0.1
Sum 10 1.0
21
Frequency Chart
22
Histograms
Definition: A histogram is a special type of bar chart that shows the frequency of quantitative data that is separated into intervals (bins or classes).
23
Example
Construct a histogram for the data, {1.1,1.8, 0.9, 0.2, 2.5, 1.3 ,2.1, 2.1, 2.9, 2.0}, using the bins: [0,1), [1,2), [2,3).
[0,1): 0.9, 0.2 (frequency = 2)
[1,2): 1.1, 1.8, 1.3 (frequency = 3)
[2,3): 2.5, 2.1, 2.1, 2.9, 2.0 (frequency = 5)
24
Dot Plots
• Primarily for discrete quantitative data• Similar to a bar chart or histogram• Includes information about frequency i.e., how many times a data
point appears as a single number or in a range of values.
Definition: A dot plot is a chart for discrete quantitative data where each observation is represented by a dot where the possible values of data is represented along the horizontal axis.
25
Example (quantitative)
Suppose we stand at the entrance of the Math. Building and count the number of people entering over a 10 minute period in 1 minute increments. Below we have a table that summarizes our sample and the resulting dot plot.
Time Interval
Count
1 (0-1) 3
2 (1-2) 1
5 (4-5) 3
6 (5-6) 4
10 (9-10) 7
In the table, we didn’t put intervals during which no people entered.
26
Example
This table summarizes the about of sodium (mg) and sugar (g) for some popular breakfast cereals. It also characterizes the type (adult or child) of cereal. Hence, we have three pieces of data (variables) for each cereal: 2 quantitative and 1 categorical. We will use the dot plot for the sodium.
27
Dot Plot of Sodium
Notice that the a dot plot gives information about the frequency that a number in a numerical data sample reoccurs, e.g., 70 occurs once and 200 twice.
28
Stem-and-Leaf Plots
• A stem-and-leaf plot organizes data to show its shape and distribution.• Each data point is represented by a stem and a leaf.• Usually, the leaf is the last digit of the numerical data point and the other
digits to the left of the leaf form the stem. For example, if 9834 is a data point, then 4 is the leaf and 983 is the stem. (stemleaf)
• In a set of data, a stem may have several leaves.• For one digit data (0,1,2,…,9), we can represent the data as 00,01,…09.
For a data point 0X, the leaf is X and stem is 0.• We usually organize by stems.• It is sometimes to modify this representation when large numbers are
involved. In this case the stem will represent a class of numbers of the form: d x 10s.
29
ExampleSuppose a sample contains the following data points: {9, 15, 17, 24, 50, 65, 101, 170, 171}.
Number Stem Leaf
9 = 09 0 9
15 1 5
17 1 7
24 2 4
50 5 0
65 6 5
101 10 1
170 17 0
171 17 1
Stems Leaves
0 9
1 57
2 4
3
4
5 0
6 5
7
8
9
10 1
11
12
13
14
15
16
17 01
30
ExampleConstruct a Stem-and-Leaf plot for the data: {5.4, 4.3, 4.1, 8.6, 6.0, 7.9, 9.1, 6.1, 3.1,14.5, 12.5, 8.3, 10.1, 8.2, 6.8, 10.9, 2.3, 1.0, 8.3, 8.9, 6.1, 6.5, 6.0, 9.4, 0.1, 13.9, 3.7, 10.1, 9.9, 4.9, 6.4, 10.3, 2.3. 11.9, 11.7, 12.1, 9.8, 7.8, 2.9, 6.7}.
We ignore the the decimal point or alternatively multiple each number by 10.
Stems Leaves
0 1
1
2 339
3 17
4 139
5 4
6 00114578
7 89
8 23369
9 1489
10 1139
11 79
12 15
13 9
14 5
31
On-line Stem-and-Leaf Plotter
http://www.shodor.http://www.shodor.org/interactivate/activities/StemAndLeafPlotter/
32
Stem-and-leaf Plots and Frequency
Consider a sample {101,103,104,108,109}. If we constructed the stem-and-leaf plot for this data, then there is a single stem (10) and five leaves (1,3,4,8,9). Hence, the number of leaves i.e., 5, the frequency that the data appears in the interval [100,109]. Hence, we can conclude that there is a connection in the number of leaves and the number of times data fall in 10 integer length intervals.
33
Bottom Line
Dot plots and stem-and-leaf plots segregate the data into bins (or numerical ranges or classes) and they show the frequency of data within those classes. This is useful information, but it is not practical when one has a sample with a large number of data points.
34
Remark: Frequency Tables & Dot Plots
Sodium Data:000 210 260 125220 290210 140220 200125 170250 150170 70230 200290 180
The frequency of a sodium interval level can be gotten from the dot plot.
A frequency table and a dot plot give basically the same information.
35
Continuous Data described by Histograms
Definition: A histogram is a type of bar chart that gives the frequencies or relative frequencies of occurrences of a quantitative variable (either discrete or continuous) in specified intervals.
Interval Frequency
0-39 1
40-79 1
80-119 0
120-159 4
160-199 3
200-239 7
240-279 2
280-319 2
36
Construction of Histograms
• Define intervals of equal width for the variable under consideration. For example if our data in our sample are integers and ranges from 0 to 50, we might choose the intervals (bins) [0,9],[10,19],[20,29],[30,39],[40,49,[50,60]. The intervals or bins are called classes. The length of a class is called the class width.
• Count the number of data points are in each bin. In the above example, we would calculate 6 nonnegative integer values.
• Construct a bar chart with the intervals specifying the width of the bars and the frequencies giving the height of the bars. Note that the width of the bar is arbitrary as long as we know the length of the intervals over which we do the frequency counting.
• The heights of the bars in the histogram are called the distribution of the sample.
• Histograms could be used for categorical data.• Remark: Instead of using the frequency counts, we could use the fraction of
the total sample size (percentage) as the height.
37
Example
Construct a histogram (using percentages) for the following sample:{1.1, -1.0, 2.1, 3.5, -2.1, 0.9, 0.75, -0.5, 0.25, 4.5, 4.1}.
Interval Frequency Fraction
[-3,-2) 1 1/11~0.091
[-2,-1) 0 0/11
[-1,0) 2 2/11~0.181
[0,1) 3 3/11~0.273
[1,2) 1 1/11
[2,3) 1 1/11
[3,4) 1 1/11
[5,5) 2 2/11
38
Histogram for Example
39
Example (IQ Scores)
IQ Range Frequency
60-69 2
70-79 3
80-89 13
90-99 42
100-110 58
110-119 40
120-129 31
130-139 8
140-149 2
150-159 1
40
(continued)
IQ Range Frequency
60-69 2
70-79 3
80-89 13
90-99 42
100-110 58
110-119 40
120-129 31
130-139 8
140-149 2
150-159 1
How many students were sampled?
What is the width of the intervals?
Which range of IQ had the highest frequency?
Which range of IQ had the lowest frequency?
41
Dot, Stem-and-leaf, or Histogram?
• Dot plot and Stem-and-Leaf plot:– Useful for showing information about small data
sets.– Shows actual data.
• Histogram– Useful for showing information about large data
sets.– Can be used for continuous or discrete data.– Most compact plot.– Has flexibility in defining intervals.
42
The Shape of the Distribution
For a histogram, we can associate the graph of a function by drawing a smooth curve through the midpoints of each bar. The shape of this curve can be used to describe the shape of the histogram.
43
Unimodal and Bimodal
Unimodal: one hump Bimodal: two humps
44
Skewed Distributions
Skewed to the right Skewed to the left
Symmetric
45
Distribution Terminology
• The value of the highest bar in a histogram is called the mode of the distribution. Hence, the terminology unimodal and bimodal.
• A distribution is said to be symmetric in there is a vertical line that separates the distribution into identical pieces.
• A distribution that is not symmetric is said to be skewed.
• The “ends” of a distribution are called the tails of the distribution.
46
Outliers
A bar that is completely separated from the cluster of bars is called an outlier.
47
Hours of TV Watching
48
Wechsler Adult Intelligence Scale (IQ)
Range %
<55 0.15
55-70 1.85
70-85 13.0
85-100 35.0
100-115 33.0
115-130 15.0
130-145 1.80
>145 0.20
The distribution is almost symmetric.
49
Additional Displays for Quantitative Data
Section 2.3
Alternative to histograms for quantitative data: Frequency Polygons.
Definition: Suppose that an interval, [a,b), represents a class for a set of quantitative data. The class midpoint is defined as (a+b)/2.
Definition: A frequency polygon is a graph that is constructed from the class midpoints and their frequencies.
Bins (class) Class Midpoint Frequency
[a,b) (a+b)/2 f
… … …
… … …
50
Example
Mathematica Demonstration
51
Cumulative Frequency Distribution
Suppose that f1, f2 ,..., fk{ } is the set of frequencies for some data set of size n. That is, suppose that we
subdivide the interval between the largest and smallest values of the data set into k categories (subintervals).We then count the number of data points that lie in each subinterval. The cumulative frequency of category j is
defined as f1 + f2 + ...+ fj = fii=1
j
∑ . Note the cumulative frequency of category k, f1 + f2 + ...+ fk =n.
52
Cumulative Frequency
53
Example
data = {3.1, 0.1, 0.9, 1.1, 1.3, 1.6, 2.5, 0.3, 2.5, 1.6, 1.6, 3.5, 1.8}
bins = [0,1), [1,2), [2,3), [3,4)
n = 13
k = 4
Bin Frequency Cumulative Frequency
[0,1) 3 3
[1,2) 6 3+6 = 9
[2,3) 2 3+6+2 = 11
[3,4) 2 3+6+2+2 = 13
54
Cumulative Relative Frequency Distribution
If f1, f2 ,..., fk{ } are the frequencies in bins (classes), a1,a2[ ), a2 ,a3[ ),..., ak,ak+1[ ){ } , for a set of data such that
f1 + f2 + ...+ fk =n, then we define the relative frequencies: rj =fj
n. We note that
r1 + r2 + ...+ rk =1. The cumulative relative frequency for bin j is defined as r1 + r2 + ...+ rj .
55
Example
data = {3.1, 0.1, 0.9, 1.1, 1.3, 1.6, 2.5, 0.3, 2.5, 1.6, 1.6, 3.5, 1.8}
bins = [0,1), [1,2), [2,3), [3,4)
n = 13
k = 4
Bin Frequency Cumulative Frequency Relative Frequency
(rounded)
Cumulative Relative Frequency
[0,1) 3 3 3/13 = 0.230 0.230
[1,2) 6 3+6 = 9 6/13= 0.462 0.230+0.462 = 0.692
[2,3) 2 3+6+2 = 11 2/13 = 0.154 0.692+0.154 = 0.846
[3,4) 2 3+6+2+2 = 13 2/13 = 0.154 0.846+0.154 = 1.000
56
Relative Frequency Distribution (histogram)
57
Ogive
Definition: An ogive is a graph of the cumulative frequency or the relative cumulative frequency as a function of the bins used to construct the cumulative or relative cumulative frequency. It is constructed by using a cumulative frequency (or relative cumulative frequency) table.
58
Example
Bin Frequency Cumulative Frequency Relative Frequency
(rounded)
Cumulative Relative Frequency
[0,1) 3 3 3/13 = 0.230 0.230
[1,2) 6 3+6 = 9 6/13= 0.462 0.230+0.462 = 0.692
[2,3) 2 3+6+2 = 11 2/13 = 0.154 0.692+0.154 = 0.846
[3,4) 2 3+6+2+2 = 13 2/13 = 0.154 0.846+0.154 = 1.000
59
Time-series DataDefinition: Data about a particular variable collected over a period of time is called time-series data.
Example: Closing prices of IBM stock since Jan. 1, 2008.
60
Bad Graphical Representation of Data
Section 2.4
Problem: Graphs can give an incomplete or even a misrepresentation of the sample (data).
61
The Scale Problem
The number of bachelor’s degrees in engineering for 1999-2003 is given in the following table:
Year Number of Degrees
1999 62,372
2000 63,731
2001 65,113
2002 67,301
2003 70,949
62
Misleading Bar Chart
top related