descriptive statistics chapter two. content 2.1 data organization and frequency distribution 2.2...
TRANSCRIPT
DESCRIPTIVE STATISTICS
CHAPTER TWO
Content
2.1 Data organization and Frequency Distribution
2.2 Types of Graph 2.3 Summary Statistics (Data Description)
• Measures of Central Tendency• Measures of Variation • Measures of Position
Objectives
• Organize data using frequency distributions.
• Represent data in frequency distributions graphically using histograms, frequency polygons, and ogives.
• Represent data using Pareto charts, time series graphs, and pie graphs.
• Draw and interpret a stem and leaf plot.
• Summarize data using measures of central tendency, such as the mean, median, mode, and midrange.
• Describe data using measures of variation, such as the range, variance, and standard deviation.
• Identify the position of a data value in a data set, using various measures of position, such as percentiles, deciles, and quartiles.
At the end of this chapter, you should be able to
2.1 Data Organization & Frequency Distribution
A. The raw data– A fresh data have been collected from any resource
Example of the Raw DataThe Slimline Beverage Company makes and sells a line of dietetic soft drink products. These products are sold in bottles and cans. In additions, soft drink syrups are sold to restaurants, theaters, and other outlets that mix small amounts of the syrup with carbonated water and sell the result in cup. The sales manager wants to see how new Fizzy Cola syrup is selling so the raw sales data on gallons of syrup sold were gathered as shown on below table.
Employee Galoon sold EmployeeGaloon
sold EmployeeGaloon
sold EmployeeGaloon
sold
PP 95 RN 95 GH 135.5 IT 135.5
SM 100.75 SG 100.75 RI 115.25 NI 115.25
PT 126 AD 126 OS 128.75 GC 128.75
PU 114 RO 114 US 113.25 AS 113.25
MS 134 EY 134 PO 132 NC 132
FK 116.75 YO 116.75 OR 105 YA 105
LZ 97.5 OU 97.5 FT 118.25 TN 118.25
FE 102.25 US 102.25 WO 121.75 HB 121.75
AN 110 LT 110 OF 109.25 IE 109.25
RJ 125 EA 125 RT 136 NF 136
OO 144 AT 144 KH 124 GU 124
UY 112 RI 112 EI 91 XN 91
TT 82.5 NS 82.5
Raw data: Gallons Of Fizzy Cola Syrup Sold by 50 Employees of Slimline Beverage Company in 1 Month
Example of Data ArrayArray data: Gallons Of Fizzy Cola Syrup Sold by 50 Employees of Slimline Beverage Company in 1 Month
82.5 105 116.75 128.75
82.5 109.25 116.75 128.75
91 109.25 118.25 132
91 110 118.25 132
95 110 121.75 134
95 112 121.75 134
97.5 112 124 135.5
97.5 113.25 124 135.5
100.75 113.25 125 136
100.75 114 125 136
102.25 114 126 144
102.25 115.25 126 144
105 115.25
The lowest data: Range:The highest data:
B. The data arrayAn arrangement of data items in either as ascending (lowest-highest) or descending (highest-lowest) order.
Advantages- We can see the range of the data
- We can determine the data distribution
- An array can show the presence of large concentrations of items at particular values (outliers – data that are different than the rest of data, much larger or smaller )
Disadvantages- The array is still a rather awkward data organization
tool, especially when the number of data items is large.
- There’s often a need to arrange the data into a more compact form for analysis and communication purposes.
C. Frequency distribution (frequency table)
Group’s data items into classes and then records the number of items that appear in each class.
The purposeTo organize the data items into a compact form without obscuring essential facts
How to do (general)? 1. Determine the number of classes that will be used to group the data.
a. Minimum – 5, maximum – 20 b. The actual number depends on such factor
i. The number of observations being groupii. The purpose of the distributioniii. The arbitrary preferences of the analyst
c. Use classes that can give you a good view of the data pattern and enable you to gain insights into the information that is there
d. All data items from the smallest to the largest must be includede. Each items must be assign to one and only one class
2. Determine the width (class interval) of these classesa. The width should be equalb. Width = range / number of classesc. Whenever possible an open-ended class interval (one with an unspecified
upper or lower class limit) should be avoided 3. Determine the number of observations / frequency in each class
Types of Frequency Distribution
• Categorical Frequency Distribution– Used for data
that can be placed in specific categories such as nominal or ordinal level data
• Ungrouped Frequency Distribution– Used for
numerical data– The range of
data is small
Grouped Frequency Distribution– Used for
numerical data too
– The range of the data is large
Example : Categorical Frequency Distribution
Twenty-five army inductees were given a blood test to determine their blood type. The data set is
A B B AB OO O B AB BB B O A O
A O O O ABAB A O B A
Construct a frequency distribution for the data.
Constructing an ungrouped & Grouped Frequency Distribution
STEP 1 Determine the classes. - Find the highest and lowest value. - Find the range. - Select the number of classes desired. - Find the width by dividing the range by the number of
classes and rounding up.
- Select a starting point (usually the lowest value or any convenient number less than the lowest value); add the width to get the lower limits.
- Find the upper class limits. - Find the boundaries.
STEP 2 Tally the data.STEP 3 Find the numerical frequencies from the tallies.STEP 4 Find the cumulative frequencies.
• The lower class limit represents the smallest data value that can be included in
the class.
• The upper class limit represents the largest value that can be included in the
class.
• The class boundaries are used to separate the classes so that there are no gaps in
the frequency distribution.
• Rule of Thumb: Class limits should have the same decimal place value as the data, but the class boundaries have one additional place value and end in a 5.
• The class width for a class in a frequency distribution is found by subtracting the lower (or upper) class limit of one class from the lower (or upper) class limit of the next class.
• The class midpoint is found by adding the lower and upper boundaries (or limits) and dividing by 2.
Class Rules
• There should be between 5 and 20 classes.• The classes must be mutually exclusive.• The classes must be continuous.• The classes must be exhaustive.• The classes must be equal width.
Example : Ungrouped Frequency Distribution
The data shown here represent the number of miles per gallon that 30 selected four-wheel-drive sports utility vehicles obtained in city driving. Construct a frequency distribution.
12 17 12 14 16 1816 18 12 16 17 1515 16 12 15 16 1612 14 15 12 15 1519 13 16 18 16 14
Example : Grouped Frequency Distribution
These data represent the record high temperatures for each of the 50 states. Construct a grouped frequency distribution for the data using 7 classes.
112 100 127 120 134 118 105 110 109 112110 118 117 116 118 122 114 114 105 109107 112 114 115 118 117 118 122 106 110116 108 110 121 113 120 119 111 104 111120 113 120 117 105 110 118 112 114 114
Why Construct Frequency Distributions?
To facilitate computational procedures for measures of average and spread.
To enable the reader to determine the
nature or shape of the distribution.
To enable the reader to make comparisons among different data sets.
To organize the data in a meaningful, intelligible
way.
To enable the researcher to draw charts and graphs
for the presentation of data.
2.2 Types of Graph
The purpose of graphs in statistics is to convey
the data to the viewer in pictorial form.
Graphs are useful in getting the audience’s
attention in a publication or a presentation.
A. Histogram, Frequency Polygon, Ogive
• Histogram– A graph that displays the data by using vertical bars of various heights
to represent the frequencies
• Frequency Polygon– A graph that displays the data by using lines that connect points
plotted for the frequencies at the midpoints of the classes. The frequencies represent the heights of the midpoints.
Ogive (Cumulative Frequency Graph)– A graph that represents the cumulative frequencies for the
classes in a frequency distribution
Procedure to construct Histogram, Frequency Polygon & Ogive
• STEP 1 Draw and label the x and y axes.• STEP 2 Choose a suitable scale for the frequencies
or cumulative frequencies, and label it on the y axis.
• STEP 3 Represent the class boundaries for the histogram or ogive, or the midpoint for the frequency polygon, on the x axis.
• STEP 4 Plot the points and then draw the bars or lines.
Example
These data represent the record high temperatures for each of the 50 states. Construct a grouped frequency distribution for the data using 7 classes. Then, construct a histogram, frequency polygon and ogive for these data.
112 100 127 120 134 118 105 110 109 112110 118 117 116 118 122 114 114 105 109107 112 114 115 118 117 118 122 106 110116 108 110 121 113 120 119 111 104 111120 113 120 117 105 110 118 112 114 114
Distribution Shapes
B. Pareto Chart
Used to represent a frequency distribution for a categorical
variable and the frequency are displayed by the heights of
vertical bars.
Example
Twenty-five army inductees were given a blood test to determine their blood type. The data set is
A B B AB OO O B AB BB B O A OA O O O ABAB A O B A
Construct a pareto chart for the data.
C. Time Series Graph
• Represents data that occur over a specified period of time
• STEP 1 Draw and label the x and y axes.
• STEP 2 Label the x axis for years and the y axis for the number of theaters.
• STEP 3 Plot each point according to the table.
• STEP 4 Draw line segments connecting adjacent points. Do not try to fit a smooth curve through the data points.
• We look for a trend or pattern that occurs over the time period (ascending, descending) & the slope or steepness of the line (increase, decrease)
Two time series graph for comparisons (compound time series graph)
Example
In 1958, there were more than 4000 outdoor drive-in theaters. The number of these theaters has changed over the years. Draw a time series graph for the data and summarize the findings.
Year Number1988 14971990 9101992 8701994 8591996 8261998 7502000 637
D. Pie Chart
A pie graph is a circle that is divided into sections or wedges according to the percentage of frequencies in each
category of the distribution.
The purpose of the pie graph is to show the relationship of the parts to the whole by visually comparing the sizes of the
sectors.
Percentages or proportions can be used. The variable is nominal or categorical.
Example
Twenty-five army inductees were given a blood test to determine their blood type. The data set is
A B B AB OO O B AB BB B O A OA O O O ABAB A O B A
Construct a pie chart for the data.
Stem-and-Leaf Plots• A stem-and-leaf plot is a
data plot that uses part of a data value as the stem and part of the data value as the leaf to form groups or classes.
• It has the advantage over grouped frequency distribution of retaining the actual data while showing them in graphic form.
Stem leaf
Example
An insurance company researcher conducted a survey on the number of car thefts in a large city for a period of 30 days last summer. The raw data are shown below.Construct a stem and leaf plot.
52 62 51 50 6958 77 66 53 5775 56 55 67 7379 59 68 65 7257 51 63 69 7565 53 78 66 55
Conclusions (2.1 & 2.2)
• Data can be organized in some
meaningful way using frequency
distributions. Once the
frequency distribution is
constructed, the representation
of the data by graphs is a simple
task.
2.3 Summary Statistics (Data Description)• Statistical methods can be used to summarize data.
• Measures of average are also called measures of central tendency and include the mean, median, mode, and midrange.
• Measures that determine the spread of data values are called measures of variation or measures of dispersion and include the range, variance, and standard deviation.
• Measures of position tell where a specific data value falls within the data set or its relative position in comparison with other data values.
• The most common measures of position are percentiles, deciles, and quartiles.
• The measures of central tendency, variation, and position are part of what is
called traditional statistics. This type of data is typically used to confirm
conjectures about the data
Measures of Central Tendency
Mean
the sum of the values divided by the total number of values.
Population Mean Sample Mean
1 , population size
N
ii
xN
N
1 , sample size
n
ii
xx n
n
Arithmetic Mean – Individual Data
Example 1• Calculate the arithmetic mean for the
following:3, 5, 8, 12, 15
35
The Arithmetic Mean – Ungrouped Frequency Distribution
Example 2• Number of defects in a sample of 50 products
No of defects No of products
0 5
1 7
2 15
3 13
4 6
5 4
36
The Arithmetic Mean – Grouped Frequency Distribution
Example 3• A radar speed recorder was setup on a stretch of road to
which a legal speed limit was applied. The result are summarized in the table below:
Speed (mph) No of cars observed
15 – 20 5
20 – 25 39
25 – 30 112
30 – 35 295
35 – 40 242
40 – 45 89
45 – 50 837
Mean• One computes the mean by using all the values of the data.
• The mean varies less than the median or mode when samples are taken from the same population and all three measures are computed for these samples.
• The mean is used in computing other statistics, such as variance.
• The mean for the data set is unique, and not necessarily one of the data values.
• The mean cannot be computed for an open-ended frequency distribution.
• The mean is affected by extremely high or low values and may not be the appropriate average to use in these situations
Measures of Central Tendency
Median
the middle number of n ordered data (smallest to largest)
If n is odd If n is even
1
2
Median nx 12 2Median
2
n nx x
Median
• The median is used when one must find the center or middle value of a data set.
• The median is used when one must determine whether the data values fall into the upper half or lower half of the distribution.
• The median is used to find the average of an open-ended distribution.
• The median is affected less than the mean by extremely high or extremely low values.
The Median – Individual Data
Example 4• The following data relates to the marks
obtained in a course of 15 students• Progress test 1: marks obtained
30, 35, 52, 52, 35, 40, 59, 60, 41, 46, 61, 65, 47, 70, 72
• In the case of even number of observations, there is, no definite middle item
• The median is then taken to be the average of two middle items
41
The Median – Locating the Median Graphically
• Example 5• Given below is the frequency distribution of marks obtained by 50 students
in a certain college
Marks No. of Students
10 – 20 3
20 – 30 7
30 – 40 10
40 – 50 20
50 – 60 7
60 – 70 3
42
The Median – Ungrouped Frequency Distribution
• Example 6• Tests for defects are carried out in a textile factory on a lot comprising 400
pieces of cloth. The results of the tests are tabulated below
No of faults per pieces No pieces
0 92
1 142
2 96
3 46
4 18
5 6
6 043
Measures of Central Tendency
Mod
the most commonly occurring value in a data series
• The mode is used when the most typical case is desired.
• The mode is the easiest average to compute.
• The mode can be used when the data are nominal, such as religious preference, gender, or political affiliation.
• The mode is not always unique. A data set can have more than one mode, or the mode may not exist for a data set.
The Mode – Individual Data
• Example 7• Determine the mode from the following data:• Marks obtained by 10 students
10, 27, 24, 12, 27, 27, 20, 18, 15, 20
45
The Mode – Grouped Frequency Distribution• Example 8
• A client company of your firm is a horticultural shop selling a wide variety of product to its customers. The analysis of weekly sales of plants throughout the year is summarized in the following frequency distribution
Weekly sales of plants ($) No. of weeks
1255 – 1280 9
1280 – 1305 19
1305 – 1330 10
1330 – 1355 8
1355 – 1380 6
46
Midrange is a rough estimate of the middle & also a very rough
estimate of the average and can be affected by one extremely high or low value.
lowest value highest valueMR
2
Measures of Central Tendency
Types of Distribution
Symmetric
Positively skewed or right-skewed Negatively skewed or left-skewed
Measures of Variation / Dispersion
• Used when the central of tendency doesn't mean anything or not needed (eg: mean are same for two types of data)
• One that gauges the variability that exists in a data set
• To form a judgment about how well the average value illustrate/ depict the data
• To learn the extent of the scatter so that steps may be taken to control the existing variation
Measures of Variation / Dispersion
Range
is the different between the highest value and the lowest value in a data set.
The symbol R is used for the range.
R = highest value - lowest value
Measures of Variation / Dispersion
Variance
is the average of the squares of the distance each value is from the mean.
Population Variance Sample Variance
Standard Deviation is the square root of the variance
Population standard deviation , Sample standard deviation, s
2
2 1
2
1
, population size
, population size
N
ii
N
ii
xN
N
xN
N
2
2 1
2
1
, sample size1
, sample size1
n
ii
n
ii
x xs n
n
x xs n
n
Variance
• The variance is the average of the squared deviations from the arithmetic mean
• Calculate of Variance• The following data relates to the marks obtained by
15 students in an Accounting examination• 50, 60, 60, 65, 70, 50, 40, 45, 40, 50, 70, 80, 80, 70,
70
52
Standard Deviation• Calculation of Standard Deviation – grouped frequency
distribution• The following data relates to the sales of electronic calculators
in the South of England
Sales per week (thousand) No. of weeks
4 – 6 2
6 – 8 5
8 – 10 12
10 – 12 9
12 - 14 3
53
Variance & Standard deviation
• Variances and standard deviations can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. The information is useful in comparing two or more data sets to determine which is more variable.
• The measures of variance and standard deviation are used to determine the consistency of a variable.
• The variance and standard deviation are used to determine the number of data values that fall within a specified interval in a distribution.
• The variance and standard deviation are used quite often in inferential statistics.
Describing the position of the data value
• Measures of Position
Quartile
4 for i = 1, 2, 3
where;
Lower limit of the interval containing
Width of the interval containing
Cumulative frequency before class
Frequency class
i
i
i
i
in FQ L C
f
L Q
C Q
F Q
f
iQ
Group Data
56
Quartile Deviation- Individual Data
• The following is the marks of 9 students in a certain examination.
Student No Marks
1 20
2 28
3 40
4 12
5 30
6 15
7 50
8 45
9 60
57
Quartile DeviationExample – Group Frequency Distribution
• The following group frequency table describes the weight of 95 packages selected for a QC test.
Weight (grams) No. of Packages
450 – 452 11
452 – 454 26
454 – 456 34
456 - 458 24
The measures of central tendency, variation, and position for Grouped data
measures of central tendency
Mean Class
where;
frequency
midpoint
i i
i
i
f xx
N
f
x
Median class
2Median Class
where;
Lower limit of the interval containing median
Width of the interval containing median
Cumulative frequency before class median
Frequency f
n FL C
f
L
C
F
f
or class median
Mode class
1
1 2
1
2
Mode class =
where;
Lower limit of the interval containing mod
Width of the interval containing mod
Frequency class mode - frequency before class mod
Frequen
L C
L
C
cy class mode - frequency after class mod
measures of Variation
Population variance
22
2
2 1
where;
frequency
midpoint
population size
mean class
Ni i
i i i ii
i
i
f xf x f x
NN N
f
x
N
Sample variance
22
2
2 1
1 1where;
frequency
midpoint
sample size
mean class
ni i
i i i ii
i
i
f xf x x f x
nsn n
f
x
n
x
Conclusions
• By combining all of these techniques discussed in this chapter together, the student is now able to collect, organize, summarize and present data.