descriptive statistics chapter two. content 2.1 data organization and frequency distribution 2.2...

DESCRIPTIVE STATISTICS

CHAPTER TWO

Content

2.1 Data organization and Frequency Distribution

2.2 Types of Graph 2.3 Summary Statistics (Data Description)

• Measures of Central Tendency• Measures of Variation • Measures of Position

Objectives

• Organize data using frequency distributions.

• Represent data in frequency distributions graphically using histograms, frequency polygons, and ogives.

• Represent data using Pareto charts, time series graphs, and pie graphs.

• Draw and interpret a stem and leaf plot.

• Summarize data using measures of central tendency, such as the mean, median, mode, and midrange.

• Describe data using measures of variation, such as the range, variance, and standard deviation.

• Identify the position of a data value in a data set, using various measures of position, such as percentiles, deciles, and quartiles.

At the end of this chapter, you should be able to

2.1 Data Organization & Frequency Distribution

A. The raw data– A fresh data have been collected from any resource

Example of the Raw DataThe Slimline Beverage Company makes and sells a line of dietetic soft drink products. These products are sold in bottles and cans. In additions, soft drink syrups are sold to restaurants, theaters, and other outlets that mix small amounts of the syrup with carbonated water and sell the result in cup. The sales manager wants to see how new Fizzy Cola syrup is selling so the raw sales data on gallons of syrup sold were gathered as shown on below table.

Employee Galoon sold EmployeeGaloon

sold EmployeeGaloon

sold EmployeeGaloon

sold

PP 95 RN 95 GH 135.5 IT 135.5

SM 100.75 SG 100.75 RI 115.25 NI 115.25

PT 126 AD 126 OS 128.75 GC 128.75

PU 114 RO 114 US 113.25 AS 113.25

MS 134 EY 134 PO 132 NC 132

FK 116.75 YO 116.75 OR 105 YA 105

LZ 97.5 OU 97.5 FT 118.25 TN 118.25

FE 102.25 US 102.25 WO 121.75 HB 121.75

AN 110 LT 110 OF 109.25 IE 109.25

RJ 125 EA 125 RT 136 NF 136

OO 144 AT 144 KH 124 GU 124

UY 112 RI 112 EI 91 XN 91

TT 82.5 NS 82.5

Raw data: Gallons Of Fizzy Cola Syrup Sold by 50 Employees of Slimline Beverage Company in 1 Month

Example of Data ArrayArray data: Gallons Of Fizzy Cola Syrup Sold by 50 Employees of Slimline Beverage Company in 1 Month

82.5 105 116.75 128.75

82.5 109.25 116.75 128.75

91 109.25 118.25 132

91 110 118.25 132

95 110 121.75 134

95 112 121.75 134

97.5 112 124 135.5

97.5 113.25 124 135.5

100.75 113.25 125 136

100.75 114 125 136

102.25 114 126 144

102.25 115.25 126 144

105 115.25

The lowest data: Range:The highest data:

B. The data arrayAn arrangement of data items in either as ascending (lowest-highest) or descending (highest-lowest) order.

Advantages- We can see the range of the data

- We can determine the data distribution

- An array can show the presence of large concentrations of items at particular values (outliers – data that are different than the rest of data, much larger or smaller )

Disadvantages- The array is still a rather awkward data organization

tool, especially when the number of data items is large.

- There’s often a need to arrange the data into a more compact form for analysis and communication purposes.

C. Frequency distribution (frequency table)

Group’s data items into classes and then records the number of items that appear in each class.

The purposeTo organize the data items into a compact form without obscuring essential facts

How to do (general)? 1. Determine the number of classes that will be used to group the data.

a. Minimum – 5, maximum – 20 b. The actual number depends on such factor

i. The number of observations being groupii. The purpose of the distributioniii. The arbitrary preferences of the analyst

c. Use classes that can give you a good view of the data pattern and enable you to gain insights into the information that is there

d. All data items from the smallest to the largest must be includede. Each items must be assign to one and only one class

2. Determine the width (class interval) of these classesa. The width should be equalb. Width = range / number of classesc. Whenever possible an open-ended class interval (one with an unspecified

upper or lower class limit) should be avoided 3. Determine the number of observations / frequency in each class

Types of Frequency Distribution

• Categorical Frequency Distribution– Used for data

that can be placed in specific categories such as nominal or ordinal level data

• Ungrouped Frequency Distribution– Used for

numerical data– The range of

data is small

Grouped Frequency Distribution– Used for

numerical data too

– The range of the data is large

Example : Categorical Frequency Distribution

Twenty-five army inductees were given a blood test to determine their blood type. The data set is

A B B AB OO O B AB BB B O A O

A O O O ABAB A O B A

Construct a frequency distribution for the data.

Constructing an ungrouped & Grouped Frequency Distribution

STEP 1 Determine the classes. - Find the highest and lowest value. - Find the range. - Select the number of classes desired. - Find the width by dividing the range by the number of

classes and rounding up.

- Select a starting point (usually the lowest value or any convenient number less than the lowest value); add the width to get the lower limits.

- Find the upper class limits. - Find the boundaries.

STEP 2 Tally the data.STEP 3 Find the numerical frequencies from the tallies.STEP 4 Find the cumulative frequencies.

• The lower class limit represents the smallest data value that can be included in

the class.

• The upper class limit represents the largest value that can be included in the

class.

• The class boundaries are used to separate the classes so that there are no gaps in

the frequency distribution.

• Rule of Thumb: Class limits should have the same decimal place value as the data, but the class boundaries have one additional place value and end in a 5.

• The class width for a class in a frequency distribution is found by subtracting the lower (or upper) class limit of one class from the lower (or upper) class limit of the next class.

• The class midpoint is found by adding the lower and upper boundaries (or limits) and dividing by 2.

Class Rules

• There should be between 5 and 20 classes.• The classes must be mutually exclusive.• The classes must be continuous.• The classes must be exhaustive.• The classes must be equal width.

Example : Ungrouped Frequency Distribution

The data shown here represent the number of miles per gallon that 30 selected four-wheel-drive sports utility vehicles obtained in city driving. Construct a frequency distribution.

12 17 12 14 16 1816 18 12 16 17 1515 16 12 15 16 1612 14 15 12 15 1519 13 16 18 16 14

Example : Grouped Frequency Distribution

These data represent the record high temperatures for each of the 50 states. Construct a grouped frequency distribution for the data using 7 classes.

112 100 127 120 134 118 105 110 109 112110 118 117 116 118 122 114 114 105 109107 112 114 115 118 117 118 122 106 110116 108 110 121 113 120 119 111 104 111120 113 120 117 105 110 118 112 114 114

Why Construct Frequency Distributions?

To facilitate computational procedures for measures of average and spread.

To enable the reader to determine the

nature or shape of the distribution.

To enable the reader to make comparisons among different data sets.

To organize the data in a meaningful, intelligible

way.

To enable the researcher to draw charts and graphs

for the presentation of data.

2.2 Types of Graph

The purpose of graphs in statistics is to convey

the data to the viewer in pictorial form.

Graphs are useful in getting the audience’s

attention in a publication or a presentation.

A. Histogram, Frequency Polygon, Ogive

• Histogram– A graph that displays the data by using vertical bars of various heights

to represent the frequencies

• Frequency Polygon– A graph that displays the data by using lines that connect points

plotted for the frequencies at the midpoints of the classes. The frequencies represent the heights of the midpoints.

Ogive (Cumulative Frequency Graph)– A graph that represents the cumulative frequencies for the

classes in a frequency distribution

Procedure to construct Histogram, Frequency Polygon & Ogive

• STEP 1 Draw and label the x and y axes.• STEP 2 Choose a suitable scale for the frequencies

or cumulative frequencies, and label it on the y axis.

• STEP 3 Represent the class boundaries for the histogram or ogive, or the midpoint for the frequency polygon, on the x axis.

• STEP 4 Plot the points and then draw the bars or lines.

Example

These data represent the record high temperatures for each of the 50 states. Construct a grouped frequency distribution for the data using 7 classes. Then, construct a histogram, frequency polygon and ogive for these data.

112 100 127 120 134 118 105 110 109 112110 118 117 116 118 122 114 114 105 109107 112 114 115 118 117 118 122 106 110116 108 110 121 113 120 119 111 104 111120 113 120 117 105 110 118 112 114 114

Distribution Shapes

B. Pareto Chart

Used to represent a frequency distribution for a categorical

variable and the frequency are displayed by the heights of

vertical bars.

Example


A B B AB OO O B AB BB B O A OA O O O ABAB A O B A

Construct a pareto chart for the data.

C. Time Series Graph

• Represents data that occur over a specified period of time

• STEP 1 Draw and label the x and y axes.

• STEP 2 Label the x axis for years and the y axis for the number of theaters.

• STEP 3 Plot each point according to the table.

• STEP 4 Draw line segments connecting adjacent points. Do not try to fit a smooth curve through the data points.

• We look for a trend or pattern that occurs over the time period (ascending, descending) & the slope or steepness of the line (increase, decrease)

Two time series graph for comparisons (compound time series graph)

Example

In 1958, there were more than 4000 outdoor drive-in theaters. The number of these theaters has changed over the years. Draw a time series graph for the data and summarize the findings.

Year Number1988 14971990 9101992 8701994 8591996 8261998 7502000 637

D. Pie Chart

A pie graph is a circle that is divided into sections or wedges according to the percentage of frequencies in each

category of the distribution.

The purpose of the pie graph is to show the relationship of the parts to the whole by visually comparing the sizes of the

sectors.

Percentages or proportions can be used. The variable is nominal or categorical.

Example


A B B AB OO O B AB BB B O A OA O O O ABAB A O B A

Construct a pie chart for the data.

Stem-and-Leaf Plots• A stem-and-leaf plot is a

data plot that uses part of a data value as the stem and part of the data value as the leaf to form groups or classes.

• It has the advantage over grouped frequency distribution of retaining the actual data while showing them in graphic form.

Stem leaf

Example

An insurance company researcher conducted a survey on the number of car thefts in a large city for a period of 30 days last summer. The raw data are shown below.Construct a stem and leaf plot.

52 62 51 50 6958 77 66 53 5775 56 55 67 7379 59 68 65 7257 51 63 69 7565 53 78 66 55

Conclusions (2.1 & 2.2)

• Data can be organized in some

meaningful way using frequency

distributions. Once the

frequency distribution is

constructed, the representation

of the data by graphs is a simple

task.

2.3 Summary Statistics (Data Description)• Statistical methods can be used to summarize data.

• Measures of average are also called measures of central tendency and include the mean, median, mode, and midrange.

• Measures that determine the spread of data values are called measures of variation or measures of dispersion and include the range, variance, and standard deviation.

• Measures of position tell where a specific data value falls within the data set or its relative position in comparison with other data values.

• The most common measures of position are percentiles, deciles, and quartiles.

• The measures of central tendency, variation, and position are part of what is

called traditional statistics. This type of data is typically used to confirm

conjectures about the data

Measures of Central Tendency

Mean

the sum of the values divided by the total number of values.

Population Mean Sample Mean

1 , population size

N

ii

xN

N

1 , sample size

n

ii

xx n

n

Arithmetic Mean – Individual Data

Example 1• Calculate the arithmetic mean for the

following:3, 5, 8, 12, 15

35

The Arithmetic Mean – Ungrouped Frequency Distribution

Example 2• Number of defects in a sample of 50 products

No of defects No of products

0 5

1 7

2 15

3 13

4 6

5 4

36

The Arithmetic Mean – Grouped Frequency Distribution

Example 3• A radar speed recorder was setup on a stretch of road to

which a legal speed limit was applied. The result are summarized in the table below:

Speed (mph) No of cars observed

15 – 20 5

20 – 25 39

25 – 30 112

30 – 35 295

35 – 40 242

40 – 45 89

45 – 50 837

Mean• One computes the mean by using all the values of the data.

• The mean varies less than the median or mode when samples are taken from the same population and all three measures are computed for these samples.

• The mean is used in computing other statistics, such as variance.

• The mean for the data set is unique, and not necessarily one of the data values.

• The mean cannot be computed for an open-ended frequency distribution.

• The mean is affected by extremely high or low values and may not be the appropriate average to use in these situations


Median

the middle number of n ordered data (smallest to largest)

If n is odd If n is even

1

2

Median nx 12 2Median

2

n nx x

Median

• The median is used when one must find the center or middle value of a data set.

• The median is used when one must determine whether the data values fall into the upper half or lower half of the distribution.

• The median is used to find the average of an open-ended distribution.

• The median is affected less than the mean by extremely high or extremely low values.

The Median – Individual Data

Example 4• The following data relates to the marks

obtained in a course of 15 students• Progress test 1: marks obtained

30, 35, 52, 52, 35, 40, 59, 60, 41, 46, 61, 65, 47, 70, 72

• In the case of even number of observations, there is, no definite middle item

• The median is then taken to be the average of two middle items

41

The Median – Locating the Median Graphically

• Example 5• Given below is the frequency distribution of marks obtained by 50 students

in a certain college

Marks No. of Students

10 – 20 3

20 – 30 7

30 – 40 10

40 – 50 20

50 – 60 7

60 – 70 3

42

The Median – Ungrouped Frequency Distribution

• Example 6• Tests for defects are carried out in a textile factory on a lot comprising 400

pieces of cloth. The results of the tests are tabulated below

No of faults per pieces No pieces

0 92

1 142

2 96

3 46

4 18

5 6

6 043


Mod

the most commonly occurring value in a data series

• The mode is used when the most typical case is desired.

• The mode is the easiest average to compute.

• The mode can be used when the data are nominal, such as religious preference, gender, or political affiliation.

• The mode is not always unique. A data set can have more than one mode, or the mode may not exist for a data set.

The Mode – Individual Data

• Example 7• Determine the mode from the following data:• Marks obtained by 10 students

10, 27, 24, 12, 27, 27, 20, 18, 15, 20

45

The Mode – Grouped Frequency Distribution• Example 8

• A client company of your firm is a horticultural shop selling a wide variety of product to its customers. The analysis of weekly sales of plants throughout the year is summarized in the following frequency distribution

Weekly sales of plants ($) No. of weeks

1255 – 1280 9

1280 – 1305 19

1305 – 1330 10

1330 – 1355 8

1355 – 1380 6

46

Midrange is a rough estimate of the middle & also a very rough

estimate of the average and can be affected by one extremely high or low value.

lowest value highest valueMR

2


Types of Distribution

Symmetric

Positively skewed or right-skewed Negatively skewed or left-skewed

Measures of Variation / Dispersion

• Used when the central of tendency doesn't mean anything or not needed (eg: mean are same for two types of data)

• One that gauges the variability that exists in a data set

• To form a judgment about how well the average value illustrate/ depict the data

• To learn the extent of the scatter so that steps may be taken to control the existing variation


Range

is the different between the highest value and the lowest value in a data set.

The symbol R is used for the range.

R = highest value - lowest value


Variance

is the average of the squares of the distance each value is from the mean.

Population Variance Sample Variance

Standard Deviation is the square root of the variance

Population standard deviation , Sample standard deviation, s

2

2 1

2

1

, population size

, population size

N

ii

N

ii

xN

N

xN

N

2

2 1

2

1

, sample size1

, sample size1

n

ii

n

ii

x xs n

n

x xs n

n

Variance

• The variance is the average of the squared deviations from the arithmetic mean

• Calculate of Variance• The following data relates to the marks obtained by

15 students in an Accounting examination• 50, 60, 60, 65, 70, 50, 40, 45, 40, 50, 70, 80, 80, 70,

70

52

Standard Deviation• Calculation of Standard Deviation – grouped frequency

distribution• The following data relates to the sales of electronic calculators

in the South of England

Sales per week (thousand) No. of weeks

4 – 6 2

6 – 8 5

8 – 10 12

10 – 12 9

12 - 14 3

53

Variance & Standard deviation

• Variances and standard deviations can be used to determine the spread of the data. If the variance or standard deviation is large, the data are more dispersed. The information is useful in comparing two or more data sets to determine which is more variable.

• The measures of variance and standard deviation are used to determine the consistency of a variable.

• The variance and standard deviation are used to determine the number of data values that fall within a specified interval in a distribution.

• The variance and standard deviation are used quite often in inferential statistics.

Describing the position of the data value

• Measures of Position

Quartile

4 for i = 1, 2, 3

where;

Lower limit of the interval containing

Width of the interval containing

Cumulative frequency before class

Frequency class

i

i

i

i

in FQ L C

f

L Q

C Q

F Q

f

iQ

Group Data

56

Quartile Deviation- Individual Data

• The following is the marks of 9 students in a certain examination.

Student No Marks

1 20

2 28

3 40

4 12

5 30

6 15

7 50

8 45

9 60

57

Quartile DeviationExample – Group Frequency Distribution

• The following group frequency table describes the weight of 95 packages selected for a QC test.

Weight (grams) No. of Packages

450 – 452 11

452 – 454 26

454 – 456 34

456 - 458 24

The measures of central tendency, variation, and position for Grouped data

measures of central tendency

Mean Class

where;

frequency

midpoint

i i

i

i

f xx

N

f

x

Median class

2Median Class

where;

Lower limit of the interval containing median

Width of the interval containing median

Cumulative frequency before class median

Frequency f

n FL C

f

L

C

F

f

or class median

Mode class

1

1 2

1

2

Mode class =

where;

Lower limit of the interval containing mod

Width of the interval containing mod

Frequency class mode - frequency before class mod

Frequen

L C

L

C

cy class mode - frequency after class mod

measures of Variation

Population variance

22

2

2 1

where;

frequency

midpoint

population size

mean class

Ni i

i i i ii

i

i

f xf x f x

NN N

f

x

N

Sample variance

22

2

2 1

1 1where;

frequency

midpoint

sample size

mean class

ni i

i i i ii

i

i

f xf x x f x

nsn n

f

x

n

x

Conclusions

• By combining all of these techniques discussed in this chapter together, the student is now able to collect, organize, summarize and present data.

descriptive statistics chapter two. content 2.1 data organization and frequency distribution 2.2...

Documents