Download - CHAPTER 3: INTODUCTION TO STATISTICS The Statistical ...mathcfs-students-page.wikispaces.com/file/view/CHAPTER+5+(INTRO... · CHAPTER 3: INTODUCTION TO STATISTICS ... A statistical

SHF1124

1

CHAPTER 3: INTODUCTION TO STATISTICS

The Statistical Process

3.1 Introduction

Statistics:

A field of study which implies collecting, presenting, analyzing and interpreting data

as a basis for explanation, description and comparison.

used to analyze the results of surveys and as a tool in scientific research to make

decisions based on controlled experiments.

Also useful for operations, research, quality control, estimation and prediction.

Population: a collection, or set of individuals or objects or events whose properties are

to be analyzed.

Sample: a group of subjects selected from the population. Sample is a subset of a

population.

Statistical POPULATION :

-Collection of data we wish to gather information about

- Eg: All students of CFS IIUM

SAMPLE:

Data collected from Population

-Eg: Students of Dept. of Science

Analyze the Data :

Organize, Describe & Present them

Sample Statistics :

-Graphic : Eg: Histogram, Ogive, Frequency Polygon

-Numeric : Eg: Mean, Standard Deviation

Make Inferences :

Determine what the statistics tell us

about the Population

Plan the Investigation:

What? How? Who? Where?

Collect the Sample

Collect

SHF1124

2

Data: consist a set of recorded observations or values. Any quantity that can have a

number of values is variable. Variables whose values are determined by chance are

called random variables.

Data set: a collection of data values. Each value in the data set is called a data value or a

datum.

Variable: a characteristics or attribute that can assume different values.

A statistical exercise normally consists of 4 stages:

i) Collection of data by counting or measuring.

ii) Ordering and presentation of the data in a convenient form.

iii) Analysis of the collected data.

iv) Interpretation of the results and conclusions formulated.

3.1.1 Two branches of Statistics

STATISTICS

DESCRIPTIVE STATISTICS

Consists of the collection, organization, summarization and presentation of data.

-Describes a situation. Data presented in the form of charts, graphs or tables.

-Make use of graphical techniques and numerical descriptive measures such as average to summarize and present the data.

-E.g.: National census conducted by Malaysian goverment every 5 years or 10 years. The results of this census give some information regarding average age, income and other characteristics of the Malaysian population

INFERENTIAL STATISTICS

Consists of generalizing from samples to populations, performing hypothesis tests, detemining relationships among variables and making prediction

- Inferences are made from samples to populations

-Use probability, that is the chance of an event occurring.

-The area of inferential statistics called hypotesis testing is a decision-making process for evaluating claims about a population, based on information obtined from samples.

- E.g.: A researcher may want to know if a new product of skin lotion containing aloe vera will reduce the skin problem on children. For this study, two group of young children would be selected. One group would be given the lotion containing aloe vera and the other would be given a normal lotion without containing aloe vera. As aresult is observed by experts to see the effectiveness of the new product.

SHF1124

3

3.1.2 Variables and Types of Data

LEVEL OF MEASUREMENT

Statisticians gain information about a particular situation by collecting data for

random variables.

Types of Data (variables)

1) Qualitative variables

Variables that can be placed into distinct categories, according to some

characteristics or attribute.

Nonnumeric categories

E.g.: Gender , color, religion , workplace and etc

2) Quantitative variables

It is numerical in nature and can be ordered or ranked.

A quantitative variable may be one of two kinds:

Discrete variable – a variable that can be counted or for which there is a fixed

set of values. Example: the number of children in a family, the number of

students in a class and etc

Continuous variable – a variable that can be measured on continuous scale ,

the result depending on the precision of the measuring instrument, or the

accuracy of the observer. Continuous variable can assume all values between

any two specific values. Example: temperatures, heights, weights, time taken

and etc.

TYPES OF DATA (VARIABLES)

QUALITATIVE

NOMINAL

ORDINAL

QUANTITATIVE

CONTINUOUS

RATIO

INTERVAL

DISCRETE

SHF1124

4

Variables can be classified by how they are categorized, counted or measured. Data/

variables can be classified according to the LEVEL OF MEASUREMENT as follows:

1) Nominal Level Data: - classifies data (persons/objects) into two or more

categories. Whatever the basis for classification, a person can only be in one

category and members of a given category have a common set of characteristics.

The lowest level of measurement.

No ranking/order can be placed on the data

E.g. : Gender (Male / Female) , Type of school (Public / Private),

Height (Tall/Short) , etc

2) Ordinal Level Data:- classifies data into categories that can be ranked; however

precise differences between the ranks do not exist.

This type of measuring scale puts the data/subjects in order from highest to

lowest, from most to least. It does not indicate how much higher or how

much better. Intervals between ranks are not equal.

E.g.: Letter grades (A,B,C,D,E,F) ; Man’s build (small, medium, or large)-large

variation exists among the individuals in each class.

3) Interval Level Data:- has all characteristics of a nominal and ordinal scale but in

addition it is based upon predetermined equal interval. It has no true zero point

(ratio between number on the scale are not meaningful). E.g.:

Achievement test; aptitude tests, IQ test. A one point difference between IQ

test of 110 and an IQ of 111 gives a significant difference.

The Fahrenheit scale is a clear example of the interval scale of measurement.

Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit represent interval

data. Measurement of Sea Level is another example of an interval scale. With

each of these scales there are direct, measurable quantities with equality of

units. In addition, zero does not represent the absolute lowest value. Rather,

it is point on the scale with numbers both above and below it (for example,

-10degrees Fahrenheit).

4) Ratio Level Data:- possesses all the characteristics of interval scale and in

addition it has a meaningful (true zero point). True ratios exist when the same

variable is measured on two different members of the population.

The highest, most precise level of measurement.

E.g.: Weight, number of calls received; height.

SHF1124

5

3.1.3 Data collection and Sampling Techniques

Sampling is the process of selecting a number of individuals for a study in such a way

that the individuals represent the larger group from which they were selected.

The purpose of sampling is to use a sample to gain information about a population.

In order to obtain samples that are unbiased, statisticians use 4 basic methods of

sampling:

i) Random Sampling: subjects are selected by random numbers.

ii) Systematic Sampling: Subjects are selected by using every kth number after

the first subject is randomly from 1 through k.

iii) Stratified Sampling: Subjects are selected by dividing up the population into

groups (strata) and subjects within groups are randomly selected.

- E.g.: We divide the population into 5 group then we take the subjects from

each group to become our sample.

iv) Cluster Sampling: Subjects are selected by using an intact group that is

representative of the population.

- E.g.: We divide the population into 5 group then we take 2 groups to

become our sample. That means 2 group of subject represent 5 groups of

subjects.

Exercise:

A ) Classify each set of data as discrete or continuous.

1) The number of suitcases lost by an airline.

2) The height of corn plants.

3) The number of ears of corn produced.

4) The number of green M&M's in a bag.

5) The time it takes for a car battery to die.

6) The production of tomatoes by weight.

SHF1124

6

B) Identify the following as nominal level, ordinal level, interval level, or ratio level data.

1) Percentage scores on a Math exam.

2) Letter grades on an English essay.

3) Flavors of yogurt.

4) Instructors classified as: Easy, Difficult or Impossible.

5) Employee evaluations classified as : Excellent, Average, Poor.

6) Religions.

7) Political parties.

8) Commuting times to school.

9) Years (AD) of important historical events.

10) Ages (in years) of statistics students.

11) Ice cream flavor preference.

12) Amount of money in savings accounts.

13) Students classified by their reading ability: Above average, Below average, Normal.

SHF1124

7

3.2 ORGANIZING DATA & PRESENTATION OF DATA

3.2.1 FREQUENCY DISTRIBUTION

A frequency distribution is the organization of raw data in table form, using classes and

frequencies.

There are three types of frequency distribution.

1. Categorical frequency distribution

-for data that can be placed in specific category

Example : The following data represent the color of men’s shirts purchased in the

men’s department of a large department store. Construct a frequency distribution

for the data. (W = White, BL = Blue, BR = Brown, Y =Yellow, G = Gray)

W W BR Y BL BL W W Y G

W W BL BR BL BR BL BL BR Y

BL G W BL W W BL W BL BR

Y BL G BR G BR W W BR Y

W BL Y W W BL W BR G G

(A complete categorical distribution must have class, frequency & percentage column

in the table)

2. Grouped frequency distribution

-when the range of the data is large, the data must be grouped into classes.

Example: The ages of the signers of the Declaration of Independence are shown

below. Construct a frequency distribution for the data using seven classes.

41 54 47 40 39 35 50 37 49 42 70

32 44 52 39 50 40 30 34 69 39 45

33 42 44 63 60 27 42 34 50 42 52

38 36 45 35 43 48 46 31 27 55 63

46 33 60 62 35 46 45 34 53 50 50

Example: The number of calories per serving for selected ready-to-eat cereals is

listed here. Construct a frequency distribution using seven classes.

130 190 140 80 100 120 220 220 110 100

210 130 100 90 210 120 200 120 180 120

190 210 120 200 130 180 260 270 100 160

190 240 80 120 90 190 200 210 190 180

115 210 110 225 190 130

SHF1124

8

3. Ungrouped frequency distribution

-when the range of data is small

Example: A survey taken in a restaurant shows the following number of cups of

coffee consumed with each meal. Construct frequency distribution.

0 2 2 1 1 2 3 5 3 2

2 2 1 0 1 2 4 2 0 1

0 1 4 4 2 2 0 1 1 5

Procedure to construct frequency distribution (this procedure is not unique):

1) Determine number of classes which normally 5 – 20

2) Find range = Highest value – lowest value

3) The class width should be an odd number. Class width = range

no. of class and rounding

up.

4) Class Limit :

Lower class limit =the lowest value or any number less than the lowest value.

Upper class limit = (Lower class limit + class width) -1

5) Class Boundary: (to separate classes so that there are no gap in the frequency

distribution)

Lower Class Boundary: Lower class limit -0.5

Upper Class Boundary: Upper class limit + 0.5

6) Find frequency and cumulative frequency.

Class width = Upper Class Boundary - Lower Class Boundary

= Lower class limit of one class - Lower class limit of next class

= Upper class limit of one class - Upper class limit of next class

Class Midpoint = (Lower Class Boundary + Upper Class Boundary)/2

= (Upper class limit + Lower class limit)/2

SHF1124

9

3.2.2 HISTOGRAMS, FREQUENCY POLYGONS AND OGIVES

Example:

For 108 randomly selected college applicants, the following frequency distribution for

entrance exam scores was obtained.

Class Limit Frequency 90 – 98 6

99 – 107 22 108 – 116 43 117 – 125 28 126 - 134 9

Construct:

1. Histogram

i) x-axis :class boundary ii) x-axis :class boundary

y-axis : frequency y-axis : relative frequency

2. Frequency Polygon

i) x-axis :class midpoint ii) x-axis :class midpoint

y-axis : frequency y-axis : relative frequency

3. Ogive

i) x-axis : class boundary ii) x-axis : class boundary

y-axis : cumulative frequency y-axis : cumulative relative frequency

Relative frequency = f

f

Cumulative relative frequency = cumulative frequency

f or add the relative frequency in

each class to the total relative frequency.

SHF1124

10

Note: Graphing

Given the frequency distribution below:

Class Limit Class Boundary f Cf

0 – 19 -0.5 – 19.5 13 13

20 – 39 19.5 – 39.5 18 31

The first value on the x-axis is -0.5 can be drawn as below

OR

-0.5 19.5 39.5 -0.5 19.5 39.5

All graphs must be drawn on the right side of y-axis and omit question on analyzing the

graph in exercise.

Exercise:

1. In a class of 35 students, the following grade distribution was found. Construct a

histogram, frequency polygon and ogive for the data. (A=4, B=3, C=2, D=1, F=0)

Grade Frequency 0 3 1 6 2 9 3 12 4 5

2. Using the histogram shown below. Construct

i) A frequency distribution

ii) A frequency polygon

iii) An ogive

y

7

6 6

5

5

4

3 3 3

2 2

1 1

x

21.5 24.5 27.5 30.5 33.5 36.5 39.5 42.5

Class Boundaries

SHF1124

11

3. The number of calories per serving for selected ready-to-eat cereals is listed here.

Construct a histogram, frequency polygon and ogive for the data using relative

frequency.

130 190 140 80 100 120 220 220 110 100

210 130 100 90 210 120 200 120 180 120

190 210 120 200 130 180 260 270 100 160

190 240 80 120 90 190 200 210 190 180

115 210 110 225 190 130

4. Below is a data set for the duration (in minutes) of a random sample of 24 long-

distance phone calls:

1 20 10 20 12 23 3 7 18 12 4 5

15 7 29 10 18 10 10 23 4 12 8 6

a) Construct a frequency distribution table for the data using the classes “1 to 5” “6

to 10” etc.

b) Construct a cumulative frequency distribution table and use it to draw up an

ogive.

5. The following table refers to the 2003 average income (in thousand Ringgit) per

year for 20 employees of company A.

Income (‘000 Ringgit)

Frequency

5 -9 6 10 – 14 3 15 – 19 2 20 – 24 4 25 – 29 3 30 – 34 2

a) Draw the histogram and frequency polygon for the above data.

b) Construct the cumulative frequency table. Hence, draw up an ogive for the above

data.

SHF1124

12

3.3 DATA DESCRIPTION

3.3.1 MEASURES OF CENTRAL TENDENCY

Mean, median and Mode for Ungrouped data

Mean (arithmetic average)

Symbol for Sample: X Symbol for Population: μ

(Syllabus focus on sample formula), Mean, X

Xn

Median : (the middle point in ordered data set)

- arrange the data in order, ascending or descending

- select the middle point or use formula 1

2

nT

, n is number of data.

- Then, the median is:

the value at location T (for odd number of data)

the average of the value at location T and the value at location (T +1) (for even number of data)

Mode : the value that occur most often in the data set

Example:

1) The following data are the number of burglaries reported for a specific year for nine

western Pennsylvania universities. Find mean, median and mode.

61, 11, 1, 3, 2, 30, 18, 3, 7

2) Twelve major earthquakes had Richter magnitudes shown here. Find mean, median

and mode.

7.0 , 6.2 , 7.7 , 8.0 , 6.4 , 6.2 , 7.2 , 5.4 , 6.4 , 6.5 , 7.2 , 5.4

3) The number of hospitals for the five largest hospital systems is shown here. Find

mean, median and mode.

340, 75, 123, 259, 151

SHF1124

13

Mean, median and Mode for Ungrouped frequency distribution

Mean, f X

Xf

Median :

- find cumulative frequency

- Location of median 2

f

Mode : the value with the largest frequency

Example:

4) A survey taken in a restaurant. This ungrouped frequency distribution of the

number of cups of coffee consumed with each meal was obtained. Find mean,

median and mode.

Number of cups Frequency

0 5

1 8

2 10

3 2

4 3

5 2

Mean, median and Mode for Grouped frequency distribution

Mean, mf XX

f

where; mX =class midpoint

(Student must show the working ie. Find midpoint and mf X )

Median :

- find cumulative frequency

- find location of median class 2

f

- Median 2

fF

L cf

Where; L=lower boundary of the median class

F = cumulative frequency until the point L (before median class)

f = frequency of the median class

c =class width of median class

SHF1124

14

Mode :

- find location of modal class : class with the largest frequency

- Mode c

where; L=lower boundary of the modal class

a = different between frequencies of modal class and the class before it.

b= different between frequencies of modal class and the class after it.

c =class width of median class

Example:

5) These numbers of books were read by each of the 28 students in a literature class.

Find mean, median and mode.

Number of books Frequency 0 – 2 2 3 – 5 6 6 – 8 12

9 – 11 5 12 – 14 3

6) Eighty randomly selected light bulbs were tested to determine their lifetimes (in

hours). This frequency distribution was obtained. Find mean, median and mode.

Class Boundaries Frequency 52.5 – 63.5 6 63.5 – 74.5 12 74.5 – 85.5 25 85.5 – 96.5 18

96.5 – 107.5 14 107.5 – 118.5 5

SHF1124

15

3.3.2 MEASURES OF VARIATION

Variance and Standard deviation (the spread of data set)

Group A Group B

80 55

81 88

82 100

X =81 X =81 Variation, s2 =1 Variation, s2 =543

80 81 82 55 88 100

Even though the average for both groups is the same, the spread or variation of data in

the Group B larger than Group A.

(Syllabus focus on sample formula)

Variance

Population variance , σ2

= (Σ(X -μ)2)/N

Sample variance , s2

Standard deviation

Population standard deviation , σ

= √(Σ(X -μ)2)/N

=√σ2

Sample standard deviation , s

SHF1124

16

Sample variance and standard deviation

For Ungrouped Data

Variance,

2

2

1

X Xs

n

Standard deviation,

2

2

1

X Xs s

n

where; X =individual value

X =sample mean

n = sample size

OR

Variance,

2

2

2

1

XX

n

sn

Standard deviation,

2

2

2

1

XX

n

s sn

(Note: 2X is not the same as 2

X )

Example:

1) The normal daily temperatures (in degrees Fahrenheit) in January for 10 selected

cities are as follows. Find the variance and standard deviation.

50 37 29 54 30 61 47 38 34 61

2) Twelve students were given an arithmetic test and the times (in minutes) to

complete it were

10 9 12 11 8 15 9 7 8 6 12 10

Find the variance and standard deviation.

SHF1124

17

For Grouped Data

Variance,

2

2

2

1

m

m

f Xf X

f

sf

Standard deviation,

2

2

2

1

m

m

f Xf X

f

s sf

(Students must show the working ie. Find mf X and 2

mf X )

Example:

3) In a class of 29 students, this distribution of quiz scores was recorded. Find variance

and standard deviation.

Grade Frequency 0 – 2 1 3 – 5 3 6 – 8 5

9 – 11 14 12 – 14 6

4) Eighty randomly selected light bulbs were tested to determine their lifetimes (in

hours). This frequency distribution was obtained. Find variance and standard

deviation.

Class Boundaries Frequency 52.5 – 63.5 6 63.5 – 74.5 12 74.5 – 85.5 25 85.5 – 96.5 18

96.5 – 107.5 14 107.5 – 118.5 5

5) These data represent the scores (in words per minute) of 25 typists on a speed test.

Find variance and standard deviation.

Class limit Frequency 54 – 58 2 59 – 63 5 64 – 68 8 69 – 73 0 74 – 78 4

79 – 83 5 84 – 88 1

SHF1124

18

3.3.3 MEASURES OF POSITION

Standard scores, percentiles, deciles and quartiles are used to locate the relative position

of the data value in the data set.

Standard score / z-score

The z-score represent the number of standard deviations the data value is above or

below the mean.

X X

zs

if the z score is positive, the score is above the mean

if the z score is negative, the score is below the mean

Example:

1) Let data set : 65 , 70 , 75 ,80 , 85 ; X =75 , s =5

65 70 75 80 85

X -2s X - s X X +2s X +s

z= -2 z= -1 z= 0 z= 1 z= 2

For data value 83: 83 75

1.65

z

2) Test marks are shown here. On which test she perform better?

Math marks: 65 50 45 ; X =53.3 , s=10.4

Biology marks: 80 75 70 ; X =75 , s=5

65 53.3

1.12210.4

Mz

75 75

05

Bz

M Bz z , the relative position in math class is higher than her the relative

position in biology class. She performs better in math paper than biology paper.

(the marks that she get from biology paper is more than mathematics paper but we

cannot compare the marks directly because the papers are different i.e. number of

question, standard of questions and so on, that is why we have to compare the relative

position)

SHF1124

19

Quartiles, deciles and percentile

For Ungrouped data

Quartiles: divide the distribution into four group Q1 , Q2 , Q3

Smallest data Q1 Q2 Q3 Largest data

25% 25% 25% 25%

Median

arrange the data in order

Find location of quartiles, 4

n qc

where ; n = total number of values

q =quartile

i) If c is not whole number, round up to the next whole number

ii) If c is a whole number, take average of cth and (c+1)th

Example:

1) The weights in pounds in the data set. Find Q1 , Q2 , Q3.

16 18 22 19 3 21 17 20

2) The test score in the data set. Find Q1 , Q2 , Q3.

42 35 28 12 47 50 49

Deciles: divide the distribution into 10 groups

Smallest data D1 D2 D3 D4 D5 D6 D7 D8 D9 Largest data

10% 10% 10% 10% 10% 10% 10% 10% 10%

Median



n dc


d =decile

iii) If c is not whole number, round up to the next whole number

iv) If c is a whole number, take average of cth and (c+1)th

SHF1124

20

Example:

1) (from previous example) Find D5.

16 18 22 19 3 21 17 20

2) (from previous example)Find D7.

42 35 28 12 47 50 49

Percentiles: divide the distribution into 100 equal groups

Smallest data P1 P2 P3 P97 P98 P99 Largest data

10% 10% 10% 10% 10% 10% 10% 10% 10%

D1 , D2, D3, … , D9 correspond to P10 , P20, P30, … , P90

Q1 , Q2 , Q3 correspond to P25 , P50, P75

Median = Q2 = D5 = P50



n pc


p =percentile

v) If c is not whole number, round up to the next whole number

vi) If c is a whole number, take average of cth and (c+1)th

Example:

1) (from previous example) Find P33.

16 18 22 19 3 21 17 20

2) (from previous example)Find P60.

42 35 28 12 47 50 49

Finding percentile corresponding to given value, X

number of values below X 0.5Percentile 100%

total number of values

Example of data set : 1 1 3 4 5

Find percentile for 4.

3 0.5Percentile 100% 70%

5

P70 = 4

(round off the answer)

SHF1124

21

Example:

2) (from previous example)Find the percentile rank for each test score in the data set.

42 35 28 12 47 50 49

(Data value 47 = P64 but previously when we want to find P60 the data value is 47b too.

So actually P60 closer to P64 which is data value 47)

For Grouped Data

METHOD 1: (USE PERCENTILE GRAPH)

x-axis: class boundaries

y-axis: relative cumulative frequency (percentage)

Cumulative relative frequency (%) = cumulative frequency

100%f

Graph:

i) percentile graph

Relative cumulative frequency (%)

100

25 P25

ii) Ogive using relative frequency (iii) Ogive

Relative cumulative frequency Cumulative Frequency

1.0 75

0.25 P25 18.75 P25

25% x 75 =18.75

SHF1124

22

METHOD 2: (USE FORMULA)

100n n

nf F

P L cf

Example:

This distribution represents the data for weights of fifth-grade boys.

Weights (pounds) frequency 52.5 – 55.5 9 55.5 – 58.5 12 58.5 – 61.5 17 61.5 – 64.5 22 64.5 – 67.5 15

1) Find the approximate weights corresponding to each percentile given by

constructing a percentile graph.

(i) Q1 (ii) D8 (iii) Median (iv) P95

2) Find the approximate percentile ranks of the following weights.

(i) 57 pounds (ii) 64 pounds (iii) 62 pounds (iv) 59 pounds

3) Find P63 by using the formula.

SHF1124

23

EXERCISE CHAPTER 3

1. What type of sampling is being employed if a country is divided into economic classes

and a sample is chosen from each class to be surveyed?

2. Given a set of data 5,2,8,14,10,5,7,10,m, n where X =7 and mode = 5. Find the

possible values of m and n. (ans: m=5, n=4 or m =4 , n =5)

3. Find the value that corresponds to the 30th percentile of the following data set:

78 82 86 88 92 97 (ans: P30 =82)

4. Given the variance of the set of 8 data x1 , x2, x3, … , x8 is 5.67. If 2 944.96X , find

the mean of the data. (ans: 11.09)

5. Find Q3 for the given data set : 18,22,50,15,13,6,5,12 (ans: 20)

6. The number of credits in business courses that eight applicants took is 9, 12, 15, 27,

33, p, 63, 72. Given the value that corresponds to the 75th percentile is 54, find p.

(ans: 45)

7. The mean of 5, 10, 26, 30, 45, 32, x, y is 25 where x and y are constants. If x = 16, find

the median. (ans: 28)

8. A physician is interested in studying scheduling procedures. She questions 40 patients

concerning the length of time in minutes that they waste past their scheduled appointment

time. The following data are obtained:

60 29 34 25 31 30 6 17 6 50

10 18 38 25 35 36 31 23 12 52

8 27 27 30 42 9 47 31 27 6

45 33 25 37 3 50 53 28 16 19

a) Construct a frequency distribution by using 7 classes (use 3 as lower limit of the first

class)

b) Find the mean, mode and standard deviation. (ans: 28.15 , 31.3 , 14.63)

c) Draw an ogive by using relative frequency and estimate the median from the graph.

Download - CHAPTER 3: INTODUCTION TO STATISTICS The Statistical ...mathcfs-students-page.wikispaces.com/file/view/CHAPTER+5+(INTRO... · CHAPTER 3: INTODUCTION TO STATISTICS ... A statistical

Top Related