sampling - what, why and how

Sampling

Date: 12/02/2013

Author: K. S. Alok Ranjan

About: Meaning, Types and Formulas of Sampling in

Statistics.

http://www.sevensolutions.in/

https://www.facebook.com/success7s

https://twitter.com/

http://www.slideshare.net/AlokRanjan9

http://in.linkedin.com/in/ksalokranjan

Sampling, What, Why and How Feb-2013

2 [email protected] | www.sevensolutions.in | +91 9810 77 5457

I. Sampling – Meaning and Need

Sampling, in Simple terms refers to choosing few individual entities from a complete group of entities for the

purpose of assessing characteristics or qualities of the complete group. For example:

a. Choosing some individuals from a city for a poll whether the complete city will vote or not.

b. Choosing some iron rods from a manufacturing plant to assess if the complete production meets a certain

quality standard

c. Choosing some patients of a particular disease to figure out if every patient suffering from the same

disease has a particular symptom or what their reaction will be to a particular medicine.

Sampling is a Statistical Survey Methodology that helps to select a subset of Individuals from a Population to

estimate Characteristics about the complete Population.

In the above examples:

Populations are Complete City of People, Complete Production of Manufacturing Plant and All the People

suffering from the disease.

Subsets are some individuals in the city, some iron rods from the plant and some patients suffering from

same disease.

Characteristics to find are Voting Possibility, Certain Quality Standards and Disease Symptoms/Medicine

reaction

II. Stages in Sampling

1. Defining Population: Clearly define who the complete population is. It would eliminate any possibility of

biasedness and ensure that the sample taken is correct. For e.g. all persons suffering from the disease in

above example.

2. Deciding Sampling Frame: Sampling Frame is a set of elements of the Population that can be used to

extract Samples. For e.g. Contact Information of individuals in Poll example above.

3. Deciding Sampling Method: There are different kinds of Sampling Methods that we will study below.

4. Determining the Sample Size: The size or volume of sample can be statistically determined using certain

formulas that also we would study below.

5. Planning the Sampling Implementation: Device a strategy on how to go about collecting the samples.

6. Collecting the Data: Collation of data on the basis of characteristics decided before from the samples.

III. Population, Subpopulation, Frame and Sample

A Population in Statistics means a set of entities (identifiable individuals who can be studied alone for any

purpose are entities) who are bound by some common measurable characteristics. Generally they are found in a

group. For e.g. all students in Delhi who are in any DU college.

A Subpopulation is a subset within the Population that inherits the characteristics of the Population and also

maintains some unique characteristics that is not present in other distinct subpopulations inside the Population.

For e.g. each college under DU is a subpopulation, or all males and all females are two subpopulations.

A Frame is a mechanism that helps in picking Sample from a Population. Note that there has to be an instrument

that helps in contacting the Samples and including them in the survey. This instrument can be either a telephone

directory, University Magazine, Patient list etc. So, simply put, a Frame is a list of the Population (preferably

complete Population) that also has a medium to help pick Samples. For e.g. Enrolment forms of an academic

year.

Sample is a subset of the Population, chosen using the Frame so that they can be studied for certain

characteristics that can later be generalized for the Population. Few of the students from any college of DU



selected from their Enrolment forms.

IV. Types of Sampling

Types of Sampling signifies the different (two in this case) categories of Sampling based on the type of

input/output or behaviour of the input/output.

1. Probability

Probability Sampling means every Unit in the Population has less or more, but valid chance of being

selected as a sample. And also, this valid chance can be statistically measured.

For e.g. in a city if each home and hospital is searched for a particular type of patient and identify the

patients, then randomly select one patient from the city, each patient has a valid chance of getting selected.

May be more, in case the person is a single patient in home or less, in case there are more than one such

patient in a hospital. But, the valid chance, or probability in statistical language, remains for each patient.

This is Probability Sampling.

In case the Probability is equal for each Unit in the Population, it is called EPS, Equal Probability of

Selection. An example can be, searching for a patient with a particular disease only in one hospital.

2. Non Probability

Non Probability Sampling methods are the ones in which some Units of the Population does not have any

valid chance or the chance cannot be known before, of getting selected in the Sampling.

Non probability Sampling happens when assumptions are used to sample from a Population. For this

reason, sampling errors cannot be determined. It further gives birth to Biasedness due to Exclusion,

precisely meaning, the Population might not be properly estimated from the Sample.

An example can be, visiting only hospitals in a city to find out the patients from a particular disease, and not

visiting the homes in the city.

V. Calculating Sample Size

Let’s calculate the sample size of how many quality samples should be done in a Customer Service Process

which handles 50,000 calls a month.

* ( )

+

Where:

SS = Sample Size to be calculated

( )

Pop = Population

p = Per cent of Population that you expect will satisfy or not satisfy the criteria of reason why you are

sampling.

For e.g. 30% of population is meeting quality standards and 70% is not.

This is expressed as a decimal and generally taken as 50% or 0,5. Any per cent greater or lesser than this

would reduce the sample size. 50% (0.5) will maximize the sample to include most of the population.



Z = Confidence Level (If you do manual check of complete population for the criteria, like Quality check as in

our example, how many time will the p you took above will be correct, 90, 95 or 99 times?)

Generally taken at 95%. In the formula, use -

1.645 if you are 90% sure



C = Confidence Interval (Error Margin allowed between what may happen with Sample and what should

happen in population)

Margin of error allowed in sample, against (if hypothetically) quality is done for complete population.

Expressed as decimal, as it for 3% error, 0,03.

Example:

Population of 50,000 (means, 50,000 calls in a month)

P = 50% 0r 0.5 (Because I think half of my population will flunk in quality and half may not, and this way I can

assure highest sample)

Z = 1.960 (Because I am confident that if I do Quality of whole population 100 times, 95 times P above will

be correct)

C = 0.03 (Because I want to allow 3% of sampling error, that is P may vary from 47% to 53% but no more or

less)

So,

( )

( ) ( )

( )

And, now,

* ( )

+

* ( )

+



[ ]

So, Sample size in a month for a 50,000 calls should be 1045.

VI. Sampling Methods

Method of Sampling signifies the different ways of calculating sample size. This list will generally differ from one

Statistician to other or from one Six Sigma expert to another. This is because the interpretation may result in

merging two Methods into one or splitting a Method into two.

Since the Methods are situational and to be decided strictly as per the kind of Population you are handling and

the kind of analysis you are looking it, I have listed here the almost exhaustive list of Sampling Methods that you

may choose from.

1. Simple Random Sampling:

Simple Random Sampling is a Probability Sampling. It is choosing a sample (a subset of individuals) from a

Population (larger set). Each individual is chosen randomly and entirely by chance, such that each individual

has the same probability of being chosen at any stage during the sampling process

Simply put, it states that once Sample Size is calculated, the number of Samples to be chosen from the

Population has to be chosen in such a way that each entity in the Population has equal chance to be chosen.

For e.g. if the Enrolment forms are kept in a box and randomly number of forms are chosen as specified by

the Sample Size.

Advantage:

i. Minimizes bias and simplifies analysis.

ii. Variance depicted in Sample is almost correct for the Population.

Disadvantage:

i. Might not reflect the makeup of the Population, like number of boys and girls in all DU colleges.

ii. Tiresome and clumsy in case of a large Population.

2. Systematic Sampling:

Systematic Sampling is a Probability Sampling. In this, once Sample size is determined, an interval is

created and Samples are chosen from the intervals systematically. The procedure is as below:

Divide the Population by the Sample size to arrive at k. Start from an entity between 1 and k. Choose each

kth entity from the Population starting from initial k. Once the end of population is reached, rotate back to the

beginning of the Population cyclically. Continue choosing until the Sample size is reached.

For e.g. from a Population of 300, if Sample size is 12, choose every (300/12)th = 25

th entity starting from any

random number between 1 – 25. Choose each successive 25th entity from the starting entity until Sample

size is reached. However, Population will be very rarely divided by the Sample evenly. For e.g. if Sample

size is 9, then (300/9) = 33.33. In these cases, chose a starting point between 1 and 33.33 and round up

each successive entity to one up. For e.g. if you start from 23.6, then start selecting 24, 57, 91… and so on.



Advantage:

i. Efficient for Databases.

ii. Very efficient for Data with gradual trend and slope.

Disadvantage:

i. Data with periodicity will be heavily biased. If a the Sample frame has alternate boys and girls name,

Systematic Sampling will only choose either all boys or all girls.

ii. Variations between neighbouring entities are never captured.

3. Stratified Sampling:

A Population may have different Subpopulations that are independent homogenous groups who contribute to

the characteristics of the Population, but have unique set of their own characteristics. The Subpopulations

are homogenous internally but heterogeneous with each other.

As per Stratified sampling, the Population is divided into Strata or Subpopulations as per the uniqueness of

each Strata. It is to be taken care that no entity is in more than one Strata neither is an entity left out of the

Population. Then in each Subpopulation or Strata Simple Random Sampling or Systematic Sampling is

applied.

For e.g. if all the students in all colleges of DU is the Population, Strata can be each college, or each

academic area of colleges combined (Science, Commerce), or Geographical origin of students (North India,

East India).

While doing a Stratified Sampling, it is to be taken care that the proportion of each Strata in the Population is

reflected in the Samples. For e.g. if there are 30% of males and 70% of females and Strata are males and

females, then a Sample should have 3 males to 7 females. Also, if a Subpopulation has more of Standard

Deviation, larger samples should be taken from it than the Subpopulation is lesser Standard Deviation.

A Population should not be divided into more than six Strata.

Advantage:

i. Sample represents the Population better, Sampling Error reduces. Subpopulations with more

importance can be weighted more.

ii. Different Sampling Methods can be exercised for different Subpopulations.

iii. Sampling from a Population over a wide geographical area is more accurate.

Disadvantage:

i. Cannot be applied to large Population where Subgroups may be not distinctly disjoint or entities

have characteristics that are liable to make them a part of more than one Subpopulation.

ii. Scope of Sampling error increases with the number if Subpopulations in a Population.

iii. Can be expensive to implement.

4. Probability Proportional to Size Sampling:

If there are more than one Subpopulation with varying size of entities each, PPS Sampling ensures that the

Probability of an entity being selected as a Sample increases or decreases Proportional to the size of its

Subpopulation.

In this case, each Subpopulation is sorted in increasing order; each one is given a number, (number for

Subp1 = 1 to number of entities in Subp1, number for Subp2 = number for Subp1 + 1 to number in Subp2…



). Then k (as in Systematic Sampling) is calculated (k = Population/Sample). Then each kth entity is chosen

from the Subpopulation numbers we had arrived before.

For e.g. in a Population of all students in all colleges of DU, a Subpopulation of each college will have

number of entities (students) which has considerable Variance between them. If we have a Sample Size of

25 to select from 3100 students in DU colleges with 5 colleges:

DU = Population = 3100

Subpopulation = 5 colleges

Sample Size derived = 25

Number Calculation for each Subpopulation

First sort and list the Subpopulation in increasing order.

Colleges College 1 College 2 College 3 College 4 College 5

Subpopulation 340 510 620 750 880

numbers

1 341 851 1471 2221

to to to to To

340 850 1470 2220 3100

Number calc Entities in least populated

Subp 340+510 850+620 1470+750 2220+880

Calculation of k = Population/Sample = 3100/25 = 124

Randomly select first Sample between 1 and 124, say 113, then 113+124 = 237, 237 + 124 = 361… and we

get the below Table and derivation at right side:

Sample Number College

1 113 College 1

2 237 College 1

3 361 College 2

4 485 College 2

5 609 College 2

6 733 College 2

7 857 College 3

8 981 College 3

9 1105 College 3

10 1229 College 3

11 1353 College 3

12 1477 College 4

13 1601 College 4

14 1725 College 4

15 1849 College 4

16 1973 College 4

17 2097 College 4

18 2221 College 5

19 2345 College 5

20 2469 College 5

21 2593 College 5

22 2717 College 5

23 2841 College 5

24 2965 College 5

25 3089 College 5

The Table at Left states that the below number of Samples should be collected from each Subpopulation (College):

Subpopulation Sample

Size

College 1 2

College 2 4

College 3 5

College 4 6

College 5 8

DU 25

Which Sums to 24, the Sample Size. If you see, larger Samples are resulted from Subpopulations with larger number of entities.



Advantage:

i. Sample concentration on larger Subpopulation, increasing the representativeness of Sample.

ii. Counters the disadvantages of Systematic and Stratified Sampling when Subpopulations have

Variance between them .

Disadvantage:

i. Fails to account for negative balances while Sampling for a Business’ Finance data.

ii. Decreases precision of estimates; thus, requires larger sample size.

5. Cluster Sampling:

Cluster sampling is a method in which the Population is divided into Clusters taking care that each Cluster

has the entire characteristic that the Population as a whole has. Then one or more Clusters are taken as

Sample/s, leaving rest of the Clusters untouched.

The difference between a Stratified Sampling and Cluster Sampling is, in Stratified Sampling, Sample has to

come from each Strata, and in Cluster Sampling, Sample can come from one or more Cluster only. The other

and basic difference is, Strata are internally homogenous however heterogeneous with each other; Clusters

are internally heterogeneous however homogenous to each other.

For e.g. one student of each college in DU in an inter-college competition would be a Cluster for the

Population DU.

Advantage:

i. Cheaper than other Methods.

ii. Sampling Frame for complete Population is not needed.

Disadvantage:

i. Sampling error possibility is high. Extra care needed to choose a Cluster.

ii. Requires larger Sample than SRS or Systematic Sampling for similar accuracy.

6. Multistage Sampling:

Multistage Sampling is a complex form of Cluster Sampling with multiple levels selection. After identifying

Clusters in a Population, the second stage is to randomly select Samples from each Cluster. Sometimes

when Population is huge or not completely available, multiple stages of Cluster Selection may be applied

before final Sample is collected.

For e.g. if students I all the colleges under DU is the Population, and we need to find out about student

involvement in National level Competitions, first level Cluster would be all the students (from each college

under DU) participating in each competition, then from each Cluster students can be picked either using

Systematic Sampling or SRS.

Advantage:

i. Cost and speed are optimized, convenience to researchers is assured.

ii. Less Sampling error than normal Cluster Sampling for same size Sample.

Disadvantage:

i. Less accurate than SRS for same Sample Size.

ii. Not much testing and analysis can be done on Sample.



7. Multiphase sampling:

Multiphase Sampling refers to a method where a part of Sample is collected from the main Sample Size and

rest is collected from a subset of the main Sample. It ensures that some part of the Sample provides more

information than the others. Basically, the sub samples provide more detailed information.

For e.g. if all the students in colleges under DU is the Population and we need to find out which students are

speak fluent Tamil and can teach Basic Statis in Tamil, a large Sample of South Indian students can be

separated and asked, “Are you from Tamil Nadu?” then the sub Sample of students who confirm that they

are form Tamil Nadu can be asked if they speak Tamil and can teach Basic Stats.

Advantage:

i. Useful when Sampling Frame lacks auxiliary information for Stratified Sampling.

ii. Cost effective when budget is not available for complete Sample information collation.

Disadvantage:

i. The planning and implementation is complicated.

ii. Time consuming.

8. Quota Sampling:

Quota Sampling is a Non-Probability Sampling. This method asks to segment the Population into mutually

exclusive Subpopulations. Then a pre-determined judgement is used to pick Samples from each

Subpopulation.

For e.g. from the Population of all students in DU colleges, after defining Subpopulation as each colleges, 20

female students with entrance exam marks between 75% and 85% are to be chosen. Researcher may

choose any 20 females from any colleges randomly, may be based on the language of the student easy to

understand.

Advantage:

i. Samples have probability of getting biased.

ii. This method is incredibly cheap.

Disadvantage:

i. Limits decisions.

ii. Does not allow variety in Sample.

iii. Not possible to assess Sampling error as it is not random.

9. Accidental Sampling:

Accidental Sampling is like Snowball Sampling in Social Science Research and it also called

Convenience/Grab/Opportunity Sampling. It is a Non Probability Sampling and consists of collecting Sample

from the part of Population that is easy to access or is readily available.

However, it should be ensured that research is equipped with fail safe to lessen the impact of the non-

randomness. Also, it should be ensured that the Convenience Sample has reason to represent the

Population.

For e.g. Sampling the students in a particular gift shop nearest to one of the colleges under DU.



Advantage:

i. Useful for Pilot Testing of a product or service, where target user is not particular.

ii. Cost effective.

Disadvantage:

i. Sampling error possible due to non-randomness of the Sample.

ii. High probability of Sample not representing Population exists.

10. Line Intercept Sampling:

Line Intercept Sampling is a Non Probability sampling Method generally applied to Samples across an area

where the Samples are stationary or relatively very less mobile, for e.g. patches in a certain habitat type,

herbs and vegetation, rocks on a mountain, relatively less mobile animals like cows grazing in a field.

Lines, often called Transects are drawn through the area and any entity falling in the line of the Transect are

taken as Sample. Either the transect is drawn through diagonal if area is square or rectangle or more than

one Transects if area has random circumference. Generally it is used in Biological studies or Vegetation and

Geographical Data collection.

Advantage:

i. Simple method of Sample collection.

ii. Useful for Populations who are not mobile and cannot participate in selection process.

Disadvantage:

i. Since it is Non Probability Sampling, some Sample do not have chance to get selected.

ii. Cannot be applied to all kind of data collection.

11. Panel Sampling:

Panel sampling is mostly used for Social Science Research. It consists of selecting a Sample Size using any

Random Sampling and then extracting same information more than once from the Sample over a period of

time. Information extracted each time is called a Wave. It is like studying Repeatability for Gage RnR in Six

Sigma.

For e.g. post selecting the Sample of students from all colleges under DU, asking the students if they will join

family business or go for a job at the beginning of each year once and studying the Variance in their

answers.

It is a very useful Sampling Method, carefully done can give useful analysis using MANOVA, Growth Curve

etc. about people and their changing views.

Advantage:

i. Useful for People Study, Political mileage trends.

ii. Can help find out within-person health changes due to changing stress, time, prices etc.

Disadvantage:

i. Time consuming and can prove to be costly.

ii. Cannot be applied to all kind of data collection.



12. The Judgement Sample:

Judgement Sample is a Non Probability method in which either the Researcher or an Expert takes a

Judgement on which entities are to be included in the Sample. Here the Sampling Frame and the Population

are not identical, so there is scope of bias.

This is appropriate if Population is difficult to locate or some part of the Population is known to have more

data or knowledge or are receptive to data sharing then others.

For e.g. from the Population of all students in colleges under DU, if the researcher chooses the ones who are

into College Election to ask about current Political events in the country, it would be Judgement Sampling.

Advantage:

iii. Easy to determine Samples.

iv. Useful for Population with definite expertise and skills.

Disadvantage:

iii. High scope of biasedness in Sample.

iv. Expert’s or Researcher’s reliability evaluation is necessary.

VII. End Notes

Sampling is the first step in Analysis and a very important part of the complete Analysis Process. It forms the

primary step in Measure Phase of Six Sigma. The complete chapter on Sampling above has been presented in as

easy language as possible. However, there are few pointers listed below that needs further study. You can also

contact me for any clarification:

1. Systematic Sampling, weighted method

2. Systematic Sampling vs. SRS

3. Standard Deviation while Sampling from Strata

4. Post Stratification and Over Sampling in Stratified Sampling

5. MANOVA, Growth Curve etc

Will also come back with an Excel based Sample Size calculator where you can enter data knowing what is its

significance. Until then, here is a very nice and simple calculator developed by Macorr that you can download and

use.

http://www.macorr.com/sample-size-calculator.htm [Courtesy: Macorr]

VIII. References

http://www.pitt.edu/~super7/43011-44001/43911.ppt

http://en.wikipedia.org/wiki/Sampling_(statistics)

www.hivhub.ir/fa/document-center/doc_download/161-probability-sampling

http://encyclopedia2.thefreedictionary.com/multiphase+sampling

http://archa1.blogspot.in/2007/04/multiphase-sampling.html

http://www.businessdictionary.com/definition/quota-sampling.html

http://www.blurtit.com/q788493.html

http://www.jstor.org/discover/10.2307/2531331?uid=3738256&uid=2&uid=4&sid=21101797994437

http://www.math.montana.edu/~parker/PattersonStats/Lineint.pdf

http://en.wikipedia.org/wiki/Judgment_sample

http://www.htm.uoguelph.ca/MJResearch/ResearchProcess/JudgementSampling.htm