sampling - what, why and how
DESCRIPTION
Sampling, what, why and how. Complete Guide to Statistical Sampling.TRANSCRIPT
Sampling
Date: 12/02/2013
Author: K. S. Alok Ranjan
About: Meaning, Types and Formulas of Sampling in
Statistics.
Sampling, What, Why and How Feb-2013
2 [email protected] | www.sevensolutions.in | +91 9810 77 5457
I. Sampling – Meaning and Need
Sampling, in Simple terms refers to choosing few individual entities from a complete group of entities for the
purpose of assessing characteristics or qualities of the complete group. For example:
a. Choosing some individuals from a city for a poll whether the complete city will vote or not.
b. Choosing some iron rods from a manufacturing plant to assess if the complete production meets a certain
quality standard
c. Choosing some patients of a particular disease to figure out if every patient suffering from the same
disease has a particular symptom or what their reaction will be to a particular medicine.
Sampling is a Statistical Survey Methodology that helps to select a subset of Individuals from a Population to
estimate Characteristics about the complete Population.
In the above examples:
Populations are Complete City of People, Complete Production of Manufacturing Plant and All the People
suffering from the disease.
Subsets are some individuals in the city, some iron rods from the plant and some patients suffering from
same disease.
Characteristics to find are Voting Possibility, Certain Quality Standards and Disease Symptoms/Medicine
reaction
II. Stages in Sampling
1. Defining Population: Clearly define who the complete population is. It would eliminate any possibility of
biasedness and ensure that the sample taken is correct. For e.g. all persons suffering from the disease in
above example.
2. Deciding Sampling Frame: Sampling Frame is a set of elements of the Population that can be used to
extract Samples. For e.g. Contact Information of individuals in Poll example above.
3. Deciding Sampling Method: There are different kinds of Sampling Methods that we will study below.
4. Determining the Sample Size: The size or volume of sample can be statistically determined using certain
formulas that also we would study below.
5. Planning the Sampling Implementation: Device a strategy on how to go about collecting the samples.
6. Collecting the Data: Collation of data on the basis of characteristics decided before from the samples.
III. Population, Subpopulation, Frame and Sample
A Population in Statistics means a set of entities (identifiable individuals who can be studied alone for any
purpose are entities) who are bound by some common measurable characteristics. Generally they are found in a
group. For e.g. all students in Delhi who are in any DU college.
A Subpopulation is a subset within the Population that inherits the characteristics of the Population and also
maintains some unique characteristics that is not present in other distinct subpopulations inside the Population.
For e.g. each college under DU is a subpopulation, or all males and all females are two subpopulations.
A Frame is a mechanism that helps in picking Sample from a Population. Note that there has to be an instrument
that helps in contacting the Samples and including them in the survey. This instrument can be either a telephone
directory, University Magazine, Patient list etc. So, simply put, a Frame is a list of the Population (preferably
complete Population) that also has a medium to help pick Samples. For e.g. Enrolment forms of an academic
year.
Sample is a subset of the Population, chosen using the Frame so that they can be studied for certain
characteristics that can later be generalized for the Population. Few of the students from any college of DU
Sampling, What, Why and How Feb-2013
3 [email protected] | www.sevensolutions.in | +91 9810 77 5457
selected from their Enrolment forms.
IV. Types of Sampling
Types of Sampling signifies the different (two in this case) categories of Sampling based on the type of
input/output or behaviour of the input/output.
1. Probability
Probability Sampling means every Unit in the Population has less or more, but valid chance of being
selected as a sample. And also, this valid chance can be statistically measured.
For e.g. in a city if each home and hospital is searched for a particular type of patient and identify the
patients, then randomly select one patient from the city, each patient has a valid chance of getting selected.
May be more, in case the person is a single patient in home or less, in case there are more than one such
patient in a hospital. But, the valid chance, or probability in statistical language, remains for each patient.
This is Probability Sampling.
In case the Probability is equal for each Unit in the Population, it is called EPS, Equal Probability of
Selection. An example can be, searching for a patient with a particular disease only in one hospital.
2. Non Probability
Non Probability Sampling methods are the ones in which some Units of the Population does not have any
valid chance or the chance cannot be known before, of getting selected in the Sampling.
Non probability Sampling happens when assumptions are used to sample from a Population. For this
reason, sampling errors cannot be determined. It further gives birth to Biasedness due to Exclusion,
precisely meaning, the Population might not be properly estimated from the Sample.
An example can be, visiting only hospitals in a city to find out the patients from a particular disease, and not
visiting the homes in the city.
V. Calculating Sample Size
Let’s calculate the sample size of how many quality samples should be done in a Customer Service Process
which handles 50,000 calls a month.
* ( )
+
Where:
SS = Sample Size to be calculated
( )
Pop = Population
p = Per cent of Population that you expect will satisfy or not satisfy the criteria of reason why you are
sampling.
For e.g. 30% of population is meeting quality standards and 70% is not.
This is expressed as a decimal and generally taken as 50% or 0,5. Any per cent greater or lesser than this
would reduce the sample size. 50% (0.5) will maximize the sample to include most of the population.
Sampling, What, Why and How Feb-2013
4 [email protected] | www.sevensolutions.in | +91 9810 77 5457
Z = Confidence Level (If you do manual check of complete population for the criteria, like Quality check as in
our example, how many time will the p you took above will be correct, 90, 95 or 99 times?)
Generally taken at 95%. In the formula, use -
1.645 if you are 90% sure
1.960 if you are 95% sure
2.576 if you are 99% sure
C = Confidence Interval (Error Margin allowed between what may happen with Sample and what should
happen in population)
Margin of error allowed in sample, against (if hypothetically) quality is done for complete population.
Expressed as decimal, as it for 3% error, 0,03.
Example:
Population of 50,000 (means, 50,000 calls in a month)
P = 50% 0r 0.5 (Because I think half of my population will flunk in quality and half may not, and this way I can
assure highest sample)
Z = 1.960 (Because I am confident that if I do Quality of whole population 100 times, 95 times P above will
be correct)
C = 0.03 (Because I want to allow 3% of sampling error, that is P may vary from 47% to 53% but no more or
less)
So,
( )
( ) ( )
( )
And, now,
* ( )
+
* ( )
+
Sampling, What, Why and How Feb-2013
5 [email protected] | www.sevensolutions.in | +91 9810 77 5457
[ ]
So, Sample size in a month for a 50,000 calls should be 1045.
VI. Sampling Methods
Method of Sampling signifies the different ways of calculating sample size. This list will generally differ from one
Statistician to other or from one Six Sigma expert to another. This is because the interpretation may result in
merging two Methods into one or splitting a Method into two.
Since the Methods are situational and to be decided strictly as per the kind of Population you are handling and
the kind of analysis you are looking it, I have listed here the almost exhaustive list of Sampling Methods that you
may choose from.
1. Simple Random Sampling:
Simple Random Sampling is a Probability Sampling. It is choosing a sample (a subset of individuals) from a
Population (larger set). Each individual is chosen randomly and entirely by chance, such that each individual
has the same probability of being chosen at any stage during the sampling process
Simply put, it states that once Sample Size is calculated, the number of Samples to be chosen from the
Population has to be chosen in such a way that each entity in the Population has equal chance to be chosen.
For e.g. if the Enrolment forms are kept in a box and randomly number of forms are chosen as specified by
the Sample Size.
Advantage:
i. Minimizes bias and simplifies analysis.
ii. Variance depicted in Sample is almost correct for the Population.
Disadvantage:
i. Might not reflect the makeup of the Population, like number of boys and girls in all DU colleges.
ii. Tiresome and clumsy in case of a large Population.
2. Systematic Sampling:
Systematic Sampling is a Probability Sampling. In this, once Sample size is determined, an interval is
created and Samples are chosen from the intervals systematically. The procedure is as below:
Divide the Population by the Sample size to arrive at k. Start from an entity between 1 and k. Choose each
kth entity from the Population starting from initial k. Once the end of population is reached, rotate back to the
beginning of the Population cyclically. Continue choosing until the Sample size is reached.
For e.g. from a Population of 300, if Sample size is 12, choose every (300/12)th = 25
th entity starting from any
random number between 1 – 25. Choose each successive 25th entity from the starting entity until Sample
size is reached. However, Population will be very rarely divided by the Sample evenly. For e.g. if Sample
size is 9, then (300/9) = 33.33. In these cases, chose a starting point between 1 and 33.33 and round up
each successive entity to one up. For e.g. if you start from 23.6, then start selecting 24, 57, 91… and so on.
Sampling, What, Why and How Feb-2013
6 [email protected] | www.sevensolutions.in | +91 9810 77 5457
Advantage:
i. Efficient for Databases.
ii. Very efficient for Data with gradual trend and slope.
Disadvantage:
i. Data with periodicity will be heavily biased. If a the Sample frame has alternate boys and girls name,
Systematic Sampling will only choose either all boys or all girls.
ii. Variations between neighbouring entities are never captured.
3. Stratified Sampling:
A Population may have different Subpopulations that are independent homogenous groups who contribute to
the characteristics of the Population, but have unique set of their own characteristics. The Subpopulations
are homogenous internally but heterogeneous with each other.
As per Stratified sampling, the Population is divided into Strata or Subpopulations as per the uniqueness of
each Strata. It is to be taken care that no entity is in more than one Strata neither is an entity left out of the
Population. Then in each Subpopulation or Strata Simple Random Sampling or Systematic Sampling is
applied.
For e.g. if all the students in all colleges of DU is the Population, Strata can be each college, or each
academic area of colleges combined (Science, Commerce), or Geographical origin of students (North India,
East India).
While doing a Stratified Sampling, it is to be taken care that the proportion of each Strata in the Population is
reflected in the Samples. For e.g. if there are 30% of males and 70% of females and Strata are males and
females, then a Sample should have 3 males to 7 females. Also, if a Subpopulation has more of Standard
Deviation, larger samples should be taken from it than the Subpopulation is lesser Standard Deviation.
A Population should not be divided into more than six Strata.
Advantage:
i. Sample represents the Population better, Sampling Error reduces. Subpopulations with more
importance can be weighted more.
ii. Different Sampling Methods can be exercised for different Subpopulations.
iii. Sampling from a Population over a wide geographical area is more accurate.
Disadvantage:
i. Cannot be applied to large Population where Subgroups may be not distinctly disjoint or entities
have characteristics that are liable to make them a part of more than one Subpopulation.
ii. Scope of Sampling error increases with the number if Subpopulations in a Population.
iii. Can be expensive to implement.
4. Probability Proportional to Size Sampling:
If there are more than one Subpopulation with varying size of entities each, PPS Sampling ensures that the
Probability of an entity being selected as a Sample increases or decreases Proportional to the size of its
Subpopulation.
In this case, each Subpopulation is sorted in increasing order; each one is given a number, (number for
Subp1 = 1 to number of entities in Subp1, number for Subp2 = number for Subp1 + 1 to number in Subp2…
Sampling, What, Why and How Feb-2013
7 [email protected] | www.sevensolutions.in | +91 9810 77 5457
). Then k (as in Systematic Sampling) is calculated (k = Population/Sample). Then each kth entity is chosen
from the Subpopulation numbers we had arrived before.
For e.g. in a Population of all students in all colleges of DU, a Subpopulation of each college will have
number of entities (students) which has considerable Variance between them. If we have a Sample Size of
25 to select from 3100 students in DU colleges with 5 colleges:
DU = Population = 3100
Subpopulation = 5 colleges
Sample Size derived = 25
Number Calculation for each Subpopulation
First sort and list the Subpopulation in increasing order.
Colleges College 1 College 2 College 3 College 4 College 5
Subpopulation 340 510 620 750 880
numbers
1 341 851 1471 2221
to to to to To
340 850 1470 2220 3100
Number calc Entities in least populated
Subp 340+510 850+620 1470+750 2220+880
Calculation of k = Population/Sample = 3100/25 = 124
Randomly select first Sample between 1 and 124, say 113, then 113+124 = 237, 237 + 124 = 361… and we
get the below Table and derivation at right side:
Sample Number College
1 113 College 1
2 237 College 1
3 361 College 2
4 485 College 2
5 609 College 2
6 733 College 2
7 857 College 3
8 981 College 3
9 1105 College 3
10 1229 College 3
11 1353 College 3
12 1477 College 4
13 1601 College 4
14 1725 College 4
15 1849 College 4
16 1973 College 4
17 2097 College 4
18 2221 College 5
19 2345 College 5
20 2469 College 5
21 2593 College 5
22 2717 College 5
23 2841 College 5
24 2965 College 5
25 3089 College 5
The Table at Left states that the below number of Samples should be collected from each Subpopulation (College):
Subpopulation Sample
Size
College 1 2
College 2 4
College 3 5
College 4 6
College 5 8
DU 25
Which Sums to 24, the Sample Size. If you see, larger Samples are resulted from Subpopulations with larger number of entities.
Sampling, What, Why and How Feb-2013
8 [email protected] | www.sevensolutions.in | +91 9810 77 5457
Advantage:
i. Sample concentration on larger Subpopulation, increasing the representativeness of Sample.
ii. Counters the disadvantages of Systematic and Stratified Sampling when Subpopulations have
Variance between them .
Disadvantage:
i. Fails to account for negative balances while Sampling for a Business’ Finance data.
ii. Decreases precision of estimates; thus, requires larger sample size.
5. Cluster Sampling:
Cluster sampling is a method in which the Population is divided into Clusters taking care that each Cluster
has the entire characteristic that the Population as a whole has. Then one or more Clusters are taken as
Sample/s, leaving rest of the Clusters untouched.
The difference between a Stratified Sampling and Cluster Sampling is, in Stratified Sampling, Sample has to
come from each Strata, and in Cluster Sampling, Sample can come from one or more Cluster only. The other
and basic difference is, Strata are internally homogenous however heterogeneous with each other; Clusters
are internally heterogeneous however homogenous to each other.
For e.g. one student of each college in DU in an inter-college competition would be a Cluster for the
Population DU.
Advantage:
i. Cheaper than other Methods.
ii. Sampling Frame for complete Population is not needed.
Disadvantage:
i. Sampling error possibility is high. Extra care needed to choose a Cluster.
ii. Requires larger Sample than SRS or Systematic Sampling for similar accuracy.
6. Multistage Sampling:
Multistage Sampling is a complex form of Cluster Sampling with multiple levels selection. After identifying
Clusters in a Population, the second stage is to randomly select Samples from each Cluster. Sometimes
when Population is huge or not completely available, multiple stages of Cluster Selection may be applied
before final Sample is collected.
For e.g. if students I all the colleges under DU is the Population, and we need to find out about student
involvement in National level Competitions, first level Cluster would be all the students (from each college
under DU) participating in each competition, then from each Cluster students can be picked either using
Systematic Sampling or SRS.
Advantage:
i. Cost and speed are optimized, convenience to researchers is assured.
ii. Less Sampling error than normal Cluster Sampling for same size Sample.
Disadvantage:
i. Less accurate than SRS for same Sample Size.
ii. Not much testing and analysis can be done on Sample.
Sampling, What, Why and How Feb-2013
9 [email protected] | www.sevensolutions.in | +91 9810 77 5457
7. Multiphase sampling:
Multiphase Sampling refers to a method where a part of Sample is collected from the main Sample Size and
rest is collected from a subset of the main Sample. It ensures that some part of the Sample provides more
information than the others. Basically, the sub samples provide more detailed information.
For e.g. if all the students in colleges under DU is the Population and we need to find out which students are
speak fluent Tamil and can teach Basic Statis in Tamil, a large Sample of South Indian students can be
separated and asked, “Are you from Tamil Nadu?” then the sub Sample of students who confirm that they
are form Tamil Nadu can be asked if they speak Tamil and can teach Basic Stats.
Advantage:
i. Useful when Sampling Frame lacks auxiliary information for Stratified Sampling.
ii. Cost effective when budget is not available for complete Sample information collation.
Disadvantage:
i. The planning and implementation is complicated.
ii. Time consuming.
8. Quota Sampling:
Quota Sampling is a Non-Probability Sampling. This method asks to segment the Population into mutually
exclusive Subpopulations. Then a pre-determined judgement is used to pick Samples from each
Subpopulation.
For e.g. from the Population of all students in DU colleges, after defining Subpopulation as each colleges, 20
female students with entrance exam marks between 75% and 85% are to be chosen. Researcher may
choose any 20 females from any colleges randomly, may be based on the language of the student easy to
understand.
Advantage:
i. Samples have probability of getting biased.
ii. This method is incredibly cheap.
Disadvantage:
i. Limits decisions.
ii. Does not allow variety in Sample.
iii. Not possible to assess Sampling error as it is not random.
9. Accidental Sampling:
Accidental Sampling is like Snowball Sampling in Social Science Research and it also called
Convenience/Grab/Opportunity Sampling. It is a Non Probability Sampling and consists of collecting Sample
from the part of Population that is easy to access or is readily available.
However, it should be ensured that research is equipped with fail safe to lessen the impact of the non-
randomness. Also, it should be ensured that the Convenience Sample has reason to represent the
Population.
For e.g. Sampling the students in a particular gift shop nearest to one of the colleges under DU.
Sampling, What, Why and How Feb-2013
10 [email protected] | www.sevensolutions.in | +91 9810 77 5457
Advantage:
i. Useful for Pilot Testing of a product or service, where target user is not particular.
ii. Cost effective.
Disadvantage:
i. Sampling error possible due to non-randomness of the Sample.
ii. High probability of Sample not representing Population exists.
10. Line Intercept Sampling:
Line Intercept Sampling is a Non Probability sampling Method generally applied to Samples across an area
where the Samples are stationary or relatively very less mobile, for e.g. patches in a certain habitat type,
herbs and vegetation, rocks on a mountain, relatively less mobile animals like cows grazing in a field.
Lines, often called Transects are drawn through the area and any entity falling in the line of the Transect are
taken as Sample. Either the transect is drawn through diagonal if area is square or rectangle or more than
one Transects if area has random circumference. Generally it is used in Biological studies or Vegetation and
Geographical Data collection.
Advantage:
i. Simple method of Sample collection.
ii. Useful for Populations who are not mobile and cannot participate in selection process.
Disadvantage:
i. Since it is Non Probability Sampling, some Sample do not have chance to get selected.
ii. Cannot be applied to all kind of data collection.
11. Panel Sampling:
Panel sampling is mostly used for Social Science Research. It consists of selecting a Sample Size using any
Random Sampling and then extracting same information more than once from the Sample over a period of
time. Information extracted each time is called a Wave. It is like studying Repeatability for Gage RnR in Six
Sigma.
For e.g. post selecting the Sample of students from all colleges under DU, asking the students if they will join
family business or go for a job at the beginning of each year once and studying the Variance in their
answers.
It is a very useful Sampling Method, carefully done can give useful analysis using MANOVA, Growth Curve
etc. about people and their changing views.
Advantage:
i. Useful for People Study, Political mileage trends.
ii. Can help find out within-person health changes due to changing stress, time, prices etc.
Disadvantage:
i. Time consuming and can prove to be costly.
ii. Cannot be applied to all kind of data collection.
Sampling, What, Why and How Feb-2013
11 [email protected] | www.sevensolutions.in | +91 9810 77 5457
12. The Judgement Sample:
Judgement Sample is a Non Probability method in which either the Researcher or an Expert takes a
Judgement on which entities are to be included in the Sample. Here the Sampling Frame and the Population
are not identical, so there is scope of bias.
This is appropriate if Population is difficult to locate or some part of the Population is known to have more
data or knowledge or are receptive to data sharing then others.
For e.g. from the Population of all students in colleges under DU, if the researcher chooses the ones who are
into College Election to ask about current Political events in the country, it would be Judgement Sampling.
Advantage:
iii. Easy to determine Samples.
iv. Useful for Population with definite expertise and skills.
Disadvantage:
iii. High scope of biasedness in Sample.
iv. Expert’s or Researcher’s reliability evaluation is necessary.
VII. End Notes
Sampling is the first step in Analysis and a very important part of the complete Analysis Process. It forms the
primary step in Measure Phase of Six Sigma. The complete chapter on Sampling above has been presented in as
easy language as possible. However, there are few pointers listed below that needs further study. You can also
contact me for any clarification:
1. Systematic Sampling, weighted method
2. Systematic Sampling vs. SRS
3. Standard Deviation while Sampling from Strata
4. Post Stratification and Over Sampling in Stratified Sampling
5. MANOVA, Growth Curve etc
Will also come back with an Excel based Sample Size calculator where you can enter data knowing what is its
significance. Until then, here is a very nice and simple calculator developed by Macorr that you can download and
use.
http://www.macorr.com/sample-size-calculator.htm [Courtesy: Macorr]
VIII. References
http://www.pitt.edu/~super7/43011-44001/43911.ppt
http://en.wikipedia.org/wiki/Sampling_(statistics)
www.hivhub.ir/fa/document-center/doc_download/161-probability-sampling
http://encyclopedia2.thefreedictionary.com/multiphase+sampling
http://archa1.blogspot.in/2007/04/multiphase-sampling.html
http://www.businessdictionary.com/definition/quota-sampling.html
http://www.blurtit.com/q788493.html
http://www.jstor.org/discover/10.2307/2531331?uid=3738256&uid=2&uid=4&sid=21101797994437
http://www.math.montana.edu/~parker/PattersonStats/Lineint.pdf
http://en.wikipedia.org/wiki/Judgment_sample
http://www.htm.uoguelph.ca/MJResearch/ResearchProcess/JudgementSampling.htm