survey sampling l sampling & non-sampling error l bias l simple sampling methods l sampling...

58
Survey sampling Sampling & non-sampling error Bias Simple sampling methods Sampling terminology Cluster sampling Design effect Stratified sampling Sampling weights

Upload: darlene-radcliff

Post on 11-Dec-2015

238 views

Category:

Documents


7 download

TRANSCRIPT

Page 1: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Survey sampling

Sampling & non-sampling error Bias Simple sampling methods Sampling terminology Cluster sampling Design effect Stratified sampling Sampling weights

Page 2: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Why sample?

To make an inference about a population

Studying entire pop is impractical or impossible

Page 3: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Example of sampling

Estimate the proportion of adults, ages 18-65, in Port Elizabeth that have type 2 diabetes

Select a sample from which to estimate the proportion

Population: adults aged 18-65 living in Port Elizabeth

Inference: proportion with type 2 diabetes

Page 4: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Probability sampling

Each individual has known (non-zero) probability of selection

Precision of estimates can be quantified

Page 5: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Non-probability sampling

Cheaper, more convenient Quality of estimates cannot be

assessed May not be representative of

population

Page 6: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Sampling errorv.

Non-sampling error

Page 7: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Sampling error

Random variability in sample estimates that arises out of the randomness of the sample selection process

Precision can be quantified (estimation of standard errors, confidence intervals)

Page 8: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Non-sampling error

Estimation error that arises from sources other than random variation– non-response– undercoverage of survey– poorly-trained interviewers– non-truthful answers– non-probability sampling

This type of error is a bias

Page 9: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

What is bias?

We want to estimate the mean weight of all women aged 15-44 living in Coopersville. Suppose there are 50,000 such women and the true mean weight is 61.7 kg.

We select a sample of 200 such women and interview them, asking each woman what her weight is.

The sample mean weight is 59.4 kg.

Is our estimate biased?

Page 10: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Bias

Suppose we could repeat the survey many, many times.

Then we compute the mean of all the sample means.

Say the mean of the means = 62.9

Bias = (mean of means) - (true mean)

= 62.9 - 61.7 = 1.2 kg

Page 11: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Unbiased estimation

If . . .

(mean of the means) = (true mean)

then the bias is zero, and we say that the estimator is unbiased.

The “mean of the means” is called the “expected value” of the estimator.

Page 12: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Simple sampling methods

Task: Select a sample of n individuals or items from a population of N individuals or items

Common methods– simple random sampling– systematic sampling

Page 13: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Simple sampling methods

Simple random sampling (SRS)– each item in population is equally likely

to be selected– each combination of n items is equally

likely to be selected Systematic sampling (typical method)– randomly select a starting point– select every kth item thereafter

Page 14: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Systematic sampling example

Stack of 213 hospital admission forms; select a sample of 15

213/15 = 14.2 Select every 14th form Starting point: random number between 1 and 14

(we choose 11) First form selected is 11th from top Second form selected is 25th from top (11 + 14 = 25) Third form selected is 39th from top (11 + 2x14 = 39) And so forth . . .

Page 15: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Systematic sampling, continued

What is the probability that the 146th form will be selected? The 195th?

Does this qualify as a simple random sample? Why or why not?

Is there any potential problem arising from the use of systematic sampling in this situation?

Page 16: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Example was typical quick method

In the preceding example, we selected every 14th form

Ideally, we would select every 14.2th form (see later example on 2-stage sample of nurses)

Example is a quick and easy method, commonly used in the field; it is a good approximation to the more rigorous procedure

Page 17: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Systematic sampling: + and -

Advantages of systematic sampling– typically simpler to implement than SRS– can provide a more uniform coverage

Potential disadvantage of systematic sampling– can produce a bias if there is a

systematic pattern in the sequence of items from which the sample is selected

Page 18: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Role of simple sampling methods

These simple sampling methods are necessary components of more complex sampling methods:– cluster sampling– stratified sampling

We’ll discuss these more complex methods next (following some definitions)

Page 19: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Definitions

Listing units (or enumeration units)– the lowest level sampled units (e.g.,

households or individuals) PSUs (primary sampling units)– the first units sampled (e.g., states or

regions) Sampling probability– for any unit eligible to be sampled, the

probability that the unit is selected in the sample

Page 20: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

More definitions

EPSEM sampling– “equal probability of selection method”,

thus a method in which each listing unit has the same sampling probability

Sampling frame– the set of items from which sampling is

done--often a list of items.

Page 21: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

More definitions

Undercoverage: the degree to which we fail to identify all eligible units in the population– incomplete lists– incomplete or incorrect eligibility

information

Page 22: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Still more definitions

Non-response: failure to interview sampled listing units (study subjects)– refusal– death– physician refusal– inability to locate subject– unavailability

Page 23: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Still more definitions

Precision: the amount of random error in an estimate– often measured by the width or half-

width of the confidence interval– standard error is another measure of

precision– estimates with smaller standard error or

narrower CI are said to be more precise

Page 24: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

CLUSTER SAMPLINGsingle stage

Page 25: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Clusters

Subsets of the listing units in the population

Set of clusters must be mutually exclusive and collectively exhaustive– counties– townships– regions– institutions

Page 26: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

ExampleSingle-stage cluster sampling

There are 361 nurses working at the 31 hospitals and clinics in Region 4

We wish to interview a sample of these nurses– select a simple random sample of 5

hospitals/clinics– interview all nurses employed at the 5

selected institutions

Page 27: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Assessing the example

Hospitals/clinics are the PSUs Nurses are the listing units Sampling probability for each nurse

is 5/31 Thus, this is an EPSEM sample Sampling frame is the list of 31

hospitals and clinics

Page 28: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

CLUSTER SAMPLINGtwo stage

Page 29: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Cluster sampling -- two stage

Select a sample of clusters, as in the single-stage method

From each selected cluster, select a subsample of listing units

Page 30: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Cluster sampling -- two stage

It is always nice to do EPSEM sampling because such samples are self-weighting– don’t need sampling weights in analysis

A common EPSEM method for two-stage sampling is PPS (probability proportional to size)

Page 31: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

PPS sampling

The key to the method is that the sampling probabilities of clusters in the first stage are proportional to the “sizes” of the clusters– size = number of listing units in cluster

At stage 2, select the same number of listing units from each selected cluster

Page 32: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Nurse example revisitedTwo-stage sampling

We want to interview a sample of 36 nurses

We can afford to visit 9 different hospitals/clinics

Thus, we need to interview 36/9 = 4 nurses at each institution

Page 33: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Nurse example revisitedTwo-stage sampling

Stage 1: select a sample of 9 hospitals/clinics– Selection prob. proportional to “size”

Stage 2: select a sample of 4 nurses from each selected institution

At each stage, use one of the simple sampling methods

Page 34: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Nurse example revisitedTwo-stage sampling

PSUs are the hospitals/clinics Listing units are the nurses Sampling frames– Stage 1: List of 31 hospitals/clinics– Stage 2: Lists of nurses at each

selected hospital/clinic

Page 35: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Selecting 2-stage nurse sample

Sampling interval, I = 361/9 = 40.1 Starting point, random number between 1

and 40; we choose R = 14 First sampling number = R = 14 2nd sampling number = 14 + 1x40.1 = 54.1 3rd sampling number = 14 + 2x40.1 = 94.2 We have selected institutions 2, 5, 9, . . .

Page 36: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Two-stage nurse sampleInstitutionNumber

No. ofNurses

CumulativeNurses

SamplingNumber

1 12 122 7 19 143 9 284 18 465 11 57 54.16 7 647 10 748 14 889 8 96 94.2..

.

...

31 9 361Total 361

Page 37: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Applying the sampling numbers

For each sampling number, choose the first unit with cumulative “size” equal to or greater than the sampling number

Example: sampling number 54.1– first unit with cumulative size 54.1 is

unit 5 (cum. no. of nurses = 57)

–so we select unit 5 for the sample

Page 38: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Optional challenge

What is the selection probability for institution 1?

12/40.1 = 0.299

What is the selection probability for a nurse in institution 1?

(12/40.1) x (4/12) = 0.998 = 36/361

What is the selection probability for a nurse in institution 2?

(7/40.1) x (4/7) = 0.998 = 36/361

All nurses have the same selection probability.

Page 39: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Why do cluster sampling instead Of a simple sampling method?

Advantages– reduced logistical costs (e.g., travel)– list of all 361 nurses may not be available

(reduces listing labor) Disadvantages– estimates are less precise– analysis is more complicated (requires

special software)

Page 40: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Design effect

Relative increase in variance of an estimate due to the sampling design– “variance” = (standard error)2

Formula– s1 = standard error under simple

random sampling– s2 = standard error under complex

sampling design (e.g., cluster sampling)– design effect = (s2/s1)2

Page 41: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Design effect for cluster sampling

For cluster sampling designs, the design effect is always >1

This means that estimates from a survey done with cluster sampling are less precise than corresponding estimates obtained from a survey having the same sample size done with simple random sampling

Page 42: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Cluster sizes Recommended “take” per cluster is

20-40 for multi-purpose surveys Time and resource limitations will

often dictate the maximum number of clusters you can include in the study

Including more clusters improves the precision of your estimates more than a corresponding increase in sample size within the clusters already in the sample

Page 43: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

STRATIFIEDSAMPLING

Page 44: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Strata

Subsets of the listing units in the population

Set of strata must be mutually exclusive and collectively exhaustive

Strata are often based on demographic variables– age– sex– race

Page 45: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Stratified sampling

Sample from each stratum Often, sampling probabilities vary

across strata

Page 46: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Stratified sampling

Advantages– guarantees coverage across strata– can over-sample some strata in order to obtain

precise within-stratum estimates– typically, design effect < 1

Disadvantages– with unequal sampling probabilities, sampling

weights must be included in analysis• more complicated • requires special software

Page 47: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Example: sampling breast cancer cases for the Women’s CARE Study

Stratification variables– geographic site– race (2 races)– five-year age group

Over-sampled younger women Over-sampled black women

Page 48: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Example: Sampling households for a reproductive health survey in 11 refugee camps in Pakistan

Selected simple random sample of households from within each of the 11 camps

All households were selected with the same probability

Page 49: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Refugee camp sampling

Camp PopulationSample

SizeCompletedInterviews

Lakhte Banda 12,943 64 61Kotki 1 7,262 36 29Kotki 2 5781 29 21Kata Kanra 8,437 42 38Mohd Khoja 12,791 63 45Doaba 13,584 67 25Darsamand 17,797 88 53Kahi 11,061 55 32Naryab 5,543 28 19Thal 1 11,087 55 44Thal 2 17,130 85 60Dallan 10,990 55 45Total 134,406 667 472

Page 50: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

The sampling operation

Must be carefully controlled– don’t leave to discretion in the field– use a carefully defined procedure

Document what you did– for reference during analysis– to defend your study

Page 51: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Sampling frames

A list containing all listing units is great if you can get it– ok if it includes some ineligibles

Problems associated with geographic location-based sampling–map-based sampling– EPI sampling

Page 52: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Sampling weights

Inverse of the net sampling probability

Interpretation: the sampling weight for an sampled individual is the number of individuals his/her data “represent”

Page 53: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Example--sampling weights

There are 150 employees in a firm– stratum 1: 50 employees aged 18-29– stratum 2: 100 employees aged 30-69

We sample 10 from each stratum Sampling probabilities are– stratum 1: 10/50 = 0.20– stratum 2: 10/100 = 0.10

Page 54: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Example: sampling weights (continued)

Sampling weights– stratum 1: 1/0.20 = 5– stratum 2: 1/0.10 = 10

Interpretation:– Each sampled employee in stratum 1

represents 5 employees– Each sampled employee in stratum 2

represents 10 employees

Page 55: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

What about non-response?

1 employee in the stratum 1 sample and 3 employees in the stratum 2 sample refuse to participate in the survey

Net sampling probabilities– stratum 1: 9/50 = 0.18– stratum 2: 7/100 = 0.07

Page 56: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Revised sampling weights

Sampling weights revised for non-response– stratum 1: 1/0.18 = 5.56– stratum 2: 1/0.07 = 14.29

This computation is often done by multiplying the original sampling weights by adjustment factors to account for non-response rates

Page 57: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Post-stratification weighting

Define strata, which may or may not have been used as strata in the sampling design

Compute sampling probabilities = proportion of each stratum that was actually sampled

Compute sampling weights from these sampling probabilities

Allows post-hoc treatment of unequal representation of population segments in the sample

Page 58: Survey sampling l Sampling & non-sampling error l Bias l Simple sampling methods l Sampling terminology l Cluster sampling l Design effect l Stratified

Discussion topics

What is the population of interest? Infinite populations Selecting random numbers Selecting simple random samples– from finite populations– from infinite populations

Analysis software for complex surveys