a comparison of design-based and model-based … comparison of design-based and model-based analysis...
TRANSCRIPT
A Comparison of Design-based and
Model-based Analysis of Sample Surveys in Geography
by
David C. Wheeler1, Jason E. VanHorn2, and Electra D. Paskett3
Technical Report 07-11 December 2007
Department of Biostatistics Rollins School of Public Health
Emory University Atlanta, Georgia
1Department of Biostatistics Rollins School of Public Health
Emory University
2Department of Geology, Geography, and Environmental Studies Calvin College
3Department of Epidemiology
College of Public Health Ohio State University
Correspondence Author: Dr. David Wheeler Telephone: (404) 727-8059 FAX: (404) 727-1370
e-mail: [email protected]
1
A comparison of design-based and model-based analysis of sample
surveys in geography
David C. Wheeler1*, Jason E. VanHorn2, Electra D. Paskett3
1 Department of Biostatistics, Rollins School of Public Health, Emory University 2 Department of Geology, Geography, and Environmental Studies, Calvin College
3 Division of Epidemiology, College of Public Health, Ohio State University
* Corresponding author: [email protected]
Abstract. Sample surveys are routinely used to gather primary data in human geography
research. We highlight the difference between design-based analysis and model-based
analysis of sample surveys and emphasize the advantages of using the design-based
approach with these data. As an example, we demonstrate differences in results from
model-based and design-based analyses of cancer prevalence in a population of
predominantly minority women in North Carolina and South Carolina. The results from
the two approaches reveal differences in population estimates of numerous variables and
a different conclusion regarding the significance of an explanatory variable in a logistic
regression model to explain colon cancer prevalence.
Key Words: logistic regression, stratified sample, survey research, cancer, Stata
2
Introduction
Social and health scientists frequently turn to survey research to investigate research
questions about individual behavior or characteristics where any existing data from
secondary sources would not adequately address the questions of interest. Survey
sampling research is a method of data collection where individuals provide some basis
for making extrapolations to a larger population (Manheim and Rich 1995). A sample
survey is a study involving a subset of individuals selected from a larger population,
where the members of the sample are interviewed and characteristics of interest, or
variables, are measured on each observation. Sample surveys are routinely employed in
human geography research to gather primary data to address research questions (Yankson
2000; Li and Siu 2001; Takasaki, Barham, and Coomes 2001; McSweeney 2002; Sinai
2002; McSweeney 2004; Paudel and Thapa 2004; Overmars and Verburg 2005). A
sample survey is different from a census, where all individuals in a population are
measured. The motivation for taking a sample instead of a census is primarily the
prohibitive expense, in terms of time, capital, and human resources, of enumerating a
population of interest. In some small studies where relatively few subjects are to be
measured, it may be possible and worthwhile to perform a complete census of the target
population. In some circumstances, the inability to measure all target population subjects
adequately can turn an intended census into a sample of the population (see, for example,
Wyllie and Smith 1996).
3
A primary objective of many sample surveys is often description of certain characteristics
of the population, achieved through sample-based estimates of the characteristics. Sample
estimates of population parameters are obtained by aggregating the measurements from
sampled individuals. Inferences about the population are then based on these summary
statistics. The reliability and validity of the summary statistics, or population parameter
estimates, are interrelated with the design of the sample. Reliability is associated with the
size of the standard error of an estimate and validity is measured by the bias, or deviance
from the true population value, of an estimate. Reliability of an estimate can only be
assessed if a probability-based sample is taken, so that the probability of selecting any
individual for the sample is known. Levy and Lemeshow (1999) demonstrate the
differences between probability-based and non-probability-based sampling schemes. The
distinguishing characteristic is that in probability-based schemes, each element or
individual selected in the sample has a non-zero probability of being included in the
sampling frame, which is a list of the population from which the sample can be attained.
Quota sampling or “open-to-the-public” Internet polling is considered non-probability-
based because there is no way to achieve a non-zero probability of selection for
individuals who do not take the survey (Weisberg, Krosnick, and Bowen 1996).
There are several types of designs for sample surveys, including simple random
sampling, systematic sampling, stratified sampling, cluster sampling, and multi-stage
combinations of these. All of these sampling designs are relevant for geographers
engaged in survey research. The decision on the type of sampling design to employ is an
important one and requires some careful consideration. Typically, this decision depends
4
on the objectives of the study and the data that are available in the sampling design
process. Korn and Graubard (1999) and Levy and Lemeshow (1999) provide detailed
descriptions of the different sampling designs and the properties associated with each
one. In simple random sampling, each element in a population has the same probability of
being selected in the sample. In complex sample designs, such as cluster or stratified
sampling, each element does not necessarily have the same probability of being sampled.
Instead, the probability of being sampled depends on the type of complex sample design
one uses. In addition, the statistical formulas to calculate population parameter estimates,
such as means and standard errors of mean estimates, account for the probability of
selection and hence differ across the numerous types of sampling designs.
The different types of sample surveys and associated estimation methods have not been
well distinguished in the geography literature. When differentiating between cluster
sampling and stratified sampling, Holmes (1967) briefly states to a geographic audience
that scholars have erroneously applied methods for simple random samples to complex
samples in error, and thus have reached incorrect conclusions as a result. This statement
was ancillary to Holmes’s main theme, and has not been brought to the attention of many
geographers. A review of papers dealing with sample surveys in geographic literature
reveals that an increased attention to the appropriate analysis of sample data is needed.
Others (Poon 2007) have noted an underreporting of sampling design details in some
geographic journals. The work in many papers does not explicitly account for the
sampling design in the analysis, which requires a design-based analysis. A design-based
analysis treats the data as a sample, sometimes a complex one, from a finite population
5
and explicitly considers the sampling design by applying a statistical weight to each
observation. In contrast, a model-based analysis assumes the data are a simple random
sample from an infinite population, and are independent and identically distributed (Kott
1991). There are numerous examples in the recent geographic literature of using model-
based analysis of complex survey sample data, not accounting for sampling from a finite
population, and not reporting sampling design details. Several commonly used statistical
packages in human geography research, such as SPSS, have historically assumed data are
a simple random sample from an infinite population and, therefore, performed model-
based analyses of sample survey data. A historical dearth of statistical software that
enabled design-based analysis of sample survey data may have contributed to the current
underutilization of this analytical approach. Fortunately, there are statistical software
packages, such as SUDAAN (Research Triangle Institute 2001) and Stata (Stata 2005),
that enable one to perform design-based analysis of complex sample survey data with
ease. In addition, current versions of SPSS (SPSS Inc. 2006), allow for design-based
analysis of sample surveys using a complex sample add-on module that specifies the
sample design parameters through a complex sample file.
While some geographers have not fully recognized the benefits of a design-based
approach to sample survey analysis, colleagues outside the discipline have been more
active in this area. Lemeshow et al. (1998) provide an excellent example for
biostatisticians and epidemiologists of the difference between design-based analysis and
model-based analysis in the study of the association between wine consumption and
dementia that argues for design-based analysis. Brogan (1998) demonstrates to the
6
biostatistics community the difference in results between using SUDAAN (design-based)
and SAS (model-based) software to analyze the Behavioral Risk Factor Surveillance
System (BRFSS) surveys. Both Lemeshow et al. (1998) and Brogan (1998) illustrate that
biased point estimates and inappropriate standard errors can result from using non-
specialized statistical software to analyze sample surveys. Work by Pickle and Su (2002)
and Pickle et al. (2007) in public health journals also adjust for sample survey statistical
weights when analyzing the BRFSS health surveys over several years. Pickle and Su
(2002) produce smoothed maps at the county level of certain disease risk factor estimates,
such as proportion smoking and proportion at risk of obesity, using the BRFSS. Pickle et
al. (2007) use population estimates of certain risk factors from the BRFSS as covariates
in a statistical model to predict new annual cancer cases. Schaible (1996) and Malec
(1996) describe the more complicated method of indirect estimates of population
parameters, which rely on information from other locations or time periods and are
designed to increase accuracy of estimates when sample sizes are small.
Regrettably, many geographers have not been exposed to either the sample survey
literature in biostatistics/statistics or have not had formal education in sample survey
methodology. This is a situation that should be addressed directly in the geographic
literature. The work presented in this article draws on the existing literature to raise the
awareness of geographers to the importance of using the design-based approach when
analyzing sample survey data and is aimed primarily at an audience of human geography
researchers engaged in or planning sample survey research in a range of geographic sub-
disciplines. To illustrate our point, we demonstrate differences in results from model-
7
based and design-based analyses of cancer prevalence in a population of predominantly
low-income, minority women in urban environments in North Carolina and South
Carolina using sample survey data from the CARES (Carolinas education and screening)
study. We compare estimates of population means and proportions and standard errors of
estimates for numerous variables collected in the study. We also compare the results of
logistic regression models to measure associations between potential risk factors and
colon and non-colon cancer prevalence. We perform the analysis in Stata software and
describe how to do so. This example is illustrative to the different sample survey analysis
approaches and should be substantively stimulating to readers interested in a variety of
human geographic research agendas and especially to those in medical and health
geography, a currently growing area of research (for example, see, Kolivras 2006;
Langford and Higgs 2006; Strait 2006; Xu et al. 2006; Sui 2007; Wheeler 2007).
Methods
The CARES Study
In this section, we present the background for an analysis of a sample survey study. The
initial analysis of the sample survey data by the study investigators did not consider the
sampling design, but rather treated the complex stratified sample as a simple random
sample from an infinite population. We contrast this model-based analysis with the
appropriate design-based analysis. The CARES study was a three-year study started in
year 2000 to investigate colorectal cancer screening among women aged fifty years and
8
older who resided in subsidized housing communities in cities in North Carolina and
South Carolina. The primary goal of the complete study was to assess the impact of an
intervention program delivered by American Cancer Society (ACS) volunteers to
improve attitudes, knowledge, and behavior related to colorectal cancer screening. We
had access to the baseline data from the principal investigator, Dr. Electra Paskett, and
analyzed these data only. The baseline survey included questions related to knowledge of
colon cancer risk factors, history of cancer and colon cancer screening, as well as
demographic questions and medical care questions. The motivation for the CARES study
was the substantial risk for colorectal cancer among the female population in the United
States, as it is the third most common cancer among women (Landis et al. 1999). In
addition, colorectal cancer age-adjusted incidence and mortality rates are higher for
African-Americans than any other racial/ethnic group in the United States, as mortality
rates for colorectal cancer are 20-30 percent higher in African-Americans than for other
ethnicities and five-year survival rates are 15 percent lower for African-Americans than
for Caucasians (Landis et al. 1999). Regarding screening, fewer colorectal cancers are
detected at localized stages among African-Americans compared to Caucasians.
Colorectal cancer screening is an important preventative measure, as improved survival
rates have been linked to the use of early detection tests and changes in lifestyle factors
(Chu et al. 1994).
The survey population was women aged fifty and older who resided in subsidized
housing communities in ten cities in North Carolina and South Carolina. The population
was low-income and primarily minority, as 79 percent of the women in the population
9
were African-American. The ten cities were selected by the investigators for the
population based on the fact that they each contained active American Cancer Society
volunteer units, were within approximately three hours driving distance from the ACS
project staff base in Charlotte, NC, and contained a housing authority that manages
subsidized housing communities that would cooperate with the project. A small number
of women were sampled in an eleventh city, Anderson, South Carolina, but the data were
not available for this city, hence, it was excluded from the study population. The cities
that contained the study population are displayed in Figure 1. The sizes of the survey
population and sample by city are listed in Table 1. The sampling plan of the study was to
draw a simple random sample (SRS) from each city using a compiled list of all women
aged fifty years and older residing in the housing authority in each city. After SRS
selection, each woman was sent a letter, followed by a maximum of five attempts to
complete an interview in a home visit by an interviewer. The interview was
approximately thirty to forty-five minutes in duration. The investigators expected an
approximately 80 percent response rate to interview requests, and the actual rate was
slightly higher. An assumption of the analysis in this article is that the non-response
individuals were not significantly different from those who were interviewed, in terms of
the characteristics that were measured.
The survey sample design is actually a single-stage stratified random sample, stratified by
city. In this design, all the cities (strata) were first selected, and then a simple random
sample was taken of women on the listing of sampling units for each city. There are two
advantages to using stratified sampling instead of simple random sampling. First,
10
stratified sampling readily provides population mean and proportion estimates for each
stratum, or subdomain, of the population. Holmes (1967) noted the importance of
stratified sampling for geographers interested in local estimates of variables under study
and Wood (1955) provides an early example of using stratified random sampling in
geographic research. Second, stratified sampling offers the potential of lower standard
errors of population estimates under certain conditions, such as with homogeneity within
stratum and heterogeneity between strata, thereby yielding improved estimate precision
and reliability (Levy and Lemeshow 1999).
In the CARES sampling design, the list of women in each city constitutes the sampling
frame, where every element on the list for each city had an equal probability of being
selected in the sample. The probability of being sampled was not the same for all women
in the population, however, as the probability of selection depends on the population in
each city. Defining kN as the population size in city k , and kn as the sample size in city
k , the probability of selecting an individual in the sample is /k kn N , which depends on
the fixed population size that varies from city to city. It should be clear that this is not a
simple random sample from an infinite population, where all sampling elementary units
have an equal probability of being sampled. Design-based analysis of sample surveys
incorporates statistical weights, which account for how many observations in the
population a sampled observation represents. The statistical weight, kw , is equal to 1
divided by the probability of selection of an element; therefore, it is /k k kw N n= for
sampled observations in city k . In the CARES example, the statistical weights are easy
to calculate, but this may not always be the case, such as when the population counts are
11
not known. Statistical weights may also be more difficult to calculate in more complex
sampling designs; however, weights are often provided for users of large national
surveys, such as the National Health and Nutrition Examination Survey (National Center
for Health Statistics 2006), so that unbiased estimates and proper standard errors may be
calculated. In practice, some statistical analysis software, such as Stata, allow the user to
specify a variable that contains the statistical weights for sampled observations.
The fact that a sample does not come from an infinite population is accounted for by the
finite population correction (FPC). The FPC is a multiplier adjustment to the estimate of
a standard error of a population parameter estimate and has the effect of reducing the
standard error as the sample size n approaches the population size N (Levy and
Lemeshow 1999). The formula for the FPC is ( ) /( 1)N n N− − . The FPC can be seen in
the equation for the standard error of a sample mean for variable x in a stratified random
sample,
( )1 22
22
1
1( ) ,1
Kk k k
kk k k
N nSE x NN n N
σ=
⎡ ⎤⎛ ⎞⎛ ⎞−= ⎢ ⎥⎜ ⎟⎜ ⎟−⎝ ⎠⎝ ⎠⎣ ⎦
∑ (1)
where there are K strata and the population stratum variance is ( )2
2 1
kN
ik ki
kk
X X
Nσ =
−=∑
.
The use of the FPC offers notably improved precision of population parameter estimates
under certain conditions, and there is no penalty for using it. Whether the goal is to make
statistical statements about a sample estimate itself or to infer about a population
12
parameter, the FPC will effectively reduce the standard error. As the sample size
increases relative to the population size, the effect of the FPC becomes more dramatic in
decreasing the standard error, and gains in precision can be nearly 70 percent. For
example, if N = 100 and n = 90, the FPC will reduce the standard error by approximately
68 percent, thereby increasing the reliability of the estimate.
In the CARES study data, there were some missing values for responses to certain
questions of interest. Rather than impute these missing values simply, such as using mean
imputation, and artificially decrease the estimate standard errors, we choose to delete
these observations by variable and account for the deletion by adjusting the
corresponding statistical weights. For example, in estimating the proportion of people
who have been told by a doctor they had colon cancer, there were two observations with
missing values (skipped the question) and these observations were dropped. Multiple
imputation would be an alternative to fill in the missing values and account for the extra
variation in the imputed values (Korn and Graubard 1999), particularly if there were
more missing observations. Using mean imputation with a binary variable, such as
presence of colon cancer, is clearly problematic. As a result of removing some
observations, the statistical weights are different for some variables. In notation, the
statistical weights for variable i in city k are /ik k ikw N n= , where 0ik k ikn n n= − is the
sample size for variable i in city k after removing the 0ikn observations with missing
values for variable i . Removing one observation effectively increases the weights of the
other sampled observations, as each observation now represents a slightly larger number
of individuals in the population.
13
We made use of the statistical weights to calculate in Stata version 9 estimates of the
population mean and proportion for numerous variables. To do so in Stata, we first
specify the survey design settings using the svyset command. The statistical weights,
strata, and finite population correction are specified using the variables WEIGHT, CITY,
and N, where WEIGHT contains the statistical weight for each observation, CITY is a
number indicating the city where each observation resides, and N is the size of the survey
population for each city. The variable WEIGHT is calculated using the population and
sample sizes by strata. After the survey design parameters have been specified, we
estimate proportions, means, and standard errors of estimates for certain variables. For
example, to estimate the overall prevalence of colon cancer (CCANCER), we use the
following commands with the binary variable CCANCER.
. svyset _n [pw=WEIGHT], strata(CITY) fpc(N)
. svy: prop CCANCER
Results
Estimation of Population Parameters
The original investigators in the CARES study were primarily interested in survey
questions related to colon cancer screening history as response variables. These variables
were FOB, an indicator for having a fecal occult blood test in the past, and FLEX, an
14
indicator for having a flexible sigmoidoscopy in the past year. For this analysis, we are
also interested in the variables related to prevalence of cancer. The two variables of
interest are CCANCER, an indicator for having a doctor diagnose the individual with
colon cancer in the past, and OCANCER, an indicator for having a doctor diagnose the
individual with another cancer in the past. The population proportion estimates for these
four response variables are listed in Table 2 for both the design-based and model-based
analyses. The model-based estimates are calculated by ignoring the sample survey design
and treating the data as a simple random sample from an infinite population. This is
accomplished easily in Stata by not setting the survey parameters and calculating
descriptive statistics. The design-based estimates include a finite population correction.
The estimates of the population proportions are somewhat similar with the two analytic
approaches; however, the proportions are consistently higher with the design-based
approach. The largest difference is with the estimated proportion of the population that
has had a fecal occult blood test, which is about 20 percent higher in the design-based
analysis. Hence, using only the model-based approach would underestimate substantially
the proportion of women that have had a fecal occult blood test. In addition, the model-
based estimates are biased estimates of the population parameters. The standard errors are
lower for the design-based approach only in the prevalence of cancer other than colon.
This may be due to reasons such as sample element allocation issues, lack of
homogeneity within each stratum, lack of heterogeneity between strata, as well as the
small number of cancer cases in the survey population. Stratified sampling will generally
give lower standard errors when there is homogeneity within stratum and heterogeneity
between strata. Simple random sampling underestimates the uncertainty of the population
15
parameter estimates in the sample survey. The proportion estimates by stratum for colon
and other cancer prevalence are listed in Table 3 using the design-based approach. The
proportions are indeed more heterogeneous between strata with other cancer, as High
Point, NC has a much higher prevalence than the other cities. This is clear in the plotted
estimates of other cancer prevalence in Figure 1. The proportion estimates are more
homogeneous between strata for colon cancer simply from the fact that there are fewer
cases of colon cancer in the population and many of the cities had no sampled individuals
with a history of colon cancer.
There were many questions in the CARES survey related to cancer risk factors and
knowledge of these factors, as well as colon cancer screening methods. In this analysis,
we selected a subset of these for which to calculate population parameter estimates. We
produced estimates of the following variables: race of respondent (WHITE, BLACK,
OTHER), age of respondent (AGE), knowledge about exercise as a preventative measure
for colon cancer (EXERCISE), knowledge about a healthy diet as a preventative measure
for colon cancer (DIET), having a doctor suggest a fecal occult blood test in the past
(DOCFOB), having a doctor suggest a flexible sigmoidoscopy test in the past
(DOCFLEX), and status as a current smoker (SMOKING). We later use these as
predictor variables in logistic regression models to explain cancer prevalence. The
estimates of these variable means and proportions and standard errors of the estimates are
listed in Table 4 for both the design-based and model-based analyses. Aside from the
estimate for mean age, the estimates are proportions, as the other variables are all
indicator variables. The mean and proportion estimates are similar for the two analytic
16
approaches, with some small differences. Again, the model-based estimates are biased.
Regarding the reliability of the estimates, the standard errors are either the same or lower
with the design-based analysis for all of the variables. Therefore, the design-based
analysis estimates are overall more precise than those from the model-based analysis.
Taken together, the estimates in Table 2 and Table 4 demonstrate that there can be more
heterogeneity between strata and more homogeneity within stratum for one variable than
for another, and the gain in precision of population parameter estimates from using
stratified random sampling need not be consistent across variables.
Assessing Risk Factors for Cancer Prevalence with Logistic Regression
To study the association of risk factors with past diagnosis of colon and other cancer, we
next built logistic regression models using the previously defined explanatory variables.
The response variables are the binary indicators for colon and other cancer diagnosis,
which are the variables used to estimate cancer prevalence in the population. The logistic
regression was carried out in Stata. We built separate regression models for colon cancer
and other cancer using the design-based approach, starting with the common confounder
AGE in the models, adding other variables to the model one at a time, and retaining only
those variables that were statistically significant near the significance level α = 0.05. To
fit the logistic regression models in Stata while considering the sampling design, we first
specify the sample design settings as discussed previously and then use the survey logit
command. For example, the logistic regression model to explain colon cancer occurrence
17
using variables AGE and SMOKER is fitted with the following command, where “or” is
an option that specifies that the estimated odds ratio for each variable should be returned.
. svy:logit CCANCER AGE SMOKER, or
We verified the model assumption of linearity in the logit for the continuous age variable
by fitting a logistic regression model for colon cancer with quartile design variables for
age (see Hosmer and Lemeshow 2000 for details). A plot of estimated regression
coefficients versus age quartile midpoints suggested some nonlinearity in the logit for
AGE, but a quadratic term for AGE did not significantly add to the model. Hence, AGE
was retained as a continuous variable in subsequent logistic regression models in the
design-based analysis. After building the design-based model, we fitted the same model
using a model-based scenario for comparison. We again checked for linearity in the logit
for AGE in the model-based analysis using quartile design variables and reached the
same conclusion as with the design-based analysis. In addition, we used fractional
polynomials (Hosmer and Lemeshow 2000) to verify a linear relationship for AGE with
the outcome variable. We assessed goodness-of-fit using the Hosmer-Lemeshow test with
ten categories and evaluated the discrimination of the model with the receiver operating
characteristic (ROC) curve. Both goodness-of-fit and discrimination were adequate.
Currently, goodness-of-fit tests, ROC curves for model discrimination, and fractional
polynomials cannot be applied to complex survey data in Stata. Therefore, we used these
assessment tools in the model-based analysis. We repeated these steps for the non-colon
cancer model and did not find evidence that AGE was nonlinear in the logit in either the
18
design-based or model-based analyses; goodness-of fit and model discrimination were
again adequate.
The logistic regression model results for colon cancer for both design-based and model-
based analytic approaches are listed in Table 5. The table includes the estimated odds
ratio (OR), 95% confidence interval for the odds ratio, and the associated p-value for the
Wald test for each variable. The odds ratio measures the multiplicative increase in the
odds for the response variable associated with an increase in the exposure variable. For a
binary exposure variable, such as status as a current smoker, the increase in odds is
associated with the presence of the condition. For a continuous variable, such as age, the
multiplicative increase in odds measured by the odds ratio reflects a one-unit increase in
the variable, here, one-year in age. For colon cancer, the only variable in addition to AGE
that is significant is SMOKING. Both age and smoking status have a positive association
with history of colon cancer, as would be expected. Based on the statistics in Table 5, the
variable coefficients are more significant in the design-based analysis than in the model-
based analysis. The smoking status effect is 9 percent higher in the design-based analysis.
In fact, the effect of smoking is not significant in the model-based analysis at the typical
5% significance level, while it clearly is in the design-based analysis, given that the 95%
confidence interval for the odds ratio does not contain 1, the neutral value of equal risk
for exposed and non-exposed groups. The significance level is subjective and we specify
a value that is a commonly used threshold. The example here is one where using the
model-based approach with sample survey data would lead to an incorrect conclusion
about the significance of a relationship in the population. The logistic regression model
19
results for history of other cancer for both design-based and model-based analytic
approaches are listed in Table 6. For this outcome, both age and current smoking status
are significant at the 0.05 level for both the design-based and model-based analyses. Both
age and smoking have a positive relationship with non-colon cancer status, as would be
expected.
Conclusions
In this article, we emphasize to geographers the importance of using a design-based
approach when analyzing sample survey data. We focus on the sampling of individuals
and not on sampling of spatial units (see Griffith 2005). We have demonstrated by
example with a cancer sample survey that results from model-based and design-based
analyses differ. Model-based analysis of sample survey data can result in biased estimates
of population parameters. In the analyses of this sample survey, there were gains in
precision of the population parameter estimates of many variables when taking into
account the study design and finite size of the population. This results in increased
confidence in population parameter estimates. Even when inference on the population is
not of primary interest, precision gains in sample standard errors are possible through
using the finite population correction. In stratified sampling designs with heterogeneity
between strata and homogeneity within each stratum, design-based sample analysis
estimates of population parameters from a stratified sample have lower standard errors,
and are, therefore, more reliable than those from model-based analysis. In different
sampling schemes, the standard errors of estimates of population parameters from model-
20
based analysis can underestimate the uncertainty in the estimates. Overall, estimating
population proportions and means is more reliable when using the appropriate method of
analyzing sample survey data, and is worth the additional analysis steps if statistical
weights that represent the number of people in the population sampled are available.
Also, in our example, taking into account the design of the sample survey resulted in
finding through logistic regression a statistically significant association between smoking
and colon cancer in the CARES population, where this was not found using the model-
based analysis that inappropriately assumed a simple random sample from an infinite
population. In general, failing to account for a finite survey population and not utilizing
appropriate sample survey analysis techniques can lead to misrepresentation of the data
and incorrect conclusions about the significance of associations in regression models.
Results of sample survey analysis could conceivably have policy implications and real
ramifications and it is therefore important to carefully prepare and report findings.
Readily available statistical software explicitly take into account sampling design and
make design-based analysis of sample surveys relatively straightforward. We conclude
that geographers should perform design-based analysis of complex sample survey data
whenever possible when making inferences about a population, and make this
recommendation to researchers undertaking the design and analysis of sample surveys in
human geography. In closing, we encourage geographers involved in survey research to
invest considerable thought into the sampling design of sample surveys and to provide
detailed description when reporting on results of survey research.
21
References
Brogan, D. J. 1998. Software for sample survey data, misuse of standard packages. In
Encyclopedia of biostatistics, volume 5. ed. P. Armitage and T. Colton, 4167-4174. New
York, NY: Wiley & Sons.
Chu, K. C., R. E. Tarone, W. H. Chow, B. F. Hankey, and L. A. Ries. 1994. Temporal
patterns in colorectal cancer incidence, survival, and mortality through 1990. Journal of
the National Cancer Institute 87:997.
Griffith D. A, 2005. Effective geographic sample size in the presence of spatial
autocorrelation. Annals of the Association of American Geographers 94(5):740-760.
Holmes, J. 1967. Problems in location sampling. Annals of the Association of American
Geographers 47:757-780.
Hosmer, D. W., and S. Lemeshow. 2000. Applied logistic regression, 2nd Edition. New
York, NY: John Wiley & Sons.
Kolivras, K. N. 2006. Mosquito habitat and dengue risk potential in Hawaii: a conceptual
framework and GIS application. The Professional Geographer 58(2): 139–154.
22
Korn, E. L., and B. I. Graubard. 1999. Analysis of health surveys. New York, NY: John
Wiley & Sons.
Kott, P. S. 1991. A model-based look at linear regression with survey data. The American
Statistician 45:107-112.
Landis, S. H., T. Munay, S. Bolden, and P. A. Wingo. 1999. Cancer statistics. CA: A
Cancer Journal for Clinicians 49:8-31.
Langford, M., and G. Higgs. 2006. Measuring potential access to primary healthcare
services: the influence of alternative spatial representations of population. The
Professional Geographer 58(3):294–306.
Lemeshow, S., L. Letenneur, J. F. Dartigues, S. Lafont, J. M. Orgogozo, and D.
Commenges. 1998. Illustration of analysis taking into account complex survey
considerations: the association between wine consumption and dementia in the PAQUID
study. American Journal of Epidemiology 148:298-306.
Levy, P. S., and S. Lemeshow. 1999. Sampling of populations: methods and applications.
New York, NY: John Wiley & Sons.
Li, S. M., and Y. M. Siu. 2001. Residential mobility and urban restructuring under market
transition: a study of Guangzhou, China. The Professional Geographer 53(2):219-229.
23
Malec, D. 1996. Model based state estimates from the National Health Interview survey.
In Indirect estimators in U.S. federal programs. ed. W. L. Schaible, 145-167. New York,
NY: Springer-Verlag.
Manheim, J., and R. Rich. 1995. Empirical political analysis: research methods in
political science. 4th ed. White Plains, NY: Longman.
McSweeney, K. 2002. Who is “forest-dependent”? Capturing local variation in forest-
product sale, Eastern Honduras. The Professional Geographer 54(2):158-174.
McSweeney, K. 2004. Forest product sale as natural insurance: the effects of household
characteristics and the nature of shock in eastern Honduras. Society and Natural
Resources 17:39-56.
National Center for Health Statistics. 2006. Analytic and reporting guidelines: The
National Health and Nutrition Examination Survey (NHANES). Hyattsville, MD.
Overmars, K. P., and P. H. Verburg. 2005. Analysis of land use drivers at the watershed
and household level: linking two paradigms at the Philippine forest fringe. International
Journal of Geographical Information Science 19:125-152.
24
Paudel, G. S., and G. B. Thapa. 2004. Impact of social, institutional and ecological
factors on land management practices in mountain watersheds of Nepal. Applied
Geography 24:35-55.
Pickle, L. W., Y. Hao, A. Jemel, Z. Zou, R. C. Tiwari, E. Ward, M. Hachey, H. L. Howe,
and E. J. Feuer. 2007. A new method of estimating United States and state-level cancer
incidence counts for the current calendar year. CA: A Cancer Journal for Clinicians
57(1):30-42.
Pickle, L. W., and Y. Su. 2002. Within-state geographic patterns of health insurance
coverage and health risk factors in the United States. American Journal of Preventive
Medicine 22(2):75-83.
Poon, J. 2007. Instrumentation rigor and practice. Environment and Planning A
39(5):1017-1019.
Research Triangle Institute. 2001. SUDAAN user’s manual, release 8.0. Research
Triangle Park, NC: Research Triangle Institute.
Schaible, W. L. 1996. Introduction and summary. In Indirect estimators in U.S. federal
programs. ed. W. L. Schaible, 1-15. New York, NY: Springer-Verlag.
25
Sinai, I. 2002. The determinants of the number of rooms occupied by compound dwellers
in Kumasi, Ghana: does working at home mean more rooms? Applied Geography 22:77-
90.
SPSS Inc. 2006. Command Syntax Reference. Chicago, IL.
Stata. 2005. Release 9 user’s guide. College Station, TX: Stata Press.
Strait, J. B. 2006. An epidemiology of neighborhood poverty: causal factors of infant
mortality among blacks and whites in the metropolitan United States. The Professional
Geographer 58(1): 39–53.
Sui, D. Z. 2007. Geographic information systems and medical geography: toward a new
synergy. Geography Compass 1(3): 556–582.
Takasaki, Y., B. L. Barham, and O. T. Coomes. 2001. Amazonian peasants, rain forest
use, and income generation: the role of wealth and geographical factors. Society and
Natural Resources 14:291-308.
Weisberg, H. F., J. A. Krosnick, and B. D. Bowen. 1996. An introduction to
survey research, polling, and data analysis. 3rd ed. Thousand Oaks, Calif.: Sage
Publications.
26
Wheeler, D. 2007. A comparison of spatial clustering and cluster detection techniques for
childhood leukemia incidence in Ohio, 1996 – 2003. International Journal of Health
Geographics 2007, 6:13.
Wood, W. F. 1955. Use of stratified random samples in a land use study. Annals of the
Association of American Geographers 45(4):350-367.
Wyllie, D. S., and G. C. Smith. 1996. Effects of extroversion on the routine spatial
behavior of middle adolescents. The Professional Geographer 48(2):166-180.
Xu, B., P. Gong, E. Seto, S. Liang, C. Yang, S. Wen, D. Qiu, X. Gu, and R. Spear. 2006.
A spatial-temporal model for assessing the effects of intervillage connectivity in
schistosomiasis transmission. Annals of the Association of American Geographers 96(1):
31–46.
Yankson, P. W. K. 2000. Houses and residential neighbourhoods as work places in urban
areas: the case of selected low income residential areas in Greater Accra Metropolitan
Area (GAMA), Ghana. Singapore Journal of Tropical Geography 21(2):200-214.
27
Tables
City Population Sample
Winston-Salem, NC 459 49 Greensboro, NC 652 55 High Point, NC 385 41 Greenville, NC 379 57 Rocky Mount, NC 231 43 Wilson, NC 149 17 Charlotte, NC 1006 126 Rock Hill, SC 87 13 Spartanburg, SC 440 69 Greenville, SC 331 47 Total 4119 517 Table 1. Total population counts and sampled counts of women aged fifty years or older who resided in subsidized housing communities in the ten cities of the survey design area at the time of the CARES project
28
Design-based analysis Model-based analysis
Variable Mean SE(mean) Mean SE(mean) FOB 0.122 0.021 0.102 0.018 FLEX 0.062 0.013 0.059 0.012 CCANCER 0.019 0.006 0.018 0.006 OCANCER 0.077 0.011 0.076 0.012
Table 2. Estimates of population proportions and standard errors of estimates for the color cancer screening variables (FOB, FLEX) and the cancer history variables (CCANCER, OCANCER) in the CARES data
29
Other cancer Colon cancer
City Mean Design SE Model SE Mean Design SE Model SE Winston-Salem, NC 0.041 0.027 0.029 0.041 0.027 0.029 Greensboro, NC 0.073 0.034 0.035 0.019 0.018 0.019 High Point, NC 0.171 0.056 0.059 0.024 0.023 0.024 Greenville, NC 0.054 0.028 0.030 . . . Rocky Mount, NC 0.070 0.035 0.039 . . . Wilson, NC . . . . . . Charlotte, NC 0.080 0.023 0.024 0.024 0.013 0.014 Rock Hill, SC . . . . . . Spartanburg, SC 0.087 0.031 0.034 0.029 0.019 0.020 Greenville, SC 0.085 0.038 0.041 . . . Table 3. Design-based estimates and standard errors of estimates within stratum (city) for prevalence of other cancer and colon cancer, where “.” indicates no estimate due to no cases
30
Design-based analysis Model-based analysis
Variable Mean SE(mean) Mean SE(mean) WHITE 0.191 0.017 0.184 0.017 BLACK 0.770 0.018 0.779 0.018 OTHER 0.038 0.008 0.037 0.008 EXERCISE 0.141 0.015 0.136 0.015 AGE 69.465 0.477 69.430 0.492 DIET 0.191 0.017 0.189 0.017 DOCFOB 0.927 0.015 0.920 0.017 DOCFLEX 0.922 0.021 0.908 0.022 SMOKING 0.210 0.017 0.204 0.018
Table 4. Estimates of population means and proportions and standard errors of estimates for explanatory variables in the CARES dataset using both design-based analysis and model-based analysis
31
Design-based analysis Model-based analysis
Variable OR 95% CI p-value OR 95% CI p-value AGE 1.057 0.999-1.119 0.054 1.042 0.986-1.102 0.142 SMOKING 4.319 1.056-17.671 0.042 3.952 0.921-16.954 0.064
Table 5. Odds ratios for age and smoking from the logistic regression model for history of colon cancer
32
Design-based analysis Model-based analysis
Variable OR 95% CI p-value OR 95% CI p-value AGE 1.029 1.000-1.058 0.049 1.030 1.000-1.062 0.048 SMOKING 2.101 1.043-4.231 0.038 2.115 1.020-4.384 0.044
Table 6. Odds ratios for age and smoking from the logistic regression model for history of other cancer