a comparison of design-based and model-based … comparison of design-based and model-based analysis...

34
A Comparison of Design-based and Model-based Analysis of Sample Surveys in Geography by David C. Wheeler 1 , Jason E. VanHorn 2 , and Electra D. Paskett 3 Technical Report 07-11 December 2007 Department of Biostatistics Rollins School of Public Health Emory University Atlanta, Georgia 1 Department of Biostatistics Rollins School of Public Health Emory University 2 Department of Geology, Geography, and Environmental Studies Calvin College 3 Department of Epidemiology College of Public Health Ohio State University Correspondence Author: Dr. David Wheeler Telephone: (404) 727-8059 FAX: (404) 727-1370 e-mail: [email protected]

Upload: vudien

Post on 12-Apr-2018

221 views

Category:

Documents


1 download

TRANSCRIPT

A Comparison of Design-based and

Model-based Analysis of Sample Surveys in Geography

by

David C. Wheeler1, Jason E. VanHorn2, and Electra D. Paskett3

Technical Report 07-11 December 2007

Department of Biostatistics Rollins School of Public Health

Emory University Atlanta, Georgia

1Department of Biostatistics Rollins School of Public Health

Emory University

2Department of Geology, Geography, and Environmental Studies Calvin College

3Department of Epidemiology

College of Public Health Ohio State University

Correspondence Author: Dr. David Wheeler Telephone: (404) 727-8059 FAX: (404) 727-1370

e-mail: [email protected]

1

A comparison of design-based and model-based analysis of sample

surveys in geography

David C. Wheeler1*, Jason E. VanHorn2, Electra D. Paskett3

1 Department of Biostatistics, Rollins School of Public Health, Emory University 2 Department of Geology, Geography, and Environmental Studies, Calvin College

3 Division of Epidemiology, College of Public Health, Ohio State University

* Corresponding author: [email protected]

Abstract. Sample surveys are routinely used to gather primary data in human geography

research. We highlight the difference between design-based analysis and model-based

analysis of sample surveys and emphasize the advantages of using the design-based

approach with these data. As an example, we demonstrate differences in results from

model-based and design-based analyses of cancer prevalence in a population of

predominantly minority women in North Carolina and South Carolina. The results from

the two approaches reveal differences in population estimates of numerous variables and

a different conclusion regarding the significance of an explanatory variable in a logistic

regression model to explain colon cancer prevalence.

Key Words: logistic regression, stratified sample, survey research, cancer, Stata

2

Introduction

Social and health scientists frequently turn to survey research to investigate research

questions about individual behavior or characteristics where any existing data from

secondary sources would not adequately address the questions of interest. Survey

sampling research is a method of data collection where individuals provide some basis

for making extrapolations to a larger population (Manheim and Rich 1995). A sample

survey is a study involving a subset of individuals selected from a larger population,

where the members of the sample are interviewed and characteristics of interest, or

variables, are measured on each observation. Sample surveys are routinely employed in

human geography research to gather primary data to address research questions (Yankson

2000; Li and Siu 2001; Takasaki, Barham, and Coomes 2001; McSweeney 2002; Sinai

2002; McSweeney 2004; Paudel and Thapa 2004; Overmars and Verburg 2005). A

sample survey is different from a census, where all individuals in a population are

measured. The motivation for taking a sample instead of a census is primarily the

prohibitive expense, in terms of time, capital, and human resources, of enumerating a

population of interest. In some small studies where relatively few subjects are to be

measured, it may be possible and worthwhile to perform a complete census of the target

population. In some circumstances, the inability to measure all target population subjects

adequately can turn an intended census into a sample of the population (see, for example,

Wyllie and Smith 1996).

3

A primary objective of many sample surveys is often description of certain characteristics

of the population, achieved through sample-based estimates of the characteristics. Sample

estimates of population parameters are obtained by aggregating the measurements from

sampled individuals. Inferences about the population are then based on these summary

statistics. The reliability and validity of the summary statistics, or population parameter

estimates, are interrelated with the design of the sample. Reliability is associated with the

size of the standard error of an estimate and validity is measured by the bias, or deviance

from the true population value, of an estimate. Reliability of an estimate can only be

assessed if a probability-based sample is taken, so that the probability of selecting any

individual for the sample is known. Levy and Lemeshow (1999) demonstrate the

differences between probability-based and non-probability-based sampling schemes. The

distinguishing characteristic is that in probability-based schemes, each element or

individual selected in the sample has a non-zero probability of being included in the

sampling frame, which is a list of the population from which the sample can be attained.

Quota sampling or “open-to-the-public” Internet polling is considered non-probability-

based because there is no way to achieve a non-zero probability of selection for

individuals who do not take the survey (Weisberg, Krosnick, and Bowen 1996).

There are several types of designs for sample surveys, including simple random

sampling, systematic sampling, stratified sampling, cluster sampling, and multi-stage

combinations of these. All of these sampling designs are relevant for geographers

engaged in survey research. The decision on the type of sampling design to employ is an

important one and requires some careful consideration. Typically, this decision depends

4

on the objectives of the study and the data that are available in the sampling design

process. Korn and Graubard (1999) and Levy and Lemeshow (1999) provide detailed

descriptions of the different sampling designs and the properties associated with each

one. In simple random sampling, each element in a population has the same probability of

being selected in the sample. In complex sample designs, such as cluster or stratified

sampling, each element does not necessarily have the same probability of being sampled.

Instead, the probability of being sampled depends on the type of complex sample design

one uses. In addition, the statistical formulas to calculate population parameter estimates,

such as means and standard errors of mean estimates, account for the probability of

selection and hence differ across the numerous types of sampling designs.

The different types of sample surveys and associated estimation methods have not been

well distinguished in the geography literature. When differentiating between cluster

sampling and stratified sampling, Holmes (1967) briefly states to a geographic audience

that scholars have erroneously applied methods for simple random samples to complex

samples in error, and thus have reached incorrect conclusions as a result. This statement

was ancillary to Holmes’s main theme, and has not been brought to the attention of many

geographers. A review of papers dealing with sample surveys in geographic literature

reveals that an increased attention to the appropriate analysis of sample data is needed.

Others (Poon 2007) have noted an underreporting of sampling design details in some

geographic journals. The work in many papers does not explicitly account for the

sampling design in the analysis, which requires a design-based analysis. A design-based

analysis treats the data as a sample, sometimes a complex one, from a finite population

5

and explicitly considers the sampling design by applying a statistical weight to each

observation. In contrast, a model-based analysis assumes the data are a simple random

sample from an infinite population, and are independent and identically distributed (Kott

1991). There are numerous examples in the recent geographic literature of using model-

based analysis of complex survey sample data, not accounting for sampling from a finite

population, and not reporting sampling design details. Several commonly used statistical

packages in human geography research, such as SPSS, have historically assumed data are

a simple random sample from an infinite population and, therefore, performed model-

based analyses of sample survey data. A historical dearth of statistical software that

enabled design-based analysis of sample survey data may have contributed to the current

underutilization of this analytical approach. Fortunately, there are statistical software

packages, such as SUDAAN (Research Triangle Institute 2001) and Stata (Stata 2005),

that enable one to perform design-based analysis of complex sample survey data with

ease. In addition, current versions of SPSS (SPSS Inc. 2006), allow for design-based

analysis of sample surveys using a complex sample add-on module that specifies the

sample design parameters through a complex sample file.

While some geographers have not fully recognized the benefits of a design-based

approach to sample survey analysis, colleagues outside the discipline have been more

active in this area. Lemeshow et al. (1998) provide an excellent example for

biostatisticians and epidemiologists of the difference between design-based analysis and

model-based analysis in the study of the association between wine consumption and

dementia that argues for design-based analysis. Brogan (1998) demonstrates to the

6

biostatistics community the difference in results between using SUDAAN (design-based)

and SAS (model-based) software to analyze the Behavioral Risk Factor Surveillance

System (BRFSS) surveys. Both Lemeshow et al. (1998) and Brogan (1998) illustrate that

biased point estimates and inappropriate standard errors can result from using non-

specialized statistical software to analyze sample surveys. Work by Pickle and Su (2002)

and Pickle et al. (2007) in public health journals also adjust for sample survey statistical

weights when analyzing the BRFSS health surveys over several years. Pickle and Su

(2002) produce smoothed maps at the county level of certain disease risk factor estimates,

such as proportion smoking and proportion at risk of obesity, using the BRFSS. Pickle et

al. (2007) use population estimates of certain risk factors from the BRFSS as covariates

in a statistical model to predict new annual cancer cases. Schaible (1996) and Malec

(1996) describe the more complicated method of indirect estimates of population

parameters, which rely on information from other locations or time periods and are

designed to increase accuracy of estimates when sample sizes are small.

Regrettably, many geographers have not been exposed to either the sample survey

literature in biostatistics/statistics or have not had formal education in sample survey

methodology. This is a situation that should be addressed directly in the geographic

literature. The work presented in this article draws on the existing literature to raise the

awareness of geographers to the importance of using the design-based approach when

analyzing sample survey data and is aimed primarily at an audience of human geography

researchers engaged in or planning sample survey research in a range of geographic sub-

disciplines. To illustrate our point, we demonstrate differences in results from model-

7

based and design-based analyses of cancer prevalence in a population of predominantly

low-income, minority women in urban environments in North Carolina and South

Carolina using sample survey data from the CARES (Carolinas education and screening)

study. We compare estimates of population means and proportions and standard errors of

estimates for numerous variables collected in the study. We also compare the results of

logistic regression models to measure associations between potential risk factors and

colon and non-colon cancer prevalence. We perform the analysis in Stata software and

describe how to do so. This example is illustrative to the different sample survey analysis

approaches and should be substantively stimulating to readers interested in a variety of

human geographic research agendas and especially to those in medical and health

geography, a currently growing area of research (for example, see, Kolivras 2006;

Langford and Higgs 2006; Strait 2006; Xu et al. 2006; Sui 2007; Wheeler 2007).

Methods

The CARES Study

In this section, we present the background for an analysis of a sample survey study. The

initial analysis of the sample survey data by the study investigators did not consider the

sampling design, but rather treated the complex stratified sample as a simple random

sample from an infinite population. We contrast this model-based analysis with the

appropriate design-based analysis. The CARES study was a three-year study started in

year 2000 to investigate colorectal cancer screening among women aged fifty years and

8

older who resided in subsidized housing communities in cities in North Carolina and

South Carolina. The primary goal of the complete study was to assess the impact of an

intervention program delivered by American Cancer Society (ACS) volunteers to

improve attitudes, knowledge, and behavior related to colorectal cancer screening. We

had access to the baseline data from the principal investigator, Dr. Electra Paskett, and

analyzed these data only. The baseline survey included questions related to knowledge of

colon cancer risk factors, history of cancer and colon cancer screening, as well as

demographic questions and medical care questions. The motivation for the CARES study

was the substantial risk for colorectal cancer among the female population in the United

States, as it is the third most common cancer among women (Landis et al. 1999). In

addition, colorectal cancer age-adjusted incidence and mortality rates are higher for

African-Americans than any other racial/ethnic group in the United States, as mortality

rates for colorectal cancer are 20-30 percent higher in African-Americans than for other

ethnicities and five-year survival rates are 15 percent lower for African-Americans than

for Caucasians (Landis et al. 1999). Regarding screening, fewer colorectal cancers are

detected at localized stages among African-Americans compared to Caucasians.

Colorectal cancer screening is an important preventative measure, as improved survival

rates have been linked to the use of early detection tests and changes in lifestyle factors

(Chu et al. 1994).

The survey population was women aged fifty and older who resided in subsidized

housing communities in ten cities in North Carolina and South Carolina. The population

was low-income and primarily minority, as 79 percent of the women in the population

9

were African-American. The ten cities were selected by the investigators for the

population based on the fact that they each contained active American Cancer Society

volunteer units, were within approximately three hours driving distance from the ACS

project staff base in Charlotte, NC, and contained a housing authority that manages

subsidized housing communities that would cooperate with the project. A small number

of women were sampled in an eleventh city, Anderson, South Carolina, but the data were

not available for this city, hence, it was excluded from the study population. The cities

that contained the study population are displayed in Figure 1. The sizes of the survey

population and sample by city are listed in Table 1. The sampling plan of the study was to

draw a simple random sample (SRS) from each city using a compiled list of all women

aged fifty years and older residing in the housing authority in each city. After SRS

selection, each woman was sent a letter, followed by a maximum of five attempts to

complete an interview in a home visit by an interviewer. The interview was

approximately thirty to forty-five minutes in duration. The investigators expected an

approximately 80 percent response rate to interview requests, and the actual rate was

slightly higher. An assumption of the analysis in this article is that the non-response

individuals were not significantly different from those who were interviewed, in terms of

the characteristics that were measured.

The survey sample design is actually a single-stage stratified random sample, stratified by

city. In this design, all the cities (strata) were first selected, and then a simple random

sample was taken of women on the listing of sampling units for each city. There are two

advantages to using stratified sampling instead of simple random sampling. First,

10

stratified sampling readily provides population mean and proportion estimates for each

stratum, or subdomain, of the population. Holmes (1967) noted the importance of

stratified sampling for geographers interested in local estimates of variables under study

and Wood (1955) provides an early example of using stratified random sampling in

geographic research. Second, stratified sampling offers the potential of lower standard

errors of population estimates under certain conditions, such as with homogeneity within

stratum and heterogeneity between strata, thereby yielding improved estimate precision

and reliability (Levy and Lemeshow 1999).

In the CARES sampling design, the list of women in each city constitutes the sampling

frame, where every element on the list for each city had an equal probability of being

selected in the sample. The probability of being sampled was not the same for all women

in the population, however, as the probability of selection depends on the population in

each city. Defining kN as the population size in city k , and kn as the sample size in city

k , the probability of selecting an individual in the sample is /k kn N , which depends on

the fixed population size that varies from city to city. It should be clear that this is not a

simple random sample from an infinite population, where all sampling elementary units

have an equal probability of being sampled. Design-based analysis of sample surveys

incorporates statistical weights, which account for how many observations in the

population a sampled observation represents. The statistical weight, kw , is equal to 1

divided by the probability of selection of an element; therefore, it is /k k kw N n= for

sampled observations in city k . In the CARES example, the statistical weights are easy

to calculate, but this may not always be the case, such as when the population counts are

11

not known. Statistical weights may also be more difficult to calculate in more complex

sampling designs; however, weights are often provided for users of large national

surveys, such as the National Health and Nutrition Examination Survey (National Center

for Health Statistics 2006), so that unbiased estimates and proper standard errors may be

calculated. In practice, some statistical analysis software, such as Stata, allow the user to

specify a variable that contains the statistical weights for sampled observations.

The fact that a sample does not come from an infinite population is accounted for by the

finite population correction (FPC). The FPC is a multiplier adjustment to the estimate of

a standard error of a population parameter estimate and has the effect of reducing the

standard error as the sample size n approaches the population size N (Levy and

Lemeshow 1999). The formula for the FPC is ( ) /( 1)N n N− − . The FPC can be seen in

the equation for the standard error of a sample mean for variable x in a stratified random

sample,

( )1 22

22

1

1( ) ,1

Kk k k

kk k k

N nSE x NN n N

σ=

⎡ ⎤⎛ ⎞⎛ ⎞−= ⎢ ⎥⎜ ⎟⎜ ⎟−⎝ ⎠⎝ ⎠⎣ ⎦

∑ (1)

where there are K strata and the population stratum variance is ( )2

2 1

kN

ik ki

kk

X X

Nσ =

−=∑

.

The use of the FPC offers notably improved precision of population parameter estimates

under certain conditions, and there is no penalty for using it. Whether the goal is to make

statistical statements about a sample estimate itself or to infer about a population

12

parameter, the FPC will effectively reduce the standard error. As the sample size

increases relative to the population size, the effect of the FPC becomes more dramatic in

decreasing the standard error, and gains in precision can be nearly 70 percent. For

example, if N = 100 and n = 90, the FPC will reduce the standard error by approximately

68 percent, thereby increasing the reliability of the estimate.

In the CARES study data, there were some missing values for responses to certain

questions of interest. Rather than impute these missing values simply, such as using mean

imputation, and artificially decrease the estimate standard errors, we choose to delete

these observations by variable and account for the deletion by adjusting the

corresponding statistical weights. For example, in estimating the proportion of people

who have been told by a doctor they had colon cancer, there were two observations with

missing values (skipped the question) and these observations were dropped. Multiple

imputation would be an alternative to fill in the missing values and account for the extra

variation in the imputed values (Korn and Graubard 1999), particularly if there were

more missing observations. Using mean imputation with a binary variable, such as

presence of colon cancer, is clearly problematic. As a result of removing some

observations, the statistical weights are different for some variables. In notation, the

statistical weights for variable i in city k are /ik k ikw N n= , where 0ik k ikn n n= − is the

sample size for variable i in city k after removing the 0ikn observations with missing

values for variable i . Removing one observation effectively increases the weights of the

other sampled observations, as each observation now represents a slightly larger number

of individuals in the population.

13

We made use of the statistical weights to calculate in Stata version 9 estimates of the

population mean and proportion for numerous variables. To do so in Stata, we first

specify the survey design settings using the svyset command. The statistical weights,

strata, and finite population correction are specified using the variables WEIGHT, CITY,

and N, where WEIGHT contains the statistical weight for each observation, CITY is a

number indicating the city where each observation resides, and N is the size of the survey

population for each city. The variable WEIGHT is calculated using the population and

sample sizes by strata. After the survey design parameters have been specified, we

estimate proportions, means, and standard errors of estimates for certain variables. For

example, to estimate the overall prevalence of colon cancer (CCANCER), we use the

following commands with the binary variable CCANCER.

. svyset _n [pw=WEIGHT], strata(CITY) fpc(N)

. svy: prop CCANCER

Results

Estimation of Population Parameters

The original investigators in the CARES study were primarily interested in survey

questions related to colon cancer screening history as response variables. These variables

were FOB, an indicator for having a fecal occult blood test in the past, and FLEX, an

14

indicator for having a flexible sigmoidoscopy in the past year. For this analysis, we are

also interested in the variables related to prevalence of cancer. The two variables of

interest are CCANCER, an indicator for having a doctor diagnose the individual with

colon cancer in the past, and OCANCER, an indicator for having a doctor diagnose the

individual with another cancer in the past. The population proportion estimates for these

four response variables are listed in Table 2 for both the design-based and model-based

analyses. The model-based estimates are calculated by ignoring the sample survey design

and treating the data as a simple random sample from an infinite population. This is

accomplished easily in Stata by not setting the survey parameters and calculating

descriptive statistics. The design-based estimates include a finite population correction.

The estimates of the population proportions are somewhat similar with the two analytic

approaches; however, the proportions are consistently higher with the design-based

approach. The largest difference is with the estimated proportion of the population that

has had a fecal occult blood test, which is about 20 percent higher in the design-based

analysis. Hence, using only the model-based approach would underestimate substantially

the proportion of women that have had a fecal occult blood test. In addition, the model-

based estimates are biased estimates of the population parameters. The standard errors are

lower for the design-based approach only in the prevalence of cancer other than colon.

This may be due to reasons such as sample element allocation issues, lack of

homogeneity within each stratum, lack of heterogeneity between strata, as well as the

small number of cancer cases in the survey population. Stratified sampling will generally

give lower standard errors when there is homogeneity within stratum and heterogeneity

between strata. Simple random sampling underestimates the uncertainty of the population

15

parameter estimates in the sample survey. The proportion estimates by stratum for colon

and other cancer prevalence are listed in Table 3 using the design-based approach. The

proportions are indeed more heterogeneous between strata with other cancer, as High

Point, NC has a much higher prevalence than the other cities. This is clear in the plotted

estimates of other cancer prevalence in Figure 1. The proportion estimates are more

homogeneous between strata for colon cancer simply from the fact that there are fewer

cases of colon cancer in the population and many of the cities had no sampled individuals

with a history of colon cancer.

There were many questions in the CARES survey related to cancer risk factors and

knowledge of these factors, as well as colon cancer screening methods. In this analysis,

we selected a subset of these for which to calculate population parameter estimates. We

produced estimates of the following variables: race of respondent (WHITE, BLACK,

OTHER), age of respondent (AGE), knowledge about exercise as a preventative measure

for colon cancer (EXERCISE), knowledge about a healthy diet as a preventative measure

for colon cancer (DIET), having a doctor suggest a fecal occult blood test in the past

(DOCFOB), having a doctor suggest a flexible sigmoidoscopy test in the past

(DOCFLEX), and status as a current smoker (SMOKING). We later use these as

predictor variables in logistic regression models to explain cancer prevalence. The

estimates of these variable means and proportions and standard errors of the estimates are

listed in Table 4 for both the design-based and model-based analyses. Aside from the

estimate for mean age, the estimates are proportions, as the other variables are all

indicator variables. The mean and proportion estimates are similar for the two analytic

16

approaches, with some small differences. Again, the model-based estimates are biased.

Regarding the reliability of the estimates, the standard errors are either the same or lower

with the design-based analysis for all of the variables. Therefore, the design-based

analysis estimates are overall more precise than those from the model-based analysis.

Taken together, the estimates in Table 2 and Table 4 demonstrate that there can be more

heterogeneity between strata and more homogeneity within stratum for one variable than

for another, and the gain in precision of population parameter estimates from using

stratified random sampling need not be consistent across variables.

Assessing Risk Factors for Cancer Prevalence with Logistic Regression

To study the association of risk factors with past diagnosis of colon and other cancer, we

next built logistic regression models using the previously defined explanatory variables.

The response variables are the binary indicators for colon and other cancer diagnosis,

which are the variables used to estimate cancer prevalence in the population. The logistic

regression was carried out in Stata. We built separate regression models for colon cancer

and other cancer using the design-based approach, starting with the common confounder

AGE in the models, adding other variables to the model one at a time, and retaining only

those variables that were statistically significant near the significance level α = 0.05. To

fit the logistic regression models in Stata while considering the sampling design, we first

specify the sample design settings as discussed previously and then use the survey logit

command. For example, the logistic regression model to explain colon cancer occurrence

17

using variables AGE and SMOKER is fitted with the following command, where “or” is

an option that specifies that the estimated odds ratio for each variable should be returned.

. svy:logit CCANCER AGE SMOKER, or

We verified the model assumption of linearity in the logit for the continuous age variable

by fitting a logistic regression model for colon cancer with quartile design variables for

age (see Hosmer and Lemeshow 2000 for details). A plot of estimated regression

coefficients versus age quartile midpoints suggested some nonlinearity in the logit for

AGE, but a quadratic term for AGE did not significantly add to the model. Hence, AGE

was retained as a continuous variable in subsequent logistic regression models in the

design-based analysis. After building the design-based model, we fitted the same model

using a model-based scenario for comparison. We again checked for linearity in the logit

for AGE in the model-based analysis using quartile design variables and reached the

same conclusion as with the design-based analysis. In addition, we used fractional

polynomials (Hosmer and Lemeshow 2000) to verify a linear relationship for AGE with

the outcome variable. We assessed goodness-of-fit using the Hosmer-Lemeshow test with

ten categories and evaluated the discrimination of the model with the receiver operating

characteristic (ROC) curve. Both goodness-of-fit and discrimination were adequate.

Currently, goodness-of-fit tests, ROC curves for model discrimination, and fractional

polynomials cannot be applied to complex survey data in Stata. Therefore, we used these

assessment tools in the model-based analysis. We repeated these steps for the non-colon

cancer model and did not find evidence that AGE was nonlinear in the logit in either the

18

design-based or model-based analyses; goodness-of fit and model discrimination were

again adequate.

The logistic regression model results for colon cancer for both design-based and model-

based analytic approaches are listed in Table 5. The table includes the estimated odds

ratio (OR), 95% confidence interval for the odds ratio, and the associated p-value for the

Wald test for each variable. The odds ratio measures the multiplicative increase in the

odds for the response variable associated with an increase in the exposure variable. For a

binary exposure variable, such as status as a current smoker, the increase in odds is

associated with the presence of the condition. For a continuous variable, such as age, the

multiplicative increase in odds measured by the odds ratio reflects a one-unit increase in

the variable, here, one-year in age. For colon cancer, the only variable in addition to AGE

that is significant is SMOKING. Both age and smoking status have a positive association

with history of colon cancer, as would be expected. Based on the statistics in Table 5, the

variable coefficients are more significant in the design-based analysis than in the model-

based analysis. The smoking status effect is 9 percent higher in the design-based analysis.

In fact, the effect of smoking is not significant in the model-based analysis at the typical

5% significance level, while it clearly is in the design-based analysis, given that the 95%

confidence interval for the odds ratio does not contain 1, the neutral value of equal risk

for exposed and non-exposed groups. The significance level is subjective and we specify

a value that is a commonly used threshold. The example here is one where using the

model-based approach with sample survey data would lead to an incorrect conclusion

about the significance of a relationship in the population. The logistic regression model

19

results for history of other cancer for both design-based and model-based analytic

approaches are listed in Table 6. For this outcome, both age and current smoking status

are significant at the 0.05 level for both the design-based and model-based analyses. Both

age and smoking have a positive relationship with non-colon cancer status, as would be

expected.

Conclusions

In this article, we emphasize to geographers the importance of using a design-based

approach when analyzing sample survey data. We focus on the sampling of individuals

and not on sampling of spatial units (see Griffith 2005). We have demonstrated by

example with a cancer sample survey that results from model-based and design-based

analyses differ. Model-based analysis of sample survey data can result in biased estimates

of population parameters. In the analyses of this sample survey, there were gains in

precision of the population parameter estimates of many variables when taking into

account the study design and finite size of the population. This results in increased

confidence in population parameter estimates. Even when inference on the population is

not of primary interest, precision gains in sample standard errors are possible through

using the finite population correction. In stratified sampling designs with heterogeneity

between strata and homogeneity within each stratum, design-based sample analysis

estimates of population parameters from a stratified sample have lower standard errors,

and are, therefore, more reliable than those from model-based analysis. In different

sampling schemes, the standard errors of estimates of population parameters from model-

20

based analysis can underestimate the uncertainty in the estimates. Overall, estimating

population proportions and means is more reliable when using the appropriate method of

analyzing sample survey data, and is worth the additional analysis steps if statistical

weights that represent the number of people in the population sampled are available.

Also, in our example, taking into account the design of the sample survey resulted in

finding through logistic regression a statistically significant association between smoking

and colon cancer in the CARES population, where this was not found using the model-

based analysis that inappropriately assumed a simple random sample from an infinite

population. In general, failing to account for a finite survey population and not utilizing

appropriate sample survey analysis techniques can lead to misrepresentation of the data

and incorrect conclusions about the significance of associations in regression models.

Results of sample survey analysis could conceivably have policy implications and real

ramifications and it is therefore important to carefully prepare and report findings.

Readily available statistical software explicitly take into account sampling design and

make design-based analysis of sample surveys relatively straightforward. We conclude

that geographers should perform design-based analysis of complex sample survey data

whenever possible when making inferences about a population, and make this

recommendation to researchers undertaking the design and analysis of sample surveys in

human geography. In closing, we encourage geographers involved in survey research to

invest considerable thought into the sampling design of sample surveys and to provide

detailed description when reporting on results of survey research.

21

References

Brogan, D. J. 1998. Software for sample survey data, misuse of standard packages. In

Encyclopedia of biostatistics, volume 5. ed. P. Armitage and T. Colton, 4167-4174. New

York, NY: Wiley & Sons.

Chu, K. C., R. E. Tarone, W. H. Chow, B. F. Hankey, and L. A. Ries. 1994. Temporal

patterns in colorectal cancer incidence, survival, and mortality through 1990. Journal of

the National Cancer Institute 87:997.

Griffith D. A, 2005. Effective geographic sample size in the presence of spatial

autocorrelation. Annals of the Association of American Geographers 94(5):740-760.

Holmes, J. 1967. Problems in location sampling. Annals of the Association of American

Geographers 47:757-780.

Hosmer, D. W., and S. Lemeshow. 2000. Applied logistic regression, 2nd Edition. New

York, NY: John Wiley & Sons.

Kolivras, K. N. 2006. Mosquito habitat and dengue risk potential in Hawaii: a conceptual

framework and GIS application. The Professional Geographer 58(2): 139–154.

22

Korn, E. L., and B. I. Graubard. 1999. Analysis of health surveys. New York, NY: John

Wiley & Sons.

Kott, P. S. 1991. A model-based look at linear regression with survey data. The American

Statistician 45:107-112.

Landis, S. H., T. Munay, S. Bolden, and P. A. Wingo. 1999. Cancer statistics. CA: A

Cancer Journal for Clinicians 49:8-31.

Langford, M., and G. Higgs. 2006. Measuring potential access to primary healthcare

services: the influence of alternative spatial representations of population. The

Professional Geographer 58(3):294–306.

Lemeshow, S., L. Letenneur, J. F. Dartigues, S. Lafont, J. M. Orgogozo, and D.

Commenges. 1998. Illustration of analysis taking into account complex survey

considerations: the association between wine consumption and dementia in the PAQUID

study. American Journal of Epidemiology 148:298-306.

Levy, P. S., and S. Lemeshow. 1999. Sampling of populations: methods and applications.

New York, NY: John Wiley & Sons.

Li, S. M., and Y. M. Siu. 2001. Residential mobility and urban restructuring under market

transition: a study of Guangzhou, China. The Professional Geographer 53(2):219-229.

23

Malec, D. 1996. Model based state estimates from the National Health Interview survey.

In Indirect estimators in U.S. federal programs. ed. W. L. Schaible, 145-167. New York,

NY: Springer-Verlag.

Manheim, J., and R. Rich. 1995. Empirical political analysis: research methods in

political science. 4th ed. White Plains, NY: Longman.

McSweeney, K. 2002. Who is “forest-dependent”? Capturing local variation in forest-

product sale, Eastern Honduras. The Professional Geographer 54(2):158-174.

McSweeney, K. 2004. Forest product sale as natural insurance: the effects of household

characteristics and the nature of shock in eastern Honduras. Society and Natural

Resources 17:39-56.

National Center for Health Statistics. 2006. Analytic and reporting guidelines: The

National Health and Nutrition Examination Survey (NHANES). Hyattsville, MD.

Overmars, K. P., and P. H. Verburg. 2005. Analysis of land use drivers at the watershed

and household level: linking two paradigms at the Philippine forest fringe. International

Journal of Geographical Information Science 19:125-152.

24

Paudel, G. S., and G. B. Thapa. 2004. Impact of social, institutional and ecological

factors on land management practices in mountain watersheds of Nepal. Applied

Geography 24:35-55.

Pickle, L. W., Y. Hao, A. Jemel, Z. Zou, R. C. Tiwari, E. Ward, M. Hachey, H. L. Howe,

and E. J. Feuer. 2007. A new method of estimating United States and state-level cancer

incidence counts for the current calendar year. CA: A Cancer Journal for Clinicians

57(1):30-42.

Pickle, L. W., and Y. Su. 2002. Within-state geographic patterns of health insurance

coverage and health risk factors in the United States. American Journal of Preventive

Medicine 22(2):75-83.

Poon, J. 2007. Instrumentation rigor and practice. Environment and Planning A

39(5):1017-1019.

Research Triangle Institute. 2001. SUDAAN user’s manual, release 8.0. Research

Triangle Park, NC: Research Triangle Institute.

Schaible, W. L. 1996. Introduction and summary. In Indirect estimators in U.S. federal

programs. ed. W. L. Schaible, 1-15. New York, NY: Springer-Verlag.

25

Sinai, I. 2002. The determinants of the number of rooms occupied by compound dwellers

in Kumasi, Ghana: does working at home mean more rooms? Applied Geography 22:77-

90.

SPSS Inc. 2006. Command Syntax Reference. Chicago, IL.

Stata. 2005. Release 9 user’s guide. College Station, TX: Stata Press.

Strait, J. B. 2006. An epidemiology of neighborhood poverty: causal factors of infant

mortality among blacks and whites in the metropolitan United States. The Professional

Geographer 58(1): 39–53.

Sui, D. Z. 2007. Geographic information systems and medical geography: toward a new

synergy. Geography Compass 1(3): 556–582.

Takasaki, Y., B. L. Barham, and O. T. Coomes. 2001. Amazonian peasants, rain forest

use, and income generation: the role of wealth and geographical factors. Society and

Natural Resources 14:291-308.

Weisberg, H. F., J. A. Krosnick, and B. D. Bowen. 1996. An introduction to

survey research, polling, and data analysis. 3rd ed. Thousand Oaks, Calif.: Sage

Publications.

26

Wheeler, D. 2007. A comparison of spatial clustering and cluster detection techniques for

childhood leukemia incidence in Ohio, 1996 – 2003. International Journal of Health

Geographics 2007, 6:13.

Wood, W. F. 1955. Use of stratified random samples in a land use study. Annals of the

Association of American Geographers 45(4):350-367.

Wyllie, D. S., and G. C. Smith. 1996. Effects of extroversion on the routine spatial

behavior of middle adolescents. The Professional Geographer 48(2):166-180.

Xu, B., P. Gong, E. Seto, S. Liang, C. Yang, S. Wen, D. Qiu, X. Gu, and R. Spear. 2006.

A spatial-temporal model for assessing the effects of intervillage connectivity in

schistosomiasis transmission. Annals of the Association of American Geographers 96(1):

31–46.

Yankson, P. W. K. 2000. Houses and residential neighbourhoods as work places in urban

areas: the case of selected low income residential areas in Greater Accra Metropolitan

Area (GAMA), Ghana. Singapore Journal of Tropical Geography 21(2):200-214.

27

Tables

City Population Sample

Winston-Salem, NC 459 49 Greensboro, NC 652 55 High Point, NC 385 41 Greenville, NC 379 57 Rocky Mount, NC 231 43 Wilson, NC 149 17 Charlotte, NC 1006 126 Rock Hill, SC 87 13 Spartanburg, SC 440 69 Greenville, SC 331 47 Total 4119 517 Table 1. Total population counts and sampled counts of women aged fifty years or older who resided in subsidized housing communities in the ten cities of the survey design area at the time of the CARES project

28

Design-based analysis Model-based analysis

Variable Mean SE(mean) Mean SE(mean) FOB 0.122 0.021 0.102 0.018 FLEX 0.062 0.013 0.059 0.012 CCANCER 0.019 0.006 0.018 0.006 OCANCER 0.077 0.011 0.076 0.012

Table 2. Estimates of population proportions and standard errors of estimates for the color cancer screening variables (FOB, FLEX) and the cancer history variables (CCANCER, OCANCER) in the CARES data

29

Other cancer Colon cancer

City Mean Design SE Model SE Mean Design SE Model SE Winston-Salem, NC 0.041 0.027 0.029 0.041 0.027 0.029 Greensboro, NC 0.073 0.034 0.035 0.019 0.018 0.019 High Point, NC 0.171 0.056 0.059 0.024 0.023 0.024 Greenville, NC 0.054 0.028 0.030 . . . Rocky Mount, NC 0.070 0.035 0.039 . . . Wilson, NC . . . . . . Charlotte, NC 0.080 0.023 0.024 0.024 0.013 0.014 Rock Hill, SC . . . . . . Spartanburg, SC 0.087 0.031 0.034 0.029 0.019 0.020 Greenville, SC 0.085 0.038 0.041 . . . Table 3. Design-based estimates and standard errors of estimates within stratum (city) for prevalence of other cancer and colon cancer, where “.” indicates no estimate due to no cases

30

Design-based analysis Model-based analysis

Variable Mean SE(mean) Mean SE(mean) WHITE 0.191 0.017 0.184 0.017 BLACK 0.770 0.018 0.779 0.018 OTHER 0.038 0.008 0.037 0.008 EXERCISE 0.141 0.015 0.136 0.015 AGE 69.465 0.477 69.430 0.492 DIET 0.191 0.017 0.189 0.017 DOCFOB 0.927 0.015 0.920 0.017 DOCFLEX 0.922 0.021 0.908 0.022 SMOKING 0.210 0.017 0.204 0.018

Table 4. Estimates of population means and proportions and standard errors of estimates for explanatory variables in the CARES dataset using both design-based analysis and model-based analysis

31

Design-based analysis Model-based analysis

Variable OR 95% CI p-value OR 95% CI p-value AGE 1.057 0.999-1.119 0.054 1.042 0.986-1.102 0.142 SMOKING 4.319 1.056-17.671 0.042 3.952 0.921-16.954 0.064

Table 5. Odds ratios for age and smoking from the logistic regression model for history of colon cancer

32

Design-based analysis Model-based analysis

Variable OR 95% CI p-value OR 95% CI p-value AGE 1.029 1.000-1.058 0.049 1.030 1.000-1.062 0.048 SMOKING 2.101 1.043-4.231 0.038 2.115 1.020-4.384 0.044

Table 6. Odds ratios for age and smoking from the logistic regression model for history of other cancer

33

Figures

Figure 1. Design-based estimates of non-colon cancer prevalence within stratum (city)