JENNIFER SAYLOR, PHD, RN, ANCS-BC
UNIVERSITY OF DELAWARESEPTEMBER 14, [email protected]
Essentials of Complex Data Analysis Utilizing National
Survey Data
National Surveys: Advantages
Guides actions & policies to improve the
healthAllows researchers to ask & answer
questions on a population level from previously collected data
Available data without any replication of
effort
Prohibitive cost of obtaining primary data
Probability-based complex sample
designs
National Surveys: Challenges
Locating a dataset that includes the variables to address the research question
Additions or deletions of variables and differences in methods of assessment of variables in different waves (years) in each survey
Merging multiple data files
Analyzing data using complex sample design
Complex Sample Designs
Used to acquire representation of an entire population using a sample of the population
More efficient than simple random samples: Do not require complete enumeration of the population Allows researchers to visit compact areas to obtain in
person data (interview or laboratory) Able to oversample small or sensitive subgroups to
adequately represent their variability Assuming simple random sampling gives low
variance estimates without accounting for the correlation among individuals within clusters yielding biased results
Complex Sample Analysis
Accurately estimates population means and frequencies of the population from the sample after taking into account over or under sampling of specific groups
Statistical Programs to address sampling design elements such as stratification, clusters, and weights SUDAAN®, Complex Sample analysis in SPSS®, & survey
procedures in SAS® Weighting without complex samples analysis leads to
grossly reduced estimates of population variability Estimates are computed as if the measures were obtained from
the number of cases in the entire population rather than the number of cases in the sample in the data set
Clusters include participants who are more similar to one another than those in another cluster
National Health and Nutrition Examination Survey
Combination of health and nutrition questionnaires and physical examination to assess the health and nutritional status of adults and children in the United States1.
National Health and Nutrition Examination Survey (NHANES) data can be used to: Produce estimates of personal health conditions Vital statistics Establish national standards for measurements (i.e. blood pressure) Analyze risk factors for diseases Examine disparities in health status
60-minute Interview: demographic, socioeconomic, dietary, & health-related questions
Physical Examination: most performed in mobile examination centers- medical, dental, physiological measurements, & laboratory tests depending on the participants’ age & gender
NHANES Sampling Plan
Probability-based complex sample design represents the civilian, non-institutionalized U.S. population.
Excludes individuals: Residing in nursing homes Armed forces Institutionalized U.S. nationals living outside the U.S
Stage 1: Fifteen Primary Sampling Units (PSUs)- counties or small groups of contiguous counties depending on the population of the counties; sample size in each PSU is approximately 5,000 examined participants per year.Stage 2: Segments within the PSUs are selected- a cluster of households in a block or a group of blocks depending on population density Stage 3: Households within the segments are selected Stage 4: One or more participants within the households are randomly selected
Diagram: NHANES Sampling Plan2
NHANES 2007-2008 Sampling
Oversamples small and sensitive subgroups: Persons over 60 African Americans Low income population Entire Hispanic population (not only Mexican
Americans)A representative sample of these groups by age,
sex, and income Reliable and precise health status indicator
estimatesEach NHANES 2007-2008 participant represents
approximately 50,000 other U.S. residents.
NHANES Survey Weights
Purpose: account for oversampling, survey non-response, & post-stratification
Sample weights are assigned to each person based on the number of people they represent within the U.S. Census non-institutionalized civilian population
NHANES provides three weights (2 & 4 year increments) Interview weights: all people interviewed Medical examination weights: interviewed & medical
examination Fasting laboratory weights: interviewed, medical
examination, & fasting laboratory tests
Creating Analytical File from NHANES
Download data files & codebooks Combination of 11 individual data files in NHANES 07-08
Transfer text data files to statistical package (SPSS, SAS)
Screen each data files for unused variables & delete If names of variables or responses changed between
multiple survey years, rename variables and recode responses before merging
Merge different data files by sequence number assigned to each participant Data collected varied by age; Files do not have the same
number of records in each file Recode 07-08 variables to create the study variable
2007-2008 Files Merged to Create Analytical File
Data Analysis Prep In SPSS complex sample analysis: a complex sample plan file is created with NHANES 2007-2008 2 year fasting laboratory weight (WTSAF2YR) and design variables: strata (SDMVSTR) and cluster (SDMVPSU
From: Saylor, J., Friedmann, E. & Lee, H. J. (2012). Navigating complex sample analysis using national survey data. Nursing Research, 61 (3), 231–237.
Comparison of Descriptive Statistics of Categorical Data
Frequencies: weighting & complex sample results are the same because the sample size is the same.
Race/Ethnicity variable: Un-Weighting- Racial
minorities account for 52% of the sample (oversampled)
Complex Samples- Racial minorities account for only 30% when estimated for the U.S. population using (more representative of the U.S.)
Comparison of Descriptive Statistics of Continuous Data
Mean for each continuous variable changes when weighted Mean remains the same with weighting and complex samples
analysis The proportion of cases with each value remains constant
Standard error of the mean for weighting Almost non-existent when weighting since the sample size appears to be
the entire population Complex samples analysis mean and standard error
Accurate because the mean is estimated for the entire population based on calculations of the # of cases from which data are obtained
Note. Not all the variables from the metabolic syndrome study are presented in the table; * measured using Patient Health QuestionnaireFrom: Saylor, J., Friedmann, E. & Lee, H. J. (2012). Navigating complex sample analysis using national survey data. Nursing Research, 61 (3), 231–237
Comparison of Logistic Regression: Education Regressed on Metabolic
Syndrome
Logistic Regression with two dichotomous variables Education < a high school education =0 and > high school education =1 Metabolic syndrome is coded as absent =0 and present =1
Chi-Square is the statistical methodology, which analyzes frequencies that are not affected by dispersion (only small differences)
Un-weighted analysis: those who have less than high-school education are 62% more likely to have metabolic syndrome
Weighted and complex samples analysis: likelihood increases to 74% Odds Ratio is the same for the weighted and complex sampling analysis,
but the 95% CI are unrealistically narrow for the weighted analysis
Notes. See below for coding . From: Saylor, J., Friedmann, E. & Lee, H. J. (2012). Navigating complex sample analysis using national survey data. Nursing Research, 61 (3), 231–237
Comparison of Linear Regression: Depressive Symptoms Regressed on Diet
Linear Regression with two continuous variablesLinear Un-weighted and weighted data: biased results
Depressive symptoms- measured via Patient Health Questionnaire (PHQ-9)
Diet- measured as the number of calories consumed per dayComplex samples analysis: depressive symptoms do
not predict (p = .151) diet Depressive symptoms predict diet (p = .006, p < .001,
respectively
Notes. See below for coding . From: Saylor, J., Friedmann, E. & Lee, H. J. (2012). Navigating complex sample analysis using national survey data. Nursing Research, 61 (3), 231–237
Limitations: NHANES & Other Secondary Data
Cross-sectional data: unable to determine causality
Unable to control definitions of variables, measurement, & data collection Unable to exclude subjects with a history of
psychosis since NHANES does not collect these data Definition of smoking limited due to collected
NHANES 07-08 data in the metabolic syndrome study
Unable to choose how depressive symptoms were measured
Conclusion
Use of national data sets allows use of extensive, expensive, well documented survey data for exploratory questions but limits analysis to those variables included in the data set
Large sample: examine multiple predictors & interactive relationships
Challenges of National databases Merging data files Differentiating the availability of data in different waves of
surveys Using complex sampling techniques to provide a
representative sampleComplex samples data analysis programs allows
inclusion of sampling design elements (Stratification, Clusters, & weights) Provides unbiased population estimates of frequencies, means,
& variability Provide results representative of the US population
FOR FURTHER INFORMATION, PLEASE CONTACT ME AT
THANK YOU
Questions or Comments?
References
1 NCHS. (2009). NHANES 2007-2008 public data general release file documentation, from http://www.cdc.gov/nchs/nhanes/nhanes2007-2008/generaldoc_e.htm
2 NCHS. (2010). Continuous NHANES Web Tutorial: Sample Design, from http://www.cdc.gov/nchs/tutorials/Nhanes/SurveyDesign/SampleDesign/intro.htm
Labeled Tables: Saylor, J., Friedmann, E. & Lee, H. J. (2012). Navigating complex sample analysis using national survey data. Nursing Research, 61 (3), 231–237.