the weighting strategy of the canadian community health survey cathlin sarafin methodologist...
TRANSCRIPT
The Weighting Strategy of the Canadian Community Health
Survey
Cathlin SarafinMethodologist
Statistics Canada
March 25, 2008
Outline
Introduction Methodology The Canadian Community Health Survey (CCHS)
The Multiple Frames
The Weighting Strategy of the CCHS
Methodology Recruitment Process
Introduction
Methodology Structure: You
Recruits are called Junior Methodologists
Your Unit 2 to 7 Methodologists supervised by one Senior Methodologist
Your Section3 to 6 units working on related projects, managed by a Chief
Your Division A division has roughly 100 people, usually all together on one
floor of the building
Introduction
Every person has their own responsibilitiesSenior Methodologist outlines tasksDiscuss options and approaches as a team
Introduction
Variance estimation Data quality indicators Record linkage Time series Data analysis Disclosure control Research and development
Survey Methodology:
Frame creation Sampling Questionnaire design Data collection methods Data processing Edit and imputation Weighting and estimation
The CCHS
Collects general health information on the Canadian population
Estimates produced for more than 120 Health Regions (HRs) across Canada
Produces estimates on: Health Risk Factors Health Status Health Care Services
The CCHS
The CCHS was introduced in 2000 Data was collected every second year for a total
sample size of 130,000 per year
It was redesigned in 2007 Data is now collected continuously for a total
sample size of ≈ 65,000 respondents per year Annual files are released Multi-year files will be produced starting in 2009
The CCHS
A cross-sectional survey Survey a specific population for
a given period of time
A longitudinal survey Survey a specific population
repeatedly over time
The CCHS
Target population: Individuals living in private dwellings aged 12
years old and over Exclusions: those living on Indian Reserves
and Crown Lands, residents of institutions, full-time members of the Canadian Forces and residents of some remote areas
CCHS covers ~98% of the Canadian population
The CCHS
Has a complex, multi-stage, dual frame design Area frame (49%) Telephone list frame (50%) Random digit dialing (RDD) frame (1%)
The telephone frames compliment the area frame in most HRs
The Area Frame
Units are geographical areas Target sampling units are not listed
Based on Labour Force Survey (LFS) design 6 rotation groups Stratified probability proportional to size sample of
clusters Systematic sample of dwellings
Random selection of a start
Probabilistic sample of one individual per household
The Area Frame
Str
atu
m #
1S
trat
um
#2
1. Each province is divided into geographic strata
2. Clusters selected within strata (PPS sampling) 1st stage
3. Dwellings selected within clusters (systematic sampling) 2nd stage
4. People selected within responding
dwellings 3rd stage
Province XYZProvince XYZ
LFS Sample Selection
The Area Frame
Why use such a design? Stratification:
Better coverage of the entire region of interest
Increases precision
Clustering: Efficient for interviewing (less travel, less costly)
Decreases precision
The Area Frame
The CCHS selection process: The LFS provides a list of available starts
(systematic samples) within each cluster The clusters are mapped to the CCHS HRs A random selection of starts is chosen within
a HR Probabilistic sample of one individual per
household
The Area Frame
2-phase sample 1st phase is the LFS sample of starts within
the LFS strata 2nd phase is the CCHS sample of starts within
the HRs
The Area Frame
Why use the LFS? No adequate list of addresses available Costly to create and maintain such a frame LFS has good coverage of target population It is a monthly sample conducted at Statistics
Canada Continually updated
The Telephone Frame
List of telephone numbers from across Canada
Created using InfoDirect© files
Stratified by HR
SRSWOR sample of phone numbers
Probabilistic sample of one individual per household
The RDD Frame
Phone numbers are grouped into banks
Banks are assigned to a HR
Computer randomly generates the last 2 numbers
Probabilistic sample of one individual per household
Dual Frame Design
Multiple frames are used to: Improve the coverage of the target population Reduce costs
Area Frame Covers target population Costly to implement
Listing costs Face-to-face interview costs
Dual Frame Design
Telephone Frame Only covers population with listed phone
numbers Undercoverage may bias the estimates Growing problem with the increasing popularity of
cell phones Less costly to implement
Calls made from regional offices
Dual Frame Design
RDD Frame Inefficient
Results in a large amount of out-of-scope numbers
Used alone for 2 northern regions LFS is not adequate for these 2 regions
Used as a complement to the area frame in Whitehorse and Yellowknife Quality of telephone frame is considered poor
in these regions
The Weighting Strategy of the CCHS
Area Frame
A4 - Household nonresponse
A3 - Out-of-scope dwellings
A2 - Stabilization
A1 – Sub-cluster adjustment
A0 – Initial weight
Telephone Frame
T4 - Multiple phone lines
T3 - Household nonresponse
T2 - Out-of-scope numbers
T1 - Number of collection periods
T0 - Initial weight
Final CCHS Weight6
Combined Frame
I5 - Calibration
I4 - Winsorization
I3 – Person nonresponse
I1 - Integration
I2 – Person selection
Sampling Weights
Number of people in the population represented by the interviewed person Ex: wi = 500
Can be broken down into 3 major steps: Design weights Nonresponse adjustment Calibration
Design Weights
Weights determined by the design of the survey
They are the inverse of the inclusion probability A person selected according to a sampling fraction of
1% will have a weight of 1/0.01 = 100
The design weights in the CCHS are calculated separately for each frame
Sampling fractions differ between HRs, therefore design weights are not uniform
List Frame Design Weights
The sample is stratified by HR, so weights are calculated within HR
It is an SRSWOR of phone numbers
Probability of selection within HR g is
g
gi N
n
Area Frame Design Weights
The LFS is redesigned every 10 years A sample 20 year sample plan created
The LFS provides a list of available starts Typically consists of 40 columns and 6 rows
per LFS stratum Each row represents a rotation group Each column represents a monthly LFS sample
Area Frame Design Weights
LFS
Stratum
Rotation Cluster Start Cluster Start Cluster Start
50 1 1 1 1 2 1 3
50 2 2 4 2 5 3 6
50 3 7 8 7 9 7 10
50 4 6 1 6 2 4 3
50 5 9 4 9 5 9 6
50 6 5 16 5 12 5 13
One LFS sample
Area Frame Design Weights
The LFS provides a weight for one LFS sample A weight for every start in one column
This weight is used to assign a weight to all available starts
The weights are then redistributed to the CCHS selected starts within each HR
S
RWW
lfss
Nonresponse Adjustments
The design weights are corrected for total nonresponse (NR) All the variables for the respondent are missing
Complete refusal
Unable to contact the respondent
Respondent absent for the duration of the survey
language barrier
Information obtained is unusable
Nonresponse Adjustments
There are 2 types of NR in the CCHS Household level Person level
The weights of the nonrespondents have to be redistributed to the respondents Form groups based on auxiliary information
NR Adjustments
There are several methods available for the creation of response homogeneity groups (RHGs)
The CCHS uses the scoring method Logistic regression is used to obtain a
probability of response ( ) for every unit Groups are formed based on the values of p
p
NR Adjustments
Logistic Regression Models Variables include geographic information,
process data and socio-economic indicators Variables derived from process data include:
Number of attemptsTime/day of attemptCalled on weekday/weekend
NR Adjustments
Initial groups are formed using a clustering algorithm in SAS
These groups are then collapsed to ensure: A response rate of at least 50% At least 20 observations
The adjustment within each RHG is
r
iiD
n
iiD
NR
W
Wa
1
1
Integration of Frames
Area Frame
Telephone Frame
No phone line
Unlisted phone number
Listed phone number
Integration of Frames
Area Frame Population = A
Sample = SA
Telephone Frame Population = B
Sample = SB
BA SAB
SAB YY ˆ1ˆYint
Integration
Integration factor:
A number between 0 and 1 For CCHS it is based on sample size
BA
A
nn
n
Integration
Parameter of interest:
Unbiased estimates
BSABYE ˆYE AS
AB
Integration
Composite estimation
BSABY1YE AS
AB
BSABYE1YE AS
AB
Integration of Frames
Possible to integrate only the overlapping populations covered by the 2 frames
Problem identifying the overlapping portion for the area frame due to nonresponse Possible to impute these cases
BAA SAB
SAB
SA YYY ˆ1ˆˆYint
Integration of Frames
Area Frame
Telephone Frame
SB
SAB
SA
SAU
Integration of Frames
Logistic regression is used to assign a probability of belonging to the non-common part SA
The final integration method is
BAA SAB
SAB
SA YYpYp ˆ1ˆ)1(ˆYint
Calibration
Weights are adjusted to match population projection counts Based on the Census Adjusted to account for births, deaths, immigration
and emigration
The rounded average of the monthly projection counts is used within each post-stratum
Calibration
Why is calibration used? Gives confidence when estimating totals Improves precision of the estimates
If auxiliary variables are well correlated to the survey variables
Adjusts for coverage inadequacies when the survey population differs from the target population
Calibration
In the CCHS All post-strata with at least 20 observations are
calibrated at the HR by age by sex levelHR: 120 across CanadaAge groups: 12-19, 20-29, 30-44, 45-64 and 65+
Sex: Male and Female
Calibration
Age Group
Number of Observations
12-19 15
20-29 40
30-44 53
45-64 18
65+ 31
Age Group Number of Observations
12-19 25
20-29 40
30-44 53
45-64 22
65+ 31
Females MalesExample: HR 2Post-strata = HR by age by sex Post-strata = HR by sex Post-strata = Prov by age by sex
Final Weights
Master: Contains all variables for all respondentsShare: Contains all variables for the subset of people who agreed to share (subset of records)PUMF: Contains a subset of variables for all respondents (subset of variables)Dummy: Contains a subset of records from the master file. Scrambled data used for testing and remote access purposesBootstrap: Created for variance estimation purposesSpecial Requests: linkage, different geographies, etc.
Methodology
Typical tasks: Write computer programs to solve problems or
explore data Attend meetings Write documentation Present our work at seminars Work on different committees
Methodology
Working Conditions Permanent job Continuous learning:
Computer courses Statistics and methodology courses Language courses Seminars, conferences and publications
Methodology
All methodologists work at the Head Office in Ottawa
Recruitment
Our recruitment campaign takes place each fall
Detailed presentations at the Universities by early October
It is a 3 step process: On-line application
Starts in September Deadline in mid-October
Written Exam Early November
Interview January
Recruitment
Who can apply? Persons residing in Canada and Canadian
citizens residing abroad Preference will be given to Canadian citizens
Bilingualism No preference is given to those who speak both
English and French
For more information please contact
www.statcan.ca Under:
About Us Employment opportunities Mathematical statisticians (MA)
Email: [email protected]
Telephone: 1-888-321-3089