thomas a. louis, phd department of biostatistics johns ... · releasing full \micro-data"...
TRANSCRIPT
Convergence of the Biostatistical and Survey Worlds
Thomas A. Louis, PhD
Department of BiostatisticsJohns Hopkins Bloomberg SPH
Research & MethodologyU. S. Census Bureau
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 1
Outline
The Census Bureau
A sampling of research at Census
Adaptive designDisclosure avoidanceA few other topics
Design-based/Model-based
Convergence of the Biostatistical and survey cultures
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 2
Preamble
Historically, survey, biostatistical and epidemiological methods andcultures were quite distinct, or at least appeared to be so
However, service as Associate Director for Research & Methodology andChief Scientist at the U. S. Census Bureau has heightened my awarenessof the similarities of goals and methods, and of the many potentials
Convergence steadily increases to the benefit of all
I highlight some examples, but first
HAPPY 50th!
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 3
Preamble
Historically, survey, biostatistical and epidemiological methods andcultures were quite distinct, or at least appeared to be so
However, service as Associate Director for Research & Methodology andChief Scientist at the U. S. Census Bureau has heightened my awarenessof the similarities of goals and methods, and of the many potentials
Convergence steadily increases to the benefit of all
I highlight some examples, but first
HAPPY 50th!
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 4
The U. S. Census BureauEmployees
≈ 15,000 employees, of these, ≈ 5,000 are on permanent appointmentsThe remainder are primarily part-time interviewers and other field staff
Central office in Suitland MD, and 6 Regional offices
Censuses
The decennial census(the only activity embedded in the U. S. Constitution)The Population & Housing Census - every 10 yearsThe Economic Census - every 5 yearsThe Census of Governments - every 5 yearsMonthly Import/Export compilations
Selected surveys (of ≈ 130/yr)
The American Community Survey (continuous)The Current Population Survey (CPS) Includes Health Insurance QsThe Survey of Income and Program Participation (SIPP) DittoThe National Survey of College GraduatesThe National Crime Victimization Survey (NCVS)The National Survey on Family Growth (NSFG)The Health Interview SurveyInternational surveys and censuses
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 5
The U. S. Census BureauEmployees
≈ 15,000 employees, of these, ≈ 5,000 are on permanent appointmentsThe remainder are primarily part-time interviewers and other field staff
Central office in Suitland MD, and 6 Regional offices
Censuses
The decennial census(the only activity embedded in the U. S. Constitution)The Population & Housing Census - every 10 yearsThe Economic Census - every 5 yearsThe Census of Governments - every 5 yearsMonthly Import/Export compilations
Selected surveys (of ≈ 130/yr)
The American Community Survey (continuous)The Current Population Survey (CPS) Includes Health Insurance QsThe Survey of Income and Program Participation (SIPP) DittoThe National Survey of College GraduatesThe National Crime Victimization Survey (NCVS)The National Survey on Family Growth (NSFG)The Health Interview SurveyInternational surveys and censuses
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 6
Adaptive Design
Goals & MethodsReduce the time/expense from the start of data collection to completion
Efficiently allocate data collection resources
Use dynamic mode-switching to increase efficiency and enhance quality(dynamic treatment regimens)
Employ stopping rules (possibly stratum-specific)
Necessary Inputs
Sampling frame (under-utilized in clinical and field trials)
Paradata =⇒ propensity models
Cost & Quality metrics
Measures of statistical information
Timely and accurate data
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 7
Adaptive Design
Goals & MethodsReduce the time/expense from the start of data collection to completion
Efficiently allocate data collection resources
Use dynamic mode-switching to increase efficiency and enhance quality(dynamic treatment regimens)
Employ stopping rules (possibly stratum-specific)
Necessary Inputs
Sampling frame (under-utilized in clinical and field trials)
Paradata =⇒ propensity models
Cost & Quality metrics
Measures of statistical information
Timely and accurate data
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 8
R-indicators: Overview
Based on the sampling frame and attributes, R-Indicators quantifyrepresentativeness of survey coverage
They identify the attributes that drive variation in response propensitiesand support adaptation by evaluating which subgroups are over/underrepresented
The sample R-indicator
ρi is the estimated (possibly adjusted) response propensity for group i
R(ρ) = 1− 2
vuut 1
N − 1
NX1
(ρi − ρ̄)2
R(ρ) = 1 indicates that the sample is fully representative
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 9
The National Survey of College Graduates1
Data are collected by a variety of modes: web, telephone, . . .
The 2013 NSCG uses monitoring to identify target cases for
mode-switching with the goal of moving a case to the mode with the
highest response propensity or to control costs by not moving
Hold a case in web if it is “low impact”Switch to CATI (Computer assisted telephone interview) if it hasnot responded via web and is “high impact”Put a CATI case on hold (no contacts) if the “R-indicator” showsthat the group is over-represented
Strike an effective cost/quality tradeoff
1Thanks to Ben ReistT. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 10
Comparison of incentive approaches in the NSCG4 separate surveys each using a different set of incentives, but with thesame attributes used in the propensity model
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 11
Partial, unconditional, R-indicators
Identify subgroups that are over/under represented and use theinformation to encourage or “not encourage” specific cases or groups
Adapt by switching modes, incentives, etc.
With ρk the estimated (possibly adjusted) response propensity for groupX = k, ρ the composed vector, and ρ̄ the (weighted) mean, the (partial)unconditional R-indicator is
Ru(X = k,ρ) =
„Nk
N+
« 12
(ρk − ρ̄)
It’s a residual and Ru = 0⇒ balance
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 12
NSCG Data Monitoring Example
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 13
How long to wait before sending hard copy?Event-time analysis
In the American Community Survey (ACS), need to determine how longto wait for an internet response before sending hard-copy
Demographic group-specific, event-time distributions were estimated with
the event being “answered via the internet”
The event-time is administratively censored via sending hard-copy,contacting by phone, etc.
With T the internet return time, compute,
P(s, d) = pr(T ≤ s + d | T > s)
Switch to hard copy if P(s, d) < γ for a specified delay d .
Optimize wrt (d , γ) to reduce delay and control costs
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 14
Internet response time distributions2
2 From ACS Memorandum #ACS13–RER–18T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 15
Stopping rules
When is there sufficient information to stop conducting interviews?
The “stop and impute rule”
θ̂now : Use currently collected data, augmented by imputation ofmissing values
The “project rule”
θ̂future : Collect a specified number of additional interviews, and thenaugment by imputation of missing values
If a prediction model indicates that
pr“| θ̂now − θ̂future |> ε
”< γ
then stop and use θ̂now
Similar to futility assessment in a clinical trial
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 16
Issues with adaptive designs
Need robust approaches to avoid degrading quality due to inappropriate
adaptation wrt identified subgroups of interest
To avoid degrading coverage for other subgroupsYou are creating the database; don’t mess it up!
Learning from data generated by an adaptive design is complex
Adaptation may induce confounding that needs to be removedThe good news is that the propensities are availableThe database may be less useful for learning than one producednon-adaptively
There is a trade-off between generating a learning databaseand optimizing survey performance
A single mode becomes a “vector mode”
Very similar to issues in adaptive clinical trials
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 17
Issues with adaptive designs
Need robust approaches to avoid degrading quality due to inappropriate
adaptation wrt identified subgroups of interest
To avoid degrading coverage for other subgroupsYou are creating the database; don’t mess it up!
Learning from data generated by an adaptive design is complex
Adaptation may induce confounding that needs to be removedThe good news is that the propensities are availableThe database may be less useful for learning than one producednon-adaptively
There is a trade-off between generating a learning databaseand optimizing survey performance
A single mode becomes a “vector mode”
Very similar to issues in adaptive clinical trials
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 18
Issues with adaptive designs
Need robust approaches to avoid degrading quality due to inappropriate
adaptation wrt identified subgroups of interest
To avoid degrading coverage for other subgroupsYou are creating the database; don’t mess it up!
Learning from data generated by an adaptive design is complex
Adaptation may induce confounding that needs to be removedThe good news is that the propensities are availableThe database may be less useful for learning than one producednon-adaptively
There is a trade-off between generating a learning databaseand optimizing survey performance
A single mode becomes a “vector mode”
Very similar to issues in adaptive clinical trials
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 19
Disclosure Avoidance and Data Dissemination
Setting the scene
Make data available while protecting confidentiality
Average income in small areas, industrial output in small areasLocal Employment and Housing Dynamics (LEHD)American Fact Finder or “On the Map”Micro-data Analysis System (MAS)
Releasing full “micro-data” provides complete information, but with100% disclosure risk
Releasing no data provides no information, with 0% risk
The trade-off should be societally determined, but formalism is needed toguide the choice
Achieving an acceptable trade-off is becoming more difficult in thecontext of “big data” and active intruder threats
Record linkage is closely related (for both good and ill)
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 20
Trade-offs, as in diagnostic testingDisclosure risk and the value of the data are positively related
The trade-off is very similar to that for an ROCThe X-axis is disclosure risk, rather than (1 - specificity)The Y-axis is available information, rather than sensitivity
Discsclolosusurere RiRisksk
Available Info
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 21
Methods to reduce disclosure risk
Bureaucratic/legal: Titles 13 & 26, RDCs, . . .
In the big data era, these may become the mainstays
Cell suppression
Aggregation
Random swapping
Add noise, “noisy fusion”
Add random N(0, σ2) noise, split noise, . . .The variance controls disclosure risk and available informationAdded noise inflates variance, but aggregation or modeling still supportsinformative inferences
Synthetic data
Develop a (Bayesian) model for the full micro-data that preservesimportant relationsGenerate one or more datasets based on the model-based, posteriorpredictive distributionCan provide an effective information/protection trade-off
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 22
Partially Synthetic data allow users to select customgeographies in “OnTheMap”
Commuting Patterns, Portland OR Hurricane Sandy
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 23
Measuring disclosure risk
There is always a disclosure risk when data are made available, and it isbest measured by the probability of disclosure
For example, among n identified people, one with “income > $100,000”and with no other information available, the disclosure risk is 1/n
More sophisticated measures are available
In the era of big data there is other information available from recordmatching and melding, increasing the risk beyond (sometimes far beyond)what a “local” assessment computes
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 24
Probabilistic Differential Privacy3
For ε ≥ 0, a randomized function K gives ε-differential privacy, if for alldata sets D and D ′ differing in at most one element (e.g., row of data),and all S ⊆ range(K), ˛̨̨̨
log
»pr{K(D) ∈ S}pr{K(D ′) ∈ S}
–˛̨̨̨≤ ε
A global guarantee: The protection is for all possible deletionsfor datasets that you have identified
Plausible deniability: A reported value has a “similar” probabilityirrespective of whether your data are or are not included in the data set
Example: Reporting mean salary in successive years with one hire for thesecond year confers almost no protection
The trimmed mean or an M-estimate confer protectionSo do other robust statistics, noisy fusion, synthetic data
To compute ε you need to know D and in this Big Data era, it is likelybigger than you assume
3Dwork & Smith, J. Privacy and Confidentiality, 2009T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 25
Probabilistic Differential Privacy3
For ε ≥ 0, a randomized function K gives ε-differential privacy, if for alldata sets D and D ′ differing in at most one element (e.g., row of data),and all S ⊆ range(K), ˛̨̨̨
log
»pr{K(D) ∈ S}pr{K(D ′) ∈ S}
–˛̨̨̨≤ ε
A global guarantee: The protection is for all possible deletionsfor datasets that you have identified
Plausible deniability: A reported value has a “similar” probabilityirrespective of whether your data are or are not included in the data set
Example: Reporting mean salary in successive years with one hire for thesecond year confers almost no protection
The trimmed mean or an M-estimate confer protectionSo do other robust statistics, noisy fusion, synthetic data
To compute ε you need to know D and in this Big Data era, it is likelybigger than you assume
3Dwork & Smith, J. Privacy and Confidentiality, 2009T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 26
Record Matching
Latanya Sweeney
A large percentage of the U.S. population has a high probability of beingidentified based only on place, gender, date of birthThis probability soars when other information is matched.
Applications
e-Health: Combining Medical records and iPhone info, and . . .Precision Medicine: Gene signatures and medical recordsWho was where doing what: credit card charges melding withFacebook postsSocial networks: Cell phone meta-dataReal-time population estimates: Cell phone meta-dataFacial recognitionMortality in a war zoneScans of the universe: Are this galaxy in Sloan Sky the same as theone in another database?
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 27
More on record matching
In conducting a surveyIs the person at address A the same as the person in the IRS data? If so,can use the IRS information to augment other information
De-duplication: root out double counting
Imputation: Help build the imputation model
Approaches and challenges
Frequentist: “Big Match”
Bayesian structuring: pr(match | data) =⇒ fractional matches
In general, a challenging computing problem, especially for matching
using ≥ 3 sources
Computer science to the rescue?
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 28
Micro-simulation/agent-based model for the 2020 Census
Households are “agents”
Inputs: Mode-specific and mode-sequence specific responseprobability, quality and NRFU (non-response follow-up) values
Outputs: National and domain-specific (cost, quality)probability distributions that help to identify high-leverageresearch
Leading to a design with a high probability of success
(cost, quality) in an acceptable region
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 29
Logic Diagram for Administrative Records
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 30
Statistical inference is all about missing data
Infinite reference populationInfer T (Yn+1, . . . ,Y∞), conditional on (Y1, . . . ,Yn)and on the sampling plan
Finite populationInfer T (Yn+1, . . . ,YN), conditional on (Y1, . . . ,Yn)and on the sampling plan
Always need to account for uncertaintiesUncertainty due to (Y1, . . . ,Yn) providing only finiteinformation on the predictive distributionUncertainty in predicting the unobserved Y s using a knownpredictive distribution
Bayesian formulations are the way to go!
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 31
Design vs. model-based inference
Design-based (randomization) inference: The Y s are fixed andinference is based on the distribution of sample inclusion indicators
Model-based inference: The Y s are also random variables from a
probability distribution
Superpopulation: Frequentist inference based on repeated samples thesuper-population and from the resulting sampleBayes: add a prior for parameters; inference based on posteriordistribution of finite population quantities
The fundamental distinction is use of a randomization distribution versus
a stochastic model for the Ys, however
“Weighters” shouldn’t ignore modelsModelers can’t ignore (design) weights
Bayesian models that incorporate design features can yield inferences withvery good design-based properties
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 32
The basic setup
Finite population: U = {1, 2, . . . ,N}
Values of interest: Yk , k ∈ U
The Yk are a set of fixed, but unknown numbers,not necessarily from a probability distribution
Goal: Estimate a function of the Yk , any function,but we’ll focus on the population total or mean
total: T (Y) =NX
k=1
Yk mean:T (Y)
N
Draw a sample S ∈ U with,
pr(unit k ∈ S) = πk > 0 (can depend on covariates)pr(k and ` ∈ S) = πk`
pr(k1, . . . , kn ∈ S) = πk1k2...kn
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 33
The weighting game
Sample membership indicators:
Zk =
1, k ∈ S0, k /∈ S
E(Zk ) = πk
E(ZkZ`) = πk`
The Zk are random variables; the Yk are constants
The Horvitz-Thompson, unbiased estimate of T :
T̂ = HT [Yk ] =Xk∈S
Yk
πk=Xk∈U
ZkYk
πk
E(T̂ ) =Xk∈U
E(Zk )Yk
πk
=Xk∈U
πkYk
πk
=Xk∈U
Yk = T (Y)
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 34
Using auxiliary information
Assume we have information Xk , k ∈ U
Examples include location, information from administrative records(e.g., tax data), etc.
And have a model m(Xk) to predict Yk
Then,
T̂GReG =Xk∈U
m(Xk ) +Xk∈U
Zk ·„
Yk −m(Xk )
πk
«
=Xk∈U
m(Xk ) + HT [Yk −m(Xk )]
T̂GReG is the Generalized Regression Estimator (GReG)
It is the original doubly robust estimator!
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 35
“Pure” design-based is not so pure
Benefit: If the πk are correct, the estimator is unbiased
However, in complicated surveys producing the πk is a complicatedbusiness and computing the πk` is (complicated)2
For example, the American Community Survey (ACS) uses a very
complicated, cluster design ⇒ complicated πk
And adjustments of the πk are needed to reflect non-response andimputationModels are used for the adjustments and imputations
Variance computations can be complicated
Successive difference replicationBootstrap, but beware. . .
Generally, inference for non-linear functions of the Y s requires a model
As does small domain estimation
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 36
Challenges of collecting probability samples
Most state that nonprobability or volunteer samples, can’t be used forpopulation estimates
But, “Would you rather have 60% response rate from a well-designed and
conducted Gallup survey or a 95% rate from a self-selected group?
Advantage Gallup: The 60% is also self-selected, but informationon the relation of respondents to non-respondents is available fromthe sampling frame and generalizing from the sample is possibleHowever: For the self-selected survey, there may be other data thatcan be used to develop reasonable weights for some referencepopulation
Analogously, in clinical trials many (most) interesting questions are notprotected by randomization, are not Intent to Treat (ITT), but progresscan be made, with care!
Collecting information to support “causal analysis” is key
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 37
Challenges of collecting probability samples
Most state that nonprobability or volunteer samples, can’t be used forpopulation estimates
But, “Would you rather have 60% response rate from a well-designed and
conducted Gallup survey or a 95% rate from a self-selected group?
Advantage Gallup: The 60% is also self-selected, but informationon the relation of respondents to non-respondents is available fromthe sampling frame and generalizing from the sample is possibleHowever: For the self-selected survey, there may be other data thatcan be used to develop reasonable weights for some referencepopulation
Analogously, in clinical trials many (most) interesting questions are notprotected by randomization, are not Intent to Treat (ITT), but progresscan be made, with care!
Collecting information to support “causal analysis” is key
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 38
Informative sample sizeMean menstrual cycle length (MCL) in a prospective pregnancy study
Enroll couples who are trying to have a child
Follow until pregnancy or end of study
Average the MCLs to get a “population” estimate µ̂Informative sample size: the relatively less fecund couples providerelatively more cycles and so the average is over-weighted towards theMCLs for less fecund couples
If MCL and fecundity are related, µ̂ will be biased relative to thepopulation value
There are fixes (e.g., equal weighting), but using them depends onrecognizing the issue
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 39
Internal and External WorldsSurveys focus on external validity, representation of a well-specifiedreference population
Clinical and epidemiological studies traditionally focus on internal validitywith relatively little direct attention to representation
Without question,
The biostat/epi communities should pay more attention to surveygoals and methodsThe survey communities should pay more attention to biostat/epigoals and methods
Transportability as a unifying theme, see
Pearl J, Bareinboim E (2014). External Validity: From do-calculusto Transportability across Populations. Statistical Science, 29:579–595
Big Data (all data!) potentials
To support adaptationTo make sense of collected dataTo transport to a reference population
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 40
Internal and External WorldsSurveys focus on external validity, representation of a well-specifiedreference population
Clinical and epidemiological studies traditionally focus on internal validitywith relatively little direct attention to representation
Without question,
The biostat/epi communities should pay more attention to surveygoals and methodsThe survey communities should pay more attention to biostat/epigoals and methods
Transportability as a unifying theme, see
Pearl J, Bareinboim E (2014). External Validity: From do-calculusto Transportability across Populations. Statistical Science, 29:579–595
Big Data (all data!) potentials
To support adaptationTo make sense of collected dataTo transport to a reference population
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 41
Miettinen’s view, at least in 1985Miettinen, O. S. (1985). Theoretical Epidemiology. Wiley, New York
“In science the generalization from the actual study experience is notmade to a population of which the study experience is a sample in atechnical sense of probability sampling. In science the generalization isfrom the actual study experience to the abstract, with no referent in placeor time.”
Olie’s view is far too optimistic; far too trusting in immutable truths
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 42
A pleasing trend
ConvergenceThere is convergence, with some clinical/epi studies identifying in areasonably well-defined reference population
Sometimes using all data (big, small, in between) to help identify thepopulation, to compute weights and transport to it
A stumbling block
Different interpretations of “representative”
In Epi/Bio it is commonly reserved for a “self-weighting” sampleIn the Survey world, if the sampling weights are known, the sampleis representative
The broader (and correct!) definition opens up opportunities forbeneficial convergence
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 43
Coda
The goals and methods of the epi/biostat and survey communities willnever completely converge, however there are considerable similarities ingoals and methods with more “sims” available
These, anchored by overarching principles will empower convergence thatwill benefit each field, science and society
Consider the opportunities as you,
Enjoy the journey to your 75th and 100th
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 44
Coda
The goals and methods of the epi/biostat and survey communities willnever completely converge, however there are considerable similarities ingoals and methods with more “sims” available
These, anchored by overarching principles will empower convergence thatwill benefit each field, science and society
Consider the opportunities as you,
Enjoy the journey to your 75th and 100th
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 45
THANK YOU
T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 46