thomas a. louis, phd department of biostatistics johns ... · releasing full \micro-data"...

$: Thomas A. Louis, PhD Department of Biostatistics Johns ... · Releasing full \micro-data" provides complete information, but with 100% disclosure risk Releasing no data provides no$
Convergence of the Biostatistical and Survey Worlds

Thomas A. Louis, PhD

Department of BiostatisticsJohns Hopkins Bloomberg SPH

[email protected]

Research & MethodologyU. S. Census Bureau

T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 1

Outline

The Census Bureau

A sampling of research at Census

Adaptive designDisclosure avoidanceA few other topics

Design-based/Model-based

Convergence of the Biostatistical and survey cultures


Preamble

Historically, survey, biostatistical and epidemiological methods andcultures were quite distinct, or at least appeared to be so

However, service as Associate Director for Research & Methodology andChief Scientist at the U. S. Census Bureau has heightened my awarenessof the similarities of goals and methods, and of the many potentials

Convergence steadily increases to the benefit of all

I highlight some examples, but first

HAPPY 50th!


The U. S. Census BureauEmployees

≈ 15,000 employees, of these, ≈ 5,000 are on permanent appointmentsThe remainder are primarily part-time interviewers and other field staff

Central office in Suitland MD, and 6 Regional offices

Censuses

The decennial census(the only activity embedded in the U. S. Constitution)The Population & Housing Census - every 10 yearsThe Economic Census - every 5 yearsThe Census of Governments - every 5 yearsMonthly Import/Export compilations

Selected surveys (of ≈ 130/yr)

The American Community Survey (continuous)The Current Population Survey (CPS) Includes Health Insurance QsThe Survey of Income and Program Participation (SIPP) DittoThe National Survey of College GraduatesThe National Crime Victimization Survey (NCVS)The National Survey on Family Growth (NSFG)The Health Interview SurveyInternational surveys and censuses


Adaptive Design

Goals & MethodsReduce the time/expense from the start of data collection to completion

Efficiently allocate data collection resources

Use dynamic mode-switching to increase efficiency and enhance quality(dynamic treatment regimens)

Employ stopping rules (possibly stratum-specific)

Necessary Inputs

Sampling frame (under-utilized in clinical and field trials)

Paradata =⇒ propensity models

Cost & Quality metrics

Measures of statistical information

Timely and accurate data


R-indicators: Overview

Based on the sampling frame and attributes, R-Indicators quantifyrepresentativeness of survey coverage

They identify the attributes that drive variation in response propensitiesand support adaptation by evaluating which subgroups are over/underrepresented

The sample R-indicator

ρi is the estimated (possibly adjusted) response propensity for group i

R(ρ) = 1− 2

vuut 1

N − 1

NX1

(ρi − ρ̄)2

R(ρ) = 1 indicates that the sample is fully representative


The National Survey of College Graduates1

Data are collected by a variety of modes: web, telephone, . . .

The 2013 NSCG uses monitoring to identify target cases for

mode-switching with the goal of moving a case to the mode with the

highest response propensity or to control costs by not moving

Hold a case in web if it is “low impact”Switch to CATI (Computer assisted telephone interview) if it hasnot responded via web and is “high impact”Put a CATI case on hold (no contacts) if the “R-indicator” showsthat the group is over-represented

Strike an effective cost/quality tradeoff

1Thanks to Ben ReistT. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 10

Comparison of incentive approaches in the NSCG4 separate surveys each using a different set of incentives, but with thesame attributes used in the propensity model


Partial, unconditional, R-indicators

Identify subgroups that are over/under represented and use theinformation to encourage or “not encourage” specific cases or groups

Adapt by switching modes, incentives, etc.

With ρk the estimated (possibly adjusted) response propensity for groupX = k, ρ the composed vector, and ρ̄ the (weighted) mean, the (partial)unconditional R-indicator is

Ru(X = k,ρ) =

„Nk

N+

« 12

(ρk − ρ̄)

It’s a residual and Ru = 0⇒ balance


NSCG Data Monitoring Example


How long to wait before sending hard copy?Event-time analysis

In the American Community Survey (ACS), need to determine how longto wait for an internet response before sending hard-copy

Demographic group-specific, event-time distributions were estimated with

the event being “answered via the internet”

The event-time is administratively censored via sending hard-copy,contacting by phone, etc.

With T the internet return time, compute,

P(s, d) = pr(T ≤ s + d | T > s)

Switch to hard copy if P(s, d) < γ for a specified delay d .

Optimize wrt (d , γ) to reduce delay and control costs


Internet response time distributions2

2 From ACS Memorandum #ACS13–RER–18T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 15

Stopping rules

When is there sufficient information to stop conducting interviews?

The “stop and impute rule”

θ̂now : Use currently collected data, augmented by imputation ofmissing values

The “project rule”

θ̂future : Collect a specified number of additional interviews, and thenaugment by imputation of missing values

If a prediction model indicates that

pr“| θ̂now − θ̂future |> ε

”< γ

then stop and use θ̂now

Similar to futility assessment in a clinical trial


Issues with adaptive designs

Need robust approaches to avoid degrading quality due to inappropriate

adaptation wrt identified subgroups of interest

To avoid degrading coverage for other subgroupsYou are creating the database; don’t mess it up!

Learning from data generated by an adaptive design is complex

Adaptation may induce confounding that needs to be removedThe good news is that the propensities are availableThe database may be less useful for learning than one producednon-adaptively

There is a trade-off between generating a learning databaseand optimizing survey performance

A single mode becomes a “vector mode”

Very similar to issues in adaptive clinical trials


Disclosure Avoidance and Data Dissemination

Setting the scene

Make data available while protecting confidentiality

Average income in small areas, industrial output in small areasLocal Employment and Housing Dynamics (LEHD)American Fact Finder or “On the Map”Micro-data Analysis System (MAS)

Releasing full “micro-data” provides complete information, but with100% disclosure risk

Releasing no data provides no information, with 0% risk

The trade-off should be societally determined, but formalism is needed toguide the choice

Achieving an acceptable trade-off is becoming more difficult in thecontext of “big data” and active intruder threats

Record linkage is closely related (for both good and ill)


Trade-offs, as in diagnostic testingDisclosure risk and the value of the data are positively related

The trade-off is very similar to that for an ROCThe X-axis is disclosure risk, rather than (1 - specificity)The Y-axis is available information, rather than sensitivity

Discsclolosusurere RiRisksk

Available Info


Methods to reduce disclosure risk

Bureaucratic/legal: Titles 13 & 26, RDCs, . . .

In the big data era, these may become the mainstays

Cell suppression

Aggregation

Random swapping

Add noise, “noisy fusion”

Add random N(0, σ2) noise, split noise, . . .The variance controls disclosure risk and available informationAdded noise inflates variance, but aggregation or modeling still supportsinformative inferences

Synthetic data

Develop a (Bayesian) model for the full micro-data that preservesimportant relationsGenerate one or more datasets based on the model-based, posteriorpredictive distributionCan provide an effective information/protection trade-off


Partially Synthetic data allow users to select customgeographies in “OnTheMap”

Commuting Patterns, Portland OR Hurricane Sandy


Measuring disclosure risk

There is always a disclosure risk when data are made available, and it isbest measured by the probability of disclosure

For example, among n identified people, one with “income > $100,000”and with no other information available, the disclosure risk is 1/n

More sophisticated measures are available

In the era of big data there is other information available from recordmatching and melding, increasing the risk beyond (sometimes far beyond)what a “local” assessment computes


Probabilistic Differential Privacy3

For ε ≥ 0, a randomized function K gives ε-differential privacy, if for alldata sets D and D ′ differing in at most one element (e.g., row of data),and all S ⊆ range(K), ˛̨̨̨

log

»pr{K(D) ∈ S}pr{K(D ′) ∈ S}

–˛̨̨̨≤ ε

A global guarantee: The protection is for all possible deletionsfor datasets that you have identified

Plausible deniability: A reported value has a “similar” probabilityirrespective of whether your data are or are not included in the data set

Example: Reporting mean salary in successive years with one hire for thesecond year confers almost no protection

The trimmed mean or an M-estimate confer protectionSo do other robust statistics, noisy fusion, synthetic data

To compute ε you need to know D and in this Big Data era, it is likelybigger than you assume

3Dwork & Smith, J. Privacy and Confidentiality, 2009T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 25

Probabilistic Differential Privacy3

For ε ≥ 0, a randomized function K gives ε-differential privacy, if for alldata sets D and D ′ differing in at most one element (e.g., row of data),and all S ⊆ range(K), ˛̨̨̨

log

»pr{K(D) ∈ S}pr{K(D ′) ∈ S}

–˛̨̨̨≤ ε

A global guarantee: The protection is for all possible deletionsfor datasets that you have identified

Plausible deniability: A reported value has a “similar” probabilityirrespective of whether your data are or are not included in the data set

Example: Reporting mean salary in successive years with one hire for thesecond year confers almost no protection

The trimmed mean or an M-estimate confer protectionSo do other robust statistics, noisy fusion, synthetic data

To compute ε you need to know D and in this Big Data era, it is likelybigger than you assume

3Dwork & Smith, J. Privacy and Confidentiality, 2009T. A. Louis: Johns Hopkins Biostatistics & Census Bureau McGill, Epidemiology/Biostatistics 50th , 2015 26

Record Matching

Latanya Sweeney

A large percentage of the U.S. population has a high probability of beingidentified based only on place, gender, date of birthThis probability soars when other information is matched.

Applications

e-Health: Combining Medical records and iPhone info, and . . .Precision Medicine: Gene signatures and medical recordsWho was where doing what: credit card charges melding withFacebook postsSocial networks: Cell phone meta-dataReal-time population estimates: Cell phone meta-dataFacial recognitionMortality in a war zoneScans of the universe: Are this galaxy in Sloan Sky the same as theone in another database?


More on record matching

In conducting a surveyIs the person at address A the same as the person in the IRS data? If so,can use the IRS information to augment other information

De-duplication: root out double counting

Imputation: Help build the imputation model

Approaches and challenges

Frequentist: “Big Match”

Bayesian structuring: pr(match | data) =⇒ fractional matches

In general, a challenging computing problem, especially for matching

using ≥ 3 sources

Computer science to the rescue?


Micro-simulation/agent-based model for the 2020 Census

Households are “agents”

Inputs: Mode-specific and mode-sequence specific responseprobability, quality and NRFU (non-response follow-up) values

Outputs: National and domain-specific (cost, quality)probability distributions that help to identify high-leverageresearch

Leading to a design with a high probability of success

(cost, quality) in an acceptable region


Logic Diagram for Administrative Records


Statistical inference is all about missing data

Infinite reference populationInfer T (Yn+1, . . . ,Y∞), conditional on (Y1, . . . ,Yn)and on the sampling plan

Finite populationInfer T (Yn+1, . . . ,YN), conditional on (Y1, . . . ,Yn)and on the sampling plan

Always need to account for uncertaintiesUncertainty due to (Y1, . . . ,Yn) providing only finiteinformation on the predictive distributionUncertainty in predicting the unobserved Y s using a knownpredictive distribution

Bayesian formulations are the way to go!


Design vs. model-based inference

Design-based (randomization) inference: The Y s are fixed andinference is based on the distribution of sample inclusion indicators

Model-based inference: The Y s are also random variables from a

probability distribution

Superpopulation: Frequentist inference based on repeated samples thesuper-population and from the resulting sampleBayes: add a prior for parameters; inference based on posteriordistribution of finite population quantities

The fundamental distinction is use of a randomization distribution versus

a stochastic model for the Ys, however

“Weighters” shouldn’t ignore modelsModelers can’t ignore (design) weights

Bayesian models that incorporate design features can yield inferences withvery good design-based properties


The basic setup

Finite population: U = {1, 2, . . . ,N}

Values of interest: Yk , k ∈ U

The Yk are a set of fixed, but unknown numbers,not necessarily from a probability distribution

Goal: Estimate a function of the Yk , any function,but we’ll focus on the population total or mean

total: T (Y) =NX

k=1

Yk mean:T (Y)

N

Draw a sample S ∈ U with,

pr(unit k ∈ S) = πk > 0 (can depend on covariates)pr(k and ` ∈ S) = πk`

pr(k1, . . . , kn ∈ S) = πk1k2...kn


The weighting game

Sample membership indicators:

Zk =

1, k ∈ S0, k /∈ S

E(Zk ) = πk

E(ZkZ`) = πk`

The Zk are random variables; the Yk are constants

The Horvitz-Thompson, unbiased estimate of T :

T̂ = HT [Yk ] =Xk∈S

Yk

πk=Xk∈U

ZkYk

πk

E(T̂ ) =Xk∈U

E(Zk )Yk

πk

=Xk∈U

πkYk

πk

=Xk∈U

Yk = T (Y)


Using auxiliary information

Assume we have information Xk , k ∈ U

Examples include location, information from administrative records(e.g., tax data), etc.

And have a model m(Xk) to predict Yk

Then,

T̂GReG =Xk∈U

m(Xk ) +Xk∈U

Zk ·„

Yk −m(Xk )

πk

«

=Xk∈U

m(Xk ) + HT [Yk −m(Xk )]

T̂GReG is the Generalized Regression Estimator (GReG)

It is the original doubly robust estimator!


“Pure” design-based is not so pure

Benefit: If the πk are correct, the estimator is unbiased

However, in complicated surveys producing the πk is a complicatedbusiness and computing the πk` is (complicated)2

For example, the American Community Survey (ACS) uses a very

complicated, cluster design ⇒ complicated πk

And adjustments of the πk are needed to reflect non-response andimputationModels are used for the adjustments and imputations

Variance computations can be complicated

Successive difference replicationBootstrap, but beware. . .

Generally, inference for non-linear functions of the Y s requires a model

As does small domain estimation


Challenges of collecting probability samples

Most state that nonprobability or volunteer samples, can’t be used forpopulation estimates

But, “Would you rather have 60% response rate from a well-designed and

conducted Gallup survey or a 95% rate from a self-selected group?

Advantage Gallup: The 60% is also self-selected, but informationon the relation of respondents to non-respondents is available fromthe sampling frame and generalizing from the sample is possibleHowever: For the self-selected survey, there may be other data thatcan be used to develop reasonable weights for some referencepopulation

Analogously, in clinical trials many (most) interesting questions are notprotected by randomization, are not Intent to Treat (ITT), but progresscan be made, with care!

Collecting information to support “causal analysis” is key


Informative sample sizeMean menstrual cycle length (MCL) in a prospective pregnancy study

Enroll couples who are trying to have a child

Follow until pregnancy or end of study

Average the MCLs to get a “population” estimate µ̂Informative sample size: the relatively less fecund couples providerelatively more cycles and so the average is over-weighted towards theMCLs for less fecund couples

If MCL and fecundity are related, µ̂ will be biased relative to thepopulation value

There are fixes (e.g., equal weighting), but using them depends onrecognizing the issue


Internal and External WorldsSurveys focus on external validity, representation of a well-specifiedreference population

Clinical and epidemiological studies traditionally focus on internal validitywith relatively little direct attention to representation

Without question,

The biostat/epi communities should pay more attention to surveygoals and methodsThe survey communities should pay more attention to biostat/epigoals and methods

Transportability as a unifying theme, see

Pearl J, Bareinboim E (2014). External Validity: From do-calculusto Transportability across Populations. Statistical Science, 29:579–595

Big Data (all data!) potentials

To support adaptationTo make sense of collected dataTo transport to a reference population


Miettinen’s view, at least in 1985Miettinen, O. S. (1985). Theoretical Epidemiology. Wiley, New York

“In science the generalization from the actual study experience is notmade to a population of which the study experience is a sample in atechnical sense of probability sampling. In science the generalization isfrom the actual study experience to the abstract, with no referent in placeor time.”

Olie’s view is far too optimistic; far too trusting in immutable truths


A pleasing trend

ConvergenceThere is convergence, with some clinical/epi studies identifying in areasonably well-defined reference population

Sometimes using all data (big, small, in between) to help identify thepopulation, to compute weights and transport to it

A stumbling block

Different interpretations of “representative”

In Epi/Bio it is commonly reserved for a “self-weighting” sampleIn the Survey world, if the sampling weights are known, the sampleis representative

The broader (and correct!) definition opens up opportunities forbeneficial convergence


Coda

The goals and methods of the epi/biostat and survey communities willnever completely converge, however there are considerable similarities ingoals and methods with more “sims” available

These, anchored by overarching principles will empower convergence thatwill benefit each field, science and society

Consider the opportunities as you,

Enjoy the journey to your 75th and 100th


THANK YOU


thomas a. louis, phd department of biostatistics johns ... · releasing full \micro-data"...

Documents