can a synthetic data approach applied to high risk data ... · cdac/fcsm workshop new advances in...

CanaSyntheticDataApproachAppliedtoHighRiskDataResultinUsableDatawithaVeryLowRisk?

ApplicationtotheFederalEmployeeViewpointSurvey

TaylorLewis1,U.S. OfficeofPersonnelManagementTom Krenzke2,Westat

1

CDAC/FCSMWorkshopNewAdvancesinDisclosureLimitation

BureauofLaborStatisticsSeptember27,2017

1Theopinions,findings,andconclusionsexpressedinthispresentationarethoseoftheauthoranddonotnecessarilyreflectthoseoftheU.S.OfficeofPersonnelManagement.

2TheauthorwouldliketoacknowledgeJaneLiandLinLifortheirassistanceincarryingoutthisresearch.

I. BackgroundonTraditionalDisclosureAvoidanceStrategiesandThoseAppliedtotheFederalEmployeeViewpointSurvey(FEVS)

II. FEVSSyntheticDataApplication– Methodology– DataUtilityAssessments– RiskAssessments

III. SummaryandFurtherResearchQuestions

Outline

2

• Informationreduction:– Topcodingà cappingagesat“60+”– Roundingà convertingincomeintoranges– Droppingvariables– Separatefileswithseparatesetsofvariables– Sampling– Suppressionà deletingcertainvalues

• Dataperturbation:– Swappingvaluesacrosstwoormorerecords– Noiseinfusion(e.g.,addingrandomerrors)– generallymoreapplicablefor

continuousvariables

TraditionalStrategiesReducingDisclosureRisk

3

• Informationreduction:– Droppingvariablesdegradesoveralldatautility– Combining/collapsingmayhidekeyrelationshipsindata– Suppressionmightproducedatathatarenotmissingcompletelyatrandom

(MCAR)(LittleandRubin,2002)andcanreduceprecisionofestimates

• Dataperturbation:– Dataswappingmaintains(unweighted)marginaldistributions,butanalyses

involvingtheswappedandun-swappedvariablesjointlycanbedistorted(Reiter,2012)

– Noiseinfusioncanalsoattenuatecorrelationsanddistortrelationshipsamongstvariables

ProblemswithTraditionalStrategies

4

• FirstproposedbyRubin(1993),generatingsyntheticdataisapromising(andrapidlyevolving)methodologythataddressesmanyofthetraditionalstrategies’limitations

• Premise:modeltheobserveddataandusethatmodeltoproduceplausiblesubstitutevalues

• Twotypesofsyntheticdata:– Fullysyntheticdata(Raghunathan etal.,2003)– allvaluesaresynthesized– Partiallysyntheticdata(Reiter,2003)– onlysomevaluesaresynthesized

(eitheraportionofvariables,aportionofrecords,orsomecombinationofboth)

SyntheticDatatotheRescue?

5

• Keyadvantageoffullysyntheticdata:becausenoactualvaluesarereleased,disclosureriskisextremelylow

• Attemptstomatchsyntheticdatawithexternaldatabasesforpurposesofdisclosurearepointless

• Partiallysyntheticdatabettermaintainsrelationshipsinthedata,butincreasesdisclosurerisk

• Keydisadvantageofsyntheticdata:relationshipsomittedfrommodelwillnotappearinthesyntheticdataà itisonlypossibleforanalyststorediscoverwhatisaccountedforbythesynthesismodels

SyntheticData:AdvantagesandDisadvantages

6

BackgroundontheFEVS

7

• TheFederalEmployeeViewpointSurvey(FEVS)isanannual,Web-basedsurveyoffull-time,permanent,non-seasonalfederalemployeesadministeredbytheU.S.OfficeofPersonnelManagement(OPM)

• Asof2016FEVS:samplesize~900,000;80+agenciesparticipating;responseratejustunder50%

• Instrumentconsistsmainlyofattitudinalitems(e.g.,perceptionsofleadership,jobsatisfaction)onaLikert-typescale,butalsoaboutadozenpotentiallyobservabledemographics

• Highlydetailedindividual-level,work-unitinformationisprovidedbyagenciesforsampling/reportingpurposes

• Afterextensivereportingphase,threepublic-releasedatafiles(PRDFs)aremadeavailable(seehttps://www.fedview.opm.gov/2015/EVSDATA/):1. General(excludingLGBTitem)2. LGBT(includingLGBTitem,fewervariables,commonvariablesrecodedto

determergingwithgeneralfile)3. Trend(allpriorgeneralPRDFsstackedandcodedforwardtomostrecent

FEVS)

• PrivacyActstatementassuresrespondents“Inanypublicreleaseofsurveyresults,nodatawillbedisclosedthatcouldbeusedtoidentifyspecificindividuals”

FEVSDataReleases

8

• InFEVS,work-unitinformationandobservabledemographicscompeteagainsteachotherwithrespecttodisclosurerisk

• Forlowerdisclosurerisk,onecouldreleasecompletework-unitdetailbutnodemographics,orviceversaà neitherisideal

• Moreappropriateapproachistostrikeacompromise,withtheendgoaltominimizedisclosureriskwhilemaximizingdatautility

StrikingaCompromise

9

HigherDisclosureRisk

MoreWork-UnitInfo MoreDemographicInfoLowerDisclosureRisk

LowerDisclosureRisk

• Detailedintechnicalreport(OPM,2015):– SeparateLGBTfile– Startingpointforworkunits:agencyrequest,solongasatleast250respondents– Certainvariablesremovedand/orcombined(e.g.,minoritystatus)– Categoriescollapsedforothervariables

• Exhaustivetabulationsassessment(ETA)(Krenzke etal.,2014)systematicallyexaminesallpossibledemographiccombinationswithinaworkunit,flaggingrecordsposingadisclosurerisk

• Workunitidentifierswith>25%recordsflaggedaresettomissing,andETAisdoneoncemore;forrecordsstillflagged(~8000inFEVS2016),onlyoneoffour“core”demographics(gender,agegroup,supervisorystatus,andminoritystatus)ismaintained

CurrentDisclosureAvoidanceMethods

10

SchematicRepresentationofOriginalDataSet:

APartiallySyntheticApproach

11

Premise:leaveX0andX1intact,butmodelrelationshipbetweenX1 andY (independentlywithinworkunits),andusetoderivesubstitutevaluesforY

• FifteenvariablescomprisingY synthesizedsequentiallyalaRaghunathan etal.(2001)using“synthpop”Rpackage(Nowok etal.,2015)

• Nonparametric“ctree”methodusedexclusively,basedonclassificationandregressiontrees(CART)(Breiman etal.,1984)– successivebinarysplitspartitiondatasetintocells

• Withinacell,valuesaresynthesizedrandomlyinproportiontotheiroccurrenceintheobserveddata

• CreatedM =3implicates,notforvarianceestimationpurposesperse,buttoruleoutdeterministicrelationships

APartiallySyntheticApproach(2)

12

VisualizationofCTREEMethod

13

• KeyadvantageofCART(Reiter,2005):findandexploitonlymostimportantrelationshipsfromalargepoolofpotentialpredictorsà inexamplebelow,onlytheHQ/fielddutystationindicator(DLOC)andagencytenure(DAGYTEN)areneededforsynthesizinggender

• Dramaticallyreduceddisclosurerisk

• Moreworkscanbeidentified– 368vs181distinctworkunits

• Moredemographicinformationcanbeincluded– 15vs11variables– 48vs31totaldemographicvariablecategories

• NoneedforseparateLGBTfile

• Keydownside:noguaranteesyntheticdataresultsmatchthosethatwouldbegeneratedwiththeactualdata

BenefitsRelativetoCurrentMethods

14

Results:UnivariateMarginalDistributions

15

Results:BivariateMarginalDistributions

16

AveragePercentPositiveDifferencefor2016FEVSDemographicCategorieswithinaWorkUnit:PartiallySyntheticDatavsActualData

Results:PointEstimateDifferences

17

EstimatedOddsRatioDifferencesfortheMultinomialLogisticRegressionModelDiscussedinWhitford andLee(2015):

Results:ModelParameterDifferences

18

-0.20

-0.15

-0.10

-0.05

0.00

0.05

0.10

0.15

0.20

Estim

ated

Odd

sRatioDifferen

ce:

PartiallySynthetic-Ac

tual

RiskAssessments

• Traditionalpublic-releasedatafile(PRDF)• PartiallysyntheticPRDF

• Re-identification-- Hundepool etal.(2012)– Achievedbyanintruderwhencomparingatargetindividualinasamplewithanavailablelistofunits(externalfile)thatcontainsindividualidentifiers(e.g.,nameandaddress),plusasetofidentifyingvariables

– Occurswhentheunitinthereleasedfileandaunitintheexternalfilebelongtothesameindividualinthepopulation

19

RiskElements

• Questionnaireitems– 15indirectidentifiers–Mostlyattitudinal

• Workunit• Highsamplingrate– Samplingrateequalto1

• About50percentresponserate

20

RiskAssessmentonTraditionalPRDF

• Fileriskmeasure– Theexpectednumberofpopulationuniques giventhesampleuniques• 𝑅𝑖𝑠𝑘 = ∑ 𝐸(𝐹* = 1|𝑓* = 1)�

01– whereSU isthesetofsampleuniques,fk isthesamplefrequencyincellk,andFk isthepopulationfrequencyincellk

– Fk mustbeestimated» EstimatedusingloglinearmodelswiththeSkinnerandShlomo (2008)approach• Stabilizesestimate• Usesweights

• Workunitand12selectindirectidentifiers– Lownumberofmissingvalues– Highlyidentifiable

21

RiskAssessmentonTraditionalPRDF(2)

22

RiskAssessmentonTraditionalPRDF:Results

• Fileriskmeasurewascomputedforeachworkunit– Rangedfrom3%to69%acrossallworkunits(orcombinedworkunits)

withamedianof26%andameanof28%– Highrisk

23

RiskAssessmentonPartiallySyntheticPRDF• Firstlook:Percentageofchangedvaluesamongthe15

indirectidentifiers,byworkunit

24

RiskAssessmentonPartiallySyntheticPRDF(2)• Re-identificationrisk– Whatistheexpectednumberofcorrectmatches?

• Raw-to-Raw• Synthetic-to-Raw

• Raw-to-Raw-- Exactmatching– Foreachworkunit,thematchwasconductedon15indirectidentifiers• Somemultiplerecordswiththesamesubgroup• Example:If3recordshavethesamecharacteristics,thenriskisa1in3chanceofmatchingcorrectly

– Onaverage,88percentmatchedcorrectly,rangingfrom57percentto98percentacrossworkunits

– Didnotaccountforapproximate50percentresponserate,whichlowerstheriskvalue

25

RiskAssessmentonPartiallySyntheticPRDF(3)• Synthetic-to-Raw-- Probability-basedmatching– UsedWestat’sWesLink SASmacro,basedonlog-likelihoodestimation

– Identifygroupofbestmatchesforeachrecord,giventheworkunitandthe15indirectidentifiers• Thresholdissettominimizefalsepositivesandfalsenegatives

– Probabilityofcorrectmatchcomputedforeachindividualrecord• Ifthetruerecordwasamongthebestmatches

– Probabilityofcorrectmatch=1/(#ofbestmatches)• Ifthetruerecordwasnotamongthebestmatches

– Probabilityofcorrectmatch=0– Fileriskwascomputedtheaverageoftheprobabilities

• 0.43percent,notaccountingfortheresponserate• 20unitsrangedfrom1.0percentto2.2percent

26

RiskAssessmentonPartiallySyntheticPRDF(4)

27

• Partiallysynthetic2016FEVSdatadoesnotproduceperfectreplicationsoftheactualdata,butresultsarereasonablycloseanddevoidofanysystematicbiases

• Differencestendtozeroassamplesizesincrease,asdomeasuresofdisclosurerisk

• Ofcourse,noguaranteeallconceivableanalyseswillbeasharmoniousasthosepresentedhere

• Openquestion:istheextranoiseafairpricetopayinexchangeformoredetaileddemographicandworkunitinformation,anddramaticallyreduceddisclosurerisk?

Summary

28

• Coulddatautilitybeincreasediffewervaluesweresynthesized?– Donotsynthesizevariablesthatarenothighlyidentifiable(e.g.,intention

toleave)– Synthesizeonlyasubsetofvariablesforasubsetofrecordswithhigh

disclosurerisk

• Couldtheanalysisweightsberecalibratedinsomewaytomakeresultsmoreconcomitant?

• Arethereothersolutions?– Remoteaccessservers– Hybridapproachofboththetraditionalstatisticaldisclosurelimitation

techniques(coarseningandsuppression)andsyntheticdata

FurtherResearchQuestions

29

Breiman,L.,Friedman,J.,Stone,C.J.,andOlshen,R.A.(1984).ClassificationandRegressionTrees.BocaRaton,FL:CRCPress.

Hundepool,A.,Domingo-Ferrer,J.,Franconi,L.,Giessing,S.,Nordholt,E.S.,Spicer,K.,anddeWolf,P.-P.(2012).StatisticalDisclosureControl.Chichester,UK:JohnWiley&Sons.

Krenzke,T.,Li,J.,andLi,L.(2014).“AnEvaluationoftheImpactofMissingDataonDisclosureRiskMeasures,”ProceedingsoftheJointStatisticalMeetings,Alexandria,VA:AmericanStatisticalAssociation.

Nowok,B.,Raab,G.andDibben,C.(2015)."synthpop:BespokeCreationofSyntheticDatainR,"Packagevignetteavailableonlineat:http://cran.r-project.org/web/packages/synthpop/vignettes/synthpop.pdf.

Raghunathan,T.,Lepkowski,J.,VanHoewyk,J.,andSolenberger,P.(2001).“AMultivariateTechniqueforMultiplyImputingMissingValuesUsingaSequenceofRegressionModels,”SurveyMethodology,27,pp.85– 95.

Raghunathan,T.,Reiter,J.,andRubin,D.(2003).“MultipleImputationforStatisticalDisclosureLimitation,”JournalofOfficialStatistics,19,pp.1– 16.

References

30

Reiter,J.(2003).“InferenceforPartiallySynthetic,PublicUseMicrodataSets,”SurveyMethodology,29,pp.181– 188.

Reiter,J.(2005).“UsingCARTtoGeneratePartiallySyntheticPublicUseMicrodata,”JournalofOfficialStatistics,21,pp.441– 462.

Reiter,J.(2012).“StatisticalApproachestoProtectingConfidentialityforMicrodataandtheirEffectsontheQualityofStatisticalInferences.”PublicOpinionQuarterly,76,pp.163–181

Rubin,D.(1993).“Discussion:StatisticalDisclosureLimitation,”JournalofOfficialStatistics,9,pp.461– 468.

Skinner,C.,andShlomo,N.(2008).“AssessingIdentificationRiskinSurveyMicrodataUsingLog-LinearModels,”JournaloftheAmericanStatisticalAssociation,103,pp.989– 1001.

U.S.OfficeofPersonnelManagement.(2015).FederalEmployeeViewpointSurveyTechnicalReport.Availableonlineat:https://www.fedview.opm.gov/2015/Published/.

Whitford,A.,andLee,S.-Y.(2015).“Exit,Voice,andLoyaltywithMultipleExitOptions:EvidencefromtheUSFederalWorkforce,”JournalofPublicAdministrationsResearchandTheory,25,pp.373– 398.

References(2)

31

can a synthetic data approach applied to high risk data ... · cdac/fcsm workshop new advances in...

Documents