can a synthetic data approach applied to high risk data ... · cdac/fcsm workshop new advances in...
TRANSCRIPT
CanaSyntheticDataApproachAppliedtoHighRiskDataResultinUsableDatawithaVeryLowRisk?
ApplicationtotheFederalEmployeeViewpointSurvey
TaylorLewis1,U.S. OfficeofPersonnelManagementTom Krenzke2,Westat
1
CDAC/FCSMWorkshopNewAdvancesinDisclosureLimitation
BureauofLaborStatisticsSeptember27,2017
1Theopinions,findings,andconclusionsexpressedinthispresentationarethoseoftheauthoranddonotnecessarilyreflectthoseoftheU.S.OfficeofPersonnelManagement.
2TheauthorwouldliketoacknowledgeJaneLiandLinLifortheirassistanceincarryingoutthisresearch.
I. BackgroundonTraditionalDisclosureAvoidanceStrategiesandThoseAppliedtotheFederalEmployeeViewpointSurvey(FEVS)
II. FEVSSyntheticDataApplication– Methodology– DataUtilityAssessments– RiskAssessments
III. SummaryandFurtherResearchQuestions
Outline
2
• Informationreduction:– Topcodingà cappingagesat“60+”– Roundingà convertingincomeintoranges– Droppingvariables– Separatefileswithseparatesetsofvariables– Sampling– Suppressionà deletingcertainvalues
• Dataperturbation:– Swappingvaluesacrosstwoormorerecords– Noiseinfusion(e.g.,addingrandomerrors)– generallymoreapplicablefor
continuousvariables
TraditionalStrategiesReducingDisclosureRisk
3
• Informationreduction:– Droppingvariablesdegradesoveralldatautility– Combining/collapsingmayhidekeyrelationshipsindata– Suppressionmightproducedatathatarenotmissingcompletelyatrandom
(MCAR)(LittleandRubin,2002)andcanreduceprecisionofestimates
• Dataperturbation:– Dataswappingmaintains(unweighted)marginaldistributions,butanalyses
involvingtheswappedandun-swappedvariablesjointlycanbedistorted(Reiter,2012)
– Noiseinfusioncanalsoattenuatecorrelationsanddistortrelationshipsamongstvariables
ProblemswithTraditionalStrategies
4
• FirstproposedbyRubin(1993),generatingsyntheticdataisapromising(andrapidlyevolving)methodologythataddressesmanyofthetraditionalstrategies’limitations
• Premise:modeltheobserveddataandusethatmodeltoproduceplausiblesubstitutevalues
• Twotypesofsyntheticdata:– Fullysyntheticdata(Raghunathan etal.,2003)– allvaluesaresynthesized– Partiallysyntheticdata(Reiter,2003)– onlysomevaluesaresynthesized
(eitheraportionofvariables,aportionofrecords,orsomecombinationofboth)
SyntheticDatatotheRescue?
5
• Keyadvantageoffullysyntheticdata:becausenoactualvaluesarereleased,disclosureriskisextremelylow
• Attemptstomatchsyntheticdatawithexternaldatabasesforpurposesofdisclosurearepointless
• Partiallysyntheticdatabettermaintainsrelationshipsinthedata,butincreasesdisclosurerisk
• Keydisadvantageofsyntheticdata:relationshipsomittedfrommodelwillnotappearinthesyntheticdataà itisonlypossibleforanalyststorediscoverwhatisaccountedforbythesynthesismodels
SyntheticData:AdvantagesandDisadvantages
6
BackgroundontheFEVS
7
• TheFederalEmployeeViewpointSurvey(FEVS)isanannual,Web-basedsurveyoffull-time,permanent,non-seasonalfederalemployeesadministeredbytheU.S.OfficeofPersonnelManagement(OPM)
• Asof2016FEVS:samplesize~900,000;80+agenciesparticipating;responseratejustunder50%
• Instrumentconsistsmainlyofattitudinalitems(e.g.,perceptionsofleadership,jobsatisfaction)onaLikert-typescale,butalsoaboutadozenpotentiallyobservabledemographics
• Highlydetailedindividual-level,work-unitinformationisprovidedbyagenciesforsampling/reportingpurposes
• Afterextensivereportingphase,threepublic-releasedatafiles(PRDFs)aremadeavailable(seehttps://www.fedview.opm.gov/2015/EVSDATA/):1. General(excludingLGBTitem)2. LGBT(includingLGBTitem,fewervariables,commonvariablesrecodedto
determergingwithgeneralfile)3. Trend(allpriorgeneralPRDFsstackedandcodedforwardtomostrecent
FEVS)
• PrivacyActstatementassuresrespondents“Inanypublicreleaseofsurveyresults,nodatawillbedisclosedthatcouldbeusedtoidentifyspecificindividuals”
FEVSDataReleases
8
• InFEVS,work-unitinformationandobservabledemographicscompeteagainsteachotherwithrespecttodisclosurerisk
• Forlowerdisclosurerisk,onecouldreleasecompletework-unitdetailbutnodemographics,orviceversaà neitherisideal
• Moreappropriateapproachistostrikeacompromise,withtheendgoaltominimizedisclosureriskwhilemaximizingdatautility
StrikingaCompromise
9
HigherDisclosureRisk
MoreWork-UnitInfo MoreDemographicInfoLowerDisclosureRisk
LowerDisclosureRisk
• Detailedintechnicalreport(OPM,2015):– SeparateLGBTfile– Startingpointforworkunits:agencyrequest,solongasatleast250respondents– Certainvariablesremovedand/orcombined(e.g.,minoritystatus)– Categoriescollapsedforothervariables
• Exhaustivetabulationsassessment(ETA)(Krenzke etal.,2014)systematicallyexaminesallpossibledemographiccombinationswithinaworkunit,flaggingrecordsposingadisclosurerisk
• Workunitidentifierswith>25%recordsflaggedaresettomissing,andETAisdoneoncemore;forrecordsstillflagged(~8000inFEVS2016),onlyoneoffour“core”demographics(gender,agegroup,supervisorystatus,andminoritystatus)ismaintained
CurrentDisclosureAvoidanceMethods
10
SchematicRepresentationofOriginalDataSet:
APartiallySyntheticApproach
11
Premise:leaveX0andX1intact,butmodelrelationshipbetweenX1 andY (independentlywithinworkunits),andusetoderivesubstitutevaluesforY
• FifteenvariablescomprisingY synthesizedsequentiallyalaRaghunathan etal.(2001)using“synthpop”Rpackage(Nowok etal.,2015)
• Nonparametric“ctree”methodusedexclusively,basedonclassificationandregressiontrees(CART)(Breiman etal.,1984)– successivebinarysplitspartitiondatasetintocells
• Withinacell,valuesaresynthesizedrandomlyinproportiontotheiroccurrenceintheobserveddata
• CreatedM =3implicates,notforvarianceestimationpurposesperse,buttoruleoutdeterministicrelationships
APartiallySyntheticApproach(2)
12
VisualizationofCTREEMethod
13
• KeyadvantageofCART(Reiter,2005):findandexploitonlymostimportantrelationshipsfromalargepoolofpotentialpredictorsà inexamplebelow,onlytheHQ/fielddutystationindicator(DLOC)andagencytenure(DAGYTEN)areneededforsynthesizinggender
• Dramaticallyreduceddisclosurerisk
• Moreworkscanbeidentified– 368vs181distinctworkunits
• Moredemographicinformationcanbeincluded– 15vs11variables– 48vs31totaldemographicvariablecategories
• NoneedforseparateLGBTfile
• Keydownside:noguaranteesyntheticdataresultsmatchthosethatwouldbegeneratedwiththeactualdata
BenefitsRelativetoCurrentMethods
14
AveragePercentPositiveDifferencefor2016FEVSDemographicCategorieswithinaWorkUnit:PartiallySyntheticDatavsActualData
Results:PointEstimateDifferences
17
EstimatedOddsRatioDifferencesfortheMultinomialLogisticRegressionModelDiscussedinWhitford andLee(2015):
Results:ModelParameterDifferences
18
-0.20
-0.15
-0.10
-0.05
0.00
0.05
0.10
0.15
0.20
Estim
ated
Odd
sRatioDifferen
ce:
PartiallySynthetic-Ac
tual
RiskAssessments
• Traditionalpublic-releasedatafile(PRDF)• PartiallysyntheticPRDF
• Re-identification-- Hundepool etal.(2012)– Achievedbyanintruderwhencomparingatargetindividualinasamplewithanavailablelistofunits(externalfile)thatcontainsindividualidentifiers(e.g.,nameandaddress),plusasetofidentifyingvariables
– Occurswhentheunitinthereleasedfileandaunitintheexternalfilebelongtothesameindividualinthepopulation
19
RiskElements
• Questionnaireitems– 15indirectidentifiers–Mostlyattitudinal
• Workunit• Highsamplingrate– Samplingrateequalto1
• About50percentresponserate
20
RiskAssessmentonTraditionalPRDF
• Fileriskmeasure– Theexpectednumberofpopulationuniques giventhesampleuniques• 𝑅𝑖𝑠𝑘 = ∑ 𝐸(𝐹* = 1|𝑓* = 1)�
01– whereSU isthesetofsampleuniques,fk isthesamplefrequencyincellk,andFk isthepopulationfrequencyincellk
– Fk mustbeestimated» EstimatedusingloglinearmodelswiththeSkinnerandShlomo (2008)approach• Stabilizesestimate• Usesweights
• Workunitand12selectindirectidentifiers– Lownumberofmissingvalues– Highlyidentifiable
21
RiskAssessmentonTraditionalPRDF:Results
• Fileriskmeasurewascomputedforeachworkunit– Rangedfrom3%to69%acrossallworkunits(orcombinedworkunits)
withamedianof26%andameanof28%– Highrisk
23
RiskAssessmentonPartiallySyntheticPRDF• Firstlook:Percentageofchangedvaluesamongthe15
indirectidentifiers,byworkunit
24
RiskAssessmentonPartiallySyntheticPRDF(2)• Re-identificationrisk– Whatistheexpectednumberofcorrectmatches?
• Raw-to-Raw• Synthetic-to-Raw
• Raw-to-Raw-- Exactmatching– Foreachworkunit,thematchwasconductedon15indirectidentifiers• Somemultiplerecordswiththesamesubgroup• Example:If3recordshavethesamecharacteristics,thenriskisa1in3chanceofmatchingcorrectly
– Onaverage,88percentmatchedcorrectly,rangingfrom57percentto98percentacrossworkunits
– Didnotaccountforapproximate50percentresponserate,whichlowerstheriskvalue
25
RiskAssessmentonPartiallySyntheticPRDF(3)• Synthetic-to-Raw-- Probability-basedmatching– UsedWestat’sWesLink SASmacro,basedonlog-likelihoodestimation
– Identifygroupofbestmatchesforeachrecord,giventheworkunitandthe15indirectidentifiers• Thresholdissettominimizefalsepositivesandfalsenegatives
– Probabilityofcorrectmatchcomputedforeachindividualrecord• Ifthetruerecordwasamongthebestmatches
– Probabilityofcorrectmatch=1/(#ofbestmatches)• Ifthetruerecordwasnotamongthebestmatches
– Probabilityofcorrectmatch=0– Fileriskwascomputedtheaverageoftheprobabilities
• 0.43percent,notaccountingfortheresponserate• 20unitsrangedfrom1.0percentto2.2percent
26
• Partiallysynthetic2016FEVSdatadoesnotproduceperfectreplicationsoftheactualdata,butresultsarereasonablycloseanddevoidofanysystematicbiases
• Differencestendtozeroassamplesizesincrease,asdomeasuresofdisclosurerisk
• Ofcourse,noguaranteeallconceivableanalyseswillbeasharmoniousasthosepresentedhere
• Openquestion:istheextranoiseafairpricetopayinexchangeformoredetaileddemographicandworkunitinformation,anddramaticallyreduceddisclosurerisk?
Summary
28
• Coulddatautilitybeincreasediffewervaluesweresynthesized?– Donotsynthesizevariablesthatarenothighlyidentifiable(e.g.,intention
toleave)– Synthesizeonlyasubsetofvariablesforasubsetofrecordswithhigh
disclosurerisk
• Couldtheanalysisweightsberecalibratedinsomewaytomakeresultsmoreconcomitant?
• Arethereothersolutions?– Remoteaccessservers– Hybridapproachofboththetraditionalstatisticaldisclosurelimitation
techniques(coarseningandsuppression)andsyntheticdata
FurtherResearchQuestions
29
Breiman,L.,Friedman,J.,Stone,C.J.,andOlshen,R.A.(1984).ClassificationandRegressionTrees.BocaRaton,FL:CRCPress.
Hundepool,A.,Domingo-Ferrer,J.,Franconi,L.,Giessing,S.,Nordholt,E.S.,Spicer,K.,anddeWolf,P.-P.(2012).StatisticalDisclosureControl.Chichester,UK:JohnWiley&Sons.
Krenzke,T.,Li,J.,andLi,L.(2014).“AnEvaluationoftheImpactofMissingDataonDisclosureRiskMeasures,”ProceedingsoftheJointStatisticalMeetings,Alexandria,VA:AmericanStatisticalAssociation.
Nowok,B.,Raab,G.andDibben,C.(2015)."synthpop:BespokeCreationofSyntheticDatainR,"Packagevignetteavailableonlineat:http://cran.r-project.org/web/packages/synthpop/vignettes/synthpop.pdf.
Raghunathan,T.,Lepkowski,J.,VanHoewyk,J.,andSolenberger,P.(2001).“AMultivariateTechniqueforMultiplyImputingMissingValuesUsingaSequenceofRegressionModels,”SurveyMethodology,27,pp.85– 95.
Raghunathan,T.,Reiter,J.,andRubin,D.(2003).“MultipleImputationforStatisticalDisclosureLimitation,”JournalofOfficialStatistics,19,pp.1– 16.
References
30
Reiter,J.(2003).“InferenceforPartiallySynthetic,PublicUseMicrodataSets,”SurveyMethodology,29,pp.181– 188.
Reiter,J.(2005).“UsingCARTtoGeneratePartiallySyntheticPublicUseMicrodata,”JournalofOfficialStatistics,21,pp.441– 462.
Reiter,J.(2012).“StatisticalApproachestoProtectingConfidentialityforMicrodataandtheirEffectsontheQualityofStatisticalInferences.”PublicOpinionQuarterly,76,pp.163–181
Rubin,D.(1993).“Discussion:StatisticalDisclosureLimitation,”JournalofOfficialStatistics,9,pp.461– 468.
Skinner,C.,andShlomo,N.(2008).“AssessingIdentificationRiskinSurveyMicrodataUsingLog-LinearModels,”JournaloftheAmericanStatisticalAssociation,103,pp.989– 1001.
U.S.OfficeofPersonnelManagement.(2015).FederalEmployeeViewpointSurveyTechnicalReport.Availableonlineat:https://www.fedview.opm.gov/2015/Published/.
Whitford,A.,andLee,S.-Y.(2015).“Exit,Voice,andLoyaltywithMultipleExitOptions:EvidencefromtheUSFederalWorkforce,”JournalofPublicAdministrationsResearchandTheory,25,pp.373– 398.
References(2)
31