post-tabular stochastic noise
DESCRIPTION
Post-tabular Stochastic Noise. to Protect Skewed Business Data. Sarah GIESSING, Federal Statistical Office of Germany Division Mathematical Statistical Methods. Stochastic Noise. Input perturbation (pre-tabular) (Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or - PowerPoint PPT PresentationTRANSCRIPT
© Destatis, Mathematical Statistical Methods
Post-tabular Stochastic Noise
Sarah GIESSING,Federal Statistical Office of GermanyDivision Mathematical Statistical Methods
to Protect Skewed Business Data
© Destatis, Mathematical Statistical Methods
Stochastic Noise Input perturbation (pre-tabular)
(Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or
Output perturbation (post-tabular)? Challenges for Post-tabular Noise:
Between tables consistency Use micro-data seed when generating the noise
(ABS (Fraser and Wooton, 2006)) Table Additivity
Restoring additivity leads to between-tables inconsistency
Idea: enough to achieve near-additivity through Flexible Rounding
© Destatis, Mathematical Statistical Methods
Masking skewed business data using multiplicative noise
Pre-tabular approach (Höhne, 2008) Multiply variable yi (in record i ) by (1 ± ( + zi)), zi ~N(0, )
Post-tabular approach Tpost=Torig-y1+y1 (1 ± ( + |zc |)) Set to 0 for non-sensitive cells =2p/100 => Tpost non-sensitive according to p%-rule For between tables consistency:
Attach „seed“ variable to microdataWhen making tables: Add up seed (➙Uc , f.i.
Qc:= mod100(Uc) )=> consistent seed on the cell level (= Pseudo random numbers)
Both approaches: determines the strength of the perturbation
0 0
0
0
0
0
© Destatis, Mathematical Statistical Methods
Noise variances Pre-tabular approach
V(tpre ) = i=1,...,n ( )
Post-tabular approach V(tpost ) =
= , for non-sensitive cells (Is =0) because of
a2+b=1
fV(tpost ) < V(tpre ) for non-sensitive cells
2iy
20
20
2 pre
24222where2
0020
2 baaIb spost ,,
2pre
21
2 ypost
20
© Destatis, Mathematical Statistical Methods
Post-tabular noise – what about additivityNoisy tables are not additive
Restoring additivity (Iterative methods, CTA)causes between-tables inconsistency
We want additivity – but how much? Only a few users need exact additivity For everyone else „approximate“ additivity (subject to
rounding errors) is enough Rounding also provides local information loss measure What should be the rounding basis?
Idea: Use width of confidence interval for Torig to compute
rounding basis B=10b. Require: RoundB(Torig)~ RoundB(Tpost) Publish RoundB(Tpost)
© Destatis, Mathematical Statistical Methods
Confidence Interval and Rounding BasisConfidence Interval for Torig:
Tpost
to model user‘s ambiguity about true parameters =3 for 99% interval
Rounding Basis: Require: RoundB(Torig)= RoundB(Tpost) (±1)
Example: Torig = 156 764, Tpost= 156 755,
confidence interval [155 463;158 047]
Choose Publish 155 8XX
100 11 ybuapc cp,,
u
B=10 000
B= 1 000
B= 100
B= 10
B= 100
© Destatis, Mathematical Statistical Methods
Results: An Example
NACE District Item total < 50 000
DA151 XY orig 120,136 328DA151 XY pretab (relDev) 125,471 (-4,44) 265 ! (19,21)DA151 XY posttab (relDev) 121,xxx (-0,72) xxx ! (100,0)DA151 XY count (sensitivity) 21 (20,0) 1 (0,0)
Turnover (in hundreds) by NACE x District x Size Class
: sensitive
Item 50 000-100 000 100 000-250 000 250 000-500 000 500 000-1 000 000
orig 720 13,85 14,733 29,632pretab (relDev) 863 ! (-19,9) 11,838 ! (14,53) 14,555 (1,21) 29,869 (-0,80)posttab (relDev) x,xxx ! (100,0) 13,8xx (0,36) 14,7xx (0,22) 29,7xx (-0,23)count (sensitivity) 1 (0,0) 8 (20,0) 4 (20,0) 4 (20,0)
0
© Destatis, Mathematical Statistical Methods
Distribution of non-sensitive cells by relative deviation of the noiseRange of
rel. dev.(in %)
NACE5 x State x SizeCl,124 204 non-sensitive cells
NACE5 x Distr x SizeCl,38 256 non-sensitive cells
NACE5 x Municipality,1 865 non-sensitive cells
Pre Post Rd Adj Pre Post Rd Adj Pre Post Rd Adj
0-1 22.8 88.6 87.2 75.5 13.9 82.8 81.1 63.8 6.5 60.1 61.6 38.7
1-2 14.8 8.9 8.2 12.0 12.1 13.4 11.6 16.7 6.5 26.0 20.2 21.2
2-3 10.9 1.9 2.6 5.1 10.0 2.9 4.1 7.6 6.4 8.7 8.9 12.1
3-4 8.5 0.4 1.0 2.7 8.7 0.7 1.6 4.2 5.5 3.4 4.8 6.8
4-5 7.0 0.1 0.5 1.6 7.4 0.2 0.8 2.7 5.4 1.3 1.8 5.6
5-6 5.7 0.1 0.2 1.0 6.8 0.1 0.4 1.6 4.8 0.2 1.2 3.3
6-7 4.9 0.0 0.1 0.6 5.9 0.0 0.2 1.0 4.8 0.2 0.8 3.0
7-8 4.1 0.0 0.1 0.4 5.2 0.0 0.1 0.7 5.4 0.1 0.5 1.9
8-9 3.5 0.0 0.0 0.3 4.7 0.0 0.0 0.5 6.0 0.1 0.1 1.1
9-10 3.1 0.0 0.0 0.2 4.2 0.0 0.0 0.4 5.2 0.0 0.2 1.4
≥ 10 14.7 0.0 0.0 0.5 21.2 0.0 0.1 0.8 43.6 0.0 0.2 4.8
© Destatis, Mathematical Statistical Methods
Disclosure Risks Risk Type I: Masked value too close to original value
Not very critical: users can‘t tell which values are actually close
Risk Type II: post-tabular masked data are not additive, i.e. considering the table relations, they are not „feasible“.
Possible to compute feasibility interval for sensitive cells considering the constraints given by
Tables relations and Rounding intervals
Feasibilty interval too close = obvious case of disclosure No such cases found in empirical tests
NACE5 x SizeCl x, State
NACE5 x SizeCl x Distr
NACE5 x Municipality
pre-tab 14% 13% 4%
post-adj (after Rounding) 39% 42% 15%
© Destatis, Mathematical Statistical Methods
ConclusionsFlexible rounding of posttabular noisy data is a promising new method for flexible table servers
Paradigms hold: Exact between-tables consistency Near-additivity (only „rounding“ deviations)
Quality: In tables with „usual“ detail:
More than 95% non-sensitive cells with less than 2% rel.dev. In tables with too much detail for usual cell suppression
methods: More than 95% non-sensitive cells with less than 5% rel.dev.
Transparency: Influence of SDC on the data obvious to users Risk: No „obvious“ disclosure risk found in testing so far Easy to implement, computational effort negligible
© Destatis, Mathematical Statistical Methods
Thanks for your attentionThanks for your attention