post-tabular stochastic noise

© Destatis, Mathematical Statistical Methods

Post-tabular Stochastic Noise

Sarah GIESSING,Federal Statistical Office of GermanyDivision Mathematical Statistical Methods

to Protect Skewed Business Data


Stochastic Noise Input perturbation (pre-tabular)

(Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or

Output perturbation (post-tabular)? Challenges for Post-tabular Noise:

Between tables consistency Use micro-data seed when generating the noise

(ABS (Fraser and Wooton, 2006)) Table Additivity

Restoring additivity leads to between-tables inconsistency

Idea: enough to achieve near-additivity through Flexible Rounding


Masking skewed business data using multiplicative noise

Pre-tabular approach (Höhne, 2008) Multiply variable yi (in record i ) by (1 ± ( + zi)), zi ~N(0, )

Post-tabular approach Tpost=Torig-y1+y1 (1 ± ( + |zc |)) Set to 0 for non-sensitive cells =2p/100 => Tpost non-sensitive according to p%-rule For between tables consistency:

Attach „seed“ variable to microdataWhen making tables: Add up seed (➙Uc , f.i.

Qc:= mod100(Uc) )=> consistent seed on the cell level (= Pseudo random numbers)

Both approaches: determines the strength of the perturbation

0 0

0

0

0

0


Noise variances Pre-tabular approach

V(tpre ) = i=1,...,n ( )

Post-tabular approach V(tpost ) =

= , for non-sensitive cells (Is =0) because of

a2+b=1

fV(tpost ) < V(tpre ) for non-sensitive cells

2iy

20

20

2 pre

24222where2

0020

2 baaIb spost ,,

2pre

21

2 ypost

20


Post-tabular noise – what about additivityNoisy tables are not additive

Restoring additivity (Iterative methods, CTA)causes between-tables inconsistency

We want additivity – but how much? Only a few users need exact additivity For everyone else „approximate“ additivity (subject to

rounding errors) is enough Rounding also provides local information loss measure What should be the rounding basis?

Idea: Use width of confidence interval for Torig to compute

rounding basis B=10b. Require: RoundB(Torig)~ RoundB(Tpost) Publish RoundB(Tpost)


Confidence Interval and Rounding BasisConfidence Interval for Torig:

Tpost

to model user‘s ambiguity about true parameters =3 for 99% interval

Rounding Basis: Require: RoundB(Torig)= RoundB(Tpost) (±1)

Example: Torig = 156 764, Tpost= 156 755,

confidence interval [155 463;158 047]

Choose Publish 155 8XX

100 11 ybuapc cp,,

u

B=10 000

B= 1 000

B= 100

B= 10

B= 100


Results: An Example

NACE District Item total < 50 000

DA151 XY orig 120,136 328DA151 XY pretab (relDev) 125,471 (-4,44) 265 ! (19,21)DA151 XY posttab (relDev) 121,xxx (-0,72) xxx ! (100,0)DA151 XY count (sensitivity) 21 (20,0) 1 (0,0)

Turnover (in hundreds) by NACE x District x Size Class

: sensitive

Item 50 000-100 000 100 000-250 000 250 000-500 000 500 000-1 000 000

orig 720 13,85 14,733 29,632pretab (relDev) 863 ! (-19,9) 11,838 ! (14,53) 14,555 (1,21) 29,869 (-0,80)posttab (relDev) x,xxx ! (100,0) 13,8xx (0,36) 14,7xx (0,22) 29,7xx (-0,23)count (sensitivity) 1 (0,0) 8 (20,0) 4 (20,0) 4 (20,0)

0


Distribution of non-sensitive cells by relative deviation of the noiseRange of

rel. dev.(in %)

NACE5 x State x SizeCl,124 204 non-sensitive cells

NACE5 x Distr x SizeCl,38 256 non-sensitive cells

NACE5 x Municipality,1 865 non-sensitive cells

Pre Post Rd Adj Pre Post Rd Adj Pre Post Rd Adj

0-1 22.8 88.6 87.2 75.5 13.9 82.8 81.1 63.8 6.5 60.1 61.6 38.7

1-2 14.8 8.9 8.2 12.0 12.1 13.4 11.6 16.7 6.5 26.0 20.2 21.2

2-3 10.9 1.9 2.6 5.1 10.0 2.9 4.1 7.6 6.4 8.7 8.9 12.1

3-4 8.5 0.4 1.0 2.7 8.7 0.7 1.6 4.2 5.5 3.4 4.8 6.8

4-5 7.0 0.1 0.5 1.6 7.4 0.2 0.8 2.7 5.4 1.3 1.8 5.6

5-6 5.7 0.1 0.2 1.0 6.8 0.1 0.4 1.6 4.8 0.2 1.2 3.3

6-7 4.9 0.0 0.1 0.6 5.9 0.0 0.2 1.0 4.8 0.2 0.8 3.0

7-8 4.1 0.0 0.1 0.4 5.2 0.0 0.1 0.7 5.4 0.1 0.5 1.9

8-9 3.5 0.0 0.0 0.3 4.7 0.0 0.0 0.5 6.0 0.1 0.1 1.1

9-10 3.1 0.0 0.0 0.2 4.2 0.0 0.0 0.4 5.2 0.0 0.2 1.4

≥ 10 14.7 0.0 0.0 0.5 21.2 0.0 0.1 0.8 43.6 0.0 0.2 4.8


Disclosure Risks Risk Type I: Masked value too close to original value

Not very critical: users can‘t tell which values are actually close

Risk Type II: post-tabular masked data are not additive, i.e. considering the table relations, they are not „feasible“.

Possible to compute feasibility interval for sensitive cells considering the constraints given by

Tables relations and Rounding intervals

Feasibilty interval too close = obvious case of disclosure No such cases found in empirical tests

NACE5 x SizeCl x, State

NACE5 x SizeCl x Distr

NACE5 x Municipality

pre-tab 14% 13% 4%

post-adj (after Rounding) 39% 42% 15%


ConclusionsFlexible rounding of posttabular noisy data is a promising new method for flexible table servers

Paradigms hold: Exact between-tables consistency Near-additivity (only „rounding“ deviations)

Quality: In tables with „usual“ detail:

More than 95% non-sensitive cells with less than 2% rel.dev. In tables with too much detail for usual cell suppression

methods: More than 95% non-sensitive cells with less than 5% rel.dev.

Transparency: Influence of SDC on the data obvious to users Risk: No „obvious“ disclosure risk found in testing so far Easy to implement, computational effort negligible


Thanks for your attentionThanks for your attention

post-tabular stochastic noise

Documents