post-tabular stochastic noise

11
© Destatis, Mathematical Statistical Methods Post-tabular Stochastic Noise Sarah GIESSING, Federal Statistical Office of Germany Division Mathematical Statistical Methods to Protect Skewed Business Data

Upload: sumi

Post on 04-Jan-2016

34 views

Category:

Documents


0 download

DESCRIPTION

Post-tabular Stochastic Noise. to Protect Skewed Business Data. Sarah GIESSING, Federal Statistical Office of Germany Division Mathematical Statistical Methods. Stochastic Noise. Input perturbation (pre-tabular) (Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Post-tabular Stochastic Noise

Sarah GIESSING,Federal Statistical Office of GermanyDivision Mathematical Statistical Methods

to Protect Skewed Business Data

Page 2: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Stochastic Noise Input perturbation (pre-tabular)

(Example: Evans, T., Zayatz, L. and Slanta, J. (1998)) or

Output perturbation (post-tabular)? Challenges for Post-tabular Noise:

Between tables consistency Use micro-data seed when generating the noise

(ABS (Fraser and Wooton, 2006)) Table Additivity

Restoring additivity leads to between-tables inconsistency

Idea: enough to achieve near-additivity through Flexible Rounding

Page 3: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Masking skewed business data using multiplicative noise

Pre-tabular approach (Höhne, 2008) Multiply variable yi (in record i ) by (1 ± (  + zi)), zi ~N(0, )

Post-tabular approach Tpost=Torig-y1+y1 (1 ± (  + |zc |)) Set to 0 for non-sensitive cells =2p/100 => Tpost non-sensitive according to p%-rule For between tables consistency:

Attach „seed“ variable to microdataWhen making tables: Add up seed (➙Uc , f.i.

Qc:= mod100(Uc) )=> consistent seed on the cell level (= Pseudo random numbers)

Both approaches: determines the strength of the perturbation

0 0

0

0

0

0

Page 4: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Noise variances Pre-tabular approach

V(tpre ) = i=1,...,n ( )

Post-tabular approach V(tpost ) =

= , for non-sensitive cells (Is =0) because of

a2+b=1

fV(tpost ) < V(tpre ) for non-sensitive cells

2iy

20

20

2 pre

24222where2

0020

2 baaIb spost ,,

2pre

21

2 ypost

20

Page 5: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Post-tabular noise – what about additivityNoisy tables are not additive

Restoring additivity (Iterative methods, CTA)causes between-tables inconsistency

We want additivity – but how much? Only a few users need exact additivity For everyone else „approximate“ additivity (subject to

rounding errors) is enough Rounding also provides local information loss measure What should be the rounding basis?

Idea: Use width of confidence interval for Torig to compute

rounding basis B=10b. Require: RoundB(Torig)~ RoundB(Tpost) Publish RoundB(Tpost)

Page 6: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Confidence Interval and Rounding BasisConfidence Interval for Torig:

Tpost

to model user‘s ambiguity about true parameters =3 for 99% interval

Rounding Basis: Require: RoundB(Torig)= RoundB(Tpost) (±1)

Example: Torig = 156 764, Tpost= 156 755,

confidence interval [155 463;158 047]

Choose Publish 155 8XX

100 11 ybuapc cp,,

u

B=10 000

B= 1 000

B= 100

B= 10

B= 100

Page 7: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Results: An Example

NACE District Item total < 50 000

DA151 XY orig 120,136 328DA151 XY pretab (relDev) 125,471 (-4,44) 265 ! (19,21)DA151 XY posttab (relDev) 121,xxx (-0,72) xxx ! (100,0)DA151 XY count (sensitivity) 21 (20,0) 1 (0,0)

Turnover (in hundreds) by NACE x District x Size Class

: sensitive

Item 50 000-100 000 100 000-250 000 250 000-500 000 500 000-1 000 000

orig 720 13,85 14,733 29,632pretab (relDev) 863 ! (-19,9) 11,838 ! (14,53) 14,555 (1,21) 29,869 (-0,80)posttab (relDev) x,xxx ! (100,0) 13,8xx (0,36) 14,7xx (0,22) 29,7xx (-0,23)count (sensitivity) 1 (0,0) 8 (20,0) 4 (20,0) 4 (20,0)

0

Page 8: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Distribution of non-sensitive cells by relative deviation of the noiseRange of

rel. dev.(in %)

NACE5 x State x SizeCl,124 204 non-sensitive cells

NACE5 x Distr x SizeCl,38 256 non-sensitive cells

NACE5 x Municipality,1 865 non-sensitive cells

Pre Post Rd Adj Pre Post Rd Adj Pre Post Rd Adj

0-1 22.8 88.6 87.2 75.5 13.9 82.8 81.1 63.8 6.5 60.1 61.6 38.7

1-2 14.8 8.9 8.2 12.0 12.1 13.4 11.6 16.7 6.5 26.0 20.2 21.2

2-3 10.9 1.9 2.6 5.1 10.0 2.9 4.1 7.6 6.4 8.7 8.9 12.1

3-4 8.5 0.4 1.0 2.7 8.7 0.7 1.6 4.2 5.5 3.4 4.8 6.8

4-5 7.0 0.1 0.5 1.6 7.4 0.2 0.8 2.7 5.4 1.3 1.8 5.6

5-6 5.7 0.1 0.2 1.0 6.8 0.1 0.4 1.6 4.8 0.2 1.2 3.3

6-7 4.9 0.0 0.1 0.6 5.9 0.0 0.2 1.0 4.8 0.2 0.8 3.0

7-8 4.1 0.0 0.1 0.4 5.2 0.0 0.1 0.7 5.4 0.1 0.5 1.9

8-9 3.5 0.0 0.0 0.3 4.7 0.0 0.0 0.5 6.0 0.1 0.1 1.1

9-10 3.1 0.0 0.0 0.2 4.2 0.0 0.0 0.4 5.2 0.0 0.2 1.4

≥ 10 14.7 0.0 0.0 0.5 21.2 0.0 0.1 0.8 43.6 0.0 0.2 4.8

Page 9: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Disclosure Risks Risk Type I: Masked value too close to original value

Not very critical: users can‘t tell which values are actually close

Risk Type II: post-tabular masked data are not additive, i.e. considering the table relations, they are not „feasible“.

Possible to compute feasibility interval for sensitive cells considering the constraints given by

Tables relations and Rounding intervals

Feasibilty interval too close = obvious case of disclosure No such cases found in empirical tests

NACE5 x SizeCl x, State

NACE5 x SizeCl x Distr

NACE5 x Municipality

pre-tab 14% 13% 4%

post-adj (after Rounding) 39% 42% 15%

Page 10: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

ConclusionsFlexible rounding of posttabular noisy data is a promising new method for flexible table servers

Paradigms hold: Exact between-tables consistency Near-additivity (only „rounding“ deviations)

Quality: In tables with „usual“ detail:

More than 95% non-sensitive cells with less than 2% rel.dev. In tables with too much detail for usual cell suppression

methods: More than 95% non-sensitive cells with less than 5% rel.dev.

Transparency: Influence of SDC on the data obvious to users Risk: No „obvious“ disclosure risk found in testing so far Easy to implement, computational effort negligible

Page 11: Post-tabular Stochastic Noise

© Destatis, Mathematical Statistical Methods

Thanks for your attentionThanks for your attention