self-identifying sensor data - jeff vaughan · parameter check bit allows for parameter recovery....

Post on 01-Aug-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Self-Identifying Sensor Data

Jeffrey Vaughan

Department of Computer ScienceHarvard University

With Stephen Chong (Harvard) and Christian Skalka (UVM)

IPSNApril 13, 2010

1/23

Example: Hubbard Brook Ecosystem Study

2/23

Example: Hubbard Brook Ecosystem Study

2/23

Data sharing: a tension

Scientists want their work to be used and cited.But don’t want use without attribution.

GoalTrack data provenance in a light-weight, cooperative way.

3/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

Publish 31218523338

123425435215

Snow (μm)

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

Publish 31218523338

123425435215

Snow (μm)

Sample

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

Publish 31218523338

123425435215

Snow (μm)

Sample

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

Publish 31218523338

123425435215

Snow (μm)

Sample

Round

90

3020

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

Publish 31218523338

123425435215

Snow (μm)

Sample

Round

90

3020

Reorder

435220

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

Publish 31218523338

123425435215

Snow (μm)

Sample

Round

90

3020

Reorder

435220

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

Publish 31218523338

123425435215

Snow (μm)

Sample

Round

90

3020

Reorder

435220

ProvenaceMark: 54321

Recover

4/23

Key Idea: Encode provenance in insignificant bits.

Sensorw/ error±1mm

31234423818

123275435493

Snow (μm)

ProvenaceMark: 54321

185338425215

Publish 31218523338

123425435215

Snow (μm)

Sample

Round

90

3020

Reorder

435220

ProvenaceMark: 54321

Recover

Lookup

4/23

The scientific/educational threat model is unusual.

Scientists. . .

are cooperative: they want to properly cite and attributedata.

use legacy workflows, and will be harmed byschema/format changes.

care a lot about data integrity, but understand the boundson measurement precision and accuracy.

5/23

Outline

1 Introduction

2 Encoding and Decoding

3 Evaluation

4 Conclusion

6/23

Encoding and Decoding

7/23

Self-identifying data based on three functions

Encode : Dataset ×Mark → AnnotatedDataset

Check : Dataset ×Mark → bool

Recover : Dataset → Mark

Where

Mark , int32Datapoint , int

Dataset , AnnotatedDataset , List of Datapoint

8/23

Self-identifying data based on three functions

Encode : Dataset ×Mark → AnnotatedDataset

Check : Dataset ×Mark → bool (cf. One-bit watermarking)

Recover : Dataset → Mark (cf. Blind watermarking)

Where

Mark , int32Datapoint , int

Dataset , AnnotatedDataset , List of Datapoint

8/23

Data is collected for scientific analysis.

Sensors measure real-world phenomena.High-order significant bits contain meaningful data.Low-order insignificant bits contain noise.Acceptable encodings should be minimally perceptible.

Guiding principle

Never alter significant bits.

9/23

Encoding must be robust against transformation.

Sampling Store metadata in many points: Pointwiseredundancy.

Truncation Store key metadata in more significant bits.

Other metadata stored in several bit locations:Positional redundancy

Reordering Order independent encoding.

Insertions/Bit Flips Enable recovery even with “bad” datapoints.

10/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 411/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 411/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 811/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp 11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 0

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

Parameter check bit: pc = hash(Sig. Bits@N ) mod 2 = 1pp

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

Parameter check bit: pc = hash(Sig. Bits@N ) mod 2 = 1pp

1

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

Parameter check bit: pc = hash(Sig. Bits@N ) mod 2 = 1pp

101

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

Parameter check bit: pc = hash(Sig. Bits@N ) mod 2 = 1pp

101

Insig

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

Parameter check bit: pc = hash(Sig. Bits@N ) mod 2 = 1pp

101

Insig

Mark check bit: mc = hash(Sig. Bits@Mark) mod 2 = 0

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

Parameter check bit: pc = hash(Sig. Bits@N ) mod 2 = 1pp

101

Insig

Mark check bit: mc = hash(Sig. Bits@Mark) mod 2 = 0

0

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

Parameter check bit: pc = hash(Sig. Bits@N ) mod 2 = 1pp

101

Insig

Mark check bit: mc = hash(Sig. Bits@Mark) mod 2 = 0

010

11/23

Encoding must use limited supply of insignificant bits

Harvard Red

Publish

Datapoint0 1111 1 11 00 00 000 1c

mc

p

Mark0 1010 0 10 11 10 100 0

c

pp Sig. BitsSig. Bits Insig

L = 6md

0 1111 1 00 00

Number of Provenance Pieces: ppN = 4

0pp 2pp1pp 3pp

ppN = 4

0pp 2pp1pp 3pp4pp 6pp5pp 7pp

ppN = 8

ppN = 8

hash(1110001100) mod N = 4pp

01 1 0

4pp

01 1 001 1 0

01 1 0

Insig

Parameter check bit: pc = hash(Sig. Bits@N ) mod 2 = 1pp

101

Insig

Mark check bit: mc = hash(Sig. Bits@Mark) mod 2 = 0

010

Insig

11/23

Decoding uses all metadata from annotated datasets.

parameter check: recovery of encoding parameters

mark check: checking if a mark is consistent with dataset

provenance piece: recover an unknown mark from a dataset

12/23

Parameter check bit allows for parameter recovery.

GoalLet D be an annotated data set with k elements. Find Lmd andNpp used in encoding.

Algorithm sketch (FindParams)

Enumerate Lmd and Npp candidates, rejecting incorrectguesses.

Calculate pc for each guess and compare with data points in D.

Correct guess: Probability pc-consistent = 1.Incorrect guess:

Probability consistent with any data point ≈ 1/2.Probability consistent with all points ≈ (1/2)k .

13/23

Parameter check bit allows for parameter recovery.

GoalLet D be an annotated data set with k elements. Find Lmd andNpp used in encoding.

Algorithm sketch (FindParams)

Enumerate Lmd and Npp candidates, rejecting incorrectguesses.(200 or fewer candidates in the worst case.)Calculate pc for each guess and compare with data points in D.

Correct guess: Probability pc-consistent = 1.Incorrect guess:

Probability consistent with any data point ≈ 1/2.Probability consistent with all points ≈ (1/2)k .

13/23

Define Check function using mark check bit.

Goal

Check(D,m) = true

iff mark m is encoded into dataset D.

Algorithm sketch (Check )

Recover parameter Lmd (and Npp) using FindParams.

For each data point in D, recompute mc.

Correct guess: Probability mc-consistent = 1.Incorrect guess:

Probability consistent with any data point ≈ 1/2.Probability consistent with all points ≈ (1/2)k .

14/23

Define Check function using mark check bit.

Goal

Check(D,m) = true

iff mark m is encoded into dataset D.

Algorithm sketch (Check )

Recover parameter Lmd (and Npp) using FindParams.

For each data point in D, recompute mc.

Correct guess: Probability mc-consistent = 1.Incorrect guess:

Probability consistent with any data point ≈ 1/2.Probability consistent with all points ≈ (1/2)k .

14/23

Recover function based on Check .

Goal

Recover(D) = m

iff mark m is encoded into dataset D.

Algorithm sketch (Recover )

Let suggestions = GatherSuggestions(D)Let candidates : List mark = Merge(suggestions)

for each m ∈ candidatesif Check(D,m)

return m

Heuristics in Merge keep candidateslist short, and search tractable.

15/23

Recover function based on Check .

Goal

Recover(D) = m

iff mark m is encoded into dataset D.

Algorithm sketch (Recover )

Let suggestions = GatherSuggestions(D)Let candidates : List mark = Merge(suggestions)

for each m ∈ candidatesif Check(D,m)

return m

Heuristics in Merge keep candidateslist short, and search tractable.

15/23

Recover function based on Check .

Goal

Recover(D) = m

iff mark m is encoded into dataset D.

Algorithm sketch (Recover )

Let suggestions = GatherSuggestions(D)Let candidates : List mark = Merge(suggestions)

for each m ∈ candidatesif Check(D,m)

return m

Heuristics in Merge keep candidateslist short, and search tractable.

15/23

Probabilistic algorithms deal with corruption.

Key Ideas

Empirical cutoffs < 1 determine when mc- andpc-consistency scores indicate correct parameters ormarks.

Weighted voting selects likely candidates provenancemarks.

See paper for details.

16/23

Evaluation

17/23

Encoding preserves bulk statistical data properties.

Assumption: hash mod k samples uniform distributionover {0 . . . k − 1}.Let ε = 2Lmd , the largest value written with insignificant bits.

Change in mean Change invariance

worst case ε− 1 (ε− 1)2/4

insignificant bits ∼ 0 0uniformly dist.

18/23

Recovery is robust against sampling. (1/2)

1 10 100 1000 10000

Sample size

0.0

0.2

0.4

0.6

0.8

1.0

pc-c

onsi

sten

cy s

core

19/23

Recovery is robust against sampling. (2/2)

1 10 100 1000

Sample size

0.0

0.2

0.4

0.6

0.8

1.0

Probabilityofrecovery

Predicted, R=1Predicted, R=2Predicted, R=4Predicted, R=8Predicted, R=16Predicted, R=32Observed, R=2

20/23

Conclusion

21/23

Self-identifying data solves a new problem.

This work: a mechanism for embeddingprovenance information into sensor data.

Management of provenance metadata:Metadata generation [SensorBase]Republication and curation [Park and Heidermann; Ledlie,Ng, et al. ’05]

Structured data:Databases [Lee, Bressen, and Madnick ’95; Buneman,Chapmen and Cheney ’06]Video and images [Genani and Lindqvist, ’07]

Non-malleable data:Digital signatures [Goldwasser, Micali, and Rivest ’88]

22/23

Take home message

Self identifying data. . .

associates sensor data with provenance metadata.

maintains integrity of significant bits, means, and variance.

is robust against benign transformations.

23/23

Bonus Slides

Definition of GatherSuggestions

Definition of GatherSuggestions

Algorithm sketch(GatherSuggestions(D))

Recover Lmd and Npp using FindParams.Let suggestions := ∅

For each d ∈ DLet (S,pc,mc,pp) = split(d ,Lmd)Let offset = hash(S) mod NppLet

suggest(i) =

(pp[j], j) j = i − offset mod Npp

and pp[j] in bounds∗ otherwise

suggestions := suggestions ∪ {suggest}

return suggestions

top related