synthetic encoding
DESCRIPTION
Synthetic encoding is a state-of-art solution for data lineage. It's related to Dryad and Spark.TRANSCRIPT
The Synthetic EncodingAN EFFICIENT APPROACH TO IDENTIFY DAG NODES
[email protected], some rights reserved
Outline• Prelude
• Canonical Encoding
• Digested Encoding
• Synthetic Encoding
• Cryptanalysis
• Application
• Related work
[email protected], 2013 - 2014
PreludeA BRIEF INTRODUCTION
[email protected], 2013 - 2014
Directed Acyclic Graph (Wikipedia)A directed graph may be used to represent a network of processing elements; in this formulation, data enters a processing element through its incoming edges and leaves the element through its outgoing edges. Examples of this include the following:
• In electronic circuit design, a combinational logic circuit is an acyclic system of logic gates that computes a function of an input, where the input and output of the function are represented as individual bits.
• Dataflow programming languages describe systems of values that are related to each other by a directed acyclic graph. When one value changes, its successors are recalculated; each value is evaluated as a function of its predecessors in the DAG.
• In compilers, straight line code (that is, sequences of statements without loops or conditional branches) may be represented by a DAG describing the inputs and outputs of each of the arithmetic operations performed within the code; this representation allows the compiler to perform common subexpression eliminationefficiently.
• In most spreadsheet systems, the dependency graph that connects one cell to another if the first cell stores a formula that uses the value in the second cell must be a directed acyclic graph. Cycles of dependencies are disallowed because they cause the cells involved in the cycle to not have a well-defined value. Additionally, requiring the dependencies to be acyclic allows a topological order to be used to schedule the recalculations of cell values when the spreadsheet is changed.
[email protected], 2013 - 2014
Example DAG as an Archetype• The archetype
• Arithmetic example
• ((a+b)*(a+b))*(c+d)*d
• Aliases
• Node#1
• Node#2
• Node#3
• Node#4
Legend
Data
Combinator
a
*
b cd
+ +
*
Node#4
Node#3
Node#1 Node#2
[email protected], 2013 - 2014
Alias IS NOT Identity• Arbitrary aliasing
• Node#1 -> Node#a
• Node#2 -> Node#b
• Node#3 -> Node#c
• Node#4 -> Node#dLegend
Data
Combinator
a
*
b cd
+ +
*
Node#d
Node#c
Node#a Node#b
[email protected], 2013 - 2014
Or else…• Arbitrary aliasing
• Node#1 -> Node#3
• Node#2 -> Node#4
• Node#3 -> Node#2
• Node#4 -> Node#1Legend
Data
Combinator
a
*
b cd
+ +
*
Node#1
Node#2
Node#3 Node#4
[email protected], 2013 - 2014
Identity of DAG nodes• No false positive
• Isomorphic
• Nodes with the same identity should not have different semantics
• No false negative
• Distinctive
• Nodes with different identities should not have the same semantics
• Identity Encoding
• Universal
• Invariant across DAGs
• Presumedly existing
Legend
Data
Combinator
a
*
b cd
+ +
*
A=?
B=?
C=? D=?
[email protected], 2013 - 2014
Application of Identity Encodings• Structural Analysis
• Profiling• Detect hot spot
• Detect critical path
• Caching• Memorizing with identity
• Cache(identity(node)) == Apply(node)
• Verification• Replay the computation later
• Formal verification of semantically identical combinators
[email protected], 2013 - 2014
Interpretations• Mathematics: Embedding Graph to Linear
• Engineering: Flatten a DAG into a string
• Feasibility
• Arbitrary long string for graph encoding Encodings
• Trade-off and fault tolerance
[email protected], 2013 - 2014
Canonical EncodingIRREDUCIBLE COMPUTATION
[email protected], 2013 - 2014
Canonical Encoding• A
• ((a+b)*(a+b))*(c+d)*d
• B
• (a+b)*(a+b)
• C
• a+b
• D
• c+d
Legend
Data
Combinator
a
*
b cd
+ +
*
A=?
B=?
C=? D=?
[email protected], 2013 - 2014
Interpretations• Mathematics: Isomorphism
• Engineering: Script for the evaluation
[email protected], 2013 - 2014
Digested EncodingNAÏVE OPTIMIZATION
[email protected], 2013 - 2014
Digested Encoding• A
• md(((a+b)*(a+b))*(c+d)*d)
• B
• md((a+b)*(a+b))
• C
• md(a+b)
• D
• md(c+d)
Legend
Data
Combinator
a
*
b cd
+ +
*
A=?
B=?
C=? D=?
where md() is a one-way function, such as MD5, SHA1 etc
[email protected], 2013 - 2014
Interpretations• Comparison to Canonical Encoding
• Fixed length
• False positive
• Mathematics: Quasi-isomorphism
• Engineering: Digest of the computation
[email protected], 2013 - 2014
Synthetic EncodingFINITE INDUCTION WITH RECURSIVE SEMANTICS
[email protected], 2013 - 2014
Synthetic Encoding• A
• md([*, B, D, d])
• B
• md([*, C, C])
• C
• md([+, a, b])
• D
• md([+, c, d])
Legend
Data
Combinator
a
*
b cd
+ +
*
A=?
B=?
C=? D=?
where md() is a one-way function, such as MD5, SHA1 etc
[email protected], 2013 - 2014
Interpretations• Comparison to Canonical Encoding
• Fixed length
• False positive
• Comparison to Digested Encoding
• Better Locality
• Propagative Error (false positive)
• Mathematics: Quasi-isomorphism
• Engineering: Digest of the node dependency
[email protected], 2013 - 2014
Refinement• Reduce hash collision by• Encoding the cardinality of node
• Encoding the depth of node
• Longer digest, such as SHA2-512
• Example• A= 4-12-md([*, B, D, d])
• B = 3-7-md([*, C, C])
• C = 2-3-md([+, a, b])
• D = 2-3-md([+, c, d])
• Data• x` = 1-1-md([x])
[email protected], 2013 - 2014
CryptanalysisIMPLICATION OF FALSE POSITIVE
[email protected], 2013 - 2014
Metaphor• Proton should decay
• half-life > 6.6×1033 years
• at 90% confidence level
• via antimuon decay (*)
• Can be safely ignored
• For experiments
• For non-GUT theory
• Except GUTs POCs
• Hot topics of 1980s
• Kamioka/Super-K
• Failed to observe proton decay
[email protected], 2013 - 2014
Case Study – SHA2 Family• Unavoidable collisions
• Pigeonhole principle
• Yet no known collisions
• Birthday Attack
• 2L/2 evaluations
• 2128 for SHA2 256
• Attack Attempt
• 41-round SHA-256 out of 64 rounds with time complexity of 2253.5 and space complexity of 216
• 42-round SHA-256 with time complexity of 2251.7 and space complexity of 212
[email protected], 2013 - 2014
Security of Synthetic Encoding• Immune to birthday attack• Data is not arbitrary
• So what?
• Resistant to collisions• Data is not arbitrary
• Structural dependency• Cardinality collision• Depth collision
• High-order digest
• NSA backdoor?
• Attack Surface• Data nodes• Cardinality = 1• Depth = 1
• Possible path• Select data to pollute
• Generate fake data
• Upload fake data to system
• Trigger computation
• Requirements• Prescient knowledge• Access to data ingestion
[email protected], 2013 - 2014
ApplicationNOW WHAT?
[email protected], 2013 - 2014
Advantage• Minimal representation of computation
• Exact data lineage
• Trivial de-duplication
[email protected], 2013 - 2014
Big Data• Data Warehouse
• ETL
• Analytical Computing
• Profiling
• Cache
• In-Memory Computing
• Failover
• Replication
[email protected], 2013 - 2014
Related WorksREINVENT THE WHEEL?
[email protected], 2013 - 2014
Research-driven Technology
MICROSOFT RESEARCH
• Dryad
• Nectar
BERKELEY AMPLAB
• Spark
• Tachyon
[email protected], 2013 - 2014
Statement• This work was related to previous work since 2005
• This work was influenced by both Dryad and Spark
• But it was done without knowledge of Nectar or Tachyon
[email protected], 2013 - 2014