verifiable asics - aarhus universitet · [a2: analog malicious hardware. yang et al., ieee s&p...
TRANSCRIPT
Verifiable ASICs
Michael Walfish
Dept. of Computer Science, Courant Institute, NYU
Aarhus Workshop on Secure Multiparty Computation 1 June 2016
This is joint work with:
Riad S. Wahby (Stanford), Max Howald (Cooper Union and NYU), Siddharth Garg (NYU), abhi shelat (U. of Virginia)
Riad recently presented this work at IEEE S&P (Oakland).
Problem: the manufacturer (“foundry” or “fab”) of a custom chip (“ASIC”) can undermine the chip’s execution.
principal (govt, chip vendor, …)
chip manufacturer (“foundry” or “fab”)
chip design
eavesdropper
encrypted phone
encrypted phone
Response: control the manufacturing chain with a trusted foundry
But trusted fabrication is not a panacea:
§ Only 5 countries have cutting-edge fabs on shore
§ Building a new fab takes $billions and years of R&D
§ With semiconductor technology, area and energy reduce with square and cube of transistor dimension
§ So: old fabs means enormous penalty. Example of India: 108×.
Trusted fabrication is the only solution with strong guarantees.
§ For example, post-fab detection can be thwarted [A2: Analog Malicious Hardware. Yang et al., IEEE S&P 2016]
We thought: probabilistic proofs might let us get trust more cheaply!
An alternative: Verifiable ASICs
principal
F → designs for P, V
integrator
untrusted fab (fast) builds P
trusted fab (slow) builds V
P V
V
operator
P x y
proof that F(x) = y
input
output
Makes sense if V + P cheaper than trusted F
Reasons for hope:
§ Running time of V < F (asymptotically)
§ Implementations exist, and …
§ … though their costs for P are absurd, advanced fab might make P cheaper than F (!)
V P x y
proof that F(x) = y
input
output
vs. F
GMR85 Babai85 BCC86 BFLS91 FGLSS91 Kilian92 ALMSS92 AS92 Micali94 BG02 GOS06 IKO07 GKR08 KR09 GGP10 Groth10 GLR11 Lipmaa11 BCCT12 GGPR12 BCCT13 KRR14 …
SBW11 CMT12
SMBW12 TRMP12
SVPBBW12 SBVBPW13
VSBW13 PGHR13
Thaler13 BCGTV13
BFRSBW13 BFR13
DFKP13 BCTV14a BCTV14b
BCGGMTV14 FL14
KPPSST14 FGP14
WSRHBW15 BBFR15
CFHKKNPZ15 CTV15
KZMQCPPsS15
Reasons for hope caution:
§ The theory is silent about feasibility (and the onus here is heavier than in prior work)
§ Costs must reflect hardware: energy, area, ….
§ We need physically realizable designs and plausible computation sizes
Makes sense if V + P cheaper than trusted F
V P x y
proof that F(x) = y
input
output
vs. F SBW11 CMT12
SMBW12 TRMP12
SVPBBW12 SBVBPW13
VSBW13 PGHR13
Thaler13 BCGTV13
BFRSBW13 BFR13
DFKP13 BCTV14a BCTV14b
BCGGMTV14 FL14
KPPSST14 FGP14
WSRHBW15 BBFR15
CFHKKNPZ15 CTV15
KZMQCPPsS15
GMR85 Babai85 BCC86 BFLS91 FGLSS91 Kilian92 ALMSS92 AS92 Micali94 BG02 GOS06 IKO07 GKR08 KR09 GGP10 Groth10 GLR11 Lipmaa11 BCCT12 GGPR12 BCCT13 KRR14 …
Zebra: a system that saves costs
… sometimes
(1)
(2)
probabilistic proof protocol (back-end)
program translator (front-end)
Implementations of probabilistic proofs:
arithmetic circuit (AC) over 𝔽p
x y, proof
main(){ ... }
C program
prover
verifier
interactive proof [GKR08]
interactive argument [IKO07]
non-interactive argument (CS proof, SNARG, SNARK) [Micali94, Groth10, Lipmaa12, GGPR12]
P V
arguments (interactive, SNARK, CS proof, etc.)
• non-deterministic ACs
• arbitrary AC geometry
• 1-round, 2-round protocols
• deterministic ACs only
• layered, low-depth ACs
• lots of rounds, communication
interactive proofs [GGPR12, PGHR13, SBVBPW13, BCTV14] [GKR08, CMT12, VSBW13]
x
y, proof
unsuited to hardware suited to hardware
Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔽p
verifier prover
x y
…
ACCEPT/REJECT
x
y
sum-check invocation
[LFKN90]
… sum-check invocation
sum-check invocation
Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔽p
V’s sequential running time: O(depth · log width + |x| + |y|), assuming precomputation of queries
Cost to execute F directly: O(depth · width)
Soundness error: miniscule for large p
Zebra builds on the GKR interactive proof [GKR08, CMT12, VSBW13]; computations are expressed as layered arithmetic circuits over 𝔽p
V’s sequential running time: O(depth · log width + |x| + |y|), assuming precomputation of queries
Cost to execute F directly: O(depth · width)
Soundness error: miniscule for large p
verifier prover
x y
… sum-check
invocation [LFKN90]
… sum-check invocation
sum-check invocation
P’s sequential running time: O(depth · width · log width)
Zebra extracts parallelism
Execution step: layers are sequential, but gates can be executed in parallel.
Proving step: can P and V parallelize the interaction?
§ No. V must ask questions in order
§ But. Parallelism is still available
V questions P about F(x1)’s output layer
Simultaneously, P returns F(x2)
F(x2) F(x1)
F(x3)
F(x1)
F(x2)
V questions P about F(x1)’s next layer and F(x2)’s output layer
Meanwhile, P returns F(x3)
F(x4)
F(x1)
F(x2)
F(x3)
This process continues
F(x5)
F(x1)
F(x2)
F(x3)
F(x4)
This process continues
F(x7)
F(x1)
F(x2)
F(x3)
F(x4)
F(x5)
F(x6)
This process continues until V and P are completing one proof in each time step.
sub-prover, layer 0
prover
…
sub-prover, layer d-1
This is nothing other than pipelining, a classic hardware technique.
It applies because layering organizes the work into stages.
There are other opportunities along these lines.
sub-prover, layer 1
sub-prover, layer i
for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)
Sub-prover’s obligation in round j of sum-check invocation: return Hj(0), Hj(1), Hj(2), where
Hj(k) = ∑ uj(g)⋅vj(g, k) gates g
uj+1(g) = uj(g)⋅vj(g, rj)
rj
Hj(0), Hj(1), Hj(2)
…
load u[1] u[1]*v(1, k=0) u[1]*v(1, k=1) u[1]*v(1, k=2) store new u[1]
load u[g] u[g]*v(g, k=0) u[g]*v(g, k=1) u[g]*v(g, k=2) store new u[g]
…
adder tree
gate module 1 gate module g
RAM
sub-prover, layer i
for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)
Sub-prover’s obligation in round j of sum-check invocation: return Hj(0), Hj(1), Hj(2), where
Hj(k) = ∑ uj(g)⋅vj(g, k) gates g
uj+1(g) = uj(g)⋅vj(g, rj)
rj
Hj(0), Hj(1), Hj(2)
… load u[1] u[1]*v(1, k=0) u[1]*v(1, k=1) u[1]*v(1, k=2) store new u[1]
load u[g] u[g]*v(g, k=0) u[g]*v(g, k=1) u[g]*v(g, k=2) store new u[g]
…
adder tree
u[1] u[g]
gate module 1 gate module g
sub-prover, layer i
for k in {0,1,2}: H[k] ← 0 for all gates g: H[k] ← H[k] + u[g]*v(g,k) for all gates g: u[g] ← u[g]*v(g,rj)
Sub-prover’s obligation in round j of sum-check invocation: return Hj(0), Hj(1), Hj(2), where
Hj(k) = ∑ uj(g)⋅vj(g, k) gates g
uj+1(g) = uj(g)⋅vj(g, rj)
rj
Hj(0), Hj(1), Hj(2)
adder tree
u[1] u[g]
u[1]*v(1, k=1) u[1]*v(1, k=2)
u[1]*v(1, k=0)
gate module 1 gate module g
§ Extract parallelism § Pipelined proving, adder tree, gate proving, etc.
§ Exploit locality: distribute state and control
§ Custom registers (no RAM): “data” wires are few and short
§ Latency-insensitive design: few “control” wires
§ Reduce and reuse
Summary of Zebra’s design approach:
sub-prover, layer i
adder tree
gate module 1 gate module g
u[1] u[g]
u[1]*v(1, k=1) u[1]*v(1, k=2)
u[1]*v(1, k=0)
§ Extract parallelism § Pipelined proving, adder tree, gate proving, etc.
§ Exploit locality: distribute state and control
§ Custom registers (no RAM): “data” wires are few and short
§ Latency-insensitive design: few “control” wires
§ Reuse and recycle
§ Recycle hardware circuitry for different tasks
§ Save energy by adding memoization to P
§ Reuse block designs; optimizations thus have high pay-off
Summary of Zebra’s design approach:
Architectural and operational challenges for Zebra
1. Communication between V and P is high bandwidth
§ V and P on circuit board? Too much energy, circuit area
§ Zebra’s response: use 3D packaging
2. Protocol requires input-independent precomputation
§ Zebra’s response: amortize precomputations over many V-P pairs
3. Trusted storage would be prohibitive
§ Zebra’s response: use untrusted storage, with auth-encryption
§ An arithmetic circuit to synthesizable Verilog compiler for P § Composes with existing C to arithmetic circuit compilers
§ Two V implementations: § hardware (Verilog)
§ software (C++)
§ Library to generate V’s precomputations
§ Verilog simulator extensions to model software or hardware V’s interactions with P and with storage
The implementation of Zebra includes:
This implementation seemed to work great.
Zebra: 104 or 105 proofs per second
Existing implementations: 10 seconds per proof, at least
But that isn’t a serious evaluation …
§ Baseline: direct implementation of F in same technology as V
§ Metrics: energy, chip size per throughput (in paper)
§ Assessed with circuit synthesis and simulation, published chip designs, and CMOS scaling models
§ Charge for V, P, communication; retrieving and decrypting precomputations; PRNG; operator communicating with V
§ Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; max chip area = 200 mm2; max total power = 150 W
Evaluation method
V P x y
proof
input
output
vs. F
§ Baseline: direct implementation of F in same technology as V
§ Metrics: energy, chip size per throughput (in paper)
§ Assessed with circuit synthesis and simulation, published chip designs, and CMOS scaling models
§ Charge for V, P, communication; retrieving and decrypting precomputations; PRNG; operator communicating with V
§ Constraints: trusted fab = 350 nm; untrusted fab = 7 nm; max chip area = 200 mm2; max total power = 150 W
Evaluation method
V P x y
proof
input
output
vs. F
1997
350 nm
2017
7 nm [TSMC]
Application #1: number theoretic transform
NTT: a Fourier transform over 𝔽p
Used in signal processing, computer algebra, etc.
Application #1: number theoretic transform
6 7 8 9 10 11 12 130.1
0.3
1
3
log2(NTT size)
base
line v
s. Z
eb
ra (
hig
he
r is
be
tte
r)Ratio of baseline energy to Zebra energy
Application #2: Curve25519 point multiplication
Curve25519: a commonly-used elliptic curve
Point multiplication: primitive used for ECDH
84 170 340 682 11470.1
0.3
1
3
Parallel Curve25519 point multiplications
ba
selin
e v
s. Z
eb
ra (
hig
he
r is
be
tte
r)
Ratio of baseline energy to Zebra energy
Application #2: Curve25519 point multiplication
(1) Zebra: a system that saves costs
(2) … sometimes
Summary of Zebra’s applicability:
1. Computation F must have a layered, shallow, deterministic AC
2. Need wide gap between (fast) fab for P and (trusted) fab for V
3. Computation F must be relatively large for V to save work
4. Computation F must be efficient as an arithmetic circuit (AC)
5. Must amortize precomputations over many chips
restriction of the interactive proof (IP) setup
Why did we build Zebra atop IPs instead of arguments?
Design principle interactive proofs
[GKR08, CMT12, VSBW13]
arguments (interactive, SNARK, CS proof, etc.) [GGPR12, PGHR13, SBVBPW13, BCTV14]
Extract parallelism ✓ ✓
Exploit locality ✓ ✗
Reduce and reuse ✓ ✗
In arguments, P computes over entire AC at once ⟶ need RAM
P does crypto for every gate in AC ⟶ special crypto circuits needed
We hope these issues are surmountable!
Because argument protocols seem unfriendly to hardware:
Reality check on the restrictions:
1. Computation F must have a layered, shallow, deterministic AC
2. Need wide gap between (fast) fab for P and (trusted) fab for V
3. Computation F must be relatively large for V to save work
4. Computation F must be efficient as an arithmetic circuit (AC)
5. Must amortize precomputations over many chips
applies to interactive proofs (IPs) but not arguments
common to all implementations of probabilistic proofs
A limitation that is endemic to the area: Need wide gap between (fast) fab for P and (trusted) fab for V
101
105
0
109
103
107
1011w
orke
r’s
cost
no
rmal
ized
to n
ativ
e C
matrix multiplication (m=128) PAM clustering (m=20, d=128)
N/A
1013
Pepp
erG
inge
r
Pin
occh
io
Zaa
tar
CM
T
nativ
e C
Alls
pice
Tin
yRA
M
Tha
ler
Pepp
erG
inge
r
Pin
occh
io
Zaa
tar
CM
T
nativ
e C
Alls
pice
Tin
yRA
M
Tha
ler
Limitations that are endemic to the area:
Computation F must be relatively large for V to save work
Computation F must be efficient as an arithmetic circuit
§ Example: libsnark’s [BCTV14] optimized implementation of GGPR/Pinocchio [GGPR12, PGHR13]. Great work, but:
§ Verification time: 6 ms + (|x| + |y|)・3 µs on 2.7 Ghz CPU
§ That time is >16 million CPU ops, which is a break-even point
§ libsnark handles ≤ 16 million gates (with 32 GB of RAM), so to break even, F also needs on average CPU_ops/AC_gate > 1. § Example: addition over 𝔽p instead of over fixed-width integers
Built probabilistic proof protocols amortize precomputations*
*Exception: CMT [CMT12] applied to highly regular arithmetic circuits
System amortize precomputation over size of advice
Zebra multiple V-P pairs short
Allspice [VSBW13] over a batch of instances of a given F short
Bootstrapped SNARKs [BCTV14a, CTV15]
over all computations long
BCTV [BCTV14b] over all computations of the same length long
Pinocchio [PGHR13] over all future uses of a given F long
Pepper [SMBW12], Ginger [SVPBBW12], Zaatar [SBVBPW13]
over a batch of instances of a given F long
Lessons (re)learned:
§ Do careful feasibility studies first!
§ Hardware is a powerful tool for acceleration …
§ ... but only if data flows are amenable
§ Theory of computation versus application of physics
§ General-purpose verifiable computation and succinct arguments are still far from practical
Summary and take-aways
§ Verifiable ASICs: a new approach to building trustworthy hardware under a strong threat model
§ First hardware design for a probabilistic proof protocol; first work to capture cost of prover, verifier together
§ Improves performance compared to trusted baseline
§ Improvement compared to baseline is modest
§ Applicability is limited § Amortization, arithmetic circuits, “big enough” computations,
large gap between trusted and untrusted technology, etc.
§ Zebra is plausibly deployable (!), but work remains for this area
http://www.pepper-project.org