bcrypt ecc-day 2008 requirements, algorithms, architectures the design space of ecc hardware

22
BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

Upload: cecil-berry

Post on 13-Jan-2016

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

BCRYPT ECC-Day 2008

Requirements, Algorithms, Architectures

The design space of ECC hardware

Page 2: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

2

Contents

Applications of ECC Hardware

Existing Solutions

Design of ECC Hardware

Details of ECC Hardware

Page 3: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

3

Motivation

ECC Hardware: What for? Acceleration Power efficiency Implementation security

Side-channel resistance

Competitors of ECC hardware

RSA hardware Software implementation

Very fast on PC But very slow on 8-bit µC

Application: Server High throughput

> 100 signatures / sec

Application: Smartcard Low latency

100 ms per signature

Low die size

Application: RFID Low power consumption Low die size

Page 4: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

4

ECC Hardware: Application

Different Requirements for ECC applications Smartcard

Acceptable latency Implementation security One EC curve sufficient

Server acceleration Throughput (not latency) Complete offloading

Costumers, Clientse-Commerce server, e-Government server

ECC

Page 5: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

5

ECC Hardware: Server Acceleration

GF(2191) Hardware Accelerator No GF(2m) support in processors (x86, PPC, …) FPGA (programmable HW) as platform Optimized for one curve Complete EC operation in HW

PCI chipset

FPGA

InfineonPITA-2

Register File

Arithmetic UnitInter-face

ECC Control

Unit

GF(2191), fClk = 66 MHz

Multipl.[Radix]

k·P[Takte]

fCLK,max

[MHz]k·P / sec

[Ops]

W = 8-Bit 40.210 74,6 1641

W = 16-Bit 23.820 71,3 2770

W = 32-Bit 15.623 70,4 4224

Page 6: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

6

ECC Hardware: SmartcardsInfineon SLE88CFX4000P

SLE 88 32-Bit Platform 1408-Bit RSA

co-processor

RSA coprocessor Local memory (704 bytes) Scalable word width Support for ECC: GF(p), GF(2m)

Photo © Infineon Technologies

SmartCard

Page 7: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

7

ECC Hardware: SmartcardsNXP Smart MX P5CC072

Smart MX 8-bit smartcard FameXE

coprocessor

FameXE RSA, ECC:

GF(p), GF(2m) 2.5 kB local RAM Word width < 4096 bits

Photo © NXP

SmartCard

Page 8: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

8

ECC Hardware: RFID Authentication

Challenge-response authentication in RFID Minimization of power consumption Trading performance for power

Lower clock speed Reduced word size

Antenna

AnalogFrontend

VddDig.

Front-end

RFIDCont-roller

ECC ProcessorNVRAM

Register FileECC

Cont-roller

Alu

Page 9: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

9

Hardware Design: CMOS Circuits

CMOS complementary metal-oxide semiconductor Silicon circuit: up to 2*106 transistors per IC Digital hardware: standard-cell circuits

Flipflops, full adders, muxes, gates: xor, and, …

Page 10: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

10

Hardware Design: Top → Down

Top-down design methodology From specification To working silicon

„First time right“

Design process Refinement of models Early estimates of

area, power, performance

Design iterations when constraints are not met

Efficiency

Effort

Algorithm

Circuit level

Architecture

System level

Page 11: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

11

Hardware Design: Design Flow

Abstraction level and tools1. System level

Defining functionality and constraints

2. Algorithmic level High-level model

3. Architectural level Paper + pencil

4. Register-transfer level HDL description

5. Circuit level Schematic + layout

1 2

3 4

5

Page 12: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

12

Challenges of ECC Hardware

EC Algorithms (ladder, EC point operation, point representation)

Defines number of multiplications Defines storage requirements Defines implementation security

Multiplication Determines performance

Storage Determines circuit size

Control Determines HDL complexity

Do’s Fix EC parameters

Fixed field size

Separate storage and computation

Dont’s Trading increased

storage for lower computation

Optimization of negligible things

Inversion

Page 13: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

13

Approaches to ECC Hardware

EC-processor Computing full point multiplication No external interaction necessary

Co-processor Acceleration of finite-field operation (Limited local memory) External interaction needed

For point ladder and point operation

ISE Enhancement of existing instruction set Acceleration of core operations

Multiply-Accumulate instructions Support of polynomial arithmetic

??

Page 14: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

14

Algorithms for ECC

Bitserial multiplication a in full precision; b bitwise Faster: digit-serial (w bits of b)

Modular reduction Without division:

NIST reduction For trinomial / pentanomials For Mersenne-like primes

Montgomery Multiplication Combines a*b and mod p For arbitrary moduli

MulSer(a, b) = a*bc = 0for i = n-1 to 0 do

c = 2·c + a·bi

Pre-comp: R = 2n+2 mod p, R2 mod p, p’ = (-p)-1 mod 2 MonMul(a, b) = a·b·R-1 mod p c = 0 for i = 0 to n+1 do q = ((c0 + a0·bi) mod 2)·p’ c = c + p·q + a·bi

ah al

12642192

ah

al

ah

Page 15: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

15

Modular Multiplication in HW

GF(2191) Example Digit-serial multiplication

c(x) = a(x)*m(x) mod f(x) a(x): full precision m(x): w-bit digits

– Digit size w = 8, 16, 32

Alignment of intermediate result

Interleaved NIST reduction small intermediate results

Squaring as own operation Simple when irred. poly f(x) fixed

a(x) + b(x) mod f(x)

a(x) b(x)

CM

muxm

<< w

muxb

c(x)

a(x) · mi(x)

a(x)

01

0

b(x)

<< w ^2

dout

din

muxm

i mi(x

)

PCI chipset

FPGA

InfineonPITA-2

Register File

Arithmetic UnitInter-face

ECC Control

Unit

Page 16: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

16

Multiplier in HW

Partial product generation a(x) * mi

Simply 191 AND gates Amplification of mi crucial

Aligning intermediate results Simple: Fixed shift

operation

Accumulation of PP Array or Tree adder

Modular reduction 200 bits -> 191 bits

m0

190 189 i 1 0

m1

190 189 i 1 0i-1

i+1 2

188

p0p1p2pipi+1p189p190p191

a0a1a2aiai+1a189a190

... ...

a9 a0a1a10a11 a2a8..a3a190..a12a191a192a193

a9 a0a1a10a11 a2a8..a3a190..a12

Page 17: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

17

GF(p) Multiplier

Radix-4 multiplier A in full precision B: 2 bits / cycle

Montgomery multiplic. Orup’s optimization

Redundant number representation Carry-save (CS) More storage Shorter crit. Path Red2bin: CSA reuse

Booth recoding (Benc)

CSA FA

A

C

S

Si

CSA FA

Ci

PPG

PPG'Ci-1 Si-1

M

Con

trol

CSA HA

qi-1

qqi

Qenc

cin

cin

>> 2 >> 2

2

2

b

bi

B

>> 2 0

Benc

Czero

3

bi-1

bi-1,neg

1

~

Page 18: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

18

Dual-field Support

Application: e.g. ECDSA ECC over GF(2m) Protocol: GF(p)

Mul, Add, Inv mod n– n … base point order

Architecture ~GF(p) mult. CSA for GF(p) XOR for GF(2m)

Carries blocked

GF(p) versus GF(2m) GF(2m) faster … GF(p) needs reg. C

Carry-save Adder

Carry-save Adder

a, a(x) p, p(x)

s

cb

q

neg

-a a 0 2c c 0 2s s 0

p p/2 0 c c/2 0 s s/2 0

b/2 s

c s

s

Reg C Reg SReg B

p1 c1 s1

c2 s2a2b2

Control

ECDSA

e = SHA-1(Message)

k = random(1, n-1)

R = k*(Px,Py) = (Rx,Ry)

r = Rx mod n

s = k-1·(e + d·r)

Page 19: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

19

ECC for RFID

Problem: Very constrained power budget P = E/t = I*U = fclk*CL*Vdd*Vdd Problem analysis: where is power consumed?

Mostly for storage: clocking of registers

New idea Less registers; more comb. logic Smaller datapaths

No computation at full wordsize Adoption of ISE techniques

– MAC-operation Simple HDL implementation

RAM

datapathcontrol

16

Page 20: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

20

Control

Task of control logic Generate control signals

For 60.000 – 6 Mio clock cycles

Separation of control and datapath

Registered control signals

For performance and power efficiency

Avoiding critical path

Hierarchical control Complex control

Options Hardwired

State machine Micro-program

Counter + ROM Micro-controller

Software

Elliptic-Curve Processor

Mult.Control ALU

Mod. Red.

RAM

ROM2

PointControl- init- final

- double- add

register

Mux

ROM1

ScalarCon-trol

Page 21: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

21

Results

Server Acceleration For GF(2191) Size: 1500 slices

On Xilinx FPGA > 1000 EC ops / sec

@ 66 MHz clock

Smartcard Coprocessor Dual-Field capability 192-bit ECC: 23k GE

400k – 700k cycles 256-bit ECC: 31k GE

600k - 900k cycles

ECC for RFID 163-bit ECC: 12k GE

400k cycles 192-bit ECC: 18k GE

850k cycles

Storage 75% of area

ISE-datapath 75% of power

Realistic on <130 nm CMOS

Power constraint ~15µA

Page 22: BCRYPT ECC-Day 2008 Requirements, Algorithms, Architectures The design space of ECC hardware

22

Conclusions

Different applications require different ECC hardware

Fixed parameters (EC params, field) allow more efficient implementation

Squaring in GF(2m) NIST reduction

ECC for RFID Seems possible

Costumers, Clientse-Commerce server, e-Government server

ECC