hardware accelerators for ecc and hecc · a. tisserand, cnrs{irisa{cairn. hardware accelerators for...

Hardware Accelerators for ECC and HECC

Arnaud Tisserand

CNRS, IRISA laboratory, CAIRN research team

ECCBordeaux

Sep. 29–30, 2015

Summary

• Introduction

• Accelerator architecture and units

• Accelerator programming

• Implementation results: comparison ECC vs HECC on FPGA

• Conclusion & current/future works

A. Tisserand, CNRS–IRISA–CAIRN. Hardware Accelerators for ECC and HECC 2/35

Current Projects on (H)ECC AcceleratorsPAVOIS project 2012–2016

Arithmetic Protections Against PhysicalAttacks for Elliptic Curve based Cryptography

• IRISA (Lannion)• LIRMM (Perpignan, Montpellier & Toulon)

http://pavois.irisa.fr/

ANR 12 BS02 002

HAH project 2014–2017

Hardware and Arithmetic for HyperellipticCurves Cryptography

• IRISA (Lannion)• IRMAR (Rennes)

http://h-a-h.inria.fr/

Labex

and


http://pavois.irisa.fr/http://h-a-h.inria.fr/

Introduction

encryption

signature

etc

pro

toco

lle

vel

[k]P

ADD(P,Q) DBL(P)

P + Pcurv

ele

vel

x±y x×y . . .

fiel

dle

vel

Scalar multiplication operationfor i from 0 to t − 1 do

if ki = 1 then Q = ADD(P,Q)P = DBL(P)

Point addition/doubling operationssequence of finite field operationsDBL: v1 = z

21 , v2 = x1 − v1, . . .

ADD: w1 = z21 ,w2 = z1 × w1, . . .

Fp or F2m operationsoperation modulo large prime (Fp)or irreducible polynomial (F2m )


Introduction

encryption

signature

etc

pro

toco

lle

vel

[k]P

ADD(P,Q) DBL(P)

P + Pcurv

ele

vel

x±y x×y . . .

fiel

dle

vel

E : y2 = x3 + 4x + 20 over GF(1009)

points: P, Q= (x , y) or (x , y , z) or . . .




21 , v2 = x1 − v1, . . .

ADD: w1 = z21 ,w2 = z1 × w1, . . .



Introduction

encryption

signature

etc

pro

toco

lle

vel

[k]P

ADD(P,Q) DBL(P)

P + Pcurv

ele

vel

x±y x×y . . .

fiel

dle

vel

E : y2 = x3 + 4x + 20 over GF(1009)


coordinates: x , y , z ∈ GF(·)Fp, F2m , t : 80–600 bits

k = (kt−1kt−2 . . . k1k0)2 ∈ N




21 , v2 = x1 − v1, . . .

ADD: w1 = z21 ,w2 = z1 × w1, . . .



Introduction

encryption

signature

etc

pro

toco

lle

vel

[k]P

ADD(P,Q) DBL(P)

P + Pcurv

ele

vel

x±y x×y . . .

fiel

dle

vel

E : y2 = x3 + 4x + 20 over GF(1009)


coordinates: x , y , z ∈ GF(·)Fp, F2m , t : 80–600 bits

k = (kt−1kt−2 . . . k1k0)2 ∈ NScalar multiplication operationfor i from 0 to t − 1 do



21 , v2 = x1 − v1, . . .

ADD: w1 = z21 ,w2 = z1 × w1, . . .



Side Channel Attacks

encryption

signature

etc

pro

toco

lle

vel

[k]P

ADD(P,Q) DBL(P)

curv

ele

vel

x±y x×y . . .

fiel

dle

vel



• simple power analysis (& variants)

• differential power analysis (& variants)

• horizontal/vertical/. . . attacks



encryption

signature

etc

pro

toco

lle

vel

[k]P

ADD(P,Q) DBL(P)

curv

ele

vel

x±y x×y . . .

fiel

dle

vel

DBL DBL DBL DBL DBL DBL








encryption

signature

etc

pro

toco

lle

vel

[k]P

ADD(P,Q) DBL(P)

curv

ele

vel

x±y x×y . . .

fiel

dle

vel

DBL DBL DBL DBL DBL DBLADD ADD








encryption

signature

etc

pro

toco

lle

vel

[k]P

ADD(P,Q) DBL(P)

curv

ele

vel

x±y x×y . . .

fiel

dle

vel

DBL DBL DBL DBL DBL DBLADD ADD

0 0 0 1 1 0







Objectives of Our Research Group

• Study and implementation of efficient hardware supports:I Cryptography over (hyper)-elliptic curves (H)ECCI Operations over finite fields Fp & F2m and curve pointsI Hardware targets: FPGAs and ASICsI Flexibility programmable in software

• Study and implementation of protections against physical attacks:I Passive attacks: measure of power consumption, electromagnetic

radiations, timingsI Active attacks: fault injection (in progress)

• Levels: algorithm, representation, operator, architecture, circuit

• Trade-offs between: performance, cost (area/energy), security

• Study, development and distribution of an open source (H)ECCaccelerator and its programming tools


Accelerator Specifications

encryption

signature

etc

pro

toco

lle

vel

HW

SW

HW

[k]P

ADD(P,Q) DBL(P)

P + Pcurv

ele

vel

x±y x×y . . .

fiel

dle

vel

• Performances =⇒ hardware (HW)I dedicated functional unitsI internal parallelism

• Limited cost (embedded systems)I reduced silicon areaI low energy (& power consumption)I large area used at each clock cycle

• Flexibility =⇒ software (SW)I curves, algorithms, representations

(points/elements), k recoding, . . .I at design time / at run time

• Security against SCAs =⇒ HWI secure units (F2m , Fp)I secure key storage/managementI secure control


Accelerator Specifications

encryption

signature

etc

pro

toco

lle

vel

HW

SW

HW[k]P

ADD(P,Q) DBL(P)

P + Pcurv

ele

vel

x±y x×y . . .

fiel

dle

vel

• Performances =⇒ hardware (HW)I dedicated functional unitsI internal parallelism

• Limited cost (embedded systems)I reduced silicon areaI low energy (& power consumption)I large area used at each clock cycle

• Flexibility =⇒ software (SW)I curves, algorithms, representations

(points/elements), k recoding, . . .I at design time / at run time

• Security against SCAs =⇒ HWI secure units (F2m , Fp)I secure key storage/managementI secure control


Accelerator Architecture

exte

rnal

inte

rfac

e

accelerator

Data: w -bit (32, . . . , 128) except for k digits, control: a few bits per unit



exte

rnal

inte

rfac

e

accelerator

FU1 FU2 FU3




exte

rnal

inte

rfac

e

accelerator

registerfile

FU1 FU2 FU3




exte

rnal

inte

rfac

e

accelerator

key

mn

g.

registerfile

FU1 FU2 FU3




exte

rnal

inte

rfac

e

accelerator

CTRL

key

mn

g.

registerfile

FU1 FU2 FU3




exte

rnal

inte

rfac

e

accelerator

CTRL

codemem.

key

mn

g.

registerfile

FU1 FU2 FU3




exte

rnal

inte

rfac

e

accelerator

interconnect

CTRL

codemem.

key

mn

g.

registerfile

FU1 FU2 FU3




exte

rnal

inte

rfac

e

accelerator

interconnect

CTRL

codemem.

key

mn

g.

registerfile

FU1 FU2 FU3

Data: w -bit (32, . . . , 128) except for k digits, control: a few bits per unitA. Tisserand, CNRS–IRISA–CAIRN. Hardware Accelerators for ECC and HECC 8/35


exte

rnal

inte

rfac

eaccelerator

interconnect

CTRL

codemem.

key

mn

g.

registerfile

FU1 FU2 FU3

Data: w -bit (32, . . . , 128) except for k digits, control: a few bits per unitA. Tisserand, CNRS–IRISA–CAIRN. Hardware Accelerators for ECC and HECC 8/35

Functional Units for Field Level Operations

data (w bits)

control (few bits)

FUα

x [i ] y [i ] r [i ]

Notation: x [i ] is the i-th w -bit word of x ∈ FqUnits:

• Fp: addition/subtraction, multiplication (2-step, Montgomery,variants), inversion

• F2m (polynomial basis, normal basis & variants): addition/subtraction,multiplication (Montgomery, Mastrovito, 2-step), square, inversion

Internal parameters: nb of sub-blocks, radix, pipelining scheme,countermeasure, mapping of local registers, output/input bypass, . . .


Register File (≈ Dual Port Memory)

x [i ] y [i ] r [i ]

field elements (size ≥ m bits)

word size (w bits)

Control signals: addresses (port A, port B), read/write, write enable

Specific addressing model for Fq elements (through an intermediate addresstable with hardware loop)

• linear addresses, SW: LOAD @x =⇒ HW: loop x [0], x [1], . . . x [`− 1]• randomized addresses


Key Management Unit

key

mn

g.

kkey

recoding

ki

CTRL

• On-the-fly recoding of k: binary, λ-NAF (λ ∈ {2, 3, 4, 5}), variants(fixed/sliding), double-base [1] and multiple-base [2] number systems(w/wo randomization), addition chains [12], other ?

• Specific private path in the interconnect (no key leaks in RF or FUs)


External Interface(s)

Under development:

• Basic (neither clock rate nor width adaptation)

• ARM Cortex cores in Zynq 7 FPGAs (through AXI bus)

• MicroBlaze softcore processor for Xilinx FPGAsI AXI bus (V6+)I PLB bus (V2 – V5)

• Specific for a “small” ASIC pad ring

Future development:

• NIOS softcore processor for Altera FPGAs

• LEON softcore processor (depending on internal demand)


Protected F2m Multipliers

Unprotected

0

50

100

150

200

250

0 100 200 300 400 500

#tr

an

sitio

ns

cycles

Mastrovito 233

200 225 250cycles

Protected

Overhead:Area/time < 10 %

References:PhD D. Pamula [8]Articles: [11], [10], [9]


Protected (Old) Accelerator for F2m

0 100 200 300

0 50 100 150 200 250 300 350

#tr

an

sit.

cycles

DBL operationMastrovitoUnprotectedActivity trace

0.000.020.040.060.08

cu

rre

nt

[mA

]

DBL operationMastrovitoUnprotectedCurrent measures

0 100 200 300

#tr

an

sit.

DBL operationMastrovitoProtectedActivity trace

0.000.040.080.120.16

cu

rre

nt

[mA

]

DBL operationMastrovitoProtectedCurrent measures

0 100 200 300

#tr

an

sit.

ADD operationMastrovitoProtectedActivity trace

Warning: old dedicated accelerator (similar behavior is expected for our new one)A. Tisserand, CNRS–IRISA–CAIRN. Hardware Accelerators for ECC and HECC 14/35

Circuit-Level Protections for Arithmetic Operators

References: [4] and [3]


Units Impact on Side Channel Information (1/2)

Activity traces measured with CABA1 simulations for three configurationsof the multiplier (1,2,4 sub-blocks of 32 bits) and a very small accelerator

1 2 4

ADD

0

200

400

600

800

1000

1200

0 5000 10000 15000 20000 25000

activity [#tr

ansitio

ns]

time [clock cycles]

0

200

400

600

800

1000

1200

0 2000 4000 6000 8000 10000 12000 14000 16000

activity [#tr

ansitio

ns]

time [clock cycles]

0

200

400

600

800

1000

1200

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

activity [#tr

ansitio

ns]

time [clock cycles]

DBL

0

200

400

600

800

1000

1200

0 5000 10000 15000 20000 25000

activity [#tr

ansitio

ns]

time [clock cycles]

0

200

400

600

800

1000

1200

0 2000 4000 6000 8000 10000 12000 14000

activity [#tr

ansitio

ns]

time [clock cycles]

0

200

400

600

800

1000

1200

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

activity [#tr

ansitio

ns]

time [clock cycles]

1 Cycle Accurate Bit Accurate


Units Impact on Side Channel Information (2/2)

0

200

400

600

800

1000

1200

16700 16720 16740 16760 16780 16800 16820 16840 16860

activity [

#tr

an

sitio

ns]

time [clock cycles]

RE

AD

LA

UN

CH

ad

ditio

n

WA

IT a

dd

itio

nW

RIT

E

RE

AD

LA

UN

CH

ad

ditio

n

WA

IT a

dd

itio

nW

RIT

E

RE

AD

LA

UN

CH

ad

ditio

n

WA

IT a

dd

itio

nW

RIT

E

RE

AD

LA

UN

CH

mu

ltip

lica

tio

n

WA

IT m

ultip

lica

tio

nW

RIT

E

0

200

400

600

800

1000

1200

1400

6500 6520 6540 6560 6580 6600 6620 6640 6660

activity [

#tr

an

sitio

ns]

time [clock cycles]

RE

AD

LA

UN

CH

ad

ditio

n

WA

IT a

dd

itio

nW

RIT

E

RE

AD

LA

UN

CH

ad

ditio

n

WA

IT a

dd

itio

nW

RIT

E

RE

AD

LA

UN

CH

ad

ditio

n

WA

IT a

dd

itio

nW

RIT

E

RE

AD

LA

UN

CH

mu

ltip

lica

tio

n

WA

IT m

ultip

lica

tio

nW

RIT

E


Developed Programming Tools

timenow

V0

acceleratormodules

. . .

configurations

CAD tools

selection

user

crypto.lib.

assembler

binaryimplementation



timenow

V0 V1

acceleratormodules

. . .

configurations

CAD tools

selection

user

crypto.lib.

assembler


compilerpython

API/TLS-SSL



timenow

V0 V1 V2

acceleratormodules

acceleratormodules

. . .

configurations

CAD tools

selection

user

crypto.lib.

crypto.lib.

assembler


compiler Sage

API/TLS-SSL


Instruction Set

READ FUid @Rid @Rid B/U

WRITE FUid @Rid

LAUNCH FUid MODE

WAIT FUid

SETADDRO @Rid OFFSET

SETADDRN @Rid #WORD

WRITEK #WORD

CALL @DEST

RET

BZ @DEST

BNZ @DEST

JMP @DEST

CMPD DIGIT

SET FLAGid

TST FLAGid


Address Model in the Register FileRF requirements :

• 5–16 registers of m-bit Fq elements• worst case: w small (16 bits) and m large (600 bits) ⇒ 550+ words

and 10-bit physical addresses

x ∈ Fq is addressed by one entry (notation @Rid) of the intermediateaddress table (IAT) with 2 values:

• offset of the first word (e.g. x [0])• number of w -bit words

CTRLregister

file

addresstable

@Rid physical@


Code Memory

Behavior:

• Specific private path in the interconnect for code download (no leaksin RF or FUs)

• Code input can be disabled (ROM mode with code in the FPGAbitstream)

• Instruction CALL: push PC then jump to @DEST• Instruction RET: jump to (pop) + 1

Memory mapping to be defined


Internal Parallelism Model

non-blocking instruction decoding (i.e. always do PC ← PC + 1 orPC ← cst) except for WAIT instruction

Example of operations sequence, its dependency graph and assembly codefor 2 multipliers:

r = ((a×b)+c)+(d×e))

a

b

c

d

e

r

M

0

1

M

3

4

A

2

5

A

5

6

5

1 read fu mul 0, 0, 1 read a & b2 launch fu mul 0 start ab3 read fu mul 1, 3, 4 lit d & e4 launch fu mul 1 start de5 wait fu mul 0 wait for ab6 write fu mul 0, 5 write ab7 set OPMODE, 0 addition mode (+)8 read fu add sub 0, 5, 2 read ab & c9 launch fu add sub 0 start (ab) + c

10 wait fu mul 1 wait for de11 write fu mul 1, 6 write de12 wait fu add sub 0 wait for (ab) + c13 write fu add sub 0, 5 write (ab) + c14 read fu add sub 0, 5, 6 read (ab) + c & de15 launch fu add sub 0 start ((ab) + c) + (de)16 wait fu add sub 0 wait for ((ab) + c) + (de)17 write fu add sub 0, 5 write ((ab) + c) + (de)


ECC Accelerator with Additions ChainsFirst full hardware implementation of recoding using additions chains

FPGA implementation

Spartan-6 XC6SLX9

192-bit FpVery small config.

Euclide

computationof C

MEM.(BRAM)

1/φ, k, k/φ,a, b, C, C′

a(j) − b(j) b(j) − a(j) a(j)

2− b(j) b

(j)

2− a(j)

k

unused unusedcout cout

CTRL CSIPO C C

LSB(a(j)) LSB(b(j) )

computationof kφ

±

+

+

ε

+ CTRL@

offset C′offset C

offset boffset a

offset k/φoffset k

write ports

read ports

address control signalsscalar

word digit w-bit data word

recoding

BR

AM optim. area freq. dura. SCA

method target slices (FF/LUT) MHz ms prot.

EAC 3area 534 (1813/1508) 132 35.8

Yspeed 556 (1872/1523) 137 34.5

DA 2area 429 (1243/1134) 191 30

Nspeed 399 (1302/1222) 177 32.5

ML 2area 429 (1243/1134) 191 42.5

Yspeed 399 (1302/1222) 177 45.8

UF 2area 429 (1243/1134) 191 50.4

Yspeed 399 (1302/1222) 177 54.4

NAF-3 2area 422 (1280/1157) 181 25.2

Nspeed 423 (1321/1242) 175 26.1

NAF-4 2area 420 (1277/1161) 158 27.3

Nspeed 425 (1233/1246) 177 24.4

EAC: Euclidean addition chains, DA: dbl-and-add, ML: Montgomery ladder,UF: unified formula

See details in [12]A. Tisserand, CNRS–IRISA–CAIRN. Hardware Accelerators for ECC and HECC 23/35

Comparison ECC 256 vs HECC 128 (1/7)field Fp ADD DBL

ECC ` bits mul OUTRX

sub OUTRY

mul OUTRZ

PZ

mul

PZ

mul

PZ

mulPZ

PX mulPX

PYmul

PY

QYQY

QXQX

QZ

QZ

QZ

QZ

mulv18

add addv12

subv13

mulv10

v10

v10

mulv11

v11

sub

v11

mul

v16

v17

v14

v14

subv0

v1

v1

v2

v2

sqrv2

subv3

v4

v4

v5

sqr

v5

mul

v6

v7

v7v8

v9

v9

Cost: 12M + 2S

mul OUTRX

sub OUTRY

add OUTRZ

PZ

mul

PZ

mul

PZ

PX

sqr

PX

mul

PX

PYPY

mul

PY

aa

addadd

v18

v18

addv19

v19

sub

addv12

v12

sub

v12

v13

add

addv10

v10

v10

v11

mul

v16

sqrv17

v17

v15

mul addv23

v23sqr

v22

v20

add

v25

v25

v24

v24

add

v0

add

v1

v1

addv1

v2

v3

v4

sqr

v4

v5

v6

v6

v6

v6

v7

v7

add

v8

v8

v9

v9

Cost: 6M + 5S

HECC `2 bits

mul OUTRU0

mul OUTRU1

add OUTRV0

add OUTRV1

mul OUTRZ

PZ

mul

PZ

mul

PZ

mul

PZ

add

PZ

mul

PZ

mul

PZ

mul

PZ

mul

PZ

QV0

QV0

QV1

QV1

QU1

add

QU1

QU1

QU0QU0

QU0

PU0

mul

PU0

mul

PU0

mul

PU0

PU1

PU1

mul

PU1

mul

PU1

PV1 mulPV1

QZ

mul

QZ

QZ

QZ

QZ

QZ

PV0PV0

sub

v18

mul

v18

add

v19

mul

v19

add

v12

mulv13

mulv13

mul

v10

sqr

v11

add mulv16

v17

sqr

v14

mul

v14

mul

v14

v15

add

v85

mul

v84

mulv87

add

v86

add

sub

v81

mulv80

subv83

v82

sqr

v69

v69

mul

v69

mul

sub

v68

sub

v67

mul

v67

sub

v66

v66

sub

v65

add

v64

v64

mul

v63

v63

subv61

v60

v60

add

v78

v79

v74

v76

mul v77

mul

v70

v70

v71

v72

mul

v73

v73

v23

mul

mul

v41

sub

v40

mul

v43

v43

v43

v43

mul

v43

mul

v43

addv42

add

mul

v45

add

v44

sub

add

v47

addv46

v49

v48

sub

v22

mulv22

v21

v21

v21

v21

v20

mul

v27

v26

v26

addv25

sub

v25

sub

v24

v29

v28

mul

v56

add

v56

v56

sub

v57

add

v54

v54

v54

v52

v53

v50

v51

v58

v58

v59

v30

v30

v30

v30

mul

v30

mulv30

v31

v31

v31

v31 v32

v32

v32v32

v33

mulv34

v35

v35

sqrv35

addv35

v35

v36

add

v37

v37v38

v39

v0

v0

v0

v0

v0

v1

sub

v1

v2

v3

v3

v3

subv4

v5

v5

v5

v6

v6

v6

v6

v6

v6

v6

v6

add

v7

v8

v9

v9

v9

Cost: 47M + 4S

mul OUTRU0

mul OUTRU1

sub OUTRV0

sub OUTRV1

mul OUTRZ

PZ

mul

PZ

mul

PZ

sub

PZ

mul

PZ

mulPZ

sub

PZ

mul

PZ

mul

PZ

mul

PZ

mul

PZ

sqrPZ

mul

PZ

mul

PZ

PU0

add

PU0

PU0

mul

PU0

add

PU0

mul

PU0

mul

PU0

PU1

sqr

PU1

add

PU1

mul

PU1

PU1

mul

PU1

mul

PU1

PU1

Z

sub

Z

PV1

mul

PV1

sqr

PV1

addPV1

PV1

PV0

mul

PV0

add

PV0

PV0

add

sub

v18

addv18

v18

v19

add

mul

v12

add

v13

v13

v13

add

v13

mul

mul

v10

sqr

v10

mul

v10

v11

mul

v11

v16

v17

v14

v15

v15

v15

v80

v69

mul v68

v68

subv67

mul

v66

v65

addv65

sqr

v64

v64

mul

v64

mul

v63

addv62

mul

v62

sub

v61

v60

v60

add

v78

v79

mul

sub

v74

v75

sub

v76

v77

v71

add

v72

v73

v41

v40

v40

v40

v40

sub

sqr

v43

mul

v43

v42

mul

add

v45

add

v44

mulv47

v46

v49

v48

sub

v23

v22

v21

add

v20subv27

addv26

mul

v26

v25

v24

v29

v28

sub

v56

v57

v57

v57

v54

add

v54

v54v55

v53

v53

v53

v50

v51

v51

v51

mul

v59

v59

v59

v30

v30

v31

sub

v32

v33

add

v33

v34

v34

addv34

v35

v36

v37

v38

v38

v38

v38

v39

v39

v39

v0

v0

v1

mul

v1

sub

v2

v3

v4

v4

v4

add

v5

v6

v6

sqr

v6

v7

v8

v9

v9

Cost: 38M + 6S

Configurations on a XC6SLX75 FPGA (details in [5]):

• w = 32 bits internal words• 1 adder/subtracter, 1 inversion unit• nM multipliers (Montgomery) with nB w -bit sub-blocks• No DSP blocks• ISE 14.6 Xilinx CAD tools, standard efforts (synthesis and P&R)


Comparison ECC 256 vs HECC 128 (2/7)

• Compared recoding techniques:I BIN: standard binary from left to rightI NAF: non-adjacent formI λ-NAF: window methods with λ ∈ {3, 4}

• Implementation results for a full ECC accelerator (nM = 1, nB = 1):

Recoding BIN NAF 3-NAF 4-NAF

area slices (FF/LUT) 565 (1321/1461) 570 (1340/1479) 571 (1344/1495) 503 (1348/1489)freq. (MHz) 225 228 237 217

All other results are reported for 4-NAF



Impact of the number/size of multipliers on the area and frequency:

nM

BR

AM nB = 1 nB = 2 nB = 4

area freq. area freq. area freq.slices (FF/LUT) MHz slices (FF/LUT) MHz slices (FF/LUT) MHz

EC

C

1 3 547 (1374/1460) 231 573 (1476/1625) 233 673 (1674/1875) 2332 3 722 (1776/1903) 220 811 (1979/2210) 227 942 (2377/2701) 2203 3 810 (2174/2236) 221 915 (2480/2698) 215 1130 (3077/3430) 2144 3 952 (2569/2656) 215 1100 (2977/3282) 217 1512 (3771/4293) 2165 3 1064 (2982/3136) 210 1405 (3492/3902) 206 1722 (4487/5122) 209

HE

CC

1 4 514 (1336/1374) 235 549 (1434/1513) 2342 4 646 (1716/1783) 220 737 (1912/2055) 2343 4 732 (2092/2075) 224 826 (2386/2485) 2254 4 870 (2476/2424) 218 1022 (2868/2987) 2145 4 976 (2865/2773) 219 1115 (3355/3465) 2106 4 1089 (3233/3092) 203 1240 (3821/3908) 2087 4 1145 (3601/3426) 213 1372 (4287/4365) 2058 4 1281 (3981/3809) 191 1552 (4765/4890) 1839 4 1379 (4363/4051) 202 1691 (5245/5277) 199

10 4 1543 (4739/4435) 196 1856 (5719/5801) 19811 4 1547 (5114/4750) 189 1936 (6192/6240) 19812 4 1738 (5499/5128) 191 2100 (6675/6771) 188



Impact of the number/size of multipliers on the average time (ms):

nBnM

1 2 3 4 5 6 7 8 9 10 11 12

HECC1 15.6 8.6 5.7 4.7 3.9 3.7 3.3 3.6 3.4 3.5 3.6 3.62 11.9 6.2 4.5 3.6 3.2 2.8 2.8 3.0 2.7 2.7 2.8 2.9

ECC1 28.1 15.3 12.4 12.4 12.72 17.7 9.6 8.3 8.0 8.44 11.1 6.2 5.4 5.1 5.3

Standard deviation for 1000 [k]P:

configuration ECC (1,1) ECC (3,4) HECC (1,1) HECC (6,2)

average time [ms] 28.1 5.4 15.6 2.8standard deviation [ms] 0.289 0.056 0.324 0.045



area

[slice

s]

time [ms]

ECC

HECC

600 800 1000 1200 1400 1600 1800 2000 2200

5

10

15

20

25

30

5,4

5,2

5,1

4,4

4,2

4,1

3,4

3,2

3,1

2,4

2,2

2,1

1,4

1,2

1,1

12,212,1 11,211,1 10,210,1 9,2

9,1

8,2

8,1

7,2

7,1

6,2

6,1

5,2

5,1

4,2

4,1

3,23,1

2,2

2,1

1,2

1,1

On average HECC is 40 % faster than ECC for a similar silicon cost


Comparison ECC 256 vs HECC 128 (6/7)%

usa

ge×

area

spee

du

p

ECC HECC

020406080

1001

2

3012345

1,1 1,2 1,4 2,4 3,4 4,4 1,1 1,2 2,1 3,1 3,2 5,2 8,2



Source FPGAarea freq. duration [k]P

slices / DSP blocks MHz ms

ECC 1,2

Spartan 6

573 / 0 233 17.7ECC 1,4 673 / 0 233 11.1ECC 2,4 942 / 0 220 6.2ECC 3,4 1 130 / 0 214 5.4

[7]Virtex-5 1 725 / 37 291 0.38Virtex-4 4 655 / 37 250 0.44

[6] Virtex-413 661 / 0 43 9.220 123 / 0 43 7.7


Conclusion & Current/Future Works

• HECC is efficient in hardware (40 % speedup vs ECC)• Flexible architecture and tools for research activities• Advanced recoding schemes are efficient in hardware

Current/future works:

• Hardware implementation of halving based method(s)• Protections against fault injection• HECC extensions of the accelerator (and tools)• ASIC (CMOS 65nm) implementation of the accelerator• Side channel evaluation of (some) proposed protections• HW/SW Code distribution under free license• More advanced architecture/circuit level protections• Collaboration with other research groups


Our Long Term ObjectivesStudy the links between:

• curves• arithmetic algorithms• Fq, pts representations• architecture & units• circuit styles

to ensure

• high security againstI theoretical attacksI physical attacks

• low design cost• low silicon cost• low energy(/power)• high performances• high flexibility

area 1

delay 1

energy 1

security 1




to ensure



area 1 1 + a

delay 1 1 + t

energy 1 1 + e

a, t, e ∈ 0%, 5%, 10%, . . . , 100%

security 1




to ensure



area 1 1 + a

delay 1 1 + t

energy 1 1 + e

a, t, e ∈ 0%, 5%, 10%, . . . , 100%

security 1

×10×100


References I

T. Chabrier, D. Pamula, and A. Tisserand.

Hardware implementation of DBNS recoding for ECC processor.In Proc. 44rd Asilomar Conference on Signals, Systems and Computers, pages 1129–1133, Pacific Grove, California,U.S.A., November 2010. IEEE.

T. Chabrier and A. Tisserand.

On-the-fly multi-base recoding for ECC scalar multiplication without pre-computations.In A. Nannarelli, P.-M. Seidel, and P. T. P. Tang, editors, Proc. 21st Symposium on Computer Arithmetic (ARITH), pages219–228, Austin, TX, U.S.A, April 2013. IEEE Computer Society.

J. Chen, A. Tisserand, E. Popovici, and S. Cotofana.

Asynchronous charge sharing power consistent montgomery multiplier.In J. Sparso and E Yahya, editors, Proc. 21st IEEE International Symposium on Asynchronous Circuits and Systems(ASYNC), pages 132–138, Mountain View, California, USA, May 2015.

J. Chen, A. Tisserand, E. M. Popovici, and S. Cotofana.

Robust sub-powered asynchronous logic.In J. Becker and M. R. Adrover, editors, Proc. 24th International Workshop on Power and Timing Modeling, Optimizationand Simulation (PATMOS), pages 1–7, Palma de Mallorca, Spain, September 2014. IEEE.

G. Gallin, A. Tisserand, and N. Veyrat-Charvillon.

Comparaison expérimentale d’architectures de crypto-processeurs pour courbes elliptiques et hyper-elliptiques.In Actes Conférence d’informatique en Parallélisme, Architecture et Système (ComPAS), Lille, France, June 2015.Prix meilleur papier track architecture.

S. Ghosh, M. Alam, D. Roychowdhury, and I.S. Gupta.

Parallel crypto-devices for GF(p) elliptic curve multiplication resistant against side channel attacks.Computers and Electrical Engineering, 35(2):329–338, March 2009.


References II

Y. Ma, Z. Liu, W. Pan, and J. Jing.

A high-speed elliptic curve cryptographic processor for generic curves over GF(p).In Proc. 20th International Workshop on Selected Areas in Cryptography (SAC), volume 8282 of LNCS, pages 421–437,Burnaby, BC, Canada, August 2013. Springer.

D. Pamula.

Arithmetic Operators on GF(2m) for Cryptographic Applications: Performance - Power Consumption - Security Tradeoffs.Phd thesis, University of Rennes 1 and Silesian University of Technology, December 2012.

D. Pamula, E. Hrynkiewicz, and A. Tisserand.

Analysis of GF(2233) multipliers regarding elliptic curve cryptosystem applications.In 11th IFAC/IEEE International Conference on Programmable Devices and Embedded Systems (PDeS), pages 252–257,Brno, Czech Republic, May 2012.

D. Pamula and A. Tisserand.

GF(2m) finite-field multipliers with reduced activity variations.In 4th International Workshop on the Arithmetic of Finite Fields, volume 7369 of LNCS, pages 152–167, Bochum,Germany, July 2012. Springer.

D. Pamula and A. Tisserand.

Fast and secure finite field multipliers.In Proc. Euromicro Conference on Digital System Design (DSD), pages 1–8, Funchal, Portugal, August 2015.

J. Proy, N. Veyrat-Charvillon, A. Tisserand, and N. Meloni.

Full hardware implementation of short addition chains recoding for ECC scalar multiplication.In Actes Conférence d’informatique en Parallélisme, Architecture et Système (ComPAS), Lille, France, June 2015.


The end, questions ?

Contact:

• mailto:[email protected]• http://people.irisa.fr/Arnaud.Tisserand/• CAIRN Group http://www.irisa.fr/cairn/• IRISA Laboratory, CNRS–INRIA–Univ. Rennes 1

6 rue Kerampont, CS 80518, F-22305 Lannion cedex, France

Thank you


mailto:[email protected]://people.irisa.fr/Arnaud.Tisserand/http://www.irisa.fr/cairn/

IntroductionCurrent Projects on (H)ECC AcceleratorsIntroductionSide Channel AttacksObjectives of Our Research Group

Accelerator ArchitectureAccelerator SpecificationsAccelerator ArchitectureFunctional Units for Field Level OperationsRegister File ( Dual Port Memory)Key Management UnitExternal Interface(s)Protected F2m MultipliersProtected (Old) Accelerator for F2mCircuit-Level Protections for Arithmetic OperatorsUnits Impact on Side Channel Information

Accelerator ProgrammingDeveloped Programming ToolsInstruction SetAddress Model in the Register FileCode MemoryInternal Parallelism Model

Implementation resultsECC Accelerator with Additions ChainsComparison ECC 256 vs HECC 128

Conclusion

hardware accelerators for ecc and hecc · a. tisserand, cnrs{irisa{cairn. hardware accelerators for...

Documents