fast implementation and fair comparison of the final...

Kris Gaj and Pawel ChodowiecElectrical and Computer Engineering

George Mason University

Fast implementation and fair comparison of the final candidates

for Advanced Encryption Standardusing Field Programmable Gate Arrays

http://ece.gmu.edu/crypto-text.htm

AES Contest - NIST Evaluation Criteria

Security

HardwareEfficiency

SoftwareEfficiency

Flexibility

AES Contest EffortJune 1998

15 Candidatesfrom USA, Canada, Belgium,

France, Germany, Norway, UK, Isreal,Korea, Japan, Australia, Costa Rica

Round 1

August 1999

October 2000

5 final candidates

SecuritySoftware efficiency

Round 2

Mars, RC6, Rijndael, Serpent, TwofishSecurity

Hardware efficiency

1 winner: RijndaelBelgium

Hardware Efficiency Comparisons

FPGA

GMUWPI

USCUC Berkeley

MICRONIC

Academia and small business

ASIC

NSAMitsubishi

IBM

Governmentand large companies

Primary ways of implementing cryptographyin hardware

• designs must be sentfor expensive and timeconsuming fabricationin semiconductor foundry

• bought off the shelfand reconfigured bydesigners themselves

ASICApplication Specific

Integrated Circuit

FPGAField Programmable

Gate Array

• designed all the wayfrom behavioral descriptionto physical layout

• no physical layout design;design ends witha bitstream usedto configure a device

Which way to go?

FPGAsASICs

High performanceOff-the-shelf

Short time to the market

Low development costs

Reconfigurability

Low power

Low cost (but only in high volumes)

Reconfigurability

External ROM and microprocessor enableschanging an FPGA function in several milliseconds

Encryption vs. decryption vs. key scheduling

Various algorithms

Keyscheduling Encryption

FPGA

5-15 ms

FPGA

Decryption

FPGA

5-15 ms

Triple DES IDEA

FPGA FPGA

5-15 ms 5-15 ms

FPGA

AES

Target FPGA devices

Xilinx Virtex - XCV 1000

• 0.22 µm CMOS process

• 1 mln equivalent logic gates

• 12 288 CLB slices

• Up to 200 MHz clock

ProgrammableInterconnects

Configurable Logic Block slices (CLB slices)

Block RAMs

• 10 4-kbit block RAMs

Methodology and Tools

Code in VHDL

1. Functional simulationAldec, Active-HDL

Netlist with timing

Xilinx, Foundation Series v. 2.1

4. Experimental Testing

3. Timing simulation

Bitstream

Aldec, Active-HDL

2. Synthesis and

Implementation

Implementation Verification

USC-ISI, SLAAC-1V FPGA board

Primary parameters of hardware implementationsfor secret-key block ciphers

Latency Throughput

Mi

Time to encrypt/decrypt

a single block of data

Encryption/decryption

Number of bits encrypted/decrypted

in a unit of timeCi

Encryption/decryption

Mi

Mi+1

Mi+2

Ci

Ci+1

Ci+2

Throughput =Block_size · Number_of_blocks_processed_simultaneouslyLatency

Dependence of the encryption time on latency and throughput

Encryption time

Latency (Message_size –Block_size)

Message size

Throughput

Time

Top level block diagramcontrol

keyscheduling

encryption/decryption

memory of internal keys

output

input/key

input interface

output interface

Control unit

Primary factor in choosing the encryption/decryption unit architecture

Symmetric-key cipher mode of operation:

1. Non-feedback cipher modes

ECB, counter mode

2. Feedback cipher modes

CBC, CFB, OFB

Non-feedback Counter Mode - CTR

IV+NIV+N-1IV IV+1 IV+2. . .

M0 M1 M2

EEE

Ci = Mi ⊕ AES(IV+i) for i=0..N

MN-1

E

MN

E. . .

C2 CN-1C3 CNC1

Feedback cipher modes - CBCM3M1 M2 MNMN-1

. . .

E E E EE

IV

C1 = AES(Mi ⊕ IV)Ci = AES(Mi ⊕ Ci-1) for i=2..N

. . .

C1 C2 C3 CN-1 CN

Feedback cipher modesCBC, CFB, OFB

Basic iterative architecture

register

combinationallogic

multiplexer

one round

Architectures suitable for feedback modes

. . . .

MUX

combinationallogic

register round 1MUX

one roundround 2

round K

round #rounds

. . . .

round 1

round 2

Partial Loop Unrolling

register

multiplexer

round 1round 2

round K. . . . .

K rounds

combinationallogic

Loop Unrolling: Speed vs. Area

Throughput

- basic architecture

Area

k=2 k=3 k=4 k=5

loop-unrollingbasic architecture

resource sharing

- loop unrolling

- resource sharing

Decreasing area by resource sharing

AfterBefore

F

D0 D1

multiplexer

D0 D1

F F

D0’ D1’D1’D0’

register register

First basic architecture of Serpent - Serpent I1

128-bit register

32 x S-box 0

regular Serpent round

32 x S-box 7

linear transformationK32

output

128

128

Ki

128

32 x S-box 1

8-to-1 128-bit multiplexer128 128 128

128 128 128

Alternative basic architecture of Serpent: Serpent I8

128-bit register

32 x S-box 0

linear transformation

K0 round 0

32 x S-box 7

linear transformation

K7 round 7

K32

output

128

128

128

one implementation round of Serpent

=8 regular cipher

rounds

Our Results: Basic architecture - Speed

050100150200250300350400450500

Serpent Rijndael Twofish RC6 Mars 3DES

431414

177142

61 59

Throughput [Mbit/s]

Our Results: Basic architecture - Area

0500100015002000250030003500400045005000

Serpent 3DESRijndaelTwofish RC6 Mars

1076 1137

27442507

4507

356

Area [CLB slices]

Comparison with results of other groups: Speed

050100150200250300350400450500Throughput [Mbit/s]

Rijndael Twofish RC6 MarsSerpent I1

431 444414

353

294

177 173

104

149

62

143112

88102

61

Worcester Polytechnic Institute

University of Southern CaliforniaOur Results

Serpent I8

Comparison with results of other groups: AreaArea [CLB slices]

0100020003000400050006000700080009000

Serpent I8RijndaelTwofish RC6 MarsSerpent

I1

Worcester Polytechnic Institute

University of Southern California

Our Results

1250

5511

1076

28092666

11371749

26382507

4312

35282744

4621 4507

7964

Our Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - Virtex FPGA

Throughput [Mbit/s]

Area [CLB slices]0

100

200

300

400

500

0 1000 2000 3000 4000 5000

Rijndael Serpent I8

MarsRC6

TwofishSerpent I1

NSA Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - ASIC, 0.5 µm CMOS

Throughput [Mbit/s]

Area [CLB slices]

0

100

200

300

400

500

600

700

0 5 10 15 20 25 30 35 40

Serpent I1

RC6 Twofish Mars

Rijndael

Conclusions for feedback cipher modes (1)(CBC, CFB, OFB)

• Speed (throughput) should be the primary criteria of comparison

• Basic iterative architecture is the most appropriatefor comparison and future implementations

• Serpent and Rijndael are over twice as fast as the next best candidate for all implementations

Conclusions for feedback cipher modes (2)(CBC, CFB, OFB)

• Results confirmed by - three independent university groups for FPGAs, and - NSA group for ASICs

• Results of comparison independent ofimplementation technology (FPGAs vs. ASICs)

0102030405060708090100

SerpentRijndael Twofish RC6 Mars

Survey filled by 167 participants of the Third AES Conference, April 2000

# votes

Our Results: Basic architecture - SpeedThroughput [Mbit/s]

050100150200250300350400450500

MarsRC6Serpent Rijndael Twofish

Non-Feedback Cipher ModesECB, counter

Comparison for non-feedback cipher modes, e.g.Counter Mode - CTR

IV+NIV+N-1IV IV+1 IV+2. . .

M0 M1 M2

EEE

Ci = Mi ⊕ AES(IV+i) for i=0..N

MN-1

E

MN

E. . .

C2 CN-1C3 CNC1

NSA approach: Traditional methodology

round #rounds= one pipeline stage

. . . .

round 1= one pipeline stage


K registers

round K= one pipeline stage

. . . .



MUXK registers

combinationallogic

MUXregister

one round,no pipelining

Our approach: New methodology

b)a)

round #rounds=k pipeline stages

. . . .

round 1= k pipeline stages

round 2=k pipeline stages

. . . .

. . . .

. . . .

d) k registers

round K= k pipeline stages

. . . .



MUX

. . . .

. . . .

. . . .

k registers

c)

one round= k pipeline stages

MUX

. . . .

k registersMUX

one round,no pipelining

register

combinational logic

Our approach: Inner-Round Pipelining

register1

register2

register k. . . .

pipeline stage 1

pipeline stage 2

pipeline stage k

multiplexer

one round

Comparison of the traditional and new design methodologies

- inner-round pipelining- mixed inner and outer-round pipelining

- basic architecture- outer-round pipelining

Area

Throughput

basic architecture

inner-round pipelining

mixed inner and outer-round pipelining

outer-roundpipelining

K=2K=3

K=4

K=2

K=3

k=2

kopt

Latency vs. area dependence for the new design methodology

- inner-round pipeliningLatency

Area

basicarchitecture

inner-round pipelining

mixed inner and outer-round pipelining

outer-round pipelining

K=2 K=4K=3 K=5

K=2 K=3

k=2

kopt

- mixed inner and outer-round pipelining

- basic architecture- outer-round pipelining

NSA architecture:Full outer-round pipelining

#rounds registers

. . . .



round #rounds= one pipeline stage

Total # of pipeline stages = #rounds

MarsRC6TwofishRijndaelSerpent

NSA Results: Full outer-round pipeliningThroughput [Gbit/s] CMOS ASIC 0.5 µm

0

1

2

3

4

5

6

7

8

9

2.2

5.7

2.3 2.2

8.0

Our approach:Full mixed inner- and outer-round pipelining

k registers

round #rounds=k pipeline stages

. . . .

. . . .

. . . .

. . . .


round 2=k pipeline stages

Total # of pipeline stages = #rounds·k

Our Results: Full mixed pipelining

0

2

4

6

8

10

12

14

16

18Throughput [Gbit/s] Virtex FPGA

16.815.2

13.1 12.2

Serpent RC6 RijndaelTwofish

Speed-up compared to the basic architecture

0102030405060708090100

Serpent I8Rijndael Twofish RC6 MarsSerpent

I1

29.5

9.5

39 40

86

21.5

91.5

21

38

Our results

NSA

Our Results: Full mixed pipelining

05000100001500020000250003000035000400004500050000

Area [CLB slices]

19,700 21,000

46,900

12,60080 RAMs

dedicated memory blocks, RAMs

Serpent Twofish RC6 Rijndael

Our Results: Increase in the circuit latency

Latency without and with pipelining [µs]6

297733 722

3092

897

5490

309737

x 2.5

x 4.3

x 6.1

x 2.4

5

4

3

2

1

0RC6Twofish RijndaelSerpent I8

Conclusions for non-feedback cipher modes (1)ECB, counter

• All ciphers can achieve approximately the same speed.Area should be the primary criteria of comparison.

• Architecture with inner round pipelining combinedwith full outer round pipelining is the most appropriate for comparison and future implementations

• Serpent, Twofish and Rijndael are the most cost-efficient and take approximately the same amount of area

Conclusions for non-feedback cipher modes (2)ECB, counter

No agreement regarding the methodology and architecture used for comparison

NSA methodology favored ciphers with• short cipher round• large number of rounds

Our methodology• fair• practical (superior throughput/area ratio)

Importance of the AES candidate hardware efficiency comparison

• Important factor used to differentiate among final candidates

- objective and commonly accepted measures- good agreement among results from various groups- large differences among final candidates

• Efficient architectures and methodologies developed for all algorithms

Basic building blocks of FPGA devicesVirtex

CLB slice = 1/2 of a CLBCLB - Configurable Logic Block

Memory modeLogic mode

one-bit register

one-bit register

4

4

RAM16x1

RAM16x1

one-bit register

one-bit register

combinationallogic

8

fast implementation and fair comparison of the final...

Documents