fast implementation and fair comparison of the final...
TRANSCRIPT
Kris Gaj and Pawel ChodowiecElectrical and Computer Engineering
George Mason University
Fast implementation and fair comparison of the final candidates
for Advanced Encryption Standardusing Field Programmable Gate Arrays
http://ece.gmu.edu/crypto-text.htm
AES Contest EffortJune 1998
15 Candidatesfrom USA, Canada, Belgium,
France, Germany, Norway, UK, Isreal,Korea, Japan, Australia, Costa Rica
Round 1
August 1999
October 2000
5 final candidates
SecuritySoftware efficiency
Round 2
Mars, RC6, Rijndael, Serpent, TwofishSecurity
Hardware efficiency
1 winner: RijndaelBelgium
Hardware Efficiency Comparisons
FPGA
GMUWPI
USCUC Berkeley
MICRONIC
Academia and small business
ASIC
NSAMitsubishi
IBM
Governmentand large companies
Primary ways of implementing cryptographyin hardware
• designs must be sentfor expensive and timeconsuming fabricationin semiconductor foundry
• bought off the shelfand reconfigured bydesigners themselves
ASICApplication Specific
Integrated Circuit
FPGAField Programmable
Gate Array
• designed all the wayfrom behavioral descriptionto physical layout
• no physical layout design;design ends witha bitstream usedto configure a device
Which way to go?
FPGAsASICs
High performanceOff-the-shelf
Short time to the market
Low development costs
Reconfigurability
Low power
Low cost (but only in high volumes)
Reconfigurability
External ROM and microprocessor enableschanging an FPGA function in several milliseconds
Encryption vs. decryption vs. key scheduling
Various algorithms
Keyscheduling Encryption
FPGA
5-15 ms
FPGA
Decryption
FPGA
5-15 ms
Triple DES IDEA
FPGA FPGA
5-15 ms 5-15 ms
FPGA
AES
Target FPGA devices
Xilinx Virtex - XCV 1000
• 0.22 µm CMOS process
• 1 mln equivalent logic gates
• 12 288 CLB slices
• Up to 200 MHz clock
ProgrammableInterconnects
Configurable Logic Block slices (CLB slices)
Block RAMs
• 10 4-kbit block RAMs
Methodology and Tools
Code in VHDL
1. Functional simulationAldec, Active-HDL
Netlist with timing
Xilinx, Foundation Series v. 2.1
4. Experimental Testing
3. Timing simulation
Bitstream
Aldec, Active-HDL
2. Synthesis and
Implementation
Implementation Verification
USC-ISI, SLAAC-1V FPGA board
Primary parameters of hardware implementationsfor secret-key block ciphers
Latency Throughput
Mi
Time to encrypt/decrypt
a single block of data
Encryption/decryption
Number of bits encrypted/decrypted
in a unit of timeCi
Encryption/decryption
Mi
Mi+1
Mi+2
Ci
Ci+1
Ci+2
Throughput =Block_size · Number_of_blocks_processed_simultaneouslyLatency
Dependence of the encryption time on latency and throughput
Encryption time
Latency (Message_size –Block_size)
Message size
Throughput
Time
Top level block diagramcontrol
keyscheduling
encryption/decryption
memory of internal keys
output
input/key
input interface
output interface
Control unit
Primary factor in choosing the encryption/decryption unit architecture
Symmetric-key cipher mode of operation:
1. Non-feedback cipher modes
ECB, counter mode
2. Feedback cipher modes
CBC, CFB, OFB
Non-feedback Counter Mode - CTR
IV+NIV+N-1IV IV+1 IV+2. . .
M0 M1 M2
EEE
Ci = Mi ⊕ AES(IV+i) for i=0..N
MN-1
E
MN
E. . .
C2 CN-1C3 CNC1
Feedback cipher modes - CBCM3M1 M2 MNMN-1
. . .
E E E EE
IV
C1 = AES(Mi ⊕ IV)Ci = AES(Mi ⊕ Ci-1) for i=2..N
. . .
C1 C2 C3 CN-1 CN
Architectures suitable for feedback modes
. . . .
MUX
combinationallogic
register round 1MUX
one roundround 2
round K
round #rounds
. . . .
round 1
round 2
Partial Loop Unrolling
register
multiplexer
round 1round 2
round K. . . . .
K rounds
combinationallogic
Loop Unrolling: Speed vs. Area
Throughput
- basic architecture
Area
k=2 k=3 k=4 k=5
loop-unrollingbasic architecture
resource sharing
- loop unrolling
- resource sharing
Decreasing area by resource sharing
AfterBefore
F
D0 D1
multiplexer
D0 D1
F F
D0’ D1’D1’D0’
register register
First basic architecture of Serpent - Serpent I1
128-bit register
32 x S-box 0
regular Serpent round
32 x S-box 7
linear transformationK32
output
128
128
Ki
128
32 x S-box 1
8-to-1 128-bit multiplexer128 128 128
128 128 128
Alternative basic architecture of Serpent: Serpent I8
128-bit register
32 x S-box 0
linear transformation
K0 round 0
32 x S-box 7
linear transformation
K7 round 7
K32
output
128
128
128
one implementation round of Serpent
=8 regular cipher
rounds
Our Results: Basic architecture - Speed
050100150200250300350400450500
Serpent Rijndael Twofish RC6 Mars 3DES
431414
177142
61 59
Throughput [Mbit/s]
Our Results: Basic architecture - Area
0500100015002000250030003500400045005000
Serpent 3DESRijndaelTwofish RC6 Mars
1076 1137
27442507
4507
356
Area [CLB slices]
Comparison with results of other groups: Speed
050100150200250300350400450500Throughput [Mbit/s]
Rijndael Twofish RC6 MarsSerpent I1
431 444414
353
294
177 173
104
149
62
143112
88102
61
Worcester Polytechnic Institute
University of Southern CaliforniaOur Results
Serpent I8
Comparison with results of other groups: AreaArea [CLB slices]
0100020003000400050006000700080009000
Serpent I8RijndaelTwofish RC6 MarsSerpent
I1
Worcester Polytechnic Institute
University of Southern California
Our Results
1250
5511
1076
28092666
11371749
26382507
4312
35282744
4621 4507
7964
Our Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - Virtex FPGA
Throughput [Mbit/s]
Area [CLB slices]0
100
200
300
400
500
0 1000 2000 3000 4000 5000
Rijndael Serpent I8
MarsRC6
TwofishSerpent I1
NSA Results: Encryption in cipher feedback modes (CBC, CFB, OFB) - ASIC, 0.5 µm CMOS
Throughput [Mbit/s]
Area [CLB slices]
0
100
200
300
400
500
600
700
0 5 10 15 20 25 30 35 40
Serpent I1
RC6 Twofish Mars
Rijndael
Conclusions for feedback cipher modes (1)(CBC, CFB, OFB)
• Speed (throughput) should be the primary criteria of comparison
• Basic iterative architecture is the most appropriatefor comparison and future implementations
• Serpent and Rijndael are over twice as fast as the next best candidate for all implementations
Conclusions for feedback cipher modes (2)(CBC, CFB, OFB)
• Results confirmed by - three independent university groups for FPGAs, and - NSA group for ASICs
• Results of comparison independent ofimplementation technology (FPGAs vs. ASICs)
0102030405060708090100
SerpentRijndael Twofish RC6 Mars
Survey filled by 167 participants of the Third AES Conference, April 2000
# votes
Our Results: Basic architecture - SpeedThroughput [Mbit/s]
050100150200250300350400450500
MarsRC6Serpent Rijndael Twofish
Comparison for non-feedback cipher modes, e.g.Counter Mode - CTR
IV+NIV+N-1IV IV+1 IV+2. . .
M0 M1 M2
EEE
Ci = Mi ⊕ AES(IV+i) for i=0..N
MN-1
E
MN
E. . .
C2 CN-1C3 CNC1
NSA approach: Traditional methodology
round #rounds= one pipeline stage
. . . .
round 1= one pipeline stage
round 2= one pipeline stage
K registers
round K= one pipeline stage
. . . .
round 1= one pipeline stage
round 2= one pipeline stage
MUXK registers
combinationallogic
MUXregister
one round,no pipelining
Our approach: New methodology
b)a)
round #rounds=k pipeline stages
. . . .
round 1= k pipeline stages
round 2=k pipeline stages
. . . .
. . . .
. . . .
d) k registers
round K= k pipeline stages
. . . .
round 1= k pipeline stages
round 2= k pipeline stages
MUX
. . . .
. . . .
. . . .
k registers
c)
one round= k pipeline stages
MUX
. . . .
k registersMUX
one round,no pipelining
register
combinational logic
Our approach: Inner-Round Pipelining
register1
register2
register k. . . .
pipeline stage 1
pipeline stage 2
pipeline stage k
multiplexer
one round
Comparison of the traditional and new design methodologies
- inner-round pipelining- mixed inner and outer-round pipelining
- basic architecture- outer-round pipelining
Area
Throughput
basic architecture
inner-round pipelining
mixed inner and outer-round pipelining
outer-roundpipelining
K=2K=3
K=4
K=2
K=3
k=2
kopt
Latency vs. area dependence for the new design methodology
- inner-round pipeliningLatency
Area
basicarchitecture
inner-round pipelining
mixed inner and outer-round pipelining
outer-round pipelining
K=2 K=4K=3 K=5
K=2 K=3
k=2
kopt
- mixed inner and outer-round pipelining
- basic architecture- outer-round pipelining
NSA architecture:Full outer-round pipelining
#rounds registers
. . . .
round 1= one pipeline stage
round 2= one pipeline stage
round #rounds= one pipeline stage
Total # of pipeline stages = #rounds
MarsRC6TwofishRijndaelSerpent
NSA Results: Full outer-round pipeliningThroughput [Gbit/s] CMOS ASIC 0.5 µm
0
1
2
3
4
5
6
7
8
9
2.2
5.7
2.3 2.2
8.0
Our approach:Full mixed inner- and outer-round pipelining
k registers
round #rounds=k pipeline stages
. . . .
. . . .
. . . .
. . . .
round 1= k pipeline stages
round 2=k pipeline stages
Total # of pipeline stages = #rounds·k
Our Results: Full mixed pipelining
0
2
4
6
8
10
12
14
16
18Throughput [Gbit/s] Virtex FPGA
16.815.2
13.1 12.2
Serpent RC6 RijndaelTwofish
Speed-up compared to the basic architecture
0102030405060708090100
Serpent I8Rijndael Twofish RC6 MarsSerpent
I1
29.5
9.5
39 40
86
21.5
91.5
21
38
Our results
NSA
Our Results: Full mixed pipelining
05000100001500020000250003000035000400004500050000
Area [CLB slices]
19,700 21,000
46,900
12,60080 RAMs
dedicated memory blocks, RAMs
Serpent Twofish RC6 Rijndael
Our Results: Increase in the circuit latency
Latency without and with pipelining [µs]6
297733 722
3092
897
5490
309737
x 2.5
x 4.3
x 6.1
x 2.4
5
4
3
2
1
0RC6Twofish RijndaelSerpent I8
Conclusions for non-feedback cipher modes (1)ECB, counter
• All ciphers can achieve approximately the same speed.Area should be the primary criteria of comparison.
• Architecture with inner round pipelining combinedwith full outer round pipelining is the most appropriate for comparison and future implementations
• Serpent, Twofish and Rijndael are the most cost-efficient and take approximately the same amount of area
Conclusions for non-feedback cipher modes (2)ECB, counter
No agreement regarding the methodology and architecture used for comparison
NSA methodology favored ciphers with• short cipher round• large number of rounds
Our methodology• fair• practical (superior throughput/area ratio)
Importance of the AES candidate hardware efficiency comparison
• Important factor used to differentiate among final candidates
- objective and commonly accepted measures- good agreement among results from various groups- large differences among final candidates
• Efficient architectures and methodologies developed for all algorithms