new idea&aes
DESCRIPTION
u can get more detail of this document futureTRANSCRIPT
-
INSTRUCTION SET EXTENSIONS FORENHANCING THE PERFORMANCE OF SYMMETRIC KEY
CRYPTOGRAPHIC ALGORITHMS
BY
SEAN R. OMELIABS CpE, UNIVERSITY OF MASSACHUSETTS LOWELL (2005)
SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF MASTER OF SCIENCE IN ENGINEERING
DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERINGUNIVERSITY OF MASSACHUSETTS LOWELL
Signature of Author Date
Dr. Adam J. ElbirtThesis Advisor
Prof. George P. CheneyThesis Committee Member
Dr. Dalila B. MegherbiThesis Committee Member
-
INSTRUCTION SET EXTENSIONS FORENHANCING THE PERFORMANCE OF SYMMETRIC KEY
CRYPTOGRAPHIC ALGORITHMS
BY
SEAN R. OMELIABS CpE, UNIVERSITY OF MASSACHUSETTS LOWELL (2005)
ABSTRACT OF A THESIS SUBMITTED TO THE FACULTY OF THEDEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING
IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF
MASTER OF SCIENCE IN ENGINEERINGUNIVERSITY OF MASSACHUSETTS LOWELL
2007
Thesis Advisor: Dr. Adam J. ElbirtAssistant Professor, Department of Computer Science
-
ABSTRACT
In this thesis, instruction set extensions for a RISC processor are presented
to improve the performance in software of the Data Encryption Standard (DES),
Triple-DES, International Data Encryption Algorithm (IDEA), and Advanced En-
cryption Standard (AES) algorithms. The most computationally intensive operations
of each algorithm are handled by a set of new instructions. The hardware supporting
these instructions is integrated into the processors datapath. For each of the targeted
algorithms, comparisons are presented between traditional software implementations
and new implementations that take advantage of the extended instruction set ar-
chitecture. Results show that utilization of the proposed instructions significantly
reduces program code size and improves encryption and decryption throughput. The
additional hardware resources required by all of the custom hardware increases the
total area of the processor by less than fifty percent.
ii
-
ACKNOWLEDGEMENTS
There are several people I wish to thank for their assistance and support
in the completion of this thesis. I would like to express many thanks to my advisor,
Dr. Adam J. Elbirt, who has been an excellent guide throughout all stages of the
research, and Prof. George Cheney and Dr. Dalila Megherbi for their membership
on the defense committee. I received a great deal of support on technical matters
from Gaisler Research, the creator of the LEON2 processor, and the members of the
Instruction Set Extensions for Cryptography Project at Graz University of Technol-
ogy. Their advice was most helpful for understanding the LEON2 model and how the
processor architecture can be extended. I would also like to thank all of my friends
and family, who have encouraged me throughout the course of my work.
iii
-
Contents
List of Figures viii
List of Tables x
1 INTRODUCTION 1
2 PREVIOUS WORK 6
3 THE LEON2 PROCESSOR 9
3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 VHDL Model Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.3 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.4 SPARC r V8 Instruction Model . . . . . . . . . . . . . . . . . . . . . 11
3.5 Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.6 Synthesis and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.7 Software Development Tools . . . . . . . . . . . . . . . . . . . . . . . 12
4 TARGET ALGORITHMS 14
4.1 Triple-DES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 The DES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 14
iv
-
4.1.2 DES Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.1.3 The Triple-DES Algorithm . . . . . . . . . . . . . . . . . . . . . 21
4.1.4 Triple-DES Key Schedule . . . . . . . . . . . . . . . . . . . . . . 22
4.1.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.2 IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.2.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 30
4.3.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 30
4.3.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.4 Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5 PROPOSED INSTRUCTION SET EXTENSIONS 40
5.1 DES and Triple-DES . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.1 Initial and Final Permutations . . . . . . . . . . . . . . . . . . . 40
5.1.2 Set Encryption Direction . . . . . . . . . . . . . . . . . . . . . . 42
5.1.3 Key Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.1.4 Round Core (f ) Function . . . . . . . . . . . . . . . . . . . . . . 43
5.1.5 New DES and Triple-DES Algorithm Implementations . . . . . 43
5.2 IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 Multiplication modulo 216 + 1 . . . . . . . . . . . . . . . . . . . 47
v
-
5.2.2 New IDEA Algorithm Implementation . . . . . . . . . . . . . . 47
5.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.1 SubBytes Operations . . . . . . . . . . . . . . . . . . . . . . . . 48
5.3.2 GF(2m) Matrix Multiplier Constant Loading . . . . . . . . . . . 49
5.3.3 GF(2m) Matrix Multiplication . . . . . . . . . . . . . . . . . . . 49
5.3.4 New AES Algorithm Implementations . . . . . . . . . . . . . . . 51
6 LEON2 HARDWARE AND SOFTWARE TOOLCHAIN MODIFI-
CATIONS 59
6.1 Custom Hardware Units . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.1 DES Permutation Unit . . . . . . . . . . . . . . . . . . . . . . . 59
6.1.2 DES Round f -function Unit . . . . . . . . . . . . . . . . . . . . 60
6.1.3 DES Key Generator . . . . . . . . . . . . . . . . . . . . . . . . 61
6.1.4 Modulo (216 + 1) Multiplier . . . . . . . . . . . . . . . . . . . . 63
6.1.5 AES S-Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.1.6 Galois Field Fixed Field Constant Multiplier . . . . . . . . . . . 64
6.2 Architecture Modifications . . . . . . . . . . . . . . . . . . . . . . . . 68
6.3 Modifications to Software Development Tools . . . . . . . . . . . . . . 69
7 RESULTS AND ANALYSIS 70
7.1 Testing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.2 Software Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.3 Algorithm Execution Times . . . . . . . . . . . . . . . . . . . . . . . . 74
7.4 Hardware Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.5 Throughput to Area Comparisons . . . . . . . . . . . . . . . . . . . . 79
vi
-
8 CONCLUSIONS AND FUTURE WORK 81
REFERENCES 83
Appendix A: VHDL Source for Custom Functional Units 94
Appendix B: Modifications to LEON2 VHDL Model and Development
Tools 116
Appendix C: Test Vectors for Functional Evaluation 160
Appendix D: Example Source Code for Functional and Performance
Evaluations 162
About the Author 210
vii
-
List of Figures
1 Structure of SPARC r V8 Format 3 instructions . . . . . . . . . . . . 11
2 Block diagram for standard block ciphers . . . . . . . . . . . . . . . . 15
3 The Data Encryption Standard algorithm . . . . . . . . . . . . . . . 16
4 The DES f -function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 The computation graph for IDEA. . . . . . . . . . . . . . . . . . . . . 27
6 The AES encryption process . . . . . . . . . . . . . . . . . . . . . . . 31
7 The state representation of data blocks in AES . . . . . . . . . . . . 31
8 The AES decryption process . . . . . . . . . . . . . . . . . . . . . . . 33
9 The key expansion process for AES . . . . . . . . . . . . . . . . . . . 34
10 DES encryption routine with custom instructions . . . . . . . . . . . 44
11 DES decryption routine with custom instructions . . . . . . . . . . . 44
12 Triple-DES encryption routine with custom instructions . . . . . . . . 45
13 Triple-DES decryption routine with custom instructions . . . . . . . . 46
14 IDEA algorithm routine with custom instructions . . . . . . . . . . . 47
15 AES encryption routine with aessb and gfmmul instructions . . . . 52
16 AES decryption routine with aessb and gfmmul instructions . . . . 54
17 AES encryption routine with aessb4 and gfmmul instructions . . . . 56
18 AES decryption routine with aessb4 and gfmmul instructions . . . . 58
viii
-
19 DES permutation unit . . . . . . . . . . . . . . . . . . . . . . . . . . 60
20 DES key generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
ix
-
List of Tables
1 LEON2 VHDL Model File Hierarchy . . . . . . . . . . . . . . . . . . 10
2 The Initial Permutation IP . . . . . . . . . . . . . . . . . . . . . . . . 17
3 The Expansion Operation E . . . . . . . . . . . . . . . . . . . . . . . 18
4 The DES S-Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5 The Pre-Output Permutation P . . . . . . . . . . . . . . . . . . . . . 20
6 The Final Permutation FP . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Permuted Choice 1 (PC-1 ) . . . . . . . . . . . . . . . . . . . . . . . . 21
8 Rotations for the DES key schedule . . . . . . . . . . . . . . . . . . . 21
9 Permuted Choice 2 (PC-2 ) . . . . . . . . . . . . . . . . . . . . . . . . 21
10 IDEA key schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
11 Interpretation of the asi field for the DES permutation instructions . 41
12 Usage of the simm13 field by the aessb and aessb4 instructions . . 48
13 Code size in bytes for DES . . . . . . . . . . . . . . . . . . . . . . . . 72
14 Code size in bytes for Triple-DES . . . . . . . . . . . . . . . . . . . . 72
15 Code size in bytes for IDEA . . . . . . . . . . . . . . . . . . . . . . . 72
16 Code size in bytes for AES without gfmmul instruction . . . . . . . 73
17 Code size in bytes for AES with gfmmul instruction . . . . . . . . . 73
18 Execution cycles for DES . . . . . . . . . . . . . . . . . . . . . . . . . 74
x
-
19 Execution cycles for Triple-DES . . . . . . . . . . . . . . . . . . . . . 74
20 Execution cycles for IDEA . . . . . . . . . . . . . . . . . . . . . . . . 75
21 Execution cycles for AES without gfmmul instruction . . . . . . . . 75
22 Execution cycles for AES with gfmmul instruction . . . . . . . . . . 76
23 Comparison with ISEC Extensions in C/Inline Assembly . . . . . . . 77
24 Comparison with ISEC Extensions in Pure Assembly . . . . . . . . . 77
25 Hardware utilization on the Xilinx XC4VLX25 FPGA . . . . . . . . . 78
26 Throughput to area ratios for algorithm implementations . . . . . . . 80
xi
-
11 INTRODUCTION
With more than 188 million Americans connected to the Internet [1], in-
formation security has become a top priority. Many applications electronic mail,
electronic banking, medical databases, and electronic commerce require the ex-
change of private information. For example, when engaging in electronic commerce,
customers provide credit card numbers when purchasing products. If the connection
is not secure, an attacker can easily obtain this sensitive data. In order to imple-
ment a comprehensive security plan for a given network to guarantee the security of
a connection, the following services must be provided [2], [3], [4]:
Confidentiality : Information cannot be observed by an unauthorized party. This
is accomplished via public-key and private-key encryption.
Data Integrity : Transmitted data within a given communication cannot be
altered in transit due to error or an unauthorized party. This is accomplished
via the use of hash functions and Message Authentication Codes.
Authentication: Parties within a given communication session must provide
certifiable proof of their identity. This is accomplished via the use of digital
signatures.
Non-repudiation: Neither the sender nor the receiver of a message may deny
transmission. This is accomplished via digital signatures and third party notary
services.
Cryptographic algorithms used to ensure confidentiality fall within one of two cat-
-
2egories: private-key (also known as symmetric-key) and public-key. Symmetric-key
algorithms use the same key for both encryption and decryption. Conversely, public-
key algorithms use a public key for encryption and a private key for decryption. In a
typical session, a public-key algorithm will be used for the exchange of a session key
and to provide authenticity through digital signatures. The session key is then used
in conjunction with a symmetric-key algorithm. Symmetric-key algorithms tend to
be significantly faster than public-key algorithms and as a result are typically used
in bulk data encryption [3]. The two types of symmetric-key algorithms are block
ciphers and stream ciphers. Block ciphers operate on a block of data while stream
ciphers encrypt individual bits. Block ciphers are typically used when performing
bulk data encryption and the data transfer rate of the connection directly follows the
throughput of the implemented algorithm.
High throughput encryption and decryption are becoming increasingly im-
portant in the area of high-speed networking. Many applications demand the creation
of networks that are both private and secure while using public data-transmission
links. These systems, known as Virtual Private Networks (VPNs), can demand en-
cryption throughputs at speeds exceeding Asynchronous Transfer Mode (ATM) rates
of 622 million bits per second (Mbps). Increasingly, security standards and applica-
tions are defined to be algorithm independent. Although context switching between
algorithms can be easily realized via software implementations, the task is significantly
more difficult when using hardware implementations. The advantages of a software
implementation include ease of use, ease of upgrade, ease of design, portability, and
flexibility. However, a software implementation offers only limited physical security,
especially with respect to key storage [3], [5]. Conversely, cryptographic algorithms
that are implemented in hardware are by nature more physically secure as they can-
not easily be read or modified by an outside attacker when the key is stored in special
-
3memory internal to the device [5]. As a result, the attacker does not have easy access
to the key storage area and cannot discover or alter its value in a straightforward
manner [3].
When using a general-purpose processor, even the fastest software imple-
mentations of block ciphers cannot satisfy the required bulk data encryption data
rates for high-end applications [6], [7], [8], [9], [10]. As a result, hardware imple-
mentations are necessary for block ciphers to achieve this required performance level.
Although traditional hardware implementations lack flexibility with respect to al-
gorithm and parameter switching, configurable hardware devices offer a promising
alternative for the implementation of processors via the use of IP cores in Applica-
tion Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA)
technology. To illustrate, Altera Corporation offers IP core implementations of the
Intel 8051 microcontroller and the Motorola 68000 processor in addition to their
own Nios r-II embedded processor [11]. Similarly, Xilinx Inc. offers IP core imple-
mentations of the PowerPC processor in addition to their own MicroBlazeTM and
PicoBlazeTM embedded processors [12]. ASIC and FPGA technologies provide the
opportunity to augment the existing datapath of a processor implemented via an IP
core to add acceleration modules supported through newly defined instruction set
extensions targeting performance-critical functions [13], [14], [15]. Moreover, many
licensable and extendible processor cores are also available for the same purpose [16],
[17], [18], [19].
The use of instruction set extensions follows the hardware/software co-
design paradigm to achieve the performance and physical security associated with
hardware implementations while providing the portability and flexibility traditionally
associated with software implementations [20]. Moreover, when considering alterna-
tive solutions, instruction set extensions result in significant performance improve-
-
4ments versus traditional software implementations with considerably reduced logic
resource requirements versus hardware-only solutions such as co-processors [21], [22],
[23], [24], [25], [26], [27], [28], [29]. It is the goal of this research to demonstrate a set
of instruction set extensions for a reduced instruction set computing (RISC) processor
that enhance the performance of symmetric-key algorithms in software implementa-
tions.
Chapter 2 discusses related work on the various methods of speeding up
symmetric-key algorithms in software. These include optimization of pure software
implementations, off-loading of cryptographic algorithm execution to co-processors,
and other instruction set extensions. It will be shown that advances in technology
have fueled trends towards increased reconfigurability in embedded systems, resulting
in instruction set extensions becoming a more viable and attractive option when the
performance of symmetric-key algorithms is critical.
Chapter 3 describes the target processor, the LEON2 RISC processor whose
architecture is based on the SPARC r architecture. The LEON2 processor was cho-
sen for its robust design, ease of configurability, and customization through the full
availability of the models hardware description language (HDL) source code.
In Chapter 4, the cryptographic algorithms that are the focus of this re-
search are explained. These include the Data Encryption Standard (DES) and Triple-
DES, the International Data Encryption Algorithm (IDEA), and the Advanced En-
cryption Standard (AES). The various performance bottlenecks that are commonly
encountered in software implementations of these algorithms are also discussed.
Syntax and encoding of the newly developed instructions are presented in
Chapter 5, followed by a description of the modifications to the LEON2 processor
and its associated development tools in Chapter 6. To evaluate the effectiveness of
the instruction set extensions, Chapter 7 presents data on the logic utilization of the
-
5custom hardware, as well as throughput data for the target algorithms both with
and without the use of the custom instructions. The thesis concludes in Chapter 8
along with recommendations for investigating further architectural enhancements to
the LEON2 processor for supporting symmetric-key algorithms.
-
62 PREVIOUS WORK
Most traditional methods for improving the throughput of pure software
implementations of symmetric-key algorithms fall into one of two categories. One
option is to construct memory-based look-up tables where results of some of the basic
operations of the algorithm have been pre-computed and stored. The substitution
boxes, or S-Boxes, of the DES and AES algorithms are commonly stored in look-up
tables in software implementations. Look-up tables may also be used to combine
operations used in the DES and AES algorithms. An implementation of DES in [30]
combines the S-Box table look-up with the subsequent 32-bit permutation and uses
tables as part of the Initial and Final Permutations. The AES algorithm requires
several complicated mathematical operations that are time-consuming on general-
purpose processors. Therefore in some implementations, large look-up tables, called
T-tables, are employed that combine several of these complex operations into a single
table access [20]. A look-up table based implementation is a viable option for systems
with a large memory space and low memory access times. However, area-constrained
systems suffer large performance penalties under these implementations [20], [29],
thus they are generally not employed in those environments.
Another method for speeding up software implementations of cryptographic
algorithms involves taking advantage of mathematical or structural properties of the
particular algorithm. The Initial and Final Permutations of the DES algorithm have
regular structures that make it possible to execute a series of matrix transformations
and exclusive-OR operations as demonstrated in [31]. This translates into a sequence
of instructions that is much smaller than the traditional sequence required to perform
-
7the Initial and Final Permutations. In previous work on improving the performance of
the AES algorithm on 32-bit systems, it has been shown that transforming a block of
plaintext from a column-oriented matrix to a row-oriented matrix reduces the number
of instructions required to complete the cipher rounds. In particular, a row-oriented
representation allows for a more efficient implementation of the Galois Field matrix
multiplication operations required for AES encryption and decryption [32].
In order to extend the cryptographic capabilities of an embedded system
without modifying the main processor, a co-processor solution can be adapted. When
there is data that must be encrypted or decrypted through the chosen symmetric-key
algorithm, the main processor sends the data and key material to the co-processor,
and the co-processor performs the algorithm, sending the processed data back over the
interface to the main processor. Most co-processor solutions have tended to combine
a number of different algorithms to provide a multi-faceted security solution. Co-
processors have generally achieved high throughput values compared to traditional
software implementations and therefore are much more capable of meeting demands
for speed-critical network communications. However, this type of solution is generally
associated with considerable overhead in terms of hardware area utilization, data
transfer latency, and complex interfaces to the main processor [23], [33], [34], [35],
[36], [37], [38], [39], [40].
There has been previous work on instruction set extensions for general per-
mutations that are useful for improving the performance of permutations for the
DES algorithm. Shi and Lee [41] presented two new instructions for general and
dynamically specified permutations. The input and a string of configuration bits are
specified in the source operands and the result is stored in the destination register.
These instructions, along with two new instructions, are discussed in [21]. In general,
permutations of n bits required log2(n) issues of the custom instructions, as well
-
8as several loads into registers of configuration bits. The MOSES platform developed
by a group based at NEC Research Laboratories is based on the Xtensa T1040, a
RISC-like processor designed to be easily extended with additional custom hardware
and supporting instructions. Throughput improvement factors of 31.0 for DES, 33.9
for Triple-DES, and 17.4 for AES were reported for this custom architecture [42], [43].
Study of the effect of custom instructions that support the AES cipher is
extensive. Most of this work targets the memory look-ups and multiplications that
are needed to perform the encryption rounds and key schedule. The Instruction Set
Extensions for Cryptography (ISEC) project conducted at the Graz University of
Technology in Graz, Austria, has investigated instruction set extensions that perform
the mathematical operations in the AES rounds using custom functional units inte-
grated into the targeted processors datapath [29]. Earlier work in the ISEC project
demonstrated the effectiveness of instruction set extensions for elliptic curve cryptog-
raphy [27] in improving the performance of binary extension Galois Field arithmetic
[28].
-
93 THE LEON2 PROCESSOR
3.1 Overview
The target processor for this work is the LEON2, a RISC central processing
unit (CPU) that was produced by Gaisler Research [44] (note that at the time of
this writing, support for the LEON2 processor has been discontinued in favor of the
newer LEON3 processor model). The LEON2 processor is implemented in VHDL
and is fully synthesizable. The model is highly configurable, allowing for adjustments
to many features of the processor using a graphical configuration utility. The entire
source code is freely available under the GNU General Public License which enables
custom modifications and enhancements to the architecture. Information presented
in this chapter is derived from the LEON2 documentation [45], [46].
3.2 VHDL Model Hierarchy
The source code for the LEON2 processor has the directory structure shown
in Table 1. The top-level folder /leon2/ is used as an example; it is permissible for
the root directory to have any name.
3.3 Processor Architecture
LEON2 is based on the Scalable Processor Architecture (SPARC r).
SPARC r was first developed in 1985 at Sun Microsystems and is based on the work
-
10
Folder Description Refer toleon2/ Top directory Sec. 3.2leon2/boards/ FPGA board support files Sec. 3.6leon2/doc/ User manuals [46]leon2/leon/ LEON2 processor VHDL model Sec. 3.3leon2/pmon/ Simple boot-monitor Not discussedleon2/sim/ Simulator support files Sec. 3.6leon2/syn/ Synthesis support files Sec. 3.6leon2/tbench/ LEON2 VHDL test bench Sec. 3.6leon2/tkconfig/ graphical configuration utility Sec. 3.5leon2/tsource/ LEON2 test bench (C source) Sec. 3.6
Table 1: LEON2 VHDL Model File Hierarchy
that produced the RISC I and RISC II architectures at the University of California at
Berkeley during the early 1980s [47]. The LEON2 processor attained full certification
of compliance with the SPARC r V8 architecture in 2003 [48].
Features of the LEON2 processor coding style include fully synchronous
design with a single clock, use of multiplexers for loading of pipeline registers, sep-
arate combinational and sequential processes, and record types for interconnection
of component I/O signals. LEON2 provides support for on-chip peripherals such as
a floating-point unit (FPU), Peripheral Component Interconnect (PCI), and Ether-
net; co-processor support is also available in accordance with the SPARC r model.
However, these features are outside the scope of this research and are therefore not
discussed in any further detail. The main focus of the LEON2 architecture with re-
gards to the proposed instruction set extensions is the pipelined integer unit (IU).
The IU pipeline consists of five stages: fetch, decode, execute, memory, and write back.
The VHDL model implements each stage in its own process. A process
statement in VHDL is a closed block of code that runs sequentially. The inputs are
specified by a sensitivity list. The process statement executes at any time a signal
in the sensitivity list changes state. Processes are used for behavioral VHDL code, a
high-level coding style used commonly for describing sequential logic [49].
-
11
3.4 SPARC r V8 Instruction Model
All SPARC r V8 instructions are implemented in the LEON2 processor
architecture. Instructions are grouped according to the values of the various fields in
the instruction operation code. Arithmetic, logic, and memory operations have the
Format 3 structure [47] shown in Figure 1.
op rd op3 rs1 i=0 asi rs2op rd op3 rs1 i=1 simm13
Figure 1: Structure of SPARC r V8 Format 3 instructions
3.5 Customization
Most of the available features of the LEON2 processor can be enabled, dis-
abled, or adjusted by using the graphical configuration utility. For the purposes of
this work, a basic configuration is used with no FPU, PCI, Ethernet, co-processor in-
terface, or hardware multiplier or divider. To extend the LEON2 architecture beyond
the scope of the standard model, additional VHDL code is required. The specific
files that must be modified depend on what functionality is to be added, but if the
instruction set is to be extended, the module containing the SPARC r V8 opcode
constants must be updated, and these instructions must follow the SPARC r V8
architecture specification [47]. The graphical configuration utility may also be mod-
ified to provide an easy interface for adjusting parameters of the custom functionality.
-
12
3.6 Synthesis and Simulation
The LEON2 VHDL implementation is a fully synthesizable processor that
can be targeted to any type of FPGA or ASIC technology. There are pre-made
packages for several synthesis tools such as XST, Synplify, Synopsys, and Leonardo
in the /syn/ sub-folder of the LEON2 directory structure. These packages enable use
of technology-specific cells to directly instantiate or automatically infer the register
files, caches, PCI FIFOs, and I/O pads. There are also a number of packages in the
/boards/ sub-folder that support programming of physical FPGA boards [50] with
the LEON2 architecture.
Functional verification of programs built for the LEON2 architecture can be
performed with the provided generic test bench. The VHDL source for the test bench
is located in the /tbench/ sub-folder of the LEON2 directory structure. Software code
is placed in the /tsource/ sub-folder in a format readable by the test bench VHDL
code. The software can then be read and executed by the test bench for purposes of
functional verification and performance evaluation.
3.7 Software Development Tools
In order to facilitate the development of programs targeting the LEON2
processor, Gaisler Research has provided a series of compilers and simulators that
may be chosen depending on the software environment. For stand-alone applications,
the Bare C Compiler (BCC) is recommended. BCC is based on the GNU Compiler
Collection (GCC) and GNU binutils. The BCC development tools are used in the
same way as those included in the standard GCC and binutils packages. The actual
names of the executables have a sparc-elf- prefix.
-
13
Packages containing the binaries for Linux and Cygwin environments are
available, as well as the full source code for developers who wish to take advantage
of an expanded LEON2 architecture.
-
14
4 TARGET ALGORITHMS
4.1 Triple-DES
4.1.1 The DES Algorithm
Many block ciphers may be characterized as Feistel networks [3]. Feistel
networks were invented by Horst Feistel [51] and are a general method of transforming
a function into a permutation. The basic Feistel network divides the data into two
halves where one half operates upon the other [52]. The f -function uses one of the
halves of the data block and a key to create a pseudo-random bit stream that is
used to encrypt or decrypt the other half of the data block. Therefore, to encrypt or
decrypt both halves requires two iterations of the Feistel network.
A generalization of the basic Feistel network allows for the support of larger
data blocks. Generalization occurs by considering the data swap as a circular right
shift. This allows for the use of the same f -function but requires multiple rounds to
input all of the sub-blocks to the f -function [53]. Figure 2 from [53] details the block
diagram for block ciphers employing both the basic Feistel network and generalized
Feistel networks of three and four blocks. The f -function is represented by the box
and the symbol represents a bit-wise XOR operation.
The f -function employs confusion and diffusion to obscure redundancies in
a plaintext message [54]. Confusion obscures the relationship between the plaintext,
the ciphertext, and the key. S-Box look-up tables are an example of a confusion
operation. Diffusion spreads the influence of individual plaintext or key bits over
as much of the ciphertext as possible. Expansion and permutation functions are
-
15
L Rk0
R Lk1
L R
B Ck0
A Bk1
C A
A
C
Bk2
B CA
C Dk0
B Ck1
A B
B
A
Dk2
D AC
A
D
C
Bk3
C DBA
Figure 2: Block diagram for standard block ciphers
examples of diffusion operations [3]. The basic operations that may be found within
an f -function include:
Bitwise XOR, AND, or OR.
Modular addition or subtraction.
Shift or rotation by a constant number of bits.
Data-dependent rotation by a variable number of bits.
Modular multiplication.
Multiplication in a Galois field.
Modular inversion.
Look-up-table substitution.
DES is a sixteen-round Feistel Network block cipher. A block diagram of
the entire operation is given in Figure 3 from [55]. The DES cipher takes as input a
64-bit key, where 8 of the 64 bits are used for parity and the other 56 bits comprise
-
16
the actual key material. The input and output both have a size of 64 bits for both
encryption and decryption. The procedures for encryption and decryption are almost
exactly the same; the only difference is that the key schedule for decryption is the
reverse of that used for encryption.
Figure 3: The Data Encryption Standard algorithm
Throughout the rest of this section, bit ordering is denoted for an n-bit
-
17
vector such that bit 1 is the most significant bit and n is the least significant bit.
In all Figures that show the bit assignments for DES permutations, the numbers
correspond to input bits that are mapped to a specific position in the output, starting
with output bits 1,2,3,... in the top row and ending with output bits ...,n-2,n-1,n in
the bottom row.
The first part of DES encryption is an Initial Permutation (IP) on the input
block. The IP rearranges the input according to Table 2. The output of the IP is
divided into a left half L0 and a right half R0, which becomes the input to the first
round. For each round iteration i from 1 to 16:
Li = Ri1
Ri = Li1
f(Ri1, Ki)
58 50 42 34 26 18 10 260 52 44 36 28 20 12 462 54 46 38 30 22 14 664 56 48 40 32 24 16 857 49 41 33 25 17 9 159 51 43 35 27 19 11 361 53 45 37 29 21 13 563 55 47 39 31 23 15 7
Table 2: The Initial Permutation IP
Figure 4 is an illustration of the round function and shows the individual
blocks of the f -function, which is the core operation of each round.
The following tables show the mappings of input bits to output bits of the
E and P operations, as well as the eight S-Boxes. The E expansion duplicates some
of the bits of the 32-bit input to the f - function as shown in Table 3 and outputs
a 48-bit value. The result of E (Ri1)Ki is partitioned into eight 6-bit values.
The S-Boxes output a 4-bit number based on a 6-bit input. The input is in the form
-
18
Figure 4: The DES f -function
a5a4a3a2a1a0 the row index into the S-Box is the number formed from a5a0 and the
column index is the number formed from a4a3a2a1. The outputs of the S-Boxes are
concatenated to form the input to the P permutation. The P permutation rearranges
the 32 bits of the combined S-Box outputs, and the result is XOR-ed with Li1 to
obtain Ri for the current round.
32 1 2 3 4 54 5 6 7 8 98 9 10 11 12 1312 13 14 15 16 1716 17 18 19 20 2120 21 22 23 24 2524 25 26 27 28 2928 29 30 31 32 1
Table 3: The Expansion Operation E
-
19
S114 4 13 1 2 15 11 8 3 10 6 12 5 9 0 70 15 7 4 14 2 13 1 10 6 12 11 9 5 3 84 1 14 8 13 6 2 11 15 12 9 7 3 10 5 015 12 8 2 4 9 1 7 5 11 3 14 10 0 6 13
S215 1 8 14 6 11 3 4 9 7 2 13 12 0 5 103 13 4 7 15 2 8 14 12 0 1 10 6 9 11 50 14 7 11 10 4 13 1 5 8 12 6 9 3 2 1513 8 10 1 3 15 4 2 11 6 7 12 0 5 14 9
S310 0 9 14 6 3 15 5 1 13 12 7 11 4 2 813 7 0 9 3 4 6 10 2 8 5 14 12 11 15 113 6 4 9 8 15 3 0 11 1 2 12 5 10 14 71 10 13 0 6 9 8 7 4 15 14 3 11 5 2 12
S47 13 14 3 0 6 9 10 1 2 8 5 11 12 4 1513 8 11 5 6 15 0 3 4 7 2 12 1 10 14 910 6 9 0 12 11 7 13 15 1 3 14 5 2 8 43 15 0 6 10 1 13 8 9 4 5 11 12 7 2 14
S52 12 4 1 7 10 11 6 8 5 3 15 13 0 14 914 11 2 12 4 7 13 1 5 0 15 10 3 9 8 64 2 1 11 10 13 7 8 15 9 12 5 6 3 0 1411 8 12 7 1 14 2 13 6 15 0 9 10 4 5 3
S612 1 10 15 9 2 6 8 0 13 3 4 14 7 5 1110 15 4 2 7 12 9 5 6 1 13 14 0 11 3 89 14 15 5 2 8 12 3 7 0 4 10 1 13 11 64 3 2 12 9 5 15 10 11 14 1 7 6 0 8 13
S74 11 2 14 15 0 8 13 3 12 9 7 5 10 6 113 0 11 7 4 9 1 10 14 3 5 12 2 15 8 61 4 11 13 12 3 7 14 10 15 6 8 0 5 9 26 11 13 8 1 4 10 7 9 5 0 15 14 2 3 12
S813 2 8 4 6 15 11 1 10 9 3 14 5 0 12 71 15 13 8 10 3 7 4 12 5 6 11 0 14 9 27 11 4 1 9 12 14 2 0 6 10 13 15 3 5 82 1 14 7 4 10 8 13 15 12 9 0 3 5 6 11
Table 4: The DES S-Boxes
-
20
16 7 20 2129 12 28 171 15 23 265 18 31 102 8 24 1432 27 3 919 13 30 622 11 4 25
Table 5: The Pre-Output Permutation P
After the final round, the left and right halves of the 64-bit block, L16 and
R16, are swapped, and then subject to a Final Permutation (FP). This operation is
simply the inverse of the IP. The bit mapping for this operation is shown in Table 6;
bit positions are represented in the same manner as Table 2 for the IP.
40 8 48 16 56 24 64 3239 7 47 15 55 23 63 3138 6 46 14 54 22 62 3037 5 45 13 53 21 61 2936 4 44 12 52 20 60 2835 3 43 11 51 19 59 2734 2 42 10 50 18 58 2633 1 41 9 49 17 57 25
Table 6: The Final Permutation FP
4.1.2 DES Key Schedule
The key schedule for DES operates on the 64-bit master key to produce
a series of 48-bit round keys, each used one at a time for the sixteen rounds of the
cipher. Initially, the bits of the master key are arranged by Permuted Choice 1 (PC-
1 ) into two 28-bit vectors, C and D (note that every eighth bit is a parity bit and is
discarded). Table 7 depicts the PC-1 operation.
For each round of the cipher, a bit rotation is performed separately on the
C and D values. Rotation moves to the left for encryption, and to the right for
decryption. The rotation amount depends on the next round; these amounts are
-
21
C0 D057 49 41 33 25 17 9 63 55 47 39 31 23 151 58 50 42 34 26 18 7 62 54 46 38 30 2210 2 59 51 43 35 27 14 6 61 53 45 37 2919 11 3 60 52 44 36 21 13 5 28 20 12 4
Table 7: Permuted Choice 1 (PC-1 )
given for encryption in Table 8. These amounts are carried out in reverse order for
the right-rotations of the decryption key schedule.
Round 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Rotate amount 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1
Table 8: Rotations for the DES key schedule
The round key is the result of passing the current state of the C and D bit
vectors through Permuted Choice 2 (PC-2 ). This operation maps the concatenation
of C and D to the 48-bit round key as shown in Table 9.
14 17 11 24 1 53 28 15 6 21 1023 19 12 4 26 816 7 27 20 13 241 52 31 37 47 5530 40 51 45 33 4844 49 39 56 34 5346 42 50 36 29 32
Table 9: Permuted Choice 2 (PC-2 )
4.1.3 The Triple-DES Algorithm
The Triple-DES algorithm has been suggested as a more secure alternative
to DES [55]. As the name suggests, this cipher sequentially executes the DES algo-
rithm three times with keys K1, K2, and K3, where two or all three of these keys may
be equivalent. The following rules are used for encryption and decryption, where PT
is the plaintext, CT is the ciphertext, EKi is a DES encryption using Ki, and DKi is
-
22
a DES decryption using Ki:
CT = EK3(DK2(EK1(PT )))
PT = DK3(EK2(DK1(CT )))
Since the output ciphertext from implementations 1 and 2 of DES is used as
the input plaintext to implementations 2 and 3 respectively, and the Initial and Final
Permutations are inverses of each other, the inner Initial and Final Permutations may
be removed from the algorithm.
There are three keying options commonly used for Triple DES [55]:
Keying Option 1: K1, K2, and K3 are independent.
Keying Option 2: K1 = K3 and K2 is independent from K1 and K3.
Keying Option 3: K1 = K2 = K3.
Note that Keying Option 3 is equivalent to a single iteration of DES.
4.1.4 Triple-DES Key Schedule
To perform the key schedule for Triple-DES, the key expansion must be
performed on each unique key that is used in the chosen implementation of the algo-
rithm. This means that three key expansions are required for Keying Option 1, two
are required for Keying Option 2, and one is required for Keying Option 3.
4.1.5 Performance
Software implementations of DES tend to be significantly slower than hard-
ware implementations. Bit-level manipulations such as those contained in the permu-
tation, expansion, permuted choice, and Cyclic Left/Right Shift units do not map well
to general purpose processors. General purpose processor instruction sets operate on
-
23
multiple bits at a time based on the processor word size. Moreover, the DES S-Boxes
do not use memory in an efficient manner. Software look-up tables would appear
to be the obvious implementation choice for the DES S-Boxes. However, the DES
S-Boxes have 6-bit addresses and 4-bit output data while most memories associated
with general purpose processors use byte addressing with either 8-bit or 32-bit output
data. As a result, many software implementations of DES exhibit throughputs that
are at least a full order of magnitude slower than hardware implementations.
Even the best software implementations are only capable of throughputs in
the range of 100200 Mbps. Most of these implementations recommend storing the Li
and Ri data as a 48-bit padded word within a 64-bit processor word and implementing
the permutations and S-Boxes as precomputed look-up tables. Additionally, there is
general agreement that the look-up table implementation for the S-Boxes is most
effective when the size of the look-up tables is minimized, guaranteeing that the
data will fit entirely in on-chip cache. Size minimization of the S-Box look-up tables
is achieved by implementing each S-Box in its own look-up table. Finally, one key
software optimization is the unrolling of software loops to increase performance. Even
when software loops are too cumbersome to unroll, using loop counters that decrement
to zero in place of loop counters that increment to a terminal count are shown to
greatly increase the performance of software implementations of the DES algorithm.
However, the unrolling of software loops must be done with great care such that the
total data storage space does not exceed the size of the on-chip cache as this would
cause extreme performance degradation [56], [57], [58].
DES hardware implementations are easily realized in a single chip, such as
an FPGA or an ASIC. To support encryption or decryption of a new block of data
every clock cycle in an implementation operating in a non-feedback mode, such as
Electronic Code Book (ECB) or counter mode also requires that the chip have at least
-
24
128 input pins (for the input data and key) and 64 output pins (for the output data).
Once again, FPGA and ASIC technology provide more than enough I/O pins to meet
these requirements. As a result, numerous fast and efficient DES implementations
have been reported, reaching throughputs in the Gbps when targeting either FPGAs
or ASICs. Examples of such implementations may be found in [2], [59], [60]. When
operating in feedback modes (such as Cipher Block Chaining (CBC) mode), DES
does not map nicely to pipelined hardware implementations because of the chaining
of blocks. The chaining requires ciphertext block yi1 to process plaintext block xi
and thus simultaneous processing of the two blocks is impossible, requiring that the
pipeline be stalled until generation of the ciphertext block yi1 is completed. However,
the stalling of the pipeline may be avoided in an environment with multiple data
streams, such as in a network processor. In such a situation, the pipeline may be fully
utilized by interleaving the data streams. For a fully pipelined DES implementation
where the atomic unit of the pipeline is the DES round function, the pipeline will
have sixteen stages, thus requiring sixteen interleaved data streams. Let x0S0 denote
plaintext block 0 from data stream 0, x0S1 denote plaintext block 0 from data stream
1, etc. Using this notation, the pipeline is filled with blocks x0S0 , x0S1 , x0S2 , . . .,
x0S15 . When x0S0 has passed the final stage of the pipeline to yield y0S0 , x1S0 is
ready to enter the first stage of the pipeline and is combined with y0S0 via the XOR
operation to perform CBC mode chaining. Thus each data stream is encrypted and
decrypted in CBC mode while also maintaining full pipeline utilization, maximizing
the performance of the implementation. Note that such an implementation must also
maintain sixteen Initialization Vectors, one for each data stream, to be combined with
the first plaintext block x0 of the associated data stream via the XOR operation.
The earliest VLSI implementations of DES [61], [62] achieved throughputs
ranging from 20 to 32 Mbps using 3 m technology. The variances in throughput are
-
25
compared and contrasted based upon speed versus area tradeoffs. The implementa-
tions support multiple modes of operation, including ECB, CBC, Cipher Feedback
(CFB), and Output Feedback (OFB) (see [63] for a detailed description of DES modes
of operation). Other ASIC implementations of DES [64], [65] achieve a throughput
of 1 Gbps using 0.8 m Gallium Arsenide (GaAs) technology. More recently, a DES
ASIC implementation has been demonstrated to operate at up to 10 Gbps using 0.6
m technology [59].
4.2 IDEA
4.2.1 Mathematical Background
The International Data Encryption Algorithm (IDEA) was originally pub-
lished as the Proposed Encryption Standard (PES) by Xuejia Lai and James Massey
[66]. The computations involved in IDEA are based on operations from three different
mathematical groups:
16-bit bitwise exclusive-OR, denoted by,
Addition modulo 216, denoted by ,
Multiplication modulo (216 + 1), denoted by.
For the third operation, an input of 0x0000 represents the value 216. This
is because the
operation is performed over the multiplicative group Z 216+1, where
zero is not a member but 216 is a member of the group. The value 216 is therefore
denoted by 0x0000 so that only sixteen bits are required to represent all possible
values for the inputs to each operation.
The security of IDEA is based not only on its large key size but also on the
fact that the output of one group operation is never used as an input to the same
-
26
operation. Further details are available in the original proposal for PES [66]. IDEA
evolved into its final form [67] due to modifications required to strengthen the cipher
against differential cryptanalysis attacks [68]. IDEA is used in many commercial
applications, such as Pretty Good Privacy (PGP). Like DES, IDEA operates across
64-bit blocks. However, while DES requires a 56-bit key, IDEA requires a 128-bit key,
accounting for the increased security of the cipher as compared to DES.
4.2.2 Algorithm Description
IDEA operates on 64-bit plaintext blocks and has a key size of 128 bits.
The algorithm consists of eight rounds followed by a final transformation to obtain
the output. Similar to DES, the procedure (referred to as a computation graph; see
Figure 5), is the same for both encryption and decryption but different key schedules
are used.
Figure 5 shows the input text as four 16-bit sub-blocks X1, X2, X3, and X4.
These text sub-blocks are combined with the six 16-bit sub-blocks of the round key
for the current round r, labeled Z(r)1 through Z
(r)6 , using the mathematical operations
noted above.
4.2.3 Key Schedule
The IDEA key schedule for encryption is based on a series of left rotations
of the 128-bit master key. The master key is first partitioned into eight 16-bit blocks;
these are the first eight key sub-blocks: Z(1)1 , Z
(1)2 , Z
(1)3 , Z
(1)4 , Z
(1)5 , Z
(1)6 , Z
(2)1 , Z
(2)2 .
The next eight key blocks are obtained by rotating the key to the left by 25 bits,
then performing the partition again. This process is repeated until all 52 key blocks
are generated (six blocks for each of the eight rounds and four blocks for the final
transformation).
The key schedule for decryption is based on the encryption key schedule.
Table 10 shows the relationship between the decryption key blocks and the encryption
-
27
Figure 5: The computation graph for IDEA.
key blocks, where Z(r)1n represents the multiplicative inverse modulo (216+1) of Z(r)n ,
and Z(r)n represents the additive inverse modulo 216 of Z(r)n .
4.2.4 Performance
In terms of the core operations of IDEA, the bit-wise XOR and addition
-
28
Round Encrypt keys Decrypt keys
1 Z(1)1 Z
(1)2 Z
(1)3 Z
(1)4 Z
(1)5 Z
(1)6 Z
(9)11 Z
(9)2 Z
(9)3 Z
(9)14 Z
(8)5 Z
(8)6
2 Z(2)1 Z
(2)2 Z
(2)3 Z
(2)4 Z
(2)5 Z
(2)6 Z
(8)11 Z
(8)3 Z
(8)2 Z
(8)14 Z
(7)5 Z
(7)6
3 Z(3)1 Z
(3)2 Z
(3)3 Z
(3)4 Z
(3)5 Z
(3)6 Z
(7)11 Z
(7)3 Z
(7)2 Z
(7)14 Z
(6)5 Z
(6)6
4 Z(4)1 Z
(4)2 Z
(4)3 Z
(4)4 Z
(4)5 Z
(4)6 Z
(6)11 Z
(6)3 Z
(6)2 Z
(6)14 Z
(5)5 Z
(5)6
5 Z(5)1 Z
(5)2 Z
(5)3 Z
(5)4 Z
(5)5 Z
(5)6 Z
(5)11 Z
(5)3 Z
(5)2 Z
(5)14 Z
(4)5 Z
(4)6
6 Z(6)1 Z
(6)2 Z
(6)3 Z
(6)4 Z
(6)5 Z
(6)6 Z
(4)11 Z
(4)3 Z
(4)2 Z
(4)14 Z
(3)5 Z
(3)6
7 Z(7)1 Z
(7)2 Z
(7)3 Z
(7)4 Z
(7)5 Z
(7)6 Z
(3)11 Z
(3)3 Z
(3)2 Z
(3)14 Z
(2)5 Z
(2)6
8 Z(8)1 Z
(8)2 Z
(8)3 Z
(8)4 Z
(8)5 Z
(8)6 Z
(2)11 Z
(2)3 Z
(2)2 Z
(2)14 Z
(1)5 Z
(1)6
Final
transform Z(9)1 Z
(9)2 Z
(9)3 Z
(9)4 Z
(1)11 Z
(1)2 Z
(1)3 Z
(1)14
Table 10: IDEA key schedule
are easily implemented with one instruction each in software. For the reduction
modulo 216, a processor such as the LEON2 that only performs arithmetic on 32-
bit register operands requires an additional logic instruction to mask out the bits
that may overflow into the sixteen most significant bits of the destination register.
The major performance bottleneck for a software implementation of the IDEA cipher
is the multiplication modulo (216 + 1). The reason for this is that multiplication
in general may take several clock cycles to complete on the processor running the
algorithm (especially those without hardware multipliers), and the modular reduction,
which is commonly implemented using the Low-High Lemma [66], requires additional
execution time.
Several software implementations of the IDEA algorithm take advantage
of advanced processor architectures that employ instruction parallelism or functional
units for multimedia support. A four-way parallel implementation on a 166 Mhz Pen-
tiumMMX processor [69] achieved a throughput of approximately 72 Mbps. Through-
put values ranging from 421 Mbps to 550 Mbps have been achieved on the Itanium
platform running at 733 MHz [70]. The performance evaluations reported in [71]
include a comparison of IDEA software implementations on processors with various
-
29
word sizes, clock frequencies, and cache sizes. Execution times for IDEA encryption
ranged from 2555 s on the 8-bit 4 MHz Atmega 103 to 9 s on the 64-bit 440 MHz
UltraSparc2 r with instruction and data cache sizes of 16 kbytes. The ability to
perform fast multiplications was shown to be a major factor in the performance of
the IDEA algorithm.
Implementations of IDEA on reconfigurable computing platforms and sys-
tems with co-processors have shown improved performance. An implementation on a
SRC-6E platform [72] achieved throughputs of approximately 590 Mbps for end-to-
end software time for bulk data processing. Comparisons have been made between
the performance of IDEA on Digital Signal Processing (DSP) chips, cryptographic
co-processors, and hardware implementations on FPGAs in a hardware-software co-
design system that makes use of encryption in a mobile device. Reported perfor-
mance figures ranged from 32 Mbps on the DEC SA-110 and 53.1 Mbps on the TI
TMX320C6x DSP chips, to 180 Mbps using the VINCI cryptographic co-processor,
to 528 Mbps with an FPGA-based implementation [73].
A VLSI implementation of PES [74] achieved a throughput of 44 Mbps using
1.5 m technology. This implementation was limited in clock frequency to maintain
compatibility with the Sun Microsystems SBus. The earliest VLSI implementations
of IDEA [75], [76] achieved throughputs of 177 Mbps using 1.2 m technology. More
recent VLSI implementations [77] achieve a throughput of 355 Mbps using 0.8 m
technology. When using 0.7 m technology, a throughput of 424 Mbps was achieved
in a single chip solution [78]. However, the performance of these implementations
were significantly reduced when operating in feedback modes.
-
30
4.3 AES
4.3.1 Mathematical Background
Joan Daemen and Vincent Rijmen proposed the Rijndael algorithm to NIST
as a candidate for the Advanced Encryption Standard [79]. One of the most significant
features of the algorithm is the extensive use of finite field, or Galois Field, arithmetic.
The particular field used in the AES algorithm is the Galois Field GF(28). Values
are represented by polynomials of the form a(x) = a7x7 +a6x
6 +a5x5 +a4x
4 +a3x3 +
a2x2 + a1x + a0, or in bit vector notation, { a7a6a5a4a3a2a1a0 }, where each ai is a
coefficient in the Galois Field GF(2). Addition is done by computing the sum mod-
ulo 2 of coefficients in the same bit positions; this can be accomplished by applying
a bit-wise exclusive-OR on the coefficients. Multiplication works in much the same
way as ordinary polynomial multiplication, but there is an additional step to make
a modular reduction of the product by an irreducible polynomial so that the final
product is in the Galois Field GF(28). For the AES algorithm, this polynomial is
m(x) = x8 + x4 + x3 + x+ 1.
4.3.2 Algorithm Description
AES always operates on a block size of 128 bits, but key sizes of 128, 192,
or 256 bits are allowed. The number of rounds used in the cipher is dependent on the
key size ten rounds for a 128-bit key, twelve rounds for a 192-bit key, and fourteen
rounds for a 256 bit key. This research focuses on a 128-bit key implementation but
is easily extended for use in implementations with larger key sizes.
Encryption of one plaintext block in AES requires the sequence of operations
shown in Figure 6. The word data type is a 32-bit value. In the AES algorithm
specification [80], the plaintext is arranged into a 4 4 matrix of 8-bit values called
-
31
the state, depicted in Figure 7.
Encrypt(byte in[16], byte out[16], word k[44])
begin
byte state[4,4]
state = in
AddRoundKey(state, k[0, 3])
for round = 1 step 1 to 9
SubBytes(state)
ShiftRows(state)
MixColumns(state)
AddRoundKey(state, k[round*4, (round+1)*4-1])
end for
SubBytes(state)
ShiftRows(state)
AddRoundKey(state, k[40, 43])
out = state
end
Figure 6: The AES encryption process
s0,0 s0,1 s0,2 s0,3s1,0 s1,1 s1,2 s1,3s2,0 s2,1 s2,2 s2,3s3,0 s3,1 s3,2 s3,3
Figure 7: The state representation of data blocks in AES
The four types of operations performed on the state are:
SubBytes: substitutes each byte in the state with a new value according to
the following procedure:
(1) Compute the multiplicative inverse in the Galois Field GF(28), denoted as
a1 (except for the value 0x00, which is mapped to itself);
(2) Perform the following affine transformation over the Galois Field GF(2) on
a1:
-
32
b7
b6
b5
b4
b3
b2
b1
b0
=
0 0 0 1 1 1 1 1
0 0 1 1 1 1 1 0
0 1 1 1 1 1 0 0
1 1 1 1 1 0 0 0
1 1 1 1 0 0 0 1
1 1 1 0 0 0 1 1
1 1 0 0 0 1 1 1
1 0 0 0 1 1 1 1
a17
a16
a15
a14
a13
a12
a11
a10
+
0
1
1
0
0
0
1
1
The result b is copied into the position of a in the state.
ShiftRows: performs cyclic left-shifts on each row in the state. The amount
of bytes by which to shift depends on the row: zero for the top row, one for the
second row, two for the third row, and three for the bottom row.
MixColumns: each column of the state is treated as a vector of four polyno-
mials in the Galois Field GF(28) in this operation. Each of the four columns
are multiplied by a 4 4 constant matrix with coefficients in the Galois Field
GF(28) reduced modulo m(x) = x8 + x4 + x3 + x+ 1. For each column c from
0 to 3,
B(0,c)
B(1,c)
B(2,c)
B(3,c)
=
02 03 01 01
01 02 03 01
01 01 02 03
03 01 01 02
A(0,c)
A(1,c)
A(2,c)
A(3,c)
.
AddRoundKey: likeMixColumns, AddRoundKey operates on individual
columns of the state. Each column Ci is combined by a bit-wise exclusive-OR
-
33
operation with a 32-bit word k4r+i from the current round key (the key schedule
is explained in Section 4.3.3).
Decryption of a ciphertext block incorporates the inverses of the operations
used in the encryption process as shown in Figure 8. Note that AddRoundKey is
its own inverse since it involves only the bitwise exclusive-OR operation.
Decrypt(byte in[16], byte out[16], word k[44])
begin
byte state[4,4]
state = in
AddRoundKey(state, k[40, 43])
for round = 9 step -1 downto 1
InvShiftRows(state)
InvSubBytes(state)
AddRoundKey(state, k[round*4, (round+1)*4-1])
InvMixColumns(state)
end for
InvShiftRows(state)
InvSubBytes(state)
AddRoundKey(state, k[0, 3])
out = state
end
Figure 8: The AES decryption process
InvSubBytes: reverses the transformation performed by SubBytes by first
applying an affine transformation using the inverse of the 8 8 matrix used for
SubBytes followed by calculation of the multiplicative inverse in the Galois
Field GF(28) modulo m(x).
InvShiftRows: performs cyclic right-shifts on each row in the state in the
same amounts as ShiftRows.
InvMixColumns: performs multiplication of the state by the inverse of the
-
34
Galois Field constant matrix from MixColumns:
B(0,c)
B(1,c)
B(2,c)
B(3,c)
=
0e 0b 0d 09
09 0e 0b 0d
0d 09 0e 0b
0b 0d 09 0e
A(0,c)
A(1,c)
A(2,c)
A(3,c)
.
4.3.3 Key Schedule
For the 128-bit key size implementation of the AES algorithm, the master
key is expanded into a linear array of eleven 4-byte words using the process presented
in Figure 9. There are two operations and an array of constants used specifically for
the key schedule:
KeyExpansion(byte key[16], word w[44])
begin
word temp
i = 0
while (i < 4)
w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])
i = i + 1
end while
i = 4
while (i < 44)
temp = w[i-1]
if (i mod 4 = 0)
temp = SubWord(RotWord(temp)) xor Rcon[i/4]
end if
w[i] = w[i-4] xor temp
i = i + 1
end while
end
Figure 9: The key expansion process for AES
SubWord: applies a substitution to each of the four bytes in the input word
-
35
using the same S-Box that is used in encryption.
RotWord: performs a cyclic left rotation by one byte on the input word
a0a1a2a3 to produce an output of a1a2a3a0.
Rcon[ ]: the round constant array with a size of ten words. For i from 1 to 10,
Rcon[i] = [{02}i1, {00}, {00}, {00}],
where the powers {02}i1 are in the Galois Field GF(28).
4.3.4 Performance
Rijndael software performance bottlenecks typically occur in the SubBytes
and MixColumns transformations, one or both of which are usually implemented
via 8-bit to 8-bit look-up tables. Often most of the Rijndael round transformations
SubBytes, ShiftRows, and MixColumns are combined into large look-up ta-
bles termed T-tables. Such implementations require up to three T-tables whose size
may be either 1 kbytes or 4 kbytes where the smaller tables require performing an
additional rotation operation. The goal of the T-tables is to avoid performing the
MixColumns and InvMixColumns transformations as these operations perform Galois
Field fixed field constant multiplication, an operation which maps poorly to general
purpose processors. However, the use of T-tables has significant disadvantages. The
T-tables significantly increase code size, their performance is dependent on the mem-
ory system architecture as well as cache size, and their use causes key expansion for
Rijndael decryption to become significantly more complex. As an alternative to the
T-tables implementation method, it is also feasible to have the processor perform all of
the Rijndael round transformations. Row-based implementations have been demon-
strated to allow for greater efficiency in the implementation of the MixColumns and
-
36
InvMixColumns transformations versus column-based implementations. However the
SubBytes transformation still remains as a bottleneck, requiring separate 256-byte
look-up tables for encryption and decryption [20], [29], [32], [81], [82], [83], [84].
Numerous co-processors have been developed to accelerate cryptographic
algorithm implementations. The CryptoManiac VLIW co-processor was developed
as a result of instruction set extensions designed to accelerate the performance of
a number of the AES candidate algorithms. CryptoManiac features the execution
of up to four instructions per cycle and the use of instructions with up to three
operands to allow for the combination of short latency instructions for single cycle
execution. Similarly, the Cryptonite co-processor is also VLIW based, with two 64-
bit datapaths and special instructions combined with dedicated memories to support
Rijndael implementations. Both co-processors improve the performance of Rijndael
implementations versus implementations targeting general purpose processors. Other
implementations couple FPGA co-processors with a LEON-2 processor core. The
co-processors connect to the processor core via either a dedicated interface or as a
memory-mapped peripheral and were able to significantly improve the performance
of Rijndael implementations [23], [33], [35], [85], [86].
Multiple implementations of Rijndael have been presented targeting a wide
range of hardware technologies. These implementations use specific Galois Field fixed
field constant multipliers resulting in either logic equations or look-up tables being
generated to perform the multiplication. Implementations based on logic equations
are optimized for area and require a moderate number of logic levels. Implementations
based on look-up tables are optimized for speed at the cost of additional logic resources
though the performance of these implementations, like the software implementations
employing T-tables, is highly dependent on the memory system and cache organiza-
tion and size. In the case of the Galois Field fixed field constant multipliers used in
-
37
the MixColumns transformation, the 8-bit to 8-bit look-up tables may be replaced by
8 bit 8 bit mapping matrices, reducing the associated memory requirements
by a factor of nearly 20 [87], [88]. Look-up tables may also be replaced with logic
equation implementations for the SubBytes and MixColumns transformations, sig-
nificantly reducing the hardware resource requirements. To illustrate the significant
reduction in logic resource requirements, in the case of the SubBytes transformation,
a reduction in gate count by as much as a factor 4.66 has been realized using logic
equations in place of a look-up table. When performing sixteen SubBytes transfor-
mations in parallel in a single round of Rijndael (assuming a 128-bit implementation),
this equates to a savings of over 38,000 gate equivalences. For a pipelined implemen-
tation of 128-bit AES, this savings increases to over 380,000 gate equivalences [29].
Encryption, decryption, and Key Scheduling are all easily pipelined in non-feedback
modes of operation while single-round implementations are typically used when op-
erating in feedback modes. Depending on the implementation methodology, Rijndael
throughputs as high as 70 Gbps when operating in non-feedback modes and 2.29
Gbps when operating in feedback modes have been reported [89], [90], [91], [92], [93],
[94], [95], [96], [97], [98], [99], [100], [101], [102], [103], [104], [105], [106], [107].
Instruction set extensions are an interesting implementation option that
bridges the gap between hardware and software. Significantly improved performance
of software implementations have been demonstrated as a result of adding function-
ality to a processors datapath and corresponding control logic to decode new in-
structions. Instruction set extensions designed to accelerate the performance of soft-
ware implementations of Rijndael have been proposed for a wide range of processors.
These extensions minimize the number of memory accesses, usually by combining
the SubBytes and MixColumns transformations into one T-table look-up operation
to speed up algorithm execution. While T-table performance is heavily dependent
-
38
upon available cache size, these extensions have been shown to result in performance
improvements of up to a factor of 3.68 versus Rijndael implementations without the
use of the instruction set extensions [20], [42], [85], [86], [88], [108], [109].
4.4 Modes of Operation
All of the symmetric-key algorithms targeted in this research support many
different modes of operation methods that specify the information entered into the
algorithm. Two modes of operation are of particular interest here:
Electronic Code Book (ECB): Each block of plaintext xi is input directly into
the encryption function to form the ciphertext yi. Encrypting a specific value
for xi always produces the same value for yi and the result is not affected by
previous plaintext blocks.
Cipher Block Chaining (CBC): The first plaintext block is combined with an
initialization vector (IV) using the bitwise exclusive-OR operation, and the
result is encrypted to form the first ciphertext block. Subsequent blocks of
plaintext are exclusive-ORed with the last ciphertext block computed. In this
mode, every block of ciphertext depends on the preceding runs of the algorithm.
ECB mode may be employed to create pipelined implementations of block
ciphers, leading to very high throughput. However, a disadvantage of ECB mode is
that identical plaintext blocks encrypted with the same key always produce the same
ciphertext. Also, the fact that data blocks encrypted in ECB mode have no chaining
or feedback mechanisms, the encrypted data is vulnerable to a substitution attack.
In this type of attack, encrypted blocks in the original data stream are swapped
by the attacker with other data blocks that are encrypted with the same key. CBC
-
39
mode is not vulnerable to suck attacks, but because the current block to be encrypted
depends directly on the previous ciphertext, CBC mode is not well suited to pipelined
implementations.
-
40
5 PROPOSED INSTRUCTION SET EXTENSIONS
This chapter specifies the new instructions that are intended to enhance the
performance of the algorithms described in Chapter 4. All of the custom instructions
are intended to comply with the SPARC r V8 instruction model [47]. In particular,
the instructions have the Format 3 structure described in Section 3.3.
The sub-sections below indicate the syntax and encoding, as well as a brief
description, of each instruction. All instructions that write to a register execute in
one clock cycle, except for the mmul16 instruction which takes two clock cycles.
For those instructions that store data directly into registers contained in the custom
hardware, the data is available at the start of the next cycle, after instruction execu-
tion has completed.
5.1 DES and Triple-DES
5.1.1 Initial and Final Permutations
Instruction Syntax:
desipl rs1,rs2,rd
desipr rs1,rs2,rd
desfpl rs1,rs2,rd
desfpr rs1,rs2,rd
-
41
Instruction Encoding:
op rd op3 rs1 i asi rs210 rd 001101 rs1 0 XXXXXXnn rs2
The desipl and desipr instructions produce the left and right halves of the
DES IP, respectively. Similarly, the desfpl and desfpr instructions produce the left
and right halves of the DES FP, respectively. The left half of the input block must
be located in the rs1 register and the right half must be located in the rs2 register.
The specific instruction to be executed is determined by the value of bits
[1:0] of the asi field as given in Table 11 below. Bits [7:2] of the asi field are ignored
by all of the DES permutation instructions.
asi[1:0] Instruction00 desipl01 desipr10 desfpl11 desfpr
Table 11: Interpretation of the asi field for the DES permutation instructions
Inclusion of these instructions allows for the IP and FP for DES and Triple-
DES to be completed in two instructions each. Traditional implementations of the
IP and FP in software require a series of bit mask setup, shift, logical AND, and
logical OR operations for each bit for a total of 256 instructions [41]. The improved
permutation algorithm used in the reference code [110] requires 44 instructions to
complete on a SPARC r V8 processor such as the LEON2, which is still significantly
larger than the instruction count required to perform the permutations as proposed
in this research.
-
42
5.1.2 Set Encryption Direction
Instruction Syntax:
desdir imm
Instruction Encoding:
op rd op3 rs1 i simm1310 00000 001001 00000 1 (dir)
Set up the DES key generator to output round keys in either encryption or
decryption order. The imm operand is set to zero for encryption, one for decryption.
This instruction also resets the round counter of the key generator according to the
chosen direction to ensure that output of the round keys may be immediately carried
out in the proper order. It is not necessary to re-load the master key after this in-
struction is executed. The desdir instruction is used in conjunction with the deskey
and desf instructions as explained in Sections 5.1.3 and 5.1.4.
5.1.3 Key Loading
Instruction Syntax:
deskey rs1,rs2
Instruction Encoding:
op rd op3 rs1 i asi rs210 00000 001001 rs1 0 unused rs2
The deskey instruction loads the 64-bit master key for DES. The left half
of the master key must be contained in the rs1 register, and the right half in the rs2
register.
-
43
5.1.4 Round Core (f ) Function
Instruction Syntax
desf rs1,rd
Instruction Encoding
op rd op3 rs1 i simm1310 00000 001001 rs1 1 0x1XXX
This instruction takes the right half of a round output block stored in the
rs1 register and stores the output of the core (f ) function into the rd register. The
round key is not specified here since the round key output of the DES key generator
is hard-wired to the f -function circuits round key input. After completion of this
instruction, the key generator is signaled to generate the key for the next round. Due
to the logic of the DES key generator (see Section 6.1.3), the desf instruction may
not be followed by another desf instruction. However, this is not expected to cause
a performance bottleneck due to the additional instruction required for swapping the
values of the left and right halves of the round input block.
Implementation of the desdir, deskey, and desf instructions removes the
need for storage of the sixteen round keys and S-Boxes in memory. All round keys
are generated on-the-fly in the custom hardware. An implementation of the DES al-
gorithm using these instructions requires two instructions for key scheduling and four
instructions for each of the sixteen rounds (one desf, one exclusive-OR, and two regis-
ter data transfers for swapping the left and right halves of the round function output).
5.1.5 New DES and Triple-DES Algorithm Implementations
The following text show instruction sequences that may be used to imple-
ment the DES and Triple-DES algorithms for encryption and decryption. All operand
-
44
names are symbolic and do not represent the names of physical registers of the LEON2
processor.
desipl %[ptextl], %[ptextr], %[l]
desipr %[ptextl], %[ptextr], %[r]
deskey %[keyl], %[keyr]
desdir 0
mov 1, %[i]
round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu round
add %[i], 1, %[i]
desfpl %[r], %[l], %[ctextl]
desfpr %[r], %[l], %[ctextr]
Figure 10: DES encryption routine with custom instructions
desipl %[ctextl], %[ctextr], %[l]
desipr %[ctextl], %[ctextr], %[r]
deskey %[keyl], %[keyr]
desdir 1
mov 1, %[i]
round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu round
add %[i], 1, %[i]
desfpl %[r], %[l], %[ptextl]
desfpr %[r], %[l], %[ptextr]
Figure 11: DES decryption routine with custom instructions
-
45
desipl %[ptextl], %[ptextr], %[l]
desipr %[ptextl], %[ptextr], %[r]
deskey %[key1l], %[key1r]
desdir 0
mov 1, %[i]
d1round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d1round
add %[i], 1, %[i]
mov %[r], %[temp]
mov %[l], %[r]
mov %[temp], %[l]
deskey %[key2l], %[key2r]
desdir 1
mov 1, %[i]
d2round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d2round
add %[i], 1, %[i]
mov %[r], %[temp]
mov %[l], %[r]
mov %[temp], %[l]
deskey %[key3l], %[key3r]
desdir 0
mov 1, %[i]
d3round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d3round
add %[i], 1, %[i]
desfpl %[r], %[l], %[ctextl]
desfpr %[r], %[l], %[ctextr]
Figure 12: Triple-DES encryption routine with custom instructions
-
46
desipl %[ctextl], %[ctextr], %[l]
desipr %[ctextl], %[ctextr], %[r]
deskey %[key3l], %[key3r]
desdir 1
mov 1, %[i]
d1round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d1round
add %[i], 1, %[i]
mov %[r], %[temp]
mov %[l], %[r]
mov %[temp], %[l]
deskey %[key2l], %[key2r]
desdir 0
mov 1, %[i]
d2round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d2round
add %[i], 1, %[i]
mov %[r], %[temp]
mov %[l], %[r]
mov %[temp], %[l]
deskey %[key1l], %[key1r]
desdir 1
mov 1, %[i]
d3round:
mov %[r], %[temp]
desf %[r], %[r]
xor %[l], %[r], %[r]
mov %[temp], %[l]
cmp %[i], 16
blu d3round
add %[i], 1, %[i]
desfpl %[r], %[l], %[ctextl]
desfpr %[r], %[l], %[ctextr]
Figure 13: Triple-DES decryption routine with custom instructions
-
47
5.2 IDEA
5.2.1 Multiplication modulo 216 + 1
Instruction Syntax:
mmul16 rs1, rs2, rd
Instruction Encoding:
op rd op3 rs1 i asi rs210 rd 101101 rs1 0 unused rs2
This instruction calculates rs1 rs2 mod (216 + 1) and stores the product
in the rd register. Both sources must be in the lower sixteen bits of their respective
registers. The 16-bit product is stored in the lower sixteen bits of the rd register.
5.2.2 New IDEA Algorithm Implementation
t0_1 = in[0]; t0_2 = in[1]; t0_3 = in[2]; t0_4 = in[3];
for(i=0;i
-
48
5.3 AES
5.3.1 SubBytes Operations
Instruction Syntax:
aessb rs1, imm, rd
aessb4 rs1, imm, rd
Instruction Encoding:
op rd op3 rs1 i simm1310 rd 101100 rs1 1 see below
These instructions perform the AES SubBytes and InvSubBytes operations
on either one (aessb) or all four (aessb4) of the bytes in the rs1 register. Only one
of these instructions may be implemented in the hardware but not both.
The value specified in the simm13 field determines the actual operation
performed. The lease significant bit is set to zero for SubBytes, or one for InvSub-
Bytes. The value composed of bits 5 and 4 indicate the byte to be substituted for
the aessb instruction as shown in Table 12. These bits are not used by the aessb4
instruction.
simm13[5:4] Substituted byte00 rs1[31:24]01 rs1[23:16]10 rs1[15:8]11 rs1[7:0]
Table 12: Usage of the simm13 field by the aessb and aessb4 instructions
Bits [12:6] and [3:1] of simm13 are ignored by both the aessb and aessb4
instructions.
-
49
5.3.2 GF(2m) Matrix Multiplier Constant Loading
Instruction Syntax:
gfmkld rs1,rs2
Instruction Encoding:
op rd op3 rs1 i asi rs210 00000 011001 rs1 0 unused rs2
The gfmkld instruction is used to load one of the sixteen constants into
the constant matrix of the Galois Field fixed field constant matrix multiplier. The
constant matrix has the following structure:
K00 K01 K02 K03
K10 K11 K12 K13
K20 K21 K22 K23
K30 K31 K32 K33
The first constants to be loaded are those in the first row, from K00 to K03.
The loading process continues for each row in descending order, from left to right,
until the last constant K33 has been loaded. Due to the logic that has been added to
the multiplier for inclusion into the LEON2 processor datapath (see Section 6.1.6),
instances of the gfmkld instruction may not be issued consecutively.
5.3.3 GF(2m) Matrix Multiplication
Instruction Syntax:
gfmmul rs1, imm, rd
-
50
Instruction Encoding:
op rd op3 rs1 i asi rs210 rd 011101 rs1 0 unused 00000
Perform the Galois Field fixed field constant matrix multiplication on the
input in the rs1 register and store the result in the rd register.
-
51
5.3.4 New AES Algorithm Implementations
// First add round key
state[0] = plaintext[0] ^ key_schedule[0][0];
state[1] = plaintext[1] ^ key_schedule[0][1];
state[2] = plaintext[2] ^ key_schedule[0][2];
state[3] = plaintext[3] ^ key_schedule[0][3];
// The nine rounds
for (i = 1; i < Nr; i++)
{
// SubBytes + ShiftRows
asm( "aessb %[s0], 0x00, %[s0]\n\t" : [s0] "+r" (state[0]) );
asm( "aessb %[s1], 0x10, %[s1]\n\t" : [s1] "+r" (state[1]) );
asm( "aessb %[s2], 0x20, %[s2]\n\t" : [s2] "+r" (state[2]) );
asm( "aessb %[s3], 0x30, %[s3]\n\t" : [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
asm( "aessb %[s1], 0x00, %[s1]\n\t" : [s1] "+r" (state[1]) );
asm( "aessb %[s2], 0x10, %[s2]\n\t" : [s2] "+r" (state[2]) );
asm( "aessb %[s3], 0x20, %[s3]\n\t" : [s3] "+r" (state[3]) );
asm( "aessb %[s0], 0x30, %[s0]\n\t" : [s0] "+r" (state[0]) );
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
asm( "aessb %[s2], 0x00, %[s2]\n\t" : [s2] "+r" (state[2]) );
asm( "aessb %[s3], 0x10, %[s3]\n\t" : [s3] "+r" (state[3]) );
asm( "aessb %[s0], 0x20, %[s0]\n\t" : [s0] "+r" (state[0]) );
asm( "aessb %[s1], 0x30, %[s1]\n\t" : [s1] "+r" (state[1]) );
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
asm( "aessb %[s3], 0x00, %[s3]\n\t" : [s3] "+r" (state[3]) );
asm( "aessb %[s0], 0x10, %[s0]\n\t" : [s0] "+r" (state[0]) );
asm( "aessb %[s1], 0x20, %[s1]\n\t" : [s1] "+r" (state[1]) );
asm( "aessb %[s2], 0x30, %[s2]\n\t" : [s2] "+r" (state[2]) );
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
// MixColumns
asm(
"gfmmul %[t0], %[s0]\n\t"
"gfmmul %[t1], %[s1]\n\t"
"gfmmul %[t2], %[s2]\n\t"
"gfmmul %[t3], %[s3]\n\t"
: [s0] "=r" (state[0]) , [s1] "=r" (state[1]) , [s2] "=r" (state[2]) , [s3] "=r" (state[3])
: [t0] "r" (tmp[0]) , [t1] "r" (tmp[1]) , [t2] "r" (tmp[2]) , [t3] "r" (tmp[3])
);
-
52
// Add round key
state[0] = state[0] ^ key_schedule[i][0];
state[1] = state[1] ^ key_schedule[i][1];
state[2] = state[2] ^ key_schedule[i][2];
state[3] = state[3] ^ key_schedule[i][3];
}
// Final round
// SubBytes + ShiftRows
asm( "aessb %[s0], 0x00, %[s0]\n\t" : [s0] "+r" (state[0]) );
asm( "aessb %[s1], 0x10, %[s1]\n\t" : [s1] "+r" (state[1]) );
asm( "aessb %[s2], 0x20, %[s2]\n\t" : [s2] "+r" (state[2]) );
asm( "aessb %[s3], 0x30, %[s3]\n\t" : [s3] "+r" (state[3]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[1] & 0x00FF0000 ;
tmp[0] |= state[2] & 0x0000FF00 ;
tmp[0] |= state[3] & 0x000000FF ;
asm( "aessb %[s1], 0x00, %[s1]\n\t" : [s1] "+r" (state[1]) );
asm( "aessb %[s2], 0x10, %[s2]\n\t" : [s2] "+r" (state[2]) );
asm( "aessb %[s3], 0x20, %[s3]\n\t" : [s3] "+r" (state[3]) );
asm( "aessb %[s0], 0x30, %[s0]\n\t" : [s0] "+r" (state[0]) );
tmp[1] = state[1] & 0xFF000000 ;
tmp[1] |= state[2] & 0x00FF0000 ;
tmp[1] |= state[3] & 0x0000FF00 ;
tmp[1] |= state[0] & 0x000000FF ;
asm( "aessb %[s2], 0x00, %[s2]\n\t" : [s2] "+r" (state[2]) );
asm( "aessb %[s3], 0x10, %[s3]\n\t" : [s3] "+r" (state[3]) );
asm( "aessb %[s0], 0x20, %[s0]\n\t" : [s0] "+r" (state[0]) );
asm( "aessb %[s1], 0x30, %[s1]\n\t" : [s1] "+r" (state[1]) );
tmp[2] = state[2] & 0xFF000000 ;
tmp[2] |= state[3] & 0x00FF0000 ;
tmp[2] |= state[0] & 0x0000FF00 ;
tmp[2] |= state[1] & 0x000000FF ;
asm( "aessb %[s3], 0x00, %[s3]\n\t" : [s3] "+r" (state[3]) );
asm( "aessb %[s0], 0x10, %[s0]\n\t" : [s0] "+r" (state[0]) );
asm( "aessb %[s1], 0x20, %[s1]\n\t" : [s1] "+r" (state[1]) );
asm( "aessb %[s2], 0x30, %[s2]\n\t" : [s2] "+r" (state[2]) );
tmp[3] = state[3] & 0xFF000000 ;
tmp[3] |= state[0] & 0x00FF0000 ;
tmp[3] |= state[1] & 0x0000FF00 ;
tmp[3] |= state[2] & 0x000000FF ;
// Add round key
ciphertext[0] = tmp[0] ^ key_schedule[Nr][0];
ciphertext[1] = tmp[1] ^ key_schedule[Nr][1];
ciphertext[2] = tmp[2] ^ key_schedule[Nr][2];
ciphertext[3] = tmp[3] ^ key_schedule[Nr][3];
}
Figure 15: AES encryption routine with aessb and gfmmul instructions
-
53
// AddRoundKey
state[0] = ciphertext[0] ^ key_schedule[Nr][0];
state[1] = ciphertext[1] ^ key_schedule[Nr][1];
state[2] = ciphertext[2] ^ key_schedule[Nr][2];
state[3] = ciphertext[3] ^ key_schedule[Nr][3];
for (i = Nr-1; i > 0; i--)
{
// InvShiftRowsInvSubBytes
asm( "aessb %[s0], 0x01, %[s0]\n\t" : [s0] "+r" (state[0]) );
asm( "aessb %[s3], 0x11, %[s3]\n\t" : [s3] "+r" (state[3]) );
asm( "aessb %[s2], 0x21, %[s2]\n\t" : [s2] "+r" (state[2]) );
asm( "aessb %[s1], 0x31, %[s1]\n\t" : [s1] "+r" (state[1]) );
tmp[0] = state[0] & 0xFF000000 ;
tmp[0] |= state[3] & 0x00FF0000 ;
tmp[0] |=