new idea&aes

INSTRUCTION SET EXTENSIONS FORENHANCING THE PERFORMANCE OF SYMMETRIC KEY

CRYPTOGRAPHIC ALGORITHMS

BY

SEAN R. OMELIABS CpE, UNIVERSITY OF MASSACHUSETTS LOWELL (2005)

SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF MASTER OF SCIENCE IN ENGINEERING

DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERINGUNIVERSITY OF MASSACHUSETTS LOWELL

Signature of Author Date

Dr. Adam J. ElbirtThesis Advisor

Prof. George P. CheneyThesis Committee Member

Dr. Dalila B. MegherbiThesis Committee Member

INSTRUCTION SET EXTENSIONS FORENHANCING THE PERFORMANCE OF SYMMETRIC KEY

CRYPTOGRAPHIC ALGORITHMS

BY

SEAN R. OMELIABS CpE, UNIVERSITY OF MASSACHUSETTS LOWELL (2005)

ABSTRACT OF A THESIS SUBMITTED TO THE FACULTY OF THEDEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF

MASTER OF SCIENCE IN ENGINEERINGUNIVERSITY OF MASSACHUSETTS LOWELL

2007

Thesis Advisor: Dr. Adam J. ElbirtAssistant Professor, Department of Computer Science

ABSTRACT

In this thesis, instruction set extensions for a RISC processor are presented

to improve the performance in software of the Data Encryption Standard (DES),

Triple-DES, International Data Encryption Algorithm (IDEA), and Advanced En-

cryption Standard (AES) algorithms. The most computationally intensive operations

of each algorithm are handled by a set of new instructions. The hardware supporting

these instructions is integrated into the processors datapath. For each of the targeted

algorithms, comparisons are presented between traditional software implementations

and new implementations that take advantage of the extended instruction set ar-

chitecture. Results show that utilization of the proposed instructions significantly

reduces program code size and improves encryption and decryption throughput. The

additional hardware resources required by all of the custom hardware increases the

total area of the processor by less than fifty percent.

ii

ACKNOWLEDGEMENTS

There are several people I wish to thank for their assistance and support

in the completion of this thesis. I would like to express many thanks to my advisor,

Dr. Adam J. Elbirt, who has been an excellent guide throughout all stages of the

research, and Prof. George Cheney and Dr. Dalila Megherbi for their membership

on the defense committee. I received a great deal of support on technical matters

from Gaisler Research, the creator of the LEON2 processor, and the members of the

Instruction Set Extensions for Cryptography Project at Graz University of Technol-

ogy. Their advice was most helpful for understanding the LEON2 model and how the

processor architecture can be extended. I would also like to thank all of my friends

and family, who have encouraged me throughout the course of my work.

iii

Contents

List of Figures viii

List of Tables x

1 INTRODUCTION 1

2 PREVIOUS WORK 6

3 THE LEON2 PROCESSOR 9

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.2 VHDL Model Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.3 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.4 SPARC r V8 Instruction Model . . . . . . . . . . . . . . . . . . . . . 11

3.5 Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.6 Synthesis and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.7 Software Development Tools . . . . . . . . . . . . . . . . . . . . . . . 12

4 TARGET ALGORITHMS 14

4.1 Triple-DES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.1 The DES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 14

iv

4.1.2 DES Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.1.3 The Triple-DES Algorithm . . . . . . . . . . . . . . . . . . . . . 21

4.1.4 Triple-DES Key Schedule . . . . . . . . . . . . . . . . . . . . . . 22

4.1.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.2 IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 25

4.2.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 30

4.3.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 30

4.3.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.4 Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5 PROPOSED INSTRUCTION SET EXTENSIONS 40

5.1 DES and Triple-DES . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.1 Initial and Final Permutations . . . . . . . . . . . . . . . . . . . 40

5.1.2 Set Encryption Direction . . . . . . . . . . . . . . . . . . . . . . 42

5.1.3 Key Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

5.1.4 Round Core (f ) Function . . . . . . . . . . . . . . . . . . . . . . 43

5.1.5 New DES and Triple-DES Algorithm Implementations . . . . . 43

5.2 IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.2.1 Multiplication modulo 216 + 1 . . . . . . . . . . . . . . . . . . . 47

v

5.2.2 New IDEA Algorithm Implementation . . . . . . . . . . . . . . 47

5.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.1 SubBytes Operations . . . . . . . . . . . . . . . . . . . . . . . . 48

5.3.2 GF(2m) Matrix Multiplier Constant Loading . . . . . . . . . . . 49

5.3.3 GF(2m) Matrix Multiplication . . . . . . . . . . . . . . . . . . . 49

5.3.4 New AES Algorithm Implementations . . . . . . . . . . . . . . . 51

6 LEON2 HARDWARE AND SOFTWARE TOOLCHAIN MODIFI-

CATIONS 59

6.1 Custom Hardware Units . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.1.1 DES Permutation Unit . . . . . . . . . . . . . . . . . . . . . . . 59

6.1.2 DES Round f -function Unit . . . . . . . . . . . . . . . . . . . . 60

6.1.3 DES Key Generator . . . . . . . . . . . . . . . . . . . . . . . . 61

6.1.4 Modulo (216 + 1) Multiplier . . . . . . . . . . . . . . . . . . . . 63

6.1.5 AES S-Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.1.6 Galois Field Fixed Field Constant Multiplier . . . . . . . . . . . 64

6.2 Architecture Modifications . . . . . . . . . . . . . . . . . . . . . . . . 68

6.3 Modifications to Software Development Tools . . . . . . . . . . . . . . 69

7 RESULTS AND ANALYSIS 70

7.1 Testing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

7.2 Software Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

7.3 Algorithm Execution Times . . . . . . . . . . . . . . . . . . . . . . . . 74

7.4 Hardware Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.5 Throughput to Area Comparisons . . . . . . . . . . . . . . . . . . . . 79

vi

8 CONCLUSIONS AND FUTURE WORK 81

REFERENCES 83

Appendix A: VHDL Source for Custom Functional Units 94

Appendix B: Modifications to LEON2 VHDL Model and Development

Tools 116

Appendix C: Test Vectors for Functional Evaluation 160

Appendix D: Example Source Code for Functional and Performance

Evaluations 162

About the Author 210

vii

List of Figures

1 Structure of SPARC r V8 Format 3 instructions . . . . . . . . . . . . 11

2 Block diagram for standard block ciphers . . . . . . . . . . . . . . . . 15

3 The Data Encryption Standard algorithm . . . . . . . . . . . . . . . 16

4 The DES f -function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5 The computation graph for IDEA. . . . . . . . . . . . . . . . . . . . . 27

6 The AES encryption process . . . . . . . . . . . . . . . . . . . . . . . 31

7 The state representation of data blocks in AES . . . . . . . . . . . . 31

8 The AES decryption process . . . . . . . . . . . . . . . . . . . . . . . 33

9 The key expansion process for AES . . . . . . . . . . . . . . . . . . . 34

10 DES encryption routine with custom instructions . . . . . . . . . . . 44

11 DES decryption routine with custom instructions . . . . . . . . . . . 44

12 Triple-DES encryption routine with custom instructions . . . . . . . . 45

13 Triple-DES decryption routine with custom instructions . . . . . . . . 46

14 IDEA algorithm routine with custom instructions . . . . . . . . . . . 47

15 AES encryption routine with aessb and gfmmul instructions . . . . 52

16 AES decryption routine with aessb and gfmmul instructions . . . . 54

17 AES encryption routine with aessb4 and gfmmul instructions . . . . 56

18 AES decryption routine with aessb4 and gfmmul instructions . . . . 58

viii

19 DES permutation unit . . . . . . . . . . . . . . . . . . . . . . . . . . 60

20 DES key generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

ix

List of Tables

1 LEON2 VHDL Model File Hierarchy . . . . . . . . . . . . . . . . . . 10

2 The Initial Permutation IP . . . . . . . . . . . . . . . . . . . . . . . . 17

3 The Expansion Operation E . . . . . . . . . . . . . . . . . . . . . . . 18

4 The DES S-Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5 The Pre-Output Permutation P . . . . . . . . . . . . . . . . . . . . . 20

6 The Final Permutation FP . . . . . . . . . . . . . . . . . . . . . . . . 20

7 Permuted Choice 1 (PC-1 ) . . . . . . . . . . . . . . . . . . . . . . . . 21

8 Rotations for the DES key schedule . . . . . . . . . . . . . . . . . . . 21

9 Permuted Choice 2 (PC-2 ) . . . . . . . . . . . . . . . . . . . . . . . . 21

10 IDEA key schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

11 Interpretation of the asi field for the DES permutation instructions . 41

12 Usage of the simm13 field by the aessb and aessb4 instructions . . 48

13 Code size in bytes for DES . . . . . . . . . . . . . . . . . . . . . . . . 72

14 Code size in bytes for Triple-DES . . . . . . . . . . . . . . . . . . . . 72

15 Code size in bytes for IDEA . . . . . . . . . . . . . . . . . . . . . . . 72

16 Code size in bytes for AES without gfmmul instruction . . . . . . . 73

17 Code size in bytes for AES with gfmmul instruction . . . . . . . . . 73

18 Execution cycles for DES . . . . . . . . . . . . . . . . . . . . . . . . . 74

x

19 Execution cycles for Triple-DES . . . . . . . . . . . . . . . . . . . . . 74

20 Execution cycles for IDEA . . . . . . . . . . . . . . . . . . . . . . . . 75

21 Execution cycles for AES without gfmmul instruction . . . . . . . . 75

22 Execution cycles for AES with gfmmul instruction . . . . . . . . . . 76

23 Comparison with ISEC Extensions in C/Inline Assembly . . . . . . . 77

24 Comparison with ISEC Extensions in Pure Assembly . . . . . . . . . 77

25 Hardware utilization on the Xilinx XC4VLX25 FPGA . . . . . . . . . 78

26 Throughput to area ratios for algorithm implementations . . . . . . . 80

xi

11 INTRODUCTION

With more than 188 million Americans connected to the Internet [1], in-

formation security has become a top priority. Many applications electronic mail,

electronic banking, medical databases, and electronic commerce require the ex-

change of private information. For example, when engaging in electronic commerce,

customers provide credit card numbers when purchasing products. If the connection

is not secure, an attacker can easily obtain this sensitive data. In order to imple-

ment a comprehensive security plan for a given network to guarantee the security of

a connection, the following services must be provided [2], [3], [4]:

Confidentiality : Information cannot be observed by an unauthorized party. This

is accomplished via public-key and private-key encryption.

Data Integrity : Transmitted data within a given communication cannot be

altered in transit due to error or an unauthorized party. This is accomplished

via the use of hash functions and Message Authentication Codes.

Authentication: Parties within a given communication session must provide

certifiable proof of their identity. This is accomplished via the use of digital

signatures.

Non-repudiation: Neither the sender nor the receiver of a message may deny

transmission. This is accomplished via digital signatures and third party notary

services.

Cryptographic algorithms used to ensure confidentiality fall within one of two cat-

2egories: private-key (also known as symmetric-key) and public-key. Symmetric-key

algorithms use the same key for both encryption and decryption. Conversely, public-

key algorithms use a public key for encryption and a private key for decryption. In a

typical session, a public-key algorithm will be used for the exchange of a session key

and to provide authenticity through digital signatures. The session key is then used

in conjunction with a symmetric-key algorithm. Symmetric-key algorithms tend to

be significantly faster than public-key algorithms and as a result are typically used

in bulk data encryption [3]. The two types of symmetric-key algorithms are block

ciphers and stream ciphers. Block ciphers operate on a block of data while stream

ciphers encrypt individual bits. Block ciphers are typically used when performing

bulk data encryption and the data transfer rate of the connection directly follows the

throughput of the implemented algorithm.

High throughput encryption and decryption are becoming increasingly im-

portant in the area of high-speed networking. Many applications demand the creation

of networks that are both private and secure while using public data-transmission

links. These systems, known as Virtual Private Networks (VPNs), can demand en-

cryption throughputs at speeds exceeding Asynchronous Transfer Mode (ATM) rates

of 622 million bits per second (Mbps). Increasingly, security standards and applica-

tions are defined to be algorithm independent. Although context switching between

algorithms can be easily realized via software implementations, the task is significantly

more difficult when using hardware implementations. The advantages of a software

implementation include ease of use, ease of upgrade, ease of design, portability, and

flexibility. However, a software implementation offers only limited physical security,

especially with respect to key storage [3], [5]. Conversely, cryptographic algorithms

that are implemented in hardware are by nature more physically secure as they can-

not easily be read or modified by an outside attacker when the key is stored in special

3memory internal to the device [5]. As a result, the attacker does not have easy access

to the key storage area and cannot discover or alter its value in a straightforward

manner [3].

When using a general-purpose processor, even the fastest software imple-

mentations of block ciphers cannot satisfy the required bulk data encryption data

rates for high-end applications [6], [7], [8], [9], [10]. As a result, hardware imple-

mentations are necessary for block ciphers to achieve this required performance level.

Although traditional hardware implementations lack flexibility with respect to al-

gorithm and parameter switching, configurable hardware devices offer a promising

alternative for the implementation of processors via the use of IP cores in Applica-

tion Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA)

technology. To illustrate, Altera Corporation offers IP core implementations of the

Intel 8051 microcontroller and the Motorola 68000 processor in addition to their

own Nios r-II embedded processor [11]. Similarly, Xilinx Inc. offers IP core imple-

mentations of the PowerPC processor in addition to their own MicroBlazeTM and

PicoBlazeTM embedded processors [12]. ASIC and FPGA technologies provide the

opportunity to augment the existing datapath of a processor implemented via an IP

core to add acceleration modules supported through newly defined instruction set

extensions targeting performance-critical functions [13], [14], [15]. Moreover, many

licensable and extendible processor cores are also available for the same purpose [16],

[17], [18], [19].

The use of instruction set extensions follows the hardware/software co-

design paradigm to achieve the performance and physical security associated with

hardware implementations while providing the portability and flexibility traditionally

associated with software implementations [20]. Moreover, when considering alterna-

tive solutions, instruction set extensions result in significant performance improve-

4ments versus traditional software implementations with considerably reduced logic

resource requirements versus hardware-only solutions such as co-processors [21], [22],

[23], [24], [25], [26], [27], [28], [29]. It is the goal of this research to demonstrate a set

of instruction set extensions for a reduced instruction set computing (RISC) processor

that enhance the performance of symmetric-key algorithms in software implementa-

tions.

Chapter 2 discusses related work on the various methods of speeding up

symmetric-key algorithms in software. These include optimization of pure software

implementations, off-loading of cryptographic algorithm execution to co-processors,

and other instruction set extensions. It will be shown that advances in technology

have fueled trends towards increased reconfigurability in embedded systems, resulting

in instruction set extensions becoming a more viable and attractive option when the

performance of symmetric-key algorithms is critical.

Chapter 3 describes the target processor, the LEON2 RISC processor whose

architecture is based on the SPARC r architecture. The LEON2 processor was cho-

sen for its robust design, ease of configurability, and customization through the full

availability of the models hardware description language (HDL) source code.

In Chapter 4, the cryptographic algorithms that are the focus of this re-

search are explained. These include the Data Encryption Standard (DES) and Triple-

DES, the International Data Encryption Algorithm (IDEA), and the Advanced En-

cryption Standard (AES). The various performance bottlenecks that are commonly

encountered in software implementations of these algorithms are also discussed.

Syntax and encoding of the newly developed instructions are presented in

Chapter 5, followed by a description of the modifications to the LEON2 processor

and its associated development tools in Chapter 6. To evaluate the effectiveness of

the instruction set extensions, Chapter 7 presents data on the logic utilization of the

5custom hardware, as well as throughput data for the target algorithms both with

and without the use of the custom instructions. The thesis concludes in Chapter 8

along with recommendations for investigating further architectural enhancements to

the LEON2 processor for supporting symmetric-key algorithms.

62 PREVIOUS WORK

Most traditional methods for improving the throughput of pure software

implementations of symmetric-key algorithms fall into one of two categories. One

option is to construct memory-based look-up tables where results of some of the basic

operations of the algorithm have been pre-computed and stored. The substitution

boxes, or S-Boxes, of the DES and AES algorithms are commonly stored in look-up

tables in software implementations. Look-up tables may also be used to combine

operations used in the DES and AES algorithms. An implementation of DES in [30]

combines the S-Box table look-up with the subsequent 32-bit permutation and uses

tables as part of the Initial and Final Permutations. The AES algorithm requires

several complicated mathematical operations that are time-consuming on general-

purpose processors. Therefore in some implementations, large look-up tables, called

T-tables, are employed that combine several of these complex operations into a single

table access [20]. A look-up table based implementation is a viable option for systems

with a large memory space and low memory access times. However, area-constrained

systems suffer large performance penalties under these implementations [20], [29],

thus they are generally not employed in those environments.

Another method for speeding up software implementations of cryptographic

algorithms involves taking advantage of mathematical or structural properties of the

particular algorithm. The Initial and Final Permutations of the DES algorithm have

regular structures that make it possible to execute a series of matrix transformations

and exclusive-OR operations as demonstrated in [31]. This translates into a sequence

of instructions that is much smaller than the traditional sequence required to perform

7the Initial and Final Permutations. In previous work on improving the performance of

the AES algorithm on 32-bit systems, it has been shown that transforming a block of

plaintext from a column-oriented matrix to a row-oriented matrix reduces the number

of instructions required to complete the cipher rounds. In particular, a row-oriented

representation allows for a more efficient implementation of the Galois Field matrix

multiplication operations required for AES encryption and decryption [32].

In order to extend the cryptographic capabilities of an embedded system

without modifying the main processor, a co-processor solution can be adapted. When

there is data that must be encrypted or decrypted through the chosen symmetric-key

algorithm, the main processor sends the data and key material to the co-processor,

and the co-processor performs the algorithm, sending the processed data back over the

interface to the main processor. Most co-processor solutions have tended to combine

a number of different algorithms to provide a multi-faceted security solution. Co-

processors have generally achieved high throughput values compared to traditional

software implementations and therefore are much more capable of meeting demands

for speed-critical network communications. However, this type of solution is generally

associated with considerable overhead in terms of hardware area utilization, data

transfer latency, and complex interfaces to the main processor [23], [33], [34], [35],

[36], [37], [38], [39], [40].

There has been previous work on instruction set extensions for general per-

mutations that are useful for improving the performance of permutations for the

DES algorithm. Shi and Lee [41] presented two new instructions for general and

dynamically specified permutations. The input and a string of configuration bits are

specified in the source operands and the result is stored in the destination register.

These instructions, along with two new instructions, are discussed in [21]. In general,

permutations of n bits required log2(n) issues of the custom instructions, as well

8as several loads into registers of configuration bits. The MOSES platform developed

by a group based at NEC Research Laboratories is based on the Xtensa T1040, a

RISC-like processor designed to be easily extended with additional custom hardware

and supporting instructions. Throughput improvement factors of 31.0 for DES, 33.9

for Triple-DES, and 17.4 for AES were reported for this custom architecture [42], [43].

Study of the effect of custom instructions that support the AES cipher is

extensive. Most of this work targets the memory look-ups and multiplications that

are needed to perform the encryption rounds and key schedule. The Instruction Set

Extensions for Cryptography (ISEC) project conducted at the Graz University of

Technology in Graz, Austria, has investigated instruction set extensions that perform

the mathematical operations in the AES rounds using custom functional units inte-

grated into the targeted processors datapath [29]. Earlier work in the ISEC project

demonstrated the effectiveness of instruction set extensions for elliptic curve cryptog-

raphy [27] in improving the performance of binary extension Galois Field arithmetic

[28].

93 THE LEON2 PROCESSOR

3.1 Overview

The target processor for this work is the LEON2, a RISC central processing

unit (CPU) that was produced by Gaisler Research [44] (note that at the time of

this writing, support for the LEON2 processor has been discontinued in favor of the

newer LEON3 processor model). The LEON2 processor is implemented in VHDL

and is fully synthesizable. The model is highly configurable, allowing for adjustments

to many features of the processor using a graphical configuration utility. The entire

source code is freely available under the GNU General Public License which enables

custom modifications and enhancements to the architecture. Information presented

in this chapter is derived from the LEON2 documentation [45], [46].

3.2 VHDL Model Hierarchy

The source code for the LEON2 processor has the directory structure shown

in Table 1. The top-level folder /leon2/ is used as an example; it is permissible for

the root directory to have any name.

3.3 Processor Architecture

LEON2 is based on the Scalable Processor Architecture (SPARC r).

SPARC r was first developed in 1985 at Sun Microsystems and is based on the work

10

Folder Description Refer toleon2/ Top directory Sec. 3.2leon2/boards/ FPGA board support files Sec. 3.6leon2/doc/ User manuals [46]leon2/leon/ LEON2 processor VHDL model Sec. 3.3leon2/pmon/ Simple boot-monitor Not discussedleon2/sim/ Simulator support files Sec. 3.6leon2/syn/ Synthesis support files Sec. 3.6leon2/tbench/ LEON2 VHDL test bench Sec. 3.6leon2/tkconfig/ graphical configuration utility Sec. 3.5leon2/tsource/ LEON2 test bench (C source) Sec. 3.6

Table 1: LEON2 VHDL Model File Hierarchy

that produced the RISC I and RISC II architectures at the University of California at

Berkeley during the early 1980s [47]. The LEON2 processor attained full certification

of compliance with the SPARC r V8 architecture in 2003 [48].

Features of the LEON2 processor coding style include fully synchronous

design with a single clock, use of multiplexers for loading of pipeline registers, sep-

arate combinational and sequential processes, and record types for interconnection

of component I/O signals. LEON2 provides support for on-chip peripherals such as

a floating-point unit (FPU), Peripheral Component Interconnect (PCI), and Ether-

net; co-processor support is also available in accordance with the SPARC r model.

However, these features are outside the scope of this research and are therefore not

discussed in any further detail. The main focus of the LEON2 architecture with re-

gards to the proposed instruction set extensions is the pipelined integer unit (IU).

The IU pipeline consists of five stages: fetch, decode, execute, memory, and write back.

The VHDL model implements each stage in its own process. A process

statement in VHDL is a closed block of code that runs sequentially. The inputs are

specified by a sensitivity list. The process statement executes at any time a signal

in the sensitivity list changes state. Processes are used for behavioral VHDL code, a

high-level coding style used commonly for describing sequential logic [49].

11

3.4 SPARC r V8 Instruction Model

All SPARC r V8 instructions are implemented in the LEON2 processor

architecture. Instructions are grouped according to the values of the various fields in

the instruction operation code. Arithmetic, logic, and memory operations have the

Format 3 structure [47] shown in Figure 1.

op rd op3 rs1 i=0 asi rs2op rd op3 rs1 i=1 simm13

Figure 1: Structure of SPARC r V8 Format 3 instructions

3.5 Customization

Most of the available features of the LEON2 processor can be enabled, dis-

abled, or adjusted by using the graphical configuration utility. For the purposes of

this work, a basic configuration is used with no FPU, PCI, Ethernet, co-processor in-

terface, or hardware multiplier or divider. To extend the LEON2 architecture beyond

the scope of the standard model, additional VHDL code is required. The specific

files that must be modified depend on what functionality is to be added, but if the

instruction set is to be extended, the module containing the SPARC r V8 opcode

constants must be updated, and these instructions must follow the SPARC r V8

architecture specification [47]. The graphical configuration utility may also be mod-

ified to provide an easy interface for adjusting parameters of the custom functionality.

12

3.6 Synthesis and Simulation

The LEON2 VHDL implementation is a fully synthesizable processor that

can be targeted to any type of FPGA or ASIC technology. There are pre-made

packages for several synthesis tools such as XST, Synplify, Synopsys, and Leonardo

in the /syn/ sub-folder of the LEON2 directory structure. These packages enable use

of technology-specific cells to directly instantiate or automatically infer the register

files, caches, PCI FIFOs, and I/O pads. There are also a number of packages in the

/boards/ sub-folder that support programming of physical FPGA boards [50] with

the LEON2 architecture.

Functional verification of programs built for the LEON2 architecture can be

performed with the provided generic test bench. The VHDL source for the test bench

is located in the /tbench/ sub-folder of the LEON2 directory structure. Software code

is placed in the /tsource/ sub-folder in a format readable by the test bench VHDL

code. The software can then be read and executed by the test bench for purposes of

functional verification and performance evaluation.

3.7 Software Development Tools

In order to facilitate the development of programs targeting the LEON2

processor, Gaisler Research has provided a series of compilers and simulators that

may be chosen depending on the software environment. For stand-alone applications,

the Bare C Compiler (BCC) is recommended. BCC is based on the GNU Compiler

Collection (GCC) and GNU binutils. The BCC development tools are used in the

same way as those included in the standard GCC and binutils packages. The actual

names of the executables have a sparc-elf- prefix.

13

Packages containing the binaries for Linux and Cygwin environments are

available, as well as the full source code for developers who wish to take advantage

of an expanded LEON2 architecture.

14

4 TARGET ALGORITHMS

4.1 Triple-DES

4.1.1 The DES Algorithm

Many block ciphers may be characterized as Feistel networks [3]. Feistel

networks were invented by Horst Feistel [51] and are a general method of transforming

a function into a permutation. The basic Feistel network divides the data into two

halves where one half operates upon the other [52]. The f -function uses one of the

halves of the data block and a key to create a pseudo-random bit stream that is

used to encrypt or decrypt the other half of the data block. Therefore, to encrypt or

decrypt both halves requires two iterations of the Feistel network.

A generalization of the basic Feistel network allows for the support of larger

data blocks. Generalization occurs by considering the data swap as a circular right

shift. This allows for the use of the same f -function but requires multiple rounds to

input all of the sub-blocks to the f -function [53]. Figure 2 from [53] details the block

diagram for block ciphers employing both the basic Feistel network and generalized

Feistel networks of three and four blocks. The f -function is represented by the box

and the symbol represents a bit-wise XOR operation.

The f -function employs confusion and diffusion to obscure redundancies in

a plaintext message [54]. Confusion obscures the relationship between the plaintext,

the ciphertext, and the key. S-Box look-up tables are an example of a confusion

operation. Diffusion spreads the influence of individual plaintext or key bits over

as much of the ciphertext as possible. Expansion and permutation functions are

15

L Rk0

R Lk1

L R

B Ck0

A Bk1

C A

A

C

Bk2

B CA

C Dk0

B Ck1

A B

B

A

Dk2

D AC

A

D

C

Bk3

C DBA

Figure 2: Block diagram for standard block ciphers

examples of diffusion operations [3]. The basic operations that may be found within

an f -function include:

Bitwise XOR, AND, or OR.

Modular addition or subtraction.

Shift or rotation by a constant number of bits.

Data-dependent rotation by a variable number of bits.

Modular multiplication.

Multiplication in a Galois field.

Modular inversion.

Look-up-table substitution.

DES is a sixteen-round Feistel Network block cipher. A block diagram of

the entire operation is given in Figure 3 from [55]. The DES cipher takes as input a

64-bit key, where 8 of the 64 bits are used for parity and the other 56 bits comprise

16

the actual key material. The input and output both have a size of 64 bits for both

encryption and decryption. The procedures for encryption and decryption are almost

exactly the same; the only difference is that the key schedule for decryption is the

reverse of that used for encryption.

Figure 3: The Data Encryption Standard algorithm

Throughout the rest of this section, bit ordering is denoted for an n-bit

17

vector such that bit 1 is the most significant bit and n is the least significant bit.

In all Figures that show the bit assignments for DES permutations, the numbers

correspond to input bits that are mapped to a specific position in the output, starting

with output bits 1,2,3,... in the top row and ending with output bits ...,n-2,n-1,n in

the bottom row.

The first part of DES encryption is an Initial Permutation (IP) on the input

block. The IP rearranges the input according to Table 2. The output of the IP is

divided into a left half L0 and a right half R0, which becomes the input to the first

round. For each round iteration i from 1 to 16:

Li = Ri1

Ri = Li1

f(Ri1, Ki)

58 50 42 34 26 18 10 260 52 44 36 28 20 12 462 54 46 38 30 22 14 664 56 48 40 32 24 16 857 49 41 33 25 17 9 159 51 43 35 27 19 11 361 53 45 37 29 21 13 563 55 47 39 31 23 15 7

Table 2: The Initial Permutation IP

Figure 4 is an illustration of the round function and shows the individual

blocks of the f -function, which is the core operation of each round.

The following tables show the mappings of input bits to output bits of the

E and P operations, as well as the eight S-Boxes. The E expansion duplicates some

of the bits of the 32-bit input to the f - function as shown in Table 3 and outputs

a 48-bit value. The result of E (Ri1)Ki is partitioned into eight 6-bit values.

The S-Boxes output a 4-bit number based on a 6-bit input. The input is in the form

18

Figure 4: The DES f -function

a5a4a3a2a1a0 the row index into the S-Box is the number formed from a5a0 and the

column index is the number formed from a4a3a2a1. The outputs of the S-Boxes are

concatenated to form the input to the P permutation. The P permutation rearranges

the 32 bits of the combined S-Box outputs, and the result is XOR-ed with Li1 to

obtain Ri for the current round.

32 1 2 3 4 54 5 6 7 8 98 9 10 11 12 1312 13 14 15 16 1716 17 18 19 20 2120 21 22 23 24 2524 25 26 27 28 2928 29 30 31 32 1

Table 3: The Expansion Operation E

19

S114 4 13 1 2 15 11 8 3 10 6 12 5 9 0 70 15 7 4 14 2 13 1 10 6 12 11 9 5 3 84 1 14 8 13 6 2 11 15 12 9 7 3 10 5 015 12 8 2 4 9 1 7 5 11 3 14 10 0 6 13

S215 1 8 14 6 11 3 4 9 7 2 13 12 0 5 103 13 4 7 15 2 8 14 12 0 1 10 6 9 11 50 14 7 11 10 4 13 1 5 8 12 6 9 3 2 1513 8 10 1 3 15 4 2 11 6 7 12 0 5 14 9

S310 0 9 14 6 3 15 5 1 13 12 7 11 4 2 813 7 0 9 3 4 6 10 2 8 5 14 12 11 15 113 6 4 9 8 15 3 0 11 1 2 12 5 10 14 71 10 13 0 6 9 8 7 4 15 14 3 11 5 2 12

S47 13 14 3 0 6 9 10 1 2 8 5 11 12 4 1513 8 11 5 6 15 0 3 4 7 2 12 1 10 14 910 6 9 0 12 11 7 13 15 1 3 14 5 2 8 43 15 0 6 10 1 13 8 9 4 5 11 12 7 2 14

S52 12 4 1 7 10 11 6 8 5 3 15 13 0 14 914 11 2 12 4 7 13 1 5 0 15 10 3 9 8 64 2 1 11 10 13 7 8 15 9 12 5 6 3 0 1411 8 12 7 1 14 2 13 6 15 0 9 10 4 5 3

S612 1 10 15 9 2 6 8 0 13 3 4 14 7 5 1110 15 4 2 7 12 9 5 6 1 13 14 0 11 3 89 14 15 5 2 8 12 3 7 0 4 10 1 13 11 64 3 2 12 9 5 15 10 11 14 1 7 6 0 8 13

S74 11 2 14 15 0 8 13 3 12 9 7 5 10 6 113 0 11 7 4 9 1 10 14 3 5 12 2 15 8 61 4 11 13 12 3 7 14 10 15 6 8 0 5 9 26 11 13 8 1 4 10 7 9 5 0 15 14 2 3 12

S813 2 8 4 6 15 11 1 10 9 3 14 5 0 12 71 15 13 8 10 3 7 4 12 5 6 11 0 14 9 27 11 4 1 9 12 14 2 0 6 10 13 15 3 5 82 1 14 7 4 10 8 13 15 12 9 0 3 5 6 11

Table 4: The DES S-Boxes

20

16 7 20 2129 12 28 171 15 23 265 18 31 102 8 24 1432 27 3 919 13 30 622 11 4 25

Table 5: The Pre-Output Permutation P

After the final round, the left and right halves of the 64-bit block, L16 and

R16, are swapped, and then subject to a Final Permutation (FP). This operation is

simply the inverse of the IP. The bit mapping for this operation is shown in Table 6;

bit positions are represented in the same manner as Table 2 for the IP.

40 8 48 16 56 24 64 3239 7 47 15 55 23 63 3138 6 46 14 54 22 62 3037 5 45 13 53 21 61 2936 4 44 12 52 20 60 2835 3 43 11 51 19 59 2734 2 42 10 50 18 58 2633 1 41 9 49 17 57 25

Table 6: The Final Permutation FP

4.1.2 DES Key Schedule

The key schedule for DES operates on the 64-bit master key to produce

a series of 48-bit round keys, each used one at a time for the sixteen rounds of the

cipher. Initially, the bits of the master key are arranged by Permuted Choice 1 (PC-

1 ) into two 28-bit vectors, C and D (note that every eighth bit is a parity bit and is

discarded). Table 7 depicts the PC-1 operation.

For each round of the cipher, a bit rotation is performed separately on the

C and D values. Rotation moves to the left for encryption, and to the right for

decryption. The rotation amount depends on the next round; these amounts are

21

C0 D057 49 41 33 25 17 9 63 55 47 39 31 23 151 58 50 42 34 26 18 7 62 54 46 38 30 2210 2 59 51 43 35 27 14 6 61 53 45 37 2919 11 3 60 52 44 36 21 13 5 28 20 12 4

Table 7: Permuted Choice 1 (PC-1 )

given for encryption in Table 8. These amounts are carried out in reverse order for

the right-rotations of the decryption key schedule.

Round 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Rotate amount 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1

Table 8: Rotations for the DES key schedule

The round key is the result of passing the current state of the C and D bit

vectors through Permuted Choice 2 (PC-2 ). This operation maps the concatenation

of C and D to the 48-bit round key as shown in Table 9.

14 17 11 24 1 53 28 15 6 21 1023 19 12 4 26 816 7 27 20 13 241 52 31 37 47 5530 40 51 45 33 4844 49 39 56 34 5346 42 50 36 29 32

Table 9: Permuted Choice 2 (PC-2 )

4.1.3 The Triple-DES Algorithm

The Triple-DES algorithm has been suggested as a more secure alternative

to DES [55]. As the name suggests, this cipher sequentially executes the DES algo-

rithm three times with keys K1, K2, and K3, where two or all three of these keys may

be equivalent. The following rules are used for encryption and decryption, where PT

is the plaintext, CT is the ciphertext, EKi is a DES encryption using Ki, and DKi is

22

a DES decryption using Ki:

CT = EK3(DK2(EK1(PT )))

PT = DK3(EK2(DK1(CT )))

Since the output ciphertext from implementations 1 and 2 of DES is used as

the input plaintext to implementations 2 and 3 respectively, and the Initial and Final

Permutations are inverses of each other, the inner Initial and Final Permutations may

be removed from the algorithm.

There are three keying options commonly used for Triple DES [55]:

Keying Option 1: K1, K2, and K3 are independent.

Keying Option 2: K1 = K3 and K2 is independent from K1 and K3.

Keying Option 3: K1 = K2 = K3.

Note that Keying Option 3 is equivalent to a single iteration of DES.

4.1.4 Triple-DES Key Schedule

To perform the key schedule for Triple-DES, the key expansion must be

performed on each unique key that is used in the chosen implementation of the algo-

rithm. This means that three key expansions are required for Keying Option 1, two

are required for Keying Option 2, and one is required for Keying Option 3.

4.1.5 Performance

Software implementations of DES tend to be significantly slower than hard-

ware implementations. Bit-level manipulations such as those contained in the permu-

tation, expansion, permuted choice, and Cyclic Left/Right Shift units do not map well

to general purpose processors. General purpose processor instruction sets operate on

23

multiple bits at a time based on the processor word size. Moreover, the DES S-Boxes

do not use memory in an efficient manner. Software look-up tables would appear

to be the obvious implementation choice for the DES S-Boxes. However, the DES

S-Boxes have 6-bit addresses and 4-bit output data while most memories associated

with general purpose processors use byte addressing with either 8-bit or 32-bit output

data. As a result, many software implementations of DES exhibit throughputs that

are at least a full order of magnitude slower than hardware implementations.

Even the best software implementations are only capable of throughputs in

the range of 100200 Mbps. Most of these implementations recommend storing the Li

and Ri data as a 48-bit padded word within a 64-bit processor word and implementing

the permutations and S-Boxes as precomputed look-up tables. Additionally, there is

general agreement that the look-up table implementation for the S-Boxes is most

effective when the size of the look-up tables is minimized, guaranteeing that the

data will fit entirely in on-chip cache. Size minimization of the S-Box look-up tables

is achieved by implementing each S-Box in its own look-up table. Finally, one key

software optimization is the unrolling of software loops to increase performance. Even

when software loops are too cumbersome to unroll, using loop counters that decrement

to zero in place of loop counters that increment to a terminal count are shown to

greatly increase the performance of software implementations of the DES algorithm.

However, the unrolling of software loops must be done with great care such that the

total data storage space does not exceed the size of the on-chip cache as this would

cause extreme performance degradation [56], [57], [58].

DES hardware implementations are easily realized in a single chip, such as

an FPGA or an ASIC. To support encryption or decryption of a new block of data

every clock cycle in an implementation operating in a non-feedback mode, such as

Electronic Code Book (ECB) or counter mode also requires that the chip have at least

24

128 input pins (for the input data and key) and 64 output pins (for the output data).

Once again, FPGA and ASIC technology provide more than enough I/O pins to meet

these requirements. As a result, numerous fast and efficient DES implementations

have been reported, reaching throughputs in the Gbps when targeting either FPGAs

or ASICs. Examples of such implementations may be found in [2], [59], [60]. When

operating in feedback modes (such as Cipher Block Chaining (CBC) mode), DES

does not map nicely to pipelined hardware implementations because of the chaining

of blocks. The chaining requires ciphertext block yi1 to process plaintext block xi

and thus simultaneous processing of the two blocks is impossible, requiring that the

pipeline be stalled until generation of the ciphertext block yi1 is completed. However,

the stalling of the pipeline may be avoided in an environment with multiple data

streams, such as in a network processor. In such a situation, the pipeline may be fully

utilized by interleaving the data streams. For a fully pipelined DES implementation

where the atomic unit of the pipeline is the DES round function, the pipeline will

have sixteen stages, thus requiring sixteen interleaved data streams. Let x0S0 denote

plaintext block 0 from data stream 0, x0S1 denote plaintext block 0 from data stream

1, etc. Using this notation, the pipeline is filled with blocks x0S0 , x0S1 , x0S2 , . . .,

x0S15 . When x0S0 has passed the final stage of the pipeline to yield y0S0 , x1S0 is

ready to enter the first stage of the pipeline and is combined with y0S0 via the XOR

operation to perform CBC mode chaining. Thus each data stream is encrypted and

decrypted in CBC mode while also maintaining full pipeline utilization, maximizing

the performance of the implementation. Note that such an implementation must also

maintain sixteen Initialization Vectors, one for each data stream, to be combined with

the first plaintext block x0 of the associated data stream via the XOR operation.

The earliest VLSI implementations of DES [61], [62] achieved throughputs

ranging from 20 to 32 Mbps using 3 m technology. The variances in throughput are

25

compared and contrasted based upon speed versus area tradeoffs. The implementa-

tions support multiple modes of operation, including ECB, CBC, Cipher Feedback

(CFB), and Output Feedback (OFB) (see [63] for a detailed description of DES modes

of operation). Other ASIC implementations of DES [64], [65] achieve a throughput

of 1 Gbps using 0.8 m Gallium Arsenide (GaAs) technology. More recently, a DES

ASIC implementation has been demonstrated to operate at up to 10 Gbps using 0.6

m technology [59].

4.2 IDEA

4.2.1 Mathematical Background

The International Data Encryption Algorithm (IDEA) was originally pub-

lished as the Proposed Encryption Standard (PES) by Xuejia Lai and James Massey

[66]. The computations involved in IDEA are based on operations from three different

mathematical groups:

16-bit bitwise exclusive-OR, denoted by,

Addition modulo 216, denoted by ,

Multiplication modulo (216 + 1), denoted by.

For the third operation, an input of 0x0000 represents the value 216. This

is because the

operation is performed over the multiplicative group Z 216+1, where

zero is not a member but 216 is a member of the group. The value 216 is therefore

denoted by 0x0000 so that only sixteen bits are required to represent all possible

values for the inputs to each operation.

The security of IDEA is based not only on its large key size but also on the

fact that the output of one group operation is never used as an input to the same

26

operation. Further details are available in the original proposal for PES [66]. IDEA

evolved into its final form [67] due to modifications required to strengthen the cipher

against differential cryptanalysis attacks [68]. IDEA is used in many commercial

applications, such as Pretty Good Privacy (PGP). Like DES, IDEA operates across

64-bit blocks. However, while DES requires a 56-bit key, IDEA requires a 128-bit key,

accounting for the increased security of the cipher as compared to DES.

4.2.2 Algorithm Description

IDEA operates on 64-bit plaintext blocks and has a key size of 128 bits.

The algorithm consists of eight rounds followed by a final transformation to obtain

the output. Similar to DES, the procedure (referred to as a computation graph; see

Figure 5), is the same for both encryption and decryption but different key schedules

are used.

Figure 5 shows the input text as four 16-bit sub-blocks X1, X2, X3, and X4.

These text sub-blocks are combined with the six 16-bit sub-blocks of the round key

for the current round r, labeled Z(r)1 through Z

(r)6 , using the mathematical operations

noted above.

4.2.3 Key Schedule

The IDEA key schedule for encryption is based on a series of left rotations

of the 128-bit master key. The master key is first partitioned into eight 16-bit blocks;

these are the first eight key sub-blocks: Z(1)1 , Z

(1)2 , Z

(1)3 , Z

(1)4 , Z

(1)5 , Z

(1)6 , Z

(2)1 , Z

(2)2 .

The next eight key blocks are obtained by rotating the key to the left by 25 bits,

then performing the partition again. This process is repeated until all 52 key blocks

are generated (six blocks for each of the eight rounds and four blocks for the final

transformation).

The key schedule for decryption is based on the encryption key schedule.

Table 10 shows the relationship between the decryption key blocks and the encryption

27

Figure 5: The computation graph for IDEA.

key blocks, where Z(r)1n represents the multiplicative inverse modulo (216+1) of Z(r)n ,

and Z(r)n represents the additive inverse modulo 216 of Z(r)n .

4.2.4 Performance

In terms of the core operations of IDEA, the bit-wise XOR and addition

28

Round Encrypt keys Decrypt keys

1 Z(1)1 Z

(1)2 Z

(1)3 Z

(1)4 Z

(1)5 Z

(1)6 Z

(9)11 Z

(9)2 Z

(9)3 Z

(9)14 Z

(8)5 Z

(8)6

2 Z(2)1 Z

(2)2 Z

(2)3 Z

(2)4 Z

(2)5 Z

(2)6 Z

(8)11 Z

(8)3 Z

(8)2 Z

(8)14 Z

(7)5 Z

(7)6

3 Z(3)1 Z

(3)2 Z

(3)3 Z

(3)4 Z

(3)5 Z

(3)6 Z

(7)11 Z

(7)3 Z

(7)2 Z

(7)14 Z

(6)5 Z

(6)6

4 Z(4)1 Z

(4)2 Z

(4)3 Z

(4)4 Z

(4)5 Z

(4)6 Z

(6)11 Z

(6)3 Z

(6)2 Z

(6)14 Z

(5)5 Z

(5)6

5 Z(5)1 Z

(5)2 Z

(5)3 Z

(5)4 Z

(5)5 Z

(5)6 Z

(5)11 Z

(5)3 Z

(5)2 Z

(5)14 Z

(4)5 Z

(4)6

6 Z(6)1 Z

(6)2 Z

(6)3 Z

(6)4 Z

(6)5 Z

(6)6 Z

(4)11 Z

(4)3 Z

(4)2 Z

(4)14 Z

(3)5 Z

(3)6

7 Z(7)1 Z

(7)2 Z

(7)3 Z

(7)4 Z

(7)5 Z

(7)6 Z

(3)11 Z

(3)3 Z

(3)2 Z

(3)14 Z

(2)5 Z

(2)6

8 Z(8)1 Z

(8)2 Z

(8)3 Z

(8)4 Z

(8)5 Z

(8)6 Z

(2)11 Z

(2)3 Z

(2)2 Z

(2)14 Z

(1)5 Z

(1)6

Final

transform Z(9)1 Z

(9)2 Z

(9)3 Z

(9)4 Z

(1)11 Z

(1)2 Z

(1)3 Z

(1)14

Table 10: IDEA key schedule

are easily implemented with one instruction each in software. For the reduction

modulo 216, a processor such as the LEON2 that only performs arithmetic on 32-

bit register operands requires an additional logic instruction to mask out the bits

that may overflow into the sixteen most significant bits of the destination register.

The major performance bottleneck for a software implementation of the IDEA cipher

is the multiplication modulo (216 + 1). The reason for this is that multiplication

in general may take several clock cycles to complete on the processor running the

algorithm (especially those without hardware multipliers), and the modular reduction,

which is commonly implemented using the Low-High Lemma [66], requires additional

execution time.

Several software implementations of the IDEA algorithm take advantage

of advanced processor architectures that employ instruction parallelism or functional

units for multimedia support. A four-way parallel implementation on a 166 Mhz Pen-

tiumMMX processor [69] achieved a throughput of approximately 72 Mbps. Through-

put values ranging from 421 Mbps to 550 Mbps have been achieved on the Itanium

platform running at 733 MHz [70]. The performance evaluations reported in [71]

include a comparison of IDEA software implementations on processors with various

29

word sizes, clock frequencies, and cache sizes. Execution times for IDEA encryption

ranged from 2555 s on the 8-bit 4 MHz Atmega 103 to 9 s on the 64-bit 440 MHz

UltraSparc2 r with instruction and data cache sizes of 16 kbytes. The ability to

perform fast multiplications was shown to be a major factor in the performance of

the IDEA algorithm.

Implementations of IDEA on reconfigurable computing platforms and sys-

tems with co-processors have shown improved performance. An implementation on a

SRC-6E platform [72] achieved throughputs of approximately 590 Mbps for end-to-

end software time for bulk data processing. Comparisons have been made between

the performance of IDEA on Digital Signal Processing (DSP) chips, cryptographic

co-processors, and hardware implementations on FPGAs in a hardware-software co-

design system that makes use of encryption in a mobile device. Reported perfor-

mance figures ranged from 32 Mbps on the DEC SA-110 and 53.1 Mbps on the TI

TMX320C6x DSP chips, to 180 Mbps using the VINCI cryptographic co-processor,

to 528 Mbps with an FPGA-based implementation [73].

A VLSI implementation of PES [74] achieved a throughput of 44 Mbps using

1.5 m technology. This implementation was limited in clock frequency to maintain

compatibility with the Sun Microsystems SBus. The earliest VLSI implementations

of IDEA [75], [76] achieved throughputs of 177 Mbps using 1.2 m technology. More

recent VLSI implementations [77] achieve a throughput of 355 Mbps using 0.8 m

technology. When using 0.7 m technology, a throughput of 424 Mbps was achieved

in a single chip solution [78]. However, the performance of these implementations

were significantly reduced when operating in feedback modes.

30

4.3 AES

4.3.1 Mathematical Background

Joan Daemen and Vincent Rijmen proposed the Rijndael algorithm to NIST

as a candidate for the Advanced Encryption Standard [79]. One of the most significant

features of the algorithm is the extensive use of finite field, or Galois Field, arithmetic.

The particular field used in the AES algorithm is the Galois Field GF(28). Values

are represented by polynomials of the form a(x) = a7x7 +a6x

6 +a5x5 +a4x

4 +a3x3 +

a2x2 + a1x + a0, or in bit vector notation, { a7a6a5a4a3a2a1a0 }, where each ai is a

coefficient in the Galois Field GF(2). Addition is done by computing the sum mod-

ulo 2 of coefficients in the same bit positions; this can be accomplished by applying

a bit-wise exclusive-OR on the coefficients. Multiplication works in much the same

way as ordinary polynomial multiplication, but there is an additional step to make

a modular reduction of the product by an irreducible polynomial so that the final

product is in the Galois Field GF(28). For the AES algorithm, this polynomial is

m(x) = x8 + x4 + x3 + x+ 1.

4.3.2 Algorithm Description

AES always operates on a block size of 128 bits, but key sizes of 128, 192,

or 256 bits are allowed. The number of rounds used in the cipher is dependent on the

key size ten rounds for a 128-bit key, twelve rounds for a 192-bit key, and fourteen

rounds for a 256 bit key. This research focuses on a 128-bit key implementation but

is easily extended for use in implementations with larger key sizes.

Encryption of one plaintext block in AES requires the sequence of operations

shown in Figure 6. The word data type is a 32-bit value. In the AES algorithm

specification [80], the plaintext is arranged into a 4 4 matrix of 8-bit values called

31

the state, depicted in Figure 7.

Encrypt(byte in[16], byte out[16], word k[44])

begin

byte state[4,4]

state = in

AddRoundKey(state, k[0, 3])

for round = 1 step 1 to 9

SubBytes(state)

ShiftRows(state)

MixColumns(state)

AddRoundKey(state, k[round*4, (round+1)*4-1])

end for

SubBytes(state)

ShiftRows(state)


out = state

end

Figure 6: The AES encryption process

s0,0 s0,1 s0,2 s0,3s1,0 s1,1 s1,2 s1,3s2,0 s2,1 s2,2 s2,3s3,0 s3,1 s3,2 s3,3

Figure 7: The state representation of data blocks in AES

The four types of operations performed on the state are:

SubBytes: substitutes each byte in the state with a new value according to

the following procedure:

(1) Compute the multiplicative inverse in the Galois Field GF(28), denoted as

a1 (except for the value 0x00, which is mapped to itself);

(2) Perform the following affine transformation over the Galois Field GF(2) on

a1:

32

b7

b6

b5

b4

b3

b2

b1

b0

=

0 0 0 1 1 1 1 1

0 0 1 1 1 1 1 0

0 1 1 1 1 1 0 0

1 1 1 1 1 0 0 0

1 1 1 1 0 0 0 1

1 1 1 0 0 0 1 1

1 1 0 0 0 1 1 1

1 0 0 0 1 1 1 1

a17

a16

a15

a14

a13

a12

a11

a10

+

0

1

1

0

0

0

1

1

The result b is copied into the position of a in the state.

ShiftRows: performs cyclic left-shifts on each row in the state. The amount

of bytes by which to shift depends on the row: zero for the top row, one for the

second row, two for the third row, and three for the bottom row.

MixColumns: each column of the state is treated as a vector of four polyno-

mials in the Galois Field GF(28) in this operation. Each of the four columns

are multiplied by a 4 4 constant matrix with coefficients in the Galois Field

GF(28) reduced modulo m(x) = x8 + x4 + x3 + x+ 1. For each column c from

0 to 3,

B(0,c)

B(1,c)

B(2,c)

B(3,c)

=

02 03 01 01

01 02 03 01

01 01 02 03

03 01 01 02

A(0,c)

A(1,c)

A(2,c)

A(3,c)

.

AddRoundKey: likeMixColumns, AddRoundKey operates on individual

columns of the state. Each column Ci is combined by a bit-wise exclusive-OR

33

operation with a 32-bit word k4r+i from the current round key (the key schedule

is explained in Section 4.3.3).

Decryption of a ciphertext block incorporates the inverses of the operations

used in the encryption process as shown in Figure 8. Note that AddRoundKey is

its own inverse since it involves only the bitwise exclusive-OR operation.

Decrypt(byte in[16], byte out[16], word k[44])

begin

byte state[4,4]

state = in


for round = 9 step -1 downto 1

InvShiftRows(state)

InvSubBytes(state)

AddRoundKey(state, k[round*4, (round+1)*4-1])

InvMixColumns(state)

end for

InvShiftRows(state)

InvSubBytes(state)


out = state

end

Figure 8: The AES decryption process

InvSubBytes: reverses the transformation performed by SubBytes by first

applying an affine transformation using the inverse of the 8 8 matrix used for

SubBytes followed by calculation of the multiplicative inverse in the Galois

Field GF(28) modulo m(x).

InvShiftRows: performs cyclic right-shifts on each row in the state in the

same amounts as ShiftRows.

InvMixColumns: performs multiplication of the state by the inverse of the

34

Galois Field constant matrix from MixColumns:

B(0,c)

B(1,c)

B(2,c)

B(3,c)

=

0e 0b 0d 09

09 0e 0b 0d

0d 09 0e 0b

0b 0d 09 0e

A(0,c)

A(1,c)

A(2,c)

A(3,c)

.

4.3.3 Key Schedule

For the 128-bit key size implementation of the AES algorithm, the master

key is expanded into a linear array of eleven 4-byte words using the process presented

in Figure 9. There are two operations and an array of constants used specifically for

the key schedule:

KeyExpansion(byte key[16], word w[44])

begin

word temp

i = 0

while (i < 4)

w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])

i = i + 1

end while

i = 4

while (i < 44)

temp = w[i-1]

if (i mod 4 = 0)

temp = SubWord(RotWord(temp)) xor Rcon[i/4]

end if

w[i] = w[i-4] xor temp

i = i + 1

end while

end

Figure 9: The key expansion process for AES

SubWord: applies a substitution to each of the four bytes in the input word

35

using the same S-Box that is used in encryption.

RotWord: performs a cyclic left rotation by one byte on the input word

a0a1a2a3 to produce an output of a1a2a3a0.

Rcon[ ]: the round constant array with a size of ten words. For i from 1 to 10,

Rcon[i] = [{02}i1, {00}, {00}, {00}],

where the powers {02}i1 are in the Galois Field GF(28).

4.3.4 Performance

Rijndael software performance bottlenecks typically occur in the SubBytes

and MixColumns transformations, one or both of which are usually implemented

via 8-bit to 8-bit look-up tables. Often most of the Rijndael round transformations

SubBytes, ShiftRows, and MixColumns are combined into large look-up ta-

bles termed T-tables. Such implementations require up to three T-tables whose size

may be either 1 kbytes or 4 kbytes where the smaller tables require performing an

additional rotation operation. The goal of the T-tables is to avoid performing the

MixColumns and InvMixColumns transformations as these operations perform Galois

Field fixed field constant multiplication, an operation which maps poorly to general

purpose processors. However, the use of T-tables has significant disadvantages. The

T-tables significantly increase code size, their performance is dependent on the mem-

ory system architecture as well as cache size, and their use causes key expansion for

Rijndael decryption to become significantly more complex. As an alternative to the

T-tables implementation method, it is also feasible to have the processor perform all of

the Rijndael round transformations. Row-based implementations have been demon-

strated to allow for greater efficiency in the implementation of the MixColumns and

36

InvMixColumns transformations versus column-based implementations. However the

SubBytes transformation still remains as a bottleneck, requiring separate 256-byte

look-up tables for encryption and decryption [20], [29], [32], [81], [82], [83], [84].

Numerous co-processors have been developed to accelerate cryptographic

algorithm implementations. The CryptoManiac VLIW co-processor was developed

as a result of instruction set extensions designed to accelerate the performance of

a number of the AES candidate algorithms. CryptoManiac features the execution

of up to four instructions per cycle and the use of instructions with up to three

operands to allow for the combination of short latency instructions for single cycle

execution. Similarly, the Cryptonite co-processor is also VLIW based, with two 64-

bit datapaths and special instructions combined with dedicated memories to support

Rijndael implementations. Both co-processors improve the performance of Rijndael

implementations versus implementations targeting general purpose processors. Other

implementations couple FPGA co-processors with a LEON-2 processor core. The

co-processors connect to the processor core via either a dedicated interface or as a

memory-mapped peripheral and were able to significantly improve the performance

of Rijndael implementations [23], [33], [35], [85], [86].

Multiple implementations of Rijndael have been presented targeting a wide

range of hardware technologies. These implementations use specific Galois Field fixed

field constant multipliers resulting in either logic equations or look-up tables being

generated to perform the multiplication. Implementations based on logic equations

are optimized for area and require a moderate number of logic levels. Implementations

based on look-up tables are optimized for speed at the cost of additional logic resources

though the performance of these implementations, like the software implementations

employing T-tables, is highly dependent on the memory system and cache organiza-

tion and size. In the case of the Galois Field fixed field constant multipliers used in

37

the MixColumns transformation, the 8-bit to 8-bit look-up tables may be replaced by

8 bit 8 bit mapping matrices, reducing the associated memory requirements

by a factor of nearly 20 [87], [88]. Look-up tables may also be replaced with logic

equation implementations for the SubBytes and MixColumns transformations, sig-

nificantly reducing the hardware resource requirements. To illustrate the significant

reduction in logic resource requirements, in the case of the SubBytes transformation,

a reduction in gate count by as much as a factor 4.66 has been realized using logic

equations in place of a look-up table. When performing sixteen SubBytes transfor-

mations in parallel in a single round of Rijndael (assuming a 128-bit implementation),

this equates to a savings of over 38,000 gate equivalences. For a pipelined implemen-

tation of 128-bit AES, this savings increases to over 380,000 gate equivalences [29].

Encryption, decryption, and Key Scheduling are all easily pipelined in non-feedback

modes of operation while single-round implementations are typically used when op-

erating in feedback modes. Depending on the implementation methodology, Rijndael

throughputs as high as 70 Gbps when operating in non-feedback modes and 2.29

Gbps when operating in feedback modes have been reported [89], [90], [91], [92], [93],

[94], [95], [96], [97], [98], [99], [100], [101], [102], [103], [104], [105], [106], [107].

Instruction set extensions are an interesting implementation option that

bridges the gap between hardware and software. Significantly improved performance

of software implementations have been demonstrated as a result of adding function-

ality to a processors datapath and corresponding control logic to decode new in-

structions. Instruction set extensions designed to accelerate the performance of soft-

ware implementations of Rijndael have been proposed for a wide range of processors.

These extensions minimize the number of memory accesses, usually by combining

the SubBytes and MixColumns transformations into one T-table look-up operation

to speed up algorithm execution. While T-table performance is heavily dependent

38

upon available cache size, these extensions have been shown to result in performance

improvements of up to a factor of 3.68 versus Rijndael implementations without the

use of the instruction set extensions [20], [42], [85], [86], [88], [108], [109].

4.4 Modes of Operation

All of the symmetric-key algorithms targeted in this research support many

different modes of operation methods that specify the information entered into the

algorithm. Two modes of operation are of particular interest here:

Electronic Code Book (ECB): Each block of plaintext xi is input directly into

the encryption function to form the ciphertext yi. Encrypting a specific value

for xi always produces the same value for yi and the result is not affected by

previous plaintext blocks.

Cipher Block Chaining (CBC): The first plaintext block is combined with an

initialization vector (IV) using the bitwise exclusive-OR operation, and the

result is encrypted to form the first ciphertext block. Subsequent blocks of

plaintext are exclusive-ORed with the last ciphertext block computed. In this

mode, every block of ciphertext depends on the preceding runs of the algorithm.

ECB mode may be employed to create pipelined implementations of block

ciphers, leading to very high throughput. However, a disadvantage of ECB mode is

that identical plaintext blocks encrypted with the same key always produce the same

ciphertext. Also, the fact that data blocks encrypted in ECB mode have no chaining

or feedback mechanisms, the encrypted data is vulnerable to a substitution attack.

In this type of attack, encrypted blocks in the original data stream are swapped

by the attacker with other data blocks that are encrypted with the same key. CBC

39

mode is not vulnerable to suck attacks, but because the current block to be encrypted

depends directly on the previous ciphertext, CBC mode is not well suited to pipelined

implementations.

40

5 PROPOSED INSTRUCTION SET EXTENSIONS

This chapter specifies the new instructions that are intended to enhance the

performance of the algorithms described in Chapter 4. All of the custom instructions

are intended to comply with the SPARC r V8 instruction model [47]. In particular,

the instructions have the Format 3 structure described in Section 3.3.

The sub-sections below indicate the syntax and encoding, as well as a brief

description, of each instruction. All instructions that write to a register execute in

one clock cycle, except for the mmul16 instruction which takes two clock cycles.

For those instructions that store data directly into registers contained in the custom

hardware, the data is available at the start of the next cycle, after instruction execu-

tion has completed.

5.1 DES and Triple-DES

5.1.1 Initial and Final Permutations

Instruction Syntax:

desipl rs1,rs2,rd

desipr rs1,rs2,rd

desfpl rs1,rs2,rd

desfpr rs1,rs2,rd

41

Instruction Encoding:

op rd op3 rs1 i asi rs210 rd 001101 rs1 0 XXXXXXnn rs2

The desipl and desipr instructions produce the left and right halves of the

DES IP, respectively. Similarly, the desfpl and desfpr instructions produce the left

and right halves of the DES FP, respectively. The left half of the input block must

be located in the rs1 register and the right half must be located in the rs2 register.

The specific instruction to be executed is determined by the value of bits

[1:0] of the asi field as given in Table 11 below. Bits [7:2] of the asi field are ignored

by all of the DES permutation instructions.

asi[1:0] Instruction00 desipl01 desipr10 desfpl11 desfpr

Table 11: Interpretation of the asi field for the DES permutation instructions

Inclusion of these instructions allows for the IP and FP for DES and Triple-

DES to be completed in two instructions each. Traditional implementations of the

IP and FP in software require a series of bit mask setup, shift, logical AND, and

logical OR operations for each bit for a total of 256 instructions [41]. The improved

permutation algorithm used in the reference code [110] requires 44 instructions to

complete on a SPARC r V8 processor such as the LEON2, which is still significantly

larger than the instruction count required to perform the permutations as proposed

in this research.

42

5.1.2 Set Encryption Direction

Instruction Syntax:

desdir imm


op rd op3 rs1 i simm1310 00000 001001 00000 1 (dir)

Set up the DES key generator to output round keys in either encryption or

decryption order. The imm operand is set to zero for encryption, one for decryption.

This instruction also resets the round counter of the key generator according to the

chosen direction to ensure that output of the round keys may be immediately carried

out in the proper order. It is not necessary to re-load the master key after this in-

struction is executed. The desdir instruction is used in conjunction with the deskey

and desf instructions as explained in Sections 5.1.3 and 5.1.4.

5.1.3 Key Loading

Instruction Syntax:

deskey rs1,rs2


op rd op3 rs1 i asi rs210 00000 001001 rs1 0 unused rs2

The deskey instruction loads the 64-bit master key for DES. The left half

of the master key must be contained in the rs1 register, and the right half in the rs2

register.

43

5.1.4 Round Core (f ) Function

Instruction Syntax

desf rs1,rd

Instruction Encoding

op rd op3 rs1 i simm1310 00000 001001 rs1 1 0x1XXX

This instruction takes the right half of a round output block stored in the

rs1 register and stores the output of the core (f ) function into the rd register. The

round key is not specified here since the round key output of the DES key generator

is hard-wired to the f -function circuits round key input. After completion of this

instruction, the key generator is signaled to generate the key for the next round. Due

to the logic of the DES key generator (see Section 6.1.3), the desf instruction may

not be followed by another desf instruction. However, this is not expected to cause

a performance bottleneck due to the additional instruction required for swapping the

values of the left and right halves of the round input block.

Implementation of the desdir, deskey, and desf instructions removes the

need for storage of the sixteen round keys and S-Boxes in memory. All round keys

are generated on-the-fly in the custom hardware. An implementation of the DES al-

gorithm using these instructions requires two instructions for key scheduling and four

instructions for each of the sixteen rounds (one desf, one exclusive-OR, and two regis-

ter data transfers for swapping the left and right halves of the round function output).

5.1.5 New DES and Triple-DES Algorithm Implementations

The following text show instruction sequences that may be used to imple-

ment the DES and Triple-DES algorithms for encryption and decryption. All operand

44

names are symbolic and do not represent the names of physical registers of the LEON2

processor.

desipl %[ptextl], %[ptextr], %[l]

desipr %[ptextl], %[ptextr], %[r]

deskey %[keyl], %[keyr]

desdir 0

mov 1, %[i]

round:

mov %[r], %[temp]

desf %[r], %[r]

xor %[l], %[r], %[r]

mov %[temp], %[l]

cmp %[i], 16

blu round

add %[i], 1, %[i]

desfpl %[r], %[l], %[ctextl]

desfpr %[r], %[l], %[ctextr]

Figure 10: DES encryption routine with custom instructions

desipl %[ctextl], %[ctextr], %[l]

desipr %[ctextl], %[ctextr], %[r]

deskey %[keyl], %[keyr]

desdir 1

mov 1, %[i]

round:

mov %[r], %[temp]

desf %[r], %[r]

xor %[l], %[r], %[r]

mov %[temp], %[l]

cmp %[i], 16

blu round

add %[i], 1, %[i]

desfpl %[r], %[l], %[ptextl]

desfpr %[r], %[l], %[ptextr]

Figure 11: DES decryption routine with custom instructions

45

desipl %[ptextl], %[ptextr], %[l]

desipr %[ptextl], %[ptextr], %[r]

deskey %[key1l], %[key1r]

desdir 0

mov 1, %[i]

d1round:

mov %[r], %[temp]

desf %[r], %[r]

xor %[l], %[r], %[r]

mov %[temp], %[l]

cmp %[i], 16

blu d1round

add %[i], 1, %[i]

mov %[r], %[temp]

mov %[l], %[r]

mov %[temp], %[l]


desdir 1

mov 1, %[i]

d2round:

mov %[r], %[temp]

desf %[r], %[r]

xor %[l], %[r], %[r]

mov %[temp], %[l]

cmp %[i], 16

blu d2round

add %[i], 1, %[i]

mov %[r], %[temp]

mov %[l], %[r]

mov %[temp], %[l]


desdir 0

mov 1, %[i]

d3round:

mov %[r], %[temp]

desf %[r], %[r]

xor %[l], %[r], %[r]

mov %[temp], %[l]

cmp %[i], 16

blu d3round

add %[i], 1, %[i]



Figure 12: Triple-DES encryption routine with custom instructions

46

desipl %[ctextl], %[ctextr], %[l]

desipr %[ctextl], %[ctextr], %[r]


desdir 1

mov 1, %[i]

d1round:

mov %[r], %[temp]

desf %[r], %[r]

xor %[l], %[r], %[r]

mov %[temp], %[l]

cmp %[i], 16

blu d1round

add %[i], 1, %[i]

mov %[r], %[temp]

mov %[l], %[r]

mov %[temp], %[l]


desdir 0

mov 1, %[i]

d2round:

mov %[r], %[temp]

desf %[r], %[r]

xor %[l], %[r], %[r]

mov %[temp], %[l]

cmp %[i], 16

blu d2round

add %[i], 1, %[i]

mov %[r], %[temp]

mov %[l], %[r]

mov %[temp], %[l]


desdir 1

mov 1, %[i]

d3round:

mov %[r], %[temp]

desf %[r], %[r]

xor %[l], %[r], %[r]

mov %[temp], %[l]

cmp %[i], 16

blu d3round

add %[i], 1, %[i]



Figure 13: Triple-DES decryption routine with custom instructions

47

5.2 IDEA

5.2.1 Multiplication modulo 216 + 1

Instruction Syntax:

mmul16 rs1, rs2, rd


op rd op3 rs1 i asi rs210 rd 101101 rs1 0 unused rs2

This instruction calculates rs1 rs2 mod (216 + 1) and stores the product

in the rd register. Both sources must be in the lower sixteen bits of their respective

registers. The 16-bit product is stored in the lower sixteen bits of the rd register.

5.2.2 New IDEA Algorithm Implementation

t0_1 = in[0]; t0_2 = in[1]; t0_3 = in[2]; t0_4 = in[3];

for(i=0;i

48

5.3 AES

5.3.1 SubBytes Operations

Instruction Syntax:

aessb rs1, imm, rd

aessb4 rs1, imm, rd


op rd op3 rs1 i simm1310 rd 101100 rs1 1 see below

These instructions perform the AES SubBytes and InvSubBytes operations

on either one (aessb) or all four (aessb4) of the bytes in the rs1 register. Only one

of these instructions may be implemented in the hardware but not both.

The value specified in the simm13 field determines the actual operation

performed. The lease significant bit is set to zero for SubBytes, or one for InvSub-

Bytes. The value composed of bits 5 and 4 indicate the byte to be substituted for

the aessb instruction as shown in Table 12. These bits are not used by the aessb4

instruction.

simm13[5:4] Substituted byte00 rs1[31:24]01 rs1[23:16]10 rs1[15:8]11 rs1[7:0]

Table 12: Usage of the simm13 field by the aessb and aessb4 instructions

Bits [12:6] and [3:1] of simm13 are ignored by both the aessb and aessb4

instructions.

49

5.3.2 GF(2m) Matrix Multiplier Constant Loading

Instruction Syntax:

gfmkld rs1,rs2


op rd op3 rs1 i asi rs210 00000 011001 rs1 0 unused rs2

The gfmkld instruction is used to load one of the sixteen constants into

the constant matrix of the Galois Field fixed field constant matrix multiplier. The

constant matrix has the following structure:

K00 K01 K02 K03

K10 K11 K12 K13

K20 K21 K22 K23

K30 K31 K32 K33

The first constants to be loaded are those in the first row, from K00 to K03.

The loading process continues for each row in descending order, from left to right,

until the last constant K33 has been loaded. Due to the logic that has been added to

the multiplier for inclusion into the LEON2 processor datapath (see Section 6.1.6),

instances of the gfmkld instruction may not be issued consecutively.

5.3.3 GF(2m) Matrix Multiplication

Instruction Syntax:

gfmmul rs1, imm, rd

50


op rd op3 rs1 i asi rs210 rd 011101 rs1 0 unused 00000

Perform the Galois Field fixed field constant matrix multiplication on the

input in the rs1 register and store the result in the rd register.

51

5.3.4 New AES Algorithm Implementations

// First add round key

state[0] = plaintext[0] ^ key_schedule[0][0];




// The nine rounds

for (i = 1; i < Nr; i++)

{

// SubBytes + ShiftRows

asm( "aessb %[s0], 0x00, %[s0]\n\t" : [s0] "+r" (state[0]) );




tmp[0] = state[0] & 0xFF000000 ;

tmp[0] |= state[1] & 0x00FF0000 ;

tmp[0] |= state[2] & 0x0000FF00 ;

tmp[0] |= state[3] & 0x000000FF ;





tmp[1] = state[1] & 0xFF000000 ;

tmp[1] |= state[2] & 0x00FF0000 ;

tmp[1] |= state[3] & 0x0000FF00 ;

tmp[1] |= state[0] & 0x000000FF ;





tmp[2] = state[2] & 0xFF000000 ;

tmp[2] |= state[3] & 0x00FF0000 ;

tmp[2] |= state[0] & 0x0000FF00 ;

tmp[2] |= state[1] & 0x000000FF ;





tmp[3] = state[3] & 0xFF000000 ;

tmp[3] |= state[0] & 0x00FF0000 ;

tmp[3] |= state[1] & 0x0000FF00 ;

tmp[3] |= state[2] & 0x000000FF ;

// MixColumns

asm(

"gfmmul %[t0], %[s0]\n\t"




: [s0] "=r" (state[0]) , [s1] "=r" (state[1]) , [s2] "=r" (state[2]) , [s3] "=r" (state[3])

: [t0] "r" (tmp[0]) , [t1] "r" (tmp[1]) , [t2] "r" (tmp[2]) , [t3] "r" (tmp[3])

);

52

// Add round key

state[0] = state[0] ^ key_schedule[i][0];




}

// Final round

// SubBytes + ShiftRows





tmp[0] = state[0] & 0xFF000000 ;

tmp[0] |= state[1] & 0x00FF0000 ;

tmp[0] |= state[2] & 0x0000FF00 ;

tmp[0] |= state[3] & 0x000000FF ;





tmp[1] = state[1] & 0xFF000000 ;

tmp[1] |= state[2] & 0x00FF0000 ;

tmp[1] |= state[3] & 0x0000FF00 ;

tmp[1] |= state[0] & 0x000000FF ;





tmp[2] = state[2] & 0xFF000000 ;

tmp[2] |= state[3] & 0x00FF0000 ;

tmp[2] |= state[0] & 0x0000FF00 ;

tmp[2] |= state[1] & 0x000000FF ;





tmp[3] = state[3] & 0xFF000000 ;

tmp[3] |= state[0] & 0x00FF0000 ;

tmp[3] |= state[1] & 0x0000FF00 ;

tmp[3] |= state[2] & 0x000000FF ;

// Add round key

ciphertext[0] = tmp[0] ^ key_schedule[Nr][0];




}

Figure 15: AES encryption routine with aessb and gfmmul instructions

53

// AddRoundKey

state[0] = ciphertext[0] ^ key_schedule[Nr][0];




for (i = Nr-1; i > 0; i--)

{

// InvShiftRowsInvSubBytes





tmp[0] = state[0] & 0xFF000000 ;

tmp[0] |= state[3] & 0x00FF0000 ;

tmp[0] |=

new idea&aes

Documents

tripledes algorithm

leon2 processor

des algorithm 14iv4

set of new instructions

data encryption standard

tripledes key schedule224

risc processor

leon2 model