new idea&aes

Upload: fudamtew

Post on 17-Oct-2015

56 views

Category:

Documents


0 download

DESCRIPTION

u can get more detail of this document future

TRANSCRIPT

  • INSTRUCTION SET EXTENSIONS FORENHANCING THE PERFORMANCE OF SYMMETRIC KEY

    CRYPTOGRAPHIC ALGORITHMS

    BY

    SEAN R. OMELIABS CpE, UNIVERSITY OF MASSACHUSETTS LOWELL (2005)

    SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF MASTER OF SCIENCE IN ENGINEERING

    DEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERINGUNIVERSITY OF MASSACHUSETTS LOWELL

    Signature of Author Date

    Dr. Adam J. ElbirtThesis Advisor

    Prof. George P. CheneyThesis Committee Member

    Dr. Dalila B. MegherbiThesis Committee Member

  • INSTRUCTION SET EXTENSIONS FORENHANCING THE PERFORMANCE OF SYMMETRIC KEY

    CRYPTOGRAPHIC ALGORITHMS

    BY

    SEAN R. OMELIABS CpE, UNIVERSITY OF MASSACHUSETTS LOWELL (2005)

    ABSTRACT OF A THESIS SUBMITTED TO THE FACULTY OF THEDEPARTMENT OF ELECTRICAL AND COMPUTER ENGINEERING

    IN PARTIAL FULFILLMENT OF THE REQUIREMENTSFOR THE DEGREE OF

    MASTER OF SCIENCE IN ENGINEERINGUNIVERSITY OF MASSACHUSETTS LOWELL

    2007

    Thesis Advisor: Dr. Adam J. ElbirtAssistant Professor, Department of Computer Science

  • ABSTRACT

    In this thesis, instruction set extensions for a RISC processor are presented

    to improve the performance in software of the Data Encryption Standard (DES),

    Triple-DES, International Data Encryption Algorithm (IDEA), and Advanced En-

    cryption Standard (AES) algorithms. The most computationally intensive operations

    of each algorithm are handled by a set of new instructions. The hardware supporting

    these instructions is integrated into the processors datapath. For each of the targeted

    algorithms, comparisons are presented between traditional software implementations

    and new implementations that take advantage of the extended instruction set ar-

    chitecture. Results show that utilization of the proposed instructions significantly

    reduces program code size and improves encryption and decryption throughput. The

    additional hardware resources required by all of the custom hardware increases the

    total area of the processor by less than fifty percent.

    ii

  • ACKNOWLEDGEMENTS

    There are several people I wish to thank for their assistance and support

    in the completion of this thesis. I would like to express many thanks to my advisor,

    Dr. Adam J. Elbirt, who has been an excellent guide throughout all stages of the

    research, and Prof. George Cheney and Dr. Dalila Megherbi for their membership

    on the defense committee. I received a great deal of support on technical matters

    from Gaisler Research, the creator of the LEON2 processor, and the members of the

    Instruction Set Extensions for Cryptography Project at Graz University of Technol-

    ogy. Their advice was most helpful for understanding the LEON2 model and how the

    processor architecture can be extended. I would also like to thank all of my friends

    and family, who have encouraged me throughout the course of my work.

    iii

  • Contents

    List of Figures viii

    List of Tables x

    1 INTRODUCTION 1

    2 PREVIOUS WORK 6

    3 THE LEON2 PROCESSOR 9

    3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.2 VHDL Model Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.3 Processor Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    3.4 SPARC r V8 Instruction Model . . . . . . . . . . . . . . . . . . . . . 11

    3.5 Customization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.6 Synthesis and Simulation . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.7 Software Development Tools . . . . . . . . . . . . . . . . . . . . . . . 12

    4 TARGET ALGORITHMS 14

    4.1 Triple-DES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

    4.1.1 The DES Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 14

    iv

  • 4.1.2 DES Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . 20

    4.1.3 The Triple-DES Algorithm . . . . . . . . . . . . . . . . . . . . . 21

    4.1.4 Triple-DES Key Schedule . . . . . . . . . . . . . . . . . . . . . . 22

    4.1.5 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

    4.2 IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

    4.2.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 25

    4.2.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 26

    4.2.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

    4.2.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

    4.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    4.3.1 Mathematical Background . . . . . . . . . . . . . . . . . . . . . 30

    4.3.2 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . 30

    4.3.3 Key Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    4.3.4 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    4.4 Modes of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    5 PROPOSED INSTRUCTION SET EXTENSIONS 40

    5.1 DES and Triple-DES . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

    5.1.1 Initial and Final Permutations . . . . . . . . . . . . . . . . . . . 40

    5.1.2 Set Encryption Direction . . . . . . . . . . . . . . . . . . . . . . 42

    5.1.3 Key Loading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    5.1.4 Round Core (f ) Function . . . . . . . . . . . . . . . . . . . . . . 43

    5.1.5 New DES and Triple-DES Algorithm Implementations . . . . . 43

    5.2 IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

    5.2.1 Multiplication modulo 216 + 1 . . . . . . . . . . . . . . . . . . . 47

    v

  • 5.2.2 New IDEA Algorithm Implementation . . . . . . . . . . . . . . 47

    5.3 AES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.3.1 SubBytes Operations . . . . . . . . . . . . . . . . . . . . . . . . 48

    5.3.2 GF(2m) Matrix Multiplier Constant Loading . . . . . . . . . . . 49

    5.3.3 GF(2m) Matrix Multiplication . . . . . . . . . . . . . . . . . . . 49

    5.3.4 New AES Algorithm Implementations . . . . . . . . . . . . . . . 51

    6 LEON2 HARDWARE AND SOFTWARE TOOLCHAIN MODIFI-

    CATIONS 59

    6.1 Custom Hardware Units . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    6.1.1 DES Permutation Unit . . . . . . . . . . . . . . . . . . . . . . . 59

    6.1.2 DES Round f -function Unit . . . . . . . . . . . . . . . . . . . . 60

    6.1.3 DES Key Generator . . . . . . . . . . . . . . . . . . . . . . . . 61

    6.1.4 Modulo (216 + 1) Multiplier . . . . . . . . . . . . . . . . . . . . 63

    6.1.5 AES S-Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    6.1.6 Galois Field Fixed Field Constant Multiplier . . . . . . . . . . . 64

    6.2 Architecture Modifications . . . . . . . . . . . . . . . . . . . . . . . . 68

    6.3 Modifications to Software Development Tools . . . . . . . . . . . . . . 69

    7 RESULTS AND ANALYSIS 70

    7.1 Testing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    7.2 Software Code Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    7.3 Algorithm Execution Times . . . . . . . . . . . . . . . . . . . . . . . . 74

    7.4 Hardware Utilization . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

    7.5 Throughput to Area Comparisons . . . . . . . . . . . . . . . . . . . . 79

    vi

  • 8 CONCLUSIONS AND FUTURE WORK 81

    REFERENCES 83

    Appendix A: VHDL Source for Custom Functional Units 94

    Appendix B: Modifications to LEON2 VHDL Model and Development

    Tools 116

    Appendix C: Test Vectors for Functional Evaluation 160

    Appendix D: Example Source Code for Functional and Performance

    Evaluations 162

    About the Author 210

    vii

  • List of Figures

    1 Structure of SPARC r V8 Format 3 instructions . . . . . . . . . . . . 11

    2 Block diagram for standard block ciphers . . . . . . . . . . . . . . . . 15

    3 The Data Encryption Standard algorithm . . . . . . . . . . . . . . . 16

    4 The DES f -function . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    5 The computation graph for IDEA. . . . . . . . . . . . . . . . . . . . . 27

    6 The AES encryption process . . . . . . . . . . . . . . . . . . . . . . . 31

    7 The state representation of data blocks in AES . . . . . . . . . . . . 31

    8 The AES decryption process . . . . . . . . . . . . . . . . . . . . . . . 33

    9 The key expansion process for AES . . . . . . . . . . . . . . . . . . . 34

    10 DES encryption routine with custom instructions . . . . . . . . . . . 44

    11 DES decryption routine with custom instructions . . . . . . . . . . . 44

    12 Triple-DES encryption routine with custom instructions . . . . . . . . 45

    13 Triple-DES decryption routine with custom instructions . . . . . . . . 46

    14 IDEA algorithm routine with custom instructions . . . . . . . . . . . 47

    15 AES encryption routine with aessb and gfmmul instructions . . . . 52

    16 AES decryption routine with aessb and gfmmul instructions . . . . 54

    17 AES encryption routine with aessb4 and gfmmul instructions . . . . 56

    18 AES decryption routine with aessb4 and gfmmul instructions . . . . 58

    viii

  • 19 DES permutation unit . . . . . . . . . . . . . . . . . . . . . . . . . . 60

    20 DES key generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

    ix

  • List of Tables

    1 LEON2 VHDL Model File Hierarchy . . . . . . . . . . . . . . . . . . 10

    2 The Initial Permutation IP . . . . . . . . . . . . . . . . . . . . . . . . 17

    3 The Expansion Operation E . . . . . . . . . . . . . . . . . . . . . . . 18

    4 The DES S-Boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    5 The Pre-Output Permutation P . . . . . . . . . . . . . . . . . . . . . 20

    6 The Final Permutation FP . . . . . . . . . . . . . . . . . . . . . . . . 20

    7 Permuted Choice 1 (PC-1 ) . . . . . . . . . . . . . . . . . . . . . . . . 21

    8 Rotations for the DES key schedule . . . . . . . . . . . . . . . . . . . 21

    9 Permuted Choice 2 (PC-2 ) . . . . . . . . . . . . . . . . . . . . . . . . 21

    10 IDEA key schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    11 Interpretation of the asi field for the DES permutation instructions . 41

    12 Usage of the simm13 field by the aessb and aessb4 instructions . . 48

    13 Code size in bytes for DES . . . . . . . . . . . . . . . . . . . . . . . . 72

    14 Code size in bytes for Triple-DES . . . . . . . . . . . . . . . . . . . . 72

    15 Code size in bytes for IDEA . . . . . . . . . . . . . . . . . . . . . . . 72

    16 Code size in bytes for AES without gfmmul instruction . . . . . . . 73

    17 Code size in bytes for AES with gfmmul instruction . . . . . . . . . 73

    18 Execution cycles for DES . . . . . . . . . . . . . . . . . . . . . . . . . 74

    x

  • 19 Execution cycles for Triple-DES . . . . . . . . . . . . . . . . . . . . . 74

    20 Execution cycles for IDEA . . . . . . . . . . . . . . . . . . . . . . . . 75

    21 Execution cycles for AES without gfmmul instruction . . . . . . . . 75

    22 Execution cycles for AES with gfmmul instruction . . . . . . . . . . 76

    23 Comparison with ISEC Extensions in C/Inline Assembly . . . . . . . 77

    24 Comparison with ISEC Extensions in Pure Assembly . . . . . . . . . 77

    25 Hardware utilization on the Xilinx XC4VLX25 FPGA . . . . . . . . . 78

    26 Throughput to area ratios for algorithm implementations . . . . . . . 80

    xi

  • 11 INTRODUCTION

    With more than 188 million Americans connected to the Internet [1], in-

    formation security has become a top priority. Many applications electronic mail,

    electronic banking, medical databases, and electronic commerce require the ex-

    change of private information. For example, when engaging in electronic commerce,

    customers provide credit card numbers when purchasing products. If the connection

    is not secure, an attacker can easily obtain this sensitive data. In order to imple-

    ment a comprehensive security plan for a given network to guarantee the security of

    a connection, the following services must be provided [2], [3], [4]:

    Confidentiality : Information cannot be observed by an unauthorized party. This

    is accomplished via public-key and private-key encryption.

    Data Integrity : Transmitted data within a given communication cannot be

    altered in transit due to error or an unauthorized party. This is accomplished

    via the use of hash functions and Message Authentication Codes.

    Authentication: Parties within a given communication session must provide

    certifiable proof of their identity. This is accomplished via the use of digital

    signatures.

    Non-repudiation: Neither the sender nor the receiver of a message may deny

    transmission. This is accomplished via digital signatures and third party notary

    services.

    Cryptographic algorithms used to ensure confidentiality fall within one of two cat-

  • 2egories: private-key (also known as symmetric-key) and public-key. Symmetric-key

    algorithms use the same key for both encryption and decryption. Conversely, public-

    key algorithms use a public key for encryption and a private key for decryption. In a

    typical session, a public-key algorithm will be used for the exchange of a session key

    and to provide authenticity through digital signatures. The session key is then used

    in conjunction with a symmetric-key algorithm. Symmetric-key algorithms tend to

    be significantly faster than public-key algorithms and as a result are typically used

    in bulk data encryption [3]. The two types of symmetric-key algorithms are block

    ciphers and stream ciphers. Block ciphers operate on a block of data while stream

    ciphers encrypt individual bits. Block ciphers are typically used when performing

    bulk data encryption and the data transfer rate of the connection directly follows the

    throughput of the implemented algorithm.

    High throughput encryption and decryption are becoming increasingly im-

    portant in the area of high-speed networking. Many applications demand the creation

    of networks that are both private and secure while using public data-transmission

    links. These systems, known as Virtual Private Networks (VPNs), can demand en-

    cryption throughputs at speeds exceeding Asynchronous Transfer Mode (ATM) rates

    of 622 million bits per second (Mbps). Increasingly, security standards and applica-

    tions are defined to be algorithm independent. Although context switching between

    algorithms can be easily realized via software implementations, the task is significantly

    more difficult when using hardware implementations. The advantages of a software

    implementation include ease of use, ease of upgrade, ease of design, portability, and

    flexibility. However, a software implementation offers only limited physical security,

    especially with respect to key storage [3], [5]. Conversely, cryptographic algorithms

    that are implemented in hardware are by nature more physically secure as they can-

    not easily be read or modified by an outside attacker when the key is stored in special

  • 3memory internal to the device [5]. As a result, the attacker does not have easy access

    to the key storage area and cannot discover or alter its value in a straightforward

    manner [3].

    When using a general-purpose processor, even the fastest software imple-

    mentations of block ciphers cannot satisfy the required bulk data encryption data

    rates for high-end applications [6], [7], [8], [9], [10]. As a result, hardware imple-

    mentations are necessary for block ciphers to achieve this required performance level.

    Although traditional hardware implementations lack flexibility with respect to al-

    gorithm and parameter switching, configurable hardware devices offer a promising

    alternative for the implementation of processors via the use of IP cores in Applica-

    tion Specific Integrated Circuit (ASIC) and Field Programmable Gate Array (FPGA)

    technology. To illustrate, Altera Corporation offers IP core implementations of the

    Intel 8051 microcontroller and the Motorola 68000 processor in addition to their

    own Nios r-II embedded processor [11]. Similarly, Xilinx Inc. offers IP core imple-

    mentations of the PowerPC processor in addition to their own MicroBlazeTM and

    PicoBlazeTM embedded processors [12]. ASIC and FPGA technologies provide the

    opportunity to augment the existing datapath of a processor implemented via an IP

    core to add acceleration modules supported through newly defined instruction set

    extensions targeting performance-critical functions [13], [14], [15]. Moreover, many

    licensable and extendible processor cores are also available for the same purpose [16],

    [17], [18], [19].

    The use of instruction set extensions follows the hardware/software co-

    design paradigm to achieve the performance and physical security associated with

    hardware implementations while providing the portability and flexibility traditionally

    associated with software implementations [20]. Moreover, when considering alterna-

    tive solutions, instruction set extensions result in significant performance improve-

  • 4ments versus traditional software implementations with considerably reduced logic

    resource requirements versus hardware-only solutions such as co-processors [21], [22],

    [23], [24], [25], [26], [27], [28], [29]. It is the goal of this research to demonstrate a set

    of instruction set extensions for a reduced instruction set computing (RISC) processor

    that enhance the performance of symmetric-key algorithms in software implementa-

    tions.

    Chapter 2 discusses related work on the various methods of speeding up

    symmetric-key algorithms in software. These include optimization of pure software

    implementations, off-loading of cryptographic algorithm execution to co-processors,

    and other instruction set extensions. It will be shown that advances in technology

    have fueled trends towards increased reconfigurability in embedded systems, resulting

    in instruction set extensions becoming a more viable and attractive option when the

    performance of symmetric-key algorithms is critical.

    Chapter 3 describes the target processor, the LEON2 RISC processor whose

    architecture is based on the SPARC r architecture. The LEON2 processor was cho-

    sen for its robust design, ease of configurability, and customization through the full

    availability of the models hardware description language (HDL) source code.

    In Chapter 4, the cryptographic algorithms that are the focus of this re-

    search are explained. These include the Data Encryption Standard (DES) and Triple-

    DES, the International Data Encryption Algorithm (IDEA), and the Advanced En-

    cryption Standard (AES). The various performance bottlenecks that are commonly

    encountered in software implementations of these algorithms are also discussed.

    Syntax and encoding of the newly developed instructions are presented in

    Chapter 5, followed by a description of the modifications to the LEON2 processor

    and its associated development tools in Chapter 6. To evaluate the effectiveness of

    the instruction set extensions, Chapter 7 presents data on the logic utilization of the

  • 5custom hardware, as well as throughput data for the target algorithms both with

    and without the use of the custom instructions. The thesis concludes in Chapter 8

    along with recommendations for investigating further architectural enhancements to

    the LEON2 processor for supporting symmetric-key algorithms.

  • 62 PREVIOUS WORK

    Most traditional methods for improving the throughput of pure software

    implementations of symmetric-key algorithms fall into one of two categories. One

    option is to construct memory-based look-up tables where results of some of the basic

    operations of the algorithm have been pre-computed and stored. The substitution

    boxes, or S-Boxes, of the DES and AES algorithms are commonly stored in look-up

    tables in software implementations. Look-up tables may also be used to combine

    operations used in the DES and AES algorithms. An implementation of DES in [30]

    combines the S-Box table look-up with the subsequent 32-bit permutation and uses

    tables as part of the Initial and Final Permutations. The AES algorithm requires

    several complicated mathematical operations that are time-consuming on general-

    purpose processors. Therefore in some implementations, large look-up tables, called

    T-tables, are employed that combine several of these complex operations into a single

    table access [20]. A look-up table based implementation is a viable option for systems

    with a large memory space and low memory access times. However, area-constrained

    systems suffer large performance penalties under these implementations [20], [29],

    thus they are generally not employed in those environments.

    Another method for speeding up software implementations of cryptographic

    algorithms involves taking advantage of mathematical or structural properties of the

    particular algorithm. The Initial and Final Permutations of the DES algorithm have

    regular structures that make it possible to execute a series of matrix transformations

    and exclusive-OR operations as demonstrated in [31]. This translates into a sequence

    of instructions that is much smaller than the traditional sequence required to perform

  • 7the Initial and Final Permutations. In previous work on improving the performance of

    the AES algorithm on 32-bit systems, it has been shown that transforming a block of

    plaintext from a column-oriented matrix to a row-oriented matrix reduces the number

    of instructions required to complete the cipher rounds. In particular, a row-oriented

    representation allows for a more efficient implementation of the Galois Field matrix

    multiplication operations required for AES encryption and decryption [32].

    In order to extend the cryptographic capabilities of an embedded system

    without modifying the main processor, a co-processor solution can be adapted. When

    there is data that must be encrypted or decrypted through the chosen symmetric-key

    algorithm, the main processor sends the data and key material to the co-processor,

    and the co-processor performs the algorithm, sending the processed data back over the

    interface to the main processor. Most co-processor solutions have tended to combine

    a number of different algorithms to provide a multi-faceted security solution. Co-

    processors have generally achieved high throughput values compared to traditional

    software implementations and therefore are much more capable of meeting demands

    for speed-critical network communications. However, this type of solution is generally

    associated with considerable overhead in terms of hardware area utilization, data

    transfer latency, and complex interfaces to the main processor [23], [33], [34], [35],

    [36], [37], [38], [39], [40].

    There has been previous work on instruction set extensions for general per-

    mutations that are useful for improving the performance of permutations for the

    DES algorithm. Shi and Lee [41] presented two new instructions for general and

    dynamically specified permutations. The input and a string of configuration bits are

    specified in the source operands and the result is stored in the destination register.

    These instructions, along with two new instructions, are discussed in [21]. In general,

    permutations of n bits required log2(n) issues of the custom instructions, as well

  • 8as several loads into registers of configuration bits. The MOSES platform developed

    by a group based at NEC Research Laboratories is based on the Xtensa T1040, a

    RISC-like processor designed to be easily extended with additional custom hardware

    and supporting instructions. Throughput improvement factors of 31.0 for DES, 33.9

    for Triple-DES, and 17.4 for AES were reported for this custom architecture [42], [43].

    Study of the effect of custom instructions that support the AES cipher is

    extensive. Most of this work targets the memory look-ups and multiplications that

    are needed to perform the encryption rounds and key schedule. The Instruction Set

    Extensions for Cryptography (ISEC) project conducted at the Graz University of

    Technology in Graz, Austria, has investigated instruction set extensions that perform

    the mathematical operations in the AES rounds using custom functional units inte-

    grated into the targeted processors datapath [29]. Earlier work in the ISEC project

    demonstrated the effectiveness of instruction set extensions for elliptic curve cryptog-

    raphy [27] in improving the performance of binary extension Galois Field arithmetic

    [28].

  • 93 THE LEON2 PROCESSOR

    3.1 Overview

    The target processor for this work is the LEON2, a RISC central processing

    unit (CPU) that was produced by Gaisler Research [44] (note that at the time of

    this writing, support for the LEON2 processor has been discontinued in favor of the

    newer LEON3 processor model). The LEON2 processor is implemented in VHDL

    and is fully synthesizable. The model is highly configurable, allowing for adjustments

    to many features of the processor using a graphical configuration utility. The entire

    source code is freely available under the GNU General Public License which enables

    custom modifications and enhancements to the architecture. Information presented

    in this chapter is derived from the LEON2 documentation [45], [46].

    3.2 VHDL Model Hierarchy

    The source code for the LEON2 processor has the directory structure shown

    in Table 1. The top-level folder /leon2/ is used as an example; it is permissible for

    the root directory to have any name.

    3.3 Processor Architecture

    LEON2 is based on the Scalable Processor Architecture (SPARC r).

    SPARC r was first developed in 1985 at Sun Microsystems and is based on the work

  • 10

    Folder Description Refer toleon2/ Top directory Sec. 3.2leon2/boards/ FPGA board support files Sec. 3.6leon2/doc/ User manuals [46]leon2/leon/ LEON2 processor VHDL model Sec. 3.3leon2/pmon/ Simple boot-monitor Not discussedleon2/sim/ Simulator support files Sec. 3.6leon2/syn/ Synthesis support files Sec. 3.6leon2/tbench/ LEON2 VHDL test bench Sec. 3.6leon2/tkconfig/ graphical configuration utility Sec. 3.5leon2/tsource/ LEON2 test bench (C source) Sec. 3.6

    Table 1: LEON2 VHDL Model File Hierarchy

    that produced the RISC I and RISC II architectures at the University of California at

    Berkeley during the early 1980s [47]. The LEON2 processor attained full certification

    of compliance with the SPARC r V8 architecture in 2003 [48].

    Features of the LEON2 processor coding style include fully synchronous

    design with a single clock, use of multiplexers for loading of pipeline registers, sep-

    arate combinational and sequential processes, and record types for interconnection

    of component I/O signals. LEON2 provides support for on-chip peripherals such as

    a floating-point unit (FPU), Peripheral Component Interconnect (PCI), and Ether-

    net; co-processor support is also available in accordance with the SPARC r model.

    However, these features are outside the scope of this research and are therefore not

    discussed in any further detail. The main focus of the LEON2 architecture with re-

    gards to the proposed instruction set extensions is the pipelined integer unit (IU).

    The IU pipeline consists of five stages: fetch, decode, execute, memory, and write back.

    The VHDL model implements each stage in its own process. A process

    statement in VHDL is a closed block of code that runs sequentially. The inputs are

    specified by a sensitivity list. The process statement executes at any time a signal

    in the sensitivity list changes state. Processes are used for behavioral VHDL code, a

    high-level coding style used commonly for describing sequential logic [49].

  • 11

    3.4 SPARC r V8 Instruction Model

    All SPARC r V8 instructions are implemented in the LEON2 processor

    architecture. Instructions are grouped according to the values of the various fields in

    the instruction operation code. Arithmetic, logic, and memory operations have the

    Format 3 structure [47] shown in Figure 1.

    op rd op3 rs1 i=0 asi rs2op rd op3 rs1 i=1 simm13

    Figure 1: Structure of SPARC r V8 Format 3 instructions

    3.5 Customization

    Most of the available features of the LEON2 processor can be enabled, dis-

    abled, or adjusted by using the graphical configuration utility. For the purposes of

    this work, a basic configuration is used with no FPU, PCI, Ethernet, co-processor in-

    terface, or hardware multiplier or divider. To extend the LEON2 architecture beyond

    the scope of the standard model, additional VHDL code is required. The specific

    files that must be modified depend on what functionality is to be added, but if the

    instruction set is to be extended, the module containing the SPARC r V8 opcode

    constants must be updated, and these instructions must follow the SPARC r V8

    architecture specification [47]. The graphical configuration utility may also be mod-

    ified to provide an easy interface for adjusting parameters of the custom functionality.

  • 12

    3.6 Synthesis and Simulation

    The LEON2 VHDL implementation is a fully synthesizable processor that

    can be targeted to any type of FPGA or ASIC technology. There are pre-made

    packages for several synthesis tools such as XST, Synplify, Synopsys, and Leonardo

    in the /syn/ sub-folder of the LEON2 directory structure. These packages enable use

    of technology-specific cells to directly instantiate or automatically infer the register

    files, caches, PCI FIFOs, and I/O pads. There are also a number of packages in the

    /boards/ sub-folder that support programming of physical FPGA boards [50] with

    the LEON2 architecture.

    Functional verification of programs built for the LEON2 architecture can be

    performed with the provided generic test bench. The VHDL source for the test bench

    is located in the /tbench/ sub-folder of the LEON2 directory structure. Software code

    is placed in the /tsource/ sub-folder in a format readable by the test bench VHDL

    code. The software can then be read and executed by the test bench for purposes of

    functional verification and performance evaluation.

    3.7 Software Development Tools

    In order to facilitate the development of programs targeting the LEON2

    processor, Gaisler Research has provided a series of compilers and simulators that

    may be chosen depending on the software environment. For stand-alone applications,

    the Bare C Compiler (BCC) is recommended. BCC is based on the GNU Compiler

    Collection (GCC) and GNU binutils. The BCC development tools are used in the

    same way as those included in the standard GCC and binutils packages. The actual

    names of the executables have a sparc-elf- prefix.

  • 13

    Packages containing the binaries for Linux and Cygwin environments are

    available, as well as the full source code for developers who wish to take advantage

    of an expanded LEON2 architecture.

  • 14

    4 TARGET ALGORITHMS

    4.1 Triple-DES

    4.1.1 The DES Algorithm

    Many block ciphers may be characterized as Feistel networks [3]. Feistel

    networks were invented by Horst Feistel [51] and are a general method of transforming

    a function into a permutation. The basic Feistel network divides the data into two

    halves where one half operates upon the other [52]. The f -function uses one of the

    halves of the data block and a key to create a pseudo-random bit stream that is

    used to encrypt or decrypt the other half of the data block. Therefore, to encrypt or

    decrypt both halves requires two iterations of the Feistel network.

    A generalization of the basic Feistel network allows for the support of larger

    data blocks. Generalization occurs by considering the data swap as a circular right

    shift. This allows for the use of the same f -function but requires multiple rounds to

    input all of the sub-blocks to the f -function [53]. Figure 2 from [53] details the block

    diagram for block ciphers employing both the basic Feistel network and generalized

    Feistel networks of three and four blocks. The f -function is represented by the box

    and the symbol represents a bit-wise XOR operation.

    The f -function employs confusion and diffusion to obscure redundancies in

    a plaintext message [54]. Confusion obscures the relationship between the plaintext,

    the ciphertext, and the key. S-Box look-up tables are an example of a confusion

    operation. Diffusion spreads the influence of individual plaintext or key bits over

    as much of the ciphertext as possible. Expansion and permutation functions are

  • 15

    L Rk0

    R Lk1

    L R

    B Ck0

    A Bk1

    C A

    A

    C

    Bk2

    B CA

    C Dk0

    B Ck1

    A B

    B

    A

    Dk2

    D AC

    A

    D

    C

    Bk3

    C DBA

    Figure 2: Block diagram for standard block ciphers

    examples of diffusion operations [3]. The basic operations that may be found within

    an f -function include:

    Bitwise XOR, AND, or OR.

    Modular addition or subtraction.

    Shift or rotation by a constant number of bits.

    Data-dependent rotation by a variable number of bits.

    Modular multiplication.

    Multiplication in a Galois field.

    Modular inversion.

    Look-up-table substitution.

    DES is a sixteen-round Feistel Network block cipher. A block diagram of

    the entire operation is given in Figure 3 from [55]. The DES cipher takes as input a

    64-bit key, where 8 of the 64 bits are used for parity and the other 56 bits comprise

  • 16

    the actual key material. The input and output both have a size of 64 bits for both

    encryption and decryption. The procedures for encryption and decryption are almost

    exactly the same; the only difference is that the key schedule for decryption is the

    reverse of that used for encryption.

    Figure 3: The Data Encryption Standard algorithm

    Throughout the rest of this section, bit ordering is denoted for an n-bit

  • 17

    vector such that bit 1 is the most significant bit and n is the least significant bit.

    In all Figures that show the bit assignments for DES permutations, the numbers

    correspond to input bits that are mapped to a specific position in the output, starting

    with output bits 1,2,3,... in the top row and ending with output bits ...,n-2,n-1,n in

    the bottom row.

    The first part of DES encryption is an Initial Permutation (IP) on the input

    block. The IP rearranges the input according to Table 2. The output of the IP is

    divided into a left half L0 and a right half R0, which becomes the input to the first

    round. For each round iteration i from 1 to 16:

    Li = Ri1

    Ri = Li1

    f(Ri1, Ki)

    58 50 42 34 26 18 10 260 52 44 36 28 20 12 462 54 46 38 30 22 14 664 56 48 40 32 24 16 857 49 41 33 25 17 9 159 51 43 35 27 19 11 361 53 45 37 29 21 13 563 55 47 39 31 23 15 7

    Table 2: The Initial Permutation IP

    Figure 4 is an illustration of the round function and shows the individual

    blocks of the f -function, which is the core operation of each round.

    The following tables show the mappings of input bits to output bits of the

    E and P operations, as well as the eight S-Boxes. The E expansion duplicates some

    of the bits of the 32-bit input to the f - function as shown in Table 3 and outputs

    a 48-bit value. The result of E (Ri1)Ki is partitioned into eight 6-bit values.

    The S-Boxes output a 4-bit number based on a 6-bit input. The input is in the form

  • 18

    Figure 4: The DES f -function

    a5a4a3a2a1a0 the row index into the S-Box is the number formed from a5a0 and the

    column index is the number formed from a4a3a2a1. The outputs of the S-Boxes are

    concatenated to form the input to the P permutation. The P permutation rearranges

    the 32 bits of the combined S-Box outputs, and the result is XOR-ed with Li1 to

    obtain Ri for the current round.

    32 1 2 3 4 54 5 6 7 8 98 9 10 11 12 1312 13 14 15 16 1716 17 18 19 20 2120 21 22 23 24 2524 25 26 27 28 2928 29 30 31 32 1

    Table 3: The Expansion Operation E

  • 19

    S114 4 13 1 2 15 11 8 3 10 6 12 5 9 0 70 15 7 4 14 2 13 1 10 6 12 11 9 5 3 84 1 14 8 13 6 2 11 15 12 9 7 3 10 5 015 12 8 2 4 9 1 7 5 11 3 14 10 0 6 13

    S215 1 8 14 6 11 3 4 9 7 2 13 12 0 5 103 13 4 7 15 2 8 14 12 0 1 10 6 9 11 50 14 7 11 10 4 13 1 5 8 12 6 9 3 2 1513 8 10 1 3 15 4 2 11 6 7 12 0 5 14 9

    S310 0 9 14 6 3 15 5 1 13 12 7 11 4 2 813 7 0 9 3 4 6 10 2 8 5 14 12 11 15 113 6 4 9 8 15 3 0 11 1 2 12 5 10 14 71 10 13 0 6 9 8 7 4 15 14 3 11 5 2 12

    S47 13 14 3 0 6 9 10 1 2 8 5 11 12 4 1513 8 11 5 6 15 0 3 4 7 2 12 1 10 14 910 6 9 0 12 11 7 13 15 1 3 14 5 2 8 43 15 0 6 10 1 13 8 9 4 5 11 12 7 2 14

    S52 12 4 1 7 10 11 6 8 5 3 15 13 0 14 914 11 2 12 4 7 13 1 5 0 15 10 3 9 8 64 2 1 11 10 13 7 8 15 9 12 5 6 3 0 1411 8 12 7 1 14 2 13 6 15 0 9 10 4 5 3

    S612 1 10 15 9 2 6 8 0 13 3 4 14 7 5 1110 15 4 2 7 12 9 5 6 1 13 14 0 11 3 89 14 15 5 2 8 12 3 7 0 4 10 1 13 11 64 3 2 12 9 5 15 10 11 14 1 7 6 0 8 13

    S74 11 2 14 15 0 8 13 3 12 9 7 5 10 6 113 0 11 7 4 9 1 10 14 3 5 12 2 15 8 61 4 11 13 12 3 7 14 10 15 6 8 0 5 9 26 11 13 8 1 4 10 7 9 5 0 15 14 2 3 12

    S813 2 8 4 6 15 11 1 10 9 3 14 5 0 12 71 15 13 8 10 3 7 4 12 5 6 11 0 14 9 27 11 4 1 9 12 14 2 0 6 10 13 15 3 5 82 1 14 7 4 10 8 13 15 12 9 0 3 5 6 11

    Table 4: The DES S-Boxes

  • 20

    16 7 20 2129 12 28 171 15 23 265 18 31 102 8 24 1432 27 3 919 13 30 622 11 4 25

    Table 5: The Pre-Output Permutation P

    After the final round, the left and right halves of the 64-bit block, L16 and

    R16, are swapped, and then subject to a Final Permutation (FP). This operation is

    simply the inverse of the IP. The bit mapping for this operation is shown in Table 6;

    bit positions are represented in the same manner as Table 2 for the IP.

    40 8 48 16 56 24 64 3239 7 47 15 55 23 63 3138 6 46 14 54 22 62 3037 5 45 13 53 21 61 2936 4 44 12 52 20 60 2835 3 43 11 51 19 59 2734 2 42 10 50 18 58 2633 1 41 9 49 17 57 25

    Table 6: The Final Permutation FP

    4.1.2 DES Key Schedule

    The key schedule for DES operates on the 64-bit master key to produce

    a series of 48-bit round keys, each used one at a time for the sixteen rounds of the

    cipher. Initially, the bits of the master key are arranged by Permuted Choice 1 (PC-

    1 ) into two 28-bit vectors, C and D (note that every eighth bit is a parity bit and is

    discarded). Table 7 depicts the PC-1 operation.

    For each round of the cipher, a bit rotation is performed separately on the

    C and D values. Rotation moves to the left for encryption, and to the right for

    decryption. The rotation amount depends on the next round; these amounts are

  • 21

    C0 D057 49 41 33 25 17 9 63 55 47 39 31 23 151 58 50 42 34 26 18 7 62 54 46 38 30 2210 2 59 51 43 35 27 14 6 61 53 45 37 2919 11 3 60 52 44 36 21 13 5 28 20 12 4

    Table 7: Permuted Choice 1 (PC-1 )

    given for encryption in Table 8. These amounts are carried out in reverse order for

    the right-rotations of the decryption key schedule.

    Round 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16Rotate amount 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1

    Table 8: Rotations for the DES key schedule

    The round key is the result of passing the current state of the C and D bit

    vectors through Permuted Choice 2 (PC-2 ). This operation maps the concatenation

    of C and D to the 48-bit round key as shown in Table 9.

    14 17 11 24 1 53 28 15 6 21 1023 19 12 4 26 816 7 27 20 13 241 52 31 37 47 5530 40 51 45 33 4844 49 39 56 34 5346 42 50 36 29 32

    Table 9: Permuted Choice 2 (PC-2 )

    4.1.3 The Triple-DES Algorithm

    The Triple-DES algorithm has been suggested as a more secure alternative

    to DES [55]. As the name suggests, this cipher sequentially executes the DES algo-

    rithm three times with keys K1, K2, and K3, where two or all three of these keys may

    be equivalent. The following rules are used for encryption and decryption, where PT

    is the plaintext, CT is the ciphertext, EKi is a DES encryption using Ki, and DKi is

  • 22

    a DES decryption using Ki:

    CT = EK3(DK2(EK1(PT )))

    PT = DK3(EK2(DK1(CT )))

    Since the output ciphertext from implementations 1 and 2 of DES is used as

    the input plaintext to implementations 2 and 3 respectively, and the Initial and Final

    Permutations are inverses of each other, the inner Initial and Final Permutations may

    be removed from the algorithm.

    There are three keying options commonly used for Triple DES [55]:

    Keying Option 1: K1, K2, and K3 are independent.

    Keying Option 2: K1 = K3 and K2 is independent from K1 and K3.

    Keying Option 3: K1 = K2 = K3.

    Note that Keying Option 3 is equivalent to a single iteration of DES.

    4.1.4 Triple-DES Key Schedule

    To perform the key schedule for Triple-DES, the key expansion must be

    performed on each unique key that is used in the chosen implementation of the algo-

    rithm. This means that three key expansions are required for Keying Option 1, two

    are required for Keying Option 2, and one is required for Keying Option 3.

    4.1.5 Performance

    Software implementations of DES tend to be significantly slower than hard-

    ware implementations. Bit-level manipulations such as those contained in the permu-

    tation, expansion, permuted choice, and Cyclic Left/Right Shift units do not map well

    to general purpose processors. General purpose processor instruction sets operate on

  • 23

    multiple bits at a time based on the processor word size. Moreover, the DES S-Boxes

    do not use memory in an efficient manner. Software look-up tables would appear

    to be the obvious implementation choice for the DES S-Boxes. However, the DES

    S-Boxes have 6-bit addresses and 4-bit output data while most memories associated

    with general purpose processors use byte addressing with either 8-bit or 32-bit output

    data. As a result, many software implementations of DES exhibit throughputs that

    are at least a full order of magnitude slower than hardware implementations.

    Even the best software implementations are only capable of throughputs in

    the range of 100200 Mbps. Most of these implementations recommend storing the Li

    and Ri data as a 48-bit padded word within a 64-bit processor word and implementing

    the permutations and S-Boxes as precomputed look-up tables. Additionally, there is

    general agreement that the look-up table implementation for the S-Boxes is most

    effective when the size of the look-up tables is minimized, guaranteeing that the

    data will fit entirely in on-chip cache. Size minimization of the S-Box look-up tables

    is achieved by implementing each S-Box in its own look-up table. Finally, one key

    software optimization is the unrolling of software loops to increase performance. Even

    when software loops are too cumbersome to unroll, using loop counters that decrement

    to zero in place of loop counters that increment to a terminal count are shown to

    greatly increase the performance of software implementations of the DES algorithm.

    However, the unrolling of software loops must be done with great care such that the

    total data storage space does not exceed the size of the on-chip cache as this would

    cause extreme performance degradation [56], [57], [58].

    DES hardware implementations are easily realized in a single chip, such as

    an FPGA or an ASIC. To support encryption or decryption of a new block of data

    every clock cycle in an implementation operating in a non-feedback mode, such as

    Electronic Code Book (ECB) or counter mode also requires that the chip have at least

  • 24

    128 input pins (for the input data and key) and 64 output pins (for the output data).

    Once again, FPGA and ASIC technology provide more than enough I/O pins to meet

    these requirements. As a result, numerous fast and efficient DES implementations

    have been reported, reaching throughputs in the Gbps when targeting either FPGAs

    or ASICs. Examples of such implementations may be found in [2], [59], [60]. When

    operating in feedback modes (such as Cipher Block Chaining (CBC) mode), DES

    does not map nicely to pipelined hardware implementations because of the chaining

    of blocks. The chaining requires ciphertext block yi1 to process plaintext block xi

    and thus simultaneous processing of the two blocks is impossible, requiring that the

    pipeline be stalled until generation of the ciphertext block yi1 is completed. However,

    the stalling of the pipeline may be avoided in an environment with multiple data

    streams, such as in a network processor. In such a situation, the pipeline may be fully

    utilized by interleaving the data streams. For a fully pipelined DES implementation

    where the atomic unit of the pipeline is the DES round function, the pipeline will

    have sixteen stages, thus requiring sixteen interleaved data streams. Let x0S0 denote

    plaintext block 0 from data stream 0, x0S1 denote plaintext block 0 from data stream

    1, etc. Using this notation, the pipeline is filled with blocks x0S0 , x0S1 , x0S2 , . . .,

    x0S15 . When x0S0 has passed the final stage of the pipeline to yield y0S0 , x1S0 is

    ready to enter the first stage of the pipeline and is combined with y0S0 via the XOR

    operation to perform CBC mode chaining. Thus each data stream is encrypted and

    decrypted in CBC mode while also maintaining full pipeline utilization, maximizing

    the performance of the implementation. Note that such an implementation must also

    maintain sixteen Initialization Vectors, one for each data stream, to be combined with

    the first plaintext block x0 of the associated data stream via the XOR operation.

    The earliest VLSI implementations of DES [61], [62] achieved throughputs

    ranging from 20 to 32 Mbps using 3 m technology. The variances in throughput are

  • 25

    compared and contrasted based upon speed versus area tradeoffs. The implementa-

    tions support multiple modes of operation, including ECB, CBC, Cipher Feedback

    (CFB), and Output Feedback (OFB) (see [63] for a detailed description of DES modes

    of operation). Other ASIC implementations of DES [64], [65] achieve a throughput

    of 1 Gbps using 0.8 m Gallium Arsenide (GaAs) technology. More recently, a DES

    ASIC implementation has been demonstrated to operate at up to 10 Gbps using 0.6

    m technology [59].

    4.2 IDEA

    4.2.1 Mathematical Background

    The International Data Encryption Algorithm (IDEA) was originally pub-

    lished as the Proposed Encryption Standard (PES) by Xuejia Lai and James Massey

    [66]. The computations involved in IDEA are based on operations from three different

    mathematical groups:

    16-bit bitwise exclusive-OR, denoted by,

    Addition modulo 216, denoted by ,

    Multiplication modulo (216 + 1), denoted by.

    For the third operation, an input of 0x0000 represents the value 216. This

    is because the

    operation is performed over the multiplicative group Z 216+1, where

    zero is not a member but 216 is a member of the group. The value 216 is therefore

    denoted by 0x0000 so that only sixteen bits are required to represent all possible

    values for the inputs to each operation.

    The security of IDEA is based not only on its large key size but also on the

    fact that the output of one group operation is never used as an input to the same

  • 26

    operation. Further details are available in the original proposal for PES [66]. IDEA

    evolved into its final form [67] due to modifications required to strengthen the cipher

    against differential cryptanalysis attacks [68]. IDEA is used in many commercial

    applications, such as Pretty Good Privacy (PGP). Like DES, IDEA operates across

    64-bit blocks. However, while DES requires a 56-bit key, IDEA requires a 128-bit key,

    accounting for the increased security of the cipher as compared to DES.

    4.2.2 Algorithm Description

    IDEA operates on 64-bit plaintext blocks and has a key size of 128 bits.

    The algorithm consists of eight rounds followed by a final transformation to obtain

    the output. Similar to DES, the procedure (referred to as a computation graph; see

    Figure 5), is the same for both encryption and decryption but different key schedules

    are used.

    Figure 5 shows the input text as four 16-bit sub-blocks X1, X2, X3, and X4.

    These text sub-blocks are combined with the six 16-bit sub-blocks of the round key

    for the current round r, labeled Z(r)1 through Z

    (r)6 , using the mathematical operations

    noted above.

    4.2.3 Key Schedule

    The IDEA key schedule for encryption is based on a series of left rotations

    of the 128-bit master key. The master key is first partitioned into eight 16-bit blocks;

    these are the first eight key sub-blocks: Z(1)1 , Z

    (1)2 , Z

    (1)3 , Z

    (1)4 , Z

    (1)5 , Z

    (1)6 , Z

    (2)1 , Z

    (2)2 .

    The next eight key blocks are obtained by rotating the key to the left by 25 bits,

    then performing the partition again. This process is repeated until all 52 key blocks

    are generated (six blocks for each of the eight rounds and four blocks for the final

    transformation).

    The key schedule for decryption is based on the encryption key schedule.

    Table 10 shows the relationship between the decryption key blocks and the encryption

  • 27

    Figure 5: The computation graph for IDEA.

    key blocks, where Z(r)1n represents the multiplicative inverse modulo (216+1) of Z(r)n ,

    and Z(r)n represents the additive inverse modulo 216 of Z(r)n .

    4.2.4 Performance

    In terms of the core operations of IDEA, the bit-wise XOR and addition

  • 28

    Round Encrypt keys Decrypt keys

    1 Z(1)1 Z

    (1)2 Z

    (1)3 Z

    (1)4 Z

    (1)5 Z

    (1)6 Z

    (9)11 Z

    (9)2 Z

    (9)3 Z

    (9)14 Z

    (8)5 Z

    (8)6

    2 Z(2)1 Z

    (2)2 Z

    (2)3 Z

    (2)4 Z

    (2)5 Z

    (2)6 Z

    (8)11 Z

    (8)3 Z

    (8)2 Z

    (8)14 Z

    (7)5 Z

    (7)6

    3 Z(3)1 Z

    (3)2 Z

    (3)3 Z

    (3)4 Z

    (3)5 Z

    (3)6 Z

    (7)11 Z

    (7)3 Z

    (7)2 Z

    (7)14 Z

    (6)5 Z

    (6)6

    4 Z(4)1 Z

    (4)2 Z

    (4)3 Z

    (4)4 Z

    (4)5 Z

    (4)6 Z

    (6)11 Z

    (6)3 Z

    (6)2 Z

    (6)14 Z

    (5)5 Z

    (5)6

    5 Z(5)1 Z

    (5)2 Z

    (5)3 Z

    (5)4 Z

    (5)5 Z

    (5)6 Z

    (5)11 Z

    (5)3 Z

    (5)2 Z

    (5)14 Z

    (4)5 Z

    (4)6

    6 Z(6)1 Z

    (6)2 Z

    (6)3 Z

    (6)4 Z

    (6)5 Z

    (6)6 Z

    (4)11 Z

    (4)3 Z

    (4)2 Z

    (4)14 Z

    (3)5 Z

    (3)6

    7 Z(7)1 Z

    (7)2 Z

    (7)3 Z

    (7)4 Z

    (7)5 Z

    (7)6 Z

    (3)11 Z

    (3)3 Z

    (3)2 Z

    (3)14 Z

    (2)5 Z

    (2)6

    8 Z(8)1 Z

    (8)2 Z

    (8)3 Z

    (8)4 Z

    (8)5 Z

    (8)6 Z

    (2)11 Z

    (2)3 Z

    (2)2 Z

    (2)14 Z

    (1)5 Z

    (1)6

    Final

    transform Z(9)1 Z

    (9)2 Z

    (9)3 Z

    (9)4 Z

    (1)11 Z

    (1)2 Z

    (1)3 Z

    (1)14

    Table 10: IDEA key schedule

    are easily implemented with one instruction each in software. For the reduction

    modulo 216, a processor such as the LEON2 that only performs arithmetic on 32-

    bit register operands requires an additional logic instruction to mask out the bits

    that may overflow into the sixteen most significant bits of the destination register.

    The major performance bottleneck for a software implementation of the IDEA cipher

    is the multiplication modulo (216 + 1). The reason for this is that multiplication

    in general may take several clock cycles to complete on the processor running the

    algorithm (especially those without hardware multipliers), and the modular reduction,

    which is commonly implemented using the Low-High Lemma [66], requires additional

    execution time.

    Several software implementations of the IDEA algorithm take advantage

    of advanced processor architectures that employ instruction parallelism or functional

    units for multimedia support. A four-way parallel implementation on a 166 Mhz Pen-

    tiumMMX processor [69] achieved a throughput of approximately 72 Mbps. Through-

    put values ranging from 421 Mbps to 550 Mbps have been achieved on the Itanium

    platform running at 733 MHz [70]. The performance evaluations reported in [71]

    include a comparison of IDEA software implementations on processors with various

  • 29

    word sizes, clock frequencies, and cache sizes. Execution times for IDEA encryption

    ranged from 2555 s on the 8-bit 4 MHz Atmega 103 to 9 s on the 64-bit 440 MHz

    UltraSparc2 r with instruction and data cache sizes of 16 kbytes. The ability to

    perform fast multiplications was shown to be a major factor in the performance of

    the IDEA algorithm.

    Implementations of IDEA on reconfigurable computing platforms and sys-

    tems with co-processors have shown improved performance. An implementation on a

    SRC-6E platform [72] achieved throughputs of approximately 590 Mbps for end-to-

    end software time for bulk data processing. Comparisons have been made between

    the performance of IDEA on Digital Signal Processing (DSP) chips, cryptographic

    co-processors, and hardware implementations on FPGAs in a hardware-software co-

    design system that makes use of encryption in a mobile device. Reported perfor-

    mance figures ranged from 32 Mbps on the DEC SA-110 and 53.1 Mbps on the TI

    TMX320C6x DSP chips, to 180 Mbps using the VINCI cryptographic co-processor,

    to 528 Mbps with an FPGA-based implementation [73].

    A VLSI implementation of PES [74] achieved a throughput of 44 Mbps using

    1.5 m technology. This implementation was limited in clock frequency to maintain

    compatibility with the Sun Microsystems SBus. The earliest VLSI implementations

    of IDEA [75], [76] achieved throughputs of 177 Mbps using 1.2 m technology. More

    recent VLSI implementations [77] achieve a throughput of 355 Mbps using 0.8 m

    technology. When using 0.7 m technology, a throughput of 424 Mbps was achieved

    in a single chip solution [78]. However, the performance of these implementations

    were significantly reduced when operating in feedback modes.

  • 30

    4.3 AES

    4.3.1 Mathematical Background

    Joan Daemen and Vincent Rijmen proposed the Rijndael algorithm to NIST

    as a candidate for the Advanced Encryption Standard [79]. One of the most significant

    features of the algorithm is the extensive use of finite field, or Galois Field, arithmetic.

    The particular field used in the AES algorithm is the Galois Field GF(28). Values

    are represented by polynomials of the form a(x) = a7x7 +a6x

    6 +a5x5 +a4x

    4 +a3x3 +

    a2x2 + a1x + a0, or in bit vector notation, { a7a6a5a4a3a2a1a0 }, where each ai is a

    coefficient in the Galois Field GF(2). Addition is done by computing the sum mod-

    ulo 2 of coefficients in the same bit positions; this can be accomplished by applying

    a bit-wise exclusive-OR on the coefficients. Multiplication works in much the same

    way as ordinary polynomial multiplication, but there is an additional step to make

    a modular reduction of the product by an irreducible polynomial so that the final

    product is in the Galois Field GF(28). For the AES algorithm, this polynomial is

    m(x) = x8 + x4 + x3 + x+ 1.

    4.3.2 Algorithm Description

    AES always operates on a block size of 128 bits, but key sizes of 128, 192,

    or 256 bits are allowed. The number of rounds used in the cipher is dependent on the

    key size ten rounds for a 128-bit key, twelve rounds for a 192-bit key, and fourteen

    rounds for a 256 bit key. This research focuses on a 128-bit key implementation but

    is easily extended for use in implementations with larger key sizes.

    Encryption of one plaintext block in AES requires the sequence of operations

    shown in Figure 6. The word data type is a 32-bit value. In the AES algorithm

    specification [80], the plaintext is arranged into a 4 4 matrix of 8-bit values called

  • 31

    the state, depicted in Figure 7.

    Encrypt(byte in[16], byte out[16], word k[44])

    begin

    byte state[4,4]

    state = in

    AddRoundKey(state, k[0, 3])

    for round = 1 step 1 to 9

    SubBytes(state)

    ShiftRows(state)

    MixColumns(state)

    AddRoundKey(state, k[round*4, (round+1)*4-1])

    end for

    SubBytes(state)

    ShiftRows(state)

    AddRoundKey(state, k[40, 43])

    out = state

    end

    Figure 6: The AES encryption process

    s0,0 s0,1 s0,2 s0,3s1,0 s1,1 s1,2 s1,3s2,0 s2,1 s2,2 s2,3s3,0 s3,1 s3,2 s3,3

    Figure 7: The state representation of data blocks in AES

    The four types of operations performed on the state are:

    SubBytes: substitutes each byte in the state with a new value according to

    the following procedure:

    (1) Compute the multiplicative inverse in the Galois Field GF(28), denoted as

    a1 (except for the value 0x00, which is mapped to itself);

    (2) Perform the following affine transformation over the Galois Field GF(2) on

    a1:

  • 32

    b7

    b6

    b5

    b4

    b3

    b2

    b1

    b0

    =

    0 0 0 1 1 1 1 1

    0 0 1 1 1 1 1 0

    0 1 1 1 1 1 0 0

    1 1 1 1 1 0 0 0

    1 1 1 1 0 0 0 1

    1 1 1 0 0 0 1 1

    1 1 0 0 0 1 1 1

    1 0 0 0 1 1 1 1

    a17

    a16

    a15

    a14

    a13

    a12

    a11

    a10

    +

    0

    1

    1

    0

    0

    0

    1

    1

    The result b is copied into the position of a in the state.

    ShiftRows: performs cyclic left-shifts on each row in the state. The amount

    of bytes by which to shift depends on the row: zero for the top row, one for the

    second row, two for the third row, and three for the bottom row.

    MixColumns: each column of the state is treated as a vector of four polyno-

    mials in the Galois Field GF(28) in this operation. Each of the four columns

    are multiplied by a 4 4 constant matrix with coefficients in the Galois Field

    GF(28) reduced modulo m(x) = x8 + x4 + x3 + x+ 1. For each column c from

    0 to 3,

    B(0,c)

    B(1,c)

    B(2,c)

    B(3,c)

    =

    02 03 01 01

    01 02 03 01

    01 01 02 03

    03 01 01 02

    A(0,c)

    A(1,c)

    A(2,c)

    A(3,c)

    .

    AddRoundKey: likeMixColumns, AddRoundKey operates on individual

    columns of the state. Each column Ci is combined by a bit-wise exclusive-OR

  • 33

    operation with a 32-bit word k4r+i from the current round key (the key schedule

    is explained in Section 4.3.3).

    Decryption of a ciphertext block incorporates the inverses of the operations

    used in the encryption process as shown in Figure 8. Note that AddRoundKey is

    its own inverse since it involves only the bitwise exclusive-OR operation.

    Decrypt(byte in[16], byte out[16], word k[44])

    begin

    byte state[4,4]

    state = in

    AddRoundKey(state, k[40, 43])

    for round = 9 step -1 downto 1

    InvShiftRows(state)

    InvSubBytes(state)

    AddRoundKey(state, k[round*4, (round+1)*4-1])

    InvMixColumns(state)

    end for

    InvShiftRows(state)

    InvSubBytes(state)

    AddRoundKey(state, k[0, 3])

    out = state

    end

    Figure 8: The AES decryption process

    InvSubBytes: reverses the transformation performed by SubBytes by first

    applying an affine transformation using the inverse of the 8 8 matrix used for

    SubBytes followed by calculation of the multiplicative inverse in the Galois

    Field GF(28) modulo m(x).

    InvShiftRows: performs cyclic right-shifts on each row in the state in the

    same amounts as ShiftRows.

    InvMixColumns: performs multiplication of the state by the inverse of the

  • 34

    Galois Field constant matrix from MixColumns:

    B(0,c)

    B(1,c)

    B(2,c)

    B(3,c)

    =

    0e 0b 0d 09

    09 0e 0b 0d

    0d 09 0e 0b

    0b 0d 09 0e

    A(0,c)

    A(1,c)

    A(2,c)

    A(3,c)

    .

    4.3.3 Key Schedule

    For the 128-bit key size implementation of the AES algorithm, the master

    key is expanded into a linear array of eleven 4-byte words using the process presented

    in Figure 9. There are two operations and an array of constants used specifically for

    the key schedule:

    KeyExpansion(byte key[16], word w[44])

    begin

    word temp

    i = 0

    while (i < 4)

    w[i] = word(key[4*i], key[4*i+1], key[4*i+2], key[4*i+3])

    i = i + 1

    end while

    i = 4

    while (i < 44)

    temp = w[i-1]

    if (i mod 4 = 0)

    temp = SubWord(RotWord(temp)) xor Rcon[i/4]

    end if

    w[i] = w[i-4] xor temp

    i = i + 1

    end while

    end

    Figure 9: The key expansion process for AES

    SubWord: applies a substitution to each of the four bytes in the input word

  • 35

    using the same S-Box that is used in encryption.

    RotWord: performs a cyclic left rotation by one byte on the input word

    a0a1a2a3 to produce an output of a1a2a3a0.

    Rcon[ ]: the round constant array with a size of ten words. For i from 1 to 10,

    Rcon[i] = [{02}i1, {00}, {00}, {00}],

    where the powers {02}i1 are in the Galois Field GF(28).

    4.3.4 Performance

    Rijndael software performance bottlenecks typically occur in the SubBytes

    and MixColumns transformations, one or both of which are usually implemented

    via 8-bit to 8-bit look-up tables. Often most of the Rijndael round transformations

    SubBytes, ShiftRows, and MixColumns are combined into large look-up ta-

    bles termed T-tables. Such implementations require up to three T-tables whose size

    may be either 1 kbytes or 4 kbytes where the smaller tables require performing an

    additional rotation operation. The goal of the T-tables is to avoid performing the

    MixColumns and InvMixColumns transformations as these operations perform Galois

    Field fixed field constant multiplication, an operation which maps poorly to general

    purpose processors. However, the use of T-tables has significant disadvantages. The

    T-tables significantly increase code size, their performance is dependent on the mem-

    ory system architecture as well as cache size, and their use causes key expansion for

    Rijndael decryption to become significantly more complex. As an alternative to the

    T-tables implementation method, it is also feasible to have the processor perform all of

    the Rijndael round transformations. Row-based implementations have been demon-

    strated to allow for greater efficiency in the implementation of the MixColumns and

  • 36

    InvMixColumns transformations versus column-based implementations. However the

    SubBytes transformation still remains as a bottleneck, requiring separate 256-byte

    look-up tables for encryption and decryption [20], [29], [32], [81], [82], [83], [84].

    Numerous co-processors have been developed to accelerate cryptographic

    algorithm implementations. The CryptoManiac VLIW co-processor was developed

    as a result of instruction set extensions designed to accelerate the performance of

    a number of the AES candidate algorithms. CryptoManiac features the execution

    of up to four instructions per cycle and the use of instructions with up to three

    operands to allow for the combination of short latency instructions for single cycle

    execution. Similarly, the Cryptonite co-processor is also VLIW based, with two 64-

    bit datapaths and special instructions combined with dedicated memories to support

    Rijndael implementations. Both co-processors improve the performance of Rijndael

    implementations versus implementations targeting general purpose processors. Other

    implementations couple FPGA co-processors with a LEON-2 processor core. The

    co-processors connect to the processor core via either a dedicated interface or as a

    memory-mapped peripheral and were able to significantly improve the performance

    of Rijndael implementations [23], [33], [35], [85], [86].

    Multiple implementations of Rijndael have been presented targeting a wide

    range of hardware technologies. These implementations use specific Galois Field fixed

    field constant multipliers resulting in either logic equations or look-up tables being

    generated to perform the multiplication. Implementations based on logic equations

    are optimized for area and require a moderate number of logic levels. Implementations

    based on look-up tables are optimized for speed at the cost of additional logic resources

    though the performance of these implementations, like the software implementations

    employing T-tables, is highly dependent on the memory system and cache organiza-

    tion and size. In the case of the Galois Field fixed field constant multipliers used in

  • 37

    the MixColumns transformation, the 8-bit to 8-bit look-up tables may be replaced by

    8 bit 8 bit mapping matrices, reducing the associated memory requirements

    by a factor of nearly 20 [87], [88]. Look-up tables may also be replaced with logic

    equation implementations for the SubBytes and MixColumns transformations, sig-

    nificantly reducing the hardware resource requirements. To illustrate the significant

    reduction in logic resource requirements, in the case of the SubBytes transformation,

    a reduction in gate count by as much as a factor 4.66 has been realized using logic

    equations in place of a look-up table. When performing sixteen SubBytes transfor-

    mations in parallel in a single round of Rijndael (assuming a 128-bit implementation),

    this equates to a savings of over 38,000 gate equivalences. For a pipelined implemen-

    tation of 128-bit AES, this savings increases to over 380,000 gate equivalences [29].

    Encryption, decryption, and Key Scheduling are all easily pipelined in non-feedback

    modes of operation while single-round implementations are typically used when op-

    erating in feedback modes. Depending on the implementation methodology, Rijndael

    throughputs as high as 70 Gbps when operating in non-feedback modes and 2.29

    Gbps when operating in feedback modes have been reported [89], [90], [91], [92], [93],

    [94], [95], [96], [97], [98], [99], [100], [101], [102], [103], [104], [105], [106], [107].

    Instruction set extensions are an interesting implementation option that

    bridges the gap between hardware and software. Significantly improved performance

    of software implementations have been demonstrated as a result of adding function-

    ality to a processors datapath and corresponding control logic to decode new in-

    structions. Instruction set extensions designed to accelerate the performance of soft-

    ware implementations of Rijndael have been proposed for a wide range of processors.

    These extensions minimize the number of memory accesses, usually by combining

    the SubBytes and MixColumns transformations into one T-table look-up operation

    to speed up algorithm execution. While T-table performance is heavily dependent

  • 38

    upon available cache size, these extensions have been shown to result in performance

    improvements of up to a factor of 3.68 versus Rijndael implementations without the

    use of the instruction set extensions [20], [42], [85], [86], [88], [108], [109].

    4.4 Modes of Operation

    All of the symmetric-key algorithms targeted in this research support many

    different modes of operation methods that specify the information entered into the

    algorithm. Two modes of operation are of particular interest here:

    Electronic Code Book (ECB): Each block of plaintext xi is input directly into

    the encryption function to form the ciphertext yi. Encrypting a specific value

    for xi always produces the same value for yi and the result is not affected by

    previous plaintext blocks.

    Cipher Block Chaining (CBC): The first plaintext block is combined with an

    initialization vector (IV) using the bitwise exclusive-OR operation, and the

    result is encrypted to form the first ciphertext block. Subsequent blocks of

    plaintext are exclusive-ORed with the last ciphertext block computed. In this

    mode, every block of ciphertext depends on the preceding runs of the algorithm.

    ECB mode may be employed to create pipelined implementations of block

    ciphers, leading to very high throughput. However, a disadvantage of ECB mode is

    that identical plaintext blocks encrypted with the same key always produce the same

    ciphertext. Also, the fact that data blocks encrypted in ECB mode have no chaining

    or feedback mechanisms, the encrypted data is vulnerable to a substitution attack.

    In this type of attack, encrypted blocks in the original data stream are swapped

    by the attacker with other data blocks that are encrypted with the same key. CBC

  • 39

    mode is not vulnerable to suck attacks, but because the current block to be encrypted

    depends directly on the previous ciphertext, CBC mode is not well suited to pipelined

    implementations.

  • 40

    5 PROPOSED INSTRUCTION SET EXTENSIONS

    This chapter specifies the new instructions that are intended to enhance the

    performance of the algorithms described in Chapter 4. All of the custom instructions

    are intended to comply with the SPARC r V8 instruction model [47]. In particular,

    the instructions have the Format 3 structure described in Section 3.3.

    The sub-sections below indicate the syntax and encoding, as well as a brief

    description, of each instruction. All instructions that write to a register execute in

    one clock cycle, except for the mmul16 instruction which takes two clock cycles.

    For those instructions that store data directly into registers contained in the custom

    hardware, the data is available at the start of the next cycle, after instruction execu-

    tion has completed.

    5.1 DES and Triple-DES

    5.1.1 Initial and Final Permutations

    Instruction Syntax:

    desipl rs1,rs2,rd

    desipr rs1,rs2,rd

    desfpl rs1,rs2,rd

    desfpr rs1,rs2,rd

  • 41

    Instruction Encoding:

    op rd op3 rs1 i asi rs210 rd 001101 rs1 0 XXXXXXnn rs2

    The desipl and desipr instructions produce the left and right halves of the

    DES IP, respectively. Similarly, the desfpl and desfpr instructions produce the left

    and right halves of the DES FP, respectively. The left half of the input block must

    be located in the rs1 register and the right half must be located in the rs2 register.

    The specific instruction to be executed is determined by the value of bits

    [1:0] of the asi field as given in Table 11 below. Bits [7:2] of the asi field are ignored

    by all of the DES permutation instructions.

    asi[1:0] Instruction00 desipl01 desipr10 desfpl11 desfpr

    Table 11: Interpretation of the asi field for the DES permutation instructions

    Inclusion of these instructions allows for the IP and FP for DES and Triple-

    DES to be completed in two instructions each. Traditional implementations of the

    IP and FP in software require a series of bit mask setup, shift, logical AND, and

    logical OR operations for each bit for a total of 256 instructions [41]. The improved

    permutation algorithm used in the reference code [110] requires 44 instructions to

    complete on a SPARC r V8 processor such as the LEON2, which is still significantly

    larger than the instruction count required to perform the permutations as proposed

    in this research.

  • 42

    5.1.2 Set Encryption Direction

    Instruction Syntax:

    desdir imm

    Instruction Encoding:

    op rd op3 rs1 i simm1310 00000 001001 00000 1 (dir)

    Set up the DES key generator to output round keys in either encryption or

    decryption order. The imm operand is set to zero for encryption, one for decryption.

    This instruction also resets the round counter of the key generator according to the

    chosen direction to ensure that output of the round keys may be immediately carried

    out in the proper order. It is not necessary to re-load the master key after this in-

    struction is executed. The desdir instruction is used in conjunction with the deskey

    and desf instructions as explained in Sections 5.1.3 and 5.1.4.

    5.1.3 Key Loading

    Instruction Syntax:

    deskey rs1,rs2

    Instruction Encoding:

    op rd op3 rs1 i asi rs210 00000 001001 rs1 0 unused rs2

    The deskey instruction loads the 64-bit master key for DES. The left half

    of the master key must be contained in the rs1 register, and the right half in the rs2

    register.

  • 43

    5.1.4 Round Core (f ) Function

    Instruction Syntax

    desf rs1,rd

    Instruction Encoding

    op rd op3 rs1 i simm1310 00000 001001 rs1 1 0x1XXX

    This instruction takes the right half of a round output block stored in the

    rs1 register and stores the output of the core (f ) function into the rd register. The

    round key is not specified here since the round key output of the DES key generator

    is hard-wired to the f -function circuits round key input. After completion of this

    instruction, the key generator is signaled to generate the key for the next round. Due

    to the logic of the DES key generator (see Section 6.1.3), the desf instruction may

    not be followed by another desf instruction. However, this is not expected to cause

    a performance bottleneck due to the additional instruction required for swapping the

    values of the left and right halves of the round input block.

    Implementation of the desdir, deskey, and desf instructions removes the

    need for storage of the sixteen round keys and S-Boxes in memory. All round keys

    are generated on-the-fly in the custom hardware. An implementation of the DES al-

    gorithm using these instructions requires two instructions for key scheduling and four

    instructions for each of the sixteen rounds (one desf, one exclusive-OR, and two regis-

    ter data transfers for swapping the left and right halves of the round function output).

    5.1.5 New DES and Triple-DES Algorithm Implementations

    The following text show instruction sequences that may be used to imple-

    ment the DES and Triple-DES algorithms for encryption and decryption. All operand

  • 44

    names are symbolic and do not represent the names of physical registers of the LEON2

    processor.

    desipl %[ptextl], %[ptextr], %[l]

    desipr %[ptextl], %[ptextr], %[r]

    deskey %[keyl], %[keyr]

    desdir 0

    mov 1, %[i]

    round:

    mov %[r], %[temp]

    desf %[r], %[r]

    xor %[l], %[r], %[r]

    mov %[temp], %[l]

    cmp %[i], 16

    blu round

    add %[i], 1, %[i]

    desfpl %[r], %[l], %[ctextl]

    desfpr %[r], %[l], %[ctextr]

    Figure 10: DES encryption routine with custom instructions

    desipl %[ctextl], %[ctextr], %[l]

    desipr %[ctextl], %[ctextr], %[r]

    deskey %[keyl], %[keyr]

    desdir 1

    mov 1, %[i]

    round:

    mov %[r], %[temp]

    desf %[r], %[r]

    xor %[l], %[r], %[r]

    mov %[temp], %[l]

    cmp %[i], 16

    blu round

    add %[i], 1, %[i]

    desfpl %[r], %[l], %[ptextl]

    desfpr %[r], %[l], %[ptextr]

    Figure 11: DES decryption routine with custom instructions

  • 45

    desipl %[ptextl], %[ptextr], %[l]

    desipr %[ptextl], %[ptextr], %[r]

    deskey %[key1l], %[key1r]

    desdir 0

    mov 1, %[i]

    d1round:

    mov %[r], %[temp]

    desf %[r], %[r]

    xor %[l], %[r], %[r]

    mov %[temp], %[l]

    cmp %[i], 16

    blu d1round

    add %[i], 1, %[i]

    mov %[r], %[temp]

    mov %[l], %[r]

    mov %[temp], %[l]

    deskey %[key2l], %[key2r]

    desdir 1

    mov 1, %[i]

    d2round:

    mov %[r], %[temp]

    desf %[r], %[r]

    xor %[l], %[r], %[r]

    mov %[temp], %[l]

    cmp %[i], 16

    blu d2round

    add %[i], 1, %[i]

    mov %[r], %[temp]

    mov %[l], %[r]

    mov %[temp], %[l]

    deskey %[key3l], %[key3r]

    desdir 0

    mov 1, %[i]

    d3round:

    mov %[r], %[temp]

    desf %[r], %[r]

    xor %[l], %[r], %[r]

    mov %[temp], %[l]

    cmp %[i], 16

    blu d3round

    add %[i], 1, %[i]

    desfpl %[r], %[l], %[ctextl]

    desfpr %[r], %[l], %[ctextr]

    Figure 12: Triple-DES encryption routine with custom instructions

  • 46

    desipl %[ctextl], %[ctextr], %[l]

    desipr %[ctextl], %[ctextr], %[r]

    deskey %[key3l], %[key3r]

    desdir 1

    mov 1, %[i]

    d1round:

    mov %[r], %[temp]

    desf %[r], %[r]

    xor %[l], %[r], %[r]

    mov %[temp], %[l]

    cmp %[i], 16

    blu d1round

    add %[i], 1, %[i]

    mov %[r], %[temp]

    mov %[l], %[r]

    mov %[temp], %[l]

    deskey %[key2l], %[key2r]

    desdir 0

    mov 1, %[i]

    d2round:

    mov %[r], %[temp]

    desf %[r], %[r]

    xor %[l], %[r], %[r]

    mov %[temp], %[l]

    cmp %[i], 16

    blu d2round

    add %[i], 1, %[i]

    mov %[r], %[temp]

    mov %[l], %[r]

    mov %[temp], %[l]

    deskey %[key1l], %[key1r]

    desdir 1

    mov 1, %[i]

    d3round:

    mov %[r], %[temp]

    desf %[r], %[r]

    xor %[l], %[r], %[r]

    mov %[temp], %[l]

    cmp %[i], 16

    blu d3round

    add %[i], 1, %[i]

    desfpl %[r], %[l], %[ctextl]

    desfpr %[r], %[l], %[ctextr]

    Figure 13: Triple-DES decryption routine with custom instructions

  • 47

    5.2 IDEA

    5.2.1 Multiplication modulo 216 + 1

    Instruction Syntax:

    mmul16 rs1, rs2, rd

    Instruction Encoding:

    op rd op3 rs1 i asi rs210 rd 101101 rs1 0 unused rs2

    This instruction calculates rs1 rs2 mod (216 + 1) and stores the product

    in the rd register. Both sources must be in the lower sixteen bits of their respective

    registers. The 16-bit product is stored in the lower sixteen bits of the rd register.

    5.2.2 New IDEA Algorithm Implementation

    t0_1 = in[0]; t0_2 = in[1]; t0_3 = in[2]; t0_4 = in[3];

    for(i=0;i

  • 48

    5.3 AES

    5.3.1 SubBytes Operations

    Instruction Syntax:

    aessb rs1, imm, rd

    aessb4 rs1, imm, rd

    Instruction Encoding:

    op rd op3 rs1 i simm1310 rd 101100 rs1 1 see below

    These instructions perform the AES SubBytes and InvSubBytes operations

    on either one (aessb) or all four (aessb4) of the bytes in the rs1 register. Only one

    of these instructions may be implemented in the hardware but not both.

    The value specified in the simm13 field determines the actual operation

    performed. The lease significant bit is set to zero for SubBytes, or one for InvSub-

    Bytes. The value composed of bits 5 and 4 indicate the byte to be substituted for

    the aessb instruction as shown in Table 12. These bits are not used by the aessb4

    instruction.

    simm13[5:4] Substituted byte00 rs1[31:24]01 rs1[23:16]10 rs1[15:8]11 rs1[7:0]

    Table 12: Usage of the simm13 field by the aessb and aessb4 instructions

    Bits [12:6] and [3:1] of simm13 are ignored by both the aessb and aessb4

    instructions.

  • 49

    5.3.2 GF(2m) Matrix Multiplier Constant Loading

    Instruction Syntax:

    gfmkld rs1,rs2

    Instruction Encoding:

    op rd op3 rs1 i asi rs210 00000 011001 rs1 0 unused rs2

    The gfmkld instruction is used to load one of the sixteen constants into

    the constant matrix of the Galois Field fixed field constant matrix multiplier. The

    constant matrix has the following structure:

    K00 K01 K02 K03

    K10 K11 K12 K13

    K20 K21 K22 K23

    K30 K31 K32 K33

    The first constants to be loaded are those in the first row, from K00 to K03.

    The loading process continues for each row in descending order, from left to right,

    until the last constant K33 has been loaded. Due to the logic that has been added to

    the multiplier for inclusion into the LEON2 processor datapath (see Section 6.1.6),

    instances of the gfmkld instruction may not be issued consecutively.

    5.3.3 GF(2m) Matrix Multiplication

    Instruction Syntax:

    gfmmul rs1, imm, rd

  • 50

    Instruction Encoding:

    op rd op3 rs1 i asi rs210 rd 011101 rs1 0 unused 00000

    Perform the Galois Field fixed field constant matrix multiplication on the

    input in the rs1 register and store the result in the rd register.

  • 51

    5.3.4 New AES Algorithm Implementations

    // First add round key

    state[0] = plaintext[0] ^ key_schedule[0][0];

    state[1] = plaintext[1] ^ key_schedule[0][1];

    state[2] = plaintext[2] ^ key_schedule[0][2];

    state[3] = plaintext[3] ^ key_schedule[0][3];

    // The nine rounds

    for (i = 1; i < Nr; i++)

    {

    // SubBytes + ShiftRows

    asm( "aessb %[s0], 0x00, %[s0]\n\t" : [s0] "+r" (state[0]) );

    asm( "aessb %[s1], 0x10, %[s1]\n\t" : [s1] "+r" (state[1]) );

    asm( "aessb %[s2], 0x20, %[s2]\n\t" : [s2] "+r" (state[2]) );

    asm( "aessb %[s3], 0x30, %[s3]\n\t" : [s3] "+r" (state[3]) );

    tmp[0] = state[0] & 0xFF000000 ;

    tmp[0] |= state[1] & 0x00FF0000 ;

    tmp[0] |= state[2] & 0x0000FF00 ;

    tmp[0] |= state[3] & 0x000000FF ;

    asm( "aessb %[s1], 0x00, %[s1]\n\t" : [s1] "+r" (state[1]) );

    asm( "aessb %[s2], 0x10, %[s2]\n\t" : [s2] "+r" (state[2]) );

    asm( "aessb %[s3], 0x20, %[s3]\n\t" : [s3] "+r" (state[3]) );

    asm( "aessb %[s0], 0x30, %[s0]\n\t" : [s0] "+r" (state[0]) );

    tmp[1] = state[1] & 0xFF000000 ;

    tmp[1] |= state[2] & 0x00FF0000 ;

    tmp[1] |= state[3] & 0x0000FF00 ;

    tmp[1] |= state[0] & 0x000000FF ;

    asm( "aessb %[s2], 0x00, %[s2]\n\t" : [s2] "+r" (state[2]) );

    asm( "aessb %[s3], 0x10, %[s3]\n\t" : [s3] "+r" (state[3]) );

    asm( "aessb %[s0], 0x20, %[s0]\n\t" : [s0] "+r" (state[0]) );

    asm( "aessb %[s1], 0x30, %[s1]\n\t" : [s1] "+r" (state[1]) );

    tmp[2] = state[2] & 0xFF000000 ;

    tmp[2] |= state[3] & 0x00FF0000 ;

    tmp[2] |= state[0] & 0x0000FF00 ;

    tmp[2] |= state[1] & 0x000000FF ;

    asm( "aessb %[s3], 0x00, %[s3]\n\t" : [s3] "+r" (state[3]) );

    asm( "aessb %[s0], 0x10, %[s0]\n\t" : [s0] "+r" (state[0]) );

    asm( "aessb %[s1], 0x20, %[s1]\n\t" : [s1] "+r" (state[1]) );

    asm( "aessb %[s2], 0x30, %[s2]\n\t" : [s2] "+r" (state[2]) );

    tmp[3] = state[3] & 0xFF000000 ;

    tmp[3] |= state[0] & 0x00FF0000 ;

    tmp[3] |= state[1] & 0x0000FF00 ;

    tmp[3] |= state[2] & 0x000000FF ;

    // MixColumns

    asm(

    "gfmmul %[t0], %[s0]\n\t"

    "gfmmul %[t1], %[s1]\n\t"

    "gfmmul %[t2], %[s2]\n\t"

    "gfmmul %[t3], %[s3]\n\t"

    : [s0] "=r" (state[0]) , [s1] "=r" (state[1]) , [s2] "=r" (state[2]) , [s3] "=r" (state[3])

    : [t0] "r" (tmp[0]) , [t1] "r" (tmp[1]) , [t2] "r" (tmp[2]) , [t3] "r" (tmp[3])

    );

  • 52

    // Add round key

    state[0] = state[0] ^ key_schedule[i][0];

    state[1] = state[1] ^ key_schedule[i][1];

    state[2] = state[2] ^ key_schedule[i][2];

    state[3] = state[3] ^ key_schedule[i][3];

    }

    // Final round

    // SubBytes + ShiftRows

    asm( "aessb %[s0], 0x00, %[s0]\n\t" : [s0] "+r" (state[0]) );

    asm( "aessb %[s1], 0x10, %[s1]\n\t" : [s1] "+r" (state[1]) );

    asm( "aessb %[s2], 0x20, %[s2]\n\t" : [s2] "+r" (state[2]) );

    asm( "aessb %[s3], 0x30, %[s3]\n\t" : [s3] "+r" (state[3]) );

    tmp[0] = state[0] & 0xFF000000 ;

    tmp[0] |= state[1] & 0x00FF0000 ;

    tmp[0] |= state[2] & 0x0000FF00 ;

    tmp[0] |= state[3] & 0x000000FF ;

    asm( "aessb %[s1], 0x00, %[s1]\n\t" : [s1] "+r" (state[1]) );

    asm( "aessb %[s2], 0x10, %[s2]\n\t" : [s2] "+r" (state[2]) );

    asm( "aessb %[s3], 0x20, %[s3]\n\t" : [s3] "+r" (state[3]) );

    asm( "aessb %[s0], 0x30, %[s0]\n\t" : [s0] "+r" (state[0]) );

    tmp[1] = state[1] & 0xFF000000 ;

    tmp[1] |= state[2] & 0x00FF0000 ;

    tmp[1] |= state[3] & 0x0000FF00 ;

    tmp[1] |= state[0] & 0x000000FF ;

    asm( "aessb %[s2], 0x00, %[s2]\n\t" : [s2] "+r" (state[2]) );

    asm( "aessb %[s3], 0x10, %[s3]\n\t" : [s3] "+r" (state[3]) );

    asm( "aessb %[s0], 0x20, %[s0]\n\t" : [s0] "+r" (state[0]) );

    asm( "aessb %[s1], 0x30, %[s1]\n\t" : [s1] "+r" (state[1]) );

    tmp[2] = state[2] & 0xFF000000 ;

    tmp[2] |= state[3] & 0x00FF0000 ;

    tmp[2] |= state[0] & 0x0000FF00 ;

    tmp[2] |= state[1] & 0x000000FF ;

    asm( "aessb %[s3], 0x00, %[s3]\n\t" : [s3] "+r" (state[3]) );

    asm( "aessb %[s0], 0x10, %[s0]\n\t" : [s0] "+r" (state[0]) );

    asm( "aessb %[s1], 0x20, %[s1]\n\t" : [s1] "+r" (state[1]) );

    asm( "aessb %[s2], 0x30, %[s2]\n\t" : [s2] "+r" (state[2]) );

    tmp[3] = state[3] & 0xFF000000 ;

    tmp[3] |= state[0] & 0x00FF0000 ;

    tmp[3] |= state[1] & 0x0000FF00 ;

    tmp[3] |= state[2] & 0x000000FF ;

    // Add round key

    ciphertext[0] = tmp[0] ^ key_schedule[Nr][0];

    ciphertext[1] = tmp[1] ^ key_schedule[Nr][1];

    ciphertext[2] = tmp[2] ^ key_schedule[Nr][2];

    ciphertext[3] = tmp[3] ^ key_schedule[Nr][3];

    }

    Figure 15: AES encryption routine with aessb and gfmmul instructions

  • 53

    // AddRoundKey

    state[0] = ciphertext[0] ^ key_schedule[Nr][0];

    state[1] = ciphertext[1] ^ key_schedule[Nr][1];

    state[2] = ciphertext[2] ^ key_schedule[Nr][2];

    state[3] = ciphertext[3] ^ key_schedule[Nr][3];

    for (i = Nr-1; i > 0; i--)

    {

    // InvShiftRowsInvSubBytes

    asm( "aessb %[s0], 0x01, %[s0]\n\t" : [s0] "+r" (state[0]) );

    asm( "aessb %[s3], 0x11, %[s3]\n\t" : [s3] "+r" (state[3]) );

    asm( "aessb %[s2], 0x21, %[s2]\n\t" : [s2] "+r" (state[2]) );

    asm( "aessb %[s1], 0x31, %[s1]\n\t" : [s1] "+r" (state[1]) );

    tmp[0] = state[0] & 0xFF000000 ;

    tmp[0] |= state[3] & 0x00FF0000 ;

    tmp[0] |=