encryption code generator

University of Dublin

TRINITY COLLEGE

ENCRYPTION CODE GENERATOR

Paul MagrathB.A. (Mod.) Computer ScienceFinal Year Project - May 2009

Supervisor: David Gregg

School of Computer Science and Statistics

O’Reilly Institute, Trinity College, Dublin 2, Ireland

1

Declaration

I hereby declare that this thesis is entirely my own work and that it has not been

submitted as an exercise for a degree at any other university.

_________________________________ April 24th, 2009

Paul Magrath

2

Permission to Lend

I agree that the Library and other agents of the College may lend or copy this thesis upon

request.

_________________________________ April 24th, 2009

Paul Magrath

3

Acknowledgements

To David Gregg, for his support and advice throughout this project.

To Laura, for the love you have given me and for putting up with me.

4

Table of Contents

Declaration................................................................................................................................ 2

Permission to Lend................................................................................................................. 3

Acknowledgements................................................................................................................ 4

1. Motivation............................................................................................................................. 7Introduction........................................................................................................................................ 7Readers’ Guide to the Report.........................................................................................................9

Background.........................................................................................................................................................9AES Code Generator........................................................................................................................................9Experimental Results.....................................................................................................................................9Conclusions.........................................................................................................................................................9References...........................................................................................................................................................9

2. Background........................................................................................................................ 10Encryption......................................................................................................................................... 10AES....................................................................................................................................................... 11SIMD..................................................................................................................................................... 12SSE........................................................................................................................................................ 12Optimization..................................................................................................................................... 14

3. AES Code Generator......................................................................................................... 17I. Correctness: An AES-256 implementation..........................................................................17II. The Generator............................................................................................................................. 18

A. Steaming store...........................................................................................................................................19B. Unwind inner loop...................................................................................................................................20C. Use local variables....................................................................................................................................21D. Unwind outer loop...................................................................................................................................22E. Interleave.....................................................................................................................................................23F. OpenMP.........................................................................................................................................................24G. Prefetch to cache......................................................................................................................................25H. Preload to register...................................................................................................................................26

III Simulating.................................................................................................................................... 27

4. Experimental Results...................................................................................................... 29Intel Core 2 Quad 2.4Ghz.............................................................................................................. 30

Sequential.........................................................................................................................................................32Parallel............................................................................................................................................................... 43

Intel Core 2 Duo 2.16Ghz.............................................................................................................. 45Sequential.........................................................................................................................................................45Parallel............................................................................................................................................................... 45

Intel Pentium 4 Dual Processor................................................................................................. 47Sequential.........................................................................................................................................................47Parallel............................................................................................................................................................... 47

5. Conclusions........................................................................................................................ 49Contributions................................................................................................................................... 49Future Work..................................................................................................................................... 50

5

References............................................................................................................................... 51

6

1. Motivation

Introduction

In the world today, encryption is vital. Without it, there is no security, freedom or privacy

at all. Governments, corporations, strangers and criminals all routinely attempt to gather

as much information as possible about us and what we do online and in real life in order

to profile us, tempt us, learn about us and defraud us respectively. Every day, with social

networking and online records, more and more information is available in order to do

this. Even more is available when measures such as deep packet inspection are turned to.

Encryption is a solution to these problems in that it allows us to have a modicum of

control over who can access the data that we distribute, who can listen to our calls, read

our mail, and read our bank statements.

This project investigates using a code generator to generate the various variants of an

Advanced Encryption Standard (AES) encryption loop. AES (see ‘Background’, chapter

2) is a form of encryption that Intel will begin supporting in hardware in their

microprocessors in 2010. An encryption loop is the iterative loop that loops through all

the data to be encrypted and performs steps necessary to encrypt the data. As such it is a

computationally expensive loop, requiring a large amount of CPU time, and is, hence, a

candidate for optimization in order to reduce the time taken.

A code generator, in this context, is a program that generates a number of different

variants of a piece of code in order to find which combination of optimization techniques

yields the best possible result (see ‘AES Code Generator’, chapter 3, for variants

employed here). The use of code generators is an established technique for solving

problems in optimizing for modern architectures, used primarily in the research

community. It is ideal for optimizing a small piece of code that uses a vast amount of

processing time and to which the best optimizations are not obvious. Code generators

7

have been successfully applied to several projects such as [1] and [2]. The motivation for

a code generator is to avoid the problems with code maintenance and code readability

that almost inevitably result from hand-tuned assembly specific to the architecture it is

written for. A code generator can tune itself to the architecture it is running on to find the

best combination of optimizations for that architecture, while remaining readable and

maintainable as it can be written in a high level language, such as C++.

In 2010 Intel will release processors that will have AES instructions built in to their

instruction set. This will greatly reduce the cost of encryption, as it will be possible to

perform the encryption much more quickly and efficiently than previously. This is part of

an enhancement, and replacement, of the current Intel SIMD instruction set, the

Streaming SIMD Extension (SSE). The replacement instruction set will be known the

Intel Advanced Vector Extensions (AVX).

This project takes a look at what would be the most likely ways that the use of such

instructions as part of a standard AES encryption loop could be optimised, how this can

be automated using a code generator, and presents the results of running this AES code

generator to generate these optimisation combinations on different architectures and

processors.

8

Readers’ Guide to the Report

Background

A summary of the what readers should understand and be aware of for consideration of

the report, particularly Encryption, AES, SIMD, SSE, OpenMP and Optimisation.

AES Code Generator

An outline and discussion of the various variants supported by the generator that was

implemented as part of the project as well as an explanation of how the correctness of the

input and output AES encryption loops was confirmed.

Experimental Results

Tables and diagrams summarising and demonstrating the results obtained by the timing

and measurement of the outputs from the Encryption Code Generator using hardware

performance counters, or other means as available.

ConclusionsA discussion of the conclusions that can be drawn from the results and of the possible

future works that can build upon this project.

ReferencesSystemic and complete reference to sources used and a classified list of all sources.

9

2. Background

Encryption

Encryption is the translation of data into a secret code. This is done in an attempt to keep

information secure. To read an encrypted file, you must have access to a secret key or

password that enables you to decrypt it. Hence, third parties without access to the shared

secret key, such as online criminals, curious neighbours and oppressive governments, are

unable to easily (or, in the case of strong encryption, at all) access the data or information

that has been encrypted. Unencrypted data is called plain text while encrypted data is

referred to as cipher text.

There are two main types of encryption: asymmetric encryption (also known as public

key encryption) and symmetric encryption. Asymmetric encryption is a form of

encryption where keys come in pairs. What one key encrypts, only the other can decrypt.

The encryption key is usually made public and distributed freely as only the holder of the

decryption key is able to read the data that has been encrypted with the encryption key.

Symmetric encryption is a form of encryption where the same key is used for both

encryption and decryption. The key must be kept secret, and is shared by the message

sender and recipient. This form of encryption can usually be performed much more

quickly and efficiently than asymmetric encryption. In practice, asymmetric encryption is

usually used to encrypt an insecure communication channel in order to allow for the

exchange of the secret key for the symmetric encryption that will be used for the rest of

the communications. This technique effectively combines the strengths of the two forms

of encryption and is the basis of the SSL/TLS family of encryption protocols that include

HTTPS that is used everyday for online banking and shopping.

10

AES

AES (Advanced Encryption Standard) is one of the most popular algorithms used in

symmetric encryption.

Originally published as Rijndael [3], it was adopted as a standard by the U.S. government

in November 2001 [4], after a five-year standardization process involving fifteen

competing designs. The standard comprises three block ciphers, AES-128, AES-192 and

AES-256, adopted from a larger collection. A block cipher is a cipher that operates on

fixed-length groups of bits, termed blocks, with an unvarying transformation. The block

cipher takes in two inputs, the plaintext of the block and the secret key, and outputs the

ciphertext (encrypted text) of the block. Each AES cipher has a 128-bit block size, which

means that 128-bits of the plaintext are encrypted into ciphertext in each iteration of the

encryption loop. AES-128, AES-192 and AES-256 have secret keys of sizes 128, 192 and

256 bits respectively where AES-128 is the least secure while AES-256 is the most

secure.

When the length of data to be encrypted exceeds the block size, a mode of operation must

be used [5]. The two that we will concern ourselves with are Electronic Code Book

(ECB) and Counter (CTR). These will be discussed in detail in the AES Code Generator

(chapter 3).

An outline of the algorithm for AES-256 encryption is:

Key Expansion

Using the Rijndael key schedule, the 14 round keys are extracted from the 256-bit

secret key.

Initial Round (round 0)

The initial round is simply the bitwise XOR of the round key to the plaintext

(referred to as the ‘state’ during the encryption).

Rounds 1 through 12

11

Each of these rounds is comprised of a non-linear substitution step, a transposition

step, a mixing step, and a bitwise XOR of the round key to the state.

Final Round (round 13)

The final round is identical to Rounds 1 through 12 except the mixing step is

omitted.

It should be noted that the key expansion only has to be performed once for any given

secret key. Hence, the encryption loop is composed only of the Initial Round, the Rounds,

and the Final Round as the round keys extracted during key expansion can be reused for

whatever many iterations of the encryption loop it takes to encrypt the entirety of the

plaintext.

SIMDSIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level

parallelism. Parallelism is when calculations are carried out simultaneously. In SIMD

computer architecture, the computer exploits multiple data streams against a single

instruction stream in order to perform operations that may be easily parallelized [6].

By processing multiple data elements in parallel, SIMD processors provide a way to

utilize data parallelism in applications that apply a single operation to all elements in a

data set, such as a vector or matrix [7].

SSE

Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to the x86

architecture. It was designed by Intel and introduced with the Pentium III microprocessor

family.

SSE originally added eight new 128-bit registers (a register is a memory device usually

used to store local variables) known as XMM0 through XMM7. The 64-bit variant of the

12

x86 architecture, x86-64, has a further eight registers referred to as XMM8 through

XMM15. XMM0 through XMM15 can be accessed in 64-bit operating mode, while only

XXM0 through XMM7 can be accessed in 32-bit operating mode.

Each register packs together four 32-bit single-precision floating point numbers or two

64-bit double-precision floating point numbers or four 32-bit integers or eight 16-bit short

integers or sixteen 8-bit bytes or characters. (Hence, each register could hold the contents

of an entire 128-bit AES block.) There have been a number of iterations of SSE, each of

which has added a number of enhancements in terms of instructions. SSE4 is the version

of SSE that is supported by the current Intel Core micro architecture. [8]

AVX (Advanced Vector Extensions) is an advanced version of SSE, which will appear in

Intel products in 2010, and which features a 256 bit data path (widened from 128 bits in

SSE4). AVX will provide six new instructions for symmetric encryption/decryption using

the Advance Encryption Standard (AES) and one instruction performing carry-less

multiplication (PCMULQDQ) which aids in performing finite field arithmetic (a type of

arithmetic used in advanced block cipher encryption). These hardware-based primitives

provide a security benefit apart from their speed advantage by avoiding table-lookups and

hence protecting against software side channel attacks (attempts to discover the secret

key by observing and analyzing the flow of information in the computer during the

encryption process). [9]

However, we do not need to program in assembly in order to utilise any of these

extensions to the instruction set. Instead we can use intrinsics. Intrinsics are special

functions for which the compiler generally has a specific optimisation path and which the

compiler encodes to one or more machine instructions. They represent a good middle

ground between speed of execution and ease of use for the programmer. With modern

optimising compilers, particularly with the Intel C++ Compiler, we can get nearly as

good results as the best hand tuned assembly code, without compromising code

readability or having to bother about register management.

13

OpenMP

The OpenMP Application Program Interface (API) supports multi-platform shared-

memory parallel programming in C/C++ and Fortran on many architectures, including

Linux, Windows and Mac OS X. It consists of a set of compiler directives, library

routines, and environment variables that influence run-time behavior. OpenMP gives

programmers a simple model and interface for developing parallel applications [10].

The Hello World example below demonstrates how easy it is to create additional threads

to carry out work using the API provided by OpenMP:

int main(int argc, char* argv[]) { #pragma omp parallel printf("Hello World!\n"); return 0; }

The above program will print “Hello World!” once for each processor core present on the

machine.

Similarly, for a for loop:

int main(int argc, char **argv) { const int N = 100000; int i, a[N]; #pragma omp parallel for for (i = 0; i < N; i++) a[i] = 2 * i; return 0;}

Optimization

Optimization is the process of tuning the output of a compiler to minimize or maximize

some attribute of an executable computer program. The most common requirement is to

minimize the time taken to execute a program; a less common one is to minimize the

amount of memory occupied. Optimization is usually applied by the compiler, which will

14

usually attempt to generate the fastest possible code. However, a programmer will

sometimes attempt to help the compiler out by performing some optimizations manually

as (s)he, in theory, knows the algorithms better.

The general themes of optimizing programs are:

Optimize the common case

The common case may have unique properties that allow a very quick computation at the

expense of a very slow computation for certain less common cases. If the common case is

taken most often, the result can be better over-all performance.

Avoid redundancy

Reuse results that are already computed and store them for use later, instead of re-

computing them.

Less code

Remove unnecessary computations and intermediate values. Less work for the CPU,

cache, and memory usually results in faster execution. Alternatively, in embedded

systems, less code requires less memory and brings a lower product cost.

Straight line code

Less complicated code with less jumps and conditional branches will be faster as they

interfere with the pre-fetching of instructions, thus slowing down code.

Locality

Code and data that are accessed closely together in time should be placed close together

in memory to increase spatial locality of reference, and hence reduce the amount of

register loading.

Manage memory efficiently

15

Place the most commonly used items in registers first, then caches, then main memory,

before going to disk.

Parallelize

Reorder operations to allow multiple computations to happen in parallel, either at the

instruction (using SIMD instructions such as those in the Intel SSE instruction set),

memory, or thread level (using an API like OpenMP). [11]

16

3. AES Code Generator

AES Code Generator is discussed here under a number of sections:

I. Correctness: An AES-256 implementationII. The GeneratorIII. SimulatingIV. Testing

I. Correctness: An AES-256 implementation

The correctness of the input and output is confirmed by means of a customised AES-256

implementation.

The implementation described in this report emulates AES-256 and was based upon a

byte-oriented portable C implementation in which all the lookup tables had been replaced

with “on-the-fly” calculations [12]. The implementation is fully compliant with the

specification and is highly portable. There was no assembler in the original code but I

later wrapped the code with vector functions (SSE functions) and vectorised some, but

not all, of the code. This was done so that the function signatures (i.e. the names and

parameters of the functions) would be compatible with those of the AES instructions that

Intel will introduce as part of AVX (the Advanced Vector Extensions, see Background,

chapter 2).

Since the purpose of including the AES code is to check correctness, the slow speed of

this implementation is not a problem. It is provided as a simple means of verifying that

the implemented variants do not change the basic algorithm of the AES encryption loop

that is the input to the generator.

17

II. The Generator

The generator system generates a large number of simple variations of a basic AES

encryption loop. These various modifications of the code can then be run on a particular

model of processor, and with various compiler switches, to find the best variant for that

particular processor.

The generator was tested with two block cipher modes of operation of 256 bit AES:

Electronic Code Book (ECB) and Counter (CTR). Electronic code book is the simplest of

the encryption modes. The data to be encrypted is divided into blocks and each block is

encrypted separately. However, identical plaintext blocks are encrypted into identical

ciphertext blocks, so data patterns are easily recognised in the ciphertext. Hence, it is not

recommended for use in cryptographic protocols at all. On the other hand, counter turns a

block cipher into a stream cipher. Instead of encrypting the data itself, successive values

of a “counter” are encrypted. This is usually concatenated, added or XORed with an

initialization vector to produce the unique counter variable for the encryption. The

counter is then XORed with the actual text to be encrypted in order to form the actual

ciphertext.

The generator can take an input file of an encryption loop of either mode of operation,

electronic codebook (ECB) or counter (CTR), as its input file. These two types of block

cipher mode of operation were chosen because they were the easiest to parallelise. There

are a number of different options available for generating the variants. Each of these

options, is, in essence, a distinct optimization, or set of possible configurations of

optimizations, that can be performed. The options are:

Streaming store

Unwind inner loop

Use local variables

Unwind outer loop

Interleave

OpenMP (parallel)

18

Prefetch to cache

Prefetch to register

In the following sections, these options are described in more detail.

A. Steaming store

In the Steaming Store option, we use an SSE instruction instead of storing the result to

memory using a standard memory assignment. This variant uses the _mm_stream_si128

instruction to store the result directly to memory without polluting the caches. Like with

all the options to the generator, specifying it will generate the variants with the option

enabled and disabled.

Before:result = encrypt_final(result, *keys);result = _mm_xor_si128(result,source[i]);dest[i] = result;

After:result = encrypt_final(result, *keys);result = _mm_xor_si128(result,source[i]);_mm_stream_si128(&(dest[i]),result);

Normally, when we write to memory, the cache is updated with the contents of the write

so that if there is a read request for that information soon after the write, it can be recalled

quickly. However, in this situation the information that is being written is the cipher text

that has just been encrypted, and we want to keep the cache for memory that we are going

to be accessing again such as round keys and the plain text to be encrypted.

However, this is an option that can be turned on or off as it can happen that the use of

these instructions can interfere with the compiler’s own optimizations. The generator can

therefore experiment with both versions in combination with lots of other options.

19

B. Unwind inner loop

This variant unwinds the inner loop to the extent specified in the argument.

Loop unrolling is a technique that attempts to increase the execution speed of the

program at the expense of its size. The loop is rewritten as a sequence of independent

statements, hence reducing (in this case, eliminating) the overhead of evaluating the loop

condition on each of the iterations and reducing the number of jumps and conditional

branches that need to be executed.

There are two side effects of loop unrolling. These are an increased register usage in a

single iteration to store temporary variables (but not in this case, as we are completely

eliminating the loop rather than just unwinding it a little), and the code size expansion

after the unrolling. Large code size can lead to an increase in instruction cache misses.

In this case, the loop is quite short (equal to the number of keys, 14) and hence a

significant speed boost should be observed due to the removal of the control variable

check, the 14 jumps and the conditional branches from the control flow.

Before: for (int j = 1; j < nKeys; j++ ) { result = encrypt_round(result, *(keys+j)); }

After: result = encrypt_round(result, *(keys+1)); result = encrypt_round(result, *(keys+2)); result = encrypt_round(result, *(keys+3)); ... result = encrypt_round(result, *(keys+12)); result = encrypt_round(result, *(keys+13));

Loop unrolling can also aide the compiler and the processor perform their own

optimizations, such as instruction scheduling.

20

C. Use local variables

This variant uses local variables for the AES round keys instead of memory accesses (the

round keys are sub-keys used for the individual rounds extracted from the cipher key

using the Rijndael key schedule). This involves defining the variables, assigning them the

round keys from their memory locations and updating all references in the input file to

the memory location to refer to the variables instead. The idea is that assigning the round

keys to variables should be a massive hint to the compiler to keep the round keys in

registers rather than doing a memory access (it is faster to access registers than the L1

cache where the round keys will probably be).

It is also observable that the number of round keys that are stored in registers can have a

negative impact on run time. This is due the impact that storing them in registers has on

the number of registers available for other purposes, such as storing temporary variables

such as results. This is particularly relevant when a high level of unwinding of the outer

loop has also occurred, especially when it has been interleaved as well.

As such, up to 214 different variants of the number of round keys stored in local variables

as opposed to using memory accesses can be generated. There are 214 different variants as

there are 14 round keys that could be stored in local variables and hence 214 different

permutations of round keys in local variables and memory accesses. However, for testing

purposes, only 14 permutations formed by loading successive round keys into local

variables were looked at.

Before:

for ( i = 0; i < limit; i++ ) {...result = encrypt_round(result, *(keys+1));

result = encrypt_round(result, *(keys+2)); result = encrypt_round(result, *(keys+3)); ... result = encrypt_round(result, *(keys+12)); result = encrypt_round(result, *(keys+13));

21

After:const vector_type key0 = keys[0];const vector_type key1 = keys[1];const vector_type key2 = keys[2];const vector_type key3 = keys[3]; ...const vector_type key12 = keys[12];

const vector_type key13 = keys[13]; for ( i = 0; i < limit; i++ ) {

... result = encrypt_round(result, (key1));

result = encrypt_round(result, (key2)); result = encrypt_round(result, (key3)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13));

Using local variables for the round keys also has the bonus that it allows the compiler to

apply other optimizations much earlier in the process than if the round keys are left as

elements of an array as it can be difficult for the compiler to prove that the array is

unaliased (i.e. does not reference the same location or variables as another pointer).

D. Unwind outer loop

This variant unwinds the outer loop to the extent specified in the argument. The

technique, and the side effects, is the same as with unwinding the inner loop, except the

effect is much greater as a result of the greater number of instructions involved.

There is a certain threshold where the returns from the unwinding begin to rapidly

diminish and then the effect of the limited number of registers and the instruction cycle

misses hits and run time increases again.

Before: After:for ( i = 0; i < limit; i++ ) { vector_type result; result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) );

// initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0);

for ( i = 0; i < (limit-2); i+=2) { vector_type result;

// iteration 0 result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13));

22

result = _mm_xor_si128(result,source[i]);

dest[i] = result;} // end outer loop

// final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]);

dest[i] = result; // end of original outer loop

// iteration 1 result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i+1) );

// initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2));

... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]);

dest[i+1] = result; // end of original outer loop

} // end unrolled loop

E. Interleave

This variant interleaves an unwound outer loop to the extent specified in the argument.

The idea is that by interleaving the loop, we reduce the number of instructions that are

stalling due to a reliance on the result of a previous instruction.

Since each iteration is dealing with a different intermediate result, and since each

iteration is dealing with its own intermediate result a number of times as it works through

the encryption rounds, it makes sense the a number of operations of the same round,

rather than the same iteration, after one another. Hence, the delay that would exist in an

iteration between the first and the second round while waiting for the result from the first

is filled by calculating the result of the first round of the next iteration.

23

Also, it is potentially very important for performance that the keys are used multiple

times in sequence so that it is strictly necessary to load them just the once for each

sequence of multiple instructions using that key.

for ( i = 0; i < (limit-2); i+=2) { vector_type result;

result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]); dest[i] = result; // end of original outer loop result = _mm_add_epi64(nonce,_mm_set_epi32(0,0,0,i+1));

// initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2));

... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0);

result = _mm_xor_si128(result,source[i]); dest[i+1] = result; // end of original outer loop} // end unrolled loop

for ( i = 0; i < (limit-2); i+=2) { vector_type result0; vector_type result1;

result0 = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); result1 = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i+1) );

// initial round result0 = encrypt_initial(result0, enckey); result1 = encrypt_initial(result1, enckey); // encryption result0 = encrypt_round(result0, (key1)); result1 = encrypt_round(result1, (key1)); result0 = encrypt_round(result0, (key2)); result1 = encrypt_round(result1, (key2)); result0 = encrypt_round(result0, (key3)); result1 = encrypt_round(result1, (key3)); ... result0 = encrypt_round(result0, (key11)); result1 = encrypt_round(result1, (key11)); result0 = encrypt_round(result0, (key12)); result1 = encrypt_round(result1, (key12)); result0 = encrypt_round(result0, (key13)); result1 = encrypt_round(result1, (key13)); // final round result0 = encrypt_final(result0, key0); result1 = encrypt_final(result1, key0); result0 = _mm_xor_si128(result0,source[i]); result1 = _mm_xor_si128(result1,source[i+1]);

dest[i] = result0; dest[i+1] = result1; // end of original outer loop} // end unrolled loop

F. OpenMP

This variant includes the OpenMP pragma directive that are normally commented out for

the other variants. This allows for the investigation of parallel versions of the code, for

running on processors with multiple cores and/or simultaneous multithreading.

24

Any of the other variants can be generated with the OpenMP variant to create both

threaded and non-threaded versions of the same code to investigate which is the most

efficient of the options.

Before: for ( i = 0; i < (limit-2); i+=2) { vector_type result0; vector_type result1; vector_type src0; vector_type src1;

After: #pragma omp parallel for

for ( i = 0; i < (limit-2); i+=2) { vector_type result0; vector_type result1; vector_type src0; vector_type src1;

G. Prefetch to cache

This variant uses the _mm_prefetch SSE intrinsic to prefetch source data to the cache.

The instruction loads one cache line of data from the given address to a location closer to

the processor. The idea is to prime the cache for the next iteration(s) of the encryption

loop.

However, a balance must be found to ensure that priming the cache too far ahead does

not poison the cache (i.e. if we prefetch too many lines into the cache, or prefetch the

lines too soon, it may cause lines in the cache that we still need to be removed). Hence,

the generator must generate a large number of variants in order to find a version that

prefetches the source data just enough instructions ahead.

Also, the number of iterations before its use that source data is prefetched can have an

impact on the speedup gained from the optimization so a number of variants must be

produced in order to find the one with the greatest speed up.

25

Example:

_mm_prefetch((char const*)&source[i+2],_MM_HINT_T0);

H. Preload to register

This variant uses a variable to preload source data to a register. The idea is to touch the

L1 cache so it is primed with the line from which the source data value is chosen for the

register for the next iteration(s) of the encryption loop. This is important as the prefetch

SSE instruction only brings things into the L2 cache.

However, there are only a limited number of registers that are available for use. Hence,

the generator must generate a large number of variants in order to find the version that

will lead to an improvement in runtimes.

Example:

const vector_type src2 = source[i+2];

Most Intel microprocessors are also capable of doing prefetching in hardware, so

sometimes attempts at prefetching have little, no or even a negative effect, if the

hardware prefetcher does a better job.

26

III Simulating

The new architecture that Intel will release in 2010 introduces six Intel SSE instructions

that facilitate encryption. Four instructions, namely AESENC, AESENCLAST,

AESDEC, and AESDELAST facilitate high performance AES encryption and

decryption. The other two, namely AESIMC and AESKEYGENASSIST, support the

AES key expansion procedure. Together, these instructions will provide a full hardware

to support AES, offering security, high performance, and a great deal of flexibility. [13]

In order to simulate the performance effect of the actual AES instructions without the

actual hardware that supports them, the generator supports replacing the AES encryption

instructions in the input code with instructions that are of various different latencies

(provided in #defines in the input code) that we can use as proxies for the purposes of

testing the effect of optimizations. This allows the testing of the code with various

different latencies to give an idea of the effect of the speedups implemented by the

variants. This is done only with the four instructions that facilitate high performance AES

encryption and decryption. The two instructions supporting AES key expansion are not

emulated since, as key expansion is only done once no matter the amount of plaintext to

be encrypted, there is little to no benefit in studying speedups of it especially as it is

relatively quick.

For the purposes of this project, instructions of latencies two, three and five cycles

respectively were chosen for comparison, using the available Intel documentation [12].

Intel documentation published to date [13] indicates that the initial chips supporting AES

encryption in hardware will do so in six cycles. As such, the figures for the effect of the

various optimizations on five cycle latency instructions are the most relevant for the first

generation of hardware supporting the instructions (given that there are not currently any

six or seven latency instructions in the SSE instruction set). The figures for two and three

latencies, however, represent what we can expect from the second or third generation of

hardware.

27

IV Testing

In order to test my code and get my experimental results, I ran my generated code under a

number of compiler flags and architectures. Specifically, both the GNU C Compiler and

Intel C Compiler, 32 and 64 bit multiprocessors and machines with only a single core,

dual cores and 8 cores.

In order to generate the detailed results I got for my experimental results (see the next

section), I used PapiEx. PapiEx is a performance analysis tool designed to transparently

and passively measure the hardware performance counters of an application using PAPI

[15]. It uses Monitor to effortlessly intercept process/thread creation/destruction. The

Performance API (PAPI) project specifies a standard application programming interface

(API) for accessing hardware performance counters available on most modern

microprocessors. [14] Monitor is a library that gives the user callbacks or traps upon

events related to library/process/thread initialization/creation/destruction. [17]

Using PapiEx, I was able to get accurate information on the number of instructions

issued, executed and completed, the amount of data cache misses and the number of stall

cycles used, as well as the total number of cycles execution took.

At the time of writing, the Intel C Compiler did not yet support the compilation of AES

instructions, however it should be trivial to test the correctness of the output using the

Intel Software Development Emulator [18].

28

4. Experimental Results

The Experimental Results illustrated below showcase the effects of the use of some of the

variants. Running quite small subsets of the possible output of the generator and graphing

the intermediate results – specifically the generator’s medians of the runtimes from

twenty-five consecutive runs of each variant – was the source of the data for these graphs.

The generator, by default, runs through all the possible combinations unless, as here, it is

passed arguments in which case it just generates the combinations of those variants

specified. Hence, normally the ‘winner’ variant that is printed out by the generator at the

end is all one would be concerned with. The detailed graphing and discussion of the

effects of each variant and combination here is simply to demonstrate what is going on

inside the generator.

The AES Code Generator can be run on practically any system that has a working C++

compiler installed which has been compiled with SSE support appropriate to the

microprocessor it is installed on. However, to get the most out of it one should use a

compiler that supports OpenMP as well. The GNU C++ Compiler (v4.2+) and the Intel

C++ Compiler (v10.1+) are hence the best choices.

In general, the AES Code Generator simply has to be deployed to the target machine and

executed in order for it to build itself, generate the variants, test them, and report back

and the fastest variant. The optimization process is entirely automated, and no further

user intervention is needed.

To demonstrate this, data generated by the AES Code Generator will be shown from

three different architectures:

Intel Core 2 Quad 2.4Ghz (4 cores)

Intel Core 2 Duo 2.16Ghz (2 cores)

Intel Pentium 4 Dual Processors

29

Intel Core 2 Quad 2.4Ghz

Figure 1: Effect on runtime in terms of processor cycles of combinations of Streaming Store and of Parallel

(on an Intel Core 2 Quad)

The graph above shows the base case (Sequential) along with the possible combinations

of the OpenMP variant (Parallel) and the Streaming Store variant.

As you can see, the effect of using OpenMP can be quite significant with a speedup of

about 250% while the Streaming Store seems to be largely ineffective, perhaps due to the

work of the hardware prefetcher (part of the processor that speeds up the flow of data

accessed by the program [20]) negating the effect of not using the instruction. The effect

of using multiple threads is not proportional to the number of threads used as the

processor has other bottlenecks that constrict the speed at which it can execute

instructions even when there is work for all four cores, such as the memory bus

bandwidth which is limited and in a case such as this where a large amount of data is

being processed, could quite easily become the bottleneck.

30

A number of variants are possible but we will look next at the effect of putting the round

keys (the encryption keys used for the individual rounds in the AES algorithm - see

‘Background’, chapter 2) into local variables.

31

Sequential

Figure 2: Effect on runtime in terms of processor cycles of combinations of Streaming Store, of Parallel

and of using Local Variables to store round keys (on an Intel Core 2 Quad)

In the chart above, you can see the effect of five simple variants: OpenMP, Streaming

Store and the use of local variables to store the round keys. The unwinding of the inner

loop has also been performed, as that optimization is a prerequisite for using local

variables to store the round keys. Each of the four possible combinations of use and non-

use of OpenMP (Parallel) and Streaming Store respectively are represented by one of the

four colours while the number of local variables used is on the x-axis. The y-axis is the

number of CPU cycles taken for the runtime of the encryption loop.

As you can see, the use of multiple threads in the parallel version provides a considerable

speed boost to which the use of local variables is largely irrelevant whereas in the

sequential version, the use of about nine local variables seems to be best. In either case,

the use of Streaming Store seems to be largely irrelevant indicating that the hardware

prefetcher in the microprocessor and the compiler are doing an excellent job without any

help from the Streaming Store.

32

Figure 3: Effect on runtime in terms of processor cycles of varying levels of Unwinding and numbers of

round keys in local variables (on an Intel Core 2 Quad)

Figure 4: Effect on runtime in terms of issued instructions of varying levels of Unwinding and numbers of


33

Figure 5: Effect on runtime in terms of completed instructions of varying levels of Unwinding and numbers

of round keys in local variables (on an Intel Core 2 Quad)

Figure 6: Effect on runtime in terms of processor stall cycles of varying levels of Unwinding and numbers


34

Figure 3 represents the effect of local variables combined with unwinding on the runtime.

As you can see, the combination of a high number of round keys in local variables and a

high level of unwinding is very effective here. Figures 4 through 6 show the effect of the

optimizations using other metrics: issued instructions, completed instructions and stall

cycles respectively. Lower is better with all of these, with the minimization of stall cycles

the most important of the three. With all of these, we see a similar pattern that is

indicating that a high number of round keys in local variables and a high level of

unwinding are very effective when combined.

The diagram below (Figure 7) demonstrates the effect of the fully unwound outer loop

when combined with the Streaming Store option. As you can see, the fastest variants are

clearly those with high level of both unrolling and high number of round keys in local

variables again but the fastest here are slower than the fastest on the previous page where

streaming store was disabled.

Figure 7: Effect on runtime in terms of processor cycles of varying levels of Unwinding and numbers of

round keys in local variables when Streaming Store is used (on an Intel Core 2 Quad)

36

Clearly, the streaming store optimization is not worthwhile for this architecture so it will

not be looked at again here. We can postulate that the hardware prefetcher in the

microprocessor and the compiler are doing an excellent job without any help from the

Streaming Store.

37

An optimization of unrolling the outer loop is to interleave that loop as well. The effect of

this is shown in the diagram (Figure 8) below.

Figure 8: Effect on runtime in terms of processor cycles of varying levels of Interleaving and numbers of


Figure 9: Effect on runtime in terms of processor cycles of varying levels of Interleaving and numbers of


38

Figure 10: Effect on runtime in terms of issued instructions of varying levels of Interleaving and numbers


Figure 11: Effect on runtime in terms of completed instructions of varying levels of Interleaving and

numbers of round keys in local variables (on an Intel Core 2 Quad)

39

As you can see from Figure 8, the fastest variants are clearly those with high levels of

both interleaving and a fair number of round keys in local variables. This is demonstrated

by their runtimes in the range of 200,000 to 300,000 cycles.

Comparing the variants with unrolling alone to the variants with interleaving, one can

clearly see a considerable speed advantage to the interleaved versions. This is likely due

to the reduction in the cache misses that would result from keeping operations with the

same round keys together rather than operations with the same state and result. When

unrolling alone is used, it will often still be necessary to load the round keys into the

registers from the caches and there is a considerable length of time between uses whereas

when the code is interleaved, all the operations with the same round key are performed

without interruption.

From Figures 8 through 11, and comparing to the corresponding graphs for unwinding

alone (Figures 3 through 6), we can clearly see that code generated by the AES Code

Generator that uses interleaving is significantly faster than that using unrolling alone on

this architecture since the number of cycles, instructions and stalls is nearly uniformly

lower in the interleaved variants.

The next optimization that we will look at is pre-fetching some of the plaintext source

data to be encrypted into the caches. For these graphs, we are only looking at the pre-

fetching variants with five local variables and interleaved to a factor of ten (the generator

would run through all the possible combinations not just the variants of that one).

Below is a graph (Figure 12) of the effect of pre-fetching to cache (the level two cache)

in terms of cycles.

40

Figure 12: Effect on runtime in terms of processor cycles of varying the source data line and number of

iterations ahead to prefetch when prefetching to L2 cache (on an Intel Core 2 Quad)

From Figure 12 above, you can see that it is possible to shave off a couple of thousand

cycles from this technique but there is little in the way of a general rule. Multiple runs of

these results show a distinct level of consistency in this however, indicating that software

prefetching can beat the hardware prefetcher. Hence, this is a case where a code generator

is perfectly suited as it can generate and test all the permutations. Figure 13 below shows

the effect on the L2 cache itself in terms of the misses that resulted.

Figure 13: Effect on L2 Cache Misses of varying the source data line and number of iterations ahead to

prefetch when prefetching to L2 cache (on an Intel Core 2 Quad)

41

Figure 14: Effect on runtime in terms of processor cycles of varying the source data line and number of

iterations ahead to prefetch when prefetching to register (on an Intel Core 2 Quad)

Figure 15: Effect on L1 Cache Misses of varying the source data line and number of iterations ahead to

prefetch when prefetching to register (on an Intel Core 2 Quad)

42

Figure 14 shows the effect of pre-fetching to register (level 1 cache, effectively) in terms

of cycles while Figure 15 shows the effect on the L1 cache itself in terms of the misses

that resulted. Although a little less unpredictable, it again shows that a generator is in its

element in this kind of search for the most efficient combination of optimizations,

especially when dealing with new, unseen architectures such as those that will support

AES in hardware from 2010 onwards.

To conclude this section, it has looked at the effect of various optimizations that the

generator can easily generate, test, measure and compare. What has not been looked at

yet though is the effect of using multiple threads to perform the encryption in parallel. It

is this that will be looked at in the next section.

43

ParallelIn the previous section, we have looked at running optimizations on the sequential

version of the encryption code loop. However, the AES Code Generator generates all

these optimizations for the parallel, OpenMP, version of the code, just as easily.

In the diagram below we see the effect of unrolling and the number of round keys in local

variables with the OpenMP variant.

Figure 16: Effect of varying levels of Unwinding and numbers of round keys in local variables when

multiple threads are used (on an Intel Core 2 Quad)

As you can see from the diagram above, the effect of the unwinding, although not as

pronounced as in the sequential version, is still significant. One could postulate that this

is due to the slower sequential version putting less pressure on the shared memory bus.

In the diagram below we see the effect of interleaving and the number round keys in local

variables with the OpenMP variant.

44

Figure 17: Effect of varying levels of Interleaving and numbers of round keys in local variables when

multiple threads are used (on an Intel Core 2 Quad)

Again, interleaved is clearly faster than unrolled alone and, if you compare it to the

Interleaved without OpenMP (sequential) that we saw in the previous section, you will

see that although it is close, the OpenMP version (parallel) is still distinctly faster. In a

machine with multiple processors rather than multiple cores, it is entirely possible that

you would see a greater speed up from the use of multiple threads working in parallel in

the OpenMP version as the memory bus to/from the processor could well be a bottleneck

here.

To conclude this section, we have shown that the generator can apply the techniques used

to optimize sequential versions to equally optimize parallel versions. It should be noted

that although we have broken up the sequential and parallel versions here, the generator

considers the OpenMP variant just another variant to be tried in combination with all the

other variants it can attempt permutations together of. As such, the generator will

conclude with a specific recommendation, as can be seen below.

Tail of the AES Code Generator output:

Winner is: output-omp-ctr-L5-UI-14LocalVariables-Interleaved11.cpp

45

Intel Core 2 Duo 2.16Ghz

In dealing with the final two architectures, we will focus on the most interesting areas

rather than going as in-depth as previously, in order to avoid repetition. As such, we will

be concentrating on the parallel versions of the unrolled and interleaved outer loop and

the effect of different combinations of local variables used for round keys and levels of

unwinding or interleaving respectively.

Sequential


Winner is: output-ctr-L5-UI-01LocalVariables-Interleaved10.cpp

On this architecture, as you can see above, the fastest sequential version was an

interleaved version which only put a single round key into a local variable explicitly. The

most likely reason for the low number of local variables used for round keys would be

that the compiler performed much the same optimization itself or the hardware prefetcher

performed better than the generator’s software efforts.

Parallel


multiple threads are used (on an Intel Core 2 Duo)

46

The graph above shows the effect of the unrolling of the outer loop while the graph below

shows the effect of the interleaving of the outer loop. As you can see, the unrolled

versions are significantly quicker on average while the effect of the interleaving is more

predictable but with a larger possible range.


multiple threads are used (on an Intel Core 2 Duo)



As you can see from the Generator output above though, an interleaved version is the

fastest overall.

47

Intel Pentium 4 Dual Processor

Sequential


Winner is: output-ctr-L5-UI-05LocalVariables-Interleaved14.cpp

On this architecture, as you can see above, the fastest sequential version was an

interleaved version which only put a single round key into a local variable explicitly. The

interleaving could be expected from the previous sections and the relatively low use of

local variables could be due the lower number of registers available as this is a 32 bit

machine whereas the others were 64 bit machines.

Parallel


multiple threads are used (on an Intel Core 4 Dual Processor)

48


multiple threads are used (on an Intel Pentium 4 Dual Processor)



As you can see, the fastest version was one, which eschewed the use of local variables for

round keys. This could be due to it being a 32-bit machine which, consequently, had less

registers available so the compiler and hardware were able to do the best job managing

them.

49

5. Conclusions

AES Code Generator is a valuable means to find the most efficient and optimized AES

code for any given architecture. In general for the architectures surveyed here, the

interleaved, parallel variants seem to be the most efficient. This is most likely due to the

use of all of the available cores by using multiple threads and to the reduction in loads to

registers from caches caused by interleaving the outer loop of the encryption loop. The

number of local variables used for storing round keys seemed to be consistently low on

32-bit architectures and medium on 64-bit architectures, reflecting the increased number

of registers available on the 64-bit architecture.

To conclude, this report will outline what contributions this project has made to the

current state of the art and briefly discuss what future work could be attempted that could

build upon the progress achieved in this project.

Contributions

This project has made a number of contributions to the state of the art.

Firstly, it is one of the very first to deal with the new AES instruction set which will

appear in the next generation of Intel processors.

Secondly, it provides a proof of the concept of using a code generator to provide all the

various optimized variants of the standard AES encryption loop.

Thirdly, it extends the idea of self-tuning generators to the area of encryption code

generation

50

Future Work

Future work will no doubt focus on optimizations taking advantage of the new Single

Instruction Multiple Data (SIMD) instructions that will be introduced in the next

generation of Intel’s processors. The provision of these instructions enabling fast and

secure encryption and decryption using AES on the hardware level will doubtlessly result

in the predominant use of the SIMD instructions in all future AES Code Generators and

the efforts that had to be taken here to emulate the instructions and estimate the effect of

optimizations by using older SIMD instructions of similar latency will be a thing of the

past.

A minor extension of this work would consider AES-128 and/or AES-192 instead of

AES-256 that was dealt with here. However, given that AES is a block cipher that always

deals with 128-bit blocks, the only differences are the number of rounds of encryption

and the resulting greater number of round keys. As such, the effect of optimizations will

be less with AES-128 or AES-192.

Another minor extension of this work would investigate AES on a machine with an

extremely large number of processors, and comparing the performance of pthreads

against OpenMP.

51

References[1] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela

Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen

Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo, “SPIRAL: Code

Generation for DSP Transforms”, Proceedings of the IEEE special issue on

"Program Generation, Optimization, and Adaptation," Vol. 93, No. 2, 2005, pp.

232-275.

[2] Matteo Frigo and Steven G. Johnson, "The Design and Implementation of

FFTW3," Proceedings of the IEEE 93 (2), 216–231 (2005). Invited paper, Special

Issue on Program Generation, Optimization, and Platform Adaptation.

[3] Joan Daemen , Vincent Rijmen, “The Block Cipher Rijndael”, Proceedings of the

The International Conference on Smart Card Research and Applications, p.277-

284, September 14-16, 1998.

[4] National Institute for Standards and Technology, “Announcing the Advanced

Encryption Standard (AES)”, Federal Information Processing Publication #197,

2001.

[5] Michael Flynn, “Some Computer Organizations and Their Effectiveness”, IEEE

Trans. Comput., Vol. C-21, pp. 948, 1972.

[6] Aart J.C. Bik, “Vectorization with the Intel Compilers”, Intel, 2008.

[7] R.M. Ramanthan, “Extending the World’s Most Popular Architecture”, Intel,

2006.

[8] Nadeem Firasta, “Intel AVX: New Frontiers in Performance Improvements and

Energy Efficiency”, Intel, 2008.

[9] Barbara Chapman, “Using OpenMP: Portable Shared Memory Parallel

Programming”, The MIT Press, 2007.

[10] Unknown, “Compiler optimization”,

http://en.wikipedia.org/wiki/Compiler_optimization (last accessed on 3rd April

2009)

52

http://en.wikipedia.org/wiki/Compiler_optimization

[11] Ilya O. Levin, “A byte-orientated AES-256 implementation”,

http://www.literatecode.com/2007/11/11/aes256/, (last accessed on 4th April

2009).

[12] Shay Gueron, Intel Mobility Group Israel, “AES Instructions Set White Paper”,

Intel, July 2008.

[13] Intel, “Intel 64 and IA-32 Architectures Optimization Reference Manual”,

November 2007.

[14] P.Mucci, “PapiEx - Execute arbitrary application and measure hardware

performance counters with PAPI”, http://icl.cs.utk.edu/~mucci/papiex/ (last

accessed on 10th April 2009).

[15] P. Mucci et al, “A Scalable Cross-Platform Infrastructure for Application

Performance Tuning Using Hardware Counters”, Proceedings of Supercomputing

2000, 2000.

[16] P. Mucci and N. Tallent, “Monitor - user callbacks for library, process and thread

initialization/creation/destruction”, 2004.

[17] Mark Charney, “Intel Software Development Emulator”, Intel,

http://www.intel.com/software/sde/ (last accessed on 14th April 2009).

[18] Robert Konighofer, “A Fast and Cache-Timing Resistant Implementation of the

AES”, Proceedings of the Cryptographer’s Track at RSA Conference 2008, 2008.

[19] Intel, “Intel 64 and IA-32 Architectures Software Developers Manual”,

November 2007.

[20] Guido Bertoni et al, “Efficient Software Implementation of AES on 32-Bit

Platforms”, Proceedings of Cryptographic Hardware and Embedded Systems

2002: p159-171, 2002.

53

http://www.intel.com/software/sde/

http://icl.cs.utk.edu/~mucci/papiex/

http://www.literatecode.com/2007/11/11/aes256/

encryption code generator

Documents

aes pageref

encryption pageref

parallel pageref

sequential pageref

optimization pageref

background pageref

references pageref

conclusions pageref