encryption code generator
TRANSCRIPT
University of Dublin
TRINITY COLLEGE
ENCRYPTION CODE GENERATOR
Paul MagrathB.A. (Mod.) Computer ScienceFinal Year Project - May 2009
Supervisor: David Gregg
School of Computer Science and Statistics
O’Reilly Institute, Trinity College, Dublin 2, Ireland
1
Declaration
I hereby declare that this thesis is entirely my own work and that it has not been
submitted as an exercise for a degree at any other university.
_________________________________ April 24th, 2009
Paul Magrath
2
Permission to Lend
I agree that the Library and other agents of the College may lend or copy this thesis upon
request.
_________________________________ April 24th, 2009
Paul Magrath
3
Acknowledgements
To David Gregg, for his support and advice throughout this project.
To Laura, for the love you have given me and for putting up with me.
4
Table of Contents
Declaration................................................................................................................................ 2
Permission to Lend................................................................................................................. 3
Acknowledgements................................................................................................................ 4
1. Motivation............................................................................................................................. 7Introduction........................................................................................................................................ 7Readers’ Guide to the Report.........................................................................................................9
Background.........................................................................................................................................................9AES Code Generator........................................................................................................................................9Experimental Results.....................................................................................................................................9Conclusions.........................................................................................................................................................9References...........................................................................................................................................................9
2. Background........................................................................................................................ 10Encryption......................................................................................................................................... 10AES....................................................................................................................................................... 11SIMD..................................................................................................................................................... 12SSE........................................................................................................................................................ 12Optimization..................................................................................................................................... 14
3. AES Code Generator......................................................................................................... 17I. Correctness: An AES-256 implementation..........................................................................17II. The Generator............................................................................................................................. 18
A. Steaming store...........................................................................................................................................19B. Unwind inner loop...................................................................................................................................20C. Use local variables....................................................................................................................................21D. Unwind outer loop...................................................................................................................................22E. Interleave.....................................................................................................................................................23F. OpenMP.........................................................................................................................................................24G. Prefetch to cache......................................................................................................................................25H. Preload to register...................................................................................................................................26
III Simulating.................................................................................................................................... 27
4. Experimental Results...................................................................................................... 29Intel Core 2 Quad 2.4Ghz.............................................................................................................. 30
Sequential.........................................................................................................................................................32Parallel............................................................................................................................................................... 43
Intel Core 2 Duo 2.16Ghz.............................................................................................................. 45Sequential.........................................................................................................................................................45Parallel............................................................................................................................................................... 45
Intel Pentium 4 Dual Processor................................................................................................. 47Sequential.........................................................................................................................................................47Parallel............................................................................................................................................................... 47
5. Conclusions........................................................................................................................ 49Contributions................................................................................................................................... 49Future Work..................................................................................................................................... 50
5
References............................................................................................................................... 51
6
1. Motivation
Introduction
In the world today, encryption is vital. Without it, there is no security, freedom or privacy
at all. Governments, corporations, strangers and criminals all routinely attempt to gather
as much information as possible about us and what we do online and in real life in order
to profile us, tempt us, learn about us and defraud us respectively. Every day, with social
networking and online records, more and more information is available in order to do
this. Even more is available when measures such as deep packet inspection are turned to.
Encryption is a solution to these problems in that it allows us to have a modicum of
control over who can access the data that we distribute, who can listen to our calls, read
our mail, and read our bank statements.
This project investigates using a code generator to generate the various variants of an
Advanced Encryption Standard (AES) encryption loop. AES (see ‘Background’, chapter
2) is a form of encryption that Intel will begin supporting in hardware in their
microprocessors in 2010. An encryption loop is the iterative loop that loops through all
the data to be encrypted and performs steps necessary to encrypt the data. As such it is a
computationally expensive loop, requiring a large amount of CPU time, and is, hence, a
candidate for optimization in order to reduce the time taken.
A code generator, in this context, is a program that generates a number of different
variants of a piece of code in order to find which combination of optimization techniques
yields the best possible result (see ‘AES Code Generator’, chapter 3, for variants
employed here). The use of code generators is an established technique for solving
problems in optimizing for modern architectures, used primarily in the research
community. It is ideal for optimizing a small piece of code that uses a vast amount of
processing time and to which the best optimizations are not obvious. Code generators
7
have been successfully applied to several projects such as [1] and [2]. The motivation for
a code generator is to avoid the problems with code maintenance and code readability
that almost inevitably result from hand-tuned assembly specific to the architecture it is
written for. A code generator can tune itself to the architecture it is running on to find the
best combination of optimizations for that architecture, while remaining readable and
maintainable as it can be written in a high level language, such as C++.
In 2010 Intel will release processors that will have AES instructions built in to their
instruction set. This will greatly reduce the cost of encryption, as it will be possible to
perform the encryption much more quickly and efficiently than previously. This is part of
an enhancement, and replacement, of the current Intel SIMD instruction set, the
Streaming SIMD Extension (SSE). The replacement instruction set will be known the
Intel Advanced Vector Extensions (AVX).
This project takes a look at what would be the most likely ways that the use of such
instructions as part of a standard AES encryption loop could be optimised, how this can
be automated using a code generator, and presents the results of running this AES code
generator to generate these optimisation combinations on different architectures and
processors.
8
Readers’ Guide to the Report
Background
A summary of the what readers should understand and be aware of for consideration of
the report, particularly Encryption, AES, SIMD, SSE, OpenMP and Optimisation.
AES Code Generator
An outline and discussion of the various variants supported by the generator that was
implemented as part of the project as well as an explanation of how the correctness of the
input and output AES encryption loops was confirmed.
Experimental Results
Tables and diagrams summarising and demonstrating the results obtained by the timing
and measurement of the outputs from the Encryption Code Generator using hardware
performance counters, or other means as available.
ConclusionsA discussion of the conclusions that can be drawn from the results and of the possible
future works that can build upon this project.
ReferencesSystemic and complete reference to sources used and a classified list of all sources.
9
2. Background
Encryption
Encryption is the translation of data into a secret code. This is done in an attempt to keep
information secure. To read an encrypted file, you must have access to a secret key or
password that enables you to decrypt it. Hence, third parties without access to the shared
secret key, such as online criminals, curious neighbours and oppressive governments, are
unable to easily (or, in the case of strong encryption, at all) access the data or information
that has been encrypted. Unencrypted data is called plain text while encrypted data is
referred to as cipher text.
There are two main types of encryption: asymmetric encryption (also known as public
key encryption) and symmetric encryption. Asymmetric encryption is a form of
encryption where keys come in pairs. What one key encrypts, only the other can decrypt.
The encryption key is usually made public and distributed freely as only the holder of the
decryption key is able to read the data that has been encrypted with the encryption key.
Symmetric encryption is a form of encryption where the same key is used for both
encryption and decryption. The key must be kept secret, and is shared by the message
sender and recipient. This form of encryption can usually be performed much more
quickly and efficiently than asymmetric encryption. In practice, asymmetric encryption is
usually used to encrypt an insecure communication channel in order to allow for the
exchange of the secret key for the symmetric encryption that will be used for the rest of
the communications. This technique effectively combines the strengths of the two forms
of encryption and is the basis of the SSL/TLS family of encryption protocols that include
HTTPS that is used everyday for online banking and shopping.
10
AES
AES (Advanced Encryption Standard) is one of the most popular algorithms used in
symmetric encryption.
Originally published as Rijndael [3], it was adopted as a standard by the U.S. government
in November 2001 [4], after a five-year standardization process involving fifteen
competing designs. The standard comprises three block ciphers, AES-128, AES-192 and
AES-256, adopted from a larger collection. A block cipher is a cipher that operates on
fixed-length groups of bits, termed blocks, with an unvarying transformation. The block
cipher takes in two inputs, the plaintext of the block and the secret key, and outputs the
ciphertext (encrypted text) of the block. Each AES cipher has a 128-bit block size, which
means that 128-bits of the plaintext are encrypted into ciphertext in each iteration of the
encryption loop. AES-128, AES-192 and AES-256 have secret keys of sizes 128, 192 and
256 bits respectively where AES-128 is the least secure while AES-256 is the most
secure.
When the length of data to be encrypted exceeds the block size, a mode of operation must
be used [5]. The two that we will concern ourselves with are Electronic Code Book
(ECB) and Counter (CTR). These will be discussed in detail in the AES Code Generator
(chapter 3).
An outline of the algorithm for AES-256 encryption is:
Key Expansion
Using the Rijndael key schedule, the 14 round keys are extracted from the 256-bit
secret key.
Initial Round (round 0)
The initial round is simply the bitwise XOR of the round key to the plaintext
(referred to as the ‘state’ during the encryption).
Rounds 1 through 12
11
Each of these rounds is comprised of a non-linear substitution step, a transposition
step, a mixing step, and a bitwise XOR of the round key to the state.
Final Round (round 13)
The final round is identical to Rounds 1 through 12 except the mixing step is
omitted.
It should be noted that the key expansion only has to be performed once for any given
secret key. Hence, the encryption loop is composed only of the Initial Round, the Rounds,
and the Final Round as the round keys extracted during key expansion can be reused for
whatever many iterations of the encryption loop it takes to encrypt the entirety of the
plaintext.
SIMDSIMD (Single Instruction, Multiple Data) is a technique employed to achieve data level
parallelism. Parallelism is when calculations are carried out simultaneously. In SIMD
computer architecture, the computer exploits multiple data streams against a single
instruction stream in order to perform operations that may be easily parallelized [6].
By processing multiple data elements in parallel, SIMD processors provide a way to
utilize data parallelism in applications that apply a single operation to all elements in a
data set, such as a vector or matrix [7].
SSE
Streaming SIMD Extensions (SSE) is a SIMD instruction set extension to the x86
architecture. It was designed by Intel and introduced with the Pentium III microprocessor
family.
SSE originally added eight new 128-bit registers (a register is a memory device usually
used to store local variables) known as XMM0 through XMM7. The 64-bit variant of the
12
x86 architecture, x86-64, has a further eight registers referred to as XMM8 through
XMM15. XMM0 through XMM15 can be accessed in 64-bit operating mode, while only
XXM0 through XMM7 can be accessed in 32-bit operating mode.
Each register packs together four 32-bit single-precision floating point numbers or two
64-bit double-precision floating point numbers or four 32-bit integers or eight 16-bit short
integers or sixteen 8-bit bytes or characters. (Hence, each register could hold the contents
of an entire 128-bit AES block.) There have been a number of iterations of SSE, each of
which has added a number of enhancements in terms of instructions. SSE4 is the version
of SSE that is supported by the current Intel Core micro architecture. [8]
AVX (Advanced Vector Extensions) is an advanced version of SSE, which will appear in
Intel products in 2010, and which features a 256 bit data path (widened from 128 bits in
SSE4). AVX will provide six new instructions for symmetric encryption/decryption using
the Advance Encryption Standard (AES) and one instruction performing carry-less
multiplication (PCMULQDQ) which aids in performing finite field arithmetic (a type of
arithmetic used in advanced block cipher encryption). These hardware-based primitives
provide a security benefit apart from their speed advantage by avoiding table-lookups and
hence protecting against software side channel attacks (attempts to discover the secret
key by observing and analyzing the flow of information in the computer during the
encryption process). [9]
However, we do not need to program in assembly in order to utilise any of these
extensions to the instruction set. Instead we can use intrinsics. Intrinsics are special
functions for which the compiler generally has a specific optimisation path and which the
compiler encodes to one or more machine instructions. They represent a good middle
ground between speed of execution and ease of use for the programmer. With modern
optimising compilers, particularly with the Intel C++ Compiler, we can get nearly as
good results as the best hand tuned assembly code, without compromising code
readability or having to bother about register management.
13
OpenMP
The OpenMP Application Program Interface (API) supports multi-platform shared-
memory parallel programming in C/C++ and Fortran on many architectures, including
Linux, Windows and Mac OS X. It consists of a set of compiler directives, library
routines, and environment variables that influence run-time behavior. OpenMP gives
programmers a simple model and interface for developing parallel applications [10].
The Hello World example below demonstrates how easy it is to create additional threads
to carry out work using the API provided by OpenMP:
int main(int argc, char* argv[]) { #pragma omp parallel printf("Hello World!\n"); return 0; }
The above program will print “Hello World!” once for each processor core present on the
machine.
Similarly, for a for loop:
int main(int argc, char **argv) { const int N = 100000; int i, a[N]; #pragma omp parallel for for (i = 0; i < N; i++) a[i] = 2 * i; return 0;}
Optimization
Optimization is the process of tuning the output of a compiler to minimize or maximize
some attribute of an executable computer program. The most common requirement is to
minimize the time taken to execute a program; a less common one is to minimize the
amount of memory occupied. Optimization is usually applied by the compiler, which will
14
usually attempt to generate the fastest possible code. However, a programmer will
sometimes attempt to help the compiler out by performing some optimizations manually
as (s)he, in theory, knows the algorithms better.
The general themes of optimizing programs are:
Optimize the common case
The common case may have unique properties that allow a very quick computation at the
expense of a very slow computation for certain less common cases. If the common case is
taken most often, the result can be better over-all performance.
Avoid redundancy
Reuse results that are already computed and store them for use later, instead of re-
computing them.
Less code
Remove unnecessary computations and intermediate values. Less work for the CPU,
cache, and memory usually results in faster execution. Alternatively, in embedded
systems, less code requires less memory and brings a lower product cost.
Straight line code
Less complicated code with less jumps and conditional branches will be faster as they
interfere with the pre-fetching of instructions, thus slowing down code.
Locality
Code and data that are accessed closely together in time should be placed close together
in memory to increase spatial locality of reference, and hence reduce the amount of
register loading.
Manage memory efficiently
15
Place the most commonly used items in registers first, then caches, then main memory,
before going to disk.
Parallelize
Reorder operations to allow multiple computations to happen in parallel, either at the
instruction (using SIMD instructions such as those in the Intel SSE instruction set),
memory, or thread level (using an API like OpenMP). [11]
16
3. AES Code Generator
AES Code Generator is discussed here under a number of sections:
I. Correctness: An AES-256 implementationII. The GeneratorIII. SimulatingIV. Testing
I. Correctness: An AES-256 implementation
The correctness of the input and output is confirmed by means of a customised AES-256
implementation.
The implementation described in this report emulates AES-256 and was based upon a
byte-oriented portable C implementation in which all the lookup tables had been replaced
with “on-the-fly” calculations [12]. The implementation is fully compliant with the
specification and is highly portable. There was no assembler in the original code but I
later wrapped the code with vector functions (SSE functions) and vectorised some, but
not all, of the code. This was done so that the function signatures (i.e. the names and
parameters of the functions) would be compatible with those of the AES instructions that
Intel will introduce as part of AVX (the Advanced Vector Extensions, see Background,
chapter 2).
Since the purpose of including the AES code is to check correctness, the slow speed of
this implementation is not a problem. It is provided as a simple means of verifying that
the implemented variants do not change the basic algorithm of the AES encryption loop
that is the input to the generator.
17
II. The Generator
The generator system generates a large number of simple variations of a basic AES
encryption loop. These various modifications of the code can then be run on a particular
model of processor, and with various compiler switches, to find the best variant for that
particular processor.
The generator was tested with two block cipher modes of operation of 256 bit AES:
Electronic Code Book (ECB) and Counter (CTR). Electronic code book is the simplest of
the encryption modes. The data to be encrypted is divided into blocks and each block is
encrypted separately. However, identical plaintext blocks are encrypted into identical
ciphertext blocks, so data patterns are easily recognised in the ciphertext. Hence, it is not
recommended for use in cryptographic protocols at all. On the other hand, counter turns a
block cipher into a stream cipher. Instead of encrypting the data itself, successive values
of a “counter” are encrypted. This is usually concatenated, added or XORed with an
initialization vector to produce the unique counter variable for the encryption. The
counter is then XORed with the actual text to be encrypted in order to form the actual
ciphertext.
The generator can take an input file of an encryption loop of either mode of operation,
electronic codebook (ECB) or counter (CTR), as its input file. These two types of block
cipher mode of operation were chosen because they were the easiest to parallelise. There
are a number of different options available for generating the variants. Each of these
options, is, in essence, a distinct optimization, or set of possible configurations of
optimizations, that can be performed. The options are:
Streaming store
Unwind inner loop
Use local variables
Unwind outer loop
Interleave
OpenMP (parallel)
18
Prefetch to cache
Prefetch to register
In the following sections, these options are described in more detail.
A. Steaming store
In the Steaming Store option, we use an SSE instruction instead of storing the result to
memory using a standard memory assignment. This variant uses the _mm_stream_si128
instruction to store the result directly to memory without polluting the caches. Like with
all the options to the generator, specifying it will generate the variants with the option
enabled and disabled.
Before:result = encrypt_final(result, *keys);result = _mm_xor_si128(result,source[i]);dest[i] = result;
After:result = encrypt_final(result, *keys);result = _mm_xor_si128(result,source[i]);_mm_stream_si128(&(dest[i]),result);
Normally, when we write to memory, the cache is updated with the contents of the write
so that if there is a read request for that information soon after the write, it can be recalled
quickly. However, in this situation the information that is being written is the cipher text
that has just been encrypted, and we want to keep the cache for memory that we are going
to be accessing again such as round keys and the plain text to be encrypted.
However, this is an option that can be turned on or off as it can happen that the use of
these instructions can interfere with the compiler’s own optimizations. The generator can
therefore experiment with both versions in combination with lots of other options.
19
B. Unwind inner loop
This variant unwinds the inner loop to the extent specified in the argument.
Loop unrolling is a technique that attempts to increase the execution speed of the
program at the expense of its size. The loop is rewritten as a sequence of independent
statements, hence reducing (in this case, eliminating) the overhead of evaluating the loop
condition on each of the iterations and reducing the number of jumps and conditional
branches that need to be executed.
There are two side effects of loop unrolling. These are an increased register usage in a
single iteration to store temporary variables (but not in this case, as we are completely
eliminating the loop rather than just unwinding it a little), and the code size expansion
after the unrolling. Large code size can lead to an increase in instruction cache misses.
In this case, the loop is quite short (equal to the number of keys, 14) and hence a
significant speed boost should be observed due to the removal of the control variable
check, the 14 jumps and the conditional branches from the control flow.
Before: for (int j = 1; j < nKeys; j++ ) { result = encrypt_round(result, *(keys+j)); }
After: result = encrypt_round(result, *(keys+1)); result = encrypt_round(result, *(keys+2)); result = encrypt_round(result, *(keys+3)); ... result = encrypt_round(result, *(keys+12)); result = encrypt_round(result, *(keys+13));
Loop unrolling can also aide the compiler and the processor perform their own
optimizations, such as instruction scheduling.
20
C. Use local variables
This variant uses local variables for the AES round keys instead of memory accesses (the
round keys are sub-keys used for the individual rounds extracted from the cipher key
using the Rijndael key schedule). This involves defining the variables, assigning them the
round keys from their memory locations and updating all references in the input file to
the memory location to refer to the variables instead. The idea is that assigning the round
keys to variables should be a massive hint to the compiler to keep the round keys in
registers rather than doing a memory access (it is faster to access registers than the L1
cache where the round keys will probably be).
It is also observable that the number of round keys that are stored in registers can have a
negative impact on run time. This is due the impact that storing them in registers has on
the number of registers available for other purposes, such as storing temporary variables
such as results. This is particularly relevant when a high level of unwinding of the outer
loop has also occurred, especially when it has been interleaved as well.
As such, up to 214 different variants of the number of round keys stored in local variables
as opposed to using memory accesses can be generated. There are 214 different variants as
there are 14 round keys that could be stored in local variables and hence 214 different
permutations of round keys in local variables and memory accesses. However, for testing
purposes, only 14 permutations formed by loading successive round keys into local
variables were looked at.
Before:
for ( i = 0; i < limit; i++ ) {...result = encrypt_round(result, *(keys+1));
result = encrypt_round(result, *(keys+2)); result = encrypt_round(result, *(keys+3)); ... result = encrypt_round(result, *(keys+12)); result = encrypt_round(result, *(keys+13));
21
After:const vector_type key0 = keys[0];const vector_type key1 = keys[1];const vector_type key2 = keys[2];const vector_type key3 = keys[3]; ...const vector_type key12 = keys[12];
const vector_type key13 = keys[13]; for ( i = 0; i < limit; i++ ) {
... result = encrypt_round(result, (key1));
result = encrypt_round(result, (key2)); result = encrypt_round(result, (key3)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13));
Using local variables for the round keys also has the bonus that it allows the compiler to
apply other optimizations much earlier in the process than if the round keys are left as
elements of an array as it can be difficult for the compiler to prove that the array is
unaliased (i.e. does not reference the same location or variables as another pointer).
D. Unwind outer loop
This variant unwinds the outer loop to the extent specified in the argument. The
technique, and the side effects, is the same as with unwinding the inner loop, except the
effect is much greater as a result of the greater number of instructions involved.
There is a certain threshold where the returns from the unwinding begin to rapidly
diminish and then the effect of the limited number of registers and the instruction cycle
misses hits and run time increases again.
Before: After:for ( i = 0; i < limit; i++ ) { vector_type result; result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) );
// initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0);
for ( i = 0; i < (limit-2); i+=2) { vector_type result;
// iteration 0 result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13));
22
result = _mm_xor_si128(result,source[i]);
dest[i] = result;} // end outer loop
// final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]);
dest[i] = result; // end of original outer loop
// iteration 1 result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i+1) );
// initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2));
... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]);
dest[i+1] = result; // end of original outer loop
} // end unrolled loop
E. Interleave
This variant interleaves an unwound outer loop to the extent specified in the argument.
The idea is that by interleaving the loop, we reduce the number of instructions that are
stalling due to a reliance on the result of a previous instruction.
Since each iteration is dealing with a different intermediate result, and since each
iteration is dealing with its own intermediate result a number of times as it works through
the encryption rounds, it makes sense the a number of operations of the same round,
rather than the same iteration, after one another. Hence, the delay that would exist in an
iteration between the first and the second round while waiting for the result from the first
is filled by calculating the result of the first round of the next iteration.
23
Also, it is potentially very important for performance that the keys are used multiple
times in sequence so that it is strictly necessary to load them just the once for each
sequence of multiple instructions using that key.
for ( i = 0; i < (limit-2); i+=2) { vector_type result;
result = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); // initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2)); ... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0); result = _mm_xor_si128(result,source[i]); dest[i] = result; // end of original outer loop result = _mm_add_epi64(nonce,_mm_set_epi32(0,0,0,i+1));
// initial round result = encrypt_initial(result, enckey); // encryption result = encrypt_round(result, (key1)); result = encrypt_round(result, (key2));
... result = encrypt_round(result, (key12)); result = encrypt_round(result, (key13)); // final round result = encrypt_final(result, key0);
result = _mm_xor_si128(result,source[i]); dest[i+1] = result; // end of original outer loop} // end unrolled loop
for ( i = 0; i < (limit-2); i+=2) { vector_type result0; vector_type result1;
result0 = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i) ); result1 = _mm_add_epi64( nonce,_mm_set_epi32(0,0,0,i+1) );
// initial round result0 = encrypt_initial(result0, enckey); result1 = encrypt_initial(result1, enckey); // encryption result0 = encrypt_round(result0, (key1)); result1 = encrypt_round(result1, (key1)); result0 = encrypt_round(result0, (key2)); result1 = encrypt_round(result1, (key2)); result0 = encrypt_round(result0, (key3)); result1 = encrypt_round(result1, (key3)); ... result0 = encrypt_round(result0, (key11)); result1 = encrypt_round(result1, (key11)); result0 = encrypt_round(result0, (key12)); result1 = encrypt_round(result1, (key12)); result0 = encrypt_round(result0, (key13)); result1 = encrypt_round(result1, (key13)); // final round result0 = encrypt_final(result0, key0); result1 = encrypt_final(result1, key0); result0 = _mm_xor_si128(result0,source[i]); result1 = _mm_xor_si128(result1,source[i+1]);
dest[i] = result0; dest[i+1] = result1; // end of original outer loop} // end unrolled loop
F. OpenMP
This variant includes the OpenMP pragma directive that are normally commented out for
the other variants. This allows for the investigation of parallel versions of the code, for
running on processors with multiple cores and/or simultaneous multithreading.
24
Any of the other variants can be generated with the OpenMP variant to create both
threaded and non-threaded versions of the same code to investigate which is the most
efficient of the options.
Before: for ( i = 0; i < (limit-2); i+=2) { vector_type result0; vector_type result1; vector_type src0; vector_type src1;
After: #pragma omp parallel for
for ( i = 0; i < (limit-2); i+=2) { vector_type result0; vector_type result1; vector_type src0; vector_type src1;
G. Prefetch to cache
This variant uses the _mm_prefetch SSE intrinsic to prefetch source data to the cache.
The instruction loads one cache line of data from the given address to a location closer to
the processor. The idea is to prime the cache for the next iteration(s) of the encryption
loop.
However, a balance must be found to ensure that priming the cache too far ahead does
not poison the cache (i.e. if we prefetch too many lines into the cache, or prefetch the
lines too soon, it may cause lines in the cache that we still need to be removed). Hence,
the generator must generate a large number of variants in order to find a version that
prefetches the source data just enough instructions ahead.
Also, the number of iterations before its use that source data is prefetched can have an
impact on the speedup gained from the optimization so a number of variants must be
produced in order to find the one with the greatest speed up.
25
Example:
_mm_prefetch((char const*)&source[i+2],_MM_HINT_T0);
H. Preload to register
This variant uses a variable to preload source data to a register. The idea is to touch the
L1 cache so it is primed with the line from which the source data value is chosen for the
register for the next iteration(s) of the encryption loop. This is important as the prefetch
SSE instruction only brings things into the L2 cache.
However, there are only a limited number of registers that are available for use. Hence,
the generator must generate a large number of variants in order to find the version that
will lead to an improvement in runtimes.
Example:
const vector_type src2 = source[i+2];
Most Intel microprocessors are also capable of doing prefetching in hardware, so
sometimes attempts at prefetching have little, no or even a negative effect, if the
hardware prefetcher does a better job.
26
III Simulating
The new architecture that Intel will release in 2010 introduces six Intel SSE instructions
that facilitate encryption. Four instructions, namely AESENC, AESENCLAST,
AESDEC, and AESDELAST facilitate high performance AES encryption and
decryption. The other two, namely AESIMC and AESKEYGENASSIST, support the
AES key expansion procedure. Together, these instructions will provide a full hardware
to support AES, offering security, high performance, and a great deal of flexibility. [13]
In order to simulate the performance effect of the actual AES instructions without the
actual hardware that supports them, the generator supports replacing the AES encryption
instructions in the input code with instructions that are of various different latencies
(provided in #defines in the input code) that we can use as proxies for the purposes of
testing the effect of optimizations. This allows the testing of the code with various
different latencies to give an idea of the effect of the speedups implemented by the
variants. This is done only with the four instructions that facilitate high performance AES
encryption and decryption. The two instructions supporting AES key expansion are not
emulated since, as key expansion is only done once no matter the amount of plaintext to
be encrypted, there is little to no benefit in studying speedups of it especially as it is
relatively quick.
For the purposes of this project, instructions of latencies two, three and five cycles
respectively were chosen for comparison, using the available Intel documentation [12].
Intel documentation published to date [13] indicates that the initial chips supporting AES
encryption in hardware will do so in six cycles. As such, the figures for the effect of the
various optimizations on five cycle latency instructions are the most relevant for the first
generation of hardware supporting the instructions (given that there are not currently any
six or seven latency instructions in the SSE instruction set). The figures for two and three
latencies, however, represent what we can expect from the second or third generation of
hardware.
27
IV Testing
In order to test my code and get my experimental results, I ran my generated code under a
number of compiler flags and architectures. Specifically, both the GNU C Compiler and
Intel C Compiler, 32 and 64 bit multiprocessors and machines with only a single core,
dual cores and 8 cores.
In order to generate the detailed results I got for my experimental results (see the next
section), I used PapiEx. PapiEx is a performance analysis tool designed to transparently
and passively measure the hardware performance counters of an application using PAPI
[15]. It uses Monitor to effortlessly intercept process/thread creation/destruction. The
Performance API (PAPI) project specifies a standard application programming interface
(API) for accessing hardware performance counters available on most modern
microprocessors. [14] Monitor is a library that gives the user callbacks or traps upon
events related to library/process/thread initialization/creation/destruction. [17]
Using PapiEx, I was able to get accurate information on the number of instructions
issued, executed and completed, the amount of data cache misses and the number of stall
cycles used, as well as the total number of cycles execution took.
At the time of writing, the Intel C Compiler did not yet support the compilation of AES
instructions, however it should be trivial to test the correctness of the output using the
Intel Software Development Emulator [18].
28
4. Experimental Results
The Experimental Results illustrated below showcase the effects of the use of some of the
variants. Running quite small subsets of the possible output of the generator and graphing
the intermediate results – specifically the generator’s medians of the runtimes from
twenty-five consecutive runs of each variant – was the source of the data for these graphs.
The generator, by default, runs through all the possible combinations unless, as here, it is
passed arguments in which case it just generates the combinations of those variants
specified. Hence, normally the ‘winner’ variant that is printed out by the generator at the
end is all one would be concerned with. The detailed graphing and discussion of the
effects of each variant and combination here is simply to demonstrate what is going on
inside the generator.
The AES Code Generator can be run on practically any system that has a working C++
compiler installed which has been compiled with SSE support appropriate to the
microprocessor it is installed on. However, to get the most out of it one should use a
compiler that supports OpenMP as well. The GNU C++ Compiler (v4.2+) and the Intel
C++ Compiler (v10.1+) are hence the best choices.
In general, the AES Code Generator simply has to be deployed to the target machine and
executed in order for it to build itself, generate the variants, test them, and report back
and the fastest variant. The optimization process is entirely automated, and no further
user intervention is needed.
To demonstrate this, data generated by the AES Code Generator will be shown from
three different architectures:
Intel Core 2 Quad 2.4Ghz (4 cores)
Intel Core 2 Duo 2.16Ghz (2 cores)
Intel Pentium 4 Dual Processors
29
Intel Core 2 Quad 2.4Ghz
Figure 1: Effect on runtime in terms of processor cycles of combinations of Streaming Store and of Parallel
(on an Intel Core 2 Quad)
The graph above shows the base case (Sequential) along with the possible combinations
of the OpenMP variant (Parallel) and the Streaming Store variant.
As you can see, the effect of using OpenMP can be quite significant with a speedup of
about 250% while the Streaming Store seems to be largely ineffective, perhaps due to the
work of the hardware prefetcher (part of the processor that speeds up the flow of data
accessed by the program [20]) negating the effect of not using the instruction. The effect
of using multiple threads is not proportional to the number of threads used as the
processor has other bottlenecks that constrict the speed at which it can execute
instructions even when there is work for all four cores, such as the memory bus
bandwidth which is limited and in a case such as this where a large amount of data is
being processed, could quite easily become the bottleneck.
30
A number of variants are possible but we will look next at the effect of putting the round
keys (the encryption keys used for the individual rounds in the AES algorithm - see
‘Background’, chapter 2) into local variables.
31
Sequential
Figure 2: Effect on runtime in terms of processor cycles of combinations of Streaming Store, of Parallel
and of using Local Variables to store round keys (on an Intel Core 2 Quad)
In the chart above, you can see the effect of five simple variants: OpenMP, Streaming
Store and the use of local variables to store the round keys. The unwinding of the inner
loop has also been performed, as that optimization is a prerequisite for using local
variables to store the round keys. Each of the four possible combinations of use and non-
use of OpenMP (Parallel) and Streaming Store respectively are represented by one of the
four colours while the number of local variables used is on the x-axis. The y-axis is the
number of CPU cycles taken for the runtime of the encryption loop.
As you can see, the use of multiple threads in the parallel version provides a considerable
speed boost to which the use of local variables is largely irrelevant whereas in the
sequential version, the use of about nine local variables seems to be best. In either case,
the use of Streaming Store seems to be largely irrelevant indicating that the hardware
prefetcher in the microprocessor and the compiler are doing an excellent job without any
help from the Streaming Store.
32
Figure 3: Effect on runtime in terms of processor cycles of varying levels of Unwinding and numbers of
round keys in local variables (on an Intel Core 2 Quad)
Figure 4: Effect on runtime in terms of issued instructions of varying levels of Unwinding and numbers of
round keys in local variables (on an Intel Core 2 Quad)
33
Figure 5: Effect on runtime in terms of completed instructions of varying levels of Unwinding and numbers
of round keys in local variables (on an Intel Core 2 Quad)
Figure 6: Effect on runtime in terms of processor stall cycles of varying levels of Unwinding and numbers
of round keys in local variables (on an Intel Core 2 Quad)
34
35
Figure 3 represents the effect of local variables combined with unwinding on the runtime.
As you can see, the combination of a high number of round keys in local variables and a
high level of unwinding is very effective here. Figures 4 through 6 show the effect of the
optimizations using other metrics: issued instructions, completed instructions and stall
cycles respectively. Lower is better with all of these, with the minimization of stall cycles
the most important of the three. With all of these, we see a similar pattern that is
indicating that a high number of round keys in local variables and a high level of
unwinding are very effective when combined.
The diagram below (Figure 7) demonstrates the effect of the fully unwound outer loop
when combined with the Streaming Store option. As you can see, the fastest variants are
clearly those with high level of both unrolling and high number of round keys in local
variables again but the fastest here are slower than the fastest on the previous page where
streaming store was disabled.
Figure 7: Effect on runtime in terms of processor cycles of varying levels of Unwinding and numbers of
round keys in local variables when Streaming Store is used (on an Intel Core 2 Quad)
36
Clearly, the streaming store optimization is not worthwhile for this architecture so it will
not be looked at again here. We can postulate that the hardware prefetcher in the
microprocessor and the compiler are doing an excellent job without any help from the
Streaming Store.
37
An optimization of unrolling the outer loop is to interleave that loop as well. The effect of
this is shown in the diagram (Figure 8) below.
Figure 8: Effect on runtime in terms of processor cycles of varying levels of Interleaving and numbers of
round keys in local variables (on an Intel Core 2 Quad)
Figure 9: Effect on runtime in terms of processor cycles of varying levels of Interleaving and numbers of
round keys in local variables (on an Intel Core 2 Quad)
38
Figure 10: Effect on runtime in terms of issued instructions of varying levels of Interleaving and numbers
of round keys in local variables (on an Intel Core 2 Quad)
Figure 11: Effect on runtime in terms of completed instructions of varying levels of Interleaving and
numbers of round keys in local variables (on an Intel Core 2 Quad)
39
As you can see from Figure 8, the fastest variants are clearly those with high levels of
both interleaving and a fair number of round keys in local variables. This is demonstrated
by their runtimes in the range of 200,000 to 300,000 cycles.
Comparing the variants with unrolling alone to the variants with interleaving, one can
clearly see a considerable speed advantage to the interleaved versions. This is likely due
to the reduction in the cache misses that would result from keeping operations with the
same round keys together rather than operations with the same state and result. When
unrolling alone is used, it will often still be necessary to load the round keys into the
registers from the caches and there is a considerable length of time between uses whereas
when the code is interleaved, all the operations with the same round key are performed
without interruption.
From Figures 8 through 11, and comparing to the corresponding graphs for unwinding
alone (Figures 3 through 6), we can clearly see that code generated by the AES Code
Generator that uses interleaving is significantly faster than that using unrolling alone on
this architecture since the number of cycles, instructions and stalls is nearly uniformly
lower in the interleaved variants.
The next optimization that we will look at is pre-fetching some of the plaintext source
data to be encrypted into the caches. For these graphs, we are only looking at the pre-
fetching variants with five local variables and interleaved to a factor of ten (the generator
would run through all the possible combinations not just the variants of that one).
Below is a graph (Figure 12) of the effect of pre-fetching to cache (the level two cache)
in terms of cycles.
40
Figure 12: Effect on runtime in terms of processor cycles of varying the source data line and number of
iterations ahead to prefetch when prefetching to L2 cache (on an Intel Core 2 Quad)
From Figure 12 above, you can see that it is possible to shave off a couple of thousand
cycles from this technique but there is little in the way of a general rule. Multiple runs of
these results show a distinct level of consistency in this however, indicating that software
prefetching can beat the hardware prefetcher. Hence, this is a case where a code generator
is perfectly suited as it can generate and test all the permutations. Figure 13 below shows
the effect on the L2 cache itself in terms of the misses that resulted.
Figure 13: Effect on L2 Cache Misses of varying the source data line and number of iterations ahead to
prefetch when prefetching to L2 cache (on an Intel Core 2 Quad)
41
Figure 14: Effect on runtime in terms of processor cycles of varying the source data line and number of
iterations ahead to prefetch when prefetching to register (on an Intel Core 2 Quad)
Figure 15: Effect on L1 Cache Misses of varying the source data line and number of iterations ahead to
prefetch when prefetching to register (on an Intel Core 2 Quad)
42
Figure 14 shows the effect of pre-fetching to register (level 1 cache, effectively) in terms
of cycles while Figure 15 shows the effect on the L1 cache itself in terms of the misses
that resulted. Although a little less unpredictable, it again shows that a generator is in its
element in this kind of search for the most efficient combination of optimizations,
especially when dealing with new, unseen architectures such as those that will support
AES in hardware from 2010 onwards.
To conclude this section, it has looked at the effect of various optimizations that the
generator can easily generate, test, measure and compare. What has not been looked at
yet though is the effect of using multiple threads to perform the encryption in parallel. It
is this that will be looked at in the next section.
43
ParallelIn the previous section, we have looked at running optimizations on the sequential
version of the encryption code loop. However, the AES Code Generator generates all
these optimizations for the parallel, OpenMP, version of the code, just as easily.
In the diagram below we see the effect of unrolling and the number of round keys in local
variables with the OpenMP variant.
Figure 16: Effect of varying levels of Unwinding and numbers of round keys in local variables when
multiple threads are used (on an Intel Core 2 Quad)
As you can see from the diagram above, the effect of the unwinding, although not as
pronounced as in the sequential version, is still significant. One could postulate that this
is due to the slower sequential version putting less pressure on the shared memory bus.
In the diagram below we see the effect of interleaving and the number round keys in local
variables with the OpenMP variant.
44
Figure 17: Effect of varying levels of Interleaving and numbers of round keys in local variables when
multiple threads are used (on an Intel Core 2 Quad)
Again, interleaved is clearly faster than unrolled alone and, if you compare it to the
Interleaved without OpenMP (sequential) that we saw in the previous section, you will
see that although it is close, the OpenMP version (parallel) is still distinctly faster. In a
machine with multiple processors rather than multiple cores, it is entirely possible that
you would see a greater speed up from the use of multiple threads working in parallel in
the OpenMP version as the memory bus to/from the processor could well be a bottleneck
here.
To conclude this section, we have shown that the generator can apply the techniques used
to optimize sequential versions to equally optimize parallel versions. It should be noted
that although we have broken up the sequential and parallel versions here, the generator
considers the OpenMP variant just another variant to be tried in combination with all the
other variants it can attempt permutations together of. As such, the generator will
conclude with a specific recommendation, as can be seen below.
Tail of the AES Code Generator output:
Winner is: output-omp-ctr-L5-UI-14LocalVariables-Interleaved11.cpp
45
Intel Core 2 Duo 2.16Ghz
In dealing with the final two architectures, we will focus on the most interesting areas
rather than going as in-depth as previously, in order to avoid repetition. As such, we will
be concentrating on the parallel versions of the unrolled and interleaved outer loop and
the effect of different combinations of local variables used for round keys and levels of
unwinding or interleaving respectively.
Sequential
Tail of the AES Code Generator output:
Winner is: output-ctr-L5-UI-01LocalVariables-Interleaved10.cpp
On this architecture, as you can see above, the fastest sequential version was an
interleaved version which only put a single round key into a local variable explicitly. The
most likely reason for the low number of local variables used for round keys would be
that the compiler performed much the same optimization itself or the hardware prefetcher
performed better than the generator’s software efforts.
Parallel
Figure 18: Effect of varying levels of Unwinding and numbers of round keys in local variables when
multiple threads are used (on an Intel Core 2 Duo)
46
The graph above shows the effect of the unrolling of the outer loop while the graph below
shows the effect of the interleaving of the outer loop. As you can see, the unrolled
versions are significantly quicker on average while the effect of the interleaving is more
predictable but with a larger possible range.
Figure 19: Effect of varying levels of Interleaving and numbers of round keys in local variables when
multiple threads are used (on an Intel Core 2 Duo)
Tail of the AES Code Generator output:
Winner is: output-omp-ctr-L5-UI-12LocalVariables-Interleaved03.cpp
As you can see from the Generator output above though, an interleaved version is the
fastest overall.
47
Intel Pentium 4 Dual Processor
Sequential
Tail of the AES Code Generator output:
Winner is: output-ctr-L5-UI-05LocalVariables-Interleaved14.cpp
On this architecture, as you can see above, the fastest sequential version was an
interleaved version which only put a single round key into a local variable explicitly. The
interleaving could be expected from the previous sections and the relatively low use of
local variables could be due the lower number of registers available as this is a 32 bit
machine whereas the others were 64 bit machines.
Parallel
Figure 20: Effect of varying levels of Unwinding and numbers of round keys in local variables when
multiple threads are used (on an Intel Core 4 Dual Processor)
48
Figure 21: Effect of varying levels of Interleaving and numbers of round keys in local variables when
multiple threads are used (on an Intel Pentium 4 Dual Processor)
Tail of the AES Code Generator output:
Winner is: output-omp-ctr-L5-UI-00LocalVariables-Interleaved07.cpp
As you can see, the fastest version was one, which eschewed the use of local variables for
round keys. This could be due to it being a 32-bit machine which, consequently, had less
registers available so the compiler and hardware were able to do the best job managing
them.
49
5. Conclusions
AES Code Generator is a valuable means to find the most efficient and optimized AES
code for any given architecture. In general for the architectures surveyed here, the
interleaved, parallel variants seem to be the most efficient. This is most likely due to the
use of all of the available cores by using multiple threads and to the reduction in loads to
registers from caches caused by interleaving the outer loop of the encryption loop. The
number of local variables used for storing round keys seemed to be consistently low on
32-bit architectures and medium on 64-bit architectures, reflecting the increased number
of registers available on the 64-bit architecture.
To conclude, this report will outline what contributions this project has made to the
current state of the art and briefly discuss what future work could be attempted that could
build upon the progress achieved in this project.
Contributions
This project has made a number of contributions to the state of the art.
Firstly, it is one of the very first to deal with the new AES instruction set which will
appear in the next generation of Intel processors.
Secondly, it provides a proof of the concept of using a code generator to provide all the
various optimized variants of the standard AES encryption loop.
Thirdly, it extends the idea of self-tuning generators to the area of encryption code
generation
50
Future Work
Future work will no doubt focus on optimizations taking advantage of the new Single
Instruction Multiple Data (SIMD) instructions that will be introduced in the next
generation of Intel’s processors. The provision of these instructions enabling fast and
secure encryption and decryption using AES on the hardware level will doubtlessly result
in the predominant use of the SIMD instructions in all future AES Code Generators and
the efforts that had to be taken here to emulate the instructions and estimate the effect of
optimizations by using older SIMD instructions of similar latency will be a thing of the
past.
A minor extension of this work would consider AES-128 and/or AES-192 instead of
AES-256 that was dealt with here. However, given that AES is a block cipher that always
deals with 128-bit blocks, the only differences are the number of rounds of encryption
and the resulting greater number of round keys. As such, the effect of optimizations will
be less with AES-128 or AES-192.
Another minor extension of this work would investigate AES on a machine with an
extremely large number of processors, and comparing the performance of pthreads
against OpenMP.
51
References[1] Markus Püschel, José M. F. Moura, Jeremy Johnson, David Padua, Manuela
Veloso, Bryan Singer, Jianxin Xiong, Franz Franchetti, Aca Gacic, Yevgen
Voronenko, Kang Chen, Robert W. Johnson, and Nick Rizzolo, “SPIRAL: Code
Generation for DSP Transforms”, Proceedings of the IEEE special issue on
"Program Generation, Optimization, and Adaptation," Vol. 93, No. 2, 2005, pp.
232-275.
[2] Matteo Frigo and Steven G. Johnson, "The Design and Implementation of
FFTW3," Proceedings of the IEEE 93 (2), 216–231 (2005). Invited paper, Special
Issue on Program Generation, Optimization, and Platform Adaptation.
[3] Joan Daemen , Vincent Rijmen, “The Block Cipher Rijndael”, Proceedings of the
The International Conference on Smart Card Research and Applications, p.277-
284, September 14-16, 1998.
[4] National Institute for Standards and Technology, “Announcing the Advanced
Encryption Standard (AES)”, Federal Information Processing Publication #197,
2001.
[5] Michael Flynn, “Some Computer Organizations and Their Effectiveness”, IEEE
Trans. Comput., Vol. C-21, pp. 948, 1972.
[6] Aart J.C. Bik, “Vectorization with the Intel Compilers”, Intel, 2008.
[7] R.M. Ramanthan, “Extending the World’s Most Popular Architecture”, Intel,
2006.
[8] Nadeem Firasta, “Intel AVX: New Frontiers in Performance Improvements and
Energy Efficiency”, Intel, 2008.
[9] Barbara Chapman, “Using OpenMP: Portable Shared Memory Parallel
Programming”, The MIT Press, 2007.
[10] Unknown, “Compiler optimization”,
http://en.wikipedia.org/wiki/Compiler_optimization (last accessed on 3rd April
2009)
52
[11] Ilya O. Levin, “A byte-orientated AES-256 implementation”,
http://www.literatecode.com/2007/11/11/aes256/, (last accessed on 4th April
2009).
[12] Shay Gueron, Intel Mobility Group Israel, “AES Instructions Set White Paper”,
Intel, July 2008.
[13] Intel, “Intel 64 and IA-32 Architectures Optimization Reference Manual”,
November 2007.
[14] P.Mucci, “PapiEx - Execute arbitrary application and measure hardware
performance counters with PAPI”, http://icl.cs.utk.edu/~mucci/papiex/ (last
accessed on 10th April 2009).
[15] P. Mucci et al, “A Scalable Cross-Platform Infrastructure for Application
Performance Tuning Using Hardware Counters”, Proceedings of Supercomputing
2000, 2000.
[16] P. Mucci and N. Tallent, “Monitor - user callbacks for library, process and thread
initialization/creation/destruction”, 2004.
[17] Mark Charney, “Intel Software Development Emulator”, Intel,
http://www.intel.com/software/sde/ (last accessed on 14th April 2009).
[18] Robert Konighofer, “A Fast and Cache-Timing Resistant Implementation of the
AES”, Proceedings of the Cryptographer’s Track at RSA Conference 2008, 2008.
[19] Intel, “Intel 64 and IA-32 Architectures Software Developers Manual”,
November 2007.
[20] Guido Bertoni et al, “Efficient Software Implementation of AES on 32-Bit
Platforms”, Proceedings of Cryptographic Hardware and Embedded Systems
2002: p159-171, 2002.
53