aes report

8/13/2019 AES Report

1/10

AES Implementation in Verilog

Siddharth Chatterjee Sushobhan Nayak

February 18, 2011

1 Original Encryption Algorithm and its Mod-ification

In our implimentation of AES cipher, we have essentially followed the ap-proach of [2]. In the write up below, we will first state the algo describedin [1] and then present the equivalent algo by [2]. Then we will describe theefficient pipelined use of the modified algo for maximum throughput withminimum use of space. The following figure presents the psedocodes of boththese algorithms. In what follows, we will describe the Verilog module usedto implement each function in the modified algo.

Figure 1: Comparison of both algos

1

ABHISHEK KONERU


2/10

1.1 State representation

The data to be encrypted and the key to be used are both 128-bit bit streams.[1] represents them in a 4-by-4 matrix as shown below. In [1], an input streamis arranged column wise (refer Fig). However, our implimentation arrangesthe state representation row-wise. So arow2column() module is employed toachieve this conversion before the key and the data are used for processing.

Figure 2: Comparison of representation

1.2 AddRoundKey()

This function is a simple bit-wise XOR of the key and the generated state.So it has been directly implimented in the main AES module.

1.3 SubByte() & ShiftRow()

Both the functions have been implimented in a single module. As describedin Section 5.1.1 of [1], SubByte() can essentially be implimented througha look-up table. So, in our design, module sbox() achieves this througha combinatorial circuit. Now ShiftRows() is essentially cyclic shifts of thethe rows of the state matrix, and as such, both these functions have been

implimented in a single module subRow(), following figure3.

2


3/10

Figure 3: Row operation

1.4 MixColumns()

This function has been implimented with the help of module mixColumns(),which encompasses modules mult() and xtime(). Please refer to figure 3,4 and 5 of [2] for the design diagram. xtime() essentially impliments thextime() function defined in [1], Section 4.2.1. As the transformation matrixfor this column operation essentially multiplies 01, 02 or 03 inGF(28) to eachelement of the state matrix, following [2], we implimented a combinatorialcircuitmult, which takes an input and based on a select signal, outputs themultiplied result. It is easy to notice that figure3 in [2] corresponds to thefollowing set of equations from [1]. As defined in section 4.2.1 of [1], everynumber can be represented through operation ofxtime() and XOR. So the

invMux() module used in decryption, which uses numbers 0e, 0d, 09 and 0bcan also be easily generated.

Figure 4: Column operation

3


4/10

1.5 Inverse Cypher

The equivalent inverse cipher described in section 5.3.5 of [1] follows exactlythe same procedure for decryption, albeit with each function turned to its in-verse one. So inverse of each function was implemented, andinvMixColumn()was applied to the key schedule as described in [1].

2 Design Decisions and Implementation Is-

sues

For details of the pipelined implimentation, please refer to Section III of [2].

Instead of describing them again to avoid repetition, we will only focus ondesign issues faced:

2.1 Pipelining

There were two possible ways of pipelining we decided upon. In one case,there is a possibility of using one stage of pipelining for each of the 11 stagesof the encryption peocess, thereby increasing the throughput. The other isthe one described in [2]. A 11 stage pipeline would obviously have a 3 to 4times increase in throughput over a 3-stage one: but it requires 11 128-bitregistors, thereby taking up too much space and a possibility of overflow of

space in the FPGA. So we stuck with the implementation of the paper. Thefollowing are the design issues we faced:

As described in the following subsection, we run the key scheduler firstto generate the whole 1408 bits needed for the entire process. Oncethe schedule is ready, it sets signal keyGen low, sinalling the controlunit that it can go ahead with the encryption process. The controlsignal has three output bits: control[0] to decide whether new datashould be let into the pipeline or the next stage of the old data is tobe selected; control[1] is set high when the data is ready at the output;control[2] chooses between the output of the MixColumn() function

and the data not processed through the same, which is crucial in thelast stage of encryption of a 128-bit data stream. So, when keyGen isset low, control[0] is set high and is maintained for next three clockcycles to allow for three new 128-bit data streams to enter the pipeline.Once, say, data1 is in, it takes 3 clock cycles to return to Register0.

4


5/10

So, after 30th clock cycle, the output for data1 is ready. On the 30th,

control[1] is set high, so that we can get the output in the 31st clockcycle. It is maintained for the next two cycles to allow for output ofnext two data streams. So, in 33 cycles, we get 3 outputs, leading to athroughput of

Tp = 3128

33clockCycle

which in our case, considering that the minimum clockperiod is 4.97ns,amounts to 2.34 Gbps.

At each stage, the key schedule to be used has to be decided. Sincethere are 11 stages, we used a 4-bit stateInfoto keep track of the stage

of encryption for a single 128-bit data. When a new data enters thepipeline, the last four bits ofRegister0, which we have implementedas a 132 bit register, are set to zero, to indicate that we are in thefirst stage. This is incremented each time the contents ofAddRound()function are put in Registor1, so that when the same data returns toRegistor0 after processing, we can choose the next key schedule justby looking at those 4-bits.

2.2 keyExpansion() Routine

Based on speed and space considerations, there are three ways to go about

performing key expansion. The encryption process takes 11 key matrices,amounting to a generation of a total of 1408 bits. Only a combinatorialimplimentation will use too many XOR gates (around 1200). We consideredthree sequential key generation techniques :

Sequential/Dynamic: As only 128-bits are used at each stage in theencryption process, the key schedule for each stage can be generateddynamically. The implemetation is shown in Figure 5. With eachclock cycle, the previous value is loaded on the register. Another 128-bit register, R0, has to be maintained which holds the original key sothat when the present cycle of encryption ends, the original key can beloaded for the next run. As is evident from our pipelined design, thekey for stage i has to be held for three continuous clock cycles (as thereare three stages in the pipelinethe same key acts on three differentinputs in three clock cycles), and we employed a counter to achieve

5


6/10

the same. Now, when one encryption cycle ends, new data is loaded

conditioned on the control[0] signal the same signal is used to loadthe key with contents of R0, enabling us to synchronize the process ofdata and key selection.


Sequential/StaticWhile the above method is very efficient in terms of

space and speed, it has the obvious disadvantage that it cannt be usedfor the decryption process. Also, assuming only one key schedule is tobe used for a single encryption process on any file,if processor/memoryspace is not of much consideration, it is particularly energy consumingto generate the key schedule through the entire process when it caneasily be generated in one run and then saved for use in subsecquentstages. So, we also implemented this approach, as we have implementedthe decryption process too. The process is started with a reset signal,whereupon the key scheduler runs and generates a 1408 bit key scheduleto be used in 11 stages of one encryption of 128-bit data input. Oncethe key is generated, the module outputs a keyGen signal, giving the

controlmodule a thumbs up to go ahead with the encryption process,by setting control[0] to one to accept data.

Sequential/RAMIf space is very much an issue and we need to imple-ment the decryption process nonetheless, and if we are willing to com-

6


7/10

promise on speed and throughput, we can run the Sequential/Dynamic

scheduler and then use the block RAM in the FPGA (which has thecapability to read/write 128-bits in a single clockcycle) to store theschedule for further utilization.

2.3 I/O module

The I/O module we used is shown in figure 8. In a generic device, whichcontains a memory that can read or write 32 bits in a single clock cycle, weemploy the I/O module. 32 bits are read and stored in the first register in 4clock cycles, which is controlled by InCounter. Three registers are used suchthat when the pipeline needs three consecutive inputs, they can be provided.

It takes 4 3 = 12 clock cycles to fill up the three registers. Given thatthe pipeline requires data every 33 clock cycles, there is more than enoughtime to keep the data ready when the need arises. Same is the case with theoutput.

2.4 Sbox

[2] uses the ROM in the FPGA for Sbox look up table implimentation. Whileit reduces the number of FPGA slices used, it also brings down the speedwhich would otherwise have been achieved due to a combinatorial impli-mentation. We decided to go for a higher speed and hence implemented acombinatorial module of 16 Sboxes.

3 Decryption routine

It is easy to notice that the equivalent inverse cipher described in Section 5.3.5is equivalent in its operations to the encryption process we have implemented.So, it can be used in a very similar way, albeit we replace each of the functionswith their inverse, and introduce a minor change in key lookup. The issuesare:

AddRoundKey() is its own inverse, as it simply is an XOR operation.InvSubByte() and InvSubRow() are easily reproducible, by definingan inverse Sbox, which is a look-up table and rotating the bytes inRegistor0 appropriately.

7


8/10


InvMixColumn() can be easily implemented if we recognize the factthat any multiplication of numbers in GF(28) can be implementedthrough xtime() and XOR operations. The inverse matrix and thegenerated numbers are shown in fig 6.

We use the following equations, where s is a byte long and belongs toGF(28)

{02} s = xtime(s)

{04} s = xtime({02} s)

{08} s = xtime({04} s)

{0e} s= ({02} {04} {08}) s= ({02} s) ({04} s) ({08} s)

{0d} s = ({01} {04} {08})s = (s)({04} s)({08} s)

{0b} s = ({01} {02} {08})s = (s)({02} s)({08} s)

{09} s = ({01} {08})s = (s)({08} s)

So, following the encryption process, the invMult() unit is shown in fig7. This module replaces the mult() module in MixColumns() to pro-duceInvMixColumn(), with proper care taken for the enablesignals.

8


9/10


The key schedules used are taken from the generated key array. Welook at the stateInfo, select the appropriate key, pass it throughInvMixColumn() and then conditioned on the stateInfo, pass theoriginal or the output ofInvMixColumn()(first and last stage use un-processed key).

4 Specifications on Synthesis

1. Platform used: Xilinx Virtex 5, XUPV5-LX110T.

2. Maximum clock speed achieved was GHz. 340.022MHz, correspondingto a min time period of 2.941ns.

3. Maximum throughput achieved was GBPS. 3.8 GBPS. As was ex-pected, due to combinatorial implementation of Sbox, the throughputbecame almost twice.

4. An estimated 530 registers and flip flops were used, with some 450 XORgates.

9


10/10


References

1. National Institute of Standards and Technology, Advanced EncrytionStandard, Federal Information Processing Standards 197, November2001

2. Nadia Nedjah, Luiza de Macedo Mourelle, Marco Paulo Cardoso, ACompact Piplined Hardware Implementation of the AES-128 Cipher,itng, pp.216-221, Third International Conference on Information Tech-nology: New Generations (ITNG06), 2006

10

aes report

Documents