hdl design laboratory report 1

HDL Design Laboratory

Report

Hardware Implementation of

An Encryption Algorithm

Daniel Thangaraj Stanley

MSCE / WS 2013

Matrikel-Nr: 03636685

User ID: ga39por

February 01, 2013

CONTENTS

I. Answers for all the marked questions

II. Delay Estimation

III. Comparison of Different Implementation

IV. Simulation Plots with all Inputs and Outputs

V. Screenshots using Ideatester

Answers for all the marked questions

1) Preparation Question 1:

The variable Xi(1) is the input to the hardware, to be encrypted

or it is the input to the first round. Similarly Yi(9) is the output of

the encryption algorithm or the output of the output

transformation, where i = {1,2,3,4}

The output that feeds the input Xi(r+1) are the output of the

previous stage ‘r’ i.e. Yi(r), where i ={1,2,3,4}


If either Xi or Zi are 0, then the corresponding inputs, a or b, are

set to 2n which consists of n+1 bits, else a and b are set to the

value of Xi and Zi with a zero appended to the MSB to make it

n+1 bits.

The number of bits needed to represent a and b in the range

{1,…,2n} are n+1 bits. In worst case, the number of bits needed

to represent the product of a and b are 2n+1 bits.

The value of n for the modulo-multiplier in the IDEA Algorithm

is 16.

In modulo 2n operation, the n LSB bits remain and the next n+1

bits are masked. Thus, n bits are needed for the result of (ab

mod 2n).

The simple bit operation which is equivalent to (ab div 2n). The

bits MSB down to ‘n’ of ab remain. After the division these bits

are positioned ‘n’ down to 0. n+1 bits are needed to store the

result.

The result of (ab mod (2n+1)), if (ab mod 2n) = (ab div 2n) is zero.

This implies that the bits of ab from 2n down n and the bits

from n-1 down to 0 bits are identical (from the previous

answers) and the 2n+1 bit being 0. Using this result we can

represent ab as (2n+1)C where C = (ab mod 2n) = (ab div 2n).

Thus for (ab mod (2n+1)) to be zero, either a or b or both must

be zero or (2n+1). Hence this case is not possible when a or b is

in the range {1,…, 2n} and 2n+1 is a prime number which is also

not in the range.


If there are two modulo-multipliers, two adders and unlimited

XOR modules, then the round can be split up as follows.

Fig: Round Module spilt using two modulo multiplier and two adder

For this design, we would need two partial steps with each

partial step containing two modulo-multiplier and two adders

with some XORs.

The registers have to be inserted at the output of modulo-

multipliers, adders and the XORs of the each partial step.

The design for the datapath with 2 modulo-multipliers and two

adders can be found below.

Fig: Design for the datapath with 2 modulo-multipliers and two adders

DELAY ESTIMATION


Since a SLICE can perform two XOR operations, to perform a 16

bit XOR operation we require 8 SLICES.


Since the XOR operation is performed only by the F/G function

generators we consider the propagation delay of function

generator alone. Since the function generators are in parallel

and the SLICES operate in parallel, it takes 3ns to complete 16

bit XOR operation. Since the operation is in parallel the time is

independent on the input width.


The adder used here is a ripple carry adder. Thus there is a sum

logic and a carry logic whose formula can be written as

Sn = Xn ⊕ Yn ⊕ Zn

Cn+1 = XnYn+(Xn+Yn)Cn

The XOR operation is performed by the function generators and

the carry bit is calculated by the carry logic inside the SLICE.

Thus each SLICE performs two full adders. At time t=0, all input

signals are valid, thus after 3ns we obtain the sum for the LSB

and the carry bit after 1ns, thus the sum of the next LSB in 4ns,

similarly for 16 bits it takes 18ns. The longest delay is caused

when the LSB is a generate and all other bits are propagate.


One SLICE consists of two function generators. Each function

generators can be programmed to be a 2:1 MUX. The SLICE has

an inbuilt 2:1 MUX at the output of the function generator.

Thus these combined function as a 4:1 MUX. So a 16 bit 4:1

MUX requires 16 SLICES. The propagation delay is the delay of

the function generator which is 3ns only, since the delay of the

output MUX is neglected.


The time taken for the entire encryption can be computed from calculating the time delay in round output transformation block. As shown in the figure, in one round, the longest path of propagation includes 3 modulo-multipliers, 2 adders and 2 XORs and in the output transformation block, the longest path depends on one modulo multiplier only.

From previous calculations, we have, Delay of XOR: 3ns Delay of adder: 10ns Delay of modulo-multiplier: 22ns Thus the delay for the longest path in a round, is given by 3(22) + 2(10) + 2(3) = 92ns. Since there are 8 round module followed by an output transformation we have,

The total delay is now 8(92) + 22 = 758 ns.

The number of encryptions per second is given by 1/delay i.e.

1/758ns = 1319261 encryptions per second.

Yes, it is possible to start a new calculation while the current

encryption is not finished by using additional registers. This is

done by giving the output of the registers to the input of the

round on the next clock pulse i.e. they are given to the next

round module while the first round receives next set of inputs

from the input registers, provided the key value remains the

same.

9) Preparation Question: 9

The shortest clock cycle corresponds to the time delay of the

longest logic combinational path. For RCS I it includes the round

module, 16 bit 2:1 MUX and the register. From the previous

answer we know the time delay of the round module.

Time delay = tround + treg + tmux = 92+4+3 = 99ns.

Thus the shortest clock cycle is 99ns

The number of clock cycles required to complete one

encryption is 8.


The shortest clock cycle is found by calculating the time delay

for the longest logic combinational path. Here the longest path

in the clocked round is the path from the register R5 to R8. This

includes one multiplier, one adder, two multiplexers and the

register.

tlong = tmul + tadder + tmux + tsetup = 22+10+2(3)+4 = 42ns.

Thus the shortest clock cycle for the clocked round is 42ns.

The multiplexer is switched to 10. i.e. S = 10.

The register at the end of this path is R8.

Fig: Longest path in the clocked round is the path from the register R5 to R8.


The shortest clock cycle for RCS II (complete encryption)

corresponds to the time delay of the longest path. It is from

register R8 to the outer register R3. It includes multiplexer,

adder, two MUXs, one XOR and one register.

tlong = tmul + tadder + tmux + txor + tsetup = 22+10+2(3)+3+4 = 45ns

The shortest clock cycle for RCS II = 45 ns

Thus the minimal clock cycle for RCS II is longer than the

shortest clock cycle of the clocked round. The longer cycles are

caused by the additional delay of the XOR operation. Find

below the trace of the longest path.

Fig: Longest path from register R8 to the outer register R3


The total number of clock cycles needed for complete

encryption of RCS II is the sum of clock cycles for 8 round

module and the output transformation.

For the round module to be completed it takes 8 clock cycles

and for output transformation block, it is 6 clock cycles.

Clock cycles = 8(8) + 6 = 70 clock cycles. The output is available

at the next rising edge of the clock pulse. Thus the number of

clock cycles required for complete encryption of RCS II is 71

clock cycles.

Assuming that the start signal is active at the clock signal

number 0, the init signals are set at the following rising edge of

the clock - 0, 8th, 16th, 24th, 32nd, 40th, 48th, 56th and 62nd rising

edge of the clock.

The result signal are set at the 7th, 15th, 23rd, 31st, 39th, 47th,

55th, 63rd and 69th rising edges of the clock.

The ready signal is set at the 70th rising edge of the clock.

From the Question 11 we know the shortest clock cycle to be

45ns.

Thus time taken for the complete encryption in the RCS II is 71

* 45 = 3195ns.

Therefore the number of encryptions per second is 1/3195ns =

312989 encryptions/second.


For the complete encryption of RCS II+, it takes 5 clock cycles

for each round module and 4 clock cycles for output

transformation.

Number of clock cycles for complete encryption = 5(8)+4 = 44

clock cycles. The output is available at the next rising edge of

the clock. Thus it requires 45 clock cycles.

Time taken to complete the encryption = 45*45ns = 2025ns

Number of encryption per second is 1/T = 1/2025ns = 493827.2

encryptions per second.

Comparison of Different Implementations Implementation 1 2 3 4 5

Encryption/second 1319261 10869565 1265822 312989 493827.2

Required area in SLICES

2458 2714 478 293 293

Required input/output pins

192/64 193/64 194/65 194/65 194/65

Efficiency 536.7 4005 2648.2 1068.2 1685.417

1) Direct Implementation:

The number of encryption is as calculated in the previous

answers.

From the LUT table, we know that number of LUTS for modulo

multiplier is 106, thus the number of SLICES is 53.

The number of SLICES for a XOR is 8, adder is 8, modulo

multiplier is 53. In a round module there are 6 XORs, 4 adders,

4 multipliers.

Number of SLICES per round = 8(6)+8(4)+53(4) = 292 SLICES. For

8 round it is 8(292) = 2336 SLICES.

Similarly for output transformation it takes 2 multipliers and 2

adders. Therefore the number of SLICES = 2(53)+2(8) = 122

SLICES.

The total number of SLICES for direct implementation =

2336+122 =2458 SLICES.

There are 4 inputs each being 16 bits thus 16*4 = 64 PINs and

128 bit input for key thus the input PINs are 64+128 = 192 PINs.

There are 4 output each being 16 bits thus 16*4 = 64 PINs.

Efficiency is number of encryption per second/required SLICES

= 1319261/2458 = 536.721.

2) Direct implementation with simultaneous calculation of several encryptions:

In this case we use registers in between the round modules to store the output of each round and then forwarded to the next round. By doing so, we make sure that all the values propagate at same time instance to the next this enables us to input a new set of values into round 1. The longest propagation path is now 92 ns in each round and hence we can give new set inputs. By this we can now calculate (1/92ns) = 10869565 encryptions/sec. The number of SLICEs is increased as we have an additional 4 registers at the output of each round and each register uses 8 SLICEs. Therefore 256 + 2458 = 2714 SLICEs. The number of input/output PINs is very similar to the previous case but we have one more extra input pin as clock for the registers. Hence 193 input PINs and 64 output PINs. Efficiency is number of encryption per second/required SLICES = 10869565/2714 = 4004.998

3) Hardware Oriented Implementation – Resource Constrained Scheduling I

Since we call the round module 8 times, thus we are reusing

the resources. The number of SLICES = 6(8)+4(8)+4(53) = 292

SLICES. The output transformation uses 122 SLICEs. There are 4

16 bit register each needs 8 SLICES and 4 2:1 MUX each needs 8

SLICES, which requires 4(8)+4(8) = 64 SLICES.

The total number of SLICES = 292 + 122 + 64 = 478 SLICES.

The number of input/output PINs is very similar to the previous case but we have extra input pin as clock and start for the registers and a ready output signal. Hence 194 input PINs and 65 output PINs.

Time taken for a complete encryption is given by

T = 8(tround+tsetup)+ttrafo = 8(92+4)+22 = 790ns.

Number of encryptions per second is 1/T = 1/790 = 1265822

Efficiency is 1265822/790 = 2648.2

4) Hardware Oriented Implementation – Resource Constrained

Scheduling II

The number of encryptions per second is 312989

In the datapath we use 1 adder, 1 multiplier, 5 XORs, 8

Registers, 4 4:1 MUXs which uses 16 SLICES each, which gives

1(8)+1(53)+5(8)+4(16)+8(8)=229 SLICES. Also there are 4

external registers and 4 2:1 MUX which gives the number of

SLICES = 4(8)+4(8)+229=293 SLICES.

The number of input/output PINs is similar to the previous case which is 194 input PINs and 65 output PINs. Efficiency is 312989/293 = 1068.221

5) Hardware Oriented Implementation – Resource Constrained

Scheduling II+

From question 14, the number of encryptions per second =

493827.2.

The number of SLICES is also same as that of previous case, 293

SLICES.

The number of input/output PINs is similar to the previous case which is 194 input PINs and 65 output PINs. Efficiency is 493827.2/293 = 1685.417

SIMULATION PLOTS WITH ALL INPUTS AND

OUTPUTS

1. Direct Implementation:

2) RCS 1 a. Control Path

b. Result

3) RSC2

a. Control Logic

b. Control logic and its extension

c. Round Counter

d. Result

3) RSC2plus

SCREENSHOTS USING IDEATESTOR

1) Final Testing of RCS1

2) Final Testing of RSC2

3) Final Testing of RSC2plus

hdl design laboratory report 1

Documents