hardware architecture for data security · design. the design was synthesized using xilinx design...

HARDWARE ARCHITECTURE FOR DATA

SECURITY

Thesis Submitted in partial fulfillment for the

Award of Degree of

DOCTOR OF PHILOSOPHY IN ELECTRONICS AND

COMMUNICATION ENGINEERING

By

AMBIKA R.

(Reg. No. D898900001)

Under the Guidance of

Dr. S. RAMACHANDRAN

Professor, Department of ECE

SJBIT, BENGALURU – 560060

VINAYAKA MISSIONS UNIVERSITY

SALEM, TAMILNADU, INDIA

AUGUST 2015

Dedicated to

MY BELOVED PARENTS AND TEACHERS


Declaration

I, Ambika R. declare that the thesis entitled “HARDWARE

ARCHITECTURE FOR DATA SECURITY” submitted by me for the

award of the Degree of Doctor of Philosophy is the record of work

carried out by me during the period from January 2008 to August

2015 under the guidance of Dr S. Ramachandran and has not

formed the basis for the award of any degree, diploma, associate-ship,

fellowship, titles in this or any other University or other similar

institutions of higher learning.

Place: Bengaluru (AMBIKA R.)

Date:


Certificate by the Guide

I, S. Ramachandran certify that the thesis entitled “HARDWARE

ARCHITECTURE FOR DATA SECURITY” submitted for the award of

the Degree of Doctor of Philosophy by Ms. Ambika R. is the record of

research work carried out by her during the period from January 2008 to

August 2015 under my guidance and supervision and this work has not

formed the basis for the award of any degree, diploma, associate-ship,

fellowship or other titles in this University or any other university or

Institution of higher learning.

Place: Bengaluru (Dr S. Ramachandran)

Date:

i

ACKNOWLEDGEMENTS

I would like to express my deepest gratitude to my research guide

Dr. S. Ramachandran, Professor, Department of Electronics and

Communication Engineering, SJBIT, Bengaluru for his constant support,

patience and timely guidance at every step of my research work,

without whom this herculean task would not have been completed.

My Sincere thanks to Vinayaka Missions University, Salem, for

providing me an opportunity to carry out my research work under its

banner. My heartfelt thanks to the Chancellor, Dean (Research) of

Vinayaka Missions University for their constant support. I would like to

thank other officers of Vinayaka Missions University, Salem for their

timely assistance and guidance in completing my research work.

I sincerely thank BMS Educational Trust and Management of

BMS Institute of Technology for supporting me in my endeavor to

complete the thesis work.

It gives me a great pleasure indeed to place on record my

humblest and sincere token of gratitude and heartfelt thanks to Dr

Mohan Babu G. N., Principal, BMSIT&M, Dr R. V. Ranganath, Professor,

Department of Civil Engineering, BMSCE, Dr S. Venkateswaran,

ii

Principal, BMS Evening College, Dr A. C. Bhaskar Naidu, former

Principal and Professor of Mechanical Engineering.

.

I would also like to thank Mrs Priyanka Agarwal of SJBIT,

Mrs Vijayalakshmi K. of BMSCE, Mrs S. K. Pushpa, Mrs C. S. Mala,

Mrs Sahana Devanathan, Mr Saneesh Cleatus T., Mrs Shashikala J. of

BMSIT&M, Dr Fathima Jabeen, HOD, ECE, KSSEM, Ms P. C. Sunitha

and Ms Yashaswini, Staff of ECE department, BMSITM for all their help

and cooperation in carrying out my work. I thank Mr Ashutosh and

Kashyap for their valuable suggestions during the course of my work.

I thank my parents, sisters, my nephew Kushal B. N. and friends

for their constant encouragement and support.

Ambika R.

iii

Abstract

High speed data communication in multiple transceivers and

Multiple Input Multiple Output applications demand a highly robust

and secure system model that facilitate security. The security

requirement is a must in realizing the secure telephony system, e-

commerce, e- banking and multi-user secure communication

scenarios. Confidentiality, authenticity, data integrity and its non-

repudiation are the major requirements in security.

Number of approaches and systems have been designed and

developed for ensuring data security in competitive multiuser

scenario. Among them, public key cryptosystem has been recognized

as one of the optimum solutions. With the rapid spread of digital

communication networks, there is a great need for privacy and

security of transmitted data. Therefore, the methods of safeguarding

information are becoming a major issue for which the encryption and

decryption systems have been created.

Numerous efforts have been made to optimize the

authentication and its optimization with RSA cryptosystems and

majority were implemented with hardware platforms. But considering

a competitive multi-transceiver or MIMO kind of applications, these

approaches are found to be limited in terms of critical latency,

iv

power factor and hence the overall performance. Most of these

schemes are computationally intense since they use serial

Montgomery Multiplication. Further, RSA approaches suffer from

reorder cryptosystem limitations. The proposed implementation of

commutative cryptography using Parallel Montgomery Multiplication

offers high processing speed of 0.5 µs at an operating frequency of

100 MHz and 3522 mW of power consumption. It also avoids the key

exchange complications.

RSA algorithm has been enhanced by means of its

optimization with commutative behavior. The Commutative RSA

approach has exhibited better results for Multiple Input Multiple

Output transceiver based secure communication. In the present work,

the Commutative nature of RSA algorithm has been proved. RSA

algorithm is implemented using both Serial and Parallel Montgomery

Multiplication Algorithms. Commutative Cryptography Core with Key

Generation has been designed for distributed Field Programmable

Gate Array (FPGA) Architecture, which reduces the key exchange

overheads.

The results obtained for Serial Montgomery and Parallel

Montgomery Multiplication based Commutative RSA (CRSA)

architectures have been compared. Considering the performance

parameters like Memory occupancy, speed, power consumption,

v

delay and throughput, it has been found that the proposed Parallel

Montgomery has performed better compared to Serial Montgomery

based Commutative RSA implementation. The delay in the proposed

Parallel Montgomery based CRSA is 13.8% lower as compared to

Serial Montgomery based CRSA cryptography core. Similarly the

throughput of the proposed Parallel Montgomery based CRSA is

12.1% higher than the serial Montgomery based CRSA architecture.

In the proposed design, the trade-off between power consumption

and area is also very small.

The Architecture has been realized using the Hardware

Design Language, VHDL conforming to Register Transfer Level

(RTL) coding guidelines, without which no chip can work. For

convenience sake, the frequency of operation in simulation has been

set to 100 MHz although Place & Route report is 199 MHz. V arious

encryption and decryption data values at each user location is

computed.

The Synthesis, Place and Route have been run on the RTL

design. The design was synthesized using Xilinx Design Suite 14.3

targeted on Virtex-5, xc5vfx70t-2ff1136 FPGA and Vivado 2012.3

Xilinx tool. The design for both the encryption and the decryption

utilizes about 67% of the chip resources, leaving room for future

additions, if any. The maximum operating frequency reported by the

vi

Xilinx Design Suite 14.3 tool is 199 MHz for CRSA encryption and

decryption and 335 MHz for commutative cryptography core.

vii

TABLE OF CONTENTS

1. INTRODUCTION 1

1.1. Background 1

1.2. Classification of Cryptosystem 3

1.2.1 Symmetric Key Cryptosystem 4

1.2.2 Public Key Cryptosystem 5

1.3. The RSA Cryptosystem 9

1.3.1. Encryption and Decryption 11

1.3.2. Why RSA? 13

1.4. Motivations 14

1.5. Research Objectives 15

1.6. Methodology Adopted 16

1.7. Thesis Organization 16

2. REVIEW OF LITERATURE 18

3. DEVELOPMENT OF COMMUTATIVE RSA ALGORITHM 35

3.1. Introduction 35

3.2. System Model 36

3.3. Commutative RSA 37

3.4. Commutative Nature of Commutative RSA 40

Algorithm

3.5. Commutative RSA Implementation with Serial and 41

Parallel Montgomery Multiplication 3.5.1 Modular Exponentiation 41

viii

3.5.2 Modular Multiplication 43

3.5.3 Montgomery Multiplication Algorithm 44

3.5.4 Radix – 2 Modular Multiplier 47

3.6 Modular Multiplication Algorithms 49

3.6.1 Algorithm for Sequential Binary (T, E, M) 50

3.6.2 Algorithm for Parallel Binary (T, E, M) 52

4. DEVELOPMENT OF ALGORITHM FOR COMMUTATIVE

CRYPTOGRAPHY CORE WITH KEY GENERATION 55

4.1. Sequential Implementation of Key Generation 55

4.1.1. Linear Feedback Shift Register 55

4.1.2. Fibonacci LFSR 56

4.1.3. Galois LFSR 56

4.1.4. Primality Test 57

4.2. Key Generation 57 4.2.1 RTL Coding Guidelines 58

4.3. Commutative Encryption 73

4.4. Commutative Decryption 74

4.5. Commutative RSA Cryptography Core 75

4.6. CRSA Oriented Montgomery Parallel Multiplier 78

4.6.1. Parallel Montgomery Multiplier 78

4.7. Realization of Parallel Montgomery Multiplier 79

ix

4.7.1. Pseudo Algorithm for Parallel 81

Integer Multiplication

4.7.2. Pseudo Algorithm for Parallel 82

Montgomery Multiplication

4.8. Realization of CRSA with Multiple 83

Distributed Cores

5. SIMULATION AND PLACE AND ROUTE RESULTS OF 87

COMMUTATIVE CRYPTOGRAPHY ARCHITECTURE

WITH KEY GENERATION

5.1. Introduction 87

5.2. Hardware Design 88

5.3. A Comparative Analysis for Serial versus Parallel 89

Montgomery Multiplication Based CRSA

Implementation

5.4. Analysis of Simulation Waveforms 97

5.5. RTL View of Architecture of Commutative RSA 107 5.5.1. Rationale for 199 MHz 126

6. CONCLUSIONS AND SCOPE FOR FUTURE WORK 128

6.1. Conclusions 128

6.2. Contributions of This Work 128

6.3. Scope for Future Work 131

REFERENCES 132

x

LIST OF PUBLICATIONS 152 APPENDIX A 153

APPENDIX B 155

APPENDIX C 162

xi

LIST OF FIGURES

Figure 3.1. Square and Multiply Algorithm 42

Figure 3.2. Algorithm for Modular Multiplication 43

Figure 3.3. Algorithm for Radix-2 Modular Multiplication 48

Figure 3.4. Algorithm for Sequential Binary Modular 50

Multiplication

Figure 3.5. Serial Montgomery Multiplier 51

Figure 3.6. Algorithm for Parallel Binary Modular 52

Multiplication

Figure 3.7. Parallel Montgomery Multiplier 53

Figure 4.1. ASM Chart for Finding the Prime Number 60

Figure 4.2. RTL Schematic for Finding the Prime Number 61

Using ISE 14.3 Xilinx Tool

Figure 4.3 Flowchart for Finding GCD of Two Numbers 63

Figure 4.4. RTL View for Finding GCD Using ISE 14.3 Xilinx 64

Tool

Figure 4.5. Key Generation for Commutative RSA Algorithm 65

Figure 4.6. Pseudo Algorithm for Commutative RSA Key 67

Generation

Figure 4.7. ASM Chart for Key Generation for Commutative

RSA 68

Figure 4.8. RTL Schematics of Top Module of CRSA Key 70

Generation Using Vivado 2012.3

Figure 4.9. Schematic of CRSA Key Generation Using Vivado 71

2012.3

Figure 4.10. Pseudo Algorithm for Commutative Encryption 74

Figure 4.11. Pseudo Algorithm for Commutative Decryption 75

xii

Figure 4.12. Sequential Model for Commutative RSA

Realization 77

Figure 4.13. Pseudo Algorithm for Montgomery Multiplication 80

Figure 4.14. Pseudo Algorithm for Parallel Integer

Multiplication 81

Figure 4.15. Pseudo Algorithm for Parallel Montgomery 82

Multiplication

Figure 5.1. Comparison of Chip Area of Serial and Parallel 91

Montgomery Based CRSA

Figure 5.2. Comparison of Power Consumption of Serial 94

and Parallel Montgomery Based CRSA

Figure 5.3. Delay Comparison for Serial and Parallel 95


Figure 5.4. Throughput Comparison of Serial and Parallel 96


Figure 5.5. Simulation Waveform of CRSA Encryption at 104

User Terminal 1


User Terminal 2


User Terminal 3

Figure 5.8. Simulation Waveform of CRSA Decryption at 105

User Terminal 3


User Terminal 2


User Terminal 1

Figure 5.11. Top Level RTL View of Key Generation Using 107

ISE 14.3 Xilinx Tool

Figure 5.12. Top Level RTL View of Key Generation Using 108

Vivado 2012.3 Xilinx Tool

xiii

Figure 5.13. Second Level RTL View of CRSA Key

Generation Using ISE 14.3 Xilinx Tool

Figure 5.14. Second Level RTL View of CRSA Key

Generation Using Vivado 2012.3 Xilinx Tool

110

111

Figure 5.15.

(a-g)

RTL View of LFSR Using Vivado 2012.3 Xilinx

Tool

112-117

Figure 5.16 RTL View of Checking the Prime Number 118

Figure 5.17 Device Utilization summary using ISE 14.3

design tool xc5vlx20t-2ff323


design tool xc5vsx50t-2ff1136


design tool xc5vfx70t-2ff1136

Figure 5.20 Timing Report for Commutative Encryption and

Decryption Using ISE 14.3 Xilinx Tool

120

121

122

125

Figure 5.21 Timing Report for CRSA with Key Generation 125


Figure 5.22 Timing Report for CRSA with Key Generation 126

Using Vivado 2012.3 Xilinx Tool

xiv

LIST OF TABLES

Table 5.1. Comparison of Chip Area of Serial and Parallel 90

Montgomery Based CRSA Cryptography Core Table 5.2. Comparison of Power Consumption of Serial and 93

Parallel Montgomery Based Commutative RSA

Cryptography Core

Table 5.3. Performance (Delay, Frequency and Throughput) 95

Comparison for Serial and Parallel Montgomery

Based CRSA Cryptography Core

Table 5.4. Data at User Terminal 1 97 Table 5.5. Data at User Terminal 2 98

Table 5.6. Data at User Terminal 3 99

Table 5.7. Data Mapping 101

Table 5.8 Device Utilization of Encryption, Decryption

and Key Generation RTL Designs

119

Table 5.9. Timing Report for CRSA Using ISE 14.3 122

Table 5.10. FPGA Resource Consumption of the RTL VHDL Design 124

xv

LIST OF ACRONYMS

RSA Algorithm - Rivest, Shamir, Adleman Algorithm

MIMO - Multiple Input Multiple Output

CRSA - Commutative RSA

DES - Data Encryption Standard

IDEA - International Data Encryption Algorithm

SSL - Secure Socket Layer

TLS - Transport Layer Security

SSH - Secure Shell Protocols

PKI - Public Key Infrastructure

GCD – Greatest Common Divisor

FPGA – Field Programmable Gate Array

ISE – Integrated Synthesis Environment

VHDL – Very high speed integrated circuit Hardware Design Language

CRT - Chinese Remainder Theorem

ASIC – Application Specific Integrated Circuit

VLSI – Very Large Scale Integrated Circuit

GF – Galois Field

RNS - Residue Number System

ECC - Elliptic Curve Cryptography

MMM - Montgomery Modular Multiplication

CSA - Carry Save Adder

SMFCP - Secure Multi FPGA Communication Protocol

RTL – Register Transfer Level

CPA - Carry Propagation Adder

Vinayaka Missions University, Salem 1

Chapter 1

Introduction

1.1 Background

High pace increase in data communication and the relative

spreading out of internet services encompass multiple transceiver

terminals and Multiple Input Multiple Output (MIMO) applications. They

require a highly secure and robust system model. This facilitates the

fundamental requirements of secure and authenticated data

communication such as security in multiparty communication networks

with privacy or confidentiality, authentication, data integrity and its non-

repudiation. The security requirements are necessary in accomplishing

the processes like secure telephony, e-commerce, e - banking and

multi-user secure communication situations.

With the rapid spread of communication networks, there is a

need for privacy and security of transmitted data. Therefore, the

methods of safeguarding information are becoming a key issue, for

which the cryptographic techniques have been created. Software and

hardware protocols have been implemented to improve the security of

information. In order to accomplish the optimum security and data

authenticity for multiple users‟ communication scenario, a number of


approaches have been developed and cryptosystems might have

potential solutions for data security and authenticity for MIMO kinds of

communication environment. In order to accomplish this goal, in this

work, an enhanced and optimized algorithm for public key cryptography

has been developed.

In the proposed work, the well-known approach called Rivest,

Shamir, Adleman (RSA) algorithm has been enhanced by means of its

optimization with commutative behavior. The commutative behavior

characterizes that the order in which the encryption is done does not

affect results if the decryption is accomplished in the reverse order.

Mathematically formulated scheme for RSA has been converted into

unitary algorithms and the overall system model has been enriched with

Serial as well as Parallel Montgomery and the complete system model

has been realized with multiple MIMO transceivers or FPGA distributed

cores. The Commutative RSA (CRSA) approach has exhibited good

results for MIMO transceiver based secure communication. In order to

optimize the cryptosystem, it is required first to enhance the process of

encryption and decryption. The parametric or functional enhancement

can be effectively accomplished by introducing certain exponential

multiplication approaches such as Montgomery multiplication while

exhibiting high computational efficiency. In order to accomplish secure

communication and data authentication, the encryption and decryption

need to be the best in terms of security.


1.2 Classification of Cryptosystem

Cryptography is defined as the study of mathematical techniques

related to the security of transmission and storage of information.

Cryptography is the science concerned with the design of ciphers,

whereas cryptanalysis is the related study of breaking ciphers.

Cryptography and cryptanalysis are somehow complimentary to each

other: development in one is usually followed by further development in

the other. Cryptography is an important tool in today's information

security. Although cryptography has been historically linked to

confidentiality, modern cryptographic techniques address the issues of

integrity, authentication and non-repudiation. A cryptosystem is an

implementation of cryptographic techniques and their accompanying

infrastructure to provide information security services. A cryptosystem is

also referred to as a cipher system. The term cryptography may be

used in place of cryptosystem and vice-versa. There are two types of

cryptography: symmetric-key cryptography and public-key cryptography

or asymmetric key cryptography.


1.2.1 Symmetric Key Cryptosystem

In this cryptosystem, only one key is used and is known as secret

key. Both the communicating parties know this secret key. This secret

key can be a fixed key or it can be passed from the two parties over a

secure communication link. Symmetric-key (or secret-key) cryptography

can be seen as an outgrowth of classical cryptography. If users want to

securely communicate with each other, they must share a key, which is

used to both encrypt and decrypt messages. The security of a

symmetric-key scheme should rely on the secrecy of the key, as well as

in the “infeasibility” of decryption without knowledge of the same.

Symmetric-key schemes are usually fast. One of the main issues

when deploying symmetric-key cryptography is the problem of key

establishment. Users must share a secret key to be able to securely

communicate, which by nature should be known to no one else.

Historically, key distribution was done in advance using a secure

communication channel, e.g., trusted courier. When considering a

network of users wishing to communicate securely, each pair of users

must share a secret key, which makes it impractical for any medium-

size network. A solution could involve a central entity, which would be

trusted by all the users, e.g., a trusted third party and whose job would

be the issue of session keys. There are a number of other solutions,


but it should be clear that the key establishment is one of the main

problems.

1.2.2 Public Key Cryptosystem

In this cryptosystem, two keys are present: public key and

private key. Each user has both a private key and a public key. The

two users can communicate because they know each other‟s public

keys. Normally in a public-key cryptosystem, each user uses a public

key for encryption and private key for decryption. The private

transformation is described by a private key, and the public

transformation is described by a public key derived from the private

key. The RSA approach is one of the most popular public-key

techniques and is based on factoring of large integers.

The concept of public-key cryptography was proposed in 1976 by

Whitfield Diffie and Martin Hellman in their innovative work [1]. As

advocated by them, the main motivation for this new concept was to

“minimize the need for secure key distribution channels and supply the

equivalent of a written signature”. In public-key cryptography, each user

has a key pair (e, d), which consists of a public key “e” and a private key

“d”. The user “A” can make “e” publicly available and keep only “d”

secret. Now anyone can encrypt the information with “e”, while only user

“A” can decrypt it with his private key “d”. Alternatively, the private key


can be used to sign a document, and anyone can use the

corresponding public key to verify its authenticity. Public-key

cryptography works efficiently compared to secret key cryptography as

it is computationally infeasible to derive the secret key “d” from the

corresponding public key “e”. Examples of public-key cryptosystems are

Diffie-Hellman key exchange scheme, ElGamal and RSA encryption

and signature schemes [1,2].

At the heart of public-key cryptography lies the concept of one-

way function. We say that a function “f ” is “one-way” if it is easy to

compute f(x) for every “x” in the domain of “f”, but for most randomly

chosen “y” in the domain of “f”, it is computationally infeasible to find “x”

such that “f(x) = y”. A “trapdoor one-way function” is a one-way function

for which given an extra information (the trapdoor), it becomes feasible

to find “x” such that “f(x) = y” [1]. One can consider the encryption with

“e” as the one-way function, with “d” being the trapdoor information. In

some cases, especially in the case with MIMO transceiver based

communication scenario, the one-way function of RSA causes certain

limitations. Therefore in case of multiple party communication, this

drawback is required to be taken care of and with this objective, this

work presents a Commutative RSA scheme which is efficient in

facilitating benefits of bi-directional cryptosystems under certain

functional principles.


Public-key cryptosystems are usually slower and require longer

keys compared with symmetric-key cryptography [1, 2]. On the other

hand, public-key cryptosystem simplifies key distribution. Conventional

symmetric-key algorithms have a limited life time; it turns out to be

useless once exhaustive key search becomes feasible due to

computational progress. Public-key is highly flexible, as one has to

select among large number of keys. In practice, it is common to use

hybrid schemes: public-key techniques are used to establish short-term

(symmetric) keys, which are used to secure the communication. This

research work advocates a Montgomery multiplication based approach

with higher radix multiplication. This reduces the execution time and

would facilitate an optimum approach for secured communication

among multiple party communication environments.

Considering the requirement of an efficient, robust and secure

communication in MIMO transceiver based communication scenario,

the public key cryptography can be a potential candidate for productive

results. Hence, considering this requirement, here in this work, public

key cryptographic technique RSA has been taken into consideration for

further enhancements and optimization.

RSA cryptosystem has exhibited good results in facilitating

secure data communication [2] but still it possesses certain scopes for


further optimization. Few predominant limitations are iterative key

computation and its distribution across users. However, the common

RSA requires a lot of time for encryption and decryption computations.

Most cryptographic techniques use a standard method to encrypt the

message. Some of the issues such as uni-linear encryption can be

effectively eliminated while employing schemes such as the

commutative approach. This thesis work takes the commutative

behavior into consideration and then a final system model called

Commutative RSA has been developed.

The difference in encryption process is determined by an

electronic key which is added to the encryption process. This key may

be private so that both the sender and the receiver could use the

same key for encryption and decryption of the data. Unfortunately,

using this private key system, for each conversation, users would

require different keys. Another disadvantage is, a user would require to

pass the private key through a secure channel. However, this channel

may or may not be secure. There are no ways of knowing that an

external party has a secret key. Usually private keys are changed at

regular intervals of time. If an unauthorized party knows the techniques

behind how these keys change, the party can also change the keys.

These problems are overcome with public-key encryption. This involves

each user having two keys. One is a public-key which is used to encrypt


the data that is sent to the user. The other key is a private-key which is

used to decrypt the received encrypted data. No one knows the private-

key except for the user. The iterative key computation and key

distribution makes the RSA implementation somewhat complicated and

costly in terms of overheads and overall performance. Such kind of

limitations and unwanted overheads make the RSA cryptosystem

inefficient and somewhat complicated for MIMO application scenario. In

order to eliminate the issues of key computational cost, commutative

RSA has been advocated and for making it compatible with hardware

processors, the Montgomery multiplication with its parallelized

application has been proposed in this work.

1.3 RSA Cryptosystem

The RSA cryptosystem was created in 1977 and named after its

inventors Ronald Rivest, Adi Shamir and Leonard Adleman [2]. It is

widely used to secure communication in the Internet, ensure

confidentiality and authenticity of e-mail, and it has become

fundamental to e-commerce. RSA is used in the most popular security

protocols, including Secure Socket Layer (SSL) / Transport Layer

Security (TLS), Secure Shell (SSH) Protocols as well as most Public

Key Infrastructure (PKI) products. RSA for encryption is used in many

applications like digital signatures and key establishment. RSA is

usually present wherever security of digital data is of concern.


The mathematical structure of the RSA function is quite simple

and this may be another reason for its popularity. RSA is based on

basic algebraic operations on large integers. Before the description of

the algorithm, we need to set some conventions as follows:

1. Let a, b and n be positive integers.

[“a” is equal to “b” modulo n] (denoted as a = b mod n), if b is

the remainder of „a‟ divided by „n‟.

2. Zn is the set of integers modulo n. We can define the

operations addition and multiplication in this set by using the

usual operations on integers, but taking the result modulo “n”

as defined above. These are called modular addition and

modular multiplication.

3. We denote the greatest common divisor of (a and b) by

“gcd (a, b)”. This is defined as the largest positive integer

which divides both a and b.

RSA algorithm [3] can now be described as:

1. Generate two distinct large prime numbers of the same size.

2. Compute “ and ”.

3. Choose an integer “e < (n)” such that .

4. Calculate such that


The pair of integers (e, n) is the public key. The pair of integers

(d, n) is the private key. An integer “n” is the modulus. “e” and “d” are

the public and private exponents respectively. It is also convention to

call the bit length of the modulus “n” as the size of the RSA key.

1.3.1 Encryption and Decryption using RSA

Let “M” be the plain text or message, which conveys information.

Before transmitting, it should be converted to unreadable form. So

encrypt the message M using encryption key “e”. After encrypting, the

cipher text “C” is obtained.

Represent a message to be encrypted as an integer .

Encrypt M as

The resulting cipher text “C” can be decrypted by computing

D = [Cd (mod n)]. It follows from [d x e = 1 mod (n)].

The RSA algorithm is also used for generating digital signatures, which

can provide authenticity and non-repudiation of electronic legal

documents.

The RSA encryption scheme is a public-key cryptosystem, and

the RSA trapdoor function is defined as “ = (mod The trapdoor

is the private exponent “d”, since ( ) (mod n). The security of the

RSA cryptosystem relies on the problem of factoring large integers,


which is widely believed to be intractable. Factorization of the modulus

is devastating for the system. If an adversary can factor “n”, he can

easily calculate the private exponent by solving the congruence

[ and thus invert the RSA function. Recovery of the

private exponent “d” is equivalent to factoring the modulus “n” [4]. This

means that an adversary who knows can efficiently factor “n”.

This illustrates a possible misuse of the RSA cryptosystem.

In order to avoid generating new primes for every user, a trusted

central authority could generate a common modulus and distinct

exponent pairs (e, d) for each user. Although this might seem at first

glance a good idea, the fact above shows that it is completely insecure:

any user “i”, knowing his/ her own key pair could factor “n” and then

recover all the other private keys. This shows that an RSA modulus

should never be used by more than one entity.

Considering the aforementioned functional behavior of RSA

cryptosystem and its limitations, here in this work a robust system called

“Commutative Cryptography” has been developed and has been

implemented with multiple distributed FPGA architectures. Commutative

RSA is implemented to ensure the performance with real time hardware

assisted applications. Three distinct FPGA cores have been taken into

consideration and the respective Commutative RSA algorithms have


been implemented with individual cores. The overall system

development has been carried out in sequence of key generation phase

succeeded by modified and parallel Montgomery Multiplication [5-7]

based robust commutative encryption which has been followed by

parallel Montgomery Multiplication based commutative decryption. In

this unique approach for key generation, two individual random

generators have been employed with two 32 bits entity and first input

first output shift registers. The pseudo random data bits are verified for

its primality and then it is processed for the greatest common divisor

estimation. In further steps, with the help of available variables, the

encryption and decryption keys have been generated. The primary

significance of this approach is that the iterative key generation and its

resulting or allied overheads are minimized, and thus, the efficiency

increases. The implementation of parallel Montgomery with high radix

multiplication has made this system highly robust and offer good

performance with increased security.

1.3.2 Why RSA?

In case of traditional approach of public key cryptography, it

is required to perform for key generation at each terminal and thus in

case of multiple or higher count of cores, the overheads caused due to

key generation becomes too disadvantageous. Therefore a system

model is developed that can noticeably reduce the key computation


cost and thus could enhance the performance of the system. This is a

matter of fact that RSA establishes itself as an optimum approach for

public-key cryptography [1, 2]. However, considering certain scenario

of multiuser communication, distributed MIMO applications,

heterogeneous multiprocessor system on chip, multi-DSP based

communication cores etc., the normal RSA implementation might not

be much fruitful and even if it remains unexplored with recent and

optimized encryption techniques. Hence, taking into account of

approaches like commutative characteristics, the order in which

encryption takes place doesn‟t affect the decryption process if it is

done in the same way and avoids security breaching. In MIMO

applications, encryption of the data takes place at every user terminal.

1.4 Motivations

As per the increase in the secured data communication

requirements and Multiple Input Multiple Output transceiver based

communication systems, the requirement for secured communication

paradigm is also increasing day by day. Data security is one of the

predominant issues in modern multiple transceiver based

communications. The approach of public key cryptography plays a

potential role in establishing security for MIMO based applications and

RSA approach plays an important role with public key cryptography to


be used in public infrastructure. A number of cryptosystems reported

earlier suffer from the limitation of linear cryptography, which is required

to be optimized along with the overhead reduction of the system. An

approach called “Commutative” states that the order in which encryption

is performed do not influence the final result. Conversely, the security

and authenticity of encompassing MIMO transceiver terminals also

becomes critical in public infrastructure based communication. These

are the main motivating factors for undertaking this work.

1.5 Research Objectives

Considering the need for highly efficient and robust security

mechanism for multiple users in communication, this research work

proposes a highly robust and efficient security of authentication

mechanism. In this work, a commutative behavior enabled RSA

cryptographic core has been developed which has been implemented

with multiple users. The main objectives of this work may be

summarized as follows:

1. To develop an efficient secured cryptosystem for multiple cores in

multi-party communication environment.

2. To develop a public key cryptographic scheme or algorithm

introducing Commutative Nature of RSA Encryption and

Decryption.


3. To develop FPGA compatible CRSA algorithm using serial as

well as parallel Montgomery multiplication so as to enhance

processing efficiency and overall throughout.

4. To develop a CRSA cryptosystem and implement it with multiple

FPGA cores to ensure the justifiable performance of CRSA with

real time hardware assisted applications.

1.6 Methodology Adopted

In order to meet the objectives stated earlier, the following

methods are adopted:

1. Parameters and algorithms for data security have been identified for

Multiple Input Multiple Output applications.

2. Commutative nature of RSA algorithm has been proved.

Performance analysis of serial and parallel Montgomery

Multiplication algorithms on CRSA has been verified.

3. Commutative cryptography core with key generation suiting FPGA

realization has been designed using VHDL.

4. Functionality has been verified using Modelsim.

5. The design has been synthesized using Xilinx ISE and Vivado

targeted on a Virtex FPGA.

1.7 Thesis Organization

The overall thesis organization is as follows:

Chapter 1 is an introduction of research domain. In this chapter,


brief discussion of cryptosystems and its significance, kinds of

cryptosystems, RSA algorithm for public key cryptography applications

etc. were presented. This chapter also presented the motivations for

research work, objectives, methodology and thesis organization.

Chapter 2 presents the literature review for RSA algorithms and its

implementation for secure communication. In this chapter, reviews have

been made for RSA algorithms. Different architectures of Montgomery

multiplications with its implementation on FPGA have been discussed.

The theoretical background of the RSA cryptosystems, public key

cryptography, and various kinds of Montgomery multiplications of RSA

have been presented in Chapter 3. Implementations of Commutative

RSA using serial and parallel Montgomery multiplication algorithms are

also presented.

Chapter 4 discusses the key generation for Commutative RSA

using parallel Montgomery multiplication. A novel Algorithmic

development is also presented.

Chapter 5 presents the simulation results and their respective

validation. Performance evaluation for the developed system in terms of

execution time, memory occupancy, delay analysis with multiple

processors and varying frequencies etc. have been presented.

Chapter 6 presents the Conclusion and the Scope for future work.


Chapter 2

Review of Literature

An introduction to a security of multiple transceiver terminals and

MIMO applications has been presented in Chapter 1. An appropriate

literature survey has been carried out in order to understand the prior

work in RSA algorithm and its implementation with FPGA cores. In this

chapter, a detailed review is presented for Montgomery multiplication

implementation for enhancing real time performance of RSA algorithm.

Many different architectures have been proposed in the

technical literature for modular exponentiation and RSA implementation.

However, they often rely on specific technologies and thus a fair

comparison is difficult.

Hardware implementations have been proposed for the RSA

algorithm for public-key cryptography [8-11]. Bit-slice based architecture

was developed to implement modular exponentiation algorithm in

hardware [8]. Bo Song et al. presented the RSA module for 2048-bit

size [9]. They presented hardware algorithm for RSA

encryption/decryption based on Montgomery multiplication. These

implementations are not suitable for high throughput hardware as is the

case with the proposed architectures. Programmable active memory

implementation of RSA which combines Chinese remainders, star


chains, Hensel's odd division, carry-save representation, quotient pipe-

lining and asynchronous carry completion adders is proposed in [10].

Nakano K. et al. presented hardware algorithms for modulo

exponentiation (PE (mod M)) used in RSA encryption and decryption,

and implement them on the FPGA [11]. Daniel Mesquita et al. proposed

an architecture to perform modular exponentiation using the

Montgomery Powering Ladder algorithm [12]. However, the authors

have not proved the commutative nature of the RSA algorithm and they

have not applied it for multi core applications, which is vital for effective

security of multi transceiver systems.

Very high speed RSA demodulator is proposed in [13]. Xuewen

Tan and Yunfei Li [13] proposed Batch RSA-S1 Multi-Power RSA

(BS1PRSA) algorithm to improve the performance of RSA decryption by

combining the load transferring technique and multi-prime technique in

the Batch RSA algorithm. The algorithm in Ref. [14] has been

implemented using a Xilinx XC6VLX240T FPGA chip, and the maximum

running frequency reaches 188 MHz. However, authors have not

mentioned about encryption.

An implementation of RSA algorithm using software has been

proposed in [15]. Suli Wang and Ganlai Liu used C++ Class Library to

develop RSA encryption algorithm Class Library to realize Groupware


encapsulation on 32-bit windows platform. Wenjun Fan et al. have

proposed an implementation of RSA algorithm using JCUDA and

Hadoop software [16]. They have used CUDA framework to realize RSA

algorithm. However, they have not substantiated their claim by

presenting processing speeds achieved. Further, software realization is

not suitable for hardware implementation.

Public key cryptosystem was proposed in Ref. [17-19]. Ljupco

Kocarev et al. [16] proposed a public-key encryption algorithm which is

based on torus automorphisms. Authors generalized RSA algorithm

replacing powers with matrix powers, choosing the matrix, which

defines a two-torus automorphism. Software implementation is

discussed and hardware implementation is not mentioned. Montgomery

multiplication methods have been used for encrypting and signing digital

data in public-key cryptography by Koç C. K. et al. [18]. However, they

did not compare the Montgomery techniques to other modular

multiplication approaches. Software implementation of RSA using C++

is proposed in Ref. [19] byXin Zhou and Xiaofei Tang.

Jiang Huiping et al. have proposed improved architecture for RSA

coprocessor against power analysis [20]. The Shadow technology was

introduced into the RSA algorithm in order to improve differential power

analysis theoretically. Their result showed that it would take about 498


ms to encrypt 1024 bits plaintext operating at 5 MHz. Their proposal is

applicable for the security of wireless sensor networks only and not to

Multi-core applications.

Nagar S. A. and Saad Alshamma have implemented the RSA

algorithm and their aim was to increase the speed during data

transmission among different communication networks and Internet

[21]. They generated the keys offline and stored in different databases

to increase the speed of RSA. Key Generation and storage were

developed using C# language. However, software is not suitable for

hardware realization.

RSA algorithm with counter-measures against different attacks

was proposed in Ref. [22, 23]. A new Chinese Remainder Theorem

based RSA digital signature with counter-measures to hardware fault

attacks was proposed by Sining Liu et al [22]. These were implemented

for the security of smart card data. Several computational issues as well

as the analysis of attacks are discussed in Ref. [23].

An algorithm to enhance security in RSA is proposed by the

authors of Ref. [24, 25]. Al-Hamami et al. proposed an enhancement to

the RSA algorithm by using an additional third prime number in the

computation of the public and private keys [25]. However, computational


complexity may increase because of the third prime number.

Dahui Hu and Zhiguo Du discussed about a network

authentication protocol called Kerberos [26]. This is a network

authentication protocol and it which provides secure authentication

service based on the reliable third-party. Being a Software realization,

this, however, is not suitable for hardware implementation.

Doroeviae G. et al. proposed optimization techniques for the

modular reduction procedure of RSA algorithm [27]. They realized RSA

algorithm using TMS320C54x assembler of Texas Instruments. A

reciprocal value method and Montgomery's procedure were considered.

However, authors have proposed only optimization techniques for

modular reduction procedure.

Na Qi et al. presented RSA password system and applied it for

mobile phone short message encryption system [28]. It is applicable

only for the mobile terminal equipment. Perovic N.S. et al. proposed

RSA crypto algorithm with a 1024 bits long key [29]. Iana G.V. et al.

presented an implementation of the RSA algorithm as a prototype

programmable structure [30]. Xilinx Spartan3 was integrated on an

ASIC structure. The purpose was to optimize the Field Programmable

Gate Array area used. Only encryption technique was implemented.


Some new structures which can implement RSA cryptographic

algorithm were presented by the authors of Ref. [31- 33]. These

structures were designed upon a modified Montgomery modular

multiplier. The operations of multiplication and modular reductions are

carried out in parallel rather than interleaved as in the traditional

Montgomery multiplier. They have implemented 8 different structures

mentioned as Struct by the authors and compared. Struct 4 requires

7.55 ms to carry out the full modular exponentiation operation while

Struct 7 carries out the same operation in 6.78 ms. However, the area

used by Struct 7 is almost twice that of Struct 4.

Montgomery modular multiplication on reconfigurable hardware

was presented in [34- 38]. Hariri A. and Reyhani-Masoleh proposed bit-

serial and bit-parallel multipliers [34] and Miguel Morales-Sandoval,

Arturo Díaz-Pérez proposed Scalable GF(p) Montgomery multiplier

based on a digit–digit computation approach [35]. Perin G. et al.

proposed a comparison of two FPGA Montgomery modular

multiplication architectures, a fully systolic array and a parallel

implementation [36]. Fully systolic array implementation takes 3.23 ms

to run 1024 bit RSA decryption process and the parallel architecture

executes the same operation in 6 ms. Ali Ziya Alkar et al. proposed

modular multiplication and squaring, bit level systolic arrays to increase

the speed of multipliers by 20% [37]. However, they have not

implemented entire cryptosystem.

https://www.researchgate.net/profile/Miguel_Morales-Sandoval

https://www.researchgate.net/profile/Arturo_Diaz-Perez


Galois field arithmetic has been proposed for implementing

Montgomery's algorithm [39, 40], which finds wide use in cryptography.

Poolakkaparambil M. et al. [39] and Bajard, J. et al. [40] implemented

Montgomery multipliers for elliptic curve cryptographic applications. But

they have not implemented for RSA.

Scalable Montgomery's algorithm was proposed by the authors of

Ref. [41, 42]. Chiou-Yng Lee et al. presented a scalable and systolic

Montgomery's algorithm in GF(2m) using the Hankel matrix-vector

representation [41]. The proposed architectures have the features of

regularity, modularity, and local interconnect ability. They are well suited

for VLSI implementation. However, it was not implemented on any

hardware. Sanu M.O. et al. proposed four different architectures for

Modular multiplication for speeding up application-specific crypto-

processors [43]. They have not implemented RSA algorithm.

Parallelization of the Montgomery multiplication was attempted

using software by Zhimin Chen et al. [44, 45]. A scalable parallel

programming scheme to map the Montgomery multiplication to a

general multicore architecture was presented. However, pSHS trades

some throughput for latency.

Authors of Ref. [46- 49] have proposed high-radix scalable

Montgomery multiplier. Amberg P. et al. proposed an algorithm for


parallel high-radix scalable Montgomery multiplier with trade-offs for

multiple hardware implementations [46]. They presented the processing

element designs exploring combinations of radices 2, 4, and 8, right vs.

left shifting and Booth encoding. However, left shifting adds additional

design complexity and places strict constraints on word length and

number of processing elements in the pipeline. Thomas Blum and

Christof Paar proposed arithmetic architectures. These are optimized

for modern field programmable gate arrays [48, 49]. The proposed

architectures perform modular exponentiation with very long integers.

Parallel implementation of Montgomery multiplication has been

proposed earlier [50, 51]. Jun Hanet al. proposed parallel

implementation of Montgomery multiplication with improved task

partitioning for multicore platform with area-efficient processors [50].

Selçuk and Erkay proposed a new parallel Montgomery multiplication

algorithm, which can be adopted for parallel realization using general-

purpose multi-core processors [51]. Miladinovic et al. presented

Montgomery multipliers, which can perform modular multiplication of

two integers without trial division [52]. The researchers described the

design and FPGA implementation of two architectures. Ciaran McIvo et

al. proposed fast Montgomery multipliers [53].

New methods to increase the speed of the Montgomery

Multiplication were presented by the authors of Ref. [54 – 56]. Neto


J.C. et al. proposed a new approach to speed up the Montgomery

Multiplication by distributing the multiplier operand bits into partitions

that can process in parallel [54, 55]. Batina and Muurling proposed

efficient way of implementing Montgomery in hardware for an arbitrary

bit length [56]. The scope of this work is limited to the application of the

sequential radix-2 Montgomery Multiplication algorithm.

The authors of Ref. [57, 59] presented novel multipliers using

irreducible polynomials for Montgomery multiplication defined on binary

fields GF (2m). They have used Linear feedback shift register (LFSR) as

the main module for the presented architecture. Huapeng Wu presented

a low complexity Montgomery multiplier in GF (2m) [57]. They are well

suited for VLSI systems and applicable for Elliptic Curve Cryptography.

However, application of the same for RSA algorithm was not proposed.

Talapatra S. et al. presented unified digit-serial systolic

multiplication architecture for all-one polynomials and trinomial over

GF(2m) for efficient implementation of Montgomery Multiplication

algorithm suitable for cryptosystem [60, 61]. Michalski A. et al. proposed

an improvement to a limited resource Montgomery multiplier design [62,

63]. Their design was scaled to utilize available FPGA multipliers, CLB

logic and frequencies of operation.


Implementation of RSA algorithm on FPGA was proposed in [64-

68]. Chu A. et al. analysed an RSA implementation on a reconfigurable

platform [64]. Montgomery modular multiplier replaced the expensive

multiplier and modular operations and provides reconfigurable hardware

support for the Montgomery modular multiplier and Montgomery

modular exponentiation. Mazzeo et al. proposed FPGA-based

Implementation of a serial RSA processor [66]. However, RSA was

implemented for embedded systems and not for multi core systems as

in the present work.

Liang Wang and Yonggui Zhang introduced a new personal

information protection approach based on RSA cryptography [68].Using

this approach, personal information can be transformed from plain text

into cipher text. Customer representatives were able to contact their

clients without seeing the privacy. They have used programmable active

memory based on programmable gate array.

Ming-Der Shieh et al. proposed VLSI implementation of the

modular exponentiation, which was based on the asynchronous

behaviour of the modular multiplication [69]. The basic idea is to

partition the operand (multiplier) into several equal-sized segments and

then to perform the multiplication and residue calculation of each

segment in a micro pipelining fashion.


Manaf N. V. et al. proposed a method to increase the speed of

arithmetic circuits in the Residue Number System (RNS) [70]. Hariri A.

et al. considered the finite field multiplication used in elliptic curve

cryptography and design concurrent error detection circuits [71, 72].

Hariri and Reyhani proposed concurrent error detection in the

Montgomery multiplication over binary extension fields [72]. Error

detection schemes for two Montgomery multiplication architectures

were presented in that work.

Miaoqing Huanget al. proposed two Montgomery modular

multiplication architectures which perform multiplication operation [73].

These two architectures were based on pre-computing partial results

using two possible assumptions regarding the most significant bit of the

previous word. McLoone M. et al. proposed novel hardware architecture

implemented using coarsely integrated hybrid scanning (CIHS)

algorithm to perform Montgomery modular multiplication [74].

Novel CSA architecture for Montgomery multiplication was

presented by the authors of Ref. [75, 76]. Garg R. et al. proposed

Montgomery modular multiplication technique that employs multi-bit

shifting and carry-save addition to perform long-integer arithmetic [75].

As per the claims of the authors, the gain in data throughput for

Montgomery multiplication is approximately 45.49% (for 1024-bit length)

and the hardware reduction is 24.27% of the traditional methods.


Modified radix-4 modular multiplication based on Booth's

multiplication techniques were presented in Ref. [77, 78]. They have

used Carry Save Adder to avoid carry propagation. Bayhan D. et al.

analysed and compared the Montgomery multiplication algorithms for

their power dissipation on FPGA devices [79]. Among various

architectures proposed for Montgomery multiplication, parallel,

sequential and systolic variants were considered [80].

Sever R. et al. presented a high speed, non-pipelined FPGA

implementation of the Rijndael algorithm [81]. They have implemented

both the encryption and the decryption algorithms of Rijndael on the

same FPGA. Kshirsagar R. V. et al. proposed a high data throughput

AES hardware architecture by partitioning into sub-blocks of repeated

AES modules [82]. However, they have not applied their architecture for

Public Key Cryptosystem.

Sushanta Kumar et al. [83] presented architecture and modeling

of RSA public key encryption/decryption systems. It supports multiple

key sizes of 128 bits, 256 bits, and 512 bits. Tanimura, K. et al.

proposed a scalable unified dual-radix architecture for Montgomery

multiplication in GF(P) and GF(2n) [84]. Schinianakis D. et al. describe a

methodology for incorporating Polynomial Residue Arithmetic in the

Montgomery multiplication algorithm for polynomials in GF (2n) [85].

Talapatra S et al. define a scalable VLSI multiplication architecture


based on Montgomery multiplication (MM) algorithm for elliptic curve

cryptography (ECC) over GF (pm) [86]. Satzoda R.K. et al. proposed a

new scalable and pipelined Montgomery multiplier architecture that

unifies the two important finite fields, GF(p) and GF(2m) [87].

Fournaris A.P. constitute the first complete attempt for an efficient

design on a fault and simple power attack resistant RSA [88]. This

algorithm was based on the Montgomery modular multiplication. Faster

multiplication scheme was adopted by Thapliyal H. et al. to improve the

efficiency of the public key encryption systems like RSA and ECC [89].

Modified Montgomery multiplications and circuit architectures were

presented in this paper. Mohammadi M. et al. proposed Montgomery

modular multiplication based on Residue Number System (RNS) [90].

Hamming weight based and reverse converter based moduli were

implemented. The proposed architecture encompasses fast RNS to

RNS converter, choosing appropriate moduli sets.

Haining Fan et al. presented a new parallel multiplier for

irreducible trinomials [91]. Koç C.Ket al. explained Montgomery

multiplication methods [92]. For computing the Montgomery product,

several high-speed, space-efficient algorithms for computing MonPro (a,

b) were described. McIvor C. et al proposed new FPGA architectures

for the ordinary Montgomery multiplication algorithm and the Finely

Ingrained Operand Scanning modular multiplication algorithms [93].


Venkatasubramani V. R. and S. Rajaram [94] propose Modified

Montgomery Modular Multiplication algorithms that reduce the number

of computational operations such as the number of additions, memory

reads and writes involved in the existing algorithms, thereby, saving

considerable time and area for execution. In Ref. [89] carry save adders

are used, Finely Ingrained Operand Scanning modular multiplication

algorithms are used in Ref. [93]. In Ref. [94], the number of

computational operations is reduced. McIvor C. et al. [95, 96] presented

Modified Montgomery multiplication and associated RSA modular

exponentiation algorithms and circuit architectures. These modified

multipliers use carry save adders (CSAs) to perform large word length

additions.

Fournaris A.P. et al. [97] proposed two Finite Field multiplier

architectures and their VLSI implementations using Montgomery

Multiplication Algorithm. The first architecture was called as Folded

architecture and it is optimized in order to minimize the silicon covered

area (gate count) and the second architecture that uses Pipelining is

optimized in order to reduce the multiplication time delay. Both

architectures are measured in terms of gate count (or chip covered

area) and multiplication time delay. Design and implementation of a

real-time FPGA based Network security application was proposed by

Paul R. et al. [98]. They have used the RSA encryption and decryption


algorithm for implementation, as security is one of the most important

needs for data communication. Pellegrini A. et al. developed a

theoretical attack to the RSA signature algorithm and realized it using

an FPGA [99]. Kamal et al. improved the NTRU encryption algorithm,

called as NTRUEncrypt [100].

Marcelo E. Kaihara and Naofumi Takagi proposed a new fast

method for computing modular multiplication [101, 102]. The calculation

is performed using a residue classes modulo M that enables splitting of

the multiplier into two parts. These parts are then processed separately

in parallel, potentially doubling the calculation speed. The upper part

and the lower part of the multiplier are processed separately using the

interleaved modular multiplication algorithm and the Montgomery

algorithm respectively. This technique can also easily be adopted for

operation in the binary extended field GF (2m).

Kazuo Sakiyama et al. present a new modular multiplication

algorithm that allows one to implement modular multiplications

efficiently [103, 104]. It proposes a systematic approach to perform

modular multiplication for maximizing a level of parallelism. The

proposed algorithm combines several different existing algorithms, a

classical modular multiplication built on Barrett reduction, the modular

multiplication with Montgomery reduction and the Karatsuba


multiplication algorithms to reduce the computational intricacy and

increase the prospective of parallel processing. This algorithm was

suitable for both software and hardware implementations i in a

multiprocessor environment. Reconfigurable curve based crypto-

processor that accelerates scalar multiplication of Elliptic Curve

Cryptography (ECC) and Hyper Elliptic Curve Cryptography (HECC) of

genus 2 over GF (2n) was presented by Kazuo Sakiyama et al. [104]. N.

Costigan and P. Schwabe proposed fast elliptic curve cryptography

[105].

Junfeng Fan et al. investigate efficient software implementations

of the Montgomery modular multiplication algorithm ported on a multi-

core system [106]. A Hardware and Software synchronized technique

was used to find the effectual system architecture and the instruction

scheduling method. Cetin Kaya Koc et al. proposed several

Montgomery multiplication algorithms [107, 108]. These algorithms have

been implemented in C and also using assembly language. Colin D.

Walter proposed an optimal upper bound for the number of iterations in

Montgomery Modular multiplication [109].

A new version of Montgomery‟s algorithm for modular

multiplication of large integers and its implementation in hardware was

presented by Viktor Bunimov et al. [110]. The algorithm is superior to

the Montgomery‟s original method by a factor of 2, with respect to


both chip area and latency. The new method has a simple structure. It

requires a small amount of pre-computation and storage in order to

reduce the number of necessary additions by a factor of 2. GF(2n)

Montgomery multiplier using polynomial arithmetic was presented by

Skavantzos et al. [111]. Fast pipelined RSA architecture based

Montgomery algorithm was proposed by Iput Heri et al. [112]. Digital

signatures using RSA and other public key cryptosystems was

presented in [113] by Dorothy E. Dening et al. RSA using 3D graphics

hardware was proposed by A. Moss et al. [114].

Some of the work presented earlier has come up with certain

good proposals for data security. However, those cannot be considered

as the optimum approach for data security. Some of the predominant

issues such as one-way encryption and high computational complexity

still exist in majority of existing work. In order to accomplish a better and

optimized system for multiple input and multiple output applications, it is

required to enhance the system model. Hence in the present work, a

robust Commutative RSA Algorithm and Architecture has been

developed so that it may be realized efficiently as a hardware. Serial

and parallel Montgomery Multiplication algorithms were developed. To

reduce the key exchange overheads in public key cryptosystems, CRSA

algorithm with key generation has been designed.


Chapter 3

Development of Commutative RSA Algorithm

A detailed literature survey was carried out and the findings were

presented in Chapter 2. In this chapter, RSA algorithm, its commutative

nature, serial and parallel Montgomery multipliers are presented.

3.1 Introduction

Commutative RSA algorithm is designed and the developed

module has been incorporated with multiple FPGA cores. In order to

implement developed CRSA algorithm with FPGA core, two different

models based on serial Montgomery and parallel Montgomery have

been developed. Initially Serial Montgomery based CRSA algorithm is

implemented and in later stages the Parallelized Montgomery has been

employed and is explained in detail in the following sections. The

parallelized Montgomery has exhibited better results as compared to

serial Montgomery implementation and therefore in final stage,

parallelized Montgomery architecture with higher radix multiplication is

used for enhancing execution time of CRSA with multiple FPGA cores

[115]. This chapter mainly presents the system development and

discussion about the algorithmic development for


Commutative RSA cryptosystem. The detail algorithm development and

its implementation have been discussed in the following section.

3.2 System Model

RSA is considered as an efficient and optimized solution for

public-key cryptography [1]. In most of the existing systems, data

authentication or security is accomplished by a key exchange approach.

This increases the key exchange overheads. In MIMO or multiple

transceiver systems, at every transceiver terminal, encryption and

decryption is required. If a general RSA approach is applied, the data

authentication and security could be violated. Therefore, to achieve the

goal of data security with individual encryption and decryption without

affecting the integrity and data security, a modified RSA has been

developed known as Commutative RSA. In this chapter, the initial stage

of the system development has been discussed which in specific

discusses about the highly robust and optimized system architecture to

implement Commutative RSA algorithm for data authentication among

MIMO terminals or multiple FPGA cores.

The first step considered was to develop a commutative behavior

based RSA algorithm. The commutative behavior characterizes that the

order in which the encryption is done, does not affect results if the


decryption is accomplished in the reverse order. Advantage of the

CRSA: Each user is generating his/her key. So there are no key

exchange overheads. The mathematically formulated scheme for RSA

has been converted into unitary algorithms and the overall system

model has been enriched with serial as well as parallel Montgomery

multipliers and the complete system model has been realized with

MIMO transceivers or FPGA distributed cores. In the following section,

the implementation of CRSA with serial and parallel Montgomery with

Radix 2 multiplication has been presented, that would be followed by

ultimate parallelized Montgomery multiplication based CRSA

implementation. The algorithmic development, enhancements and

ultimately its implementation with distributed FPGA cores has been

presented in the following section.

3.3 Commutative RSA

The primary objective of the presented thesis is to facilitate a

highly efficient and robust authentication or security system for MIMO

transceiver based data communication applications. For such

applications, the key management and hardware compatibility becomes

prime concern, which is required to be taken seriously to provide an

optimum solution for secure data communication. One limitation called

“one-way encryption” might cause extra overheads on the system


function in case of multiple user scenarios. Therefore, the consideration

of commutative behavior might play significant role in system

performance optimization and overhead reduction for distributed cores.

Hence, in the initial phase of this research work, the commutative RSA

has been implemented.

A secure plane is realizable only if the data communicated over

the plane cannot be colluded and is secured. Use of cryptographic

algorithm is generally preferred; hence the Secure Multi FPGA

Communication Protocol (SMFCP) proposed adopts the commutative

RSA algorithm. The SMFCP considers two prime numbers and

initialized amongst all the group members. Let and

represent the group members required to connect over the secure

plane. In order to compute the encryption and decryption key pairs of

the commutative RSA algorithm, the parameters CRSA and NCRSA are

computed using the following:

NCRSA = (3.1)

(3.2)

The encryption key pair of A and B represented as

(NACRSA, EA

CRSA) and (NBCRSA, EB

CRSA) respectively are need to be

computed. The is computed by randomly selecting numbers such


that it is a co-prime of or in other terms

(3.3)

Where represents the greatest common divisor function

between two variables “x” and “y”.

The decryption key pair of A and B represented by

and , the is computed

based on the following equation:

(3.4)

Where, and are the commutative RSA decryption

keys of user A and user B respectively.

Let represents the encrypted data of plain text “X”. The encryption

operation is defined as follows:

(3.5)

The commutative RSA decryption operation on the encrypted data EncX

is defined as

(3.6)


3.4 Commutative Nature of RSA Algorithm

After developing the commutative RSA algorithm, it is required to

be verified for its function with real time data security. In this section, the

verification for proof of CRSA has been analysed. For simplicity of

realization, only two users, “A” and “B” have been considered. The

commutative property of the RSA algorithm which is employed in

multiple input multiple output applications may be proved if data “ is

encrypted by user “ A” first and then encrypted by “B” delivers the same

result if the order of encryption is changed.

(3.7)

(3.8)

(3.9)

As it can be concluded that

(3.10)

(3.11)

Thus commutative nature of RSA is justified.


3.5 Commutative RSA Implementation with Serial and Parallel

Montgomery Multiplication

RSA algorithm functions on the basis of huge sequential

multiplication operations and this becomes more complicated if the

system has to deal with multiple users or multiple MIMO transceivers.

Therefore, it is always expected and explored to have such a system

that could make the system functional with minimum overheads and

computational costs. In order to enhance the system performance, the

implementation of Montgomery multiplication is always advocated.

Considering the prime objective of this research work, for MIMO or

multiple transceivers based communication systems, the enhanced

RSA called, Commutative RSA cryptography core has been developed,

and has been realized on multiple FPGA devices. In order to ensure

optimum performance, highly streamlined system architecture like

Montgomery modular multiplication based on Radix-2 has been

developed. Such implementations result in reduction of memory

occupancy and the speed is exponentially enhanced. A brief of these

implemented approaches are presented in the following sections.

3.5.1 Modular Exponentiation

Modular exponentiation operation can be simplified in to series of

modular multiplication and squaring operations. This exponentiation is


simplified based on square and multiply algorithm. Square and multiply

algorithm is based on scanning the bit of the exponent from left (most

significant bit) to right (lower significant bit). In every iteration, i.e., for

every exponent bit, the current result is squared, if and only if the

currently scanned exponent bit has the value “1”, a multiplication of the

current result by “M” is executed following the squaring. This algorithm

can be represented in pseudo code as shown in figure 3.1.

Algorithm 3.1

Input : A, b, n

Output : C= Ab(mod n)

Let “b” contains “k” number of bits

If bk-1 = 1,

then C = A;

else C = 1;

For I = k-2 down to 0

C=C x C;

If ei = 1 then

C = C x A

Figure 3.1: Square and Multiply Algorithm


3.5.2 Modular Multiplication

The modular multiplication problem is defined as the computation

of P = [(A × B) mod n] where “A”, “B”, and “n” are integers. It is assumed

that “A” and “B” are positive integers with 0 ≤ A, B < n.

Let Ai and Bi are the “ith” bit of A and B, respectively. The algorithm is

stated as follows:

Algorithm 3.2

Input : A, B, n

Output : M = (A x B) mod n

M = 0;

For i = 0 to k

M = M + (A x Bi)

If M0 = 1

M = M/2;

Else

M = (M+n)/2;

Return M;

Figure 3.2: Algorithm for Modular Multiplication.


3.5.3 Montgomery Multiplication Algorithm

The basic operation of the RSA algorithm is modular

exponentiation on large integers, i.e. “B = AE (mod N)”. This

mathematical computation is used for both encryption and decryption

and digital signature. The security level of an RSA cryptosystem

depends on the length of the modulus “N”. A modulus of at least 768

bits is recommended. All operands involved in the computation of

modular exponentiation have normally the same size as the modulus.

For computing “AE (mod N)”, all existing techniques use reduction

of modular exponentiation to a sequence of modular multiplications. All

modular exponentiation algorithms use modular multiplication to

increase the efficiency. However, modular multiplication is a complex

arithmetic operation. Several algorithms have been proposed for

achieving efficient implementations of modular multiplication. Blakley‟s

method [5] and Montgomery‟s method [6] are the most studied ones.

These are the algorithms suitable for practical hardware implementation

[7]. Both Blakley‟s method and Montgomery‟s method perform modular

reduction during the multiplication process. Division operation is not

required at any point in the process. However, Blakley‟s method needs

a comparison between two large integers at each step of the modular

multiplication process. However, Montgomery‟s method does not

require any comparison. This is achieved by resorting to a


representation of the operands as a residue class modulo N. Further,

the Montgomery‟s technique requires some pre-processing and post

processing steps, which are required to convert the numbers to and

from the residue based representation. However, the cost of these

steps is negligible when many consecutive modular multiplications are

to be executed, as in the case of RSA. This is the reason why the

Montgomery‟s method is considered the most efficient algorithm for

implementing RSA operations. Depending upon the number “r” used as

the radix for the representation of numbers, there are several versions

of Montgomery‟s algorithms. But in hardware implementations “r” is

always a power of 2.

The Montgomery algorithm [6] uses simple divisions by a power

of two instead of divisions by M, which are used in conventional

modular operation. Montgomery‟s modular multiplication algorithm uses

only mathematical, Boolean and shift operations to avoid trial division,

which is a key and time-consuming operation in conventional modular

multiplication. The disadvantage is the need to convert operands into

and out of Montgomery‟s domain. However, the latency is almost

negligible in cryptosystems as Montgomery modular multiplication is

one of the key operations used in cryptographic algorithms. The

Multiple-Word Radix-2 Montgomery Multiplication algorithm

characterizes new-classic architecture for implementing Montgomery

multiplication on hardware. With properties streamlined for least time


delay, this architecture performs a single Montgomery multiplication in

approximately “2n” clock cycles, where “n” is the width of operands in

bits.

For “M” being an odd integer, in many cryptosystems, such as

RSA, computing “M” is an essential operation. The reduction of “M” is a

highly time-consuming step compared to multiplication of “A” and “B”

without reduction. Montgomery introduces a method for calculating

“products (mod M)” without the time consuming reduction (mod M).

Montgomery multiplication of A and B (mod M), denoted by MP (A, B,

M) is defined as [A.B.2n (mod M)] for some integer “n”. As Montgomery

multiplication is not an ordinary multiplication, there is an inter

conversion process between the ordinary domain (with ordinary

multiplication) and the Montgomery domain. The inter conversion

between the ordinary domain and the Montgomery domain is given by

the relation A↔A‟ where A‟ = A x 2n (mod M).

Mathematically, it can be given as:

(3.12)

(3.13)

The conversion between the domains can be executed using the same

Montgomery operation, in particular

and , where

can be pre-computed. Despite the initial conversion delay, there is a


significant advantage over ordinary multiplication, when we perform

many Montgomery multiplications trailed by an inverse conversion at

the end.

3.5.4 Radix-2 Modular Multiplier

The optimized algorithm for Radix-2 Modular multiplier for

Montgomery multiplication is presented below.

Algorithm 3.3

(3.14)

Output :

C = MP (A, B, M) = A. B. 2-n (mod M), 0≤ C< M

(3.15)


1.1

1.2

1.3

1.4

1.5

1.6

Figure 3.3: Algorithm for Radix -2 Modular Multiplication

The above algorithm presents the Pseudocode for the Radix-2

Montgomery multiplication, where And “n” is the size of

“M” in bits.

The verification of the above algorithm has been presented as follows:

Consider X[i] is given as

(3.16)

With X [0] = 0. Then can be

computed iteratively using the following dependence:

(3.17)

(3.18)

(3.19)


(3.20)

Hence, conditional to the parity of , we do compute as

or so as to make the numerator divisible by 2.

Since and one has for all . In

literatures [109] and [110] it has been given that the result of

Montgomery multiplication

when and (3.21)

As a result, by remapping“n” to be the least integer such that

, the subtraction at the final step of presented algorithm can be

side stepped and the output of the multipication can be instantaneously

used as an input for the forthcoming Montgomery multiplication.

3.6 Modular Multiplication Algorithms

In RSA, the encryption key is a pair of positive integers (e, n) and

the decryption key is predominantly a pair of positive integers (d, n). To

encrypt a message using the key (e, n), the following structural

approach has been implemented.

The sequential binary modular exponentiation algorithm is

presented in algorithm 3.4 in Figure 3.4. “M” represents the number of

digits in exponent “E”. The multiplication of line 4 of the algorithm

depends on the result of the squaring of line 3.


3.6.1 Algorithm for Sequential Binary (T, E, M)

Algorithm 3.4

int R :=1;

1 if em-1 = 1,then R:=T ;

2 for i = (m-2) downto 0 do

3 R = RXR (mod M);

4 if ei = 1, then R:= RXT (mod M);

5 return R ;

end

Figure 3.4: Algorithm for Sequential Binary Modular Multiplication

The serial Montgomery Multiplier is presented in Figure 3.5. It

includes two Montgomery modular multipliers SerialMMM1 and

SerialMMM2 . It uses five registers: one for each of T, M, E, RxR

(SQUARE) and RxT(MPRODUCT). Depending on the length of

Exponent “E”, the CONTROLLER controls the number of iterations

required to perform the exponentiation. Exponent “E” is stored into a

shift-register. Two multipliers SerialMMM1 and SerialMMM2 are not

necessary. One can reuse the same multiplier to perform the squaring

step of line 3 and subsequently the multiplication step of line 4 in

Algorithm 1. It reduces the hardware area but it increases the

encryption/decryption throughput.


Figure 3.5: Serial Montgomery Multiplier

The parallel algorithm for binary exponentiation is given in

algorithm 3.6. Parallel algorithm uses an extra register P. The use of the

parallel algorithm would reduce the encryption/decryption time to half of

the serial algorithm. In the average case, if the exponents consist of half

ones and half zeros, the execution time is improved by a factor of 1.5.

But, it needs almost twice the circuitry needed to implement the serial


algorithm. However, it reduces the execution time of encryption and

decryption.

3.6.2 Algorithm for Parallel Binary (T, E, M)

Algorithm 3.5

1. R(0) = 1, P(0) = T;

2. for i = 0 to (m-1) do

3. P(i+1) = P(i) X P(i) (mod M);

4. if ei = 1, then R(i+1) = R(i) X P(i) (mod M)

5. else R(i+1) = R(i)

6. return R(m)

7. end

Figure 3.6: Algorithm for Parallel Binary Modular Multiplication.

The hardware architecture of the parallel modular multiplier is

shown in figure 3.7 below. It includes two Montgomery modular

multipliers SerialMMM1 and SerialMMM2. It uses eight registers: one

for each of T, M, E, R(i) (SQUAREi), P(i)(MPRODUCTi),

R(i+1)(SQUAREi+1) and P(i+1) (MPRODUCTi+1). Depending on the length

of E, the CONTROLLER controls the number of iteration required to

perform the exponentiation. Exponent “E” is stored in to a shift register.


Figure 3.7: Parallel Montgomery Multiplier

In this research work, Radix - 2 Modular multiplier based

multiplication architecture is implemented. A brief description of the

employed algorithm is mentioned below:

In the initial stages, commutative RSA operands are fed into shift

registers serially via an input buffer. While loading message “M” into the

register, the exponent register is moved up until the point where in the


first non-zero is the most significant bit and count the number of bits of

exponent log2 E. After this initial stage, the multiplier comes into

assignment. Pretty much as the first yield bit of the multiplier is finished,

the Montgomery module begins working in a flash. Just as the first

output bit of the multiplier is complete, the Montgomery module starts

operating instantly. Hence, the execution time of multiplier, Carry

Propagation Adder, and Montgomery module is nearly overlapped.

Thus, the functional units of our design are completely utilized during

computation.

Summary

In this chapter, Commutative nature of RSA is proved. Design of

serial Montgomery multiplier and parallel Montgomery multiplier are

presented. Design of commutative cryptography core with key

generation, ASM charts for finding the prime number and key

generation, Flow chart for finding the GCD of two numbers, algorithms

for parallel multiplication are presented in the next chapter.


Chapter 4

Development of Algorithm for Commutative Cryptography Core

with Key Generation

The design of serial Montgomery multiplier and parallel

Montgomery multiplier, RSA algorithm and its commutative property

have been presented in Chapter 3. In this chapter, design of

commutative cryptography core with key generation and ASM chart for

the key generation are presented. These functions have been coded in

VHDL and simulated using Modelsim.

4.1 Sequential Implementation of Key Generation

In this section, the mathematical modeling and its sequential

implementation for key generation are presented. In the implemented

system, the first phase is to generate pseudo random numbers. For this

purpose, the Linear Feedback Shift Register (LFSR) has been

employed and more precisely the Fibonacci LFSR is taken into

consideration.

4.1.1 Linear Feedback Shift Register

Linear-feedback shift register is a shift register whose input bit is

a linear function of its previous state. Exclusive-or (XOR) function is


commonly used to get the linear function of single bits. Hence, an LFSR

is a shift register, whose input bit is derived by the XOR of some bits of

the complete shift register value.

4.1.2 Fibonacci LFSRs

The feedback tap numbers in Fibonacci LFSR correspond to a

primitive polynomial. The initial value of the Linear Feedback Shift

Register is called the seed. The rightmost bit is generally the output bit.

The taps are XORed sequentially with the output bit and are provided

as feedback at the leftmost bit. The sequence of bits in the rightmost

position provides the output stream.

4.1.3 Galois LFSR

It is an LFSR in Galois configuration. It is also known as a

modular internal XORs as well as one-to-many LFSR. It is named after

the French mathematician Évariste Galois. In comparison to previously

discussed LFSR the taps, here are XORed with the output bit before

they are stored in the next position. The current output bit is the

forthcoming input bit. The result of this is that when the output bit is zero

all the bits in the register shift to the right without any alteration, and the

input bit becomes zero. When the output bit is one, the bits in the tap

positions will flip and complement themselves and then the entire

https://en.wikipedia.org/wiki/%C3%89variste_Galois


register is shifted to the right and the input bit becomes “1”.

Fibonacci LFSR facilitates a better scenario for hardware or

multiple core implementations for distributed CRSA core. It can produce

sequences of large period with good statistical properties and because

of its structure, it can be analysed using algebraic techniques.

4.1.4 Primality Test

A primality test is an algorithm for determining whether an input

number is prime or not. Primality tests do not give prime factors, but

they will check whether the input number is prime or not.

4.2 Key Generation

Each output of the LFSR is checked whether it is a prime number

or not. Two prime numbers are selected randomly and designated as

“p” and “q”. These prime numbers are shared between all the users in

the system. Using these prime numbers, the variables for encryption

exponent (e) and decryption exponent (d) are computed. These are the

commutative encryption key and decryption key respectively. These

keys have been further employed for creating cipher text “C” and later

converted into plain text “M”. Considering the architectural differences

between the generic RSA and the proposed Commutative RSA, it can

be found that in order to enhance the overall efficiency and reduce


critical delay in key generation, two parallel pseudo random generators

have been employed and 512 bits of LFSR succeed them. The pseudo

random bits generated are stored in shift registers and once it gets

filled, the LFSR stops further bit generation till the register memory is

available for the next random bits. It makes the system power efficient

and increases the speed of operation.

Flow chart of various functions are represented using Algorithmic

State Machine (ASM) charts [115]. ASM charts are a graphical

representation of step-by-step execution of a hardware code. They are

easier to interpret and can be easily converted to other forms of

representation. They do not enumerate all the possible inputs and

outputs. Only the inputs that matter and the outputs that are asserted

are indicated. ASM chart for testing whether the output of LFSR is a

prime number or not is presented in Figure 4.1. RTL schematic of the

same is extracted using ISE design tool and is presented in Figure. 4.2.

4.2.1 RTL Coding Guidelines

Ultimate aim of the designer is to finally map the design on an

FPGA device or implement as an Application Specific Integrated Circuit

(ASIC), and this is possible only if certain guidelines are followed [115].

Popular guideline known as the RTL coding guidelines, practiced in

Industries, signifies that data transfers in a system take place via


registers. It is basically adhering to synchronous design practices and it

signifies the regulation of data flow and how the data is processed.

Since we deal with a synchronous design, it should run smoothly

through simulation, synthesis and finally on place and route tools. In

order to do this, we have to isolate the asynchronous and sequential

circuits. The codes developed using RTL coding guidelines will run

smoothly in all the tools.


Input an integer „n‟ >1

Is „n‟ an exact

power of another

positive integer?

Choose set of polynomials g(x) and a

polynomial f(x) that are sufficient for

testing primality of „n‟

F

Choose g(x) from the set of g(x)s

Is ([g(x))n ? g(x

n) mod (f(x), n)?

F

Has the check been

performed for all g(x)?

Declare „n‟ to be

prime Number

Start

Stop Declare „n‟ to be

composite

Number

S 52

S 58

S 60

S 66

S 56

T

T

T

F

Figure 4.1: ASM Chart for Finding the Prime Number


Figure 4.2: RTL Schematic for Finding the Prime Number Using


Figure 4.2, CRSA_KGCORE_CMP_PRME_TRUE is the block

name mentioned in the VHDL code. “clk”, reset_I, start, rd_ack, rand_in

(511:0) are the inputs and ready, rd_en, prime_valid, prime_out (511:0)

are the outputs of this block. “reset_I” is an asynchronous active low

input; the system is reset only if this input is low. For normal working of

the system, this signal would be high. Start is active high input which

indicates the start of checking the number whether it is prime or not.

rand_in (511:0) is the input number which is to be checked for its

primality. This is a random number generated by linear feedback shift


register. Continuous clock pulses are applied. For every positive edge

trigger of the clock, reset_I is checked. If it is low, then the output will be

in reset state. When reset_I is high, the random number at that time will

be tested for primality. Start input indicates the start of the function.

Algorithm implemented for checking the prime number is presented in

Figure 4.1. If the applied random number is a prime number,

prime_valid and rd_en output will be “1” indicating the output prime_out

(511:0) is a prime number.

Two prime numbers have been taken and designated as “p” and

“q”. Variables “n” and “(n)” are computed using these two prime

numbers. Encryption exponent “e” is computed using trial and error

method in such a way that the Greatest Common Divisor (GCD) of “e”

and “(n)” is “1”. Flow chart for finding the GCD of two numbers has

been presented in Figure 4.3 and the RTL schematic is presented in

Figure 4.4.

After obtaining “(n)”, assume an integer value for “e” and these

two are the inputs for GCD block. We have to compare “e” and “(n)”

for “e” greater than “(n)”, “e” equal to “(n)” and “e” less than “(n)”. If

“e” is greater than “(n)”, subtract “(n)” from “e” and compare them

again. Repeat the steps until “e” becomes equal to “(n)”. If “e” is less

than “(n)”, subtract “e” from “(n)” and compare them again. Repeat


the steps until “e” becomes equal to “(n)”. If “e” is equal to “(n)”, then

GCD is equal to “1”. Therefore, we select “e” as the encryption key.

START

Read “e” and “(n)”

Compare “e” and “(n)”

e > (n)? e = (n)? e < (n)?

e = e - (n) GCD = 1 (n) = (n) - e

T T

F

F

T

F

STOP

Figure 4.3: Flowchart for Finding GCD of Two Numbers


Figure 4. 4: RTL View for Finding GCD Using ISE 14.3 Xilinx Tool

In figure 4.4, CRSA_KGCORE_CMP_GCD is the component

declared for finding the GCD of two numbers in VHDL code. A(1023:0)

is the1024 bits input. Clk, reset_I, start are the other inputs. Reset_I is

an asynchronous active low input that resets the system. This signal is

high for normal functioning. Start is an active high input which indicates

the start of the function. Output d(1023:0), e(1023:0), out_valid and

ready are the outputs. d(1023:0), e (1023:0) are the decryption and

encryption keys respectively. These two keys are valid when out_valid

signal is high.


After obtaining GCD of “e” and “(n)” as “1”, the pair (e, n) is

considered

as the encryption key. The decryption key pair (d, n) is computed using

the encryption key in such a way that d = e-1 (mod n). The mathematical

approach to perform key generation is presented in Figure 4.5 and

Figure 4.6.

Figure 4.5: Key Generation for Commutative RSA Algorithm


Figure 4.5 presents the functional block diagram of key

generation in Commutative RSA. In the designed system of key

generation, the seed data of size 512 bits are applied to the key

generator. This finally generates 1024 bits of encryption key “e”, 1024

bits of decryption key “d” and 1024 bits, the parameter “(n)”. Here, the

two prime numbers of 512 bits each have been generated using pseudo

random number generators. In figure 4.5, pseudo random number

generator - 1 and pseudo random number generator -2 are the two

blocks for generating the random numbers of 512 bits each. These

random numbers are fed to the next block for testing whether they are

prime or not. Two prime numbers “p” and “q” are obtained after the

primality test. These two prime numbers are multiplied and the product

is designated as an integer “n”. One more parameter “(n)” is obtained

after multiplying (p-1) and (q-1). An integer “e” is assumed and GCD of

“e” and “(n)” is tested using trial and error method. After obtaining

GCD of “e” and “(n)” as “1”, the pair (e, n) is considered as the

encryption key. The decryption key pair (d, n) is computed using the

encryption key in such a way that d = e-1 (mod n). Step by step

procedure for generating both encryption key (e, n) and decryption key

(d, n) is presented in the following algorithm.


Algorithm 4.1: Key Generation

1. Select p and q such that both p and q are Prime numbers

2. Compute n = p x q

3. Compute (n) =(p -1) x (q - 1)

4. Select integer variable “e” in such a way that

5. GCD ((n), e) = 1; 1< e <(n)

6. Calculate d = e -1

(mod ((n))

7. Generate the public and private key pairs.

8. Encryption key pair, (e, n)

9. Decryption key pair, (d, n)

Figure 4.6: Pseudo Algorithm for Commutative RSA Key

Generation

The algorithm may be efficiently designed using Algorithmic State

Machine (ASM) charts rather than by the traditional state diagram [115].

The process of commutative key generation has been illustrated in

Figure 4.7 using ASM Chart.


S1

START

Pseudo Random Number

Generator

Prime No1? Prime No. 2?

p q

T

F F

T

n = p x q

Ф(n) = (p-1) x (q-1)

GCD (Ф(n),e))

GCD=1? F

Private key (e, n)

T

Public key (d, n)

S0

S2

S3

S4

S5

S6

S7

Encryption Process

Decryption Process

S8

S9

Figure 4.7: ASM Chart for Key Generation for Commutative RSA

In the designed system of key generation for CRSA, the seed

data of size 512 bits are applied to the key generator. This finally

generates 1024 bits of encryption key “e”, 1024 bits of decryption key

“d” and 1024 bits, the parameter “(n)”. Here, the two prime numbers of

512 bits each have been generated using pseudo random number


generators. These random numbers are fed to the next block for testing

whether they are prime or not. Two prime numbers “p” and “q” are

obtained after the primality test. These two prime numbers are

multiplied and the product is designated as an integer “n”. One more

parameter “(n)” is obtained after multiplying (p-1) and (q-1). An integer

“e” is assumed and GCD of “e” and “(n)” is tested using trial and error

method. After obtaining GCD of “e” and “(n)” as “1”, the pair (e, n) is

considered as the encryption key. The decryption key pair (d, n) is

computed using the encryption key in such a way that d = e-1 (mod n).

These two keys are then used for computing encryption of the plain text

and decryption of the cipher text respectively.

RTL schematics of the top module of CRSA key generation using

Xilinx ISE 14.3 and Vivado 2012.3 are presented in Figure 4.8 and

Figure 4.9 respectively.


Figure 4.8: RTL Schematics of Top Module of CRSA Key

Generation Using Vivado 2012.3


CRSA

Encoding

Decoding

CLK

CLK

CLK

SEED[511:0]

ADDA[511:0]

SEED[511:0]

KEYGEN_CRSA_KGCORE_GENPVAL

KEYGEN_CRSA_KGCORE_GENQVAL

KEYGEN_CRSA_MAIN_MEM_MGMT

RAND_IN[511:0]

KEYGEN_CRSA_KGCORE_CHKQVAL


RAND_IN[511:0]


KEYGEN_CRSA_KGCORE_CHKPVAL

PRIME_OUT[511:0]

PRIME_OUT[511:0]A[1023:0]

d_va[1023:0]

Key_out

Fi_out[1023:0]

e_va[1023:0]

N_out[1023:0]

Figure 4.9: Schematic of CRSA Key Generation Using Vivado 2012.3

In Figures 4.8 and 4.9, first block is the generation of 512 bit

random numbers. We require two random numbers P and Q.

KEYGEN_CRSA_KGCORE_GENPVAL is the first block for generating

first random number. KEYGEN_CRSA_KGCORE_GENQVAL is the

second block for generating the second random number. Input to these

two blocks are 512 bit, seed [511:0], Reset, clk and chip enable (ce).

These two blocks are Linear Feedback Shift Registers (LFSR).

Fibonacci LFSR is implemented and the same is described in section

4.1.2. Second block is for checking the prime number.

KEYGEN_CRSA_KGCORE_CHKQVAL and


KEYGEN_CRSA_KGCORE_CHKPVAL are the two blocks for verifying

whether the random number is prime or not. Inputs for these two blocks

are the outputs of the first block. Figure 4.1 describes the Algorithmic

State Machine Chart for checking whether the input number is prime or

not. After finding two prime numbers P and Q, Variables “n” and “(n)”

are computed using these two prime numbers. Encryption exponent “e”

is computed using trial and error method in such a way that the

Greatest Common Divisor (GCD) of “e” and “(n)” is “1”. Flow chart for

finding the GCD of two numbers has been presented in Figure 4.3 and

the RTL schematic is presented in Figure 4.4.

KEYGEN_CRSA_KGCORE_GEN_N_PHI is the block for

computing “n” and “(n)”. These variables are the inputs for the next

block KEYGEN_CRSA_E_D_VAL_CMPTE, for computing 1024 bits

encryption key “e” and decryption key “d ”, indicated in the figure as

e_val (1023:0) and d_val (1023:0) respectively. These two keys are of

1024 bits each. Algorithmic State Machine Chart of figure 4.7 presents

the key generation for Commutative RSA Algorithm implemented in this

work.

Second phase of CRSA cryptography core is Commutative

encryption and is presented in the ascending section.


4.3 Commutative Encryption

RSA cryptosystem is one of the finest solutions for public key

cryptography approaches. In any case, its complete vigor gets restricted

because of restricted encryption. RSA algorithms suffer from reorder

issues. Hence, in order to make this system simpler and efficient,

methodology called Commutative RSA has been designed. This works on

the basis of commutative property. The mathematical structure for

performing encryption is described by a pseudo algorithm presented in

Figure 4.10. For achieving a vigorous and least critical delay encryption,

an improved architecture called, “Parallel Montgomery Multiplication for

distributed cores” has been designed. The algorithmic advancement and

subtle elements of this altered Montgomery has been given in the

following section. Using encryption key “e”, the plain text “M” is converted

to the cipher text “C”. This CRSA encryption with parallel Montgomery is

implemented at every FPGA core or distributed cores. Such

implementation not only reduces the computational cost in terms of

minimum execution time but also results in reducing key exchange

overheads in case of multiple core or MIMO transceiver based

communication system. Mathematically the commutative RSA encryption

algorithm can be stated as follows:


Algorithm 4.2 Commutative Encryption

1. Prime numbers: p ,q

2. Compute n: n = p x q

3. Plain text :

4. Cipher text: C = Me (mod n)

Figure 4.10: Pseudo Algorithm for Commutative Encryption

The encryption process takes input as data parameters with

encryption exponent and “n” parameter, all of size 32 bits. After

encryption, the CRSA encryption comes out with 32 bits of cipher text.

The designed scheme takes into consideration the Montgomery

multiplication in its parallel form. In individual multiplication steps, there

is input of 32 bits modulus, multiplicand and multiplier. Finally it

generates 32 bits of cipher text. This CRSA encryption with parallel

Montgomery is implemented at every comprising FPGA core or

distributed cores. For better performance and results, Montgomery

based decryption necessary. Therefore, the developed decryption

algorithm functions using parallel Montgomery Multiplier.

4.4 Commutative Decryption

In commutative RSA, the decryption would be done in the same

fashion as done for commutative encryption using parallel Montgomery


multiplier. In commutative decryption, at individual transceiver terminal,

the respective plain text is obtained from the cipher text. It is achieved

from cipher text raised with power of decryption key exponent “dcrsa” and

it is succeeded with reduction of modulo “n”. Mathematically the

commutative RSA decryption algorithm can be stated as follows:

Algorithm 4.3: Commutative Decryption

1. Input: Cipher text: C

2. Output: plain text M

3. Decryption key pair : (d, n)

4. Plain text: M = Cd(mod n)

Figure 4.11: Pseudo Algorithm for Commutative Decryption

The algorithm for accomplishing security and authentication for

multi user or multiple core distributed FPGA frameworks has been

presented in the next section.

4.5 Commutative RSA Cryptography Core

In this section overall design of CRSA Architecture and its

implementation with multiple distributed FPGA cores has been

presented. The entire system implementation was done sequentially

using multiple cores.


In this work, three user terminals are considered. Each user or

terminal encrypts the message using the private key and decryption is

performed using public key. Even though public key is generated at

each transceiver terminal, the public keys are transmitted using

classical method. At the receiver end, each user terminal knows number

of terminals the plaintext has traversed through. Receiver has to

perform same number of iterations of decryption using the respective

decryption keys to obtain the plaintext. Phase 1 is the key generation

step. Sequential procedure for key generation has been presented in

section 4.1 and 4.2. Next phase is commutative encryption, which is

presented in section 4.3 and the Third phase is commutative decryption,

discussed in the previous section. The sequential phases have been

presented in Figure 4.12.


Figure 4.12: Sequential Model for Commutative RSA Realization

Various phases of commutative cryptography core are discussed

in the previous section. Montgomery parallel multiplier is presented in

the next section.


4.6 CRSA Oriented Parallel Montgomery Multiplier

A system needs high tolerability for allied communication delay

and thus the created inter-core communication is in general, much

slower as compared to intra-core communication situation. Hence, in

multiple core communications, the inter-core communication might be a

bottleneck in parallel multipliers. The parallel Montgomery Multipliers

are eminently suited for diverse multi-core models and it makes the

overall system stable even for higher throughputs.

4.6.1 Parallel Montgomery Multiplier

The implementation of Montgomery multiplication using parallel

multiplier for multi-core applications is effective for RSA cryptosystems

and its applications. In a number of cryptosystems, a series of

multiplication functions operating concurrently are required. For

example, the modular exponentiation is estimated while employing the

chain of processes like modular multiplication and squaring.

Multiplication and squaring are in general implemented by means of an

integer multiplication followed by a modular reduction with certain

predefined modulus. The design of Montgomery Multiplication Algorithm

was presented in detail in the previous chapter. In Montgomery

multiplication process, initially the functional operands are transformed


into their allied Montgomery residue representation. That is followed by

performing integer multiplication and squaring. Finally, the result is

converted back into its generic integer representation.

In this work, multiple core hardware architecture is proposed for

Commutative RSA encryption and decryption. The developed scheme

involves highly efficient communication among incorporated multiple

cores. This CRSA approach is robust, efficient and offers real time

functional hardware architecture.

4.7 Realization of Parallel Montgomery Multiplier

In this algorithm for multiple distributed core architecture, the

intrinsic parallelism scheme has been taken into consideration. The

following algorithm has been used for parallel Montgomery

multiplication. Algorithm 4.4 explains the general Montgomery

Multiplication algorithm.


Algorithm 4.4 : Montgomery Multiplication

Input : A, B Zn where n is an odd integer,

, where .

Output: AB

1. T A B

2. T ( T + ( T ) n)/ 2m

3. If T ≥ n then

4. Return T-n

5. Else

6. Return T

7. End if

Figure 4.13: Pseudo Algorithm for Montgomery Multiplication

Parallel Integer Multiplication and Parallel Montgomery

Multiplication algorithms are presented in the following sections.


4.7.1 Pseudo algorithm for Parallel Integer Multiplication

The pseudo codes for the intrinsic integer Parallel Montgomery

multiplication are as follows:Algorithm 4.5 : Algorithm for Parallel

Integer Multiplication

Input: Integers E = and N possessing its size as m =

d. s Here variable “s” states the number of cores to be realized

Results : parmul (E, N) = E.N (parallel Multiplied Outputs)

1. initialize one loop

2. for i = 0 to s-1 do

Update variable t

Wi ei. N. 2i.d, for multicore implementation

End for

3. for I = 0 to s-1

Update

4. w0 w0 + wi

End for

5. return (w0)

Figure 4.14: Pseudo Algorithm for Parallel Integer Multiplication


4.7.2 Pseudo Algorithm for Parallel Montgomery Multiplication

Algorithm 4.6

Input: Variables , n for an odd integer, while

, with

Results:

Apply previous algorithm (parallel integer Multiplication algorithm) for

achieving the following:

1. W parmul (E. N)

2. U parmul ( w, n ) mod 2m

3. Update g

4. g parmul (g,n)

5. u (g + w)/ 2m

6. if g ≥ n, then

7. return (g - n)

8. else

9. return (g)

10. end if

Figure 4.15: Pseudo Algorithm for Parallel Montgomery

Multiplication


The partial product accumulation has been accomplished by

means of a “binary tree approach” as presented below with at most

phases with “s” number of cores available.

For

For

End for

End for

Thus, implementing this parallel Montgomery multiplication with

every encompassing transceiver MIMO or distributed FPGA core-

terminals for performing commutative encryption and decryption, the

overall latency and the critical delay has been minimized. Thus a highly

robust system has been obtained for Commutative RSA Cryptosystem.

4.8 Realization of CRSA with Multiple Distributed Cores

Considering an accessible platform for secure medium is

available, it is assumed that data to be communicated over the

communication medium is needed to be secured and, in fact, it is


sufficient to prohibit any collusion, and a secure plane is achievable. In

fact the implementation of public key cryptosystem like RSA is normally

ideal. As a result, the commutative cryptographic core for distributed

FPGA scheme with parallelized Montgomery Multiplication considers

the commutative RSA algorithm CRSA, that advocates that the order in

which the encryption has been done does not affect the cryptosystem.

The decryption can also be realized in a similar manner. The developed

approach takes into account two prime variables stated as

and that has been initialized and accompanied by all the

group members.

Let the variables and represent the group members

required to perform communication over the secure plane. For

computing the encryption and decryption key of the proposed CRSA

algorithm, the Property and have been

calculated while considering the expression presented in Eq. 4.1 and

Eq. 4.2.

(4.1)

(4.2)

Taking into account of the above expressions, it is verified that

and for A

and B respectively. The encryption key pairs for user A and B are


represented as:

and

(4.3)

The is computed using randomly selected variables in

such a way that:

(4.4)

where is the greatest common divisor (GCD) function that

exists between variables and .

The decryption key pair of A and B is presented in terms of

and . The

property are calculated based on the following expression.

(4.5)

(4.6)

Similarly, the resulting commutative RSA decryption functions on

the retrieved encrypted data which can be defined by the following

expression.

. (4.7)

After decryption, the data has been verified.


Summary

In this chapter, design of commutative cryptography core with key

generation, ASM charts for finding the prime number and key

generation, Flow chart for finding the GCD of two numbers, algorithms

for parallel multiplication are presented. Analysis of simulation results

and hardware design are presented in the next chapter.


Chapter 5

Simulation and Place and Route Results of

Commutative Cryptography Architecture with Key Generation

In the previous two chapters, the design of a serial and parallel

Montgomery multipliers and design of commutative cryptography

architecture with key generation were presented. In this chapter, their

Simulation, Place and Route Results are presented in detail.

5.1 Introduction

The secure and authenticated data communication in the present

day competitive application scenario needs a robust and efficient

security system that ensures authenticated data communication in

multiple user scenarios. Considering the efficiency of RSA algorithm

and certain optimization opportunity with this cryptosystem, in this work,

a new approach called Commutative RSA has been developed and it

has been implemented with multiple FPGA distributed cores. The

developed system has been enhanced with other optimization features

such as parallel Montgomery multiplication with radix -2. This makes the

system perform well in terms of execution time. The CRSA has been

tested with both paradigms of Montgomery multiplication: Serial and

Parallel Montgomery Multiplications. The results illustrate that parallel

Montgomery Multiplication optimizes CRSA with better performance as


compared to serial Montgomery Multiplication. The parallel CRSA

system has been implemented with multiple distributed cores and the

results justify that CRSA with parallel Montgomery Multiplication can

deliver secure communication among multiple users with high

throughput.

5.2 Hardware Design

The architecture of a 32-bit RSA processor has been designed

that functions on the basis of the proposed Commutative RSA

algorithm. In this work, four 32-bit linear shift registers are employed to

store operands needed for computing 32-bit RSA operations. The

commutative RSA core has been implemented on multiple FPGA

devices for simulation. Data authenticity among multiple user terminals

in a communication environment is illustrated. The implementation of

commutative RSA cryptography core has been simulated with three

individual FPGA devices.

The system has been coded using VHDL with multiple transceivers.

The design has been simulated using Modelsim and the RTL design is

synthesized using Xilinx Design Suite 14.3 targeted on Virtex-5, FPGA

and Vivado 2012.3 Xilinx tool.

In this work, two systems have been developed. One is Serial

Montgomery based Cryptography core and the second is Parallel

Montgomery based cryptography core. The designed system utilizes


31298 out of 44800 (69%) slice registers, 30129 out of 44800 (67%)

slice LUTs. The system clock frequency reported by the Place and

Route tool is 199 MHz for CRSA encryption and decryption and 335

MHz for commutative cryptography core.

5.3 Comparative Analysis for Serial Versus Parallel Montgomery

Multiplication Based CRSA Implementation

The results for Serial Montgomery and Parallel Montgomery

Multiplication based CRSA architectures have been compared in this

section. Considering the performance parameters like memory

occupancy, speed, power consumption, delay and throughput, it has

been found that the Parallel Montgomery based Commutative RSA

performs better when compared to Serial Montgomery based

Commutative RSA realization. The delay in the proposed Parallel

Montgomery based CRSA is 13.8% lower as compared to Serial

Montgomery based CRSA cryptography core. Similarly, the throughput

of the proposed Parallel Montgomery based CRSA is 12.1% higher than

the serial Montgomery based CRSA architecture. In the proposed

design, the trade-off between power consumption and area is also very

small. The comparative results for Serial Montgomery based

Commutative RSA and Parallel Montgomery based Commutative RSA

are presented below.


Table 5.1 presents the comparison of area in Serial and Parallel

Montgomery based CRSA Cryptography Core. Figure 5.1 presents the

graphical representation of the same.

Table 5.1 Comparison of Chip Area of Serial and Parallel

Montgomery Based CRSA Cryptography Core

CRYPTOGRAPHY

CORE

SERIAL

MONTGOMERY Based

CRSA

PARALLEL

MONTGOMERY Based

CRSA

DEVICE xc5vlx330t-2-ff1738 xc5vlx330t-2-ff1738

SLICE LUT 913 844

LUT USED AS

LOGIC 913 813

OCCUPIED

SLICES 290 311


Figure 5.1: Comparison of Chip Area of Serial and Parallel


Resources on the FPGA that can perform logic functions are

defined as Logic resources. They are grouped in slices to create

configurable logic blocks. A slice contains a set of number of Look-Up-

Tables (LUTs), flip-flops and multiplexers. An LUT is a collection of logic

gates hard-wired on the FPGA. LUTs store a predefined list of outputs

for every combination of inputs and provide a fast way to retrieve the

output of a logic operation. A flip-flop is a circuit capable of two stable

states and represents a single bit. A multiplexer, also known as a mux,

is a circuit that selects between two or more inputs and outputs the

selected input.

Slices are the basic building block components in the FPGA

fabric. However, each slice contains number of LUTs, flip-flops, and


carry logic elements, which make up the logic of the design before

mapping. After mapping, all the LUTs and flip-flops are packed into

slices, but not necessarily filling the slices. A slice with two LUTs and

two flip-flops may be in use for just one LUT. In the map report, any

slice that is used even partially is counted as a complete "occupied

slice". Hence the percentage of usage of slices is greater than the

larger of LUTs and flip-flops. The design may use about 25% of LUTs

and flip-flops. Owing to sparse packing it can have nearly 50% occupied

slices.

Different FPGA families implement slices and LUTs differently.

For example, a slice on a Virtex-II FPGA has two LUTs and two flip-

flops but a slice on a Virtex-5 FPGA has four LUTs and four flip-flops. In

addition, the number of inputs to an LUT is generally two to six,

depending on the FPGA family.

In Table 5.1, number of slice LUT used for the implementation of

serial Montgomery based CRSA and parallel Montgomery based CRSA

are 913 and 844 respectively. Occupied slices for serial Montgomery

based CRSA and parallel Montgomery based CRSA are 290 and 311

respectively. This table and the graphical representation indicate that

Parallel Montgomery based CRSA occupies only 21 extra slices

compared to serial Montgomery based CRSA. Hence for further

implementation, Parallel Montgomery based CRSA is used.


Table 5.2 and Figure 5.2 present the comparison of Power

consumption in Serial and Parallel Montgomery Based Commutative

RSA Cryptography Core.

Table 5.2 Comparison of Power Consumption of Serial and

Parallel Montgomery Based Commutative RSA

Cryptography Core

Cryptography

Core

Serial Montgomery

Based CRSA

Parallel

Montgomery Based

CRSA

Device Xc5vlx330t-2-ff1738 Xc5vlx330t-2-ff1738

Static Power

(mW) 3516.7 3516.75

Dynamic Power

(mW) 4.76 5.72

Total Power

(mW) 3521 3522


Figure 5.2: Comparison of Power Consumption of Serial and

Parallel Montgomery Based CRSA

Power consumption of serial Montgomery based CRSA and

parallel Montgomery based CRSA are extracted from the device

utilization summary. Table 5.2 and Figure 5.2 indicate both the methods

consume almost the same power.

Table 5.3 presents the Performance such as Delay, Frequency

and Throughput for Serial and Parallel Montgomery Based CRSA

Cryptography Core. The graphical comparisons are presented in Figure

5.3 and Figure 5.4 for delay and throughput respectively.


Table 5.3: Performance (Delay, Frequency and Throughput)

Comparison for Serial and Parallel Montgomery Based

CRSA Cryptography Core

Cryptography

Core

Serial Montgomery

Based CRSA

Parallel

Montgomery

Based CRSA

Device

xc5vlx330t-2-ff1738

xc5vlx330t-2-ff1738

Frequency (MHz) 199 227

Delay (ns) 5.01 4.4

Throughput

(Kbps) 779 887

Figure 5.3: Delay Comparison for Serial and Parallel



Figure 5.4: Throughput Comparison of Serial and Parallel


Maximum frequency of operation and delay are given by the

timing report of the device. Parallel Montgomery based CRSA has

reported less delay compared to serial Montgomery based CRSA. This

indicates that the speed of operation is more in Parallel Montgomery

based CRSA. Throughput indicates the number outputs per unit time.

Higher the throughput better is the performance. Data presented in

Table 5.3 reveals that the Parallel Montgomery based CRSA performs

better than the Serial Montgomery based CRSA. Delay of Serial

Montgomery based CRSA is 5 ns and Parallel Montgomery based

CRSA is 4.4 ns or 13.8% less in the case of Parallel CRSA. Similarly

throughput of Serial Montgomery based CRSA is 779 Kbps and Parallel

Montgomery based CRSA is 887 Kbps or 12 % better in the latter.


Simulation waveforms of encryption and decryption are presented

in the next section.

5.4 Analysis of Simulation Waveforms

Various encryption and decryption data values for each user

location are computed and presented in Tables 5.4 to 5.6. Frequency of

operation in simulation has been set to 100 MHz for convenience. The

data values shown in these three tables are remapped for the VHDL

RTL codes as presented in Table 5.7.

Table 5.4 Data at User Terminal 1

User Terminal 1

Data Decimal Value Hexadecimal Value

p_val 59083 E6CB

q_val 33223 81C7

n_pram 1962914509 74FFB2CD

e_pram 699776239 29B5BCEF

data_pram 7487875 724183

cypher 848084699 328CBEDB

d_pram 1389794659 52D69563

Original plain text

after decoding 7487875 724183


Table 5.5 presents the data at user terminal 2.


User Terminal 2


p_val 59083 E6CB

q_val 33223 81C7

n_pram 1962914509 74FFB2CD

e_pram 1154032391 44C92307

data_pram 848084699 328CBEDB

cypher 752490942 2CDA19BE

d_pram 1608356723 5FDD9373

Original plain

text after

decoding

848084699 328CBEDB


Table 5.6 presents the data at user terminal 3.


User Terminal 3


p_val 59083 E6CB

q_val 33223 81C7

n_pram 1962914509 74FFB2CD

e_pram 627898457 256CF859

data_pram 752490942 2CDA19BE

cypher 553018001 20F66291

d_pram 1057410797 3F06CEED

Original plain text

after decoding 752490942 2CDA19BE

Tables 5.4 to 5.6 present the data at user terminal 1, user

terminal 2 and user terminal 3. Prime numbers generated using LFSR

are designated as p_val and q_val. The parameters “n” and “(n)” are

computed. Parameter “n” is designated as “n_pram” in tables.

Encryption key is designated as “e_pram”. Encryption key for user

terminal 1 is computed as “699776239” in decimal and “29B5BCEF” in

Hexadecimal. Similarly for user terminal 2 and user terminal 3

encryption keys are “1154032391” in decimal, “44C92307” in


hexadecimal and “627898457” in decimal, “256CF859” in hexadecimal

respectively. “data_pram” is the original data required for encryption.

“7487875” is the original data for user terminal 1. This original data is

encrypted using an encryption key “e_pram”. The encrypted output data

is shown as “cypher” in the table and at user terminal 1 the “cypher”

value is “848084699”. This encrypted data at user terminal 1 is the

“data_pram” for user terminal 2, which is shown in table 5.5.

“752490942” is the “cypher” of user terminal 2 and is the “data_pram”

for user terminal 3, which is shown in Table 5.6. “cypher” of user

terminal 3 is obtained after encrypting its “data_pram” and is shown as

“20F66291”.

The decryption key is different from the encryption key and is

shown as “d_pram”. Decryption key for user terminal 1 is computed as

“1389794659” in decimal or “52D69563 H”. Similarly decryption keys at

user terminal 2 and user terminal 3 are “1608356723” in decimal,

“5FDD9373 H”, and 1057410797 respectively.

The decrypted output is shown as “Original plain text after

decoding” in the table. Decoding starts at user terminal 3 and the value

obtained is “752490942” in decimal. This data is decrypted at user

terminal 2 and the obtained result is “848084699”. Decrypted data of

user 2 terminal is now decrypted at user terminal 1 and the result is


“7487875”. It may be noted that the encrypted data of User terminal 1

will be the input data for User terminal 2 and so on. Similarly, for the

decryption. At user terminal 1 “data_pram” and “Original plain text after

decoding” are the same. Similarly at user terminal 2 and user terminal

3. This proves the commutative nature of the algorithm.

Table 5.7 presents the data mapping between equations

presented and data used in coding to obtain the waveforms, presented

in the next section.

Table 5.7 Data Mapping

Designation of Data in

Equations

Designation of Data in

Waveforms

p_val

q_val

n_pram

e_pram

data_pram

Cypher

d_pram

Originalplaintext


In Table 5.7, the first column presents the data designated in

equations which were discussed in detail in Chapter 4. The second

column presents the same data designated in waveforms.

is designated as “p_val” in simulation waveforms. Similarly

is shown as “q_val”, as “n_pram”, as

“e_pram”, as “data_pram”, as “Cypher”, as

“d_pram”, and as “Originalplaintext” in equations and waveforms

respectively.

Simulation waveforms of encryption and decryption at each user

terminals are presented below. For convenience sake, the frequency of

operation in simulation has been set to 100 MHz, although any other

value can be set. It may be noted that the “reset” signal is active high

here, being exact complement of “reset_l” shown elsewhere. The

original data required for encryption is shown as “data_pram” in Figures

5.5 to Figure 5.7. The encryption key is shown as “e_pram” and the

encrypted output data is shown as “cypher”. The encryption starts at

150 ns and completes processing at 10035 ns for User terminal 1 as

shown in simulation waveforms presented in Figure 5.5. Similarly, for

User terminal 2, encryption commences at time 10035 ns and ends at

22195 ns, whereas for User terminal 3, start and end times are 22195

ns and 34115 ns respectively as presented in Figure 5.6 and Figure 5.7.


The decryption timings for the three user terminals are presented in

Figure 5.8 to Figure 5.10.

The encrypted data are input as “incipher” for Decryption

processing. The decryption key is different from the encryption key and

is shown as “d_pram” and the decrypted output is shown as

“originalplaintext” in the waveforms. It may be noted that the encrypted

data of User 1 will be the input data for User 2 and so on. Similarly, for

the decryption. For example, User 1 data “7487875” is encrypted as

shown in Figure 5.5 and decrypted as shown in Figure 5.10 recovering

the same data. This proves the commutative nature of the algorithm.

From the waveforms, it can be seen that each of the encryption and the

decryption process takes 50 clock cycles or 0.5 µs at 100 MHz.

The Synthesis, Place and Route have been run on the RTL

design. The design was synthesized using Xilinx Design Suite 14.3

targeted on Virtex-5, xc5vfx70t-2ff1136 FPGA. The design for both the

encryption and the decryption utilizes about 67% of the chip resources

as presented in Table 5.8. Although only 100 MHz was used during

Simulation, the maximum operating frequency reported by the Xilinx

Design Suite 14.3 tool is 199 MHz for CRSA encryption and decryption

and 335 MHz for commutative cryptography core as presented in Table

5.9.


Figure 5.5: Simulation Waveform of CRSA Encryption at

User Terminal 1


User Terminal 2



User Terminal 3

Figure 5.8: Simulation Waveform of CRSA Decryption at

User Terminal 3



User Terminal 2


User Terminal 1


5.5 RTL View of Architecture for Commutative RSA

Top level RTL schematics of Commutative RSA with key

generation using ISE Xilinx tool and Vivado Xilinx tool are presented in

Figure 5.11 and Figure 5.12 respectively.

Figure 5.11: Top Level RTL View of Key Generation Using



Figure 5.12: Top Level RTL View of Key Generation Using


Top level RTL view presents inputs and outputs of a module and

is extracted from the Xilinx tool after synthesis.

CRSA_KGCORE_KGENTOP is the key generation module name.

seed(511:0), ce, clk and reset_I are the inputs. d_val(1023:0),

e_val(1023:0), fi_out(1023:0), n_out(1023:0) and key_out are the

outputs. Seed (511:0) is 512 bits input applied to produce the encryption

and decryption keys. “ce” is the chip enable input and it is an

asynchronous active high input. For every positive edge trigger of the


clock, reset_I is checked. If it is high, then the output will be in reset

state. “reset_I” is an asynchronous active high input, the component

works only if this input is low. If it is high, the output will be in reset

state. After computing the data, it produces the encryption key,

e_val(1023:0) of 1024 bits, decryption key, d_val(1023:0) of 1024 bits,

parameters fi_out(1023:0) and n_out(1023:0) of 1024 bits each. High

output on key_out indicates that the computation is completed and the

keys are obtained.

Second level RTL view of Commutative RSA with key generation

using ISE Xilinx tool and Vivado Xilinx tool are presented in Figure 5.13

and Figure 5.14 respectively. Second level RTL view presents the

internal details of the module. Five sub - modules are seen in Figure

5.13 with their inputs and outputs. Two modules for checking the prime

number represented as KEYGEN_CRSA_KGCORE_CHKPVAL and

KEYGEN_CRSA_KGCORE_CHKQVAL are the first two blocks.

KEYGEN_CRSA_KGCORE_GEN_N_PHI is for computing “n” and

“(n)”. KEYGEN_CRSA_E_D_VAL_CMPTE is for computing encryption

key and decryption key. KEYGEN_CRSA_MAIN_MEM_MGMT is for

memory management. Second level RTL view also presents the

interconnection between each block by showing the intermediate inputs

and outputs of the sub-modules. It is observed that ISE 14.3 Xilinx Tool

presents the sub-modules as blocks and Vivado 2012.3 Xilinx Tool


presents the sub-modules as multiplexers and registers.

Figure 5.13: Second Level RTL View of CRSA Key Generation



Figure 5.14: Second level RTL view of CRSA Key Generation

Using Vivado 2012.3 Xilinx Tool

RTL schematics of Linear Feedback Shift Register for generating

pseudo random numbers using Vivado Xilinx tool are presented in

Figure 5.15 (a) to Figure 5.15 (g). For generating 512 bits of random

number, it is using 512 numbers of registers. Interconnection of signals

between these registers is shown in Figures 5.15 (a) to 5.15 (g).


(a)

Figure 5.15: RTL View of LFSR Using Vivado 2012.3 Xilinx Tool

(Contd.)


(b)


(Contd.)


(c)


(Contd.)


(d)


(Contd.)


(e)


(Contd.)

(f)


(Contd.)


(g)


RTL View of checking the prime number using Vivado Xilinx tool

is presented in Figure 5.16. Sub modules are shown as multiplexers

and registers.


Figure 5.16: RTL View of Checking the Prime Number


Table 5.8 presents Device utilization of encryption, decryption

and CRSA key generation using ISE 14.3 Xilinx tool.

Table 5.8 Device Utilization of Encryption, Decryption and

Key Generation RTL Designs

Device Utilization Utilized Available Utilization (%)

Number of Slice Registers 31298 44800 69

Number of Slice LUTs 30129 44800 67

Number of Bonded IOBs 132 172 76

Number of fully used LUT-

FF pairs 20258 41169 49

1. The RTL VHDL codes were first run on ISE design tool

xc5vlx20t-2ff323. It was found that the utilization was only 7%.

This meant that the design consumes very little hardware. The

significance of 7% is that only 7% of the LUT slices were utilized,

and 93% of the resources were wasted. Device utilization

summary is presented in Figure 5.17.


Figure 5.17: Device Utilization summary Using ISE 14.3 Design

Tool xc5vlx20t-2ff323

2. Thereafter, the codes were run on ISE design tool xc5vsx50t-

2ff1136. It was found that the utilization was 95%. This meant that

the LUT slices were efficiently used. But this gives no opportunity

for further extension of the design. Device utilization summary is

presented in Figure 5.18.


Figure 5.18: Device Utilization Summary Using ISE 14.3

Design Tool xc5vsx50t-2ff1136

3. Based on the above two points, the codes were run on ISE

design tool xc_5vfx70t-2ff1136. It revealed 67% utilization. The

Slice LUTs were used. Not only the LUT slices were utilized

efficiently, further enhancements are also possible. Device

utilization summary is presented in Figure 5.19.


Figure 5.19: Device Utilization Summary Using ISE 14.3

Design Tool xc5vfx70t-2ff1136

Table 5.9 presents the timing report of CRSA with key generation,

encryption and decryption.

Table 5.9 Timing Report for CRSA Using ISE 14.3

RTL Design Maximum Frequency (MHz)

CRSA with Key Generation

335

CRSA with

Encryption and Decryption 199


It may be noted that the process of key generation takes much

time while the optimization in Encryption/Decryption using parallelized

Montgomery implementation with higher radix has efficient processing

time. The implementation of parallel Montgomery algorithm with

enhanced key generation approach has exhibited good results in terms

of not only memory occupancy but also high throughput, which

strengthens this approach to be employed, especially for real time

applications. On the other hand, optimization in Key generation with

commutative characteristics has reduced the key exchange overheads

to a great extent.

Table 5.10 presents the FPGA Resource Consumption of the RTL

VHDL Design. After compiling, a report is generated which provides

information on the speed and size of the compiled program. The Device

Utilization Summary section provides information on the number of

slices used. This metric is the most important measure of the size of the

compiled program on hardware. The designed system utilizes 31298 out

of 44800 (69%) slice registers, 30129 out of 44800 (67%) slice LUTs.


Table 5.10 FPGA Resource Consumption of the RTL VHDL Design

Device utilization summary: ---------------------------

Selected Device : 5vfx70tff1136-2

Slice Logic Utilization:

Number of Slice Registers: 31298 out of 44800 69%

Number of Slice LUTs: 30129 out of 44800 67%

Number used as Logic: 30129 out of 44800 67%

Slice Logic Distribution:

Number of LUT Flip Flop pairs used: 41169

Number with an unused Flip Flop: 9871 out of 41169 23%

Number with an unused LUT: 11040 out of 41169 26%

Number of fully used LUT-FF pairs: 20258 out of 41169 49%

Number of unique control sets: 26

The generated timing reports are presented in Figures 5.20 to

5.22. The system clock frequency reported by the Place and Route tool

is 199 MHz for CRSA encryption and decryption and 335 MHz for

commutative cryptography core.


Timing Summary: --------------- Speed Grade: -2

Minimum period: 5.011ns (Maximum Frequency: 199 MHz)

Minimum input arrival time before clock: 2.402ns

Maximum output required time after clock: 2.965ns

Figure 5.20: Timing Report for Commutative Encryption and

Decryption Using ISE 14.3 Xilinx Tool

Timing Summary: --------------- Speed Grade: -2




Maximum combinational path delay: 0.468ns

Figure 5.21: Timing Report for CRSA with Key Generation Using



Timing Summary:

---------------

Speed Grade: -2




Maximum combinational path delay: 0.930ns

Figure 5.22: Timing Report for CRSA with Key Generation Using


5.5.1 Rationale for 199 MHz

As per the design, the requested frequency of operation is 100

MHz to match the Simulation frequency, whereas the synthesis yielded

a much faster clock of 199 MHz as represented in Figure 5.20. The

requested and the reported frequencies are 10 ns and 5.011 ns

respectively in terms of periods. The difference known as the slack time

is 4.98 ns. The slack time must be positive. Otherwise, the device

cannot meet the requested frequency of operation. Similarly, Timing

Report for CRSA with Key Generation Using ISE 14.3 Xilinx Tool in

Figure 5.21 Indicates 2.978ns (Maximum Frequency: 335 MHz). The


slack time is 7.022 ns. Timing Report for CRSA with Key Generation

Using Vivado 2012.3 Xilinx Tool in Figure 5.22 is 3.422ns (Maximum

Frequency: 292 MHz). The slack time is 6.578 ns. In all the three cases,

the slack time is positive. Hence the device meets at least the

requested frequency of operation. In actual FPGA implementation of

CRSA Encryption and Decryption Processors, hopefully 200 MHz will

be used, thereby achieving double the throughput when compared to

the Simulation.

Summary

Architecture for Commutative RSA with key generation has been

designed and implemented. In this chapter, simulation results obtained

at three transceiver terminals for commutative cryptography core,

comparative analysis of serial Montgomery multiplier and parallel

Montgomery multiplier in terms of area, throughput and delay, device

utilization summary, timing summary, top level and second level RTL

views were presented.


Chapter 6

Conclusions and Scope for Future Work

6.1 Conclusions

In the present work, data security algorithms and architectures

were designed for multiple input multiple output or multi transceiver

systems. The commutative RSA approach has been implemented with

multiple FPGA cores that functions as individual transceiver terminals

and performs its encryption and decryption individually recovering the

original data. Architectures were realized using the Hardware Design

Language VHDL conforming to RTL coding guidelines.

The complete design of Commutative RSA with key generation

was validated using Modelsim Simulator and the Synthesis and Place

and Route results were obtained by implementing the design using

Xilinx design suite 14.3 and Vivado 2012.3.

6.2 Contributions of This Work

The main contribution of this work was to design a security

system for multi-party communications. The major contributions of this

work may be summarized as follows:


1. Public key cryptographic scheme was designed introducing

Commutative nature of RSA Encryption and Decryption. For

multiple Input multiple output communications, three users were

considered. User 1 generates “Original message”, encrypts it

using the private key (encryption key) and sends it to User 2.

User 2 in turn encrypts the received message from User 1 and

transmits it to User 3. Thereafter, the User 3 decrypts the

message twice by using public keys of User 2 and User 1

respectively to recover the “original message”. Any authenticated

user in the multi-party system can decrypt the message, if he/she

knows the order of encryption.

2. Two versions of Montgomery multipliers, Serial Montgomery and

Parallel Montgomery multipliers were designed and their

performances compared with respect to chip area, power

consumption, delay, throughput and frequency of operation. The

delay in the proposed Parallel Montgomery based CRSA is

13.8% lower as compared to Serial Montgomery based CRSA

cryptography core. Similarly, the throughput of the proposed

Parallel Montgomery based CRSA is 12.1% higher than the serial

Montgomery based CRSA architecture. Parallel Montgomery

multiplier exhibits better performance compared to Serial

Montgomery multiplier.


3. Commutative RSA cryptosystem was designed using Parallel

Montgomery multiplier. In order to avoid key exchange

overheads, each user generates both public key and a private

key. Private Key has been used for encrypting the data. Public

key is used for decrypting the data. Classical method may be

used to share the Public key among the authenticated users in

the group.

4. Commutative RSA with key generation architectures were coded

using VHDL conforming to RTL coding guidelines, without which

no system can work.

5. Suitable test benches were developed using VHDL and the

complete design was functionally tested using Modelsim.

6. Synthesis and Place and Route results were obtained by

implementing the RTL design using Xilinx Design Suite 14.3. The

designed system utilizes 31298 out of 44800 (69%) slice

registers, 30129 out of 44800 (67%) slice LUTs. The system

clock frequency reported by the Place and Route tool is 199 MHz

for CRSA encryption and decryption and 335 MHz for

Commutative Cryptography core.


The developed system can be used for sending or receiving

stereo-audio channels up to a sampling rate of 44.1 KHz. It can also be

used for transmitting or receiving video signals up to 360 Megabytes per

second.

6.3 Scope for Future Work

In the present work, only three user terminals were considered.

The future work may consider larger group sizes. Public keys may be

generated offline and stored in different databases to increase the

speed of RSA algorithm. In case of group communications, novel

algorithms may be designed for updating the keys, when a new member

enters the group or the existing member leaves the group.


References

[1] W. Diffie and M. Hellman, “New Directions in Cryptography”, IEEE

Transactions on Information Theory, pp. 644-654, 1976.

[2] R. Rivest, A. Shamir and L. Adleman, “A Method for Obtaining

Digital Signatures and Public Key Cryptosystems”,

Communications of ACM 21, pp. 120-125, 1978.

[3] A. Menezes, P. van Oorschot and S. Vanstone, “Handbook of

Applied Cryptography”, CRC Press, Oct. 1996.

[4] D. Boneh, “Twenty Years of Attacks on the RSA Cryptosystem”,

Notices of the American Mathematical Society, Vol. 46(2), pp.

203-213, 1999.

[5] G. R. Blakley, “A computer algorithm for the product AB modulo

M”, IEEE Transactions on Computers, Vol.32, No.5, pp. 497-500,

May 1983.

[6] P. L. Montgomery, “Modular multiplication without trial division”,

Math. of Computation, Vol. 44, No. 170, pp. 519-521, April 1985.

[7] C. K. Koc, “High-speed RSA Implementation”, Technical Report,

RSA Laboratories, Nov. 1994.

[8] Cilardo A., Mazzeo A., Romano L., Saggese G.P., "Carry-save

Montgomery modular exponentiation on reconfigurable hardware",

Proceedings of Design, Automation and Test in Europe

Conference and Exhibition, Vol. 3, pp. 206-211, Feb. 2004.


[9] Bo Song, Kawakami K., Nakano K. and Ito Y., "An RSA Encryption

Hardware Algorithm Using a Single DSP Block and a Single Block

RAM on the FPGA", First International Conference on Networking

and Computing, pp. 140-147, Nov. 2010.

[10] Shand M., Vuillemin J., "Fast implementations of RSA

cryptography", 11th Symposium on Computer Arithmetic,

Proceedings, pp. 252-259, Jun-Jul 1993.

[11] Koji Nakano, Kawakami K. and Shigemoto K., "RSA encryption

and decryption using the redundant number system on the FPGA",

2009 IEEE International Symposium on Parallel & Distributed

Processing, pp. 1-8, May 2009.

[12] Daniel Mesquita , Guilherme Perin , Fernando Luís Herrmann ,

João Baptista Martins, “An efficient implementation of

Montgomery powering ladder in reconfigurable hardware”. Pp.

121-126, Jan. 2010.

[13] Xuewen Tan, Yunfei Li, "Parallel Analysis of an Improved RSA

Algorithm", International Conference on Computer Science and

Electronics Engineering, Vol. 1, pp. 318-320, March 2012.

[14] Changxing Lin, Jian Zhang, Beibei Shao, "A High Speed Parallel

Timing Recovery Algorithm and Its FPGA Implementation", 2011

2nd International Symposium on Intelligence Information

Processing and Trusted Computing, pp. 63-66, Oct. 2011.


[15] Suli Wang, Ganlai Liu, "File Encryption and Decryption System

Based on RSA Algorithm", International Conference on

Computational and Information Sciences, pp. 797-800, Oct. 2011.

[16] Wenjun Fan, Xudong Chen, Xuefeng Li, "Parallelization of RSA

Algorithm Based on Compute Unified Device Architecture", Ninth

International Conference on Grid and Cooperative Computing, pp.

174-178, Nov. 2010.

[17] Ljupco Kocarev, Marjan Sterjev, Paolo Amato, P., "RSA ncryption

algorithm based on torus automorphisms", Proceedings of the

2004 International Symposium on Circuits and Systems, Vol. 4,

pp. 577-580, May 2004.

[18] Koç C.K., Tolga Acar, Kaliski, B.S. Jr., "Analyzing and comparing

Montgomery multiplication algorithms”, IEEE Micro, Vol. 16, pp.

26-33, Jun 1996.

[19] Xin Zhou, Xiaofei Tang, "Research and implementation of RSA

algorithm for encryption and decryption", 6th International Forum

on Strategic Technology, Vol. 2, pp. 1118-1121, Aug. 2011.

[20] Jiang Huiping, Yang Guosheng, "Resistant against power analysis

for a fast parallel high-radix RSA algorithm", International

Conference on Electric Information and Control Engineering, pp.

1668-1671, April 2011.

[21] Sami A. Nagar, Saad Alshamma, "High speed implementation of

RSA algorithm with modified keys exchange", 6th International


Conference on Sciences of Electronics, Technologies of

Information and Telecommunications, pp. 639-642, March 2012.

[22] Sining Liu, King B., Wang Wei, "A CRT-RSA Algorithm Secure

against Hardware Fault Attacks", 2nd IEEE International

Symposium on Dependable, Autonomic and Secure Computing,

pp. 51- 60, Sept-Oct. 2006.

[23] Abdullah AI Hasib and Abdul Ahsan Md. Mahmudul Haque, "A

Comparative Study of the Performance and Security Issues of

AES and RSA Cryptography", Third 2008 International

Conference on Convergence and Hybrid Information Technology,

Vol. 2, pp. 505-510, Nov. 2008.

[24] Minni, Rohit, Sultania, Kaushal, Mishra, Saurabh, Vincent, Durai

Raj, "An algorithm to enhance security in RSA", Fourth

International Conference on Computing, Communications and

Networking Technologies, pp. 1-4, July 2013.

[25] Al-Hamami A.H., Aldariseh I.A., "Enhanced Method for RSA

Cryptosystem Algorithm", 2012 International Conference on

Advanced Computer Science Applications and Technologies, pp.

402-408, Nov. 2012.

[26] Dahui Hu, Zhiguo Du, "An improved Kerberos protocol based on

fast RSA algorithm", IEEE International Conference on

Information Theory and Information Security, pp. 274-278, Dec.

2010.


[27] Doroeviae G., Unkasevia T., Markoviae M., "Optimization of

modular reduction procedure in RSA algorithm implementation on

assembler of TMS320C54x signal processors", 14th International

Conference on Digital Signal Processing, pp. 811-814, 2002.

[28] Na Qi Jing Pan Quan Ding, "The Implementation of FPGA-based

RSA Public-key Algorithm and its Application in Mobile-phone

SMS Encryption System", First International Conference on

Instrumentation, Measurement, Computer, Communication and

Control, pp. 700- 703, Oct. 2011.

[29] Perovic, N.S., Popovic-Bozovic, M., "FPGA implementation of

RSA cryptoalgorithm using shift and carry algorithm", 20th

Telecommunications Forum, pp. 1040-1043, Nov. 2012.

[30] Iana G.V., Anghelescu P., Serban G., "RSA encryption algorithm

implemented on FPGA", International Conference on Applied

Electronics, pp. 1-4, Sept. 2011.

[31] Nibouche O., Nibouche M., Bouridane A.and Belatreche A., "Fast

architectures for FPGA-based implementation of RSA encryption

algorithm", Proceedings of IEEE International Conference on

Field-Programmable Technology, pp. 271-278, Dec. 2004.

[32] Nibouche O., Nibouche M., Bouridane A., "High speed FPGA

implementation of RSA encryption algorithm", Proceedings of the

10th IEEE International Conference on Electronics, Circuits and

Systems, pp. 204-207, Dec. 2003.


[33] Chhabra A., Mathur S., "Modified RSA Algorithm: A Secure

Approach, International Conference Computational Intelligence

and Communication Networks, pp. 545-548, Oct.2011.

[34] Hariri A., Reyhani-Masoleh A., "Bit-Serial and Bit-Parallel

Montgomery Multiplication and Squaring over GF(2m)", IEEE

Transactions on Computers, Vol. 58, No. 10, pp. 1332-1345, Oct.

2009.

[35] Miguel Morales-Sandoval, Arturo Díaz-Pérez, “Scalable GF(p)

Montgomery multiplier based on a digit–digit computation

approach”, IET Computers & Digital Techniques 10(3) , Sept.

2015.

[36] Perin G., Daniel G. Mesquita, Fernado L. Herrmann, Martins J.B.,

"Montgomery modular multiplication on reconfigurable hardware:

Fully systolic array vs parallel implementation", Programmable

Logic Conference, 2010 pp. 61-66, March 2010.

[37] Ali Ziya Alkar, Remziye So¨nmez “A hardware version of the RSA

using the Montgomery‟s algorithm with systolic arrays”,

INTEGRATION, the VLSI journal 38, pp. 299–307, 2004.

[38] Guilherme Perin, Daniel GomesMesquita, and Jo˜ao

BaptistaMartins, “Montgomery Modular Multiplication on

Reconfigurable Hardware: Systolic versus Multiplexed

Implementation”, International Journal of Reconfigurable

Computing, pp. 1-10, Nov. 2011.

https://www.researchgate.net/profile/Miguel_Morales-Sandoval

https://www.researchgate.net/profile/Arturo_Diaz-Perez

https://www.researchgate.net/journal/1751-8601_IET_Computers_Digital_Techniques


[39] M. Poolakkaparambil, J. Mathew, A. M. Jabir, and D. K. Pradhan,

"A dynamically error correctable bit parallel Montgomery multiplier

over binary extension fields", 2011 20th European Conference on

Circuit Theory and Design, pp. 600-603, Aug. 2011.

[40] Jean-Claude Bajard, Imbert L., Graham A. Jullien, "Parallel

Montgomery multiplication in GF(2k) using trinomial residue

arithmetic", Proceedings of the 17th IEEE Symposium on

Computer Arithmetic, pp. 164-171, June 2005.

[41] Chiou-Yng Lee, Chin-Chin Chen, Erl-Huei Lu, "Compact Bit-

Parallel Systolic Montgomery Multiplication Over GF(2m)

Generated by Trinomials", TENCON 2006. 2006 IEEE Region 10

Conference, pp. 1-4, Nov. 2006.

[42] Daesung Lim, Nam Su Chang , Sung Yeon Ji , Chang Han Kim ,

Sangjin Lee, Young-Ho Park, “An efficient signed digit

montgomery multiplication for RSA”, Journal of Systems

Architecture, Vol. 55, pp. 355–362, 2009.

[43] Sanu M.O., Swartzlander E.E., Chase C.M., "Parallel Montgomery

multipliers", 15th IEEE International Conference on Application-

Specific Systems, Architectures and Processors, pp. 63-72, Sept.

2004.

[44] Zhimin Chen, Schaumont P., "pSHS: A scalable parallel software

implementation of Montgomery multiplication for multicore

systems", Design, Automation & Test in Europe Conference &


Exhibition, pp. 843-848, March 2010.

[45] Zhimin Chen, Schaumont P., "A Parallel Implementation of

Montgomery Multiplication on Multicore Systems: Algorithm,

Analysis, and Prototype", IEEE Transactions on Computers,

Vol.60, No.12, pp.1692-1703, Dec. 2011.

[46] Amberg P. Pinckney N. Harris D.M., "Parallel high-radix

Montgomery multipliers", 42nd Asilomar Conference on Signals,

Systems and Computers, pp. 772-776, Oct. 2008.

[47] F. Bernard, “Scalable hardware implementing high-radix

Montgomery multiplication algorithm”, Journal of Systems

Architecture, Vol. 53, pp. 117–126, 2007.

[48] Thomas Blum and Christof Paar, “High Radix Montgomery

Modular Exponentiation on Reconfigurable Hardware”, IEEE

Transactions on Computers, Vol. 50, No. 5, pp. 759-764, July

2001.

[49] Thomas Blum and Christof Paar, “Montgomery modular

exponentiation on reconfigurable hardware”, 14th Symposium on

Computer Arithmetic, pp. 70–77, 1999.

[50] Jun Han, Shuai Wang, Wei Huang, Zhiyi Yu, Xiaoyang Zeng,

"Parallelization of Radix-2 Montgomery Multiplication on Multicore

Platform", IEEE Transactions on Very Large Scale Integration

Systems, Vol. 21, No. 12, pp. 2325-2330, Dec. 2013.


[51] Selçuk Baktir, Erkay Savaş, “Highly-Parallel Montgomery

multiplication for Multi-Core General-Purpose Microprocessors”,

Computer and Information Sciences III, pp. 467- 476, 2013.

[52] Miladinovic N., Popovi -Bo ovi J., FPGA realization of fully

systolic and parallel architecture of Montgomery multipliers", 19th

Telecommunications Forum, pp. 928-931, Nov. 2011.

[53] Ciaran McIvor, Máire McLoone, John V McCanny, Alan Daly,

William Marnane, “Fast Montgomery Modular Multiplication and

RSA Cryptographic Processor Architectures”, thirty seventh

Asilomar conference on signals, systems and computers, Vol. 1,

pp. 379-384, Nov. 2003.

[54] Neto J.C., Tenca A.F., Ruggiero W.V., "A parallel k-partition

method to perform Montgomery Multiplication", IEEE International

Conference on Application-Specific Systems, Architectures and

Processors, pp. 251-254, Sept. 2011.

[55] Néto J. C., Tenca A. F., Ruggiero W. V., "A Parallel and Uniform

k-Partition Method for Montgomery Multiplication", IEEE

Transactions on Computers, Vol. 63, No. 9, pp. 2122-2133, Sept.

2014.

[56] L. Batina and G. Muurling, “Montgomery in Practice: How to Do It

More Efficiently in Hardware”, Proc. Cryptographer‟s Track at the

RSA Conf. Topics in Cryptology, pp. 40-52, Feb. 2002.

[57] Huapeng Wu, "Low complexity LFSR based bit-serial montgomery


multiplier in GF(2m)", IEEE International Symposium on Circuits

and Systems, pp. 1962-1965, May 2013.

[58] M. Morales-Sandoval, C. Feregrino-Uribe, P. Kitsos, "Bit-serial

and digit-serial GF(2m)Montgomery multipliers using linear

feedback shift registers", Computers & Digital Techniques, Vol.5,

Issue 2, pp. 86-94, March 2011.

[59] Gustavo D. Sutter, Jean-Pierre Deschamps, José Luis Imaña,

“Modular Multiplication and Exponentiation Architectures for Fast

RSA Cryptosystem Based on Digit Serial Computation”, IEEE

Transactions on Industrial Electronics, Vol. 58, No. 7, July 2011.

[60] Talapatra S., Rahaman H. Saha S.K., "Unified Digit Serial Systolic

Montgomery Multiplication Architecture for Special Classes of

Polynomials over GF(2m)", 2010 13th Euromicro Conference on

Digital System Design: Architectures, Methods and Tools, pp.

427-432, Sept. 2010.

[61] Talapatra S., Rahaman H., Mathew J., "Low Complexity Digit

Serial Systolic Montgomery Multipliers for Special Class of

GF(2m)", IEEE Transactions on Very Large Scale Integration

Systems, Vol. 18, No. 5, pp. 847-852, May 2010.

[62] Michalski A. Buell D., "A Scalable Architecture for RSA

Cryptography on Large FPGAs", 14th Annual IEEE Symposium on

Field- Programmable Custom computing Machines, pp. 1-8,

Aug. 2006.


[63] Michalski A. and Buell D., "A Scalable Architecture for RSA

Cryptography on Large FPGAs", 14th Annual IEEE Symposium

on Field-Programmable Custom Computing Machines, pp. 331-

332, April 2006.

[64] Chu A., Sima, M., "Reconfigurable RSA Cryptography for

Embedded Devices", Canadian Conference on Electrical and

Computer Engineering, pp. 1312-1315, May 2006.

[65] Oskuzoglu E, Savaş E, “Parametric, secure and compact

implementation of RSA on FPGA”, International Conference on

Reconfigurable Computing and FPGAs, pp. 391–396, Dec. 2008.

[66] A. Mazzeo, L. Romano, G. P. Saggese - Universita‟ degli Studi di

Napoli “Federico II”, FPGA-based implementation of a serial RSA

processor”, Proceedings of the Design, Automation and Test in

Europe Conference and Exhibition, pp. 582-587 , March 2003.

[67] Ersin Öksüzoğlu, Erkay Savaş, “Parametric, Secure and Compact

Implementation of RSA on FPGA”, International Conference on

Reconfigurable Computing and FPGAs, pp. 391-396, 2008.

[68] Liang Wang, Yonggui Zhang, "A new personal information

protection approach based on RSA cryptography", 2011

International Symposium on IT in Medicine and Education, Vol.1,

pp. 591-593, Dec. 2011.

[69] Ming-Der Shieh, Chien-Hsing Wu, Ming-hwa Sheu, Jia-Lin Sheu,

Che-Han Wu, "Asynchronous implementation of modular


exponentiation for RSA cryptography", Proceedings of the

Second IEEE Asia Pacific Conference on ASICs, pp. 191-194,

2000.

[70] Manaf N.V., Sheramin, G.Y., "A Simple Approach for VLSI

Improvement of OHRNS for Use RSA Cryptography", 2010

International Conference on Computational Intelligence and

Communication Networks, pp. 355-358, Nov. 2010.

[71] Hariri A., Reyhani-Masoleh A., "Concurrent Error Detection in

Montgomery Multiplication over Binary Extension Fields", IEEE

Transactions on Computers, Vol. 60, No. 9, pp. 1341-1353, Sept.

2011.

[72] Hariri A., Reyhani-Masoleh A., "Fault Detection Structures for the

Montgomery Multiplication over Binary Extension Fields", 2007

Workshop on Fault Diagnosis and Tolerance in Cryptography,

pp. 37- 46, Sept. 2007.

[73] Miaoqing Huang, Kris Gaj, El-Ghazawi T., "New Hardware

Architectures for Montgomery Modular Multiplication Algorithm",

IEEE Transactions on Computers, Vol. 60, No. 7, pp. 923-936,

July 2011.

[74] McLoone M., McIvor C. McCanny J.V., "Montgomery modular

multiplication architecture for public key cryptosystems", IEEE

Workshop on Signal Processing Systems, pp. 349-354, Oct.

2004.


[75] Richa Garg, Renu Vig, "An Efficient Montgomery Multiplication

Algorithm and RSA Cryptographic Processor", International

Conference on Computational Intelligence and Multimedia

Applications 2007, Vol. 2, pp. 188-195, 13-15 Dec. 2007.

[76] Yuan-Yang Zhang , Zheng Li, Lei Yang, Shao-Wu Zhang, “ An

efficient CSA architecture for montgomery modular multiplication”,

Microprocessors and Microsystems 31, pp. 456–459, 2007.

[77] S. S. Ghoreishi, M. A. Pourmina, H. Bozorgi, M. Dousti, "High

Speed RSA Implementation Based on Modified Booth's

Technique and Montgomery's Multiplication for FPGA Platform",

2009 Second International Conference on Advances in Circuits,

Electronics and Micro-electronics, pp. 86-93, Oct. 2009.

[78] Jin-Hua Hong, Cheng-Wen Wu, “Cellular-Array Modular Multiplier

for Fast RSA Public-Key Cryptosystem Based on Modified

Booth‟s Algorithm”, IEEE Transactions on Very Large Scale

Integration Systems, Vol. 11, No. 3, June 2003.

[79] Bayhan D., Ors S.B. Saldamli G., "Analyzing and comparing the

Montgomery multiplication algorithms for their power

consumption", International Conference on Computer Engineering

and Systems, pp. 257- 261, Nov.-Dec. 2010.

[80] Nadia Nedjah, Luiza de Macedo Mourelle, “Three Hardware

Architectures for the Binary Modular Exponentiation: Sequential,


Parallel, and Systolic”, IEEE Transactions on Circuits and

Systems, Vol. 53, No. 3, March 2006.

[81] Refik Sever, A. Neslin Ismailglu, Yusuf C Tekmen, Murat Askar,

Burak Okcan, "A high speed FPGA implementation of the

Rijndael algorithm”, Proceedings of the Euromicro Symposium on

Digital System Design, pp. 358-362, Aug-Sept. 2004.

[82] R.V. Kshirsagar, M. V. Vyawahare, "FPGA Implementation of High

Speed VLSI Architectures for AES Algorithm", 2012 Fifth

International Conference on Emerging Trends in Engineering and

Technology, pp. 239-242, Nov. 2012.

[83] Sushanta Kumar Sahu, Manoranjan Pradhan “FPGA

Implementation of RSA Encryption System”, International Journal

of Computer Applications, Vol. 19, No. 9, pp. 10-12, April 2011.

[84] Tanimura K., Nara R., Kohara S., Shimizu K. Shi Y., Nozomu

Togawa, Yanagisawa M., Ohtsuki T., "Scalable unified dual-radix

architecture for Montgomery multiplication in GF(P) and GF(2n)",

Asia and South Pacific Design Automation Conference, pp. 697-

702, March 2008.

[85] Schinianakis D., Skavantzos A., Stouraitis T., "GF(2n) Montgomery

multiplication using Polynomial Residue Arithmetic", 2012 IEEE

International Symposium on Circuits and Systems, pp. 3033-

3036, May 2012.

[86] Talapatra S., Rahaman H., "Low complexity Montgomery


multiplication architecture for elliptic curve cryptography over

GF(pm)", VLSI System on Chip Conference 18th IEEE/IFIP, pp.

219-224, Sept. 2010.

[87] Satzoda R.K., Chip-Hong Chang, "A fast kernel for unifying GF(p)

and GF(2m) Montgomery multiplications in a scalable pipelined

architecture", Proceedings of 2006 IEEE International Symposium

on Circuits and Systems, pp. 3378-3381, May 2006.

[88] Fournaris A.P., "Fault and simple power attack resistant RSA

using Montgomery modular multiplication", Proceedings of 2010

IEEE International Symposium on Circuits and Systems, pp.

1875-1878, May-June 2010.

[89] Himanshu Thapliyal, Anvesh Ramasahayam, Vivek Reddy Kotha,

Kunul Gottimukkula and M.B. Srinivas, "Modified Montgomery

modular multiplication using 4:2 compressor and CSA adder",

Third IEEE International Workshop on Electronic Design, Test

and Applications, pp. 17-19, Jan. 2006.

[90] Maryam Mohammadi, Amir Sabbagh Molahosseini, "Efficient

design of Elliptic Curve Point Multiplication based on fast

Montgomery modular multiplication", 3rd International Conference

on Computer and Knowledge Engineering, pp. 424-429, Oct-Nov

2013.

[91] Haining Fan; M. Anwar Hasan, "Relationship between GF(2m)

Montgomery and Shifted Polynomial Basis Multiplication


Algorithms", IEEE Transactions on Computers,Vol.55, No.9, pp.

1202-1206, Sept. 2006.

[92] Koç, C.K., Tolga Acar, Burton S. Kaliski, Jr., "Analyzing and

comparing Montgomery multiplication algorithms", Micro, IEEE,

Vol. 16, No. 3, pp. 26-33, Jun 1996.

[93] Ciaran McIvor, McLoone M., John V. McCanny, "FPGA

Montgomery modular multiplication architectures suitable for

ECCs over GF(p)", Proceedings of the 2004 International

Symposium on Circuits and Systems, Vol.3, pp. III-509-III-512,

May 2004.

[94] V. R. Venkatasubramani, S. Rajaram, "Novel techniques for

Montgomery modular multiplication algorithms for public key

cryptosystems", 2011 IEEE Electrical Design of Advanced

Packaging and Systems Symposium, pp. 1-6, Dec. 2011.

[95] C. McIvor, M. McLoone and J. V. McCanny, "Modified

Montgomery modular multiplication and RSA exponentiation

techniques", IEE Proceedings of Computers and Digital

Techniques, Vol.151, No.6, pp. 402-408, Nov. 2004.

[96] Mclvor C. McLoone M., McCanny J.V., "Fast Montgomery modular

multiplication and RSA cryptographic processor architectures",

Conference Record of the Thirty-Seventh Asilomar Conference

on Signals, Systems and Computers, Vol. 1, pp. 379-384, Nov.

2003.


[97] A. P. Fournaris, O. Koufopavlou, "GF(2K) multipliers based on

Montgomery Multiplication Algorithm", Proceedings of the 2004

International Symposium on Circuits and Systems, Vol.2, pp. II-

849-II-852, May 2004.

[98] Paul R., Saha S., Suman Sau, Amlan Chakrabarti, "Real time

communication between multiple FPGA systems in multitasking

environment using RTOS", 2012 International Conference on

Devices, Circuits and Systems, pp. 130-134, March 2012.

[99] Pellegrini A., Bertacco V. and Austin T., "Fault-based attack of

RSA authentication", Design, Automation & Test in Europe

Conference & Exhibition, pp. 855-860, March 2010.

[100] Abdel Alim Kamal and Amr M. Youssef, "An FPGA

implementation of the NTRUEncrypt cryptosystem", 2009

International Conference on Microelectronics, pp. 209- 212, Dec.

2009.

[101] Marcelo E. Kaihara and Naofumi Takagi, “Bipartite modular

multiplication”, In Proceedings of Cryptographic Hardware and

Embedded Systems, Lecture notes in Computer Science 3659,

pp. 201-210. Springer-Verlag, 2005.

[102] Marcelo E. Kaihara and Naofumi Takagi, “Bipartite modular

multiplication method”, IEEE Transactions on Computers, pp.

157-164, 2008.

[103] Kazuo Sakiyama, Miroslav Knezevic, Junfeng Fan, Bart Preneel,


and Ingrid Verbauwhede, “Tripartite modular multiplication”,

INTEGRATION THE VLSI JOURNAL, Vol. 44, No. 4, pp. 259-

269, Sept. 2011.

[104] Kazuo Sakiyama, Lejla Batina, Bart Preneel, and Ingrid

Verbauwhede, “Multicore Curve-Based Cryptoprocessor with

Reconfigurable Modular Arithmetic Logic Units over GF(2n)”,

IEEE Transactions on Computers, Vol. 56, No. 9, pp. 1269-

1282, Sept. 2007.

[105] N. Costigan and P. Schwabe, “Fast Elliptic-Curve Cryptography

on the Cell Broadband Engine”, International Conference on

Cryptology in Africa: Progress in Cryptology ‟09, pp. 368-385,

2009.

[106] Junfeng Fan, Kazuo Sakiyama, and Ingrid Verbauwhede,

“Montgomery modular multiplication algorithm on multi-core

systems”, IEEE Workshop on Signal Processing Systems, pp.

261-266, 2007.

[107] Ç. K. Koç, T. Acar, and B. Kaliski. “Analyzing and comparing

montgomery multiplication algorithms”, IEEE Micro, Jun 1996.

[108] Ç. K Koç and T. Acar. “Montgomery Multplication in GF(2k)”,

Design, Codes, and Cryptography, 14(1):57_69, 1998.

[109] C.D. Walter, “Precise Bounds for Montgomery Modular

Multiplication and Some Potentially Insecure RSA Moduli”, Proc.

Cryptographer‟s Track at the RSA Conf. Topics in


Cryptology(CT-RSA ‟02), pp. 30-39, Feb. 2002.

[110] Bunimov V., Schimmler M. Tolg B., “A Complexity-Effective

Version of Montgomery‟s Algorithm”, Presented at the Workshop

on Complexity Effective Designs, May 2002.

[111] Skavantzos A., Stouraitis T., "GF(2n) Montgomery multiplication

using Polynomial Residue Arithmetic", 2012 IEEE International

Symposium on Circuits and Systems, pp. 3033-3036, May 2012.

[112] Iput Heri K., Asep Bagja N., Purba R.S., Adiono T., "Very fast

pipelined RSA architecture based on Montgomery's

algorithm", 2009 International Conference on Electrical

Engineering and Informatics, Vol. 02, pp. 491-495, Aug 2009.

[113] Dorothy E. Dening, “Digital Signatures with RSA and other

Public-Key Cryptosystems”, Communications of the ACM, Vol.

27, N0.4, April 1984.

[114] A. Moss, D. Page, and N.P. Smart, “Toward Acceleration of RSA

Using 3D Graphics Hardware”, IMA International Conference on

Cryptography and Coding, pp. 213-220, 2007.

[115] Ramachandran Seetharaman, “Digital VLSI Systems Design: A

Design Manual for Implementation of Projects on FPGAs and

ASICs Using Verilog”, Springer, 2007.

[116] R. Ambika, S. Ramachandran, K. R. Kashwan, “Securing

Distributed FPGA System using Commutative RSA Core”,

Global Journal of Researches in Engineering Electrical and


electronics Engineering, Vol. 13, Issue 15, Version 1, Nov. 2013.

[117] R. Ambika, S. Ramachandran, K. R. Kashwan, “Data Security

Using Serial Commutative RSA Core for Multiple FPGA

System”, 2014 2nd International Conference on Devices,

Circuits and Systems, pp. 1-5, March 2014.

[118] R. Ambika, S. Ramachandran, K. R. Kashwan, “Design of

Commutative Cryptography Core with Key Generation for

Distributed FPGA Architecture”, International Journal of Current

Engineering and Technology, Vol. 4, No. 5, pp. 3519- 3527, Oct.

2014.

[119] ModelSim, A mixed-languages simulator, supporting Verilog-

2001 and System Verilog.

[120] Xilinx ISE Tool, Design Suite, Version 14.3.

[121] Xilinx Vivado Design Suite, Version 2012.3


List of Publications

1. R. Ambika, S. Ramachandran, K. R. Kashwan, “Securing

Distributed FPGA System using Commutative RSA Core”, Global

Journal of Researches in Engineering Electrical and electronics

Engineering, Vol. 13, Issue 15, Version 1, Nov. 2013.

2. R. Ambika, S. Ramachandran, K. R. Kashwan, “Data Security

Using Serial Commutative RSA Core for Multiple FPGA System”,

2014 2nd International Conference on Devices, Circuits and

Systems, pp. 1-5, March 2014.

3. R. Ambika, S. Ramachandran, K. R. Kashwan, “Design of

Commutative Cryptography Core with Key Generation for

Distributed FPGA Architecture”, International Journal of Current

Engineering and Technology, Vol. 4, No. 5, pp. 3519- 3527, Oct.

2014.

4. R. Ambika and Sahana Devanathan, “FPGA Implementation of

Cryptographic Algorithms: A Survey”, International Journal of

Scientific & Engineering Research, Volume 4, Issue 4, pp. 884-

888, April-2013.

5. R. Ambika and Hamsavahini R, “A Survey on Hardware

Architectures for Montgomery Modular Multiplication Algorithm”,

International Journal of Emerging Technologies in Computational

and Applied Sciences, Vol.5, Issue 3, pp. 217- 221, June-August,

2013.


APPENDIX A

RTL CODING:

Guidelines of RTL coding are certain rules of coding which is

followed worldwide according to which coding has been done in this

project too. They are as follows:

1 Avoid hard values and numeric constants, use attributes on

objects or explicitly declared constants

2 Use "rising edge" exclusively (it's time to give up clock‟ event

and clk='1').

3 "wait" must not be used.

4 Input ports can be left unconnected (open) at instantiation

provided they are assigned default values at declaration.

5 Do not leave ports unconnected by omission: use "open".

6 Avoid recursive code.

7 Inside an entity's synthesizable architecture, the authorized types

are: std_logic, std_logic_vector, signed, unsigned. The use of

integer range and Boolean requires care and caution since these

scalar types are implicitly initialized at creation. Enumerated types

have the same behavior and must be treated with even greater

care.

8 Use as few variables as possible in synthesizable code, and


never when you may use signals instead. Use variables for their

specific behavior (factoring, intermediate results, re-using the Flip

Flops inputs etc).

9 Never create latches, combinational feedback or

asynchronous sequential logic, whether intended or

unintended.

10 You must not initialize signals at their declaration. You must

not initialize variables at their declaration in processes. It is

acceptable to initialize variables at their declaration inside

functions.

11 All asynchronous input signals should be re-synchronized.

12 All the Entity's outputs should be registered. Combinational

outputs can also create combinational feedbacks through the

hierarchy.


APPENDIX B

DEVELOPMENT TOOLS

SIMULATION TOOL: Modelsim

It is a tool, which provides comprehensive simulation and debug

environment for complex ASIC and FPGA designs.

Version used: Model sim PE 5.5e

Command Summary:

Double click on icon Shortcut to Modelsim.exe.link on your desktop.

Main Modelsim window and welcome to Modelsim window open.

Click on “create a project” in welcome to Modelsim window for a new

project. Create project window opens. Type in the project name

menu, and also type the desired location of the directory where you

have your Verilog files stored in the project menu location. Make

sure “work” is entered in default library name. Click on OK. In the

main Modelsim window click on library on the bottom left.


opening a project

In main window, click on design => compile. The menu opens.

Double click on the desired test bench. The test bench and the

design will be compiled. If any error it will be reported on the main

menu. If there are errors, fix them. Otherwise click on done. Click

on design => Load design in the main window. Load design menu

opens. Double click on the desired test bench; say the desired

test bench name followed by the load. The test bench will be

loaded into the simulator.


Loading a Design


compiling a design

Click on main menu. View => “signals”. Signals window opens. In

that window click on view => wave => signals in design. Waveform

window with the listing of all the signals will open. X-axis of the

waveform gives the time in nanoseconds.


Signals in Design

Click on run all. Menu on the top second row on the right. Click on

Zoom full menu to view all the waveforms. Other options such as

Zoom in, zoom out, zoom area may be used to get the desired view.


Simulation Result Window

Locate the signal, say, “clk” and click on it. It will be highlighted. Click

on the highlighted signal. Drag up and leave it at the top of all other

signals. Likewise, you can drop all other signals belonging to the

function next to the first signal and so on. You can also use copy or

cut and paste it at the desired place. Click on “Format => Radix =>

Binary or decimal or hexadecimal or octal” as per your requirements.

All selected signals are set to the desired number system.


Use <, > arrows to move the waveforms along the time axis. Analyze

the waveforms to check the functionality.

This completes one session. To exit the Modelsim, click on “File =>

Quit” and “Yes”..

If you want to use Modelsim again to resume the same project, double

click on the same icon on the desktop.

Click on “Open a Project” in welcome to Modelsim window. Make sure

that the directory is the same by clicking on “Library”. Otherwise click

“File => change the directory and select the desired directory and click

on “Open”.


APPENDIX C

DEVELOPMENT OF ASM CHARTS

ASM charts are a graphical representation of step-by-step execution of

a hardware code. They are easier to interpret and can be easily

converted to other forms of representation.

They do not enumerate all the possible inputs and outputs. Only the

inputs that matter and the outputs that are asserted are indicated. It

must be known whether a signal is positive or negative logic:

Positive logic signals that are high are said to be asserted

Negative logic signals that are low are said to be asserted

In this report, a _n suffix is added to indicate low logic signals.

The ASM Diagram Block

An ASM chart has an entry point and is constructed with blocks. A block

is constructed with the following type of symbols.

state box: The state box has a name and lists outputs that are asserted

when the system is in that state. These outputs are called synchronous

or Moore type outputs.


State Box Representation

Optional decision box : A decision box may be conditioned on a signal

or a test of some kind.

Condition Box Representation

Optional conditional output box: Such an output box indicates

outputs that are conditionally asserted. These outputs are called

asynchronous or Mealy outputs:


Conditional Output Box

There is no rule saying that outputs are exclusively inside a conditional

output box or in a state box. An output written inside a state box is

simply independent of the input, while in that state.

The drawing of ASM charts must follow certain necessary rules:

i. The entrance paths to an ASM block lead to only one state

box

ii. Of 'N' possible exit paths, for each possible valid input

combination, only one exit path can be followed, that is there

is only one valid next state.

iii. No feedback internal to a state box is allowed.

hardware architecture for data security · design. the design was synthesized using xilinx design...

Documents