hardware architecture for data security · design. the design was synthesized using xilinx design...
TRANSCRIPT
HARDWARE ARCHITECTURE FOR DATA
SECURITY
Thesis Submitted in partial fulfillment for the
Award of Degree of
DOCTOR OF PHILOSOPHY IN ELECTRONICS AND
COMMUNICATION ENGINEERING
By
AMBIKA R.
(Reg. No. D898900001)
Under the Guidance of
Dr. S. RAMACHANDRAN
Professor, Department of ECE
SJBIT, BENGALURU – 560060
VINAYAKA MISSIONS UNIVERSITY
SALEM, TAMILNADU, INDIA
AUGUST 2015
Dedicated to
MY BELOVED PARENTS AND TEACHERS
VINAYAKA MISSIONS UNIVERSITY
Declaration
I, Ambika R. declare that the thesis entitled “HARDWARE
ARCHITECTURE FOR DATA SECURITY” submitted by me for the
award of the Degree of Doctor of Philosophy is the record of work
carried out by me during the period from January 2008 to August
2015 under the guidance of Dr S. Ramachandran and has not
formed the basis for the award of any degree, diploma, associate-ship,
fellowship, titles in this or any other University or other similar
institutions of higher learning.
Place: Bengaluru (AMBIKA R.)
Date:
VINAYAKA MISSIONS UNIVERSITY
Certificate by the Guide
I, S. Ramachandran certify that the thesis entitled “HARDWARE
ARCHITECTURE FOR DATA SECURITY” submitted for the award of
the Degree of Doctor of Philosophy by Ms. Ambika R. is the record of
research work carried out by her during the period from January 2008 to
August 2015 under my guidance and supervision and this work has not
formed the basis for the award of any degree, diploma, associate-ship,
fellowship or other titles in this University or any other university or
Institution of higher learning.
Place: Bengaluru (Dr S. Ramachandran)
Date:
i
ACKNOWLEDGEMENTS
I would like to express my deepest gratitude to my research guide
Dr. S. Ramachandran, Professor, Department of Electronics and
Communication Engineering, SJBIT, Bengaluru for his constant support,
patience and timely guidance at every step of my research work,
without whom this herculean task would not have been completed.
My Sincere thanks to Vinayaka Missions University, Salem, for
providing me an opportunity to carry out my research work under its
banner. My heartfelt thanks to the Chancellor, Dean (Research) of
Vinayaka Missions University for their constant support. I would like to
thank other officers of Vinayaka Missions University, Salem for their
timely assistance and guidance in completing my research work.
I sincerely thank BMS Educational Trust and Management of
BMS Institute of Technology for supporting me in my endeavor to
complete the thesis work.
It gives me a great pleasure indeed to place on record my
humblest and sincere token of gratitude and heartfelt thanks to Dr
Mohan Babu G. N., Principal, BMSIT&M, Dr R. V. Ranganath, Professor,
Department of Civil Engineering, BMSCE, Dr S. Venkateswaran,
ii
Principal, BMS Evening College, Dr A. C. Bhaskar Naidu, former
Principal and Professor of Mechanical Engineering.
.
I would also like to thank Mrs Priyanka Agarwal of SJBIT,
Mrs Vijayalakshmi K. of BMSCE, Mrs S. K. Pushpa, Mrs C. S. Mala,
Mrs Sahana Devanathan, Mr Saneesh Cleatus T., Mrs Shashikala J. of
BMSIT&M, Dr Fathima Jabeen, HOD, ECE, KSSEM, Ms P. C. Sunitha
and Ms Yashaswini, Staff of ECE department, BMSITM for all their help
and cooperation in carrying out my work. I thank Mr Ashutosh and
Kashyap for their valuable suggestions during the course of my work.
I thank my parents, sisters, my nephew Kushal B. N. and friends
for their constant encouragement and support.
Ambika R.
iii
Abstract
High speed data communication in multiple transceivers and
Multiple Input Multiple Output applications demand a highly robust
and secure system model that facilitate security. The security
requirement is a must in realizing the secure telephony system, e-
commerce, e- banking and multi-user secure communication
scenarios. Confidentiality, authenticity, data integrity and its non-
repudiation are the major requirements in security.
Number of approaches and systems have been designed and
developed for ensuring data security in competitive multiuser
scenario. Among them, public key cryptosystem has been recognized
as one of the optimum solutions. With the rapid spread of digital
communication networks, there is a great need for privacy and
security of transmitted data. Therefore, the methods of safeguarding
information are becoming a major issue for which the encryption and
decryption systems have been created.
Numerous efforts have been made to optimize the
authentication and its optimization with RSA cryptosystems and
majority were implemented with hardware platforms. But considering
a competitive multi-transceiver or MIMO kind of applications, these
approaches are found to be limited in terms of critical latency,
iv
power factor and hence the overall performance. Most of these
schemes are computationally intense since they use serial
Montgomery Multiplication. Further, RSA approaches suffer from
reorder cryptosystem limitations. The proposed implementation of
commutative cryptography using Parallel Montgomery Multiplication
offers high processing speed of 0.5 µs at an operating frequency of
100 MHz and 3522 mW of power consumption. It also avoids the key
exchange complications.
RSA algorithm has been enhanced by means of its
optimization with commutative behavior. The Commutative RSA
approach has exhibited better results for Multiple Input Multiple
Output transceiver based secure communication. In the present work,
the Commutative nature of RSA algorithm has been proved. RSA
algorithm is implemented using both Serial and Parallel Montgomery
Multiplication Algorithms. Commutative Cryptography Core with Key
Generation has been designed for distributed Field Programmable
Gate Array (FPGA) Architecture, which reduces the key exchange
overheads.
The results obtained for Serial Montgomery and Parallel
Montgomery Multiplication based Commutative RSA (CRSA)
architectures have been compared. Considering the performance
parameters like Memory occupancy, speed, power consumption,
v
delay and throughput, it has been found that the proposed Parallel
Montgomery has performed better compared to Serial Montgomery
based Commutative RSA implementation. The delay in the proposed
Parallel Montgomery based CRSA is 13.8% lower as compared to
Serial Montgomery based CRSA cryptography core. Similarly the
throughput of the proposed Parallel Montgomery based CRSA is
12.1% higher than the serial Montgomery based CRSA architecture.
In the proposed design, the trade-off between power consumption
and area is also very small.
The Architecture has been realized using the Hardware
Design Language, VHDL conforming to Register Transfer Level
(RTL) coding guidelines, without which no chip can work. For
convenience sake, the frequency of operation in simulation has been
set to 100 MHz although Place & Route report is 199 MHz. V arious
encryption and decryption data values at each user location is
computed.
The Synthesis, Place and Route have been run on the RTL
design. The design was synthesized using Xilinx Design Suite 14.3
targeted on Virtex-5, xc5vfx70t-2ff1136 FPGA and Vivado 2012.3
Xilinx tool. The design for both the encryption and the decryption
utilizes about 67% of the chip resources, leaving room for future
additions, if any. The maximum operating frequency reported by the
vi
Xilinx Design Suite 14.3 tool is 199 MHz for CRSA encryption and
decryption and 335 MHz for commutative cryptography core.
vii
TABLE OF CONTENTS
1. INTRODUCTION 1
1.1. Background 1
1.2. Classification of Cryptosystem 3
1.2.1 Symmetric Key Cryptosystem 4
1.2.2 Public Key Cryptosystem 5
1.3. The RSA Cryptosystem 9
1.3.1. Encryption and Decryption 11
1.3.2. Why RSA? 13
1.4. Motivations 14
1.5. Research Objectives 15
1.6. Methodology Adopted 16
1.7. Thesis Organization 16
2. REVIEW OF LITERATURE 18
3. DEVELOPMENT OF COMMUTATIVE RSA ALGORITHM 35
3.1. Introduction 35
3.2. System Model 36
3.3. Commutative RSA 37
3.4. Commutative Nature of Commutative RSA 40
Algorithm
3.5. Commutative RSA Implementation with Serial and 41
Parallel Montgomery Multiplication 3.5.1 Modular Exponentiation 41
viii
3.5.2 Modular Multiplication 43
3.5.3 Montgomery Multiplication Algorithm 44
3.5.4 Radix – 2 Modular Multiplier 47
3.6 Modular Multiplication Algorithms 49
3.6.1 Algorithm for Sequential Binary (T, E, M) 50
3.6.2 Algorithm for Parallel Binary (T, E, M) 52
4. DEVELOPMENT OF ALGORITHM FOR COMMUTATIVE
CRYPTOGRAPHY CORE WITH KEY GENERATION 55
4.1. Sequential Implementation of Key Generation 55
4.1.1. Linear Feedback Shift Register 55
4.1.2. Fibonacci LFSR 56
4.1.3. Galois LFSR 56
4.1.4. Primality Test 57
4.2. Key Generation 57 4.2.1 RTL Coding Guidelines 58
4.3. Commutative Encryption 73
4.4. Commutative Decryption 74
4.5. Commutative RSA Cryptography Core 75
4.6. CRSA Oriented Montgomery Parallel Multiplier 78
4.6.1. Parallel Montgomery Multiplier 78
4.7. Realization of Parallel Montgomery Multiplier 79
ix
4.7.1. Pseudo Algorithm for Parallel 81
Integer Multiplication
4.7.2. Pseudo Algorithm for Parallel 82
Montgomery Multiplication
4.8. Realization of CRSA with Multiple 83
Distributed Cores
5. SIMULATION AND PLACE AND ROUTE RESULTS OF 87
COMMUTATIVE CRYPTOGRAPHY ARCHITECTURE
WITH KEY GENERATION
5.1. Introduction 87
5.2. Hardware Design 88
5.3. A Comparative Analysis for Serial versus Parallel 89
Montgomery Multiplication Based CRSA
Implementation
5.4. Analysis of Simulation Waveforms 97
5.5. RTL View of Architecture of Commutative RSA 107 5.5.1. Rationale for 199 MHz 126
6. CONCLUSIONS AND SCOPE FOR FUTURE WORK 128
6.1. Conclusions 128
6.2. Contributions of This Work 128
6.3. Scope for Future Work 131
REFERENCES 132
x
LIST OF PUBLICATIONS 152 APPENDIX A 153
APPENDIX B 155
APPENDIX C 162
xi
LIST OF FIGURES
Figure 3.1. Square and Multiply Algorithm 42
Figure 3.2. Algorithm for Modular Multiplication 43
Figure 3.3. Algorithm for Radix-2 Modular Multiplication 48
Figure 3.4. Algorithm for Sequential Binary Modular 50
Multiplication
Figure 3.5. Serial Montgomery Multiplier 51
Figure 3.6. Algorithm for Parallel Binary Modular 52
Multiplication
Figure 3.7. Parallel Montgomery Multiplier 53
Figure 4.1. ASM Chart for Finding the Prime Number 60
Figure 4.2. RTL Schematic for Finding the Prime Number 61
Using ISE 14.3 Xilinx Tool
Figure 4.3 Flowchart for Finding GCD of Two Numbers 63
Figure 4.4. RTL View for Finding GCD Using ISE 14.3 Xilinx 64
Tool
Figure 4.5. Key Generation for Commutative RSA Algorithm 65
Figure 4.6. Pseudo Algorithm for Commutative RSA Key 67
Generation
Figure 4.7. ASM Chart for Key Generation for Commutative
RSA 68
Figure 4.8. RTL Schematics of Top Module of CRSA Key 70
Generation Using Vivado 2012.3
Figure 4.9. Schematic of CRSA Key Generation Using Vivado 71
2012.3
Figure 4.10. Pseudo Algorithm for Commutative Encryption 74
Figure 4.11. Pseudo Algorithm for Commutative Decryption 75
xii
Figure 4.12. Sequential Model for Commutative RSA
Realization 77
Figure 4.13. Pseudo Algorithm for Montgomery Multiplication 80
Figure 4.14. Pseudo Algorithm for Parallel Integer
Multiplication 81
Figure 4.15. Pseudo Algorithm for Parallel Montgomery 82
Multiplication
Figure 5.1. Comparison of Chip Area of Serial and Parallel 91
Montgomery Based CRSA
Figure 5.2. Comparison of Power Consumption of Serial 94
and Parallel Montgomery Based CRSA
Figure 5.3. Delay Comparison for Serial and Parallel 95
Montgomery Based CRSA
Figure 5.4. Throughput Comparison of Serial and Parallel 96
Montgomery Based CRSA
Figure 5.5. Simulation Waveform of CRSA Encryption at 104
User Terminal 1
Figure 5.6. Simulation Waveform of CRSA Encryption at 104
User Terminal 2
Figure 5.7. Simulation Waveform of CRSA Encryption at 105
User Terminal 3
Figure 5.8. Simulation Waveform of CRSA Decryption at 105
User Terminal 3
Figure 5.9. Simulation Waveform of CRSA Decryption at 106
User Terminal 2
Figure 5.10. Simulation Waveform of CRSA Decryption at 106
User Terminal 1
Figure 5.11. Top Level RTL View of Key Generation Using 107
ISE 14.3 Xilinx Tool
Figure 5.12. Top Level RTL View of Key Generation Using 108
Vivado 2012.3 Xilinx Tool
xiii
Figure 5.13. Second Level RTL View of CRSA Key
Generation Using ISE 14.3 Xilinx Tool
Figure 5.14. Second Level RTL View of CRSA Key
Generation Using Vivado 2012.3 Xilinx Tool
110
111
Figure 5.15.
(a-g)
RTL View of LFSR Using Vivado 2012.3 Xilinx
Tool
112-117
Figure 5.16 RTL View of Checking the Prime Number 118
Figure 5.17 Device Utilization summary using ISE 14.3
design tool xc5vlx20t-2ff323
Figure 5.18 Device Utilization summary using ISE 14.3
design tool xc5vsx50t-2ff1136
Figure 5.19 Device Utilization summary using ISE 14.3
design tool xc5vfx70t-2ff1136
Figure 5.20 Timing Report for Commutative Encryption and
Decryption Using ISE 14.3 Xilinx Tool
120
121
122
125
Figure 5.21 Timing Report for CRSA with Key Generation 125
Using ISE 14.3 Xilinx Tool
Figure 5.22 Timing Report for CRSA with Key Generation 126
Using Vivado 2012.3 Xilinx Tool
xiv
LIST OF TABLES
Table 5.1. Comparison of Chip Area of Serial and Parallel 90
Montgomery Based CRSA Cryptography Core Table 5.2. Comparison of Power Consumption of Serial and 93
Parallel Montgomery Based Commutative RSA
Cryptography Core
Table 5.3. Performance (Delay, Frequency and Throughput) 95
Comparison for Serial and Parallel Montgomery
Based CRSA Cryptography Core
Table 5.4. Data at User Terminal 1 97 Table 5.5. Data at User Terminal 2 98
Table 5.6. Data at User Terminal 3 99
Table 5.7. Data Mapping 101
Table 5.8 Device Utilization of Encryption, Decryption
and Key Generation RTL Designs
119
Table 5.9. Timing Report for CRSA Using ISE 14.3 122
Table 5.10. FPGA Resource Consumption of the RTL VHDL Design 124
xv
LIST OF ACRONYMS
RSA Algorithm - Rivest, Shamir, Adleman Algorithm
MIMO - Multiple Input Multiple Output
CRSA - Commutative RSA
DES - Data Encryption Standard
IDEA - International Data Encryption Algorithm
SSL - Secure Socket Layer
TLS - Transport Layer Security
SSH - Secure Shell Protocols
PKI - Public Key Infrastructure
GCD – Greatest Common Divisor
FPGA – Field Programmable Gate Array
ISE – Integrated Synthesis Environment
VHDL – Very high speed integrated circuit Hardware Design Language
CRT - Chinese Remainder Theorem
ASIC – Application Specific Integrated Circuit
VLSI – Very Large Scale Integrated Circuit
GF – Galois Field
RNS - Residue Number System
ECC - Elliptic Curve Cryptography
MMM - Montgomery Modular Multiplication
CSA - Carry Save Adder
SMFCP - Secure Multi FPGA Communication Protocol
RTL – Register Transfer Level
CPA - Carry Propagation Adder
Vinayaka Missions University, Salem 1
Chapter 1
Introduction
1.1 Background
High pace increase in data communication and the relative
spreading out of internet services encompass multiple transceiver
terminals and Multiple Input Multiple Output (MIMO) applications. They
require a highly secure and robust system model. This facilitates the
fundamental requirements of secure and authenticated data
communication such as security in multiparty communication networks
with privacy or confidentiality, authentication, data integrity and its non-
repudiation. The security requirements are necessary in accomplishing
the processes like secure telephony, e-commerce, e - banking and
multi-user secure communication situations.
With the rapid spread of communication networks, there is a
need for privacy and security of transmitted data. Therefore, the
methods of safeguarding information are becoming a key issue, for
which the cryptographic techniques have been created. Software and
hardware protocols have been implemented to improve the security of
information. In order to accomplish the optimum security and data
authenticity for multiple users‟ communication scenario, a number of
Vinayaka Missions University, Salem 2
approaches have been developed and cryptosystems might have
potential solutions for data security and authenticity for MIMO kinds of
communication environment. In order to accomplish this goal, in this
work, an enhanced and optimized algorithm for public key cryptography
has been developed.
In the proposed work, the well-known approach called Rivest,
Shamir, Adleman (RSA) algorithm has been enhanced by means of its
optimization with commutative behavior. The commutative behavior
characterizes that the order in which the encryption is done does not
affect results if the decryption is accomplished in the reverse order.
Mathematically formulated scheme for RSA has been converted into
unitary algorithms and the overall system model has been enriched with
Serial as well as Parallel Montgomery and the complete system model
has been realized with multiple MIMO transceivers or FPGA distributed
cores. The Commutative RSA (CRSA) approach has exhibited good
results for MIMO transceiver based secure communication. In order to
optimize the cryptosystem, it is required first to enhance the process of
encryption and decryption. The parametric or functional enhancement
can be effectively accomplished by introducing certain exponential
multiplication approaches such as Montgomery multiplication while
exhibiting high computational efficiency. In order to accomplish secure
communication and data authentication, the encryption and decryption
need to be the best in terms of security.
Vinayaka Missions University, Salem 3
1.2 Classification of Cryptosystem
Cryptography is defined as the study of mathematical techniques
related to the security of transmission and storage of information.
Cryptography is the science concerned with the design of ciphers,
whereas cryptanalysis is the related study of breaking ciphers.
Cryptography and cryptanalysis are somehow complimentary to each
other: development in one is usually followed by further development in
the other. Cryptography is an important tool in today's information
security. Although cryptography has been historically linked to
confidentiality, modern cryptographic techniques address the issues of
integrity, authentication and non-repudiation. A cryptosystem is an
implementation of cryptographic techniques and their accompanying
infrastructure to provide information security services. A cryptosystem is
also referred to as a cipher system. The term cryptography may be
used in place of cryptosystem and vice-versa. There are two types of
cryptography: symmetric-key cryptography and public-key cryptography
or asymmetric key cryptography.
Vinayaka Missions University, Salem 4
1.2.1 Symmetric Key Cryptosystem
In this cryptosystem, only one key is used and is known as secret
key. Both the communicating parties know this secret key. This secret
key can be a fixed key or it can be passed from the two parties over a
secure communication link. Symmetric-key (or secret-key) cryptography
can be seen as an outgrowth of classical cryptography. If users want to
securely communicate with each other, they must share a key, which is
used to both encrypt and decrypt messages. The security of a
symmetric-key scheme should rely on the secrecy of the key, as well as
in the “infeasibility” of decryption without knowledge of the same.
Symmetric-key schemes are usually fast. One of the main issues
when deploying symmetric-key cryptography is the problem of key
establishment. Users must share a secret key to be able to securely
communicate, which by nature should be known to no one else.
Historically, key distribution was done in advance using a secure
communication channel, e.g., trusted courier. When considering a
network of users wishing to communicate securely, each pair of users
must share a secret key, which makes it impractical for any medium-
size network. A solution could involve a central entity, which would be
trusted by all the users, e.g., a trusted third party and whose job would
be the issue of session keys. There are a number of other solutions,
Vinayaka Missions University, Salem 5
but it should be clear that the key establishment is one of the main
problems.
1.2.2 Public Key Cryptosystem
In this cryptosystem, two keys are present: public key and
private key. Each user has both a private key and a public key. The
two users can communicate because they know each other‟s public
keys. Normally in a public-key cryptosystem, each user uses a public
key for encryption and private key for decryption. The private
transformation is described by a private key, and the public
transformation is described by a public key derived from the private
key. The RSA approach is one of the most popular public-key
techniques and is based on factoring of large integers.
The concept of public-key cryptography was proposed in 1976 by
Whitfield Diffie and Martin Hellman in their innovative work [1]. As
advocated by them, the main motivation for this new concept was to
“minimize the need for secure key distribution channels and supply the
equivalent of a written signature”. In public-key cryptography, each user
has a key pair (e, d), which consists of a public key “e” and a private key
“d”. The user “A” can make “e” publicly available and keep only “d”
secret. Now anyone can encrypt the information with “e”, while only user
“A” can decrypt it with his private key “d”. Alternatively, the private key
Vinayaka Missions University, Salem 6
can be used to sign a document, and anyone can use the
corresponding public key to verify its authenticity. Public-key
cryptography works efficiently compared to secret key cryptography as
it is computationally infeasible to derive the secret key “d” from the
corresponding public key “e”. Examples of public-key cryptosystems are
Diffie-Hellman key exchange scheme, ElGamal and RSA encryption
and signature schemes [1,2].
At the heart of public-key cryptography lies the concept of one-
way function. We say that a function “f ” is “one-way” if it is easy to
compute f(x) for every “x” in the domain of “f”, but for most randomly
chosen “y” in the domain of “f”, it is computationally infeasible to find “x”
such that “f(x) = y”. A “trapdoor one-way function” is a one-way function
for which given an extra information (the trapdoor), it becomes feasible
to find “x” such that “f(x) = y” [1]. One can consider the encryption with
“e” as the one-way function, with “d” being the trapdoor information. In
some cases, especially in the case with MIMO transceiver based
communication scenario, the one-way function of RSA causes certain
limitations. Therefore in case of multiple party communication, this
drawback is required to be taken care of and with this objective, this
work presents a Commutative RSA scheme which is efficient in
facilitating benefits of bi-directional cryptosystems under certain
functional principles.
Vinayaka Missions University, Salem 7
Public-key cryptosystems are usually slower and require longer
keys compared with symmetric-key cryptography [1, 2]. On the other
hand, public-key cryptosystem simplifies key distribution. Conventional
symmetric-key algorithms have a limited life time; it turns out to be
useless once exhaustive key search becomes feasible due to
computational progress. Public-key is highly flexible, as one has to
select among large number of keys. In practice, it is common to use
hybrid schemes: public-key techniques are used to establish short-term
(symmetric) keys, which are used to secure the communication. This
research work advocates a Montgomery multiplication based approach
with higher radix multiplication. This reduces the execution time and
would facilitate an optimum approach for secured communication
among multiple party communication environments.
Considering the requirement of an efficient, robust and secure
communication in MIMO transceiver based communication scenario,
the public key cryptography can be a potential candidate for productive
results. Hence, considering this requirement, here in this work, public
key cryptographic technique RSA has been taken into consideration for
further enhancements and optimization.
RSA cryptosystem has exhibited good results in facilitating
secure data communication [2] but still it possesses certain scopes for
Vinayaka Missions University, Salem 8
further optimization. Few predominant limitations are iterative key
computation and its distribution across users. However, the common
RSA requires a lot of time for encryption and decryption computations.
Most cryptographic techniques use a standard method to encrypt the
message. Some of the issues such as uni-linear encryption can be
effectively eliminated while employing schemes such as the
commutative approach. This thesis work takes the commutative
behavior into consideration and then a final system model called
Commutative RSA has been developed.
The difference in encryption process is determined by an
electronic key which is added to the encryption process. This key may
be private so that both the sender and the receiver could use the
same key for encryption and decryption of the data. Unfortunately,
using this private key system, for each conversation, users would
require different keys. Another disadvantage is, a user would require to
pass the private key through a secure channel. However, this channel
may or may not be secure. There are no ways of knowing that an
external party has a secret key. Usually private keys are changed at
regular intervals of time. If an unauthorized party knows the techniques
behind how these keys change, the party can also change the keys.
These problems are overcome with public-key encryption. This involves
each user having two keys. One is a public-key which is used to encrypt
Vinayaka Missions University, Salem 9
the data that is sent to the user. The other key is a private-key which is
used to decrypt the received encrypted data. No one knows the private-
key except for the user. The iterative key computation and key
distribution makes the RSA implementation somewhat complicated and
costly in terms of overheads and overall performance. Such kind of
limitations and unwanted overheads make the RSA cryptosystem
inefficient and somewhat complicated for MIMO application scenario. In
order to eliminate the issues of key computational cost, commutative
RSA has been advocated and for making it compatible with hardware
processors, the Montgomery multiplication with its parallelized
application has been proposed in this work.
1.3 RSA Cryptosystem
The RSA cryptosystem was created in 1977 and named after its
inventors Ronald Rivest, Adi Shamir and Leonard Adleman [2]. It is
widely used to secure communication in the Internet, ensure
confidentiality and authenticity of e-mail, and it has become
fundamental to e-commerce. RSA is used in the most popular security
protocols, including Secure Socket Layer (SSL) / Transport Layer
Security (TLS), Secure Shell (SSH) Protocols as well as most Public
Key Infrastructure (PKI) products. RSA for encryption is used in many
applications like digital signatures and key establishment. RSA is
usually present wherever security of digital data is of concern.
Vinayaka Missions University, Salem 10
The mathematical structure of the RSA function is quite simple
and this may be another reason for its popularity. RSA is based on
basic algebraic operations on large integers. Before the description of
the algorithm, we need to set some conventions as follows:
1. Let a, b and n be positive integers.
[“a” is equal to “b” modulo n] (denoted as a = b mod n), if b is
the remainder of „a‟ divided by „n‟.
2. Zn is the set of integers modulo n. We can define the
operations addition and multiplication in this set by using the
usual operations on integers, but taking the result modulo “n”
as defined above. These are called modular addition and
modular multiplication.
3. We denote the greatest common divisor of (a and b) by
“gcd (a, b)”. This is defined as the largest positive integer
which divides both a and b.
RSA algorithm [3] can now be described as:
1. Generate two distinct large prime numbers of the same size.
2. Compute “ and ”.
3. Choose an integer “e < (n)” such that .
4. Calculate such that
Vinayaka Missions University, Salem 11
The pair of integers (e, n) is the public key. The pair of integers
(d, n) is the private key. An integer “n” is the modulus. “e” and “d” are
the public and private exponents respectively. It is also convention to
call the bit length of the modulus “n” as the size of the RSA key.
1.3.1 Encryption and Decryption using RSA
Let “M” be the plain text or message, which conveys information.
Before transmitting, it should be converted to unreadable form. So
encrypt the message M using encryption key “e”. After encrypting, the
cipher text “C” is obtained.
Represent a message to be encrypted as an integer .
Encrypt M as
The resulting cipher text “C” can be decrypted by computing
D = [Cd (mod n)]. It follows from [d x e = 1 mod (n)].
The RSA algorithm is also used for generating digital signatures, which
can provide authenticity and non-repudiation of electronic legal
documents.
The RSA encryption scheme is a public-key cryptosystem, and
the RSA trapdoor function is defined as “ = (mod The trapdoor
is the private exponent “d”, since ( ) (mod n). The security of the
RSA cryptosystem relies on the problem of factoring large integers,
Vinayaka Missions University, Salem 12
which is widely believed to be intractable. Factorization of the modulus
is devastating for the system. If an adversary can factor “n”, he can
easily calculate the private exponent by solving the congruence
[ and thus invert the RSA function. Recovery of the
private exponent “d” is equivalent to factoring the modulus “n” [4]. This
means that an adversary who knows can efficiently factor “n”.
This illustrates a possible misuse of the RSA cryptosystem.
In order to avoid generating new primes for every user, a trusted
central authority could generate a common modulus and distinct
exponent pairs (e, d) for each user. Although this might seem at first
glance a good idea, the fact above shows that it is completely insecure:
any user “i”, knowing his/ her own key pair could factor “n” and then
recover all the other private keys. This shows that an RSA modulus
should never be used by more than one entity.
Considering the aforementioned functional behavior of RSA
cryptosystem and its limitations, here in this work a robust system called
“Commutative Cryptography” has been developed and has been
implemented with multiple distributed FPGA architectures. Commutative
RSA is implemented to ensure the performance with real time hardware
assisted applications. Three distinct FPGA cores have been taken into
consideration and the respective Commutative RSA algorithms have
Vinayaka Missions University, Salem 13
been implemented with individual cores. The overall system
development has been carried out in sequence of key generation phase
succeeded by modified and parallel Montgomery Multiplication [5-7]
based robust commutative encryption which has been followed by
parallel Montgomery Multiplication based commutative decryption. In
this unique approach for key generation, two individual random
generators have been employed with two 32 bits entity and first input
first output shift registers. The pseudo random data bits are verified for
its primality and then it is processed for the greatest common divisor
estimation. In further steps, with the help of available variables, the
encryption and decryption keys have been generated. The primary
significance of this approach is that the iterative key generation and its
resulting or allied overheads are minimized, and thus, the efficiency
increases. The implementation of parallel Montgomery with high radix
multiplication has made this system highly robust and offer good
performance with increased security.
1.3.2 Why RSA?
In case of traditional approach of public key cryptography, it
is required to perform for key generation at each terminal and thus in
case of multiple or higher count of cores, the overheads caused due to
key generation becomes too disadvantageous. Therefore a system
model is developed that can noticeably reduce the key computation
Vinayaka Missions University, Salem 14
cost and thus could enhance the performance of the system. This is a
matter of fact that RSA establishes itself as an optimum approach for
public-key cryptography [1, 2]. However, considering certain scenario
of multiuser communication, distributed MIMO applications,
heterogeneous multiprocessor system on chip, multi-DSP based
communication cores etc., the normal RSA implementation might not
be much fruitful and even if it remains unexplored with recent and
optimized encryption techniques. Hence, taking into account of
approaches like commutative characteristics, the order in which
encryption takes place doesn‟t affect the decryption process if it is
done in the same way and avoids security breaching. In MIMO
applications, encryption of the data takes place at every user terminal.
1.4 Motivations
As per the increase in the secured data communication
requirements and Multiple Input Multiple Output transceiver based
communication systems, the requirement for secured communication
paradigm is also increasing day by day. Data security is one of the
predominant issues in modern multiple transceiver based
communications. The approach of public key cryptography plays a
potential role in establishing security for MIMO based applications and
RSA approach plays an important role with public key cryptography to
Vinayaka Missions University, Salem 15
be used in public infrastructure. A number of cryptosystems reported
earlier suffer from the limitation of linear cryptography, which is required
to be optimized along with the overhead reduction of the system. An
approach called “Commutative” states that the order in which encryption
is performed do not influence the final result. Conversely, the security
and authenticity of encompassing MIMO transceiver terminals also
becomes critical in public infrastructure based communication. These
are the main motivating factors for undertaking this work.
1.5 Research Objectives
Considering the need for highly efficient and robust security
mechanism for multiple users in communication, this research work
proposes a highly robust and efficient security of authentication
mechanism. In this work, a commutative behavior enabled RSA
cryptographic core has been developed which has been implemented
with multiple users. The main objectives of this work may be
summarized as follows:
1. To develop an efficient secured cryptosystem for multiple cores in
multi-party communication environment.
2. To develop a public key cryptographic scheme or algorithm
introducing Commutative Nature of RSA Encryption and
Decryption.
Vinayaka Missions University, Salem 16
3. To develop FPGA compatible CRSA algorithm using serial as
well as parallel Montgomery multiplication so as to enhance
processing efficiency and overall throughout.
4. To develop a CRSA cryptosystem and implement it with multiple
FPGA cores to ensure the justifiable performance of CRSA with
real time hardware assisted applications.
1.6 Methodology Adopted
In order to meet the objectives stated earlier, the following
methods are adopted:
1. Parameters and algorithms for data security have been identified for
Multiple Input Multiple Output applications.
2. Commutative nature of RSA algorithm has been proved.
Performance analysis of serial and parallel Montgomery
Multiplication algorithms on CRSA has been verified.
3. Commutative cryptography core with key generation suiting FPGA
realization has been designed using VHDL.
4. Functionality has been verified using Modelsim.
5. The design has been synthesized using Xilinx ISE and Vivado
targeted on a Virtex FPGA.
1.7 Thesis Organization
The overall thesis organization is as follows:
Chapter 1 is an introduction of research domain. In this chapter,
Vinayaka Missions University, Salem 17
brief discussion of cryptosystems and its significance, kinds of
cryptosystems, RSA algorithm for public key cryptography applications
etc. were presented. This chapter also presented the motivations for
research work, objectives, methodology and thesis organization.
Chapter 2 presents the literature review for RSA algorithms and its
implementation for secure communication. In this chapter, reviews have
been made for RSA algorithms. Different architectures of Montgomery
multiplications with its implementation on FPGA have been discussed.
The theoretical background of the RSA cryptosystems, public key
cryptography, and various kinds of Montgomery multiplications of RSA
have been presented in Chapter 3. Implementations of Commutative
RSA using serial and parallel Montgomery multiplication algorithms are
also presented.
Chapter 4 discusses the key generation for Commutative RSA
using parallel Montgomery multiplication. A novel Algorithmic
development is also presented.
Chapter 5 presents the simulation results and their respective
validation. Performance evaluation for the developed system in terms of
execution time, memory occupancy, delay analysis with multiple
processors and varying frequencies etc. have been presented.
Chapter 6 presents the Conclusion and the Scope for future work.
Vinayaka Missions University, Salem 18
Chapter 2
Review of Literature
An introduction to a security of multiple transceiver terminals and
MIMO applications has been presented in Chapter 1. An appropriate
literature survey has been carried out in order to understand the prior
work in RSA algorithm and its implementation with FPGA cores. In this
chapter, a detailed review is presented for Montgomery multiplication
implementation for enhancing real time performance of RSA algorithm.
Many different architectures have been proposed in the
technical literature for modular exponentiation and RSA implementation.
However, they often rely on specific technologies and thus a fair
comparison is difficult.
Hardware implementations have been proposed for the RSA
algorithm for public-key cryptography [8-11]. Bit-slice based architecture
was developed to implement modular exponentiation algorithm in
hardware [8]. Bo Song et al. presented the RSA module for 2048-bit
size [9]. They presented hardware algorithm for RSA
encryption/decryption based on Montgomery multiplication. These
implementations are not suitable for high throughput hardware as is the
case with the proposed architectures. Programmable active memory
implementation of RSA which combines Chinese remainders, star
Vinayaka Missions University, Salem 19
chains, Hensel's odd division, carry-save representation, quotient pipe-
lining and asynchronous carry completion adders is proposed in [10].
Nakano K. et al. presented hardware algorithms for modulo
exponentiation (PE (mod M)) used in RSA encryption and decryption,
and implement them on the FPGA [11]. Daniel Mesquita et al. proposed
an architecture to perform modular exponentiation using the
Montgomery Powering Ladder algorithm [12]. However, the authors
have not proved the commutative nature of the RSA algorithm and they
have not applied it for multi core applications, which is vital for effective
security of multi transceiver systems.
Very high speed RSA demodulator is proposed in [13]. Xuewen
Tan and Yunfei Li [13] proposed Batch RSA-S1 Multi-Power RSA
(BS1PRSA) algorithm to improve the performance of RSA decryption by
combining the load transferring technique and multi-prime technique in
the Batch RSA algorithm. The algorithm in Ref. [14] has been
implemented using a Xilinx XC6VLX240T FPGA chip, and the maximum
running frequency reaches 188 MHz. However, authors have not
mentioned about encryption.
An implementation of RSA algorithm using software has been
proposed in [15]. Suli Wang and Ganlai Liu used C++ Class Library to
develop RSA encryption algorithm Class Library to realize Groupware
Vinayaka Missions University, Salem 20
encapsulation on 32-bit windows platform. Wenjun Fan et al. have
proposed an implementation of RSA algorithm using JCUDA and
Hadoop software [16]. They have used CUDA framework to realize RSA
algorithm. However, they have not substantiated their claim by
presenting processing speeds achieved. Further, software realization is
not suitable for hardware implementation.
Public key cryptosystem was proposed in Ref. [17-19]. Ljupco
Kocarev et al. [16] proposed a public-key encryption algorithm which is
based on torus automorphisms. Authors generalized RSA algorithm
replacing powers with matrix powers, choosing the matrix, which
defines a two-torus automorphism. Software implementation is
discussed and hardware implementation is not mentioned. Montgomery
multiplication methods have been used for encrypting and signing digital
data in public-key cryptography by Koç C. K. et al. [18]. However, they
did not compare the Montgomery techniques to other modular
multiplication approaches. Software implementation of RSA using C++
is proposed in Ref. [19] byXin Zhou and Xiaofei Tang.
Jiang Huiping et al. have proposed improved architecture for RSA
coprocessor against power analysis [20]. The Shadow technology was
introduced into the RSA algorithm in order to improve differential power
analysis theoretically. Their result showed that it would take about 498
Vinayaka Missions University, Salem 21
ms to encrypt 1024 bits plaintext operating at 5 MHz. Their proposal is
applicable for the security of wireless sensor networks only and not to
Multi-core applications.
Nagar S. A. and Saad Alshamma have implemented the RSA
algorithm and their aim was to increase the speed during data
transmission among different communication networks and Internet
[21]. They generated the keys offline and stored in different databases
to increase the speed of RSA. Key Generation and storage were
developed using C# language. However, software is not suitable for
hardware realization.
RSA algorithm with counter-measures against different attacks
was proposed in Ref. [22, 23]. A new Chinese Remainder Theorem
based RSA digital signature with counter-measures to hardware fault
attacks was proposed by Sining Liu et al [22]. These were implemented
for the security of smart card data. Several computational issues as well
as the analysis of attacks are discussed in Ref. [23].
An algorithm to enhance security in RSA is proposed by the
authors of Ref. [24, 25]. Al-Hamami et al. proposed an enhancement to
the RSA algorithm by using an additional third prime number in the
computation of the public and private keys [25]. However, computational
Vinayaka Missions University, Salem 22
complexity may increase because of the third prime number.
Dahui Hu and Zhiguo Du discussed about a network
authentication protocol called Kerberos [26]. This is a network
authentication protocol and it which provides secure authentication
service based on the reliable third-party. Being a Software realization,
this, however, is not suitable for hardware implementation.
Doroeviae G. et al. proposed optimization techniques for the
modular reduction procedure of RSA algorithm [27]. They realized RSA
algorithm using TMS320C54x assembler of Texas Instruments. A
reciprocal value method and Montgomery's procedure were considered.
However, authors have proposed only optimization techniques for
modular reduction procedure.
Na Qi et al. presented RSA password system and applied it for
mobile phone short message encryption system [28]. It is applicable
only for the mobile terminal equipment. Perovic N.S. et al. proposed
RSA crypto algorithm with a 1024 bits long key [29]. Iana G.V. et al.
presented an implementation of the RSA algorithm as a prototype
programmable structure [30]. Xilinx Spartan3 was integrated on an
ASIC structure. The purpose was to optimize the Field Programmable
Gate Array area used. Only encryption technique was implemented.
Vinayaka Missions University, Salem 23
Some new structures which can implement RSA cryptographic
algorithm were presented by the authors of Ref. [31- 33]. These
structures were designed upon a modified Montgomery modular
multiplier. The operations of multiplication and modular reductions are
carried out in parallel rather than interleaved as in the traditional
Montgomery multiplier. They have implemented 8 different structures
mentioned as Struct by the authors and compared. Struct 4 requires
7.55 ms to carry out the full modular exponentiation operation while
Struct 7 carries out the same operation in 6.78 ms. However, the area
used by Struct 7 is almost twice that of Struct 4.
Montgomery modular multiplication on reconfigurable hardware
was presented in [34- 38]. Hariri A. and Reyhani-Masoleh proposed bit-
serial and bit-parallel multipliers [34] and Miguel Morales-Sandoval,
Arturo Díaz-Pérez proposed Scalable GF(p) Montgomery multiplier
based on a digit–digit computation approach [35]. Perin G. et al.
proposed a comparison of two FPGA Montgomery modular
multiplication architectures, a fully systolic array and a parallel
implementation [36]. Fully systolic array implementation takes 3.23 ms
to run 1024 bit RSA decryption process and the parallel architecture
executes the same operation in 6 ms. Ali Ziya Alkar et al. proposed
modular multiplication and squaring, bit level systolic arrays to increase
the speed of multipliers by 20% [37]. However, they have not
implemented entire cryptosystem.
Vinayaka Missions University, Salem 24
Galois field arithmetic has been proposed for implementing
Montgomery's algorithm [39, 40], which finds wide use in cryptography.
Poolakkaparambil M. et al. [39] and Bajard, J. et al. [40] implemented
Montgomery multipliers for elliptic curve cryptographic applications. But
they have not implemented for RSA.
Scalable Montgomery's algorithm was proposed by the authors of
Ref. [41, 42]. Chiou-Yng Lee et al. presented a scalable and systolic
Montgomery's algorithm in GF(2m) using the Hankel matrix-vector
representation [41]. The proposed architectures have the features of
regularity, modularity, and local interconnect ability. They are well suited
for VLSI implementation. However, it was not implemented on any
hardware. Sanu M.O. et al. proposed four different architectures for
Modular multiplication for speeding up application-specific crypto-
processors [43]. They have not implemented RSA algorithm.
Parallelization of the Montgomery multiplication was attempted
using software by Zhimin Chen et al. [44, 45]. A scalable parallel
programming scheme to map the Montgomery multiplication to a
general multicore architecture was presented. However, pSHS trades
some throughput for latency.
Authors of Ref. [46- 49] have proposed high-radix scalable
Montgomery multiplier. Amberg P. et al. proposed an algorithm for
Vinayaka Missions University, Salem 25
parallel high-radix scalable Montgomery multiplier with trade-offs for
multiple hardware implementations [46]. They presented the processing
element designs exploring combinations of radices 2, 4, and 8, right vs.
left shifting and Booth encoding. However, left shifting adds additional
design complexity and places strict constraints on word length and
number of processing elements in the pipeline. Thomas Blum and
Christof Paar proposed arithmetic architectures. These are optimized
for modern field programmable gate arrays [48, 49]. The proposed
architectures perform modular exponentiation with very long integers.
Parallel implementation of Montgomery multiplication has been
proposed earlier [50, 51]. Jun Hanet al. proposed parallel
implementation of Montgomery multiplication with improved task
partitioning for multicore platform with area-efficient processors [50].
Selçuk and Erkay proposed a new parallel Montgomery multiplication
algorithm, which can be adopted for parallel realization using general-
purpose multi-core processors [51]. Miladinovic et al. presented
Montgomery multipliers, which can perform modular multiplication of
two integers without trial division [52]. The researchers described the
design and FPGA implementation of two architectures. Ciaran McIvo et
al. proposed fast Montgomery multipliers [53].
New methods to increase the speed of the Montgomery
Multiplication were presented by the authors of Ref. [54 – 56]. Neto
Vinayaka Missions University, Salem 26
J.C. et al. proposed a new approach to speed up the Montgomery
Multiplication by distributing the multiplier operand bits into partitions
that can process in parallel [54, 55]. Batina and Muurling proposed
efficient way of implementing Montgomery in hardware for an arbitrary
bit length [56]. The scope of this work is limited to the application of the
sequential radix-2 Montgomery Multiplication algorithm.
The authors of Ref. [57, 59] presented novel multipliers using
irreducible polynomials for Montgomery multiplication defined on binary
fields GF (2m). They have used Linear feedback shift register (LFSR) as
the main module for the presented architecture. Huapeng Wu presented
a low complexity Montgomery multiplier in GF (2m) [57]. They are well
suited for VLSI systems and applicable for Elliptic Curve Cryptography.
However, application of the same for RSA algorithm was not proposed.
Talapatra S. et al. presented unified digit-serial systolic
multiplication architecture for all-one polynomials and trinomial over
GF(2m) for efficient implementation of Montgomery Multiplication
algorithm suitable for cryptosystem [60, 61]. Michalski A. et al. proposed
an improvement to a limited resource Montgomery multiplier design [62,
63]. Their design was scaled to utilize available FPGA multipliers, CLB
logic and frequencies of operation.
Vinayaka Missions University, Salem 27
Implementation of RSA algorithm on FPGA was proposed in [64-
68]. Chu A. et al. analysed an RSA implementation on a reconfigurable
platform [64]. Montgomery modular multiplier replaced the expensive
multiplier and modular operations and provides reconfigurable hardware
support for the Montgomery modular multiplier and Montgomery
modular exponentiation. Mazzeo et al. proposed FPGA-based
Implementation of a serial RSA processor [66]. However, RSA was
implemented for embedded systems and not for multi core systems as
in the present work.
Liang Wang and Yonggui Zhang introduced a new personal
information protection approach based on RSA cryptography [68].Using
this approach, personal information can be transformed from plain text
into cipher text. Customer representatives were able to contact their
clients without seeing the privacy. They have used programmable active
memory based on programmable gate array.
Ming-Der Shieh et al. proposed VLSI implementation of the
modular exponentiation, which was based on the asynchronous
behaviour of the modular multiplication [69]. The basic idea is to
partition the operand (multiplier) into several equal-sized segments and
then to perform the multiplication and residue calculation of each
segment in a micro pipelining fashion.
Vinayaka Missions University, Salem 28
Manaf N. V. et al. proposed a method to increase the speed of
arithmetic circuits in the Residue Number System (RNS) [70]. Hariri A.
et al. considered the finite field multiplication used in elliptic curve
cryptography and design concurrent error detection circuits [71, 72].
Hariri and Reyhani proposed concurrent error detection in the
Montgomery multiplication over binary extension fields [72]. Error
detection schemes for two Montgomery multiplication architectures
were presented in that work.
Miaoqing Huanget al. proposed two Montgomery modular
multiplication architectures which perform multiplication operation [73].
These two architectures were based on pre-computing partial results
using two possible assumptions regarding the most significant bit of the
previous word. McLoone M. et al. proposed novel hardware architecture
implemented using coarsely integrated hybrid scanning (CIHS)
algorithm to perform Montgomery modular multiplication [74].
Novel CSA architecture for Montgomery multiplication was
presented by the authors of Ref. [75, 76]. Garg R. et al. proposed
Montgomery modular multiplication technique that employs multi-bit
shifting and carry-save addition to perform long-integer arithmetic [75].
As per the claims of the authors, the gain in data throughput for
Montgomery multiplication is approximately 45.49% (for 1024-bit length)
and the hardware reduction is 24.27% of the traditional methods.
Vinayaka Missions University, Salem 29
Modified radix-4 modular multiplication based on Booth's
multiplication techniques were presented in Ref. [77, 78]. They have
used Carry Save Adder to avoid carry propagation. Bayhan D. et al.
analysed and compared the Montgomery multiplication algorithms for
their power dissipation on FPGA devices [79]. Among various
architectures proposed for Montgomery multiplication, parallel,
sequential and systolic variants were considered [80].
Sever R. et al. presented a high speed, non-pipelined FPGA
implementation of the Rijndael algorithm [81]. They have implemented
both the encryption and the decryption algorithms of Rijndael on the
same FPGA. Kshirsagar R. V. et al. proposed a high data throughput
AES hardware architecture by partitioning into sub-blocks of repeated
AES modules [82]. However, they have not applied their architecture for
Public Key Cryptosystem.
Sushanta Kumar et al. [83] presented architecture and modeling
of RSA public key encryption/decryption systems. It supports multiple
key sizes of 128 bits, 256 bits, and 512 bits. Tanimura, K. et al.
proposed a scalable unified dual-radix architecture for Montgomery
multiplication in GF(P) and GF(2n) [84]. Schinianakis D. et al. describe a
methodology for incorporating Polynomial Residue Arithmetic in the
Montgomery multiplication algorithm for polynomials in GF (2n) [85].
Talapatra S et al. define a scalable VLSI multiplication architecture
Vinayaka Missions University, Salem 30
based on Montgomery multiplication (MM) algorithm for elliptic curve
cryptography (ECC) over GF (pm) [86]. Satzoda R.K. et al. proposed a
new scalable and pipelined Montgomery multiplier architecture that
unifies the two important finite fields, GF(p) and GF(2m) [87].
Fournaris A.P. constitute the first complete attempt for an efficient
design on a fault and simple power attack resistant RSA [88]. This
algorithm was based on the Montgomery modular multiplication. Faster
multiplication scheme was adopted by Thapliyal H. et al. to improve the
efficiency of the public key encryption systems like RSA and ECC [89].
Modified Montgomery multiplications and circuit architectures were
presented in this paper. Mohammadi M. et al. proposed Montgomery
modular multiplication based on Residue Number System (RNS) [90].
Hamming weight based and reverse converter based moduli were
implemented. The proposed architecture encompasses fast RNS to
RNS converter, choosing appropriate moduli sets.
Haining Fan et al. presented a new parallel multiplier for
irreducible trinomials [91]. Koç C.Ket al. explained Montgomery
multiplication methods [92]. For computing the Montgomery product,
several high-speed, space-efficient algorithms for computing MonPro (a,
b) were described. McIvor C. et al proposed new FPGA architectures
for the ordinary Montgomery multiplication algorithm and the Finely
Ingrained Operand Scanning modular multiplication algorithms [93].
Vinayaka Missions University, Salem 31
Venkatasubramani V. R. and S. Rajaram [94] propose Modified
Montgomery Modular Multiplication algorithms that reduce the number
of computational operations such as the number of additions, memory
reads and writes involved in the existing algorithms, thereby, saving
considerable time and area for execution. In Ref. [89] carry save adders
are used, Finely Ingrained Operand Scanning modular multiplication
algorithms are used in Ref. [93]. In Ref. [94], the number of
computational operations is reduced. McIvor C. et al. [95, 96] presented
Modified Montgomery multiplication and associated RSA modular
exponentiation algorithms and circuit architectures. These modified
multipliers use carry save adders (CSAs) to perform large word length
additions.
Fournaris A.P. et al. [97] proposed two Finite Field multiplier
architectures and their VLSI implementations using Montgomery
Multiplication Algorithm. The first architecture was called as Folded
architecture and it is optimized in order to minimize the silicon covered
area (gate count) and the second architecture that uses Pipelining is
optimized in order to reduce the multiplication time delay. Both
architectures are measured in terms of gate count (or chip covered
area) and multiplication time delay. Design and implementation of a
real-time FPGA based Network security application was proposed by
Paul R. et al. [98]. They have used the RSA encryption and decryption
Vinayaka Missions University, Salem 32
algorithm for implementation, as security is one of the most important
needs for data communication. Pellegrini A. et al. developed a
theoretical attack to the RSA signature algorithm and realized it using
an FPGA [99]. Kamal et al. improved the NTRU encryption algorithm,
called as NTRUEncrypt [100].
Marcelo E. Kaihara and Naofumi Takagi proposed a new fast
method for computing modular multiplication [101, 102]. The calculation
is performed using a residue classes modulo M that enables splitting of
the multiplier into two parts. These parts are then processed separately
in parallel, potentially doubling the calculation speed. The upper part
and the lower part of the multiplier are processed separately using the
interleaved modular multiplication algorithm and the Montgomery
algorithm respectively. This technique can also easily be adopted for
operation in the binary extended field GF (2m).
Kazuo Sakiyama et al. present a new modular multiplication
algorithm that allows one to implement modular multiplications
efficiently [103, 104]. It proposes a systematic approach to perform
modular multiplication for maximizing a level of parallelism. The
proposed algorithm combines several different existing algorithms, a
classical modular multiplication built on Barrett reduction, the modular
multiplication with Montgomery reduction and the Karatsuba
Vinayaka Missions University, Salem 33
multiplication algorithms to reduce the computational intricacy and
increase the prospective of parallel processing. This algorithm was
suitable for both software and hardware implementations i in a
multiprocessor environment. Reconfigurable curve based crypto-
processor that accelerates scalar multiplication of Elliptic Curve
Cryptography (ECC) and Hyper Elliptic Curve Cryptography (HECC) of
genus 2 over GF (2n) was presented by Kazuo Sakiyama et al. [104]. N.
Costigan and P. Schwabe proposed fast elliptic curve cryptography
[105].
Junfeng Fan et al. investigate efficient software implementations
of the Montgomery modular multiplication algorithm ported on a multi-
core system [106]. A Hardware and Software synchronized technique
was used to find the effectual system architecture and the instruction
scheduling method. Cetin Kaya Koc et al. proposed several
Montgomery multiplication algorithms [107, 108]. These algorithms have
been implemented in C and also using assembly language. Colin D.
Walter proposed an optimal upper bound for the number of iterations in
Montgomery Modular multiplication [109].
A new version of Montgomery‟s algorithm for modular
multiplication of large integers and its implementation in hardware was
presented by Viktor Bunimov et al. [110]. The algorithm is superior to
the Montgomery‟s original method by a factor of 2, with respect to
Vinayaka Missions University, Salem 34
both chip area and latency. The new method has a simple structure. It
requires a small amount of pre-computation and storage in order to
reduce the number of necessary additions by a factor of 2. GF(2n)
Montgomery multiplier using polynomial arithmetic was presented by
Skavantzos et al. [111]. Fast pipelined RSA architecture based
Montgomery algorithm was proposed by Iput Heri et al. [112]. Digital
signatures using RSA and other public key cryptosystems was
presented in [113] by Dorothy E. Dening et al. RSA using 3D graphics
hardware was proposed by A. Moss et al. [114].
Some of the work presented earlier has come up with certain
good proposals for data security. However, those cannot be considered
as the optimum approach for data security. Some of the predominant
issues such as one-way encryption and high computational complexity
still exist in majority of existing work. In order to accomplish a better and
optimized system for multiple input and multiple output applications, it is
required to enhance the system model. Hence in the present work, a
robust Commutative RSA Algorithm and Architecture has been
developed so that it may be realized efficiently as a hardware. Serial
and parallel Montgomery Multiplication algorithms were developed. To
reduce the key exchange overheads in public key cryptosystems, CRSA
algorithm with key generation has been designed.
Vinayaka Missions University, Salem 35
Chapter 3
Development of Commutative RSA Algorithm
A detailed literature survey was carried out and the findings were
presented in Chapter 2. In this chapter, RSA algorithm, its commutative
nature, serial and parallel Montgomery multipliers are presented.
3.1 Introduction
Commutative RSA algorithm is designed and the developed
module has been incorporated with multiple FPGA cores. In order to
implement developed CRSA algorithm with FPGA core, two different
models based on serial Montgomery and parallel Montgomery have
been developed. Initially Serial Montgomery based CRSA algorithm is
implemented and in later stages the Parallelized Montgomery has been
employed and is explained in detail in the following sections. The
parallelized Montgomery has exhibited better results as compared to
serial Montgomery implementation and therefore in final stage,
parallelized Montgomery architecture with higher radix multiplication is
used for enhancing execution time of CRSA with multiple FPGA cores
[115]. This chapter mainly presents the system development and
discussion about the algorithmic development for
Vinayaka Missions University, Salem 36
Commutative RSA cryptosystem. The detail algorithm development and
its implementation have been discussed in the following section.
3.2 System Model
RSA is considered as an efficient and optimized solution for
public-key cryptography [1]. In most of the existing systems, data
authentication or security is accomplished by a key exchange approach.
This increases the key exchange overheads. In MIMO or multiple
transceiver systems, at every transceiver terminal, encryption and
decryption is required. If a general RSA approach is applied, the data
authentication and security could be violated. Therefore, to achieve the
goal of data security with individual encryption and decryption without
affecting the integrity and data security, a modified RSA has been
developed known as Commutative RSA. In this chapter, the initial stage
of the system development has been discussed which in specific
discusses about the highly robust and optimized system architecture to
implement Commutative RSA algorithm for data authentication among
MIMO terminals or multiple FPGA cores.
The first step considered was to develop a commutative behavior
based RSA algorithm. The commutative behavior characterizes that the
order in which the encryption is done, does not affect results if the
Vinayaka Missions University, Salem 37
decryption is accomplished in the reverse order. Advantage of the
CRSA: Each user is generating his/her key. So there are no key
exchange overheads. The mathematically formulated scheme for RSA
has been converted into unitary algorithms and the overall system
model has been enriched with serial as well as parallel Montgomery
multipliers and the complete system model has been realized with
MIMO transceivers or FPGA distributed cores. In the following section,
the implementation of CRSA with serial and parallel Montgomery with
Radix 2 multiplication has been presented, that would be followed by
ultimate parallelized Montgomery multiplication based CRSA
implementation. The algorithmic development, enhancements and
ultimately its implementation with distributed FPGA cores has been
presented in the following section.
3.3 Commutative RSA
The primary objective of the presented thesis is to facilitate a
highly efficient and robust authentication or security system for MIMO
transceiver based data communication applications. For such
applications, the key management and hardware compatibility becomes
prime concern, which is required to be taken seriously to provide an
optimum solution for secure data communication. One limitation called
“one-way encryption” might cause extra overheads on the system
Vinayaka Missions University, Salem 38
function in case of multiple user scenarios. Therefore, the consideration
of commutative behavior might play significant role in system
performance optimization and overhead reduction for distributed cores.
Hence, in the initial phase of this research work, the commutative RSA
has been implemented.
A secure plane is realizable only if the data communicated over
the plane cannot be colluded and is secured. Use of cryptographic
algorithm is generally preferred; hence the Secure Multi FPGA
Communication Protocol (SMFCP) proposed adopts the commutative
RSA algorithm. The SMFCP considers two prime numbers and
initialized amongst all the group members. Let and
represent the group members required to connect over the secure
plane. In order to compute the encryption and decryption key pairs of
the commutative RSA algorithm, the parameters CRSA and NCRSA are
computed using the following:
NCRSA = (3.1)
(3.2)
The encryption key pair of A and B represented as
(NACRSA, EA
CRSA) and (NBCRSA, EB
CRSA) respectively are need to be
computed. The is computed by randomly selecting numbers such
Vinayaka Missions University, Salem 39
that it is a co-prime of or in other terms
(3.3)
Where represents the greatest common divisor function
between two variables “x” and “y”.
The decryption key pair of A and B represented by
and , the is computed
based on the following equation:
(3.4)
Where, and are the commutative RSA decryption
keys of user A and user B respectively.
Let represents the encrypted data of plain text “X”. The encryption
operation is defined as follows:
(3.5)
The commutative RSA decryption operation on the encrypted data EncX
is defined as
(3.6)
Vinayaka Missions University, Salem 40
3.4 Commutative Nature of RSA Algorithm
After developing the commutative RSA algorithm, it is required to
be verified for its function with real time data security. In this section, the
verification for proof of CRSA has been analysed. For simplicity of
realization, only two users, “A” and “B” have been considered. The
commutative property of the RSA algorithm which is employed in
multiple input multiple output applications may be proved if data “ is
encrypted by user “ A” first and then encrypted by “B” delivers the same
result if the order of encryption is changed.
(3.7)
(3.8)
(3.9)
As it can be concluded that
(3.10)
(3.11)
Thus commutative nature of RSA is justified.
Vinayaka Missions University, Salem 41
3.5 Commutative RSA Implementation with Serial and Parallel
Montgomery Multiplication
RSA algorithm functions on the basis of huge sequential
multiplication operations and this becomes more complicated if the
system has to deal with multiple users or multiple MIMO transceivers.
Therefore, it is always expected and explored to have such a system
that could make the system functional with minimum overheads and
computational costs. In order to enhance the system performance, the
implementation of Montgomery multiplication is always advocated.
Considering the prime objective of this research work, for MIMO or
multiple transceivers based communication systems, the enhanced
RSA called, Commutative RSA cryptography core has been developed,
and has been realized on multiple FPGA devices. In order to ensure
optimum performance, highly streamlined system architecture like
Montgomery modular multiplication based on Radix-2 has been
developed. Such implementations result in reduction of memory
occupancy and the speed is exponentially enhanced. A brief of these
implemented approaches are presented in the following sections.
3.5.1 Modular Exponentiation
Modular exponentiation operation can be simplified in to series of
modular multiplication and squaring operations. This exponentiation is
Vinayaka Missions University, Salem 42
simplified based on square and multiply algorithm. Square and multiply
algorithm is based on scanning the bit of the exponent from left (most
significant bit) to right (lower significant bit). In every iteration, i.e., for
every exponent bit, the current result is squared, if and only if the
currently scanned exponent bit has the value “1”, a multiplication of the
current result by “M” is executed following the squaring. This algorithm
can be represented in pseudo code as shown in figure 3.1.
Algorithm 3.1
Input : A, b, n
Output : C= Ab(mod n)
Let “b” contains “k” number of bits
If bk-1 = 1,
then C = A;
else C = 1;
For I = k-2 down to 0
C=C x C;
If ei = 1 then
C = C x A
Figure 3.1: Square and Multiply Algorithm
Vinayaka Missions University, Salem 43
3.5.2 Modular Multiplication
The modular multiplication problem is defined as the computation
of P = [(A × B) mod n] where “A”, “B”, and “n” are integers. It is assumed
that “A” and “B” are positive integers with 0 ≤ A, B < n.
Let Ai and Bi are the “ith” bit of A and B, respectively. The algorithm is
stated as follows:
Algorithm 3.2
Input : A, B, n
Output : M = (A x B) mod n
M = 0;
For i = 0 to k
M = M + (A x Bi)
If M0 = 1
M = M/2;
Else
M = (M+n)/2;
Return M;
Figure 3.2: Algorithm for Modular Multiplication.
Vinayaka Missions University, Salem 44
3.5.3 Montgomery Multiplication Algorithm
The basic operation of the RSA algorithm is modular
exponentiation on large integers, i.e. “B = AE (mod N)”. This
mathematical computation is used for both encryption and decryption
and digital signature. The security level of an RSA cryptosystem
depends on the length of the modulus “N”. A modulus of at least 768
bits is recommended. All operands involved in the computation of
modular exponentiation have normally the same size as the modulus.
For computing “AE (mod N)”, all existing techniques use reduction
of modular exponentiation to a sequence of modular multiplications. All
modular exponentiation algorithms use modular multiplication to
increase the efficiency. However, modular multiplication is a complex
arithmetic operation. Several algorithms have been proposed for
achieving efficient implementations of modular multiplication. Blakley‟s
method [5] and Montgomery‟s method [6] are the most studied ones.
These are the algorithms suitable for practical hardware implementation
[7]. Both Blakley‟s method and Montgomery‟s method perform modular
reduction during the multiplication process. Division operation is not
required at any point in the process. However, Blakley‟s method needs
a comparison between two large integers at each step of the modular
multiplication process. However, Montgomery‟s method does not
require any comparison. This is achieved by resorting to a
Vinayaka Missions University, Salem 45
representation of the operands as a residue class modulo N. Further,
the Montgomery‟s technique requires some pre-processing and post
processing steps, which are required to convert the numbers to and
from the residue based representation. However, the cost of these
steps is negligible when many consecutive modular multiplications are
to be executed, as in the case of RSA. This is the reason why the
Montgomery‟s method is considered the most efficient algorithm for
implementing RSA operations. Depending upon the number “r” used as
the radix for the representation of numbers, there are several versions
of Montgomery‟s algorithms. But in hardware implementations “r” is
always a power of 2.
The Montgomery algorithm [6] uses simple divisions by a power
of two instead of divisions by M, which are used in conventional
modular operation. Montgomery‟s modular multiplication algorithm uses
only mathematical, Boolean and shift operations to avoid trial division,
which is a key and time-consuming operation in conventional modular
multiplication. The disadvantage is the need to convert operands into
and out of Montgomery‟s domain. However, the latency is almost
negligible in cryptosystems as Montgomery modular multiplication is
one of the key operations used in cryptographic algorithms. The
Multiple-Word Radix-2 Montgomery Multiplication algorithm
characterizes new-classic architecture for implementing Montgomery
multiplication on hardware. With properties streamlined for least time
Vinayaka Missions University, Salem 46
delay, this architecture performs a single Montgomery multiplication in
approximately “2n” clock cycles, where “n” is the width of operands in
bits.
For “M” being an odd integer, in many cryptosystems, such as
RSA, computing “M” is an essential operation. The reduction of “M” is a
highly time-consuming step compared to multiplication of “A” and “B”
without reduction. Montgomery introduces a method for calculating
“products (mod M)” without the time consuming reduction (mod M).
Montgomery multiplication of A and B (mod M), denoted by MP (A, B,
M) is defined as [A.B.2n (mod M)] for some integer “n”. As Montgomery
multiplication is not an ordinary multiplication, there is an inter
conversion process between the ordinary domain (with ordinary
multiplication) and the Montgomery domain. The inter conversion
between the ordinary domain and the Montgomery domain is given by
the relation A↔A‟ where A‟ = A x 2n (mod M).
Mathematically, it can be given as:
(3.12)
(3.13)
The conversion between the domains can be executed using the same
Montgomery operation, in particular
and , where
can be pre-computed. Despite the initial conversion delay, there is a
Vinayaka Missions University, Salem 47
significant advantage over ordinary multiplication, when we perform
many Montgomery multiplications trailed by an inverse conversion at
the end.
3.5.4 Radix-2 Modular Multiplier
The optimized algorithm for Radix-2 Modular multiplier for
Montgomery multiplication is presented below.
Algorithm 3.3
(3.14)
Output :
C = MP (A, B, M) = A. B. 2-n (mod M), 0≤ C< M
(3.15)
Vinayaka Missions University, Salem 48
1.1
1.2
1.3
1.4
1.5
1.6
Figure 3.3: Algorithm for Radix -2 Modular Multiplication
The above algorithm presents the Pseudocode for the Radix-2
Montgomery multiplication, where And “n” is the size of
“M” in bits.
The verification of the above algorithm has been presented as follows:
Consider X[i] is given as
(3.16)
With X [0] = 0. Then can be
computed iteratively using the following dependence:
(3.17)
(3.18)
(3.19)
Vinayaka Missions University, Salem 49
(3.20)
Hence, conditional to the parity of , we do compute as
or so as to make the numerator divisible by 2.
Since and one has for all . In
literatures [109] and [110] it has been given that the result of
Montgomery multiplication
when and (3.21)
As a result, by remapping“n” to be the least integer such that
, the subtraction at the final step of presented algorithm can be
side stepped and the output of the multipication can be instantaneously
used as an input for the forthcoming Montgomery multiplication.
3.6 Modular Multiplication Algorithms
In RSA, the encryption key is a pair of positive integers (e, n) and
the decryption key is predominantly a pair of positive integers (d, n). To
encrypt a message using the key (e, n), the following structural
approach has been implemented.
The sequential binary modular exponentiation algorithm is
presented in algorithm 3.4 in Figure 3.4. “M” represents the number of
digits in exponent “E”. The multiplication of line 4 of the algorithm
depends on the result of the squaring of line 3.
Vinayaka Missions University, Salem 50
3.6.1 Algorithm for Sequential Binary (T, E, M)
Algorithm 3.4
int R :=1;
1 if em-1 = 1,then R:=T ;
2 for i = (m-2) downto 0 do
3 R = RXR (mod M);
4 if ei = 1, then R:= RXT (mod M);
5 return R ;
end
Figure 3.4: Algorithm for Sequential Binary Modular Multiplication
The serial Montgomery Multiplier is presented in Figure 3.5. It
includes two Montgomery modular multipliers SerialMMM1 and
SerialMMM2 . It uses five registers: one for each of T, M, E, RxR
(SQUARE) and RxT(MPRODUCT). Depending on the length of
Exponent “E”, the CONTROLLER controls the number of iterations
required to perform the exponentiation. Exponent “E” is stored into a
shift-register. Two multipliers SerialMMM1 and SerialMMM2 are not
necessary. One can reuse the same multiplier to perform the squaring
step of line 3 and subsequently the multiplication step of line 4 in
Algorithm 1. It reduces the hardware area but it increases the
encryption/decryption throughput.
Vinayaka Missions University, Salem 51
Figure 3.5: Serial Montgomery Multiplier
The parallel algorithm for binary exponentiation is given in
algorithm 3.6. Parallel algorithm uses an extra register P. The use of the
parallel algorithm would reduce the encryption/decryption time to half of
the serial algorithm. In the average case, if the exponents consist of half
ones and half zeros, the execution time is improved by a factor of 1.5.
But, it needs almost twice the circuitry needed to implement the serial
Vinayaka Missions University, Salem 52
algorithm. However, it reduces the execution time of encryption and
decryption.
3.6.2 Algorithm for Parallel Binary (T, E, M)
Algorithm 3.5
1. R(0) = 1, P(0) = T;
2. for i = 0 to (m-1) do
3. P(i+1) = P(i) X P(i) (mod M);
4. if ei = 1, then R(i+1) = R(i) X P(i) (mod M)
5. else R(i+1) = R(i)
6. return R(m)
7. end
Figure 3.6: Algorithm for Parallel Binary Modular Multiplication.
The hardware architecture of the parallel modular multiplier is
shown in figure 3.7 below. It includes two Montgomery modular
multipliers SerialMMM1 and SerialMMM2. It uses eight registers: one
for each of T, M, E, R(i) (SQUAREi), P(i)(MPRODUCTi),
R(i+1)(SQUAREi+1) and P(i+1) (MPRODUCTi+1). Depending on the length
of E, the CONTROLLER controls the number of iteration required to
perform the exponentiation. Exponent “E” is stored in to a shift register.
Vinayaka Missions University, Salem 53
Figure 3.7: Parallel Montgomery Multiplier
In this research work, Radix - 2 Modular multiplier based
multiplication architecture is implemented. A brief description of the
employed algorithm is mentioned below:
In the initial stages, commutative RSA operands are fed into shift
registers serially via an input buffer. While loading message “M” into the
register, the exponent register is moved up until the point where in the
Vinayaka Missions University, Salem 54
first non-zero is the most significant bit and count the number of bits of
exponent log2 E. After this initial stage, the multiplier comes into
assignment. Pretty much as the first yield bit of the multiplier is finished,
the Montgomery module begins working in a flash. Just as the first
output bit of the multiplier is complete, the Montgomery module starts
operating instantly. Hence, the execution time of multiplier, Carry
Propagation Adder, and Montgomery module is nearly overlapped.
Thus, the functional units of our design are completely utilized during
computation.
Summary
In this chapter, Commutative nature of RSA is proved. Design of
serial Montgomery multiplier and parallel Montgomery multiplier are
presented. Design of commutative cryptography core with key
generation, ASM charts for finding the prime number and key
generation, Flow chart for finding the GCD of two numbers, algorithms
for parallel multiplication are presented in the next chapter.
Vinayaka Missions University, Salem 55
Chapter 4
Development of Algorithm for Commutative Cryptography Core
with Key Generation
The design of serial Montgomery multiplier and parallel
Montgomery multiplier, RSA algorithm and its commutative property
have been presented in Chapter 3. In this chapter, design of
commutative cryptography core with key generation and ASM chart for
the key generation are presented. These functions have been coded in
VHDL and simulated using Modelsim.
4.1 Sequential Implementation of Key Generation
In this section, the mathematical modeling and its sequential
implementation for key generation are presented. In the implemented
system, the first phase is to generate pseudo random numbers. For this
purpose, the Linear Feedback Shift Register (LFSR) has been
employed and more precisely the Fibonacci LFSR is taken into
consideration.
4.1.1 Linear Feedback Shift Register
Linear-feedback shift register is a shift register whose input bit is
a linear function of its previous state. Exclusive-or (XOR) function is
Vinayaka Missions University, Salem 56
commonly used to get the linear function of single bits. Hence, an LFSR
is a shift register, whose input bit is derived by the XOR of some bits of
the complete shift register value.
4.1.2 Fibonacci LFSRs
The feedback tap numbers in Fibonacci LFSR correspond to a
primitive polynomial. The initial value of the Linear Feedback Shift
Register is called the seed. The rightmost bit is generally the output bit.
The taps are XORed sequentially with the output bit and are provided
as feedback at the leftmost bit. The sequence of bits in the rightmost
position provides the output stream.
4.1.3 Galois LFSR
It is an LFSR in Galois configuration. It is also known as a
modular internal XORs as well as one-to-many LFSR. It is named after
the French mathematician Évariste Galois. In comparison to previously
discussed LFSR the taps, here are XORed with the output bit before
they are stored in the next position. The current output bit is the
forthcoming input bit. The result of this is that when the output bit is zero
all the bits in the register shift to the right without any alteration, and the
input bit becomes zero. When the output bit is one, the bits in the tap
positions will flip and complement themselves and then the entire
Vinayaka Missions University, Salem 57
register is shifted to the right and the input bit becomes “1”.
Fibonacci LFSR facilitates a better scenario for hardware or
multiple core implementations for distributed CRSA core. It can produce
sequences of large period with good statistical properties and because
of its structure, it can be analysed using algebraic techniques.
4.1.4 Primality Test
A primality test is an algorithm for determining whether an input
number is prime or not. Primality tests do not give prime factors, but
they will check whether the input number is prime or not.
4.2 Key Generation
Each output of the LFSR is checked whether it is a prime number
or not. Two prime numbers are selected randomly and designated as
“p” and “q”. These prime numbers are shared between all the users in
the system. Using these prime numbers, the variables for encryption
exponent (e) and decryption exponent (d) are computed. These are the
commutative encryption key and decryption key respectively. These
keys have been further employed for creating cipher text “C” and later
converted into plain text “M”. Considering the architectural differences
between the generic RSA and the proposed Commutative RSA, it can
be found that in order to enhance the overall efficiency and reduce
Vinayaka Missions University, Salem 58
critical delay in key generation, two parallel pseudo random generators
have been employed and 512 bits of LFSR succeed them. The pseudo
random bits generated are stored in shift registers and once it gets
filled, the LFSR stops further bit generation till the register memory is
available for the next random bits. It makes the system power efficient
and increases the speed of operation.
Flow chart of various functions are represented using Algorithmic
State Machine (ASM) charts [115]. ASM charts are a graphical
representation of step-by-step execution of a hardware code. They are
easier to interpret and can be easily converted to other forms of
representation. They do not enumerate all the possible inputs and
outputs. Only the inputs that matter and the outputs that are asserted
are indicated. ASM chart for testing whether the output of LFSR is a
prime number or not is presented in Figure 4.1. RTL schematic of the
same is extracted using ISE design tool and is presented in Figure. 4.2.
4.2.1 RTL Coding Guidelines
Ultimate aim of the designer is to finally map the design on an
FPGA device or implement as an Application Specific Integrated Circuit
(ASIC), and this is possible only if certain guidelines are followed [115].
Popular guideline known as the RTL coding guidelines, practiced in
Industries, signifies that data transfers in a system take place via
Vinayaka Missions University, Salem 59
registers. It is basically adhering to synchronous design practices and it
signifies the regulation of data flow and how the data is processed.
Since we deal with a synchronous design, it should run smoothly
through simulation, synthesis and finally on place and route tools. In
order to do this, we have to isolate the asynchronous and sequential
circuits. The codes developed using RTL coding guidelines will run
smoothly in all the tools.
Vinayaka Missions University, Salem 60
Input an integer „n‟ >1
Is „n‟ an exact
power of another
positive integer?
Choose set of polynomials g(x) and a
polynomial f(x) that are sufficient for
testing primality of „n‟
F
Choose g(x) from the set of g(x)s
Is ([g(x))n ? g(x
n) mod (f(x), n)?
F
Has the check been
performed for all g(x)?
Declare „n‟ to be
prime Number
Start
Stop Declare „n‟ to be
composite
Number
S 52
S 58
S 60
S 66
S 56
T
T
T
F
Figure 4.1: ASM Chart for Finding the Prime Number
Vinayaka Missions University, Salem 61
Figure 4.2: RTL Schematic for Finding the Prime Number Using
ISE 14.3 Xilinx Tool
Figure 4.2, CRSA_KGCORE_CMP_PRME_TRUE is the block
name mentioned in the VHDL code. “clk”, reset_I, start, rd_ack, rand_in
(511:0) are the inputs and ready, rd_en, prime_valid, prime_out (511:0)
are the outputs of this block. “reset_I” is an asynchronous active low
input; the system is reset only if this input is low. For normal working of
the system, this signal would be high. Start is active high input which
indicates the start of checking the number whether it is prime or not.
rand_in (511:0) is the input number which is to be checked for its
primality. This is a random number generated by linear feedback shift
Vinayaka Missions University, Salem 62
register. Continuous clock pulses are applied. For every positive edge
trigger of the clock, reset_I is checked. If it is low, then the output will be
in reset state. When reset_I is high, the random number at that time will
be tested for primality. Start input indicates the start of the function.
Algorithm implemented for checking the prime number is presented in
Figure 4.1. If the applied random number is a prime number,
prime_valid and rd_en output will be “1” indicating the output prime_out
(511:0) is a prime number.
Two prime numbers have been taken and designated as “p” and
“q”. Variables “n” and “(n)” are computed using these two prime
numbers. Encryption exponent “e” is computed using trial and error
method in such a way that the Greatest Common Divisor (GCD) of “e”
and “(n)” is “1”. Flow chart for finding the GCD of two numbers has
been presented in Figure 4.3 and the RTL schematic is presented in
Figure 4.4.
After obtaining “(n)”, assume an integer value for “e” and these
two are the inputs for GCD block. We have to compare “e” and “(n)”
for “e” greater than “(n)”, “e” equal to “(n)” and “e” less than “(n)”. If
“e” is greater than “(n)”, subtract “(n)” from “e” and compare them
again. Repeat the steps until “e” becomes equal to “(n)”. If “e” is less
than “(n)”, subtract “e” from “(n)” and compare them again. Repeat
Vinayaka Missions University, Salem 63
the steps until “e” becomes equal to “(n)”. If “e” is equal to “(n)”, then
GCD is equal to “1”. Therefore, we select “e” as the encryption key.
START
Read “e” and “(n)”
Compare “e” and “(n)”
e > (n)? e = (n)? e < (n)?
e = e - (n) GCD = 1 (n) = (n) - e
T T
F
F
T
F
STOP
Figure 4.3: Flowchart for Finding GCD of Two Numbers
Vinayaka Missions University, Salem 64
Figure 4. 4: RTL View for Finding GCD Using ISE 14.3 Xilinx Tool
In figure 4.4, CRSA_KGCORE_CMP_GCD is the component
declared for finding the GCD of two numbers in VHDL code. A(1023:0)
is the1024 bits input. Clk, reset_I, start are the other inputs. Reset_I is
an asynchronous active low input that resets the system. This signal is
high for normal functioning. Start is an active high input which indicates
the start of the function. Output d(1023:0), e(1023:0), out_valid and
ready are the outputs. d(1023:0), e (1023:0) are the decryption and
encryption keys respectively. These two keys are valid when out_valid
signal is high.
Vinayaka Missions University, Salem 65
After obtaining GCD of “e” and “(n)” as “1”, the pair (e, n) is
considered
as the encryption key. The decryption key pair (d, n) is computed using
the encryption key in such a way that d = e-1 (mod n). The mathematical
approach to perform key generation is presented in Figure 4.5 and
Figure 4.6.
Figure 4.5: Key Generation for Commutative RSA Algorithm
Vinayaka Missions University, Salem 66
Figure 4.5 presents the functional block diagram of key
generation in Commutative RSA. In the designed system of key
generation, the seed data of size 512 bits are applied to the key
generator. This finally generates 1024 bits of encryption key “e”, 1024
bits of decryption key “d” and 1024 bits, the parameter “(n)”. Here, the
two prime numbers of 512 bits each have been generated using pseudo
random number generators. In figure 4.5, pseudo random number
generator - 1 and pseudo random number generator -2 are the two
blocks for generating the random numbers of 512 bits each. These
random numbers are fed to the next block for testing whether they are
prime or not. Two prime numbers “p” and “q” are obtained after the
primality test. These two prime numbers are multiplied and the product
is designated as an integer “n”. One more parameter “(n)” is obtained
after multiplying (p-1) and (q-1). An integer “e” is assumed and GCD of
“e” and “(n)” is tested using trial and error method. After obtaining
GCD of “e” and “(n)” as “1”, the pair (e, n) is considered as the
encryption key. The decryption key pair (d, n) is computed using the
encryption key in such a way that d = e-1 (mod n). Step by step
procedure for generating both encryption key (e, n) and decryption key
(d, n) is presented in the following algorithm.
Vinayaka Missions University, Salem 67
Algorithm 4.1: Key Generation
1. Select p and q such that both p and q are Prime numbers
2. Compute n = p x q
3. Compute (n) =(p -1) x (q - 1)
4. Select integer variable “e” in such a way that
5. GCD ((n), e) = 1; 1< e <(n)
6. Calculate d = e -1
(mod ((n))
7. Generate the public and private key pairs.
8. Encryption key pair, (e, n)
9. Decryption key pair, (d, n)
Figure 4.6: Pseudo Algorithm for Commutative RSA Key
Generation
The algorithm may be efficiently designed using Algorithmic State
Machine (ASM) charts rather than by the traditional state diagram [115].
The process of commutative key generation has been illustrated in
Figure 4.7 using ASM Chart.
Vinayaka Missions University, Salem 68
S1
START
Pseudo Random Number
Generator
Prime No1? Prime No. 2?
p q
T
F F
T
n = p x q
Ф(n) = (p-1) x (q-1)
GCD (Ф(n),e))
GCD=1? F
Private key (e, n)
T
Public key (d, n)
S0
S2
S3
S4
S5
S6
S7
Encryption Process
Decryption Process
S8
S9
Figure 4.7: ASM Chart for Key Generation for Commutative RSA
In the designed system of key generation for CRSA, the seed
data of size 512 bits are applied to the key generator. This finally
generates 1024 bits of encryption key “e”, 1024 bits of decryption key
“d” and 1024 bits, the parameter “(n)”. Here, the two prime numbers of
512 bits each have been generated using pseudo random number
Vinayaka Missions University, Salem 69
generators. These random numbers are fed to the next block for testing
whether they are prime or not. Two prime numbers “p” and “q” are
obtained after the primality test. These two prime numbers are
multiplied and the product is designated as an integer “n”. One more
parameter “(n)” is obtained after multiplying (p-1) and (q-1). An integer
“e” is assumed and GCD of “e” and “(n)” is tested using trial and error
method. After obtaining GCD of “e” and “(n)” as “1”, the pair (e, n) is
considered as the encryption key. The decryption key pair (d, n) is
computed using the encryption key in such a way that d = e-1 (mod n).
These two keys are then used for computing encryption of the plain text
and decryption of the cipher text respectively.
RTL schematics of the top module of CRSA key generation using
Xilinx ISE 14.3 and Vivado 2012.3 are presented in Figure 4.8 and
Figure 4.9 respectively.
Vinayaka Missions University, Salem 70
Figure 4.8: RTL Schematics of Top Module of CRSA Key
Generation Using Vivado 2012.3
Vinayaka Missions University, Salem 71
CRSA
Encoding
Decoding
CLK
CLK
CLK
SEED[511:0]
ADDA[511:0]
SEED[511:0]
KEYGEN_CRSA_KGCORE_GENPVAL
KEYGEN_CRSA_KGCORE_GENQVAL
KEYGEN_CRSA_MAIN_MEM_MGMT
RAND_IN[511:0]
KEYGEN_CRSA_KGCORE_CHKQVAL
KEYGEN_CRSA_KGCORE_GENPVAL
RAND_IN[511:0]
KEYGEN_CRSA_KGCORE_GENPVAL
KEYGEN_CRSA_KGCORE_CHKPVAL
PRIME_OUT[511:0]
PRIME_OUT[511:0]A[1023:0]
d_va[1023:0]
Key_out
Fi_out[1023:0]
e_va[1023:0]
N_out[1023:0]
Figure 4.9: Schematic of CRSA Key Generation Using Vivado 2012.3
In Figures 4.8 and 4.9, first block is the generation of 512 bit
random numbers. We require two random numbers P and Q.
KEYGEN_CRSA_KGCORE_GENPVAL is the first block for generating
first random number. KEYGEN_CRSA_KGCORE_GENQVAL is the
second block for generating the second random number. Input to these
two blocks are 512 bit, seed [511:0], Reset, clk and chip enable (ce).
These two blocks are Linear Feedback Shift Registers (LFSR).
Fibonacci LFSR is implemented and the same is described in section
4.1.2. Second block is for checking the prime number.
KEYGEN_CRSA_KGCORE_CHKQVAL and
Vinayaka Missions University, Salem 72
KEYGEN_CRSA_KGCORE_CHKPVAL are the two blocks for verifying
whether the random number is prime or not. Inputs for these two blocks
are the outputs of the first block. Figure 4.1 describes the Algorithmic
State Machine Chart for checking whether the input number is prime or
not. After finding two prime numbers P and Q, Variables “n” and “(n)”
are computed using these two prime numbers. Encryption exponent “e”
is computed using trial and error method in such a way that the
Greatest Common Divisor (GCD) of “e” and “(n)” is “1”. Flow chart for
finding the GCD of two numbers has been presented in Figure 4.3 and
the RTL schematic is presented in Figure 4.4.
KEYGEN_CRSA_KGCORE_GEN_N_PHI is the block for
computing “n” and “(n)”. These variables are the inputs for the next
block KEYGEN_CRSA_E_D_VAL_CMPTE, for computing 1024 bits
encryption key “e” and decryption key “d ”, indicated in the figure as
e_val (1023:0) and d_val (1023:0) respectively. These two keys are of
1024 bits each. Algorithmic State Machine Chart of figure 4.7 presents
the key generation for Commutative RSA Algorithm implemented in this
work.
Second phase of CRSA cryptography core is Commutative
encryption and is presented in the ascending section.
Vinayaka Missions University, Salem 73
4.3 Commutative Encryption
RSA cryptosystem is one of the finest solutions for public key
cryptography approaches. In any case, its complete vigor gets restricted
because of restricted encryption. RSA algorithms suffer from reorder
issues. Hence, in order to make this system simpler and efficient,
methodology called Commutative RSA has been designed. This works on
the basis of commutative property. The mathematical structure for
performing encryption is described by a pseudo algorithm presented in
Figure 4.10. For achieving a vigorous and least critical delay encryption,
an improved architecture called, “Parallel Montgomery Multiplication for
distributed cores” has been designed. The algorithmic advancement and
subtle elements of this altered Montgomery has been given in the
following section. Using encryption key “e”, the plain text “M” is converted
to the cipher text “C”. This CRSA encryption with parallel Montgomery is
implemented at every FPGA core or distributed cores. Such
implementation not only reduces the computational cost in terms of
minimum execution time but also results in reducing key exchange
overheads in case of multiple core or MIMO transceiver based
communication system. Mathematically the commutative RSA encryption
algorithm can be stated as follows:
Vinayaka Missions University, Salem 74
Algorithm 4.2 Commutative Encryption
1. Prime numbers: p ,q
2. Compute n: n = p x q
3. Plain text :
4. Cipher text: C = Me (mod n)
Figure 4.10: Pseudo Algorithm for Commutative Encryption
The encryption process takes input as data parameters with
encryption exponent and “n” parameter, all of size 32 bits. After
encryption, the CRSA encryption comes out with 32 bits of cipher text.
The designed scheme takes into consideration the Montgomery
multiplication in its parallel form. In individual multiplication steps, there
is input of 32 bits modulus, multiplicand and multiplier. Finally it
generates 32 bits of cipher text. This CRSA encryption with parallel
Montgomery is implemented at every comprising FPGA core or
distributed cores. For better performance and results, Montgomery
based decryption necessary. Therefore, the developed decryption
algorithm functions using parallel Montgomery Multiplier.
4.4 Commutative Decryption
In commutative RSA, the decryption would be done in the same
fashion as done for commutative encryption using parallel Montgomery
Vinayaka Missions University, Salem 75
multiplier. In commutative decryption, at individual transceiver terminal,
the respective plain text is obtained from the cipher text. It is achieved
from cipher text raised with power of decryption key exponent “dcrsa” and
it is succeeded with reduction of modulo “n”. Mathematically the
commutative RSA decryption algorithm can be stated as follows:
Algorithm 4.3: Commutative Decryption
1. Input: Cipher text: C
2. Output: plain text M
3. Decryption key pair : (d, n)
4. Plain text: M = Cd(mod n)
Figure 4.11: Pseudo Algorithm for Commutative Decryption
The algorithm for accomplishing security and authentication for
multi user or multiple core distributed FPGA frameworks has been
presented in the next section.
4.5 Commutative RSA Cryptography Core
In this section overall design of CRSA Architecture and its
implementation with multiple distributed FPGA cores has been
presented. The entire system implementation was done sequentially
using multiple cores.
Vinayaka Missions University, Salem 76
In this work, three user terminals are considered. Each user or
terminal encrypts the message using the private key and decryption is
performed using public key. Even though public key is generated at
each transceiver terminal, the public keys are transmitted using
classical method. At the receiver end, each user terminal knows number
of terminals the plaintext has traversed through. Receiver has to
perform same number of iterations of decryption using the respective
decryption keys to obtain the plaintext. Phase 1 is the key generation
step. Sequential procedure for key generation has been presented in
section 4.1 and 4.2. Next phase is commutative encryption, which is
presented in section 4.3 and the Third phase is commutative decryption,
discussed in the previous section. The sequential phases have been
presented in Figure 4.12.
Vinayaka Missions University, Salem 77
Figure 4.12: Sequential Model for Commutative RSA Realization
Various phases of commutative cryptography core are discussed
in the previous section. Montgomery parallel multiplier is presented in
the next section.
Vinayaka Missions University, Salem 78
4.6 CRSA Oriented Parallel Montgomery Multiplier
A system needs high tolerability for allied communication delay
and thus the created inter-core communication is in general, much
slower as compared to intra-core communication situation. Hence, in
multiple core communications, the inter-core communication might be a
bottleneck in parallel multipliers. The parallel Montgomery Multipliers
are eminently suited for diverse multi-core models and it makes the
overall system stable even for higher throughputs.
4.6.1 Parallel Montgomery Multiplier
The implementation of Montgomery multiplication using parallel
multiplier for multi-core applications is effective for RSA cryptosystems
and its applications. In a number of cryptosystems, a series of
multiplication functions operating concurrently are required. For
example, the modular exponentiation is estimated while employing the
chain of processes like modular multiplication and squaring.
Multiplication and squaring are in general implemented by means of an
integer multiplication followed by a modular reduction with certain
predefined modulus. The design of Montgomery Multiplication Algorithm
was presented in detail in the previous chapter. In Montgomery
multiplication process, initially the functional operands are transformed
Vinayaka Missions University, Salem 79
into their allied Montgomery residue representation. That is followed by
performing integer multiplication and squaring. Finally, the result is
converted back into its generic integer representation.
In this work, multiple core hardware architecture is proposed for
Commutative RSA encryption and decryption. The developed scheme
involves highly efficient communication among incorporated multiple
cores. This CRSA approach is robust, efficient and offers real time
functional hardware architecture.
4.7 Realization of Parallel Montgomery Multiplier
In this algorithm for multiple distributed core architecture, the
intrinsic parallelism scheme has been taken into consideration. The
following algorithm has been used for parallel Montgomery
multiplication. Algorithm 4.4 explains the general Montgomery
Multiplication algorithm.
Vinayaka Missions University, Salem 80
Algorithm 4.4 : Montgomery Multiplication
Input : A, B Zn where n is an odd integer,
, where .
Output: AB
1. T A B
2. T ( T + ( T ) n)/ 2m
3. If T ≥ n then
4. Return T-n
5. Else
6. Return T
7. End if
Figure 4.13: Pseudo Algorithm for Montgomery Multiplication
Parallel Integer Multiplication and Parallel Montgomery
Multiplication algorithms are presented in the following sections.
Vinayaka Missions University, Salem 81
4.7.1 Pseudo algorithm for Parallel Integer Multiplication
The pseudo codes for the intrinsic integer Parallel Montgomery
multiplication are as follows:Algorithm 4.5 : Algorithm for Parallel
Integer Multiplication
Input: Integers E = and N possessing its size as m =
d. s Here variable “s” states the number of cores to be realized
Results : parmul (E, N) = E.N (parallel Multiplied Outputs)
1. initialize one loop
2. for i = 0 to s-1 do
Update variable t
Wi ei. N. 2i.d, for multicore implementation
End for
3. for I = 0 to s-1
Update
4. w0 w0 + wi
End for
5. return (w0)
Figure 4.14: Pseudo Algorithm for Parallel Integer Multiplication
Vinayaka Missions University, Salem 82
4.7.2 Pseudo Algorithm for Parallel Montgomery Multiplication
Algorithm 4.6
Input: Variables , n for an odd integer, while
, with
Results:
Apply previous algorithm (parallel integer Multiplication algorithm) for
achieving the following:
1. W parmul (E. N)
2. U parmul ( w, n ) mod 2m
3. Update g
4. g parmul (g,n)
5. u (g + w)/ 2m
6. if g ≥ n, then
7. return (g - n)
8. else
9. return (g)
10. end if
Figure 4.15: Pseudo Algorithm for Parallel Montgomery
Multiplication
Vinayaka Missions University, Salem 83
The partial product accumulation has been accomplished by
means of a “binary tree approach” as presented below with at most
phases with “s” number of cores available.
For
For
End for
End for
Thus, implementing this parallel Montgomery multiplication with
every encompassing transceiver MIMO or distributed FPGA core-
terminals for performing commutative encryption and decryption, the
overall latency and the critical delay has been minimized. Thus a highly
robust system has been obtained for Commutative RSA Cryptosystem.
4.8 Realization of CRSA with Multiple Distributed Cores
Considering an accessible platform for secure medium is
available, it is assumed that data to be communicated over the
communication medium is needed to be secured and, in fact, it is
Vinayaka Missions University, Salem 84
sufficient to prohibit any collusion, and a secure plane is achievable. In
fact the implementation of public key cryptosystem like RSA is normally
ideal. As a result, the commutative cryptographic core for distributed
FPGA scheme with parallelized Montgomery Multiplication considers
the commutative RSA algorithm CRSA, that advocates that the order in
which the encryption has been done does not affect the cryptosystem.
The decryption can also be realized in a similar manner. The developed
approach takes into account two prime variables stated as
and that has been initialized and accompanied by all the
group members.
Let the variables and represent the group members
required to perform communication over the secure plane. For
computing the encryption and decryption key of the proposed CRSA
algorithm, the Property and have been
calculated while considering the expression presented in Eq. 4.1 and
Eq. 4.2.
(4.1)
(4.2)
Taking into account of the above expressions, it is verified that
and for A
and B respectively. The encryption key pairs for user A and B are
Vinayaka Missions University, Salem 85
represented as:
and
(4.3)
The is computed using randomly selected variables in
such a way that:
(4.4)
where is the greatest common divisor (GCD) function that
exists between variables and .
The decryption key pair of A and B is presented in terms of
and . The
property are calculated based on the following expression.
(4.5)
(4.6)
Similarly, the resulting commutative RSA decryption functions on
the retrieved encrypted data which can be defined by the following
expression.
. (4.7)
After decryption, the data has been verified.
Vinayaka Missions University, Salem 86
Summary
In this chapter, design of commutative cryptography core with key
generation, ASM charts for finding the prime number and key
generation, Flow chart for finding the GCD of two numbers, algorithms
for parallel multiplication are presented. Analysis of simulation results
and hardware design are presented in the next chapter.
Vinayaka Missions University, Salem 87
Chapter 5
Simulation and Place and Route Results of
Commutative Cryptography Architecture with Key Generation
In the previous two chapters, the design of a serial and parallel
Montgomery multipliers and design of commutative cryptography
architecture with key generation were presented. In this chapter, their
Simulation, Place and Route Results are presented in detail.
5.1 Introduction
The secure and authenticated data communication in the present
day competitive application scenario needs a robust and efficient
security system that ensures authenticated data communication in
multiple user scenarios. Considering the efficiency of RSA algorithm
and certain optimization opportunity with this cryptosystem, in this work,
a new approach called Commutative RSA has been developed and it
has been implemented with multiple FPGA distributed cores. The
developed system has been enhanced with other optimization features
such as parallel Montgomery multiplication with radix -2. This makes the
system perform well in terms of execution time. The CRSA has been
tested with both paradigms of Montgomery multiplication: Serial and
Parallel Montgomery Multiplications. The results illustrate that parallel
Montgomery Multiplication optimizes CRSA with better performance as
Vinayaka Missions University, Salem 88
compared to serial Montgomery Multiplication. The parallel CRSA
system has been implemented with multiple distributed cores and the
results justify that CRSA with parallel Montgomery Multiplication can
deliver secure communication among multiple users with high
throughput.
5.2 Hardware Design
The architecture of a 32-bit RSA processor has been designed
that functions on the basis of the proposed Commutative RSA
algorithm. In this work, four 32-bit linear shift registers are employed to
store operands needed for computing 32-bit RSA operations. The
commutative RSA core has been implemented on multiple FPGA
devices for simulation. Data authenticity among multiple user terminals
in a communication environment is illustrated. The implementation of
commutative RSA cryptography core has been simulated with three
individual FPGA devices.
The system has been coded using VHDL with multiple transceivers.
The design has been simulated using Modelsim and the RTL design is
synthesized using Xilinx Design Suite 14.3 targeted on Virtex-5, FPGA
and Vivado 2012.3 Xilinx tool.
In this work, two systems have been developed. One is Serial
Montgomery based Cryptography core and the second is Parallel
Montgomery based cryptography core. The designed system utilizes
Vinayaka Missions University, Salem 89
31298 out of 44800 (69%) slice registers, 30129 out of 44800 (67%)
slice LUTs. The system clock frequency reported by the Place and
Route tool is 199 MHz for CRSA encryption and decryption and 335
MHz for commutative cryptography core.
5.3 Comparative Analysis for Serial Versus Parallel Montgomery
Multiplication Based CRSA Implementation
The results for Serial Montgomery and Parallel Montgomery
Multiplication based CRSA architectures have been compared in this
section. Considering the performance parameters like memory
occupancy, speed, power consumption, delay and throughput, it has
been found that the Parallel Montgomery based Commutative RSA
performs better when compared to Serial Montgomery based
Commutative RSA realization. The delay in the proposed Parallel
Montgomery based CRSA is 13.8% lower as compared to Serial
Montgomery based CRSA cryptography core. Similarly, the throughput
of the proposed Parallel Montgomery based CRSA is 12.1% higher than
the serial Montgomery based CRSA architecture. In the proposed
design, the trade-off between power consumption and area is also very
small. The comparative results for Serial Montgomery based
Commutative RSA and Parallel Montgomery based Commutative RSA
are presented below.
Vinayaka Missions University, Salem 90
Table 5.1 presents the comparison of area in Serial and Parallel
Montgomery based CRSA Cryptography Core. Figure 5.1 presents the
graphical representation of the same.
Table 5.1 Comparison of Chip Area of Serial and Parallel
Montgomery Based CRSA Cryptography Core
CRYPTOGRAPHY
CORE
SERIAL
MONTGOMERY Based
CRSA
PARALLEL
MONTGOMERY Based
CRSA
DEVICE xc5vlx330t-2-ff1738 xc5vlx330t-2-ff1738
SLICE LUT 913 844
LUT USED AS
LOGIC 913 813
OCCUPIED
SLICES 290 311
Vinayaka Missions University, Salem 91
Figure 5.1: Comparison of Chip Area of Serial and Parallel
Montgomery Based CRSA
Resources on the FPGA that can perform logic functions are
defined as Logic resources. They are grouped in slices to create
configurable logic blocks. A slice contains a set of number of Look-Up-
Tables (LUTs), flip-flops and multiplexers. An LUT is a collection of logic
gates hard-wired on the FPGA. LUTs store a predefined list of outputs
for every combination of inputs and provide a fast way to retrieve the
output of a logic operation. A flip-flop is a circuit capable of two stable
states and represents a single bit. A multiplexer, also known as a mux,
is a circuit that selects between two or more inputs and outputs the
selected input.
Slices are the basic building block components in the FPGA
fabric. However, each slice contains number of LUTs, flip-flops, and
Vinayaka Missions University, Salem 92
carry logic elements, which make up the logic of the design before
mapping. After mapping, all the LUTs and flip-flops are packed into
slices, but not necessarily filling the slices. A slice with two LUTs and
two flip-flops may be in use for just one LUT. In the map report, any
slice that is used even partially is counted as a complete "occupied
slice". Hence the percentage of usage of slices is greater than the
larger of LUTs and flip-flops. The design may use about 25% of LUTs
and flip-flops. Owing to sparse packing it can have nearly 50% occupied
slices.
Different FPGA families implement slices and LUTs differently.
For example, a slice on a Virtex-II FPGA has two LUTs and two flip-
flops but a slice on a Virtex-5 FPGA has four LUTs and four flip-flops. In
addition, the number of inputs to an LUT is generally two to six,
depending on the FPGA family.
In Table 5.1, number of slice LUT used for the implementation of
serial Montgomery based CRSA and parallel Montgomery based CRSA
are 913 and 844 respectively. Occupied slices for serial Montgomery
based CRSA and parallel Montgomery based CRSA are 290 and 311
respectively. This table and the graphical representation indicate that
Parallel Montgomery based CRSA occupies only 21 extra slices
compared to serial Montgomery based CRSA. Hence for further
implementation, Parallel Montgomery based CRSA is used.
Vinayaka Missions University, Salem 93
Table 5.2 and Figure 5.2 present the comparison of Power
consumption in Serial and Parallel Montgomery Based Commutative
RSA Cryptography Core.
Table 5.2 Comparison of Power Consumption of Serial and
Parallel Montgomery Based Commutative RSA
Cryptography Core
Cryptography
Core
Serial Montgomery
Based CRSA
Parallel
Montgomery Based
CRSA
Device Xc5vlx330t-2-ff1738 Xc5vlx330t-2-ff1738
Static Power
(mW) 3516.7 3516.75
Dynamic Power
(mW) 4.76 5.72
Total Power
(mW) 3521 3522
Vinayaka Missions University, Salem 94
Figure 5.2: Comparison of Power Consumption of Serial and
Parallel Montgomery Based CRSA
Power consumption of serial Montgomery based CRSA and
parallel Montgomery based CRSA are extracted from the device
utilization summary. Table 5.2 and Figure 5.2 indicate both the methods
consume almost the same power.
Table 5.3 presents the Performance such as Delay, Frequency
and Throughput for Serial and Parallel Montgomery Based CRSA
Cryptography Core. The graphical comparisons are presented in Figure
5.3 and Figure 5.4 for delay and throughput respectively.
Vinayaka Missions University, Salem 95
Table 5.3: Performance (Delay, Frequency and Throughput)
Comparison for Serial and Parallel Montgomery Based
CRSA Cryptography Core
Cryptography
Core
Serial Montgomery
Based CRSA
Parallel
Montgomery
Based CRSA
Device
xc5vlx330t-2-ff1738
xc5vlx330t-2-ff1738
Frequency (MHz) 199 227
Delay (ns) 5.01 4.4
Throughput
(Kbps) 779 887
Figure 5.3: Delay Comparison for Serial and Parallel
Montgomery Based CRSA
Vinayaka Missions University, Salem 96
Figure 5.4: Throughput Comparison of Serial and Parallel
Montgomery Based CRSA
Maximum frequency of operation and delay are given by the
timing report of the device. Parallel Montgomery based CRSA has
reported less delay compared to serial Montgomery based CRSA. This
indicates that the speed of operation is more in Parallel Montgomery
based CRSA. Throughput indicates the number outputs per unit time.
Higher the throughput better is the performance. Data presented in
Table 5.3 reveals that the Parallel Montgomery based CRSA performs
better than the Serial Montgomery based CRSA. Delay of Serial
Montgomery based CRSA is 5 ns and Parallel Montgomery based
CRSA is 4.4 ns or 13.8% less in the case of Parallel CRSA. Similarly
throughput of Serial Montgomery based CRSA is 779 Kbps and Parallel
Montgomery based CRSA is 887 Kbps or 12 % better in the latter.
Vinayaka Missions University, Salem 97
Simulation waveforms of encryption and decryption are presented
in the next section.
5.4 Analysis of Simulation Waveforms
Various encryption and decryption data values for each user
location are computed and presented in Tables 5.4 to 5.6. Frequency of
operation in simulation has been set to 100 MHz for convenience. The
data values shown in these three tables are remapped for the VHDL
RTL codes as presented in Table 5.7.
Table 5.4 Data at User Terminal 1
User Terminal 1
Data Decimal Value Hexadecimal Value
p_val 59083 E6CB
q_val 33223 81C7
n_pram 1962914509 74FFB2CD
e_pram 699776239 29B5BCEF
data_pram 7487875 724183
cypher 848084699 328CBEDB
d_pram 1389794659 52D69563
Original plain text
after decoding 7487875 724183
Vinayaka Missions University, Salem 98
Table 5.5 presents the data at user terminal 2.
Table 5.5 Data at User Terminal 2
User Terminal 2
Data Decimal Value Hexadecimal Value
p_val 59083 E6CB
q_val 33223 81C7
n_pram 1962914509 74FFB2CD
e_pram 1154032391 44C92307
data_pram 848084699 328CBEDB
cypher 752490942 2CDA19BE
d_pram 1608356723 5FDD9373
Original plain
text after
decoding
848084699 328CBEDB
Vinayaka Missions University, Salem 99
Table 5.6 presents the data at user terminal 3.
Table 5.6 Data at User Terminal 3
User Terminal 3
Data Decimal Value Hexadecimal Value
p_val 59083 E6CB
q_val 33223 81C7
n_pram 1962914509 74FFB2CD
e_pram 627898457 256CF859
data_pram 752490942 2CDA19BE
cypher 553018001 20F66291
d_pram 1057410797 3F06CEED
Original plain text
after decoding 752490942 2CDA19BE
Tables 5.4 to 5.6 present the data at user terminal 1, user
terminal 2 and user terminal 3. Prime numbers generated using LFSR
are designated as p_val and q_val. The parameters “n” and “(n)” are
computed. Parameter “n” is designated as “n_pram” in tables.
Encryption key is designated as “e_pram”. Encryption key for user
terminal 1 is computed as “699776239” in decimal and “29B5BCEF” in
Hexadecimal. Similarly for user terminal 2 and user terminal 3
encryption keys are “1154032391” in decimal, “44C92307” in
Vinayaka Missions University, Salem 100
hexadecimal and “627898457” in decimal, “256CF859” in hexadecimal
respectively. “data_pram” is the original data required for encryption.
“7487875” is the original data for user terminal 1. This original data is
encrypted using an encryption key “e_pram”. The encrypted output data
is shown as “cypher” in the table and at user terminal 1 the “cypher”
value is “848084699”. This encrypted data at user terminal 1 is the
“data_pram” for user terminal 2, which is shown in table 5.5.
“752490942” is the “cypher” of user terminal 2 and is the “data_pram”
for user terminal 3, which is shown in Table 5.6. “cypher” of user
terminal 3 is obtained after encrypting its “data_pram” and is shown as
“20F66291”.
The decryption key is different from the encryption key and is
shown as “d_pram”. Decryption key for user terminal 1 is computed as
“1389794659” in decimal or “52D69563 H”. Similarly decryption keys at
user terminal 2 and user terminal 3 are “1608356723” in decimal,
“5FDD9373 H”, and 1057410797 respectively.
The decrypted output is shown as “Original plain text after
decoding” in the table. Decoding starts at user terminal 3 and the value
obtained is “752490942” in decimal. This data is decrypted at user
terminal 2 and the obtained result is “848084699”. Decrypted data of
user 2 terminal is now decrypted at user terminal 1 and the result is
Vinayaka Missions University, Salem 101
“7487875”. It may be noted that the encrypted data of User terminal 1
will be the input data for User terminal 2 and so on. Similarly, for the
decryption. At user terminal 1 “data_pram” and “Original plain text after
decoding” are the same. Similarly at user terminal 2 and user terminal
3. This proves the commutative nature of the algorithm.
Table 5.7 presents the data mapping between equations
presented and data used in coding to obtain the waveforms, presented
in the next section.
Table 5.7 Data Mapping
Designation of Data in
Equations
Designation of Data in
Waveforms
p_val
q_val
n_pram
e_pram
data_pram
Cypher
d_pram
Originalplaintext
Vinayaka Missions University, Salem 102
In Table 5.7, the first column presents the data designated in
equations which were discussed in detail in Chapter 4. The second
column presents the same data designated in waveforms.
is designated as “p_val” in simulation waveforms. Similarly
is shown as “q_val”, as “n_pram”, as
“e_pram”, as “data_pram”, as “Cypher”, as
“d_pram”, and as “Originalplaintext” in equations and waveforms
respectively.
Simulation waveforms of encryption and decryption at each user
terminals are presented below. For convenience sake, the frequency of
operation in simulation has been set to 100 MHz, although any other
value can be set. It may be noted that the “reset” signal is active high
here, being exact complement of “reset_l” shown elsewhere. The
original data required for encryption is shown as “data_pram” in Figures
5.5 to Figure 5.7. The encryption key is shown as “e_pram” and the
encrypted output data is shown as “cypher”. The encryption starts at
150 ns and completes processing at 10035 ns for User terminal 1 as
shown in simulation waveforms presented in Figure 5.5. Similarly, for
User terminal 2, encryption commences at time 10035 ns and ends at
22195 ns, whereas for User terminal 3, start and end times are 22195
ns and 34115 ns respectively as presented in Figure 5.6 and Figure 5.7.
Vinayaka Missions University, Salem 103
The decryption timings for the three user terminals are presented in
Figure 5.8 to Figure 5.10.
The encrypted data are input as “incipher” for Decryption
processing. The decryption key is different from the encryption key and
is shown as “d_pram” and the decrypted output is shown as
“originalplaintext” in the waveforms. It may be noted that the encrypted
data of User 1 will be the input data for User 2 and so on. Similarly, for
the decryption. For example, User 1 data “7487875” is encrypted as
shown in Figure 5.5 and decrypted as shown in Figure 5.10 recovering
the same data. This proves the commutative nature of the algorithm.
From the waveforms, it can be seen that each of the encryption and the
decryption process takes 50 clock cycles or 0.5 µs at 100 MHz.
The Synthesis, Place and Route have been run on the RTL
design. The design was synthesized using Xilinx Design Suite 14.3
targeted on Virtex-5, xc5vfx70t-2ff1136 FPGA. The design for both the
encryption and the decryption utilizes about 67% of the chip resources
as presented in Table 5.8. Although only 100 MHz was used during
Simulation, the maximum operating frequency reported by the Xilinx
Design Suite 14.3 tool is 199 MHz for CRSA encryption and decryption
and 335 MHz for commutative cryptography core as presented in Table
5.9.
Vinayaka Missions University, Salem 104
Figure 5.5: Simulation Waveform of CRSA Encryption at
User Terminal 1
Figure 5.6: Simulation Waveform of CRSA Encryption at
User Terminal 2
Vinayaka Missions University, Salem 105
Figure 5.7: Simulation Waveform of CRSA Encryption at
User Terminal 3
Figure 5.8: Simulation Waveform of CRSA Decryption at
User Terminal 3
Vinayaka Missions University, Salem 106
Figure 5.9: Simulation Waveform of CRSA Decryption at
User Terminal 2
Figure 5.10: Simulation Waveform of CRSA Decryption at
User Terminal 1
Vinayaka Missions University, Salem 107
5.5 RTL View of Architecture for Commutative RSA
Top level RTL schematics of Commutative RSA with key
generation using ISE Xilinx tool and Vivado Xilinx tool are presented in
Figure 5.11 and Figure 5.12 respectively.
Figure 5.11: Top Level RTL View of Key Generation Using
ISE 14.3 Xilinx Tool
Vinayaka Missions University, Salem 108
Figure 5.12: Top Level RTL View of Key Generation Using
Vivado 2012.3 Xilinx Tool
Top level RTL view presents inputs and outputs of a module and
is extracted from the Xilinx tool after synthesis.
CRSA_KGCORE_KGENTOP is the key generation module name.
seed(511:0), ce, clk and reset_I are the inputs. d_val(1023:0),
e_val(1023:0), fi_out(1023:0), n_out(1023:0) and key_out are the
outputs. Seed (511:0) is 512 bits input applied to produce the encryption
and decryption keys. “ce” is the chip enable input and it is an
asynchronous active high input. For every positive edge trigger of the
Vinayaka Missions University, Salem 109
clock, reset_I is checked. If it is high, then the output will be in reset
state. “reset_I” is an asynchronous active high input, the component
works only if this input is low. If it is high, the output will be in reset
state. After computing the data, it produces the encryption key,
e_val(1023:0) of 1024 bits, decryption key, d_val(1023:0) of 1024 bits,
parameters fi_out(1023:0) and n_out(1023:0) of 1024 bits each. High
output on key_out indicates that the computation is completed and the
keys are obtained.
Second level RTL view of Commutative RSA with key generation
using ISE Xilinx tool and Vivado Xilinx tool are presented in Figure 5.13
and Figure 5.14 respectively. Second level RTL view presents the
internal details of the module. Five sub - modules are seen in Figure
5.13 with their inputs and outputs. Two modules for checking the prime
number represented as KEYGEN_CRSA_KGCORE_CHKPVAL and
KEYGEN_CRSA_KGCORE_CHKQVAL are the first two blocks.
KEYGEN_CRSA_KGCORE_GEN_N_PHI is for computing “n” and
“(n)”. KEYGEN_CRSA_E_D_VAL_CMPTE is for computing encryption
key and decryption key. KEYGEN_CRSA_MAIN_MEM_MGMT is for
memory management. Second level RTL view also presents the
interconnection between each block by showing the intermediate inputs
and outputs of the sub-modules. It is observed that ISE 14.3 Xilinx Tool
presents the sub-modules as blocks and Vivado 2012.3 Xilinx Tool
Vinayaka Missions University, Salem 110
presents the sub-modules as multiplexers and registers.
Figure 5.13: Second Level RTL View of CRSA Key Generation
Using ISE 14.3 Xilinx Tool
Vinayaka Missions University, Salem 111
Figure 5.14: Second level RTL view of CRSA Key Generation
Using Vivado 2012.3 Xilinx Tool
RTL schematics of Linear Feedback Shift Register for generating
pseudo random numbers using Vivado Xilinx tool are presented in
Figure 5.15 (a) to Figure 5.15 (g). For generating 512 bits of random
number, it is using 512 numbers of registers. Interconnection of signals
between these registers is shown in Figures 5.15 (a) to 5.15 (g).
Vinayaka Missions University, Salem 112
(a)
Figure 5.15: RTL View of LFSR Using Vivado 2012.3 Xilinx Tool
(Contd.)
Vinayaka Missions University, Salem 113
(b)
Figure 5.15: RTL View of LFSR Using Vivado 2012.3 Xilinx Tool
(Contd.)
Vinayaka Missions University, Salem 114
(c)
Figure 5.15: RTL View of LFSR Using Vivado 2012.3 Xilinx Tool
(Contd.)
Vinayaka Missions University, Salem 115
(d)
Figure 5.15: RTL View of LFSR Using Vivado 2012.3 Xilinx Tool
(Contd.)
Vinayaka Missions University, Salem 116
(e)
Figure 5.15: RTL View of LFSR Using Vivado 2012.3 Xilinx Tool
(Contd.)
(f)
Figure 5.15: RTL View of LFSR Using Vivado 2012.3 Xilinx Tool
(Contd.)
Vinayaka Missions University, Salem 117
(g)
Figure 5.15: RTL View of LFSR Using Vivado 2012.3 Xilinx Tool
RTL View of checking the prime number using Vivado Xilinx tool
is presented in Figure 5.16. Sub modules are shown as multiplexers
and registers.
Vinayaka Missions University, Salem 118
Figure 5.16: RTL View of Checking the Prime Number
Vinayaka Missions University, Salem 119
Table 5.8 presents Device utilization of encryption, decryption
and CRSA key generation using ISE 14.3 Xilinx tool.
Table 5.8 Device Utilization of Encryption, Decryption and
Key Generation RTL Designs
Device Utilization Utilized Available Utilization (%)
Number of Slice Registers 31298 44800 69
Number of Slice LUTs 30129 44800 67
Number of Bonded IOBs 132 172 76
Number of fully used LUT-
FF pairs 20258 41169 49
1. The RTL VHDL codes were first run on ISE design tool
xc5vlx20t-2ff323. It was found that the utilization was only 7%.
This meant that the design consumes very little hardware. The
significance of 7% is that only 7% of the LUT slices were utilized,
and 93% of the resources were wasted. Device utilization
summary is presented in Figure 5.17.
Vinayaka Missions University, Salem 120
Figure 5.17: Device Utilization summary Using ISE 14.3 Design
Tool xc5vlx20t-2ff323
2. Thereafter, the codes were run on ISE design tool xc5vsx50t-
2ff1136. It was found that the utilization was 95%. This meant that
the LUT slices were efficiently used. But this gives no opportunity
for further extension of the design. Device utilization summary is
presented in Figure 5.18.
Vinayaka Missions University, Salem 121
Figure 5.18: Device Utilization Summary Using ISE 14.3
Design Tool xc5vsx50t-2ff1136
3. Based on the above two points, the codes were run on ISE
design tool xc_5vfx70t-2ff1136. It revealed 67% utilization. The
Slice LUTs were used. Not only the LUT slices were utilized
efficiently, further enhancements are also possible. Device
utilization summary is presented in Figure 5.19.
Vinayaka Missions University, Salem 122
Figure 5.19: Device Utilization Summary Using ISE 14.3
Design Tool xc5vfx70t-2ff1136
Table 5.9 presents the timing report of CRSA with key generation,
encryption and decryption.
Table 5.9 Timing Report for CRSA Using ISE 14.3
RTL Design Maximum Frequency (MHz)
CRSA with Key Generation
335
CRSA with
Encryption and Decryption 199
Vinayaka Missions University, Salem 123
It may be noted that the process of key generation takes much
time while the optimization in Encryption/Decryption using parallelized
Montgomery implementation with higher radix has efficient processing
time. The implementation of parallel Montgomery algorithm with
enhanced key generation approach has exhibited good results in terms
of not only memory occupancy but also high throughput, which
strengthens this approach to be employed, especially for real time
applications. On the other hand, optimization in Key generation with
commutative characteristics has reduced the key exchange overheads
to a great extent.
Table 5.10 presents the FPGA Resource Consumption of the RTL
VHDL Design. After compiling, a report is generated which provides
information on the speed and size of the compiled program. The Device
Utilization Summary section provides information on the number of
slices used. This metric is the most important measure of the size of the
compiled program on hardware. The designed system utilizes 31298 out
of 44800 (69%) slice registers, 30129 out of 44800 (67%) slice LUTs.
Vinayaka Missions University, Salem 124
Table 5.10 FPGA Resource Consumption of the RTL VHDL Design
Device utilization summary: ---------------------------
Selected Device : 5vfx70tff1136-2
Slice Logic Utilization:
Number of Slice Registers: 31298 out of 44800 69%
Number of Slice LUTs: 30129 out of 44800 67%
Number used as Logic: 30129 out of 44800 67%
Slice Logic Distribution:
Number of LUT Flip Flop pairs used: 41169
Number with an unused Flip Flop: 9871 out of 41169 23%
Number with an unused LUT: 11040 out of 41169 26%
Number of fully used LUT-FF pairs: 20258 out of 41169 49%
Number of unique control sets: 26
The generated timing reports are presented in Figures 5.20 to
5.22. The system clock frequency reported by the Place and Route tool
is 199 MHz for CRSA encryption and decryption and 335 MHz for
commutative cryptography core.
Vinayaka Missions University, Salem 125
Timing Summary: --------------- Speed Grade: -2
Minimum period: 5.011ns (Maximum Frequency: 199 MHz)
Minimum input arrival time before clock: 2.402ns
Maximum output required time after clock: 2.965ns
Figure 5.20: Timing Report for Commutative Encryption and
Decryption Using ISE 14.3 Xilinx Tool
Timing Summary: --------------- Speed Grade: -2
Minimum period: 2.978ns (Maximum Frequency: 335 MHz)
Minimum input arrival time before clock: 0.991ns
Maximum output required time after clock: 2.826ns
Maximum combinational path delay: 0.468ns
Figure 5.21: Timing Report for CRSA with Key Generation Using
ISE 14.3 Xilinx Tool
Vinayaka Missions University, Salem 126
Timing Summary:
---------------
Speed Grade: -2
Minimum period: 3.422ns (Maximum Frequency: 292 MHz)
Minimum input arrival time before clock: 1.403ns
Maximum output required time after clock: 0.717ns
Maximum combinational path delay: 0.930ns
Figure 5.22: Timing Report for CRSA with Key Generation Using
Vivado 2012.3 Xilinx Tool
5.5.1 Rationale for 199 MHz
As per the design, the requested frequency of operation is 100
MHz to match the Simulation frequency, whereas the synthesis yielded
a much faster clock of 199 MHz as represented in Figure 5.20. The
requested and the reported frequencies are 10 ns and 5.011 ns
respectively in terms of periods. The difference known as the slack time
is 4.98 ns. The slack time must be positive. Otherwise, the device
cannot meet the requested frequency of operation. Similarly, Timing
Report for CRSA with Key Generation Using ISE 14.3 Xilinx Tool in
Figure 5.21 Indicates 2.978ns (Maximum Frequency: 335 MHz). The
Vinayaka Missions University, Salem 127
slack time is 7.022 ns. Timing Report for CRSA with Key Generation
Using Vivado 2012.3 Xilinx Tool in Figure 5.22 is 3.422ns (Maximum
Frequency: 292 MHz). The slack time is 6.578 ns. In all the three cases,
the slack time is positive. Hence the device meets at least the
requested frequency of operation. In actual FPGA implementation of
CRSA Encryption and Decryption Processors, hopefully 200 MHz will
be used, thereby achieving double the throughput when compared to
the Simulation.
Summary
Architecture for Commutative RSA with key generation has been
designed and implemented. In this chapter, simulation results obtained
at three transceiver terminals for commutative cryptography core,
comparative analysis of serial Montgomery multiplier and parallel
Montgomery multiplier in terms of area, throughput and delay, device
utilization summary, timing summary, top level and second level RTL
views were presented.
Vinayaka Missions University, Salem 128
Chapter 6
Conclusions and Scope for Future Work
6.1 Conclusions
In the present work, data security algorithms and architectures
were designed for multiple input multiple output or multi transceiver
systems. The commutative RSA approach has been implemented with
multiple FPGA cores that functions as individual transceiver terminals
and performs its encryption and decryption individually recovering the
original data. Architectures were realized using the Hardware Design
Language VHDL conforming to RTL coding guidelines.
The complete design of Commutative RSA with key generation
was validated using Modelsim Simulator and the Synthesis and Place
and Route results were obtained by implementing the design using
Xilinx design suite 14.3 and Vivado 2012.3.
6.2 Contributions of This Work
The main contribution of this work was to design a security
system for multi-party communications. The major contributions of this
work may be summarized as follows:
Vinayaka Missions University, Salem 129
1. Public key cryptographic scheme was designed introducing
Commutative nature of RSA Encryption and Decryption. For
multiple Input multiple output communications, three users were
considered. User 1 generates “Original message”, encrypts it
using the private key (encryption key) and sends it to User 2.
User 2 in turn encrypts the received message from User 1 and
transmits it to User 3. Thereafter, the User 3 decrypts the
message twice by using public keys of User 2 and User 1
respectively to recover the “original message”. Any authenticated
user in the multi-party system can decrypt the message, if he/she
knows the order of encryption.
2. Two versions of Montgomery multipliers, Serial Montgomery and
Parallel Montgomery multipliers were designed and their
performances compared with respect to chip area, power
consumption, delay, throughput and frequency of operation. The
delay in the proposed Parallel Montgomery based CRSA is
13.8% lower as compared to Serial Montgomery based CRSA
cryptography core. Similarly, the throughput of the proposed
Parallel Montgomery based CRSA is 12.1% higher than the serial
Montgomery based CRSA architecture. Parallel Montgomery
multiplier exhibits better performance compared to Serial
Montgomery multiplier.
Vinayaka Missions University, Salem 130
3. Commutative RSA cryptosystem was designed using Parallel
Montgomery multiplier. In order to avoid key exchange
overheads, each user generates both public key and a private
key. Private Key has been used for encrypting the data. Public
key is used for decrypting the data. Classical method may be
used to share the Public key among the authenticated users in
the group.
4. Commutative RSA with key generation architectures were coded
using VHDL conforming to RTL coding guidelines, without which
no system can work.
5. Suitable test benches were developed using VHDL and the
complete design was functionally tested using Modelsim.
6. Synthesis and Place and Route results were obtained by
implementing the RTL design using Xilinx Design Suite 14.3. The
designed system utilizes 31298 out of 44800 (69%) slice
registers, 30129 out of 44800 (67%) slice LUTs. The system
clock frequency reported by the Place and Route tool is 199 MHz
for CRSA encryption and decryption and 335 MHz for
Commutative Cryptography core.
Vinayaka Missions University, Salem 131
The developed system can be used for sending or receiving
stereo-audio channels up to a sampling rate of 44.1 KHz. It can also be
used for transmitting or receiving video signals up to 360 Megabytes per
second.
6.3 Scope for Future Work
In the present work, only three user terminals were considered.
The future work may consider larger group sizes. Public keys may be
generated offline and stored in different databases to increase the
speed of RSA algorithm. In case of group communications, novel
algorithms may be designed for updating the keys, when a new member
enters the group or the existing member leaves the group.
Vinayaka Missions University, Salem 132
References
[1] W. Diffie and M. Hellman, “New Directions in Cryptography”, IEEE
Transactions on Information Theory, pp. 644-654, 1976.
[2] R. Rivest, A. Shamir and L. Adleman, “A Method for Obtaining
Digital Signatures and Public Key Cryptosystems”,
Communications of ACM 21, pp. 120-125, 1978.
[3] A. Menezes, P. van Oorschot and S. Vanstone, “Handbook of
Applied Cryptography”, CRC Press, Oct. 1996.
[4] D. Boneh, “Twenty Years of Attacks on the RSA Cryptosystem”,
Notices of the American Mathematical Society, Vol. 46(2), pp.
203-213, 1999.
[5] G. R. Blakley, “A computer algorithm for the product AB modulo
M”, IEEE Transactions on Computers, Vol.32, No.5, pp. 497-500,
May 1983.
[6] P. L. Montgomery, “Modular multiplication without trial division”,
Math. of Computation, Vol. 44, No. 170, pp. 519-521, April 1985.
[7] C. K. Koc, “High-speed RSA Implementation”, Technical Report,
RSA Laboratories, Nov. 1994.
[8] Cilardo A., Mazzeo A., Romano L., Saggese G.P., "Carry-save
Montgomery modular exponentiation on reconfigurable hardware",
Proceedings of Design, Automation and Test in Europe
Conference and Exhibition, Vol. 3, pp. 206-211, Feb. 2004.
Vinayaka Missions University, Salem 133
[9] Bo Song, Kawakami K., Nakano K. and Ito Y., "An RSA Encryption
Hardware Algorithm Using a Single DSP Block and a Single Block
RAM on the FPGA", First International Conference on Networking
and Computing, pp. 140-147, Nov. 2010.
[10] Shand M., Vuillemin J., "Fast implementations of RSA
cryptography", 11th Symposium on Computer Arithmetic,
Proceedings, pp. 252-259, Jun-Jul 1993.
[11] Koji Nakano, Kawakami K. and Shigemoto K., "RSA encryption
and decryption using the redundant number system on the FPGA",
2009 IEEE International Symposium on Parallel & Distributed
Processing, pp. 1-8, May 2009.
[12] Daniel Mesquita , Guilherme Perin , Fernando Luís Herrmann ,
João Baptista Martins, “An efficient implementation of
Montgomery powering ladder in reconfigurable hardware”. Pp.
121-126, Jan. 2010.
[13] Xuewen Tan, Yunfei Li, "Parallel Analysis of an Improved RSA
Algorithm", International Conference on Computer Science and
Electronics Engineering, Vol. 1, pp. 318-320, March 2012.
[14] Changxing Lin, Jian Zhang, Beibei Shao, "A High Speed Parallel
Timing Recovery Algorithm and Its FPGA Implementation", 2011
2nd International Symposium on Intelligence Information
Processing and Trusted Computing, pp. 63-66, Oct. 2011.
Vinayaka Missions University, Salem 134
[15] Suli Wang, Ganlai Liu, "File Encryption and Decryption System
Based on RSA Algorithm", International Conference on
Computational and Information Sciences, pp. 797-800, Oct. 2011.
[16] Wenjun Fan, Xudong Chen, Xuefeng Li, "Parallelization of RSA
Algorithm Based on Compute Unified Device Architecture", Ninth
International Conference on Grid and Cooperative Computing, pp.
174-178, Nov. 2010.
[17] Ljupco Kocarev, Marjan Sterjev, Paolo Amato, P., "RSA ncryption
algorithm based on torus automorphisms", Proceedings of the
2004 International Symposium on Circuits and Systems, Vol. 4,
pp. 577-580, May 2004.
[18] Koç C.K., Tolga Acar, Kaliski, B.S. Jr., "Analyzing and comparing
Montgomery multiplication algorithms”, IEEE Micro, Vol. 16, pp.
26-33, Jun 1996.
[19] Xin Zhou, Xiaofei Tang, "Research and implementation of RSA
algorithm for encryption and decryption", 6th International Forum
on Strategic Technology, Vol. 2, pp. 1118-1121, Aug. 2011.
[20] Jiang Huiping, Yang Guosheng, "Resistant against power analysis
for a fast parallel high-radix RSA algorithm", International
Conference on Electric Information and Control Engineering, pp.
1668-1671, April 2011.
[21] Sami A. Nagar, Saad Alshamma, "High speed implementation of
RSA algorithm with modified keys exchange", 6th International
Vinayaka Missions University, Salem 135
Conference on Sciences of Electronics, Technologies of
Information and Telecommunications, pp. 639-642, March 2012.
[22] Sining Liu, King B., Wang Wei, "A CRT-RSA Algorithm Secure
against Hardware Fault Attacks", 2nd IEEE International
Symposium on Dependable, Autonomic and Secure Computing,
pp. 51- 60, Sept-Oct. 2006.
[23] Abdullah AI Hasib and Abdul Ahsan Md. Mahmudul Haque, "A
Comparative Study of the Performance and Security Issues of
AES and RSA Cryptography", Third 2008 International
Conference on Convergence and Hybrid Information Technology,
Vol. 2, pp. 505-510, Nov. 2008.
[24] Minni, Rohit, Sultania, Kaushal, Mishra, Saurabh, Vincent, Durai
Raj, "An algorithm to enhance security in RSA", Fourth
International Conference on Computing, Communications and
Networking Technologies, pp. 1-4, July 2013.
[25] Al-Hamami A.H., Aldariseh I.A., "Enhanced Method for RSA
Cryptosystem Algorithm", 2012 International Conference on
Advanced Computer Science Applications and Technologies, pp.
402-408, Nov. 2012.
[26] Dahui Hu, Zhiguo Du, "An improved Kerberos protocol based on
fast RSA algorithm", IEEE International Conference on
Information Theory and Information Security, pp. 274-278, Dec.
2010.
Vinayaka Missions University, Salem 136
[27] Doroeviae G., Unkasevia T., Markoviae M., "Optimization of
modular reduction procedure in RSA algorithm implementation on
assembler of TMS320C54x signal processors", 14th International
Conference on Digital Signal Processing, pp. 811-814, 2002.
[28] Na Qi Jing Pan Quan Ding, "The Implementation of FPGA-based
RSA Public-key Algorithm and its Application in Mobile-phone
SMS Encryption System", First International Conference on
Instrumentation, Measurement, Computer, Communication and
Control, pp. 700- 703, Oct. 2011.
[29] Perovic, N.S., Popovic-Bozovic, M., "FPGA implementation of
RSA cryptoalgorithm using shift and carry algorithm", 20th
Telecommunications Forum, pp. 1040-1043, Nov. 2012.
[30] Iana G.V., Anghelescu P., Serban G., "RSA encryption algorithm
implemented on FPGA", International Conference on Applied
Electronics, pp. 1-4, Sept. 2011.
[31] Nibouche O., Nibouche M., Bouridane A.and Belatreche A., "Fast
architectures for FPGA-based implementation of RSA encryption
algorithm", Proceedings of IEEE International Conference on
Field-Programmable Technology, pp. 271-278, Dec. 2004.
[32] Nibouche O., Nibouche M., Bouridane A., "High speed FPGA
implementation of RSA encryption algorithm", Proceedings of the
10th IEEE International Conference on Electronics, Circuits and
Systems, pp. 204-207, Dec. 2003.
Vinayaka Missions University, Salem 137
[33] Chhabra A., Mathur S., "Modified RSA Algorithm: A Secure
Approach, International Conference Computational Intelligence
and Communication Networks, pp. 545-548, Oct.2011.
[34] Hariri A., Reyhani-Masoleh A., "Bit-Serial and Bit-Parallel
Montgomery Multiplication and Squaring over GF(2m)", IEEE
Transactions on Computers, Vol. 58, No. 10, pp. 1332-1345, Oct.
2009.
[35] Miguel Morales-Sandoval, Arturo Díaz-Pérez, “Scalable GF(p)
Montgomery multiplier based on a digit–digit computation
approach”, IET Computers & Digital Techniques 10(3) , Sept.
2015.
[36] Perin G., Daniel G. Mesquita, Fernado L. Herrmann, Martins J.B.,
"Montgomery modular multiplication on reconfigurable hardware:
Fully systolic array vs parallel implementation", Programmable
Logic Conference, 2010 pp. 61-66, March 2010.
[37] Ali Ziya Alkar, Remziye So¨nmez “A hardware version of the RSA
using the Montgomery‟s algorithm with systolic arrays”,
INTEGRATION, the VLSI journal 38, pp. 299–307, 2004.
[38] Guilherme Perin, Daniel GomesMesquita, and Jo˜ao
BaptistaMartins, “Montgomery Modular Multiplication on
Reconfigurable Hardware: Systolic versus Multiplexed
Implementation”, International Journal of Reconfigurable
Computing, pp. 1-10, Nov. 2011.
Vinayaka Missions University, Salem 138
[39] M. Poolakkaparambil, J. Mathew, A. M. Jabir, and D. K. Pradhan,
"A dynamically error correctable bit parallel Montgomery multiplier
over binary extension fields", 2011 20th European Conference on
Circuit Theory and Design, pp. 600-603, Aug. 2011.
[40] Jean-Claude Bajard, Imbert L., Graham A. Jullien, "Parallel
Montgomery multiplication in GF(2k) using trinomial residue
arithmetic", Proceedings of the 17th IEEE Symposium on
Computer Arithmetic, pp. 164-171, June 2005.
[41] Chiou-Yng Lee, Chin-Chin Chen, Erl-Huei Lu, "Compact Bit-
Parallel Systolic Montgomery Multiplication Over GF(2m)
Generated by Trinomials", TENCON 2006. 2006 IEEE Region 10
Conference, pp. 1-4, Nov. 2006.
[42] Daesung Lim, Nam Su Chang , Sung Yeon Ji , Chang Han Kim ,
Sangjin Lee, Young-Ho Park, “An efficient signed digit
montgomery multiplication for RSA”, Journal of Systems
Architecture, Vol. 55, pp. 355–362, 2009.
[43] Sanu M.O., Swartzlander E.E., Chase C.M., "Parallel Montgomery
multipliers", 15th IEEE International Conference on Application-
Specific Systems, Architectures and Processors, pp. 63-72, Sept.
2004.
[44] Zhimin Chen, Schaumont P., "pSHS: A scalable parallel software
implementation of Montgomery multiplication for multicore
systems", Design, Automation & Test in Europe Conference &
Vinayaka Missions University, Salem 139
Exhibition, pp. 843-848, March 2010.
[45] Zhimin Chen, Schaumont P., "A Parallel Implementation of
Montgomery Multiplication on Multicore Systems: Algorithm,
Analysis, and Prototype", IEEE Transactions on Computers,
Vol.60, No.12, pp.1692-1703, Dec. 2011.
[46] Amberg P. Pinckney N. Harris D.M., "Parallel high-radix
Montgomery multipliers", 42nd Asilomar Conference on Signals,
Systems and Computers, pp. 772-776, Oct. 2008.
[47] F. Bernard, “Scalable hardware implementing high-radix
Montgomery multiplication algorithm”, Journal of Systems
Architecture, Vol. 53, pp. 117–126, 2007.
[48] Thomas Blum and Christof Paar, “High Radix Montgomery
Modular Exponentiation on Reconfigurable Hardware”, IEEE
Transactions on Computers, Vol. 50, No. 5, pp. 759-764, July
2001.
[49] Thomas Blum and Christof Paar, “Montgomery modular
exponentiation on reconfigurable hardware”, 14th Symposium on
Computer Arithmetic, pp. 70–77, 1999.
[50] Jun Han, Shuai Wang, Wei Huang, Zhiyi Yu, Xiaoyang Zeng,
"Parallelization of Radix-2 Montgomery Multiplication on Multicore
Platform", IEEE Transactions on Very Large Scale Integration
Systems, Vol. 21, No. 12, pp. 2325-2330, Dec. 2013.
Vinayaka Missions University, Salem 140
[51] Selçuk Baktir, Erkay Savaş, “Highly-Parallel Montgomery
multiplication for Multi-Core General-Purpose Microprocessors”,
Computer and Information Sciences III, pp. 467- 476, 2013.
[52] Miladinovic N., Popovi -Bo ovi J., FPGA realization of fully
systolic and parallel architecture of Montgomery multipliers", 19th
Telecommunications Forum, pp. 928-931, Nov. 2011.
[53] Ciaran McIvor, Máire McLoone, John V McCanny, Alan Daly,
William Marnane, “Fast Montgomery Modular Multiplication and
RSA Cryptographic Processor Architectures”, thirty seventh
Asilomar conference on signals, systems and computers, Vol. 1,
pp. 379-384, Nov. 2003.
[54] Neto J.C., Tenca A.F., Ruggiero W.V., "A parallel k-partition
method to perform Montgomery Multiplication", IEEE International
Conference on Application-Specific Systems, Architectures and
Processors, pp. 251-254, Sept. 2011.
[55] Néto J. C., Tenca A. F., Ruggiero W. V., "A Parallel and Uniform
k-Partition Method for Montgomery Multiplication", IEEE
Transactions on Computers, Vol. 63, No. 9, pp. 2122-2133, Sept.
2014.
[56] L. Batina and G. Muurling, “Montgomery in Practice: How to Do It
More Efficiently in Hardware”, Proc. Cryptographer‟s Track at the
RSA Conf. Topics in Cryptology, pp. 40-52, Feb. 2002.
[57] Huapeng Wu, "Low complexity LFSR based bit-serial montgomery
Vinayaka Missions University, Salem 141
multiplier in GF(2m)", IEEE International Symposium on Circuits
and Systems, pp. 1962-1965, May 2013.
[58] M. Morales-Sandoval, C. Feregrino-Uribe, P. Kitsos, "Bit-serial
and digit-serial GF(2m)Montgomery multipliers using linear
feedback shift registers", Computers & Digital Techniques, Vol.5,
Issue 2, pp. 86-94, March 2011.
[59] Gustavo D. Sutter, Jean-Pierre Deschamps, José Luis Imaña,
“Modular Multiplication and Exponentiation Architectures for Fast
RSA Cryptosystem Based on Digit Serial Computation”, IEEE
Transactions on Industrial Electronics, Vol. 58, No. 7, July 2011.
[60] Talapatra S., Rahaman H. Saha S.K., "Unified Digit Serial Systolic
Montgomery Multiplication Architecture for Special Classes of
Polynomials over GF(2m)", 2010 13th Euromicro Conference on
Digital System Design: Architectures, Methods and Tools, pp.
427-432, Sept. 2010.
[61] Talapatra S., Rahaman H., Mathew J., "Low Complexity Digit
Serial Systolic Montgomery Multipliers for Special Class of
GF(2m)", IEEE Transactions on Very Large Scale Integration
Systems, Vol. 18, No. 5, pp. 847-852, May 2010.
[62] Michalski A. Buell D., "A Scalable Architecture for RSA
Cryptography on Large FPGAs", 14th Annual IEEE Symposium on
Field- Programmable Custom computing Machines, pp. 1-8,
Aug. 2006.
Vinayaka Missions University, Salem 142
[63] Michalski A. and Buell D., "A Scalable Architecture for RSA
Cryptography on Large FPGAs", 14th Annual IEEE Symposium
on Field-Programmable Custom Computing Machines, pp. 331-
332, April 2006.
[64] Chu A., Sima, M., "Reconfigurable RSA Cryptography for
Embedded Devices", Canadian Conference on Electrical and
Computer Engineering, pp. 1312-1315, May 2006.
[65] Oskuzoglu E, Savaş E, “Parametric, secure and compact
implementation of RSA on FPGA”, International Conference on
Reconfigurable Computing and FPGAs, pp. 391–396, Dec. 2008.
[66] A. Mazzeo, L. Romano, G. P. Saggese - Universita‟ degli Studi di
Napoli “Federico II”, FPGA-based implementation of a serial RSA
processor”, Proceedings of the Design, Automation and Test in
Europe Conference and Exhibition, pp. 582-587 , March 2003.
[67] Ersin Öksüzoğlu, Erkay Savaş, “Parametric, Secure and Compact
Implementation of RSA on FPGA”, International Conference on
Reconfigurable Computing and FPGAs, pp. 391-396, 2008.
[68] Liang Wang, Yonggui Zhang, "A new personal information
protection approach based on RSA cryptography", 2011
International Symposium on IT in Medicine and Education, Vol.1,
pp. 591-593, Dec. 2011.
[69] Ming-Der Shieh, Chien-Hsing Wu, Ming-hwa Sheu, Jia-Lin Sheu,
Che-Han Wu, "Asynchronous implementation of modular
Vinayaka Missions University, Salem 143
exponentiation for RSA cryptography", Proceedings of the
Second IEEE Asia Pacific Conference on ASICs, pp. 191-194,
2000.
[70] Manaf N.V., Sheramin, G.Y., "A Simple Approach for VLSI
Improvement of OHRNS for Use RSA Cryptography", 2010
International Conference on Computational Intelligence and
Communication Networks, pp. 355-358, Nov. 2010.
[71] Hariri A., Reyhani-Masoleh A., "Concurrent Error Detection in
Montgomery Multiplication over Binary Extension Fields", IEEE
Transactions on Computers, Vol. 60, No. 9, pp. 1341-1353, Sept.
2011.
[72] Hariri A., Reyhani-Masoleh A., "Fault Detection Structures for the
Montgomery Multiplication over Binary Extension Fields", 2007
Workshop on Fault Diagnosis and Tolerance in Cryptography,
pp. 37- 46, Sept. 2007.
[73] Miaoqing Huang, Kris Gaj, El-Ghazawi T., "New Hardware
Architectures for Montgomery Modular Multiplication Algorithm",
IEEE Transactions on Computers, Vol. 60, No. 7, pp. 923-936,
July 2011.
[74] McLoone M., McIvor C. McCanny J.V., "Montgomery modular
multiplication architecture for public key cryptosystems", IEEE
Workshop on Signal Processing Systems, pp. 349-354, Oct.
2004.
Vinayaka Missions University, Salem 144
[75] Richa Garg, Renu Vig, "An Efficient Montgomery Multiplication
Algorithm and RSA Cryptographic Processor", International
Conference on Computational Intelligence and Multimedia
Applications 2007, Vol. 2, pp. 188-195, 13-15 Dec. 2007.
[76] Yuan-Yang Zhang , Zheng Li, Lei Yang, Shao-Wu Zhang, “ An
efficient CSA architecture for montgomery modular multiplication”,
Microprocessors and Microsystems 31, pp. 456–459, 2007.
[77] S. S. Ghoreishi, M. A. Pourmina, H. Bozorgi, M. Dousti, "High
Speed RSA Implementation Based on Modified Booth's
Technique and Montgomery's Multiplication for FPGA Platform",
2009 Second International Conference on Advances in Circuits,
Electronics and Micro-electronics, pp. 86-93, Oct. 2009.
[78] Jin-Hua Hong, Cheng-Wen Wu, “Cellular-Array Modular Multiplier
for Fast RSA Public-Key Cryptosystem Based on Modified
Booth‟s Algorithm”, IEEE Transactions on Very Large Scale
Integration Systems, Vol. 11, No. 3, June 2003.
[79] Bayhan D., Ors S.B. Saldamli G., "Analyzing and comparing the
Montgomery multiplication algorithms for their power
consumption", International Conference on Computer Engineering
and Systems, pp. 257- 261, Nov.-Dec. 2010.
[80] Nadia Nedjah, Luiza de Macedo Mourelle, “Three Hardware
Architectures for the Binary Modular Exponentiation: Sequential,
Vinayaka Missions University, Salem 145
Parallel, and Systolic”, IEEE Transactions on Circuits and
Systems, Vol. 53, No. 3, March 2006.
[81] Refik Sever, A. Neslin Ismailglu, Yusuf C Tekmen, Murat Askar,
Burak Okcan, "A high speed FPGA implementation of the
Rijndael algorithm”, Proceedings of the Euromicro Symposium on
Digital System Design, pp. 358-362, Aug-Sept. 2004.
[82] R.V. Kshirsagar, M. V. Vyawahare, "FPGA Implementation of High
Speed VLSI Architectures for AES Algorithm", 2012 Fifth
International Conference on Emerging Trends in Engineering and
Technology, pp. 239-242, Nov. 2012.
[83] Sushanta Kumar Sahu, Manoranjan Pradhan “FPGA
Implementation of RSA Encryption System”, International Journal
of Computer Applications, Vol. 19, No. 9, pp. 10-12, April 2011.
[84] Tanimura K., Nara R., Kohara S., Shimizu K. Shi Y., Nozomu
Togawa, Yanagisawa M., Ohtsuki T., "Scalable unified dual-radix
architecture for Montgomery multiplication in GF(P) and GF(2n)",
Asia and South Pacific Design Automation Conference, pp. 697-
702, March 2008.
[85] Schinianakis D., Skavantzos A., Stouraitis T., "GF(2n) Montgomery
multiplication using Polynomial Residue Arithmetic", 2012 IEEE
International Symposium on Circuits and Systems, pp. 3033-
3036, May 2012.
[86] Talapatra S., Rahaman H., "Low complexity Montgomery
Vinayaka Missions University, Salem 146
multiplication architecture for elliptic curve cryptography over
GF(pm)", VLSI System on Chip Conference 18th IEEE/IFIP, pp.
219-224, Sept. 2010.
[87] Satzoda R.K., Chip-Hong Chang, "A fast kernel for unifying GF(p)
and GF(2m) Montgomery multiplications in a scalable pipelined
architecture", Proceedings of 2006 IEEE International Symposium
on Circuits and Systems, pp. 3378-3381, May 2006.
[88] Fournaris A.P., "Fault and simple power attack resistant RSA
using Montgomery modular multiplication", Proceedings of 2010
IEEE International Symposium on Circuits and Systems, pp.
1875-1878, May-June 2010.
[89] Himanshu Thapliyal, Anvesh Ramasahayam, Vivek Reddy Kotha,
Kunul Gottimukkula and M.B. Srinivas, "Modified Montgomery
modular multiplication using 4:2 compressor and CSA adder",
Third IEEE International Workshop on Electronic Design, Test
and Applications, pp. 17-19, Jan. 2006.
[90] Maryam Mohammadi, Amir Sabbagh Molahosseini, "Efficient
design of Elliptic Curve Point Multiplication based on fast
Montgomery modular multiplication", 3rd International Conference
on Computer and Knowledge Engineering, pp. 424-429, Oct-Nov
2013.
[91] Haining Fan; M. Anwar Hasan, "Relationship between GF(2m)
Montgomery and Shifted Polynomial Basis Multiplication
Vinayaka Missions University, Salem 147
Algorithms", IEEE Transactions on Computers,Vol.55, No.9, pp.
1202-1206, Sept. 2006.
[92] Koç, C.K., Tolga Acar, Burton S. Kaliski, Jr., "Analyzing and
comparing Montgomery multiplication algorithms", Micro, IEEE,
Vol. 16, No. 3, pp. 26-33, Jun 1996.
[93] Ciaran McIvor, McLoone M., John V. McCanny, "FPGA
Montgomery modular multiplication architectures suitable for
ECCs over GF(p)", Proceedings of the 2004 International
Symposium on Circuits and Systems, Vol.3, pp. III-509-III-512,
May 2004.
[94] V. R. Venkatasubramani, S. Rajaram, "Novel techniques for
Montgomery modular multiplication algorithms for public key
cryptosystems", 2011 IEEE Electrical Design of Advanced
Packaging and Systems Symposium, pp. 1-6, Dec. 2011.
[95] C. McIvor, M. McLoone and J. V. McCanny, "Modified
Montgomery modular multiplication and RSA exponentiation
techniques", IEE Proceedings of Computers and Digital
Techniques, Vol.151, No.6, pp. 402-408, Nov. 2004.
[96] Mclvor C. McLoone M., McCanny J.V., "Fast Montgomery modular
multiplication and RSA cryptographic processor architectures",
Conference Record of the Thirty-Seventh Asilomar Conference
on Signals, Systems and Computers, Vol. 1, pp. 379-384, Nov.
2003.
Vinayaka Missions University, Salem 148
[97] A. P. Fournaris, O. Koufopavlou, "GF(2K) multipliers based on
Montgomery Multiplication Algorithm", Proceedings of the 2004
International Symposium on Circuits and Systems, Vol.2, pp. II-
849-II-852, May 2004.
[98] Paul R., Saha S., Suman Sau, Amlan Chakrabarti, "Real time
communication between multiple FPGA systems in multitasking
environment using RTOS", 2012 International Conference on
Devices, Circuits and Systems, pp. 130-134, March 2012.
[99] Pellegrini A., Bertacco V. and Austin T., "Fault-based attack of
RSA authentication", Design, Automation & Test in Europe
Conference & Exhibition, pp. 855-860, March 2010.
[100] Abdel Alim Kamal and Amr M. Youssef, "An FPGA
implementation of the NTRUEncrypt cryptosystem", 2009
International Conference on Microelectronics, pp. 209- 212, Dec.
2009.
[101] Marcelo E. Kaihara and Naofumi Takagi, “Bipartite modular
multiplication”, In Proceedings of Cryptographic Hardware and
Embedded Systems, Lecture notes in Computer Science 3659,
pp. 201-210. Springer-Verlag, 2005.
[102] Marcelo E. Kaihara and Naofumi Takagi, “Bipartite modular
multiplication method”, IEEE Transactions on Computers, pp.
157-164, 2008.
[103] Kazuo Sakiyama, Miroslav Knezevic, Junfeng Fan, Bart Preneel,
Vinayaka Missions University, Salem 149
and Ingrid Verbauwhede, “Tripartite modular multiplication”,
INTEGRATION THE VLSI JOURNAL, Vol. 44, No. 4, pp. 259-
269, Sept. 2011.
[104] Kazuo Sakiyama, Lejla Batina, Bart Preneel, and Ingrid
Verbauwhede, “Multicore Curve-Based Cryptoprocessor with
Reconfigurable Modular Arithmetic Logic Units over GF(2n)”,
IEEE Transactions on Computers, Vol. 56, No. 9, pp. 1269-
1282, Sept. 2007.
[105] N. Costigan and P. Schwabe, “Fast Elliptic-Curve Cryptography
on the Cell Broadband Engine”, International Conference on
Cryptology in Africa: Progress in Cryptology ‟09, pp. 368-385,
2009.
[106] Junfeng Fan, Kazuo Sakiyama, and Ingrid Verbauwhede,
“Montgomery modular multiplication algorithm on multi-core
systems”, IEEE Workshop on Signal Processing Systems, pp.
261-266, 2007.
[107] Ç. K. Koç, T. Acar, and B. Kaliski. “Analyzing and comparing
montgomery multiplication algorithms”, IEEE Micro, Jun 1996.
[108] Ç. K Koç and T. Acar. “Montgomery Multplication in GF(2k)”,
Design, Codes, and Cryptography, 14(1):57_69, 1998.
[109] C.D. Walter, “Precise Bounds for Montgomery Modular
Multiplication and Some Potentially Insecure RSA Moduli”, Proc.
Cryptographer‟s Track at the RSA Conf. Topics in
Vinayaka Missions University, Salem 150
Cryptology(CT-RSA ‟02), pp. 30-39, Feb. 2002.
[110] Bunimov V., Schimmler M. Tolg B., “A Complexity-Effective
Version of Montgomery‟s Algorithm”, Presented at the Workshop
on Complexity Effective Designs, May 2002.
[111] Skavantzos A., Stouraitis T., "GF(2n) Montgomery multiplication
using Polynomial Residue Arithmetic", 2012 IEEE International
Symposium on Circuits and Systems, pp. 3033-3036, May 2012.
[112] Iput Heri K., Asep Bagja N., Purba R.S., Adiono T., "Very fast
pipelined RSA architecture based on Montgomery's
algorithm", 2009 International Conference on Electrical
Engineering and Informatics, Vol. 02, pp. 491-495, Aug 2009.
[113] Dorothy E. Dening, “Digital Signatures with RSA and other
Public-Key Cryptosystems”, Communications of the ACM, Vol.
27, N0.4, April 1984.
[114] A. Moss, D. Page, and N.P. Smart, “Toward Acceleration of RSA
Using 3D Graphics Hardware”, IMA International Conference on
Cryptography and Coding, pp. 213-220, 2007.
[115] Ramachandran Seetharaman, “Digital VLSI Systems Design: A
Design Manual for Implementation of Projects on FPGAs and
ASICs Using Verilog”, Springer, 2007.
[116] R. Ambika, S. Ramachandran, K. R. Kashwan, “Securing
Distributed FPGA System using Commutative RSA Core”,
Global Journal of Researches in Engineering Electrical and
Vinayaka Missions University, Salem 151
electronics Engineering, Vol. 13, Issue 15, Version 1, Nov. 2013.
[117] R. Ambika, S. Ramachandran, K. R. Kashwan, “Data Security
Using Serial Commutative RSA Core for Multiple FPGA
System”, 2014 2nd International Conference on Devices,
Circuits and Systems, pp. 1-5, March 2014.
[118] R. Ambika, S. Ramachandran, K. R. Kashwan, “Design of
Commutative Cryptography Core with Key Generation for
Distributed FPGA Architecture”, International Journal of Current
Engineering and Technology, Vol. 4, No. 5, pp. 3519- 3527, Oct.
2014.
[119] ModelSim, A mixed-languages simulator, supporting Verilog-
2001 and System Verilog.
[120] Xilinx ISE Tool, Design Suite, Version 14.3.
[121] Xilinx Vivado Design Suite, Version 2012.3
Vinayaka Missions University, Salem 152
List of Publications
1. R. Ambika, S. Ramachandran, K. R. Kashwan, “Securing
Distributed FPGA System using Commutative RSA Core”, Global
Journal of Researches in Engineering Electrical and electronics
Engineering, Vol. 13, Issue 15, Version 1, Nov. 2013.
2. R. Ambika, S. Ramachandran, K. R. Kashwan, “Data Security
Using Serial Commutative RSA Core for Multiple FPGA System”,
2014 2nd International Conference on Devices, Circuits and
Systems, pp. 1-5, March 2014.
3. R. Ambika, S. Ramachandran, K. R. Kashwan, “Design of
Commutative Cryptography Core with Key Generation for
Distributed FPGA Architecture”, International Journal of Current
Engineering and Technology, Vol. 4, No. 5, pp. 3519- 3527, Oct.
2014.
4. R. Ambika and Sahana Devanathan, “FPGA Implementation of
Cryptographic Algorithms: A Survey”, International Journal of
Scientific & Engineering Research, Volume 4, Issue 4, pp. 884-
888, April-2013.
5. R. Ambika and Hamsavahini R, “A Survey on Hardware
Architectures for Montgomery Modular Multiplication Algorithm”,
International Journal of Emerging Technologies in Computational
and Applied Sciences, Vol.5, Issue 3, pp. 217- 221, June-August,
2013.
Vinayaka Missions University, Salem 153
APPENDIX A
RTL CODING:
Guidelines of RTL coding are certain rules of coding which is
followed worldwide according to which coding has been done in this
project too. They are as follows:
1 Avoid hard values and numeric constants, use attributes on
objects or explicitly declared constants
2 Use "rising edge" exclusively (it's time to give up clock‟ event
and clk='1').
3 "wait" must not be used.
4 Input ports can be left unconnected (open) at instantiation
provided they are assigned default values at declaration.
5 Do not leave ports unconnected by omission: use "open".
6 Avoid recursive code.
7 Inside an entity's synthesizable architecture, the authorized types
are: std_logic, std_logic_vector, signed, unsigned. The use of
integer range and Boolean requires care and caution since these
scalar types are implicitly initialized at creation. Enumerated types
have the same behavior and must be treated with even greater
care.
8 Use as few variables as possible in synthesizable code, and
Vinayaka Missions University, Salem 154
never when you may use signals instead. Use variables for their
specific behavior (factoring, intermediate results, re-using the Flip
Flops inputs etc).
9 Never create latches, combinational feedback or
asynchronous sequential logic, whether intended or
unintended.
10 You must not initialize signals at their declaration. You must
not initialize variables at their declaration in processes. It is
acceptable to initialize variables at their declaration inside
functions.
11 All asynchronous input signals should be re-synchronized.
12 All the Entity's outputs should be registered. Combinational
outputs can also create combinational feedbacks through the
hierarchy.
Vinayaka Missions University, Salem 155
APPENDIX B
DEVELOPMENT TOOLS
SIMULATION TOOL: Modelsim
It is a tool, which provides comprehensive simulation and debug
environment for complex ASIC and FPGA designs.
Version used: Model sim PE 5.5e
Command Summary:
Double click on icon Shortcut to Modelsim.exe.link on your desktop.
Main Modelsim window and welcome to Modelsim window open.
Click on “create a project” in welcome to Modelsim window for a new
project. Create project window opens. Type in the project name
menu, and also type the desired location of the directory where you
have your Verilog files stored in the project menu location. Make
sure “work” is entered in default library name. Click on OK. In the
main Modelsim window click on library on the bottom left.
Vinayaka Missions University, Salem 156
opening a project
In main window, click on design => compile. The menu opens.
Double click on the desired test bench. The test bench and the
design will be compiled. If any error it will be reported on the main
menu. If there are errors, fix them. Otherwise click on done. Click
on design => Load design in the main window. Load design menu
opens. Double click on the desired test bench; say the desired
test bench name followed by the load. The test bench will be
loaded into the simulator.
Vinayaka Missions University, Salem 157
Loading a Design
Vinayaka Missions University, Salem 158
compiling a design
Click on main menu. View => “signals”. Signals window opens. In
that window click on view => wave => signals in design. Waveform
window with the listing of all the signals will open. X-axis of the
waveform gives the time in nanoseconds.
Vinayaka Missions University, Salem 159
Signals in Design
Click on run all. Menu on the top second row on the right. Click on
Zoom full menu to view all the waveforms. Other options such as
Zoom in, zoom out, zoom area may be used to get the desired view.
Vinayaka Missions University, Salem 160
Simulation Result Window
Locate the signal, say, “clk” and click on it. It will be highlighted. Click
on the highlighted signal. Drag up and leave it at the top of all other
signals. Likewise, you can drop all other signals belonging to the
function next to the first signal and so on. You can also use copy or
cut and paste it at the desired place. Click on “Format => Radix =>
Binary or decimal or hexadecimal or octal” as per your requirements.
All selected signals are set to the desired number system.
Vinayaka Missions University, Salem 161
Use <, > arrows to move the waveforms along the time axis. Analyze
the waveforms to check the functionality.
This completes one session. To exit the Modelsim, click on “File =>
Quit” and “Yes”..
If you want to use Modelsim again to resume the same project, double
click on the same icon on the desktop.
Click on “Open a Project” in welcome to Modelsim window. Make sure
that the directory is the same by clicking on “Library”. Otherwise click
“File => change the directory and select the desired directory and click
on “Open”.
Vinayaka Missions University, Salem 162
APPENDIX C
DEVELOPMENT OF ASM CHARTS
ASM charts are a graphical representation of step-by-step execution of
a hardware code. They are easier to interpret and can be easily
converted to other forms of representation.
They do not enumerate all the possible inputs and outputs. Only the
inputs that matter and the outputs that are asserted are indicated. It
must be known whether a signal is positive or negative logic:
Positive logic signals that are high are said to be asserted
Negative logic signals that are low are said to be asserted
In this report, a _n suffix is added to indicate low logic signals.
The ASM Diagram Block
An ASM chart has an entry point and is constructed with blocks. A block
is constructed with the following type of symbols.
state box: The state box has a name and lists outputs that are asserted
when the system is in that state. These outputs are called synchronous
or Moore type outputs.
Vinayaka Missions University, Salem 163
State Box Representation
Optional decision box : A decision box may be conditioned on a signal
or a test of some kind.
Condition Box Representation
Optional conditional output box: Such an output box indicates
outputs that are conditionally asserted. These outputs are called
asynchronous or Mealy outputs:
Vinayaka Missions University, Salem 164
Conditional Output Box
There is no rule saying that outputs are exclusively inside a conditional
output box or in a state box. An output written inside a state box is
simply independent of the input, while in that state.
The drawing of ASM charts must follow certain necessary rules:
i. The entrance paths to an ASM block lead to only one state
box
ii. Of 'N' possible exit paths, for each possible valid input
combination, only one exit path can be followed, that is there
is only one valid next state.
iii. No feedback internal to a state box is allowed.