public-key cryptography on simd mobile devices · pdf filepublic-key cryptography on simd...

Public-key Cryptography on SIMD Mobile Devices

Paulo Sergio Alves Martins

Thesis to obtain the Master of Science Degree in

Electrical and Computer Engineering

Supervisor: Dr. Leonel Augusto Pires Seabra de Sousa

Examination CommitteeChairperson: Dr. Nuno Cavaco Gomes Horta

Supervisor: Dr. Leonel Augusto Pires Seabra de SousaMembers of the Committee: Dr. Ricardo Jorge Fernandes Chaves

October, 2014

Acknowledgments

I would like to thank Professor Leonel Sousa for giving me the opportunity to perform this work andfor supporting me with his never-ending energy, great scientific expertise and genuine friendliness.

Also, I want to express my gratitude to my parents, the rest of my family and my friends for theirmotivation and for being by my side.

Finally, I want to thank the SiPS group, Instituto Superior Tecnico and everyone else who helped meachieve my goals.

This work was partially supported by national funds through FCT - Fundacao para a Ciencia e aTecnologia, under projects PEst-OE/EEI/LA0021/2013, and PTDC/EEI-ELC/3152/2012.

i

Abstract

The acceleration of cryptographic applications on embedded devices is a topic of increasingly im-portance, due to their massive use. In this thesis, the efficiency of embedded devices when operatingas cryptographic accelerators is evaluated. Single Instruction Multiple Data (SIMD) parallelism, in com-plement to multithreading parallelism, is exploited, as an efficient and broadly available approach, toaccelerate cryptographic operations.

Firstly, the throughput of modular multiplications is increased, and the parallel algorithm is testedfor the Rivest-Shamir-Adleman (RSA) and Elliptic Curve (EC) cryptosystems. Speedups of up to 7.2and 3.9 are obtained for the RSA and EC cryptosystems, respectively, on the ARM A15 quad-coresystem. Moreover, the delay of a single multiplication is reduced, and the technique is applied to theRSA cryptosystem, reducing its central operation execution time by up to 2.2 times. Lastly, the feasibilityof the proposed approaches is assessed by programming a tool based on the proposed algorithms forproducing and verifying digital signatures.

The use of Graphics Processing Units (GPUs) as cryptographic accelerators is also evaluated in thisthesis. The Residue Number System (RNS) is employed to divide large integer operations over severalchannels, and harness the Qualcomm Adreno 320 GPU parallel computational power. Speedups of upto 2.2 are obtained for the RSA cryptosystem.

Finally, the relative effectiveness of SIMD and multithreading parallelism on embedded and generalpurpose devices is experimentally evaluated. It can be concluded that it is possible to achieve the samelevels of execution enhancement on both types of platforms.

Keywords

Public-key Cryptography, Elliptic Curve Cryptography, Rivest-Shamir-Adleman Cryptography, ParallelAlgorithms, Single Instruction Multiple Data, Mobile Devices

iii

Resumo

A aceleracao de aplicacoes criptograficas em sistemas embedidos e um topico de importancia cres-cente, devido ao seu uso massivo. Nesta tese, a eficiencia destes dispositivos, quando utilizados comoaceleradores criptograficos, e avaliada. Paralelismo do tipo Instrucao Unica Multiplos Dados (SIMD),em complemento com paralelismo multithreading, e explorado, como uma abordagem eficiente e am-plamente disponıvel, para acelerar operacoes criptograficas.

Primeiro, o debito de multiplicacoes modulares e aumentado e o algoritmo paralelo e testado para ossistemas critograficos Rivest-Shamir-Adleman (RSA) e suportados em Curvas Elıticas (ECs). Aceleracoesde 7,2 e 3,9 sao obtidas para os sistemas RSA e suportados em ECs, respetivamente, no dispositivoquad-core ARM A15. Seguidamente, o tempo duma unica multiplicacao e reduzido, e a tecnica e apli-cada ao sistema RSA, diminuindo o tempo de execucao da sua operacao basica ate 2,2 vezes. Porultimo, a viabilidade das solucoes propostas e aferida, apresentando uma ferramenta suportada nosalgoritmos desenvolvidos, capaz de gerar e verificar assinaturas digitais.

O uso de Unidades de Processamento Grafico (GPUs) como aceleradores criptograficos tambem eavaliado nesta tese. O Sistema de Representacao por Resıduos (RNS) e usado para dividir operacoessobre numeros extensos por varios canais e aproveitar o poder computacional paralelo da GPU Qual-comm Adreno 320. Neste caso, sao obtidas aceleracoes de ate 2,2 para o sistema RSA.

Finalmente, a eficiencia relativa dos paralelismos SIMD e multithreading em dispositivos embedidose de uso geral e avaliada experimentalmente. Pode concluir-se que e possıvel atingir nıveis semelhantesde melhoria de desempenho em ambos os tipos de plataformas.

Palavras Chave

Criptografia de Chave Publica, Criptografia suportada em Curva Elıtica, Criptografia Rivest-Shamir-Adleman, Algoritmos Paralelos, Instrucao Unica Multiplos Dados, Dispositivos Moveis

v

Contents

1 Introduction 11.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Background 52.1 Cryptographic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Embedded Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.2 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Heterogeneous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Multi-Precision Arithmetic 153.1 Multi-Precision Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Operand-Scanning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.2 Product-Scanning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.3 Operand-Caching Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Multi-Precision Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.1 Classic Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2.2 Barrett Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.3 Montgomery Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2.4 Parallel Execution of a Single Modular Multiplication . . . . . . . . . . . . . . . . . 25

3.2.5 Residue Number System (RNS)-based Modular Multiplication . . . . . . . . . . . . 27

3.3 Implementation and Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.3.1 Multi-precision Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

3.3.2 Multi-Precision Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4 Cryptosystems and Experimental Assessment 474.1 Modular Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.2 Elliptic Curve Cryptosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.3 Implementation Details and Experimental Results . . . . . . . . . . . . . . . . . . . . . . . 52

4.4 System Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

vii

5 Comparative Evaluation of General Purpose and Embedded Processors 635.1 Montgomery Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1.1 RSA Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6 Conclusion 716.1 Summary and Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

viii

List of Figures

2.1 Cipher Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Elliptic Curve Point Addition [1] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Example of Instruction Execution using Pipelining . . . . . . . . . . . . . . . . . . . . . . . 112.4 SIMD Operation for 4 Lanes [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.5 GPU Modules Diagram [3] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Operand-Scanning Method Illustrated for 8-word (s = 8) Large Operands [4] . . . . . . . . 173.2 Product-Scanning Method Illustrated for 8-word (s = 8) Large Operands [4] . . . . . . . . 173.3 Parallel Product-Scanning Method Illustrated for 8-word (s = 8) Large Operands . . . . . 183.4 Operand-Caching Method Illustrated for 8-words Operands and e = 3 [4] . . . . . . . . . . 183.5 Montgomery Multiplication methods: multiplication and reduction loops organisation. . . . 223.6 SIMD operation for 4 lanes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.7 Multi-Core architecture with 4-lanes SIMD engines . . . . . . . . . . . . . . . . . . . . . . 25

4.1 Graphical User Interface (GUI) Algorithm Stack . . . . . . . . . . . . . . . . . . . . . . . . 554.2 Digital Signature Algorithm (DSA) Form . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.3 DSA Signature Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 574.4 DSA Signature Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.5 DSA Signature Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584.6 Elliptic Curve Digital Signature Algorithm (ECDSA) Signature Generation . . . . . . . . . 594.7 ECDSA Signature Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.1 Relative performance comparison for the execution of Montgomery multiplication usingNEON, SSE4.1 and AVX2 technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.2 Relative performance comparison for the execution of modular exponentiation using NEON,SSE4.1 and AVX2 technologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

ix

List of Tables

2.1 Rivest-Shamir-Adleman (RSA) Algorithm’s Operation . . . . . . . . . . . . . . . . . . . . . 8

3.1 NEON Instructions Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.2 Multi-Precision NEON Multiplication Performance . . . . . . . . . . . . . . . . . . . . . . . 343.3 Obtained performance for implementing Montgomery multiplications on the ARM processor 433.4 Multi-Precision Montgomery PowerVR Series5XT SGX544 MP3 Graphics Processing Unit

(GPU) Multiplication Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.5 k-ary Multi-Precision Modular Multiplication Performance for 3 Cores . . . . . . . . . . . . 443.6 Obtained performance from the execution of the Montgomery multiplication algorithm on

the Adreno GPU, and the sequential version on the 1.7 GHz Krait 300 ARM Cortex-A15based CPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.1 Schedulling for point arithmetic in modified Jacobian coordinates . . . . . . . . . . . . . . 514.2 Multi-Precision NEON Montgomery Exponentiation Performance . . . . . . . . . . . . . . 534.3 Multi-Precision k-Ary Method Exponentiation Performance, for 3 Cores . . . . . . . . . . . 544.4 Execution time [µs] obtained from the execution of the Modular exponentiation algorithm

on the Adreno GPU, and the sequential version on the 1.7 GHz Krait 300 ARM Cortex-A15based Central Processing Unit (CPU) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.5 Performance of the Elliptic Curve (EC) point multiplication on the ARM processor . . . . . 554.6 DSA Example Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 564.7 ECDSA Example Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.1 Experimental Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 645.2 SIMD instructions adopted for the ARM and the Intel processors . . . . . . . . . . . . . . 655.3 Multi-Precision SSE4.1 and AVX2 Montgomery Multiplication Performance . . . . . . . . 675.4 Execution time [clock cycles] obtained from the execution of the modular exponentiation

algorithm on the Intel processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

xi

List of Algorithms

2.1 DSA Message Signing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.2 DSA Message Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Classic Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2 Barrett Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Montgomery Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213.4 k-ary Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263.5 RNS Montgomery Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273.6 Operand-Scanning Method C Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 313.7 Product-Scanning Method NEON Implementation . . . . . . . . . . . . . . . . . . . . . . . 323.8 Operand-Caching Method NEON Implementation . . . . . . . . . . . . . . . . . . . . . . . 333.9 NEON Vector Multiply and Accumulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353.10 Separand Operand Scanning (SOS) Main Loops . . . . . . . . . . . . . . . . . . . . . . . . 363.11 Finely Integrated Operand Scanning (FIOS) Inner Loop . . . . . . . . . . . . . . . . . . . . 373.12 FIOS2 Main Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373.13 SOS SIMD OpenCL Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 393.14 k-ary Modular Multiplication OpenMP Implementation . . . . . . . . . . . . . . . . . . . . . 403.15 Implementation of RNS Montgomery Multiplication . . . . . . . . . . . . . . . . . . . . . . . 403.16 Evaluation of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403.17 2nd Evaluation of k . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.18 GPU Modular Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 413.19 OpenCL Implementation of RNS Montgomery Multiplication . . . . . . . . . . . . . . . . . . 42

4.1 Generic 2k-ary method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Point Addition Formulae for Jacobian Coordinates . . . . . . . . . . . . . . . . . . . . . . . 514.3 Generic 2k-ary Method C Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 524.4 Montgomery Multiplication Wrapper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.1 AVX2 Vector Multiply and Accumulate . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 665.2 AVX2 FIOS2 Main Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

xiii

List of Acronyms

ALU Arithmetic-Logic Unit

API Application Programming Interface

CPU Central Processing Unit

CRT Chinese Remainder Theorem

DES Data Encryption Standard

DSA Digital Signature Algorithm

EC Elliptic Curve

ECC Elliptic Curve Cryptography

ECDSA Elliptic Curve Digital Signature Algorithm

FFT Fast Fourier Transform

FIOS Finely Integrated Operand Scanning

FIOS2 Finely Integrated Operand Scanning Version 2

GCD Greatest Common Divisor

GPU Graphics Processing Unit

GUI Graphical User Interface

ISA Instruction Set Architecture

JNI Java Native Interface

MIMD Multiple Instruction Multiple Data

MRS Mixed Radix System

OS Operating System

PGP Pretty Good Privacy

RISC Reduced Instruction Set Computer

RNS Residue Number System

RSA Rivest-Shamir-Adleman

SIMD Single Instruction Multiple Data

xv

SIMO Single Instruction Multiple Operations

SOS Separand Operand Scanning

SSH Secure Shell

SSL Secure Socket’s Layer

VLIW Very Long Instruction Word

VLSI Very-Large-Scale Integration

xvi

1Introduction

Contents1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.3 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.5 Organisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1

During the past decades, technology has evolved at a tremendous pace. Particularly, with the adventof the Internet, more and more aspects of our personal and economic lives tend to depend on this globalpublic network. The technology that enables these interactions is digitally based on the transmissionand broadcast of physical quantities that represent sequences of zeroes and ones.

With such uncomplicated foundations, it comes to question how to assure secure communicationover public networks, which means preventing someone from impersonating us, or stopping them fromaltering information we sent for their own advantage. The answer to these questions comes with cryp-tography. Cryptography provides us with techniques to keep data secret, to determine that informationhas not been tampered with, and to determine authorship of pieces of information.

Cryptography traces back to 4000 years ago, when Egyptians made an initial limited use of it [5].In recent times, this topic has been of great interest since the proliferation of computer and telecom-munication systems in the 1960s. During the 1970s the most well-known cryptographic mechanism inhistory, Data Encryption Standard (DES), was developed as the standard for encrypting unclassifiedinformation.

Diffie and Hellman introduced the revolutionary concept of public-key cryptography in 1976, with thepublication of the New Directions in Cryptography paper [6]. Even though they found no practical real-isation for it at that time, the concept instilled extensive interest and activity in the scientific community.The first practical public-key encryption and signature scheme published to support this type of cipher-ing was Rivest-Shamir-Adleman (RSA), which was presented in 1977 and whose name is related to itsauthors [7]. The security strength of this algorithm is based on the complexity of both factorisation andcomputation of discrete logarithms. Later, several other algorithms were proposed such as ElGamal’son 1985 [8], and others based on Elliptic Curves (ECs) [9] [10].

One of the most significant contributions provided by public-key cryptography are digital signatures.In 1991, the first international standard for digital signatures, based on the RSA scheme, was estab-lished. Later, in 1994, the U.S. Government adopted the Digital Signature Algorithm (DSA), a mecha-nism based on the ElGamal scheme.

1.1 Motivation

Due to the wide deployment of cryptographic operations and protocols, required to secure communi-cation, it is most beneficial to perform them efficiently. Public-key cryptosystems, in particular, generallyrequire computationally demanding arithmetic defined over finite fields of large characteristic.

It is, therefore, of practical interest not only to maximise the throughput of such operations but alsoreduce their latency, namely in real-time applications, with strict temporal requirements, and embeddedsystems, which have modest computational power and limited energy budget. The portability of thesedevices has led to a rise in the demand for higher communication security, which requires the use ofincreasingly larger keys, making said improvements more and more important.

For software implementations, these improvements can be achieved by exploiting data stream par-allelism, using Single Instruction Multiple Data (SIMD) extensions, which allow not only to reduce theexecution time of computations, but also to do it inexpensively, since they are broadly available. It is alsoan energy efficient approach [11], and it allows to increase security through the introduction of redundantcomputation [12].

The main motivation for this dissertation is the enhancement of public-key cryptographic underlyingmathematical operations and algorithms, using the hardware provided with most embedded systems,with a focus on the exploitation of SIMD parallelism.

2

1.2 Related Work

Most public-key cryptosystems extensively require multiplication of large numbers. There has beensignificant research on how to speed-up this operation. Comba, in 1990, proposed an approach toreduce the number of memory accesses [13], thus improving memory efficiency. More recently, Guraproposed a way to further reduce memory usage [14], and Hutter expanded on this idea [4].

Cryptosystems multiplications often take place on modular fields and there have been several pro-posals to enhance the modular reduction operation, namely the Montgomery [15] and Barrett [16] mod-ular multiplication algorithms. These algorithms have been thoroughly studied, and several techniqueshave been proposed to enhance their implementation. In [17] the operations involved in computing theMontgomery product are analysed and several high-speed and space-efficient methods for this compu-tation are compared.

In [12] a Montgomery redundant representation is described as a way to allow for high-performanceSIMD-based implementations of RSA and Elliptic Curve Cryptography (ECC). SIMD instructions arealso used to enhance the performance of modular reduction for moduli with a special structure in [18].In [19], an algorithm is proposed for the execution of SIMD parallel reductions. A scheduling for theECC arithmetic based on parallel SIMD multiplications was proposed on [20]. Finally, other publications[21] [22] have proposed multithreaded multiplication reductions based on the Barrett and Montgomeryalgorithms.

The Montgomery multiplication can also be implemented using the Residue Number System (RNS),which allows for fast parallel arithmetic due to its carry-free nature [23]. Under this system an integer isrepresented by a set of remainders, when the number is divided by the co-prime integers that composethe base set. There have been several proposals of different algorithms that take advantage of thisrepresentation [23] [24] [25].

1.3 Objectives

The main objective of this thesis is the analysis of the mathematical operations required to per-form public-key cryptographic operations, in order to propose approaches and algorithms through whichSIMD parallelism can be efficiently exploited. Parallel algorithms are developed and evaluated, and theirproperties will be discussed so as to find how they can be used to implement efficient cryptosystems.These algorithms are also used, in the scope of this thesis, to implement the ECC, DSA, and RSA coreoperations in a parallel manner. Finally, a software application will be developed, which will be usedto sign and verify messages, in order to confirm the feasibility and practical interest of the proposedalgorithms.

1.4 Main Contributions

Several contributions of this thesis can be listed. One of them results from the investigation of howSIMD parallelism can be exploited to enhance large-operands arithmetic, with an emphasis on modulararithmetic, mainly for cryptographic applications. Since the proposed algorithms are general, they mayenhance not only cryptographic operations but also any application which requires multiplications oflarge numbers. A large set of experimental results are presented, which allow to experimentally evaluatethe best methodologies for the implementation of modular arithmetic on embedded devices.

Another important contribution of this thesis is the analysis of results for embedded processors, andthe comparison to those for general processors. It is of practical interest to understand how their per-

3

formances differ, so that the best characteristics of each device may be taken into account for designingnew efficient cryptographic systems.

These main contributions led to results that have been accepted for publications in the followinginternational conference and journal:

• Paulo Martins and Leonel Sousa. On the evaluation of multi-core systems with SIMD enginesfor public-key cryptography. In Applications for Multi-Core Architectures (WAMCA), 2014 FifthWorkshop on, October 2014.

• Leonel Sousa and Paulo Martins. Efficient sign identification engines for integers represented inthe RNS extended 3-moduli set {2ˆn-1, 2ˆ(n+k), 2ˆn+1}. Electronics Letters, IET, vol. 50, n. 16,pp. 1138 1139, July 2014.

1.5 Organisation

This dissertation is organised in six chapters. In Chapter 2, the theoretical backgrounds of public-keycryptography and computer architecture are presented. An overview of cryptography is given, underlin-ing the differences between public-key and private-key cryptographies, and how they are used in prac-tice. Since modular arithmetic is often used to implement public-key cryptosystems, this topic is alsocovered. It is also explained how it underpins public-key cryptosystems, namely ECC, RSA and DSA.Afterwards, an overview of computers’ architectures is given, with emphasis on the differences betweenCentral Processing Units (CPUs) and Graphics Processing Units (GPUs). Finally, the characteristics ofthe platforms used in this thesis are also presented.

Chapter 3 starts with a description of different approaches to perform multi-precision multiplicationon embedded devices. They are analysed so as to find ways to exploit SIMD parallelism. Afterwards,several approaches to modularly reduce multi-precision multiplications are studied, which feature SIMDand multithreading parallelisms. Finally, the technical aspects of the implementation of the arithmeticalgorithms are described, and their performances are evaluated.

In Chapter 4, the aforementioned techniques are employed to realise cryptographic operations,namely modular exponentiation and point multiplication over ECs. The implementation details of thesecryptosystems are presented, and the performance enhancement of SIMD and multithreading parallelimplementations is analysed. Finally, the feasibility of the developed algorithms is tested for the DSA,and Elliptic Curve Digital Signature Algorithm (ECDSA) cryptosystems, by generating and verifying dig-ital signatures of defined messages.

Chapter 5 features a comparison of the enhancements achieved through the use of SIMD and multi-threading parallelisms on general purpose and embedded devices, so as to find the strengths and weak-nesses of the different architectures. With embedded devices providing low-power consumption, andmodest computational resources, while general purpose devices often provide greater performances, itis of practical interest to understand how this dichotomy affects SIMD and multithreading parallel imple-mentations of cryptographic applications.

Concluding remarks are presented in Chapter 6, based on the experimental results of the attainedSIMD performance. Furthermore, future work is proposed.

4

2Background

Contents2.1 Cryptographic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62.2 Embedded Computer Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Central Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.2 Graphics Processing Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.2.3 Heterogeneous Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

5

Figure 2.1: Cipher Operation

The development and implementation of cryptosystems require in-depth knowledge of cryptographicoperations and computer architecture. In this section, the theoretical foundations of these areas arepresented. Firstly, the application of cryptosystems is over-viewed, with an emphasis on how the useof public and private-key cryptosystems differs. Afterwards, a top-down description of public-key cryp-tography is given, by explaining how several protocols apply it, stating which public-key cryptographicsystems are most commonly used, and on which mathematical foundations they are supported. Then,the construction of these cryptosystems is described, and some algorithms used to implement them arereferred.

Finally, the main techniques employed for designing CPUs and GPU are described. The developmentplatforms, which employ those techniques and were used for testing the developed algorithms, are alsopresented.

2.1 Cryptographic Systems

Cryptography is based on the use of ciphers. A cipher is a technique used to conceal informationor reveal it, whereby a clear text is converted into a cryptogram and vice-versa. The operation of mostciphers is established by an algorithm and a key. The algorithm establishes the way data is transformed,whereas the key is used as a parameter of the algorithm that modifies its behaviour in a complex manner.The correct decryption of a cryptogram implies the knowledge of a specific key, as depicted in Figure2.1.

Cryptographic systems can be characterised as public-key cryptosystems or private-key cryptosys-tems. Public-key encryption is a cryptographic system where a pair of distinct but related keys are used– one is public and the other one is private – and it is infeasible to compute the private key based on thevalue of the public key. When a message is ciphered using the public key, the resulting cryptogram canonly be deciphered using the private key, therefore one achieves confidentiality when transmitting thecryptogram. Some ciphering algorithms allow for the reverse – to cipher a message using the privatekey and to decipher the result using the public key – ensuring authorship. This pair of keys is alwaysrelated to a certain entity who must publicly announce its public key while ensuring that the correspond-ing private key is not disclosed. For private-key encryption schemes, a common key is used for bothencryption and decryption. They often are more computationally efficient than public-key cryptosystemsand are used to ensure the confidentiality of data exchanged by two or more entities. A more in-depthintroduction to cryptography can be found in [26].

Public-key cryptography is typically used to solve the problem of sharing the secret key, required byprivate-key cryptosystems, over public networks. Supposing two entities want to establish a secure com-munication, one may use the other’s public key to cipher the secret key. Afterwards, only the recipient,who has access to the corresponding private key, is able to decipher the cryptogram, and therefore it isassured that only the two entities know the value of the secret key.

The private and public key pair can also be used to establish the author of a message. When

6

someone uses their private key to cipher a message, it is as if they signed the message, since no oneelse has access to this key. Anyone else may decipher the resulting cryptogram, and verify its authorship.

Public-key cryptosystems employ complex mathematical problems as the basis for the cipheringprocess. These are chosen such that there are no polynomial time solutions, and are applied to largenumbers, in order to assure its security strength. This type of encryption is used on protocols such asPretty Good Privacy (PGP), Secure Socket’s Layer (SSL) and Secure Shell (SSH), providing a way todistribute keys and to generate digital signatures [27].

The first algorithm published to support this type of ciphering was RSA, which was presented in1977 [7]. The security strength of this algorithm is based on the complexity of both factorisation andcomputation of discrete logarithms which make it infeasible to determine the private key based on thepublic one [28]. Later, several other algorithms were published, such as the ElGamal’s on 1985 [8].Among those, algorithms based on ECs [9] [10] have recently received notorious attention from thescientific community [29], due to its enhanced benefits in terms of performance when compared to RSA.

Modular arithmetic is employed on most public-key cryptosystems. Two integers A and B ∈ N arecongruent modulo M if they leave the same remainder when divided by M . This relation is usuallydenoted by A ≡ B(modM). For instance, 20 ≡ 33(mod13), since both 20 and 33 leave the remainder7 when divided by 13. For a modulo M , every integer is congruent to some value 0 ≤ A < M . Eventhough every integer is congruent to an infinite number of others, it is usual to represent it using thesmallest positive member of the equivalence class. As such, operations modulo M are performed solelyusing integers between 0 and M − 1.

Euler’s totient function is an important tool of number theory and has a great impact on modulararithmetic. This is the function Φ : N → N such that on any input n ∈ N, Φ(n) returns the count ofnatural numbers less than n that are co-prime to it; two numbers are said to be co-prime if they shareno common factors except for 1. For instance Φ(14) = 6, since 2, 4, 6, 8, 10, 12 and 14 share commonfactors with 14, but the following 6 numbers do not: 1, 3, 5, 9, 11 and 13. For a prime number p ∈ P,Φ(p) = p−1. Additionally, it can be proven that function Φ is multiplicative, that is Φ(ab) = Φ(a)Φ(b), andthat AΦ(n) ≡ 1(modn) [30], for A 6≡ 0(modn), and GCD(A,n) = 1, where GCD denotes the GreatestCommon Divisor. For example, 36 ≡ 729 ≡ 1(mod14).

For a modulo M , A−1 is defined to be the multiplicative inverse of A if AA−1 ≡ 1 mod M . It is knownthat this value only exists when A and M are co-prime. Furthermore, Euler’s totient function providesthe ability to compute A−1 ≡ AΦ(M)−1 mod M , since AΦ(M)−1A ≡ AΦ(M) ≡ 1(modM). This value canalso be computed using the extended Euclidean algorithm, which, given A and M , computes X and Y ,such that XA + YM = GCD(A,M) [30]. If A and M are co-prime, XA + YM ≡ XA ≡ GCD(A,M) ≡1(modM).

The RSA algorithm’s operation is described in Table 2.1. Firstly, a user produces two random largeprime numbers p and q, and computes n = pq. Afterwards, the public key e is randomly generatedsuch that 0 < e < n and e is co-prime to Φ(n). Φ(n) can be computed, taking the aforementionedproperties of the totient function into account, as Φ(n) = Φ(pq) = Φ(p)Φ(q) = (p− 1)(q− 1). A messagecan be ciphered by computing C ≡ P e(modn). The private key d that matches e is computed as d ≡e−1 mod Φ(n). This fact assures the correctness of the deciphering process: Cd ≡ P e×d ≡ P 1+kΦ(n) ≡P (modn), for some value of k. The value of d can be easily computed from e if the values of p and q areknown. Despite this, the computation of p and q from n is a very complex operation. Another way to gainaccess to d would be to compute the discrete logarithm of P , given the basis C and the modulo n, butthere is no known efficient way to perform this operation. In [31] can be found an example of a concretestandard that uses this procedure and in [32] a discussion of its security strength.

Another similar cryptosystem to RSA, used for signing messages, is the DSA [33]. Firstly a keylength pair (L,N) is decided on. Secondly, a N -bit prime q is selected, and a L-bit prime modulo p is

7

Public Key (n, e) n is the product of two large prime numbersp, q. e is co-prime to Φ(n) = (p− 1)(q − 1)

Private Key d d is chosen such that e · d ≡ 1(modΦ(n))Cipher C ≡ P e(modn)Decipher P ≡ Cd(modn)

Table 2.1: RSA Algorithm’s Operation

determined such that p − 1 is a multiple of q. Finally, a number g is computed such that q is the leastpositive number that satisfies gq ≡ 1(modp). This may be achieved by setting g ≡ h

p−1q (modp), for an

arbitrary h (1 < h < p − 1), and g 6≡ 1. The (p, q, g) triplet is a parameter shared among the users ofthe system. Subsequently, a user selects a x value, such that 0 < x < q, as their private key. Thecorresponding public key is y ≡ gx(modp). In order to find x, an attacker would have to compute thediscrete logarithm of y modulo p, which is a complex operation. The user then uses the procedure inAlgorithm 2.1, for the generation of the signature of the message m. The H(m) function is an hashfunction that converts the message m to a L-bit output. H(m) should be as secure as the chosenkey-length; suitable hash functions can be found in [34]. The generation of r amounts to create a newper-message key. s is computed in a way that it not only involves the hashed message, but also enablesthe re-computation of r by another user, using the public key. Verification of the signature then proceedsas depicted in Algorithm 2.2.

Algorithm 2.1: DSA Message SigningGenerate a random per-message k, 0 < k < qCompute r = (gk mod p) mod qif r = 0 then

Restart the procedureendCompute s = k−1(H(m) + xr) mod qif s = 0 then

Restart the procedureendOutput (r, s)

Algorithm 2.2: DSA Message VerificationVerify that 0 < r < q and 0 < s < qCompute w = s−1 mod q;u1 = H(m)w mod q;u2 = rw mod q.Determine if v = ((gu1yu2) mod p) mod q is equal to r

EC cryptosystems are based on the algebraic structure of ECs [35]. ECC emerged as a new type ofcryptography that addresses some of the problems found in traditional cryptosystems, namely the DSAand the RSA referred before. As previously stated, the latter systems rely on the difficulty of the discretelogarithm problem. As computational power increases, the key-length must continue to grow, namely toprevent attacks by those with access to sufficient computational power. This tendency leads to poorerperformances. It is possible to define the discrete logarithm problem over ECs so as to mitigate thisissue. Using EC groups for this purpose, it is possible to have smaller key-lengths, whilst still maintainingthe same level of security.

The EC group and the point addition operation, which replaces DSA modular multiplication, arerepresented in Figure 2.2(a) for an elliptic curve defined over R. This operation adds the points A and

8

(a) R2 (b) Fp

Figure 2.2: Elliptic Curve Point Addition [1]

B, by drawing the line that crosses these two points, and considering the point C, which is the verticallysymmetric to where the line crosses the curve a third time. The ECs used in practice are defined overfinite fields. The same structure is printed in Figure 2.2(b) for Fp, over which the point addition operationis represented. The repeated application of this operation is named point multiplication, [k]P , which isanalogue to the modular exponentiation operation. The EC discrete logarithm problem, to find the valueof k, given P and B, such that P = [k]B, is generally a complex problem [36], and underpins ECscryptosystems. ECs will be quite used in this thesis, and therefore will be further analysed in Section 4.2

ECs can be used to implement the DSA algorithm, similarly to what is done with finite fields [33]; pshould be replaced by the parameters of the curve, and g should be selected as a point such that q is theleast positive number that verifies [q]g = O, where O is the identity element for point addition. Then, thegeneration of the private and public key pair (x, y) takes place as the selection of an arbitrary x value,such that 0 < x < q, and the computation of y = [x]g. Finally, signature generation and verification useAlgorithms 2.1 and 2.2, but multiplication and exponentiation modulo p are replaced by point additionand multiplication, respectively: in Algorithm 2.1, the value of gk mod p is replaced by the x-coordinateof [k]g, and in Algorithm 2.2, the value of (gu1yu2) mod p by the x-coordinate of [u1]g + [u2]y.

Modular arithmetic defined over large groups is computationally demanding, and the most time-consuming modular operations are division and multiplication [37]. The RSA algorithm does not involvedivision, but this operation is required in ECC to compute the slope of the line in Figure 2.2. Thiscomputation can be avoided by using projective coordinates [30], which use a third coordinate Z andreplace k modular divisions, required for the computation of [k]P , by a single division to convert the finalresulting point to affine coordinates.

Generally, cryptographic operands are represented by hundreds or thousands of bits. However, mostprocessors are only able to operate on w-bits at a time, where w typically takes the value of 16, 32 or 64.As such, to compute the operands product, their representations should be divided into w-bits words,and the words multiplied pairwise. This procedure is called multi-precision multiplication.

Multi-precision modular multiplication is classically performed by firstly computing the value of theproduct, C = AB, and afterwards taking the remainder of the division by the modulo, T = C − QM ,where Q = bC/Mc. The computation of Q requires a long division operation which can be cumbersome

9

when applied to large integers. As a way to speed up this algorithm, Barrett proposed in 1984 thepre-computation of the value of 1/M using a fixed-point approach, in order to estimate Q using onlymultiplications and logic shifts [16]. A similar algorithm was proposed by Montgomery in 1985, in whichQ is redefined as the number which satisfies the condition C + QM ≡ 0(mod2n) [15]. By definition,C +QM is divisible by 2n and, as such, it is possible to compute T = (C +QM)2−n ≡ AB2−n(modM)

efficiently, using only multiplication and logic shifts. This operation can be repeatedly used to computemodular exponentiation: if A and B are each multiplied by 2n, the result of the multiplication algorithmcorresponds to the product of A and B multiplied by the same value. At the end of the exponentiationalgorithm, using the Montgomery reduction, with base A and exponent e, T has the value of T ≡Ae2n(modM), and therefore it should be multiplied by 2−n in order to obtain the final result.

Multi-precision arithmetic may be accelerated on massively parallel devices, such as GPUs, by ex-ploiting the RNS. Under this system, numbers are represented as their remainders when divided bythe set of co-primes r0, r1, . . . , rn−1 that form the RNS basis. Additions, subtractions and multiplica-tions modulo R =

∏n−1i=0 ri can be implemented with a O(1) time complexity, since these operations

can be performed independently for each modulo of the set, whereas operations such as reductions byanother modulus, integer division, and integer comparison are generally complex. A Montgomery-likealgorithm may be used to convert operations modulo M to R, by redefining Q as the value that satis-fies C + QM ≡ 0(modR). Since C + QM is divisible by R, it can not be represented modulo R, andtherefore an extra base, with R =

∏n−1i=0 ri > R, and R co-prime to R is required. Using this approach,

C = AB is first computed in both bases and Q ≡ −CM−1(modR) is computed on the first base, so thatthe result is reduced modulo R. Afterwards Q is extended to the second base, and T = (C + QM)R−1

is computed. Finally, the result is extended to the first base, so that the output of the algorithm may bere-used as the input of further operations. There are several approaches to perform base extensions,which will be analysed in Section 3.2.5.

2.2 Embedded Computer Architecture

In recent time, embedded systems, and in particular autonomous mobile devices, have grown inimportance. They commonly contain one or more programmable processing cores, which establish anInstruction Set Architecture (ISA) for interfacing algorithms with the underlying hardware. Often real-timeexecution is a requisite in an embedded application, although in some came cases only some segmentsof the applications have restrictions on the execution time. There is also the need to minimise powerconsumption and memory usage. With lower power consumption, autonomous devices require the useof smaller batteries, while reducing the need also for cooling the device. At the same time, the minimi-sation of memory size reduces not only the cost of the circuit but also power consumption. Embeddedprocessors are designed taking the aforementioned features into consideration. When optimising codefor these platforms, it is crucial to understand their intrinsic characteristics.

2.2.1 Central Processing Unit

In this section, several architectural developments are described, with a focus on the approaches forextracting parallelism on CPUs. With parallelism, performance is increased by taking advantage of theenhancement of Very-Large-Scale Integration (VLSI) technology [38]. A more detailed description ofcomputer architecture can be found in [11], herein we focus the attention on the following architecturaltechniques: pipelining, instruction set extensions, and multiple instruction issue.

1. Pipelining

10

Figure 2.3: Example of Instruction Execution using Pipelining

Classically, computers repeatedly performed the Von Neumann cycle, corresponding to the follow-ing steps: i) Fetch an instruction from memory; ii) Decode the instruction; iii) Fetch the operandsrequired by the instruction; iv ) Execute the instruction; v ) Write back the result.

It is possible to increase the throughput of instructions by overlapping the execution steps of dif-ferent instructions, with pipelining, as depicted in Figure 2.3. However, this limits the range ofinstructions that are possible to implement, as they need to be split into the same number ofstages, and last similar amounts of time. The effectiveness of this technique has led to the mas-sive employment of Reduced Instruction Set Computer (RISC) architectures, whose ISA meetsthese constraints.

The further splitting of each of these steps is called super-pipelining. However, when an instructionrequires the result of a previous instruction that is still processing, there is a need to introduce stalls,which further degrade performance as the depth of the super-pipeline is increased. As such, theapplication of this technique is limited by the parallelism that can be extracted from the targetapplication as well as physical phenomena.

2. Complex Instructions

Whereas using pipelining it is possible to achieve higher frequencies, implementing more complexinstructions, that enable, for example, the processing of a larger number of operands, increasesthe amount of work that is done per clock cycle. There are two techniques that can be applied toachieve this effect: Single Instruction Multiple Data (SIMD) and Single Instruction Multiple Oper-ations (SIMO). SIMD technologies are the commonest available among embedded devices, andtherefore will be focused on from now on, while SIMO are most frequently implemented on VeryLong Instruction Word (VLIW) architectures.

SIMD extensions to the architecture and the instruction set allow to operate on special registersas vectors of elements of the same data type, with instructions operating on these lanes simul-taneously, by performing the same operation in all of them. These registers usually have largercapacity than those used for general processing, as depicted in Figure 2.4.

3. Multiple Instruction Issue

Apart from exploiting more powerful instructions, several computer systems, denominated super-scalar, issue more than one instruction per clock cycle, and many provide multiple processingcores that run independently and in parallel.

To take advantage of the multiple cores, the programmer may define several instruction streams,which are processed by these Multiple Instruction Multiple Data (MIMD) architectures. These

11

Figure 2.4: SIMD Operation for 4 Lanes [2]

systems often have separate control structures, but share part of the memory system. In order toexploit their capacities, it is necessary to explicitly identify dependencies on the application and toestablish communication and synchronisation between different instruction and data streams.

2.2.2 Graphics Processing Unit

Graphics Processing Units (GPUs) grew in popularity during the end of the 1990s, enabling a realis-tic, high-performance 3D experience. These traditionally provided a hard-wired fixed-function implemen-tation of the graphics pipeline, responsible for transforming vertexes and texturing pixels. It was not longuntil this implementation started to be replaced by increasingly programmable units, which introducedgreater flexibility [39].

Graphics computing requires extremely high arithmetic throughput, while considerably tolerating la-tency, since images are often displayed once every 16 milliseconds [39]. The increasingly computationalpower available through GPUs has instilled research on general-purpose computing on these platforms.

Most conceptual ideas of GPU design can be found in [40], where the NVidia Tesla architectureis described. Herein, a synopsis is presented, and the main blocks of a typical GPU architecture aredepicted in Figure 2.5, namely multiple small Shader Cores, that host multiple Thread Warps, featuringpipeline SIMD arithmetic units. These features will be described in the following text.

1. Core Size Reduction

Most CPUs employ a variety of techniques in order to reduce the latency of algorithms. Theseinclude out-of-order execution, branch prediction, memory pre-fetching and large data caches. Asstated before, GPU computation generally tolerates the delays mitigated by these techniques and,as such, it is possible to remove part of the hardware associated with them, in order to free circuitarea and employ a larger number of cores.

2. SIMD Processing

In order to further reduce the complexity of managing instructions across many Arithmetic-LogicUnits (ALUs), it is common to group instruction streams into SIMD warps, as depicted on theShader Core module in Figure 2.5. With this architecture each warp of threads executes the sameinstruction simultaneously on different data in parallel scalar pipelines. It should be noted thatsince the control structure is shared by all the elements of the warp, there is a degradation inperformance whenever a branch occurs; as the same instruction is executed by all threads, thereis a need to introduce stalls on the threads that should not execute that instruction.

3. Latency Hiding

The removal of large caches, as well as the memory pre-fetching module leads to a larger memoryaccess time, especially when accessing large amounts of data. This is aggravated by the fact that

12

Figure 2.5: GPU Modules Diagram [3]

memory requests can take several hundreds of cycles due to contention in the interconnectionnetwork and row-activate and precharge overheads at the DRAM. The downtime related to thesestalls can be avoided by interleaving the processing of many warps, as represented by the WarpScheduler in Figure 2.5.

2.2.3 Heterogeneous Systems

The development and testing of algorithms has been performed in this thesis by using platforms withcharacteristics similar to embedded systems used on current mobile equipment. The main developmentplatform is an ODROID-XU+E [41]. This platform features a Cortex-A15 quad-core and a Cortex-A7quad-core CPUs. The two quad-cores can not be used simultaneously. While the A15 features a su-perscalar pipeline out-of-order execution architecture, with an up to 1.6 GHz frequency, the A7 employsa pipeline partial dual-issue, in-order execution, with an up to 1.2GHz frequency. The NEON SIMD ex-tensions are available on all cores, as well as a 32 kB L1 instruction cache and a 32kB L1 data cache.Whereas the A15 has a 2MB L2 unified cache, the A7 has a 512kB L2 unified cache. It also features aPowerVR Series5XT SGX544 MP3 GPU, which has 3 shader cores, operating at 600 MHz. Each corecan host up to 16 instruction streams, which execute independently on 4 ALUs. It employs another levelof SIMD parallelism, whereby 4 floating point operands can be processed simultaneously by each ALU.

The NEON extensions are supported by a 128-bit SIMD architecture that provides 32 integer regis-ters, 64-bits wide (but from a programmer’s point of view they can also be seen as 16×128-bits wideregisters). These registers are considered as vectors of elements of the same data type when NEONinstructions operate simultaneously, by performing the same operation on all elements of a vector.

Due to limitations of the previous GPU device, namely the lack of synchronisation between threads,another development board was used for the RNS Montgomery multiplication algorithm. This devel-opment platform is a SYS6440, based on Qualcomm technologies. It deploys a quad-core Krait CPU,running at up to a frequency of 1.7 GHz, with a similar architecture to the A15 CPU, and a QualcommAdreno 320 GPU. There is few publicly available information on the architecture of the GPU, but itis thought to have 4 Compute Units, each with 32 Processing Elements, resulting in 128 processingelements [42], running at 400MHz.

13

2.3 Summary

In this section, cryptography was introduced, and it was explained how public-key cryptography dif-fers from private-key cryptography: whereas the latter provides an efficient way to exchange data in aconcealed manner, the former solves the problem of sharing keys over unreliable networks and enablesthe implementation of digital signatures. These characteristics are exploited by protocols such as PGP,SSL and SSH, and implemented using cryptosystems such as RSA, DSA, and ECC. The mathematicalproblems that underpin these systems were discussed, as well as how the cryptosystems operate.

There were also introduced various forms of performing multi-precision modular multiplication. Thisoperation is the most time-consuming for public-key cryptosystems, and therefore their performance ishighly dependent on efficient implementations of modular multiplication.

Furthermore, the most common characteristics of CPU and GPU architectures were explained. Forthe CPU, it was shown how instruction throughput was increased using pipelining. This technique islimited by the parallelism that can be extracted from the target application and physical phenomena. Af-terwards, it was stated how the introduction of more complex instructions, such as SIMD ISA extensions,and issuing multiple instructions simultaneously may be used to increase performance. For the GPU,system designers generally focus on providing platforms with high throughput, by designing a largeamount of small cores, which employ the SIMD processing principle and interleaving thread executionto hide memory access latency. The GPU platform performance may be limited when highly divergentor highly memory intensive tasks are executed. Finally, the development platforms, which employ thesesystems were also presented.

14

3Multi-Precision Arithmetic

Contents3.1 Multi-Precision Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.1 Operand-Scanning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.2 Product-Scanning Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.1.3 Operand-Caching Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Multi-Precision Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.1 Classic Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.2.2 Barrett Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.3 Montgomery Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.2.4 Parallel Execution of a Single Modular Multiplication . . . . . . . . . . . . . . . . . 253.2.5 RNS-based Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Implementation and Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.1 Multi-precision Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.3.2 Multi-Precision Modular Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

15

In this section, parallel algorithms are proposed for implementing the arithmetic operations that sup-port public-key cryptosystems. Since cryptographic operands are hundreds or thousands of bits wide,and processors typically only operate on 16, 32 or 64 bits at a time, multi-precision arithmetic is re-quired. Initially, several multi-precision multiplication algorithms are analysed, with an emphasis on howthey can be parallelised using SIMD extensions. Afterwards, several multi-precision modular multipli-cation algorithms are proposed, and several approaches to exploit SIMD and multithreading parallelismare considered. Finally, the performance of the proposed algorithms is analysed in order to supportdesign options for the cryptosystems presented in the next chapter.

3.1 Multi-Precision Multiplication

The security strength of most public-key cryptosystems is supported on the algorithmic complexityof factorisation and computation of discrete logarithms. In order to prevent the disclosure of secrets ina feasible amount of time, these problems are applied to large numbers – represented by hundreds orthousands of bits. As such, it is crucial to have efficient implementations of multi-precision operations,namely multiplication.

In this section, several methods are presented to compute multi-precision multiplication. Sub-quadraticmethods, such as the Karatsuba [43], Toom-Cook [44] and Fast Fourier Transform (FFT)-based [45] al-gorithms, are not considered, since they only present significant speedups for larger numbers than thosethat are usually used for public-key cryptography, and require more resources and memory accessesthan those considered on embedded devices [4].

3.1.1 Operand-Scanning Method

The most straightforward way to compute the multiplication of s-word large operands A and B isusing the Operand-Scanning method. This method can be implemented using a two nested loop: atiteration i, the outer loop loads the value of A[i], whereas the inner loop loads the value of B[j], forj = 0, · · · , s − 1, and multiplies it by A[i]. The partial product is then accumulated in the intermediateresult column C[i + j], along with the carry from the previous column. Figure 3.1 illustrates how thepartial products are sequentially computed. This method uses a row-wise approach, which is depictedon the image, by featuring arrows to represent the computational flow.

The high regularity of memory accesses fits the caching system well. The number of memory ac-cesses has a complexity of O(3s2) since at each step C[i + j] and B[j] need to be loaded and C[i + j]

needs to be stored after the partial result has been computed. However, if the register file is largeenough to store the entire C array throughout the algorithm, only O(s2) memory accesses are needed.

3.1.2 Product-Scanning Method

In contrast to the previous method, the Product-Scanning method computes multi-precision multipli-cation using a column-wise approach. It was proposed by Comba [13] as a way to reduce the number ofmemory accesses. Since multiplication columns in cryptography do not usually have more than 2w par-tial products, by having a 3-word large accumulator, it is possible to store the sum of all partial productsof a column without relying on storing intermediate results in memory, since s < 23w/22w = 2w, wherew denotes the word bit-length. After having processed one column, the first word of the accumulator isstored in memory as part of the final result and the accumulator is right-shifted one word. Computationthen continues on the next column, so that the carry is propagated.

16

Figure 3.1: Operand-Scanning Method Illustrated for 8-word (s = 8) Large Operands [4]

Figure 3.2: Product-Scanning Method Illustrated for 8-word (s = 8) Large Operands [4]

The procedure is represented in Figure 3.2. At each step, it suffices to load A[i] and B[j], so the totalnumber of memory accesses has the complexity of O(2s2). If both operands fit in the register file duringthe execution of the algorithm, the total memory accesses is reduced to O(4s), since it is only need toload A and B and later store C.

The product scanning method fits SIMD parallelism well. For an n-lane SIMD technology, where eachlane has a 2w-bit width that is twice that of each of the words of A and B, it is possible to use 2 ×2w-bitwide registers as the accumulator for a column. Then, n columns may be processed simultaneously,by performing n× w-bit multiplications in parallel, accumulating the product on the 2w-bit registers, andadding the w most significant bits to the register holding the carry. This method is represented in Figure3.3, for n = 2, where the dashed lines indicate multiplications performed in parallel.

3.1.3 Operand-Caching Method

The Operand-Caching method tries to reduce the number of memory accesses even further, throughthe re-usage of the operands in the registers. It follows a column-wise approach, divided by rows [4].

The procedure is illustrated in Figure 3.4, and can be described as follows:

1. The multiplication rhombus is divided into r = b sec rows plus an initialisation block denoted by binit.e should be chosen such that f = 2e + 3 registers are available to store 2e operand words and a3-words large accumulator;

2. binit is computed using one of the previous methods;

3. Each row p = r − 1, · · · , 0 is processed in four sub-steps:

17

Figure 3.3: Parallel Product-Scanning Method Illustrated for 8-word (s = 8) Large Operands

Figure 3.4: Operand-Caching Method Illustrated for 8-words Operands and e = 3 [4]

3.1. The values of A[i], for i = pe, · · · , (p+ 1)e− 1, and B[j], for j = 0, · · · , e− 1, are loaded. Thee columns belonging to sub-step 1 are then processed using the Product-Scanning method;

3.2. During the processing of sub-step 2, the operandsA[i] are kept the same, while one word ofBcan be discarded and another one has to be loaded from one column to the next. The partialproducts are added using a multiply-accumulate approach, by adding the partial products tothe previously computed values of C. s− e(p+ 1) columns are processed;

3.3. Sub-step 3 is similar to sub-step 2, except that operands B[j] are kept the same, while thewords of A kept in registers need to be updated at each new column;

3.4. The last sub-step computes the remaining partial products. At this point, all the requiredoperands are stored in registers.

binit can be discarded for evaluating the complexity of memory accesses. At each row, during thefirst sub-step, 2e loads are required, and e stores to save the intermediate results. As sub-steps 2 and 3are processed, 2 loads are required per column processed (a new operand has to be loaded, as well asthe partial C result) and 1 store. Finally, for the last part, e − 1 partial results are stored. Therefore thetotal number of memory accesses is:

r−1∑p=0

(3e+ 2× (s− e(p+ 1))× 3 + e− 1) = −3er2 + (6s+ e− 1)r (3.1)

Since r = b sec, it is possible to infer the complexity as O(6sr − 3er2) = O(6 s2

e − 3 s2

e ) = O( 3s2

e ).Parallelisation of the Operand-Caching method follows a reasoning similar to the Product-Scanning

method: for an n-lane SIMD technology, after having divided the multiplication rhombus into r rows, n

18

columns can be processed at a time. For n = 2, choosing even values for s and e yields better results,since all sub-steps, except for sub-step 4, will also contain an even number of columns: the first sub-stepprocesses e columns, which is an even number, and sub-steps 2 and 3 s − e(p + 1) columns, which isalso an even number, since it results from the subtraction of 2 even numbers. When multiplying twonumbers, the length of which is an odd number of words, it is always possible to pad them with zeroes,so that multiplication takes place with an even s.

3.2 Multi-Precision Modular Multiplication

Modular multiplication is the basis of modular exponentiation, which is the core operation of the RSAand the DSA cryptosystems. It is also featured on cryptosystems based on ECs, also being the mosttime-consuming operation of these systems. As such, the efficiency of most cryptosystems is dependenton the implementation of modular multiplication.

Two integers, A andB, are said to be congruent moduloM , denoted asA ≡ B( mod M), ifM divides(A − B). The integer M is called the modulo of the congruence. They satisfy the following properties[37]:

1. A ≡ B(modM), iff A and B leave the same remainder when divided by M ;

2. A ≡ A(modM);

3. If A ≡ B(modM) then B ≡ A(modM);

4. If A ≡ B(modM) and B ≡ C(modM), then A ≡ C(modM);

5. If A ≡ A1(modM) and B ≡ B1(modM), then A + B ≡ A1 + B1(modM) and AB ≡ A1B1(mod

M).

Finally, the multiplicative inverse of A modulo M , is an integer A−1, such that AA−1 ≡ 1(modM). Ais invertible iff GCD(A,M) = 1. This value can be computed using the extended Euclidean algorithm[30], which, given A and M , computes X and Y , such that XA + YM = GCD(A,M). If A and M

are co-prime, XA + YM ≡ XA ≡ GCD(A,M) ≡ 1(modM). The multiplicative inverse can also becomputed using Euler’s theorem [30].

It is common to use the set Z ∈ [0,M [ as the set of representatives for numbers congruent to Z

modulo M . The process of determining Z ∈ [0,M [ for a number larger than M is called reduction.Example 3.2.1 depicts some of the modular operations referred before.

Example 3.2.1 Example of operations modulo 3571

2001× 2501 ≡ 1530 mod 35712001 + 2501 ≡ 931 mod 3571−2501 ≡ 1070 mod 35712001− 2501 ≡ 2001 + 1070 ≡ 3071 mod 35712501−1 ≡ 2473 mod 35712001/2501 ≡ 2001× 2473 ≡ 2638 mod 3571

3.2.1 Classic Reduction

Classically, modular multiplication is computed by first performing long multiplication of operands Aand B, dividing the product by the modulo M , and finally obtaining the remainder of the division, asreferred in Algorithm 3.

19

Algorithm 3.1: Classic Modular MultiplicationZ ← ABQ← b ZM cZ ← Z −QM

The reduction of the product depicted on Example 3.2.1, using the classic algorithm, is instantiatedin Example 3.2.2.

Example 3.2.2 Example of classic reduction modulo 3571

Z ← 2001× 2501 = 5004501Q← b 5004501

3571 c = 1401Z ← 5004501− 1401× 3571 = 1530

3.2.2 Barrett Reduction

The long division used to compute Q is a cumbersome operation. Barrett [16] suggested pre-computing a scaled estimate of the modulo’s reciprocal, so that the value of Q could be estimatedusing only multiplication and shift operations. By pre-computing µ = b 22n

M c, such that 2n−1 < M < 2n,modular multiplication can be performed as in Algorithm 3.2. The estimation error will never be greaterthan 3 (3.2), and therefore the while cycle will iterate at most 3 times.

Algorithm 3.2: Barrett Modular MultiplicationZ ← ABQ← b b

Z2n cµ2n c

Z ← Z −QMwhile Z ≥M do

Z ← Z −Mend

⌊ZM

⌋≤⌊

(b Z2n c+1)

(⌊22n

M

⌋+1)

2n

⌋≤⌊b Z

2n c⌊

22n

M

⌋2n +

b Z2n c+

⌊22n

M

⌋+1

2n

⌋≤⌊

b Z2n c

⌊22n

M

⌋2n + (2n−1)+2n+1+1

2n

⌋≤ Q+ 3

(3.2)

The following example shows how the Barrett algorithm performs reduction. Base 10 was usedinstead of 2, for simplicity.

Example 3.2.3 Example of Barrett reduction modulo 3571

n← 4µ← b 108

3571c = 28003Z ← 2001× 2501 = 5004501

Q← b b5004501

104c28003

104 c = b 500×28003104 c = 1400

Z ← 5004501− 1400× 3571 = 5101Z > 3571⇒ Z ← 5101− 3571 = 1530Z < 3571⇒ No more subtractions required

3.2.3 Montgomery Reduction

Later, Montgomery [15] presented the idea of, instead of zeroing out the most significant bits of Z, Qcould be computed such that, when multiplied byM , and added to Z it would zero out the least significant

20

bits. This operation should be followed by a shift, effectively replacing long division by an inexpensivelogic shift. The value of Q that satisfies this condition is Z + QM ≡ 0(modR) ⇔ Q ≡ −M−1Z(modR),where R = 2n. After pre-computing the value of M ′ ≡ −M−1(modR), where R > M and R is co-primeto M , Algorithm 3.3 can be used to determine the value of Z ≡ ABR−1 (modM). Generally, R takesthe value from a power of two, so that division by R may be implemented with a logic shift.

Algorithm 3.3: Montgomery Modular MultiplicationZ ← ABQ←M ′Z (modR)Z ← Z +QMZ ← Z

Rif Z ≥M then

Z ← Z −Mend

To exploit this operation, numbers should be represented in the Montgomery domain: A is repre-sented as A ≡ AR (modM) and B as B ≡ BR (modM). Algorithm 3.3 computes multiplication underthis domain, since Z ≡ ZR−1 ≡ (ABR)R−1 ≡ (ABR−1)R−1 (modM). The value of Z will always beless than 2M (3.3), and therefore a single conditional subtraction suffices to ensure that Z < M .

Z =AB +QM

R<RM +RM

R= 2M (3.3)

The following example depicts this algorithm. R is chosen to be a power of 10 for illustrative purposes.

Example 3.2.4 Example of Montgomery reduction modulo 3571

n← 4R← 104

M ′ ← 1669 (1669× 3571 = 5959999 = 595× 104 − 1)A← 2001× 104 mod 3571 = 1687B ← 2501× 104 mod 3571 = 2287Z ← 1687× 2287 = 3858169Q← 8169× 1669 mod 104 = 4061Z ← (Z +QM)/R = (3858169 + 4061× 3571)10−4 = 18360000× 10−4 = 1836Z < 3571⇒ No subtraction required

Z ← 1836× 10−4 mod 3571 = 1836× 596 mod 3571 = 1530

The final subtraction may be omitted, by choosing R > 4M , and letting numbers in the Montgomerydomain be in the interval [0, 2M [, since [12]:

0 ≤ ZR = AB +QM ≤ 4M2 +RM < RM +RM = 2RM (3.4)

Montgomery Multiplication Algorithms

This dissertation considers two methods to perform multi-precision Montgomery multiplication, namely:i) the Separand Operand Scanning (SOS)[17]; and ii) the Finely Integrated Operand Scanning (FIOS)[17][12]methods.

These methods are graphically represented on Figure 3.5, where the arrowed boxes represent iter-ative loops. Using the SOS method [17], the product of s-words large operands A and B, which arerepresented by a number of bits typically much higher than the word length of most processors (w), isfirst incrementally computed in the top loop of Figure 3.5(a). The product result is stored in variable Z

21

(a) SOS (b) FIOS (c) FIOS2

Figure 3.5: Montgomery Multiplication methods: multiplication and reduction loops organisation.

and, afterwards, the reduction process takes place on the bottom loop, by computing (3.5) (cf. algorithm3.3).

Z ← Z +Q·MR

(3.5)

In (3.5), Z is calculated iteratively, by computing Q word by word, and setting Z ← Z +Q×M × 2iw,at each iteration i: Q is set to Q← Z[i]×M ′[0] mod 2w, Z[i] corresponds to the (i+ 1)th word of Z, andM ′[0] to the least significant word of M ′. The division by R = 2sw can be taken by ignoring the s leastsignificant words.

The FIOS method [17], depicted in figure 3.5(b) implements the Montgomery multiplication by in-tegrating the multiplication and reduction parts of the Algorithm 3.3 in the same loop. When using thismethod, Z[0] has to be calculated before entering the inner loop, so that the value ofQ can be computed.The calculus performed by the outer loop at each iteration corresponds to

Z ← (Z +A×B[i] +Q×M) >> w (3.6)

where i stands for the iteration number, that starts from i = 0, w is the size of the word, Q is set toQ← (Z[0] +A[0]×B[i])×M ′[0] mod 2w, and B[i] corresponds to the (i+ 1)th word of B. The operator“>>” designates a logic right shift of w bits.

Although the loop fusion that the FIOS method allows for may seem beneficial, this might be counter-balanced by the requirement of performing carry propagation in its inner cycle. This latter problem canbe mitigated by breaking the inner cycle into two, using a method similar to the one described in [12],which allows for the carry propagation operation to be minimised. This variant of the FIOS method, theFinely Integrated Operand Scanning Version 2 (FIOS2) method in Figure 3.5(c), consists on separatingthe product and reduction steps into different inner cycles, by first computing Z ← Z + A × B[i] × 2iw,and only afterwards Z ← Z +Q×M × 2iw.

Example 3.2.5 shows how reduction is performed, using the aforementioned methods. For a matterof simplification, it is supposed that each word has a [0, 99[ range, and whereas previously R was set tobe a power of 2, R is now a power of 100. Furthermore, variables are indexed for base 100, that is, if Ais an s-word operand, A[i] is the least non-negative integer value that satisfies A =

∑s−1i=0 A[i]100i, for

i ∈ {0, 1, . . . , s− 1}.

22

Example 3.2.5 Modular reduction using incremental Montgomery methods modulo 3571

n← 2R← 1002

M ′ ← −M−1 mod 100 = 69 (69× 3571 = 246399 = 2464× 100− 1)A← 2001× 104 mod 3571 = 1687B ← 2501× 104 mod 3571 = 2287Z ← 0The 3 methods are now independently used to compute the same result.

SO

SM

ultip

lyZ←Z

+A×B

[0]

=0

+1687×

87=

146

769

Z←Z

+A×B

[1]×

100

=14

676

9+

1687×

22×

100

=38

5816

9R

educ

eQ←Z

[0]×

M′

mod

100

=69×

69

mod

100

=61

Z←Z

+Q×M

=385

8169

+61×

3571

=4076

000

Q←Z

[1]×

M′

mod

100

=60×

69

mod

100

=40

Z←Z

+Q×M×

100

=407

6000

+40×

3571×

100

=1836

0000

Fina

lShi

ftZ←Z×

10−

4

=18

3600

00×

104

=183

6

FIO

SE

valu

atio

nofQ

Q←

(Z[0

]+A

[0]×

B[0

])×M′

mod

100

=(0

+87×

87)×

69m

od

100

=61

Mul

tiply

and

Red

uce

Z←

Z+A×B

[0]+Q×M

100

=0+

1687×

87+

61×

3571

100

=364600

100

=36

46E

valu

atio

nofQ

Q←

(Z[0

]+A

[0]×

B[1

])×M′

mod

100

=(4

6+

87×

22)×

69m

od

100

=40

Mul

tiply

and

Red

uce

Z←

Z+A×B

[1]+Q×M

100

=3646+

1687×

22+

40×

3571

100

=183600

100

=18

36

23

Figure 3.6: SIMD operation for 4 lanes

FIO

S2

Eva

luat

ion

ofQ

Q←

(Z[0

]+A

[0]×

B[0

])×M′

mod

100

=(0

+87×

87)×

69m

od

100

=61

Mul

tiply

Z←Z

+A×B

[0]

=0

+16

87×

87=

1467

69R

educ

eZ←Z

+Q×M

=14

6769

+61×

3571

=36

4600

Eva

luat

ion

ofQ

Q←

(Z[1

]+A

[0]×

B[1

])×M′

mod

100

=(4

6+

87×

22)×

69m

od

100

=40

Mul

tiply

Z←Z

+A×B

[1]×

100

=36

4600

+16

87×

22×

100

=40

7600

0R

educ

eZ←Z

+Q×M×

100

=40

7600

0+

40×

3571×

100

=18

3600

00Fi

nalS

hift

Z←Z×

10−

4

=18

3600

00×

104

=18

36

Z < 3571⇒ No subtraction required

Z ← 1836× 10−4 mod 3571 = 1836× 596 mod 3571 = 1530Example 3.2.5 (continuation) Modular reduction using incremental Montgomery methodsmodulo 3571

SIMD extensions can be used to perform multiple Montgomery multiplications in parallel. For theparallel versions of SOS, FIOS, and FIOS2, words were stored in interleaved memory: considering k

parallel channels, and that Aj for j ∈ [0, k−1] is representable by s words, such that Aj =∑s−1i=0 Aj,i2

iw,their disposition would be Ak−1,s−1Ak−2,s−1...A0,s−1 ... Ak−1,0Ak−2,0...A0,0. This organisation allows fork w-bit words to be loaded into the SIMD registers using a single instruction, after which a readjustmentprocess allows to multiply two w-bit words, and accumulate the 2w-bits result on a target register. Carriescan be extracted by right-shifting the result w-bits, and the product by a logic AND operation. If the resultis similarly represented in memory, a readjustment process must follow, so that it can be stored inmemory afterwards. This process is illustrated in Figure 3.6, for k = 4.

Furthermore, modular multiplication throughput may be increased by having several cores computingparallel SIMD Montgomery multiplications. This technique is depicted in Figure 3.7, where n coresexploit a 4-lanes SIMD architecture to perform 4n multiplications in parallel.

24

Figure 3.7: Multi-Core architecture with 4-lanes SIMD engines

3.2.4 Parallel Execution of a Single Modular Multiplication

Considering a 2n/k-radix, by splitting the operands into k sections, modular multi-precision multipli-cation may be performed using (3.7), where the equivalence corresponds to the transform h = i+ j.

Z ≡k−1∑i=0

k−1∑j=0

A[i]B[j]2(i+j) nk ≡

2k−2∑h=0

min{k−1,h}∑i=max{0,h−k+1}

A[i]B[h− i]2di,j (modM); di,j = (i+ j)n

k= h

n

k(3.7)

A column of the result is defined as the value of Cdi,j =∑min{k−1,h}i=max{0,h−k+1}A[i]B[h − i], with weight

di,j = hnk . Thus, the computation of a multi-precision multiplication may be shared among 2k−1 proces-sors if each computes a column of the result. Furthermore, each processor may estimate the value of Qrequired to reduce the value of 2di,jCdi,j . This technique, however, would lead to an unbalanced load, asprocessors whom were attributed weightier columns would need to compute larger Q values. As such,a Montgomery-like representation should be used to balance the computational load, by multiplying theresult by 2−n/2.

Using the Montgomery-like representation, each column is computed using the same equation asbefore, but their weight is redefined as di,j = hnk −

n2 , and Z is computed as Z ≡

∑2k−2h=0 2di,jCdi,j (mod

M). Using this approach, processors with a negative weight may apply the Montgomery reductionalgorithm, whereas weightier reductions are now replaced with simpler ones, and should be performedusing the Barrett reduction algorithm.

For columns with weight di,j less than zero, the Montgomery reduction is applied, and since theMontgomery domain has changed, the Q value should now be computed using M ′ ≡ −M−1( mod 2n/2).For columns with weight satisfying the condition 0 ≤ di,j ≤ n − 2n/k, the result of computing Cdi,j2di,j

will always be less than k2n, and therefore at most 2k final subtractions by M will be required to fullyreduce the column if Q is set to zero, for 2n−1 < M < 2n. Finally, when di,j > n − 2n/k, approximatelyti,j = di,j + 2n/k − n bits need to be zeroed out from the column by subtracting QM , so that the resultis less than 2n, in contrast to what happened when classical Barrett reduction was applied and n bitsneeded to be zeroed out. Therefore, Q is rewritten as stated in (3.8).

Q = bbCdi,j

2di,j

2n cb 2n+ti,j

M c2ti,j

c (3.8)

Taking into consideration that di,j ≤ 3n2 −

2nk , the value of b 2n+ti,j

M c is maximised by µ = b 23n/2

M c,and therefore only the latter value needs to be pre-computed for the Barrett reduction scheme. Theresulting algorithm is presented in Algorithm 3.4, where the i, j subscript has been omitted. After thevalues of Cd are computed and their corresponding Qd, the Qd values are summed to Q, and the resultZ = (

∑d Cd2

n/2+d −QM)2−n/2 is produced. Finally, some subtractions may be required to fully reducethe result.

Example 3.2.6 presents a case of the algorithm application. Similarly to the previous example, base100 was used instead of 2, for the reduction process.

25

Algorithm 3.4: k-ary Modular MultiplicationI ← {−n/2,−n/2 + n/k, ..., 3n/2− 2n/k}foreach d ∈ I do

Cd ←∑

(i+j)n/k−n/2=dA[i]B[j]

if d < 0 thenQd ← −(CdM

′ mod 2−d)2n/2+d

endelse if d > n− 2n/k then

t← d+ 2n/k − nQd ← bbCd2d−ncbµ2t−n/2c/2tc2n/2

endelse

Qd ← 0end

endQ←

∑dQd

Z ← (∑d Cd2

n/2+d −QM)2−n/2

while Z ≥M doZ ← Z −M

end

Example 3.2.6 Modular reduction using the k-ary Modular Multiplication modulo 3571

n← 2k ← 2M ′ ← 69 (69× 3571 = 246399 = 2464× 100− 1)µ← b1003/3571c = 280A← 2001× 100 mod 3571 = 124B ← 2501× 100 mod 3571 = 130

The 2k − 1 = 3 processors simultaneously execute:

Processor 1 Processor 2d = −1C−1 ← A[0]×B[0]

= 24× 30= 720

Q−1 ← −(C−1 ×M ′ mod 100)× 100n/2+d

= −(720× 69 mod 100)1000

= −80

d = 0C0 ← A[0]×B[1] +A[1]×B[0]

= 24× 1 + 30× 1= 54

Q0 ← 0

Processor 3d = 1C1 ← A[1]×B[1]

= 1× 1 = 1t ← d+ 2× n/k − n

= 1 + 2× 2/2− 2= 1

Q1 ← bCd × 100d−ncbµ100t−n/2c/100tc100n/2

= bb1× 100−1cb280× 1000c/1001c1001

= 0

Q←∑dQd = −80

26

Z ← (∑d Cd2

n/2+d −QM)2−n/2

= ((720 + 54× 100 + 1× 1002)− (−80)× 3571)100−1

= (16120 + 285680)100−1

= 301800× 100−1

= 3018Z < 3571⇒ No subtraction required

Z ← 3018× 10−2 mod 3571 = 3018× 2464 mod 3571 = 1530

3.2.5 RNS-based Modular Multiplication

GPUs, which enable the execution of a large number of threads, are suitable devices to apply RNS-based arithmetic. Within this system, numbers are represented according to a baseR = {r0, r1, . . . , rn−1},of pairwise co-prime modulus. An integer A is represented by the vector (a0, a1, . . . , an−1), whereai = A mod ri, for i = 0, . . . , n− 1.

The Chinese Remainder Theorem (CRT) [30], ensures the uniqueness of this representation, forintegers within the range 0 ≤ A < R, where R =

∏n−1i=0 ri. Furthermore, it provides a formula to convert

the value of A back from its residue representation (3.9).

A =

n−1∑i=0

(aiR−1i mod ri)Ri(modR) =

n−1∑i=0

(aiR−1i mod ri)Ri − kR (3.9)

where Ri = R/ri and R−1i mod ri is the multiplicative inverse of Ri modulo ri. As stated in (3.9),

reduction modulo R can be substituted by a subtraction by a multiple of R.The main advantage of RNS is that the addition, subtraction and multiplication of operands modulo

R are performed component-wise, that is:

A+B mod R = (a0 + b0 mod r0, . . . , an−1 + bn−1 mod rn−1)A−B mod R = (a0 − b0 mod r0, . . . , an−1 − bn−1 mod rn−1)A×B mod R = (a0 × b0 mod r0, . . . , an−1 × bn−1 mod rn−1)

(3.10)

Since each element is independently computed, if n threads execute in parallel, the computationtakes the time of a single operation. As such, modular reduction is usually performed using a Montgomery-like algorithm.

As in the classical Montgomery algorithm, after calculating AB, Q is computed such that AB +QM

is a multiple of R. Unfortunately, this result is equal to zero modulo R, and therefore a second base, R,is required, such that R =

∏n−1i=0 ri > R and R is co-prime to R. In order to avoid code divergences,

for example on a GPU, the cardinality of R and R should be the same. After performing the baseextension of Q to R, and computing AB + QM , the division of the result by R reduces to the productof AB + QM and the modular inverse R−1 mod R, since the former value is divisible by R: (AB +

QM)R−1 = (k1R)R−1 = k1 + k2R ⇔ (AB + QM)R−1 mod R = k1. Finally, the result is extended toR, so that the result may be employed on further multiplications. The resulting algorithm is depicted inAlgorithm 3.5.

Algorithm 3.5: RNS Montgomery Multiplicationsi ≡ aibi mod ri, i ∈ [0, n[si ≡ aibi mod ri, i ∈ [0, n[qi ≡ si(−M−1) mod ri, i ∈ [0, n[Extend Q to Rti ≡ (si + qi ×M)R−1 mod riExtend T to R

There are several approaches to perform base extension. The earliest approach uses the Szabo-Tanaka [46] algorithm, which is based on the Mixed Radix System (MRS). However, due to the recursive

27

nature of the MRS, it is not suitable for GPU implementations. On the other hand, base extension maybe employed by evaluating (3.9) for each of the moduli of the second base. In that case, base extensionof A in R to R, is performed as:

ai =

n−1∑j=0

(ajR−1j mod rj)Rj − kR mod ri, i ∈ [0, n[ (3.11)

In this thesis, Bajard’s approach to the first extension was used [47], where k is set to 0, and thereforethe Montgomery algorithm produces the final result with an offset that is a multiple of M : (AB + (Q +

xR)M)R−1 = (AB + QM)R−1 + xM ≡ AB mod M , where x < n. For the second extension, foroperands with less than 2048 bits, the method proposed by Kawamura [25] was used, which is basedon a fixed point approach for the exact computation of the value of k. For operands whose width wasequal to 2048 bits, the approach by Kawamura for the implementation on GPUs with restrictions, suchas the Adreno 320 GPU was no longer valid, therefore a modification to the algorithm was proposed. Itwas not possible to operate on larger numbers due to the device limitations, even though the proposedapproach was still valid. Another way to compute k was proposed by Shenoy [24], which requires theuse of a redundant modulo. This approach greatly increases the number of divergences of the algorithm,and therefore does not suit SIMD parallelism, namely for GPU devices.

Kawamura’s approach to base extension is based on the fact that when (3.9) is divided by R, we get:

A

R=

n−1∑i=0

ξairi− k (3.12)

where ξai = (aiR−1i mod ri). Since A

R < 1, the value of k can be computed as:

k =

⌊n−1∑i=0

ξairi

⌋(3.13)

The previous equation requires, however, expensive divisions by the values of the base. In order toreduce the complexity of the computation, Kawamura proposes to replace the values of ri by 2l, where2l−1 < ri < 2l, enabling computation using a fixed-point approach. With this approximation, k is givenby:

k =

⌊n−1∑i=0

ξai2l

+ α

⌋(3.14)

where α is a correction term. Suitable values for α will now be discussed. By defining εri = (2l − ri)/2l

and εm = max{εri |0 ≤ i < n}, the following equation can be derived:

n−1∑i=0

ξai2l

=

n−1∑i=0

ξai(1− εri)ri

≥ (1− εm)

n−1∑i=0

ξairi

>

n−1∑i=0

ξairi− nεm (3.15)

Furthermore, since 2l > ri, then

n−1∑i=0

ξai2l

<

n−1∑i=0

ξairi

(3.16)

If 0 ≤ nεm ≤ α < 1, and 0 ≤ A < (1− α)R, the previous equations result in:

∑n−1i=0

ξai

ri− nεm <

∑n−1i=0

ξai

2l <∑n−1i=0

ξai

ri⇔(

k + AR

)− nεm <

∑n−1i=0

ξai

2l <(k + A

R

)⇔(

k + AR

)− nεm + α <

∑n−1i=0

ξai

2l + α <(k + A

R

)+ α⇔

k <∑n−1i=0

ξai

2l + α < k + 1

(3.17)

28

and therefore k =⌊∑n−1

i=0ξai

2l + α⌋

= k.It was verified that, as the operand size increased, it was not possible to find suitable bases such

that the previous conditions were met. In fact, as more numbers were appended to the RNS set, theerror εm grew, and the value of nεm became greater than one. Whereas for hardware implementationsit is possible to increase the value of l, the same thing does not happen for software implementationson the GPU due to hardware restrictions. In order to enable the computation of k for larger sizes, a newapproach is herein proposed: instead of approximating 1/ri by 1/2l, the value of b2g/ric is pre-computedand k is evaluated as depicted in (3.18).

k =

⌊n−1∑i=0

ξai2g

⌊2g

ri

⌋+ α

⌋(3.18)

The following equation enables the evaluation of the error of the approximation:

ξai2g

⌊2g

ri

⌋>ξai2g

(2g

ri− 1

)=ξairi− ξai

2g(3.19)

Using the same reasoning as in (3.17), it can be proved that if 0 ≤ n rm2g ≤ α < 1, where rm =

max{ri|0 ≤ i < n}, and 0 ≤ A < (1− α)R, then k = k.The value of R should be large enough to accommodate the extension error introduced by the Bajard

extension, so that the output of the algorithm might be used on further iterations. The conditions k < n,AB < RM , and Q < (n+ 1)M lead to T < (n+ 2)M . In order to use the Montgomery algorithm withinan exponentiation algorithm, the T value might be reused as either A or B and therefore the followingcondition must be satisfied:

AB < RM ⇒ (n+ 2)2M < R (3.20)

Since T < (n + 2)M , in order to produce the exact value of k in the second base extension, the0 ≤ nεm ≤ α < 1 or the 0 ≤ n rm2g ≤ α < 1 condition must be met, when using the first or the secondapproach to perform the extension, respectively. Moreover, the following condition must also be takeninto account when determining an appropriate value for α:

(n+ 2)M

R< 1− α (3.21)

Example 3.2.7 shows how modular multiplication is performed using the described approach.

Example 3.2.7 Modular reduction using the RNS-based Modular Multiplication modulo3571

R = {239, 241}, R = {251, 257}R = 251× 257 = 64507 > R = 239× 241 = 57599 > (2 + n)2M = (2 + 2)2 × 3571 = 57136GCD(R, R) = 1, GCD(M,R) = 1, GCD(M, R) = 1A ≡ 2001× 57599 ≡ 1574(mod3571)B ≡ 2501× 57599 ≡ 959(mod3571)

(−M−1 mod ri) =

(222126

), (M mod ri) =

(57230

), (R−1 mod ri) =

(228199

), (R−1

i mod ri) =

(120120

)(R−1

i mod ri) =

(42214

), (Ri mod rj) =

(241 241239 239

), (Ri mod rj) =

(18 1612 10

)(R mod ri) =

(216160

)(A mod ri) =

(140128

), (A mod ri) =

(6832

), (B mod ri) =

(3

236

), (B mod ri) =

(206188

)

29

vadd Adds corresponding elements in two vectorsand stores the result in the destination vector;

veor Performs a bitwise exclusive OR between twovectors and stores the result in the destinationregister;

vldX/vstX vldX loads X-element structures from mem-ory interleaving them in the target registers;vstX does the opposite;

vmlal Multiplies corresponding elements in two vec-tors and adds the product to the correspond-ing destination vector elements that are twiceas long as the former;

vshr Shifts all elements in a vector and places theresults in the destination vector;

vzip Interleaves the elements of two vectors;vsra Right shifts each element in a vector, and ac-

cumulates the result in the target vector;vext Extracts the top elements of the first vector

and the bottom elements of the second, con-catenates them, and stores the result in thedestination register.

Table 3.1: NEON Instructions Summary

(S ≡ AB mod ri) =

(18183

), (S ≡ AB mod ri) =

(203105

)(ξQ ≡ S(−M−1)R−1

i mod ri) =

(1573

)(Q ≡

∑j(ξQ mod rj)(Rj mod ri) mod ri

)=

(15× 241 + 73× 239 mod 25115× 241 + 73× 239 mod 257

)=

(229245

)(ξT ≡ (S + QM)R−1R−1

i mod ri) =

((203 + 229× 57)× 228× 42 mod 251

(105 + 245× 230)× 199× 214 mod 257

)=

(22235

)Here, k is evaluated using (3.18), but 2g is replaced by 10g, with g = 3, and α = 2 × 257

1000 = 0.512,with (n+ 2)M

R≈ 0.221 < 1− α = 0.488.

k = b10−3(222× b 1000251 c+ 35× b 1000

257 c) + 0.514c = 1(T ≡

∑j(ξT mod rj)(Rj mod ri)− kR mod ri

)=

(222× 18 + 35× 12− 1× 216 mod 239222× 16 + 35× 10− 1× 160 mod 241

)=

(137127

)The CRT can now be used to compute the value of T .T ≡ 137× 120× 241 + 127× 120× 239 ≡ 1332 mod 57599As expected, TR−1 ≡ 1332× 725 ≡ 1530 mod 3571

3.3 Implementation and Experimental Results

Several standard Application Programming Interfaces (APIs) and technologies were used in thisthesis in order to exploit different levels of data parallelism, namely, the NEON technology for SIMDparallelism, and OpenMP and OpenCL for coarser-grain parallelism:

• SIMD parallelism was exploited using the NEON [48] technology, which is an ISA extension to theARM processors. It features vectors of at most 128-bits, and enables the simultaneous processingof its lanes. These instructions can be accessed through the use of C intrinsics or by writingassembly code. Table 3.1 summarises the most relevant instructions for cryptography.

• The OpenMP API [49] establishes a set of clauses that enable the creation of instruction streamsand the distribution of workload among them. For the purpose of implementing modular multiplica-tion, only the directive #pragma omp parallel for was used, which creates a set of threads, in a

30

number typically equal to the amount of cores, and splits the for-loop so that each thread handlesa set of the loop iterations. There is an implicit synchronisation point at the end of that loop.

• In order to access the GPU, OpenCL [50] was used. This framework establishes an uniformlanguage based on C, for writing programs that execute across heterogeneous platforms, likeCPUs and GPUs, among others.

3.3.1 Multi-precision Multiplication

The implementation details of the aforementioned multi-precision multiplication techniques will nowbe described in detail. Moreover, their performances are experimentally evaluated.

Operand-Scanning Method

The Operand-Scanning method was implemented using a sequential method based on the Algorithm3.6. The outer loop scans through each word of operand a, while the inner loop accumulates the productof a[i] and b in t[i+j]. It is requisite that t is zeroed out before the function is called.

Algorithm 3.6: Operand-Scanning Method C Implementation

vo id mul1 ( u i n t 3 2 t ∗ t , u i n t 3 2 t ∗a , u i n t 3 2 t ∗b , i n t s ) {u i n t 6 4 t S, C;i n t i , j ;

f o r ( i = 0 ; i < s ; i ++) {C = 0;f o r ( j = 0 ; j < s ; j ++) {

S = t [ i + j ] + ( u i n t 6 4 t ) a [ i ]∗b [ j ] + C;t [ i + j ] = S ;C = S>>32;

}t [ i + j ] = C;

}}

Product-Scanning Method

The Product-Scanning method was programmed by using NEON intrinsics, whereby two productcolumns are processed at a time. The value of a product column is

∑min{s−1,h}i=max{0,h−s+1}A[i]B[h − i], for

h ∈ [0, 2s − 2], where s corresponds to the number of words of A and B, in a similar fashion to how itwas defined in Section 3.2.4.

The code is depicted in Algorithm 3.7. Firstly, the vdup instruction duplicates the 0x00000000ffffffff

value on the mask vector and the C and S vectors are zeroed. Afterwards, the two k-loops iterate overthe columns of the result t. In the first loop, the i-loop transverses each pair of columns, two lines ata time, which requires loading two entries of the multiplier (a[i] and a[i+1]) and three entries of themultiplicand (the first vld instruction loads b[k-i] and b[k-i+1] and the second b[k-i-1] and b[k-i]).Their products are then added to the C and S registers, which together act as an accumulator, throughthe use of the ACC macro – the k column accumulates the a[i]×b[k-i] and the a[i+1]×b[k-i-1]products and the k + 1 column the a[i]×b[k-i+1] and the a[i+1]×b[k-i] products.

The ACC(b, a, i) macro multiplies the i lane of vector a by the two entries of vector b, and theresults are added to the contents of S. The most significant halves of the words of S are then added toC, by calling the vsra instruction, and cleared from the original vector, by using vand on S and mask.

31

After having processed the words related to a column, the final result is stored in memory. The valueof t[k] corresponds to the first lane of S, and t[k+1] to the addition of the second lane of S, the first ofC and the a[i+1]×b[0] product. The accumulator is finally shifted, using the vsetq lane u64 intrinsicto establish the content of each word of the accumulator.

The second k-loop is similar to the first, but it processes the second half of columns with the rolesof a and b exchanged. This behaviour is the one depicted in Figure 3.3, where on the first half of therhombus partial products are paired up on lines of constant A[i], but on the second half partial productsare grouped with constant B[j].

Algorithm 3.7: Product-Scanning Method NEON Implementation

# def ine ACC( b , a , i ) \S = vmla l lane u32 (S, b , a , i ) ; \C = vsraq n u64 (C, S, 32) ; \S = vandq u64 (S, mask ) ;

vo id mul2 ( u i n t 3 2 t ∗ t , u i n t 3 2 t ∗a ,u i n t 3 2 t ∗b , i n t s ) {

i n t k , i ;u i n t 6 4 x 2 t S, C;u i n t 6 4 t S1 , C1 ;u i n t 6 4 x 2 t mask ;u i n t 3 2 x 2 t va , vb1 , vb2 ;

mask = vdupq n u64 (0 x 0 0 0 0 0 0 0 0 f f f f f f f f ) ;

S = veorq u64 (S, S) ;C = veorq u64 (C, C) ;f o r ( k = 0 ; k < s ; k +=2) {

f o r ( i = 0 ; i <= k−2; i +=2) {va = vld1 u32 (&a [ i ] ) ;vb1 = vld1 u32 (&b [ k− i ] ) ;vb2 = vld1 u32 (&b [ k−( i +1) ] ) ;

ACC( vb1 , va , 0) ;ACC( vb2 , va , 1) ;

}va = vld1 u32 (&a [ i ] ) ;vb1 = vld1 u32 (&b [ 0 ] ) ;ACC( vb1 , va , 0) ;

S1 = vgetq lane u64 (S, 0) ;t [ k ] = S1 ;

C1 = vgetq lane u64 (C, 0) +(S1 >> 32) ;S1 = vgetq lane u64 (S, 1)+C1+( u i n t 6 4 t

) a [ i +1]∗b [ 0 ] ;t [ k + 1 ] = S1 ;

C1 = vgetq lane u64 (C, 1) +(S1 >> 32) ;

S = vsetq lane u64 (C1 & 0 x f f f f f f f f , S ,0) ;

S = vsetq lane u64 (0 , S, 1) ;C = vsetq lane u64 (C1 >> 32 , C, 0) ;C = vsetq lane u64 (0 , C, 1) ;

}

f o r ( ; k < 2∗s ; k +=2) {f o r ( i = s−2; i > k−s ; i −=2) {

va = vld1 u32 (&b [ i ] ) ;vb1 = vld1 u32 (&a [ k−i −1]) ;vb2 = vld1 u32 (&a [ k− i ] ) ;

ACC( vb1 , va , 1) ;ACC( vb2 , va , 0) ;

}

S1 = vgetq lane u64 (S, 0) +( u i n t 6 4 t ) a [s−1]∗b [ k−s + 1 ] ;

t [ k ] = S1 ;

C1 = vgetq lane u64 (C, 0) +(S1 >> 32) ;S1 = vgetq lane u64 (S, 1)+C1 ;t [ k + 1 ] = S1 ;C1 = vgetq lane u64 (C, 1) +(S1 >> 32) ;



}}

Operand-Caching Method

The NEON technology provides 32×64-bit wide registers. By selecting e = 24, 24 registers willbe allocated to store 48×32-bit words (half of these correspond to one operand, and the other half tothe other operand), 2×128-bit wide registers will act as an accumulator, and a 64-bit register and a128-bit register will be required for auxiliary operations, namely the temporary storage of operands andthe storage of a constant which works as a bitwise mask. Thus, 31×64-bits wide register are used intotal, almost filling up the register file. An excerpt of the C implementation code is shown in Algorithm3.8. It depicts sub-section 2, where, at each iteration, the contents of the a registers are firstly movedin a queue fashion, since there are values that are no longer required and new values have to beloaded. Afterwards, the accumulation steps of two multiplication columns take place. It should be notedthat the vext instruction is used to map the values of the two-worded registers rai = (rai,1, rai,0) and

32

rai+1 = (rai+1,1, rai+1,0) to tmp = (rai,1, rai+1,0), in order to be multiplied by a word of b. Finally, thesection of code responsible for storing the result and shifting the accumulator follows a similar rationaleto that of Algorithm 3.7.

Algorithm 3.8: Operand-Caching Method NEON Implementation

/∗ Par t 2∗ /f o r ( i = 0 ; i < ( n−(p+1) ∗24) / 2 ; i ++) {

i f ( i > 0) {ra0 = ra1 ;ra1 = ra2 ;ra2 = ra3 ;ra3 = ra4 ;ra4 = ra5 ;ra5 = ra6 ;ra6 = ra7 ;ra7 = ra8 ;ra8 = ra9 ;ra9 = ra10 ;ra10 = ra11 ;ra11 = ra12 ;

}

ra12 = vld1 u32 (&a[24 + 2∗ i ] ) ;ACC( ra12 , rb0 , 0) ;tmp = vext u32 ( ra11 , ra12 , 1) ;ACC( tmp , rb0 , 1) ;ACC( ra11 , rb1 , 0) ;tmp = vext u32 ( ra10 , ra11 , 1) ;ACC( tmp , rb1 , 1) ;ACC( ra10 , rb2 , 0) ;tmp = vext u32 ( ra9 , ra10 , 1) ;ACC( tmp , rb2 , 1) ;ACC( ra9 , rb3 , 0) ;tmp = vext u32 ( ra8 , ra9 , 1) ;ACC( tmp , rb3 , 1) ;ACC( ra8 , rb4 , 0) ;tmp = vext u32 ( ra7 , ra8 , 1) ;ACC( tmp , rb4 , 1) ;ACC( ra7 , rb5 , 0) ;tmp = vext u32 ( ra6 , ra7 , 1) ;ACC( tmp , rb5 , 1) ;ACC( ra6 , rb6 , 0) ;

tmp = vext u32 ( ra5 , ra6 , 1) ;ACC( tmp , rb6 , 1) ;ACC( ra5 , rb7 , 0) ;tmp = vext u32 ( ra4 , ra5 , 1) ;ACC( tmp , rb7 , 1) ;ACC( ra4 , rb8 , 0) ;tmp = vext u32 ( ra3 , ra4 , 1) ;ACC( tmp , rb8 , 1) ;ACC( ra3 , rb9 , 0) ;tmp = vext u32 ( ra2 , ra3 , 1) ;ACC( tmp , rb9 , 1) ;ACC( ra2 , rb10 , 0) ;tmp = vext u32 ( ra1 , ra2 , 1) ;ACC( tmp , rb10 , 1) ;ACC( ra1 , rb11 , 0) ;tmp = vext u32 ( ra0 , ra1 , 1) ;ACC( tmp , rb11 , 1) ;

S1 = ( u i n t 6 4 t ) t [ ( p+1)∗24 + 2∗ i ]+vgetq lane u64 (S, 0) ;

t [ ( p+1)∗24 + 2∗ i ] = S1 ;

C1 = vgetq lane u64 (C, 0) +(S1 >> 32) ;S1 = vgetq lane u64 (S, 1)+C1 + t [ ( p+1)∗24 + 2∗ i + 1 ] ;

t [ ( p+1)∗24 + 2∗ i + 1 ] = S1 ;

C1 = vgetq lane u64 (C, 1) +(S1 >> 32) ;



}

Experimental Results

The three multi-precision multiplication methods presented in this section in Algorithms 3.6, 3.7 and3.8 were run in the A15 core of the ODROID-XU+E board (see Section 2.2.3), operated at a frequencyof 1.6 GHz, and the execution times, obtained using the POSIX clock gettime function, for differentoperand widths, are shown in Table 3.2. These results are obtained by taking the average execution timeover 256 executions. The platform was running Ubuntu 13.10 and the code was compiled with gcc 4.8.1,with the -O3 flag. Results for 256 and 512 bits were omitted for the Operand-Caching method since themultiplication would fit entirely in the binit section, and therefore would not benefit from SIMD parallelism.The speedups of the Product-Scanning and Operand-Caching methods, presented in Algorithms 3.7 and3.8, respectively, were computed, and the sequential Operand-Scanning method, described in Algorithm3.6, was used as reference.

The memory access pattern of the Operand-Scanning method is more regular, since the operandsare sequentially loaded, and therefore better fits the cache system. This fact is observed in the results,which show that this method is the fastest for small operands. Notwithstanding, as the operand widthincreases, the parallelism supported on SIMD instructions proves to be effective for both the Product-

33

Execution Time [clock cycles]Number of bits 256 512 1024 2048 4096Operand-Scanning (Algorithm 3.6) 401 1,968 6,070 22,667 86,864NEON Product-Scanning (Algorithm 3.7) 614 1,609 5,244 18,819 70,489NEON Operand-Caching (Algorithm 3.8) - - 5,683 21,872 80,435

Speed-Up (Reference: Algorithm 3.6)NEON Product-Scanning 0.65 1.22 1.16 1.20 1.23NEON Operand-Caching - - 1.07 1.04 1.08

Table 3.2: Multi-Precision NEON Multiplication Performance

Scanning and the Operand-Caching methods. The data rearrangements required by the Operand-Caching method seem to be detrimental for the overall performance of the multiplication and, in this case,it is still beneficial to keep operands in lower-level caches and support an higher number of loads. Onless effective cache systems, one can expect that the Operand-Caching method outperforms Product-Scanning. Larger speedups are expected to be observed for wider SIMD technologies, since they enablethe processing of more columns in parallel.

3.3.2 Multi-Precision Modular Multiplication

In this section, the implementation details of the SIMD-parallel algorithms for performing multipleSOS, FIOS and FIOS2 multiplications simultaneously are presented. These are based on the imple-mentation of [12], which uses the SSE2 SIMD extensions to the Pentium 4 processor to implement theFIOS2 algorithm. The SSE2 extensions offer 128-bit registers and a set of instructions to operate onthem [51].

Conversely, the SIMD modular multiplication algorithm proposed in [19] uses a 2-lane SIMD ap-proach to simultaneously compute the values of AB and QM , which correspond to the value of theproduct of the two operands, and the multiplication required to reduce it, respectively. This approachhas a limited scalability. For instance, the most recent Intel SIMD engine, AVX2, provides instructionsto compute products on 4×64-bit lanes, but not on 2×128-bit lanes. It is expected that most SIMDarchitectures, as their width increases, evolve in the same direction. The cost of introducing 2×128-bit multiplication instructions on the AVX2 technology, for instance, would possibly be of decreasing itsfrequency of operation.

More optimisations can be performed, if the implementation is to be used for modulus of a particularstructure. Whereas the algorithms presented in this work are general, and work for most cryptographicoperations, in [18] a SIMD modular multiplication specific algorithm is proposed for NIST-standard primes[33].

Montgomery Multi-Precision Multiplication

The SOS, FIOS and FIOS2 Montgomery multiplication methods were implemented using both se-quential and SIMD-parallel code. For brevity, only the latter are depicted herein, as the former are easilydeduced from these using the code in Algorithm 3.6 as reference.

For the SOS algorithm, it was noticed that the most burdensome operation was the vector multiplica-tion and accumulation (as in the inner loops of Figure 3.5(a)). As such, this operation was implementedas a C function, using NEON assembly instructions, as depicted in Algorithm 3.9.

On this code, SIMD SIZE was defined to be 4, the number of 32 bit integers that fits on each SIMDregister. One starts by loading the value of vb into the SIMD register of 128 bits q0 (which corresponds tothe two 64 bits registers d0 and d1) and setting the carries, cl and ch, to 0 (in this section %q0 stands for

34

Algorithm 3.9: NEON Vector Multiply and Accumulate

vo id vmac ( u i n t 3 2 t (∗ v t ) [ SIMD SIZE ] , u i n t 3 2 t (∗ va ) [ SIMD SIZE ] , u i n t 3 2 t vb [ SIMD SIZE ] , i n ts , u i n t 6 4 x 2 t ∗pcl , u i n t 6 4 x 2 t ∗pch ) {u i n t 6 4 x 2 t c l = ∗pcl , ch = ∗pch ;

asm(” v ld1 .32 {d0 , d1} , [%4]\n\ t ”” veor %q0 , %q0 , %q0\n\ t ”” veor %q1 , %q1 , %q1\n\ t ””mov r0 , #0\n\ t ””mov r1 , #16\n\ t ”” tag1 :\n\ t ”” v ld1 .32 {d2 , d3} , [%3] , r1\n\ t ”” v ld1 .32 {d4 , d5} , [%2]\n\ t ”” veor q3 , q3 , q3\n\ t ”” vz ip .32 q2 , q3\n\ t ”” add r0 , r0 , #1\n\ t ”” vmla l . u32 q2 , d2 , d0\n\ t ”” vmla l . u32 q3 , d3 , d1\n\ t ”” vaddq . i64 q2 , %q0\n\ t ”” vaddq . i64 q3 , %q1\n\ t ”” vshr . u64 %q0 , q2 , #32\n\ t ”” vshr . u64 %q1 , q3 , #32\n\ t ”” vz ip .32 q2 , q3\n\ t ”” vs t2 .32 {d4 , d6} , [%2] , r1\n\ t ””cmp r0 , %5\n\ t ”” b l t tag1\n\ t ”: ”=&w” ( c l ) , ”=&w” ( ch ): ” r ” ( v t ) , ” r ” ( va ) , ” r ” ( vb ) , ” r ” ( s ): ”memory ” , ” cc ” , ” q0 ” , ” q1 ” , ” q2 ” , ” q3 ” , ” r0 ” , ” r1 ”) ;

∗pch = ch ;∗pc l = c l ;

}

35

the q-register corresponding to argument 0). After some initializations, a loop begins and, inside it, thevalues pointed by va and vt are loaded into the registers q1 and q2. The contents of q2 are rearrangedthrough the use of vzip and spread across q2 and q3, so that the 32 most significant bits of each laneare set to zero. The operation t ← t + a * b is then performed by the use of the instruction vmlal

(thus the result of this operation is stored in q2 and q3). Afterwards, carries from previous iterations areadded, by calling vaddq on q2/q3 and cl/ch, and updated, by shifting the result 32 bits and storing theresult in cl/ch. It should be noted that the values of va and vt are incremented by 16 on each iteration(due to de fact that SIMD SIZE ×4B = 16B) and the loop is computed s times. Therefore, at the end, theresult of the multiply and accumulate operation is stored in the 4× s words that are pointed by vt, andon the registers associated with the carry (cl and ch).

This function was then used to implement the inner loops of the SOS method whose core is describedin Algorithm 3.10. In this code, vn corresponds to the value of the modulo (M ), and vnlinha to thesymmetric of its inverse modulo 2w (M ′[0]). When performing carry propagation (mp addition C), itwould be very cumbersome to test at each iteration if the carry generated for all vectors was equal tozero, as it would require always to transfer data from the SIMD registers to the general purpose registers.Because of this, in contrast to the sequential version of the algorithm, the number of iterations for carrypropagation was established beforehand when this proceeding was required. Furthermore, no efficientway was found to perform the final conditional subtraction required by the Montgomery algorithm. Assuch, the subtraction was always performed and the selection of either the minuend or the differencewas run sequentially.

Algorithm 3.10: SOS Main Loops

f o r ( i = 0 ; i < s ; ++ i ) {/∗ t = t + a∗b [ i ] ∗2 ˆ ( swi ) ∗ /vmac ( ( u i n t 3 2 t (∗ ) [ SIMD SIZE ] ) v t [ i ] , va , vb [ i ] , s , &c l , &ch ) ;/∗ t [ i +s ] = c ∗ /vsave ( v t [ i +s ] , c l , ch ) ;

}

f o r ( i = 0 ; i < s ; ++ i ) {/∗m = t [ i ] ∗ n ’ mod 2ˆw∗ /memset ( ( vo id ∗ )vm, 0x00 , SIMD SIZE∗4) ;vmac (vm, ( u i n t 3 2 t (∗ ) [ SIMD SIZE ] ) v t [ i ] , ∗vn l inha , 1 , &c l , &ch ) ;

/∗ t = t + m ∗ n ∗ 2 ˆ ( swi ) ∗ /vmac ( ( u i n t 3 2 t (∗ ) [ SIMD SIZE ] ) v t [ i ] , vn , ∗vm, s , &c l , &ch ) ;

/∗ADD( t [ i +s ] , C) ∗ /mp addi t ion C ( vt , i +s , 2∗s+1 , &c l , &ch ) ;

}

The FIOS method was implemented using a similar rationale and the code for the inner loop de-scribed in Figure 3.5(b) is the one in Algorithm 3.11. It starts by loading the values of vt[j] and va[j]

into suml and q0, respectively (%e2 stands for the least significant 64 bits of argument 2 and %f2 forits most significant 64 bits). Then the values of suml are spread amongst suml and sumh using vmovl

instructions, so that the 32 most significant bits of the lanes are zero. The values of suml and sumh

then accumulate the product of va[j] and b (which corresponds to b[i]) and the carry. Carries areafterwards updated through a shift of the result and carry propagation takes place. In the end, the valueof vn[j], which corresponds to the modulo, is loaded to q0, carries are masked out of registers suml

and sumh and these latter registers accumulate the value of m × n[j] (%P6 and %P7 correspond to the 64bits registers of ml and mh); m is computed before stepping into the loop, and corresponds to the valuethat satisfies t+mn ≡ 0(mod2w). In the end, carries are updated for later usage and the resulting sum

36

Algorithm 3.11: FIOS Inner Loop

f o r ( j = 1 ; j < s ; ++ j ) {asm (” v ld1 .32 {%e2 , %f2 } , [%5]\n\ t ”” v ld1 .32 {d0 , d1} , [%4]\n\ t ”” vmovl . u32 %q3 , %f2\n\ t ”” vmovl . u32 %q2 , %e2\n\ t ”” vmla l . u32 %q2 , d0 , %e6\n\ t ”” vmla l . u32 %q3 , d1 , %f6\n\ t ”” vaddq . i64 %q2 , %q0\n\ t ”” vaddq . i64 %q3 , %q1\n\ t ”” vshr . u64 %q0 , %q2 , #32\n\ t ”” vshr . u64 %q1 , %q3 , #32\n\ t ”: ” +w” ( c l ) , ” +w” ( ch ) , ”=&w” ( suml ) , ”

=&w” (sumh): ” r ” ( va [ j ] ) , ” r ” ( v t [ j ] ) , ”w” ( b ): ” cc ” , ” q0 ”) ;

/∗ADD( t [ j +1 ] , C) ∗ /

mp addi t ion C ( vt , j +1 , s+2 , &c l , &ch ) ;

asm (” v ld1 .32 {d0 , d1} , [%4]\n\ t ”” vand %q2 , %q5\n\ t ”” vand %q3 , %q5\n\ t ”” vmla l . u32 %q2 , d0 , %P6\n\ t ”” vmla l . u32 %q3 , d1 , %P7\n\ t ”” vshr . u64 %q0 , %q2 , #32\n\ t ”” vshr . u64 %q1 , %q3 , #32\n\ t ”: ”=&w” ( c l ) , ”=&w” ( ch ) , ” +w” ( suml ) , ”

+w” (sumh): ” r ” ( vn [ j ] ) , ”w” (mask ) , ”w” ( ml ) , ”w”

(mh): ” cc ” , ” q0 ”) ;

/∗S∗ /vsave ( v t [ j −1] , suml , sumh) ;

}

is stored on vt[j-1].The code for the modified version of FIOS is depicted in Algorithm 3.12. This algorithm starts by

computing the value of vm, which will be multiplied by the modulo when the reduction step is performed.Afterwards the product of va and vb[i] is computed, using the vmac function described in Algorithm 3.9.To the result is then added the product of vm and the modulo. This process iterates for s words.

Algorithm 3.12: FIOS2 Main Loop

f o r ( i = 0 ; i < s ; ++ i ) {

/∗vm <− ( va [ 0 ] ∗ vb [ i ] + v t [ i ] ) ∗ v ’ [ 0 ] mod 2ˆw∗ /asm(” v ld1 .32 {d0 , d1} , [%0]\n\ t ”” v ld1 .32 {d2 , d3} , [%1]\n\ t ”” v ld1 .32 {d4 , d5} , [%2]\n\ t ”” v ld1 .32 {d6 , d7} , [%3]\n\ t ”” vmla . i32 q0 , q1 , q2\n\ t ”” vmul . i32 q0 , q0 , q3\n\ t ”” vs t1 .32 {d0 , d1} , [%4]\n\ t ”:: ” r ” ( v t [ i ] ) , ” r ” ( va [ 0 ] ) , ” r ” ( vb [ i ] ) , ” r ” (∗ vn l i nha ) , ” r ” (∗vm): ”memory ” , ” cc ” , ” q0 ” , ” q1 ” , ” q2 ” , ” q3 ”) ;

vmac ( ( u i n t 3 2 t (∗ ) [ SIMD SIZE ] ) v t [ i ] , va , vb [ i ] , s , &c l , &ch ) ;mp addi t ion C ( vt , i +s , i +s+2 , &c l , &ch ) ;

vmac ( ( u i n t 3 2 t (∗ ) [ SIMD SIZE ] ) v t [ i ] , vn , ∗vm, s , &c l , &ch ) ;mp addi t ion C ( vt , i +s , i +s+2 , &c l , &ch ) ;

}

Beyond SIMD parallelism, multithreading was exploited by using the OpenMP #pragma omp parallel

for directive prior to a for-loop. If the for-loop features N/4 iterations, and within its body calls one ofthe SIMD modular multiplication algorithms, OpenMP shares the computation of the N/4 calls amongthe cores. Each core, at each moment, computes 4 multiplications in parallel, using SIMD parallelism.Hence, N multiplications are performed in total.

Additionally, several SOS Montgomery multiplications were performed in parallel using the PowerVRSeries5XT SGX544 MP3 GPU. This platform supports 16 threads per shader core, and each thread hasaccess to a floating point SIMD device, which executes 4 operations in parallel. This architecture enables

37

several levels of parallelism: each shader core executes independently of the others, the computation ofthe 16 threads is interleaved so as to reduce downtime and maximise resource usage of each shadercore, and each thread may use 4-way SIMD instructions to process floating points. While the first twolevels were exploited by executing many instances of the same algorithm, numbers were encoded usingthe mantissa of floating points in order to take advantage of the SIMD parallelism at the last level. Inorder to avoid overflow and reduce memory bandwidth, each 24-bit mantissa was used to represent two12-bit words, and interleaved using the approach described in Section 3.2.3. Illustratively, to representthe 24-bit A, B, C and D numbers, the following representation would be used when storing them inmemory: A[0] + B[0] × 212, C[0] + D[0] × 212, A[1] + B[1] × 212, C[1] + D[1] × 212, where A[0] and A[1]

denote the 12 least and most significant bits of A, respectively. Taking the word A[0] +B[0]× 212 as anexample, since A[0] < 212 and B[0] < 212, the A[0] + B[0] × 212 value does not overflow the 24 bits ofthe mantissa, and the representations of A[0] and B[0] do not overlap.

The resulting kernel, in OpenCL for the GPU, is depicted in Algorithm 3.13. The get global id(0)

function returns an unique identification number for each thread. This value is used to determine whichvalues to load from memory, so that each thread computes a different multiplication. Firstly, the contentsof t are cleared, and afterwards the product of operands a and b is computed in the first i-loop; float2and float4 data types represent vectors of 2 and 4 floating points, respectively. The vstore2 functionstores the first float2 argument, at the memory position pointed by the third argument, with an offset of2 times the second argument. In contrast, vload2 loads the value pointed by the last argument, with anoffset of 2 times the first argument. Functions mystore and myload were created to encapsulate thosefunctions and to perform the required arithmetic to convert the float4 data types to the aforementionedmemory arrangement. Given the (x, y, z, w) floating point vector, mystore computes (x + y × 212, z +

w × 212) and stores the result in memory. Conversely, mystore loads a (x, y) vector and computes(x mod 212, bx/212c, y mod 212, by/212c).

Inside the j loop, when S holds the value of va*vb + C + vt, the 12 most significant bits of eachword of this operation are stored in C by multiplying the result by 2−12 (vinv) and using the floor function.Afterwards vt keeps the least significant bits of S by performing the operation S + C*vmbase, wherevmbase takes the value of −212. The second i-loop reduces the result by performing a similar operationto that of the second loop in Algorithm 3.10. Finally, the result is right shifted s words by repeatedlycalling vload2 and vstore2.

Parallel Execution of a Single Modular Multiplication

The k-ary modular multiplication method was implemented using OpenMP directives to distribute theworkload among the processing cores, and the resulting code is shown in Algorithm 3.14. This algorithmalso makes use of the SIMD Product-Scanning method to accelerate multiplications.

Firstly, memory space is allocated for the c and q variables. Afterwards, each multiplication columnis processed in parallel inside the mulColumn function. This function starts by calling the macc routine,which multiplies and accumulates the product of two operands, for each partial product of the column, inc. Subsequently, the value of q is computed, using either mmod, which computes multiplication modulo apower of two, or macc, whether Montgomery or Barrett reduction should be used, respectively. It shouldbe noted that most shifts are implicit, through the use of indexes (for instance &c[n-d] corresponds tothe value of c right shifted n-d-words) or computed using memmove. When all threads return from themulColumn function, they are synchronised. The q values are then added, using the addS function, whichadds signed operands. The c values are also added, taking their weights into account. The value of theproduct of q and the modulo is subtracted from the final value of c, using msub if q is positive, or maccotherwise. Finally, a n/2 right shift is taken implicitly, when using memcpy to copy the result to the final

38

Algorithm 3.13: SOS SIMD OpenCL Implementation

k e r n e l vo id mm ( g l o b a l f l o a t ∗ t , g l o b a lf l o a t ∗a , g l o b a l f l o a t ∗b , g l o b a lf l o a t ∗n , g l o b a l f l o a t ∗np , i n t s ) {

i n t id , t i d , i , j ;f l o a t 4 S, C;f l o a t 4 vnp = myload ( g e t g l o b a l i d ( 0 ) , np ) ;f l o a t 2 zero = ( f l o a t 2 ) ( 0 . 0 , 0 .0 ) ;

i d = g e t g l o b a l i d ( 0 ) ∗ s ;t i d = 2∗ i d ;f o r ( i = 0 ; i < 2∗s ; i ++) vs tore2 ( zero , t i d

+ i , t ) ;

f o r ( i = 0 ; i < s ; i ++) {f l o a t 4 va = myload ( i d + i , a ) ;

C = ( f l o a t 4 ) ( 0 . 0 , 0 .0 , 0 .0 , 0 .0 ) ;f o r ( j = 0 ; j < s ; j ++) {

f l o a t 4 vb = myload ( i d + j , b ) ;f l o a t 4 v t = myload ( t i d + i + j , t ) ;

S = va∗vb + C + v t ;C = f l o o r (S∗ v inv ) ;v t = S + C∗vmbase ;

mystore ( vt , t i d + i + j , t ) ;}

mystore (C, t i d + i +s , t ) ;

}

f o r ( i = 0 ; i < s ; i ++) {f l o a t 4 v t = myload ( t i d + i , t ) ;f l o a t 4 vq = v t ∗ vnp ;

C = f l o o r ( vq∗ v inv ) ;vq += C∗vmbase ;

C = ( f l o a t 4 ) ( 0 . 0 , 0 .0 , 0 .0 , 0 .0 ) ;f o r ( j = 0 ; j < s ; j ++) {

f l o a t 4 vn = myload ( i d + j , n ) ;f l o a t 4 v t = myload ( t i d + i + j , t ) ;

S = vq∗vn + C + v t ;C = f l o o r (S∗ v inv ) ;v t = S + C∗vmbase ;

mystore ( vt , t i d + i + j , t ) ;}

addcarry (C, t i d + i +s , t , t i d +2∗s ) ;}

f o r ( i = 0 ; i < s ; i ++) {f l o a t 2 v t = vload2 ( t i d +s+ i , t ) ;vs tore2 ( vt , t i d + i , t ) ;

}}

buffer.

RNS-based Modular Multiplication

In order to reduce the amount of pre-computed variables and therefore the memory requisites of theMontgomery RNS algorithm, the approach by Antao was followed [52], and the following definitions wereemployed:

• µi.x = −M−1R−1i mod ri

• µi.y = MR−2i mod ri

• µi.z = R−1Ri mod ri

• µi.w = R mod ri

• Operands A in base R are stored as ξai = AR−1i mod ri. It should be noted that the results in this

base are not required to retrieve the final results, and therefore this format does not change thefinal output.

Using this approach, Algorithm 3.5 may be implemented as stated in Algorithm 3.15. The evaluationof k may be performed using either Kawamura’s approximation, where 1/ri is approximated by 1/2l, orthe proposed approximation, which uses the value of b2b/ric. The former and latter approaches aredepicted in Algorithms 3.16 and 3.17, respectively, where integer arithmetic is used.

The remainder operation, usually denoted by % in C, is an expensive operation which should beavoided when performing computations on the GPU. As such, modular reductions of z in channel riwere performed using Algorithm 3.18, which is based on the fact that 2l − ri ≡ 2l(modri). It should benoted that operation min(z, z− ri) is performed using unsigned arithmetic. The while loop used therein

39

Algorithm 3.14: k-ary Modular Multiplication OpenMP Implementation

vo id mulColumn ( i n t id , u i n t 3 2 t ∗c , i n t ∗sign ,u i n t 3 2 t ∗q , u i n t 3 2 t ∗a , u i n t 3 2 t ∗b ,

u i n t 3 2 t ∗nprime , u i n t 3 2 t ∗mu, i n t n , i n tk ) {

i n t i , j , d ;i n t r a t i o = n / k ;

f o r ( i = MAX(0 , id−k+1) ; i <= MIN( id , k−1) ; i++) {j = i d − i ;macc ( c , &a [ i ∗ r a t i o ] , &b [ j ∗ r a t i o ] , r a t i o ,r a t i o ) ;

}

d = i d ∗ r a t i o − n / 2 ;∗s ign = 0;

i f ( d < 0) {mmod(&q [ n /2+d ] , c , nprime , −d ) ;∗s ign = −1;

} else i f ( d > ( n − 2∗ r a t i o ) ) {i n t t = d + 2 ∗ r a t i o − n ;i f ( n−d >= 0) {

macc ( q , &c [ n−d ] , &mu[ n/2− t ] , 2∗ r a t i o +1−n+d , n/2+1−n/2+ t ) ;} else {

macc (&q [ d−n ] , c , &mu[ n/2− t ] , 2∗ r a t i o +1−n+d , n/2+1−n/2+ t ) ;}

memset ( q , 0 , t ∗ s i z e o f ( u i n t 3 2 t ) ) ;memmove(&q [ n/2− t ] , &q [ 0 ] , ( n+2−n/2+ t ) ∗s i z e o f ( u i n t 3 2 t ) ) ;memset ( q , 0 , ( n/2− t ) ∗ s i z e o f ( u i n t 3 2 t ) ) ;∗s ign = 1;

}}

vo id mulP ( u i n t 3 2 t ∗ t , u i n t 3 2 t ∗a , u i n t 3 2 t ∗b , u i n t 3 2 t ∗mu, u i n t 3 2 t ∗nprime ,u i n t 3 2 t ∗mod, i n t n , i n t k ) {

i n t N = 2∗k − 1;u i n t 3 2 t ∗∗c , ∗∗q ;i n t ∗s ign ;i n t i , j ;

c = ALLOC TYPE(N, u i n t 3 2 t ∗ ) ;

MEM CHECK( c ) ;q = ALLOC TYPE(N, u i n t 3 2 t ∗ ) ;MEM CHECK( q ) ;

c [ 0 ] = ALLOC TYPE(N ∗ (2∗n+2) , u i n t 3 2 t ) ;MEM CHECK( c [ 0 ] ) ;q [ 0 ] = ALLOC TYPE(N ∗ ( n+2) , u i n t 3 2 t ) ;MEM CHECK( q [ 0 ] ) ;s ign = ALLOC TYPE(N, i n t ) ;MEM CHECK( s ign ) ;

f o r ( i = 1 ; i < N; i ++) {c [ i ] = c [ i −1] + (2∗n+2) ;q [ i ] = q [ i −1] + ( n+2) ;

}

#pragma omp p a r a l l e l f o rf o r ( i = 0 ; i < N; i ++)

mulColumn ( i , c [ i ] , &s ign [ i ] , q [ i ] , a , b ,nprime , mu, n , k ) ;

f o r ( i = 1 ; i < N; i += i )f o r ( j = 0 ; j < N− i ; j += ( i + i ) )

addS ( q [ j ] , &s ign [ j ] , q [ j + i ] , &s ign [ j + i ] ,n+2) ;

f o r ( i = 1 ; i < N; i += i )f o r ( j = 0 ; j < N− i ; j += ( i + i ) )

add(&c [ j ] [ i ∗n / k ] , &c [ j ] [ i ∗n / k ] , c [ j + i ] ,2∗n+2 − i ∗n / k ) ;

i f ( s ign [ 0 ] > 0)msub( c [ 0 ] , q [ 0 ] , mod, n+2 , n ) ;

e lsemacc ( c [ 0 ] , q [ 0 ] , mod, n+2 , n ) ;

wh i le (cmp(&c [ 0 ] [ n / 2 ] , mod, n+2) >= 0)sub(&c [ 0 ] [ n / 2 ] , &c [ 0 ] [ n / 2 ] , mod, n+2) ;

memcpy( t , &c [ 0 ] [ n / 2 ] , n∗ s i z e o f ( u i n t 3 2 t ) ) ;

f r ee ( c [ 0 ] ) ;f r ee ( q [ 0 ] ) ;f r ee ( c ) ;f r ee ( q ) ;f r ee ( s ign ) ;

}

Algorithm 3.15: Implementation of RNS Montgomery Multiplicationξqi ≡ aibiµi.x mod ri, i ∈ [0, n[

ξti =(∑n−1

j=0 ξqjMjµi.y + ξai ξbi

)µi.z mod ri

Evaluation of kti =

∑n−1j=0 ξtjMj − kµi.w mod ri

Algorithm 3.16: Evaluation of ksum← α× 2l

k ← 0for i ∈ [0, n[ do

sum← sum+ ξik ← k + (sum >> l)sum← sum&(2l − 1)

end

40

Algorithm 3.17: 2nd Evaluation of ksum← α× 2g

k ← 0for i ∈ [0, n[ do

sum← sum+ ξi × b2g/rick ← k + (sum >> g)sum← sum&(2g − 1)

end

was unrolled, taking into consideration the maximum value to be reduced and the minimum value in R,so that divergences between threads were avoided.

Algorithm 3.18: GPU Modular Reductionwhile z ≥ 2× ri do

zL = z&(2l − 1)zH = z >> lz = zL + (2l − ri)zH

endreturn min(z, z − ri)

The final code was implemented in OpenCL and is listed in Algorithm 3.19, wherein 16-bit moduliwere used. This algorithm starts by accessing memory in a vectorised manner: loc mu correspondsto µi, loc x to ai and ξai , loc y to bi and ξbi , and loc mi to ri and ri. Afterwards w[local id] iscomputed, which corresponds to ξqi , by calling the mod mul function twice, which performs the multipli-cation of two 16-bits operands and uses Algorithm 3.18 to reduce the result using 32-bit arithmetic. Thebarrier (CLK LOCAL MEM FENCE) function call amounts to create a synchronisation point, where threadsare stopped until all w values have been flushed to the local memory. The inner product function callthat follows computes the value of

∑n−1j=0 ξqjMj mod ri, which is afterwards multiplied by µi.y, added

to ξai ξbi , and the result is multiplied by µi.z modulo ri. The value of k is then evaluated after anothersynchronisation, using either Algorithm 3.16 or 3.17. What method to use is determined at compile-time,through the INV macro. It should be noted that for the second method the value of b = 30 was selected;and that mad24((uint)inv[i], (uint)w[i+NUM CHANNELS], sum) computes the product of inv[i] andw[i+NUM CHANNELS], using their 24 least significant bits and adds the 32-bit result to the 32 bits of sum.mad24 executes faster than using the product and addition operands with integers, and it was not onlyused on the second method for base extension, but also for computing modular multiplication (mod mul).The computed k value is stored in local memory, and after another synchronisation point, operationti =

∑n−1j=0 ξtjMj − kµi.w mod ri is completed and ti and ξti are stored in global memory in a vectorised

fashion.

Experimental Results

The SIMD versions of the Montgomery multiplication, which include the final subtraction for generality,were compiled with gcc 4.8.1, with the -O3 flag, run on the 1.6 GHz A15 quad-core of the ODROID-XU+Eplatform, operated by Ubuntu 13.10, and timed with POSIX clock gettime function. The conditionalmove that follows the subtraction, corresponding to the selection of either the minuend or the differenceas the result, was always performed sequentially.

The execution times per multiplication are shown in Table 3.3, and result from taking the averageexecution time over 1024 multiplications. One can conclude that the SOS method was the fastest se-quential method, whereas FIOS2 was the fastest parallel version. As such, these methods were elected

41

Algorithm 3.19: OpenCL Implementation of RNS Montgomery Multiplication

u i n t i nne r p roduc t ( l o c a l ushor t w[NUM CHANNELS] , g l o b a l ushor t Mjmi [NUM CHANNELS] , u i n t c , u i n t mod) {

u i n t sum = 0;

f o r ( i n t i = 0 ; i < NUM CHANNELS; i ++)sum = mod add ( mod mul ( ( u i n t )w[ i ] , (

u i n t ) Mjmi [ i ] , c , mod) ,sum, mod) ;

r e t u r n sum;}

k e r n e l vo id mm( g l o b a l ushor t x [ ] ,g l o b a l ushor t y [ ] ,g l o b a l ushor t z [ ] ,g l o b a l ushor t mu [ ] ,g l o b a l ushor t M1jm2i [NUM CHANNELS∗

NUM CHANNELS] ,g l o b a l ushor t M2jm1i [NUM CHANNELS∗

NUM CHANNELS] ,g l o b a l ushor t mi [2∗NUM CHANNELS] ,

# i f INV==0l o c a l ushor t w[2∗NUM CHANNELS] )

#e lsel o c a l ushor t w[2∗NUM CHANNELS] ,g l o b a l ushor t i nv [NUM CHANNELS] )

# end i f{

u i n t l o c a l i d = g e t l o c a l i d ( 0 ) ;u i n t o f f s e t = NUM CHANNELS ∗ ge t g roup id

( 0 ) ;u i n t 4 loc mu = c o n v e r t u i n t 4 ( vload4 (

l o c a l i d + o f f s e t , mu) ) ;u i n t 2 l o c x = c o n v e r t u i n t 2 ( vload2 (

l o c a l i d + o f f s e t , x ) ) ;u i n t 2 l o c y = c o n v e r t u i n t 2 ( vload2 (

l o c a l i d + o f f s e t , y ) ) ;u i n t 2 loc mi = c o n v e r t u i n t 2 ( vload2 (

l o c a l i d , mi ) ) ;

u i n t 2 l o c c = ( u i n t 2 ) (BASE) − l oc m i ;

w[ l o c a l i d ] = mod mul ( mod mul ( l o c x . x ,l o c y . x , l o c c . x , l oc mi . x ) ,

loc mu . x , l o c c . x , l oc mi . x ) ;

b a r r i e r (CLK LOCAL MEM FENCE) ;

u i n t c s i = inne r p roduc t (w, &M1jm2i [l o c a l i d ∗NUM CHANNELS] , l o c c . y , l oc mi . y );

c s i = mod mul ( loc mu . y , cs i , l o c c . y ,l oc mi . y ) ;

c s i = mod add ( mod mul ( l o c x . y , l o c y . y ,l o c c . y , l oc mi . y ) ,cs i , l oc m i . y ) ;

c s i = mod mul ( loc mu . z , cs i , l o c c . y ,l oc mi . y ) ;

w[ l o c a l i d + NUM CHANNELS] = c s i ;


l o c a l u i n t k ;i f ( l o c a l i d == 0) {

u i n t sum = ALPHA;k = 0;

# i f INV==0f o r ( i n t i = 0 ; i < NUM CHANNELS; i ++) {

sum += w[ i +NUM CHANNELS ] ;k += (sum >> 16) ;sum &= 0 x f f f f ;

}#else

f o r ( i n t i = 0 ; i < NUM CHANNELS; i ++) {sum = mad24 ( ( u i n t ) i nv [ i ] , ( u i n t )w[ i +

NUM CHANNELS] , sum) ;k += (sum >> 30) ;sum &= 0 x 3 f f f f f f f ;

}# end i f

}


u i n t z i = inne r p roduc t (&w[NUM CHANNELS] , &M2jm1i [ l o c a l i d ∗NUM CHANNELS] , l o c c . x ,l oc mi . x ) ;

z i = mod sub ( z i , mod mul ( k , loc mu .w,l o c c . x , l oc mi . x ) , l oc mi . x ) ;

ushor t2 l o c z = ( ushor t2 ) ( ( ushor t ) z i , (ushor t ) c s i ) ;

vs tore2 ( loc z , l o c a l i d + o f f s e t , z ) ;}

42

Number of bits 256 512 1024 2048 4096Execution Time [clock cycles]

SerialSOS 8,398 11,984 25,101 77,886 259,179FIOS 8,416 14,264 34,067 111,157 411,476FIOS2 13,638 15,056 29,106 83,659 296,955

Execution Time [clock cycles]1-core SIMD parallel

SOS 2,011 4,740 16,929 59,163 227,900FIOS 2,640 14,704 96,984 664,947 5,006,096FIOS2 1,964 4,206 14,051 50,265 203,553Speedup 4.3 2.8 1.8 1.5 1.3

Execution Time [clock cycles]4-core SIMD parallel

SOS 1,139 1,365 4,118 15,234 66,748FIOS 955 3,832 25,442 168,909 1,466,107FIOS2 1,120 1,293 3,864 12,928 49,636Speedup 7.5 9.3 6.5 6.0 5.2

Table 3.3: Obtained performance for implementing Montgomery multiplications on the ARM processor

as references for defining the speedup of the SIMD parallelisation on the ARM processor, as a point forcomparison. Results for this metric are also presented in Table 3.3.

As one can observe from the experimental results presented in Table 3.3, the FIOS method is theslowest when parallelism is exploited. Although one could expect the contrary, since it only possessesone outer and one inner cycle, this method requires an high number of memory accesses, due to therequirement of performing carry propagation inside the inner loop. The impact of this comes aggravatedby the fact that carry propagation can be inefficient using SIMD extensions. Nevertheless, the secondversion of the algorithm suits SIMD parallelism particularly well, since it allows to limit carry propagation,which reduces the amount of memory accesses. The single-core parallel version of FIOS2 presents aspeedup of upto 4.3 for the NEON technology.

It should be noted that the sequential version is already accelerated by the superscalar execution ofthe A15 processor. As such, several instructions are already processed simultaneously, and thereforethe total number of cycles it takes to compute modular multiplication is less than one would expect.Even though theoretically the 4-lane SIMD execution of multiple operations should provide fourfold theperformance of sequential code, the fact that the latter is already run partly in parallel marginalises thespeedup for larger operands.

Main memory access is a critical aspect when employing multithreading parallelism. In fact, con-tention on memory might severely damage performance on symmetric multiprocessing platforms. Sincedata is loaded to the cache in blocks of 512b on the ARM processors, a better cache efficiency isachieved as the operands width increases, since there is a greater data reuse. This characteristicamounts to a better relative speedup as the operands width increases, since memory contention isdecreased.

The OpenCL Montgomery multiplication program was run in the PowerVR Series5XT SGX544 MP3GPU in order to perform 192 multiplications in parallel. The number of multiplications is related tothe fact that the GPU has 3 cores, each core may host 16 threads, and each thread can perform 4parallel multiplications. The OpenCL library was available for the Android Operating System (OS) 4.2.2and, therefore, it was run using the Java Native Interface (JNI), which enables Java code to call nativeapplications. The SOS method was also run sequentially in the CPU using the JNI, for comparison. Thecode was compiled with the arm-2010q1-202-arm-none-linux-gnueabi cross-compiler, under Oracle

43

Execution Time [µs]Number of bits 256 512 1024 2048GPU SOS 42 118 477 1,937SOS 10 32 111 296Speed-Up 0.2 0.3 0.2 0.2Serial SOS from Table 3.3 5.2 7.5 15.7 48.7

Table 3.4: Multi-Precision Montgomery PowerVR Series5XT SGX544 MP3 GPU Multiplication Perfor-mance

Execution Times [clock cycles]Number of bits 256 512 1024 2048 4096k-ary Modular Multiplication 12,249 14,652 22,302 47,006 133,556SOS 8,398 11,984 25,101 77,886 259,179Speed-Up 0.69 0.82 1.13 1.67 1.94

Table 3.5: k-ary Multi-Precision Modular Multiplication Performance for 3 Cores

JDK 1.6.0 u33, Android SDK r20.0.3, with API18, and Android NDK r9. Table 3.4 shows the averageexecution time taken over 1920 multiplications for the GPU and 256 multiplications for the CPU SOSmethod, using the POSIX gettimeofday function.

Firstly, the SOS method shows a degradation of performance in comparison to the values obtained onTable 3.3. This may be related to the scheduling system of Android, which might have allocated that taskto an A7 core, which has fewer computational resources, and operates at a lower frequency. Secondly,the GPU is greatly outperformed by the sequential version of SOS. There are several factors thatcontribute to the large execution times: the SOS method features a large number of branch instructions,which fit CPUs better; it is a very memory-intense task and, therefore, the number of tasks may notbe enough to completely hide memory access latency; the lack of SIMD integer instructions leads toa decrease in performance; the GPU features a memory device dedicated to images, which was notpossible to exploit, since its use implied that no branch instructions could be used.

It should be notice that, even though the PowerVR GPU was ineffective at computing several Mont-gomery multiplications, its use might still be beneficial. Since this computation may be performed asyn-chronously, during the time that the GPU is performing modular multiplications, the CPU is free and maybe used to perform other tasks. The PowerVR GPU platform lacks synchronisation between threads,which prohibits the use of the RNS modular multiplication algorithm.

The k-ary modular multiplication method, which uses the SIMD Product-Scanning method to performmulti-precision multiplications, was tested using 3 1.6 GHz A15 cores of the ODROID-XU+E, by settingk = 2. It was not possible to exploit the 4×A15 cores, due to the fact that the method uses 2k− 1 cores.For systems with more cores, an higher value of k could be used so that 2k−1 cores would be exploited,and better performances could presumably be achieved. The code was compiled with gcc 4.8.1, withthe -O3 flag and run under Ubuntu 13.10. 256 multiplications were executed and the execution timewas taken using the clock gettime function. Table 3.5 shows the average execution time for differentoperand widths. The sequential SOS method times were replicated from Table 3.3, with comparisonpurposes.

The experimental results show that the cost of software parallelism outweighs the additional compu-tation power for small operands. This cost is associated with the creation of threads, their synchronisa-tion and the extra memory operations required (creation of buffers and memory transfers). As the widthof the operands increases, the k-ary modular multiplication undoubtedly attains the best performanceas the arithmetic complexity dominates the parallelism overhead. It was possible to achieve a maximumspeedup of almost 2.

44

Number of bits 256 512 1024 2048Execution Time [µs] for Serial version

SOS 3.7 12.6 46.0 176.6Execution Time [µs] for GPU version

RNS 6.3 14.1 28.2 84.2Speedup 0.6 0.9 1.6 2.1

Table 3.6: Obtained performance from the execution of the Montgomery multiplication algorithm on theAdreno GPU, and the sequential version on the 1.7 GHz Krait 300 ARM Cortex-A15 based CPU

The RNS OpenCL Montgomery multiplication program was run in the Qualcomm Adreno 320 GPU. Itwas executed under the Android OS 4.2.2, using the JNI to enable native code execution. The code wascompiled with the arm-2010q1-202-arm-none-linux-gnueabi cross-compiler, under Oracle JDK 1.6.0u33, Android SDK r20.0.3, with API18, and Android NDK r9. 2,000 multiplications were enqueued, andthe total time was obtained, using the OpenCL event profiling interface. The average time per multiplica-tion is presented in Table 3.6. Furthermore, the sequential version of the Montgomery multiplication wasimplemented using the SOS method, and run on the CPU, using the same interface. Execution timesfor the serial version were measured using the POSIX gettimeofday function, and the average time ispresented on the same table, for comparison.

Even though the RNS Montgomery multiplication theoretically should present a linear time complex-ity, there are implementation factors that make computation take longer. As the operand width increases,new elements need to be added to the RNS set, and the difference between 2l and the minimum ele-ment increases. As such, more loop iterations of Algorithm 3.18 are required to reduce multiplications.Furthermore, for 2048 bits, the second approach to base extension was employed, due to mathematicalconstraints, which is less efficient than the first. Despite this issues, its lower complexity makes it worth-while to employ RNS on this GPU device for operands whose width is greater or equal to 1024-bits. Amaximum speedup of 2.1 was achieved.

3.4 Summary

In this section, several multi-precision arithmetic algorithms, which play a crucial role on the effi-ciency of public-key cryptographic system, were presented and analysed. Different methods to imple-ment multi-precision multiplication were discussed, namely the Operand-Scanning, Product-Scanningand Operand-Caching methods. SIMD parallel algorithms were developed for the latter 2 multiplicationmethods. Experimental results showed that even though the Operand-Caching method accessed mem-ory fewer times, it was outperformed by the Product-Scanning multiplication, which featured fewer datarearrangements. A maximum speedup of 1.23 was obtained for the Product-Scanning method on theA15 CPU. One can expected higher speedups to be achieved for wider SIMD technologies.

Multi-precision modular multiplication algorithms were also introduced. Different approaches to per-form Montgomery multiplication were presented, namely the SOS, FIOS, and FIOS2 methods, and SIMDparallelism was exploited to perform multiple simultaneous operations. The limitations of the SIMD ex-tensions, which do not provide an instruction to test if the carry generated from a sum is zero, had aparticularly negative impact on the FIOS method. On the other hand, the SOS and FIOS2 algorithmsproved to be efficient on increasing the throughput of modular multiplications, not only when SIMD paral-lelism was exploited but also for multithreading parallelism. Speedups of upto 4.3 and 9.3 were obtainedfor the single and quad-core versions of the SIMD FIOS2 algorithm on the A15 platform, respectively.When the same approach was applied to the PowerVR Series5XT SGX544 MP3 GPU, it did not pro-duce satisfactory results, presumably due to its architectural characteristics, which are not appropriate

45

for highly memory-intensive and divergent code.The k-ary modular multiplication method also proved to be effective for accelerating modular multipli-

cation, by splitting the multiplication and modular reduction computations over multiple cores. Anotherlevel of parallelism was introduced, to further accelerate these computations, through the use of theSIMD Product-Scanning method. The experimental results show that the cost of software parallelismoutweighs the additional computation power for small operands. Notwithstanding, as the width of theoperands increases, the k-ary modular multiplication attains the better performance as the arithmeticcomplexity dominates the parallelism overhead. It achieved at most nearly twofold the performance ofthe sequential algorithm, when three cores were exploited. For systems with more cores, an higher valueof k could be used so that 2k − 1 cores would be exploited, and higher speedups could presumably beachieved.

Finally, the RNS was employed to harness the power of the Adreno 320 GPU. It was shown that thelinear time complexity of the RNS-based modular multiplication algorithm was effective at acceleratingthis operation for large operands. A maximum speedup of 2.1 was obtained.

46

4Cryptosystems and Experimental

Assessment

Contents4.1 Modular Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Elliptic Curve Cryptosystems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Implementation Details and Experimental Results . . . . . . . . . . . . . . . . . . . . 524.4 System Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 554.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

47

In this chapter, the processes used for ciphering and decoding messages using RSA, DSA and ECcryptosystems are described. The properties about the cryptosystems are firstly discussed. Then theparallel algorithms presented in the previous chapter are used to efficiently implement modular expo-nentiation and EC operations. The developed algorithms are experimentally assessed and conclusionsare drawn. Finally, the feasibility of the proposed approaches is tested by presenting a tool based onthem, which produces and verifies digital signatures.

4.1 Modular Exponentiation

With RSA, the encryption of a message P and the deciphering of the cryptogram C can be doneusing the procedure described in Table 2.1, where modular exponentiation plays a crucial role. Similarly,DSA uses modular exponentiation to generate and verify digital signatures, as depicted in Algorithms2.1 and 2.2. The operation t ≡ ab mod n can be implemented using the 2k-ary method representedin Algorithm 4.1, by replacing � with the modular multiplication function and O with the multiplicativeidentity. If the Montgomery multiplication algorithm is used, this value is R(modM), if the k-ary modularmultiplication algorithm is used instead, the multiplicative identity takes the value of 2n/2(modM).

Algorithm 4.1: Generic 2k-ary method

input: b =∑s−1i=0 bi2

ki

W [1]← afor i← 2 to 2k − 1 do

W [i]←W [i− 1] �W [1]endt← Ofor i← s− 1 to 0 do

for j ← 0 to k − 1 dot← t� t

endif bi 6= 0 then

t← t�W [bi]end

end

Illustratively, if k = 1 and classic modular reduction were used, operation a100 mod n would takeplace as (((((a2 × a)2)2)2 × a)2)2 mod n.

RSA

Herein, when performing a X-bit RSA operation, both a X-bit modulo and exponent are considered.In the results sections, timings are measured for random X-bit values for the sake of generality. However,there are some optimisations that can be introduced in order to reduce the complexity of ciphering anddeciphering cryptograms using RSA.

When deciphering a cryptogram, using the d private key, the user may split the computation ofP ≡ Cd(modn) over two operations modulo p and q, using the CRT. The use of multi-prime RSA[53][54] develops this concept to improve performance, namely by defining n =

∏r−1i=0 pi, and dividing

the computation of P over r channels. As such, it is beneficial to study the parallel computation ofseveral modular exponentiations.

Using the CRT, the C cryptogram may be deciphered, for multi-modulo RSA, by performing thefollowing computations:

48

P0 ≡ Cd(modp0)P1 ≡ Cd(modp1)

...Pr−1 ≡ Cd(modpr−1)

(4.1)

and reconstructing P mod n using (3.9). If the parallel implementation of modular exponentiation allowsfor the use of different exponents for each modulo, the value of d may be replaced by dpi ≡ e−1(mod

(pi − 1)), for the modulo pi.The need for the simultaneous computation of several ciphering operations may also arise when a

server wishes to identify itself to many users [55]; additionally, since users have the freedom to chooseany value as their public-key, it is typical to choose the value of e = 1 + 216 as it has a low Hammingweight, which means that its binary representation has few ones, making it more computationally effi-cient, while not compromising security [56]. Therefore, an e-commerce server may significantly improveits performance by computing several parallel modular exponentiations, using the same public-key orprivate-key exponent, but different messages and possibly different public-key modulus.

For instance, if a server wishes to cipher multiple messages on a processor with 2 cores and a 4-lane SIMD engine, it may compute 8 modular exponentiations with the same exponent and modulus. Itmay also use the SIMD engine of a single core for the computation of a single modular exponentiationon a 4-prime RSA cryptosystem using the same value of d across all modulus. Also, it may compute4 modular exponentiations adj mod n, j ∈ {0, .., 3}, using the 2 cores and the SIMD engine, wherebyeach core is attributed a value of i ∈ {0, 1} and simultaneously computes modular exponentiationsadpij mod pi, j ∈ {0, .., 3}, with exponent dpi ≡ e−1(mod(pi − 1)), for j ∈ {0, .., 3}.

4.2 Elliptic Curve Cryptosystems

An elliptic curve E is defined over a finite field F by an equation, named Weirestrass equation, of theform:

y2 + axy + by = x3 + cx2 + dx+ e, where a, b, c, d, e ∈ F (4.2)

together with a point at infinity O. If F is not characteristic 2 or 3 1, the equation can be transformedinto the Simplified Weierstrass equation, which is defined as:

y2 = x3 + ax+ b (4.3)

For cryptographic applications, all elliptic curves are defined over finite fields. However, for illustrativepurposes, we refer the reader to Figure 2.2, where a curve is plotted for real numbers, along side withone for a prime field.

It is possible to build an algebraic entity over ECs. The addition operation over this group has ageometrical interpretation, which will now be outlined, and is illustrated in Figure 2.2: given points A andB, the line AB that intersects both points is drawn. Except for the case where AB is vertical, the linewill intersect the EC a third time in point −C. Point C, which corresponds to the addition of points A andB is obtained by computing the point that is vertically symmetric to −C. Points C and −C are additiveinverses of one other; adding them geometrically results in a vertical line that extends to the point atinfinity, O. Point doubling, which corresponds to adding A to A, is performed by drawing a tangent lineto the EC on A, and determining the vertically symmetric point to where the line crosses the EC another

1If repeatedly adding the multiplicative identity 1 to itself in a field never gives 0 then that field is said to havecharacteristic 0. Otherwise, there is a prime number p such that pn = 0, for all n ∈ F, and p is called characteristicof that field.

49

time. It is possible that this line does not intersect the curve more than one time, extending to infinity. Inthis case, the point is its own additive inverse, that is A+A = O.

The mathematical formulae for point addition can be deduced by a similar rationale. In order tocompute the addition C = (x3, y3) of two points, A = (x1, y1) and B = (x2, y2), for A 6= B, one starts bydetermining the expression of the equation that crosses those two points, y = αx+β; α can be computedas α = y1−y2

x1−x2and β as β = y1−αx1. In order to find the coordinates of C, the equation is plugged in the

equation of the curve: (αx+β)2 = x3 +ax+ b. Since there are at most three roots of the cubic equation,and it is known that x1 and x2 are two of the roots, x3 must equal the third root. It can be shown that thesum of the roots of a monic polynomial is equal to minus the coefficient of the second-to-highest power,and therefore x3 = α2 − x1 − x2. Finally, y3 can be computed as y3 = −(αx3 + β) = α(x1 − x3) − y1.When A = B, point addition can be computed similarly, except that α is now computed as the derivativedydx at A. Implicit differentiation of (4.3) leads to α =

3x21+α

2y1.

When points are represented by the (x, y) tuple, where x, y ∈ F, it is said that they are representedusing affine coordinates. As previously seen, point addition using the simplified Weierstrass form of thecurve requires 2 modular multiplications, 1 squaring and 1 inversion. Field inversion is computationallydemanding, and should be avoided whenever possible. Converting the points to an alternative rep-resentation, allows to perform point addition and doubling without performing modular inversions, andtherefore a significant performance improvement is attained.

To present the alternative coordinates’ representations, it is important to introduce the concept of pro-jective space. A projective plane is the set of equivalence classes of triples (X,Y, Z) (not all elementsequal to zero), where two triples are said to be equivalent if they are scalar multiples of one another.By introducing this concept to elliptic curves, every point may have multiple representations of the form(X,Y, Z). For example, (X,Y, Z) ∼ (2X, 2Y, 2Z) ∼ (nX, nY, nZ), n ∈ Z\{0}, where ∼ denotes equiv-alence. If we choose the points lying on the unit sphere Z = 1 as the representatives, the relationshipbetween an EC defined over the affine plane and the projective plane can be described by the transform:

x = X/Z, y = Y/Z (4.4)

A variant of the representation in (4.4), often called Jacobian coordinates, is another alternative rep-resentation, which provide faster computation of point doubling. The conversion from (X,Y, Z) Jacobiancoordinates to (x, y) affine coordinates can be performed as:

x = X/Z2, y = Y/Z3 (4.5)

The resulting EC equation is:

Y 2 = X3 + aXZ4 + bZ6 (4.6)

For cryptographic applications, it is possible to consider that the addend of a point addition is nor-malised, i.e. Z2 = 1. By noticing that the denominator of α, when performing point addition of twodifferent points, can now be written as:

x1 − x2 =X1

Z21

−X2 =H

Z21

, H = X1 − Z21X2 (4.7)

It is clear that if Z3 = HZ1, then X3 = (HZ1)2x3 can be computed without performing modularinversions. If we define U1 = X2Z

21 , S1 = Y2Z

31 and R = Y1 − S1, point addition can be derived as:

X3 = Z23x3 = (Y1 − S1)2 −H2(X1 +X2Z

21 ) =

= R2 −H2(X1 −X2Z21 + 2X2Z

21 ) = R2 −H2(H + 2U1)

= R2 −H3 − 2U1H2

(4.8)

50

Similar results hold for Y3, and for point doubling. The formulae that result thereof are presented inAlgorithm 4.2.

Algorithm 4.2: Point Addition Formulae for Jacobian Coordinatesinput : P1, P2 ∈ EC(a, b,Fp)output: P3 = (X3, Y3, Z3) = P1 + P2

if P1 6= O, P2 6= O, P1 6= P2 and Y1 6≡ −Y2Z31 then

U1 ≡ X2Z21 ; S1 ≡ Y2Z

31 ; H ≡ X1 − U1; R ≡ Y1 − S1

X3 ≡ R2 −H3 − 2U1H2

Y3 ≡ R(U1H2 −X3)− S1H

3; Z3 ≡ HZ1

end

else if (P1 6= O, P2 6= O, P1 6= P2 and Y1 ≡ −Y2Z31 )

or (P1 6= O, P1 = P2 and Y1 ≡ 0)then

P3 = Oendelse if P1 6= O and P1 = P2 and Y1 6≡ 0 then

S ≡ 4X1Y21 ; M ≡ 3X2

1 + aZ41

X3 ≡M2 − 2S; Y3 ≡M(S −X3)− 8Y 41 ; Z3 ≡ 2Y1Z1

endelse if P1 = O then

P3 = P2

end

ECC relies on point multiplication for encrypting messages. For a scalar integer s and a point P1 inan EC, point multiplication P3, of P1 and s, is defined as P3 = [s]P1 = P1 + . . .+ P1︸︷︷︸

s times

. This operation can

be implemented using the Algorithm 4.1 to produce the result t← [b]a, by replacing � by point addition.

If the “Double and Add” algorithm is used to implement point multiplication, that corresponds to themethod described in Algorithm 4.1 by setting k = 1, the memory requirements are reduced, and theformulae provided for the addition of points in Algorithm 4.2 are valid if the Z coordinate of the point Ptakes the value 1. For larger values of k, the use of Algorithm 4.2 implies the normalisation of vector Wafter it is computed, by performing the operation (X,Y, Z)← (X/Z2, Y/Z3, 1) for each entry.

The representation of points in modified Jacobian coordinates, that use those same formulae butpoints have an extra coordinate Z2 (so they are represented as (X,Y, Z, Z2)), allows for the use of thescheduling proposed in [20]. This scheduling is presented in Table 4.1 (where only multiplications arepresented), and exploits 3 parallel multipliers, and therefore nicely fits SIMD parallelism.

Similarly to modular exponentiation, it may be useful for an e-commerce server to compute sev-eral point multiplication operations simultaneously if it wishes to identify itself to multiple users at the

Point AdditionMultiplier #1 Multiplier #2 Multiplier #3U1 ← X2Z

21 Z1Z

21 –

H2 S1 ← Y2Z31 Z3 ← Z1H

U1H2 HH2 R2

S1H3 R(U1H

2 −X3) Z23

Point DoublingMultiplier #1 Multiplier #2 Multiplier #3

Y 21 X2

1 (Z21 )2

M2 S ← 4X1Y21 Z3 ← 2Y1Z1

(Y 21 )2 M(S − T ) Z2

3

Table 4.1: Schedulling for point arithmetic in modified Jacobian coordinates

51

same time, using ECC. A user may also wish to perform signature generation or verification of multiplemessages, using several cores to compute [ki]gi, where ki and gi take different values for each usedthread.

4.3 Implementation Details and Experimental Results

The 2k-ary method, used for both modular exponentiation and EC point multiplication, was imple-mented based on the code of Algorithm 4.3, where k is denoted by N. In this excerpt of code, the T classcorresponds to an encapsulation of either an integer array, in the case of modular multiplication, or anEC point. The value of res should be initialized to the mul operation identity: the modular multiplicativeidentity if modular exponentiation is performed or O when performing EC point multiplication. Firstly, thevalues of a, a2, a3, ..., a2N−1 are stored in the M array (corresponding to W in Algorithm 4.1). Afterwards,the i and j loops iterate over the b array and, at each iteration, N of its bits are extracted, from the mostsignificant to the least. After this operation, res is set to res2N × aindex, where index is equal to thevalue of the extracted bits.

Algorithm 4.3: Generic 2k-ary Method C Implementation

vo id exponent ia te (T &res , T &a , u i n t 3 2 t ∗b , i n t s , vo id (∗mul ) (T&, T&, T&) ) {i n t twoToN = 1<<N;T ∗M = new T [ twoToN ] ;M[ 1 ] = a ;

f o r ( i n t i = 2 ; i < twoToN ; i ++)mul (M[ i ] , M[ i −1] , a ) ;

f o r ( i n t i = s−1; i >= 0; i−−) {f o r ( i n t j = ( s i z e o f ( u i n t 3 2 t )<<3) − N; j >= 0; j −= N) {

i n t index = ( ( ( twoToN−1) << j ) & b [ i ] ) >> j ;

f o r ( i n t k = 0 ; k < N; k++)mul ( res , res , res ) ;

i f ( index != 0)mul ( res , res , M[ index ] ) ;

}}

}

The function mul can be implemented using one of the methods described in Section 3.2, or the Ctranslation of the formulae in Algorithm 4.2. In that case, for the use of 3 parallel multipliers, data hasto be interleaved as stated in Section 3.2.3. Hence, the wrapper function summarised in Algorithm 4.4was implemented for the FIOS2 method described in Chapter 3.

Variables n and nprime correspond to the prime number that characterises the finite field currentlybeing used, and the symmetric of its inverse modulo 2w, respectively. Arrays a, b and t play the role ofbuffers for the inputs and outputs for Montgomery multiplication.

It is worth to notice that several 2k-ary methods may be performed in parallel, for example by applyingthe #pragma omp parallel for clause.

The algorithms tested and evaluated in this section, except for the RNS-based modular exponen-tiation, were run in A15 quad-core, with the NEON ISA extensions integrated in the ODROID-XU+Eplatform, operated by Ubuntu 13.10, and were compiled with gcc 4.8.1, with the -O3 flag. The RNS-based modular exponentiation was tested on the SYS6440 platform, operated by the Android OS 4.2.2,and native code was compiled with the arm-2010q1-202-arm-none-linux-gnueabi cross-compiler, un-der Oracle JDK 1.6.0 u33, Android SDK r20.0.3, with API18, and Android NDK r9.

52

Algorithm 4.4: Montgomery Multiplication Wrapper

/ / s i s the number o f words occupied by each of the inpu tsf o r ( i = 0 ; i < s ; i ++) {

a [ i ] [ 0 ] = a0 [ i ] ;a [ i ] [ 1 ] = a1 [ i ] ;a [ i ] [ 2 ] = a2 [ i ] ;

b [ i ] [ 0 ] = b0 [ i ] ;b [ i ] [ 1 ] = b1 [ i ] ;b [ i ] [ 2 ] = b2 [ i ] ;

}

f i o s 2 ( t , a , b , n , nprime , s ) ;

f o r ( i = 0 ; i < s ; i ++) {t0 [ i ] = t [ i ] [ 0 ] ;t1 [ i ] = t [ i ] [ 1 ] ;t2 [ i ] = t [ i ] [ 2 ] ;

}

Number of bits 256 512 1024 2048 4096Execution Time [×103 clock cycles]

SOS RSA 918 4,297 28,340 210,204 1,630,790Execution Time [×103 clock cycles]

1-core NEON FIOS2 RSA 665 2,920 23,264 137,860 1,065,217Speedup 1.4 1.5 1.2 1.5 1.5

Execution Time [×103 clock cycles]4-core NEON FIOS2 RSA 127 704 4,726 34,766 282,472Speedup 7.2 6.1 6.0 6.0 5.8

Table 4.2: Multi-Precision NEON Montgomery Exponentiation Performance

Modular Exponentiation

In this section, firstly, the applicability of the SIMD multiplication algorithms is tested for the RSAcryptosytem, by performing in parallel many operations with the same exponent. The need for suchcomputation may arise, for example, when an e-commerce server wishes to encrypt a set of differentmessages, with the same encryption exponent, when authenticating itself to several users, or whendecrypting several messages with the same public-key exponent.

Modular exponentiation was implemented using a quaternary method, which pseudo-code is de-picted in Algorithm 4.1, by setting k = 2. This value of k allows to reduce the amount of cycles necessaryto compute the exponentiation, while not increasing the set-up time significantly (corresponding to thecalculus of vector W ).

The execution times of the modular exponentiation operation, experimentally obtained as the aver-age over 256 exponentiations, are presented in Table 4.2. The technique presented on Section 3.2.3 ofincreasing R was applied as a way to avoid performing the final subtraction of Montgomery multiplica-tion, since the penalty was not significant. The enhancement of performance follows a similar tendencyto the one of Montgomery multiplication as the operand size increases. However, the control dependen-cies, introduced to perform the modular exponentiation, limit the achieved speed-up. Incidentally, theincremental nature of the FIOS2 method requires that the result and the operands be placed in differentmemory positions, so that the operands remain constant during the operation. This leads to the need ofmoving data between buffers, which reduces the SIMD parallelism effectiveness.

The k-ary modular multiplication algorithm was also applied to enhance modular exponentiation. Thisoperation might be used on both the DSA and the RSA cryptosystems, for the process of ciphering and

53

Number of bits 256 512 1024 2048 4096Execution Time [×103 clock cycles]

SOS RSA 918 4,297 28,340 210,204 1,630,790k-Ary RSA 4,052 10,232 31,600 130,384 742,145Speed-Up (SOS/k-Ary) 0.2 0.4 0.9 1.6 2.2

Table 4.3: Multi-Precision k-Ary Method Exponentiation Performance, for 3 Cores

Number of bits 256 512 1024 2048Execution time [µs] for Serial version

SOS 1,681.8 9,953.5 69,605.6 498,702.2Execution time [µs] for GPU version

RNS 2,777.4 8,553.4 31,806.9 229,526.5Speedup 0.6 1.2 2.2 2.2

Table 4.4: Execution time [µs] obtained from the execution of the Modular exponentiation algorithm onthe Adreno GPU, and the sequential version on the 1.7 GHz Krait 300 ARM Cortex-A15 based CPU

deciphering messages. Table 4.3 shows the performance for different operand widths, and it was imple-mented using the quaternary method on the ODROID-XU+E platform. It is clear that for small operandsthe overhead associated with the creation, synchronisation and destruction of threads outweighs thearithmetic operations, leading to poor experimental results. Notwithstanding, it presents a satisfactoryscalability, leading to significant speedups for larger operands.

There is a trend for increasing the key-length of modular exponentiation based cryptosystems. Infact, Debian’s guide for the creation of GNU Privacy Guard keys recommends the creation of 4096-bitskeys, Fedora uses 4096-bits keys for signing its software packages, and the CACert.org project uses4096 bits for its root and intermediate keys [57]. Hence there is a need to perform these operations asefficiently as possible and the k-ary method meets that demand.

Furthermore, one may conclude that the use of the k-ary modular exponentiation method is mostbeneficial when only a single exponentiation is to be performed, specially for large operands. Whenmultiple modular exponentiations are required, the use of the previously analysed SIMD Montgomery-based method will provide a better performance.

When implementing the RNS version of modular exponentiation, the CPU processed Algorithm 4.1,and the modular multiplications were enqueued on the GPU for execution. It was implemented using the“Square and Multiply” method, which corresponds to Algorithm 4.1 when k = 1, and run on the SYS6440platform, using JNI for running native code. 100 modular exponentiations were performed, both seriallyand on the GPU, their execution times were measured using the POSIX gettimeofday function, and theaverage time is presented in Table 4.4.

Despite the fact that speedups are similar to ones presented in Table 3.6, there is an added level ofparallelism, which increases their value. While for the sequential version of the algorithm the CPU bothcontrols the modular exponentiation execution and performs modular multiplications, for the GPU versionthe CPU controls the modular exponentiation execution, and the GPU performs modular multiplications.Using this approach, the enhanced control features of the CPU are exploited, at the same time that theenhanced parallel features of the GPU are exploited. Speedups greater than 1 are achieved for operandwidths greater or equal to 512, and a maximum speedup of 2.2 is observed for 2048 bits.

Elliptic Curve Point Multiplication

The aforementioned EC point multiplication algorithm, which uses 3 SIMD multipliers, was imple-mented on the ODROID-XU+E board. It was tested for the P192, P224, and P256 ECs [33] (the prime

54

Execution time [103× clock cycles]P-192 P-224 P-256

Serial version 7,340 9,445 12,3681-core NEON 5,718 7,577 10,152

Speedup 1.3 1.2 1.24-core NEON 1,894 2,414 3,132

Speedup 3.9 3.9 3.9

Table 4.5: Performance of the EC point multiplication on the ARM processor

Figure 4.1: GUI Algorithm Stack

numbers which characterise the curves have 192, 224, and 256 bits, respectively), and the obtainedperformance is reported in Table 4.5, resulting from taking the average time over 256 multiplications.

The obtained speed-ups are in accordance with what was expected. Although three multiplicationsare performed in parallel by each core most of the time, there are other operations that also need to takeplace, such as additions, comparisons and subtractions, and there was overhead introduced due to theneed of interleaving the operands of the multiplication and de-interleaving the result so that operandsfit the memory disposition described in Section 3.2.3. These operations also increase the memorybandwidth usage, which limit the scalability of the algorithm.

4.4 System Application

In order to assess the applicability of both the modular exponentiation and the EC point multiplicationalgorithms, a tool was developed. This tool was embedded in two Graphical User Interfaces (GUIs),which allow to easily execute the DSA and the ECDSA algorithms, and to evaluate and consolidateexperience with these cryptosystems. In Figure 4.1, the algorithm stack of the tool is presented. Whenperforming operations for the DSA cryptosystem, the 2k-ary exponentiation method is used, which usesthe k-ary modular multiplication algorithm as its core. The k-ary algorithm, in turn, depends on theSIMD Product-Scanning multiplication algorithm. On the other hand, the ECDSA algorithm performspoint multiplication using the 2k-ary method, and exploits the SIMD Montgomery multiplication algorithmto perform point addition and doubling.

Firstly, DSA was implemented with the k-ary exponentiation method, for the ODROID-XU+E platform,using the wxWidgets [58] framework under Linux. Figure 4.2 depicts the final interface.

The creation of a signature implies the selection of the (L,N) key pair length from the combo box(see Figure 4.2). The application supports the (1024, 160), (2048, 224) and (3072, 256) lengths, which arestandardised in [33]. Afterwards, the user must fill the p, q, and g public parameters fields, which are L,N , and L bits wide, respectively. As described in Section 2.1, these parameters should fulfil the following

55

Figure 4.2: DSA Form

Parameterp 0xa65feaab511c61e33df38fdddaf03b59b6f25e1fa4de57e5cf00a

e478a855dda4f3638d38bb00ac4af7d8414c3fb36e04fbdf3d3166712d43b421bfa757e85694ad27c48f396d03c8bce8da58db5b82039f35dcf857235c2f1c73b2226a361429190dcb5b6cd0edfb0ff6933900b02cecc0ce69274d8dae7c694804318d6d6b9

q 0xb5afd2f93246b1efcd1f3a7c240c1e9e21a3630bg 0x007bbd2c5dc917a5e08b9c2f80a49fb63fcd5c0578ba701e254fe35

30dedd3b6680a6e5afb3280b53f154028bafff73d1ba0fdb0004b9eb0dbf24b295bf2a356913cd1c0be03c5103a1da8b73e7670b56d716ed5547af67b5061311eea245e2e5c337843cbc135b9b9c18775d5d56cfda31b747e2449861adf3b3f727189c0a3

x 0x2070b3223dba372fde1c0ffc7b2e3b498b260614y 0x87c9b20aaef34afcbd6ffb5509e7cb3b43f8bec56ba74ad089d2ac2

659b9fa8f895d51b59891f0a5afe8b2e11ae133ac16529ffc031eedf7834f6c1bce2604c4e5cc750df577d29c08f0a6e4f7e190d21b683fb6e08f4d9ea6ea1f03d7720cea0a97c03969118dea97d3efc30d0dcd80495cf2ea84eac1b44fb3d2b8e25e0bd8

Table 4.6: DSA Example Parameters

requirements: p and q are primes such that p− 1 is a multiple of q, gq ≡ 1(modp) and gx 6≡ 1(modp), for0 < x < q. As an example, based on [59], the values of p, q and g depicted in Table 4.6 were selected.

The message to be signed should be written on the Message text box, and the private key on thex text box. x should be randomly chosen such that 0 < x < q. Figure 4.3 depicts the generationof the signature of the ASCII-encoded “abc” message for the x private key presented in Table 4.6.After pressing the “sign” button in Figure 4.3, the message is hashed using the SHA-1 [34] algorithm,producing the following value: “z = 0xa9993e364706816aba3e25717850c26c9cd0d89d”.

Then, a random k value is generated such that 0 < k < q. This value should not be disclosed, andshould only be used for a single message, or else the private key might be compromised. The incorrectuse of value k has led to issues in the past: the implementation of the ECDSA used by SONY, to signthe software used on the Playstation 3 platform, featured a constant value of k. Using this fact, it waspossible to compute the value of the private key, using the value of two signatures [60]. The randomlygenerated k value for the tool example in Figure 4.3 is: “0x680bbdc87647f3c382902d2f58d2754b39bca87

4”.

56

Figure 4.3: DSA Signature Generation

The computation of the signature (r, s) takes place afterwards. r is computed by first determining thevalue of gk mod p:

gk mod p = 0x374fec828ee7123e6383b122cb44a92ae23e6055b8c5f2509c05cc5770242259659e74a0a563

0180748a9781c6053ec91f5d9d46d3187c934b45254414f5979f7a16fa0440bd51707341a0f34c40cb699a0f0b7

ade12a971a6c1654b90be863bc7de06548c068358aee909734161ac15ee5772186d4e23b35980dddbb6f407da

And then reducing it modulo q:r = (gk mod p) mod q = 0x4ce4ba02c819bccafcc2ea2155940504033a04f6

The value of s can then be computed using the expression s = (z + xr)k−1 mod q:s = 0x0cdc88e99a1bef07d6e70c59d2babbc00788d8df

The signature verification is performed by first selecting the key length size, inputting the publicparameters, the message and the signature. The value of the public key corresponds to y = gx mod p,and takes the value presented in Table 4.6.

Subsequently, the message is hashed, producing the same result z as previously. The value of v,which corresponds to a reconstruction of r, is computed based on the value of y, and on the fact thatr ≡ gk ≡ g(z+xr)s−1 ≡ gu1yu2(modp), where u1 = zs−1 mod q and u2 = rs−1 mod q, using the formula:

v = (gu1yu2 mod p) mod q

In Figure 4.4, the verification process was successful, and the message “Signature Ok” was pro-duced.

If the message or the signature were tampered with, the tool would produce the message “SignatureRejected”. In the case of Figure 4.5, the ASCII-encoded “abcd” message produces the following hash:“z = 0x81fe8bfe87576c3ecb22426f8e57847382917acf ”. The value of v that derives from it, is equal to:

v = 0x6ce740db0cb120f903f61b5b179f41d0c40055c9,it does not match the value of r, and therefore the signature is invalid.

The second developed GUI refers to the ECDSA. This application employs one of 3 ECs, nist192,nist1224 or nist256, described in [33], selected by the user, in order to produce their signatures.

In the example of Figure 4.6, curve nist192 is selected. This curve is characterised by the a and bvalues defined in (4.3) and the p value that characterises the prime field, that are presented in Table 4.7.The standard g base point, from which most computation are derived, and the corresponding q value,that satisfies [q]g = O and [x]g 6= O, for 0 < x < q, are presented in the same table. Affine coordinates

57

Figure 4.4: DSA Signature Verification

Figure 4.5: DSA Signature Rejection

58

Parameterp 0xfffffffffffffffffffffffffffffffeffffffffffffff

ffa 0xfffffffffffffffffffffffffffffffeffffffffffffff

fcb 0x64210519e59c80e70fa7e9ab72243049feb8deecc146b9b1g (0x188da80eb03090f67cbf20eb43a18800f4ff0afd82ff1012; 0x0

7192b95ffc8da78631011ed6b24cdd573f977a11e794811)q 0xffffffffffffffffffffffff99def836146bc9b1b4d22831x 0x7891686032fd8057f636b44b1f47cce564d2509923a7465by (0xfba2aac647884b504eb8cd5a0a1287babcc62163f606a9a2; 0xdae

6d4cc05ef4f27d79ee38b71c9c8ef4865d98850d84aa5)

Table 4.7: ECDSA Example Parameters

Figure 4.6: ECDSA Signature Generation

were used to represent g. Furthermore, for the computation of the “abc” message signature, the x

private key presented in the same table was used.

The value of k was randomly generated, such that 0 < k < q, and was equal to: “0x14b9c0e1680bbdc

87647f3c382902d2f58d2754b39bca874”.

The computation of the r-value of the signature involves the computation of [k]g in a similar fashion tothat of the computation of gk(modp) in DSA. As such, this calculus is performed using the SIMD 2k-aryEC point multiplication algorithm. The resulting point, in affine coordinates, of that operation is:

[k]g = (0xc1b73ff1c5434f0e368dcc4bdad9fcc03a4d07ebaf475125; 0x7f3d140d26180a3dd33ed87bb5bf

39481028421f4cc29033)

r takes the value of the x-coordinate of that point, reduced modulo q: “0xc1b73ff1c5434f0e368dcc4b

dad9fcc03a4d07ebaf475125”.

Finally, the hash z of the “abc” message is computed using SHA-1 as in the previous example, and sis calculated as s = (z + xr)k−1: “s = 0xf98f4286d87cfdaab67ca4fdae522b4619fec66a26c01ce”.

The verification of the signature is performed by selecting the same curve, inserting the value of thesignature pair and the y public key. The public key is computed by taking y = [x]g, and results in thepoint presented in Table 4.7.

Finally the value of v is computed by reducing the x-coordinate of the operation [u1]g + [u2]y moduloq, where u1 = zs−1 mod q and u2 = rs−1 mod q, analogously to the procedure of DSA. If the value of

59

Figure 4.7: ECDSA Signature Verification

v matches the value of r, as in the example of Figure 4.7, the message “Signature Ok” is produced,otherwise the message “Signature Rejected” is displayed.

4.5 Summary

This chapter discussed the implementations of modular exponentiation and EC point multiplicationbased cryptosystems. Firstly, an algorithm was presented that can be applied to implement both typesof cryptosystems. Afterwards, it was discussed how the simultaneous execution of multiple modularexponentiations may be advantageous for the RSA cryptosystem, namely by providing a way for a serverto identify itself to multiple users or decode multiple messages with the same public-key exponent. Theciphering process, with the private-key, may also be split into multiple computations, and it was shownthat the multi-prime RSA further developed this concept. It was also shown how SIMD parallelism maybe exploited to enhance EC point multiplication.

The use of the SIMD FIOS2 method for performing modular exponentiation was satisfactory, produc-ing speedups of upto 1.5 and 7.2 for the single and quad-core versions of the algorithm, respectively,running on the ARM A15 platform. It should be noted that the control dependencies and the need tomove data between buffers introduced with the exponentiation algorithm, limited the achieved speedup.The k-ary modular multiplication method was also exploited to perform modular exponentiation, using 3threads. For small operands the management of threads outweighed the computational complexity ofthe arithmetic operations, for larger operands it produced speedups of upto 2.2 on the same platform. Itcan be concluded that this method is most beneficial when a single exponentiation is to be performed,specially for large operands. Additionally, the RNS-based modular multiplication algorithm was exploitedto perform modular exponentiation, and enabled the use of a second level of parallelism: whereas theCPU controlled the exponentiation execution, the GPU performed the modular multiplications. Speedupsof upto 2.2 were obtained for this method on the SYS6440 device.

The EC point multiplication algorithm was also tested. It used 3 SIMD parallel multipliers and pro-duced speedups of upto 1.3 on a single A15 core. Even though 3 multipliers were used simultaneouslymost of the time, there was overhead introduced with the need to interleave the operands and deinter-leave the result for each modular multiplication. Furthermore, other operations such as additions andsubtractions also took place, which were not parallelised. When multithreading was employed for 4×A15

60

cores, to increase the throughput of the operations, speedups of 3.9 were obtained.Finally, the feasibility of the aforementioned approached was confirmed, by developing a tool that

created and verified digital signatures for the DSA and ECDSA cryptosystems.

61

5Comparative Evaluation of General

Purpose and Embedded Processors

Contents5.1 Montgomery Multiplication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

5.1.1 RSA Exponentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 685.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

63

Platform ODROID-XU+E

ProcessorExynos5 OctaCortex(TM)-A15 quad-coreand Cortex(TM)-A7 quad-core

Intel(R) Core(TM)i7 4770K

Frequency [MHz] 1600 (A15) 3500Main Memory [MB] 1,747 31,925

SIMD Extensions NEON (128b) AVX2 (256b)SSE4.1 (128b)

Operating System Linux 3.4.84 armv7l SUSE Linux3.7.10-1.16-desktop

Compiler gcc 4.8.1 icc 13.1.3

Compilation Flags-O3

-mfpu=neon

-fopenmp

-O3

-ipo

-xhost

-fopenmp

Table 5.1: Experimental Setups

In this chapter the relative efficiency of the SIMD and multithreading parallelism is evaluated on bothembedded and general purpose processors. With ARM processor architectures dominating the marketof embedded systems, while x86 processors are commonly used in servers, it is of practical interestto compare how well SIMD extensions and multi-core architectures enhance cryptographic operationson both types of systems. Generally, whereas embedded processors aim to provide low-consuming,with restricted area, platforms with more modest computational resources, general purpose processorsemploy advanced features to extract more parallelism from applications and execute instructions asquickly as possible, attaining greater performance.

The dichotomy of power-consumption optimisation and high computational performance has led totechnologies such as ARM’s big.LITTLE, which is featured in the ODROID-XU+E platform. This technol-ogy deploys two sets of cores, and while the A15 quad-core targets high-performance, the A7 quad-coreprovides modest performance with lower power consumption, at the expense of greater circuit area. Assuch, when performing computationally demanding operations, like cryptographic operations, or real-time applications, the A15 cores are active and are able to execute them quicker than the A7 cores.During the rest of the time the A7 cores are active, increasing battery life.

The same reasoning can be applied to web-servers. The employment of a big.LITTLE-like archi-tecture to server platforms enables the use of high-performance processors for heavier workloads, andthe use of power-optimised processors when server usage is low. This allows not only to reduce powerconsumption, but also the need to cool the devices. As such, embedded devices, which apply simi-lar techniques, are threatening x86 server dominance, and there are current projects to design high-performance computing systems based on ARM embedded processors [61].

Herein, the execution times of both platforms are not directly compared, but in particular how wellSIMD and multithreading parallelism accelerate Montgomery multiplication and RSA modular exponenti-ation. With these tests, it is possible to assess which features of the two platforms are the most beneficialfor cryptography.

The Intel(R) Core(TM) i7 4770K processor is used as an archetype for the general purpose platforms.The characteristics of this platform are presented in Table 5.1, as well as those of ODROID-XU+E, forcomparison. It should be noted that even though it features 4 physical cores, they are hyper-threaded[62], and therefore correspond to 8 logical cores.

64

NEON SSE4.1 AVX2Load vld1 movntdqa vmovntdqa

Readjustment vzip pshufd vpshufdAddition vadd paddq vpaddq

Multiplication vmul pmuludq vpmuluqdMultiplication and Accumulation vmlal NA NA

Store vst2 movdqa vmovdqa

Table 5.2: SIMD instructions adopted for the ARM and the Intel processors

5.1 Montgomery Multiplication

The sequential and SIMD Montgomery multiplication algorithms, described in Chapter 3, were imple-mented for the Intel processor. The developed code was based on the one described in [12]. It can betranslated from the NEON code depicted in Section 3.3.2 using the instructions presented in Table 5.2.

For illustrative purposes, the code for the AVX2 version of vector multiply and accumulate is shown onAlgorithm 5.1, and the FIOS2 main loop implementation in Algorithm 5.2. Therein, variables ymm0, ymm1,..., ymm6 represent 256-bit registers. Firstly, in Algorithm 5.1, ymm4 and ymm5, which hold the carries, arecleared using the xor operation. Afterwards, inside the j-loop, variables va[j] and vb are loaded intothe ymm0 and ymm1 registers. The contents of the lanes are rearranged and spread among ymm0, ymm1,ymm2 and ymm3, so that the 32 most significant bits of each 64-bit lane is equal to zero. The words of theoperands are multiplied, added to t[i+j], and carries are accumulated. Then, a shuffle operation isused so that the carries of the partial result are stored in the 32 least significant bits of ymm1 and ymm3.This allows to extract the carries using a logic and operation. The registers are then re-shuffled, and theunpackhi instruction rearranges data, preparing it to be stored.

FIOS2 multiplication and reduction loop is implemented with Algorithm 5.2 by first computing thevalue of vm, which corresponds to the value that should be multiplied by the modulo, and the resultadded to the product to clear the 32 least significant bits. This value is computed by performing arithmeticmodulo 232, using the add and mullo instructions which, when applied to 32-bit lanes, compute the 32least significant bits of the result. Afterwards multiplication and reduction takes place by calling the vmac

instruction to compute

vt <- vt + va × vb[i] × 2wi,

performing carry propagation with the mp addition c function, computing

vt <- vt + vm × vn × 2wi,

(where vn corresponds to the modulo) and finally performing carry propagation.

The resulting code for the FIOS2, FIOS and SOS methods was compiled using icc, with the -O3, -ipoand -xhost flags, and the resulting applications were tested and timed using the readtsc instruction.The average execution times, taken over 4096 multiplications, are presented in Table 5.3. Speedups forthe ARM processor were replicated on Figure 5.1 for comparison with the SSE4.1 and AVX2 technolo-gies.

The results demonstrate the prominent performance of the SOS and FIOS2 algorithms, which outper-form the FIOS method. The FIOS poor performance is related to the memory system. This is aggravatedwhen SIMD-parallelism is exploited for both systems. As stated previously, this method requires an highnumber of memory accesses, due to the repeated need to perform carry propagation inside the innerloop. The second version of the FIOS SIMD method was the fastest method for both the Intel and the

65

Algorithm 5.1: AVX2 Vector Multiply and Accumulate

/∗ v t = v t + va∗vbymm4, ymm5 hold c a r r i e s on e x i t ∗ /

vo id vmac ( i n t (∗ v t ) [ SIMD SIZE ] , i n t (∗ va ) [ SIMD SIZE ] , i n t vb [ SIMD SIZE ] , i n t s ){

i n t j ;/∗ c = 0∗ /ymm4 = mm256 xor si256 (ymm4, ymm4) ;ymm5 = mm256 xor si256 (ymm5, ymm5) ;

f o r ( j = 0 ; j < s ; ++ j ) {/∗a [ j ]∗b [ i ] ∗ /ymm0 = mm256 stream load si256 ( ( m256i ∗ ) va [ j ] ) ;ymm1 = mm256 stream load si256 ( ( m256i ∗ ) vb ) ;ymm2 = mm256 shuf f le epi32 (ymm0, 0xB1 ) ;ymm3 = mm256 shuf f le epi32 (ymm1, 0xB1 ) ;

ymm0 = mm256 mul epu32 (ymm0, ymm1) ;ymm2 = mm256 mul epu32 (ymm2, ymm3) ;

/∗ t [ i + j ] + a [ i ]∗b [ j ] + c ∗ /ymm1 = mm256 stream load si256 ( ( m256i ∗ ) v t [ j ] ) ;ymm3 = mm256 shuf f le epi32 (ymm1, 0xB1 ) ;

ymm1 = mm256 and si256 (ymm1, ymm6) ;ymm3 = mm256 and si256 (ymm3, ymm6) ;

ymm1 = mm256 add epi64 (ymm1, ymm0) ;ymm1 = mm256 add epi64 (ymm1, ymm4) ;

ymm3 = mm256 add epi64 (ymm3, ymm2) ;ymm3 = mm256 add epi64 (ymm3, ymm5) ;

ymm1 = mm256 shuf f le epi32 (ymm1, 0xB1 ) ;ymm3 = mm256 shuf f le epi32 (ymm3, 0xB1 ) ;

/∗ c ∗ /ymm4 = mm256 and si256 (ymm1, ymm6) ;ymm5 = mm256 and si256 (ymm3, ymm6) ;

/∗ t [ i + j ] = s ∗ /ymm1 = mm256 shuf f le epi32 (ymm1, 0xD8 ) ;ymm3 = mm256 shuf f le epi32 (ymm3, 0xD8 ) ;

ymm1 = mm256 unpackhi epi32 (ymm1, ymm3) ;

mm256 store si256 ( ( m256i ∗ ) v t [ j ] , ymm1) ;}

}

66

Algorithm 5.2: AVX2 FIOS2 Main Loop

f o r ( i = 0 ; i < s ; ++ i ) {/∗vm <− ( va [ 0 ]∗ vb [ i ]+ v t [ i ] ) ∗ vn l i nha [ 0 ] mod 2ˆw∗ /ymm0 = mm256 stream load si256 ( ( m256i ∗ ) v t [ i ] ) ;ymm1 = mm256 stream load si256 ( ( m256i ∗ ) va [ 0 ] ) ;ymm2 = mm256 stream load si256 ( ( m256i ∗ ) vb [ i ] ) ;ymm3 = mm256 stream load si256 ( ( m256i ∗ ) vn l i nha [ 0 ] ) ;ymm1 = mm256 mullo epi32 (ymm1, ymm2) ;ymm0 = mm256 add epi32 (ymm0, ymm1) ;ymm0 = mm256 mullo epi32 (ymm0, ymm3) ;

mm256 store si256 ( ( m256i ∗ )vm[ 0 ] , ymm0) ;

/∗ v t <− v t + va∗vb [ i ] ∗2 ˆ ( wi ) ∗ /vmac ( ( i n t (∗ ) [ SIMD SIZE ] ) v t [ i ] , va , vb [ i ] , s ) ;/∗ADD( t , i +s , i +s+2) ∗ /mp add i t ion c ( vt , i +s , i +s+2) ;

/∗ v t <− v t + vm∗vn ∗2ˆ ( wi ) ∗ /vmac ( ( i n t (∗ ) [ SIMD SIZE ] ) v t [ i ] , vn , vm[ 0 ] , s ) ;/∗ADD( t , i +s , i +s+2) ∗ /mp add i t ion c ( vt , i +s , i +s+2) ;

}

Number of bits 256 512 1024 2048 4096Execution Time for Serial versions [clock cycles]

SOS 1,530 3,362 8,172 27,492 101,168FIOS 2,547 5,041 18,243 67,801 273,958FIOS2 675 2,026 6,920 26,934 102,130

Execution Time for 1-core SIMD parallel version [clock cycles]SSE4.1 SOS 408 1,282 4,521 17,339 68,104SSE4.1 FIOS 779 4,206 27,764 203,566 153,683SSE4.1 FIOS2 347 1,077 3,820 14,204 54,567AVX2 SOS 378 1,071 3,623 13,058 49,757AVX2 FIOS 565 2,640 16,233 114,294 836,364AVX2 FIOS2 333 998 3,277 11,457 43,892

SpeedupSOS/1-core SSE4.1 FIOS2 4.4 3.1 2.1 1.9 1.9SOS/1-core AVX2 FIOS2 4.6 3.4 2.5 2.4 2.3

Execution Time for 4-core SIMD parallel version [clock cycles]4-core SSE4.1 FIOS2 188 451 1,418 5,103 20,2174-core AVX2 FIOS2 123 265 777 2,749 10,482

SpeedupSOS/4-core SSE4.1 FIOS2 8.1 7.4 5.8 5.4 5.0SOS/4-core AVX2 FIOS2 12.4 12.7 10.5 10.0 9.7

Table 5.3: Multi-Precision SSE4.1 and AVX2 Montgomery Multiplication Performance

67

256 512 1024 2048 40961

2

3

4

Operands Bit-length

Spe

edup

NEON SSE4.1 AVX2

(a) Single-Core Versions

256 512 1024 2048 4096

6

8

10

12

Operands Bit-length

Spe

edup

NEON SSE4.1 AVX2

(b) Quad-Core Versions

Figure 5.1: Relative performance comparison for the execution of Montgomery multiplication usingNEON, SSE4.1 and AVX2 technologies

ARM processors, and is therefore expected to fit most SIMD technologies effectively. It presented amaximum speed-up of 4.6 for the single-core AVX2 version.

Furthermore, by comparing the speed-ups attained for both processors, it is noticeable that theNEON extensions for the ARM architecture are able to produce a similar improvement in performancefor smaller operands when compared to SSE4.1 and AVX2. This is due to the availability of the instruc-tion vmlal, which fuses integer multiplication and addition in a single instruction, that specially benefitsMontgomery multiplication, and has no equivalent in the latter technologies. The same does not holdfor larger operands, presumably due to the different memory systems, which, in the case of the ARMprocessor, hinders the obtained speedup.

It should be noted that whereas the SSE4.1 and NEON technologies process 128 bits at a time, AVX2processes 256 bits, and therefore is capable of producing greater speedups. The 1-core AVX2 versionof Montgomery multiplication was profiled using gprof, which stated that over 90% of computation wasrelated to the operation of vector multiply and accumulate (corresponding to the inner loops in Figure3.5(c)). This corresponds to useful computation, which allows one to infer that the poor scalability of thesystem may be related to the memory system.

Concerning memory usage on the multithreaded Montgomery multiplication, since data is loadedto the cache in blocks of 512b on the Intel processor, similarly to what happened on the ARM proces-sor, a better cache efficiency is achieved as the operands width increases. Furthermore, since AVX2processes 256b at once, more cache data is reused, when compared to both NEON and SSE4.1, andtherefore multithreading parallelism is able produce a better relative speedup.

The Intel processor employs the hyper-threading technology, which allows the simultaneous execu-tion of 8 threads on 4 cores. Even though each pair of threads share execution resources, such asthe SIMD engine, this allows to better fill the pipeline stages, which results on a better performanceenhancement, when compared to the A15 processor.

5.1.1 RSA Exponentiation

The algorithm for modular exponentiation introduces a great amount of conditional instructions, whichshould hinder SIMD performance, presumably due to mispredicted branches. In fact, as SIMD in-

68

Number of bits 256 512 1024 2048 4096Execution Time for Serial version [clock cycles]

Serial 344,992 1,702,372 11,420,907 86,510,112 666,074,144Execution Time for 1-core SIMD parallel version [clock cycles]

SSE4.1 212,233 1,184,521 8,088,449 62,261,412 464,262,199AVX2 119,968 697,986 4,209,286 33,138,687 247,722,523

SpeedupSerial/1-core SSE4.1 1.6 1.4 1.4 1.4 1.4Serial/1-core AVX2 2.9 2.4 2.7 2.6 2.7

Execution Time for 4-core SIMD parallel version [clock cycles]SSE4.1 62,504 316,923 2,028,522 14,898,854 112,599,638AVX2 33,068 171,194 1,146,998 8,195,936 57,935,720

SpeedupSerial/4-core SSE4.1 5.5 5.4 5.6 5.8 5.9Serial/4-core AVX2 10.4 9.9 10.0 10.6 10.5

Table 5.4: Execution time [clock cycles] obtained from the execution of the modular exponentiationalgorithm on the Intel processor

structions generally have larger latencies than scalar ones, there is a greater penalty for mispredictedbranches, due to the fact that it takes longer to fill the pipeline stages. For instance, whereas the scalarmulq instruction takes 3 cycles to execute [63], its AVX2 counterpart, pmuludq takes 5 [51] on the Inteli7 Haswell architecture. In [64], several ARM Cortex-A7 instructions were timed, and whereas the scalarsmlal instruction took 3 cycles, NEON vmlal took 4. From the experimental results obtained for the Intelprocessor (Table 5.4), this effect is more significant for small-sized operands, which have lower com-putational complexity, producing smaller increases in performance when compared to the Montgomerymultiplication algorithm alone. This effect is also noticeable on the ARM processor. The speedup resultsobtained for the modular exponentiation algorithm on this latter processor have been replicated in Figure5.2, for comparison.

Analysing the results, they clearly shows that wider SIMD engines, namely the AVX2 technology,are very effective at increasing the throughput of cryptographic operations. As more operands areprocessed in parallel there is a greater data reuse, and therefore a better cache efficiency is achieved.This is particularly effective when multithreading is employed.

5.2 Summary

In this section, the relative performance of SIMD engines on embedded and general-purpose de-vices was assessed. Initially, it was discussed how embedded technology, which favours low energyconsumption and smaller circuit areas, is currently competing for the server market.

The Intel i7 4770k processor was chosen as an archetype for the general-purpose devices, and sev-eral multi-precision Montgomery multiplication algorithms were implemented for the SSE4.1 and AVX2technologies. Whereas SSE4.1 processes 128 bits at a time, similarly to what happens to the NEONtechnology for the A15 processor, AVX2 processes 256 bits.

It was shown that the NEON technology compared favourably to SSE4.1. This was attributed tothe vmlal instruction which specially benefits Montgomery multiplication, by performing integer multi-plication and addition in a single instruction, and has no equivalent in SSE4.1. The AVX2 technologyoutperformed the aforementioned SIMD technologies, by processing more operands simultaneously, butits performance was limited by the memory system. Despite this, it achieved a better cache efficiencythan SSE4.1 and NEON, which was most valuable when multithreading parallelism was exploited.

69

256 512 1024 2048 4096

1.5

2

2.5

3

Operands Bit-length

Spe

edup

NEON SSE4.1 AVX2

(a) Single-Core Versions

256 512 1024 2048 4096

6

8

10

Operands Bit-length

Spe

edup

NEON SSE4.1 AVX2

(b) Quad-Core Versions

Figure 5.2: Relative performance comparison for the execution of modular exponentiation using NEON,SSE4.1 and AVX2 technologies

When the modular exponentiation algorithm was introduced, the additional control dependenciesdecreased the performance enhancement of SIMD parallelism. Nevertheless, SIMD parallelism was stilleffective for both systems. In particular, the results showed that wider SIMD engines, namely the AVX2technology, are very effective at increasing the throughput of cryptographic operations.

70

6Conclusion

Contents6.1 Summary and Overall Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 726.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

71

6.1 Summary and Overall Conclusions

The main objectives of this thesis were the enhancement of modular arithmetic through the exploita-tion of SIMD and multithreading parallelisms, namely on embedded systems and the experimental eval-uation of the benefits of using this type of parallelism for cryptography. Algorithms were developed andthoroughly tested for public-key cryptosystems, which are very computationally demanding. Moreover,a tool was developed, which was tested for the signature generation and verification of messages, inorder to confirm the feasibility of the proposed approaches.

Three multi-precision multiplication algorithms were analysed in order to find ways to enhance theirexecution using SIMD parallelism. The product-scanning and operand-caching methods were moreprone to parallelisation, and whereas the latter requires fewer memory accesses, the former requiresless reorganisation of data, after it has been loaded to registers. The SIMD versions of these two meth-ods were develop and experimentally evaluated in comparison with the implementation of the sequentialoperand-scanning method, which is the commonest approach to perform multi-precision multiplication.The SIMD product-scanning method executed upto 1.23 faster on the ARM A15 CPU than the sequen-tial operand-scanning method, outperforming the operand-caching method, showing it was preferableto keep the operands in lower-level caches, and have an higher number of loads, so that fewer datarearrangements would be needed.

Afterwards, several methods to compute Montgomery modular multiplication, mostly used in cryp-tography, were considered to develop parallel algorithms. These algorithms exploit SIMD extensions toperform several modular multiplications in parallel. Whereas the SOS method performs multiplicationand reduction on two separate loops, each with another inner loop, the FIOS method implements modu-lar multiplication in single loop, featuring another inner loop. The FIOS2 is a variant of the FIOS methodwhere the inner loop is split into two. Their implementation involved an unorthodox data memory stor-age. Additionally, it was noticed that carry propagation could be irksome when using SIMD extensions,as no efficient way was found to test at each iteration if the produced carry was zero. This lead to greatdifferences in performance, and the FIOS method, which featured a large amount of carry propagations,was greatly outperformed by the SOS and FIOS2 methods. Of these, the latter was the most efficient,with a speedup of upto 4.3 for the 1-thread version of the algorithm running on the A15 CPU. Whenmultithreading parallelism was employed, the throughput increased dramatically, but it was limited bycontention in accessing higher levels of cache till the main memory. However, as the operand sizesincreased, since there was a greater reuse of data in the cores’ caches, this effect was mitigated, andrelative speedups increased. A maximum speedup of 9.3 was achieved for the SIMD FIOS2 methodimplemented in a 4-core ARM A15 system.

Furthermore, investigation was made so as to extract parallelism from a single modular multiplication,and the k-ary method was used to develop a multithreaded Montgomery multiplication algorithm. Thisapproach was modified in order to use a second level of parallelism, and the SIMD product-scanningmethod was employed to compute partial results. The cost of exploiting multithreading parallelism, whichinvolves the creation, synchronisation and destruction of threads is much larger than that of SIMD paral-lelism, and, for smaller operands, outweighed the arithmetic complexity. Nevertheless, as the operands-width increased, this method proved to be effective, producing speedups of almost 2 for the multiplicationof two 4096-bits operands, when 3×A15 cores were used.

The implementation of the parallel Montgomery multiplication algorithm, which exploits SIMD paral-lelism, on the PowerVR Series5XT SGX544 MP3 GPU was ineffective, producing speedups lower than1. This platform has 3 shader cores, each core can host up to 16 instruction streams, that executeindependently on 4 ALUs. The ineffectiveness was due to the fact that the code featured many branchinstructions and was very memory intensive. Notwithstanding, its use implies that the CPU is free to

72

perform other tasks, and therefore may still be beneficial.On the other hand, the Adreno 320 GPU, which supports synchronisation between threads, enabled

the implementation of an RNS-based modular multiplication algorithm. It was shown how to obtainefficient code for this parallel algorithm, which featured few divergences, and therefore suits GPU ar-chitectures. The code made use of the Kawamura approximation algorithm [25], and it was shown howthis approach may be modified to support wider operands for this platform. Speedups of upto 2.1 wereobtained when compared to the sequential execution on the Krait 300 ARM A15-based CPU.

The aforementioned modular multiplication methods were then used in this thesis to implement RSA,DSA and ECC core operations, namely modular exponentiation and EC point multiplication. Firstly,the 2k-ary method was presented as a generic method to implement both of these operations. Whenk takes the value of a small positive integer, this method presents low memory usage and mediumperformance, and therefore nicely fits embedded devices. When the value of k is increased, performancealso increases at the expense of more memory consumption.

Several approaches were presented through which the exploitation of parallel cryptographic opera-tions may be useful, and some examples follow. The use of multi-prime RSA allows for the decipheringof a message to take place over several moduli, and the result may afterwards be reconstructed usingthe CRT. A server may also wish to identify itself to several users, and may employ multiple modularexponentiations or EC point multiplications for that effect. It may also employ SIMD parallelism to deci-pher multiple RSA messages, since it is common to set e = 1 + 216 as the public-key value, due to itslow Hamming weigh. When employing multithreading parallelism, each core may also produce or verifydifferent ECDSA signatures.

The parallel execution of multiple RSA operations on the A15 platform produced speedups whichfollowed a similar tendency to those of SIMD Montgomery multiplication. However, the additional condi-tional branches led to a decrease in performance enhancement, due to mispredicted branches. Whereasspeedups of upto 1.5 were obtained for the execution on a single-core, a maximum speedup of 7.2 wasobtained for the quad-core version. On the other hand, the modular exponentiation algorithm supportedon the k-ary modular multiplication was able to produce speedups of upto 2.2. These were all tested ona quad-core A15 system. It can be concluded that whereas the k-ary method is most effective when asingle exponentiation is to be performed, specially for large operands, SIMD Montgomery multiplicationattains a better performance when multiple exponentiations are required.

When RNS-based modular multiplication was applied to exponentiation, the execution was splitamong the CPU and the GPU. The CPU controls the exponentiation execution, and enqueues mod-ular multiplications on the GPU. This way, the enhanced control features of the CPU are exploited, atthe same time that GPU parallel capabilities are exploited. The approach resulted in speedups of upto2.2, when the code was executed on the Krait 300 CPU and Adreno 320 GPU, when compared to thesequential version computed on the CPU.

For the ECC, it was shown that the use of projective coordinates allows for cumbersome operationsto be avoided (modular inversions). A variant to projective coordinates was presented, Jacobian coor-dinates, which allows to exploit SIMD parallelism more effectively, whilst still avoiding the computationof modular inverses. The scheduling used for point multiplication applied 3 parallel multipliers at theexpense of interleaving operands, so that the SIMD algorithms may be employed, and afterwards de-interleaving the result. The developed algorithm was tested for three standard curves, P-192, P-224 andP-256, and speedups of upto 1.3 were obtained for the single-core version, while a maximum speedupof 3.9 was obtained for the quad-core implementation on the same platform.

Two GUIs were developed in this thesis for facilitating the use of DSA and ECDSA, in order togenerate and verify digital signatures. They are both supported on the wxWidgets framework. Whenperforming DSA operations, the 2k-ary exponentiation method is used, exploiting the k-ary modular

73

multiplication algorithm for the execution of modular exponentiation. The k-ary method, in turn, uses theSIMD product-scanning method to perform multi-precision multiplications. Contrastingly, the ECDSAoperations are underpinned by the 2k-ary EC point multiplication algorithm, which uses the SIMD pointaddition and doubling algorithms at its core, to perform point multiplication.

In the scope of this thesis, the SIMD and multithreading performance enhancement for embeddedand general purpose systems was compared, and the quad-core A15 platform and the quad-core i7processor were used as archetypes for each device. Whereas the vmac instruction, which performsinteger multiplication and addition, favours the execution of the Montgomery multiplication algorithmon the ARM processor, the memory system limited the parallelism effectiveness. The Intel processorlacks an instruction similar to vmac, but has a wider SIMD technology, which improves the algorithmthroughput, and deploys the hyper-threading technology which enhances multithreading parallelism.

6.2 Future Work

The field of cryptography has become increasingly important as the market for mobile embeddeddevices has grown fiercely. Research in this field is more important than ever, as the need to providesecure communications increases, on devices with limited computational resources, and a limited powerbudget. Given this last fact, power consumption might be used to inspire some of the future work, inorder to test the performance-power ratio of the multiple approaches to modular arithmetic.

From this thesis, some issues related with embedded system architectures suggest possible improve-ments. The introduction of SIMD instructions which test if all bits of a vector are set to zero would allowto enhance not only the parallel execution of multiple cryptographic operations, but also applicationswhich require the use of multi-precision arithmetic, by accelerating the process of carry propagation.Moreover, if the SIMD engine provided a simple-precision modular reduction instruction, the RNS sys-tem could easily be deployed on CPUs, enhancing multi-precision modular multiplication through theuse of the CRT and a Montgomery-like representation.

74

References

[1] Nick Sullivan. A (relatively easy to understand) primer on ellip-tic curve cryptography. http://arstechnica.com/security/2013/10/

a-relatively-easy-to-understand-primer-on-elliptic-curve-cryptography/2/, Octo-ber 2013.

[2] ARM. Neon. http://www.arm.com/products/processors/technologies/neon.php, 2014.

[3] W.W.L. Fung, I. Sham, G. Yuan, and T.M. Aamodt. Dynamic warp formation: Efficient MIMD con-trol flow on SIMD graphics hardware. ACM Transactions on Architecture and Code Optimization(TACO), 6(2):1–37, 2009.

[4] Michael Hutter and Erich Wenger. Fast multi-precision multiplication for public-key cryptogra-phy on embedded microprocessors. In Springer, editor, Cryptographic Hardware and EmbeddedSystems - CHES 2011, 13th International Workshop, Nara, Japan, September 28 - October 1, 2011,Proceedings., volume 6917 of Lecture Notes in Computer Science, pages 459 – 474. Springer,2011.

[5] David Kahn. The Codebreakers: The Comprehensive History of Secret Communication fromAncient Times to the Internet. Scribner, rev sub edition, December 1996.

[6] Whitfield Diffie and Martin E. Hellman. New directions in cryptography, 1976.

[7] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public-keycryptosystems. Commun. ACM, 21(2):120–126, February 1978.

[8] Taher El Gamal. A public key cryptosystem and a signature scheme based on discrete logarithms.In Proceedings of CRYPTO 84 on Advances in Cryptology, pages 10–18, New York, NY, USA,1985. Springer-Verlag New York, Inc.

[9] Neal Koblitz. Elliptic curve cryptosystems. Mathematics of Computation, 48(177):203–209, January1987.

[10] Victor S Miller. Use of elliptic curves in cryptography. In Lecture Notes in Computer Sciences; 218on Advances in cryptology—CRYPTO 85, pages 417–426, New York, NY, USA, 1986. Springer-Verlag New York, Inc.

[11] John L. Hennessy and David A. Patterson. Computer Architecture: A Quantitative Approach. Mor-gan Kaufmann Publishers Inc., San Francisco, CA, USA, 3 edition, 2003.

[12] D. Page and N.P. Smart. Parallel cryptographic arithmetic using a redundant montgomery repre-sentation. Computers, IEEE Transactions on, 53(11):1474–1482, Nov 2004.

[13] Paul G. Comba. Exponentiation cryptosystems on the IBM PC. IBM Systems Journal, 29(4):526–538, 1990.

75

http://arstechnica.com/security/2013/10/a-relatively-easy-to-understand-primer-on-elliptic-curve-cryptography/2/

http://arstechnica.com/security/2013/10/a-relatively-easy-to-understand-primer-on-elliptic-curve-cryptography/2/

http://www.arm.com/products/processors/technologies/neon.php

[14] Nils Gura, Arun Patel, Arvinderpal Wander, Hans Eberle, and Sheueling Chang Shantz. Comparingelliptic curve cryptography and RSA on 8-bit CPUs. In Marc Joye and Jean-Jacques Quisquater,editors, CHES, volume 3156 of Lecture Notes in Computer Science, pages 119–132. Springer,2004.

[15] Peter L. Montgomery. Modular multiplication without trial division. MATHEMATICS OFCOMPUTATION, 44(170), 1985.

[16] Paul Barrett. Implementing the Rivest Shamir and Adleman public key encryption algorithm on astandard digital signal processor. In Proceedings on Advances in cryptology—CRYPTO ’86, pages311–323, London, UK, 1987. Springer-Verlag.

[17] Cetin Koc, Tolga Acar, and Burton Kaliski Jr. Analizing and comparing Montgomery multiplicationalgorithms. IEEE Micro, June 1996.

[18] Krishna Chaitanya Pabbuleti, Deepak Hanamant Mane, Avinash Desai, Curt Albert, and PatrickSchaumont. SIMD acceleration of modular arithmetic on contemporary embedded platforms. InHPEC, pages 1–6. IEEE, 2013.

[19] Joppe W. Bos, Peter L. Montgomery, Daniel Shumow, and Greg Zaverucha. Montgomery multi-plication using vector instructions. In Selected Areas in Cryptography 2013 (SAC 2013). Springer,September 2013.

[20] Kazumaro Aoki, Fumitaka Hoshino, Tetsutaro Kobayashi, and Hiroaki Oguro. Elliptic curve arith-metic using SIMD. In Proceedings of the 4th International Conference on Information Security, ISC’01, pages 235–247, London, UK, 2001. Springer-Verlag.

[21] P. Giorgi, L. Imbert, and T. Izard. Parallel modular multiplication on multi-core processors. InComputer Arithmetic (ARITH), 2013 21st IEEE Symposium on, pages 135–142, April 2013.

[22] Marcelo Kaihara and Naofumi Takagi. Bipartite modular multiplication method. IEEE Transactionson Computers, 57(2):157–164, 2008.

[23] Jean-Claude Bajard, Laurent-Stphane Didier, and Peter Kornerup. An RNS montgomery modularmultiplication algorithm. IEEE TRANSACTIONS ON COMPUTERS, 47(7):766–776, 1998.

[24] P. P. Shenoy and R. Kumaresan. Fast base extension using a redundant modulus in RNS. IEEETrans. Comput., 38(2):292–297, February 1989.

[25] Shinichi Kawamura, Masanobu Koike, Fumihiko Sano, and Atsushi Shimbo. Cox-Rower architec-ture for fast parallel montgomery multiplication. In Proc. EUROCRYPT 2000, LNCS 1807, pages523–538, Berlin, Heidelberg, 2000. Springer-Verlag.

[26] M.I Aziz and S. Akbar. Introduction to cryptography. In Microelectronics, 2005. ICM 2005. The 17thInternational Conference on, pages 144–147, December 2005.

[27] Thomas Jonson. Public-key cryptography: PGP, SSL, and SSH. 2000.

[28] Andre Zuquete. Seguranca em Redes Informaticas. FCA, 3 edition, 2010.

[29] Samuel Antao. Portable Embedded Systems: Efficient Units for Data Processing and Cryptography.Msc, Instituto Superior Tecnico - TU-Lisbon, 2008.

[30] Neal Koblitz. A Course in Number Theory and Cryptography. Springer-Verlag New York, Inc., NewYork, NY, USA, 1987.

76

[31] J. Jonsson and B. Kaliski. Public-key cryptography standards (PKCS) #1: RSA cryptography spec-ifications version 2.1, 2003.

[32] Mihir Bellare and Phillip Rogaway. The exact security of digital signatures-how to sign with RSA andRabin. In Proceedings of the 15th Annual International Conference on Theory and Application ofCryptographic Techniques, EUROCRYPT’96, pages 399–416, Berlin, Heidelberg, 1996. Springer-Verlag.

[33] National Institute of Standards and Technology. FIPS PUB 186-4:Digital Signature Standard (DSS). National Institute for Standards and Technology, Gaithers-burg, MD, USA, July 2013. supersedes FIPS 186-3.

[34] National Institute of Standards and Technology. FIPS PUB 180-4: Secure Hash Standard (SHS).National Institute for Standards and Technology, Gaithersburg, MD, USA, March 2012. supersedesFIPS 180-3.

[35] Elisabeth Oswald. Introduction to elliptic curve cryptography. July 2005.

[36] Arjen K. Lenstra. Key length, 2004.

[37] Alfred J. Menezes, Paul C. Van Oorschot, Scott A. Vanstone, and R. L. Rivest. Handbook of appliedcryptography, 1997.

[38] Henk Corporaal. Microprocessor Architectures : From VLIW to TTA. John Wiley & Sons, December1997.

[39] David Luebke and Greg Humphreys. How GPUs work. Computer, 40(2):96–100, 2007.

[40] Erik Lindholm, John Nickolls, Stuart Oberman, and John Montrym. NVIDIA Tesla: A unified graph-ics and computing architecture. IEEE Micro, 28(2):39–55, March 2008.

[41] Marcus Hahnel and Hermann Hartig. Heterogeneity by the numbers: A study of the ODROID XU+Ebig.LITTLE platform. In 6th Workshop on Power-Aware Computing and Systems (HotPower 14),Broomfield, CO, October 2014. USENIX Association.

[42] Sergejs Cuhrajs. Boost the performance of your Android app with OpenCL.http://developer.sonymobile.com/knowledge-base/tutorials/android_tutorial/

boost-the-performance-of-your-android-app-with-opencl/, November 2013.

[43] A. Karatsuba and Y. Ofman. Multiplication of multidigit numbers on automata. Soviet PhysicsDoklady, 7:595, January 1963.

[44] Andrei L. Toom. The complexity of a scheme of functional elements realizing the multiplication ofintegers. Soviet Mathematics Doklady, 3:714–716, 1963.

[45] Arnold Schonhage and Volker Strassen. Schnelle Multiplikation großer Zahlen. Computing,7(3):281–292, September 1971.

[46] Nicholas S. Szabo and Richard I. Tanaka. Residue Arithmetic and Its Applications to Computer Technology.McGraw-Hill Book Company, New York, 1967.

[47] Jean-Claude Bajard and Laurent Imbert. A full RNS implementation of RSA. IEEE Trans. Comput.,53(June):769–774, June 2004.

[48] ARM. Cortex-a series programmer’s guide version 1.0. http://xilasz.free.fr/Android/neon/

ARM%20Cortex-A%20Series%20Programmer%92s%20Guide.pdf, 2011.

77

http://developer.sonymobile.com/knowledge-base/tutorials/android_tutorial/boost-the-performance-of-your-android-app-with-opencl/

http://developer.sonymobile.com/knowledge-base/tutorials/android_tutorial/boost-the-performance-of-your-android-app-with-opencl/

http://xilasz.free.fr/Android/neon/ARM%20Cortex-A%20Series%20Programmer%92s%20Guide.pdf

http://xilasz.free.fr/Android/neon/ARM%20Cortex-A%20Series%20Programmer%92s%20Guide.pdf

[49] The openmp specification for parallel programming. http://www.openmp.org, 2014.

[50] John E. Stone, David Gohara, and Guochun Shi. Opencl: A parallel programming standard forheterogeneous computing systems. IEEE Des. Test, 12(3):66–73, May 2010.

[51] Intel Corporation. Intel intrinsics guide. https://software.intel.com/sites/landingpage/

IntrinsicsGuide/.

[52] Samuel Antao, Jean-Claude Bajard, and Leonel Sousa. Elliptic curve point multiplication on gpus.In Franois Charot, Frank Hannig, Jrgen Teich, and Christophe Wolinski, editors, ASAP, pages 192–199. IEEE, 2010.

[53] Dan Boneh and Hovav Shacham. Fast variants of RSA. CryptoBytes, 5:1–9, 2002.

[54] M. Jason Hinek. On the security of multi-prime RSA., 2008.

[55] IBM. SSL authentication. http://publib.boulder.ibm.com/infocenter/tivihelp/v5r1/topic/com.ibm.itim.infocenter.doc/cpt/cpt_ic_security_ssl_authent.html.

[56] William Stallings. Cryptography and Network Security: Principles and Practice. Pearson Education,3rd edition, 2002.

[57] Daniel Pocock. RSA key sizes: 2048 or 4096 bits? http://danielpocock.com/

rsa-key-sizes-2048-or-4096-bits, June 2013.

[58] wxWidgets. wxWidgets cross-platform gui library. http://wxwidgets.org/.

[59] National Institute of Standards and Technology. Multiple Examples of DSA. National Institute forStandards and Technology, Gaithersburg, MD, USA, July 2003.

[60] dlevere. How the ECDSA algorithm works (PS3). http://gamehacking.org/vb/threads/

6152-How-the-ECDSA-algorithm-works-%28PS3%29, January 2012.

[61] Nikola Rajovic, Alejandro Rico, Nikola Puzovic, Chris Adeniyi-Jones, and Alex Ramirez. Tibidabo:Making the case for an ARM-based HPC system. Future Generation Computer Systems, 2013.DOI: http://dx.doi.org/10.1016/j.future.2013.07.013.

[62] Intel Corporation. Intel hyper-threading technology. http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html.

[63] Torbjorn Granlund. Instruction latencies and throughput for AMD and Intel x86 processors. https://gmplib.org/~tege/x86-timing.pdf, July 2014.

[64] Hardwarebug. Cortex-A7 instruction cycle timings. http://hardwarebug.org/2014/05/15/

cortex-a7-instruction-cycle-timings/, May 2014.

78

http://www.openmp.org

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

https://software.intel.com/sites/landingpage/IntrinsicsGuide/

http://publib.boulder.ibm.com/infocenter/tivihelp/v5r1/topic/com.ibm.itim.infocenter.doc/cpt/cpt_ic_security_ssl_authent.html

http://publib.boulder.ibm.com/infocenter/tivihelp/v5r1/topic/com.ibm.itim.infocenter.doc/cpt/cpt_ic_security_ssl_authent.html

http://danielpocock.com/rsa-key-sizes-2048-or-4096-bits

http://danielpocock.com/rsa-key-sizes-2048-or-4096-bits

http://wxwidgets.org/

http://gamehacking.org/vb/threads/6152-How-the-ECDSA-algorithm-works-%28PS3%29

http://gamehacking.org/vb/threads/6152-How-the-ECDSA-algorithm-works-%28PS3%29

http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html

http://www.intel.com/content/www/us/en/architecture-and-technology/hyper-threading/hyper-threading-technology.html

https://gmplib.org/~tege/x86-timing.pdf

https://gmplib.org/~tege/x86-timing.pdf

http://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/

http://hardwarebug.org/2014/05/15/cortex-a7-instruction-cycle-timings/

public-key cryptography on simd mobile devices · pdf filepublic-key cryptography on simd...

Documents