compact hardware implementations of chacha, blake, … · 2013. 3. 3. · to 19) and a final...

17
Compact Hardware Implementations of ChaCha, BLAKE, Threefish, and Skein on FPGA Nuray At * , Jean-Luc Beuchat , Eiji Okamoto , ˙ Ismail San * , and Teppei Yamazaki * Department of Electrical and Electronics Engineering, Anadolu University, Eskis ¸ehir, Turkey Email: {nat, isan}@anadolu.edu.tr ELCA Informatique SA, Av. de la Harpe 22–24, Case postale 519, 1001 Lausanne, Switzerland Email: [email protected] Graduate School of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, Japan Email: [email protected], [email protected] Abstract—The cryptographic hash functions BLAKE and Skein are built from the ChaCha stream cipher and the tweakable Threefish block cipher, respectively. Interestingly enough, they are based on the same arithmetic operations, and the same design philosophy allows one to design lightweight coprocessors for hashing and encryption. The key element of our approach is to take advantage of the parallelism of the algorithms to deeply pipeline our Arithmetic an Logic Units, and to avoid data dependencies by interleaving independent tasks. We show for instance that a fully autonomous implementation of BLAKE and ChaCha on a Xilinx Virtex-6 device occupies 144 slices and three memory blocks, and achieves competitive throughputs. In order to offer the same features, a coprocessor implementing Skein and Threefish requires a substantial higher slice count. I. I NTRODUCTION The cryptographic hash functions BLAKE [1] and Skein [2] are built from the ChaCha stream cipher [3] and the tweakable Threefish block cipher [2], respectively. It is therefore tempting to design compact unified hardware architectures able to hash and encrypt a message. Such processors are for instance valuable for constrained environments, where some security protocols mainly rely on cryptographic hash functions [4]. Furthermore, as emphasized by Kerckhof et al., “fully unrolled and pipelined architectures may sometimes hide a part of the algorithms’ complexity that is better revealed in compact im- plementations” [5]. In order to have a deeper understanding of the computational efficiency of ChaCha, BLAKE, Threefish, and Skein, we extend here our work presented in [6], [7] and propose novel lightweight coprocessors. The key element of our approach is to take advantage of the parallelism of the algorithms to deeply pipeline our Arithmetic an Logic Units (ALUs), and to avoid data dependencies by interleaving independent tasks. Throughout this article, all operands are w-bit unsigned integers and the following notation is adopted: : addition modulo 2 w ; : subtraction modulo 2 w ; : bitwise AND; : bitwise OR; : bitwise exclusive OR; k: rotation by k bits to the right; k: rotation by k bits to the left. The rest of the article is organized as follows: after a brief overview of Threefish (Section II), Skein (Section III), ChaCha (Section IV), and BLAKE (Section V), we describe our design philosophy and compact hardware implementations (Section VI). We discuss our implementation results on several Xilinx Field-Programmable Gate Arrays (FPGAs) in Sec- tion VII and conclude in Section VIII. II. THE THREEFISH BLOCK CIPHER The design philosophy of Threefish is that “a larger num- ber of simple rounds is more secure than fewer complex rounds” [2]. The key schedule can be computed in a few clock cycles, which is an important consideration in order to build a compression function from a block cipher. Threefish operates entirely on unsigned 64-bit integers and involves only three operations: rotation of k bits to the left, bitwise exclusive OR, and addition modulo 2 64 . Therefore, the plaintext P and the cipher key K are converted to N w 64-bit words. Note that the number of words N w and the number of rounds N r depend on the key size (Table I). The size of a plaintext block is given by N b =8 · N w bytes. Table I NUMBER OF ROUNDS OF THREEFISH FOR DIFFERENT KEY SIZES. Key size # 64-bit words # rounds Block size [bits] Nw Nr N b [bytes] 256 4 72 32 512 8 72 64 1024 16 80 128 The key schedule generates the subkeys from a block cipher key K =(k 0 ,k 1 ,...,k Nw-1 ) and a 128-bit tweak T =(t 0 ,t 1 ). K and T are extended with one parity word (Algorithm 1, lines 1 and 2). Each subkey is a combination of N w words of the extended key, two words of the extended tweak, and a counter s (Algorithm 1, lines 5 to 9). Note that the extended key and the extended tweak are rotated by one word position between two consecutive subkeys.

Upload: others

Post on 03-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

Compact Hardware Implementations of ChaCha,BLAKE, Threefish, and Skein on FPGA

Nuray At∗, Jean-Luc Beuchat†, Eiji Okamoto‡, Ismail San∗, and Teppei Yamazaki‡∗Department of Electrical and Electronics Engineering, Anadolu University, Eskisehir, Turkey

Email: {nat, isan}@anadolu.edu.tr†ELCA Informatique SA, Av. de la Harpe 22–24, Case postale 519, 1001 Lausanne, Switzerland

Email: [email protected]‡Graduate School of Systems and Information Engineering,

University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8573, JapanEmail: [email protected], [email protected]

Abstract—The cryptographic hash functions BLAKE andSkein are built from the ChaCha stream cipher and the tweakableThreefish block cipher, respectively. Interestingly enough, theyare based on the same arithmetic operations, and the samedesign philosophy allows one to design lightweight coprocessorsfor hashing and encryption. The key element of our approachis to take advantage of the parallelism of the algorithms todeeply pipeline our Arithmetic an Logic Units, and to avoid datadependencies by interleaving independent tasks. We show forinstance that a fully autonomous implementation of BLAKE andChaCha on a Xilinx Virtex-6 device occupies 144 slices and threememory blocks, and achieves competitive throughputs. In orderto offer the same features, a coprocessor implementing Skein andThreefish requires a substantial higher slice count.

I. INTRODUCTION

The cryptographic hash functions BLAKE [1] and Skein [2]are built from the ChaCha stream cipher [3] and the tweakableThreefish block cipher [2], respectively. It is therefore temptingto design compact unified hardware architectures able to hashand encrypt a message. Such processors are for instancevaluable for constrained environments, where some securityprotocols mainly rely on cryptographic hash functions [4].Furthermore, as emphasized by Kerckhof et al., “fully unrolledand pipelined architectures may sometimes hide a part of thealgorithms’ complexity that is better revealed in compact im-plementations” [5]. In order to have a deeper understanding ofthe computational efficiency of ChaCha, BLAKE, Threefish,and Skein, we extend here our work presented in [6], [7]and propose novel lightweight coprocessors. The key elementof our approach is to take advantage of the parallelism ofthe algorithms to deeply pipeline our Arithmetic an LogicUnits (ALUs), and to avoid data dependencies by interleavingindependent tasks.

Throughout this article, all operands are w-bit unsignedintegers and the following notation is adopted:• �: addition modulo 2w;• �: subtraction modulo 2w;• ∧: bitwise AND;• ∨: bitwise OR;• ⊕: bitwise exclusive OR;• ≫ k: rotation by k bits to the right;

• ≪ k: rotation by k bits to the left.The rest of the article is organized as follows: after a briefoverview of Threefish (Section II), Skein (Section III), ChaCha(Section IV), and BLAKE (Section V), we describe ourdesign philosophy and compact hardware implementations(Section VI). We discuss our implementation results on severalXilinx Field-Programmable Gate Arrays (FPGAs) in Sec-tion VII and conclude in Section VIII.

II. THE THREEFISH BLOCK CIPHER

The design philosophy of Threefish is that “a larger num-ber of simple rounds is more secure than fewer complexrounds” [2]. The key schedule can be computed in a few clockcycles, which is an important consideration in order to builda compression function from a block cipher.

Threefish operates entirely on unsigned 64-bit integers andinvolves only three operations: rotation of k bits to the left,bitwise exclusive OR, and addition modulo 264. Therefore, theplaintext P and the cipher key K are converted to Nw 64-bitwords. Note that the number of words Nw and the numberof rounds Nr depend on the key size (Table I). The size of aplaintext block is given by Nb = 8 ·Nw bytes.

Table INUMBER OF ROUNDS OF THREEFISH FOR DIFFERENT KEY SIZES.

Key size # 64-bit words # rounds Block size[bits] Nw Nr Nb [bytes]256 4 72 32512 8 72 64

1024 16 80 128

The key schedule generates the subkeys from a blockcipher key K = (k0, k1, . . . , kNw−1) and a 128-bit tweakT = (t0, t1). K and T are extended with one parity word(Algorithm 1, lines 1 and 2). Each subkey is a combinationof Nw words of the extended key, two words of the extendedtweak, and a counter s (Algorithm 1, lines 5 to 9). Note thatthe extended key and the extended tweak are rotated by oneword position between two consecutive subkeys.

Page 2: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

Table IIPERMUTATIONS USED BY THE SKEIN FUNCTIONS (REPRINTED FROM [2]).

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Nw = 4 0 3 2 1π(i) Nw = 8 2 1 4 7 6 5 0 3

Nw = 16 0 9 2 13 6 11 4 15 10 7 12 3 14 5 8 1

Algorithm 1 Key schedule of Threefish.Input: A block cipher key K = (k0, k1, . . . , kNw−1);

a tweak T = (t0, t1); the constant C240 =1BD11BDAA9FC1A22.

Output: Nr/4 + 1 subkeys ks,0, ks,1, . . . , ks,Nw−1, where0 ≤ s ≤ Nr/4.

1. kNw← C240 ⊕

Nw−1⊕i=0

ki;

2. t2 ← t⊕ t1;3. for s← 0 to Nr/4 do4. for i← 0 to Nw − 4 do5. ks,i ← k(s+i) mod (Nw+1);6. end for7. ks,Nw−3 ← k(s+Nw−3) mod (Nw+1) � ts mod 3;8. ks,Nw−2 ← k(s+Nw−2) mod (Nw+1) � t(s+1) mod 3;9. ks,Nw−1 ← k(s+Nw−1) mod (Nw+1) � s;

10. end for11. return ks,0, ks,1, . . . , ks,Nw−1, where 0 ≤ s ≤ Nr/4;

A series of Nr rounds (Figure 1 and Algorithm 2, lines 4to 19) and a final subkey addition (Algorithm 2, line 21) areapplied to produce the ciphertext. The core of a round isthe simple non-linear mixing function Mixd,j (Algorithm 2,lines 13 and 14). It consists of an addition, a rotation by aconstant Rd mod 8,j (repeated every eight rounds and definedin [2, Table 4]), and a bitwise exclusive OR. A word permu-tation π(i) (defined in Table II) is then applied to obtain theoutput of the round (Algorithm 2, line 17). Furthermore, asubkey is injected every four rounds (Algorithm 2, line 7).

Figure 2 describes a decryption round of Threefish-256.It consists of the inverse word permutation followed by theinverse MIX functions. Note that subkeys are injected inreverse order.

III. THE SKEIN FAMILY OF HASH FUNCTIONS

The Unique Block Iteration (UBI) chaining mode allows oneto build a compression function out of a tweakable encryptionfunction. Let M be a message of arbitrary length up to 299−8bits. If the number of bits in M is not a multiple of 8, weappend a bit 1 followed by a (possibly empty) string of 0’s.This step guarantees that M contains NM bytes. Then, we padM with p zero bytes so that NM+p is a multiple of the blocksize Nb. We can now split M into Nb-byte blocks M0, . . . ,Mk−1, where k = (NM +p)/Nb. Each block Mi is processedwith a unique tweak value Ti encoding how many bytes havebeen processed so far, a type field (see [2] for details), and

k1,0

v4,1v4,0 v4,3v4,2

e4,0 e4,1 e4,2 e4,3

v5,1v5,0 v5,3v5,2

k1,2 k1,3k1,1

R4,1 ≪≪

f4,0 f4,1 f4,2 f4,3

R4,0

Mix

4,0

Mix

4,1

Perm

ute

Figure 1. One of the 72 encryption rounds of Threefish-256.

Perm

ute

e4,0 e4,1 e4,2 e4,3

≫R4,1R4,0 ≫

k1,2 k1,3k1,1k1,0

v4,1v4,0 v4,3v4,2

f4,0 f4,1 f4,2 f4,3

v5,1v5,0 v5,3v5,2

InvM

ix4,0

InvM

ix4,1

Figure 2. One of the 72 decryption rounds of Threefish-256.

two bits specifying whether it is the first and/or last block.The UBI chaining mode is computed as:

H0 ← G,Hi+1 ←Mi ⊕ E(Hi, Ti,Mi),

where G is a starting value of Nb bytes.In this work, we consider the normal hashing mode and

refer the reader to [2] for a description of Skein-MAC andtree hashing with Skein. Skein is built on three invocations ofUBI (Figure 3):• Define a 32-byte configuration string C that contains the

length of the digest size (in bits), a schema identifier,

Page 3: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

Algorithm 2 Encryption with the Threefish block cipher.Input: A plaintext block P = (p0, p1, . . . , pNw−1); Nr/4+1

subkeys ks,0, ks,1, . . . , ks,Nw−1, where 0 ≤ s ≤ Nr/4;4Nw rotation constants Ri,j , where 0 ≤ i ≤ 7 and 0 ≤j ≤ Nw/2.

Output: A ciphertext block C = (c0, c1, . . . , cNw−1).1. for i← 0 to Nw − 1 do2. v0,i ← pi;3. end for4. for d← 0 to Nr − 1 do5. for i← 0 to Nw − 1 do6. if d mod 4 = 0 then7. ed,i ← vd,i � kd/4,i; (Key injection)8. else9. ed,i ← vd,i; (Rename)

10. end if11. end for12. for j ← 0 to Nw/2− 1 do13. fd,2j ← ed,2j � ed,2j+1; (Mixd,j)14. fd,2j+1 ← fd,2j ⊕ (ed,2j+1≪ Rd mod 8,j);15. end for16. for i← 0 to Nw − 1 do17. vd+1,i ← fd,π(i); (Permute)18. end for19. end for20. for i← 0 to Nw − 1 do21. ci ← vNr,i � kNr/4,i; (Key injection)22. end for23. return C = (c0, c1, . . . , cNw−1);

and a version number [2, Table 7]. Compute the Nb-byteblock G0:

G0 ← UBI(0, C, Tcfg2120).

Note that G0 only depends on the digest size and caneasily be precomputed.

• The message is then processed as follows:

G1 ← UBI(G0,M, Tmsg2120).

• A third call to UBI is required to achieve hashing-appropriate randomness:

H ← UBI(G1, 0, Tout2120).

This transform allows one to produce arbitrary digestsizes (up to 264 bits). If a single output block H isnot enough, one can use Threefish in counter mode toproduce the digest.

IV. THE CHACHA STREAM CIPHER

The ChaCha family of stream ciphers was designed byBernstein [3] to improve the diffusion per round of Salsa20 [8],while preserving the encryption rate. ChaCha operates on 32-bit words, and expands a 256-bit key (k0, . . . , k7) and a 64-bitnonce (IV0, IV1) into a 270-byte stream. A b-byte message is

M0 M1C M2 0

G1← UBI(G0,M, Tmsg2120)

G0 G1

H ← UBI(G1, 0, Tout2120)

H

G0← UBI(0, C, Tcfg2120)

0

Type: Cfg

Tweak: Tweak: Tweak: Tweak:Type: Msg Type: Msg Type: Out

Len: 128

First: 0

Last: 0

Len: 170

First: 0

Last: 1

Configuration block Output transformMessage

Tweak:Type: Msg

Len: 64

First: 1

Last: 0

Figure 3. Skein in normal hashing mode.

then encrypted (or decrypted) by XORing it with the first bbytes of the stream.

ChaCha generates the stream by blocks of 64 bytes. In orderto process the ith block, ChaCha acts on a 4× 4 matrix M of32-bit integers defined as follows:

m0 m1 m2 m3

m4 m5 m6 m7

m8 m9 m10 m11

m12 m13 m14 m15

=

c0 c1 c2 c3k0 k1 k2 k3k4 k5 k6 k7t0 t1 IV0 IV1

,

where• c0 = 61707865, c1 = 3320646E, c2 = 79622D32,

and c3 = 6B206574 are predefined constants;• t = (t0,t1) is a 64-bit counter encoding the index i (i.e.i = 232t1 + t0).

ChaCha transforms the matrix M through a series of Nrrounds (Algorithm 3). The algorithm is based on a nonlinearoperation called quarter-round function and described by Al-gorithm 4. Matrix M is copied into matrix V . Then, the even-and odd-numbered rounds of ChaCha apply the quarter-roundfunction to each row and northwest-to-southeast diagonal of V ,respectively. Eventually, a new block of the stream is generatedby adding V to the original matrix M (Algorithm 3, line 15),and the block counter is incremented (Algorithm 3, lines 17to 20).

Bernstein proposed 8-, 12-, and 20-round variants ofChaCha. Aumasson et al. introduced a novel method fordifferential cryptanalysis of ChaCha and broke the 7-roundvariant [9]. Ishiguro et al. [10], [11] improved the attack andconcluded that Salsa20 and Chacha “are not presently underthreat”.

V. THE BLAKE FAMILY OF HASH FUNCTIONS

The BLAKE family combines three previously studiedcomponents, chosen by Aumasson et al. for their complemen-tarity [1]: the iteration mode HAIFA, the internal structure ofthe hash function LAKE, and a modified version of Bernstein’sstream cipher ChaCha as compression function. BLAKE is afamily of four hash functions, namely BLAKE-224, BLAKE-256, BLAKE-384, and BLAKE-512 (Table III). The maindifferences lie in the length of words w, the number of roundsNr, and in some constants involved in the algorithm. In the

Page 4: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

Table IIIPROPERTIES OF THE BLAKE HASH FUNCTIONS.

AlgorithmWord size w Message Block Digest Salt # rounds Rotation distances

[bits] [bits] [bits] [bits] [bits] Nr δ0 δ1 δ2 δ3

BLAKE-224 32 < 264 512 224 128 14 16 12 8 7BLAKE-256 32 < 264 512 256 128 14 16 12 8 7BLAKE-384 64 < 2128 1024 384 256 16 32 25 16 11BLAKE-512 64 < 2128 1024 512 256 16 32 25 16 11

Algorithm 3 Computation of a 64-byte block of the streamof ChaCha.Input: A key, a nonce, and a block counter stored in a matrix

M .Output: A 64-byte block of the stream.

1. for i← 0 to 15 do2. v[i]← m[i];3. end for4. for i← 0 to Nr/2− 1 do5. QUARTERROUND(v0, v4, v8, v12);6. QUARTERROUND(v1, v5, v9, v13);7. QUARTERROUND(v2, v6, v10, v14);8. QUARTERROUND(v3, v7, v11, v15);9. QUARTERROUND(v0, v5, v10, v15);

10. QUARTERROUND(v1, v6, v11, v12);11. QUARTERROUND(v2, v7, v8, v13);12. QUARTERROUND(v3, v4, v9, v14);13. end for14. for i← 0 to 15 do15. v[i]← v[i]�m[i];16. end for17. m12 ← m12 � 1;18. if m12 = 0 then19. m13 ← m13 � 1;20. end if21. Return M and V ;

Algorithm 4 The ChaCha quarter-round function.Input: Four 32-bit integers a, b, c, and d.Output: QUARTERROUND(a, b, c, d).

1. a← a� b;2. d← (d⊕ a)≪ 16;3. c← c� d;4. b← (b⊕ c)≪ 12;5. a← a� b;6. d← (d⊕ a)≪ 8;7. c← c� d;8. b← (b⊕ c)≪ 7;9. Return a, b, c, and d;

following, we denote by BLAKE-n the algorithm with an n-bitdigest.

BLAKE-n involves only two arithmetic operations: theaddition modulo 2w of two w-bit unsigned integers and

the bitwise exclusive OR of two w-bit words. The latter issometimes followed by a rotation of δj bits to the right. Thefour possible rotation distances depend on the digest size andare defined in Table III. The compression function of BLAKE-n produces a new chain value h′ = h′0, . . . , h

′7 from a message

block m = m0, . . . ,m15, a chain value h = h0, . . . , h7, asalt s = s0, . . . , s3, a counter t = t0, t1, and 16 constants cidefined in [1, p. 8]. This process consists of three steps. First, a16-word internal state v = v0, . . . , v15 is initialized as follows:

v0 v1 v2 v3v4 v5 v6 v7v8 v9 v10 v11v12 v13 v14 v15

h0 h1 h2 h3h4 h5 h6 h7

s0 ⊕ c0 s1 ⊕ c1 s2 ⊕ c2 s3 ⊕ c3t0 ⊕ c4 t0 ⊕ c5 t1 ⊕ c6 t1 ⊕ c7

.

Then, a series of Nr rounds is performed. Each of themconsists of a transformation of the internal state v based onthe Gi function described by Algorithm 5, where σr denotes apermutation of {0, . . . , 15} parametrized by the round index r(see Table IV). A column step updates the four columns ofmatrix v as follows: G0(v0, v4, v8, v12), G1(v1, v5, v9, v13),G2(v2, v6, v10, v14), and G3(v3, v7, v11, v15). Note that eachcall to Gi updates a distinct column of matrix v. Sincewe focus on compact implementations of BLAKE in thiswork, we interleave the computation of G0, G1, G2, andG3. This approach allows us to design an ALU withfour pipeline stages and to achieve high clock frequencies.Then, a diagonal step updates the four diagonals of v:G4(v0, v5, v10, v15), G5(v1, v6, v11, v12), G6(v2, v7, v8, v13),and G7(v3, v4, v9, v14). Here again, each call to Gi modifiesa distinct diagonal of the matrix, allowing us to interleave thecomputation of G4, G5, G6, and G7.

At the end of the last round, a new chain value h′ =h′0, . . . , h

′7 is computed from the internal state v and the

previous chain value h (finalization step):

h′0 ← h0 ⊕ s0 ⊕ v0 ⊕ v8, h′4 ← h4 ⊕ s0 ⊕ v4 ⊕ v12,h′1 ← h1 ⊕ s1 ⊕ v1 ⊕ v9, h′5 ← h5 ⊕ s1 ⊕ v5 ⊕ v13,h′2 ← h2 ⊕ s2 ⊕ v2 ⊕ v10, h′6 ← h6 ⊕ s2 ⊕ v6 ⊕ v14,h′3 ← h3 ⊕ s3 ⊕ v3 ⊕ v11, h′7 ← h7 ⊕ s3 ⊕ v7 ⊕ v15.

In order to guarantee that the length ` of a message isa multiple of the block size b, Aumasson et al. define thefollowing padding scheme [1]:

Page 5: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

Table IVPERMUTATIONS OF {0, . . . , 15} USED BY THE BLAKE FUNCTIONS (REPRINTED FROM [1]).

i 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

σ0(i) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15σ1(i) 14 10 4 8 9 15 13 6 1 12 0 2 11 7 5 3σ2(i) 11 8 12 0 5 2 15 13 10 14 3 6 7 1 9 4σ3(i) 7 9 3 1 13 12 11 14 2 6 5 10 4 0 15 8σ4(i) 9 0 5 7 2 4 10 15 14 1 11 12 6 8 3 13σ5(i) 2 12 6 10 0 11 8 3 4 13 7 5 15 14 1 9σ6(i) 12 5 1 15 14 13 4 10 0 7 6 3 9 2 8 11σ7(i) 13 11 7 14 12 1 3 9 5 0 15 4 8 6 2 10σ8(i) 6 15 14 9 11 3 0 8 12 2 13 7 1 4 10 5σ9(i) 10 2 8 4 7 6 1 5 15 11 9 14 3 12 13 0

Algorithm 5 The Gi function.Input: A function index i and four w-bit integers a, b, c, and

d.Output: Gi(a, b, c, d).

1. a← a� b;2. a← a� (mσr(2i) ⊕ cσr(2i+1));3. d← (d⊕ a)≫ δ0;4. c← c� d;5. b← (b⊕ c)≫ δ1;6. a← a� b;7. a← a� (mσr(2i+1) ⊕ cσr(2i));8. d← (d⊕ a)≫ δ2;9. c← c� d;

10. b← (b⊕ c)≫ δ3;

• append a bit 1 followed by a sufficient number of 0 bitssuch that the length is congruent to b − 2w − 1 modulob;

• a padding bit followed by the 2w-bit unsigned big-endianrepresentation of ` is then added; in the case of BLAKE-256 and BLAKE-512, the padding bit is equal to 1;otherwise, it is set to 0.

The hash can now be computed iteratively (Algorithm 6):the padded message is divided into N 16-word blocksm(0), . . . ,m(N−1) and the chain value h(0) is set to the sameinitial value as SHA-n. The counter t(i) denotes the number ofmessage bits in m(0), . . . ,m(i) (i.e. excluding padding bits).Note that, if the last block contains only padding bits, thent(N−1) is set to zero. The message digest consists of the nleast significant bits of the output h(N).

Algorithm 6 Iterated hash.Input: A padded message split into N 16-word blocks and a

salt s.Output: A n-bit digest.

1.(h(0)0 , . . . , h

(0)7

)← (IV0, . . . , IV7);

2. for i← 0 to N − 1 do3. h(i+1) ← compress

(h(i),m(i), s, t(i)

);

4. end for5. return h(N);

VI. HARDWARE IMPLEMENTATION

All of our architectures consist of a register file organizedinto w-bit words and implemented by means of dual-portedmemory, an ALU, and a control unit (Figure 4). The userloads messages, plaintext blocks or ciphertext blocks intoport A. A few control bits allows her to select the algorithmand the desired level of security. When the coprocessors arehashing or encrypting a message, the intermediates results arealways written to port B. In the following, we assume that ourcoprocessors are provided with padded messages. A hardwarewrapper interface for BLAKE, Skein, and several other hashfunctions comprising communication and padding is describedin [12].

Register file (dual-ported memory)

AddrA

AddrB

Pipelined

WeA

Arithmetic and

ctrlk−1:0

Logic Unit

Control Unit

User InterfacePo

rtB

Port

A

Data

WeB

Figure 4. General architecture of our coprocessors.

We follow here the design strategy outlined in [6], [7], [13]–[15]. The first step consists in defining the minimal instructionset to implement a block cipher and a hash function. Then, anin-depth study of the scheduling allows us to build the ALUand organize the data in the register file. During this step, we• try to minimize the number of control bits to keep the

instruction memory as compact as possible;• take advantage of FPGA specifics to optimize the slice

count;

Page 6: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

• identify the available parallelism and pipeline the ALUaccordingly.

Eventually, we design the control unit. The instruction memoryis automatically generated by a C program. In order to keep theinstruction ROM as compact as possible, our C program is ableto compress the code, and to generate the VHDL descriptionof the decompression unit.

A. Arithmetic and Logic Units for Threefish and Skein

Our first ALU implements Threefish encryption and Skein.In the following, Ri denotes a 64-bit register. Figure 5 illus-trates our scheduling of the two mixing functions Mix4,0 andMix4,1 of the fifth round of Threefish-256:

• The operand e4,1 is loaded in register R1; at the sametime, we start the computation of e4,1 ≪ R4,0; thisoperation requires three clock cycles and intermediateresults are stored in R4, R5, and R6.

• Then, e4,0 is loaded in register R2; the content of R1 isnot modified (i.e. R1 must be controlled by an enablesignal).

• We execute the instruction R3 ← R1 � R2 and obtainf4,0.

• R3 and R6 contain f4,0 and e4,1 ≪ R4,0, respectively.The instruction R3← R3⊕R6 allows us to compute f4,1.

e4,3e4,1 e4,3

e4,2

e≪14,3

e≪14,3

e≪334,3

f4,3f4,1 f4,2

R3←

R3⊕

R6

R3←

R1�

R2

R3←

R3⊕

R6

R3←

R1�

R2

Time [clock cycles]

Port A

R1

R2

e4,1

e4,0

e4,1

R6

R3 f4,0

R4

R5

e≪14,1

e≪94,1

e4,2e4,0 e4,3

e≪254,1

Figure 5. Computation of Mix4,0 and Mix4,1 (Threefish-256).

We schedule Mix4,1 as soon as e4,0 has been read, andmanage to keep the pipeline continuously busy. In summary,our ALU must be able to carry out any rotation of a 64-bitword and to perform the following operation (Figure 6):

R3←

{R1� R2 when ctrl10 = 0,R3⊕ R6 otherwise,

(1)

where ctrl10 denotes a control bit. Let us define two 64-bitoperands a and b such that:

(a, b) =

{(R1,R2) when ctrl10 = 0,(R3,R6) otherwise.

It is well-known that a � b = (a ∨ b) � (a ∧ b) and a ⊕ b =(a ∨ b)� (a ∧ b) [16]. Thus, Equation (1) can be rewritten asfollows:

R3← (a ∨ b)� ((a ∧ b)⊕ ctrl10)� ctrl10. (2)

Figure 7 describes the implementation of Equation (2) on aVirtex-6 device. Since there is a single control signal to choosethe arithmetic operation and to select a and b, Equation (2) in-volves only five variables, and is advantageously implementedby 64 LUT6 2 primitives and dedicated carry logic.

0

or

1

0

1

1

0

1

0

From

port

BFr

ompo

rtA

Topo

rtB

0

1

R1

ctrl9ctrl6 ctrl8 ctrl10

≪0,16

,32

or48

≪0,1,2

or3

ctrl1:0

≪0,4,8

or12

R3

R4 R5 R6

R2

ctrl3:2

ctrl7

ctrl5:4

Figure 6. Arithmetic and logic unit for Threefish encryption.

11

0

1

0

0

1

0

0

1

10

1Fr

omR6

ToR3

(carry0 = ctrl10)

ctrl10

(aj ∧ bj)⊕ ctrl10

bj aj ∨ bj

bj aj ∨ bj

5-input LUT

5-input LUT

O6

O5

LUT6 2

carryj

carryj+1

aj

aj

From

R3

From

R1

From

R2

Figure 7. Computation of R3← R1�R2 or R3← R3⊕R6 on a Virtex-6device.

In order to reduce the number of operands stored in theregister file, we interleave the key schedule (Algorithm 1) andthe encryption process (Algorithm 2). This approach allows usto generate the subkeys on-the-fly. It is however necessary tocompute t2 and kNw before the first key injection. The easiest

Page 7: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

way to compute t2 would be to load t0 and t1 in registersR1 and R2, respectively, and to execute the instruction R3←R1⊕R2. Unfortunately, this solution requires one more controlbit to select the inputs of the arithmetic operator, and it is notpossible to implement the multiplexers and the adder on thesame LUT6 2 primitive anymore. Since the critical path ofour coprocessor is located in the 64-bit adder, an extra levelof LUTs would decrease the clock frequency. However, weare able to compute t2 using only the functionalities definedby Equation (1). Since t2 = (t0 � 0) ⊕ (t1 ≪ 0), it sufficesto execute the following instructions:

R4← t1≪ 0,R1← t0, R2← 0, R5← R4≪ 0,R3← R1� R2, R6← R5≪ 0,R3← R3⊕ R6.

This approach assumes that we can read simultaneously twovalues from the register file. Thanks to the multiplexer con-trolled by ctrl7, we can load data from port A or port B intoregister R2 (Figure 6). A similar strategy allows us to computekNw .

The implementation of the key injection is more straight-forward. Note that the multiplexers controlled by ctrl6 andctrl8 allow us to bypass the register file and to use the contentof R3 as an input to the ALU. Let us consider for instancethe first key injection of Threefish-256: e0,2 is defined asp2 � k0,2 = p2 � k2 � t1 and is computed as follows:

R1← k2, R2← t1,R3← R1� R2R1← R3, R2← p2,R3← R1� R2.

Figures 8 and 9 describe how we schedule the instructions ofThreefish-256.

The UBI chaining mode can be combined with the final keyinjection of Threefish encryption. It suffices to modify line 21of Algorithm 2 as follows:

eNr,i ← vNr,i � kNr/4,i;ci ← eNr,i ⊕ pi.

The only difference between this operation and the mixingfunction MIXd,j is that no permutation is applied to the secondoperand of the bitwise exclusive OR.

The inverse of the MIX function being purely sequential,Threefish decryption has less parallelism than encryption. Wesuggest to modify our ALU as follows to fully support bothencryption and decryption (Figure 10):• The inverse of the Mix function and the inverse of the key

injection require a subtraction modulo 264. Our modifiedALU is therefore able to perform a new operation: R3←R1�R2. Because of the additional control bit required toselect the operation, it is not possible to implement ourarithmetic operator by means of 64 LUT6 2 anymore.

Thus, the slice count and the critical path are expectedto increase.

• The output of the inverse Mix function is provided eitherby the arithmetic operator (e.g. e4,0 on Figure 2) or therotation unit (e.g. e4,1 on Figure 2). The multiplexercontrolled by ctrl17 allows us to select the word we storein the register file.

• Since the inverse of the Mix function is sequential, wehave to perform the rotation in a single clock cycle.We suggest to take advantage of the SRL16E primitiveavailable on Xilinx devices to implement a FIFO whosedepth is dynamically adjusted according to the algorithmselected by the user: one and three stages for decryptionand encryption, respectively.

01

0

0

1

0

10

1

1

0

1

1

0

From

port

AFr

ompo

rtB

3- or 1-stage FIFO (encryption or decryption)

oror

Topo

rtB

R1

ctrl7 ctrl10

R4 R5 R6

R3

R2

ctrl12

ctrl6

ctrl13

ctrl14

≪i

ctrl16:15

ctrl9

ctrl17

ctrl8 ctrl5:0 ctrl11

Figure 10. Arithmetic and logic unit for Threefish encryption and decryption.

B. Arithmetic and Logic Units for BLAKE and ChaCha

Let us consider the Gi function of BLAKE-n to define theinstruction set of our coprocessors. Since we focus on compactcoprocessors for the BLAKE family in this article, we performa single step of Algorithm 5 at each clock cycle. We willshow later that the input operand b is already stored in aninternal register of our ALU when we start the computationof Gi(a, b, c, d). Therefore, each operation involves the resultof the previous one, and our ALU will include a feedbackmechanism to bypass the register file of the coprocessor.

Assume that the w-bit word computed by the ALU is storedin register R5, and denote by RFA and RFB the operandsprovided by the register file. From the data flow diagram ofAlgorithm 5, we easily identify three operations (Figure 11):

1) Save the content of R5 in the register file and computeR5← R5� RFA.

2) Compute R5← R5� (RFA ⊕ RFB).3) Save the content of R5 in the register file and compute

R5← (R5⊕ RFA)≫ δj .Recall now that the four calls to Gi in a column step or a

diagonal step can be computed in parallel. In order to keep thecritical path as short as possible, we suggest to design an ALU

Page 8: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

v4,1

@v4,2

@e0,2

R3← R3⊕ R6R3← R1� R2

Permute

Port A

Port B

Output B

Output A

Address A

Address B

Input B

Reg

iste

rFile

AL

U

R1

R2

R6

R3

Rename

v4,2

@v4,2

e3,2

t2

k1

1

k4

e4,2

@k3

e1,0

Second round

Permute

Port A

Port B

Output B

Output A

Address A

Address B

Input B

Reg

iste

rFile

AL

U

R1

R2

R6

R3

Rename

Fourth round

@e3,0

e≪53,1

e3,0

e≪232,1

f2,1

v3,3

e3,3

f2,0

v3,0

e3,0

e2,0

e2,1

e≪571,3

f1,3

v2,1

e2,1

f1,2

v2,2

@e2,2

e2,2

e2,2

e3,3

v3,2

@k1

k1

e3,1

@e3,0

e2,1

@e1,0

f0,2

v1,2

e1,1

e1,2

e1,3 e1,3

v1,1

f0,3

e1,2

e≪160,3

Third round Fifth round

@0

@C240

@k2 @k3

0

C240

k2 k3

0

C240

k1 k2 k3k0

C240

C240

@k4

k4

k4

k1 k2 k3k0 0

k1 k3k0 0

@k0

k2

k1k0

@k1

t1

0

0

t0

@t2@0

@t0@t1

0

t0t1

t0

t1

t0

t2

@p3

@k0

k0

p0 p3

k0 k3

p0 p3

k0 k3

p0 p3

@k3

@e0,3

e0,3

@e0,0

e0,0 e0,2

@p1

p1

p1

t0

k1

t0

k1

e0,3 k0,2e0,0 e0,2 k0,1

t0

k1k0,2

p2t1

k2

t1

k2 p2

t1

k2 p2

@p2

@t0

@k1

@t1

@k2

@p0

k3

e≪140,1

f0,1

v1,3

e1,3

e0,3

e0,2

@e0,0

e1,0

f0,0

v1,0

e1,0

e0,1 e0,1k0,1

@e0,3 @e0,2

e0,3

e0,3 e0,2

e0,2

e0,3

e0,1

e0,1

p1

@e1,0

e0,0

e0,0

e0,0

Computation of t2 and k4, and first key injection

First roundSecond key injection

t1

k3

t1

t1

k2

k2

k3

k4 k3

e4,3 − 1e4,0

t2 1

k1,2

@e4,2

e4,3 e4,2

@e4,3

e4,3 e4,2

@e4,2

e4,2e4,3

e4,2

@e4,0

@e5,0

e5,0

e≪254,1

v5,3

e5,3

f4,0

v5,0

e5,0

e4,0

e4,0

e4,1k1,1

v4,2

v4,2 k2

e4,0

@t1

@k2

@e4,0

@v4,1

e4,0

f4,1

t2 1

e4,3 − 1 v4,2

k1,2

v4,1

k1,1v4,3

v4,1

v4,1

@e4,3

e4,3

e4,1 e4,3 e4,3e4,1

f1,0

v2,0

e1,0

e1,0

e2,0

f1,1

v2,3

e2,3

e≪521,1

e1,1 e1,1

f2,2

@e2,2

e2,2

e3,2

f2,3

v3,1

e3,1

e≪402,3

e2,3e2,3

e2,2

e2,2

f3,1

v4,3

f3,0

e3,0

e3,0

e3,1

e3,0

v4,0

@k4

@1@t2

v4,0

k1

k4

e≪373,3

f3,3

v4,1

f3,2

v4,2

e3,3

@v4,1

t2

Figure 8. Scheduling of Threefish-256 encryption. @d denotes the address of the 64-bit word d in the register file. “Rename” and “Permute” refer to lines 9and 17 of Algorithm 2, respectively.

with four pipeline stages and to interleave the computationof four Gi functions (Figure 12). The heart of the ALU isthe arithmetic operator performing the addition or the bitwiseXOR of two w-bit words described in Section VI-A. Ouroperator computes:

R3←

{R1� R2 when ctrl2 = 0,R1⊕ R2 otherwise

= (R1 ∨ R2)� ((R1 ∧ R2)⊕ ctrl2)� ctrl2,

where• R1 stores the data provided by the register file. Since

a flip-flop is always associated with a LUT, we canperform some simple pre-processing without increasingthe number of slices of the ALU: a control bit ctrl0selects either RFA or RFA ⊕ RFB . This allows us tocompute mσr(2i)⊕ cσr(2i+1) and mσr(2i+1)⊕ cσr(2i) forfree (Algorithm 5, lines 2 and 7).

• R2 almost always stores the result of a previous operation.However, we have to disable the feedback mechanismduring the initialization step: the computation of v8 ←

s0 ⊕ c0 involves for instance only two words stored inthe register file. An array of AND gates controlled byctrl1 allows us to force the second operand to zero insuch cases.

If needed, the content of register R3 is then rotated to the leftin two steps. Our implementation is based on the followingobservation:

R3≫ δi = (R3≫ (δi − δ3))≫ δ3,

where 0 ≤ i ≤ 3. At first glance, this design choice may lookawkward. However, it will allow us to easily build a unifiedprocessor for the BLAKE family. The key point is that thecontent of R3 is copied into R5 when the three control bitsctrl5:3 are equal to 0 (Figure 12).

Note that the pipeline has three possible configurations,denoted by À, Á, and  in Figure 12:

À In order to minimize the area of our ALU, we can inserta w-bit register after the first stage of the rotation. Sincethe latter involves w LUTs, there is no hardware overheadon a Virtex-6 device.

Page 9: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

@t2

k0,1

e68,0 e68,2

e68,0 e68,2 e67,2

e69,3 e69,0 e69,2

e67,0 e67,2

k1 k2 k3

@t0

@k2 @k4 @k0 @k3

e68,0e68,0

t2

t1

t0 k0

C240

k1 k2 k3

t1

k0

C240

k1 k2 k3

t0

t0

t1

t2

k0

C240

k4

k0

t1

k1

18

k0

t1

k1

18

k0

t1

k1

18

k0,2

c2

c2Port A

Port B

Output B

Output A

Address A

Address B

Input B

Reg

iste

rFile

AL

U

R1

R2

R6

R3

Rename

Port A

Port B

Output B

Output A

Address A

Address B

Input B

Reg

iste

rFile

AL

U

R1

R2

R6

R3

Rename

k0,3

c3

t0

k4 c0

k3

c3 k4 c0

t0 k3

c2 c3 k4 c0

t0 k3k0,3k0,2

f∗71,1f72,3

e71,2

e71,2

k0,1

e71,1

e71,0

e71,1 e71,3

e71,2

f72,2

v71,2

e71,2

e71,1

v71,1

e71,1

f72,0 f72,1

v71,0 v71,3

e71,3e71,0

e71,0

e71,1

e71,2

f71,1

f∗71,3

v70,3

e70,3

e70,3

e68,0 e68,2

v68,0 v68,2

f69,0

f69,3

f69,2

e68,2

f68,1e68,1

f∗68,1

e68,3

f68,3

f∗68,3 f68,0

e68,3

f68,2

temp2

temp2

InversePermute

PermuteInverse

v68,3 v68,1

v68,3 v68,1

v68,3 v68,1

k17,3 k17,1

k17,3 k17,1

e67,0

e67,0 e67,2

e67,1 e67,3

e67,0 e67,2

e67,0 e67,2

e67,0 e67,2

e67,0

e67,1 e67,3

e67,1 e67,3

e67,1 e67,3

e69,0

f69,1

f∗69,3

v68,3

e68,3

e68,3

v70,0

f71,0 f71,2

e70,0 e70,2

v70,2

e70,0 e70,2

e70,3

e70,3

e70,0 e70,2

e70,0

e70,0

e70,3

e70,0

e70,2

e70,2

v69,3

e69,3

v69,2

e69,2

e69,0 e69,2

e69,3

f∗69,1

f67,1

f∗67,3

f∗70,1

f70,1

f∗70,3 f70,2

e69,1

e65,3

f65,0

e64,3

e66,0

e66,3 e66,2e66,0

v66,0 v66,0 v66,2

e66,2e66,0e66,0

f∗67,1

f67,3

f67,0 f67,2

e66,1

f∗66,1

e65,3

e66,3

e66,2

e66,3 e66,0 e66,2

e66,0

v65,3

e65,0 e65,2

v65,2v65,0

f66,0 f66,2f∗66,3

f66,1 f66,3 e65,1

e66,3 e66,0 e66,2

e65,3 e65,0 e65,2

e65,3 e65,0 e65,2

c1

c1

c1

e71,2

e71,2

e69,3 e69,0 e69,2

e69,3 e69,0 e69,2

e71,0

v69,0

e69,0

f70,3

f70,0

e70,2

e69,2

e68,3 e68,0 e68,2

e67,2 e66,2

e71,0

e71,0 e71,2

e70,1f71,3

v68,1 v68,3

v68,1 v68,3

v68,2 v68,0

v68,2

v68,0

v64,3

e64,3 e64,0

v64,0

f∗65,1 f∗65,3

f65,1 f65,3

e65,0 e65,2 e65,0 e65,2

e65,2e65,0e65,3

e64,0

t0

k2 k4 k0 k3

17 t2

k2 k4 k0 k3

t217

k2 k4t0 17

k0 k3

t2

v67,0 v67,2 v67,3 v67,1

e67,0 e67,2 e67,1 e67,3

First round Second round Third round

Seventh roundSixth roundFifth roundFourth round

Second key injection

Computation of t2 and k4, and first key injection

R3← R1⊕ R3R3← R1⊕ R2

@k2 @k0 @k1@k1@k0@t0

@t1 @C240

@k3

@t2 @18@t1

@c2 @c3

@k4

@k4

@t0

@c0

@k3

@c1

@e71,1@e71,2

@e71,2

@e70,3 @e68,3@e70,0 @e70,2 @e69,2@e71,0 @e71,2 @e69,0@e69,3

@e68,0 @e68,2 @e67,0@17

@v68,3 @v68,1

@e67,2

@e67,0 @e67,2

@e67,2@e67,3@e67,1 @e67,0 @e66,3 @e66,0 @e66,2 @e65,0@e65,3 @e65,2@v68,3@v68,1 @e64,3 @e64,0

Figure 9. Scheduling of Threefish-256 decryption.

Á The addition modulo 2w can be computed in two clockcycles. Let a = alow+2

w2 ahigh and b = blow+2

w2 bhigh. We

store ahigh and bhigh in two w2 -bit registers, and compute

a sum word slow and a carry bit such that 2w2 c+ slow =

alow + blow. A flip-flop and a w2 -bit register store c and

slow, respectively. The most significant bits of the sumare then given by shigh = ahigh + bhigh + c. This approachallows us to reduce the worst-case carry path at the priceof three w

2 -bit registers and a flip-flop. Routing a signal from a memory block to a slice is

sometimes expensive in terms of wire delay. If the criticalpath is located between the register file and register R1,this pipeline configuration will help boosting the clockfrequency. The output data path of a Virtex-6 memoryblock has an optional internal pipeline register. Therefore,the only hardware overhead is the w-bit register betweenR5 and the array of AND gates controlled by ctrl1.

In order to avoid pipeline bubbles between column anddiagonal steps, it suffices to process the four calls to Gi ofthe diagonal step in the following order: G7, G4, G5, and G6.We check for instance that the ALU outputs the new value of

v4 (last instruction of G0) at time τ+3. If we load v3 from theregister file, we can start the computation of G7 at time τ +4(Figure 13). We easily check that this scheduling also avoidspipeline bubbles between a diagonal step and a column step(Figure 14). Since each call to Gi involves ten instructions,we need 80 clock cycles to perform a round of BLAKE-n.

Our first architecture can be modified to support the fouralgorithms of the BLAKE family (Figure 15). The 64-bitdatapath is built out of two 32-bit datapaths, thus allowingus to perform a single 64-bit operation or two 32-bit opera-tions at each clock cycle. The mode of operation is selectedaccording to an additional control bit ctrl6, the latter beingprovided by the user. The ALU includes two 32-bit adders.Let alow, ahigh, blow, bhigh, slow, and shigh denote unsigned 32-bitintegers. When the user chooses BLAKE-224 or BLAKE-256(ctrl6 = 0), two messages are processed in parallel and theALU performs two 32-bit additions:

slow ← alow + blow + ctrl2,shigh ← ahigh + bhigh + ctrl2.

When the coprocessor executes BLAKE-384 or BLAKE-512

Page 10: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

3

≫ δ0 ≫ δ2

≫ δ1

3

2

1

3

1

2

b

d

mσr(2i+1)

≫ δ3 b

cσr(2i)

a

d

mσr(2i)

a

c

d

c c c

cσr(2i+1)

aa

bb b b

1

2From R5

From port B

From port A To R5

3

1

≫ δi To R5From port A

To port BFrom R5

3

d

From port A

From R5 To port B

To R5 1

Figure 11. Implementation of the Gi function of BLAKE-n by means of three instructions. R5 denotes an internal register of the ALU.

ctrl1

LUT6 2

Á≫

δ 3

≫(δ

2−δ 3)

≫(δ

1−δ 3)

≫(δ

0−δ 3)

1

0R5

À

LUT6

1

0

1

0

1

0

ctrl3

R4

Topo

rtB

Á

R30

1R1

ctrl0

Â

Â

From

port

AFr

ompo

rtB

R2

Â

ctrl4 ctrl5ctrl2

Figure 12. Arithmetic and Logic Unit for BLAKE-n.

(ctrl6 = 1), the ALU carries out a 64-bit addition. The firstadder generates the least significant bits of the sum and a carrybit c such that:

232 · c+ slow = alow + blow + ctrl2.

The second adder computes the most significant bits of thesum:

shigh ← ahigh + bhigh + c.

We use the rotation unit of our first processor to deal withBLAKE-224 and BLAKE-256. Note that the content of R3 isalways copied into R6 when ctrl5:3 = (000)2. Thus, we sharethis datapath between all algorithms of the BLAKE family,and need only 64 LUTs to implement the rotation unit of

BLAKE-384 and BLAKE-512. When ctrl5 is equal to one,ctrl4:3 encodes the index i of the rotation distance δi (Table V).Consequently, we can use the same instruction flow for allalgorithms and select the width of the datapath according toctrl6. Note that the three pipeline configurations defined forour first coprocessor are also available here.

The QUARTERROUND function of ChaCha requires onlytwo of the instructions we defined for the Gi function. Thus,the design of a ChaCha coprocessor is rather straightforward(Figure 16). Since it is not necessary to compute RFA⊕RFBanymore, the ALU has a single 32-bit input. The only difficultyis to increment the 64-bit counter 232m13+m12 (Algorithm 3,lines 17 to 20). Assume that the constant 1 is stored in the

Page 11: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

Â

Ã

À

Á

Pipe

line

stag

es

τ + 5

1st step of G5:

τ τ + 6τ + 4τ + 3τ + 2τ + 1

Time [clock cycles]

R4← R3≫ 0

R3← R1⊕ R2

v9← v9� v13

9th step of G1:

v10← v10� v14

9th step of G2:

v3← v3� v4

1st step of G7:

R5← v4 = R4≫ 7

v0← v0� v5

1st step of G4:

v4← (v4 ⊕ v8)≫ 7

10th step of G0:

v6← (v6 ⊕ v10)≫ 7

10th step of G2:

R2← R5 = v8

R1← v4

v5← (v5 ⊕ v9)≫ 7

10th step of G1:

v11← v11� v15

9th step of G3:

v7← (v7 ⊕ v11)≫ 7

10th step of G3:

v1← v1� v6

R2← R5 = v9

R3← R1⊕ R2

R5← v5 = R4≫ 7

R1← v5

R4← R3≫ 0

R1← v6

R2← R5 = v10

R3← R1⊕ R2

R4← R3≫ 0

R5← v6 = R4≫ 7

R3← R1⊕ R2

R5← v7 = R4≫ 7

R1← v7

R2← R5 = v11

R4← R3≫ 0 R4← R3≫ 0

R3← R1� R2

R1← v3

R2← R5 = v4

R1← v0

R2← R5 = v5

R3← R1� R2

R1← v1

R2← R5 = v6

R4← R3≫ 0

R5← v11 = R4

R3← R1� R2

R5← v10 = R4

R4← R3≫ 0

R5← v9 = R4

Figure 13. Avoiding pipeline bubbles between a column step and a diagonal step.

Á

Â

ÃPipe

line

stag

es

Time [clock cycles]τ + 1 τ + 2 τ + 3 τ + 4 τ + 5 τ + 6τ

À

R4← R3≫ 0

R3← R1⊕ R2

R5← v4 = R4≫ 7

R2← R5 = v9

R1← v4

v4← (v4 ⊕ v9)≫ 7

10th step of G7:

10th step of G4:

R2← R5 = v10

R5← v5 = R4≫ 7

R1← v5

R3← R1⊕ R2

R4← R3≫ 0

v5← (v5 ⊕ v10)≫ 7

10th step of G5:

R5← v6 = R4≫ 7

R1← v6

R2← R5 = v11

R3← R1⊕ R2

R4← R3≫ 0

v6← (v6 ⊕ v11)≫ 7

10th step of G6:

R1← v7

R2← R5 = v8

R4← R3≫ 0

R3← R1⊕ R2

v7← (v7 ⊕ v8)≫ 7

R5← v7 = R4≫ 7

R3← R1� R2

1st step of G0:

v0← v0� v4

R1← v0

R2← R5 = v4

R4← R3≫ 0

R1← v1

R3← R1� R2

R2← R5 = v5

1st step of G1:

v1← v1� v5 v2← v2� v6

R1← v2

R2← R5 = v6

1st step of G2:

9th step of G6:

R5← v8 = R4

v8← v8� v13

R3← R1� R2

R4← R3≫ 0

v11← v11� v12

R5← v11 = R4

R4← R3≫ 0

9th step of G5:9th step of G4:

v10← v10� v15

R5← v10 = R4

Figure 14. Avoiding pipeline bubbles between a diagonal step and a column step.

Table VROTATION DISTANCES OF THE UNIFIED BLAKE COPROCESSOR.

ctrl5:3 Rot. dist. BLAKE-224/256 BLAKE-384/512(000)2 0 R6 ← R3 (common datapath)(100)2 δ0 R6 ← R3 ≫ 7 R6 ← R3 ≫ 11(101)2 δ1 R6 ← R3 ≫ 8 R6 ← R3 ≫ 16(110)2 δ2 R6 ← R3 ≫ 12 R6 ← R3 ≫ 25(111)2 δ3 R6 ← R3 ≫ 16 R6 ← R3 ≫ 32

register file. The control bit ctrl0 allows us to disable thefeedback mechanism and to load the constant 0 in registerR2. Execute the following instructions:

R1← 1 R2← 0,R3← R1⊕ R2,R4← R3,R5← R4,R1← m12 R2← R5.

Registers R1 and R2 store m12 and the constant 1, respectively.Note that the output carry of the 32-bit adder can now be storedin a flip-flop F. Furthermore, when ctrl2 is set to one, our ALUperforms an “add with carry” instruction. We can now computem12 + 1, save the output carry in F , and increment m13 if

necessary:

(F,R3)← R1� R2 R1← m13 R2← 0,R3← R1� R2� F R4← R3,

Register file← R4 (m12) R4← R3,Register file← R4 (m13).

Three pipeline configurations are again available. The secondone needs specific attention: since the adder is pipelined,the computation of m12 � 1 requires two clock cycles. It istherefore mandatory to introduce a NOP before loading m13

into register R1.It is of course possible to build a unified coprocessor for

ChaCha and the BLAKE family (Figure 17). A new controlbit ctrl7 allows the user to select the mode of operation ofthe ALU: encryption or hashing. Since the coprocessor has a64-bit datapath to support BLAKE-384 and BLAKE-512, it ispossible to encrypt two messages in parallel with ChaCha.

C. Register Files and Control Units

We will consider our unified coprocessor for the BLAKEand ChaCha algorithms to describe how we design our controlunits. The same approach can easily be applied to the othercoprocessors considered in this work. Virtex-6 FPGAs embedseveral configurable memory blocks that can for instance

Page 12: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

01

≫25

≫32

≫16

01

10

1

01

0

10

≫11

1 0 10

01

10

32 bits

R4

À

Á

Á

Á

ctrl6

Á

≫1

≫9

0

1R1

ctrl0

Â

Â

ctrl1Fr

ompo

rtA

LUT6 2

ctrl2From

port

B

R2

ctrl3

Â

1 bit 64 bits

ctrl5ctrl4

ctrl6

LUT6

≫5

R6

Topo

rtB

≫7

R3

R5

À

b hig

h

b low

alo

w

c

s hig

h

ahi

gh

s low

Figure 15. Unified Arithmetic and Logic Unit for the BLAKE family.

carry

ctrl3

ctrl4ctrl0

ctrl1 ctrl2

LUT6

F

Á

R3

≪7

≪1

≪9

1

0

À

1

0

1

0

ctrl5

R4

Topo

rtB

≪5

R5

1 0

R1

R2

From

port

A Â

 LUT6 2

1

0

Output

Figure 16. Arithmetic and Logic Unit for ChaCha.

store 1024 36-bit words or 2048 18-bit words. Our controlunit mainly consists of a program counter that addressesan instruction memory implemented by means of a memoryblock.

A straightforward way to deal with the permutations in-volved in the BLAKE family is to unroll the round loop.Table VI summarizes the number of instructions required bythe algorithms supported by our coprocessor if we follow thisapproach. Note that it suffices to store the code of BLAKE-384/512 and 20-round ChaCha (2039 instructions): a simplefinite-state machine allows us to jump to the finalization step

Table VINUMBER OF INSTRUCTIONS OF THE ALGORITHMS OF THE BLAKE AND

CHACHA FAMILIES.

Algorithm # instructionsBLAKE-224/256 1184BLAKE-384/512 13448-round ChaCha 31112-round ChaCha 43920-round ChaCha 695

when the desired number of rounds has been performed1. The1It is possible to reduce the size of the code by storing the table defining the

permutation of {0 . . . , 15} parametrized by the round index r (Table IV) andby generating the addresses of mσr(2i) and cσr(2i+1) on-the-fly. However,this approach would require a more complex control unit. As long as themicro-code fits into a single block of memory, there is no need to try toreduce the number of instructions.

Page 13: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

1

1

0 0

1

0

01

1

10

1

1

0

0

01

1

0

0

0

1

11

0

≫11

01

10

≫16

≫32

≫25

10

0

1 0

From

port

B

ctrl4

LUT6

ctrl5

1 bit 32 bits 64 bits

oror

ctrl6ctrl8

Á

LUT6 2

ctrl0 ctrl2

R5

ctrl6

≪8

≪16

R6

≪7

≪12

≫1

≫9

ctrl3

≫5 ≫

7

R4

À

À

R7

Topo

rtB

ctrl7

R3

Á

À

Á ÁÁ

ctrl1

ÂR2

R10

1

Â

Â

From

port

A

Figure 17. Unified Arithmetic and Logic Unit for the BLAKE and ChaCha families.

main challenge is therefore to define control words of at most18 bits in order to implement our instruction memory by meansof a single memory block. A clever organization of the registerfile (Figure 18) and a simple compression algorithm allowsus to achieve this goal. Two blocks of dual-ported memoryconfigured as 256 entries of 32 bits store the message, thechaining value, the constants, and all the intermediate variablesof BLAKE and ChaCha. Thus, our coprocessor requires 26control bits (Figure 19a):

• 8 address bits and a write enable signal for port A of theregister file;

• 8 address bits and a write enable signal for port B of theregister file;

• 8 control bits for the ALU.

Two control bits are provided by the user: ctrl7 allows herto select between BLAKE and ChaCha, and ctrl6 specifiesthe configuration of the datapath (2× 32 bits or 64 bits). Ourorganization of the data in the register file enables us to definea 20-bit instruction:

• The most significant address bit depends on the algorithmbeing executed, and is therefore provided by the user.

• We use ports A and B to load new data (message, salt, andcounter) and save the intermediate variables computedby the ALU, respectively. Consequently, the write enablesignal of port A is also given by the user.

• Let us denote by a7:0 the eight address bits of ports

A. Note that a6 is equal to one only when we read aninitial vector and assume that the digest size is selectedaccording to an additional control bit ctrl8. The addressbit a5 is computed as follows:

a5 ←

{a5 when a6 = 0,ctrl8 otherwise.

Thanks to this simple mechanism, the instruction flowdoes not depend on the digest size. Initial vectors arealways read from port A.

• Since the initial vectors are neither modified nor readfrom port B, the second most significant address bit isalways equal to zero.

Consequently, we can store 20-bit words in the instructionmemory (Figure 19b). We designed a simple compressionalgorithm to encode the write enable signal of port B and thesix control bits ctrl5:0 by means of five bits. A C programgenerates the content of the instruction memory and theVHDL description of the decompression circuit. The latterinvolves only seven 5-input LUTs, and stores the controlbits of the ALU and the write enable signal of port B ina register. Because of this pipeline stage, it is necessary togenerate the write enable signal one clock cycle in advancewhen we have to store a word in the register file. Our Cprogram takes this parameter into account and organizes thecontrol bits in the instruction memory according to the pipeline

Page 14: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

v0, . . . , v15

c0, . . . , c15

s0, . . . , s3

151

144

167

152

183

168

187

184

189

188

199

192

h0, . . . , h7

v0, . . . , v15

c0, . . . , c15

s0, . . . , s3

t0 and t1

IV0, . . . , IV7 (BLAKE-384)

143

128

m0, . . . , m15

64 bits

103

96IV0, . . . , IV7 (BLAKE-256) IV0, . . . , IV7 (BLAKE-256)

231

224IV0, . . . , IV7 (BLAKE-256)

127

104

Unused

23

16h0, . . . , h7

v0, . . . , v1539

24

c0, . . . , c1555

40

59

56s0, . . . , s3

32 bits 32 bits

191

190Unused

256

232

Unused

95

72Unused

223

200Unused

15

0

71

64IV0, . . . , IV7 (BLAKE-224)

63

6261

60

0 and 1 (32-bit constants)

m0, . . . , m15

t0 and t1

IV0, . . . , IV7 (BLAKE-224)

0 and 1 (32-bit constants)

m0, . . . , m15

t0 and t1

h0, . . . , h7

Figure 18. Register file of the unified coprocessor for the BLAKE and ChaCha families.

WeB ctrl7 ctrl6 ctrl5 ctrl4 ctrl3 ctrl2 ctrl1 ctrl0

Port A Arithmetic and logic unit

ctrl6ctrl6 0

Port B

AddrA (7 bits) WeB ctrl4ctrl5 ctrl1 ctrl0ctrl2ctrl3

Port A Arithmetic and logic unit

AddrB (6 bits)

Port B

(a) Address and control bits of our unified coprocessor for BLAKE and ChaCha

(b) Address and control bits provided by the control unit

AddrA (8 bits) WeA AddrB (8 bits)

Figure 19. From 26- to 18-bit instructions. Shaded cells denote control bits provided by the user.

configuration. Then, it generates the compressed instructionmemory. Figure 20 describes the instruction flow for the firstpipeline configuration of our coprocessor:

• As explained above, the write enable signal is generatedone clock cycle in advance to take the internal pipelinestage of the decompression unit into account.

• All inputs of the register file are registered, and the twocontrol bits ctrl0 and ctrl1 must therefore be generatedone clock cycle after the addresses. We take advantageof the latency of our decompression unit to synchronizethe control signals.

We followed the same approach to build our control unitsfor Threefish and Skein. The register file is organized into 64-bit words, and stores a plaintext block, an internal state (ed,i,where 0 ≤ i ≤ Nw − 1), an extended block cipher key, anextended tweak, the constant C240, and all possible values ofs involved in the key schedule (Figure 21). Thanks to thisapproach, the word permutation π(i) and the word rotation ofthe key schedule are conveniently implemented by addressingthe register file accordingly. Since the round constants repeatevery eight rounds (Algorithm 2, line 14), we decided to unrolleight iterations of the main loop of Threefish (Algorithm 2,lines 4 to 19). The rotation constants Rd,i are included inthe microcode executed by the control unit. Note that ourregister file is designed for Threefish-1024 (i.e. Nw = 16and Nr = 80). It is therefore straightforward to implement

the two other variants of the algorithm on our architecture.The number of clock cycles required for Threefish encryptionand decryption according to the key size is summarized inTable VII. Because of the output transform, k+1 invocationsof the tweakable encryption function are necessary to hash ak-block message with Skein. There is a latency of 5 clockcycles between two consecutive Threefish encryption. Thus,the throughput of Skein is given by:

throughput =8 ·Nb · k · f

(k + 1) · latency of Threefish with UBI + 5 · k,

where f denotes the clock frequency.

72

56

95

76

Unused

48

63 127

117

Unused

32

116

96

e′i,0, . . . , e′i,7

Unused49

0, 1, . . . , 19, 20

55

p0, . . . , p15

ei,0, . . . , ei,15

k0, . . . , k16

15

0

71

64

31

e′i,8, . . . , e′i,15

t0, t1, t2, C2407516

Figure 21. Register file of our Threefish and Skein architectures.

Page 15: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

WeB

ctrl5

ctrl4 ctrl3

ctrl2

ctrl0ctrl1WeBAddrB (6 bits)AddrA (7 bits)

ctrl4ctrl5

ctrl3

ctrl2

ctrl1 ctrl0

Port A Port B Arithmetic and logic unit

Compressed opcode

Compression algorithm

AddrB (6 bits)AddrA (7 bits)

Figure 20. Generation of the compressed instruction memory.

Table VIINUMBER OF INSTRUCTIONS OF THE ALGORITHMS OF THE THREEFISH

FAMILY.

Algorithm # instructionsThreefish-256 encryption 490

Threefish-256 encryption with UBI 501Threefish-256 decryption 469Threefish-512 encryption 860

Threefish-512 encryption with UBI 882Threefish-512 decryption 1092

Threefish-1024 encryption 1874Threefish-1024 encryption with UBI 1912

Threefish-1024 decryption 2351

VII. RESULTS AND COMPARISONS

We captured our architecture in the VHDL language andprototyped our coprocessors on a Xilinx Virtex-6 FPGA withaverage speedgrade. Tables VIII and IX summarize our place-and-route results measured with ISE 14.2. Note that weconsidered the least favorable case, where the message consistsof a single block, to compute the throughput of Skein.

Most of the architectures described in the open literaturefocus on a single level of security (Table X). We tookadvantage of the intrinsic parallelism of BLAKE to interleavethe computation of four instances of the Gi function. Thanks tothis approach, we designed an ALU with four pipeline stagesand achieved higher clock speeds than the coprocessors listedin Table X. A careful scheduling allowed us to totally avoidpipeline bubbles and memory collisions. We also addressedFPGA-specific issues and described how to share slices be-tween addition and bitwise exclusive OR of two operands.We followed the same strategy to design our coprocessorsfor Threefish and Skein. As a consequence, our coprocessorsprovide the end-user with hashing and encryption at all levelsof security, while offering a better area–time trade-off.

We report in Figure 22 the latest lightweight implementationresults of several cryptographic hash functions. Besides ourcoprocessors for BLAKE-512 and Skein-512-512, we selectedGrøstl [14], JH [5], SHA-2-512 [21], and SHA-3-512 (Keccak[r = 1024, c = 576]) [13]. In this context, BLAKE isobviously the best choice for lightweight implementations onFPGA. Since our unified architecture for the BLAKE family(Figure 15) requires less than 100 Virtex-6 slices, BLAKE

0

100

200

300

400

500

0 50 100 150 200 250 300

Area [Virtex-6 slices]

BLAKE-512

Grøstl-512

JH-512

SHA-3-512

Skein 512-512

SHA-2-512

Throughput [Mbit/s]

Figure 22. Compact implementations of several cryptographic hash functionson Virtex-6 FPGAs (512-bit digests).

is also an excellent candidate for cryptographic coprocessorssupporting several levels of security.

We already proposed lightweight implementations of ECHO& AES [15] and Grøstl & AES2 [14] (Table XI). According toour results, the unified coprocessor for BLAKE and ChaChaoffers the best area–time trade-off. However, given that allsymmetric cryptographic functions (including authenticatedencryption) can be efficiently implemented with Keccak, wewould get the following figures with a unified architecturebased on [13]:

• hashing with arbitrary length at a security level of 256bits: 501 Mbits/s;

• authenticated encryption at a security level of 256 bits:more than 501 Mbits/s (the generic security of keyedsponges allows one to use less capacity than for hashing,hence a larger rate and a proportionally larger through-put) [24].

2Note that Jarvinen [22] proposed the first unified coprocessor for AES-128 (encryption and key expansion) and Grøstl-256. Recently, Rogawski& Gaj [23] designed a parallel coprocessor for Grøstl-based HMAC andAES in the counter mode. Both architectures are optimized for high-speedimplementations, and it is therefore difficult to make a comparison with ourlightweight coprocessors.

Page 16: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

Table VIIIPLACE-AND-ROUTE RESULTS FOR OUR THREEFISH AND SKEIN COPROCESSORS ON A VIRTEX-6 FPGA (XC6VLX75T-2). ALL DESIGNS REQUIRE THREE

MEMORY BLOCKS.

Supported Area Freq.Throughput [Mbits/s]

Algorithms [Slices] [MHz] Skein-256-256 Skein-512-512Threefish-256 Threefish-512 Threefish-1024Enc. Dec. Enc. Dec. Enc. Dec.

Threefish encryption 145 294 – – 153 – 175 – 160 –Threefish 277 267 – – 139 145 158 125 145 116

Skein and150 295 75 85 154 – 175 – 161 –Threefish encryption

Skein and Threefish 292 279 70 80 145 152 166 130 152 121

Table IXPLACE-AND-ROUTE RESULTS FOR OUR CHACHA AND BLAKE COPROCESSORS ON A VIRTEX-6 FPGA (XC6VLX75T-2).

Throughput [Mbits/s]Supported Pipeline Area # block Freq. BLAKE-224 BLAKE-384

8-round 12-round 20-roundalgorithms config. [slices] RAMs [MHz] and andChaCha ChaCha ChaChaBLAKE-256 BLAKE-512

¬ 49 2 362 – – 595 422 266ChaCha ­ 77 2 316 – – 520 368 232

® 77 2 345 – – 569 403 254

BLAKE-224 and¬ 47 2 338 146 – – – –

BLAKE-256­ 49 2 341 147 – – – –® 50 2 349 150 – – – –

BLAKE-384 and¬ 79 3 331 – 252 – – –

BLAKE-512­ 91 3 331 – 252 – – –® 91 3 329 – 250 – – –

BLAKE¬ 94 3 312 2× 134 237 – – –

(all levels of security)­ 126 3 332 2× 143 252 – – –® 129 3 343 2× 148 261 – – –

BLAKE and ChaCha¬ 144 3 335 2× 144 255 2× 551 2× 390 2× 246

(all levels of security)­ 156 3 289 2× 124 220 2× 475 2× 337 2× 212® 168 3 304 2× 131 231 2× 500 2× 354 2× 223

VIII. CONCLUSION

The stream cipher ChaCha, the block cipher Threefish,and the hash functions BLAKE and Skein are based on thesame arithmetic operations. In this work, we showed thatthe same design philosophy allows one to design lightweightcoprocessors for hashing and encryption. The key element ofour approach is to take advantage of the parallelism of thealgorithms to:• deeply pipeline the ALU to achieve a high clock fre-

quency;• avoid data dependencies by interleaving independent

tasks.Furthermore, we described how to design compact controlunits thanks to a careful organization of the register file, loopunrolling, and a simple compression algorithm. Our archi-tectures are mainly designed for embedded systems. Thus,it would be interesting to conduct side-channel and faultinjection attacks in future work.

Our results show that BLAKE and ChaCha are excellentcandidates for lightweight coprocessors. However, since allsymmetric cryptographic functions can be implemented bymeans of keyed sponges, we are planning to design hardwarearchitectures based on the new SHA-3 algorithm. According

to the preliminary results reported in [13], SHA-3 couldoutperform BLAKE and ChaCha.

ACKNOWLEDGEMENTS

The authors would like to thank Daniel J. Bernstein and RayCheung for their valuable comments. This work was partiallysupported by the Japanese Society of Promotion of Science(JSPS) through the A3 Foresight Program (Research on NextGeneration Internet and Network Security). Additionally theauthors would like to acknowledge Xilinx and the XilinxUniversity Program for its generous donation of materials interms of design tools.

REFERENCES

[1] J.-P. Aumasson, L. Henzen, W. Meier, and R. Phan, “SHA-3 proposalBLAKE (version 1.4),” Jan. 2011, available at http://www.131002.net/blake.

[2] N. Ferguson, S. Lucks, B. Schneier, D. Whiting, M. Bellare, T. Kohno,J. Callas, and J. Walker, “The skein hash function family (version 1.3),”Oct. 2010, available at http://www.skein-hash.info.

[3] D. Bernstein, “ChaCha, a variant of Salsa20,” Jan. 2008, available athttp://cr.yp.to/papers.html#chacha.

Page 17: Compact Hardware Implementations of ChaCha, BLAKE, … · 2013. 3. 3. · to 19) and a final subkey addition (Algorithm 2, line 21) are applied to produce the ciphertext. The core

Table XCOMPACT IMPLEMENTATIONS OF BLAKE AND SKEIN ON VIRTEX-5 AND VIRTEX-6 FPGAS. THE THROUGHPUT IS COMPUTED FOR A ONE-BLOCK

MESSAGE.

Supported algorithm(s) FPGAArea 36k memory Frequency Throughput

[slices] blocks [MHz] [Mbits/s]Latif et al. [17]‡ Skein-256-256 xc5vlx110-3 821 Not specified 119 1610

Jungk [18]‡ Skein-512-256 xc5v 555 – 271 237

Jungk [19]‡ Skein-512-256 xc6v 406 – 318 277Kaps et al. [20] Skein-512-256 xc6vlx75t-1 207 1 166 17Kaps et al. [20] Skein-512-256 xc6vlx75t-1 193 – 193 21

Kerckhof et al. [5]‡ Skein-512-512 xc6vlx75t-1 240 – 160 179

Aumasson et al. [1] BLAKE-256 xc5vlx110 390 – 91 412Jungk [19] BLAKE-256 xc6v 235 – 231 518Jungk [19] BLAKE-256 xc6v 404 – 185 823Kaps et al. [20] BLAKE-256 xc6vlx75t-1 163 1 197 327Kaps et al. [20] BLAKE-256 xc6vlx75t-1 166 – 268 445Aumasson et al. [1] BLAKE-512 xc5vlx110 939 – 59 468Kerckhof et al. [5] BLAKE-512 xc6vlx75t-1 192 – 240 183

†Without output transformation ‡Single call to Threefish-512

Table XIPLACE-AND-ROUTE RESULTS FOR HASHING AND AES ENCRYPTION ON A VIRTEX-6 FPGA (XC6VLX75T-2).

FPGAArea Frequency Throughput [Mbits/s]

[slices] [MHz] AES-128 AES-192 AES-256 256-bit digest 512-bit digestAES & ECHO [15] 155 397 219 186 161 92 48AES & Grøstl [14] 169 393 217 184 159 92 69

[4] J. Zhai, C. Park, and G.-N. Wang, “Hash-based RFID security protocolusing randomly key-changed identification procedure,” in ComputationalScience and Its Applications–ICCSA 2006, ser. Lecture Notes in Com-puter Science, M. Gavrilova, O. Gervasi, V. Kumar, C. K. Tan, D. Taniar,A. Lagana, Y. Mun, and H. Choo, Eds., no. 3983. Springer, 2006, pp.296–305.

[5] S. Kerckhof, F. Durvaux, N. Veyrat-Charvillon, F. Regazzoni,G. Meurice de Dormale, and F.-X. Standaert, “Compact FPGA imple-mentations of the five SHA-3 finalists,” in Proceedings of the ECRYPTII Hash Workshop, 2011.

[6] N. At, J.-L. Beuchat, and I. San, “Compact implementation of Threefishand Skein on FPGA,” in Proceedings of the Fifth IFIP InternationalConference on New Technologies, Mobility and Security–NTMS 2012,A. Levi, M. Badra, M. Cesana, M. Ghassemian, O. Gurbuz, N. Jabeur,M. Klonowski, A. Mana, S. Sargento, and S.Zeadally, Eds. IEEEeXpress Conference Publishing, 2012.

[7] J.-L. Beuchat, E. Okamoto, and T. Yamazaki, “Compact implementationsof BLAKE-32 and BLAKE-64 on FPGA,” in Proceedings of the2010 International Conference on Field-Programmable Technology–FPT 2010, J. Bian, Q. Zhou, and K. Zhao, Eds. IEEE Press, 2010, pp.170–177.

[8] D. Bernstein, “The Salsa20 family of stream ciphers,” Dec. 2007,available at http://cr.yp.to/snuffle/salsafamily-20071225.pdf.

[9] J.-P. Aumasson, S. Fischer, S. Khazaei, W. Meier, and C. Rechberger,“New features of Latin dances: Analysis of Salsa, ChaCha, and Rumba,”in Fast Software Encryption–FSE 2008, ser. Lecture Notes in ComputerScience, K. Nyberg, Ed., vol. 5086. Springer, 2008, pp. 470–488.

[10] T. Ishiguro, S. Kiyomoto, and Y. Miyake, “Latin dances revisited:New analytic results of Salsa20 and ChaCha,” in Information andCommunications Security–ICICS 2011, ser. Lecture Notes in ComputerScience, S. Qing, W. Susilo, G. Wang, and D. Liu, Eds., vol. 7043.Springer, 2011, pp. 255–266.

[11] T. Ishiguro, “Modified version of ”Latin dances revisited: New analyticresults of Salsa20 and ChaCha”,” 2012, cryptology ePrint Archive,Report 2012/065.

[12] B. Baldwin, A. Byrne, L. Lu, M. Hamilton, N. Hanley, M. O’Neill, andW. Marnane, “A hardware wrapper for the SHA-3 hash algorithms,”2010, cryptology ePrint Archive, Report 2010/124.

[13] I. San and N. At, “Compact Keccak hardware architecture for dataintegrity and authentication on FPGAs,” Information Security Journal:A Global Perspective, vol. 21, no. 5, pp. 231–242, 2012.

[14] N. At, J.-L. Beuchat, E. Okamoto, I. San, and T. Yamazaki, “A low-areaunified hardware architecture for the AES and the cryptographic hashfunction Grøstl,” 2012, cryptology ePrint Archive, Report 2012/535.

[15] J.-L. Beuchat, E. Okamoto, and T. Yamazaki, “A low-area unifiedhardware architecture for the AES and the cryptographic hash functionECHO,” Journal of Cryptographic Engineering, vol. 1, no. 2, pp. 101–121, 2011.

[16] H. Warren, Hacker’s Delight. Addison-Wesley, 2002.[17] K. Latif, M. Tariq, A. Aziz, and A. Mahboob, “Efficient hardware

implementation of secure hash algorithm (SHA-3) finalist - Skein,” inProceedings of the International Conference on Computer, Communica-tion, Control and Automation–3CA2011, 2011.

[18] B. Jungk, “Compact implementations of Grøstl, JH and Skein forFPGAs,” in Proceedings of the ECRYPT II Hash Workshop, 2011.

[19] ——, “Evaluation of compact FPGA implementations for all SHA-3finalists,” in The Third SHA-3 Candidate Conference, Mar. 2012.

[20] J.-P. Kaps, P. Yalla, K. Surapathi, B. Habib, S. Vadlamudi, and S. Gu-rung, “Lightweight implementations of SHA-3 finalists on FPGAs,” inThe Third SHA-3 Candidate Conference, Mar. 2012.

[21] H. Technology, “FULL DATASHEET–Tiny hash core family for XilinxFPGA,” revision 2.0 (11/06/2010).

[22] K. Jarvinen, “Sharing resources between AES and the SHA-3 secondround candidates Fugue and Grøstl,” in The Second SHA-3 CandidateConference, Aug. 2010.

[23] M. Rogawski and K. Gaj, “A high-speed unified hardware architecturefor AES and the SHA-3 candidate Grøstl,” in Proceedings of the 15thEuromicro Conference on Digital System Design, Sep. 2012.

[24] The Keccak Team, Personal communication, Sep. 2012.