efficient implementation of sha-3 hash function on 8-bit avr … · 2020. 11. 27. · efficient...

Efficient Implementation of SHA-3 Hash Function on 8-bit AVR-based Sensor Nodes

YoungBeom Kim, Hojin Choi, Seog Chung SeoCryptography Optimization & Application Lab,

Department of Information Security, Cryptology, and Mathematics, Kookmin University

Cryptography Optimization & Application Lab

Contents

• Introduction

• Memory optimization

• Chaining optimization methodology

• Experimental result

• Conclusions


Introduction


Some Context• Hash Function provides data integrity

• Fatal reverse attack has been filed against the existing SHA-2 Family

• The importance and demand of SHA-3 is increasing

• No single implementation method is more efficient than all others on ever possible platforms

• Existing efficient designs are usually hardware or specific architectures (Parallel system) oriented

• SHA-3 is a core algorithm used in MAC, digest, digital signature, DRBG, PQC, and so on.

• General software optimization method for various platforms is an important issue

• As 5G industry increases, a efficient implementation method of SHA-3 in embedded devices is important.


Overview of SHA-3• Keccak algorithm selected to be next-generation hash function in SHA-3 competition held by NIST

• SHA-3 based on Sponge structure• Absorbing Process : Compressing message and updating internal state by 𝑓𝑓-function• Squeezing Process : Computing digest

Fig. 1: Overview of Sponge structure


Overview of SHA-3• 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 of 𝑓𝑓-function is a three-dimensional 𝑥𝑥 × 𝑦𝑦 × 𝑧𝑧 matrix

• Row 𝑥𝑥 and Column 𝑦𝑦 are both fixed to five• Consisting of 25 lanes

Fig. 2: State of SHA-3


Overview of SHA-3• 𝜃𝜃 process

• XOR each bit in 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 with parties of two columns • XORing sum of columns ((𝑥𝑥 − 1),𝑧𝑧) and ((𝑥𝑥 + 1),(𝑧𝑧 − 1))

Alg. 1: Algorithm of 𝜃𝜃 process Fig. 3: Overview of 𝜃𝜃 process


Overview of SHA-3• 𝜋𝜋 process

• Rearranging the positions of the lanes• Not changing value of lanes

Fig. 4: Overview of 𝜋𝜋 process Fig. 5: Detail Structure of 𝜋𝜋 process

Alg. 2: Algorithm of 𝜋𝜋 process


Overview of SHA-3• 𝜌𝜌 process

• Right-rotating the bits of each lane as much as offset • Not changing position of lanes• Implemented in combination with 𝜋𝜋 process in standard implementation method

Fig. 5: Overview of 𝜌𝜌 process Alg. 3: Algorithm of 𝜌𝜌 process


Overview of SHA-3• 𝜒𝜒 process

• XORing each bit with a nonlinear function of two other bits in its row• Operating in row form

• 𝜄𝜄 process• XORing Round-constant and S[12] of 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠• Operating for single lane

Fig. 6: Overview of 𝜒𝜒 process Alg. 4: Algorithm of 𝜒𝜒 process


Standard Method• The standard implementation method of SHA-3 follows as: 𝜃𝜃 → 𝜋𝜋~𝜌𝜌 → 𝜒𝜒~𝜄𝜄• Combing 𝜋𝜋 process and 𝜌𝜌 process into 𝜋𝜋~𝜌𝜌 process• Accessing 7 times to 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 during 𝑓𝑓-function

• When b = 1600, 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 is 200 bytes and 𝑓𝑓-function comprise 24 round • Requiring 168 memory access to 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 during 𝑓𝑓-function

• Memory access cause higher overhead than arithmetic and logical operations in low-end-processor

StandardMethod Initial 𝜽𝜽 𝜽𝜽 process 𝝅𝝅~𝝆𝝆 process 𝝌𝝌~𝜾𝜾 process Total Access

Load O O O O7 times

Store X O O O

Table. 1: Number of memory access to 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 in Standard Method


Memory Optimization


Memory Optimization• Proposed implementation method of SHA-3 follows as: 𝜃𝜃~𝜌𝜌 (𝜋𝜋) → 𝜒𝜒~𝜄𝜄• Implementing 𝝅𝝅 process implicitly in 𝜃𝜃~𝜌𝜌 process• Combing 𝜃𝜃 process and 𝜌𝜌 process into 𝜃𝜃 ~𝜌𝜌 process• Accessing 5 times to 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 during 𝑓𝑓-function

• Requiring 120 memory access to 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 during 𝑓𝑓-function• Proposed Method : 120 < Standard Method 168

• Reducing memory access twice compared to the standard implementation method

ProposedMethod Initial 𝜽𝜽 𝜽𝜽~𝝆𝝆 process 𝝅𝝅 process 𝝌𝝌~𝜾𝜾 process Total Access

Load O O X (Implicitly) O5 times

Store X O X (Implicitly) O

Table. 2: Number of memory access to 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 in Proposed Method


Memory Optimization• 𝜽𝜽 and 𝝆𝝆 process execute independent operation for lane

• Appling 𝝆𝝆 process before storing in 𝜽𝜽 process

• Appling 𝝅𝝅 process implicitly when updating state (store)• 𝝅𝝅 process is a rearrange process for each lane• 𝝅𝝅 process can be executed implicitly

• Memory address translation operation occurs only once• 𝜽𝜽 and 𝝅𝝅~ 𝝆𝝆 process require twice translation in standard method• Standard Method : 𝜽𝜽 → 𝝅𝝅~𝝆𝝆; twice• Proposed Method : 𝜽𝜽~𝝆𝝆 𝝅𝝅 ; once Fig. 7: Overview of Proposed Method


Chaining optimization methodology


Target Platforms

• 8-bit AVR MCUs• ATmega 128• Popularly used in WSNs (Wireless Sensor Networks)

• Spec of ATmega 128• Flash Memory : 128 KB• SRAM : 4KB• EEPROM : 4KB• 32 8-bit general-purpose registers

Fig. 8: ATmega 128


Register Scheduling• The generally used parameter is b = 1600, where the state is 200 bytes• R8-R15 and R16-R23 hold two lanes• R2-R5 are used to translate the memory address

• Initial 𝜽𝜽 and lanes of 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠

Fig. 9: Register Scheduling for Proposed Method in 8-bit AVR MCUs


• To apply 𝝅𝝅 process implicitly, we propose a Chaining optimization methodology in 8-bit AVR• Data Load to register (𝜽𝜽 process) Memory translation in register (𝝅𝝅 process) Data Store to Memory (𝝆𝝆 process)• 𝜽𝜽~𝝆𝝆 (𝝅𝝅) process uses R8-R15, R16-R23 alternately we call it “Chain Implementation”

• 𝑆𝑆𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 (200 bytes) cannot be held in the register operating lane unit• Here, memory address translation (cost 𝛼𝛼) is occurred in each process• Combining 𝜽𝜽~𝝆𝝆 process, memory address translation cost reduced to two times (4 2)

StandardMethod Initial 𝜽𝜽 𝜽𝜽 process

𝝅𝝅~𝝆𝝆process

Load O + 𝛼𝛼 O + 𝛼𝛼 O + 𝛼𝛼

Store X O O + 𝛼𝛼

ProposedMethod Initial 𝜽𝜽

𝜽𝜽~𝝆𝝆process 𝝅𝝅 process

Load O + 𝛼𝛼 O + 𝛼𝛼 X (Implicitly)

Store X O X (Implicitly)




Fig. 11: Proposed ImplementationTemp : Empty

State : S’ [0]


Temp : S’[4]

State : S’ [0]

Fig. 11: Proposed Implementation



Temp : S’[4]

State : S’[14]




Temp : S’[17]

State : S’[14]



Temp : S’[17]

State : S’[15]



Temp : S’[8]

State : S’[5]



Experimental Result


• 25.7% performance improvement over Balasch et al.’s implementation• Our Work is the fastest implementation of SHA-3 in 8-bit AVR microcontroller• Narrowing the difference in performance by about two times compared to the SHA-2 Family

• Existing implementations have nearly three times the difference in performance

Experimental Result

Reference Algorithm LanguageLength of message byte

50 byte 100 byte 500 byte

This Work SHA-3 (256-bit) Asm 2667(+25.1%)1333

(+25.7%)1073

(+25.0%)

Otte et al. SHA-3 (256-bit) C, Asm 12854 6427 1672

Balasch et al. SHA-3 (256-bit) Asm 3560( - )1795 ( - )

1432( - )

Balasch et al. SHA-256 Asm 672 668 532

Balasch et al. Blake (256-bit) Asm 714 708 562

Balasch et al. Photon (256-bit) Asm 9723 7892 4788

Table. 3: Performance of SHA-3 by hash rate (CPB), when hashing a byte of various message in 8-bit AVR


Conclusion


• We introduced a new generic fast implementation method of SHA-3

• Proposed Method not requires a lookup table or additional operations

• Proposed Chaining optimization methodology of SHA-3 is the fastest implementation

• Our Work is efficiently applicable in PQC, DRBG, MAC, and so on

• Our Work is a generic method that can be a applied to various platforms

Conclusion


Question?Contact me : darania @ kookmin.ac.kr


Thank You~Cryptography Optimization & Application Lab

Efficient Implementation of SHA-3 Hash Function on 8-bit AVR-based Sensor Nodes�ContentsIntroductionSome ContextOverview of SHA-3Overview of SHA-3Overview of SHA-3Overview of SHA-3Overview of SHA-3Overview of SHA-3Standard MethodMemory OptimizationMemory OptimizationMemory OptimizationChaining optimization methodologyTarget PlatformsRegister SchedulingChaining optimization methodologyChaining optimization methodologyChaining optimization methodologyChaining optimization methodology슬라이드 번호 22슬라이드 번호 23Chaining optimization methodologyExperimental ResultExperimental ResultConclusionConclusionQuestion?Thank You~

efficient implementation of sha-3 hash function on 8-bit avr … · 2020. 11. 27. · efficient...

Documents