digital signature using md5 algorithm hardware acceleration final presentation students: eyal mendel...

Post on 25-Dec-2015

219 Views

Category:

Documents

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Digital signature using MD5 algorithm Hardware Acceleration

Final Presentation

Students:Eyal Mendel & Aleks DyskinInstructor:Evgeny Fiksman High Speed Digital Systems Laboratory

Agenda

HW/SW System DesignHW/SW System Design

Performance EvaluationPerformance Evaluation

IntroductionIntroduction

Conclusions & SummaryConclusions & Summary

Agenda

HW/SW System DesignHW/SW System Design

Performance EvaluationPerformance Evaluation

IntroductionIntroduction

Conclusions & SummaryConclusions & Summary

Project Goals

Evaluation C to FPGA techniqueEvaluation C to FPGA technique

Study case: MD5 algorithmStudy case: MD5 algorithm

Tool: ASC – A Stream CompilerTool: ASC – A Stream Compiler

Introduction

Hardware Accelerator Design & ImplementationHardware Accelerator Design & Implementation

MD5 Goals/UsageIntroduction

Goal:The MD5 (Message Digest 5)algorithm is intended for digital signature applications, where a large file must be "compressed" in a secure manner before being encrypted with a private (secret) key under a public-key cryptosystem

Usage:MD5 is widely used as cryptographic hash function . As an internet standard RFC1321, MD5 has been employed in wide variety of security applications, commonly used to check the integrity of files.

MD5 steps (1)

Step 1: Append Padding Bits

The message is "padded" so that its length (in bits) is congruent to 448, modulo 512.

Step 2: Append Length A 64-bit representation of b (the length of the message before the

padding bits were added) is appended to the result of the previous step.

Introduction

Step 3: Initialize MD buffer

MD5 steps (2)

a=0x67452301;b=0xefcdab89;c=0x98badcfe;d=0x10325476

( , , ) ;

( , , ) ;

( , , ) ;

( , , ) ( )

F x y z xy xz

G x y z xz yz

H x y z x y z

I x y z y x z

Step 4-5: Process message in 16-word blocks and Output

Introduction

ASC Overview

• ASC (A Stream Compiler) simplifies exploration of hardware accelerators by transforming the hardware design task into a software design process using only ’gcc’ and ’make’ to obtain a hardware netlist.

• Single C++ program with custom types and operators is the only syntax needed.

• ASC provides all the environment and implements all the protocols needed to communicate between HW module and CPU.

Introduction

SW Model Evaluation(1)Introduction

• Maximum speed up in ideal case is: (process and speed_up takes 0 sec to evaluate)

•The evaluation for the finish stage was done for the worst case: i.e. the append_bits step is performed. In general case the append_bits is performed only once per file/string.• All the measurements were held on Xilinx PowerPC

Accelerated Part

1 0.49 0.512.83

1 0.49 0.33 0.18

SW Model Evaluation(2)Introduction

1 2

1 2

1 1 2 1

1 1 2 1

( 1) 1lim lim

1 1 lim

sw sw swtotal

n nhw hw hw

sw sw sw sw

nhw hw hw hw

T T TnSU

T n T n T

T T T T

T n T n T T

Where:• n is number of chunks•Tsw1,Thw1 is average time of not_last chunk execution•Tsw2,Thw2 is average time of the last chunk execution

For huge chunks amount the total speed up will be:

Agenda

HW/SW System DesignHW/SW System Design

Performance EvaluationPerformance Evaluation

IntroductionIntroduction

Conclusions & SummaryConclusions & Summary

System High-Level

Serial communication manager between PC and M310 board

This module serves as input/output of the system, starting and finishing the process.

Manages MD5 hardware interface.

SW reference module for comparison

Step 4 implementation

SW/HW System Design

SW/HW algorithm flowSW/HW System Design

HW Accelerator insights

Basic structure of the hardware module after the initial design “on paper” :

SW/HW System Design

Processing Unit

Detailed explanation of one process cycle :

The process cycle is being run 16 times per 512 bit input (32bit*16=512bit)

SW/HW System Design

Problem- which result is relevant for given ‘i’.

efiksman
Please link ths the box to F,G,H,I

Function MaskingSW/HW System Design

T-Table access(1)SW/HW System Design

Every process cycle we need to fetch 32X4=128bits from the T-table

a. Problem: ASC supports only 32bit wide memoriesb. Using 2-port BRAM result in 2 clock cycles

?

T-Table Access (2)SW/HW System Design

efiksman
Add port#0, port#1 labels

Agenda

HW/SW System DesignHW/SW System Design

Performance EvaluationPerformance Evaluation

IntroductionIntroduction

Conclusions & SummaryConclusions & Summary

HW Module PerformancePerformance Evaluation

One data process of 512 bits takes: 680ns (@clock_freq=100MHz)

S_CYCLE=4 clock cyclesS_ LOOP = 16+1

68 clock cycles680ns

clock.freq 100MHz

Measurements (1)

String Software Hardware

Init. Append Finish_SW Total Init. Append Finish_HW Total

‘a’ 2.1 6.68 91.14 99.92 2.1 6.68 66.3+0.68=66.98 75.76

‘Aleks’ 2.1 8.58 89.62 100.3 2.1 8.58 64.1+0.68=64.78 75.46

‘message digest’ 2.1 13.1 86.2 101.4 2.1 13.1 57.2+0.68=57.88 73.08

All 56-byte strings 2.1 8.77 73.24 84.11 2.1 8.77 50.1+0.68=50.78 61.65

All times are in usec

Finish_SW=append Bits_SW+Process_SW+Output_SW

Finish_HW=append Bits_SW+Process_HW+Output_SW

Average speed-up HW-SW = 1.34998 times

Performance Evaluation

efiksman
The interesting part is the FINISH stage.I think better extract it. Init/Append has the same values

String Finish Software Finish Hardware

Append bits Process Output Append bits Process Output

‘a’ 64.1 24.84 2.2 64.1 0.68 2.2

‘Aleks’ 62 25.52 2.1 62 0.68 2.1

‘message digest’

55 29 2.2 55 0.68 2.2

All 56-byte strings

47.9 23.14 2.2 47.9 0.68 2.2

Performance Evaluation

All times are in usec

Measurements (2)

Agenda

HW/SW System DesignHW/SW System Design

Performance EvaluationPerformance Evaluation

IntroductionIntroduction

Conclusions & SummaryConclusions & Summary

Conclusions(1)Conclusions & Summary

• x1.35 Speedup with HW implementation (Worst Case). The expected Speed Up in ideal case for one chunk is:

• The theoretical speedup of larger than 1.35 can be achieved with large data chunks, when append_bit is evaluated only for the last chunk. In that case the ideal speed up of 2.83 is expected, but in reality the speed up of ~ 2.75 is reached from measurments (graph next slide)

• ASC tool proved the ability to implement complicated hardware modules with the use of few software commands and its code is easy_to_read

11.45

(1 0.31)

efiksman
What is this describes, i can't understand.Is SW/HW ration should grow, so the SW only should be constant 1.The graph is wrong. I want to see speed up graph here, You can't achive unlimited speed up !!!!

1 2

1 2

2 1

( 1) 1

( 1) 1

( 1)* 1

software s s

hardware h h

total

T n T T

T n T T

n su suSu

n

When:•T1s,T1h is average time of not_last chunk execution•T2s,T2h is average time of the last chunk execution•su2 is speed up for not_last chunk• su1 is speed up for the last chunk• n is number of chunks

Conclusions(2)Speed Up Prediction

Summary

• We learned ASC :design approach, debug and synthesize process.

• We showed the feasibility of MD5 implementation with ASC

• Implementation design of algorithm from pseudo code to hardware• Masking mechanism• Parallel processing and mux-ing the appropriate result• Overcoming over the limitations of hardware by creative approach (memory imp.)• Flow control

• Project goals were partially achieved• The File version was not implemented

Conclusions & Summary

Further WorkConclusions & Summary

• Further acceleration can be reached using pipe line architecture:

• File version further development.

The End

Thank you for your time.

top related