codesigned on-chip logic minimization

Codesigned On-Chip Logic Minimization

Roman Lysecky & Frank Vahid*Department of Computer Science and Engineering

University of California, Riverside*Also with the Center for Embedded Computer Systems, UC Irvine

This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of

Education GAANN fellowship

ARM7

MEM

DMA

On-chip Minimizer

MEMProc.

I$

D$

System-On-Chip

Introduction(On-chip Logic Minimization)

Indicate Completion

33

Execute Minimizer

22

Initialize Minimizer

11

On-Chip Minimization Applications (IP Routing Table Reduction)

138.23.16.9

138.23.16.9

Port 7

Port 3125.x.x.x

Port 5 138.23.x.x

Port 7138.23.16.x

Prefix Next hop

Incoming IP packet

Destination IP

Longest Prefix Match

Lookup IP in Routing Table

IP routing table reduction Routing tables of large network

routers have over 30,000 entries Fast IP routing lookup is difficult

without using large hardware resources

Ternary CAM (McAuley & Francis, 1993)

TCAM can be used to perform routing table lookup in single cycle

Requires large resources and large power consumption

Mask Extension (Liu, 2002) Uses two-level logic minimization

to reduce the size of the routing table

Good results but did not considering off-chip communication

On-Chip Minimization Applications (Access Control List Reduction)

Access Control List (ACL) Used to restrict IP traffic through network routers ACL size can range anywhere from from 300 (UCR CS&E

Dept.) to 10,000 (AOL) Common use is to block a particular protocol or port

number to avoid attacks such as Denial of Service attacks

ACL Minimization Similar approach as used for IP routing table reduction However, order of the list must be preserved

Type Protocol In IP Out PortIn Port Out IP Action

ACL Input Format

On-Chip Minimization Applications (Dynamic Hardware/Software Partitioning)

Dynamic hardware/software partitioning (JIT compilation for FPGAs)

Dynamically detects frequently executed loop and re-implements the software loops using on-chip configurable logic

Requires logic synthesis tools to embedded on-chip

Warp Processor

MIPS/ARM

I$

D$

Profiler

Configurable Logic

Warp Processor

Warp Processor

Dynamic Partitioning Module

Warp Processor

Warp Processor

Warp Processor

ROCM On-chip Logic Minimization Requirements

Limited data and instruction memory available Quality of results must still be close to optimal Execution time should remain reasonable

On-chip Logic Minimization Goal Focus on developing an on-chip logic minimization tool that

produces acceptable results with reasonable increases in execution time while using limited memory resources

ROCM – Riverside On-Chip Minimizer Two-level minimization tool Utilized a combination of approaches from Espresso-II

(Brayton, et al. 1984) and Presto (Svoboda & White, 1979) Eliminate the need to computer the off-set to reduce

memory usage Utilizes a single expand phase instead of multiple iterations On average only 2% larger than optimal solution

ROCM Results(Performance/Memory Usage)

0

2

4

6

8

10

12

14

Exec. Time(seconds)

Espresso-ExactEspresso-IIROCMROCM (ARM7)

500 MHz Sun Ultra60

40 MHz ARM 7

(Triscend A7) 0

50

100

150

200

250

Code Mem(KB)

0

500

1000

1500

2000

2500

3000

3500

Data Mem (KB)

• ROCM executing on 40MHz ARM7 requires less than 1 second• Small code size of only 22 kilobytes• Average data memory usage of only 1 megabyte

Codesign ROCM(Hardware Coprocessor)

LoopInstr.

DoesInter 15401 34 42.7% 0.2% 1.7SetLit 15401 23 5.7% 0.2% 1.1GetLit 15401 8 8.8% 0.1% 1.1IsCov 15401 67 3.4% 0.4% 1.0

Tautology.1 15401 67 1.7% 0.4% 1.0Cofactor.1 15401 57 28.5% 0.4% 1.4

Overall 15401 256 90.7% 1.7% 10.8

Speedup Bound

Function/ Loop

Total Instr.

Loop Time%

Loop Size%

• Customized ROCM enables us to develop an efficient hardware coprocessor• Profiled the execution of ROCM-32 and ROCM-128

using ARM port of the SimpleScalar simulator

• Determine critical loops/functions that are suitable for implementation in hardware

• Identified six critical kernels that comprised 91% of the total execution time but only 2% of the code size

Codesign ROCM(Minimization Coprocessor)

MEMARM7

Min.

Coproc.Min.

Coproc.

Proc/Mem Interface

DoesInter

IsCov

GetLit

SetLit

Tautology.1

Cofactor.1

dataaddr

Minimization Coprocessor

On-Chip Minimizer


Proc/Mem Interface

DoesInterDoesIntersect

IsCov

GetLit

SetLit

Tautology.1

Cofactor.1

dataaddr

Minimization Coprocessor

aImpl dImpl numLits

<<

<< 1

32(odd)

64 64 5

32(even)

== 0

retVal

DoesIntersect

Codesign ROCM Results(Execution Time)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Exe

cution

Tim

e (s

)

IP Routing TableReduction (32)

ACL Reduction(128)

Logic Synthesis(128)

SW (75 MHz)

HW/SW (200/75 MHz)

• Average speedup of 7.8

Codesign ROCM Results(Energy Consumption)

0

10

20

30

40

50

60

70

80

90

Ener

gy C

onsu

mpt

ion

(mJ)

IP Routing TableReduction (32)

ACL Reduction(128)

Logic Synthesis(128)

SW (75 MHz)

HW/SW (200/75 MHz)

• Average energy reduction of 59.2%


• Software modifications were required to achieve speedup of 7.8• Data structures/algorithms not suitable for hardware

implementation• Reorganized data structures

• Customized width of data items

• Eliminate memory allocation within critical regions

• Not automated with current hardware/software partitioning tools


for(i=0; i<F->numImplicants; i++){ if( !DoesIntersect(implicant, xj) ) continue;

for(k=0; k<xj->numLiterals; k++) {

// determine coImplicant...

}

AddImplicant(cofactor, &coImplicant);}

Move to HW

28.5% of total exec. time

Original C Code

Only 3.5% of total exec. time

Requires dynamic memory allocation

AddImplicant(cofactor, &coImplicant);


// determine size of cofactor initiallycofactorSize = 0;for(i=0; i<F->numImplicants; i++) { if( !DoesIntersect(implicant, xj) ) continue; cofactorSize++;} // allocate all memory outside of main loopcofactor->implicants = malloc(…);

for(i=0; i<F->numImplicants; i++){ if( !DoesIntersect(implicant, xj) ) continue;

for(k=0; k<xj->numLiterals; k++) {

// additional initialization code need for each iterations coImplicant = &(cofactor->implicants[index++]);

... }}

Modified C Code

// determine size of cofactor initially

// allocate all memory outside of main loop

// additional initialization code need for each iterations

Conclusions & Future Work• Developed codesigned on-chip logic

minimization

• Performance improvement of nearly 8X compared to earlier software only implementation

• Energy reduction of almost 60%

• New directions in hardware/software partitioning

• Designer effort was required to rewrite algorithms and fine tune data structures

• Could better hardware/software partitioning tools automate this?

codesigned on-chip logic minimization

Documents