codesigned on-chip logic minimization
Post on 07-Jan-2016
22 Views
Preview:
DESCRIPTION
TRANSCRIPT
Codesigned On-Chip Logic Minimization
Roman Lysecky & Frank Vahid*Department of Computer Science and Engineering
University of California, Riverside*Also with the Center for Embedded Computer Systems, UC Irvine
This work was supported in part by the National Science Foundation, the Semiconductor Research Corporation, and a Department of
Education GAANN fellowship
ARM7
MEM
DMA
On-chip Minimizer
MEMProc.
I$
D$
System-On-Chip
Introduction(On-chip Logic Minimization)
Indicate Completion
33
Execute Minimizer
22
Initialize Minimizer
11
On-Chip Minimization Applications (IP Routing Table Reduction)
138.23.16.9
138.23.16.9
Port 7
Port 3125.x.x.x
Port 5 138.23.x.x
Port 7138.23.16.x
Prefix Next hop
Incoming IP packet
Destination IP
Longest Prefix Match
Lookup IP in Routing Table
IP routing table reduction Routing tables of large network
routers have over 30,000 entries Fast IP routing lookup is difficult
without using large hardware resources
Ternary CAM (McAuley & Francis, 1993)
TCAM can be used to perform routing table lookup in single cycle
Requires large resources and large power consumption
Mask Extension (Liu, 2002) Uses two-level logic minimization
to reduce the size of the routing table
Good results but did not considering off-chip communication
On-Chip Minimization Applications (Access Control List Reduction)
Access Control List (ACL) Used to restrict IP traffic through network routers ACL size can range anywhere from from 300 (UCR CS&E
Dept.) to 10,000 (AOL) Common use is to block a particular protocol or port
number to avoid attacks such as Denial of Service attacks
ACL Minimization Similar approach as used for IP routing table reduction However, order of the list must be preserved
Type Protocol In IP Out PortIn Port Out IP Action
ACL Input Format
On-Chip Minimization Applications (Dynamic Hardware/Software Partitioning)
Dynamic hardware/software partitioning (JIT compilation for FPGAs)
Dynamically detects frequently executed loop and re-implements the software loops using on-chip configurable logic
Requires logic synthesis tools to embedded on-chip
Warp Processor
MIPS/ARM
I$
D$
Profiler
Configurable Logic
Warp Processor
Warp Processor
Dynamic Partitioning Module
Warp Processor
Warp Processor
Warp Processor
ROCM On-chip Logic Minimization Requirements
Limited data and instruction memory available Quality of results must still be close to optimal Execution time should remain reasonable
On-chip Logic Minimization Goal Focus on developing an on-chip logic minimization tool that
produces acceptable results with reasonable increases in execution time while using limited memory resources
ROCM – Riverside On-Chip Minimizer Two-level minimization tool Utilized a combination of approaches from Espresso-II
(Brayton, et al. 1984) and Presto (Svoboda & White, 1979) Eliminate the need to computer the off-set to reduce
memory usage Utilizes a single expand phase instead of multiple iterations On average only 2% larger than optimal solution
ROCM Results(Performance/Memory Usage)
0
2
4
6
8
10
12
14
Exec. Time(seconds)
Espresso-ExactEspresso-IIROCMROCM (ARM7)
500 MHz Sun Ultra60
40 MHz ARM 7
(Triscend A7) 0
50
100
150
200
250
Code Mem(KB)
0
500
1000
1500
2000
2500
3000
3500
Data Mem (KB)
• ROCM executing on 40MHz ARM7 requires less than 1 second• Small code size of only 22 kilobytes• Average data memory usage of only 1 megabyte
Codesign ROCM(Hardware Coprocessor)
LoopInstr.
DoesInter 15401 34 42.7% 0.2% 1.7SetLit 15401 23 5.7% 0.2% 1.1GetLit 15401 8 8.8% 0.1% 1.1IsCov 15401 67 3.4% 0.4% 1.0
Tautology.1 15401 67 1.7% 0.4% 1.0Cofactor.1 15401 57 28.5% 0.4% 1.4
Overall 15401 256 90.7% 1.7% 10.8
Speedup Bound
Function/ Loop
Total Instr.
Loop Time%
Loop Size%
• Customized ROCM enables us to develop an efficient hardware coprocessor• Profiled the execution of ROCM-32 and ROCM-128
using ARM port of the SimpleScalar simulator
• Determine critical loops/functions that are suitable for implementation in hardware
• Identified six critical kernels that comprised 91% of the total execution time but only 2% of the code size
Codesign ROCM(Minimization Coprocessor)
MEMARM7
Min.
Coproc.Min.
Coproc.
Proc/Mem Interface
DoesInter
IsCov
GetLit
SetLit
Tautology.1
Cofactor.1
dataaddr
Minimization Coprocessor
On-Chip Minimizer
Codesign ROCM(Minimization Coprocessor)
Proc/Mem Interface
DoesInterDoesIntersect
IsCov
GetLit
SetLit
Tautology.1
Cofactor.1
dataaddr
Minimization Coprocessor
aImpl dImpl numLits
<<
<< 1
32(odd)
64 64 5
32(even)
== 0
retVal
DoesIntersect
Codesign ROCM Results(Execution Time)
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
Exe
cution
Tim
e (s
)
IP Routing TableReduction (32)
ACL Reduction(128)
Logic Synthesis(128)
SW (75 MHz)
HW/SW (200/75 MHz)
• Average speedup of 7.8
Codesign ROCM Results(Energy Consumption)
0
10
20
30
40
50
60
70
80
90
Ener
gy C
onsu
mpt
ion
(mJ)
IP Routing TableReduction (32)
ACL Reduction(128)
Logic Synthesis(128)
SW (75 MHz)
HW/SW (200/75 MHz)
• Average energy reduction of 59.2%
Codesign ROCM(Minimization Coprocessor)
• Software modifications were required to achieve speedup of 7.8• Data structures/algorithms not suitable for hardware
implementation• Reorganized data structures
• Customized width of data items
• Eliminate memory allocation within critical regions
• Not automated with current hardware/software partitioning tools
Codesign ROCM(Minimization Coprocessor)
for(i=0; i<F->numImplicants; i++){ if( !DoesIntersect(implicant, xj) ) continue;
for(k=0; k<xj->numLiterals; k++) {
// determine coImplicant...
}
AddImplicant(cofactor, &coImplicant);}
Move to HW
28.5% of total exec. time
Original C Code
Only 3.5% of total exec. time
Requires dynamic memory allocation
AddImplicant(cofactor, &coImplicant);
Codesign ROCM(Minimization Coprocessor)
// determine size of cofactor initiallycofactorSize = 0;for(i=0; i<F->numImplicants; i++) { if( !DoesIntersect(implicant, xj) ) continue; cofactorSize++;} // allocate all memory outside of main loopcofactor->implicants = malloc(…);
for(i=0; i<F->numImplicants; i++){ if( !DoesIntersect(implicant, xj) ) continue;
for(k=0; k<xj->numLiterals; k++) {
// additional initialization code need for each iterations coImplicant = &(cofactor->implicants[index++]);
... }}
Modified C Code
// determine size of cofactor initially
// allocate all memory outside of main loop
// additional initialization code need for each iterations
Conclusions & Future Work• Developed codesigned on-chip logic
minimization
• Performance improvement of nearly 8X compared to earlier software only implementation
• Energy reduction of almost 60%
• New directions in hardware/software partitioning
• Designer effort was required to rewrite algorithms and fine tune data structures
• Could better hardware/software partitioning tools automate this?
top related