towards optimal custom instruction processors
DESCRIPTION
Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18. Towards Optimal Custom Instruction Processors. Overview. 1. background: extensible processors 2. design flow: C to custom processor silicon - PowerPoint PPT PresentationTRANSCRIPT
1
Towards OptimalCustom Instruction Processors
Wayne LukKubilay Atasu, Rob Dimond and Oskar Mencer
Department of ComputingImperial College London
HOT CHIPS 18
2
Overview
1. background: extensible processors
2. design flow: C to custom processor silicon
3. instruction selection: bandwidth/area constraints
4. application-specific processor synthesis
5. results: 3x area delay product reduction
6. current and future work + summary
3
1. Instruction-set extensible processors
● base processor + custom logic– partition data-flow graphs into custom instructions
data out
ALURegister
File
data in
4
Previous work
● many techniques, e.g.– Atasu et al. (DAC 03)
– Goodwin and Petkov (CASES 03)
– Clark et al. (MICRO 03, HOT CHIPS 04)
● current challenges– optimality and robustness of heuristics
– complete tool chain: application to silicon
– research infrastructure for custom processor design
5
2. Custom processor research at Imperial
● focus on effective optimization techniques– e.g. Integer Linear Programming (ILP)
● complete tool-chain– high-level descriptions to custom processor silicon
● open infrastructure for research in– custom processor synthesis– automatic customization techniques
● current tools– optimizing compiler (Trimaran) for custom CPUs– custom processor synthesis tool
6
Application to custom processor flow
Application Source (C)
TemplateGeneration
TemplateSelection
AreaConstraint
GenerateCustom
Unit
GenerateBaseCPU
ProcessorDescription
ASICTools
Area,Timing
7
Custom instruction model
output ports
RegisterFile
input portsInput Register
Pipeline Register
Output Register
8
3. Optimal instruction identification● minimize schedule length of program
data flow graphs (DFGs)● subject to constraints
– convexity: ensure feasible schedules
– fixed processor critical path: pipeline for multi-cycle instructions
– fixed data bandwidth: limited by register file ports
● steps: based on Integer Linear Pogramming (ILP)
a. template generation
b. template selection
9
a. Template generation
X
1. Solve ILP for DFG to generate a template
2. Collapse template to a single DFG node
3. Repeat while (objective > 0)
10
b. Template selection
● determine isomorphism classes
– find templates that can be implemented using the same instruction
– calculate speed-up potential of each class
● solve Knapsack problem using ILP
– maximize speedup within area constraint
11
Optimizing compilation flowApplication in C/C++
Impact Front-end
CDFG Formation
a) TemplateGeneration
b) TemplateSelection
MDESGeneration
Assembly Code and Statistics
InstructionReplacement
Scheduling,Reg. Allocation
Elcor Backend
Gain
Data BandwidthConstraints
Data BandwidthConstraints
AreaConstraints
SynopsysSynthesis
AreaVHDL
12
4. Application-specific processor synthesis
● design space exploration framework– Processor Component Library– specialized structural description
● prototype: MIPS integer instruction set– custom instructions– flexible micro-architecture
● evaluate using actual implementation– timing and area
13
Processor synthesis flow
CustomData paths
from compiler
FE
Processor Component Library
● merging● add state registers● processor interface
● pipeline description● parameters
FE EX M W
interface● data in/out● stall control
Custom Processor
14
Implementation
● based on Python scripts
– structural meta-language for processors
– combine RTL (Verilog/VHDL) IP blocks
– module generators for custom units
● generate 100s of designs automatically
– ASIC processor cores
– complete system on FPGA: CPU + memory + I/O
15
5. Results● cryptography benchmarks: C source
– AES decrypt, AES encrypt, DES, MD5, SHA
● 4/5 stage pipelined MIPS base processor– 0.225mm2 area, 200 MHz clock speed– single issue processor– register file with 2 input ports, 1 output port
● processors synthesized to 130nm library– Synopsys DC and Cadence SoC Encounter
– also synthesize to Xilinx FPGA for testing
16
AES Decryption Processor
130nm CMOS200MHz0.307mm2
35% area cost(mostly one instruction)
76% cycle reduction
17
AES Decryption Processor
130nm CMOS200MHz0.307mm2
35% area cost(mostly one instruction)
76% cycle reduction
18
Execution time
0
20
40
60
80
100
120
0 10 20 30 40 50 60 70
Area constraint (ripple carry adders)
Nor
mal
ised
num
ber o
f Cyc
les
AES decryptAES encryptDESMD5SHA
4 inputs, 1 output
4 inputs, 1 output
4 inputs, 4 outputs
4 inputs, 2 outputs
4 inputs, 1 output
76% reduction
63% reduction
43% reduction
Register file in all cases: 2 input ports, 1 output port
19
Timing
• 48% of designs meet timing at 200MHz without manual optimization
-3.5
-3
-2.5
-2
-1.5
-1
-0.5
0
0.5
50000 60000 70000 80000 90000 100000 110000 120000
Cell area/mm2
Sla
ck/n
s
20
Area (for maximum speedup)
0
50000
100000
150000
200000
250000
300000
350000
400000
450000
500000
aesdec
aesenc
des md5 sha basecpu
Are
a/m
m2
Cell area
Chip area35% 28%
42%
93%
23%
21
6. Current and future work● support memory access in custom instructions
– automate data partitioning for memory access– automate SIMD load/store instructions for state registers
● use architectural techniques e.g. shadow registers– improve bandwidth without additional register file ports
● study trade-offs for VLIW style– multiple register file ports– multiple issue and custom instructions
● extend compiler: e.g. ILP model for cyclic graphs– adapt software pipelining for hardware
22
Summary● complete flow from C to custom processor
● automatic instruction set extension– based on integer linear programming– optimize schedule length under constraints
● application-specific processor synthesis– complete flow: permits real hardware evaluation
● up to 76% reduction in execution cycles– 3x area delay product reduction
● max speedup: 23% to 93% area overhead