towards optimal custom instruction processors

1

Towards OptimalCustom Instruction Processors

Wayne LukKubilay Atasu, Rob Dimond and Oskar Mencer

Department of ComputingImperial College London

HOT CHIPS 18

2

Overview

1. background: extensible processors

2. design flow: C to custom processor silicon

3. instruction selection: bandwidth/area constraints

4. application-specific processor synthesis

5. results: 3x area delay product reduction

6. current and future work + summary

3

1. Instruction-set extensible processors

● base processor + custom logic– partition data-flow graphs into custom instructions

data out

ALURegister

File

data in

4

Previous work

● many techniques, e.g.– Atasu et al. (DAC 03)

– Goodwin and Petkov (CASES 03)

– Clark et al. (MICRO 03, HOT CHIPS 04)

● current challenges– optimality and robustness of heuristics

– complete tool chain: application to silicon

– research infrastructure for custom processor design

5

2. Custom processor research at Imperial

● focus on effective optimization techniques– e.g. Integer Linear Programming (ILP)

● complete tool-chain– high-level descriptions to custom processor silicon

● open infrastructure for research in– custom processor synthesis– automatic customization techniques

● current tools– optimizing compiler (Trimaran) for custom CPUs– custom processor synthesis tool

6

Application to custom processor flow

Application Source (C)

TemplateGeneration

TemplateSelection

AreaConstraint

GenerateCustom

Unit

GenerateBaseCPU

ProcessorDescription

ASICTools

Area,Timing

7

Custom instruction model

output ports

RegisterFile

input portsInput Register

Pipeline Register

Output Register

8

3. Optimal instruction identification● minimize schedule length of program

data flow graphs (DFGs)● subject to constraints

– convexity: ensure feasible schedules

– fixed processor critical path: pipeline for multi-cycle instructions

– fixed data bandwidth: limited by register file ports

● steps: based on Integer Linear Pogramming (ILP)

a. template generation

b. template selection

9

a. Template generation

X

1. Solve ILP for DFG to generate a template

2. Collapse template to a single DFG node

3. Repeat while (objective > 0)

10

b. Template selection

● determine isomorphism classes

– find templates that can be implemented using the same instruction

– calculate speed-up potential of each class

● solve Knapsack problem using ILP

– maximize speedup within area constraint

11

Optimizing compilation flowApplication in C/C++

Impact Front-end

CDFG Formation

a) TemplateGeneration

b) TemplateSelection

MDESGeneration

Assembly Code and Statistics

InstructionReplacement

Scheduling,Reg. Allocation

Elcor Backend

Gain

Data BandwidthConstraints

Data BandwidthConstraints

AreaConstraints

SynopsysSynthesis

AreaVHDL

12

4. Application-specific processor synthesis

● design space exploration framework– Processor Component Library– specialized structural description

● prototype: MIPS integer instruction set– custom instructions– flexible micro-architecture

● evaluate using actual implementation– timing and area

13

Processor synthesis flow

CustomData paths

from compiler

FE

Processor Component Library

● merging● add state registers● processor interface

● pipeline description● parameters

FE EX M W

interface● data in/out● stall control

Custom Processor

14

Implementation

● based on Python scripts

– structural meta-language for processors

– combine RTL (Verilog/VHDL) IP blocks

– module generators for custom units

● generate 100s of designs automatically

– ASIC processor cores

– complete system on FPGA: CPU + memory + I/O

15

5. Results● cryptography benchmarks: C source

– AES decrypt, AES encrypt, DES, MD5, SHA

● 4/5 stage pipelined MIPS base processor– 0.225mm2 area, 200 MHz clock speed– single issue processor– register file with 2 input ports, 1 output port

● processors synthesized to 130nm library– Synopsys DC and Cadence SoC Encounter

– also synthesize to Xilinx FPGA for testing

16

AES Decryption Processor

130nm CMOS200MHz0.307mm2

35% area cost(mostly one instruction)

76% cycle reduction

17

AES Decryption Processor

130nm CMOS200MHz0.307mm2

35% area cost(mostly one instruction)

76% cycle reduction

18

Execution time

0

20

40

60

80

100

120

0 10 20 30 40 50 60 70

Area constraint (ripple carry adders)

Nor

mal

ised

num

ber o

f Cyc

les

AES decryptAES encryptDESMD5SHA

4 inputs, 1 output

4 inputs, 1 output

4 inputs, 4 outputs

4 inputs, 2 outputs

4 inputs, 1 output

76% reduction

63% reduction

43% reduction

Register file in all cases: 2 input ports, 1 output port

19

Timing

• 48% of designs meet timing at 200MHz without manual optimization

-3.5

-3

-2.5

-2

-1.5

-1

-0.5

0

0.5

50000 60000 70000 80000 90000 100000 110000 120000

Cell area/mm2

Sla

ck/n

s

20

Area (for maximum speedup)

0

50000

100000

150000

200000

250000

300000

350000

400000

450000

500000

aesdec

aesenc

des md5 sha basecpu

Are

a/m

m2

Cell area

Chip area35% 28%

42%

93%

23%

21

6. Current and future work● support memory access in custom instructions

– automate data partitioning for memory access– automate SIMD load/store instructions for state registers

● use architectural techniques e.g. shadow registers– improve bandwidth without additional register file ports

● study trade-offs for VLIW style– multiple register file ports– multiple issue and custom instructions

● extend compiler: e.g. ILP model for cyclic graphs– adapt software pipelining for hardware

22

Summary● complete flow from C to custom processor

● automatic instruction set extension– based on integer linear programming– optimize schedule length under constraints

● application-specific processor synthesis– complete flow: permits real hardware evaluation

● up to 76% reduction in execution cycles– 3x area delay product reduction

● max speedup: 23% to 93% area overhead

towards optimal custom instruction processors

Documents