hot chips 16august 24, 2004 optimode: programmable accelerator engines through retargetable...

18
Hot Chips 16 August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott Mahlke CCCP Research Group University of Michigan http://cccp.eecs.umich.edu Krisztián Flautner, Koen Van Nieuwenhove ARM Limited

Post on 21-Dec-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

OptimoDE: Programmable Accelerator Engines Through Retargetable Customization

Nathan Clark, Hongtao Zhong,Kevin Fan, Scott Mahlke

CCCP Research GroupUniversity of Michigan

http://cccp.eecs.umich.edu

Krisztián Flautner,Koen Van Nieuwenhove

ARM Limited

Page 2: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

OptimoDE Overview• OptimoDE

– A configurable VLIW-styled Data Engine architecture– Targeted at intensive data processing

• Characteristics– Very wide performance envelope

• Power / area / speed tradeoff• Exploiting parallelism in applications

– Unlimited data path configuration options– User extensible through ISA customization

• Semi-automatic design system– User-in-the-loop design, retargetable compiler toolchain

Page 3: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

OptimoDE in a System On Chip

SDRAM

SoC

AH

B B

us M

atrix

ARMCPU

DMAController

InterruptController

SRAM

I/O

SDRAMController

Mem

ory

Con

trol

Data Engine

DA

TA

1

MEM

CTRL

DA

TA

2

FIFO

switch

Memory

Dat

a M1

M2

M

S

S S

M

S S M

Page 4: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

OptimoDE Architecture Model• Functional Units

– ALU, ACU, Multipliers– Custom

• Memory– RAM ( asynch / synch )– ROM

• I/O ports– addressable– handshake protocol

• Registers– Register files

• Interconnect– Direct connection– Shared bus

• Controller

• All layers required• Intra-layer configuration

interconnect

InterconnectInterconnect

Controller

C

ControllerController

interconnect

… …

Function unitsFunction units MemoriesMemories

regs regs regs…regs

I/O portsI/O ports

RegistersRegisters

Page 5: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Design Toolchain

A.tmp

DesignDE DEvelop

Librarian

saveISS

set target

001010110011010110110111

load

run / profile

A.inc

A.inc.c1

UserLibrary

instantiate

#include …

main(){

… = dct();

}

create_resource

xxxxxxxdctxxxxx

A.inc

A.inc.c2

A.inc

A.inc.c3

A Evaluation

OptimoDELibraryOptimoDE

LibraryUser

Library

A Definition

LIFETIMELIFETIME

LOADLOAD

Page 6: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Compiler Toolchain

*1

+2

*4

*5

+6

+7

*8

INPUT

OUTPUT

+3

*1

+2

+3

012345

+2

*1

*4

*3

+6

*8

+7

*5

*9

+10

C SourceDescription

DEvelop

Micro-code

check map compile

Analysis feedback

Syntax checks

Dataflow analysis

Match architecture

and dataflow graph

Optimize code and

register use

Page 7: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

32-point DCT Microarchitecture

inport

outportacu_1 acu_2 acu_3

romram_1 ram_2 alu_1 alu_2

imm_1 imm_2 control

• 2 Custom FUs, 2 RAM, 1 ROM, 3 ACU, 2 I/O ports• Designer responsible for creating custom units manually

Page 8: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Retargetable Customization• Prototype 2 technologies in OptimoDE

– Automated ISA customization– Retargetable customization to an “application-area”

• Customizing for 1 application– Programmability Nominally programmable– Critical problem – Cannot sustain performance

across similar applications– How well does a custom ISA generalize

• 5 encryption algorithms, create custom design for each• Average loss >80% verses native [MICRO, 2003]

– Proactive generalization creates a retargetable design

Page 9: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Creating Custom Instructions

• Candidate discovery– Identify customization

opportunities• Examine program DFG• Partition DFG at:

– Memory operations– Unprofitable edges

• Enumerate candidate subgraphs within each partition

Page 10: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Grouping and Selection

• Group candidate subgraphs with same structure

Group 4

Group 2

Group 1

Group 3

Group 4Group 3

Group 1 Group 2

• Estimate performance and cost for each group

Cost: 0.5 AddersGain: 1,000 Cycles

Cost: 1 AdderGain: 2,500 Cycles

Cost: 2 AddersGain: 10,000 Cycles

Cost: 1 AdderGain: 1,500 Cycles

• Greedily select groups to implement in hardware subject to budget

Page 11: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

• Wildcard – multiple functionality at nodes

Input 2

Input 1

0xFF

0x8, 0x40x4

>>

|,&

+,-

Output

Proactively Generalize Groups

• Cost-effectively extend group functionality to enable reuse

Input 2

Input 1

0xFF

0x8

>>

|

+

Output

Input 2

Input 1

0xFF

0x8, 0x40x4

>>

|,&

+,-

Output

• Subsumed – configurable interconnect to bypass nodes

Page 12: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Native Speedups

0

1

2

3

4

5

6

7

8

9

3Des AES Blowfish Md5 Rc4 SHA

Speedup

ARM 926EJ (200 MHz)OptimoDE (333 MHz, 5 Issue: 3 ALU, 1 Mem, 1 Brn)OptimoDE + 1 CFU (333 MHz)

Page 13: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

0

1

2

3

4

5

6

7

3des-3des3des-Blowfish

3des-Rc43des-AES3des-Sha

Blowfish-3desBlowfish-Blowfish

Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des

Rc4-Blowfish

Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish

AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish

Sha-Rc4Sha-AESSha-Sha

Speedup

Base

Generalized

0

1

2

3

4

5

6

7

3des-3des3des-Blowfish

3des-Rc43des-AES3des-Sha

Blowfish-3desBlowfish-Blowfish

Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des

Rc4-Blowfish

Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish

AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish

Sha-Rc4Sha-AESSha-Sha

Speedup

Base

Generalized

Importance of Generalization

0

1

2

3

4

5

6

7

3des-3des3des-Blowfish

3des-Rc43des-AES3des-Sha

Blowfish-3desBlowfish-Blowfish

Blowfish-Rc4Blowfish-AESBlowfish-ShaRc4-3des

Rc4-Blowfish

Rc4-Rc4Rc4-AESRc4-ShaAES-3desAES-Blowfish

AES-Rc4AES-AESAES-ShaSha-3desSha-Blowfish

Sha-Rc4Sha-AESSha-Sha

Speedup

Base

Generalized

Key: application run – application designed for

Page 14: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Rc4 Rc4, Sha Blowfish, Rc4, Sha 3des, Blowfish,Rc4, Sha

3des, AES,Blowfish, Rc4, Sha

Apps Designed For

Fraction of Native Speedup Attained

3des

AES

Blowfish

Rc4

Sha

Md5

Mean

Designing for a Domain

Page 15: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Case Study - Md5

0

1

2

3

4

5

6

7

8

9

10

0 2 4 6 8 10

CFU Cost Budget (Relative to 32-bit ripple-carry adder)

Speedup

Page 16: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

OptimoDE Design for this Point

Input 4

Input 1

Input 3

Input 2

+

+

+

<<

0x5

>>

0x1B

|

Output

^

&

^

Input 1Input 2

Input 3

Output

ALU 1 ALU 2 CFUSRAMACU

RF RFRF RF …

Control Memory

Page 17: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Die Area Breakdown

ALU 1 ALU 2 CFUSRAMACU

RF RFRF RF …

Control Memory

Die impact is artificially large because of naïve implementation

OptimoDE = 5.5 mm2 in 0.13 ARM 926EJ = 5.0 mm2 in 0.13

Reg. File and Interconnect

9%

CFU2%

Control Memory

11%

Baseline Design

78%

Page 18: Hot Chips 16August 24, 2004 OptimoDE: Programmable Accelerator Engines Through Retargetable Customization Nathan Clark, Hongtao Zhong, Kevin Fan, Scott

Hot Chips 16 August 24, 2004

Conclusions

• OptimoDE – Configurable VLIW-style data engine architecture– Automated tools for implementing embedded signal

and data processing solutions

• Automatic retargetable customization– Customized design combined with cost-effective

generalization– Performance programmability - Performance stability

across a family of similar applications