yakun sophia shao, brandon reagen , gu-yeon wei, david brooks harvard university

30
A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks Harvard University

Upload: rollo

Post on 24-Feb-2016

38 views

Category:

Documents


0 download

DESCRIPTION

A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures. Yakun Sophia Shao, Brandon Reagen , Gu-Yeon Wei, David Brooks Harvard University. Beyond Homogeneous Parallelism. General-Purpose Cores (CPU). Programmable - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

A Pre-RTL, Power-Performance Accelerator Simulator Enabling Large Design Space Exploration of Customized Architectures

Yakun Sophia Shao, Brandon Reagen, Gu-Yeon Wei, David Brooks

Harvard University

Page 2: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

2

Programmable

Accelerators (DSP, GPU)

Application-Specific

Accelerator(ASIP, ASIC)

General-Purpose Cores

(CPU)

FlexibilityProgrammabili

ty

EnergyEfficiency

Beyond Homogeneous Parallelism

Design Cost

Page 3: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

3

OMAP 4 SoC

Today’s SoC

Page 4: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

4

OMAP 4 SoC

Today’s SoC

ARM Cores GPUDSP DSP

System Bus

Secondary Bus

Secondary Bus

Tertiary Bus

DMA

DMA SDUSBAudio Video Face Imaging

USB

Page 5: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

5

Today’s SoC

CPU + L2$ + GPU39%

Other Blocks 61%

Apple A7

Harvard VLSI-ARCH GroupSoC Tapeout

Page 6: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

6

Today’s SoC

GPU/DSP

CPU

Buses MemInter-faceAcc

CPU

Acc

Acc

Acc

Acc

Acc

Acc

Acc

Acc

Page 7: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

7

Future Accelerator-Centric Architectures

FlexibilityDesign Cost Programmability

How to decompose an application to accelerators?How to rapidly design lots of accelerators?How to design and manage the shared resources?

GPU/DSP

Big Cores

Shared ResourcesMemoryInterface

Sea of Fine-Grained Accelerators

Small Cores

Page 8: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

8

Private L1/Scratchpad

Aladdin

AcceleratorSpecific

Datapath

Shared Memory/InterconnectModels

UnmodifiedC-Code

Accelerator DesignParameters

(e.g., # FU, mem. BW)

Power/Area

Performance

“Accelerator Simulator” Design Accelerator-Rich SoC Fabrics and Memory Systems

Design Cost Flexibility Programmability

Aladdin: A pre-RTL, Power-Performance Accelerator

Simulator

“Design Assistant” Understand Algorithmic-HW

Design Space before RTL

Page 9: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

9

GPU/DSP

Big Cores

Shared ResourcesMemoryInterface

Sea of Fine-Grained Accelerators

Small Cores

Future Accelerator-Centric Architecture

Page 10: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

10

GPU/DSP

Big Cores

Shared ResourcesMemoryInterface

Sea of Fine-Grained Accelerators

Small Cores

Future Accelerator-Centric Architecture

Aladdin can rapidly evaluate large design space of accelerator-centric architectures.

Page 11: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

Aladdin Overview

C Code

Power/Area

Performance

ActivityAcc Design Parameters

Optimization Phase

Realization Phase

Optimistic IR

InitialDDDG

IdealisticDDDG

Program Constraine

d DDDG

ResourceConstraine

d DDDG

Power/Area Models

11

Dynamic Data Dependence Graph

(DDDG)

Page 12: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

Aladdin Overview

C CodeOptimistic

IRInitialDDDG

IdealisticDDDG

Program Constraine

d DDDG

ResourceConstraine

d DDDG

Power/Area Models

Optimization Phase

Realization Phase

Power/Area

Performance

ActivityAcc Design Parameters

12

Page 13: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

13

From C to Design Space

C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];

Page 14: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

From C to Design SpaceIR Dynamic Trace

C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];

0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store

c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store

c[i]10. r0 = r0 + 1 //++i…

14

Page 15: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

From C to Design SpaceInitial DDDG

0. i=0

1. ld a 2. ld b

3. +

4. st c

5. i++

6. ld a 7. ld b

8. +

9. st c

10. i++

11. ld a 12. ld b

13. +

14. st c

C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];

IR Trace:0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store c[i]10.r0 = r0 + 1 //++i…

15

Page 16: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

0. i=05. i+

+

10. i++

11. ld a 12. ld b

13. +

14. st c

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

C Code:for(i=0; i<N; ++i) c[i] = a[i] + b[i];

IR Trace:0. r0=0 //i = 01. r4=load (r0 + r1) //load a[i]2. r5=load (r0 + r2) //load b[i]3. r6=r4 + r54. store(r0 + r3, r6) //store c[i]5. r0=r0 + 1 //++i6. r4=load(r0 + r1) //load a[i]7. r5=load(r0 + r2) //load b[i]8. r6=r4 + r59. store(r0 + r3, r6) //store c[i]10.r0 = r0 + 1 //++i…

0. i=0

5. i++ 10. i++

11. ld a 12. ld b

13. +

14. st c

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

16

From C to Design SpaceIdealistic DDDG

Page 17: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

17

• Include application-specific customization strategies. • Node-Level:

– Bit-width Analysis– Strength Reduction– Tree-height Reduction

• Loop-Level:– Remove dependences between loop index variables

• Memory Optimization:– Memory-to-Register Conversion– Store-Load Forwarding– Store Buffer

• Extensible– e.g. Model CAM accelerator by matching nodes in DDDG

From C to Design SpaceOptimization Phase: C->IR->DDDG

Page 18: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

From C to Design SpaceOne Design

MEM MEM

MEM MEM

MEM

MEM

+

++

Resource Activity Idealistic DDDG

Acc Design Parameters: Memory BW <= 2 1 Adder

0. i=0

5.i++ 10. i++

11. ld a12. ld b

13. +

14. st c

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

15. i++

16. ld a17. ld b

18. +

19. st c

Cycle

0. i=0

5.i++

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

18

Page 19: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

From C to Design SpaceAnother Design

MEM MEM MEM MEM

MEM MEM MEM MEM

MEM MEM

MEM MEM

+ +

+ ++ +

+Resource Activity

Cycle

0. i=0

5.i++

10. i++

11. ld a 12. ld b

13. +

14. st c

7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

15. i++

16. ld a 17. ld b

18. +

19. st c

6. ld a

19

Acc Design Parameters: Memory BW <= 4 2 Adders

Idealistic DDDG0.

i=05.i++ 10. i++

11. ld a12. ld b

13. +

14. st c

6. ld a 7. ld b

8. +

9. st c

1. ld a 2. ld b

3. +

4. st c

15. i++

16. ld a17. ld b

18. +

19. st c

Page 20: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

20

• Constrain the DDDG with program and user-defined resource constraints

• Program Constraints– Control Dependence– Memory Ambiguation

• Resource Constraints– Loop-level Parallelism– Loop Pipelining– Memory Ports– # of FUs (e.g., adders, multipliers)

From C to Design SpaceRealization Phase: DDDG->Estimates

Page 21: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

21

Cycle

PowerAcc Design Parameters: Memory BW <= 4 2 Adders

Acc Design Parameters: Memory BW <= 2 1 Adder

From C to Design SpacePower-Performance per Design

Page 22: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

22

From C to Design SpaceDesign Space of an Algorithm

Cycle

Power

Page 23: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

Aladdin Validation

C Code Power/Area Performance

Aladdin

ModelSim

Design Compiler

Verilog

Activity

23

Page 24: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

Aladdin Validation

C Code Power/Area Performance

Aladdin

RTL Designer

HLS C Tuning

Vivado HLS ModelSim

Design Compiler

Verilog

Activity

24

Page 25: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

Aladdin Validation

25

Page 26: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

26

Aladdin Validation

Page 27: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

Aladdin enables rapid design space exploration for accelerators.

C Code Power/Area Performance

Aladdin

RTL Designer

HLS C Tuning

Vivado HLS ModelSim

Design Compiler

Verilog

Activity

27

7 mins

52 hours

Page 28: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

28

Aladdin enables pre-RTL simulation of accelerators with the rest of the SoC.

GPU

Shared ResourcesMemoryInterface

Sea of Fine-Grained Accelerators

Big Cores

Small Cores

GPGPU-Sim

MARSx86...

XIOSim…

Cacti/Orion2

DRAMSim2

Page 29: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

29

Acc Core

Cache

Memory

Acc Core

Cache

Memory

Core

Modeling Accelerators in a SoC-like Environment

Page 30: Yakun  Sophia Shao, Brandon  Reagen ,  Gu-Yeon  Wei, David Brooks Harvard University

30

• Architectures with 1000s of accelerators will be radically different; New design tools are needed.

• Aladdin enables rapid design space exploration of future accelerator-centric platforms.

• You can find Aladdin athttp://vlsiarch.eecs.harvard.edu/aladdin

Aladdin: A pre-RTL, Power-Performance Accelerator

Simulator