cdsc chp prototyping

1

CDSC CHP Prototyping

Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat,

Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou

C

A

Core L2 Bank Router

AcceleratorAccelerator

& BiN Manager

$2 C $2 C $2 C $2

$2 $2 $2 $2A A A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A A

A $2 $2 $2 $2A A

A

A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A ABM A

C $2

A ABM

2

Accelerator-Rich Architectures: ARC, CHARM, BiN

C

A

Core L2 Bank Router

AcceleratorAccelerator

& BiN Manager

$2 C $2 C $2 C $2

$2 $2 $2 $2A A A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A A

A $2 $2 $2 $2A A

A

A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A A A

A $2 $2 $2 $2A ABM A

C $2

A ABM

3

Goals

Implement the architecture features & supports into the

prototype system Architecture Proposals

• Architecture-rich CMPs• CHARM• Hybrid cache• Buffer-in NUCA etc

Bridge different thrusts in CDSC

4

Server-Class Platform: HC-1ex Architecture

Xeon Quad Core LV5408

40W TDP

Tesla C1060

100GB/s off-chip bandwidth

200W TDP

4 XC6vlx760 FPGAs

80GB/s off-chip bandwidth

90W Design Power

5

Drawback of the Commodity Systems

Limited ability to customize from the architecture point of

view

Board-level integration rather than chip-level integration

Commodity systems can only reach certain-level, we need

further innovations

6

CHP Prototyping Plan

Create the working hardware and software Use FPGA Extensible Processing Platform (EPP) as the

platform• Reuse existing FPGA IPs as much as possible

Working in multiple phases

7

Target Platforms: Xilinx ML605 and Zynq

Dual-core A9 with programmable logics

Virtex6-based board

8

CHP Prototyping Phases ARC Implementation

Phase 1: Basic platform• Accelerator and Software GAM

Phase 2: Adding modularity using available IP• E.g. Xilinx DMAC IP

Phase 3: First step toward BiN• Shared buffer• Customized modules (e.g. DMA-controller, plug-n-play accelerator)

Phase 4: System Enhancement• Crossbar • AXI implementation

CHARM Implementation

9

ARC Phase 1 Goals

Setting up a basic environment Multi-core + simple accelerators + OS

• Understanding the system interactions in more detail

Simple controller as GAM (global accelerator manager)• Supports sharing at system-level for multiple accelerators

of a same type

10

Microblaze-0(Linux with MMU)

Microblaze-1 (GAM)(Bare-metal; no MMU)

AXI4 (xbar)

AXI4lite (bus)

DDR3

Mailbox(vecadd)

FSL

vecadd vecadd

timer uartmutex

FSLFSL

vecsub vecsub

Mailbox(vecsub)

FSL

ARC Phase 1 Example System Diagram

11

ARC Phase-2 Goals

Implementing a system similar to ARC original design GAM, Accelerator, DMA-Controller, SPM

Adding modularity using available IP E.g. Xilinx DMAC IP

12

ARC Phase-2 Architecture

ARC Phase-2 Performance and Power ResultsARC Phase-2 Performance and Power ResultsBenchmarking kernel:Benchmarking kernel:

ResultsResults

( ) ( ) ( ) 2 ( ) 3 ( ) 4 ( ) 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9y i x i x i x i x i x i x i x i x i x i for i = 0...4096

Runtime (us)Runtime (us) Power (W)Power (W) EDP (Energy delay EDP (Energy delay

product) Gainproduct) Gain

CHP prototye on Xilinx FPGA ML605 CHP prototye on Xilinx FPGA ML605

@ 100MHz@ 100MHz

1,7461,746 22 17,570X17,570X

2x Quad-core Intel Xeon CPU E5405 2x Quad-core Intel Xeon CPU E5405

x64 @ 2.00GHz, 1 FPU per corex64 @ 2.00GHz, 1 FPU per core

562562 8080 1,365X1,365X

Dual-core Intel Xeon CPU 5150 x32 Dual-core Intel Xeon CPU 5150 x32

@ 2.66GHz, 1 FPU per core@ 2.66GHz, 1 FPU per core

10,06110,061 6565 94X94X

16-Core UltraSPARC T1 @ 1.2 GHz, 16-Core UltraSPARC T1 @ 1.2 GHz,

1 shared FPU1 shared FPU

852,163852,163 7272 1X1X

ARC Phase-2 Runtime BreakdownARC Phase-2 Runtime Breakdown

0 100 200 300 400 500 600 700

P0 P1 P2 P3

usReservation request sent

Parameter sent

Reservation succeded

P0 P3

Page 0 translated

Page 1 translatedPage 2

translatedPage 3

translated

P1

Task done

Acc freed

P2

GAM reserves acc

GAM passes parameter

Acc wrapper partitions task

DMAC wrapper requests Page 0

DMAC transfers Page

0


Acc computes


DMAC transfers Page

1

DMAC transfers Page

2

DMAC transfers Page

3


Acc done

GAM passes done signal

11.91 us

Core

GAM

ACC

DMAC

ARC Phase-2 Area BreakdownARC Phase-2 Area BreakdownSlice Logic UtilizationSlice Logic Utilization

Number of Slice Registers: 45,283 out Number of Slice Registers: 45,283 out of 301,440: 15%of 301,440: 15%

Number of Slice LUTs: 40,749 out of Number of Slice LUTs: 40,749 out of 150,720: 27%150,720: 27%• Number used as logic: 32,505 out of Number used as logic: 32,505 out of

150,720: 21%150,720: 21%• Number used as Memory: 5,248 out of Number used as Memory: 5,248 out of

58,400: 8%58,400: 8%

Slice Logic Distribution:Slice Logic Distribution: Number of occupied Slices: 17,621 out Number of occupied Slices: 17,621 out

of 37,680: 46%of 37,680: 46% Number of LUT Flip Flop pairs used: Number of LUT Flip Flop pairs used:

54,323 54,323• Number with an unused Flip Flop: Number with an unused Flip Flop:

14,617 out of 54,323: 26%14,617 out of 54,323: 26%• Number with an unused LUT: 13,574 Number with an unused LUT: 13,574

out of 54,323: 24%out of 54,323: 24%• Number of fully used LUT-FF pairs: Number of fully used LUT-FF pairs:

26,132 out of 54,323: 48% 26,132 out of 54,323: 48%

DMAC wrapper

AXI

AXI

AXI

AXI

Microblaze (Linux)

Microblaze (GAM)

DRAMController

Ethernet

Ethernet DMA

Ethernet DMA

DMAC

AXILite

Accelerator

ARC Phase-3 GoalsARC Phase-3 Goals

First step toward BiN:First step toward BiN: Shared bufferShared buffer

Designing our customized modules Designing our customized modules Customized DMA-controllerCustomized DMA-controller

• Handles batch TLB missesHandles batch TLB misses

Plug-n-play accelerator designPlug-n-play accelerator design• Making the interface general enough at least for a class of Making the interface general enough at least for a class of

acceleratorsaccelerators

ARC Phase-3 ArchitectureARC Phase-3 Architecture A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)

Global accelerator manager (GAM) for accelerator sharingGlobal accelerator manager (GAM) for accelerator sharing Shared on-chip buffers: Much more accelerators than buffer bank resourcesShared on-chip buffers: Much more accelerators than buffer bank resources Virtual addressing in the accelerators, accelerator virtualizationVirtual addressing in the accelerators, accelerator virtualization Virtual addressing DMA, with on-demand TLB filling from coreVirtual addressing DMA, with on-demand TLB filling from core No network-on-chip, no buffer sharing with cache, no customized instruction in the coreNo network-on-chip, no buffer sharing with cache, no customized instruction in the core

ACC0 ACC1 ACC2 ACC3 DMAC0 DMAC1 DMAC2 DMAC3

Buffer0

Buffer2

IOMMUACC

wrapper 0ACC

wrapper 1ACC

wrapper 2ACC

wrapper 3

GAM Core

AXI

AXI_B3

AXILite

Mailbox 0

Mailbox 1

DRAM

Core-GAM

Core-IOMMU

Buffer1

Buffer3

AXI_B2

AXI_B1

AXI_B0

Mutex INTCMDM TimerUARTEthernet

Bus master Bus slave

AXI Bus AXILite Bus FSL AXIStream

Performance and Power ResultsPerformance and Power ResultsBenchmarking kernel:Benchmarking kernel:

ResultsResults

( ) ( ) ( ) 2 ( ) 3 ( ) 4 ( ) 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9y i x i x i x i x i x i x i x i x i x i for i = 0...4096

Runtime (us)Runtime (us) Power (W)Power (W) EDP (Energy delay EDP (Energy delay

product) Gainproduct) Gain

CHP prototye on Xilinx FPGA ML605 CHP prototye on Xilinx FPGA ML605

@ 100MHz@ 100MHz

1,8021,802 22 8,050,786X8,050,786X

2x Quad-core Intel Xeon CPU E5405 2x Quad-core Intel Xeon CPU E5405

x64 @ 2.00GHz, 1 FPU per corex64 @ 2.00GHz, 1 FPU per core

562562 8080 2,069,261X2,069,261X

Dual-core Intel Xeon CPU 5150 x32 Dual-core Intel Xeon CPU 5150 x32

@ 2.66GHz, 1 FPU per core@ 2.66GHz, 1 FPU per core

10,06110,061 6565 7,947X7,947X

16-Core UltraSPARC T1 @ 1.2 GHz, 16-Core UltraSPARC T1 @ 1.2 GHz,

1 shared FPU1 shared FPU

852,163852,163 7272 1X1X

Impact of Communication & Computation OverlappingImpact of Communication & Computation Overlapping

0 200 400 600 800

Pages 0-4 translated

Reservation request sent

Parameter sent


Acc computation

Task done

Acc freed

GAM reserves

Acc GAM passes parameter


IOMMU requests

Pages 0-4

IOMMU requests

Pages 5-9


Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100


DMAC transfers Pages 6-9


0 200 400 600 800

Pages 0 translated


Parameter sent


P73-D

Task done

Acc freed

GAM reserves

Acc



IOMMU requests

Pages 0-4


Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 7 translated

Pages 1 translated

Pages 2 translated

Pages 4 translated

Pages 3 translated

Pages 5 translated

Pages 6 translated

P0R0 W2

P1D-1

P2D-0

P42-D

P3D-1

P0D-0

P1R1 W3

P2R0 W2

P53-D

P62-D

P3R1 W3

us

19%19%Pipelined Communication & ComputationPipelined Communication & Computation

No pipelineNo pipeline

Overhead of Buffer Sharing: Bank Access Contention (1)Overhead of Buffer Sharing: Bank Access Contention (1)

0 200 400 600 800

Pages 0 translated


Parameter sent


P73-D

Task done

Acc freed

GAM reserves

Acc



IOMMU requests

Pages 0-4


Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 7 translated

Pages 1 translated

Pages 2 translated

Pages 4 translated

Pages 3 translated

Pages 5 translated

Pages 6 translated

P0R0 W2

P1D-1

P2D-0

P42-D

P3D-1

P0D-0

P1R1 W3

P2R0 W2

P53-D

P62-D

P3R1 W3

0 200 400 600 800

Pages 0 translated


Parameter sent


P73-D

Task done

Acc freed

GAM reserves

Acc



IOMMU requests

Pages 0-4


Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100

Pages 1 translated

Pages 2 translated

Pages 4 translated

Pages 3 translated

Pages 5 translated

Pages 6 translated

P0R0 W2

P1D-1

P2D-0

P42-D

P3D-1

P0D-0

P53-D

P62-D

Pages 7 translated

P0R0 W2

P0R0 W2

P0R0 W2

3.2%3.2%

The 4 logic buffers are allocated to 4 separate buffer banksThe 4 logic buffers are allocated to 4 separate buffer banks

The 4 logic buffers are allocated to 1 buffer bankThe 4 logic buffers are allocated to 1 buffer bank

Reason: AXI bus allow masters simultaneously issue transactions. Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time and the AXI transaction time dominates buffer access time

Overhead of Buffer Sharing: Bank Access Contention (2)Overhead of Buffer Sharing: Bank Access Contention (2)

0 200 400 600 800



Parameter sent


Acc computation

Task done

Acc freed

GAM reserves



IOMMU requests

Pages 0-4

IOMMU requests

Pages 5-9


Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100




0 200 400 600 800



Parameter sent


Acc computation

Task done

Acc freed

GAM reserves



IOMMU requests

Pages 0-4

IOMMU requests

Pages 5-9


Core

GAM

ACC

DMAC

100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100



2300


2.7%2.7%

The 4 logic buffers are allocated to 4 separate buffer banksThe 4 logic buffers are allocated to 4 separate buffer banks

The 4 logic buffers are allocated to 1 buffer bankThe 4 logic buffers are allocated to 1 buffer bank

Area BreakdownArea BreakdownSlice Logic UtilizationSlice Logic Utilization

Number of Slice Registers: 105,969 out Number of Slice Registers: 105,969 out of 301,440: 35%of 301,440: 35%

Number of Slice LUTs: 93,755 out of Number of Slice LUTs: 93,755 out of 150,720: 62%150,720: 62%• Number used as logic: 80,410 out of Number used as logic: 80,410 out of

150,720: 53%150,720: 53%• Number used as Memory: 7,406 out of Number used as Memory: 7,406 out of

58,400: 12%58,400: 12%

Slice Logic Distribution:Slice Logic Distribution: Number of occupied Slices: 32,779 out Number of occupied Slices: 32,779 out

of 37,680: 86%of 37,680: 86% Number of LUT Flip Flop pairs used: Number of LUT Flip Flop pairs used:

112,772 112,772• Number with an unused Flip Flop: Number with an unused Flip Flop:

25,037 out of 112,772: 22%25,037 out of 112,772: 22%• Number with an unused LUT: 19,017 Number with an unused LUT: 19,017

out of 112,772: 16%out of 112,772: 16%• Number of fully used LUT-FF pairs: Number of fully used LUT-FF pairs:

68,718 out of 112,772: 60% 68,718 out of 112,772: 60%

Microblaze0 (Linux)

Microblaze1 (GAM)

AXI-DDR

DDRController

Ethernet DMA

Ethernet Accelerator

(Sum of 10 SQRTs)

IOMMU

Buffer Selectors

AXI-BUF0

DMAC0

DMAC1

DMAC2

DMAC3

AXILite

BUF0-CRTL

AXI-BUF1

AXI-BUF2

AXI-BUF3

BUF1-CRTL

BUF2-CRTL

BUF3-CRTL

Phase-4 ARC GoalsPhase-4 ARC Goals

Finding bottlenecks and system enhancementFinding bottlenecks and system enhancement

Communication bottleneckCommunication bottleneck Crossbar design instead of AXI-busCrossbar design instead of AXI-bus

Speed-up AXI non-burst implementation Speed-up AXI non-burst implementation

24

CrossbarCrossbar In addition to previously proposedIn addition to previously proposed

now support partial configurationnow support partial configuration• will not affect working LCAswill not affect working LCAs

Passed on-board testPassed on-board test

Hierarchical DMACsHierarchical DMACs Data transfer between Data transfer between

• Main memoryMain memory• Shared buffer banksShared buffer banks

# of buffer banks can be large# of buffer banks can be large

want to keep AXI bus sizewant to keep AXI bus size

Hierarchical DMACs and busesHierarchical DMACs and buses

Accelerator Memory System DesignAccelerator Memory System Design

IOMMU

Buffer bank1

Buffer bank2

Buffer bank3

Buffer bank4

Buffer bank9

AXI buses

DM

AC1

DM

AC2

DM

AC3

Select-bit Receiver

GAM

Mai

n AX

I bus

to DDR

LCA1

LCA2

LCA3

LCA4

OC core

25

Crossbar ResultsCrossbar Results

cdsc chp prototyping

Documents

dmac transfers page

xdualcore intel xeon

number of slice luts

number of lut flip flop

number of occupied slices

acc donegam

accelerator sharingshared

accelerator designmaking