cdsc chp prototyping
DESCRIPTION
CDSC CHP Prototyping. Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat, Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou. Accelerator-Rich Architectures: ARC, CHARM, BiN. Goals. Implement the architecture features & supports into the prototype system Architecture Proposals - PowerPoint PPT PresentationTRANSCRIPT
1
CDSC CHP Prototyping
Yu-Ting Chen, Jason Cong, Mohammad Ali Ghodrat,
Muhuan Huang, Chunyue Liu, Bingjun Xiao, Yi Zou
C
A
Core L2 Bank Router
AcceleratorAccelerator
& BiN Manager
$2 C $2 C $2 C $2
$2 $2 $2 $2A A A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A A
A $2 $2 $2 $2A A
A
A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A ABM A
C $2
A ABM
2
Accelerator-Rich Architectures: ARC, CHARM, BiN
C
A
Core L2 Bank Router
AcceleratorAccelerator
& BiN Manager
$2 C $2 C $2 C $2
$2 $2 $2 $2A A A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A A
A $2 $2 $2 $2A A
A
A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A A A
A $2 $2 $2 $2A ABM A
C $2
A ABM
3
Goals
Implement the architecture features & supports into the
prototype system Architecture Proposals
• Architecture-rich CMPs• CHARM• Hybrid cache• Buffer-in NUCA etc
Bridge different thrusts in CDSC
4
Server-Class Platform: HC-1ex Architecture
Xeon Quad Core LV5408
40W TDP
Tesla C1060
100GB/s off-chip bandwidth
200W TDP
4 XC6vlx760 FPGAs
80GB/s off-chip bandwidth
90W Design Power
5
Drawback of the Commodity Systems
Limited ability to customize from the architecture point of
view
Board-level integration rather than chip-level integration
Commodity systems can only reach certain-level, we need
further innovations
6
CHP Prototyping Plan
Create the working hardware and software Use FPGA Extensible Processing Platform (EPP) as the
platform• Reuse existing FPGA IPs as much as possible
Working in multiple phases
7
Target Platforms: Xilinx ML605 and Zynq
Dual-core A9 with programmable logics
Virtex6-based board
8
CHP Prototyping Phases ARC Implementation
Phase 1: Basic platform• Accelerator and Software GAM
Phase 2: Adding modularity using available IP• E.g. Xilinx DMAC IP
Phase 3: First step toward BiN• Shared buffer• Customized modules (e.g. DMA-controller, plug-n-play accelerator)
Phase 4: System Enhancement• Crossbar • AXI implementation
CHARM Implementation
9
ARC Phase 1 Goals
Setting up a basic environment Multi-core + simple accelerators + OS
• Understanding the system interactions in more detail
Simple controller as GAM (global accelerator manager)• Supports sharing at system-level for multiple accelerators
of a same type
10
Microblaze-0(Linux with MMU)
Microblaze-1 (GAM)(Bare-metal; no MMU)
AXI4 (xbar)
AXI4lite (bus)
DDR3
Mailbox(vecadd)
FSL
vecadd vecadd
timer uartmutex
FSLFSL
vecsub vecsub
Mailbox(vecsub)
FSL
ARC Phase 1 Example System Diagram
11
ARC Phase-2 Goals
Implementing a system similar to ARC original design GAM, Accelerator, DMA-Controller, SPM
Adding modularity using available IP E.g. Xilinx DMAC IP
12
ARC Phase-2 Architecture
ARC Phase-2 Performance and Power ResultsARC Phase-2 Performance and Power ResultsBenchmarking kernel:Benchmarking kernel:
ResultsResults
( ) ( ) ( ) 2 ( ) 3 ( ) 4 ( ) 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9y i x i x i x i x i x i x i x i x i x i for i = 0...4096
Runtime (us)Runtime (us) Power (W)Power (W) EDP (Energy delay EDP (Energy delay
product) Gainproduct) Gain
CHP prototye on Xilinx FPGA ML605 CHP prototye on Xilinx FPGA ML605
@ 100MHz@ 100MHz
1,7461,746 22 17,570X17,570X
2x Quad-core Intel Xeon CPU E5405 2x Quad-core Intel Xeon CPU E5405
x64 @ 2.00GHz, 1 FPU per corex64 @ 2.00GHz, 1 FPU per core
562562 8080 1,365X1,365X
Dual-core Intel Xeon CPU 5150 x32 Dual-core Intel Xeon CPU 5150 x32
@ 2.66GHz, 1 FPU per core@ 2.66GHz, 1 FPU per core
10,06110,061 6565 94X94X
16-Core UltraSPARC T1 @ 1.2 GHz, 16-Core UltraSPARC T1 @ 1.2 GHz,
1 shared FPU1 shared FPU
852,163852,163 7272 1X1X
ARC Phase-2 Runtime BreakdownARC Phase-2 Runtime Breakdown
0 100 200 300 400 500 600 700
P0 P1 P2 P3
usReservation request sent
Parameter sent
Reservation succeded
P0 P3
Page 0 translated
Page 1 translatedPage 2
translatedPage 3
translated
P1
Task done
Acc freed
P2
GAM reserves acc
GAM passes parameter
Acc wrapper partitions task
DMAC wrapper requests Page 0
DMAC transfers Page
0
DMAC wrapper requests Page 1
Acc computes
DMAC wrapper requests Page 2
DMAC transfers Page
1
DMAC transfers Page
2
DMAC transfers Page
3
DMAC wrapper requests Page 3
Acc done
GAM passes done signal
11.91 us
Core
GAM
ACC
DMAC
ARC Phase-2 Area BreakdownARC Phase-2 Area BreakdownSlice Logic UtilizationSlice Logic Utilization
Number of Slice Registers: 45,283 out Number of Slice Registers: 45,283 out of 301,440: 15%of 301,440: 15%
Number of Slice LUTs: 40,749 out of Number of Slice LUTs: 40,749 out of 150,720: 27%150,720: 27%• Number used as logic: 32,505 out of Number used as logic: 32,505 out of
150,720: 21%150,720: 21%• Number used as Memory: 5,248 out of Number used as Memory: 5,248 out of
58,400: 8%58,400: 8%
Slice Logic Distribution:Slice Logic Distribution: Number of occupied Slices: 17,621 out Number of occupied Slices: 17,621 out
of 37,680: 46%of 37,680: 46% Number of LUT Flip Flop pairs used: Number of LUT Flip Flop pairs used:
54,323 54,323• Number with an unused Flip Flop: Number with an unused Flip Flop:
14,617 out of 54,323: 26%14,617 out of 54,323: 26%• Number with an unused LUT: 13,574 Number with an unused LUT: 13,574
out of 54,323: 24%out of 54,323: 24%• Number of fully used LUT-FF pairs: Number of fully used LUT-FF pairs:
26,132 out of 54,323: 48% 26,132 out of 54,323: 48%
DMAC wrapper
AXI
AXI
AXI
AXI
Microblaze (Linux)
Microblaze (GAM)
DRAMController
Ethernet
Ethernet DMA
Ethernet DMA
DMAC
AXILite
Accelerator
ARC Phase-3 GoalsARC Phase-3 Goals
First step toward BiN:First step toward BiN: Shared bufferShared buffer
Designing our customized modules Designing our customized modules Customized DMA-controllerCustomized DMA-controller
• Handles batch TLB missesHandles batch TLB misses
Plug-n-play accelerator designPlug-n-play accelerator design• Making the interface general enough at least for a class of Making the interface general enough at least for a class of
acceleratorsaccelerators
ARC Phase-3 ArchitectureARC Phase-3 Architecture A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)A partial realization of the proposed accelerator-rich CMP onto Xilinx ML605 (Virtex-6)
Global accelerator manager (GAM) for accelerator sharingGlobal accelerator manager (GAM) for accelerator sharing Shared on-chip buffers: Much more accelerators than buffer bank resourcesShared on-chip buffers: Much more accelerators than buffer bank resources Virtual addressing in the accelerators, accelerator virtualizationVirtual addressing in the accelerators, accelerator virtualization Virtual addressing DMA, with on-demand TLB filling from coreVirtual addressing DMA, with on-demand TLB filling from core No network-on-chip, no buffer sharing with cache, no customized instruction in the coreNo network-on-chip, no buffer sharing with cache, no customized instruction in the core
ACC0 ACC1 ACC2 ACC3 DMAC0 DMAC1 DMAC2 DMAC3
Buffer0
Buffer2
IOMMUACC
wrapper 0ACC
wrapper 1ACC
wrapper 2ACC
wrapper 3
GAM Core
AXI
AXI_B3
AXILite
Mailbox 0
Mailbox 1
DRAM
Core-GAM
Core-IOMMU
Buffer1
Buffer3
AXI_B2
AXI_B1
AXI_B0
Mutex INTCMDM TimerUARTEthernet
Bus master Bus slave
AXI Bus AXILite Bus FSL AXIStream
Performance and Power ResultsPerformance and Power ResultsBenchmarking kernel:Benchmarking kernel:
ResultsResults
( ) ( ) ( ) 2 ( ) 3 ( ) 4 ( ) 5 ( ) 6 ( ) 7 ( ) 8 ( ) 9y i x i x i x i x i x i x i x i x i x i for i = 0...4096
Runtime (us)Runtime (us) Power (W)Power (W) EDP (Energy delay EDP (Energy delay
product) Gainproduct) Gain
CHP prototye on Xilinx FPGA ML605 CHP prototye on Xilinx FPGA ML605
@ 100MHz@ 100MHz
1,8021,802 22 8,050,786X8,050,786X
2x Quad-core Intel Xeon CPU E5405 2x Quad-core Intel Xeon CPU E5405
x64 @ 2.00GHz, 1 FPU per corex64 @ 2.00GHz, 1 FPU per core
562562 8080 2,069,261X2,069,261X
Dual-core Intel Xeon CPU 5150 x32 Dual-core Intel Xeon CPU 5150 x32
@ 2.66GHz, 1 FPU per core@ 2.66GHz, 1 FPU per core
10,06110,061 6565 7,947X7,947X
16-Core UltraSPARC T1 @ 1.2 GHz, 16-Core UltraSPARC T1 @ 1.2 GHz,
1 shared FPU1 shared FPU
852,163852,163 7272 1X1X
Impact of Communication & Computation OverlappingImpact of Communication & Computation Overlapping
0 200 400 600 800
Pages 0-4 translated
Reservation request sent
Parameter sent
Reservation succeded
Acc computation
Task done
Acc freed
GAM reserves
Acc GAM passes parameter
Acc wrapper partitions task
IOMMU requests
Pages 0-4
IOMMU requests
Pages 5-9
GAM passes done signal
Core
GAM
ACC
DMAC
100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100
Pages 5-9 translated
DMAC transfers Pages 6-9
DMAC transfers Pages 0-4
0 200 400 600 800
Pages 0 translated
usReservation request sent
Parameter sent
Reservation succeded
P73-D
Task done
Acc freed
GAM reserves
Acc
GAM passes parameter
Acc wrapper partitions task
IOMMU requests
Pages 0-4
GAM passes done signal
Core
GAM
ACC
DMAC
100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100
Pages 7 translated
Pages 1 translated
Pages 2 translated
Pages 4 translated
Pages 3 translated
Pages 5 translated
Pages 6 translated
P0R0 W2
P1D-1
P2D-0
P42-D
P3D-1
P0D-0
P1R1 W3
P2R0 W2
P53-D
P62-D
P3R1 W3
us
19%19%Pipelined Communication & ComputationPipelined Communication & Computation
No pipelineNo pipeline
Overhead of Buffer Sharing: Bank Access Contention (1)Overhead of Buffer Sharing: Bank Access Contention (1)
0 200 400 600 800
Pages 0 translated
usReservation request sent
Parameter sent
Reservation succeded
P73-D
Task done
Acc freed
GAM reserves
Acc
GAM passes parameter
Acc wrapper partitions task
IOMMU requests
Pages 0-4
GAM passes done signal
Core
GAM
ACC
DMAC
100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100
Pages 7 translated
Pages 1 translated
Pages 2 translated
Pages 4 translated
Pages 3 translated
Pages 5 translated
Pages 6 translated
P0R0 W2
P1D-1
P2D-0
P42-D
P3D-1
P0D-0
P1R1 W3
P2R0 W2
P53-D
P62-D
P3R1 W3
0 200 400 600 800
Pages 0 translated
usReservation request sent
Parameter sent
Reservation succeded
P73-D
Task done
Acc freed
GAM reserves
Acc
GAM passes parameter
Acc wrapper partitions task
IOMMU requests
Pages 0-4
GAM passes done signal
Core
GAM
ACC
DMAC
100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100
Pages 1 translated
Pages 2 translated
Pages 4 translated
Pages 3 translated
Pages 5 translated
Pages 6 translated
P0R0 W2
P1D-1
P2D-0
P42-D
P3D-1
P0D-0
P53-D
P62-D
Pages 7 translated
P0R0 W2
P0R0 W2
P0R0 W2
3.2%3.2%
The 4 logic buffers are allocated to 4 separate buffer banksThe 4 logic buffers are allocated to 4 separate buffer banks
The 4 logic buffers are allocated to 1 buffer bankThe 4 logic buffers are allocated to 1 buffer bank
Reason: AXI bus allow masters simultaneously issue transactions. Reason: AXI bus allow masters simultaneously issue transactions. and the AXI transaction time dominates buffer access time and the AXI transaction time dominates buffer access time
Overhead of Buffer Sharing: Bank Access Contention (2)Overhead of Buffer Sharing: Bank Access Contention (2)
0 200 400 600 800
Pages 0-4 translated
usReservation request sent
Parameter sent
Reservation succeded
Acc computation
Task done
Acc freed
GAM reserves
Acc GAM passes parameter
Acc wrapper partitions task
IOMMU requests
Pages 0-4
IOMMU requests
Pages 5-9
GAM passes done signal
Core
GAM
ACC
DMAC
100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100
Pages 5-9 translated
DMAC transfers Pages 6-9
DMAC transfers Pages 0-4
0 200 400 600 800
Pages 0-4 translated
usReservation request sent
Parameter sent
Reservation succeded
Acc computation
Task done
Acc freed
GAM reserves
Acc GAM passes parameter
Acc wrapper partitions task
IOMMU requests
Pages 0-4
IOMMU requests
Pages 5-9
GAM passes done signal
Core
GAM
ACC
DMAC
100 300 500 700 900 1000 1200 1400 1600 18001100 1300 1500 1700 1900 2000 22002100
Pages 5-9 translated
DMAC transfers Pages 6-9
2300
DMAC transfers Pages 6-9
2.7%2.7%
The 4 logic buffers are allocated to 4 separate buffer banksThe 4 logic buffers are allocated to 4 separate buffer banks
The 4 logic buffers are allocated to 1 buffer bankThe 4 logic buffers are allocated to 1 buffer bank
Area BreakdownArea BreakdownSlice Logic UtilizationSlice Logic Utilization
Number of Slice Registers: 105,969 out Number of Slice Registers: 105,969 out of 301,440: 35%of 301,440: 35%
Number of Slice LUTs: 93,755 out of Number of Slice LUTs: 93,755 out of 150,720: 62%150,720: 62%• Number used as logic: 80,410 out of Number used as logic: 80,410 out of
150,720: 53%150,720: 53%• Number used as Memory: 7,406 out of Number used as Memory: 7,406 out of
58,400: 12%58,400: 12%
Slice Logic Distribution:Slice Logic Distribution: Number of occupied Slices: 32,779 out Number of occupied Slices: 32,779 out
of 37,680: 86%of 37,680: 86% Number of LUT Flip Flop pairs used: Number of LUT Flip Flop pairs used:
112,772 112,772• Number with an unused Flip Flop: Number with an unused Flip Flop:
25,037 out of 112,772: 22%25,037 out of 112,772: 22%• Number with an unused LUT: 19,017 Number with an unused LUT: 19,017
out of 112,772: 16%out of 112,772: 16%• Number of fully used LUT-FF pairs: Number of fully used LUT-FF pairs:
68,718 out of 112,772: 60% 68,718 out of 112,772: 60%
Microblaze0 (Linux)
Microblaze1 (GAM)
AXI-DDR
DDRController
Ethernet DMA
Ethernet Accelerator
(Sum of 10 SQRTs)
IOMMU
Buffer Selectors
AXI-BUF0
DMAC0
DMAC1
DMAC2
DMAC3
AXILite
BUF0-CRTL
AXI-BUF1
AXI-BUF2
AXI-BUF3
BUF1-CRTL
BUF2-CRTL
BUF3-CRTL
Phase-4 ARC GoalsPhase-4 ARC Goals
Finding bottlenecks and system enhancementFinding bottlenecks and system enhancement
Communication bottleneckCommunication bottleneck Crossbar design instead of AXI-busCrossbar design instead of AXI-bus
Speed-up AXI non-burst implementation Speed-up AXI non-burst implementation
24
CrossbarCrossbar In addition to previously proposedIn addition to previously proposed
now support partial configurationnow support partial configuration• will not affect working LCAswill not affect working LCAs
Passed on-board testPassed on-board test
Hierarchical DMACsHierarchical DMACs Data transfer between Data transfer between
• Main memoryMain memory• Shared buffer banksShared buffer banks
# of buffer banks can be large# of buffer banks can be large
want to keep AXI bus sizewant to keep AXI bus size
Hierarchical DMACs and busesHierarchical DMACs and buses
Accelerator Memory System DesignAccelerator Memory System Design
IOMMU
Buffer bank1
Buffer bank2
Buffer bank3
Buffer bank4
Buffer bank9
AXI buses
DM
AC1
DM
AC2
DM
AC3
Select-bit Receiver
GAM
Mai
n AX
I bus
to DDR
LCA1
LCA2
LCA3
LCA4
OC core
25
Crossbar ResultsCrossbar Results