1 2 the maven vector-thread architecture 2...

The Maven Vector-Thread ArchitectureYunsup Lee1, Rimas Avizienis1, Alex Bishara1, Richard Xia1, Derek Lockhart2,

Christopher Batten2, Krste Asanovic1

1Parallel Computing Laboratory, UC Berkeley2Computer Systems Laboratory, Cornell University

1 Motivation

Future manycore processorswill be energy-constrained,and thus the primary metric forevaluating these architectureswill be their energy-efficiency.In this work, we investigatenew architectural and mi-croarchitectural mechanismswhich enable a wider array ofapplications to be mapped toenergy-efficient vector units.

2 Architectural Patterns

Three different architectural patterns (excluding subword-SIMD and SIMT) wereevaluated in terms of their performance, energy efficiency and area. Maven is anew vector-thread architecture, which is based on a vector-SIMD architecture, addsminimal hardware to support irregular DLP well, and is considerably simpler toimplement than previous vector-thread designs.

3 Maven Programming Methodology

4 Maven Tile Microarchitecture

We focus on comparing the various architectural design patterns with respect to asingle data-parallel tile. Example tiles are a MIMD tile, a vector tile with foursingle-lane cores, or one four-lane core. To manage complexity of many designpoints, we developed a library of parameterized synthesizable RTL components.

Tile Configurations

Maven Core Microarchitecture

5 Evaluation Framework

Toolflow

An Example VLSI Layout

Core 0 Core 1 Core 2

Core 3

D$XBAR

6 Evaluation Results

We first compare tile configurations based on their cycle time and area before exploring the impact of variousmicroarchitectural optimizations. We then compare implementation efficiency and performance of the Maven VTpattern against the MIMD, and vector-SIMD patterns for the six application kernel.

Area & Cycle Time

mimd-c4 vsimd+bi

vt-c4v1 vt-c4v1+bi

vt-c1v4+bi+2s

c1v4r256c4v1r256

r256+br256+bi

r256+2sr256+2s+d

r256+mc

ctrlregmemfpintcpi$d$

!"#$%&'()"#

*"+,'-./()0)1(2-3.45&2(/,6-

:"/(2-;',(-755<9

!=12,-:45,-7#09

!"!#$%&'() *&+,-*(.$*/* (0. *0*1 !"!#$%&'2& )*2,-*(1$)&. &01 *0*( !"!#$%&'*)/ )&),-*)&$)2* &0) *0*+ !"!#$%&')32 )++,-))*$)+/ &0. *0). 45"!#$%&4*')3267" (+2,-)*($((* 302 *0(. 45"!#$%*4&')3267" ))&,-*(.$)3) (0+ *0&2 48$%&4*')32 &)/,-*2)$(*/ 20( *0&. 48$%&4*')3267 &1&,-*&.$).* 302 *0(* 48$%&4*')3267" &&3,-*.)$)+/ 30+ *0() 48$%&4*')3267"6)5 &1+,-))3$(1& 30+ *0() 48$%&4*')3267"6)56# &*1,-*2/$(11 30+ *0(2 48$%*4&')3267"6)5 )13,-***$*2. (0+ *0&) 48$%*4&')3267"6)56!% ))(,-**/$*.( &01 *0&)

Performance and Energy Efficiency

Impact of Additional Physical Registers, Intra-Lane Regfile Banking, and Additional Per-Bank Integer ALUs

0.8 1.0 1.2 1.4 1.6 1.8 2.0 2.2 2.4 2.6Normalized Tasks / Sec

0.40.50.60.70.80.91.01.11.21.31.41.51.6

r64r128

r256r128 r256 r128 r256

mimd-c4vt-c4v1vt-c4v1+bvt-c4v1+bi

mimd-c4 vt-c4v1 vt-c4v1+b vt-c4v1+bi

r32r64r128r256

r128r256

ctrlregmemfpint

cpi$d$leak

Impact of Density-Time Execution and Stack-Based Convergence Schemes / Impact of Memory Coalescing

2.0 4.0 6.0 8.0 10.0 12.0 14.0Normalized Tasks / Sec

0.00.10.20.30.40.50.60.70.80.91.0

k FIFO

FIFO+dt

1-stack1-stack+dt 2-stack

2-stack+dtcmv

cmv+2-stack+dt

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Normalized Tasks / Sec

vec ld/st

uT ld/st

uT ld/st + mem coalescing

Implementation Efficiency and Performance for MIMD, vector-SIMD, and VT Patterns Running Application Kernels

0.5 1.0 1.50.0

0.5 1.0 1.5

1.0 2.0 3.0

r32 mcore/mlane

0.5 1.0 1.5 2.0 2.5

0.5 1.0 1.5 2.0

0.5 1.0 1.5

r32mlane

Normalized Tasks / Second / Area

viterbi radix sort kmeans dither physics sim. string search

7 Conclusions

1. The Maven vector-thread architecture is more area and energy efficient than MIMDarchitectures on regular DLP and (surprisingly) on irregular DLP

2. The Maven vector-thread architecture is a promising alternative to traditional vector-SIMDarchitectures, providing greater efficiency and easier programmability

3. Using real RTL implementations and a standard ASIC toolflow is necessary tocompare energy-optimized future architectures

For more details, see our ISCA ’11 paper below or use the QR code on the right with a barcode reader.”Exploring the Tradeoffs between Programmability and Efficiency in Data-Parallel Accelerators”

This work was supported in part by Microsoft (Award #024263) and Intel (Award #024894, equipment donations) funding and by matching funding from U.C. Discovery(Award #DIG07-10227). The authors acknowledge and thank Jiongjia Fang and Ji Kim for their help writing application kernels, Christopher Celio for his help writingMaven software and developing the vector-SIMD instruction set, and Hidetaka Aoki for his early feedback on the Maven microarchitecture.

Yunsup Lee — 577C Soda Hall, Berkeley, CA 94720 — yunsup@cs.berkeley.edu — http://www.cs.berkeley.edu/˜yunsup

1 2 the maven vector-thread architecture 2...

Documents

maven - introduction

the stairway to maven - the apache software...

1 2 the maven vector-thread architecture 2...

maven introduction

simplified vector-thread architectures for flexible...

the vector thread

vector-thread architecture and implementation

the vector-thread architecture

maven tools reference guide - jboss · creating a maven...

maven calm

1 2 the maven vector-thread architecture 2 1 -...

apache maven

(courtesy lockheed martin) maven overview quick facts lasp...

efficient and accurate fusion of massive vector data … ·...

implementing the scale vector-thread...

media maven media maven plus duplicator new - ez dupe, inc...

maven std information systems users group · maven std...

maven overview

maven - advanced driver assistance systemsadas.cvc.uab.es...

an introduction to maven · presentation summary • an...