algorithms+and+specializers+for+provably+op:mal+ ...€¦ · why did we hit a power/cooling wall?...

Algorithms and Specializers for Provably Op:mal Implementa:ons with Resiliency and Efficiency

Elad Alon, Krste Asanovic (Director), Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David PaAerson,

Koushik Sen, John Wawrzynek !

[email protected] http://aspire.eecs.berkeley.edu!

UC Berkeley Future Application Drivers!

2

UC Berkeley Compute Energy “Iron Law”

§ When power is constrained, need beAer energy efficiency for more performance

§ Where performance is constrained (real-‐Lme), want beAer energy efficiency to lower power

Improving energy Efficiency is cri:cal goal for all future systems and workloads

3

Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule)

UC Berkeley Good News: Moore’s Law Continues!

4 “Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965

UC Berkeley

Bad News:Dennard (Voltage) Scaling Over!

5

Distribution A – Approved for Public Release; Distribution Unlimited

• Voltage scaling slowed drastically

• Asymptotically approaching threshold

Why did we hit a power/cooling wall?

9/12/2012 19

The good old days of Dennard Scaling:

Today, now that Dennard Scaling is dead:

X

Ng = CMOS gates/unit area Cload = capacitive load/CMOS gate f = clock frequency V = supply voltage

Data courtesy S. Borkar/Intel 2011

Dennard Scaling

Post-‐Dennard Scaling

CICCSept 10, 2012 10

And L3 energy scaling ended in 2005

Gordon Moore, ISSCC 2003Moore, ISSCC Keynote, 2003

Moore, ISSCC Keynote, 2003

UC Berkeley

1st Impact of End of Scaling:End of Sequential Processor Era!

6

UC Berkeley Parallelism: A one-‐?me gain

Use more, slower cores for beAer energy efficiency. Either §  simpler cores, or §  run cores at lower Vdd/frequency

§  Even simpler general-‐purpose microarchitectures? - Limited by smallest sensible core

§  Even Lower Vdd/Frequency? - Limited by Vdd/Vt scaling, errors

§ Now what?

7

UC Berkeley

[Muller, ARM CTO, 2009]

2nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency!

8

No savior device technology on horizon. Future energy-‐efficiency innova:ons must be above transistor level.

UC Berkeley The End of General-‐Purpose Processors?

§ Most compuLng happens in specialized, heterogeneous processors - Can be 100-‐1000X more efficient than general-‐purpose processor

§ Challenges: - Hardware design costs - Sofware development costs

9

NVIDIA Tegra2

UC Berkeley

The Real Scaling Challenge: CommunicaFon

As transistors become smaller and cheaper, communicaLon dominates performance and energy

10

All scales: § Across chip § Up and down memory hierarchy

§ Chip-‐to-‐chip § Board-‐to-‐board § Rack-‐to-‐rack

UC Berkeley ASPIRE: From BeIer to Best

§ What is the best we can do? - For a fixed target technology (e.g., 7nm)

§ Can we prove a bound? § Can we design implementaLon approaching bound?

è Provably OpLmal ImplementaLons

11

Specialize and optimize communication and computation across whole stack from

applications to hardware

UC Berkeley

Communica?on-‐Avoiding Algorithms: Algorithm Cost Measures

1. ArithmeLc (FLOPS) 2. CommunicaLon: moving data between - levels of a memory hierarchy (sequenLal case) - processors over a network (parallel case).

12

CPU Cache

DRAM

CPU DRAM

CPU DRAM

CPU DRAM

CPU DRAM

UC Berkeley Modeling Run?me & Energy

13

UC Berkeley A few examples of speedups

§  Matrix mulLplicaLon -  Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communica?on

§  QR decomposiLon (used in least squares, data mining, …) -  Up to 8x on 8-‐core dual-‐socket Intel Clovertown, for 10M x 10 -  Up to 6.7x on 16-‐proc. PenLum III cluster, for 100K x 200 -  Up to 13x on Tesla C2050 / Fermi, for 110k x 100 -  Up to 4x on Grid of 4 ciLes (Dongarra, Langou et al) -  “infinite speedup” for out-‐of-‐core on PowerPC laptop -  LAPACK thrashed virtual memory, didn’t finish

§  Eigenvalues of band symmetric matrices -  Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequenLal)

§  IteraLve sparse linear equaLons solvers (GMRES) -  Up to 4.3x on Intel Clovertown, 8 core

§  N-‐body (direct parLcle interacLons with cutoff distance) -  Up to 10x on Cray XT-‐4 (Hopper), 24K parLcles on 6K procs.

14

UC Berkeley Modeling Energy: Dynamic

15

UC Berkeley Modeling Energy: Memory Reten?on

16

UC Berkeley Modeling Energy: Background Power

17

UC Berkeley Energy Lower Bounds

18

UC Berkeley

Early Result: Perfect Strong Scaling in Time and Energy

§  Every Lme you add processor, use its memory M too §  Start with minimal number of procs: PM = 3n2 §  Increase P by factor c è total memory increases by factor c §  NotaLon for Lming model:

-  γt , βt , αt = secs per flop, per word_moved, per message of size m T(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c

§  NotaLon for energy model: -  γe , βe , αe = Joules for same operaLons -  δe = Joules per word of memory used per sec -  εe = Joules per sec for leakage, etc.

E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP) + εET(cP) } = E(P)

§  Perfect scaling extends to n-‐body, Strassen, …

[IPDPS, 2013]

19

UC Berkeley C-‐A Algorithms Not Just for HPC

§  In ASPIRE, apply to other key applicaLon areas: machine vision, databases, speech recogniLon, sofware-‐defined radio, …

§  IniLal results on lower bounds of database join algorithms

20

UC Berkeley

From C-‐A Algorithms to Provably Op?mal Systems?

§  1) Prove lower bounds on communicaLon for a computaLon

§  2) Develop algorithm that achieves lower bound on a system

§  3) Find that communicaLon Lme/energy cost is >90% of resulLng implementaLon

§  4) We know we’re within 10% of opLmal!

§  SupporLng technique: OpLmizing sofware stack and compute engines to reduce compute costs and expose unavoidable communicaLon costs

21

UC Berkeley

ESP: An Applica?ons Processor Architecture for ASPIRE

§  Future server and mobile SoCs will have many fixed-‐funcLon accelerators and a general-‐purpose programmable mulLcore

§ Well-‐known how to customize hardware engines for specific task

§  ESP challenge is using specialized engines for general-‐purpose code

22

Intel Ivy Bridge (22nm)

Qualcomm Snapdragon MSM8960 (28nm)

UC Berkeley ESP: Ensembles of Specialized Processors

§ General-‐purpose hardware, flexible but inefficient §  Fixed-‐funcLon hardware, efficient but inflexible § Par Lab Insight: PaMerns capture common opera:ons across many applica:ons, each with unique communica:on & computa:on structure

§ Build an ensemble of specialized engines, each individually opLmized for parLcular paAern but collecLvely covering applicaLon needs

§ Bet: Will give us efficiency plus flexibility - Any given core can have a different mix of these depending on workload

23

UC Berkeley Par Lab: Mo?fs common across apps

24

Dense Graph Sparse …

Applica?ons Audio RecogniLon

Object RecogniLon

Scene Analysis

Berkeley View “Dwarfs” or Mo?fs

UC Berkeley

25

Par Lab Apps

Mo?f (nee “Dwarf”) Popularity (Red Hot / Blue Cool)

CompuLng Domains

UC Berkeley

• Pipe-‐and-‐Filter • Agent-‐and-‐Repository • Event-‐based • Bulk Synchronous • Map-‐Reduce

• Layered Systems

• Model-‐view controller

• Arbitrary Task Graphs • Puppeteer • Model-‐View-‐Controller

ApplicaLon

•  Graph Algorithms •  Dynamic programming •  Dense/Spare Linear Algebra •  Un/Structured Grids •  Graphical Models •  Finite State Machines •  Backtrack Branch-‐and-‐Bound •  N-‐Body Methods •  Circuits •  Spectral Methods •  Monte-‐Carlo

Architec?ng Parallel Sofware

IdenLfy the Sofware Structure

IdenLfy the Key ComputaLons

UC Berkeley Mapping Sofware to ESP: Specializers

§ Capture desired funcLonality at high-‐level using paAerns in a producLve high-‐level language

§ Use paAern-‐specific compilers (Specializers) with autotuners to produce efficient low-‐level code

§ ASP specializer infrastructure, open-‐source download 27

ILP Engine

Dense Engine

Sparse Engine

Graph Engine

ESP Core

Glue Code

Dense Code

SparseCode

Graph Code

ESP Code



Object RecogniLon

Scene Analysis

Berkeley View “Dwarfs” or Mo?fs

Specializers with SEJITS Implementa?ons and Autotuning

UC Berkeley

Replacing Fixed Accelerators with Programmable Fabric

§  Future server and mobile SoCs will have many fixed-‐funcLon accelerators and a general-‐purpose programmable mulLcore

§  Fabric challenge is retaining extreme energy efficiency while retaining programmability

28

Intel Ivy Bridge (22nm)

Qualcomm Snapdragon MSM8960 (28nm)

UC Berkeley Strawman Fabric Architecture

29

M

A

RM

A

RM

A

RM

A

R

M

A

RM

A

RM

A

RM

A

R

M

A

RM

A

RM

A

RM

A

R

M

A

RM

A

RM

A

RM

A

R

§  Will never have a C compiler §  Only programmed using pattern-based

DSLs §  More dynamic, less static than earlier

approaches §  Dynamic dataflow-driven execution §  Dynamic routing §  Large memory support

UC Berkeley “Agile Hardware” Development

§ Current hardware design slow and arduous § But now have huge design space to explore § How to examine many design points efficiently?

§ Build parameterized generators, not point designs! § Adopt and adapt best pracLces from Agile Sofware - Complete LVS-‐DRC clean physical design of current version every ~ two weeks (“tapein”)

- Incremental feature addiLon - Test & VerificaLon first step

30

UC Berkeley

Chisel: Construc?ng Hardware In a Scala Embedded Language

§  Embed a hardware-‐descripLon language in Scala, using Scala’s extension faciliLes

§ A hardware module is just a data structure in Scala § Different output rouLnes can generate different types of output (C, FPGA-‐Verilog, ASIC-‐Verilog) from same hardware representaLon

§  Full power of Scala for wriLng hardware generators - Object-‐Oriented: Factory objects, traits, overloading etc - FuncLonal: Higher-‐order funcs, anonymous funcs, currying - Compiles to JVM: Good performance, Java interoperability

31

UC Berkeley Chisel Design Flow!

32

Chisel Program

C++ code

FPGA Verilog

ASIC Verilog

Software Simulator

C++ Compiler

Scala/JVM

FPGA Emulation

FPGA Tools

GDS Layout

ASIC Tools

UC Berkeley Chisel is much more than an HDL

§  The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL

§ But Chisel can be extended above with domain-‐specific languages (e.g., signal processing) for fabric

§  Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum compuLng circuits)

§ Only ~6,000 lines of code in current version including libraries!

§ BSD-‐licensed open source at: chisel.eecs.berkeley.edu!

33

UC Berkeley

Many processor tapeouts in few years with small group (45nm, 28nm)

34

Clock test site

SRAM test site

DCDC test site

Processor Site

CO

RE

0 VC

0

CO

RE

1 VC

1

CO

RE

2 VC

2

CO

RE

3 VC

3

512K

B

L2

VFIX

ED

Test

Si

tes

UC Berkeley Resilient Circuits & Modeling

§  Future scaled technologies have high variability but want to run with lowest-‐possible margins to save energy

§  Significant increase in sof errors, need resilient systems §  Technology modeling to determine tradeoff between MTBF and energy per task for logic, SRAM, & interconnect.

35

Techniques to reduce operaLng voltage can be worse for energy due to rapid rise in errors

UC Berkeley

Hardware

Sofware

Computa?onal and Structural PaIerns

Algorithms and Specializers for Provably OpFmal ImplementaFons with Resiliency and Efficiency

36


ESP (Ensembles of Specialized Processors)

Architecture

C++ SimulaLon

FPGA EmulaLon

Valida?on/Verifica?on


Object RecogniLon

Scene Analysis

Hardware Cache Coherence

ASIC SoC

FPGA Computer Implementa?on Technologies

CommunicaLon-‐Avoiding Algorithms C-‐A GEMM C-‐A BFS C-‐A SpMV

Deep HW/SW Design-‐Space Explora?on

Pipe&Filter Map-‐Reduce …

… Hardware Generators using Chisel HDL

ILP Engine

Dense Engine

Sparse Engine

Graph Engine

ESP Core

Local Stores + DMA

Glue Code

Dense Code

SparseCode

Graph Code

ESP Code

Specializers with SEJITS Implementa?ons and Autotuning

UC Berkeley ASPIRE Project

§  IniLal $15.6M/5.5 year funding from DARPA PERFECT program - Started 9/28/2012 - Located in Par Lab space + BWRC

§  Looking for industrial affiliates (see Krste!) § Open House today, 5th floor Soda Hall

37

Research funded by DARPA Award Number HR0011-‐12-‐2-‐0016. Approved for public release; distribu:on is unlimited. The content of this presenta:on does not necessarily reflect the posi:on or the policy of the US government and no official endorsement should be inferred.

algorithms+and+specializers+for+provably+op:mal+ ...€¦ · why did we hit a power/cooling wall?...

Documents