algorithms+and+specializers+for+provably+op:mal+ ...€¦ · why did we hit a power/cooling wall?...
TRANSCRIPT
Algorithms and Specializers for Provably Op:mal Implementa:ons with Resiliency and Efficiency
Elad Alon, Krste Asanovic (Director), Jonathan Bachrach, Jim Demmel, Armando Fox, Kurt Keutzer, Borivoje Nikolic, David PaAerson,
Koushik Sen, John Wawrzynek !
[email protected] http://aspire.eecs.berkeley.edu!
UC Berkeley Future Application Drivers!
2
UC Berkeley Compute Energy “Iron Law”
§ When power is constrained, need beAer energy efficiency for more performance
§ Where performance is constrained (real-‐Lme), want beAer energy efficiency to lower power
Improving energy Efficiency is cri:cal goal for all future systems and workloads
3
Performance = Power * Energy Efficiency (Tasks/Second) (Joules/Second) (Tasks/Joule)
UC Berkeley Good News: Moore’s Law Continues!
4 “Cramming more components onto integrated circuits”, Gordon E. Moore, Electronics, 1965
UC Berkeley
Bad News:Dennard (Voltage) Scaling Over!
5
Distribution A – Approved for Public Release; Distribution Unlimited
• Voltage scaling slowed drastically
• Asymptotically approaching threshold
Why did we hit a power/cooling wall?
9/12/2012 19
The good old days of Dennard Scaling:
Today, now that Dennard Scaling is dead:
X
Ng = CMOS gates/unit area Cload = capacitive load/CMOS gate f = clock frequency V = supply voltage
Data courtesy S. Borkar/Intel 2011
Dennard Scaling
Post-‐Dennard Scaling
CICCSept 10, 2012 10
And L3 energy scaling ended in 2005
Gordon Moore, ISSCC 2003Moore, ISSCC Keynote, 2003
Moore, ISSCC Keynote, 2003
UC Berkeley
1st Impact of End of Scaling:End of Sequential Processor Era!
6
UC Berkeley Parallelism: A one-‐?me gain
Use more, slower cores for beAer energy efficiency. Either § simpler cores, or § run cores at lower Vdd/frequency
§ Even simpler general-‐purpose microarchitectures? - Limited by smallest sensible core
§ Even Lower Vdd/Frequency? - Limited by Vdd/Vt scaling, errors
§ Now what?
7
UC Berkeley
[Muller, ARM CTO, 2009]
2nd Impact of End of Scaling: “Dark Silicon” Cannot switch all transistors at full frequency!
8
No savior device technology on horizon. Future energy-‐efficiency innova:ons must be above transistor level.
UC Berkeley The End of General-‐Purpose Processors?
§ Most compuLng happens in specialized, heterogeneous processors - Can be 100-‐1000X more efficient than general-‐purpose processor
§ Challenges: - Hardware design costs - Sofware development costs
9
NVIDIA Tegra2
UC Berkeley
The Real Scaling Challenge: CommunicaFon
As transistors become smaller and cheaper, communicaLon dominates performance and energy
10
All scales: § Across chip § Up and down memory hierarchy
§ Chip-‐to-‐chip § Board-‐to-‐board § Rack-‐to-‐rack
UC Berkeley ASPIRE: From BeIer to Best
§ What is the best we can do? - For a fixed target technology (e.g., 7nm)
§ Can we prove a bound? § Can we design implementaLon approaching bound?
è Provably OpLmal ImplementaLons
11
Specialize and optimize communication and computation across whole stack from
applications to hardware
UC Berkeley
Communica?on-‐Avoiding Algorithms: Algorithm Cost Measures
1. ArithmeLc (FLOPS) 2. CommunicaLon: moving data between - levels of a memory hierarchy (sequenLal case) - processors over a network (parallel case).
12
CPU Cache
DRAM
CPU DRAM
CPU DRAM
CPU DRAM
CPU DRAM
UC Berkeley Modeling Run?me & Energy
13
UC Berkeley A few examples of speedups
§ Matrix mulLplicaLon - Up to 12x on IBM BG/P for n=8K on 64K cores; 95% less communica?on
§ QR decomposiLon (used in least squares, data mining, …) - Up to 8x on 8-‐core dual-‐socket Intel Clovertown, for 10M x 10 - Up to 6.7x on 16-‐proc. PenLum III cluster, for 100K x 200 - Up to 13x on Tesla C2050 / Fermi, for 110k x 100 - Up to 4x on Grid of 4 ciLes (Dongarra, Langou et al) - “infinite speedup” for out-‐of-‐core on PowerPC laptop - LAPACK thrashed virtual memory, didn’t finish
§ Eigenvalues of band symmetric matrices - Up to 17x on Intel Gainestown, 8 core, vs MKL 10.0 (up to 1.9x sequenLal)
§ IteraLve sparse linear equaLons solvers (GMRES) - Up to 4.3x on Intel Clovertown, 8 core
§ N-‐body (direct parLcle interacLons with cutoff distance) - Up to 10x on Cray XT-‐4 (Hopper), 24K parLcles on 6K procs.
14
UC Berkeley Modeling Energy: Dynamic
15
UC Berkeley Modeling Energy: Memory Reten?on
16
UC Berkeley Modeling Energy: Background Power
17
UC Berkeley Energy Lower Bounds
18
UC Berkeley
Early Result: Perfect Strong Scaling in Time and Energy
§ Every Lme you add processor, use its memory M too § Start with minimal number of procs: PM = 3n2 § Increase P by factor c è total memory increases by factor c § NotaLon for Lming model:
- γt , βt , αt = secs per flop, per word_moved, per message of size m T(cP) = n3/(cP) [ γT+ βt/M1/2 + αt/(mM1/2) ] = T(P)/c
§ NotaLon for energy model: - γe , βe , αe = Joules for same operaLons - δe = Joules per word of memory used per sec - εe = Joules per sec for leakage, etc.
E(cP) = cP { n3/(cP) [ γe+ βe/M1/2 + αe/(mM1/2) ] + δeMT(cP) + εET(cP) } = E(P)
§ Perfect scaling extends to n-‐body, Strassen, …
[IPDPS, 2013]
19
UC Berkeley C-‐A Algorithms Not Just for HPC
§ In ASPIRE, apply to other key applicaLon areas: machine vision, databases, speech recogniLon, sofware-‐defined radio, …
§ IniLal results on lower bounds of database join algorithms
20
UC Berkeley
From C-‐A Algorithms to Provably Op?mal Systems?
§ 1) Prove lower bounds on communicaLon for a computaLon
§ 2) Develop algorithm that achieves lower bound on a system
§ 3) Find that communicaLon Lme/energy cost is >90% of resulLng implementaLon
§ 4) We know we’re within 10% of opLmal!
§ SupporLng technique: OpLmizing sofware stack and compute engines to reduce compute costs and expose unavoidable communicaLon costs
21
UC Berkeley
ESP: An Applica?ons Processor Architecture for ASPIRE
§ Future server and mobile SoCs will have many fixed-‐funcLon accelerators and a general-‐purpose programmable mulLcore
§ Well-‐known how to customize hardware engines for specific task
§ ESP challenge is using specialized engines for general-‐purpose code
22
Intel Ivy Bridge (22nm)
Qualcomm Snapdragon MSM8960 (28nm)
UC Berkeley ESP: Ensembles of Specialized Processors
§ General-‐purpose hardware, flexible but inefficient § Fixed-‐funcLon hardware, efficient but inflexible § Par Lab Insight: PaMerns capture common opera:ons across many applica:ons, each with unique communica:on & computa:on structure
§ Build an ensemble of specialized engines, each individually opLmized for parLcular paAern but collecLvely covering applicaLon needs
§ Bet: Will give us efficiency plus flexibility - Any given core can have a different mix of these depending on workload
23
UC Berkeley Par Lab: Mo?fs common across apps
24
Dense Graph Sparse …
Applica?ons Audio RecogniLon
Object RecogniLon
Scene Analysis
Berkeley View “Dwarfs” or Mo?fs
UC Berkeley
25
Par Lab Apps
Mo?f (nee “Dwarf”) Popularity (Red Hot / Blue Cool)
CompuLng Domains
UC Berkeley
• Pipe-‐and-‐Filter • Agent-‐and-‐Repository • Event-‐based • Bulk Synchronous • Map-‐Reduce
• Layered Systems
• Model-‐view controller
• Arbitrary Task Graphs • Puppeteer • Model-‐View-‐Controller
ApplicaLon
• Graph Algorithms • Dynamic programming • Dense/Spare Linear Algebra • Un/Structured Grids • Graphical Models • Finite State Machines • Backtrack Branch-‐and-‐Bound • N-‐Body Methods • Circuits • Spectral Methods • Monte-‐Carlo
Architec?ng Parallel Sofware
IdenLfy the Sofware Structure
IdenLfy the Key ComputaLons
UC Berkeley Mapping Sofware to ESP: Specializers
§ Capture desired funcLonality at high-‐level using paAerns in a producLve high-‐level language
§ Use paAern-‐specific compilers (Specializers) with autotuners to produce efficient low-‐level code
§ ASP specializer infrastructure, open-‐source download 27
ILP Engine
Dense Engine
Sparse Engine
Graph Engine
ESP Core
Glue Code
Dense Code
SparseCode
Graph Code
ESP Code
Dense Graph Sparse …
Applica?ons Audio RecogniLon
Object RecogniLon
Scene Analysis
Berkeley View “Dwarfs” or Mo?fs
Specializers with SEJITS Implementa?ons and Autotuning
UC Berkeley
Replacing Fixed Accelerators with Programmable Fabric
§ Future server and mobile SoCs will have many fixed-‐funcLon accelerators and a general-‐purpose programmable mulLcore
§ Fabric challenge is retaining extreme energy efficiency while retaining programmability
28
Intel Ivy Bridge (22nm)
Qualcomm Snapdragon MSM8960 (28nm)
UC Berkeley Strawman Fabric Architecture
29
M
A
RM
A
RM
A
RM
A
R
M
A
RM
A
RM
A
RM
A
R
M
A
RM
A
RM
A
RM
A
R
M
A
RM
A
RM
A
RM
A
R
§ Will never have a C compiler § Only programmed using pattern-based
DSLs § More dynamic, less static than earlier
approaches § Dynamic dataflow-driven execution § Dynamic routing § Large memory support
UC Berkeley “Agile Hardware” Development
§ Current hardware design slow and arduous § But now have huge design space to explore § How to examine many design points efficiently?
§ Build parameterized generators, not point designs! § Adopt and adapt best pracLces from Agile Sofware - Complete LVS-‐DRC clean physical design of current version every ~ two weeks (“tapein”)
- Incremental feature addiLon - Test & VerificaLon first step
30
UC Berkeley
Chisel: Construc?ng Hardware In a Scala Embedded Language
§ Embed a hardware-‐descripLon language in Scala, using Scala’s extension faciliLes
§ A hardware module is just a data structure in Scala § Different output rouLnes can generate different types of output (C, FPGA-‐Verilog, ASIC-‐Verilog) from same hardware representaLon
§ Full power of Scala for wriLng hardware generators - Object-‐Oriented: Factory objects, traits, overloading etc - FuncLonal: Higher-‐order funcs, anonymous funcs, currying - Compiles to JVM: Good performance, Java interoperability
31
UC Berkeley Chisel Design Flow!
32
Chisel Program
C++ code
FPGA Verilog
ASIC Verilog
Software Simulator
C++ Compiler
Scala/JVM
FPGA Emulation
FPGA Tools
GDS Layout
ASIC Tools
UC Berkeley Chisel is much more than an HDL
§ The base Chisel system allows you to use the full power of Scala to describe the RTL of a design, then generate Verilog or C++ output from the RTL
§ But Chisel can be extended above with domain-‐specific languages (e.g., signal processing) for fabric
§ Importantly, Chisel can also be extended below with new backends or to add new tools or features (e.g., quantum compuLng circuits)
§ Only ~6,000 lines of code in current version including libraries!
§ BSD-‐licensed open source at: chisel.eecs.berkeley.edu!
33
UC Berkeley
Many processor tapeouts in few years with small group (45nm, 28nm)
34
Clock test site
SRAM test site
DCDC test site
Processor Site
CO
RE
0 VC
0
CO
RE
1 VC
1
CO
RE
2 VC
2
CO
RE
3 VC
3
512K
B
L2
VFIX
ED
Test
Si
tes
UC Berkeley Resilient Circuits & Modeling
§ Future scaled technologies have high variability but want to run with lowest-‐possible margins to save energy
§ Significant increase in sof errors, need resilient systems § Technology modeling to determine tradeoff between MTBF and energy per task for logic, SRAM, & interconnect.
35
Techniques to reduce operaLng voltage can be worse for energy due to rapid rise in errors
UC Berkeley
Hardware
Sofware
Computa?onal and Structural PaIerns
Algorithms and Specializers for Provably OpFmal ImplementaFons with Resiliency and Efficiency
36
Dense Graph Sparse …
ESP (Ensembles of Specialized Processors)
Architecture
C++ SimulaLon
FPGA EmulaLon
Valida?on/Verifica?on
Applica?ons Audio RecogniLon
Object RecogniLon
Scene Analysis
Hardware Cache Coherence
ASIC SoC
FPGA Computer Implementa?on Technologies
CommunicaLon-‐Avoiding Algorithms C-‐A GEMM C-‐A BFS C-‐A SpMV
Deep HW/SW Design-‐Space Explora?on
Pipe&Filter Map-‐Reduce …
… Hardware Generators using Chisel HDL
ILP Engine
Dense Engine
Sparse Engine
Graph Engine
ESP Core
Local Stores + DMA
Glue Code
Dense Code
SparseCode
Graph Code
ESP Code
Specializers with SEJITS Implementa?ons and Autotuning
UC Berkeley ASPIRE Project
§ IniLal $15.6M/5.5 year funding from DARPA PERFECT program - Started 9/28/2012 - Located in Par Lab space + BWRC
§ Looking for industrial affiliates (see Krste!) § Open House today, 5th floor Soda Hall
37
Research funded by DARPA Award Number HR0011-‐12-‐2-‐0016. Approved for public release; distribu:on is unlimited. The content of this presenta:on does not necessarily reflect the posi:on or the policy of the US government and no official endorsement should be inferred.