ramp for exascaleramp.eecs.berkeley.edu/publications/yelick... · architecture paths to exascale...
TRANSCRIPT
NERSC Overview NERSC represents science needs •Over 3000 users, 400 projects, 500 code instances •Over 1,600 publications in 2009 •Time is used by university researchers (65%), DOE Labs (25%) and others
2
1 Petaflop Hopper system, late 2010 • High application performance • Nodes: 2 12-core AMD processors • Low latency Gemini interconnect
Science at NERSC
Fusion: Simulations of Fusion devices at ITER scale
Combustion: New algorithms (AMR) coupled to experiments
Energy storage:
Catalysis for improved
batteries and fuel cells
Capture &
Sequestration: EFRCs
Materials:
For solar panels and
other applications. Climate modeling: Work
with users on scalability of cloud-resolving models
Nano devices: New single molecule switching element
Algorithm Diversity
Science areas
Dense
linear
algebra
Sparse
linear
algebra
Spectral
Methods
(FFT)s
Particle
Methods
Structured
Grids
Unstructured or
AMR Grids
Accelerator
Science
Astrophysics
Chemistry
Climate
Combustion
Fusion
Lattice Gauge
Material Science
NERSC Qualitative In-Depth Analysis of Methods by Science Area
Numerical Methods at NERSC
5
0%
5%
10%
15%
20%
25%
30%
35%
40%
45%
50%
%Projects
% Allocation
• Caveat: survey data from ERCAP requests based on PI input • Allocation is based on hours allocated to a project that use the method
NERSC Interest in Exascale
107
106
105
104
103
102
10
2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020
Top500
COTS/MPP + MPI
COTS/MPP + MPI (+ OpenMP)
GPU CUDA/OpenCL
Or Manycore BG/Q, R
Exascale + ???
Franklin (N5) 19 TF Sustained 101 TF Peak
Franklin (N5) +QC 36 TF Sustained 352 TF Peak
Hopper (N6) >1 PF Peak
NERSC-7 10 PF Peak
NERSC-8 100 PF Peak
NERSC-9 1 EF Peak
Peak T
era
flop/s
6
Danger: dragging users into a local optimum for programming
Exascale is really about Energy Efficient Computing
At $1M per MW, energy costs are substantial • 1 petaflop in 2010 will use 3 MW
• 1 exaflop in 2018 possible in 200 MW with “usual” scaling
• 1 exaflop in 2018 at 20 MW is DOE target
goal
usual scaling
2005 2010 2015 2020
7
The Challenge
• Power is the leading design constraint in HPC system design
• How to get build an exascale system without building a nuclear power plant next to my
HPC center?
• How can you assure the systems will be balanced for a reasonable science workload?
• How do you make it “programmable?”
Architecture Paths to Exascale
• Leading Technology Paths (Swim Lanes) – Multicore: Maintain complex cores, and replicate (x86 and
Power7) – Manycore/Embedded: Use many simpler, low power cores
from embedded (BlueGene) – GPU/Accelerator: Use highly specialized processors from
gaming space (NVidia Fermi, Cell)
• Risks in Swim Lane selection – Select too soon: users cannot follow – Select too late: fall behind performance curve – Select incorrectly: Subject users to multiple disruptive
changes
• Users must be deeply engaged in this process – Cannot leave this up to vendors alone
9
Green Flash: Overview John Shalf, PI
• We present an alternative approach to developing systems to serve the needs of scientific computing
• Choose our science target first to drive design decisions
• Leverage new technologies driven by consumer market
• Auto-tune software for performance, productivity, and portability
• Use hardware-accelerated architectural emulation to rapidly prototype designs (auto-tune the hardware too!)
• Requires a holistic approach: Must innovate algorithm/software/hardware together (Co-design)
Achieve 100x energy efficiency improvement
over mainstream HPC approach
System Balance
• If you pay 5% more to double the FPUs and get 10% improvement, it’s a win (despite lowering your % of peak performance)
• If you pay 2x more on memory BW (power or cost) and get 35% more performance, then it’s a net loss (even though % peak looks better)
• Real example: we can give up ALL of the flops to improve memory bandwidth by 20% on the 2018 system
• We have a fixed budget – Sustained to peak FLOP rate is wrong metric if FLOPs are cheap – Balance involves balancing your checkbook & balancing your power
budget – Requires a application co-design make the right trade-offs
The Complexity of Tradeoffs in Exascale System Design
feasible
system
s
Exascale Performance
envelope
20 MW power
envelope
$200M cost
envelope
bytes/core envelope
Computational Requirements
• ~2 million horizontal subdomains
• 100 Terabytes of Memory – 5MB memory per subdomain
• ~20 million total subdomains
– 20 PF sustained (200PF peak)
– Nearest-neighbor communication
• New discretization for climate model
CSU Icosahedral Code
Must maintain 1000x faster than real time for
practical climate simulation
Icosahedral
Seismic Migration
• Reconstruct the earth’s subsurface
• Focus on exploration which requires a 10 km survey depth
• Studies the velocity contrast between different materials below the surface
Seismic RTM Algorithm
• Explicit finite difference method used to approximate wave equation
• 8th order stencil code
• Typical survey size is 20 x 30 x 10 km
– Runtime on current clusters is ~ 1 month
Low Power Design Principles
• Small is beautiful – Large array of simple, easy to verify cores
– Embrace the embedded market
• Slower clock frequencies for cubic power improvement
• Emphasis on performance per watt
• Reduce waste by not adding features not advantageous to science
• Parallel, manycore processors are the path to energy efficiency
Science Optimized Processor Design
• Make it programmable:
– Hardware Support for PGAS • Local store mapped to global address space
• Direct DMA support between local store to bypass cache
– Logical network topology looks like full crossbar • Optimized for small transfers
• On chip interconnect optimized for problem communication pattern
• Directly expose locality for optimized memory movement
Not projected to improve much…
Energy Efficiency will require careful management of data locality
Important to know when you are on-chip and when data is off-chip!
Vertical Locality Management
• Movement of data up and down cache hierarchy – Cache virtualizes notion of
on-chip off-chip
– Software managed memory (local store) is hard to program (cell)
• Virtual Local store – Use conventional cache for
portability
– Only use SW managed memory only for performance critical code
– Repartition as needed
Horizontal Locality Management
• Movement of data between processors
– 10x lower latency and 10x higher bandwidth on-chip
– Need to minimize distance of horizontal data movement
• Encode Horizontal locality into memory address
– Hardware hierarchy where high-order bits encode cabinet
and low-order bits encode chip-level distance
Application Analysis - Climate
• Analyze each loop within climate code
– Extract temporal reuse and bandwidth requirements
• Use traces to determine cache size and DRAM BW requirements
• Ensure memory hierarchy can support application
Application Optimization - Climate
• Original code: – 160KB Cache requirement
– < 50% FP Instructions
• Tuned Code:
– 1KB Cache Requirement
– > 85% FP instructions
Loop optimization resulted in 160x
reduction in cache size and a 2x increase
in execution speed
Co-design Advantage - Seismic
Co-design Performance Advantage for Seismic RTM
0.0
500.0
1000.0
1500.0
2000.0
2500.0
3000.0
3500.0
4000.0
4500.0
8-core Nehalem Fermi Green Wave
MP
oin
ts /
sec
Co-design Advantage - Seismic
Co-design Power Advantage for Seismic RTM
0
20
40
60
80
100
120
140
160
8-core Nehalem Fermi Tensilica GreenWave
MP
oin
ts /
Watt
Extending to General Stencil Codes
• Generalized co-tuning framework being developed
• Co-tuning framework applied to multiple architectures – Manycore and GPU support
• Significant advantage over tuning only HW or SW – ~3x Power and Area advantage gained
Gradient Gradient
RAMP Infrastructure • FPGA Emulation critical for HW / SW co-design
– Enable full application benchmarking
– Autotuning target
– Feedback path from HW architect from application developers
• RAMP Gateware and methodology needed to glue processors together
Green Flash Impact
• Clear demonstration of improved performance per / watt on multiple scientific codes
– X improvement over current architectures
– Future HPC systems are power limited
– Green Flash methodology provides a path to exascale
• DOE has responded by establishing exascale co-design centers around the nation
– Green Flash and RAMP have played a key role in affecting this shift
– DOE funding for ISIS, CoDEX, and application Co-Design centers
Co-Design is Critical for Exascale
• Major changes ahead for computing, including HPC
– Resign algorithms to minimize communication (communication-avoiding/optimal)
– Development of programming models to allow for communication control
– Feedback to architecture designs
• How much data-parallelism, local stores, etc.
– Develop science applications
– Reduce risk in Exascale program
IP issues critical to success of
co-design
Co-Design Before its Time
• Green Flash Demo’d during SC ‘09
• CSU atmospheric model ported to low-power core design
– Dual Core Tensilica processors running atmospheric model at 25MHz
– MPI Routines ported to custom Tensilica Interconnect
• Memory and processor Stats available for performance analysis
• Emulation performance advantage – 250x Speedup over merely function software
simulator
• Actual code running - not representative benchmark
Icosahedral mesh for algorithm scaling