ramp for exascaleramp.eecs.berkeley.edu/publications/yelick... · architecture paths to exascale...

RAMP for Exascale

RAMP Wrap

August 25th, 2010

Kathy Yelick

NERSC Overview NERSC represents science needs •Over 3000 users, 400 projects, 500 code instances •Over 1,600 publications in 2009 •Time is used by university researchers (65%), DOE Labs (25%) and others

2

1 Petaflop Hopper system, late 2010 • High application performance • Nodes: 2 12-core AMD processors • Low latency Gemini interconnect

Science at NERSC

Fusion: Simulations of Fusion devices at ITER scale

Combustion: New algorithms (AMR) coupled to experiments

Energy storage:

Catalysis for improved

batteries and fuel cells

Capture &

Sequestration: EFRCs

Materials:

For solar panels and

other applications. Climate modeling: Work

with users on scalability of cloud-resolving models

Nano devices: New single molecule switching element

Algorithm Diversity

Science areas

Dense

linear

algebra

Sparse

linear

algebra

Spectral

Methods

(FFT)s

Particle

Methods

Structured

Grids

Unstructured or

AMR Grids

Accelerator

Science

Astrophysics

Chemistry

Climate

Combustion

Fusion

Lattice Gauge

Material Science

NERSC Qualitative In-Depth Analysis of Methods by Science Area

Numerical Methods at NERSC

5

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

%Projects

% Allocation

• Caveat: survey data from ERCAP requests based on PI input • Allocation is based on hours allocated to a project that use the method

NERSC Interest in Exascale

107

106

105

104

103

102

10

2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020

Top500

COTS/MPP + MPI

COTS/MPP + MPI (+ OpenMP)

GPU CUDA/OpenCL

Or Manycore BG/Q, R

Exascale + ???

Franklin (N5) 19 TF Sustained 101 TF Peak

Franklin (N5) +QC 36 TF Sustained 352 TF Peak

Hopper (N6) >1 PF Peak

NERSC-7 10 PF Peak

NERSC-8 100 PF Peak

NERSC-9 1 EF Peak

Peak T

era

flop/s

6

Danger: dragging users into a local optimum for programming

Exascale is really about Energy Efficient Computing

At $1M per MW, energy costs are substantial • 1 petaflop in 2010 will use 3 MW

• 1 exaflop in 2018 possible in 200 MW with “usual” scaling

• 1 exaflop in 2018 at 20 MW is DOE target

goal

usual scaling

2005 2010 2015 2020

7

The Challenge

• Power is the leading design constraint in HPC system design

• How to get build an exascale system without building a nuclear power plant next to my

HPC center?

• How can you assure the systems will be balanced for a reasonable science workload?

• How do you make it “programmable?”

Architecture Paths to Exascale

• Leading Technology Paths (Swim Lanes) – Multicore: Maintain complex cores, and replicate (x86 and

Power7) – Manycore/Embedded: Use many simpler, low power cores

from embedded (BlueGene) – GPU/Accelerator: Use highly specialized processors from

gaming space (NVidia Fermi, Cell)

• Risks in Swim Lane selection – Select too soon: users cannot follow – Select too late: fall behind performance curve – Select incorrectly: Subject users to multiple disruptive

changes

• Users must be deeply engaged in this process – Cannot leave this up to vendors alone

9

Green Flash: Overview John Shalf, PI

• We present an alternative approach to developing systems to serve the needs of scientific computing

• Choose our science target first to drive design decisions

• Leverage new technologies driven by consumer market

• Auto-tune software for performance, productivity, and portability

• Use hardware-accelerated architectural emulation to rapidly prototype designs (auto-tune the hardware too!)

• Requires a holistic approach: Must innovate algorithm/software/hardware together (Co-design)

Achieve 100x energy efficiency improvement

over mainstream HPC approach

System Balance

• If you pay 5% more to double the FPUs and get 10% improvement, it’s a win (despite lowering your % of peak performance)

• If you pay 2x more on memory BW (power or cost) and get 35% more performance, then it’s a net loss (even though % peak looks better)

• Real example: we can give up ALL of the flops to improve memory bandwidth by 20% on the 2018 system

• We have a fixed budget – Sustained to peak FLOP rate is wrong metric if FLOPs are cheap – Balance involves balancing your checkbook & balancing your power

budget – Requires a application co-design make the right trade-offs

The Complexity of Tradeoffs in Exascale System Design

feasible

system

s

Exascale Performance

envelope

20 MW power

envelope

$200M cost

envelope

bytes/core envelope

An Application Driver:

Global Cloud Resolving Climate Model

Computational Requirements

• ~2 million horizontal subdomains

• 100 Terabytes of Memory – 5MB memory per subdomain

• ~20 million total subdomains

– 20 PF sustained (200PF peak)

– Nearest-neighbor communication

• New discretization for climate model

CSU Icosahedral Code

Must maintain 1000x faster than real time for

practical climate simulation

Icosahedral

An Application Driver:

Seismic Exploration

Seismic Migration

• Reconstruct the earth’s subsurface

• Focus on exploration which requires a 10 km survey depth

• Studies the velocity contrast between different materials below the surface

Seismic RTM Algorithm

• Explicit finite difference method used to approximate wave equation

• 8th order stencil code

• Typical survey size is 20 x 30 x 10 km

– Runtime on current clusters is ~ 1 month

Low Power Design Principles

• Small is beautiful – Large array of simple, easy to verify cores

– Embrace the embedded market

• Slower clock frequencies for cubic power improvement

• Emphasis on performance per watt

• Reduce waste by not adding features not advantageous to science

• Parallel, manycore processors are the path to energy efficiency

Science Optimized Processor Design

• Make it programmable:

– Hardware Support for PGAS • Local store mapped to global address space

• Direct DMA support between local store to bypass cache

– Logical network topology looks like full crossbar • Optimized for small transfers

• On chip interconnect optimized for problem communication pattern

• Directly expose locality for optimized memory movement

Cost of Data Movement

MPI

Cost of Data Movement

MPI

Cost of a FLOP

Not projected to improve much…

Energy Efficiency will require careful management of data locality

Important to know when you are on-chip and when data is off-chip!

Vertical Locality Management

• Movement of data up and down cache hierarchy – Cache virtualizes notion of

on-chip off-chip

– Software managed memory (local store) is hard to program (cell)

• Virtual Local store – Use conventional cache for

portability

– Only use SW managed memory only for performance critical code

– Repartition as needed

Horizontal Locality Management

• Movement of data between processors

– 10x lower latency and 10x higher bandwidth on-chip

– Need to minimize distance of horizontal data movement

• Encode Horizontal locality into memory address

– Hardware hierarchy where high-order bits encode cabinet

and low-order bits encode chip-level distance

Application Analysis - Climate

• Analyze each loop within climate code

– Extract temporal reuse and bandwidth requirements

• Use traces to determine cache size and DRAM BW requirements

• Ensure memory hierarchy can support application

Application Optimization - Climate

• Original code: – 160KB Cache requirement

– < 50% FP Instructions

• Tuned Code:

– 1KB Cache Requirement

– > 85% FP instructions

Loop optimization resulted in 160x

reduction in cache size and a 2x increase

in execution speed

Co-design Advantage - Climate

Co-design Advantage - Seismic

Co-design Performance Advantage for Seismic RTM

0.0

500.0

1000.0

1500.0

2000.0

2500.0

3000.0

3500.0

4000.0

4500.0

8-core Nehalem Fermi Green Wave

MP

oin

ts /

sec

Co-design Advantage - Seismic

Co-design Power Advantage for Seismic RTM

0

20

40

60

80

100

120

140

160

8-core Nehalem Fermi Tensilica GreenWave

MP

oin

ts /

Watt

Extending to General Stencil Codes

• Generalized co-tuning framework being developed

• Co-tuning framework applied to multiple architectures – Manycore and GPU support

• Significant advantage over tuning only HW or SW – ~3x Power and Area advantage gained

Gradient Gradient

RAMP Infrastructure • FPGA Emulation critical for HW / SW co-design

– Enable full application benchmarking

– Autotuning target

– Feedback path from HW architect from application developers

• RAMP Gateware and methodology needed to glue processors together

Green Flash Impact

• Clear demonstration of improved performance per / watt on multiple scientific codes

– X improvement over current architectures

– Future HPC systems are power limited

– Green Flash methodology provides a path to exascale

• DOE has responded by establishing exascale co-design centers around the nation

– Green Flash and RAMP have played a key role in affecting this shift

– DOE funding for ISIS, CoDEX, and application Co-Design centers

Co-Design is Critical for Exascale

• Major changes ahead for computing, including HPC

– Resign algorithms to minimize communication (communication-avoiding/optimal)

– Development of programming models to allow for communication control

– Feedback to architecture designs

• How much data-parallelism, local stores, etc.

– Develop science applications

– Reduce risk in Exascale program

IP issues critical to success of

co-design

Co-Design Before its Time

• Green Flash Demo’d during SC ‘09

• CSU atmospheric model ported to low-power core design

– Dual Core Tensilica processors running atmospheric model at 25MHz

– MPI Routines ported to custom Tensilica Interconnect

• Memory and processor Stats available for performance analysis

• Emulation performance advantage – 250x Speedup over merely function software

simulator

• Actual code running - not representative benchmark

Icosahedral mesh for algorithm scaling

More Info

• Green Flash – http://www.lbl.gov/CS/html/greenflash.html

– http://www.lbl.gov/CS/html/greenmeetings.html

• NERSC Advanced Technology Group

– http://www.nersc.gov/projects/SDSA

ramp for exascaleramp.eecs.berkeley.edu/publications/yelick... · architecture paths to exascale...

Documents