hpc 2012 cetraro technology trends in hpc · 3 hp confidential 142 223 39 27 17 14 10 15 13 vendors...

1 HP Confidential

Dr.-Ing. Frank Baetke June 25th, 2012

HPC 2012 Cetraro Technology Trends in HPC

TOP500

TOP500 – Entry Distribution by Vendors

HP Confidential 3

142

223

39

27

1714 10 15 13

Vendors - November 11 - Absolute

HP TotalIBMOthersCray Inc.SGIDellOracle (Sun)BullAppro

Green500

Green500 – The Idea • TOP500 Systems and Power Efficiency

The ongoing discussion about power consumption and energy efficiency has an impact on most procurement and RFIs today.

• Two Professors from the Dept. of Computer Science at Virginia Tech Blacksburg, Dr. Wu-chun Feng and Dr. Kirk W. Cameron came up with the idea of a “Green TOP500” list which should complement the “classic TOP500 list” More details can be found in several presentations, mainly:

– Green Computing for a Clean Tomorrow – Global Climate Warming? Yes ... in the Machine Room – Making a Case for a Green500 List

• Available at www.green500.org

http://www.green500.org/

Green500 Metric and Issues • The list is based on the same entries as the “classic” TOP500 list but all entries are ordered

by “total energy consumption”. The original names, the Rmax values (Tflop/s obtained with the LP N*N test) and the original rank are retained in the Green500.

• The Green500 is the TOP500 list ordered by PpW (performance per Watt)

PpW = Performance [Rmax] / Power {measured or reported} so the system with the best PpW value is at number 1, the worst at number 500.

• Performance is based on the Rmax value from the original TOP500 list

• HP’s Tsubame2, listed as #5 in the TOP500, is listed as #10 in the Green500 but remains the largest “standards-based system” in the list

• The real critical questions is: how to measure “Power”. The list defines that as the power consumed during execution of the Linpack N*N test. HP has measured performance data only for a few system configurations. The critical question remains: How to measure power consumption correctly? This is a much more complex task than most would assume!

• KUDOS to Nathalie Ba tes w ho seem s to be m ak ing real progress here !

Graph500

Graph500 – An Unusual Benchmark

Graph500 – A Few Details … • The benchmark consists of a set of kernels from graph

algorithms, the metric is TEPS (traversed edges per second) and the list is the Graph 500.

• The intent of this benchmark problem ("Search") is to develop a compact application that has multiple analysis techniques (multiple kernels) accessing a single data structure representing a weighted, undirected graph. In addition to a kernel to construct the graph from the input tuple list, there is one additional computational kernel to operate on the graph.

• This benchmark includes a scalable data generator which produces edge tuples containing the start vertex and end vertex for each edge. The first kernel constructs an undirected graph in a format usable by all subsequent kernels. No subsequent modifications are permitted to benefit specific kernels. The second kernel performs a breadth-first search of the graph. Both kernels are timed.

•

http://www.graph500.org/specifications.html

http://www.graph500.org/specifications.html

tp://www.graph500.org/index.html

Graph500 – Nov 2011 List (in total just 49)

Combining them ... ??

Combining all 500 Lists – Tsubame 2 wins !!

12

HPCC – HPC Challenge – the motivation: overcoming the limitations of Linpack N*N (HPL) •

Summary and Trends • TOP500 will maintain a very high visibility – but very few

understand the metric behind ALL entries (commercial sites) • The Green500 will stay behind but energy efficiency will

become a key metric - all the data are in the TOP500 anyway. Issue: how to measure „power“ is key!

• The HPL algorithms will be changed (Jack D. et al.) as larger systems would require days to finish a single run

• Graph500 might become an „add-on“ for big data – but only for the few who understand its meaning and metric.

• HPCC provides a good insight but doesn‘t produce a nice ranking (but that‘s what most sites or funding agencies want – a key reason for being eager to get on the TOP500)

AN UGLY ROOM?

TSUBAME 2.0 Layout (200 sqm for main compute nodes)

NO, ONE OF THE MOST EFFICIENT PETASCALE SYSTEMS !

A Generic Peta-scale System 2.4 PFLOPS peak ? (HPL 1.4 PFLOPS)

1.4 MW max Power – Greenest Production Peta-scale system

1.4 MW max Power – Greenest Production Peta-scale system – but it won't scale to EFlops !

SCALE-OUT PRODUCT LINES FOR HPC

DL-Series BL-Series SL-Series Design center Rack Blade enclosure in rack Rack

Design focus Versatility & value Integrated & optimized, maximum redundancy

Cost & features optimized for extreme

scale out

Application General purpose General purpose / private cloud / scale out

Web 2.0 / cloud / scale out

Management Essential and advanced

management HP Insight Dynamics

Advanced management- accelerated service

delivery & change in minutes

Home grown management

Basic management via IPMI/DCMI

Density optimized for the data center

Extreme scale out datacenters with

lean management Shared infrastructure for

accelerated service delivery

25

Designed to unleash the promise of emerging extreme low-energy servers HP Project Moonshot Infrastructure Strategy

Traditional x86 Scale Out Servers Moonshot Project Architecture

• 10s of servers per rack • Processor specific distributed servers • Devices and instances proliferate with

each server added

• 1,000s of servers per rack • Processor neutral infrastructure • Federated infrastructure scales

seamlessly with additional servers

© 2011 HP Confidential NDA Required

Energy, cost and space savings move the industry to new architecture Breakthrough Savings and Simplicity

Traditional x86

$3.3M HP ‘Redstone’

$1.2M

89% less energy 94% less space 63% less cost

97% less complexity

400 servers 10 racks

20 switches 1,600 cables 91 kilowatts

1,600 servers 1/2 rack 2 switches 41 cables

9.9 kilowatts

Select HyperScale Web, and Data Analytics Applications Show Tremendous Promise

26 © 2011 HP Confidential NDA Required

EXA-SCALE

The Exascale Power Challenge

28

• The current top performing supercomputer achieves 10.5 Petaflops at 12.7MW

• For Exascale we need 100x

performance at ~2x power. Evolutionary paths will not yield 50x energy efficiency improvement

• 4 complimentary approaches – Improvements in component

technologies – Architecture enhancements – Improved power monitoring and control – Changes in programming models

Key metrics – the „famous“ DOE chart

29 HP Confidential

30

Exascale Performance Targets DOE 20MW target vs power constrained commodity systems

1

10

100

1000

2008 2010 2012 2014 2016 2018 2020

Commodity

Stretch

10x

2x

Principle Architectural Directions

31 HP Confidential

• There are currently two main directions on can observe in large system architectures:

• A) HPC Cluster with Accelerators – Highly heterogeneous

• B) Massively Parallel Processing System – Weakly heterogeneous

• Systems belonging to category A today exhibit a higher energy efficieny

(Tsubame 2 is a top example - #1 Green 500 production system) • New applications should be written to run efficiently on both

architectures. But what is the right programming paradigm?

Is an Exascale system large enough to handle even the most challenging problems - finally?

No – wrong question: the available system size determines the maximum problem size an engineer or scientist can address !

• Often it is thought to be the other way around • With larger systems one can shift to more complex

simulations - which certainly leads to new insights and also significant savings.

• But, „ensemble simulations“ will be more important than individual „grand challenge“ runs – and will generate monster data sets!

• System can be in classical setups, remote centers or up in the „Cloud“. Still, you need to tansfer the data, see the issue above.

EXA-SCALE @ HP-LABS

Interconnect •Integrated CMOS nanophotonics •16 x 25Gbps per fiber

Storage •memristor •new memory hierarchies

Processor •Many core processor •On package & stacked DRAM •~10Tflops

Electromechanical design •Rack-as-a-chassis •~256 nodes per rack •50-75kW per rack •Cold plate cooling •~2.5Petaflops

Fabric •High radix switches •Optical IO, CMOS core •Low diameter topologies (HyperX)

Overall System •~100,000 nodes •~400 racks •20-30MW •Vertical & horizontal power capping 1 EXAFLOP

Node Architecture •Single CPU node •Silicon MCM •Photonic Interconnect

CPU DIE

PH

OT

ON

IC

TR

AN

SC

EIV

ER

DRAM DRAM DRAM DRAM

NV NV NV NV

NV NV NV NV

NV NV NV NV

NV NV NV NV

WD

M L

INK

S T

O

NE

TW

OR

K

DRAM DRAM DRAM DRAM

Microgrid power & cooling

HP Proprietary

Underlined = Areas HP is investing in

A Strawman Exascale System

36

• 100,000, 10Tflop compute nodes (or 1,000,000 1Tflop processors)

• 32 to 64Petabytes of DRAM

• NVRAM capacity of atleast 4x DRAM

• 400GBytes/s of network bandwidth per node

CONTROL AND MONITORING NETWORKSYSTEM NODES

COMPUTENODES

EXTERNAL GATEWAY

PARALLEL INDPENDENT DATA NETWORKS

A Strawman Exascale System

37

• Single chip, highly parallel CPU • Stacked or on-substrate “near” memory • DRAM or NVRAM “far” memory • Integrated network interface • Multiple photonic links for off-node communications

CPU ASSEMBLY

MEM

CTR

L

DR

AMSTAC

K

PHOTONIC IF

CPU

NICS (12/24)

PHO

TON

IC IF

NVRAMSTACK

NVRAMSTACK

DR

AMSTAC

KD

RAM

STACK

DR

AMSTAC

K

PHO

TON

IC IF

NVR

AMSTAC

K

MEM

CTR

L

PHO

TON

IC IF

NVR

AMSTAC

K

MEM

CTR

L

PHO

TON

IC IF

NVR

AMSTAC

K DR

AMST

ACK

PHO

TON

IC IF

DR

AMST

ACK

DR

AMST

ACK

DR

AMST

ACK

MEM

CTR

L

PHO

TON

IC IF

NVR

AMST

ACK

MEM

CTR

L

PHO

TON

IC IF

NVR

AMST

ACK

MEM

CTR

L

PHO

TON

IC IF

NVR

AMST

ACK

DATA NETWORKCONNECTIONS

SILICON PHOTONICS & INTERCONNECT • Integrated photonics essential to meet

power and bandwidth targets • Two variations on technology

– Direct modulation - hybrid ring laser – Indirect modulations - silicon microring remonstrators – Target <1pJ/bit latch to latch, any distance

• High radix photonic router – Direct optical connection to router – Packet switching in CMOS (not optical switching) – Challenge is to yield part for wider applications

• Network topologies – Minimise hop count for power, reliability and low

latency – HyperX network – scaling characteristics of a folded

Clos, engineering characteristics of a mesh

NV MEMORY & STORAGE

• Architecture – First level storage all solid state – Direct attached solid state storage for scratch files,

local checkpoints, data staging – Remove necessity for DRAM working copy of

NVRAM • Radically improved bandwidth • Byte addressability

– Disk still lowest cost bulk storage

• Device Technology – Multiple candidate technologies – we believe

memristor has significant advantages – Mainstream market is flash replacement

Can user hints improve power efficiency?

Characteristics of applications 1

40

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

1,6 1,8 2 2,2 2,4CPU clock GHz

MiniMD

Power

Total Energy

Run time

0,86

0,88

0,9

0,92

0,94

0,96

0,98

1

1,02

1,04

1,06

1,6 1,8 2 2,2 2,4CPU clock GHz

Streams

Power

Total Energy

Run time

• Floating point limited applications are most efficient at maximum clock speed

• Memory bound applications more efficient at reduced clock rates

• Applications have distinct phases – allow user hints to aid power management 4

1

Characteristics of applications 2 – Need application to be power aware !

AMG2007 courtesy Sandia National Labs

Still a Challenge: EESI Conclusions (European Exascale Software initiative):

42 HP Confidential

43

Software and Management

• Exascale systems will require order 100,000 servers • Forces you to rethink how servers are designed, need to “co-

design” the power, cooling, networking, storqge, management, even the software applications in concert with the server itself -> Moonshot is the first step in this direction

• HP already delivering important management building blocks with Gen 8 iLO4, Insight CMU, etc.

• These systems will accelerate the “explosion of data” we are already seeing.

• It will impossible to “checkpoint” the “system” to local storage • Just forget storing data somewhere in the “Cloud” – unless you

compress by several orders of magnitude – but you can’t!

44

Storage

• Traditional HPC file systems like Lustre will continue to improve and have their place in HPC systems

• HPC will benefit from “big data” technologies. See Vertica and Autonomy et al.

• Vendors with experience in huge commercial installations might have benefits!

• Data explosion: “Silent Data Corruption” will become a bigger issue than today!

Another consequence:

45 HP Confidential

• Exascale systems also mean:

–Petascale system in a box –200 K€ or 250 K$ and 20 kW

• Huge impact for those academic, industrial structures – including SMEs – that will be able to take advantage of Exascale technology.

HP Labs Exascale Expertise

• Fabrics • Networks • Protocols • NVM (i.e., memristor) • Photonics • Architecture

46 HP Proprietary

THANK YOU

47 HP Confidential

hpc 2012 cetraro technology trends in hpc · 3 hp confidential 142 223 39 27 17 14 10 15 13 vendors...

Documents