hpc 2012 cetraro technology trends in hpc · 3 hp confidential 142 223 39 27 17 14 10 15 13 vendors...
TRANSCRIPT
1 HP Confidential
Dr.-Ing. Frank Baetke June 25th, 2012
HPC 2012 Cetraro Technology Trends in HPC
TOP500
TOP500 – Entry Distribution by Vendors
HP Confidential 3
142
223
39
27
1714 10 15 13
Vendors - November 11 - Absolute
HP TotalIBMOthersCray Inc.SGIDellOracle (Sun)BullAppro
Green500
Green500 – The Idea • TOP500 Systems and Power Efficiency
The ongoing discussion about power consumption and energy efficiency has an impact on most procurement and RFIs today.
• Two Professors from the Dept. of Computer Science at Virginia Tech Blacksburg, Dr. Wu-chun Feng and Dr. Kirk W. Cameron came up with the idea of a “Green TOP500” list which should complement the “classic TOP500 list” More details can be found in several presentations, mainly:
– Green Computing for a Clean Tomorrow – Global Climate Warming? Yes ... in the Machine Room – Making a Case for a Green500 List
• Available at www.green500.org
Green500 Metric and Issues • The list is based on the same entries as the “classic” TOP500 list but all entries are ordered
by “total energy consumption”. The original names, the Rmax values (Tflop/s obtained with the LP N*N test) and the original rank are retained in the Green500.
• The Green500 is the TOP500 list ordered by PpW (performance per Watt)
PpW = Performance [Rmax] / Power {measured or reported} so the system with the best PpW value is at number 1, the worst at number 500.
• Performance is based on the Rmax value from the original TOP500 list
• HP’s Tsubame2, listed as #5 in the TOP500, is listed as #10 in the Green500 but remains the largest “standards-based system” in the list
• The real critical questions is: how to measure “Power”. The list defines that as the power consumed during execution of the Linpack N*N test. HP has measured performance data only for a few system configurations. The critical question remains: How to measure power consumption correctly? This is a much more complex task than most would assume!
• KUDOS to Nathalie Ba tes w ho seem s to be m ak ing real progress here !
Graph500
Graph500 – An Unusual Benchmark
Graph500 – A Few Details … • The benchmark consists of a set of kernels from graph
algorithms, the metric is TEPS (traversed edges per second) and the list is the Graph 500.
• The intent of this benchmark problem ("Search") is to develop a compact application that has multiple analysis techniques (multiple kernels) accessing a single data structure representing a weighted, undirected graph. In addition to a kernel to construct the graph from the input tuple list, there is one additional computational kernel to operate on the graph.
• This benchmark includes a scalable data generator which produces edge tuples containing the start vertex and end vertex for each edge. The first kernel constructs an undirected graph in a format usable by all subsequent kernels. No subsequent modifications are permitted to benefit specific kernels. The second kernel performs a breadth-first search of the graph. Both kernels are timed.
•
Graph500 – Nov 2011 List (in total just 49)
Combining them ... ??
Combining all 500 Lists – Tsubame 2 wins !!
12
HPCC
HPCC – HPC Challenge – the motivation: overcoming the limitations of Linpack N*N (HPL) •
Summary and Trends • TOP500 will maintain a very high visibility – but very few
understand the metric behind ALL entries (commercial sites) • The Green500 will stay behind but energy efficiency will
become a key metric - all the data are in the TOP500 anyway. Issue: how to measure „power“ is key!
• The HPL algorithms will be changed (Jack D. et al.) as larger systems would require days to finish a single run
• Graph500 might become an „add-on“ for big data – but only for the few who understand its meaning and metric.
• HPCC provides a good insight but doesn‘t produce a nice ranking (but that‘s what most sites or funding agencies want – a key reason for being eager to get on the TOP500)
AN UGLY ROOM?
TSUBAME 2.0 Layout (200 sqm for main compute nodes)
NO, ONE OF THE MOST EFFICIENT PETASCALE SYSTEMS !
A Generic Peta-scale System 2.4 PFLOPS peak ? (HPL 1.4 PFLOPS)
1.4 MW max Power – Greenest Production Peta-scale system
1.4 MW max Power – Greenest Production Peta-scale system – but it won't scale to EFlops !
SCALE-OUT PRODUCT LINES FOR HPC
DL-Series BL-Series SL-Series Design center Rack Blade enclosure in rack Rack
Design focus Versatility & value Integrated & optimized, maximum redundancy
Cost & features optimized for extreme
scale out
Application General purpose General purpose / private cloud / scale out
Web 2.0 / cloud / scale out
Management Essential and advanced
management HP Insight Dynamics
Advanced management- accelerated service
delivery & change in minutes
Home grown management
Basic management via IPMI/DCMI
Density optimized for the data center
Extreme scale out datacenters with
lean management Shared infrastructure for
accelerated service delivery
25
Designed to unleash the promise of emerging extreme low-energy servers HP Project Moonshot Infrastructure Strategy
Traditional x86 Scale Out Servers Moonshot Project Architecture
• 10s of servers per rack • Processor specific distributed servers • Devices and instances proliferate with
each server added
• 1,000s of servers per rack • Processor neutral infrastructure • Federated infrastructure scales
seamlessly with additional servers
© 2011 HP Confidential NDA Required
Energy, cost and space savings move the industry to new architecture Breakthrough Savings and Simplicity
Traditional x86
$3.3M HP ‘Redstone’
$1.2M
89% less energy 94% less space 63% less cost
97% less complexity
400 servers 10 racks
20 switches 1,600 cables 91 kilowatts
1,600 servers 1/2 rack 2 switches 41 cables
9.9 kilowatts
Select HyperScale Web, and Data Analytics Applications Show Tremendous Promise
26 © 2011 HP Confidential NDA Required
EXA-SCALE
The Exascale Power Challenge
28
• The current top performing supercomputer achieves 10.5 Petaflops at 12.7MW
• For Exascale we need 100x
performance at ~2x power. Evolutionary paths will not yield 50x energy efficiency improvement
• 4 complimentary approaches – Improvements in component
technologies – Architecture enhancements – Improved power monitoring and control – Changes in programming models
Key metrics – the „famous“ DOE chart
29 HP Confidential
30
Exascale Performance Targets DOE 20MW target vs power constrained commodity systems
1
10
100
1000
2008 2010 2012 2014 2016 2018 2020
Commodity
Stretch
10x
2x
Principle Architectural Directions
31 HP Confidential
• There are currently two main directions on can observe in large system architectures:
• A) HPC Cluster with Accelerators – Highly heterogeneous
• B) Massively Parallel Processing System – Weakly heterogeneous
• Systems belonging to category A today exhibit a higher energy efficieny
(Tsubame 2 is a top example - #1 Green 500 production system) • New applications should be written to run efficiently on both
architectures. But what is the right programming paradigm?
Is an Exascale system large enough to handle even the most challenging problems - finally?
No – wrong question: the available system size determines the maximum problem size an engineer or scientist can address !
• Often it is thought to be the other way around • With larger systems one can shift to more complex
simulations - which certainly leads to new insights and also significant savings.
• But, „ensemble simulations“ will be more important than individual „grand challenge“ runs – and will generate monster data sets!
• System can be in classical setups, remote centers or up in the „Cloud“. Still, you need to tansfer the data, see the issue above.
EXA-SCALE @ HP-LABS
Interconnect •Integrated CMOS nanophotonics •16 x 25Gbps per fiber
Storage •memristor •new memory hierarchies
Processor •Many core processor •On package & stacked DRAM •~10Tflops
Electromechanical design •Rack-as-a-chassis •~256 nodes per rack •50-75kW per rack •Cold plate cooling •~2.5Petaflops
Fabric •High radix switches •Optical IO, CMOS core •Low diameter topologies (HyperX)
Overall System •~100,000 nodes •~400 racks •20-30MW •Vertical & horizontal power capping 1 EXAFLOP
Node Architecture •Single CPU node •Silicon MCM •Photonic Interconnect
CPU DIE
PH
OT
ON
IC
TR
AN
SC
EIV
ER
DRAM DRAM DRAM DRAM
NV NV NV NV
NV NV NV NV
NV NV NV NV
NV NV NV NV
WD
M L
INK
S T
O
NE
TW
OR
K
DRAM DRAM DRAM DRAM
Microgrid power & cooling
HP Proprietary
Underlined = Areas HP is investing in
A Strawman Exascale System
36
• 100,000, 10Tflop compute nodes (or 1,000,000 1Tflop processors)
• 32 to 64Petabytes of DRAM
• NVRAM capacity of atleast 4x DRAM
• 400GBytes/s of network bandwidth per node
CONTROL AND MONITORING NETWORKSYSTEM NODES
COMPUTENODES
EXTERNAL GATEWAY
PARALLEL INDPENDENT DATA NETWORKS
A Strawman Exascale System
37
• Single chip, highly parallel CPU • Stacked or on-substrate “near” memory • DRAM or NVRAM “far” memory • Integrated network interface • Multiple photonic links for off-node communications
CPU ASSEMBLY
MEM
CTR
L
DR
AMSTAC
K
PHOTONIC IF
CPU
NICS (12/24)
PHO
TON
IC IF
NVRAMSTACK
NVRAMSTACK
DR
AMSTAC
KD
RAM
STACK
DR
AMSTAC
K
PHO
TON
IC IF
NVR
AMSTAC
K
MEM
CTR
L
PHO
TON
IC IF
NVR
AMSTAC
K
MEM
CTR
L
PHO
TON
IC IF
NVR
AMSTAC
K DR
AMST
ACK
PHO
TON
IC IF
DR
AMST
ACK
DR
AMST
ACK
DR
AMST
ACK
MEM
CTR
L
PHO
TON
IC IF
NVR
AMST
ACK
MEM
CTR
L
PHO
TON
IC IF
NVR
AMST
ACK
MEM
CTR
L
PHO
TON
IC IF
NVR
AMST
ACK
DATA NETWORKCONNECTIONS
SILICON PHOTONICS & INTERCONNECT • Integrated photonics essential to meet
power and bandwidth targets • Two variations on technology
– Direct modulation - hybrid ring laser – Indirect modulations - silicon microring remonstrators – Target <1pJ/bit latch to latch, any distance
• High radix photonic router – Direct optical connection to router – Packet switching in CMOS (not optical switching) – Challenge is to yield part for wider applications
• Network topologies – Minimise hop count for power, reliability and low
latency – HyperX network – scaling characteristics of a folded
Clos, engineering characteristics of a mesh
NV MEMORY & STORAGE
• Architecture – First level storage all solid state – Direct attached solid state storage for scratch files,
local checkpoints, data staging – Remove necessity for DRAM working copy of
NVRAM • Radically improved bandwidth • Byte addressability
– Disk still lowest cost bulk storage
• Device Technology – Multiple candidate technologies – we believe
memristor has significant advantages – Mainstream market is flash replacement
Can user hints improve power efficiency?
Characteristics of applications 1
40
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,6 1,8 2 2,2 2,4CPU clock GHz
MiniMD
Power
Total Energy
Run time
0,86
0,88
0,9
0,92
0,94
0,96
0,98
1
1,02
1,04
1,06
1,6 1,8 2 2,2 2,4CPU clock GHz
Streams
Power
Total Energy
Run time
• Floating point limited applications are most efficient at maximum clock speed
• Memory bound applications more efficient at reduced clock rates
• Applications have distinct phases – allow user hints to aid power management 4
1
Characteristics of applications 2 – Need application to be power aware !
AMG2007 courtesy Sandia National Labs
Still a Challenge: EESI Conclusions (European Exascale Software initiative):
42 HP Confidential
43
Software and Management
• Exascale systems will require order 100,000 servers • Forces you to rethink how servers are designed, need to “co-
design” the power, cooling, networking, storqge, management, even the software applications in concert with the server itself -> Moonshot is the first step in this direction
• HP already delivering important management building blocks with Gen 8 iLO4, Insight CMU, etc.
• These systems will accelerate the “explosion of data” we are already seeing.
• It will impossible to “checkpoint” the “system” to local storage • Just forget storing data somewhere in the “Cloud” – unless you
compress by several orders of magnitude – but you can’t!
44
Storage
• Traditional HPC file systems like Lustre will continue to improve and have their place in HPC systems
• HPC will benefit from “big data” technologies. See Vertica and Autonomy et al.
• Vendors with experience in huge commercial installations might have benefits!
• Data explosion: “Silent Data Corruption” will become a bigger issue than today!
Another consequence:
45 HP Confidential
• Exascale systems also mean:
–Petascale system in a box –200 K€ or 250 K$ and 20 kW
• Huge impact for those academic, industrial structures – including SMEs – that will be able to take advantage of Exascale technology.
HP Labs Exascale Expertise
• Fabrics • Networks • Protocols • NVM (i.e., memristor) • Photonics • Architecture
46 HP Proprietary
THANK YOU
47 HP Confidential