supercomputers - keio university › comparc › super.pdf · homogeneous vs. heterogeneous...
TRANSCRIPT
Supercomputers
Special Course of Computer Architecture
H.Amano
Supercomputers: Contents• What are supercomputers?
• Architecture of Supercomputers
• Representative supercomputers
• Exa-Scale supercomputer project
Defining Supercomputers
•High performance computers mainly for scientific computation.
• Huge amount of computation for Biochemistry, Physics, Astronomy, Meteorology and etc.
• Very expensive: developed and managed by national fund.
• High level techniques are required to develop and manage them.
• USA, Japan and China compete the top 1 supercomputer.
• A large amount of national fund is used, and tends to be political news→ For example, in Japan, the supercomputer project became the target of budget review in Dec. 2009
FLOPS
• Floating Point Operation Per Second
• Floating Point number• (Mantissa) × 2 (index)
• Double precision 64bit, Single precision 32bit.
• IEEE Standard defines the format and rounding
238
5211
Single
Double
sign index mantissa
The range of performance
106
100万
M(Mega)
10億
G(Giga)
1兆
T(Tera)
1000兆
P(Peta)
100京
E(Exa)
10PFLOPS = 1京回 in Japanese
→ The name 「K」 comes from it.
iPhone4S
140MFLOPSHigh-end PC
50-80GFLOPS
Powerful
GPU
Tera-FLOPS
Supercomputers
10TFLOPS-30PFLOPSgrowing ratio: 1.9times/year
109 1012 1015 1018
How to select top 1?• Top500/Green500: Performance of executing Linpack
• Linpack is a kernel for matrix computation.• Scale free• Performance centric.
• Godon Bell Prize• Peak Performance, Price/Performance, Special Achievement
• HPC Challenge• Global HPL Matrix computation: Computation • Global Random Access: random memory access: Communication• EP stream per system: heavy load memory access: Memory
performance• Global FFT: Complicated problem requiring both memory and
communication performance.
• Nov. ACM/IEEE Supercomputing Conference• Top500、Gordon Bell Prize、HPC Challenge、Green500
• Jun. International Supercomputing Conference• Top500、Green500• This year, now on going in Frankfrut
From SACSIS2012 Invited Speech.
Name Developmen
tHardware Cores Performanc
e TFLOPS
Power
(KW)
Tianhe-
2( 天河)
(China)
National
University of
Defence
Technology
Intel Xeon E5-2692 12C 2.2GHz,TH
Express-2, Intel Xeon
Phi31S1P
3120000 33862.7
(54902.4)
17808
Titan
(USA)
DOE/SC/Oak
Ridge National
Lab.
Cray XK7, Opteron
6274 16C 2.2GHz,
Cray Gemini
Intercon.NVIDIA K20x
550640 17590
(27112.5)
8209
Sequoia
(USA)
DOE/NNSA/LL
NL
BlueGene/Q,Power
BQC 16C 1.6GHz1572864 17173.2
(20132.7)
7890
K (京)
(Japan)
RIKEN AICS SPARC VIIIfx
2.0GHz Tofu
Interconnect
Fujitsu
705024 10510
(11280)
12659.9
Mira
(USA)
DOE/SC/Argo
nne National
Lab.
BlueGene/Q Power
BQC 1.6GHz786432 8586.6
(10066.3
3945
Top 500 2015 July(The same as 2014)
Top 5 were not changed since June.2013
Name Development Hardware Cores Performanc
e TFLOPS
Power(KW)
TaihuLight(
太湖之光)National
Supercomputin
g Center in
Wuxi
ShinWei(神威)
NRCPC
10649600 93014.6 15371
Tianhe-
2( 天河)
(China)
National
University of
Defence
Technology
Intel Xeon E5-2692 12C 2.2GHz,TH
Express-2, Intel Xeon
Phi31S1P
3120000 33862.7
(54902.4)
17808
Titan
(USA)
DOE/SC/Oak
Ridge National
Lab.
Cray XK7, Opteron
6274 16C 2.2GHz,
Cray Gemini
Intercon.NVIDIA K20x
550640 17590
(27112.5)
8209
Sequoia
(USA)
DOE/NNSA/LL
NL
BlueGene/Q,Power
BQC 16C 1.6GHz1572864 17173.2
(20132.7)
7890
K (京)
(Japan)
RIKEN AICS SPARC VIIIfx
2.0GHz Tofu
Interconnect
Fujitsu
705024 10510
(11280)
12659.9
Top 500 2016 July
TaihuLight got the first place for the first time in 3 years.
Sunway TaiheLight(太湖之光)
• Based on a Chinese original processor ShenWei (神威) SW26010 with 260-core.
• Homogeneous type using a dedicated processor.
• High energy efficiency: the 3rd place of Green 500.
Homogeneous vs. Heterogeneous• Homogeneous
• Special multi-core is used.
• Sequoia, Mira: BlueGene Q
• K: SPARC VIIIfx
• TaiheLight: ShenWei
• Easy programming
• Wide target application
• 5/6 dimensional torus
• Heterogeneous• CPU+Accelerators
• Tienhe-2: Intel MIC(Xeon Phi)
• Titan: GPU (NVIDIA Kepler)
• Highly energy efficient if Kepler is used.
• Programming is difficult
• Target application must make the use of the accelerator.
• Infiniband+Fat tree
Machine Place FLOPS/W Total
kW
1 L-CSC, Intel Xeon E5+AMD
FirePro
GSI Helmholtz
Center
5271.81 57.15
2 Suiren, Xeron E5+PEZY-SC KEK(高エネ研) 4945.63 37.83
3 TSUBAME-KFC, Intel Xeon E5+
NVIDIA K20x, Infiniband FDR
Tokyo Institute of
Technology
4447.58 35.39
4 Storm 1 Xeon E5+ NVIDIA K20 Cray Inc. 3962.73 44.54
2 Wilkes Dell T620 Cluster, Intel
Xeon E5+NVIDIA K20, Infiniband
FDR
Cambridge
University
3631.70 52.62
Green 500 2014 Nov.
Top5 systems are accelerated with NVIDIA Kepler K20 GPUs,
coupled with Intel Xeon CPUs.
PEZY-SC is an original accelerator
Green500 2016
• Still checking but best three were announced.
Name Location GFLOPS/W
1 Shoubu Riken/Pezy 6.7774
2 Satsuki Rikein/Pezy 6.195
3 TaiheLight MSC in Wuxi 6.051
Ho
st I/F
&
Inte
r P
roce
sso
r I/
F
ARM
x2
Prefecture
L3 cache 2MB
4x4 City
DDR4
DDR4
Prefecture
L3 cache 2MB
4x4 City
DDR4
DDR4
Prefecture
L3 cache 2MB
4x4 City
DDR4
DDR4
Prefecture
L3 cache 2MB
4x4 City
DDR4
DDR4
PEZY-SC [Torii2015]
2015/12/26 修論発表 14
City
SFU
2x2 Village
L2 D cache 64KB
Village
PE
PE
L1 D
cache
2KB
PE
PE
L1 D
cache
2KB
3-hierarchical MIMD manycore: 4PE x 4(Village) x 16(City) x 4(Prefecture)
= 1,024PE
Homogeneous supercomputers• NUMA-style multiprocessors.
• Remote DMA mechanism is provided, but coherent cache is not supported.
• Dedicated CPUs (RISC) are used.• PowerPC based multicore: BlueGene Q
• SPARC based multicore: K
• Special network/Multi-dimensional Torus• 6-ary torus: K
• 5-ary torus: BlueGene Q
Why supercomputers so fast?
× Because they use high freq. clock
100MHz
1GHz
1992 2000 2008
Pentium4
3.2GHz
Nehalem
3.3GHz
Alpha21064
150MHz
K 2GHz
The speed up of the clock is
saturated in 2003.
Power and heat dissipation
The clock frequency of K
and Sequoia is lower than
that of common PCs
40% / year
Clock freq. of High end PC
年
Freq.
Sequoia 1.6GHz
Fermi 1.3GHz
Kepler 732MHz
NUMA without coherent cache
Node 1
Node 2
Node 3
Node 0
0
1
2
3
InterconnectionNetwork
Shared Memory
Processors which can
work independently.
Remote DMA (user level)
User
Kernel
Host I/F
Sender
Buffer
Data Source
Kernel
Agent
Protocol
Engine
Local Node
Network Interface
Buffer
Data Sink
Protocol
Engine
Remote Node
Network Interface
System Call
RDMAUser
Level
IBM’s BlueGene Q
•Successor of Blue Gene L and Blue Gene P.
•Sequoia is consisting of BlueGene Q
•18 Power processors (16 computational, 1 control and 1 redundant) and network interfaces are provided in a chip.
• Inner-chip interconnection is a cross-bar switch.
•5 dimensional Mesh/Torus
•1.6GHz clock.
Supercomputer 「K」
Core
Core
Core
Core
Core
Core
Core
Core
L2 C
Inter
Connect
Controller
Tofu Interconnect
6-D Torus/Mesh
SPARC64 VIIIfx Chip
4 nodes/board
24boards/Lack
96nodes/Lack
RDMA mechanism
NUMA or UMA+NORMA
Memory
SACSIS2012 Invited speech
SACSIS2012 invited speech
water cooling system
Lacks of K
6 dimensional torus
Tofu
3-ary 4-cube
0***
1***
2***
0**** 1****
2****
3-ary 5-cube
degree: 2*nDiameter: (k-1)*n
Heterogeneous supercomputers
• Accelerators:• GPUs have been introduced in this lecture.
• The most recent Kepler is highly energy efficient.
• Intel MIC (Xeon Phi) used in Tianhe-2 is a many-core accelerator which runs with X86 ISA.
• Infiniband/Fat tree is mostly used for interconnection networks.
Kepler K20
• All top 14 machines in Green500 use K20m/c/x as its accelerators.
• K20m with fun: stand-alone workstation• K20c without fun: for rack-amount• K20X high performance
• 732MHz operational clock
• lower than previous Fermi (1.3GHz)
• 2688 single precision CUDA cores.(3.94TFLOPS)
• 896 double precision CUDA cores.(1.31TFLOPS)
• Highly energy efficient accelerators
Intel Xeon Phi(MIC: Many Integrated Core)
• An accelerator but can run in the stand-alone mode.
• X86 compatible instruction set.
• 60-80 Cores, 512bit/SIMD instructions• 8 double precision operations can be executed in a cycle.
• 1.1GHz clock, more than 1TFLOPS per card
Xeon Phi Microarchitecture
Core
L2Cach
e
Core
L2Cach
e
Core
L2Cach
e
Core
L2Cach
e
Core
L2Cach
e
Core
L2Cach
e
Core
L2Cach
e
Core
L2Cach
e
TD TD TD TD
TDTDTDTD
GDDR MC
GDDR MC
GDDR MC
GDDR MC
All cores are connected through the ringinterconnect.All L2 caches are coherent with directorybased management.
So, Xeon Phi is classified intoCC (Cache Coherent) NUMA.
Of course, all cores are multithreaded, and provide 512 SIMD instructions.
1
2
3
4
5
10
Peta FLOPS11 K
Japan
Tianhe(天河) China
Jaguar USA
Nebulae
China
Tsubame Japan
Peak performance vs
Linpack Performance
The difference is large
in machines with accelerators
Homogeneous
Using GPU
Accelerator type is
energy efficient.
Arithmetic Intensity
• The number of floating point calculations per read data (byte)
From Hennessy & Patterson’s Texbook
Roof Line model
• Performance versus Arithmetic Intensity
Memory bound
Computing bound
From Hennessy & Patterson’s Texbook
Which is a good accelerator?F
loati
ng c
alc
ula
tion
s (
GF
LO
PS
/sec)
Arithmetic Intensity
Ideal Accelerator
High Performance forProblems withStrong Arithmetic Intensity
Middle Performance for Problemswith wide area of Arithmetic Intensity
Exampleof roof linemodel ofGPUs andMulti-cores
from Hennessy&Patterson’s Textbox
Infiniband•Point-to-point direct serial interconnection.
•Using 8b/10b code.
•Various types of topologies can be supported.
•Multicasting/atomic transactions are supported.
•The maximum throughput
SDR DDR QDR
1X 2Gbit/s 4Gbit/s 8Gbit/s
4X 8Gbit/s 16Gbit/s 32Gbit/s
12X 24Gbit/s 48Gbit/s 96Gbit/s
Fat Tree
Myrinet-Clos is actually a type of Fat-tree
Myrinet-Clos(1/2)
• 128nodes(Clos128)
Why Top1?
• Top1 is just a measure of matrix computation.
• Linpack is a weak scaling benchmark with high arithmetic intensity.
• Top1 of Green500, Gordon Bell Prize, Top1 of each HPC Challenge program
→ All machines are valuable.
TV or newspapers are too much focus on Top 500.
• However, most top 1 computer also got Gordon Bell Prize and HPC Challenge top1.
• K and Sequoia
• Impact of Top 1 is great!
Why Exa-scale supercomputers?
•The ratio of serial part becomes small for the large scale problem.
• Linpack is scale free benchmark.• Serial execution part 1 day+Parallel execution part 10 years
→ 1day+1day: A big impact.
• Are there any big programs which cannot be solved by K but can be solved by Exa-scale supercomputers?
• The number of programs will be decreased.
• Can we find new area of application?
• It is important such a big computing power is open for researches.
Amdahl’s lawSerial part
1%Parallel part 99%
Accelerated by parallel processing
0.01 + 0.99/p
50 times with 100 cores、91 times with 1000 cores
If there is a small part of serial execution part, the performance
improvement is limited.
Strong Scaling vs. Weak Scaling
• Strong Scaling: • The size of the problems (the size of treated data) is fixed.
• It is difficult to improve performance by Amdahl’s low.
• Weak Scaling: • The size of problems is scaled along to the size of computers.
• Linpack is a weak scaling benchmark.
• Discussion• For evaluation of computer architecture, weak scaling is
misleading!• An extremely large super computers are developed for
extremely large problems.
Japanese Exa-scale computer• Japanese national project for exa-scale computer started.
• Riken are developing an exa-scale computer (post-peta computer) until 2020.
•Architecture is now under planning.
•For exa-scale: 70,000,000 cores are needed.• The limitation of budget is severer than technical limit.
Motivation and limitation
• Integrated computer technologies including architecture, hardware, software, dependable techniques, semiconductors and application.
• Flagship and symbols.
• No-computer is remained in Japan other than supercomputers
• A super computing power is open for peaceful researches.
• It is a tool which makes impossible analysis possible.
• What needs infinite computing power?
• Is it a Japanese supercomputer if all cores and accelerators are made in USA?
• Does floating centric supercomputer to solve LInpack as fast as possible really fit the demand?
Look at Exa-scale computer project!
Should we develop a floating computation centric supercomputers?
•What people wants big supercomputer to do?• Finding new medicines: Pattern matching.• Simulation of earthquake, Meteorology for analyzing global warming.
• Big data
• Artificial Intelligence
•Most of them are not suitable for floating computation centric supercomputers.
• “Supercomputers for big data” or “Super-cloud computers” might be required.
What is WSC?
•WSC (Warehouse Scale Computing)• Google, Amazon, Yahoo,.. etc.• An extremely large cluster with more than 50000 nodes• Consisting of economical homogeneous components.• Reliability is mainly kept by software.• Power Supply and Cooling System are important design factor
•Cloud Computing is supported with such WSCs.
WSC is not a simple big data-center
• Homogeneous structure• In data-center, various types of clusters are used. Software and application packages are also various.
• In WSC, homogeneous tailored hardware is used. Software is custom made or free software.
• Cost• In data-center, the largest cost is often the people to maintain it.
• In WSC, the server hardware is the greatest cost.• WSC likes supercomputers rather than clusters in data-
centers.
Figure 6.5 Hierarchy of switches in a WSC. (Based on Figure 1.2 of Barroso and Hölzle [2009].)
Figure 6.8 The Layer 3 network used to link arrays together and to the Internet [Greenberg et al. 2009]. Some WSCs use a
separate border router to connect the Internet to the datacenter Layer 3 switches.
Figure 6.19 Google customizes a standard 1AAA container: 40 x 8 x 9.5 feet (12.2 x 2.4 x 2.9 meters). The servers are stacked
up to 20 high in racks that form two long rows of 29 racks each, with one row on each side of the container. The cool aisle goes
down the middle of the container, with the hot air return being on the outside. The hanging rack structure makes it easier torepair
the cooling system without removing the servers. To allow people inside the container to repair components, it contains safety
systems for fire detection and mist-based suppression, emergency egress and lighting, and emergency power shut-off. Containers
also have many sensors: temperature, airflow pressure, air leak detection, and motion-sensing lighting. A video tour of the
datacenter can be found at http://www.google.com/corporate/green/datacenters/summit.html. Microsoft, Yahoo!, and many others
are now building modular datacenters based upon these ideas but they have stopped using ISO standard containers since the size is
inconvenient.
Excise
• A target program:serial computation part :1
parallel computation part: N3
• K: 700,000 cores
• Exa: 70,000,000 cores
• What N makes Exa 10 times faster than K ?
Japanese supercomputers
• K-Supercomputer• Homogeneous scalar type massively parallel computers.
• Earth simulator• Vector computers
• The difference between peak and Linpack performance is small.
• TIT’s Tsubame• A lot of GPUs are used. Energy efficient supercomputer.
• Nagasaki University’s DEGIMA• A lot of GPUs are used. Hand made supercomputer. High cost-
performance. Gordon Bell prize cost performance winner
• GRAPE projects• For astronomy, dedicated supercomputers. SIMD、Various
version won the Gordon Bell prize.
SACSIS2012 Invited Speech
The earth simulatorV
ec
to
r P
ro
ce
sso
r
Ve
cto
r P
ro
ce
sso
r
…
Ve
cto
r P
ro
ce
sso
r
0 1 7
Shared Memory
16GB
Ve
cto
r P
ro
ce
sso
r
Ve
cto
r P
ro
ce
sso
r
…
Ve
cto
r P
ro
ce
sso
r
0 1 7
Shared Memory
16GB
Ve
cto
r P
ro
ce
sso
r
Ve
cto
r P
ro
ce
sso
r
…
Ve
cto
r P
ro
ce
sso
r
0 1 7
Shared Memory
16GB
….
Interconnection Network (16GB/s x 2)
Node 0Node 1 Node 639
Peak performance
40TFLOPS
Pipeline processing
1 2 3 4 5 6
Stage
Each stage sends the result/receives the input every clock
cycle.
N stages = N times performance
Data dependency makes RAW hazards and degrades the
performance.
If the large array is treated, a lot of stages can work
efficiently.
Vector computers
a0a1a2…..
multiplieradder
X[i]=A[i]*B[i]Y=Y+X[i]
vector registers
The classic style supercomputers since Cray-1.
Earth simulator is also a vector supercomputer.
b0b1b2….
a1a2…..
X[i]=A[i]*B[i]Y=Y+X[i]b1b2….
a0
b0
Vector computers
multiplieradder
vector registers
a2…..
X[i]=A[i]*B[i]Y=Y+X[i]b2….
a0
b0b1
a1
Vector computers
multiplieradder
vector registers
a11…..
X[i]=A[i]*B[i]Y=Y+X[i]b11….
a9
b9b10
a10
x1x0
Vector computers
multiplieradder
vector registers
Data transfer methods for vector computers
• Stride data access• A data located with a certain gap are accessed continuously.
• Used for sparse matrix/flexible matrix computation.
• Gather/Scatter• A data distributed in a memory system are loaded with a
continuous access to the vector registers.
• The results in the vector registers are distributed into a memory with a continuous access.
• Both functions need a large memory bank.• Powerful memory is essential for vector machines.
• Cache is not so efficient, but still effective to reduce the start-up time.
The Earth simulatorSimple NUMA
TIT’s Tsubame
Well balanced
supercomputer with GPUs
Nagasaki
Univ’s
DEGIMA