supercomputers - keio university › comparc › super.pdf · homogeneous vs. heterogeneous...

68
Supercomputers Special Course of Computer Architecture H.Amano

Upload: others

Post on 25-Jun-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Supercomputers

Special Course of Computer Architecture

H.Amano

Page 2: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Supercomputers: Contents• What are supercomputers?

• Architecture of Supercomputers

• Representative supercomputers

• Exa-Scale supercomputer project

Page 3: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Defining Supercomputers

•High performance computers mainly for scientific computation.

• Huge amount of computation for Biochemistry, Physics, Astronomy, Meteorology and etc.

• Very expensive: developed and managed by national fund.

• High level techniques are required to develop and manage them.

• USA, Japan and China compete the top 1 supercomputer.

• A large amount of national fund is used, and tends to be political news→ For example, in Japan, the supercomputer project became the target of budget review in Dec. 2009

Page 4: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

FLOPS

• Floating Point Operation Per Second

• Floating Point number• (Mantissa) × 2 (index)

• Double precision 64bit, Single precision 32bit.

• IEEE Standard defines the format and rounding

238

5211

Single

Double

sign index mantissa

Page 5: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

The range of performance

106

100万

M(Mega)

10億

G(Giga)

1兆

T(Tera)

1000兆

P(Peta)

100京

E(Exa)

10PFLOPS = 1京回 in Japanese

→ The name 「K」 comes from it.

iPhone4S

140MFLOPSHigh-end PC

50-80GFLOPS

Powerful

GPU

Tera-FLOPS

Supercomputers

10TFLOPS-30PFLOPSgrowing ratio: 1.9times/year

109 1012 1015 1018

Page 6: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

How to select top 1?• Top500/Green500: Performance of executing Linpack

• Linpack is a kernel for matrix computation.• Scale free• Performance centric.

• Godon Bell Prize• Peak Performance, Price/Performance, Special Achievement

• HPC Challenge• Global HPL Matrix computation: Computation • Global Random Access: random memory access: Communication• EP stream per system: heavy load memory access: Memory

performance• Global FFT: Complicated problem requiring both memory and

communication performance.

• Nov. ACM/IEEE Supercomputing Conference• Top500、Gordon Bell Prize、HPC Challenge、Green500

• Jun. International Supercomputing Conference• Top500、Green500• This year, now on going in Frankfrut

Page 7: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

From SACSIS2012 Invited Speech.

Page 8: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Name Developmen

tHardware Cores Performanc

e TFLOPS

Power

(KW)

Tianhe-

2( 天河)

(China)

National

University of

Defence

Technology

Intel Xeon E5-2692 12C 2.2GHz,TH

Express-2, Intel Xeon

Phi31S1P

3120000 33862.7

(54902.4)

17808

Titan

(USA)

DOE/SC/Oak

Ridge National

Lab.

Cray XK7, Opteron

6274 16C 2.2GHz,

Cray Gemini

Intercon.NVIDIA K20x

550640 17590

(27112.5)

8209

Sequoia

(USA)

DOE/NNSA/LL

NL

BlueGene/Q,Power

BQC 16C 1.6GHz1572864 17173.2

(20132.7)

7890

K (京)

(Japan)

RIKEN AICS SPARC VIIIfx

2.0GHz Tofu

Interconnect

Fujitsu

705024 10510

(11280)

12659.9

Mira

(USA)

DOE/SC/Argo

nne National

Lab.

BlueGene/Q Power

BQC 1.6GHz786432 8586.6

(10066.3

3945

Top 500 2015 July(The same as 2014)

Top 5 were not changed since June.2013

Page 9: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Name Development Hardware Cores Performanc

e TFLOPS

Power(KW)

TaihuLight(

太湖之光)National

Supercomputin

g Center in

Wuxi

ShinWei(神威)

NRCPC

10649600 93014.6 15371

Tianhe-

2( 天河)

(China)

National

University of

Defence

Technology

Intel Xeon E5-2692 12C 2.2GHz,TH

Express-2, Intel Xeon

Phi31S1P

3120000 33862.7

(54902.4)

17808

Titan

(USA)

DOE/SC/Oak

Ridge National

Lab.

Cray XK7, Opteron

6274 16C 2.2GHz,

Cray Gemini

Intercon.NVIDIA K20x

550640 17590

(27112.5)

8209

Sequoia

(USA)

DOE/NNSA/LL

NL

BlueGene/Q,Power

BQC 16C 1.6GHz1572864 17173.2

(20132.7)

7890

K (京)

(Japan)

RIKEN AICS SPARC VIIIfx

2.0GHz Tofu

Interconnect

Fujitsu

705024 10510

(11280)

12659.9

Top 500 2016 July

TaihuLight got the first place for the first time in 3 years.

Page 10: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Sunway TaiheLight(太湖之光)

• Based on a Chinese original processor ShenWei (神威) SW26010 with 260-core.

• Homogeneous type using a dedicated processor.

• High energy efficiency: the 3rd place of Green 500.

Page 11: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Homogeneous vs. Heterogeneous• Homogeneous

• Special multi-core is used.

• Sequoia, Mira: BlueGene Q

• K: SPARC VIIIfx

• TaiheLight: ShenWei

• Easy programming

• Wide target application

• 5/6 dimensional torus

• Heterogeneous• CPU+Accelerators

• Tienhe-2: Intel MIC(Xeon Phi)

• Titan: GPU (NVIDIA Kepler)

• Highly energy efficient if Kepler is used.

• Programming is difficult

• Target application must make the use of the accelerator.

• Infiniband+Fat tree

Page 12: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Machine Place FLOPS/W Total

kW

1 L-CSC, Intel Xeon E5+AMD

FirePro

GSI Helmholtz

Center

5271.81 57.15

2 Suiren, Xeron E5+PEZY-SC KEK(高エネ研) 4945.63 37.83

3 TSUBAME-KFC, Intel Xeon E5+

NVIDIA K20x, Infiniband FDR

Tokyo Institute of

Technology

4447.58 35.39

4 Storm 1 Xeon E5+ NVIDIA K20 Cray Inc. 3962.73 44.54

2 Wilkes Dell T620 Cluster, Intel

Xeon E5+NVIDIA K20, Infiniband

FDR

Cambridge

University

3631.70 52.62

Green 500 2014 Nov.

Top5 systems are accelerated with NVIDIA Kepler K20 GPUs,

coupled with Intel Xeon CPUs.

PEZY-SC is an original accelerator

Page 13: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Green500 2016

• Still checking but best three were announced.

Name Location GFLOPS/W

1 Shoubu Riken/Pezy 6.7774

2 Satsuki Rikein/Pezy 6.195

3 TaiheLight MSC in Wuxi 6.051

Page 14: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Ho

st I/F

&

Inte

r P

roce

sso

r I/

F

ARM

x2

Prefecture

L3 cache 2MB

4x4 City

DDR4

DDR4

Prefecture

L3 cache 2MB

4x4 City

DDR4

DDR4

Prefecture

L3 cache 2MB

4x4 City

DDR4

DDR4

Prefecture

L3 cache 2MB

4x4 City

DDR4

DDR4

PEZY-SC [Torii2015]

2015/12/26 修論発表 14

City

SFU

2x2 Village

L2 D cache 64KB

Village

PE

PE

L1 D

cache

2KB

PE

PE

L1 D

cache

2KB

3-hierarchical MIMD manycore: 4PE x 4(Village) x 16(City) x 4(Prefecture)

= 1,024PE

Page 15: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Homogeneous supercomputers• NUMA-style multiprocessors.

• Remote DMA mechanism is provided, but coherent cache is not supported.

• Dedicated CPUs (RISC) are used.• PowerPC based multicore: BlueGene Q

• SPARC based multicore: K

• Special network/Multi-dimensional Torus• 6-ary torus: K

• 5-ary torus: BlueGene Q

Page 16: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Why supercomputers so fast?

× Because they use high freq. clock

100MHz

1GHz

1992 2000 2008

Pentium4

3.2GHz

Nehalem

3.3GHz

Alpha21064

150MHz

K 2GHz

The speed up of the clock is

saturated in 2003.

Power and heat dissipation

The clock frequency of K

and Sequoia is lower than

that of common PCs

40% / year

Clock freq. of High end PC

Freq.

Sequoia 1.6GHz

Fermi 1.3GHz

Kepler 732MHz

Page 17: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

NUMA without coherent cache

Node 1

Node 2

Node 3

Node 0

InterconnectionNetwork

Shared Memory

Processors which can

work independently.

Page 18: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Remote DMA (user level)

User

Kernel

Host I/F

Sender

Buffer

Data Source

Kernel

Agent

Protocol

Engine

Local Node

Network Interface

Buffer

Data Sink

Protocol

Engine

Remote Node

Network Interface

System Call

RDMAUser

Level

Page 19: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

IBM’s BlueGene Q

•Successor of Blue Gene L and Blue Gene P.

•Sequoia is consisting of BlueGene Q

•18 Power processors (16 computational, 1 control and 1 redundant) and network interfaces are provided in a chip.

• Inner-chip interconnection is a cross-bar switch.

•5 dimensional Mesh/Torus

•1.6GHz clock.

Page 20: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC
Page 21: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC
Page 22: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC
Page 23: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Supercomputer 「K」

Core

Core

Core

Core

Core

Core

Core

Core

L2 C

Inter

Connect

Controller

Tofu Interconnect

6-D Torus/Mesh

SPARC64 VIIIfx Chip

4 nodes/board

24boards/Lack

96nodes/Lack

RDMA mechanism

NUMA or UMA+NORMA

Memory

Page 24: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

SACSIS2012 Invited speech

Page 25: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

SACSIS2012 invited speech

Page 26: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

water cooling system

Page 27: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Lacks of K

Page 28: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

6 dimensional torus

Tofu

Page 29: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

3-ary 4-cube

0***

1***

2***

Page 30: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

0**** 1****

2****

3-ary 5-cube

degree: 2*nDiameter: (k-1)*n

Page 31: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Heterogeneous supercomputers

• Accelerators:• GPUs have been introduced in this lecture.

• The most recent Kepler is highly energy efficient.

• Intel MIC (Xeon Phi) used in Tianhe-2 is a many-core accelerator which runs with X86 ISA.

• Infiniband/Fat tree is mostly used for interconnection networks.

Page 32: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Kepler K20

• All top 14 machines in Green500 use K20m/c/x as its accelerators.

• K20m with fun: stand-alone workstation• K20c without fun: for rack-amount• K20X high performance

• 732MHz operational clock

• lower than previous Fermi (1.3GHz)

• 2688 single precision CUDA cores.(3.94TFLOPS)

• 896 double precision CUDA cores.(1.31TFLOPS)

• Highly energy efficient accelerators

Page 33: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Intel Xeon Phi(MIC: Many Integrated Core)

• An accelerator but can run in the stand-alone mode.

• X86 compatible instruction set.

• 60-80 Cores, 512bit/SIMD instructions• 8 double precision operations can be executed in a cycle.

• 1.1GHz clock, more than 1TFLOPS per card

Page 34: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Xeon Phi Microarchitecture

Core

L2Cach

e

Core

L2Cach

e

Core

L2Cach

e

Core

L2Cach

e

Core

L2Cach

e

Core

L2Cach

e

Core

L2Cach

e

Core

L2Cach

e

TD TD TD TD

TDTDTDTD

GDDR MC

GDDR MC

GDDR MC

GDDR MC

All cores are connected through the ringinterconnect.All L2 caches are coherent with directorybased management.

So, Xeon Phi is classified intoCC (Cache Coherent) NUMA.

Of course, all cores are multithreaded, and provide 512 SIMD instructions.

Page 35: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

2

3

4

5

10

Peta FLOPS11 K

Japan

Tianhe(天河) China

Jaguar USA

Nebulae

China

Tsubame Japan

Peak performance vs

Linpack Performance

The difference is large

in machines with accelerators

Homogeneous

Using GPU

Accelerator type is

energy efficient.

Page 36: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Arithmetic Intensity

• The number of floating point calculations per read data (byte)

From Hennessy & Patterson’s Texbook

Page 37: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Roof Line model

• Performance versus Arithmetic Intensity

Memory bound

Computing bound

From Hennessy & Patterson’s Texbook

Page 38: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Which is a good accelerator?F

loati

ng c

alc

ula

tion

s (

GF

LO

PS

/sec)

Arithmetic Intensity

Ideal Accelerator

High Performance forProblems withStrong Arithmetic Intensity

Middle Performance for Problemswith wide area of Arithmetic Intensity

Page 39: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Exampleof roof linemodel ofGPUs andMulti-cores

from Hennessy&Patterson’s Textbox

Page 40: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Infiniband•Point-to-point direct serial interconnection.

•Using 8b/10b code.

•Various types of topologies can be supported.

•Multicasting/atomic transactions are supported.

•The maximum throughput

SDR DDR QDR

1X 2Gbit/s 4Gbit/s 8Gbit/s

4X 8Gbit/s 16Gbit/s 32Gbit/s

12X 24Gbit/s 48Gbit/s 96Gbit/s

Page 41: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Fat Tree

Myrinet-Clos is actually a type of Fat-tree

Page 42: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Myrinet-Clos(1/2)

• 128nodes(Clos128)

Page 43: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Why Top1?

• Top1 is just a measure of matrix computation.

• Linpack is a weak scaling benchmark with high arithmetic intensity.

• Top1 of Green500, Gordon Bell Prize, Top1 of each HPC Challenge program

→ All machines are valuable.

TV or newspapers are too much focus on Top 500.

• However, most top 1 computer also got Gordon Bell Prize and HPC Challenge top1.

• K and Sequoia

• Impact of Top 1 is great!

Page 44: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Why Exa-scale supercomputers?

•The ratio of serial part becomes small for the large scale problem.

• Linpack is scale free benchmark.• Serial execution part 1 day+Parallel execution part 10 years

→ 1day+1day: A big impact.

• Are there any big programs which cannot be solved by K but can be solved by Exa-scale supercomputers?

• The number of programs will be decreased.

• Can we find new area of application?

• It is important such a big computing power is open for researches.

Page 45: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Amdahl’s lawSerial part

1%Parallel part 99%

Accelerated by parallel processing

0.01 + 0.99/p

50 times with 100 cores、91 times with 1000 cores

If there is a small part of serial execution part, the performance

improvement is limited.

Page 46: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Strong Scaling vs. Weak Scaling

• Strong Scaling: • The size of the problems (the size of treated data) is fixed.

• It is difficult to improve performance by Amdahl’s low.

• Weak Scaling: • The size of problems is scaled along to the size of computers.

• Linpack is a weak scaling benchmark.

• Discussion• For evaluation of computer architecture, weak scaling is

misleading!• An extremely large super computers are developed for

extremely large problems.

Page 47: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Japanese Exa-scale computer• Japanese national project for exa-scale computer started.

• Riken are developing an exa-scale computer (post-peta computer) until 2020.

•Architecture is now under planning.

•For exa-scale: 70,000,000 cores are needed.• The limitation of budget is severer than technical limit.

Page 48: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Motivation and limitation

• Integrated computer technologies including architecture, hardware, software, dependable techniques, semiconductors and application.

• Flagship and symbols.

• No-computer is remained in Japan other than supercomputers

• A super computing power is open for peaceful researches.

• It is a tool which makes impossible analysis possible.

• What needs infinite computing power?

• Is it a Japanese supercomputer if all cores and accelerators are made in USA?

• Does floating centric supercomputer to solve LInpack as fast as possible really fit the demand?

Look at Exa-scale computer project!

Page 49: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Should we develop a floating computation centric supercomputers?

•What people wants big supercomputer to do?• Finding new medicines: Pattern matching.• Simulation of earthquake, Meteorology for analyzing global warming.

• Big data

• Artificial Intelligence

•Most of them are not suitable for floating computation centric supercomputers.

• “Supercomputers for big data” or “Super-cloud computers” might be required.

Page 50: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

What is WSC?

•WSC (Warehouse Scale Computing)• Google, Amazon, Yahoo,.. etc.• An extremely large cluster with more than 50000 nodes• Consisting of economical homogeneous components.• Reliability is mainly kept by software.• Power Supply and Cooling System are important design factor

•Cloud Computing is supported with such WSCs.

Page 51: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

WSC is not a simple big data-center

• Homogeneous structure• In data-center, various types of clusters are used. Software and application packages are also various.

• In WSC, homogeneous tailored hardware is used. Software is custom made or free software.

• Cost• In data-center, the largest cost is often the people to maintain it.

• In WSC, the server hardware is the greatest cost.• WSC likes supercomputers rather than clusters in data-

centers.

Page 52: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Figure 6.5 Hierarchy of switches in a WSC. (Based on Figure 1.2 of Barroso and Hölzle [2009].)

Page 53: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Figure 6.8 The Layer 3 network used to link arrays together and to the Internet [Greenberg et al. 2009]. Some WSCs use a

separate border router to connect the Internet to the datacenter Layer 3 switches.

Page 54: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Figure 6.19 Google customizes a standard 1AAA container: 40 x 8 x 9.5 feet (12.2 x 2.4 x 2.9 meters). The servers are stacked

up to 20 high in racks that form two long rows of 29 racks each, with one row on each side of the container. The cool aisle goes

down the middle of the container, with the hot air return being on the outside. The hanging rack structure makes it easier torepair

the cooling system without removing the servers. To allow people inside the container to repair components, it contains safety

systems for fire detection and mist-based suppression, emergency egress and lighting, and emergency power shut-off. Containers

also have many sensors: temperature, airflow pressure, air leak detection, and motion-sensing lighting. A video tour of the

datacenter can be found at http://www.google.com/corporate/green/datacenters/summit.html. Microsoft, Yahoo!, and many others

are now building modular datacenters based upon these ideas but they have stopped using ISO standard containers since the size is

inconvenient.

Page 55: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Excise

• A target program:serial computation part :1

parallel computation part: N3

• K: 700,000 cores

• Exa: 70,000,000 cores

• What N makes Exa 10 times faster than K ?

Page 56: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Japanese supercomputers

• K-Supercomputer• Homogeneous scalar type massively parallel computers.

• Earth simulator• Vector computers

• The difference between peak and Linpack performance is small.

• TIT’s Tsubame• A lot of GPUs are used. Energy efficient supercomputer.

• Nagasaki University’s DEGIMA• A lot of GPUs are used. Hand made supercomputer. High cost-

performance. Gordon Bell prize cost performance winner

• GRAPE projects• For astronomy, dedicated supercomputers. SIMD、Various

version won the Gordon Bell prize.

Page 57: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

SACSIS2012 Invited Speech

Page 58: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

The earth simulatorV

ec

to

r P

ro

ce

sso

r

Ve

cto

r P

ro

ce

sso

r

Ve

cto

r P

ro

ce

sso

r

0 1 7

Shared Memory

16GB

Ve

cto

r P

ro

ce

sso

r

Ve

cto

r P

ro

ce

sso

r

Ve

cto

r P

ro

ce

sso

r

0 1 7

Shared Memory

16GB

Ve

cto

r P

ro

ce

sso

r

Ve

cto

r P

ro

ce

sso

r

Ve

cto

r P

ro

ce

sso

r

0 1 7

Shared Memory

16GB

….

Interconnection Network (16GB/s x 2)

Node 0Node 1 Node 639

Peak performance

40TFLOPS

Page 59: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Pipeline processing

1 2 3 4 5 6

Stage

Each stage sends the result/receives the input every clock

cycle.

N stages = N times performance

Data dependency makes RAW hazards and degrades the

performance.

If the large array is treated, a lot of stages can work

efficiently.

Page 60: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Vector computers

a0a1a2…..

multiplieradder

X[i]=A[i]*B[i]Y=Y+X[i]

vector registers

The classic style supercomputers since Cray-1.

Earth simulator is also a vector supercomputer.

b0b1b2….

Page 61: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

a1a2…..

X[i]=A[i]*B[i]Y=Y+X[i]b1b2….

a0

b0

Vector computers

multiplieradder

vector registers

Page 62: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

a2…..

X[i]=A[i]*B[i]Y=Y+X[i]b2….

a0

b0b1

a1

Vector computers

multiplieradder

vector registers

Page 63: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

a11…..

X[i]=A[i]*B[i]Y=Y+X[i]b11….

a9

b9b10

a10

x1x0

Vector computers

multiplieradder

vector registers

Page 64: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Data transfer methods for vector computers

• Stride data access• A data located with a certain gap are accessed continuously.

• Used for sparse matrix/flexible matrix computation.

• Gather/Scatter• A data distributed in a memory system are loaded with a

continuous access to the vector registers.

• The results in the vector registers are distributed into a memory with a continuous access.

• Both functions need a large memory bank.• Powerful memory is essential for vector machines.

• Cache is not so efficient, but still effective to reduce the start-up time.

Page 65: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

The Earth simulatorSimple NUMA

Page 66: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

TIT’s Tsubame

Well balanced

supercomputer with GPUs

Page 67: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

Nagasaki

Univ’s

DEGIMA

Page 68: Supercomputers - Keio University › comparc › super.pdf · Homogeneous vs. Heterogeneous •Homogeneous •Special multi-core is used. •Sequoia, Mira: BlueGene Q •K: SPARC

GRAPE-DR

Kei Hiraki “GRAPE-DR”

http://www.fpl.org (FPL2007)