many-core computingbal/college13/class1-intro-2k13_topublish.pdf · #10 in top500 list – june...

56
MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, eScience Center 7-Oct-2013

Upload: others

Post on 08-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, eScience Center 7-Oct-2013

Page 2: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Schedule

1.  Introduction, performance metrics & analysis 2.  Programming: basics (10-10-2013) 3.  Programming: advanced (14-10-2013) 4.  Case study: LOFAR telescope with many-cores

by Rob van Nieuwpoort (17-10-2013)

2

Page 3: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

What are many-cores?

¨  From Wikipedia: “A many-core processor is a multi-core processor in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely because of issues with congestion in supplying instructions and data to the many processors.”

¨  In this course:

¤ Multi-core/many-core CPUs ¤  (GP)GPUs

3

Page 4: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

What are many-cores

¨  How many is many? ¤  Several tens of cores

¨  How are they different from multi-core CPUs? ¤ Non-uniform memory access (NUMA) ¤  Private memories ¤ Network-on-chip

¨  Examples ¤ Multi-core CPUs (48-core AMD magny-cours) ¤ Graphics Processing Units (GPUs)

n  GPGPU = general purpose programming on GPUs ¤  Server processors (Sun Niagara) ¤ HPC processors

n  Cell B.E. (PlayStation 3) n  Intel Xeon Phi (aka Intel MIC , former Larrabee)

4

Page 5: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Today’s Topics

¨  Why do many-cores exist? ¨  History ¨  Hardware introduction ¨  Performance model:

Arithmetic Intensity and Roofline

5

Page 6: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Moore’s law Many-cores in real-life

Why many-cores? 6

Page 7: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Moore’s Law

¤ Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short term this rate can be expected to continue, if not to increase....” Electronics Magazine 1965

7

Page 8: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Transistor Counts (Intel) 8

Page 9: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Impact of device shrinking

¨  Assume transistor size shrinks by a factor of x ! ¨  Transistors per unit area: up by x*x ¨  Die size ?

¤ Assume the same

¨  Clock rate ? ¤ may go up by x because wires are shorter

¨  Raw computing power ? ¤ Programs could go x*x*x times faster

¨  In reality? ¤ Power consumption, memory, parallelism impose stricter

bounds!

9

Page 10: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Revolution in Processors

¨  Chip density is continuing to increase about 2x every 2 years

¨  BUT ¤ Clock speed is not ¤  ILP is not ¤ Power is not

10

Page 11: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

New ways to use transistors 11

¨  Parallelism on-chip: multi-core processors

¨  “Multicore revolution” ¤ Every machine will soon be a parallel machine ¤ What about performance?

¨  Can applications use this parallelism? ¤ Do they have to be rewritten from scratch?

¨  Will all programmers have to be parallel programmers? ¤ New programming models are needed ¤ Try to hide complexity from most programmers

Page 12: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Top500 [1/4] 12

¨  State of the art in HPC (top500.org) ¤ Trial for all new HPC architectures

Accelerated!

Accelerated!

195 cores/node!

Page 13: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Top500 [2/4] 13

¨  Performance is dominated by multi-/many-cores ¤ Multi-core CPUs ¤ Accelerators

Page 14: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Top500 [3/4] 14

¨  Accelerators ? ¤ Relatively low numbers ¤ High performance impact

Page 15: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

China's Tianhe-1A

#10 in top500 list – June 2013 (#1 in Top500 in November 2010)

4.701 pflops peak

2.566 pflops max

14,336 Xeon X5670 processors

7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

15

Page 16: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

China's Tianhe-2

#1 in Top500 – June 2013

54.902 pflops peak

33.862 pflops max

16.000 nodes = 16.000 x (2 x Xeon IvyBridge + 3 x Xeon Phi)

= 3.120.000 cores ( => 195 cores/node)

16

Page 17: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Top500: prediction 17

Page 18: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

GPUs vs. Top500 18

Page 19: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

T12

NV30 NV40 G70

G80

GT200

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Why do we need many-cores? 19

Page 20: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Why do we need many-cores? 20

Page 21: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Power efficiency 21

Page 22: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Graphics in 1980 22

Page 23: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Graphics in 2000 23

Page 24: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Realism of modern GPUs

http://www.youtube.com/watch?v=bJDeipvpjGQ&feature=player_embedded#t=49s

Courtesy techradar.com

24

Page 25: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Why do we need many-cores?

¨  Performance ¤ Large scale parallelism

¨  Power Efficiency ¤ Use transistors more efficiently

¨  Price (GPUs) ¤ Game market is huge, bigger than Hollywood ¤ Mass production, economy of scale ¤ “spotty teenagers” pay for our HPC needs!

¨  Prestige ¤ Reach ExaFLOP by 2019

25

Page 26: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

History 26

Page 27: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Multi-core @ Intel 27

Page 28: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

GPGPU History 28

¨  Current generation: NVIDIA Kepler ¤ 7.1 transistors ¤ More cores, more parallelism, more performance

1995 2000 2005 2010

RIVA 128 3M xtors

GeForce® 256 23M xtors

GeForce FX 125M xtors

GeForce 8800 681M xtors

GeForce 3 60M xtors

“Fermi” 3B xtors

Page 29: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

GPGPU History 29

¨  Use Graphics primitives for HPC ¤  Ikonas [England 1978] ¤ Pixel Machine [Potmesil & Hoffert 1989] ¤ Pixel-Planes 5 [Rhoades, et al. 1992]

¨  Programmable shaders, around 1998 ¤ DirectX / OpenGL ¤ Map application onto graphics domain!

¨  GPGPU ¤ Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...

Page 30: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

CUDA C/C++ Continuous Innovation

2007 2008 2009 2010 July 07 Nov 07 April

08 Aug 08 July 09 Nov 09 Mar 10

CUDA Toolkit 1.1

•  Win XP 64 •  Atomics support •  Multi-GPU support

CUDA Toolkit 2.0

Double Precision •  Compiler Optimizations •  Vista 32/64 •  Mac OSX •  3D Textures •  HW Interpolation

CUDA Toolkit 2.3

•  DP FFT •  16-32 Conversion intrinsics •  Performance enhancements

CUDA Toolkit 1.0

•  C Compiler •  C Extensions •  Single Precision •  BLAS •  FFT •  SDK 40 examples

CUDA Visual Profiler 2.2

cuda-gdb HW Debugger

Parallel Nsight Beta CUDA Toolkit 3.0

C++ inheritance Fermi support Tools updates Driver / RT interop

30

Page 31: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Parallel Nsight Visual Studio

Visual Profiler For Linux

cuda-gdb For Linux

Cuda Tools 31

Page 32: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Another GPGPU history 32

Page 33: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

GPUs @ AMD 33

Page 34: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Multi-core @ AMD 34

Page 35: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Multi-core @ AMD 35

Page 36: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

GPU @ ARM 36

Page 37: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Many-core hardware 37

Page 38: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Choices …

¨  Core type(s): ¤ Fat or slim ? ¤ Vectorized (SIMD) ? ¤ Homogeneous or heterogeneous?

¨  Number of cores: ¤ Few or many ?

¨  Memory ¤ Shared-memory or distributed-memory?

¨  Parallelism ¤  Instruction-level parallelism, threads, vectors, …

38

Page 39: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

A taxonomy

¨  Based on “field-of-origin”: ¤ General-purpose

n  Intel, AMD ¤ Graphics Processing Units (GPUs)

n NVIDIA, ATI ¤ Gaming/Entertainment

n Sony/Toshiba/IBM ¤ Embedded systems

n Philips/NXP, ARM ¤ Servers

n Oracle, IBM, Intel ¤ High Performance Computing

n  Intel, IBM, …

39

Page 40: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

General Purpose Processors

¨  Architecture ¤  Few fat cores ¤ Vectorization (SSE, AVX) ¤ Homogeneous ¤  Stand-alone

¨  Memory ¤  Shared, multi-layered ¤  Per-core cache and shared cache

¨  Programming ¤  Processes (OS Scheduler) ¤ Message passing ¤ Multi-threading ¤ Coarse-grained parallelism

40

Page 41: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Server-side

¨  General-purpose-like with more hardware threads ¤ Lower performance per thread ¤ high throughput

¨  Examples ¤ Sun Niagara II

n 8 cores x 8 threads

¤  IBM POWER7 n 8 cores x 4 threads

¤  Intel SCC n 48 cores, all can run their own OS

41

Page 42: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Graphics Processing Units

¨  Architecture ¤ Hundreds/thousands of slim cores ¤ Homogeneous ¤ Accelerator

¨  Memory ¤ Very complex hierarchy ¤ Both shared and per-core

¨  Programming ¤ Off-load model ¤ Many fine-grained symmetrical threads ¤ Hardware scheduler

42

Page 43: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Cell/B.E.

¨  Architecture ¤ Heterogeneous ¤ 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)

¨  Memory ¤ Per-core memory, network-on-chip

¨  Programming ¤ User-controlled scheduling ¤ 6 levels of parallelism, all under user control ¤ Fine- and coarse-grain parallelism

43

Page 44: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Xeon Phi

¨  Architecture ¤ ~60 homogeneous cores

n  4 threads per core ¤  x86 architecture

¨  Memory ¤  Per-core caches (L1,L2)

n  Coherence ¤ UMA [?]

¨  Programming ¤  SPMD/MPMD ¤  Fine- and coarse-grain parallelism (vector processing and

threads, respectively

44

Page 45: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Take home message

¨  Variety of platforms ¤ Core types & counts ¤ Memory architecture & sizes ¤ Parallelism layers & types ¤ Scheduling

¨  Open questions: ¤ Why so many? ¤ How many platforms do we need? ¤ Can any application run on any platform?

45

Page 46: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Hardware performance metrics 46

Page 47: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Hardware Performance metrics

¨  Clock frequency [GHz] = absolute hardware speed ¤ Memories, CPUs, interconnects

¨  Operational speed [GFLOPs] ¤ Operations per cycle

¨  Memory bandwidth [GB/s] ¤  differs a lot between different memories on chip

¨  Power [Watt]

¨  Derived metrics ¤  FLOP/Byte, FLOP/Watt

47

Page 48: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Theoretical peak performance

¨  Peak = chips * cores * vectorWidth * FLOPs/cycle * clockFrequency

¨  Examples from DAS-4: ¤  Intel Core i7 CPU

n  2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs

¤  NVIDIA GTX 580 GPU n  1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle

* 1.544 GhZ = 1581 GFLOPs

¤  ATI HD 6970 n  1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle

* 0.880 GhZ = 2703 GFLOPs

48

Page 49: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

DRAM Memory bandwidth

¨  Throughput = memory bus frequency * bits per cycle * bus width ¤ Memory clock != CPU clock! ¤  In bits, divide by 8 for GB/s

¨  Examples: ¤  Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s ¤ NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s ¤ ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s

49

Page 50: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Memory bandwidths

¨  On-chip memory can be orders of magnitude faster ¤ Registers, shared memory, caches, … ¤ E.g., AMD HD 7970 L1 cache achieves 2 TB/s

¨  Other memories: depends on the interconnect ¤  Intel’s technology: QPI (Quick Path Interconnect)

n 25.6 GB/s

¤ AMD’s technology: HT3 (Hyper Transport 3) n 19.2 GB/s

¤ Accelerators: PCI-e 2.0 n 8 GB/s

50

Page 51: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Power

¨  Chip manufactures specify Thermal Design Power (TDP) ¨  We can measure dissipated power

¤ Whole system ¤ Typically (much) lower than TDP

¨  Power efficiency ¤ FLOPS / Watt

¨  Examples (with theoretical peak and TDP) ¤  Intel Core i7: 154 / 160 = 1.0 GFLOPs/W ¤ NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W ¤ ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

51

Page 52: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Summary

Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte

Sun Niagara 2 8 64 11.2 76 0.1

IBM bg/p 4 8 13.6 13.6 1.0

IBM Power 7 8 32 265 68 3.9

Intel Core i7 4 16 85 25.6 3.3

AMD Barcelona 4 8 37 21.4 1.7

AMD Istanbul 6 6 62.4 25.6 2.4

AMD Magny-Cours 12 12 125 25.6 4.9

Cell/B.E. 8 8 205 25.6 8.0

NVIDIA GTX 580 16 512 1581 192 8.2

NVIDIA GTX 680 8 1536 3090 192 16.1

AMD HD 6970 384 1536 2703 176 15.4

AMD HD 7970 32 2048 3789 264 14.4

Page 53: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Absolute hardware performance

¨  Only achieved in the optimal conditions: ¤ Processing units 100% used ¤ All parallelism 100% exploited ¤ All data transfers at maximum bandwidth

¨  In real life ¤ No application is like this ¤ Can we reason about “real” performance?

53

Page 54: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Operational Intensity and the Roofline model

Performance analysis 54

Page 55: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

An Example 55

¨  I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want to hire you to use many-cores to improve the performance.

¨  Metrics I will judge candidates by: ¤ How fast can the application be:

n  Execution time => what the users are interested in! ¤ How many times faster can you make it:

n  Speed-up => use the best possible sequential performance ¤ How do I know I should chose you?

n  Achievable performance => reason how far the performance is n  Depends on application, hardware, and dataset!

¤  Is this architecture a good one to use? n  Utilization => did I really need this hardware?

Page 56: MANY-CORE COMPUTINGbal/college13/class1-intro-2k13_toPublish.pdf · #10 in top500 list – June 2013 (#1 in Top500 in November 2010) 4.701 pflops peak 2.566 pflops max 14,336 Xeon

Questions? Comments? 56

¨  For questions, comments, suggestions, … :

[email protected]