many-core computingbal/college13/class1-intro-2k13_topublish.pdf · #10 in top500 list – june...

MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, eScience Center 7-Oct-2013

Schedule

1.  Introduction, performance metrics & analysis 2.  Programming: basics (10-10-2013) 3.  Programming: advanced (14-10-2013) 4.  Case study: LOFAR telescope with many-cores

by Rob van Nieuwpoort (17-10-2013)

2

What are many-cores?

¨  From Wikipedia: “A many-core processor is a multi-core processor in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely because of issues with congestion in supplying instructions and data to the many processors.”

¨  In this course:

¤ Multi-core/many-core CPUs ¤  (GP)GPUs

3

What are many-cores

¨  How many is many? ¤  Several tens of cores

¨  How are they different from multi-core CPUs? ¤ Non-uniform memory access (NUMA) ¤  Private memories ¤ Network-on-chip

¨  Examples ¤ Multi-core CPUs (48-core AMD magny-cours) ¤ Graphics Processing Units (GPUs)

n  GPGPU = general purpose programming on GPUs ¤  Server processors (Sun Niagara) ¤ HPC processors

n  Cell B.E. (PlayStation 3) n  Intel Xeon Phi (aka Intel MIC , former Larrabee)

4

Today’s Topics

¨  Why do many-cores exist? ¨  History ¨  Hardware introduction ¨  Performance model:

Arithmetic Intensity and Roofline

5

Moore’s law Many-cores in real-life

Why many-cores? 6

Moore’s Law

¤ Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.

“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short term this rate can be expected to continue, if not to increase....” Electronics Magazine 1965

7

Transistor Counts (Intel) 8

Impact of device shrinking

¨  Assume transistor size shrinks by a factor of x ! ¨  Transistors per unit area: up by x*x ¨  Die size ?

¤ Assume the same

¨  Clock rate ? ¤ may go up by x because wires are shorter

¨  Raw computing power ? ¤ Programs could go x*x*x times faster

¨  In reality? ¤ Power consumption, memory, parallelism impose stricter

bounds!

9

Revolution in Processors

¨  Chip density is continuing to increase about 2x every 2 years

¨  BUT ¤ Clock speed is not ¤  ILP is not ¤ Power is not

10

New ways to use transistors 11

¨  Parallelism on-chip: multi-core processors

¨  “Multicore revolution” ¤ Every machine will soon be a parallel machine ¤ What about performance?

¨  Can applications use this parallelism? ¤ Do they have to be rewritten from scratch?

¨  Will all programmers have to be parallel programmers? ¤ New programming models are needed ¤ Try to hide complexity from most programmers

Top500 [1/4] 12

¨  State of the art in HPC (top500.org) ¤ Trial for all new HPC architectures

Accelerated!

Accelerated!

195 cores/node!

Top500 [2/4] 13

¨  Performance is dominated by multi-/many-cores ¤ Multi-core CPUs ¤ Accelerators

Top500 [3/4] 14

¨  Accelerators ? ¤ Relatively low numbers ¤ High performance impact

China's Tianhe-1A

#10 in top500 list – June 2013 (#1 in Top500 in November 2010)

4.701 pflops peak

2.566 pflops max

14,336 Xeon X5670 processors

7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores

15

China's Tianhe-2

#1 in Top500 – June 2013

54.902 pflops peak

33.862 pflops max

16.000 nodes = 16.000 x (2 x Xeon IvyBridge + 3 x Xeon Phi)

= 3.120.000 cores ( => 195 cores/node)

16

Top500: prediction 17

GPUs vs. Top500 18

T12

NV30 NV40 G70

G80

GT200

3GHz Dual Core P4

3GHz Core2 Duo

3GHz Xeon Quad

Why do we need many-cores? 19

Why do we need many-cores? 20

Power efficiency 21

Graphics in 1980 22

Graphics in 2000 23

Realism of modern GPUs

http://www.youtube.com/watch?v=bJDeipvpjGQ&feature=player_embedded#t=49s

Courtesy techradar.com

24

Why do we need many-cores?

¨  Performance ¤ Large scale parallelism

¨  Power Efficiency ¤ Use transistors more efficiently

¨  Price (GPUs) ¤ Game market is huge, bigger than Hollywood ¤ Mass production, economy of scale ¤ “spotty teenagers” pay for our HPC needs!

¨  Prestige ¤ Reach ExaFLOP by 2019

25

History 26

Multi-core @ Intel 27

GPGPU History 28

¨  Current generation: NVIDIA Kepler ¤ 7.1 transistors ¤ More cores, more parallelism, more performance

1995 2000 2005 2010

RIVA 128 3M xtors

GeForce® 256 23M xtors

GeForce FX 125M xtors

GeForce 8800 681M xtors

GeForce 3 60M xtors

“Fermi” 3B xtors

GPGPU History 29

¨  Use Graphics primitives for HPC ¤  Ikonas [England 1978] ¤ Pixel Machine [Potmesil & Hoffert 1989] ¤ Pixel-Planes 5 [Rhoades, et al. 1992]

¨  Programmable shaders, around 1998 ¤ DirectX / OpenGL ¤ Map application onto graphics domain!

¨  GPGPU ¤ Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...

CUDA C/C++ Continuous Innovation

2007 2008 2009 2010 July 07 Nov 07 April

08 Aug 08 July 09 Nov 09 Mar 10

CUDA Toolkit 1.1

•  Win XP 64 •  Atomics support •  Multi-GPU support

CUDA Toolkit 2.0

Double Precision •  Compiler Optimizations •  Vista 32/64 •  Mac OSX •  3D Textures •  HW Interpolation

CUDA Toolkit 2.3

•  DP FFT •  16-32 Conversion intrinsics •  Performance enhancements

CUDA Toolkit 1.0

•  C Compiler •  C Extensions •  Single Precision •  BLAS •  FFT •  SDK 40 examples

CUDA Visual Profiler 2.2

cuda-gdb HW Debugger

Parallel Nsight Beta CUDA Toolkit 3.0

C++ inheritance Fermi support Tools updates Driver / RT interop

30

Parallel Nsight Visual Studio

Visual Profiler For Linux

cuda-gdb For Linux

Cuda Tools 31

Another GPGPU history 32

GPUs @ AMD 33

Multi-core @ AMD 34

Multi-core @ AMD 35

GPU @ ARM 36

Many-core hardware 37

Choices …

¨  Core type(s): ¤ Fat or slim ? ¤ Vectorized (SIMD) ? ¤ Homogeneous or heterogeneous?

¨  Number of cores: ¤ Few or many ?

¨  Memory ¤ Shared-memory or distributed-memory?

¨  Parallelism ¤  Instruction-level parallelism, threads, vectors, …

38

A taxonomy

¨  Based on “field-of-origin”: ¤ General-purpose

n  Intel, AMD ¤ Graphics Processing Units (GPUs)

n NVIDIA, ATI ¤ Gaming/Entertainment

n Sony/Toshiba/IBM ¤ Embedded systems

n Philips/NXP, ARM ¤ Servers

n Oracle, IBM, Intel ¤ High Performance Computing

n  Intel, IBM, …

39

General Purpose Processors

¨  Architecture ¤  Few fat cores ¤ Vectorization (SSE, AVX) ¤ Homogeneous ¤  Stand-alone

¨  Memory ¤  Shared, multi-layered ¤  Per-core cache and shared cache

¨  Programming ¤  Processes (OS Scheduler) ¤ Message passing ¤ Multi-threading ¤ Coarse-grained parallelism

40

Server-side

¨  General-purpose-like with more hardware threads ¤ Lower performance per thread ¤ high throughput

¨  Examples ¤ Sun Niagara II

n 8 cores x 8 threads

¤  IBM POWER7 n 8 cores x 4 threads

¤  Intel SCC n 48 cores, all can run their own OS

41

Graphics Processing Units

¨  Architecture ¤ Hundreds/thousands of slim cores ¤ Homogeneous ¤ Accelerator

¨  Memory ¤ Very complex hierarchy ¤ Both shared and per-core

¨  Programming ¤ Off-load model ¤ Many fine-grained symmetrical threads ¤ Hardware scheduler

42

Cell/B.E.

¨  Architecture ¤ Heterogeneous ¤ 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)

¨  Memory ¤ Per-core memory, network-on-chip

¨  Programming ¤ User-controlled scheduling ¤ 6 levels of parallelism, all under user control ¤ Fine- and coarse-grain parallelism

43

Xeon Phi

¨  Architecture ¤ ~60 homogeneous cores

n  4 threads per core ¤  x86 architecture

¨  Memory ¤  Per-core caches (L1,L2)

n  Coherence ¤ UMA [?]

¨  Programming ¤  SPMD/MPMD ¤  Fine- and coarse-grain parallelism (vector processing and

threads, respectively

44

Take home message

¨  Variety of platforms ¤ Core types & counts ¤ Memory architecture & sizes ¤ Parallelism layers & types ¤ Scheduling

¨  Open questions: ¤ Why so many? ¤ How many platforms do we need? ¤ Can any application run on any platform?

45

Hardware performance metrics 46

Hardware Performance metrics

¨  Clock frequency [GHz] = absolute hardware speed ¤ Memories, CPUs, interconnects

¨  Operational speed [GFLOPs] ¤ Operations per cycle

¨  Memory bandwidth [GB/s] ¤  differs a lot between different memories on chip

¨  Power [Watt]

¨  Derived metrics ¤  FLOP/Byte, FLOP/Watt

47

Theoretical peak performance

¨  Peak = chips * cores * vectorWidth * FLOPs/cycle * clockFrequency

¨  Examples from DAS-4: ¤  Intel Core i7 CPU

n  2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs

¤  NVIDIA GTX 580 GPU n  1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle

* 1.544 GhZ = 1581 GFLOPs

¤  ATI HD 6970 n  1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle

* 0.880 GhZ = 2703 GFLOPs

48

DRAM Memory bandwidth

¨  Throughput = memory bus frequency * bits per cycle * bus width ¤ Memory clock != CPU clock! ¤  In bits, divide by 8 for GB/s

¨  Examples: ¤  Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s ¤ NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s ¤ ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s

49

Memory bandwidths

¨  On-chip memory can be orders of magnitude faster ¤ Registers, shared memory, caches, … ¤ E.g., AMD HD 7970 L1 cache achieves 2 TB/s

¨  Other memories: depends on the interconnect ¤  Intel’s technology: QPI (Quick Path Interconnect)

n 25.6 GB/s

¤ AMD’s technology: HT3 (Hyper Transport 3) n 19.2 GB/s

¤ Accelerators: PCI-e 2.0 n 8 GB/s

50

Power

¨  Chip manufactures specify Thermal Design Power (TDP) ¨  We can measure dissipated power

¤ Whole system ¤ Typically (much) lower than TDP

¨  Power efficiency ¤ FLOPS / Watt

¨  Examples (with theoretical peak and TDP) ¤  Intel Core i7: 154 / 160 = 1.0 GFLOPs/W ¤ NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W ¤ ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W

51

Summary

Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte

Sun Niagara 2 8 64 11.2 76 0.1

IBM bg/p 4 8 13.6 13.6 1.0

IBM Power 7 8 32 265 68 3.9

Intel Core i7 4 16 85 25.6 3.3

AMD Barcelona 4 8 37 21.4 1.7

AMD Istanbul 6 6 62.4 25.6 2.4

AMD Magny-Cours 12 12 125 25.6 4.9

Cell/B.E. 8 8 205 25.6 8.0

NVIDIA GTX 580 16 512 1581 192 8.2

NVIDIA GTX 680 8 1536 3090 192 16.1

AMD HD 6970 384 1536 2703 176 15.4

AMD HD 7970 32 2048 3789 264 14.4

Absolute hardware performance

¨  Only achieved in the optimal conditions: ¤ Processing units 100% used ¤ All parallelism 100% exploited ¤ All data transfers at maximum bandwidth

¨  In real life ¤ No application is like this ¤ Can we reason about “real” performance?

53

Operational Intensity and the Roofline model

Performance analysis 54

An Example 55

¨  I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want to hire you to use many-cores to improve the performance.

¨  Metrics I will judge candidates by: ¤ How fast can the application be:

n  Execution time => what the users are interested in! ¤ How many times faster can you make it:

n  Speed-up => use the best possible sequential performance ¤ How do I know I should chose you?

n  Achievable performance => reason how far the performance is n  Depends on application, hardware, and dataset!

¤  Is this architecture a good one to use? n  Utilization => did I really need this hardware?

Questions? Comments? 56

¨  For questions, comments, suggestions, … :

[email protected]

many-core computingbal/college13/class1-intro-2k13_topublish.pdf · #10 in top500 list – june...

Documents