many-core computingbal/college13/class1-intro-2k13_topublish.pdf · #10 in top500 list – june...
TRANSCRIPT
MANY-CORE COMPUTING Ana Lucia Varbanescu, UvA Original slides: Rob van Nieuwpoort, eScience Center 7-Oct-2013
Schedule
1. Introduction, performance metrics & analysis 2. Programming: basics (10-10-2013) 3. Programming: advanced (14-10-2013) 4. Case study: LOFAR telescope with many-cores
by Rob van Nieuwpoort (17-10-2013)
2
What are many-cores?
¨ From Wikipedia: “A many-core processor is a multi-core processor in which the number of cores is large enough that traditional multi-processor techniques are no longer efficient — largely because of issues with congestion in supplying instructions and data to the many processors.”
¨ In this course:
¤ Multi-core/many-core CPUs ¤ (GP)GPUs
3
What are many-cores
¨ How many is many? ¤ Several tens of cores
¨ How are they different from multi-core CPUs? ¤ Non-uniform memory access (NUMA) ¤ Private memories ¤ Network-on-chip
¨ Examples ¤ Multi-core CPUs (48-core AMD magny-cours) ¤ Graphics Processing Units (GPUs)
n GPGPU = general purpose programming on GPUs ¤ Server processors (Sun Niagara) ¤ HPC processors
n Cell B.E. (PlayStation 3) n Intel Xeon Phi (aka Intel MIC , former Larrabee)
4
Today’s Topics
¨ Why do many-cores exist? ¨ History ¨ Hardware introduction ¨ Performance model:
Arithmetic Intensity and Roofline
5
Moore’s law Many-cores in real-life
Why many-cores? 6
Moore’s Law
¤ Gordon Moore (co-founder of Intel) predicted in 1965 that the transistor density of semiconductor chips would double roughly every 18 months.
“The complexity for minimum component costs has increased at a rate of roughly a factor of two per year ... Certainly over the short term this rate can be expected to continue, if not to increase....” Electronics Magazine 1965
7
Transistor Counts (Intel) 8
Impact of device shrinking
¨ Assume transistor size shrinks by a factor of x ! ¨ Transistors per unit area: up by x*x ¨ Die size ?
¤ Assume the same
¨ Clock rate ? ¤ may go up by x because wires are shorter
¨ Raw computing power ? ¤ Programs could go x*x*x times faster
¨ In reality? ¤ Power consumption, memory, parallelism impose stricter
bounds!
9
Revolution in Processors
¨ Chip density is continuing to increase about 2x every 2 years
¨ BUT ¤ Clock speed is not ¤ ILP is not ¤ Power is not
10
New ways to use transistors 11
¨ Parallelism on-chip: multi-core processors
¨ “Multicore revolution” ¤ Every machine will soon be a parallel machine ¤ What about performance?
¨ Can applications use this parallelism? ¤ Do they have to be rewritten from scratch?
¨ Will all programmers have to be parallel programmers? ¤ New programming models are needed ¤ Try to hide complexity from most programmers
Top500 [1/4] 12
¨ State of the art in HPC (top500.org) ¤ Trial for all new HPC architectures
Accelerated!
Accelerated!
195 cores/node!
Top500 [2/4] 13
¨ Performance is dominated by multi-/many-cores ¤ Multi-core CPUs ¤ Accelerators
Top500 [3/4] 14
¨ Accelerators ? ¤ Relatively low numbers ¤ High performance impact
China's Tianhe-1A
#10 in top500 list – June 2013 (#1 in Top500 in November 2010)
4.701 pflops peak
2.566 pflops max
14,336 Xeon X5670 processors
7168 Nvidia Tesla M2050 GPUs x 448 cores = 3,211,264 cores
15
China's Tianhe-2
#1 in Top500 – June 2013
54.902 pflops peak
33.862 pflops max
16.000 nodes = 16.000 x (2 x Xeon IvyBridge + 3 x Xeon Phi)
= 3.120.000 cores ( => 195 cores/node)
16
Top500: prediction 17
GPUs vs. Top500 18
T12
NV30 NV40 G70
G80
GT200
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
Why do we need many-cores? 19
Why do we need many-cores? 20
Power efficiency 21
Graphics in 1980 22
Graphics in 2000 23
Realism of modern GPUs
http://www.youtube.com/watch?v=bJDeipvpjGQ&feature=player_embedded#t=49s
Courtesy techradar.com
24
Why do we need many-cores?
¨ Performance ¤ Large scale parallelism
¨ Power Efficiency ¤ Use transistors more efficiently
¨ Price (GPUs) ¤ Game market is huge, bigger than Hollywood ¤ Mass production, economy of scale ¤ “spotty teenagers” pay for our HPC needs!
¨ Prestige ¤ Reach ExaFLOP by 2019
25
History 26
Multi-core @ Intel 27
GPGPU History 28
¨ Current generation: NVIDIA Kepler ¤ 7.1 transistors ¤ More cores, more parallelism, more performance
1995 2000 2005 2010
RIVA 128 3M xtors
GeForce® 256 23M xtors
GeForce FX 125M xtors
GeForce 8800 681M xtors
GeForce 3 60M xtors
“Fermi” 3B xtors
GPGPU History 29
¨ Use Graphics primitives for HPC ¤ Ikonas [England 1978] ¤ Pixel Machine [Potmesil & Hoffert 1989] ¤ Pixel-Planes 5 [Rhoades, et al. 1992]
¨ Programmable shaders, around 1998 ¤ DirectX / OpenGL ¤ Map application onto graphics domain!
¨ GPGPU ¤ Brook (2004), Cuda (2007), OpenCL (Dec 2008), ...
CUDA C/C++ Continuous Innovation
2007 2008 2009 2010 July 07 Nov 07 April
08 Aug 08 July 09 Nov 09 Mar 10
CUDA Toolkit 1.1
• Win XP 64 • Atomics support • Multi-GPU support
CUDA Toolkit 2.0
Double Precision • Compiler Optimizations • Vista 32/64 • Mac OSX • 3D Textures • HW Interpolation
CUDA Toolkit 2.3
• DP FFT • 16-32 Conversion intrinsics • Performance enhancements
CUDA Toolkit 1.0
• C Compiler • C Extensions • Single Precision • BLAS • FFT • SDK 40 examples
CUDA Visual Profiler 2.2
cuda-gdb HW Debugger
Parallel Nsight Beta CUDA Toolkit 3.0
C++ inheritance Fermi support Tools updates Driver / RT interop
30
Parallel Nsight Visual Studio
Visual Profiler For Linux
cuda-gdb For Linux
Cuda Tools 31
Another GPGPU history 32
GPUs @ AMD 33
Multi-core @ AMD 34
Multi-core @ AMD 35
GPU @ ARM 36
Many-core hardware 37
Choices …
¨ Core type(s): ¤ Fat or slim ? ¤ Vectorized (SIMD) ? ¤ Homogeneous or heterogeneous?
¨ Number of cores: ¤ Few or many ?
¨ Memory ¤ Shared-memory or distributed-memory?
¨ Parallelism ¤ Instruction-level parallelism, threads, vectors, …
38
A taxonomy
¨ Based on “field-of-origin”: ¤ General-purpose
n Intel, AMD ¤ Graphics Processing Units (GPUs)
n NVIDIA, ATI ¤ Gaming/Entertainment
n Sony/Toshiba/IBM ¤ Embedded systems
n Philips/NXP, ARM ¤ Servers
n Oracle, IBM, Intel ¤ High Performance Computing
n Intel, IBM, …
39
General Purpose Processors
¨ Architecture ¤ Few fat cores ¤ Vectorization (SSE, AVX) ¤ Homogeneous ¤ Stand-alone
¨ Memory ¤ Shared, multi-layered ¤ Per-core cache and shared cache
¨ Programming ¤ Processes (OS Scheduler) ¤ Message passing ¤ Multi-threading ¤ Coarse-grained parallelism
40
Server-side
¨ General-purpose-like with more hardware threads ¤ Lower performance per thread ¤ high throughput
¨ Examples ¤ Sun Niagara II
n 8 cores x 8 threads
¤ IBM POWER7 n 8 cores x 4 threads
¤ Intel SCC n 48 cores, all can run their own OS
41
Graphics Processing Units
¨ Architecture ¤ Hundreds/thousands of slim cores ¤ Homogeneous ¤ Accelerator
¨ Memory ¤ Very complex hierarchy ¤ Both shared and per-core
¨ Programming ¤ Off-load model ¤ Many fine-grained symmetrical threads ¤ Hardware scheduler
42
Cell/B.E.
¨ Architecture ¤ Heterogeneous ¤ 8 vector-processors (SPEs) + 1 trimmed PowerPC (PPE)
¨ Memory ¤ Per-core memory, network-on-chip
¨ Programming ¤ User-controlled scheduling ¤ 6 levels of parallelism, all under user control ¤ Fine- and coarse-grain parallelism
43
Xeon Phi
¨ Architecture ¤ ~60 homogeneous cores
n 4 threads per core ¤ x86 architecture
¨ Memory ¤ Per-core caches (L1,L2)
n Coherence ¤ UMA [?]
¨ Programming ¤ SPMD/MPMD ¤ Fine- and coarse-grain parallelism (vector processing and
threads, respectively
44
Take home message
¨ Variety of platforms ¤ Core types & counts ¤ Memory architecture & sizes ¤ Parallelism layers & types ¤ Scheduling
¨ Open questions: ¤ Why so many? ¤ How many platforms do we need? ¤ Can any application run on any platform?
45
Hardware performance metrics 46
Hardware Performance metrics
¨ Clock frequency [GHz] = absolute hardware speed ¤ Memories, CPUs, interconnects
¨ Operational speed [GFLOPs] ¤ Operations per cycle
¨ Memory bandwidth [GB/s] ¤ differs a lot between different memories on chip
¨ Power [Watt]
¨ Derived metrics ¤ FLOP/Byte, FLOP/Watt
47
Theoretical peak performance
¨ Peak = chips * cores * vectorWidth * FLOPs/cycle * clockFrequency
¨ Examples from DAS-4: ¤ Intel Core i7 CPU
n 2 chips * 4 cores * 4-way vectors * 2 FLOPs/cycle * 2.4 GHz = 154 GFLOPs
¤ NVIDIA GTX 580 GPU n 1 chip * 16 SMs * 32 cores * 2 FLOPs/cycle
* 1.544 GhZ = 1581 GFLOPs
¤ ATI HD 6970 n 1 chip * 24 SIMD engines * 16 cores * 4-way vectors * 2 FLOPs/cycle
* 0.880 GhZ = 2703 GFLOPs
48
DRAM Memory bandwidth
¨ Throughput = memory bus frequency * bits per cycle * bus width ¤ Memory clock != CPU clock! ¤ In bits, divide by 8 for GB/s
¨ Examples: ¤ Intel Core i7 DDR3: 1.333 * 2 * 64 = 21 GB/s ¤ NVIDIA GTX 580 GDDR5: 1.002 * 4 * 384 = 192 GB/s ¤ ATI HD 6970 GDDR5: 1.375 * 4 * 256 = 176 GB/s
49
Memory bandwidths
¨ On-chip memory can be orders of magnitude faster ¤ Registers, shared memory, caches, … ¤ E.g., AMD HD 7970 L1 cache achieves 2 TB/s
¨ Other memories: depends on the interconnect ¤ Intel’s technology: QPI (Quick Path Interconnect)
n 25.6 GB/s
¤ AMD’s technology: HT3 (Hyper Transport 3) n 19.2 GB/s
¤ Accelerators: PCI-e 2.0 n 8 GB/s
50
Power
¨ Chip manufactures specify Thermal Design Power (TDP) ¨ We can measure dissipated power
¤ Whole system ¤ Typically (much) lower than TDP
¨ Power efficiency ¤ FLOPS / Watt
¨ Examples (with theoretical peak and TDP) ¤ Intel Core i7: 154 / 160 = 1.0 GFLOPs/W ¤ NVIDIA GTX 580: 1581 / 244 = 6.3 GFLOPs/W ¤ ATI HD 6970: 2703 / 250 = 10.8 GFLOPs/W
51
Summary
Cores Threads/ALUs GFLOPS Bandwidth FLOPs/Byte
Sun Niagara 2 8 64 11.2 76 0.1
IBM bg/p 4 8 13.6 13.6 1.0
IBM Power 7 8 32 265 68 3.9
Intel Core i7 4 16 85 25.6 3.3
AMD Barcelona 4 8 37 21.4 1.7
AMD Istanbul 6 6 62.4 25.6 2.4
AMD Magny-Cours 12 12 125 25.6 4.9
Cell/B.E. 8 8 205 25.6 8.0
NVIDIA GTX 580 16 512 1581 192 8.2
NVIDIA GTX 680 8 1536 3090 192 16.1
AMD HD 6970 384 1536 2703 176 15.4
AMD HD 7970 32 2048 3789 264 14.4
Absolute hardware performance
¨ Only achieved in the optimal conditions: ¤ Processing units 100% used ¤ All parallelism 100% exploited ¤ All data transfers at maximum bandwidth
¨ In real life ¤ No application is like this ¤ Can we reason about “real” performance?
53
Operational Intensity and the Roofline model
Performance analysis 54
An Example 55
¨ I am the CEO of SmartSoftwareSolutions. I have an application that runs on my old Pentium laptop in 2.5 hours. I want to hire you to use many-cores to improve the performance.
¨ Metrics I will judge candidates by: ¤ How fast can the application be:
n Execution time => what the users are interested in! ¤ How many times faster can you make it:
n Speed-up => use the best possible sequential performance ¤ How do I know I should chose you?
n Achievable performance => reason how far the performance is n Depends on application, hardware, and dataset!
¤ Is this architecture a good one to use? n Utilization => did I really need this hardware?