introduction to heterogeneous_computing_for_hpc
DESCRIPTION
TRANSCRIPT
![Page 1: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/1.jpg)
1
Introduction to Heterogeneous Computing for High Performance Computing
Presented by
Supasit Kajkamhaeng
![Page 2: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/2.jpg)
2
Definition [IDC, 2011]1
“The term high-performance computing to refer to all technical computing servers and clusters used to solve problems that are computationally intensive or data intensive efficiently, reliably and quickly.”
http://www.elseptimoarte.net/peliculas/kung-fu-panda-2-2285.html
http://www.prweb.com/releases/cfd/simulation/prweb1891174.htm
http://smu.edu/catco/research/drug-design-a35.html
http://www.drroyspencer.com/2009/07/how-do-climate-models-work/
High Performance Computing
![Page 3: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/3.jpg)
3
HPC Applications
HPC Infrastructure
Processors StoragesMemories Networks
HPC Overview
![Page 4: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/4.jpg)
4
A form of computation in which many calculations are carried out simultaneously, operating on the principle that large problems can often be divided into smaller ones, which are then solved concurrently ("in parallel"). [Almasi and Gottlieb, 1989]
2
ProblemProblem
…
Instructions
CPUCPU… … … …
CPUCPU CPUCPU CPUCPU CPUCPU
Instru
ctions
TaskTask TaskTask TaskTask TaskTask
Problem
Parallel ComputingSequential Computing
Parallel Computing
![Page 5: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/5.jpg)
5
Classes of parallel computers Multicore Processor
A processor that includes multiple execution units ("cores").
Cluster [Webopedia computer dictionary, 2007]3
A group of linked computers, working together closely so that in many respects they from a single computer
To improve performance and/or availability over that provided by a single computer
etc.
Parallel Computing (2)
![Page 6: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/6.jpg)
6
Advantages Reduce computing time
More Processors
Make large scale job doable More Memories
Problems Complex programming models
Difficult development
Complex infrastructures Complicated architecture and deployment
Challenges
Parallel Computing (3)
![Page 7: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/7.jpg)
7
Why do HPC Applications need computing power more and more?
Race against time Solve problems in the shortest time possible
Precision improvement In the amount of time, results can be increased a
precision
At this time the computing power limitation may be considered from performance of most powerful computer systems being used today Top500 Supercomputing Sites (www.top500.org)
Increasing Computing Power Requirement
![Page 8: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/8.jpg)
8
What is the Top500?4 [www.top500.org]
The Top500 list the 500 fastest computer system being used today
In 1993 the collection was started and has been updated every 6 months since then
The best Linpack benchmark performance achieved is used as a performance measure in ranking the computers.
Top500 Supercomputing Sites
![Page 9: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/9.jpg)
9
#1 (Nov 2011)10.51 PF
Top500 Performance Development
![Page 10: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/10.jpg)
10[www.top500.org]
Top500 Architecture
![Page 11: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/11.jpg)
11
One of Challenges is to improve the performance (means “flops”) of HPC systems “The worldwide high-performance computing (HPC) market is
already more than three years into the petascale era (June 2008-present) and is looking to make the thousandfold leap into the exascale era before the end of this decade.” [IDC, Nov 2011]
1
Concerned improvement factors of the performance development System costs (flops/dollar) Space and compute density requirements (flops/square foot) Energy costs for computation (flops/watt)
[IDC, Nov 2011]1Want more flops/dollar, flops/square foot, flops/watt
Goal
Challenges & Factors of HPC System Development
![Page 12: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/12.jpg)
12
All performance of many powerful HPC systems aren’t only produced by CPUs
Present
Future
Source of performance on HPCsystems in present and future
#2 rank of Top500 lists (Nov 2011) 2.566 PFLOPS (Rmax)
14,336 Xeon X5670 CPUs
7,168 Tesla M2050 GPUs
2,048 NUDT FT1000 heterogeneous processors
Tianhe-1A
Jaguar
#3 rank of Top500 lists (Nov 2011) 1.759 PFLOPS (Rmax)
36K AMD Opteron CPUs
2013
Titan
20-30 PFLOPS (Rpeak) 18,000 AMD Opteron CPUs
18,000 Tesla GPUs
[http://www.nscc-tj.gov.cn]5
[IDC, Nov 2011]1
![Page 13: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/13.jpg)
13
Application Code
Definition [IDC, 2011]1
“The heterogeneous computing refer to the use of multiple types of processors, typically CPUs in combination with GPUs or other accelerators, within the same HPC system.”
CPUAccelerator(NVIDIA GPU, AMD GPU, Intel MIC)
Heterogeneous Computing for HPC
![Page 14: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/14.jpg)
14
Characteristic of CPU Design from the HPC’s Perspective
Main Point of Most HPC Application Codes
Lots of Floating-point Calculations (Operations) “A frequently used sequence of operations in computer
graphics, liner algebra, and scientific applications is to multiply two numbers, adding the product to a third number, for example, D = A x B + C (multiply-add (MAD) instruction)” [NVIDIA, 2009]
6
Lots of Parallelism Large data sets can be performed in parallel with massively
multithreaded SIMD (Single Instruction, Multiple Data) Model
![Page 15: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/15.jpg)
15
CPUs are fundamentally designed for single thread performance rather than energy efficiency [Steve Scott, November 2011]
7
Fast clock rates with deep pipelines Data and instruction caches optimized for latency Superscalar issue with out-of-order execution Dynamic conflict detection Lots of predictions and speculative execution Lots of instruction overhead per operation
Less than 2% of chip power today goes to flops
Characteristic of CPU Design from the HPC’s Perspective (2)
![Page 16: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/16.jpg)
16
Characteristic of CPU Design from the HPC’s Perspective (3)
[Peter N. Glaskowsky, 2009]8
![Page 17: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/17.jpg)
17
Characteristic of CPU Design from the HPC’s Perspective (4)
[Peter N. Glaskowsky, 2009]8
![Page 18: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/18.jpg)
18
Accelerator
Definition [S. Patel and W.Hwu, 2008]9
“An accelerator is a separate architectural substructure (on the same chip, or on a different die) that is architected using a different set of objectives than the base processor, where these objectives are derived from the needs of a special class of applications.”
“Through this manner of design, the accelerator is tuned to provide higher performance at lower cost, or at lower power, or with less development effort than with the general-purpose base hardware.”
![Page 19: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/19.jpg)
19
Accelerator (2)
Example
Intel x87 floating-point (math) coprocessors10,11,12,13
During the 1980s and the early 1990s
A separate floating point coprocessor (Intel 8087, 80187, 80287, 80387, 80487) for the 80x86 line of microprocessors
“Later Intel processors (introduced after the 486DX) did not use a separate floating point coprocessor (integrated the floating point hardware on the main processor chip)”
http://en.wikipedia.org/wiki/File:80386with387.JPG
![Page 20: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/20.jpg)
20
Accelerator (3)
Example
Graphics Processing Unit (GPU)14
“A GPU is a specialized circuit designed to rapidly manipulate and alter memory in such a way so as to accelerate the building of images in a frame buffer intended for output to a display.”
“A GPU can be present on a video card, or it can be on the motherboard or on the CPU die.”
“Modern GPUs are very efficient at manipulating computer graphics, and their highly parallel structure makes them more effective than general-purpose CPUs for algorithms where processing of large blocks of data is done in parallel.”
GPU Computing
![Page 21: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/21.jpg)
21
GPU Computing (GPGPU)
Definition [NVIDIA, 2011]15
“GPU computing or GPGPU is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing.”
“The model for GPU computing is to use a CPU and GPU together in a heterogeneous co-processing computing model.”
“The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU.”
http://www.nvidia.com/docs/IO/65513/gpu-computing-feature.jpg
![Page 22: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/22.jpg)
22
Graphics Pipeline
These are the various stages in the typical pipeline of a modern graphics processing unit (GPU). (Illustration courtesy of NVIDIA Corporation.)
More computationally demanding stage (especially, pixel shader stage)
Lots of Data Parallelism (suited for parallel hardware)
![Page 23: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/23.jpg)
23
Evolution of GPU Architecture
A fixed function graphics pipeline
A programmable parts (vector and pixel) of graphics pipeline
(a programmable engine surrounded by supporting fixed-function units and using graphics programming languages like OpenGL, DirectX, Cg to program the GPU)
A unified graphics & compute architecture (all programmable units in a graphics pipeline share a single programmable hardware
unit and added support for high-level languages like C, C++, and Fortran)
[Owens et al., 2008]16
GPU Computing
![Page 24: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/24.jpg)
24
CUDA
Compute Unified Device Architecture [NVIDIA, 2011]17
“CUDA is NVIDIA’s parallel computing architecture. It enables dramatic increases in computing performance by harnessing the power of the GPU.”
Fermi’s 16 SM are positioned around a common L2 cache. Each SM is a vertical rectangular strip that contain an orange portion (scheduler and dispatch), a green portion (execution units), and light blue portions (register file and L1 cache).
Fermi
[NVIDIA, 2009]6
SM
![Page 25: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/25.jpg)
25
CUDA (1)
[NVIDIA, 2009]6
Fermi SM
![Page 26: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/26.jpg)
26
CUDA (2)
[Wikipedia, 2011]18
![Page 27: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/27.jpg)
27
CUDA (3)
[NVIDIA]
![Page 28: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/28.jpg)
28
CUDA (4)
![Page 29: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/29.jpg)
29
CUDA (5)
![Page 30: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/30.jpg)
30
CPU and GPU Comparison
Tianhe-1A
#2 rank of Top500 lists (November 2011) 2.566 PFLOPS (Rmax)
14,336 Xeon X5670 processors 7,168 Tesla M2050 GPUs 2,048 NUDT FT1000 heterogeneous processors
ProcessorDouble Precision FLOPS
[Peak] Power Consumption
Intel Xeon X5670 70.392 GFLOPS 95W TDP
NVIDIA Tesla M2050 515 GFLOPS 225W TDP
![Page 31: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/31.jpg)
31
Conclusion
HPC applications need computing power more and more for solve problems that are compute and data intensive.
Heterogeneous computing (such as CPU+GPU) helps to deliver more cost-effective and energy-efficient (flops/dollar, flops/square foot, flops/watt) for applications that need it, rather than using only CPUs.
![Page 32: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/32.jpg)
32
1. International Data Corporation (IDC). November, 2011. IDC Executive Brief - Heterogeneous Computing: A New Paradigm for the Exascale Era.
2. G. S. Almasi and A. Gottlieb. 1989. Highly Parallel Computing. Benjamin-Cummings publishers, Redwood City, CA.
3. What is clustering?. Webopedia computer dictionary. Retrieved on November 7, 2007.
4. Top500 Supercomputing Sites. www.top500.org. Retrieved on December , 2011.
5. NSCC-TJ National Supercomputing Center in Tianjin. www.nscc-tj.gov.cn. Retrieved on December , 2011.
6. NVIDIA. 2009. NVIDIA’s Next Generation CUDATM
Compute Architecture: Fermi V1.1.
7. Steve Scott. November 15, 2011. Why the Future of HPC will be Green. SC’11
8. Peter N. Glaskowsky. September, 2009. NVIDIA’s Fermi: The First Complete GPU Computing Architecture.
Reference
![Page 33: Introduction to heterogeneous_computing_for_hpc](https://reader034.vdocument.in/reader034/viewer/2022051818/54bac6e84a79590c2b8b46bc/html5/thumbnails/33.jpg)
33
9. S. Patel and W. Hwu. 2008. Guest Editors’ Introduction: Accelerator Architectures. IEEE Micro 28(4): 4-12 (2008).
10. X87. en.wikipedia.org/wiki/X87. Retrieved on December, 2011.
11. Coprocessor. en.wikipedia.org/wiki/Coprocessor. Retrieved on December, 2011.
12. Intel 8087. en.wikipedia.org/wiki/Intel_8087. Retrieved on December, 2011.
13. x87 info you need to know!. http://coprocessor.cpu-info.com/index2.php?mainid=Copro&tabid=1&page=1. Retrieved on December, 2011.
14. Graphics Processing Unit. en.wikipedia.org/wiki/Graphics_processing_unit. Retrieved on December, 2011.
15. NVIDIA. 2011. What is GPU Computing?. www.nvidia.com/object/GPU_Computing.html. Retrieved on December, 2011.
16. J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone and J. C. Phillips. 2008. GPU Computing. Proceedings of the IEEE, Vol. 96, No.5, May 2008.
17. NVIDIA. 2011. What is CUDA. developer.nvidia.com/what-cuda. Retrieved on December, 2011.
18. CUDA. en.wikipedia.org/wiki/CUDA. Retrieved on December, 2011.
Reference (2)