1 itcs 4/5010 cuda programming, unc-charlotte, b. wilkinson, dec 31, 2012 emergence of gpu systems...
TRANSCRIPT
1ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Dec 31, 2012
Emergence of GPU systems and clusters for general purpose High
Performance Computing
These notes will introduce:
•The development of GPU devices from the 1970’s to the present day•Their use in high performance computers today
2
CPU-GPU architecture evolution1970s - 1980s
Co-processors -- very old idea that appeared in 1970s and 1980s with floating point co-processors attached to microprocessors that did not then have floating point capability.
These coprocessors simply executed floating point instructions that were fetched from memory.
Around same time, interest to provide hardware support for displays, especially with increasing use of graphics and PC games.
Led to graphics processing units (GPUs) attached to CPU to create video display.
CPU
Graphics card
Display
Memory
Early design
3
Pipelined programmable GPUDedicated pipeline (late1990s-early 2000s)
By late1990’s, graphics chips needed to support 3-D graphics, especially for games and graphics APIs such as DirectX and OpenGL.
Graphics chips generally had a pipeline structure with individual stages performing specialized operations, finally leading to loading frame buffer for display.
Individual stages may have access to graphics memory for storing intermediate computed data.
Input stage
Vertex shader stage
Geometry shader stage
Rasterizer stage
Frame buffer
Pixel shading stage
Graphics memory
Graphics Processing Units (GPUs)Brief History
1970 2010200019901980
Atari 8-bit computer
text/graphics chip
Source of information http://en.wikipedia.org/wiki/Graphics_Processing_Unit
IBM PC Professional Graphics Controller
card
S3 graphics cards- single chip 2D
accelerator
OpenGL graphics API
Hardware-accelerated 3D graphics
DirectX graphics API
Playstation
GPUs with programmable shading
Nvidia GeForceGE 3 (2001) with
programmable shading
General-purpose computing on graphics processing units
(GPGPUs)
GPU Computing
NVIDIA products
NVIDIA Corp. is the leader in GPUs for high performance computing:
1993 201019991995
http://en.wikipedia.org/wiki/GeForce
20092007 20082000 2001 2002 2003 2004 2005 2006
Established by Jen-Hsun Huang, Chris
Malachowsky, Curtis Priem
NV1 GeForce 1
GeForce 2 series GeForce FX series
GeForce 8 series
GeForce 200 series
GeForce 400 series
GTX460/465/470/475/480/485
GTX260/275/280/285/295GeForce 8800
GT 80
Tesla
Quadro
NVIDIA's first GPU with general purpose processors
C870, S870, C1060, S1070, C2050, …
Tesla 2050 GPU has 448 thread processors
Fermi
Kepler(2011)
Maxwell (2013)
6
GeForce 6 Series Architecture (2004-5)From GPU Gems 2, Copyright 2005 by NVIDIA Corporation
7
General-Purpose GPU designs
High performance pipelines call for high-speed (IEEE) floating point operations.
People tried to use GPU cards to speed up scientific computations
Known as GPGPU (General-purpose computing on graphics processing units) -- Difficult to do with specialized graphics pipelines, but possible.)
By mid 2000’s, recognized that individual stages of graphics pipeline could be implemented by a more general purpose processor core (although with a data-parallel paradigm)
a
8
NVIDIA GT 80 chip/GeForce 8800 card (2006)
First GPU for high performance computing as well as graphicsUnified processors that could perform vertex, geometry, pixel, and general computing operations
Could now write programs in C rather than graphics APIs.
Single-instruction multiple thread (SIMT) prog. model
9
GPU performance gains over CPUs
0
200
400
600
800
1000
1200
1400
9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009
GFLO
Ps
NVIDIA GPUIntel CPU
T12
Westmere
NV30NV40
G70
G80
GT200
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
10* Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Fermi, NVIDIA, 2008
• Data parallel single instruction multiple data operation (“Stream” processing)
• Up to 512 cores (“stream processing engines”, SPEs, organized as 16 SPEs, each having 32 SPEs)
• 3GB or 6 GB GDDR5 memory
• Many innovations including L1/L2 caches, unified device memory addressing, ECC memory, …
• First implementation: Tesla 20 series (single chip C2050/2070, 4 chip S2050/2070) 3 billion transistor chip? Number of cores limited by power considerations, C2050 has 448 cores.
Evolving GPU design:NVIDIA Fermi architecture(announced Sept 2009)
11
NVIDIA Kepler architecture and GPUs (2012)
A lot of major new features over earlier Fermi architecture – will look at them later in course
GeForce 600 series card introduced early 2012.
GTX 680 has 1536 cores, 195 watts. Introduced March 2012.
GXT 690 has two dies, 3072 cores (2 x 1536 cores), 300 watts. Introduced April 2012.
CUDA Computer Capability 3.0 see next
http://en.wikipedia.org/wiki/GeForce_600_Series http://www.tomshardware.com/news/Nvidia-Kepler-GK104-GeForce-GTX-670-680,14691.html
GK104 chip with 1536 cores
12
Tesla K20 GPU Computing modulesKepler architecture. Introduced November 2012
K20 – 2496 thread processors (cores)K20X – 2688 thread processors (cores)
K20:2496 FP32 cores, 832 FP64 coresWattage 225 watts3.5 compute capabilityGFLOPs: Single Precision: 3519 /4106 Double Precision: 1173
13
18,688 NVIDIA Tesla K20X GPUs
20 petaflops
Upgraded from Jaguar supercomputer.
10 times faster and
5 times more energy efficient than 2.3-petaflops Jaguar system
while occupying the same floor space.
Titan SupercomputerOak Ridge National Laboratory in Oak Ridge, TennWorld’s fastest computer as of Nov 2012
No 1 rank on TOP500 list
http://nvidianews.nvidia.com/Releases/NVIDIA-Powers-Titan-World-s-Fastest-Supercomputer-For-Open-Scientific-Research-8a0.aspx#source=pr
14
CUDA(Compute Unified Device Architecture)
• Architecture and programming model introduced in NVIDIA in 2007
• Enables GPUs to execute programs written in C.
• Within C programs, call SIMT “kernel” routines that are executed on GPU.
• CUDA syntax extension to C identify routine as a Kernel.
• Very easy to learn although to get highest possible execution performance requires understanding of hardware architecture.
• Version 3 introduced in 2009 – the one we have been using
• Current version 4 introduced 2011 – significant additions including “unified virtual addressing” – a single address space across GPU and host, see later.
• We will go into CUDA in detail later and have programming experiences.
15
2010: NVIDIA Corp. selected UNC-Charlotte Department of Computer Science to be a CUDA Teaching Center, kindly providing GPU equipment and TA support.
2011: NVIDIA kindly provided 50 GTX 480 GPU cards valued at $15,000 as continuing support for the CUDA Teaching Center.
2012: NVIDIA donates a K20!
UNC-C CUDA Teaching Center
Our course materials are posted on NVIDIA’s corporate site next to those from Stanford, and other top schools.
16
http://developer.nvidia.com/cuda-training
Questions