1
Workshop 9: General purpose computing using GPUs: Developing a hands-on
undergraduate course on CUDA programming
SIGCSE 2011 - The 42nd ACM Technical Symposium on Computer Science Education
Wednesday March 9, 2011, 7:00 pm - 10:00 pm
Dr. Barry WilkinsonUniversity of North Carolina
Charlotte
Dr. Yaohang LiOld Dominion University
SIGCSE 2011 Workshop 9 intro.ppt © 2010 B. Wilkinson Modification date: Feb 22, 2011
2
Agenda
2
3
GPU performance gains over CPUs
0
200
400
600
800
1000
1200
1400
9/22/2002 2/4/2004 6/18/2005 10/31/2006 3/14/2008 7/27/2009
GFLO
Ps
NVIDIA GPUIntel CPU
T12
Westmere
NV30 NV40
G70
G80
GT200
3GHz Dual Core P4
3GHz Core2 Duo
3GHz Xeon Quad
Source © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009ECE 498AL Spring 2010, University of Illinois, Urbana-Champaign
Emergence of GPU systems for General Purpose High Performance Computing
GPUs have developed from graphics cards into a platform for HPC
GPUs are being designed with that application in mind
Very significant performance improvements on scientific code
4http://www.hpcwire.com/blogs/New-China-GPGPU-Super-Outruns-Jaguar-105987389.html
outline.5
http://www.nvidia.com/object/cuda_courses_and_map.htmlA hot topic to teach
Taught at Illinois, Stanford, MIT, Harvard, Duke, Chapel Hill, UNC-C, …
Taught at graduate level and now moving into undergraduate level
GPU Course for HighPerformance Computing
Concerned with using Graphics Processing Units (GPUs) for high performance computing
Not graphics A programming course
Uses CUDA (Compute Unified Device Architecture), an architecture and programming model introduced by NVIDIA in 2007
C-based. Easy to learn.
NVIDIA products
NVIDIA Corp. is the leader in GPUs for high performance computing:
1993 201019991995
http://en.wikipedia.org/wiki/GeForce
20092007 20082000 2001 2002 2003 2004 2005 2006
Established by Jen-Hsun Huang, Chris
Malachowsky, Curtis Priem
NV1 GeForce 1
GeForce 2 series GeForce FX series
GeForce 8 series
GeForce 200 series
GeForce 400 series
GTX460/465/470/475/480/485
GTX260/275/280/285/295GeForce 8800
GT 80
Tesla
Quadro
NVIDIA's first GPU with general purpose processors
C870, S870, C1060, S1070, C2050, …
Tesla 2050 GPU has 448 thread processors
Fermi
Kepler(2011)
Maxwell (2013)
CUDA
8
Programming Model
GPUs historically designed for creating image data for displays.
Involves manipulating image picture elements (pixels) and often the same operation each pixel.
SIMD (Single Instruction Multiple Data) model - An efficient mode of operation in which the same operation done on each data element at the same time.
GPUs use a thread version of SIMD called Single Instruction Multiple Thread (SIMT).
9
GPU’s SIMT Programming Model
GPUs use very lightweight threads to achieve high parallel performance and to hide memory latency
Multiple threads, each execute the same instruction sequence.
Very large number of threads (10,000’s) possible on GPUs.
Threads mapped onto available processors on GPU (100’s of processors) all executing same program sequence
More on the program model shortly
10
Programming applications using SIMT model
Matrix operations -- very amenable to SIMT
•Same operations done on different elements of matrices
Some “embarassingly” parallel computations such as Monte Carlo calculations
•Monte Carlo calculations use random selections that are independent of each other
Data manipulations• Some sorting can be done quite efficiently
coit-grid01-4 Each dual Xeon processors(3.4Ghz) 8GB main memory
coit-grid05 -- Four quad-core Xeon processors (2.93Ghz)64GB main memory1.2 TB disk
coit-grid01
coit-grid01.uncc.edu – coit-grid06.uncc.edu
switch
coit-grid05
coit-grid03
coit-grid02
coit-grid04
All user’s home directories on coit-grid05 (NFS)
Computer system used for workshop at UNC-Charlotte
coit-grid06
NVIDIA Tesla GPU (448
core Fermi)
System to log onto first Only available directly from on campus
Guest accounts on computer systems
Account details consist of an account name and an ssh password.
Logon through first to coit-grid01 and then to grid06
Files needed for hands-on sessions provided in each account.
More details in hands-on session write-ups
Use PuTTY or WinSCP if Windows
coit-grid01.uncc.edu
13
Xclock running on client PC
Xclock running on coit-
grid01.uncc.edu
Xclock running on coit-
grid06.uncc.edu
Xterm running on client PC, logged onto coit-grid06.uncc.edu
User interface
accessing for
forwarding X11
graphics
Not needed for workshop
WinSCP running on client PC connected to
grid01.uncc.edu
To make sure all X servers running
14
Simple implementation
800 x 800 points50000 iterations
Speed-up = 16.57Fireplace
Heat distribution problem(Solving Laplace’s equation)
Graphics forwards to client computer (PC)
15
N Body problem
16
Video
Questions
Next
Basic CUDA programming
Intro to 1st hands-on session