carlo del mundo department of electrical and computer engineering ubiquitous parallelism are you...
TRANSCRIPT
Carlo del MundoDepartment of Electrical and
Computer Engineering
Ubiquitous ParallelismUbiquitous ParallelismAre You Equipped To Code For Multi- andMany- Core Platforms?
AgendaAgenda• Introduction/Motivation• Why Parallelism? Why now?• Survey of Parallel Hardware
• CPUs vs. GPUs• Conclusion• How Can I Start?
2
Talk GoalTalk Goal• Encourage undergraduates to
answer the call to the era of parallelism• Education• Software Engineering
3
Why Parallelism? Why now?Why Parallelism? Why now?• You’ve already been exposed to
parallelism• Bit Level Parallelism• Instruction Level Parallelism• Thread Level Parallelism
4
Why Parallelism? Why now?Why Parallelism? Why now?• Single-threaded performance has
plateaued
• Silicon Trends• Power Consumption• Heat Dissipation
5
Why Parallelism? Why now?Why Parallelism? Why now?• Issue: Power & Heat
• Good: Cheaper to have more cores, but slower
• Bad: Breaks hardware/software contract
9
Why Parallelism? Why now?Why Parallelism? Why now?• Hardware/Software Contract
• Maintain backwards-compatibility with existing codes
10
AgendaAgenda• Introduction/Motivation• Why Parallelism? Why now?• Survey of Parallel Hardware
• CPUs vs. GPUs• Conclusion• How Can I Start?
12
Personal Mobile Device SpacePersonal Mobile Device Space
14
2 CPU cores/3 GPU cores
iPhone 5 Galaxy S3
Personal Mobile Device SpacePersonal Mobile Device Space
15
2 CPU cores/3 GPU cores
4 CPU cores/4 GPU cores
iPhone 5 Galaxy S3
Desktop SpaceDesktop Space
17
16 CPU cores
AMD Opteron 6272
• Rare To Have “Single Core” CPU
• Clock Speeds < 3.0 GHz• Power Wall• Heat Dissipation
Desktop SpaceDesktop Space
18
2048 GPU Cores
AMD Radeon 7970
• General Purpose• Power Efficient• High Performance
• Not All Problems Can Be Done on GPU
Warehouse Space (HokieSpeed)Warehouse Space (HokieSpeed)
19
• Each node:• 2x Intel Xeon
5645 (6 cores each)
• 2x NVIDIA C2050 (448 GPUs each)
Warehouse Space (HokieSpeed)Warehouse Space (HokieSpeed)
20
• Each node:• 2x Intel Xeon
5645 (6 cores each)
• 2x NVIDIA C2050 (448 GPUs each)
• 209 nodes
Warehouse Space (HokieSpeed)Warehouse Space (HokieSpeed)
21
• Each node:• 2x Intel Xeon
5645 (6 cores each)
• 2x NVIDIA C2050 (448 GPUs each)
• 209 nodes
2508 CPU cores 187264 GPU cores 2508 CPU cores 187264 GPU cores
Convergence in ComputingConvergence in Computing• Three Classes:• Warehouse• Desktop• Personal Mobile Device
• Main Criteria• Power, Performance, Programmability
23
AgendaAgenda• Introduction/Motivation• Why Parallelism? Why now?• Survey of Parallel Hardware
• CPUs vs. GPUs• Conclusion• How Can I Start?
24
What is the GPU?What is the GPU?• GPU• Boeing 747
• Capacity• 605 passengers
• Top Speed• 570 mph
26
CPU vs. GPUCPU vs. GPU
27
Capacity (passengers)
Speed (mph)
Throughput(passengers * mph)
“CPU” Fighter Jet
2 2200 4400
“GPU” 747
452 555 250,860
CPUs, GPUs, APUsCPUs, GPUs, APUs• How to handle parallelism?• How to extract performance?• Can I just throw processors at a
problem?
31
CPUs, GPUs, APUsCPUs, GPUs, APUs• Multi-threading (2-16 threads) • Massive multi-threading
(100,000+)
• Depends on Your Problem
32
AgendaAgenda• Introduction/Motivation• Why Parallelism? Why now?• Survey of Parallel Hardware
• CPUs vs. GPUs• Conclusion• How Can I Start?
33
How Can I start?How Can I start?• CUDA Programming• You most likely have
a CUDA enabled GPU if you have a recent NVIDIA card
34
How Can I start?How Can I start?• CPU or GPU
Programming• Use OpenCL (your
laptop could potentially run)
35
How Can I start?How Can I start?• Undergraduate research• Senior/Grad Courses:• CS 4234 – Parallel Computation• CS 5510 – Multiprocessor Programming• ECE 4504/5504 – Computer Architecture• CS 5984 – Advanced Computer Graphics
36
In Summary …In Summary …• Parallelism is here to stay• How does this affect you?• How fast is fast enough?• Are we content with current computer
performance?
37
Thank you!Thank you!• Carlo del Mundo,
• Senior, Computer Engineering• Website: http://filebox.vt.edu/users/cdel/• E-mail: [email protected]
38
Previous Internships @
OpenCLOpenCL• Heterogeneous programming model
that is catered to several devices (CPUs, GPUs, APUs)
44
ComparisonsComparisons
pthreads MPI CUDA OpenCL
Number Threads
2-16 -- 100,000+ 2 – 100,000+
Platform CPU only Any Platform NVIDIA Only Any Platform
Productivity† Easy Medium Hard Hard
Parallelism through
Threads Messages Threads Threads
† Productivity is subjective and draws from my experiences
Vector AddVector Add• Serial• Loop N times• N cycles†
• Parallel• Assume you have N cores• 1 cycles†
48
+
=
† Assume 1 add = 1 cycle
Matrix MultiplicationMatrix Multiplication• Embarassingly Parallel
• Let L be the length of each side• L^2 elements, each element requires L
multiplies and L adds
52
PerformancePerformance• Operations/Second (FLOPS)
• Power (W)
• Throughput (# things/unit time)
• FLOPS/W
53
Puss In BootsPuss In Boots
54
• Renders that took hours now take minutes• - Ken Mueseth, Effects R&D Supervisor• DreamWorks Animation
Computational FinanceComputational Finance• Black-Scholes
– A PDE which governs the price of an option essentially “eliminating” risk
55
Genome SequencingGenome Sequencing• Knowledge of the human
genome can provide insights to new medicine and biotechnology• E.g.: genetic engineering,
hybridization
56
Why Should You Care?Why Should You Care?
• Trends:• CPU Core Counts Double Every 2 years
• 2006 – 2 cores, AMD Athlon 64 X2• 2010 – 8-12 cores, AMD Magny Cours
• Power Wall
58