carlo del mundo department of electrical and computer engineering ubiquitous parallelism are you...

Carlo del MundoDepartment of Electrical and

Computer Engineering

Ubiquitous ParallelismUbiquitous ParallelismAre You Equipped To Code For Multi- andMany- Core Platforms?

AgendaAgenda• Introduction/Motivation• Why Parallelism? Why now?• Survey of Parallel Hardware

• CPUs vs. GPUs• Conclusion• How Can I Start?

2

Talk GoalTalk Goal• Encourage undergraduates to

answer the call to the era of parallelism• Education• Software Engineering

3

Why Parallelism? Why now?Why Parallelism? Why now?• You’ve already been exposed to

parallelism• Bit Level Parallelism• Instruction Level Parallelism• Thread Level Parallelism

4

Why Parallelism? Why now?Why Parallelism? Why now?• Single-threaded performance has

plateaued

• Silicon Trends• Power Consumption• Heat Dissipation

5

Why Parallelism? Why now?Why Parallelism? Why now?

6

Power Chart: P = CV2FPower Chart: P = CV2F

7

Heat Chart (Feature Size)Heat Chart (Feature Size)

8

Why Parallelism? Why now?Why Parallelism? Why now?• Issue: Power & Heat

• Good: Cheaper to have more cores, but slower

• Bad: Breaks hardware/software contract

9

Why Parallelism? Why now?Why Parallelism? Why now?• Hardware/Software Contract

• Maintain backwards-compatibility with existing codes

10

Why Parallelism? Why now?Why Parallelism? Why now?

11



12

Personal Mobile Device SpacePersonal Mobile Device Space

13

iPhone 5 Galaxy S3


14

2 CPU cores/3 GPU cores

iPhone 5 Galaxy S3


15



iPhone 5 Galaxy S3

Desktop SpaceDesktop Space

16


17

16 CPU cores

AMD Opteron 6272

• Rare To Have “Single Core” CPU

• Clock Speeds < 3.0 GHz• Power Wall• Heat Dissipation


18

2048 GPU Cores

AMD Radeon 7970

• General Purpose• Power Efficient• High Performance

• Not All Problems Can Be Done on GPU

Warehouse Space (HokieSpeed)Warehouse Space (HokieSpeed)

19

• Each node:• 2x Intel Xeon

5645 (6 cores each)

• 2x NVIDIA C2050 (448 GPUs each)


20


5645 (6 cores each)


• 209 nodes


21


5645 (6 cores each)


• 209 nodes

2508 CPU cores 187264 GPU cores 2508 CPU cores 187264 GPU cores

All SpacesAll Spaces

22

Convergence in ComputingConvergence in Computing• Three Classes:• Warehouse• Desktop• Personal Mobile Device

• Main Criteria• Power, Performance, Programmability

23



24

What is a CPU?What is a CPU?• CPU• SR71 Jet

• Capacity• 2 passengers

• Top Speed• 2200 mph

25

What is the GPU?What is the GPU?• GPU• Boeing 747

• Capacity• 605 passengers

• Top Speed• 570 mph

26

CPU vs. GPUCPU vs. GPU

27

Capacity (passengers)

Speed (mph)

Throughput(passengers * mph)

“CPU” Fighter Jet

2 2200 4400

“GPU” 747

452 555 250,860

CPU ArchitectureCPU Architecture• Latency Oriented (Speculation)

28

GPU ArchitectureGPU Architecture

29

APU = CPU + GPUAPU = CPU + GPU• Accelerated Processing Unit• Both CPU + GPU on the same die

30

CPUs, GPUs, APUsCPUs, GPUs, APUs• How to handle parallelism?• How to extract performance?• Can I just throw processors at a

problem?

31

CPUs, GPUs, APUsCPUs, GPUs, APUs• Multi-threading (2-16 threads) • Massive multi-threading

(100,000+)

• Depends on Your Problem

32



33

How Can I start?How Can I start?• CUDA Programming• You most likely have

a CUDA enabled GPU if you have a recent NVIDIA card

34

How Can I start?How Can I start?• CPU or GPU

Programming• Use OpenCL (your

laptop could potentially run)

35

How Can I start?How Can I start?• Undergraduate research• Senior/Grad Courses:• CS 4234 – Parallel Computation• CS 5510 – Multiprocessor Programming• ECE 4504/5504 – Computer Architecture• CS 5984 – Advanced Computer Graphics

36

In Summary …In Summary …• Parallelism is here to stay• How does this affect you?• How fast is fast enough?• Are we content with current computer

performance?

37

Thank you!Thank you!• Carlo del Mundo,

• Senior, Computer Engineering• Website: http://filebox.vt.edu/users/cdel/• E-mail: [email protected]

38

Previous Internships @

AppendixAppendix

39

Programming ModelsProgramming Models• pthreads

• MPI

• CUDA

• OpenCL

40

pthreadspthreads• A UNIX API to create and destroy threads

41

MPIMPI• A communications protocol • “Send and Receive” messages between

nodes

42

CUDACUDA• Massive

multi-threading (100,000+)

• Thread-level parallelism

43

OpenCLOpenCL• Heterogeneous programming model

that is catered to several devices (CPUs, GPUs, APUs)

44

ComparisonsComparisons

pthreads MPI CUDA OpenCL

Number Threads

2-16 -- 100,000+ 2 – 100,000+

Platform CPU only Any Platform NVIDIA Only Any Platform

Productivity† Easy Medium Hard Hard

Parallelism through

Threads Messages Threads Threads

† Productivity is subjective and draws from my experiences

Parallel ApplicationsParallel Applications• Vector Add

• Matrix Multiplication

46

Vector AddVector Add

47

+

=

Vector AddVector Add• Serial• Loop N times• N cycles†

• Parallel• Assume you have N cores• 1 cycles†

48

+

=

† Assume 1 add = 1 cycle

Matrix MultiplicationMatrix Multiplication

49

A

B

C


50

A

B

C


51

A

B

C

Matrix MultiplicationMatrix Multiplication• Embarassingly Parallel

• Let L be the length of each side• L^2 elements, each element requires L

multiplies and L adds

52

PerformancePerformance• Operations/Second (FLOPS)

• Power (W)

• Throughput (# things/unit time)

• FLOPS/W

53

Puss In BootsPuss In Boots

54

• Renders that took hours now take minutes• - Ken Mueseth, Effects R&D Supervisor• DreamWorks Animation

Computational FinanceComputational Finance• Black-Scholes

– A PDE which governs the price of an option essentially “eliminating” risk

55

Genome SequencingGenome Sequencing• Knowledge of the human

genome can provide insights to new medicine and biotechnology• E.g.: genetic engineering,

hybridization

56

ApplicationsApplications

57

Why Should You Care?Why Should You Care?

• Trends:• CPU Core Counts Double Every 2 years

• 2006 – 2 cores, AMD Athlon 64 X2• 2010 – 8-12 cores, AMD Magny Cours

• Power Wall

58

Then And NowThen And Now• Today’s state-of-the-art hardware is

yesterday’s supercomputer• 1998 – Intel TFLOPS supercomputer• 1.8 trillion floating point ops / sec (1.8 TFLOP)

• 2008 – AMD Radeon 4870 GPU x2 • 2400 trilliion floating point ops / sec (2.4

TFLOP)

59

carlo del mundo department of electrical and computer engineering ubiquitous parallelism are you...

Documents

gpu slide

galaxy s3 slide

gpu cores iphone

cpu cores amd opteron

gpu cores amd radeon

gpu architecture

gpu boeing

desktop space