1 trial lecture the use of gpus for high-performance computing 12. october 2010 magnus jahre

46
1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

Upload: aryan-larsen

Post on 28-Mar-2015

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

1

Trial Lecture

The Use of GPUs for High-Performance Computing

12. October 2010

Magnus Jahre

Page 2: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

2

Graphic Processors (GPUs)

• Modern computers are graphics intensive

• Advanced 3D graphics require a significant amount of computation

Graphics Card (Source: nvidia.com)

Solution: Add a Graphics Processor (GPU)

Page 3: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

3

High-Performance Computing

High-Performance Computing (HPC)

General Purpose Programming on GPUs (GPGPU)

Efficient use of computers for computationally intensive problems in science or engineering

Pro

cess

ing

Dem

and

Communication Demand

Weather forecastClimate modeling

Dynamic Molecular Simulation

Computational Computer

Architecture

Office Applications

Third dimension:

Main Memory Capacity

Page 4: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

4

Outline

• GPU Evolution

• GPU Programming

• GPU Architecture

• Achieving High GPU Performance

• Future Trends

• Conclusions

Page 5: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

5

GPU EVOLUTION

Page 6: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

6

First GPUs: Fixed Hardware

[Blythe 2008]

Vertex Processing

RasterizationFragment

ProcessingFramebuffer Operations

Vertex DataTexture Maps

Depth BufferColor Buffer

Page 7: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

7

Programmable Shaders

Vertex Processing

RasterizationFragment

ProcessingFramebuffer Operations

Vertex DataTexture Maps

Depth BufferColor Buffer

Motivation: More flexible graphics processing

Page 8: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

8

GPGPU with Programmable Shaders

Vertex Processing

RasterizationFragment

ProcessingFramebuffer Operations

Vertex DataTexture Maps

Depth BufferColor Buffer

Use Graphics Library to gain access to GPU

Use color values to code data

Effect of fixed function stages must be accounted for

Page 9: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

9

Functional Unit Utilization

Vertex Processing

Fragment Processing

Vertex Processing

Fragment Processing

RasterizationFramebuffer Operations

Vertex DataTexture Maps

Depth BufferColor Buffer

Page 10: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

10

Functional Unit Utilization

Vertex Processing

Fragment Processing

Vertex intensive shader

Fragment intensive shader

Unified shader

Page 11: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

11

Unified Shader Architecture• Exploit parallelism

– Data parallelism– Task parallelism

• Data parallel processing (SIMD/SIMT)

• Hide memory latencies

• High bandwidth

Architecture naturally supports GPGPU

SP SP

SP SP

SP SP

SP SP

Memory

SP SP

SP SP

SP SP

SP SP

Memory

SP SP

SP SP

SP SP

SP SP

Memory

SP SP

SP SP

SP SP

SP SP

Memory

Thread Scheduler

Interconnect

On-Chip Memory or Cache

Off-Chip DRAM Memory

Page 12: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

12

GPU PROGRAMMING

Page 13: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

13

Programmable Shaders Unified Shaders

GPGPU Tool Support

Sh

PeakStreamAccelerator

GPU++

CUDA

OpenCL

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

3

1

GPU papers on Supercomputing

Page 14: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

14

Compute Unified Device Architecture (CUDA)• Most code is normal C+

+ code

• Code to run on GPU organized in kernels

• CPU sets up and manages computation

__global__ void vector_add(float* a, float* b, float* c){ int idx = threadIdx.x; c[idx] = a[idx] + b[idx];}

int main(){ int N = 512; // ... vector_add<<<1,N>>>(a_d, b_d, c_d); // ...}

Page 15: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

15

Thread/Data Organization• Hierarchical thread

organization– Grid– Block– Thread

• A block can have a maximum of 512 threads

• 1D, 2D and 3D mappings possible

Block (0,0)

Block (0,1)

Block (0,2)

Block (1,0)

Block (1,1)

Block (1,2)

Grid

Block (0)

Block (1)

Grid

Page 16: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

16

C

B

A A

B

Global Memory Main MemoryGPU CPU

SP SP SP SP

SP SP SP SP

A

B

CC

Local Memory

Vector Addition Example

A collection of concurrently processed threads is called a warp

Page 17: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

17

Terminology: Warp

Page 18: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

18

Vector Addition Profile

• Only 11% of GPU time is used to add vectors

• The arithmetic intensity of the problem is too low

• Overlapping data copy and computation could help

11%

58%

32%

%GPU time

vector_add memcpyHtoDmemcpyDtoH

Hardware: NVIDIA MVS 3100M

Page 19: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

19

Will GPUs Save the World?

• Careful optimization of both CPU and GPU code reduces the performance difference between GPUs and CPUs substantially [Lee et al., ISCA 2010]

• GPGPU has provided nice speedups for problems that fit the architecture

• Metric challenge: The practitioner needs performance per developer hour

Page 20: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

20

GPU ARCHITECTURE

Page 21: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

21

NVIDIA Tesla Architecture

Figure reproduced from [Lindholm et al.; 2008]

Page 22: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

22

Control Flow

• The threads in a warp share use the same instruction

• Branching is efficient if all threads in a warp branch in the same direction

• Divergent branches within a warp cause serial execution of both paths

IF

Condition True Threads

Condition False Threads

Condition True Threads

Condition False Threads

Page 23: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

23

Modern DRAM Interfaces

• Maximize bandwidth with 3D organization

• Repeated requests to the row buffer are very efficient

Row address

Column address

DRAM

Banks

Row Buffer

Rows

Co

lum

ns

Page 24: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

24

Access Coalescing

• Global memory accesses from all threads in a half-warp are combined into a single memory transaction

• All memory elements in a segment are accessed

• Segment size can be halved if only the lower or upper half is used

Assumes Compute Capability 1.2 or higher

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Addr 128

Addr 132

Addr 136

Addr 140

Addr 144

Addr 148

Addr 152

Addr 156

Addr 124

Addr 120

Addr 116

Addr 112Tran

saction

Transactio

n

Page 25: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

25

Bank Conflicts

• Memory banks can service requests independently

• Bank conflict: more than one thread access a bank concurrently

• Strided access patterns can cause bank conflicts

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 5

Bank 6

Bank 7

Stride two accesses gives 2-way bank conflict

Page 26: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

26

NVIDIA Fermi• Next generation computing

chip from NVIDIA

• Aims to alleviate important bottlenecks– Improved double precision

floating point support– Cache hierarchy– Concurrent kernel execution

• More problems can be solved efficiently on a GPU

Figure reproduced from [NVIDIA; 2010]

Page 27: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

27

ACHIEVING HIGH GPU PERFORMANCE

Page 28: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

28

Which problems fit the GPU model?

• Fine-grained data parallelism available• Sufficient arithmetic intensity• Sufficiently regular data access patterns

It’s all about organizing data

Optimized memory system use enables high performance

Page 29: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

29

Increase Computational Intensity• Memory types:

– On-chip shared memory: Small and fast

– Off-chip global memory: Large and slow

• Technique: Tiling– Choose tile size such

that it fits in the shared memory

– Increases locality by reducing reuse distance

A × B = C

×

=

Reuse!

Reuse!

Page 30: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

30

Memory Layout

• Exploit coalescing to achieve high bandwidth

• Linear access necessary

• Solution: Tiling

A × B = C

×

=

Assume row-major storage

Coalesced Not Coalesced

Page 31: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

31

W1 W2 W3 W4W1 W2 W3 W4

Avoid Branching Inside Warps

Assume 2 threads per warp

All iterations diverge

8

4 4

One iteration diverges

8

11 1 1 1 1 1 1 1

2 2 2 2

44

1 1 1 1 1 1 1 1

2 2 2 2

Page 32: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

32

Automation

• Thread resource usage must be balanced with the number of concurrent threads [Ryoo et al., PPoPP08]– Avoid saturation– Sweet spot will vary between devices– Sweet spot varies with problem sizes

• Auto-tuning 3D FFT [Nukada et al.; SC2009]– Balance resource consumption vs. parallelism with kernel radix and

ordering – Best number of thread blocks chosen automatically– Inserts padding to avoid shared memory bank conflicts

Page 33: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

33

Case Study: Dynamic Molecular Simulation with NAMD

Simulate the interaction of atoms due to the laws of atomic physics and quantum chemistry [Phillips; SC2009]

Page 34: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

34

Key Performance Enablers

• Careful division of labor between GPU and CPU– GPU: Short range non-bonded forces– CPU: Long-range electrostatic forces and coordinate updates

• Overlap CPU and GPU execution through asynchronous kernel execution

• Use event recording to track progress in asynchronously executing streams

[Phillips et al., SC2008]

Page 35: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

35

CPU/GPU Cooperation in NAMD

[Phillips et al., SC2008]

CPU

GPU

Remote Local Local Update

Remote Local

Time

ff

f f

x

x x

Page 36: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

36

Challenges

• Completely restructuring legacy software systems is prohibitive

• Batch processing software are unaware of GPUs

• Interoperability issues with pinning main memory pages for DMA

[Phillips et al., SC2008]

Page 37: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

37

FUTURE TRENDS

Page 38: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

38

Accelerator Integration• Industry move towards integrating

CPUs and GPUs on the same chip– AMD Fusion [Brookwood; 2010]– Intel Sandy Bridge (fixed function

GPU)

• Are other accelerators appropriate?– Single-chip Heterogeneous

Computing: Does the future include Custom Logic, FPGAs, and GPUs? [Chung et al.; MICRO 2010]

AMD FusionReproduced from [Brookwood; 2010]

Page 39: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

39

Vector Addition Revisited

Start-up and shut-down data transfers are the main bottleneck

Fusion eliminates these overheads by storing values in the on-chip cache

Using accelerators becomes more feasible

Page 40: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

40

Memory System Scalability

• Current CPU bottlenecks:– Number of pins on a chip grows slowly– Off-chip bandwidth grows slowly

• Integration only helps if there is sufficient on-chip cooperation to avoid significant increase in bandwidth demand

• Conflicting requirements:– GPU: High bandwidth, not latency sensitive– CPU: High bandwidth, can be latency sensitive

Page 41: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

41

CONCLUSIONS

Page 42: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

42

Conclusions

• GPUs can offer a significant speedup for problems that fit the model

• Tool support and flexible architectures increases the number of problems that fit the model

• CPU/GPU on-chip integration can reduce GPU start-up overheads

Page 43: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

43

Thank You

Visit our website:http://research.idi.ntnu.no/multicore/

Page 44: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

44

References• Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on

CPU and GPU; Lee et al.; ISCA; 2010• Programming Massively Parallel Processors; Kirk and Hwu; Morgan Kaufmann; 2010• NVIDIA’s Next Generation CUDA Compute Architecture: Fermi; White Paper; NVIDIA;

2010• AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience; White

Paper; AMD; 2010• Multi-core Programming with OpenCL: Performance and Portability; Fagerlund; Master

Thesis; NTNU; 2010• Complexity Effective Memory Access Scheduling for Many-Core Accelerator

Architectures; Yuan et al.; MICRO; 2009• Auto-Tuning 3-D FFT Library for CUDA GPUs; Nukada and Matsuoka; SC; 2009• Programming Graphics Processing Units (GPUs); Bakke; Master Thesis; NTNU; 2009• Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters; Phillips

et al.; SC; 2008• Rise of the Graphics Processor; Blythe; Proceedings of the IEEE; 2008• NVIDIA Tesla: A Unified Graphics and Computing Architecture; Lindholm et al; IEEE

Micro; 2008• Optimization Principles and Application Performance Evaluation of a Multithreaded

GPU using CUDA; Ryoo et al.; PPoPP; 2008

Page 45: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

45

EXTRA SLIDES

Page 46: 1 Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre

46

Complexity-Effective Memory Access Scheduling• On-chip interconnect

may interleave requests from different thread processors

• Row locality is destroyed

• Solution: Order-preserving interconnect arbitration policy and in-order scheduling

[Lee et al., MICRO2009]

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Row Switch

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Req 1Row B

Row Switch

Req 1Row B

Time

In-order Scheduling

Out-of-order Scheduling

Queue:

Req 0Row A

Req 0Row B

Req 1Row A

Req 1Row B

Req 1Row A

Req 0Row B

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Req 1Row B

Performance of out-of-order scheduling with less complex in-order scheduling