1 trial lecture the use of gpus for high-performance computing 12. october 2010 magnus jahre

Trial Lecture

The Use of GPUs for High-Performance Computing

12. October 2010

Magnus Jahre

Graphic Processors (GPUs)

• Modern computers are graphics intensive

• Advanced 3D graphics require a significant amount of computation

Graphics Card (Source: nvidia.com)

Solution: Add a Graphics Processor (GPU)

High-Performance Computing

High-Performance Computing (HPC)

General Purpose Programming on GPUs (GPGPU)

Efficient use of computers for computationally intensive problems in science or engineering

Communication Demand

Weather forecastClimate modeling

Dynamic Molecular Simulation

Computational Computer

Architecture

Office Applications

Third dimension:

Main Memory Capacity

Outline

• GPU Evolution

• GPU Programming

• GPU Architecture

• Achieving High GPU Performance

• Future Trends

• Conclusions

GPU EVOLUTION

First GPUs: Fixed Hardware

[Blythe 2008]

Vertex Processing

RasterizationFragment

ProcessingFramebuffer Operations

Vertex DataTexture Maps

Depth BufferColor Buffer

Programmable Shaders

Vertex Processing

Motivation: More flexible graphics processing

GPGPU with Programmable Shaders

Vertex Processing

Use Graphics Library to gain access to GPU

Use color values to code data

Effect of fixed function stages must be accounted for

Functional Unit Utilization

Vertex Processing

Fragment Processing

Vertex Processing

Fragment Processing

RasterizationFramebuffer Operations

Functional Unit Utilization

Vertex Processing

Fragment Processing

Vertex intensive shader

Fragment intensive shader

Unified shader

Unified Shader Architecture• Exploit parallelism

– Data parallelism– Task parallelism

• Data parallel processing (SIMD/SIMT)

• Hide memory latencies

• High bandwidth

Architecture naturally supports GPGPU

Memory

Thread Scheduler

Interconnect

On-Chip Memory or Cache

Off-Chip DRAM Memory

GPU PROGRAMMING

Programmable Shaders Unified Shaders

GPGPU Tool Support

PeakStreamAccelerator

OpenCL

2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010

GPU papers on Supercomputing

Compute Unified Device Architecture (CUDA)• Most code is normal C+

+ code

• Code to run on GPU organized in kernels

• CPU sets up and manages computation

__global__ void vector_add(float* a, float* b, float* c){ int idx = threadIdx.x; c[idx] = a[idx] + b[idx];}

int main(){ int N = 512; // ... vector_add<<<1,N>>>(a_d, b_d, c_d); // ...}

Thread/Data Organization• Hierarchical thread

organization– Grid– Block– Thread

• A block can have a maximum of 512 threads

• 1D, 2D and 3D mappings possible

Block (0,0)

Block (0,1)

Block (0,2)

Block (1,0)

Block (1,1)

Block (1,2)

Block (0)

Block (1)

Global Memory Main MemoryGPU CPU

SP SP SP SP

Local Memory

Vector Addition Example

A collection of concurrently processed threads is called a warp

Terminology: Warp

Vector Addition Profile

• Only 11% of GPU time is used to add vectors

• The arithmetic intensity of the problem is too low

• Overlapping data copy and computation could help

%GPU time

vector_add memcpyHtoDmemcpyDtoH

Hardware: NVIDIA MVS 3100M

Will GPUs Save the World?

• Careful optimization of both CPU and GPU code reduces the performance difference between GPUs and CPUs substantially [Lee et al., ISCA 2010]

• GPGPU has provided nice speedups for problems that fit the architecture

• Metric challenge: The practitioner needs performance per developer hour

GPU ARCHITECTURE

NVIDIA Tesla Architecture

Figure reproduced from [Lindholm et al.; 2008]

Control Flow

• The threads in a warp share use the same instruction

• Branching is efficient if all threads in a warp branch in the same direction

• Divergent branches within a warp cause serial execution of both paths

Condition True Threads

Condition False Threads

Condition True Threads

Condition False Threads

Modern DRAM Interfaces

• Maximize bandwidth with 3D organization

• Repeated requests to the row buffer are very efficient

Row address

Column address

Row Buffer

Access Coalescing

• Global memory accesses from all threads in a half-warp are combined into a single memory transaction

• All memory elements in a segment are accessed

• Segment size can be halved if only the lower or upper half is used

Assumes Compute Capability 1.2 or higher

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Addr 128

Addr 132

Addr 136

Addr 140

Addr 144

Addr 148

Addr 152

Addr 156

Addr 124

Addr 120

Addr 116

Addr 112Tran

saction

Transactio

Bank Conflicts

• Memory banks can service requests independently

• Bank conflict: more than one thread access a bank concurrently

• Strided access patterns can cause bank conflicts

Thread 0

Thread 1

Thread 2

Thread 3

Thread 4

Thread 5

Thread 6

Thread 7

Bank 0

Bank 1

Bank 2

Bank 3

Bank 4

Bank 5

Bank 6

Bank 7

Stride two accesses gives 2-way bank conflict

NVIDIA Fermi• Next generation computing

chip from NVIDIA

• Aims to alleviate important bottlenecks– Improved double precision

floating point support– Cache hierarchy– Concurrent kernel execution

• More problems can be solved efficiently on a GPU

Figure reproduced from [NVIDIA; 2010]

ACHIEVING HIGH GPU PERFORMANCE

Which problems fit the GPU model?

• Fine-grained data parallelism available• Sufficient arithmetic intensity• Sufficiently regular data access patterns

It’s all about organizing data

Optimized memory system use enables high performance

Increase Computational Intensity• Memory types:

– On-chip shared memory: Small and fast

– Off-chip global memory: Large and slow

• Technique: Tiling– Choose tile size such

that it fits in the shared memory

– Increases locality by reducing reuse distance

A × B = C

Reuse!

Memory Layout

• Exploit coalescing to achieve high bandwidth

• Linear access necessary

• Solution: Tiling

A × B = C

Assume row-major storage

Coalesced Not Coalesced

W1 W2 W3 W4W1 W2 W3 W4

Avoid Branching Inside Warps

Assume 2 threads per warp

All iterations diverge

One iteration diverges

11 1 1 1 1 1 1 1

2 2 2 2

1 1 1 1 1 1 1 1

2 2 2 2

Automation

• Thread resource usage must be balanced with the number of concurrent threads [Ryoo et al., PPoPP08]– Avoid saturation– Sweet spot will vary between devices– Sweet spot varies with problem sizes

• Auto-tuning 3D FFT [Nukada et al.; SC2009]– Balance resource consumption vs. parallelism with kernel radix and

ordering – Best number of thread blocks chosen automatically– Inserts padding to avoid shared memory bank conflicts

Case Study: Dynamic Molecular Simulation with NAMD

Simulate the interaction of atoms due to the laws of atomic physics and quantum chemistry [Phillips; SC2009]

Key Performance Enablers

• Careful division of labor between GPU and CPU– GPU: Short range non-bonded forces– CPU: Long-range electrostatic forces and coordinate updates

• Overlap CPU and GPU execution through asynchronous kernel execution

• Use event recording to track progress in asynchronously executing streams

[Phillips et al., SC2008]

CPU/GPU Cooperation in NAMD

Remote Local Local Update

Remote Local

Challenges

• Completely restructuring legacy software systems is prohibitive

• Batch processing software are unaware of GPUs

• Interoperability issues with pinning main memory pages for DMA

FUTURE TRENDS

Accelerator Integration• Industry move towards integrating

CPUs and GPUs on the same chip– AMD Fusion [Brookwood; 2010]– Intel Sandy Bridge (fixed function

• Are other accelerators appropriate?– Single-chip Heterogeneous

Computing: Does the future include Custom Logic, FPGAs, and GPUs? [Chung et al.; MICRO 2010]

AMD FusionReproduced from [Brookwood; 2010]

Vector Addition Revisited

Start-up and shut-down data transfers are the main bottleneck

Fusion eliminates these overheads by storing values in the on-chip cache

Using accelerators becomes more feasible

Memory System Scalability

• Current CPU bottlenecks:– Number of pins on a chip grows slowly– Off-chip bandwidth grows slowly

• Integration only helps if there is sufficient on-chip cooperation to avoid significant increase in bandwidth demand

• Conflicting requirements:– GPU: High bandwidth, not latency sensitive– CPU: High bandwidth, can be latency sensitive

CONCLUSIONS

Conclusions

• GPUs can offer a significant speedup for problems that fit the model

• Tool support and flexible architectures increases the number of problems that fit the model

• CPU/GPU on-chip integration can reduce GPU start-up overheads

Thank You

Visit our website:http://research.idi.ntnu.no/multicore/

References• Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on

CPU and GPU; Lee et al.; ISCA; 2010• Programming Massively Parallel Processors; Kirk and Hwu; Morgan Kaufmann; 2010• NVIDIA’s Next Generation CUDA Compute Architecture: Fermi; White Paper; NVIDIA;

2010• AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience; White

Paper; AMD; 2010• Multi-core Programming with OpenCL: Performance and Portability; Fagerlund; Master

Thesis; NTNU; 2010• Complexity Effective Memory Access Scheduling for Many-Core Accelerator

Architectures; Yuan et al.; MICRO; 2009• Auto-Tuning 3-D FFT Library for CUDA GPUs; Nukada and Matsuoka; SC; 2009• Programming Graphics Processing Units (GPUs); Bakke; Master Thesis; NTNU; 2009• Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters; Phillips

et al.; SC; 2008• Rise of the Graphics Processor; Blythe; Proceedings of the IEEE; 2008• NVIDIA Tesla: A Unified Graphics and Computing Architecture; Lindholm et al; IEEE

Micro; 2008• Optimization Principles and Application Performance Evaluation of a Multithreaded

GPU using CUDA; Ryoo et al.; PPoPP; 2008

EXTRA SLIDES

Complexity-Effective Memory Access Scheduling• On-chip interconnect

may interleave requests from different thread processors

• Row locality is destroyed

• Solution: Order-preserving interconnect arbitration policy and in-order scheduling

[Lee et al., MICRO2009]

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Req 1Row B

Row Switch

Req 1Row B

In-order Scheduling

Out-of-order Scheduling

Queue:

Req 0Row A

Req 0Row B

Req 1Row A

Req 1Row B

Req 1Row A

Req 0Row B

Req 0Row A

Req 0Row B

Req 1Row A

Row Switch

Req 1Row B

Performance of out-of-order scheduling with less complex in-order scheduling

1 trial lecture the use of gpus for high-performance computing 12. october 2010 magnus jahre

gpu architecture slide

warp slide

gpu programming slide

grid slide

gpu evolution slide

transaction slide

supercomputing slide

graphics processor gpu

Documents

inhalt content - aebo.ch · - 80 jahre für...

50 baeume fuer 50 jahre

magnus stenbeck

skinner jahre n

magnus profile

375 jahre sietas werft

poster magnus

1 dief: an accurate interference feedback mechanism for chip...

ipa luxemburg 50 jahre programm

vaude gipfelbuch 40 jahre

magnus gifts

70 jahre oegb

st. magnus festival orkney - orkney library & archive....

magnus revisited

albertus magnus

adductor magnus

1989 – 2019 30 jahre teletrust · jürgen sembritzki, zi...

albertus magnus - the book of secrets of albertus magnus

magnus forallx

tdt4255 computer design review lecture magnus jahre ·...