name: kaiyong zhao supervisor: dr. x. -w chu. background & related work multiple-precision...

Name: Kaiyong ZhaoSupervisor: Dr. X. -W Chu

0

20

40

60

80

100

120

2003 2004 2005 2006 2007M

em

ory

ba

nd

wid

th (

GB

/s)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

0

20

40

60

80

100

120

2003 2004 2005 2006 2007M

em

ory

ba

nd

wid

th (

GB

/s)

GPU

CPUG80 Ultra

G80

G71

NV40

NV30 Hapertown

W oodcrestPrescott EENorthwood

•Computing Capability

•Memory Bandwidth

L2

FB

SP SP

L1

TF

Th

rea

d P

roc

es

so

r

Vtx Thread Issue

Setup / Rstr / ZCull

Geom Thread Issue Pixel Thread Issue

Input Assembler

Host

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Streaming Multiprocessor (SM)

Streaming Processor (SP)

CUDA: CPU + GPU CParallel Computing modal

Single instruction Multiple Thread (SIMT)All threads run the same function(1000s threads on the fly)Each core deal with different data

Hidden the IO by multiple-threads(more than 1000s threads)Speed up Computing ／ IO Translation Coalesce the IO one time When half warp thread access neighboring data1 cycle@GPU vs. ~1000 cycles@CPU

C = vectorA * Matrix B % prime

There is no cache for global memory on G80/G200

Constant memory & texture memory have little cache

IO latency400-600 clock cycles

This is the bottle neckKey to Optimization!

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

(Device) Grid

ConstantMemory

TextureMemory

GlobalMemory

Block (0, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

LocalMemory

Thread (0, 0)

Registers

LocalMemory

Thread (1, 0)

Registers

Host

Global memory access by threads in a half-warp can be coalesced

When the words accessed by all threads lie in the same segment of size equal to:

32 bytes if all threads access 8-bit words64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words

Any pattern of addresses requested by the half-warp

Including patterns where multiple threads access the same address

Address 0

Thread 0

Address 4

Address …

Address 116

Address 120

Address 124

Address 128

Address …

Address 172

Address 176

Address 180

Address 184

Address 188

Address 252

Thread 1

Thread 2

Thread 3

Thread …

Thread 14

Thread 15

…

Segment 0 (128B) Segment 1 (128B)

Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.

CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread)GPU: XFX GTX280, 1.24 GHz

Summary

name: kaiyong zhao supervisor: dr. x. -w chu. background & related work multiple-precision...

Documents