name: kaiyong zhao supervisor: dr. x. -w chu. background & related work multiple-precision...
Post on 20-Jan-2016
216 views
TRANSCRIPT
Name: Kaiyong ZhaoSupervisor: Dr. X. -W Chu
0
20
40
60
80
100
120
2003 2004 2005 2006 2007M
em
ory
ba
nd
wid
th (
GB
/s)
GPU
CPUG80 Ultra
G80
G71
NV40
NV30 Hapertown
W oodcrestPrescott EENorthwood
0
20
40
60
80
100
120
2003 2004 2005 2006 2007M
em
ory
ba
nd
wid
th (
GB
/s)
GPU
CPUG80 Ultra
G80
G71
NV40
NV30 Hapertown
W oodcrestPrescott EENorthwood
•Computing Capability
•Memory Bandwidth
L2
FB
SP SP
L1
TF
Th
rea
d P
roc
es
so
r
Vtx Thread Issue
Setup / Rstr / ZCull
Geom Thread Issue Pixel Thread Issue
Input Assembler
Host
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
SP SP
L1
TF
L2
FB
L2
FB
L2
FB
L2
FB
L2
FB
Streaming Multiprocessor (SM)
Streaming Processor (SP)
CUDA: CPU + GPU CParallel Computing modal
Single instruction Multiple Thread (SIMT)All threads run the same function(1000s threads on the fly)Each core deal with different data
Hidden the IO by multiple-threads(more than 1000s threads)Speed up Computing / IO Translation Coalesce the IO one time When half warp thread access neighboring data1 cycle@GPU vs. ~1000 cycles@CPU
C = vectorA * Matrix B % prime
There is no cache for global memory on G80/G200
Constant memory & texture memory have little cache
IO latency400-600 clock cycles
This is the bottle neckKey to Optimization!
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Global memory access by threads in a half-warp can be coalesced
When the words accessed by all threads lie in the same segment of size equal to:
32 bytes if all threads access 8-bit words64 bytes if all threads access 16-bit words 128 bytes if all threads access 32-bit or 64-bit words
Any pattern of addresses requested by the half-warp
Including patterns where multiple threads access the same address
Address 0
Thread 0
Address 4
Address …
Address 116
Address 120
Address 124
Address 128
Address …
Address 172
Address 176
Address 180
Address 184
Address 188
Address 252
Thread 1
Thread 2
Thread 3
Thread …
Thread 14
Thread 15
…
Segment 0 (128B) Segment 1 (128B)
Reduced to 32B Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, 128 bytes for 32-, 64- and 128-bit data.
C = vectorA * Matrix B % prime
CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread)GPU: XFX GTX280, 1.24 GHz
C = vectorA * Matrix B % prime
CPU: Intel® Core™ i7 CPU 860 @ 2.80 GHz (single thread)GPU: XFX GTX280, 1.24 GHz
Summary