![Page 1: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/1.jpg)
Lecture 18
The future of Higher Performance Computing:
Technological Trends
![Page 2: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/2.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 2
Announcements
• Final Exam Review on Thursday
• Practice Questions
• A5 due in section on Friday
![Page 3: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/3.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 3
Today’s lecture
• Technological Trends
• Case Studies Blue Gene/L
STI Cell Broadband Engine
Nvidia Tesla
![Page 4: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/4.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 4
Technological Trends
IBM Blue Gene
STI Cell Broadband Engine
Nvidia Tesla
![Page 5: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/5.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 5
Blue Gene• IBM-US Dept of Energy collaboration• First generation: Blue Gene/L
64K dual processor nodes: 180 (360) TeraFlop peak 1 TeraFlop = 1,000 GigaFlops
• Low power Relatively slow processors; power PC 440 Small memory (256 MB)
• High performance interconnect
![Page 6: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/6.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 7
Next Generation: Blue Gene/P
http://www.redbooks.ibm.com/redbooks/SG247287
• Largest Installation atArgonne National Lab:163,840 cores 4-way nodes PowerPC 450
(850 MHz) 2 GB memory per node
• Peak performance:13.6 Gflops/node= 557 TeraFlops total= 0.56 Petaflops
![Page 7: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/7.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 8
Blue Gene/P Interconnect• 3D toroidal mesh (end around)
5.1 GB/sec bidirectional bandwidth / node
1.7/2.6 TB/sec bisection bandwidth (72k cores)
0.5 µs to 5µs latency (1 hop, farthest)
MPI: 3 µs to 10 µs
• Rapid combining-broadcast network 1-way latency 1.3 µs (5 µs in MPI)
Also used to move data between I/O andcompute nodes
• Low latency barrier and interrupt One way: 0.65µs
1.6 µs in MPI
Image courtesy of IBMImage courtesy of IBMhttp://www.research.http://www.research.ibmibm.com/journal/rd/492/gara4.gif.com/journal/rd/492/gara4.gif
![Page 8: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/8.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 9
Compute nodes
http://www.redbooks.ibm.com/redbooks/SG247287
• Six connections totorus network@ 425 MB/sec/link(duplex)
• Three connections toglobal collectivenetwork @ 850MB/sec/link
• Network routers areembedded within theprocessor
![Page 9: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/9.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 10
Processor sets and I/O
Image courtesyImage courtesy of IBMof IBM
• Each I/O node services a set of compute nodes
![Page 10: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/10.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 11
Programming modes• Virtual node: each node runs 4 MPI processes, 1/core
Memory and torus network shared by all processes
Shared memory is available between processes.
• Dual node: each node runs 2 MPI processes, 1 or 2 threads/process
• Symmetrical Multiprocessing:each node runs 1 MPI process, up to 4 threads
Image courtesyImage courtesy of IBMof IBM
![Page 11: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/11.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 12
Technological Trends
IBM Blue Gene/L
STI Cell Broadband Engine
Nvidia Tesla
![Page 12: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/12.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 13
Cell Overview• Chip multiprocessor for multithreaded
applications• Adapted power PC: 8 SPE + 1 PPE• Shared address space• SIMD (VMX)• Software managed resources on the SPE• Optimize the common case
![Page 13: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/13.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 14
Cell BE Block diagram
http://domino.watson.ibm.com/comm/research.nsf/pages/r.arch.innovation.html
Coherenton-chip bus
64 bit Powerarchitecturecore
64 bit Powerarchitecturecore
Memorycontroller
Interfacecontroller
![Page 14: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/14.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 15
Block diagram showing detail
ALUs (4)
FPUs (4)
reg
file
128x
128256KB
localmemory
ALUs (4)
FPUs (4)
reg
file
128x
128256KB
localmemory
ALUs (4)
FPUs (4)
reg
file
128x
128256KB
localmemory
ALUs (4)
FPUs (4)
reg
file
128x
128256KB
localmemory
ALUs (4)
FPUs (4)
regfile
128x128
256KBlocal
memory
ALUs (4)
FPUs (4)
regfile
128x128
256KBlocal
memory
ALUs (4)
FPUs (4)
regfile
128x128
256KBlocal
memory
ALUs (4)
FPUs (4)
regfile
128x128
256KBlocal
memory
64-bit SMTPower core,2x in-ordersuperscalar
512K L2
I$ D$
EIB
DMA, I/OControllers
PPECourtesy Sam Sandbote
![Page 15: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/15.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 16
SPE - Synergistic processing element• 128-bit x 128 entry register file• 256 KB local store• Software managed data transfers
Rambus XDR DRAM interface 3.2 GB/sec/SPE 25.6 GB/s aggregate
• DMA ⇔ Local Store via EIB• EIB: 4 x 128 bit wide
200 GB/sec Multiple simultaneous requests:
16 x 16kb Priority: DMA, L/S, Fetch
Image courtesy Keith OImage courtesy Keith O’’Conor,Conor,http:http://isg//isg..cscs..tcdtcd..ie/oconork/presentations/CellBroadbandEngineie/oconork/presentations/CellBroadbandEngine..pptppt
![Page 16: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/16.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 18
Usage scenarios
Image courtesy Keith OImage courtesy Keith O’’Conor,Conor,http:http://isg//isg..cscs..tcdtcd..ie/oconork/presentations/CellBroadbandEngineie/oconork/presentations/CellBroadbandEngine..pptppt
Pipeline Multithreaded Function offload
![Page 17: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/17.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 19
Technological Trends
Blue Gene/L
STI Cell Broadband Engine
Nvidia Tesla
![Page 18: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/18.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 20
NVIDIA GeForce GTX 280• Hierarchically organized clusters of streaming multiprocessors
240 cores @ 1.296 GHz Peak performance 933.12 Gflops/s
• Parallel computing or graphic mode• 1 GB memory (frame buffer)• 512 bit memory interface @ 132 GB/s
![Page 19: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/19.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 21
Streaming processing cluster• Each cluster contains 3 streaming multiprocessors• Each multiprocessor contains 8 cores that share a local memory• Each core supports scalar processing• Double precision is 10× slower than floating point
![Page 20: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/20.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 22
Streaming Multiprocessor• 8 cores (Streaming Processors)
Each core contains a fused multiply adder (32 bi)
• Each SM contains one 64 bit fused multiply adder• 2 Super Function Units (SFUs), each contains 2 fused multiply-adders• 16 KB shared memory• 16K registers
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Instruction Fetch/Dispatch
Instruction L1 Data L1
Streaming Multiprocessor
Shared Memory
Courtesy DavidKirk/NVIDIAWen-mei Hwu/UIUC
![Page 21: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/21.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 23
Detail inside the Streaming Multiprocessor
Courtesy H. Goto
![Page 22: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/22.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 24
Die pictures of GTX 280Dual Core Penryn vs GTX 280GTX 280 has 1.4B transistors
![Page 23: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/23.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 25
CUDA• Programming environment with extensions to C• Model: SPMD execution on the CPU, run a sequence
of multi-threaded kernels on the “device”• Threads are extremely lightweight
CPU Serial Code
. . .
. . .
GPU Parallel Kernel
CPU Serial Code
GPU Parallel Kernel
Courtesy DavidKirk/NVIDIAWen-mei Hwu/UIUC
![Page 24: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/24.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 26
CUDA• Hierarchical thread organization• Basic computational unit is the grid• Grids composed of thread blocks
A cooperative block of threads: synchronize and share data viafast on-chip shared memory
Threads in different blocks cannot communicate except throughslow global memory
Blocks within a grid are virtualized
May configure number of threads in the block and the numberof blocks
• Parallel calls to a kernel specify the layoutkernelFunction <<< gridDims, blockDims >>> (args)
• Compiler will re-arrange loads to hide latencies
![Page 25: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/25.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 29
CUDA Memory Architecture
CPU
MainMemory
Grid (Graphics Card)
Global Memory
Constant Memory
Block (1,0)
Shared Memory
Thread(1,0)
Registers
Registers
Thread(0,0)
Block (0,0)
Shared Memory
Thread(1,0)
Registers
Registers
Thread(0,0)
![Page 26: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/26.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 30
Block scheduling (8800)Host
Kernel1
Kernel2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Courtesy of NVIDIA
Threads assigned toSM in units ofblocks, up to 8 foreach SM
![Page 27: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/27.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 31
Warp scheduling (8800)
warp 8 instruction 11
SM multithreadedWarp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
...
time
warp 3 instruction 96
Courtesy DavidKirk/NVIDIAWen-mei Hwu/UIUC
• Blocks contain multiple warps, a group ofSIMD threads• Half-warps are the unit of scheduling (16 threads currently)• Hardware scheduled warps hide latency(zero overhead)
• Scheduler looks for an eligible warp with alloperands ready• All threads in the warp execute the sameinstruction• All branches followed: serialization,instructions can be disabled
• Many warps need to hide memory latency• Registers shared by all threads in a block
![Page 28: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/28.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 36
CUDA Coding Example
![Page 29: Lecture 18 - University of California, San Diego · Lec18.ppt Author: Scott Baden Created Date: 12/2/2008 10:01:40 AM](https://reader034.vdocument.in/reader034/viewer/2022050519/5fa2e03d507f4054dd227eab/html5/thumbnails/29.jpg)
12/2/08 Scott B. Baden / CSE 160 / Fall 2008 37
Programming issues
• Branches serialize execution within a warp
• Registers dynamically partitioned across eachblock in a Streaming Multiprocessor
• Tradeoff: more blocks with fewer threads or morethreads with fewer blocks Locality: want small blocks of data (and hence more
plentiful warps) that fit into fast memory
Register consumption
Scheduling: hide latency