Download - Final Review
![Page 1: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/1.jpg)
Final Review
Dr. Bernard Chen Ph.D.University of Central Arkansas
Fall 2010
![Page 2: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/2.jpg)
Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind
stall to proceedDIV F0 <- F2/F4ADD F10<- F0+F8SUB F12<- F8-F14
![Page 3: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/3.jpg)
Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind
stall to proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8
![Page 4: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/4.jpg)
Overcome Data Hazards with Dynamic Scheduling Key idea: Allow instructions behind stall to
proceedDIV F0 <- F2/F4SUB F12<- F8-F14ADD F10<- F0+F8
Enables out-of-order execution and allows out-of-order completion (e.g., SUB)
In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)
![Page 5: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/5.jpg)
Overcome Data Hazards with Dynamic Scheduling However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder
Name dependence: when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name;
There are 2 versions of name dependence
![Page 6: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/6.jpg)
WAR InstrJ writes operand before InstrI
reads it If it caused a hazard in the
pipeline, called a Write After Read (WAR) hazard
I: sub r4,r1,r3 J: add r1,r2,r3K: mul r6,r1,r7
![Page 7: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/7.jpg)
WAW InstrJ writes operand before InstrI
writes it. If anti-dependence caused a
hazard in the pipeline, called a Write After Write (WAW) hazard
I: sub r1,r4,r3 J: add r1,r2,r3K: mul r6,r1,r7
![Page 8: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/8.jpg)
Thread-level parallelism (TLP) Thread: process with own instructions
and data thread may be a process part of a parallel
program of multiple processes, or it may be an independent program
Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute
(Ch4: Data Level Parallelism: Perform identical operations on data, and lots of data)
![Page 9: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/9.jpg)
New Approach: Mulithreaded Execution Multithreading: multiple threads to
share the functional units of 1 processor via overlapping
Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table
![Page 10: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/10.jpg)
New Approach: Mulithreaded Execution When switch?
Alternate instruction per thread (fine grain)
When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)
![Page 11: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/11.jpg)
Fine-Grained Multithreading Switches between threads on each
instruction, causing the execution of multiples threads to be interleaved
Usually done in a round-robin fashion, skipping any stalled threads
CPU must be able to switch threads every clock
![Page 12: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/12.jpg)
Course-Grained Multithreading Switches threads only on costly stalls,
such as L2 cache misses Advantages
Relieves need to have very fast thread-switching
Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall
![Page 13: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/13.jpg)
Course-Grained Multithreading Disadvantage is hard to overcome throughput
losses from shorter stalls, due to pipeline start-up costs Since CPU issues instructions from 1 thread,
when a stall occurs, the pipeline must be emptied or frozen
New thread must fill pipeline before instructions can complete
Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time
![Page 14: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/14.jpg)
Multithreaded Categories
Thread 1 Thread 2 Thread 3 Thread 4
Thread 5
![Page 15: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/15.jpg)
Multithreaded Categories
Tim
e (p
roce
ssor
cy
cle)
Superscalar Fine-Grained Coarse-Grained (2clock cycle)
Thread 1
Thread 2Thread 3Thread 4
Thread 5Idle slot
![Page 16: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/16.jpg)
Flynn’s Taxonomy
M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
Single Instruction Single Data (SISD)(Uniprocessor)
Single Instruction Multiple Data SIMD(single PC/Server)
Multiple Instruction Single Data (MISD)(????)
Multiple Instruction Multiple Data MIMD(Clusters, SMP servers)
![Page 17: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/17.jpg)
Back to Basics “A parallel computer is a collection of processing
elements that cooperate and communicate to solve large problems fast.”
Parallel Architecture = Computer Architecture + Communication Architecture
2 classes of multiprocessors WRT memory:1. Centralized Memory Multiprocessor
• < few dozen processor chips Small enough to share single, centralized memory
2. Physically Distributed-Memory multiprocessor• Larger number chips and cores• BW demands Memory distributed among processors
![Page 18: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/18.jpg)
![Page 19: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/19.jpg)
![Page 20: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/20.jpg)
2 Models for Communication and Memory Architecture The first kind, communication
occurs through a shared address space.
Centralized memory processor utilized this type of communication, named symmetric shared memory multiprocessors
![Page 21: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/21.jpg)
2 Models for Communication and Memory Architecture The first kind, communication occurs
through a shared address space
Even the physically separate memories can be addressed as on logically shared space Meaning that the memory reference can be
made by any processor to any memory location, (assume it has the access right)
These multiprocessors are called distributed shared memory (DSM)
![Page 22: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/22.jpg)
2 Models for Communication and Memory Architecture1. Communication occurs through a shared
address space (via loads and stores): shared memory multiprocessors either
• symmetric shared memory (centralized memory MP)
• distributed shared memory (distributed memory MP)
2. Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors, distributed memory MP
![Page 23: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/23.jpg)
Multiprocessors Performance Amdahl’s Law
![Page 24: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/24.jpg)
2 Classes of Cache Coherence Protocols1. Snooping — Every cache with a
copy of data also has a copy of sharing status of block, but no centralized state is kept
2. Directory based — Sharing status of a block of physical memory is kept in just one location, the directory
![Page 25: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/25.jpg)
Snooping Write through: the information is written
to both the block in the cache and to the block in the lower-level memory
Write back: the information is only to the block in the cache. The modified cache block is written to main memory only when it is replaced or needed
![Page 26: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/26.jpg)
Snooping (write back) Time Processor
activityBus activity
Contents of CPU A’s cache
Contents of CPU A’s cache
Contents of memory X
0 0
1 CPU A read X
Cache miss for X
0 0
2 CPU B read X
Cache miss for X
0 0 0
3 CPU A write 1 to X
Invalidation for X
1 0
4 CPU B reads X
Cache miss for X
1 1 1
![Page 27: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/27.jpg)
Snooping (write through) Time Processor
activityBus activity
Contents of CPU A’s cache
Contents of CPU A’s cache
Contents of memory X
0 0
1 CPU A read X
Cache miss for X
0 0
2 CPU B read X
Cache miss for X
0 0 0
3 CPU A write 1 to X
Invalidation for X
1 1
4 CPU B reads X
Cache miss for X
1 1 1
![Page 28: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/28.jpg)
Directory-Based Cache Coherence Protocols To implement the operations, a directory must
track the state of each cache block:
Shared (S): one or more processors have the block cached, and the value is up-to-date
Uncached (U): no processor has a copy of the cache block
Modified/Executed (E): exactly one processor has a copy of the cache block. The processor is called the owner of the block
![Page 29: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/29.jpg)
Directory-based ProtocolDirectory-based ProtocolInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X U 0 0 0
Bit Vector
![Page 30: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/30.jpg)
CPU 0 Reads XCPU 0 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 0
7X
![Page 31: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/31.jpg)
CPU 2 Reads XCPU 2 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X S 1 0 1
7X 7X
![Page 32: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/32.jpg)
CPU 0 Writes 6 to XCPU 0 Writes 6 to XInterconnection Network
CPU 0 CPU 1 CPU 2
7X
Caches
Memories
Directories X E 1 0 0
6X
![Page 33: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/33.jpg)
CPU 1 Reads XCPU 1 Reads XInterconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X S 1 1 0
6X 6X
![Page 34: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/34.jpg)
CPU 2 Writes 5 to X CPU 2 Writes 5 to X (Write back)(Write back)
Interconnection Network
CPU 0 CPU 1 CPU 2
6X
Caches
Memories
Directories X E 0 0 1
5X
![Page 35: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/35.jpg)
CPU 0 Writes 4 to XCPU 0 Writes 4 to XInterconnection Network
CPU 0 CPU 1 CPU 2
5X
Caches
Memories
Directories X E 1 0 0
4X
![Page 36: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/36.jpg)
Evaluating Switch Topologies Diameter Diameter
distance between farthest two nodesdistance between farthest two nodes Bisection widthBisection width
Min. number of edges in a cut which roughly Min. number of edges in a cut which roughly divides a network in two halves - determines divides a network in two halves - determines the min. bandwidth of the networkthe min. bandwidth of the network
Degree = Number of edges / node Degree = Number of edges / node constant degree board can be mass producedconstant degree board can be mass produced
Constant edge length? (yes/no)Constant edge length? (yes/no)
![Page 37: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/37.jpg)
2-D Mesh Network
![Page 38: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/38.jpg)
Binary Tree Network
![Page 39: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/39.jpg)
Hypercube2 2 xx 2 2 xx … … xx 2 mesh 2 mesh
0010
0000
0100
0110 0111
1110
0001
0101
1000 1001
0011
1010
1111
1011
11011100
![Page 40: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/40.jpg)
Hypercubes Illustrated
![Page 41: Final Review](https://reader034.vdocument.in/reader034/viewer/2022051316/56815c2a550346895dc9ffd4/html5/thumbnails/41.jpg)
Butterfly Network0 1 2 3 4 5 6 7
3 ,0 3 ,1 3 ,2 3 ,3 3 ,4 3 ,5 3 ,6 3 ,7
2 ,0 2 ,1 2 ,2 2 ,3 2 ,4 2 ,5 2 ,6 2 ,7
1 ,0 1 ,1 1 ,2 1 ,3 1 ,4 1 ,5 1 ,6 1 ,7
0 ,0 0 ,1 0 ,2 0 ,3 0 ,4 0 ,5 0 ,6 0 ,7R ank 0
R ank 1
R ank 2
R ank 3