18-741 advanced computer architecture lecture 1: intro and basics
TRANSCRIPT
![Page 1: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/1.jpg)
18-742 Spring 2011Parallel Computer Architecture
Lecture 4: Multi-core
Prof. Onur MutluCarnegie Mellon University
![Page 2: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/2.jpg)
Research Project Project proposal due: Jan 31
Project topics
Does everyone have a topic? Does everyone have a partner? Does anyone have too many partners?
2
![Page 3: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/3.jpg)
Last Lecture Programming model vs. architecture
Message Passing vs. Shared Memory Data Parallel Dataflow
Generic Parallel Machine
3
![Page 4: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/4.jpg)
Readings for Today Required:
Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA
2005. Suleman et al., “Accelerating Critical Section Execution with Asymmetric
Multi-Core Architectures,” ASPLOS 2009. Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip
Multiprocessors,” ISCA 2007.
Recommended: Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. Barroso et al., “Piranha: A Scalable Architecture Based on Single-Chip
Multiprocessing,” ISCA 2000. Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE
Micro 2005. Amdahl, “Validity of the single processor approach to achieving large scale
computing capabilities,” AFIPS 1967.
4
![Page 5: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/5.jpg)
Reviews Due Today (Jan 21)
Seitz, “The Cosmic Cube,” CACM 1985. Suleman et al., “Accelerating Critical Section Execution with
Asymmetric Multi-Core Architectures,” ASPLOS 2009.
Due Next Tuesday (Jan 25) Papamarcos and Patel, “A low-overhead coherence solution
for multiprocessors with private cache memories,” ISCA 1984. Kelm et al., “Cohesion: a hybrid memory model for
accelerators,” ISCA 2010.
Due Next Friday (Jan 28) Suleman et al., “Data Marshaling for Multi-Core
Architectures,” ISCA 2010.
5
![Page 6: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/6.jpg)
Recap of Last Lecture
6
![Page 7: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/7.jpg)
Review: Shared Memory vs. Message Passing Loosely coupled multiprocessors
No shared global memory address space Multicomputer network
Network-based multiprocessors Usually programmed via message passing
Explicit calls (send, receive) for communication
Tightly coupled multiprocessors Shared global memory address space Traditional multiprocessing: symmetric multiprocessing (SMP) Existing multi-core processors, multithreaded processors Programming model similar to uniprocessors except
(multitasking uniprocessor) Operations on shared data require synchronization
7
![Page 8: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/8.jpg)
Programming Models vs. Architectures Five major models
Shared memory Message passing Data parallel (SIMD) Dataflow Systolic
Hybrid models?
8
![Page 9: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/9.jpg)
Scalability, Convergence, and Some Terminology
9
![Page 10: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/10.jpg)
Scaling Shared Memory Architectures
10
![Page 11: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/11.jpg)
Interconnection Schemes for Shared Memory Scalability dependent on interconnect
11
![Page 12: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/12.jpg)
UMA/UCA: Uniform Memory or Cache Access
• All processors have the same uncontended latency to memory• Latencies get worse as system grows• Symmetric multiprocessing (SMP) ~ UMA with bus interconnect
P rocessor P rocessor P rocessor. . .
. . .
latencylong
M ain M em orycontention in m em ory banks
Interconnection N etworkcontention in netw ork
![Page 13: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/13.jpg)
Uniform Memory/Cache Access+ Data placement unimportant/less important (easier to optimize code and
make use of available memory space)- Scaling the system increases latencies- Contention could restrict bandwidth and increase latency
P rocessor P rocessor P rocessor. . .
. . .
latencylong
M ain M em orycontention in m em ory banks
Interconnection N etworkcontention in netw ork
![Page 14: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/14.jpg)
Example SMP Quad-pack Intel Pentium Pro
14
![Page 15: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/15.jpg)
How to Scale Shared Memory Machines? Two general approaches
Maintain UMA Provide a scalable interconnect to memory Downside: Every memory access incurs the round-trip
network latency
Interconnect complete processors with local memory NUMA (Non-uniform memory access)
Local memory faster than remote memory Still needs a scalable interconnect for accessing remote
memory Not on the critical path of local memory access
15
![Page 16: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/16.jpg)
NUMA/NUCA: NonUniform Memory/Cache Access
• Shared memory as local versus remote memory+ Low latency to local memory- Much higher latency to remote memories
. . .
Interconnection N etw ork
contenti on i n netw ork
. . .
l a tencyl ong
M em ory
P rocessor
M em ory
P rocessor
M em ory
P rocessor
shortl a tency
+ Bandwidth to local memory may be higher- Performance very sensitive to data placement
![Page 17: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/17.jpg)
Example NUMA MAchines Sun Enterprise Server Cray T3E
17
![Page 18: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/18.jpg)
Convergence of Parallel Architectures Scalable shared memory architecture is similar to
scalable message passing architecture Main difference: is remote memory accessible with
loads/stores?
18
![Page 19: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/19.jpg)
Historical Evolution: 1960s & 70s• Early MPs
– Mainframes– Small number of processors– crossbar interconnect– UMA
Processor
MemoryMemoryMemoryMemoryMemoryMemoryMemoryMemory
Processor
Processor
Processor
corssbar
![Page 20: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/20.jpg)
Historical Evolution: 1980s• Bus-Based MPs
– enabler: processor-on-a-board– economical scaling– precursor of today’s SMPs– UMA
MemoryMemoryMemoryMemory
Proc
cache
Proc
cache
Proc
cache
Proc
cache
![Page 21: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/21.jpg)
Historical Evolution: Late 80s, mid 90s• Large Scale MPs (Massively Parallel Processors)
– multi-dimensional interconnects– each node a computer (proc + cache + memory)– both shared memory and message passing versions– NUMA– still used for “supercomputing”
![Page 22: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/22.jpg)
Historical Evolution: Current Chip multiprocessors (multi-core) Small to Mid-Scale multi-socket CMPs
One module type: processor + caches + memory Clusters/Datacenters
Use high performance LAN to connect SMP blades, racks
Driven by economics and cost Smaller systems => higher volumes Off-the-shelf components
Driven by applications Many more throughput applications (web servers) … than parallel applications (weather prediction) Cloud computing
![Page 23: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/23.jpg)
Historical Evolution: Future Cluster/datacenter on a chip?
Heterogeneous multi-core?
Bounce back to small-scale multi-core?
???
23
![Page 24: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/24.jpg)
Multi-Core Processors
24
![Page 25: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/25.jpg)
Moore’s Law
25
Moore, “Cramming more components onto integrated circuits,” Electronics, 1965.
![Page 26: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/26.jpg)
Multi-Core Idea: Put multiple processors on the same die.
Technology scaling (Moore’s Law) enables more transistors to be placed on the same die area
What else could you do with the die area you dedicate to multiple processors? Have a bigger, more powerful core Have larger caches in the memory hierarchy Simultaneous multithreading Integrate platform components on chip (e.g., network
interface, memory controllers)
26
![Page 27: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/27.jpg)
Why Multi-Core? Alternative: Bigger, more powerful single core
Larger superscalar issue width, larger instruction window, more execution units, large trace caches, large branch predictors, etc
+ Improves single-thread performance transparently to programmer, compiler
- Very difficult to design (Scalable algorithms for improving single-thread performance elusive)
- Power hungry – many out-of-order execution structures consume significant power/area when scaled. Why?
- Diminishing returns on performance - Does not significantly help memory-bound application
performance (Scalable algorithms for this elusive)
27
![Page 28: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/28.jpg)
Large Superscalar vs. Multi-Core Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
28
![Page 29: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/29.jpg)
Multi-Core vs. Large Superscalar Multi-core advantages
+ Simpler cores more power efficient, lower complexity, easier to design and replicate, higher frequency (shorter wires, smaller structures)
+ Higher system throughput on multiprogrammed workloads reduced context switches
+ Higher system throughput in parallel applications
Multi-core disadvantages- Requires parallel tasks/threads to improve performance
(parallel programming)- Resource sharing can reduce single-thread performance- Shared hardware resources need to be managed- Number of pins limits data supply for increased demand
29
![Page 30: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/30.jpg)
Large Superscalar vs. Multi-Core Olukotun et al., “The Case for a Single-Chip
Multiprocessor,” ASPLOS 1996.
Technology push Instruction issue queue size limits the cycle time of the
superscalar, OoO processor diminishing performance Quadratic increase in complexity with issue width
Large, multi-ported register files to support large instruction windows and issue widths reduced frequency or longer RF access, diminishing performance
Application pull Integer applications: little parallelism? FP applications: abundant loop-level parallelism Others (transaction proc., multiprogramming): CMP better fit
30
![Page 31: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/31.jpg)
Why Multi-Core? Alternative: Bigger caches
+ Improves single-thread performance transparently to programmer, compiler
+ Simple to design
- Diminishing single-thread performance returns from cache size. Why?
- Multiple levels complicate memory hierarchy
31
![Page 32: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/32.jpg)
Cache vs. Core
32
Time
Num
ber o
f Tra
nsis
tors
Cache
Microprocessor
![Page 33: 18-741 Advanced Computer Architecture Lecture 1: Intro and Basics](https://reader033.vdocument.in/reader033/viewer/2022052515/5891a98e1a28abb4238bcd61/html5/thumbnails/33.jpg)
Why Multi-Core? Alternative: (Simultaneous) Multithreading
+ Exploits thread-level parallelism (just like multi-core)+ Good single-thread performance when there is a single thread+ No need to have an entire core for another thread+ Parallel performance aided by tight sharing of caches
- Scalability is limited: need bigger register files, larger issue width (and associated costs) to have many threads complex with many threads
- Parallel performance limited by shared fetch bandwidth- Extensive resource sharing at the pipeline and memory system
reduces both single-thread and parallel application performance
33