18 447 lecture 21: architecture...
TRANSCRIPT
![Page 1: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/1.jpg)
18‐447‐S18‐L21‐S1, James C. Hoe, CMU/ECE/CALCM, ©2018
18‐447 Lecture 21:Parallel Architecture Overview
James C. HoeDepartment of ECE
Carnegie Mellon University
![Page 2: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/2.jpg)
18‐447‐S18‐L21‐S2, James C. Hoe, CMU/ECE/CALCM, ©2018
Housekeeping• Your goal today
– see the diverse landscape of parallel computer architectures
– set the context for focused topics to come• Notices
– Final Exam, Thursday, 5/10, 1pm~4pm– Handout #15: HW5 (on Canvas)– Lab 4 status check “this week”
• Readings– P&H Ch 6
![Page 3: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/3.jpg)
18‐447‐S18‐L21‐S3, James C. Hoe, CMU/ECE/CALCM, ©2018
0
1
2
3
4
5
6
7
>0 >4 >8 >12 >16 >20 >24 >28 >32 >36 >40 >44 >48 >52
Midterm 2 Histogram
A’ish≥47
B’ish≥39
C’ish≥31D’ish≥23
![Page 4: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/4.jpg)
18‐447‐S18‐L21‐S4, James C. Hoe, CMU/ECE/CALCM, ©2018
Midterm 2 Statistics
P1 P2 P3 P4 P5 P6 Total
locality BP energy 3C draw table
points (6) (10) (11) (10) (8) (10) (55)
mean 4.3 7.9 6.6 7.2 6.0 7.4 39.3
std dev. 1.8 2.7 2.4 3.0 1.4 3.8 82
max 6 10 10 10 8 10 51
median 4 8 7 8 6 10 41
mode 4 10 6 10 7 10 41
min 0 0 0 0 3 0 24
![Page 5: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/5.jpg)
18‐447‐S18‐L21‐S5, James C. Hoe, CMU/ECE/CALCM, ©2018
Problem 8
actual
guess
0
10
20
30
40
50
0 10 20 30 40 50
![Page 6: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/6.jpg)
18‐447‐S18‐L21‐S6, James C. Hoe, CMU/ECE/CALCM, ©2018
A Non‐Parallel “Architecture”• Memory holds both program and data
– instructions and data in a linear memory array– instructions can be modified as data
• Sequential instruction processing 1. program counter (PC) identifies current instruction2. fetch instruction from memory3. update some state (e.g. PC and memory) as a
function of current state (according to instruction) 4. repeat
…
program counter
0 1 2 3 4 5 . . .
Dominant paradigm since its invention
![Page 7: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/7.jpg)
18‐447‐S18‐L21‐S7, James C. Hoe, CMU/ECE/CALCM, ©2018
Something Inherently Parallel• Consider a von Neumann program
– What is the significance of the program order?– What is the significance of the storage locations?
• Dataflow program instruction ordering implied by data dependence– instruction specifies who receives the result– instruction executes when operands received– no program counter, no intermediate state
v := a + b ; w := b * 2 ;x := v ‐ w ;y := v + w ;z := x * y ;
+ *2
‐ +
*
a b
z[dataflow figure and example from Arvind]
![Page 8: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/8.jpg)
18‐447‐S18‐L21‐S8, James C. Hoe, CMU/ECE/CALCM, ©2018
More Conventionally Parallel
P1 P2 P3 Pn
Memory
Do you naturally think parallel or sequential?
![Page 9: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/9.jpg)
18‐447‐S18‐L21‐S9, James C. Hoe, CMU/ECE/CALCM, ©2018
Big Picture First: Flynn’s TaxonomySingle Instruction
StreamMultiple Instruction
Stream
Single Data
Stream SISD:
your vanilla uniprocessorMISD:
DB query??
Multip
le Data
Stream
SIMD:many PEs following common instruction
stream/control‐flow on different data
MIMD:fully independent
programs/control‐flowsworking in parallel
(collaborating SISDs?)
![Page 10: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/10.jpg)
18‐447‐S18‐L21‐S10, James C. Hoe, CMU/ECE/CALCM, ©2018
alu
data
alu
data
alu
data
SIMD vs. MIMD(an abstract and general depiction)
cntrl cntrl cntrl
cntrl
alu
data
alu
data
alu
data
together or separate?together or separate?
![Page 11: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/11.jpg)
18‐447‐S18‐L21‐S11, James C. Hoe, CMU/ECE/CALCM, ©2018
Data Parallelism• Abundant in matrix operations and scientific/ numerical applications
• Example: DAXPY/LINPACK (inner loop of Gaussian elimination and matrix‐mult)
– Y and X are vectors– same operations repeated on each Y[i] and X[i]– no data dependence across iterations
How to exploit data parallelism?
for(i=0; i<N; i++) {Y[i]=a*X[i]+Y[i]
}Y = a*X+Y =
![Page 12: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/12.jpg)
18‐447‐S18‐L21‐S12, James C. Hoe, CMU/ECE/CALCM, ©2018
Data Parallel Execution (short vector)
• Instantiate k copies of the hardware unit foo to process k iterations of the loop in parallel
b0 b1 b2 b3 b4 b5 b6 b7
a0 a1 a2 a3 a4 a5 a6 a7
foo foo foo foo foo foo foo foo
c0 c1 c2 c3 c4 c5 c6 c7
for(i=0; i<N; i++) {C[i]=foo(A[i], B[i])
}
![Page 13: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/13.jpg)
18‐447‐S18‐L21‐S13, James C. Hoe, CMU/ECE/CALCM, ©2018
Data Parallelism SIMD(long vector)
• Build a deeply pipelined (high‐frequency) version of foo()
for(i=0; i<N; i++) {C[i]=foo(A[i], B[i])
}
b0b1b2b3b4
a0a1a2a3a4…………c0c1c2c3c4
available manycycles laterconsumed 1 element
per cycle
……………………
Can combine concurrency and pipelining at the same time too
concurrency
![Page 14: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/14.jpg)
18‐447‐S18‐L21‐S14, James C. Hoe, CMU/ECE/CALCM, ©2018
SIMD Data‐Parallel• Parallel Bubble Sort (assume one number per PE)repeat
if (even node)if (my num>right neighbor’s num)
switch themelse // odd node
do nothing
if (odd node)if (my num>right neighbor’s num)
switch themelse // even node
do nothing
If‐then‐else implemented as predicated operations (L10‐S25)
cntrl
alu alu alu
data
![Page 15: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/15.jpg)
18‐447‐S18‐L21‐S15, James C. Hoe, CMU/ECE/CALCM, ©2018
• Each cycle, select a “ready” thread from scheduling pool– only one instruction per thread in flight at once– on a long latency stall, remove the thread from scheduling
• Simpler and faster pipeline implementation since– no data dependence, hence no stall or forwarding– no penalty in making pipeline deeper
InstT1
InstT3A B C F G H
A B C D E E G HD E
A B E F G HC DA B E F G HC D
InstT2
InstT4A B E F G HC D
A B E F G HC DA B E F G HC D
A B E F G HC DA B E F G HC D
InstT5
InstT7InstT6
InstT8InstT1
MIMD Parallel Execution Either
e.g., Barrel Processor [HEP, Smith]
![Page 16: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/16.jpg)
18‐447‐S18‐L21‐S16, James C. Hoe, CMU/ECE/CALCM, ©2018
A Spotty Tour of the MP Universe
![Page 17: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/17.jpg)
18‐447‐S18‐L21‐S17, James C. Hoe, CMU/ECE/CALCM, ©2018
Variety in the details• Scale, technology, application . . . .• Work distribution
– granularity of parallelism (how finely is work divided)whole programs down to bits
– regularityall “nodes” look the same and look out to the same environment
– static vs. dynamice.g., load‐balancing• Communication
– message‐passing vs. shared memory– granularity of communicationwords to pages– interconnect and interface design/performance
![Page 18: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/18.jpg)
18‐447‐S18‐L21‐S18, James C. Hoe, CMU/ECE/CALCM, ©2018
SIMD: Vector Machines
[Figure from H&P CAaQA, COPYRIGHT 2007 Elsevier. ALL RIGHTS RESERVED.]
• Vector data type and regfile• Deeply pipelined fxn units• Matching high‐perf load‐store
units and multi‐banked memory• E.g., Cray 1, circa 1976
– 64 x 64‐word vector RF– 12 pipelines, 12.5ns– ECL 4‐input NAND and SRAM (no caches!!)
– 2x25‐ton cooling system– 250 MIPS peak for ~10M 1970$
![Page 19: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/19.jpg)
18‐447‐S18‐L21‐S19, James C. Hoe, CMU/ECE/CALCM, ©2018
SIMD: Big‐Irons• Sea of PEs on a regular grid
– synchronized common cntrl– direct access to local mem– nearest‐neighbor exchanges– special support for broadcast, reduction, etc.
• E.g., Thinking Machines CM‐2– 1000s of bit‐sliced PEs lock‐step controlled by a common sequencer
– “hypercube” topology– special external I/O nodes
PEM
RPEM
RPEM
R
PEM
RPEM
RPEM
R
PEM
RPEM
RPEM
R
FrontEnd
![Page 20: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/20.jpg)
18‐447‐S18‐L21‐S20, James C. Hoe, CMU/ECE/CALCM, ©2018
SIMD: Modern Renditions• Intel SSE (Streaming SIMD Extension), 1999
– 16 x 128‐bit “vector” registers, 4 floats or 2 doubles
– SIMD instructions: ld/st, arithmetic, shuffle, bitwise
– SSE4 with true full‐width operationsCore i7 does upto 4 sp‐mult & 4 sp‐add per cyc per core, (24GFLOPS @3GHz)
• AVX 2 doubles the above (over 1TFLOPS/chip)• “GP”GPUs . . . (next slide)
Simple hardware, big perf numbers but only if massively data‐parallel app!!
![Page 21: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/21.jpg)
18‐447‐S18‐L21‐S21, James C. Hoe, CMU/ECE/CALCM, ©2018
8+ TFLOPs Nvidia GP104 GPU
• 20 Streaming Multiproc– 128 SIMD lane per SM– 1 mul, 1 add per lane– 1.73 GHz (boosted)
• Performance– 8874 GFLOPs– 320GB/sec– 180 Watt
How many FLOPs per Watt? How many FLOPs
per DRAM byte accessed?[NVIDIA GeForce GTX 1080 Whitepaper]
![Page 22: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/22.jpg)
18‐447‐S18‐L21‐S22, James C. Hoe, CMU/ECE/CALCM, ©2018
What 1 TFLOP meant in 1996• ASCI Red, 1996World’s 1 TFLOP computer!!
– $50M, 1600ft2 system– ~10K 200MHz PentiumPro’s – ~1 TByte DRAM (total)– 500kW to power up + 500kW to cool down
• Advanced Simulation and Computing Initiative– how to know if nuclear stockpile still good if you can’t blow one up to find out?
– require ever more expensive simulation as stockpile aged
– Red 1.3TFLOPS 1996; Blue Mountain/Pacific 4TFLOPS 1998; White 12TFLOPS 2000; Purple 100TFLOPS 2005; . . . Cray Titan 27PFLOPS . . .
![Page 23: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/23.jpg)
18‐447‐S18‐L21‐S23, James C. Hoe, CMU/ECE/CALCM, ©2018
MIMD: Message Passing
• Private address space and memory per processor• Parallel threads on different processors communicate
by explicit sending and receiving of messages
Interconnect
P0M tx fiforx fifo
NIU
P1M tx fiforx fifo
NIU
P2M tx fiforx fifo
NIU
Pi Mtx fiforx fifo
NIU
Pj Mtx fiforx fifo
NIU
Pk Mtx fiforx fifo
NIU
![Page 24: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/24.jpg)
18‐447‐S18‐L21‐S24, James C. Hoe, CMU/ECE/CALCM, ©2018
MIMD Message Passing Systems(by network interface placement)
• Beowulf Clusters (I/O bus)– Linux PCs connected by Ethernet
• High‐Performance Clusters (I/O bus)– stock workstations/servers butexotic interconnects, e.g., Myrinet, HIPPI, Infiniband, etc.
• Supers (memory bus)– stock CPUs on custom platform– e.g., Cray XT5 (“fastest”in 2011 224K AMD Opteron
• Inside the CPU– single‐instruction send/receive– e.g., iAPX 432 (1981), Transputers (80s)
DiskDiskDisk
I/O Bus
Memory Bus (GB/sec)
MainMemory
I/OBridge
I/O
CPUALU RF
$XX
X
![Page 25: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/25.jpg)
18‐447‐S18‐L21‐S25, James C. Hoe, CMU/ECE/CALCM, ©2018
MIMD Shared Memory:Symmetric Multiprocessors (SMPs)
• Symmetric means– identical procs connected to common memory– all procs have equal access to system (mem & I/O)– OS can schedule any process on any processor
• Uniform Memory Access (UMA)– processor/memory connected by bus or crossbar
– all processors have equalmemory access performanceto all memory locations
– caches need to stay coherent DiskDiskDisk
I/O Bus
CPUALU RF
cache
Memory Bus (GB/sec)
Main Memory
I/OBridge
I/O
CPUALU RF
cache
![Page 26: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/26.jpg)
18‐447‐S18‐L21‐S26, James C. Hoe, CMU/ECE/CALCM, ©2018
MIMD Shared Memory: Big IronsDistributed Shared Memory
• UMA hard to scale due to concentration of BW • Large scale SMPs have distributed memory with non‐uniform memory (NUMA)– “local” memory pages (faster to access)– “remote” memory pages (slower to access)– cache‐coherence still possible but complicated
• E.g., SGI Origin 2000– upto 512 CPUs and 512GBDRAM ($40M)
– 48 128‐CPU system wascollectively the 2nd fastest computer (3TFLOPS) in 1999
mem PE
NIU
mem PE
NIU
network
dirdir dirdir
![Page 27: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/27.jpg)
18‐447‐S18‐L21‐S27, James C. Hoe, CMU/ECE/CALCM, ©2018
MIMD Shared Memory: it is everywhere now!
• General‐purpose “multicore” processors implement SMP (not UMA) on a single chip
• Moore’s Law scaling in number of core’s
Intel Xeon e5345 [Figure from P&H CO&D, COPYRIGHT 2009 Elsevier. ALL RIGHTS RESERVED.]
![Page 28: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/28.jpg)
18‐447‐S18‐L21‐S28, James C. Hoe, CMU/ECE/CALCM, ©2018
Today’s New Normal
[https://software.intel.com/en‐us/articles/intel‐xeon‐processor‐scalable‐family‐technical‐overview]
Intel® Xeon® E5‐2600
![Page 29: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/29.jpg)
18‐447‐S18‐L21‐S29, James C. Hoe, CMU/ECE/CALCM, ©2018
Today’s Exotic
Microsoft Catapult [MICRO 2016, Caulfield, et al.]
Google TPU [Hotchips, 2017,
Jeff Dean]
![Page 30: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/30.jpg)
18‐447‐S18‐L21‐S30, James C. Hoe, CMU/ECE/CALCM, ©2018
Remember how we got here . . . .
Big Corelittle core
little core
little core little
corelittle core
little core
little core
little core
little core
little core
little core
little core
little core
little core
little core
little core
little core
little core
little core
little core
little core
2005~??1970~2005
![Page 31: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/31.jpg)
18‐447‐S18‐L21‐S31, James C. Hoe, CMU/ECE/CALCM, ©2018
Top500 – Nov 2017
Screenshot http://www.top500.org/lists/2017/11/
![Page 32: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/32.jpg)
18‐447‐S18‐L21‐S32, James C. Hoe, CMU/ECE/CALCM, ©2018
Top500: Architecture
Screenshot http://www.top500.org
performance sh
are
![Page 33: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/33.jpg)
18‐447‐S18‐L21‐S33, James C. Hoe, CMU/ECE/CALCM, ©2018
Top500: Application Area
Screenshot http://www.top500.org
performance sh
are
![Page 34: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/34.jpg)
18‐447‐S18‐L21‐S34, James C. Hoe, CMU/ECE/CALCM, ©2018
Top500: Moore’s Law?
Using parallel computers has always been harder than building them!
screenshot http://www.top500.org
![Page 35: 18 447 Lecture 21: Architecture Overviewusers.ece.cmu.edu/~jhoe/course/ece447/S18handouts/L21.pdf · –set the context for focused topics to come ... c4 c3 c2 c1 c0 ... –ECL 4‐input](https://reader031.vdocument.in/reader031/viewer/2022021901/5b83fe717f8b9a64618e3270/html5/thumbnails/35.jpg)
18‐447‐S18‐L21‐S35, James C. Hoe, CMU/ECE/CALCM, ©2018
Green500 – Nov 2017
Screenshot http://www.top500.org/green500/lists/2017/11/