dark silicon & modern computer architecturehtseng/classes/cse203_2019fa/...data read data 1 read...
TRANSCRIPT
![Page 1: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/1.jpg)
Dark Silicon & Modern Computer Architecture
Hung-Wei Tseng
![Page 2: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/2.jpg)
• Power is the direct contributor of “heat” • Packaging of the chip • Heat dissipation cost • Power = PDynamic + Pstatic
• Energy = P * ET • The electricity bill and battery life is related to energy! • Lower power does not necessary means better battery life if the
processor slow down the application too much
!2
Power v.s. Energy
![Page 3: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/3.jpg)
• Given a scaling factor S
!3
Dennardian BrokenParameter Relation Classical Scaling Leakage Limited
Power Budget 1 1Chip Size 1 1
Vdd (Supply Voltage) 1/S 1Vt (Threshold Voltage) 1/S 1/S 1
tex (oxide thickness) 1/S 1/SW, L (transistor
dimensions)1/S 1/S
Cgate (gate capacitance) WL/tox 1/S 1/SIsat (saturation current) WVdd/tox 1/S 1
F (device frequency) Isat/(CgateVdd) S SD (Device/Area) 1/(WL) S2 S2
p (device power) IsatVdd 1/S2 1P (chip power) Dp 1 S2
U (utilization) 1/P 1 1/S2
![Page 4: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/4.jpg)
Dark Silicon and the End of Multicore Scaling
H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam and D. Burger University of Washington, University of Wisconsin—Madison, University of Texas at Austin,
Microsoft Research
!4
![Page 5: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/5.jpg)
Power consumption to light on all transistors
!5
Chip Chip0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.50.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
1 1 1 1 1 1 1
Chip1 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 11 1 1 1 1 1 1 1 1 1
=49W =50W =100W!
Dennardian Scaling Dennardian Broken
On ~ 50W
Off ~ 0W
Dark!
![Page 6: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/6.jpg)
• If we are able to cram more transistors within the same chip area (Moore’s law continues), but the power consumption per transistor remains the same. Right now, if we power the chip with the same power consumption but put more transistors in the same area because the technology allows us to. How many of the following statements are true? က: The power consumption per chip will increase က< The power density of the chip will increase က> Given the same power budget, we may not able to power on all chip area if we maintain the
same clock rate က@ Given the same power budget, we may have to lower the clock rate of circuits to power on all
chip area A. 0 B. 1 C. 2 D. 3 E. 4
!6
What happens if power doesn’t scale with process technologies?
![Page 7: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/7.jpg)
Clock rate improvement is limited nowadays
!7
![Page 8: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/8.jpg)
Solutions/trends in dark silicon era
!8
![Page 9: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/9.jpg)
• Aggressive dynamic voltage/frequency scaling • Throughout oriented — slower, but more • Just let it dark — activate part of circuits, but not all • From general-purpose to domain-specific — ASIC
!9
Trends in the Dark Silicon Era
![Page 10: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/10.jpg)
Aggressive dynamic frequency scaling
!10
![Page 11: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/11.jpg)
More cores per chip, slower per core
!11
![Page 12: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/12.jpg)
• The power consumption due to the switching of transistor states
• Dynamic power per transistor
• α: average switches per cycle • C: capacitance • V: voltage • f: frequency, usually linear with V • N: the number of transistors
!12
Dynamic/Active Power
Pdynamic ∼ α × C × V2 × f × N
![Page 13: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/13.jpg)
• You may use cat /proc/cpuinfo to see all the details of your processors
• You may add “| grep MHz” to see the frequencies of your cores • Only very few of them are on the boosted frequency
!13
Demo
![Page 14: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/14.jpg)
• The power consumption due to leakage — transistors do not turn all the way off during no operation
• Becomes the dominant factor in the most advanced process technologies.
• N: number of transistors • V: voltage • Vt: threshold voltage where
transistor conducts (begins to switch)
!14
Static/Leakage Power
Pleakag e ∼ N × V × e−VtHow about static power?
![Page 15: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/15.jpg)
Slower, but more
!15
![Page 16: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/16.jpg)
Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power
ReductionRakesh Kumar, Keith Farkas, Norm P. Jouppi, Partha Ranganathan, Dean M. Tullsen.
University of California, San Diego and HP Labs
!16
![Page 17: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/17.jpg)
• You fit about 5 EV5 cores within the same area of an EV6 • If you build a quad-core EV6, you can use the same area to
• build 20-core EV5 • 3EV6+5EV5
!17
Areas of different processor generations
![Page 18: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/18.jpg)
• Energy * delay = Power * ET * ET = Power * ET2
!22
Energy-delay
![Page 19: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/19.jpg)
!23
![Page 20: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/20.jpg)
• Regarding “Single-ISA Heterogeneous Multi-Core Architectures”, how many of the following statements is/are correct? က: You need to recompile and optimize the binary for each core architecture to exploit the thread-
level parallelism in this architecture က< For a program with limited thread-level parallelism, single ISA heterogeneous CMP would
deliver better or at least the same level of performance than homogeneous CMP က> For a program with rich thread-level parallelism, single ISA heterogeneous CMP would deliver
better or at least the same level of performance than homogeneous CMP built with older-generation cores
က@ Spending more time on older-generation cores would always lead to better energy-delay A. 0 B. 1 C. 2 D. 3 E. 4
!24
Single ISA heterogeneous CMP
![Page 21: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/21.jpg)
4EV6 v.s. 20 EV5 v.s. 3EV6+5EV5
!25
![Page 22: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/22.jpg)
ARM’s big.LITTLE architecture
!26
![Page 23: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/23.jpg)
4EV6 v.s. 20 EV5 v.s. 3EV6+5EV5
!27
![Page 24: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/24.jpg)
Xeon Phi
!28
![Page 25: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/25.jpg)
!29
An Overview of Kepler GK110 and GK210 Architecture Kepler GK110 was built first and foremost for Tesla, and its goal was to be the highest performing
parallel computing microprocessor in the world. GK110 not only greatly exceeds the raw compute
horsepower delivered by previous generation GPUs, but it does so efficiently, consuming significantly
less power and generating much less heat output.
GK110 and GK210 are both designed to provide fast double precision computing performance to
accelerate professional HPC compute workloads; this is a key difference from the NVIDIA Maxwell GPU
architecture, which is designed primarily for fast graphics performance and single precision consumer
compute tasks. While the Maxwell architecture performs double precision calculations at rate of 1/32
that of single precision calculations, the GK110 and GK210 Kepler-based GPUs are capable of performing
double precision calculations at a rate of up to 1/3 of single precision compute performance.
Full Kepler GK110 and GK210 implementations include 15 SMX units and six 64‐bit memory controllers. Different products will use different configurations. For example, some products may deploy 13 or 14
SMXs. Key features of the architecture that will be discussed below in more depth include:
x The new SMX processor architecture
x An enhanced memory subsystem, offering additional caching capabilities, more bandwidth at
each level of the hierarchy, and a fully redesigned and substantially faster DRAM I/O
implementation.
x Hardware support throughout the design to enable new programming model capabilities
x GK210 expands upon GK110’s on-chip resources, doubling the available register file and shared
memory capacities per SMX.
SMX (Streaming Multiprocessor)
Thread scheduler
GPU global
memory
High-bandwidth
memory controllers
The rise of GPU
![Page 26: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/26.jpg)
!30
Streaming Multiprocessor (SMX) Architecture
The Kepler GK110/GK210 SMX unit features several architectural innovations that make it the most powerful multiprocessor we’ve built for double precision compute workloads.
SMX: 192 single-precision CUDA cores, 64 double-precision units, 32 special function units (SFU), and 32 load/store units (LD/ST).
Each of these performs the same operation, but each of these is also a
“thread” A total of 16*12 = 192 cores!
![Page 27: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/27.jpg)
Just let it dark
!31
![Page 28: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/28.jpg)
NVIDIA’s Turing Architecture
!32
![Page 29: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/29.jpg)
Programming in Turing Architecture
!33
cublasErrCheck(cublasSetMathMode(cublasHandle, CUBLAS_TENSOR_OP_MATH));
convertFp32ToFp16 <<< (MATRIX_M * MATRIX_K + 255) / 256, 256 >>> (a_fp16, a_fp32, MATRIX_M * MATRIX_K); convertFp32ToFp16 <<< (MATRIX_K * MATRIX_N + 255) / 256, 256 >>> (b_fp16, b_fp32, MATRIX_K * MATRIX_N);
cublasErrCheck(cublasGemmEx(cublasHandle, CUBLAS_OP_N, CUBLAS_OP_N, MATRIX_M, MATRIX_N, MATRIX_K, &alpha, a_fp16, CUDA_R_16F, MATRIX_M, b_fp16, CUDA_R_16F, MATRIX_K, &beta, c_cublas, CUDA_R_32F, MATRIX_M, CUDA_R_32F, CUBLAS_GEMM_DFALT_TENSOR_OP));
Use tensor cores
Make them 16-bit
call Gemm
![Page 30: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/30.jpg)
NVIDIA’s Turing Architecture
!34
You can only use either type of these ALUs, but not all of them
![Page 31: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/31.jpg)
The rise of ASICs
!35
![Page 32: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/32.jpg)
• This is what we need in RISC-V in each iteration
!36
Say, we want to implement a[i] += a[i+1]*20
ld X1, 0(X0)ld X2, 8(X0) add X3, X31, #20mul X2, X2, X3 add X1, X1, X2sd X1, 0(X0)
IF IDIF
EXIDIF
MEMEXIDIF
WBMEM
EXID
WBMEM
EXIDIF
WBMEM
EXWB
MEM WBIDIF EX MEM WB
![Page 33: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/33.jpg)
This is what you need for these instructions
!37
Instruction memory
PC
ALU
4
ReadAddress
Instruction[31:0]
Registers
Control
Inst
ruct
ion
[31:
21]
ReadRegister 1
ReadRegister 2
WriteData
ReadData 1
ReadData 2
mux
0
1
Instruction[9:5]
Instruction[20:16]
Instruction[4:0]
Sign-extend
Instruction[31:0]
Adder
mux
0
1
AdderShift
Left 2
Data memory
Address
WriteData
ReadData
mux
1
0
Zero
mux
0
1
WriteRegister
Reg2Loc
BranchMemoryRead
MemToReg
ALU Ctrl.
ALUOp
Instruction[31:21]
MemoryWriteALUSrc
RegWrite
IF/ID EX/MEM MEM/WB
RegWrite
Instruction[4:0]
ID/EXMEM/WB.RegisterRd
23
MEM/WB.RegisterRd
mux
210
Forwarding
EX/MEM.RegisterRd
ForwardA
ForwardBID/EXE.RegisterRnID/EXE.RegisterRm
EX/MEM.MemoryRead
mux
0
Hazard Detection
PCWrite
ID/EX.MemoryRead
IF/IDWrite
![Page 34: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/34.jpg)
Specialize the circuit
!38
ALU
Registers
Control
Inst
ruct
ion
[31:
21]
ReadRegister 1
ReadRegister 2
WriteData
ReadData 1
ReadData 2
mux
0
1
Instruction[9:5]
Instruction[20:16]
Instruction[4:0]
Sign-extend
Instruction[31:0]
mux
0
1
AdderShift
Left 2
Data memory
Address
WriteData
ReadData
mux
1
0
Zero
WriteRegister
Reg2Loc
BranchMemoryRead
MemToReg
ALU Ctrl.
ALUOp
Instruction[31:21]
MemoryWriteALUSrc
RegWrite
EX/MEM MEM/WB
RegWrite
Instruction[4:0]
ID/EXMEM/WB.RegisterRd
23
MEM/WB.RegisterRd
mux
210
Forwarding
EX/MEM.RegisterRd
ForwardA
ForwardBID/EXE.RegisterRnID/EXE.RegisterRm
EX/MEM.MemoryRead
mux
0
Hazard Detection
PCWrite
Instruction memory
PC4
ReadAddress
Instruction[31:0]
Adder
mux
0
1
IF/ID
ID/EX.MemoryRead
IF/IDWrite
We don’t need instruction fetch given it’s a fixed function
![Page 35: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/35.jpg)
Specialize the circuit
!39
ALU
Registers
Control
Inst
ruct
ion
[31:
21]
ReadRegister 1
ReadRegister 2
WriteData
ReadData 1
ReadData 2
mux
0
1
Instruction[9:5]
Instruction[20:16]
Instruction[4:0]
Sign-extend
Instruction[31:0]
mux
0
1
AdderShift
Left 2
Data memory
Address
WriteData
ReadData
mux
1
0
Zero
WriteRegister
Reg2Loc
BranchMemoryRead
MemToReg
ALU Ctrl.
ALUOp
Instruction[31:21]
MemoryWriteALUSrc
RegWrite
EX/MEM MEM/WB
RegWrite
Instruction[4:0]
ID/EXMEM/WB.RegisterRd
23
MEM/WB.RegisterRd
mux
210
Forwarding
EX/MEM.RegisterRd
ForwardA
ForwardBID/EXE.RegisterRnID/EXE.RegisterRm
EX/MEM.MemoryRead
mux
0
Hazard Detection ID/EX.MemoryRead
We don’t need instruction fetch given it’s a fixed function
We don’t need these many registers, complex control, decode
![Page 36: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/36.jpg)
Specialize the circuit
!40
ALU
Registers
Control
ReadRegister 1ReadRegister 2
WriteData
ReadData 1ReadData 2
Instruction[4:0]
mux
0
1
AdderShift
Left 2
Data memory
Address
WriteData
ReadData
mux
1
0
Zero
WriteRegister
BranchMemoryRead
MemToReg
ALU Ctrl.
ALUOpMemoryWrite
ALUSrcRegWrite
EX/MEM MEM/WB
RegWrite
ID/EXMEM/WB.RegisterRd
23
mux
0
Hazard Detection ID/EX.MemoryRead
We don’t need instruction fetch given it’s a fixed function
We don’t need these many registers, complex control, decode
We don’t need ALUs, branches, hazard detections…
![Page 37: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/37.jpg)
Specialize the circuit
!41
ALU
Registers
Control
ReadRegister 1ReadRegister 2
WriteData
ReadData 1ReadData 2
Instruction[4:0]
mux
0
1
Data memory
Address
WriteData
ReadData
mux
1
0
Zero
WriteRegister
BranchMemoryRead
MemToReg
ALU Ctrl.
ALUOpMemoryWrite
ALUSrcRegWrite
EX/MEM MEM/WB
RegWrite
ID/EXMEM/WB.RegisterRd
We don’t need instruction fetch given it’s a fixed function
We don’t need these many registers, complex control, decode
We don’t need big ALUs, branches, hazard detections…
![Page 38: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/38.jpg)
Rearranging the datapath
!42
Multiplier
Register
WriteData Data
memory
Address
Data memory
Address
ReadData
Data memory
Address ReadData Register
Adder
820
Adder
ld X1, 0(X0)ld X2, 8(X0) add X3, X31, #20mul X2, X2, X3 add X1, X1, X2sd X1, 0(X0)
![Page 39: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/39.jpg)
The pipeline for a[i] += a[i+1]*20
!43
Multiplier
Register
WriteData Data
memory
Address
Data memory
Address
ReadData
Data memory
Address ReadData Register
Adder
820
Adder
a[0] += a[1]*20a[1] += a[2]*20a[2] += a[3]*20a[3] += a[4]*20Each stage can still be as fast as the
pipelined processor
But each stage is now working on
what the original 6 instructions would
do
![Page 40: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/40.jpg)
A Cloud-Scale Acceleration ArchitectureAdrian Caulfield, Eric Chung, Andrew Putnam, Hari Angepat, Jeremy Fowers, Michael
Haselman, Stephen Heil, Matt Humphrey, Puneet Kaur, Joo-Young Kim, Daniel Lo, Todd Massengill, Kalin Ovtcharov, Michael Papamichael, Lisa Woods, Sitaram Lanka, Derek Chiou,
Doug Burger Microsoft
!44
![Page 41: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/41.jpg)
• Field Programmable Gate Array • An array of “Lookup tables (LUTs)” • Reconfigurable wires or say interconnects of LUTs • Registers
• An LUT • Accepts a few inputs • Has SRAM memory cells that store all possible outputs • Generates outputs according to the given inputs
• As a result, you may use FPGAs to emulate any kind of gates or logic combinations, and create an ASIC-like processor
!49
FPGAFPGA
![Page 42: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/42.jpg)
Configurable cloud
!54
TOR TOR
L1
Hardware acceleration plane
Traditional software (CPU) server plane
SQLDeep neural
networks
Web search ranking GFT Offload
Web search ranking
FPGA acceleration board
2-socket CPU server
Network switch (top of rack, cluster)
NIC – FPGA link
FPGA – switch link
L2
TOR
Interconnected FPGAs form a separate plane of computation
Can be managed and used independently from the CPU
TOR
L1
TOR
![Page 43: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/43.jpg)
• Foundation for all accelerators • Includes PCIe, Networking and DDR IP • Common, well tested platform for development
• Lightweight Transport Layer • Reliable FPGA-to-FPGA Networking • Ack/Nack protocol, retransmit buffers • Optimized for lossless network • Minimized resource usage
!55
Gen2 shell
x8PCIe
DMAEngine
Config Flash
DDR3Ctrl.
JTAGTemp.
Shell
I2C
4
4GBDDR3
72
8
SEU
40GMAC
ServerNIC Top-of-RackSwitch
8
Clock
256
4
x8PCIe
40GMAC
512
BypassCtrl
256 256 256 256
NetworkSwitch
Role
Router
LightweightTransportLayer
Role
4
![Page 44: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/44.jpg)
• Local: Great service acceleration • Infrastructure: Fastest cloud network • Remote: Reconfigurable app fabric (DNNs)
!56
Use cases
![Page 45: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/45.jpg)
• Tail Latency == 1 in X servers being slow • Why is this bad? – Each user request
now needs several servers – Changes of experience tail is much higher
• If 99% of the server’s response time is 10ms, but 1% of them take 1 second to response
• If the user only needs one, the mean is OK • If the user needs 100 partitions from 100
servers, 63% of the requests takes more than 1 seconds.
!57
Tail latencies
![Page 46: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/46.jpg)
• Lower & more consistent 99.9th tail latency • In production for years
!58
5 day bed-level latency
99.9% software latency
99.9% FPGA latencyaverage FPGA query load average software load
Day 1 Day 2 Day 3 Day 4 Day 5
1.0
2.0
3.0
4.0
5.0
6.0
7.0
Nor
mal
ized
Loa
d &
Lat
ency
Que
ry L
aten
cy 9
9.9t
h
(nor
mal
ized
to
low
est
late
ncy)
0.0
0.6
1.2
1.8
2.3
2.9
3.5
Query Load (normalized to lowest throughput)
0 1.5 3 4.5 6
SoftwareLocal FPGA
Even at 2× query load, accelerated ranking has lower latency than software at any load
![Page 47: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/47.jpg)
• Software defined networking • Generic Flow Table (GFT) rule based packet rewriting • 10x latency reduction vs software, CPU load now <1 core • 25Gb/s throughput at 25μs latency – the fastest cloud network
• Capable of 40 Gb line rate encrypt and decrypt • On Haswell, AES GCM-128 costs 1.26 cycles/byte[1] (5+ 2.4Ghz cores to
sustain 40Gb/s) • CBC and other algorithms are more expensive • AES CBC-128-SHA1 is 11μs in FPGA vs 4μs on CPU (1500B packet) • Higher latency, but significant CPU savings
!59
Accelerated networking
GFT TableFPGA
40GCryptoFlow Action
Decap,DNAT,Rewrite,Meter1.2.3.1->1.3.4.1,62362->80
GFT 40G
40G NIC
VMs
![Page 48: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/48.jpg)
• Economics: consolidation • Most accelerators have more
throughput than a single host requires • Share excess capacity, use fewer
instances • Frees up FPGAs for other use services
• DNN accelerator • Sustains 2.5x busy clients in
microbenchmark, before queuing delay drives latency up
!60
Shared DNN
FPGA
20%
FPGA
20%
FPGA
20% 20%
20%20%
SW SW SW SW SW SW
FPGA
FPGA
Har
dwar
e La
tenc
y
Nor
mal
ized
to
Loca
l FPG
A
0.0
1.3
2.5
3.8
5.0
Oversubscription: # Remote Clients / # FPGAs
1.0 1.4 1.8 2.1 2.5
99%95%Avg
![Page 49: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/49.jpg)
• Regarding MS’ configurable clouds that are powered by FPGAs, please identify how many of the following are correct က: Each FPGA is dedicated to one machine က< Each FPGA is connected through a network that is separated from the data center
network က> FPGA can deliver shorter average latency for AES-CBC-128-SHA1 encryption and
decryption than Intel’s high-end processors က@ FPGA-accelerated search queries are always faster than a pure software-based
datacenter A. 0 B. 1 C. 2 D. 3 E. 4
!61
MS’ “Configurable Clouds”
![Page 50: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/50.jpg)
• Which of the following is the main reason why Microsoft adopts FPGAs instead of the alternatives chosen by their rivals?
A. Cost B. Performance C. Scalability D. Flexibility E. Easier to program
!62
Why FPGAs?
![Page 51: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/51.jpg)
Why FPGA?
!63
Flexible
![Page 52: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/52.jpg)
• Local, infrastructure and remote acceleration • Gen1 showed significant gains even for complex services (~2x for Bing) • Needs to have clear benefit for majority of servers: infrastructure
• Economics must work • What works at small scale doesn’t always work at hyperscale and vice versa • Little tolerance for superfluous costs • Minimized complexity and risk in deployment and maintenance
• Must be flexible • Support simple, local accelerators and complex, shared accelerators at the
same time • Rapid deployment of new protocols, algorithms and services across the cloud
!64
Summary: What makes a configurable cloud?
![Page 53: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/53.jpg)
In-Datacenter Performance Analysis of a Tensor Processing Unit
N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Na- garajan,
R. Narayanaswami, R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Wal- ter, W. Wang, E. Wilcox, and D. H. Yoon
Google Inc.
!65
![Page 54: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/54.jpg)
What TPU looks like
!70
![Page 55: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/55.jpg)
TPU Floorplan
!71
![Page 56: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/56.jpg)
TPU Block diagram
!72
![Page 57: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/57.jpg)
• Regarding TPUs, please identify how many of the following statements are correct. က: TPU is optimized for highly accurate matrix multiplications က< TPU is designed for dense matrices, not for sparse matrices က> A majority of TPU’s area is used by memory buffers က@ All TPU instructions are equally long A. 0 B. 1 C. 2 D. 3 E. 4
!73
TPU (Tensor Processing Unit)
![Page 58: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/58.jpg)
Experimental setup
!74
![Page 59: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/59.jpg)
Performance/Rooflines
!75
![Page 60: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/60.jpg)
Tail latency
!76
![Page 61: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/61.jpg)
What NVIDIA says
!77
https://blogs.nvidia.com/blog/2017/04/10/ai-drives-rise-accelerated-computing-datacenter/
![Page 62: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/62.jpg)
78
![Page 63: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/63.jpg)
• Fallacy: NN inference applications in data centers value throughput as much as response time.
• Fallacy: The K80 GPU architecture is a good match to NN inference — GPU is throughput oriented
• Pitfall: For NN hardware, Inferences Per Second (IPS) is an inaccurate summary performance metric — it’s simply the inverse of the complexity of the typical inference in the application (e.g., the number, size, and type of NN layers)
• Fallacy: The K80 GPU results would be much better if Boost mode were enabled — Boost mode increased the clock rate by a factor of up to 1.6—from 560 to 875 MHz—which increased performance by 1.4X, but it also raised power by 1.3X. The net gain in performance/Watt is 1.1X, and thus Boost mode would have a minor impact on LSTM1
• Fallacy: CPU and GPU results would be comparable to the TPU if we used them more efficiently or compared to newer versions.
!79
Fallacies & Pitfalls
![Page 64: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/64.jpg)
• Pitfall: Architects have neglected important NN tasks. • CNNs constitute only about 5% of the representative NN workload for Google. More
attention should be paid to MLPs and LSTMs. Repeating history, it’s similar to when many architects concentrated on floating- point performance when most mainstream workloads turned out to be dominated by integer operations.
• Pitfall: Performance counters added as an afterthought for NN hardware. • Fallacy: After two years of software tuning, the only path left to increase TPU
performance is hardware upgrades. • Pitfall: Being ignorant of architecture history when designing a domain-specific
architecture • Systolic arrays • Decoupled-access/execute • CISC instructions
!80
Fallacies & Pitfalls
![Page 65: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/65.jpg)
Final words
!81
![Page 66: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/66.jpg)
• Computer architecture is more important than you can ever imagine • Being a “programmer” is easy. You need to know architecture a lot to be a
“performance programmer” • Branch prediction • Cache
• Multicore era — to get your multithreaded program correct and perform well, you need to take care of coherence and consistency
• We’re now in the “dark silicon era” • Single-core isn’t getting any faster • Multi-core doesn’t scale anymore • We will see more and more ASICs • You need to write more “system-level” programs to use these new ASICs.
• Interested in latest architecture/system research? Joining EE260 next quarter!82
Conclusion
![Page 67: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/67.jpg)
• Final Exam next Monday @ 8am • Homework #4 due tonight • iEval submission — attach your “confirmation” screen, you get
an extra/bonus homework • Office hour for Hung-Wei this week — MWF 1p-2p
!83
Announcement
![Page 68: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/68.jpg)
• Top — Yu-Ching Hu • Runner-up — Irfan Ahmed Vezhapillil Aboobacker • Honorable mention — Abenezer Wudenhe
!84
Helper thread contest
![Page 69: Dark Silicon & Modern Computer Architecturehtseng/classes/cse203_2019fa/...Data Read Data 1 Read Data 2 m ux 0 1 Instruction [9:5] Instruction [20:16] Instruction[4:0] Sign-extend](https://reader033.vdocument.in/reader033/viewer/2022050104/5f427e528f6dcc00ab41727b/html5/thumbnails/69.jpg)
Thank you all for this great quarter!