relative performance! - university of hong kongelec3441/sp18/handouts/l02-perf-4up.pdf2nd semester,...
TRANSCRIPT
Computer Architecture ELEC3441
2nd Semester, 2017-18 Dr. Hayden Kwok-Hay So
Department of Electrical and
Electronic Engineering
Computer Performance
2nd sem. '17-18 ENGG3441 - HS 2
How do you measure performance of a computer?
How do you make a computer fast?
Ways to measure Performance
n Execution time (response time) ≠ Throughput n We will focus on execution time in this course
2nd sem. '17-18 ENGG3441 - HS 3
Execution Time Throughput
Time to finish a task Number of tasks finish per unit time
Relative Performancen Define performance of a computer as
2nd sem. '17-18 ENGG3441 - HS 4
Performance = 1ExecutionTime
n Computer B is n times faster than Computer A if:
n = PerformanceBPerformanceA
=ExecutionTimeAExecutionTimeB
Quick Checkn Computer A finishes a task in 5s, Computer B
finishes the same task in 4s. Which one is faster, by how much?
2nd sem. '17-18 ENGG3441 - HS 5
PerformanceBPerformanceA
=ExecutionTimeAExecutionTimeB
=54=1.25
Computer B is 1.25 times faster than Computer A
Ways to Measure Execution Timen Wall Clock Time (Elapse Time)
• The total time a user experiences that a computer takes to finish a task
• Includes OS overhead, I/O, idle time, time shared with other users
n CPU Time • The time spent on a user task in the CPU • User CPU + OS CPU time • Does not include I/O, time spent by other users, etc
n Focus on CPU Time in this course
2nd sem. '17-18 ENGG3441 - HS 6
$ time shasum afile 132ecc0e19eec19d5dc775752efeac280cecebdc afile real 0m20.177s user 0m12.835s sys 0m1.786s
2nd sem. '17-18 ENGG3441 - HS 7
How can we determine CPU time needed to execute a program?
CPUTime = # of instructionprogram
×# of cycleinstruction
×timecycle
The Iron Law
CPU Time – Step 1
n Most modern CPUs are synchronous digital systems
n The time needs to finish executing a task is determined by the number of cycles needed for that ask, multiply by the cycle time.
2nd sem. '17-18 ENGG3441 - HS 8
CPUTime =CycleCount×CycleTime
=CycleCount
ClockFrequency
Digital system design review…
Synchronous Sequential Circuitsn A synchronous sequential circuit contains exactly 1
clock signal n All state elements are connected to the same clock
signal • è the state of the entire circuit is updated at the same time
n Common form of synchronous sequential circuits:
9
clk
Comb Logic
clk
input
clk
Comb Logic
clk
Comb Logic
Comb Logic
output
2nd sem. '17-18 ENGG3441 - HS
Clock Signaln A clock signal is particularly important signal in a
synchronous sequential circuit • It controls the action of all DFFs
n A clock signal toggles between ‘0’ and ‘1’ periodically
n The frequency of the toggling determines the maximum speed of the circuit • E.g.: in the accumulator example earlier, the output S
cannot change faster than the clock frequency
10
X x0 x1 x2
S 0 x0 x0 + x1 x0 + x1 + x2
clk
1
clock period
1clock period
= clock frequency
e.g. Intel CPU runs at 3 GHz, Mobile phone processors at 1 GHz Lab FPGA board at 50 MHz
2nd sem. '17-18 ENGG3441 - HS
Timing in Synchronous Circuits
n In a synchronous sequential circuit, signal changes occur only during clock edge
n All signals are therefore synchronized to change values right after a clock edge
n In the above example, need to make sure correct value of y available BEFORE next clock edge • Avoid glitches
11
ab
c
d yclk clk
clk
2nd sem. '17-18 ENGG3441 - HS
Timing in Synchronous Circuitsn In general, the propagation
delay through the combinational logic between any two registers must be shorter than the clock period
n The longest such path is called the critical path of the circuit
n The critical path determines the maximum clock speed
12
clk
clk clk
Comb Logic
a
b
x
y
clk
1
From glitch example Stable before
clock edge
2nd sem. '17-18 ENGG3441 - HS
CPU Time – Step 1 – Summary
n To improve performance:
n Increase clock freq è shorter critical path è less work accomplished in 1 cycle è more cycles needed • Engineers need tradeoff between the two
2nd sem. '17-18 ENGG3441 - HS 13
1. Increase clock frequency 2. Reduce cycle count
CPUTime =CycleCount×CycleTime = CycleCountClockFrequency
How many cycle does it take to finish a program?
CPU Time – Step 2 – Cycle Per Instruction (CPI)
n Program A has 2000 instructions, each instruction takes 2 cycles to finish. How many cycles does it take to complete Program A?
n Program B has 3000 instructions. 2000 of them takes 2 cycles and 1000 of them takes 1 cycle. How many cycles does the program take to finish?
2nd sem. '17-18 ENGG3441 - HS 14
CycleCount = InstructionCount×CyclePerInstruction
Average CPIn In general, different machine instructions may
take different amount of time to complete.
n Assuming n classes of instructions, then total clock cycle:
n Weighted average CPI:
2nd sem. '17-18 ENGG3441 - HS 15
ClockCycle = CPIi × InstructionCount ii=1
n
∑
CPI = CycleCountInstructionCount
= CPIi ×InstructionCount iInstructionCounti=1
n
∑
CPI Example (1)
n The ISA of computer A includes 3 classes of instructions that take different number of cycles to complete. A program P is compiled using compiler J, resulting in the utilization above.
n What is the average CPI of the compiled program?
2nd sem. '17-18 ENGG3441 - HS 16
Class C1 C2 C3Cycles 1 4 8Compiler J 100 50 100
CPI Example (2)
n A newer compiler K was developed to compile same program P, resulting in the utilization above.
n What is the average CPI of the compiled program using compiler K?
2nd sem. '17-18 ENGG3441 - HS 17
Class C1 C2 C3Cycles 1 4 8Compiler J 100 50 100Compiler K 350 100 50
Ans: 2.3
Which compiler was better…?
CPI Example (3)
n Observation: • Compiler J results in higher CPI • Compiler K uses more instructions
n But most importantly:
2nd sem. '17-18 ENGG3441 - HS 18
Class C1 C2 C3 #instr #cycle CPICycles 1 4 8Compiler J 100 50 100 250 1100 4.4Compiler K 350 100 50 500 1150 2.3
Compiler J uses fewer cycles è shorter run time è better
Number of Instructions
2nd sem. '17-18 ENGG3441 - HS 19
a = 0!b = a + 1!c = a + b!b = c + b!
How many instructions are there in the following code?
If CPI = 1, how many cycles does it take to complete?
# of instr: 4 # of cycles: 4
Number of Instructions
2nd sem. '17-18 ENGG3441 - HS 20
i = 0!loop: a = a + 1! i = i + 1! if i < 10 goto loop!
How many instructions are there in the following code?
If CPI = 1, how many cycles does it take to complete?
# of STATIC instructions: 4 # of DYNAMIC instructions: 1 + 3 * 10 = 31 # of cycles: 31
Number of Instructions
2nd sem. '17-18 ENGG3441 - HS 21
r = 0!for (i=b; i>0; i=i-1)! r = r + a!
How many instructions are there in the following code?
# of DYNAMIC instructions: ≈ 3b # of cycles: ≈3b
To compute: r = a×b
r = a * b!
# of instructions: 1 # of cycles: 1 (?)
Dynamic # of instructions can be data dependent.
Instruction Count & CPIn The number of instructions in a program
depends on • Nature of application • Compiler techniques • Type of available instruction of an ISA
n Average cycles per instruction depends on • CPU microarchitecture • ISA (CISC vs RISC) • The current running state of CPU
n Different instructions may have different CPI • Average CPI affected by instruction mix
2nd sem. '17-18 ENGG3441 - HS 22
Combining All – The Iron Law
2nd sem. '17-18 ENGG3441 - HS 23
CPUTime = # of instructionprogram
×# of cycleinstruction
×timecycle
• Algorithm • Language • Compiler • ISA
• Language • Compiler • ISA • Micro
-architecture
• ISA • Hardware
design
CISC vs RISCn CISC: Complex Instruction Set Computer
RISC: Reduced Instruction Set Computer
n CISC and RISC are two different computer design strategies:
2nd sem. '17-18 ENGG3441 - HS 24
VAXx86
PA-RISC
SPARC
MIPS
RISC-V
Alpha ARM
CISC RISC
CISCn ISA includes complex instructions
• E.g. VAX has a POLY instruction that evaluate polynomial in hardware
n Includes complex addressing mode • Mem-reg; mem-mem; indirect; relative; double-indirect..
n Hardware implement complex instructions using multiple clock cycles • microcode
n One promise of CISC ISA is that it allows shorter compiled code and make compiler easier. • Still relevant today in embedded systems
n Drawback: • Less attractive as compiler techniques improve • Complex hardware è slow
2nd sem. '17-18 ENGG3441 - HS 25
RISCn ISA specifies simple instructions
• Mostly register-register transfer • Simple addressing mode
n Simpler hardware design • Allows hardware optimization • Faster hardware overall • Allows easy pipelining
n Simple ISA allows compiler optimization
n Generated code length is generally longer
n Most (if not all) ISA after the 80s are RISC
2nd sem. '17-18 ENGG3441 - HS 26
RISC vs CISC – Iron Law
2nd sem. '17-18 ENGG3441 - HS 27
Microarchitecture CPI Cycle TimeCISC >1 short
RISC – single cycle unpipelined
1 long
RISC – pipelined 1 short
CPUTime = # of instructionprogram
×# of cycleinstruction
×timecycle
Amdahl’s Law Reviewn Describes the overall speedup of a system due to
speed improvement that applies to a portion of the system.
n Let P be the portion of the system that can be sped up by a factor of S,
n Amdahl’s Law stays that the overall speedup is:
n E.g.: P = 50%, S=100 è speedup = 1.98x
2nd sem. '17-18 ENGG3441 - HS 28
1
(1−P)+ PS
0 ≤ P ≤1
Amdahl’s Law Example
n Q1: a new implementation of C3 reduces its execution length by half to 4 cycles, how much improvement in performance can be achieved?
2nd sem. '17-18 ENGG3441 - HS 29
Class C1 C2 C3Cycles 1 4 8# instr 200 130 60# cycles 200 520 480
P = 480200+ 520+ 480
= 0.4 S = 2
⇒ speedup = 1(1− 0.4)+ 0.4 / 2
=1.25
Amdahl’s Law Example
n Q2: Which instruction class, when its cycle count is reduced by half, will result in most performance improvement? • Largest CPI? • Most used? • Most cycles used?
2nd sem. '17-18 ENGG3441 - HS 30
Class C1 C2 C3Cycles 1 4 8# instr 200 130 60# cycles 200 520 480
Amdahl’s Law Implications
n In most applications, only portion of the computation can be sped up • improved hardware designs • parallelization
n Amdahl’s Law è max speedup is limited by P • If only small portion of program can be sped up, then it
doesn’t matter how large S is 2nd sem. '17-18 ENGG3441 - HS 31
Can we get to a speedup of 10 with P=0.9?
Benchmark Programsn A benchmark suite is a set of programs used to
compare processor performance
n Need to be representative of typical workload
n Kernel vs whole application • Recall Amdahl’s Law
n Avoid over optimization for specific benchmark
n SPEC benchmark • Several benchmark suites commonly used in computer
architecture research • E.g. SPEC CPU2006
2nd sem. '17-18 ENGG3441 - HS 32
ENGG3441 - HS
SPEC CPU Benchmark n Programs used to measure performance
• Supposedly typical of actual workload n Standard Performance Evaluation Corp (SPEC)
• Develops benchmarks for CPU, I/O, Web, …
n SPEC CPU2006 • Elapsed time to execute a selection of programs
• Negligible I/O, so focuses on CPU performance • Normalize relative to reference machine • Summarize as geometric mean of performance ratios
• CINT2006 (integer) and CFP2006 (floating-point)
nn
1iiratio time Execution∏
=
2nd sem. '17-18 33 ENGG3441 - HS
CINT2006 for Intel Core i7 920
2nd sem. '17-18 34
Matrix-Matrix Multiplication
2nd sem. '17-18 ENGG3441 - HS 35
a0,0 ! a0,N−1" # "
aN−1,0 ! aN−1,N−1
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
×
b0,0 ! b0,N−1" # "
bN−1,0 ! bN−1,N−1
⎡
⎣
⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥
=
!
" ai,kbk, jk=0
N−1
∑ "
!
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
r[i][j] = 0!for (k=0; k<N; k++)! r[i][j] += a[i][k] * b[k][j]!
Matrix-Matrix Multiplication
n If all instructions have CPI=1, then time to complete is ~N3 cycles.
n What are the factors that will make this run faster/slower?
2nd sem. '17-18 ENGG3441 - HS 36
=
!
" ai,kbk, jk=0
n−1
∑ "
!
⎡
⎣
⎢⎢⎢⎢⎢
⎤
⎦
⎥⎥⎥⎥⎥
for(i=0; i<N; i++)! for(j=0; j<N; j++)! r[i][j] = 0! for (k=0; k<N; k++)! r[i][j] += a[i][k] * b[k][j]!
Total number of instructions: N 3 [×, +, assignment]
Breadth-First Searchn Given a graph and a root node (r), visit all
nodes reachable from r in the order of their hop distances from r.
2nd sem. '17-18 ENGG3441 - HS 37
r[i][j] = 0!for (k=0; k<N; k++)! r[i][j] += a[i][k] * b[k][j]!
ENGG3441 - HS
CPI Example n Computer A: Cycle Time = 250ps, CPI = 2.0 n Computer B: Cycle Time = 500ps, CPI = 1.2 n Same ISA n Which is faster, and by how much?
1.2500psI600psI
ATime CPUBTime CPU
600psI500ps1.2IBTime CycleBCPICount nInstructioBTime CPU
500psI250ps2.0IATime CycleACPICount nInstructioATime CPU
=×
×=
×=××=
××=
×=××=
××=
A is faster…
…by this much
2nd sem. '17-18 38
ENGG3441 - HS
CPI in More Detail n If different instruction classes take different
numbers of cycles
∑=
×=n
1iii )Count nInstructio(CPICycles Clock
n Weighted average CPI
∑=
⎟⎠
⎞⎜⎝
⎛ ×==n
1i
ii Count nInstructio
Count nInstructioCPICount nInstructio
Cycles ClockCPI
Relative frequency
2nd sem. '17-18 39 ENGG3441 - HS
CPI Example n Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1
n Sequence 1: IC = 5 n Clock Cycles
= 2×1 + 1×2 + 2×3 = 10
n Avg. CPI = 10/5 = 2.0
n Sequence 2: IC = 6 n Clock Cycles
= 4×1 + 1×2 + 1×3 = 9
n Avg. CPI = 9/6 = 1.5
2nd sem. '17-18 40
ENGG3441 - HS
Performance Summary
n Performance depends on • Algorithm: affects IC, possibly CPI • Programming language: affects IC, CPI • Compiler: affects IC, CPI • Instruction set architecture: affects IC, CPI, Tc
The BIG Picture
cycle ClockSeconds
nInstructiocycles Clock
ProgramnsInstructioTime CPU ××=
“Iron Law” of computers
2nd sem. '17-18 41
And in conclusion…n The study of computer architecture allows us to
construct better computer systems • Performance, power
n Computer architecture is a study that crosses software and hardware
n We will use RISC-V as main ISA for class work, but design principles applicable to other computer designs
n The “Iron Law” determines performance of a CPU n ISA, microarchitecture, compilers, and hardware
technology all play a role in determining CPU performance
2nd sem. '17-18 ENGG3441 - HS 42
43
Acknowledgements n These slides contain material developed and
copyright by: • Arvind (MIT) • Krste Asanovic (MIT/UCB) • Joel Emer (Intel/MIT) • James Hoe (CMU) • John Kubiatowicz (UCB) • David Patterson (UCB)
n MIT material derived from course 6.823 n UCB material derived from course CS152,
CS252
2nd sem. '17-18 ENGG3441 - HS