relative performance! - university of hong kongelec3441/sp18/handouts/l02-perf-4up.pdf2nd semester,...

Computer Architecture ELEC3441

2nd Semester, 2017-18 Dr. Hayden Kwok-Hay So

Department of Electrical and

Electronic Engineering

Computer Performance

2nd sem. '17-18 ENGG3441 - HS 2

How do you measure performance of a computer?

How do you make a computer fast?

Ways to measure Performance

n  Execution time (response time) ≠ Throughput n  We will focus on execution time in this course

2nd sem. '17-18 ENGG3441 - HS 3

Execution Time Throughput

Time to finish a task Number of tasks finish per unit time

Relative Performancen  Define performance of a computer as

2nd sem. '17-18 ENGG3441 - HS 4

Performance = 1ExecutionTime

n  Computer B is n times faster than Computer A if:

n = PerformanceBPerformanceA

=ExecutionTimeAExecutionTimeB

Quick Checkn  Computer A finishes a task in 5s, Computer B

finishes the same task in 4s. Which one is faster, by how much?

2nd sem. '17-18 ENGG3441 - HS 5

PerformanceBPerformanceA

=ExecutionTimeAExecutionTimeB

=54=1.25

Computer B is 1.25 times faster than Computer A

Ways to Measure Execution Timen  Wall Clock Time (Elapse Time)

•  The total time a user experiences that a computer takes to finish a task

•  Includes OS overhead, I/O, idle time, time shared with other users

n  CPU Time •  The time spent on a user task in the CPU •  User CPU + OS CPU time •  Does not include I/O, time spent by other users, etc

n  Focus on CPU Time in this course

2nd sem. '17-18 ENGG3441 - HS 6

$ time shasum afile 132ecc0e19eec19d5dc775752efeac280cecebdc afile real 0m20.177s user 0m12.835s sys 0m1.786s

2nd sem. '17-18 ENGG3441 - HS 7

How can we determine CPU time needed to execute a program?

CPUTime = # of instructionprogram

×# of cycleinstruction

×timecycle

The Iron Law

CPU Time – Step 1

n  Most modern CPUs are synchronous digital systems

n  The time needs to finish executing a task is determined by the number of cycles needed for that ask, multiply by the cycle time.

2nd sem. '17-18 ENGG3441 - HS 8

CPUTime =CycleCount×CycleTime

=CycleCount

ClockFrequency

Digital system design review…

Synchronous Sequential Circuitsn  A synchronous sequential circuit contains exactly 1

clock signal n  All state elements are connected to the same clock

signal •  è the state of the entire circuit is updated at the same time

n  Common form of synchronous sequential circuits:

9

clk

Comb Logic

clk

input

clk

Comb Logic

clk

Comb Logic

Comb Logic

output

2nd sem. '17-18 ENGG3441 - HS

Clock Signaln  A clock signal is particularly important signal in a

synchronous sequential circuit •  It controls the action of all DFFs

n  A clock signal toggles between ‘0’ and ‘1’ periodically

n  The frequency of the toggling determines the maximum speed of the circuit •  E.g.: in the accumulator example earlier, the output S

cannot change faster than the clock frequency

10

X x0 x1 x2

S 0 x0 x0 + x1 x0 + x1 + x2

clk

1

clock period

1clock period

= clock frequency

e.g. Intel CPU runs at 3 GHz, Mobile phone processors at 1 GHz Lab FPGA board at 50 MHz

2nd sem. '17-18 ENGG3441 - HS

Timing in Synchronous Circuits

n  In a synchronous sequential circuit, signal changes occur only during clock edge

n  All signals are therefore synchronized to change values right after a clock edge

n  In the above example, need to make sure correct value of y available BEFORE next clock edge •  Avoid glitches

11

ab

c

d yclk clk

clk

2nd sem. '17-18 ENGG3441 - HS

Timing in Synchronous Circuitsn  In general, the propagation

delay through the combinational logic between any two registers must be shorter than the clock period

n  The longest such path is called the critical path of the circuit

n  The critical path determines the maximum clock speed

12

clk

clk clk

Comb Logic

a

b

x

y

clk

1

From glitch example Stable before

clock edge

2nd sem. '17-18 ENGG3441 - HS

CPU Time – Step 1 – Summary

n  To improve performance:

n  Increase clock freq è shorter critical path è less work accomplished in 1 cycle è more cycles needed •  Engineers need tradeoff between the two

2nd sem. '17-18 ENGG3441 - HS 13

1. Increase clock frequency 2. Reduce cycle count

CPUTime =CycleCount×CycleTime = CycleCountClockFrequency

How many cycle does it take to finish a program?

CPU Time – Step 2 – Cycle Per Instruction (CPI)

n  Program A has 2000 instructions, each instruction takes 2 cycles to finish. How many cycles does it take to complete Program A?

n  Program B has 3000 instructions. 2000 of them takes 2 cycles and 1000 of them takes 1 cycle. How many cycles does the program take to finish?

2nd sem. '17-18 ENGG3441 - HS 14

CycleCount = InstructionCount×CyclePerInstruction

Average CPIn  In general, different machine instructions may

take different amount of time to complete.

n  Assuming n classes of instructions, then total clock cycle:

n  Weighted average CPI:

2nd sem. '17-18 ENGG3441 - HS 15

ClockCycle = CPIi × InstructionCount ii=1

n

∑

CPI = CycleCountInstructionCount

= CPIi ×InstructionCount iInstructionCounti=1

n

∑

CPI Example (1)

n  The ISA of computer A includes 3 classes of instructions that take different number of cycles to complete. A program P is compiled using compiler J, resulting in the utilization above.

n  What is the average CPI of the compiled program?

2nd sem. '17-18 ENGG3441 - HS 16

Class C1 C2 C3Cycles 1 4 8Compiler J 100 50 100

CPI Example (2)

n  A newer compiler K was developed to compile same program P, resulting in the utilization above.

n  What is the average CPI of the compiled program using compiler K?

2nd sem. '17-18 ENGG3441 - HS 17

Class C1 C2 C3Cycles 1 4 8Compiler J 100 50 100Compiler K 350 100 50

Ans: 2.3

Which compiler was better…?

CPI Example (3)

n  Observation: •  Compiler J results in higher CPI •  Compiler K uses more instructions

n  But most importantly:

2nd sem. '17-18 ENGG3441 - HS 18

Class C1 C2 C3 #instr #cycle CPICycles 1 4 8Compiler J 100 50 100 250 1100 4.4Compiler K 350 100 50 500 1150 2.3

Compiler J uses fewer cycles è shorter run time è better

Number of Instructions

2nd sem. '17-18 ENGG3441 - HS 19

a = 0!b = a + 1!c = a + b!b = c + b!

How many instructions are there in the following code?

If CPI = 1, how many cycles does it take to complete?

# of instr: 4 # of cycles: 4


2nd sem. '17-18 ENGG3441 - HS 20

i = 0!loop: a = a + 1! i = i + 1! if i < 10 goto loop!


If CPI = 1, how many cycles does it take to complete?

# of STATIC instructions: 4 # of DYNAMIC instructions: 1 + 3 * 10 = 31 # of cycles: 31


2nd sem. '17-18 ENGG3441 - HS 21

r = 0!for (i=b; i>0; i=i-1)! r = r + a!


# of DYNAMIC instructions: ≈ 3b # of cycles: ≈3b

To compute: r = a×b

r = a * b!

# of instructions: 1 # of cycles: 1 (?)

Dynamic # of instructions can be data dependent.

Instruction Count & CPIn  The number of instructions in a program

depends on •  Nature of application •  Compiler techniques •  Type of available instruction of an ISA

n  Average cycles per instruction depends on •  CPU microarchitecture •  ISA (CISC vs RISC) •  The current running state of CPU

n  Different instructions may have different CPI •  Average CPI affected by instruction mix

2nd sem. '17-18 ENGG3441 - HS 22

Combining All – The Iron Law

2nd sem. '17-18 ENGG3441 - HS 23



×timecycle

•  Algorithm •  Language •  Compiler •  ISA

•  Language •  Compiler •  ISA •  Micro

-architecture

•  ISA •  Hardware

design

CISC vs RISCn  CISC: Complex Instruction Set Computer

RISC: Reduced Instruction Set Computer

n  CISC and RISC are two different computer design strategies:

2nd sem. '17-18 ENGG3441 - HS 24

VAXx86

PA-RISC

SPARC

MIPS

RISC-V

Alpha ARM

CISC RISC

CISCn  ISA includes complex instructions

•  E.g. VAX has a POLY instruction that evaluate polynomial in hardware

n  Includes complex addressing mode •  Mem-reg; mem-mem; indirect; relative; double-indirect..

n  Hardware implement complex instructions using multiple clock cycles •  microcode

n  One promise of CISC ISA is that it allows shorter compiled code and make compiler easier. •  Still relevant today in embedded systems

n  Drawback: •  Less attractive as compiler techniques improve •  Complex hardware è slow

2nd sem. '17-18 ENGG3441 - HS 25

RISCn  ISA specifies simple instructions

•  Mostly register-register transfer •  Simple addressing mode

n  Simpler hardware design •  Allows hardware optimization •  Faster hardware overall •  Allows easy pipelining

n  Simple ISA allows compiler optimization

n  Generated code length is generally longer

n  Most (if not all) ISA after the 80s are RISC

2nd sem. '17-18 ENGG3441 - HS 26

RISC vs CISC – Iron Law

2nd sem. '17-18 ENGG3441 - HS 27

Microarchitecture CPI Cycle TimeCISC >1 short

RISC – single cycle unpipelined

1 long

RISC – pipelined 1 short



×timecycle

Amdahl’s Law Reviewn  Describes the overall speedup of a system due to

speed improvement that applies to a portion of the system.

n  Let P be the portion of the system that can be sped up by a factor of S,

n  Amdahl’s Law stays that the overall speedup is:

n  E.g.: P = 50%, S=100 è speedup = 1.98x

2nd sem. '17-18 ENGG3441 - HS 28

1

(1−P)+ PS

0 ≤ P ≤1

Amdahl’s Law Example

n  Q1: a new implementation of C3 reduces its execution length by half to 4 cycles, how much improvement in performance can be achieved?

2nd sem. '17-18 ENGG3441 - HS 29

Class C1 C2 C3Cycles 1 4 8# instr 200 130 60# cycles 200 520 480

P = 480200+ 520+ 480

= 0.4 S = 2

⇒ speedup = 1(1− 0.4)+ 0.4 / 2

=1.25

Amdahl’s Law Example

n  Q2: Which instruction class, when its cycle count is reduced by half, will result in most performance improvement? •  Largest CPI? •  Most used? •  Most cycles used?

2nd sem. '17-18 ENGG3441 - HS 30

Class C1 C2 C3Cycles 1 4 8# instr 200 130 60# cycles 200 520 480

Amdahl’s Law Implications

n  In most applications, only portion of the computation can be sped up •  improved hardware designs •  parallelization

n  Amdahl’s Law è max speedup is limited by P •  If only small portion of program can be sped up, then it

doesn’t matter how large S is 2nd sem. '17-18 ENGG3441 - HS 31

Can we get to a speedup of 10 with P=0.9?

Benchmark Programsn  A benchmark suite is a set of programs used to

compare processor performance

n  Need to be representative of typical workload

n  Kernel vs whole application •  Recall Amdahl’s Law

n  Avoid over optimization for specific benchmark

n  SPEC benchmark •  Several benchmark suites commonly used in computer

architecture research •  E.g. SPEC CPU2006

2nd sem. '17-18 ENGG3441 - HS 32

ENGG3441 - HS

SPEC CPU Benchmark n  Programs used to measure performance

•  Supposedly typical of actual workload n  Standard Performance Evaluation Corp (SPEC)

•  Develops benchmarks for CPU, I/O, Web, …

n  SPEC CPU2006 •  Elapsed time to execute a selection of programs

•  Negligible I/O, so focuses on CPU performance •  Normalize relative to reference machine •  Summarize as geometric mean of performance ratios

•  CINT2006 (integer) and CFP2006 (floating-point)

nn

1iiratio time Execution∏

=

2nd sem. '17-18 33 ENGG3441 - HS

CINT2006 for Intel Core i7 920

2nd sem. '17-18 34

Matrix-Matrix Multiplication

2nd sem. '17-18 ENGG3441 - HS 35

a0,0 ! a0,N−1" # "

aN−1,0 ! aN−1,N−1

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

×

b0,0 ! b0,N−1" # "

bN−1,0 ! bN−1,N−1

⎡

⎣

⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥

=

!

" ai,kbk, jk=0

N−1

∑ "

!

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥

r[i][j] = 0!for (k=0; k<N; k++)! r[i][j] += a[i][k] * b[k][j]!

Matrix-Matrix Multiplication

n  If all instructions have CPI=1, then time to complete is ~N3 cycles.

n  What are the factors that will make this run faster/slower?

2nd sem. '17-18 ENGG3441 - HS 36

=

!

" ai,kbk, jk=0

n−1

∑ "

!

⎡

⎣

⎢⎢⎢⎢⎢

⎤

⎦

⎥⎥⎥⎥⎥

for(i=0; i<N; i++)! for(j=0; j<N; j++)! r[i][j] = 0! for (k=0; k<N; k++)! r[i][j] += a[i][k] * b[k][j]!

Total number of instructions: N 3 [×, +, assignment]

Breadth-First Searchn  Given a graph and a root node (r), visit all

nodes reachable from r in the order of their hop distances from r.

2nd sem. '17-18 ENGG3441 - HS 37

r[i][j] = 0!for (k=0; k<N; k++)! r[i][j] += a[i][k] * b[k][j]!

ENGG3441 - HS

CPI Example n  Computer A: Cycle Time = 250ps, CPI = 2.0 n  Computer B: Cycle Time = 500ps, CPI = 1.2 n  Same ISA n  Which is faster, and by how much?

1.2500psI600psI

ATime CPUBTime CPU

600psI500ps1.2IBTime CycleBCPICount nInstructioBTime CPU

500psI250ps2.0IATime CycleACPICount nInstructioATime CPU

=×

×=

×=××=

××=

×=××=

××=

A is faster…

…by this much

2nd sem. '17-18 38

ENGG3441 - HS

CPI in More Detail n  If different instruction classes take different

numbers of cycles

∑=

×=n

1iii )Count nInstructio(CPICycles Clock

n  Weighted average CPI

∑=

⎟⎠

⎞⎜⎝

⎛ ×==n

1i

ii Count nInstructio

Count nInstructioCPICount nInstructio

Cycles ClockCPI

Relative frequency

2nd sem. '17-18 39 ENGG3441 - HS

CPI Example n  Alternative compiled code sequences using

instructions in classes A, B, C

Class A B C CPI for class 1 2 3 IC in sequence 1 2 1 2 IC in sequence 2 4 1 1

n  Sequence 1: IC = 5 n  Clock Cycles

= 2×1 + 1×2 + 2×3 = 10

n  Avg. CPI = 10/5 = 2.0

n  Sequence 2: IC = 6 n  Clock Cycles

= 4×1 + 1×2 + 1×3 = 9

n  Avg. CPI = 9/6 = 1.5

2nd sem. '17-18 40

ENGG3441 - HS

Performance Summary

n  Performance depends on •  Algorithm: affects IC, possibly CPI •  Programming language: affects IC, CPI •  Compiler: affects IC, CPI •  Instruction set architecture: affects IC, CPI, Tc

The BIG Picture

cycle ClockSeconds

nInstructiocycles Clock

ProgramnsInstructioTime CPU ××=

“Iron Law” of computers

2nd sem. '17-18 41

And in conclusion…n  The study of computer architecture allows us to

construct better computer systems •  Performance, power

n  Computer architecture is a study that crosses software and hardware

n  We will use RISC-V as main ISA for class work, but design principles applicable to other computer designs

n  The “Iron Law” determines performance of a CPU n  ISA, microarchitecture, compilers, and hardware

technology all play a role in determining CPU performance

2nd sem. '17-18 ENGG3441 - HS 42

43

Acknowledgements n  These slides contain material developed and

copyright by: •  Arvind (MIT) •  Krste Asanovic (MIT/UCB) •  Joel Emer (Intel/MIT) •  James Hoe (CMU) •  John Kubiatowicz (UCB) •  David Patterson (UCB)

n  MIT material derived from course 6.823 n  UCB material derived from course CS152,

CS252

2nd sem. '17-18 ENGG3441 - HS

relative performance! - university of hong kongelec3441/sp18/handouts/l02-perf-4up.pdf2nd semester,...

Documents