cpe 442 intro. to computer architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · •...

19
CS455/CpE 442 Intro. To Computer Architecure Review for Term Exam

Upload: trankhanh

Post on 06-Feb-2018

220 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

CS455/CpE 442 Intro. To Computer Architecure

Review for Term Exam

Page 2: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

The Role of Performance

• Text 3rd Edition, Chapter 4• Main focus topics

– Compare the performance of different architectures or architectural variations in executing a given application

– Determine the CPI for an executable application on a given architecture

– HW1 solutions, 2.11, 2.12, 2.13

Page 3: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

• Q2.13 [10] <§§2.2-2.3> Consider two different implementations, M1 and M2, of the same instruction set. There are three classes of instructions (A, B, and C) in the instruction set. M1 has a clock rate of 400 MHz, and M2 has a clock rate of 200 MHz. The average number of cycles per instruction (CPI) for each class of instruction on M1 and M2 is given in the following table:

Class CPI on M1 CPI on M2 Instruction mix for C1 Instruction mix for C2 Instruction mix for C3

A 4 2 30% 30% 50% B 6 4 50% 20% 30% C 8 3 20% 50% 20%

• Using C1 on both M1 and M2, how much faster can the makers of M1 claim that M1 is compared with M2?

• ii. Using C2 on both M1 and M2, how much faster can the makers of M2 claim that M2 is compared with M1?

• iii. If you purchase M1 which of the three compilers would you choose?

• iv. If you purchase M2 which of the three compilers would you choose?

Page 4: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

Sol.Using C1 compiler:M1: CPU Clock Cycles = 0.3*4+0.5*6+0.2*8 = 5.8

CPU time = CPU CC/Clock Rate = 5.8 / 400*10^6 = 0.0145*10^-6M2: CPU CC = 3.2

CPU time = 3.2 / 200*10^6 = 0.016*10^-6Thus, M1 is 0.016 / 0.0145 = 1.10 times as fast as M2.Using C2 compiler:Using the above method,M1: CPU time = 0.016*10^-6M2: CPU time = 0.0145*10^-6Thus, M2 is 0.016 / 0.0145 = 1.10 times as fast as M1.Using 3rd party:M1: CPU time = 0.0135*10^-6M2: CPU time = 0.014*10^-6Thus, M1 is 0.014 / 0.0135 = 1.04 times as fast as M2.The third-party compiler is the superior product regardless of machine purchase.M1 is the machine to purchase using the third-party compiler

Page 5: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

The Instruction Set Architecure

• Text, Ch. 2• Compare instruction set architectures based on

their complexity (instruction format, number of operands, addressing modes, operations supported)

• Instruction set architecture types– Register-to-register– Register –to-memory– Memory –to-memory

• HW2 solutions,

Page 6: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

2.51 Suppose we have made the following measurements of average CPI for instructions: INSTRUCTION AVERAGE CPI

Arithmetic 1.0 clock cycles

Data Transfer 1.4 clock cycles

Conditional Branch 1.7 clock cycles

Jump 1.2 clock cycles

Compute the effective CPI for MIPS. Average the instruction frequencies for SPEC2000int and SPEC2000fp in figure 2.48 to obtain the instruction mix.

Class CPI Avg. Freq (int & fp) CxF

Arithmetic 1.0 .36 ..36

Data Transfer 1.4 .375 .525

Cond. Branch 1.7 .12 .204

Jump 1.2 .03 .036

1.125CPI

The effective CPI for MIPS is 1.125, this seems inaccurate because the table does not include the CPI for logical operations.

Page 7: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

The Processor: Data Path and Control

• Text, ch. 5• The data path organization: functional units

and their interconnections needed to support the instruction set.

• The control unit design– Hardwired vs microprogramming design

• HW3 and HW4,

Page 8: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

Instr

RegDst

ALUSrc

MemtoReg

RegWrite

MemRead

MemWrite

Branch

ALUOp1

ALUOp2

JMPReg

R-typ

e

1 0 0 1 0 0 0 1 0 0

lw 0 1 1 1 1 0 0 0 0 0

sw x 1 x 0 0 1 0 0 0 0

beq x 0 x 0 0 0 1 0 1 0

jr x x x 0 x 0 0 x x 1

Page 9: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

Instr RegDst

ALUSrc

MemtoReg

RegWrite

MemRead

MemWrite

Branch

ALUOp1

ALUOp2

LUICtr

R-type

1 0 0 1 0 0 0 1 0 0

lw 0 1 1 1 1 0 0 0 0 0

sw x 1 x 0 0 1 0 0 0 0

beq x 0 x 0 0 0 1 0 1 0

lui 0 x x 1 x 0 0 x x 1

Page 10: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

The concept of the “critical path” , the longest possible path in the machine, was introduced in 5.4 on page 315. Based on your understanding of the single-cycle implementation, show which units can tolerate more delays (i.e. are not on the critical path), and which units can benefit from hardware optimization. Quantify your answers taking the same numbers presented on page 315.

Longest path is load instruction (instruction memory, register file, ALU, data memory, register file). It can benefit by optimizing the hardware.

Using the numbers from pg 315Mem units: 200psALU&Adders: 100psRegister File: 50ps

Critical path = 200+50+100+200+50 = 600ps (for lw)The path between the adders and the pc can tolerate more delays because

they do not lie within the critical path. Any unit within the critical path (ALU, Register, Data memory) would benefit by optimizing the hardware, this would make the critical path shorter

Page 11: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

IorD=0, (pc=pc+4 cont.)

MemRead

LDI

Page 12: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

Micro-program for LDI

Page 13: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

Pipelined Architecutres• Text, Ch.6• Stages of a pipelined data path• Pipeline hazzards• Pipelined performance, number of cycles to execute a code

segment (and the effective CPI), look for dependencies in sequencesinvolving lw and branch instructions (delay cyles)

• HW56.22 lw $4, 100($2)

sub $6, $4, $3add $2, $3, $5

number of cycles = 5+2+1= 8 eff. CPI = 8/3= k+ (n-1)+delay cycles #cycles / #instructions

k=no of Stages, n=no of instructions

Page 14: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

The Memory Hierarchy

• Text, Ch. 7• The levels of memory hierarchy, and the principal

of locality• Cache Design, direct-mapped, fully associative

and set associative– Cache access, factors affecting the miss rate, and the

miss penalty• Virtual memory, address map, page tables, and the

TLB • HW6

Page 15: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

1 KB Direct Mapped Cache with 32 B Blocks

Cache Index

0123

:

Cache DataByte 0

0431

:

Cache Tag Example: 0x50Ex: 0x01

0x50

Stored as partof the cache “state”

Valid Bit

:31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte SelectEx: 0x00

9

Page 16: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

And yet Another Extreme Example: Fully Associative

:

Cache DataByte 0

0431

:

Cache Tag (27 bits long)

Valid Bit

:

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :

Cache Tag

Byte SelectEx: 0x01

X

XX

X

X

Page 17: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

Review:4-way setassociative

Page 18: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

HW6 Problem 1• 32 bit address space, 32Kbytes cache

– Direct-mapped cache (32 byte blocks)Byte select = 5 bits (lowest order bit 0-4)Cache index = address modulo 1024 = log2(1024) = 10 bits (low order after byte select)Tag = 32 – byte select – cache index = 17 bits (high order)

– 8 way set associative cache (16 byte blocks) – 8 blocks / setByte select for 16 byte blocks = 4 bitsset – 32768 bytes / 128 bytes per set = 256 setsCache index = address modulo 256 sets = log2(256) = 8 bitsTag = 32 – 8 – 4 = 20 bits

– Fully associative cache (128 byte blocks)Byte select = 7 bits,Cache index does not exist because blocks in memory can be placed in any cache entry,Tag = 25 bits

Page 19: CpE 442 Intro. To Computer Architecurecommunity.wvu.edu/~hhammar/cpe242_files/review.pdf · • Q2.13[10]  Consider two different implementations, M1 and M2, of

Problem 7.46word ReadDirectMappedCache(address a)

static Entry cache[CACHE_SIZE_IN_WORDS];Entry e = cache[a.index]if (e.valid == FALSE || e.tag != a.tag){

e.valid = true;e.tag = a.tag;e.data = load_from_memory(a);

}return e.data;

Modified to the following for multi-word blocks

word ReadDirectMappedCache(address a)static Entry cache[CACHE_SIZE_IN_BLOCKS];Entry e = cache[a.index]if (e.valid == FALSE || e.tag != a.tag){

e.valid = true;e.tag = a.tag;e.data = load_from_memory(a);

}return e.data[a.word_index];