ee 660: computer architecture lecture 1: introduction and

115
EE 660: Computer Architecture Lecture 1: Introduction and Instruction Set Architectures Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 1 Based on the slides of Prof. David Wentzlaff

Upload: others

Post on 20-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: EE 660: Computer Architecture Lecture 1: Introduction and

EE 660: Computer Architecture

Lecture 1: Introduction and Instruction Set Architectures

Yao Zheng Department of Electrical Engineering

University of Hawaiʻi at Mānoa

1 Based on the slides of Prof. David Wentzlaff

Page 2: EE 660: Computer Architecture Lecture 1: Introduction and

What is Computer Architecture? Application

2

Page 3: EE 660: Computer Architecture Lecture 1: Introduction and

What is Computer Architecture?

Physics

Application

3

Page 4: EE 660: Computer Architecture Lecture 1: Introduction and

What is Computer Architecture?

Physics

Application

Gap too large to bridge in one step

4

Page 5: EE 660: Computer Architecture Lecture 1: Introduction and

What is Computer Architecture?

In its broadest definition, computer architecture is the design of the abstraction/implementation layers that allow us to execute information processing applications efficiently using manufacturing technologies

Physics

Application

Gap too large to bridge in one step

5

Page 6: EE 660: Computer Architecture Lecture 1: Introduction and

What is Computer Architecture?

In its broadest definition, computer architecture is the design of the abstraction/implementation layers that allow us to execute information processing applications efficiently using manufacturing technologies

Physics

Application

Gap too large to bridge in one step

6

Page 7: EE 660: Computer Architecture Lecture 1: Introduction and

Abstractions in Modern Computing Systems

Physics

Devices

Circuits Gates

Register-Transfer Level Microarchitecture

Instruction Set Architecture

Operating System/Virtual Machines Programming Language

Algorithm Application

7

Page 8: EE 660: Computer Architecture Lecture 1: Introduction and

Abstractions in Modern Computing Systems

Physics

Devices

Circuits Gates

Register-Transfer Level Microarchitecture

Instruction Set Architecture

Operating System/Virtual Machines Programming Language

Algorithm Application

Computer Architecture(EE 660)

8

Page 9: EE 660: Computer Architecture Lecture 1: Introduction and

Computer Architecture is Constantly Changing

Physics

Devices

Circuits Gates

Register-Transfer Level Microarchitecture

Instruction Set Architecture

Operating System/Virtual Machines Programming Language

Algorithm Application Application Requirements:

• Suggest how to improve architecture• Provide revenue to fund development

Technology Constraints: • Restrict what can be done efficiently• New technologies make new arch

possible

9

Page 10: EE 660: Computer Architecture Lecture 1: Introduction and

Computer Architecture is Constantly Changing

Physics

Devices

Circuits Gates

Register-Transfer Level Microarchitecture

Instruction Set Architecture

Operating System/Virtual Machines Programming Language

Algorithm Application Application Requirements:

• Suggest how to improve architecture• Provide revenue to fund development

Technology Constraints: • Restrict what can be done efficiently• New technologies make new arch

possible

Architecture provides feedback to guide application and technology research directions

10

Page 11: EE 660: Computer Architecture Lecture 1: Introduction and

Computers Then…

IAS Machine. Design directed by John von Neumann. First booted in Princeton NJ in 1952 Smithsonian Institution Archives (Smithsonian Image 95-06151) 11

Page 12: EE 660: Computer Architecture Lecture 1: Introduction and

Computers Now

12

Robots

Supercomputers Automobiles

Laptops

Set-top boxes

Smart phones

Servers Media Players

Sensor Nets

Routers

Cameras Games

Page 13: EE 660: Computer Architecture Lecture 1: Introduction and

[from Kurzweil]

Major Technology Generations Bipolar

nMOS

CMOS

pMOS

Relays

Vacuum Tubes

Electromechanical

13

Page 14: EE 660: Computer Architecture Lecture 1: Introduction and

Sequential Processor Performance

14 From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.

Page 15: EE 660: Computer Architecture Lecture 1: Introduction and

Sequential Processor Performance

RISC

15 From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.

Page 16: EE 660: Computer Architecture Lecture 1: Introduction and

Sequential Processor Performance

RISC

Move to multi-processor

16 From Hennessy and Patterson Ed. 5 Image Copyright © 2011, Elsevier Inc. All rights Reserved.

Page 17: EE 660: Computer Architecture Lecture 1: Introduction and

Course Structure

• Recommended Readings• In-Lecture Questions• Problem Sets

– Very useful for exam preparation

– Peer Evaluation

• Midterm• Final Project

17

Page 18: EE 660: Computer Architecture Lecture 1: Introduction and

Course Content Digital Sys. & Computer Design (EE 361)

Computer Organization • Basic Pipelined

Processor

~50,000 Transistors

Photo of Berkeley RISC I, © University of California (Berkeley) 18

Page 19: EE 660: Computer Architecture Lecture 1: Introduction and

Course Content Computer Architecture (EE 660)

Intel Nehalem Processor, Original Core i7, Image Credit Intel: http://download.intel.com/pressroom/kits/corei7/images/Nehalem_Die_Shot_3.jpg

19

Page 20: EE 660: Computer Architecture Lecture 1: Introduction and

Intel Nehalem Processor, Original Core i7, Image Credit Intel: http://download.intel.com/pressroom/kits/corei7/images/Nehalem_Die_Shot_3.jpg

~700,000,000 Transistors 20

Course Content Computer Architecture (EE 660)

Page 21: EE 660: Computer Architecture Lecture 1: Introduction and

Intel Nehalem Processor, Original Core i7, Image Credit Intel: http://download.intel.com/pressroom/kits/corei7/images/Nehalem_Die_Shot_3.jpg

~700,000,000 Transistors

EE 361 Processor

21

Course Content Computer Architecture (EE 660)

Page 22: EE 660: Computer Architecture Lecture 1: Introduction and

Intel Nehalem Processor, Original Core i7, Image Credit Intel: http://download.intel.com/pressroom/kits/corei7/images/Nehalem_Die_Shot_3.jpg

~700,000,000 Transistors

• Instruction Level Parallelism– Superscalar– Very Long Instruction Word (VLIW)

• Long Pipelines (PipelineParallelism)

• Advanced Memory and Caches• Data Level Parallelism

– Vector– GPU

• Thread Level Parallelism– Multithreading– Multiprocessor– Multicore– Manycore

22

EE 361 Processor

Course Content Computer Architecture (EE 660)

Page 23: EE 660: Computer Architecture Lecture 1: Introduction and

Architecture vs. Microarchitecture

“Architecture”/Instruction Set Architecture:• Programmer visible state (Memory & Register)• Operations (Instructions and how they work)• Execution Semantics (interrupts)• Input/Output• Data Types/SizesMicroarchitecture/Organization:• Tradeoffs on how to implement ISA for some metric

(Speed, Energy, Cost)• Examples: Pipeline depth, number of pipelines, cache

size, silicon area, peak power, execution ordering, buswidths, ALU widths

23

Page 24: EE 660: Computer Architecture Lecture 1: Introduction and

Software Developments

24

up to 1955 Libraries of numerical routines - Floating point operations- Transcendental functions- Matrix manipulation, equation solvers, . . .

1955-60 High level Languages - Fortran 1956 Operating Systems -

- Assemblers, Loaders, Linkers, Compilers - Accounting programs to keep track of

usage and charges

Page 25: EE 660: Computer Architecture Lecture 1: Introduction and

Software Developments

25

up to 1955 Libraries of numerical routines - Floating point operations- Transcendental functions- Matrix manipulation, equation solvers, . . .

1955-60 High level Languages - Fortran 1956 Operating Systems -

- Assemblers, Loaders, Linkers, Compilers - Accounting programs to keep track of

usage and charges

Machines required experienced operators

• Most users could not be expected to understand these programs, much less write them

• Machines had to be sold with a lot of resident software

Page 26: EE 660: Computer Architecture Lecture 1: Introduction and

Compatibility Problem at IBM

26

By early 1960’s, IBM had 4 incompatible lines of computers!

701 7094650 7074702 70801401 7010

Each system had its own • Instruction set• I/O system and Secondary Storage:

magnetic tapes, drums and disks • assemblers, compilers, libraries,...• market niche business, scientific, real time, ...

Page 27: EE 660: Computer Architecture Lecture 1: Introduction and

Compatibility Problem at IBM

27

By early 1960’s, IBM had 4 incompatible lines of computers!

701 7094650 7074702 70801401 7010

Each system had its own • Instruction set• I/O system and Secondary Storage:

magnetic tapes, drums and disks • assemblers, compilers, libraries,...• market niche business, scientific, real time, ...

Page 28: EE 660: Computer Architecture Lecture 1: Introduction and

Compatibility Problem at IBM

28

By early 1960’s, IBM had 4 incompatible lines of computers!

701 7094650 7074702 70801401 7010

Each system had its own • Instruction set• I/O system and Secondary Storage:

magnetic tapes, drums and disks • assemblers, compilers, libraries,...• market niche business, scientific, real time, ...

IBM 360

Page 29: EE 660: Computer Architecture Lecture 1: Introduction and

IBM 360 : Design Premises Amdahl, Blaauw and Brooks, 1964

29

• The design must lend itself to growth and successormachines

• General method for connecting I/O devices• Total performance - answers per month rather than bits per

microsecond programming aids• Machine must be capable of supervising itself without

manual intervention• Built-in hardware fault checking and locating aids to reduce

down time• Simple to assemble systems with redundant I/O devices,

memories etc. for fault tolerance• Some problems required floating-point larger than 36 bits

Page 30: EE 660: Computer Architecture Lecture 1: Introduction and

30

IBM 360: A General-Purpose Register (GPR) Machine

• Processor State– 16 General-Purpose 32-bit Registers

• may be used as index and base register• Register 0 has some special properties

– 4 Floating Point 64-bit Registers– A Program Status Word (PSW)

• PC, Condition codes, Control flags

• A 32-bit machine with 24-bit addresses– But no instruction contains a 24-bit address!

• Data Formats– 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double-words

Page 31: EE 660: Computer Architecture Lecture 1: Introduction and

31

IBM 360: A General-Purpose Register (GPR) Machine

• Processor State– 16 General-Purpose 32-bit Registers

• may be used as index and base register• Register 0 has some special properties

– 4 Floating Point 64-bit Registers– A Program Status Word (PSW)

• PC, Condition codes, Control flags

• A 32-bit machine with 24-bit addresses– But no instruction contains a 24-bit address!

• Data Formats– 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double-words

The IBM 360 is why bytes are 8-bits long today!

Page 32: EE 660: Computer Architecture Lecture 1: Introduction and

32

IBM 360: Initial Implementations Model 30 . . . Model 70

Storage 8K - 64 KB 256K - 512 KB Datapath 8-bit 64-bitCircuit Delay 30 nsec/level 5 nsec/levelLocal Store Main Store Transistor RegistersControl Store Read only 1sec Conventional circuits

IBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models. Milestone: The first true ISA designed as portable hardware-software interface!

Page 33: EE 660: Computer Architecture Lecture 1: Introduction and

33

IBM 360: Initial Implementations Model 30 . . . Model 70

Storage 8K - 64 KB 256K - 512 KB Datapath 8-bit 64-bitCircuit Delay 30 nsec/level 5 nsec/levelLocal Store Main Store Transistor RegistersControl Store Read only 1sec Conventional circuits

IBM 360 instruction set architecture (ISA) completely hid the underlying technological differences between various models. Milestone: The first true ISA designed as portable hardware-software interface!

With minor modifications it still survives today!

Page 34: EE 660: Computer Architecture Lecture 1: Introduction and

IBM 360: 47 years later… The zSeries z11 Microprocessor

• 5.2 GHz in IBM 45nm PD-SOI CMOS technology• 1.4 billion transistors in 512 mm2

• 64-bit virtual addressing– original S/360 was 24-bit, and S/370 was 31-bit extension

• Quad-core design• Three-issue out-of-order superscalar pipeline• Out-of-order memory accesses• Redundant datapaths

– every instruction performed in two parallel datapaths andresults compared

• 64KB L1 I-cache, 128KB L1 D-cache on-chip• 1.5MB private L2 unified cache per core, on-chip• On-Chip 24MB eDRAM L3 cache• Scales to 96-core multiprocessor with 768MB of

shared L4 eDRAM

[ IBM, Kevin Shum, HotChips, 2010] Image Credit: IBM

Courtesy of International Business Machines Corporation, © International

Business Machines Corporation. 34

Page 35: EE 660: Computer Architecture Lecture 1: Introduction and

Same Architecture Different Microarchitecture

AMD Phenom X4 • X86 Instruction Set• Quad Core• 125W• Decode 3 Instructions/Cycle/Core• 64KB L1 I Cache, 64KB L1 D Cache• 512KB L2 Cache• Out-of-order• 2.6GHz

Intel Atom • X86 Instruction Set• Single Core• 2W• Decode 2 Instructions/Cycle/Core• 32KB L1 I Cache, 24KB L1 D Cache• 512KB L2 Cache• In-order• 1.6GHz

Image Credit: Intel

Image Credit: AMD 35

Page 36: EE 660: Computer Architecture Lecture 1: Introduction and

Different Architecture Different Microarchitecture

AMD Phenom X4 • X86 Instruction Set• Quad Core• 125W• Decode 3 Instructions/Cycle/Core• 64KB L1 I Cache, 64KB L1 D Cache• 512KB L2 Cache• Out-of-order• 2.6GHz

IBM POWER7 • Power Instruction Set• Eight Core• 200W• Decode 6 Instructions/Cycle/Core• 32KB L1 I Cache, 32KB L1 D Cache• 256KB L2 Cache• Out-of-order• 4.25GHz

Image Credit: IBM Image Credit: AMD

Courtesy of International Business Machines Corporation, © International Business Machines Corporation.

36

Page 37: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

37

Page 38: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

38

ALU

Page 39: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

39

ALU

Mem

ory

Page 40: EE 660: Computer Architecture Lecture 1: Introduction and

Proc

esso

r

Where Do Operands Come from And Where Do Results Go?

40

ALU

Mem

ory

Page 41: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

41

Page 42: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

42

… TOS

ALU

Proc

esso

r

Mem

ory

Stack

Page 43: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

43

… TOS

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

Mem

ory

Stack Accumulator

Page 44: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

44

… TOS

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

Mem

ory

… Stack Accumulator

Register- Memory

Page 45: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

45

… TOS

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

Mem

ory

… Stack Accumulator

Register- Memory

Register- Register

ALU

Proc

esso

r

Mem

ory

Page 46: EE 660: Computer Architecture Lecture 1: Introduction and

Where Do Operands Come from And Where Do Results Go?

46

… TOS

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

Mem

ory

… Stack Accumulator

Register- Memory

Register- Register

0 1 2 or 3 Number Explicitly Named Operands:

ALU

Proc

esso

r

Mem

ory

2 or 3

Page 47: EE 660: Computer Architecture Lecture 1: Introduction and

Stack-Based Instruction Set Architecture (ISA)

• Burrough’s B5000 (1960)• Burrough’s B6700• HP 3000• ICL 2900• Symbolics 3600Modern• Inmos Transputer• Forth machines• Java Virtual Machine• Intel x87 Floating Point Unit

47

… TOS

ALU

Proc

esso

r

Mem

ory

Page 48: EE 660: Computer Architecture Lecture 1: Introduction and

Evaluation of Expressions

48

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Page 49: EE 660: Computer Architecture Lecture 1: Introduction and

Evaluation of Expressions

49

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Page 50: EE 660: Computer Architecture Lecture 1: Introduction and

Evaluation of Expressions

50

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack

Page 51: EE 660: Computer Architecture Lecture 1: Introduction and

Evaluation of Expressions

51

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack push a

a

Page 52: EE 660: Computer Architecture Lecture 1: Introduction and

Evaluation of Expressions

52

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack a

Page 53: EE 660: Computer Architecture Lecture 1: Introduction and

b

Evaluation of Expressions

53

(a + b * c) / (a + d * c - e) /

+

* + a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack a

push b

Page 54: EE 660: Computer Architecture Lecture 1: Introduction and

b

Evaluation of Expressions

54

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack a

Page 55: EE 660: Computer Architecture Lecture 1: Introduction and

c b

Evaluation of Expressions

55

(a + b * c) / (a + d * c - e) /

+

* + a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack a

push c

Page 56: EE 660: Computer Architecture Lecture 1: Introduction and

c b

Evaluation of Expressions

56

(a + b * c) / (a + d * c - e) /

+

* + a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack a

Page 57: EE 660: Computer Architecture Lecture 1: Introduction and

c b

Evaluation of Expressions

57

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack a

multiply

* b * c

Page 58: EE 660: Computer Architecture Lecture 1: Introduction and

c b

Evaluation of Expressions

58

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

Evaluation Stack a

b * c

Page 59: EE 660: Computer Architecture Lecture 1: Introduction and

Evaluation of Expressions

59

a

(a + b * c) / (a + d * c - e) /

+

* + a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

add Evaluation Stack

b * c

Page 60: EE 660: Computer Architecture Lecture 1: Introduction and

Evaluation of Expressions

60

a

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

add

+

Evaluation Stack

b * c

Page 61: EE 660: Computer Architecture Lecture 1: Introduction and

Evaluation of Expressions

61

a

(a + b * c) / (a + d * c - e) /

+

* +a e

-

a c

d c

* b

Reverse Polish a b c * + a d c * + e - /

add

+

Evaluation Stack

b * c a + b * c

Page 62: EE 660: Computer Architecture Lecture 1: Introduction and

Hardware organization of the stack

• Stack is part of the processor state stack must be bounded and small

number of Registers,not the size of main memory

• Conceptually stack is unboundeda part of the stack is included in the

processor state; the rest is kept in themain memory

62

Page 63: EE 660: Computer Architecture Lecture 1: Introduction and

Stack Operations and Implicit Memory References

• Suppose the top 2 elements of the stack are keptin registers and the rest is kept in the memory.

Each push operation 1 memory reference pop operation 1 memory reference

No Good! • Better performance by keeping the top N

elements in registers, and memory references aremade only when register stack overflows orunderflows.

Issue - when to Load/Unload registers ?

63

Page 64: EE 660: Computer Architecture Lecture 1: Introduction and

Stack Operations and Implicit Memory References

• Suppose the top 2 elements of the stack are keptin registers and the rest is kept in the memory.

Each push operation 1 memory reference pop operation 1 memory reference

No Good! • Better performance by keeping the top N

elements in registers, and memory references aremade only when register stack overflows orunderflows.

Issue - when to Load/Unload registers ?

64

Page 65: EE 660: Computer Architecture Lecture 1: Introduction and

Stack Operations and Implicit Memory References

• Suppose the top 2 elements of the stack are keptin registers and the rest is kept in the memory.

Each push operation 1 memory reference pop operation 1 memory reference

No Good! • Better performance by keeping the top N

elements in registers, and memory references aremade only when register stack overflows orunderflows.

Issue - when to Load/Unload registers ?

65

Page 66: EE 660: Computer Architecture Lecture 1: Introduction and

Stack Size and Memory References

66

program stack (size = 2) memory refs push a R0 a push b R0 R1 b push c R0 R1 R2 c, ss(a) * R0 R1 sf(a) + R0 push a R0 R1 a push d R0 R1 R2 d, ss(a+b*c) push c R0 R1 R2 R3 c, ss(a) * R0 R1 R2 sf(a) + R0 R1 sf(a+b*c) push e R0 R1 R2 e,ss(a+b*c) - R0 R1 sf(a+b*c) / R0

a b c * + a d c * + e - /

Page 67: EE 660: Computer Architecture Lecture 1: Introduction and

Stack Size and Memory References

67

program stack (size = 2) memory refs push a R0 a push b R0 R1 b push c R0 R1 R2 c, ss(a) * R0 R1 sf(a) + R0 push a R0 R1 a push d R0 R1 R2 d, ss(a+b*c) push c R0 R1 R2 R3 c, ss(a) * R0 R1 R2 sf(a) + R0 R1 sf(a+b*c) push e R0 R1 R2 e,ss(a+b*c) - R0 R1 sf(a+b*c) / R0

a b c * + a d c * + e - /

4 stores, 4 fetches (implicit)

Page 68: EE 660: Computer Architecture Lecture 1: Introduction and

Stack Size and Expression Evaluation

68

program stack (size = 4) push a R0 push b R0 R1 push c R0 R1 R2 * R0 R1 + R0 push a R0 R1 push d R0 R1 R2 push c R0 R1 R2 R3 * R0 R1 R2 + R0 R1 push e R0 R1 R2 - R0 R1 / R0

a b c * + a d c * + e - /

Page 69: EE 660: Computer Architecture Lecture 1: Introduction and

Stack Size and Expression Evaluation

69

program stack (size = 4) push a R0 push b R0 R1 push c R0 R1 R2 * R0 R1 + R0 push a R0 R1 push d R0 R1 R2 push c R0 R1 R2 R3 * R0 R1 R2 + R0 R1 push e R0 R1 R2 - R0 R1 / R0

a b c * + a d c * + e - /

a and c are “loaded” twice

not the best use of registers!

Page 70: EE 660: Computer Architecture Lecture 1: Introduction and

Machine Model Summary

70

… TOS

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

… M

emor

y

ALU

Proc

esso

r

Mem

ory

… Stack Accumulator

Register- Memory

Register- Register

ALU

Proc

esso

r

Mem

ory

Page 71: EE 660: Computer Architecture Lecture 1: Introduction and

Machine Model Summary

71

… TOS

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

… M

emor

y

ALU

Proc

esso

r

Mem

ory

… Stack Accumulator

Register- Memory

Register- Register

ALU

Proc

esso

r

Mem

ory

C = A + B

Page 72: EE 660: Computer Architecture Lecture 1: Introduction and

Machine Model Summary

72

… TOS

ALU

Proc

esso

r

Mem

ory

ALU

Proc

esso

r

… M

emor

y

ALU

Proc

esso

r

Mem

ory

… Stack Accumulator

Register- Memory

Register- Register

ALU

Proc

esso

r

Mem

ory

C = A + B

Push A Push B Add Pop C

Load A Add B Store C

Load R1, A Add R3, R1, B Store R3, C

Load R1, A Load R2, B Add R3, R1, R2 Store R3, C

Page 73: EE 660: Computer Architecture Lecture 1: Introduction and

Classes of Instructions • Data Transfer

– LD, ST, MFC1, MTC1, MFC0, MTC0• ALU

– ADD, SUB, AND, OR, XOR, MUL, DIV, SLT, LUI• Control Flow

– BEQZ, JR, JAL, TRAP, ERET• Floating Point

– ADD.D, SUB.S, MUL.D, C.LT.D, CVT.S.W,• Multimedia (SIMD)

– ADD.PS, SUB.PS, MUL.PS, C.LT.PS• String

– REP MOVSB (x86)73

Page 74: EE 660: Computer Architecture Lecture 1: Introduction and

Addressing Modes: How to Get Operands from Memory

74

Addressing Mode

Instruction Function

Register Add R4, R3, R2 Regs[R4] <- Regs[R3] + Regs[R2] **

Immediate Add R4, R3, #5 Regs[R4] <- Regs[R3] + 5 **

Displacement Add R4, R3, 100(R1) Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1]]

Register Indirect

Add R4, R3, (R1) Regs[R4] <- Regs[R3] + Mem[Regs[R1]]

Absolute Add R4, R3, (0x475) Regs[R4] <- Regs[R3] + Mem[0x475]

Memory Indirect

Add R4, R3, @(R1) Regs[R4] <- Regs[R3] + Mem[Mem[R1]]

PC relative Add R4, R3, 100(PC) Regs[R4] <- Regs[R3] + Mem[100 + PC]

Scaled Add R4, R3, 100(R1)[R5] Regs[R4] <- Regs[R3] + Mem[100 + Regs[R1] + Regs[R5] * 4]

** May not actually access memory!

Page 75: EE 660: Computer Architecture Lecture 1: Introduction and

Data Types and Sizes • Types

– Binary Integer– Binary Coded Decimal (BCD)– Floating Point

• IEEE 754• Cray Floating Point• Intel Extended Precision (80-bit)

– Packed Vector Data– Addresses

• Width– Binary Integer (8-bit, 16-bit, 32-bit, 64-bit)– Floating Point (32-bit, 40-bit, 64-bit, 80-bit)– Addresses (16-bit, 24-bit, 32-bit, 48-bit, 64-bit)

75

Page 76: EE 660: Computer Architecture Lecture 1: Introduction and

ISA Encoding Fixed Width: Every Instruction has same width • Easy to decode(RISC Architectures: MIPS, PowerPC, SPARC, ARM…)Ex: MIPS, every instruction 4-bytesVariable Length: Instructions can vary in width• Takes less space in memory and caches(CISC Architectures: IBM 360, x86, Motorola 68k, VAX…)Ex: x86, instructions 1-byte up to 17-bytesMostly Fixed or Compressed:• Ex: MIPS16, THUMB (only two formats 2 and 4 bytes)• PowerPC and some VLIWs (Store instructions compressed,

decompress into Instruction Cache(Very) Long Instruction Word: • Multiple instructions in a fixed width bundle• Ex: Multiflow, HP/ST Lx, TI C6000 76

Page 77: EE 660: Computer Architecture Lecture 1: Introduction and

ISA Encoding Fixed Width: Every Instruction has same width • Easy to decode (RISC Architectures: MIPS, PowerPC, SPARC, ARM…) Ex: MIPS, every instruction 4-bytes Variable Length: Instructions can vary in width • Takes less space in memory and caches (CISC Architectures: IBM 360, x86, Motorola 68k, VAX…) Ex: x86, instructions 1-byte up to 17-bytes Mostly Fixed or Compressed: • Ex: MIPS16, THUMB (only two formats 2 and 4 bytes) • PowerPC and some VLIWs (Store instructions compressed,

decompress into Instruction Cache (Very) Long Instruction Word: • Multiple instructions in a fixed width bundle • Ex: Multiflow, HP/ST Lx, TI C6000 77

Page 78: EE 660: Computer Architecture Lecture 1: Introduction and

x86 (IA-32) Instruction Encoding

78

Immediate Displacement Scale, Index, Base ModR/M Opcode Instruction

Prefixes

0,1,2, or 4 bytes

0,1,2, or 4 bytes

1 byte (if needed)

1 byte (if needed)

1,2, or 3 bytes

Up to four Prefixes (1 byte each)

x86 and x86-64 Instruction Formats Possible instructions 1 to 18 bytes long

Page 79: EE 660: Computer Architecture Lecture 1: Introduction and

MIPS64 Instruction Encoding

79 Image Copyright © 2011, Elsevier Inc. All rights Reserved.

Page 80: EE 660: Computer Architecture Lecture 1: Introduction and

Real World Instruction Sets

80

Arch Type # Oper # Mem Data Size # Regs Addr Size Use

Alpha Reg-Reg 3 0 64-bit 32 64-bit Workstation

ARM Reg-Reg 3 0 32/64-bit 16 32/64-bit Cell Phones, Embedded

MIPS Reg-Reg 3 0 32/64-bit 32 32/64-bit Workstation, Embedded

SPARC Reg-Reg 3 0 32/64-bit 24-32 32/64-bit Workstation

TI C6000 Reg-Reg 3 0 32-bit 32 32-bit DSP

IBM 360 Reg-Mem 2 1 32-bit 16 24/31/64 Mainframe

x86 Reg-Mem 2 1 8/16/32/64-bit

4/8/24 16/32/64 Personal Computers

VAX Mem-Mem 3 3 32-bit 16 32-bit Minicomputer

Mot. 6800 Accum. 1 1/2 8-bit 0 16-bit Microcontroler

Page 81: EE 660: Computer Architecture Lecture 1: Introduction and

Why the Diversity in ISAs?

Technology Influenced ISA • Storage is expensive, tight encoding important• Reduced Instruction Set Computer

– Remove instructions until whole computer fits on die• Multicore/Manycore

– Transistors not turning into sequential performanceApplication Influenced ISA • Instructions for Applications

– DSP instructions• Compiler Technology has improved

– SPARC Register Windows no longer needed– Compiler can register allocate effectively

81

Page 82: EE 660: Computer Architecture Lecture 1: Introduction and

Recap

82

Physics

Devices

Circuits Gates

Register-Transfer Level Microarchitecture

Instruction Set Architecture

Operating System/Virtual Machines Programming Language

Algorithm Application

Computer Architecture (EE 660)

Page 83: EE 660: Computer Architecture Lecture 1: Introduction and

Recap

• ISA vs Microarchitecture• ISA Characteristics

– Machine Models– Encoding– Data Types– Instructions– Addressing Modes

83

Physics

Devices

Circuits Gates

Register-Transfer Level Microarchitecture

Instruction Set Architecture

Operating System/Virtual Machines Programming Language

Algorithm Application

Page 84: EE 660: Computer Architecture Lecture 1: Introduction and

Computer Architecture Lecture 1

Next Class: Microcode and Review of Pipelining

84

Page 85: EE 660: Computer Architecture Lecture 1: Introduction and

1

Agenda

• MicrocodedMicroarchitectures• Pipeline Review– Pipelining Basics– Structural Hazards– Data Hazards– Control Hazards

Page 86: EE 660: Computer Architecture Lecture 1: Introduction and

2

Agenda

• MicrocodedMicroarchitectures• Pipeline Review– Pipelining Basics– Structural Hazards– Data Hazards– Control Hazards

Page 87: EE 660: Computer Architecture Lecture 1: Introduction and

3

WhatHappensWhentheProcessorisToo Large?

Page 88: EE 660: Computer Architecture Lecture 1: Introduction and

4

WhatHappensWhentheProcessorisToo Large?

• TimeMultiplex Resources!

Page 89: EE 660: Computer Architecture Lecture 1: Introduction and

Microcontrol Unit MauriceWilkes, 1954

Embed the controllogic state table ina memory array

Matrix A Matrix B

Decoder

Next state

op conditional code flip-flop

µ address

Control lines toALU, MUXs, Registers

Firstusedin EDSAC-2,completed 1958

Memory

5

Page 90: EE 660: Computer Architecture Lecture 1: Introduction and

Microcoded Microarchitecture

Memory (RAM)

µcontroller(ROM)

Datapath

Data Addr

busy? zero?

opcode

enMem MemWrt

holds fixedmicrocode instructions

holds user program written in macrocode

instructions (e.g., x86, MIPS, etc.)

6

Page 91: EE 660: Computer Architecture Lecture 1: Introduction and

ABus-basedDatapathfor RISC

Microinstruction: register to register transfer (17 control signals)

enMem

MA

addr

Memory

data

busy ldMA

MemWrt

Bus 32

A B

OpSel ldA

bcompare? ldB

ALU

2

RegWrt enReg

data

rs1rs2rd32(PC)1(Link)

RegSel

addr 32 GPRs+ PC ...

32-bit Reg

3

rs1rs2rd

ImmSel

IR

Opcode ldIR

Imm ALUExt control

enALUenImm

2

7

Page 92: EE 660: Computer Architecture Lecture 1: Introduction and

8

Agenda

• MicrocodedMicroarchitectures• Pipeline Review– Pipelining Basics– Structural Hazards– Data Hazards– Control Hazards

Page 93: EE 660: Computer Architecture Lecture 1: Introduction and

AnIdeal Pipeline

• Allobjectsgothroughthesame stages• Nosharingofresourcesbetweenanytwo stages• Propagationdelaythroughallpipelinestagesis equal• Schedulingofatransactionenteringthepipelineisnotaffectedbythetransactionsinother stages

stage 1

stage 2

stage3

stage 4

10

Page 94: EE 660: Computer Architecture Lecture 1: Introduction and

AnIdeal Pipeline

• Allobjectsgothroughthesame stages• Nosharingofresourcesbetweenanytwo stages• Propagationdelaythroughallpipelinestagesis equal• Schedulingofatransactionenteringthepipelineisnotaffectedbythetransactionsinother stages

• Theseconditionsgenerallyholdforindustryassemblylines,butinstructionsdependoneachothercausingvarious hazards

stage 1

stage 2

stage3

stage 4

10

Page 95: EE 660: Computer Architecture Lecture 1: Introduction and

UnpipelinedDatapathfor MIPS

clk

WBSrcRegWrite MemWrite

rdataDataMemorywdata

weaddr

RegDst BSrcExtSelOpCode

z

OpSel

0x4

AddAdd

clk

zero?

clk

addrinst

Inst.Memory

PC

wers1rs2

rd1wswd rd2GPRs

ImmExt

ALU

ALUControl

31

PCSrcbrrindjabspc+4

11

Page 96: EE 660: Computer Architecture Lecture 1: Introduction and

SimplifiedUnpipelined Datapath

rdataDataMemorywdata

we addrALU

ImmExt

0x4Add

addrrdata

Inst.Memory

wers1 rs2

rd1wswd rd2GPRs

PC

12

Page 97: EE 660: Computer Architecture Lecture 1: Introduction and

Pipelined Datapath

Clockperiodcanbereducedbydividingtheexecutionofaninstructionintomultiple cycles

write-backphase

fetchphase

executephase

decode& register-fetch phase

memoryphase

rdataDataMemorywdata

we addrALU

ImmExt

0x4Add

addrrdata

Inst.Memory

wers1 rs2

rd1wswd rd2GPRs

IRPC

tC>max{tIM,tRF,tALU,tDM,tRW}(=tDM probably)14

Page 98: EE 660: Computer Architecture Lecture 1: Introduction and

Pipelined Control

write-backfetch

phaseexecutephase

decode& register-fetch phase

memoryphase

we addr

rdataDataMemorywdata

ALU

ImmExt

0x4Add

addrrdata

Inst. Memory

wers1 rs2

rd1wswd rd2GPRs

IRPC

phaseClockperiodcanbereducedbydividingtheexecutionof aninstructionintomultiple cycles

tC>max{tIM,tRF,tALU,tDM,tRW}(=tDM probably)14

Page 99: EE 660: Computer Architecture Lecture 1: Introduction and

Pipelined Control

write-backfetch

phaseexecutephase

decode& register-fetch phase

memoryphase

we addr

rdataDataMemorywdata

ALU

ImmExt

0x4Add

addrrdata

Inst. Memory

wers1 rs2

rd1wswd rd2GPRs

IRPC

HardwiredController

phaseClockperiodcanbereducedbydividingtheexecutionof aninstructionintomultiple cycles

tC>max{tIM,tRF,tALU,tDM,tRW}(=tDM probably)15

Page 100: EE 660: Computer Architecture Lecture 1: Introduction and

Pipelined Control

Clockperiodcanbereducedbydividingtheexecutionofaninstructionintomultiple cycles

write-backphase

fetchphase

executephase

decode& register-fetch phase

memoryphase

rdataDataMemorywdata

we addrALU

ImmExt

0x4Add

addrrdata

Inst.Memory

wers1 rs2

rd1wswd rd2GPRs

IRPC

HardwiredController

tC>max{tIM,tRF,tALU,tDM,tRW}(=tDM probably)

However,CPIwillincreaseunlessinstructionsare pipelined 17

Page 101: EE 660: Computer Architecture Lecture 1: Introduction and

Pipelined Control

write-backfetch

phaseexecutephase

decode& register-fetch phase

memoryphase

we addr

rdataDataMemorywdata

ALU

ImmExt

0x4Add

addrrdata

Inst. Memory

wers1 rs2

rd1wswd rd2GPRs

IRPC

HardwiredController

phaseClockperiodcanbereducedbydividingtheexecutionof aninstructionintomultiple cycles

tC>max{tIM,tRF,tALU,tDM,tRW}(=tDM probably)

However,CPIwillincreaseunlessinstructionsare pipelined 18

Page 102: EE 660: Computer Architecture Lecture 1: Introduction and

10

“IronLaw”ofProcessor PerformanceTime = Instructions Cycles Time

Program Program * Instruction * Cycle

Microarchitecture CPI cycle timeMicrocoded >1 shortSingle-cycle unpipelined 1 longPipelined 1 short

–Instructionsperprogramdependsonsourcecode,compilertechnology,andISA

–Cyclesperinstructions(CPI)dependsuponthe ISAandthe microarchitecture

–Timepercycledependsuponthemicroarchitectureandthebase technology

Page 103: EE 660: Computer Architecture Lecture 1: Introduction and

20

“IronLaw”ofProcessor PerformanceTime = Instructions Cycles Time

Program Program * Instruction * Cycle–Instructionsperprogramdependsonsourcecode,compilertechnology,andISA

–Cyclesperinstructions(CPI)dependsuponthe ISAandthe microarchitecture

–Timepercycledependsuponthemicroarchitectureandthebase technology

Microarchitecture CPI cycle timeMicrocoded >1 shortSingle-cycle unpipelined 1 longPipelined 1 shortMulti-cycle,unpipelined control

Page 104: EE 660: Computer Architecture Lecture 1: Introduction and

20

“IronLaw”ofProcessor PerformanceTime = Instructions Cycles Time

Program Program * Instruction * Cycle

Microarchitecture CPI cycle timeMicrocoded >1 shortSingle-cycle unpipelined 1 longPipelined 1 shortMulti-cycle,unpipelined control >1 short

–Instructionsperprogramdependsonsourcecode,compilertechnology,andISA

–Cyclesperinstructions(CPI)dependsuponthe ISAandthe microarchitecture

–Timepercycledependsuponthemicroarchitectureandthebase technology

Page 105: EE 660: Computer Architecture Lecture 1: Introduction and

CPI ExamplesTime

7 cycles

Inst 1

5 cycles

Inst 2

10 cycles

Inst 3

Microcodedmachine

Inst 1 Inst 2 Inst 3

3instructions,22cycles, CPI=7.33Unpipelinedmachine

3instructions,3cycles, CPI=1Pipelinedmachine

3instructions,3cycles, CPI=1

21

Page 106: EE 660: Computer Architecture Lecture 1: Introduction and

22

Technology Assumptions

Thus,thefollowingtimingassumptionis reasonable

• Asmallamountofveryfastmemory (caches)backedupbyalarge,slower memory

• FastALU(atleastfor integers)• MultiportedRegisterfiles (slower!)

tIM » tRF » tALU » tDM » tRW

A5-stagepipelinewillbethefocusofourdetailed design

- somecommercialdesignshaveover30 pipelinestagestodoaninteger add!

Page 107: EE 660: Computer Architecture Lecture 1: Introduction and

Pipeline Diagrams

write-backphase

fetchphase

executephase

decode& register-fetch phase

memoryphase

rdataDataMemorywdata

we addrALU

ImmExt

0x4Add

addrrdata

Inst.Memory

wers1 rs2

rd1wswd rd2GPRs

IRPC

HardwiredController

Weneedsomewaytoshowmultiplesimultaneoustransactionsinbothspaceand time

23

Page 108: EE 660: Computer Architecture Lecture 1: Introduction and

PipelineDiagrams:Transactionsvs. Time

25

write-backphase

fetchphase

executephase

decode& register-fetch phase

memoryphase

we addr

rdataDataMemorywdata

ALU

ImmExt

0x4Add

addrrdata

Inst. Memory

wers1 rs2

rd1wswd rd2GPRs

IRPC

HardwiredController

time t0 t1 t2 t3 t4 t5 t6 t7 . . . .instruction1 IF1 ID1 EX1 MA1 WB1instruction2 IF2 ID2 EX2instruction3 IF3 ID3instruction4 IF4instruction5

MA2EX3ID4IF5

WB2MA3 WB3EX4 MA4 WB4ID5 EX5 MA5 WB5

Page 109: EE 660: Computer Architecture Lecture 1: Introduction and

PipelineDiagrams:Spacevs. Time

26

write-backphase

fetchphase

executephase

decode& register-fetch phase

memoryphase

rdataDataMemorywdata

we addrALU

ImmExt

0x4Add

addrrdata

Inst.Memory

wers1 rs2

rd1wswd rd2GPRs

IRPC

HardwiredController

t5 t6 t7 . . . .t0 t1 t2 t3 I1 I2 I3 I4

I1 I2 I3I1 I2

I1

time IF ID EX MA WB

t4I5I4I3I2I1

I5I4 I5I3 I4 I5I2 I3 I4 I5

Res

ourc

es

Page 110: EE 660: Computer Architecture Lecture 1: Introduction and

26

InstructionsInteractWithEach Otherin Pipeline

• Structural Hazard: An instruction in thepipeline needs a resource being used byanother instruction in the pipeline

• DataHazard:Aninstructiondependsonadatavalueproducedbyanearlier instruction

• ControlHazard:Whetherornotaninstructionshouldbeexecuteddependsonacontroldecisionmadebyanearlier instruction

Page 111: EE 660: Computer Architecture Lecture 1: Introduction and

27

Agenda

• MicrocodedMicroarchitectures• Pipeline Review– Pipelining Basics– Structural Hazards– Data Hazards– Control Hazards

Page 112: EE 660: Computer Architecture Lecture 1: Introduction and

28

OverviewofStructural Hazards• Structuralhazardsoccurwhentwoinstructions needthesamehardwareresourceatthesame time

• Approachestoresolvingstructural hazards– Schedule:Programmerexplicitlyavoidsschedulinginstructionsthatwouldcreatestructural hazards

– Stall:Hardwareincludescontrollogicthatstalls untilearlierinstructionisnolongerusingcontended resource

– Duplicate:Addmorehardwaretodesignsothateachinstructioncanaccessindependentresourcesatthesametime

• Simple5-stageMIPSpipelinehasnostructural hazardsspecificallybecauseISAwasdesignedthat way

Page 113: EE 660: Computer Architecture Lecture 1: Introduction and

ExampleStructural Hazard:Unified Memory

WBIF EXID MEM

rdataDataMemorywdata

we addrALU

ImmExt

0x4Add

addrrdata

Inst.Memory

wers1 rs2

rd1wswd rd2 GPRs

IRPC

30

Page 114: EE 660: Computer Architecture Lecture 1: Introduction and

ExampleStructural Hazard:Unified Memory

WBIF EXID MEM

Unified Memory

rdataDataMemorywdata

we addrALU

Imm Ext

0x4

addrrdata

Inst. Memory

rs2rd1

wswd rd2 GPRs

Addwe

rs1

IRPC

addr

rdatawdata

30

Page 115: EE 660: Computer Architecture Lecture 1: Introduction and

ExampleStructural Hazard:2-Cycle Memory

WBIF EXID M M

adddata

DatMe orywd a

we rramat

E

ALU

ImmExt

0x4Add

addrrdata

Inst.Memory

wers1 rs2

rd1wswd rd2 GPRs

IRPC

M0Stage

31

M1Stage