intel and faer’s reach to teach
Post on 11-Jan-2016
40 Views
Preview:
DESCRIPTION
TRANSCRIPT
Intel and FAER’s Reach to TeachA Program on Computer Architecture
Part 1: PIPELINED PROCESSORS
R. Govindarajan and Matthew Jacob SERC, Indian Institute of Science, Bangalore
2
Pipelined Processor Architecture1. Terminology and assumptions
2. Review: Computer organization; Data representation
3. Pipelined processor architecture
4. ILP (Instruction Level Parallelism) processor architecture
3
What is Computer Architecture?architecture in the English dictionary
art and science of designing and building habitable structures structures → computer systems inhabitants → computer programs
a structure, or structures collectively a style and method of design and construction
(e.g., Moghul architecture) The study of computer structures; design,
evaluation, description
4
Computer Architect vs Computer Designer vs Logic Designer Computer Architect develops the Instruction
Set Architecture (ISA: description of instructions which are allowed and semantics of what each instruction does when executed) and computer system architecture
Computer Designer develops detailed machine organization (blocks, specifications, testing)
Logic Designer implements these blocks
5
Basics: Computer Organization
CacheMemory
I/O
Bus
I/OI/O
MMU
ALU Registers
CPU
CU
REGISTERS
General Purpose Integer Registers FP Registers
Special Purpose Program Counter Stack Pointer Link Register Instruction
Register
6
Basics: Laws, Principles, Rules Amdahl’s Law: The performance improvement
to be gained from using some faster mode of execution is limited by the fraction of time the slower mode is used
pffTime
TimeSpeedupafter
before
)1(1
speedup
7
Principle of Locality of Reference A program property; programs tend to reuse
instructions and data 90-10 rule: 90% of execution time spent in 10%
of code Temporal locality: recently accessed things are
likely to be accessed in near future Spatial locality: things whose addresses are
close in space tend to be accessed close together in time
8
General Principle of Locality Denning SJCC 1972, Blevin & Ramamurthy
IEEE Trans Comp 1976 During any interval of time, resource
demands are non-uniformly distributed Correlation between immediate past and
immediate future resource demand patterns tends to be high, and correlation between disjoint resource demand patterns tends to 0 as the distance between them tends to infinity
Direction and strength of linear relationship between 2 random variables
Correlation
9
`Moore’s Law’
200
0198
0198
1 198
3198
4198
5198
6198
7198
8198
9199
0199
1199
2199
3199
4199
5199
6199
7199
8199
9198
2
µProc60%/yr.(2X/1.5yr)
Memory9%/yr.(2X/10 yrs)
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
1
10
100
1000
DRAM
CPU
10
Background: Data RepresentationBinary, bit, Byte
Commonly used representations are:
Character data: ASCII code
Signed Integer data: 2s complement1s complement, sign-magnitude
Real data: Floating pointExample: IEEE single precision floating point
standard
11
2s Complement Representation
xxx nn 021...
The n bit quantity
represents the signed integer value
2
0
1
1 22n
i
i
i
n
n xx
least significant bit
12
IEEE Floating Point Representation32 bit value (s, f, e), where f is a 23 bit fraction and e an 8 bit exponent, evaluates to
1272.1)1( es f
Normalized form
Special forms (zero, infinity, NaN, denormals)
13
Instruction Set Architecture Description of machine from view of the
programmer/compiler Example: Intel x86 ISA
Includes specification of1. The different kinds of instructions available
(instruction set)
2. How operands are specified (addressing modes)
3. What each instruction looks like (instruction format)
14
Kinds of Instructions1. Arithmetic/logical instructions
Add, subtract, multiply, divide, compare (int/fp) Or, and, not, xor Shift (left/right, arithmetic/logical), rotate
2. Data transfer instructions Load (Move data value to a register from memory) Store (Move data value to memory location from register) Move
3. Control transfer instructions Jump, conditional branch, function call, return
4. Other instructions Example: halt
15
Operand Addressing Modes• Operands to an instruction
• Source: input value to instruction• Destination: where result is to go
• Addressing Mode• How the location of operand is specified
• An operand can be either• in a memory location• in a register
16
Addressing ModesHow the location of operands is specified
Register Direct - in a register, add R1, R2, R3 Immediate - part of the instrn, add R1, R1, #4 Register indirect - in memory, register specifying
the address of memory, add R1, R2, (R3) Base-Displacement - memory addr. is sum of base
(reg.) and offset, add R1, 8(R3) Absolute - memory addr. specified in instrn Indexed - addr. is sum of base + index Others (Auto increment/decrement, PC relative)
17
Terms: Byte addressable Memory: A sequence of locations, each
containing some information referenced by an address
Address Space Memory address space, Register address space
Addressability: how much data in a location? Example: In byte-addressable memory, each location contains 8 bits (1 byte)
Word: data in a set of contiguous locations Word Length: Maximum data accessed in a single
fetch
18
Terms: Byte ordering, Alignment
Word at 400 Big Endian byte ordering 1AC8 B246
Little Endian byte ordering 46B2 C81A
Word aligned: at a word boundary Word at 400 is word alignedWord at 402 is not, but it is short-word aligned
1A C8 46B2 F0 8C DF1EData (in hex)
400 406404402Address (in dec)
0001 1010 1100 1000 1011 0010 0100 0110
0100 0110 1011 0010 1100 1000 0001 1010
Decimal: 449,360,454
Decimal: 1,186,121,754
19
ISA Example: MIPS32 ISA Registers: 32 integer GPRs (R0,R1,…,R31)
R0 is hardwired to 0 R31 is implicitly used by jal instruction HI and LO: Special purpose registers used
implicitly by multiply and divide instructions Addressing modes
Register direct Base displacement (by loads and stores) Immediate Absolute (by jump instructions) PC relative (by branch instructions)
20
MIPS32 ISA.Instruction Mnemonic Example Meaning
Data Transfer Instructions Load LB, LBU, LH, LHU,
LUI, LW lw R2, 4(R3) R2 Mem[R3+4]
Store SB, SH, SW sb R2, -8(R4) Mem[R4 - 8] R2 Int. ALU Instructions
Add ADD,ADDI,ADDIU add R1, R2, R3 R1 R2 + R3 Subtract SUB, SUBU sub R1, R2, R3 R1 R2 – R3 Multiply MULT, MULTU mult R1, R2 LO LSW ( R1*R2)
HI MSW (R1*R2) Divide DIV, DIVU div R1, R2 LO R1 div R2
HI R1 mod R2 Logical AND,ANDI,OR,ORI
NOR, XOR, XORI ori R1, R2, #0xF0 R1 R1 | SE (F0)
Shift SLL, SLV, SRA, SR sr R1, R2, #4 R1 0000 || (R2)31-4 Comparison SLT, SLTI, SLTU slti R1, R2, #16 R1 1 if R2 < SE(16)
0 otherwise
21
MIPS32 ISA..Instruction Mnemonic Example Meaning
Control Transfer Instructions Conditional Branch
BEQ, BGEZ, BLTZ, BLEZ, BGTZ, BNE
bltz R2, -16 PC PC –12 if R2 < 0
Jump J, JR j <target> PC (PC)31-28 ||target||00 Jump & Link JAL, JALR jalr R2 R31 PC + 8
PC R2 System Call SYSCALL syscall
Notation we will use for instructions:
Opcode Destination, Source1, Source2
Example: ADD R1, R2, R3
ADD R1 ← R2, R3
22
Steps in Instruction Processing1. Fetch the instruction from memory
Get instruction whose address is in PC from memory into IR
Increment PC
2. Decode the instruction Understand instruction, addressing modes, etc Calculate effective addresses of the operands to
the instruction and fetch the operand values
3. Execute the instruction Do the required operation
4. Write back the result of the instruction
Program Counter
Instruction Register
23
Timeline of events
PC to memory
Instruction in IR
PC++; Decode
Op1 eff add calc
Op1 fetched
Op2 eff add calc
Op2 fetched
Op done
Write result
Processor/Memory Speed disparity: 2-3 orders of magnitude
24
Assumptions
Activity is overlapped in time where possible. PC increment and instruction fetch? Instruction decode and effective address calc?
Load-store ISA: the only instructions that take operands from memory are loads & stores
Main memory delays not typically seen by instruction processor Cache memories (more on this in a later lecture)
Register file with 2 read ports and 1 write port
25
Processor cycle time: time required to do
Cache memory access Register access + some logic (like decode) ALU operation
Instruction can be processed in 3-5 cycles Jump: IFetch, Decode/OpFetch, DoOp ALU: IFetch, Decode/OpFetch, DoOp, WriteReg Load: IFetch, Decode, EffAddr, Cache, WriteReg
26
Performance of Processor
Which is more important? execution time of an instruction, or throughput of instruction execution (number of
instructions executed per unit time) Cycles per instruction (CPI) In our example, CPI between 3 and 5 Objective of Pipelining
To improve CPI; make it close to 1
27
Steps in Instruction Processing1. Instruction Fetch: instruction is fetched
from memory and PC is incremented
2. Instruction Decode: instruction is decoded and register operands fetched
3. Execute if arithmetic operation. Else, calculate effective address
4. Memory operation: if Load/Store, do memory access
5. Write back computed value to destination register
IF
WB
MEM
EX
ID
28
Pipelining
IF WBMEMEXID
IF WBMEMEXID
IF WBMEMEXID
IF WBMEMEXID
• Instruction execution time: 5 cycles
• Instruction execution throughput: 1 instruction per cycle
• It may not always be possible for instructions to progress through the pipeline in this way
time
29
Pipeline Hazards
Hazard: a situation that prevents the next instruction of the program from executing during its designated clock cycle
1. Structural hazard: Happens due to request for the same hardware resource by 2 or more instructions at the same time
2. Data hazard: Happens when one instruction depends on the result of previous instruction that is still in the pipeline
3. Control hazard: Happens due to control transfer instructions
30
1. Structural Hazards
B B BB
MEM & IF need to use memory
IF WBMEMEXID LW R3 ← mem [8(R2)]i
IF WBMEMEXIDi + 1
IF WBMEMEXIDi + 2
IFi + 3
IF WBMEMEXIDi + 3
B
31
2. Data Hazards
IF WBMEMEXID add R3 ← R1, R2i
IF WBMEMEXIDi + 1 sub R4 ← R3, R8B B BB
i + 1 WBMEMEXID
i + 1 WBMEMEXID
B B BB
32
A Data Hazard Solution Interlock: Hardware that detects data
dependency and stalls dependent instructions time
instr 0 1 2 3 4 5 6
ADD IF ID EX MEM WB
SUB IF stall stall ID EX MEM
OR stall stall IF IDEX
33
Another Data Hazard Solution Forwarding or Bypassing: forward the result
as soon as available to EX
add R3 ← R1, R2
IF WBMEMEXID
or R7 ← R3, R6
IF EXID
sub R5 ← R3, R4IF MEMEXID
34
Other Data Hazards Solutions Delayed loads
Require that instruction that uses load value be separated from the load instruction
Instruction Scheduling Reorder instructions so that dependent
instructions are far enough apart Compile time vs run time instruction scheduling
35
Instruction Scheduling
Before Scheduling:
LW R3 ← 0(R1)
ADDI R5 ← R3, #1
ADD R2 ← R2, R3
LW R13 ← 0(R11)
ADD R12 ← R13, R3
After Scheduling:
LW R3 ← 0(R1)
LW R13 ← 0(R11)
ADDI R5 ← R3, #1
ADD R2 ← R2, R3
ADD R12 ← R13, R3
1 stall
1 stall
2 stalls (following load) 0 stalls
36
3. Control Hazards
BEQZ R3, out
IF EXID
Fetch instrn. (i +1) or from target? IF MEMEXID
IF ID
B B BB B
B B B BFetch instrn. (i +1) or from target?
Branch resolved; appropriate instruction correctly fetched
IF WBMEMEXID
Branch condition & target resolved here
37
Lecture Summary Computer architecture is the study of
computer structures; design, evaluation, description
It builds on a background of computer organization, the study of how data can be represented and manipulated
Pipelined processors improve program execution time (instruction execution throughput) by overlapping in time the execution of many instructions
38
Next Week Instruction Level Parallelism (ILP) and how it
is exploited by current processors to improve program execution time even more
top related