ece 4100/6100 advanced computer architecture lecture 6 instruction fetch
DESCRIPTION
ECE 4100/6100 Advanced Computer Architecture Lecture 6 Instruction Fetch. Prof. Hsien-Hsin Sean Lee School of Electrical and Computer Engineering Georgia Institute of Technology. Execution Core. Instruction Supply Issues. - PowerPoint PPT PresentationTRANSCRIPT
ECE 4100/6100Advanced Computer Architecture
Lecture 6 Instruction Fetch
Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology
2
Instruction Supply Issues
• Fetch throughput defines max performance that can be achieved in later stages
• Superscalar processors need to supply more than 1 instruction per cycle
• Instruction Supply limited by– Misalignment of multiple instructions in a fetch group– Change of Flow (interrupting instruction supply)– Memory latency and bandwidth
InstructionFetch Unit
ExecutionCore
Instruction buffer
3
Aligned Instruction Fetching (4 instructions)
Row
Dec
oder
Row
Dec
oder
Row
Dec
oder
Row
Dec
oder ..01
..00 A0A4
00
A1A5
01
A2A6
10
A3A7
11
inst 1inst 1 inst 2 inst 2 inst 3 inst 4 inst 3 inst 4 inst 1inst 1 inst 2 inst 2 inst 3 inst 4 inst 3 inst 4
PC=..xx000000
One 64B I-cache line
A8A12
A9A13
A10A14
A11A15
..10
..11
Assume one fetch group = 16B
Cycle nCycle nCan pull out one row at a time
4
Misaligned FetchR
ow D
ecod
erR
ow D
ecod
erR
ow D
ecod
erR
ow D
ecod
er ..01
..00 A0A4
00
A1A5
01
A2A6
10
A3A7
11
PC=..xx001000
One 64B I-cache line
A8A12
A9A13
A10A14
A11A15
..10
..11
inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4 inst 1 inst 2inst 1 inst 2 inst 3 inst 4inst 3 inst 4
Rotating networkRotating network
Cycle nCycle n
IBM RS/6000
5
Split Cache Line AccessR
ow D
ecod
erR
ow D
ecod
erR
ow D
ecod
erR
ow D
ecod
er ..01
..00 A0A4
00
A1A5
01
A2A6
10
A3A7
11
PC=..xx111000
cache line A
A8A12
A9A13
A10A14
A11A15
..10
..11
B0 B1 B2 B3cache line B
B4 B5 B6 B7
inst 1 inst 2inst 1 inst 2 inst 1 inst 2inst 1 inst 2
inst 3 inst 4inst 3 inst 4 inst 3 inst 4inst 3 inst 4
Cycle nCycle n
Cycle n+1Cycle n+1
Be broken down to 2 physical accesses
6
Split Cache Line Access MissR
ow D
ecod
erR
ow D
ecod
erR
ow D
ecod
erR
ow D
ecod
er
A0A4
00
A1A5
01
A2A6
10
A3A7
11
cache line A
A8A12
A9A13
A10A14
A11A15
C0 C1 C2 C3cache line C
C4 C5 C6 C7
inst 1 inst 2inst 1 inst 2 inst 1 inst 2inst 1 inst 2
inst 3 inst 4inst 3 inst 4 inst 3 inst 4inst 3 inst 4
Cache line Cache line BB missesmisses
Cycle nCycle n
Cycle n+Cycle n+XX
..01
..00
..10
..11
PC=..xx111000
7
High Bandwidth Instruction Fetching
BB1
BB2 BB3
BB4
BB5
BB6BB7
• Wider issue More instruction feed
• Major challenge: to fetch more than one non-contiguousnon-contiguous basic block per cycle
• Enabling technique?– Predication– Branch alignment based on
profiling– Other hardware solutions
(branch prediction is a given)
8
Predication Example
• Convert control dependency into data dependency• Enlarge basic block size
– More room for scheduling– No fetch disruption
if (a[i+1]>a[i]) a[i+1] = 0 else
a[i] = 0
if (a[i+1]>a[i]) a[i+1] = 0 else
a[i] = 0
Source code
lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2L1: sw r0, [r1+4] L2:
lw r2, [r1+4] lw r3, [r1] blt r3, r2, L1 sw r0, [r1] j L2L1: sw r0, [r1+4] L2:
Typical assembly
lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1]
lw r2, [r1+4] lw r3, [r1] sgt pr4, r2, r3 (p4) sw r0, [r1+4] (!p4) sw r0, [r1]
Assembly w/ predication
9
Collapse Buffer [ISCA 95]
• To fetch multiple (often non-contiguous) instructions
• Use interleaved BTB to enable multiple branch predictions
• Align instructions in the predicted sequential order
• Use banked I-cache for multiple line access
10
Collapsing Buffer
Fetch PC Interleaved BTB
CacheBank 1
CacheBank 2
Interchange Switch
Collapsing Circuit
11
Collapsing Buffer Mechanism
Interleaved BTB
A E
Bank Routing
E A
E F G H
A B C D
E F G H A B C D
Interchange Switch
A B C D E F G H
Collapsing Circuit
A B C E G
ValidInstructionBits
D F H
12
High Bandwidth Instruction Fetching
BB1
BB2 BB3
BB4
BB5
BB6BB7
• To fetch more, we need to cross multiple basic blocks (and/or multiple cache lines)
• Multiple branches predictions
13
Multiple Branch Predictor [YehMarrPatt ICS’93]
• Pattern History Table (PHT) design to support MBP• Based on global history only
Branch History Register(BHR)
Pattern History Table(PHT)
bk
……b1
Primary prediction
Secondary prediction
Tertiary prediction
p1
p2p1p1 p2p2
updateupdate
14
Multiple Branch Predictin
• Fetch address could be retrieved from BTB
• Predicted path: BB1 BB2 BB5
• How to fetch BB2 and BB5? BTB?– Can’t. Branch PCs
of br1br1 and br2 br2 not available when MBP made
– Use a BAC design
BB1br1br1
BB2br2br2
BB3
BB4 BB5 BB6 BB7
T (2T (2ndnd)) FF
TTTTF (3F (3rdrd)) FF
Fetch address(br0 Primary prediction)
BTB entry
15
Branch Address Cache
• Use a Branch Address Cache (BAC): Keep 6 possible fetch addresses for 2 more predictions
• br: 2 bits for branch type (cond, uncond, return)• V: single valid bit (to indicate if hits a branch in the sequence)• To make one more level prediction
– Need to cache another 8 more addresses (i.e. total=14 addresses)– 464 bits per entry = (23+3)*1 + (30+3) * (2+4) + 30*8
Tag
23 bits
TakenTarget Address
Not-Taken Target Address
T-T Address
T-N Address
N-T Address
N-N Address
30 bits 30 bits
V br V br V br
212 bits per fetch address entry
1 2
Fetch Addr (from BTB)Fetch Addr (from BTB)
16
Caching Non-Consecutive Basic Blocks
BB2BB2
• High Fetch Bandwidth + Low Latency
BB1BB1
BB3BB3
BB4BB4
BB5BB5
Fetch in Conventional Instruction Cache
BB2BB2BB1BB1 BB3BB3 BB4BB4 BB5BB5
Fetch in Linear Memory Location
17
Trace Cache • Cache dynamic non-contiguous instructions
(traces)• Cross multiple basic blocks• Need to predict multiple branches (MBP)
E F GH I J K
A BC
D
I$
A B
C
D
E F G
H I J
I$ Fetch(5 cycles)
A B C
D E F G
H I J
CollapsingBuffer Fetch
(3 cycles)
A B C D E F G H I J
Trace Cache
A B C D E F G H I J
T$ Fetch (1 cycle)
18
Trace Cache [Rotenberg Bennett Smith MICRO‘96]
• Cache at most (in original paper)– M branches OR (M = 3 in all follow-up TC studies due to MBP)– N instructions (N = 16 in all follow-up TC studies)
• Fall-thru address if last branch is predicted not taken
Tag
Br flag
Fetch AddrFetch Addr
Br mask
Fall-thru Address
Taken Address
MBPMBP
BB2BB1 BB3
Line fill bufferLine fill buffer
For T.C. missFor T.C. miss
T.C. hits, N instructions
MM branches
Branch 1 Branch 2 Branch 3
10
1st Br taken2nd Br Not taken
11, 1
11: 3 branches.1: the trace ends w/ a branch
19
Trace Hit Logic
A 10 11,1 X Y
Tag BF Mask Fall-thru TargetFetch: A
=
Match 1st
Block
Multi-BPred
T N
Cond.AND
Match Remaining
Block(s) Trace hit
N
0 1
Next FetchAddress
20
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache (5 lines)
Cond 1: 3 branchesCond 2: Fill a trace cache lineCond 3: Exit
16 instructions
21
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache (5 lines)
Cond 1: 3 branchesCond 2: Fill a trace cache lineCond 3: Exit
22
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
Trace Cache (5 lines)
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12C12 D1 D2 D3 D4 A1 A2 A3 A4 A5
Trace Cache is Full
23
Trace Cache Example
A
B C
D
Exit
5 insts
12 insts
4 insts
6 insts
BB Traversal Path: ABDABDACDABDACDABDAC
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 D1 D2 D3 D4
A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11
B1 B2 B3 B4 B5 B6 D1 D2 D3 D4 A1 A2 A3 A4 A5
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10C11C12 D1 D2 D3 D4
C12 D1 D2 D3 D4 A1 A2 A3 A4 A5C12
How many hits?
What is the utilization?
24
Redundancy
• Duplication– Note that instructions only appear
once in I-Cache– Same instruction appears many ti
mes in TC • Fragmentation
– If 3 BBs < 16 instructions– If multiple-target branch (e.g. retu
rn, indirect jump or trap) is encountered, stop “trace construction”.
– Empty slots wasted resources • Example
– A single BB is broken up to (ABC), (BCD), (CDA), (DAB)
– Duplicating each instruction 3 times
(ABC) =16 inst(BCD) =13 inst(CDA) =15 inst(DAB) =13 inst
A B
CB D
Trace Cache
C
D A B
C D A
6
4
6
3
B
C
D
A
25
Indexability
A
C
D
B
E• TC saved traces (EAC) and
(BCD) • Path: (EAC) to (D)
– Cannot index interior block (D)
• Can cause duplication • Need partial matching
– (BCD) is cached, if (BC) is needed
E
CB D
Trace Cache
A C
G
26
Pentium 4 (NetBurst) Trace Cache
Front-endBTB
iTLB andPrefetcher
L2 Cache
Decoder
Trace $Trace $
BTB
Rename,execute,
etc.
No I$ !!
Decoded InstructionsTrace-based prediction(predict next-trace, not
next-PC)