Download - Lecture 19: Core Design
![Page 1: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/1.jpg)
1
Lecture 19: Core Design
• Today: issue queue, ILP, clock speed, ILP innovations
![Page 2: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/2.jpg)
Wakeup Logic
rdyL rdyRtagRtagL
or = = or
tag1 tagIW
…
rdyL rdyRtagRtagL
.
.
.
.
.
.
![Page 3: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/3.jpg)
Selection Logic
Issue window
req grant
anyreq enable
enable
Arbiter cell
• For multiple FUs, will need sequential selectors
![Page 4: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/4.jpg)
4
Structure Complexities
• Critical structures: register map tables, issue queue, LSQ, register file, register bypass
• Cycle time is heavily influenced by: window size (physical register size), issue width (#FUs)
• Conflict between the desire to increase IPC and clock speed
• Can achieve both if we use large structures and deep pipelining; but, some structures can’t be easily pipelined and long-latency structures can also hurt IPC
![Page 5: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/5.jpg)
5
Deep Pipelines
• What does it mean to have 2-cycle wakeup 2-cycle bypass 2-cycle regread
![Page 6: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/6.jpg)
Scaling Options
20-IQ
40Regs
F
F
F
F
20-IQ
40Regs
F
F
F
F
2-cycle wakeup2-cycle regread2-cycle bypass
15-IQ
30Regs
F
F
F
15-IQ
30Regs
F
F
F
15-IQ
30Regs
F
F
F
Pipeline Scaling
Capacity Scaling
Replicated Capacity Scaling
![Page 7: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/7.jpg)
7
Recent Trends
• Not much change in structure capacities
• Not much change in cycle time
• Pipeline depths have become shorter (circuit delays have reduced); this is good for energy efficiency
• Optimal performance is observed at about 50 pipeline stages (we are currently at ~20 stages for energy reasons)
• Deep pipelines improve parallelism (helps if there’s ILP); Deep pipelines increase the gap between dependent instructions (hurts when there is little ILP)
![Page 8: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/8.jpg)
ILP Limits Wall 1993
![Page 9: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/9.jpg)
9
Techniques for High ILP
• Better branch prediction and fetch (trace cache) cascading branch predictors?• More physical registers, ROB, issue queue, LSQ two-level regfile/IQ?• Higher issue width clustering?• Lower average cache hierarchy access time• Memory dependence prediction• Latency tolerance techniques: ILP, MLP, prefetch, runahead, multi-threading
![Page 10: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/10.jpg)
Impact of Mem-Dep Prediction
• In the perfect model, loads only wait for conflicting stores; in naïve model, loads issue speculatively and must be squashed if a dependence is later discovered
From Chrysos and Emer, ISCA’98
![Page 11: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/11.jpg)
Clustering
Reg-rename &Instr steer
IQ
Regfile
F F
IQ
Regfile
F F
r1 r2 + r3r4 r1 + r2r5 r6 + r7r8 r1 + r5
p21 p2 + p3p22 p21 + p2p42 p21
p41 p56 + p57p43 p42 + p41
40 regs in each cluster
r1 is mapped to p21 and p42 – will influence steering and instr commit – on average, only 8 replicated regs
![Page 12: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/12.jpg)
2Bc-gskew Branch Predictor
Address
Address+History
BIM
Meta
G1
G0
Pred
Vote
44 KB; 2-cycle access; used in the Alpha 21464
![Page 13: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/13.jpg)
Rules
• On a correct prediction if all agree, no update if they disagree, strengthen correct preds and chooser
• On a misprediction update chooser and recompute the prediction
on a correct prediction, strengthen correct preds on a misprediction, update all preds
![Page 14: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/14.jpg)
Runahead Mutlu et al., HPCA’03
TraceCache
CurrentRename
IssueQRegfile (128)
CheckpointedRegfile (32)
RetiredRename
ROB
FUs L1 DRunahead
Cache
When the oldest instruction is a cache miss, behave like itcauses a context-switch:
• checkpoint the committed registers, rename table, return address stack, and branch history register• assume a bogus value and start a new thread• this thread cannot modify program state, but can prefetch
![Page 15: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/15.jpg)
Memory Bottlenecks
• 128-entry window, real L2 0.77 IPC
• 128-entry window, perfect L2 1.69
• 2048-entry window, real L2 1.15
• 2048-entry window, perfect L2 2.02
• 128-entry window, real L2, runahead 0.94
![Page 16: Lecture 19: Core Design](https://reader036.vdocument.in/reader036/viewer/2022082818/5681306f550346895d964f28/html5/thumbnails/16.jpg)
16
Title
• Bullet