11/12/2013 1 out-of-the-box computing patents pending ieee-svc 2013/11/12 drinking from the firehose...
TRANSCRIPT
11/12/2013 1Out-of-the-Box Computing Patents pending
IEEE-SVC 2013/11/12
Drinking from the Firehose
Cool and cold transfer predictionin the Mill™ CPU Architecture
11/12/2013 2Out-of-the-Box Computing Patents pending
addsx(b2, b5)
The Mill Architecture
Transfer prediction- without delay
New with the Mill:
Run-ahead predictionPrediction before code is loaded
Explicit prefetch predictionNo wasted instruction loads
Automatic profilingPrediction in cold code
11/12/2013 3Out-of-the-Box Computing Patents pending
What is prediction?
Prediction is a micro-architecture mechanism to smooth the flow of instructions in today’s slow-memory and long-pipeline CPUs.
Like caches, the prediction mechanism and its success or failure is invisible to the program.
Present prediction methods work quite well in small, regular benchmarks run on bare machines.
They break down when code has irregular flow of control, and when processes are started or switched frequently.
Except in performance and power impact.
11/12/2013 4Out-of-the-Box Computing Patents pending
The Mill CPU
The Mill is a new general-purpose commercial CPU family.
The Mill has a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures, yet runs the same programs, without rewrite.
This talk will explain:• the problems that prediction is intended to alleviate• how conventional prediction works• the Mill CPU’s novel approach to prediction
11/12/2013 5Out-of-the-Box Computing Patents pending
Talks in this series
1. Encoding2. The Belt3. Cache hierarchy4. Prediction5. Metadata and speculation6. Specification7. …
You are here
Slides and videos of other talks are at:
ootbcomp.com/docs
11/12/2013 6Out-of-the-Box Computing Patents pending
Caution
Gross over-simplification!
This talk tries to convey an intuitive understanding to the non-specialist.
The reality is more complicated.
11/12/2013 7Out-of-the-Box Computing Patents pending
Branches vs. pipelines
if (I == 0) F(); else G();
Do we call F() or G()?load Ieql 0brfl labcall F…
lab: call G…
32 cycles (Intel Pentium 4 Prescott)
5 cycles (Mill)
cache decode executeschedule
11/12/2013 8Out-of-the-Box Computing Patents pending
Branches vs. pipelines
if (I == 0) F(); else G();
cache decode executeschedule
load Ieql 0brflstallstallstallstallstallstallcall G
More stall than work!
load Ieql 0brfl labcall F…
lab: call G…
11/12/2013 9Out-of-the-Box Computing Patents pending
load Ieql 0brfl labcall F…
So we guess…
if (I == 0) F(); else G();
cache decode executeschedule
load Ieql 0brflcall G
Guess right? No stall!
instinstinstinstinstinstinst
lab:
Guess to call G
(correct)call G…
11/12/2013 10Out-of-the-Box Computing Patents pending
So we guess…
if (I == 0) F(); else G();
cache decode executeschedule
load Ieql 0brflcall F
Guess tocall F
(wrong)
instinstinstinstinstinstinst
Guess wrong? Mispredict stalls!
lab:
load Ieql 0brfl labcall F…
call G…
11/12/2013 11Out-of-the-Box Computing Patents pending
So we guess…
if (I == 0) F(); else G();
cache decode executeschedule
call G
Fix prediction:Call G
instinstinstinstinstinstinststallstallstallstallstallstallstall
Finally!
lab:
load Ieql 0brfl labcall F…
call G…
11/12/2013 12Out-of-the-Box Computing Patents pending
How the guess works
if (I == 0) F(); else G();
cache decode executeschedule
lab:
load Ieql 0brfl labcall F…
call G…
11/12/2013 13Out-of-the-Box Computing Patents pending
How the guess works
if (I == 0) F(); else G();
cache decode executeschedule
load Ieql 0brfl
call F
inst
lab:
load Ieql 0brfl labcall F…
call G…
11/12/2013 14Out-of-the-Box Computing Patents pending
How the guess works
if (I == 0) F(); else G();
cache decode executeschedule
load Ieql 0brfl
call F
inst
lab:
load Ieql 0brfl labcall F…
call G…
branch history table
11/12/2013 15Out-of-the-Box Computing Patents pending
How the guess works
if (I == 0) F(); else G();
cache decode executeschedule
load Ieql 0brfl
stallstall
branch history table
call Ginstinstinst
Many fewer stalls!
lab:
load Ieql 0brfl labcall F…
call G…
11/12/2013 16Out-of-the-Box Computing Patents pending
So what’s it cost?
When (as is typical):one instruction in eight is a branchthe predictor guesses right 95% of the timethe mispredict penalty is 15 cycles
predict failure wastes 8.5% of cycles
Simplest fix is to lower the miss penalty.
Shorten the pipeline!
Mill pipeline is five cycles, not 15. Mill misprediction wastes only 3% of cycles
11/12/2013 17Out-of-the-Box Computing Patents pending
The catch - cold code
The guess is based on prior history with the branch.What happens if there is no prior history?
Cold code == random 50-50 guess
In cold code:one instruction in eight is a branchthe predictor guesses right 50% of the timethe mispredict penalty is 15 cycles
predict failure wastes 48% of cycles(23% on a Mill)
Ouch!
11/12/2013 18Out-of-the-Box Computing Patents pending
But wait – it gets worse!
Cold code means no relevant Branch History contents.
It also means no relevant cache contents.
cache decode executeschedule
DRAM
branch history table
brflinst
15 cycles
300+ cycles
11/12/2013 19Out-of-the-Box Computing Patents pending
Miss cost in cold code
In cold code, when:one instruction in eight is a branchthe predictor guesses right 50% of the timethe mispredict penalty is 15 cyclescache miss penalty is 300 cyclescache line is 64 bytes, 16 instructions
cold misses waste 96% of cycles(94% on a Mill)
Ouch!
11/12/2013 20Out-of-the-Box Computing Patents pending
What to do?
Use bigger cache linesInternal fragmentation means no gain
Fetch more lines per missCache thrashing means no gain
Nothing technical works very well.
11/12/2013 21Out-of-the-Box Computing Patents pending
What to do?
Choose short benchmarks!No problem when benchmark is only a thousand instructions
Blame the software!Code bloat is a software vendor problem, not a CPU problem
Blame the memory vendor!Memory speed is a memory vendor problem, not a CPU problem
This approach works.
(for some value of “works”)
11/12/2013 22Out-of-the-Box Computing Patents pending
Fundamental problems
Don’t know how much to load from DRAM.Mill knows how much will execute.
Can’t spot branches until loaded and decoded.Mill knows where branches are, in unseen code
Can’t predict spotted branches without history.Mill can predict in never-executed code.
The rest of the talk shows how the Mill does this.
11/12/2013 23Out-of-the-Box Computing Patents pending
Extended Basic Blocks (EBBs)
EBB
branch EBBProgramcounter
EBBEBB
EBB chain
The Mill groups code into Extended Basic Blocks, single-entry multiple-exit sequences of instructions.
Branches can only target EBB entry points; it is not possible to jump into the middle of an EBB.
Execution flows through a chain of EBBs
11/12/2013 24Out-of-the-Box Computing Patents pending
Predicting EBBs
EBB
branch
With an EBB organization, you don’t have to predict each branch. Only one of possibly many branches will pass control out of the EBB – so predict which one.
If control enters here -
predict that control will exit here
The Mill predicts exits, not branches.
11/12/2013 25Out-of-the-Box Computing Patents pending
Representing exitsCode is sequential in memory
inst instinstinst inst instinstinst
and is held in cache lines
which are also sequential
11/12/2013 26Out-of-the-Box Computing Patents pending
Representing exits
inst instinstinst inst instinstinst
There is one EBB entry point
entry
and one predicted exit point
exit
represented as the difference
prediction
11/12/2013 27Out-of-the-Box Computing Patents pending
Representing exits
inst instinstinst inst instinstinst
There is one EBB entry point
entry
and one predicted exit point
exit
represented as the difference
prediction
Rather than a byte or instruction count, the Mill predicts:the number of cache linesthe number of instructions in the last line
line count
2inst count
3
11/12/2013 28Out-of-the-Box Computing Patents pending
Representing exits
line count
2inst count
3
Predictions also contain:
prediction
• offset of the transfer target from the entry point• kind – jump, return, inner call, outer call
target offset
0xabcdkind
jump
“When we enter the EBB: fetch two lines, decode from the entry through the third instruction in the second line, and then jump to (entry+0xabcd)”
11/12/2013 29Out-of-the-Box Computing Patents pending
The Exit Table
line count
2inst count
3target
0xabcdkind
jump
11/12/2013 30Out-of-the-Box Computing Patents pending
The Exit Table
line count
2inst count
3target
0xabcdkind
jump
Predictions are stored in the hardware Exit Table
exit tablepred
The Exit Table:• is direct-mapped, with victim buffers• is keyed by the EBB entry address and history info• has check bits to detect collisions• can use any history-based algorithm
Capacity varies by Mill family member
11/12/2013 31Out-of-the-Box Computing Patents pending
Exit chains
Starting with an entry point, the Mill can chain through successive predictions without actually looking at the code.
Exit Table
entry address
123123
probe Exit Table using entry address as keyreturning the keyed prediction
17 ---
11/12/2013 32Out-of-the-Box Computing Patents pending
to get the next EBB entry address
Exit chains
Starting with an entry point, the Mill can chain through successive predictions without actually looking at the code.
Exit Table
entry address
123
17 ---+------------140
add the offset to the EBB entry address
11/12/2013 33Out-of-the-Box Computing Patents pending
rinse and repeat
Exit chains
Starting with an entry point, the Mill can chain through successive predictions without actually looking at the code.
Exit Table
entry address
123
17 ---+------------140140
-42 ---
+------------98
Repeat until:• no prediction in table• entry seen before (loop)• as far as you wanted to go
11/12/2013 34Out-of-the-Box Computing Patents pending
Prefetch
Exit Table
Predictions chained from the Exit Table are prefetched from memory
Prefetcher
pred entry addr
line count
Cache/DRAM
Prefetches cannot fault or trap, instead stops chaining
Prefetches are low priority, use idle cycles to memory
11/12/2013 35Out-of-the-Box Computing Patents pending
The Prediction Cache
After prefetch, chain predictions are stored in the Prediction Cache
pred
PrefetcherPrediction cache
The Prediction Cache is small, fast, and fully associative.
Chaining from the Exit Table stops if a prediction is found to be already in the Cache, typically a loop.
Chaining continues in the cache, possibly looping; a miss resumes from the Exit Table.
11/12/2013 36Out-of-the-Box Computing Patents pending
The Fetcher
Predictions are chained from the Prediction Cache (following loops) to the Fetcher
Prediction cache
Prefetcher
11/12/2013 37Out-of-the-Box Computing Patents pending
The Fetcher
pred
Fetcher
entry addrline count
Cache/DRAM
Microcache
Lines are fetched from the regular cache hierarchy to a microcache attached to the decoder.
Prediction cache
Predictions are chained from the Prediction Cache (following loops) to the Fetcher
11/12/2013 38Out-of-the-Box Computing Patents pending
Microcache
The Decoder
Prediction chains end at the Decoder, which also receives a stream of the corresponding cache lines from the Microcache.
Fetcher
predDecoder
The result is that the Decoder has a queue of predictions, and another queue of the matching cache lines, that are kept continuously full and available. It can decode down the predicted path at the full 30+ instructions per cycle speed.
11/12/2013 39Out-of-the-Box Computing Patents pending
Timing
DecoderExit Table
Prediction Cache
Prefetcher
Microcache
Fetcher
3 cycles 2 cycles 2 cycles 2 cycles
mispredict penalty
Vertically aligned units work in parallel
Once started, the predictor can sustain one prediction every three cycles from the Exit Table.
11/12/2013 40Out-of-the-Box Computing Patents pending
Fundamental problems redux
Don’t know how much to load from DRAM.Mill knows how much will execute.
Can’t spot branches until loaded and decoded.Mill knows where branches are, in unseen code
Can’t predict spotted branches without history.Mill can predict in never-executed code.
11/12/2013 41Out-of-the-Box Computing Patents pending
Prediction feedback
All predictors use feedback from execution experience to alter predictions to track changing program behavior.
ExecuteExit Table
pred
If a prediction was wrong, then it can be changed to predict what actually did happen.
Exit Table contents reflects current history for all contained predictions.
11/12/2013 42Out-of-the-Box Computing Patents pending
“All contained predictions”?
Not one prediction for each EBB in the program? No!Tables are much to small to hold predictions for all EBBs.
In a conventional branch predictor, each prediction is built up over time with increasing experience with the particular branch
conventional branch tableprediction
experience
But if the CPU is switched to another process, the prediction is thrown away and overwritten.Every process switch is followed by a period of poor predictions while experience is built up again.
11/12/2013 43Out-of-the-Box Computing Patents pending
?
A second source of predictions
Like others, the Mill builds predictions from experience.
However, it has a second source: the program load module.
codestatic data
predictions
program load
module
The load module is used when there is no experience.
exit table
Missing predictions are read from the load module.
key
to decode
11/12/2013 44Out-of-the-Box Computing Patents pending
But there’s a catch…
Loading a prediction from DRAM (or even L2 cache) takes much longer than a mispredict penalty!
By the time it’s loaded we no longer need it!
Solution: load bunches of likely-needed predictions
codestatic data
predictions
program load
module
exit table
key
But – what predictions are likely-needed?
11/12/2013 45Out-of-the-Box Computing Patents pending
Likely-needed predictions
Should we load on a misprediction? No.We have a prediction – it’s just wrong.
Should we load on a missing prediction? No.It may only be a rarely-taken path that aged out of the table.
We should bulk-load only when entering a whole new region of program activity that we haven’t been to before (recently), and may stay in for a while, or re-enter.
Like a function.
11/12/2013 46Out-of-the-Box Computing Patents pending
Likely-needed predictions
The Mill bulk-loads the predictions of a function when the call finds no prediction for the entry EBB.
int main() {phase1();phase2();phase3();return 0;}
Each call triggers loading of the predictions for the code of that function.
exit table
11/12/2013 47Out-of-the-Box Computing Patents pending
Program phase-change
At phase change (or just code that was swapped out long enough):
1. Recognize when a chain or misprediction leads to a call for which there is no Exit Table entry.
2. Bulk load the predictions for the function.3. Start the prediction chain in the called function4. Chaining will prefetch the predicted code path5. Execute as fast as the code comes in.
Overall delay: one load time for the first predictionsone load time for the initial code prefetchtwo loads total - everything after that in parallel
Vs. convential: one code load time per branch
11/12/2013 48Out-of-the-Box Computing Patents pending
Where’s the load module get its predictions?
The compiler can perfectly predict EBBs that contain no conditional branches.
Calls, returns and jumps
A profiler can measure conditional behavior.But instrumenting the load module changes the behavior.
So the Mill does it for you.
Exit table hardware logs experience with predictions.Post-processing of the log updates the load module. Log info is available for JITs and optimizers.
Mill programs get faster every time they run.
11/12/2013 49Out-of-the-Box Computing Patents pending
The fine print
Newly-compiled predictions assume every EBB will execute to the final transfer. This policy causes all cache lines of the EBB to be prefetched, improving performance at the expense of loading unused lines. Later experience corrects the line counts.
When experience shows that an EBB in a function is almost never entered (often error code) then it is omitted from the bulk load list, saving Exit Table space and memory traffic.
11/12/2013 50Out-of-the-Box Computing Patents pending
Fundamental problem summary
Don’t know how much to load from DRAM.
Mill knows how much will execute.
Can’t spot branches until loaded and decoded.
Mill knows where the exits are.
Can’t predict spotted branches without history.
Mill can predict in never-executed code.
Mill programs get faster every time they run.