11/12/2013 1 out-of-the-box computing patents pending ieee-svc 2013/11/12 drinking from the firehose...

11/12/2013 1Out-of-the-Box Computing Patents pending

IEEE-SVC 2013/11/12

Drinking from the Firehose

Cool and cold transfer predictionin the Mill™ CPU Architecture


addsx(b2, b5)

The Mill Architecture

Transfer prediction- without delay

New with the Mill:

Run-ahead predictionPrediction before code is loaded

Explicit prefetch predictionNo wasted instruction loads

Automatic profilingPrediction in cold code


What is prediction?

Prediction is a micro-architecture mechanism to smooth the flow of instructions in today’s slow-memory and long-pipeline CPUs.

Like caches, the prediction mechanism and its success or failure is invisible to the program.

Present prediction methods work quite well in small, regular benchmarks run on bare machines.

They break down when code has irregular flow of control, and when processes are started or switched frequently.

Except in performance and power impact.


The Mill CPU

The Mill is a new general-purpose commercial CPU family.

The Mill has a 10x single-thread power/performance gain over conventional out-of-order superscalar architectures, yet runs the same programs, without rewrite.

This talk will explain:• the problems that prediction is intended to alleviate• how conventional prediction works• the Mill CPU’s novel approach to prediction


Talks in this series

1. Encoding2. The Belt3. Cache hierarchy4. Prediction5. Metadata and speculation6. Specification7. …

You are here

Slides and videos of other talks are at:

ootbcomp.com/docs


Caution

Gross over-simplification!

This talk tries to convey an intuitive understanding to the non-specialist.

The reality is more complicated.


Branches vs. pipelines

if (I == 0) F(); else G();

Do we call F() or G()?load Ieql 0brfl labcall F…

lab: call G…

32 cycles (Intel Pentium 4 Prescott)

5 cycles (Mill)

cache decode executeschedule


Branches vs. pipelines

if (I == 0) F(); else G();


load Ieql 0brflstallstallstallstallstallstallcall G

More stall than work!

load Ieql 0brfl labcall F…

lab: call G…



So we guess…

if (I == 0) F(); else G();


load Ieql 0brflcall G

Guess right? No stall!

instinstinstinstinstinstinst

lab:

Guess to call G

(correct)call G…


So we guess…

if (I == 0) F(); else G();


load Ieql 0brflcall F

Guess tocall F

(wrong)

instinstinstinstinstinstinst

Guess wrong? Mispredict stalls!

lab:


call G…


So we guess…

if (I == 0) F(); else G();


call G

Fix prediction:Call G

instinstinstinstinstinstinststallstallstallstallstallstallstall

Finally!

lab:


call G…


How the guess works

if (I == 0) F(); else G();


lab:


call G…


How the guess works

if (I == 0) F(); else G();


load Ieql 0brfl

call F

inst

lab:


call G…


How the guess works

if (I == 0) F(); else G();


load Ieql 0brfl

call F

inst

lab:


call G…

branch history table


How the guess works

if (I == 0) F(); else G();


load Ieql 0brfl

stallstall


call Ginstinstinst

Many fewer stalls!

lab:


call G…


So what’s it cost?

When (as is typical):one instruction in eight is a branchthe predictor guesses right 95% of the timethe mispredict penalty is 15 cycles

predict failure wastes 8.5% of cycles

Simplest fix is to lower the miss penalty.

Shorten the pipeline!

Mill pipeline is five cycles, not 15. Mill misprediction wastes only 3% of cycles


The catch - cold code

The guess is based on prior history with the branch.What happens if there is no prior history?

Cold code == random 50-50 guess

In cold code:one instruction in eight is a branchthe predictor guesses right 50% of the timethe mispredict penalty is 15 cycles

predict failure wastes 48% of cycles(23% on a Mill)

Ouch!


But wait – it gets worse!

Cold code means no relevant Branch History contents.

It also means no relevant cache contents.


DRAM


brflinst

15 cycles

300+ cycles


Miss cost in cold code

In cold code, when:one instruction in eight is a branchthe predictor guesses right 50% of the timethe mispredict penalty is 15 cyclescache miss penalty is 300 cyclescache line is 64 bytes, 16 instructions

cold misses waste 96% of cycles(94% on a Mill)

Ouch!


What to do?

Use bigger cache linesInternal fragmentation means no gain

Fetch more lines per missCache thrashing means no gain

Nothing technical works very well.


What to do?

Choose short benchmarks!No problem when benchmark is only a thousand instructions

Blame the software!Code bloat is a software vendor problem, not a CPU problem

Blame the memory vendor!Memory speed is a memory vendor problem, not a CPU problem

This approach works.

(for some value of “works”)


Fundamental problems

Don’t know how much to load from DRAM.Mill knows how much will execute.

Can’t spot branches until loaded and decoded.Mill knows where branches are, in unseen code

Can’t predict spotted branches without history.Mill can predict in never-executed code.

The rest of the talk shows how the Mill does this.


Extended Basic Blocks (EBBs)

EBB

branch EBBProgramcounter

EBBEBB

EBB chain

The Mill groups code into Extended Basic Blocks, single-entry multiple-exit sequences of instructions.

Branches can only target EBB entry points; it is not possible to jump into the middle of an EBB.

Execution flows through a chain of EBBs


Predicting EBBs

EBB

branch

With an EBB organization, you don’t have to predict each branch. Only one of possibly many branches will pass control out of the EBB – so predict which one.

If control enters here -

predict that control will exit here

The Mill predicts exits, not branches.


Representing exitsCode is sequential in memory

inst instinstinst inst instinstinst

and is held in cache lines

which are also sequential


Representing exits


There is one EBB entry point

entry

and one predicted exit point

exit

represented as the difference

prediction


Representing exits


There is one EBB entry point

entry

and one predicted exit point

exit

represented as the difference

prediction

Rather than a byte or instruction count, the Mill predicts:the number of cache linesthe number of instructions in the last line

line count

2inst count

3


Representing exits

line count

2inst count

3

Predictions also contain:

prediction

• offset of the transfer target from the entry point• kind – jump, return, inner call, outer call

target offset

0xabcdkind

jump

“When we enter the EBB: fetch two lines, decode from the entry through the third instruction in the second line, and then jump to (entry+0xabcd)”


The Exit Table

line count

2inst count

3target

0xabcdkind

jump


The Exit Table

line count

2inst count

3target

0xabcdkind

jump

Predictions are stored in the hardware Exit Table

exit tablepred

The Exit Table:• is direct-mapped, with victim buffers• is keyed by the EBB entry address and history info• has check bits to detect collisions• can use any history-based algorithm

Capacity varies by Mill family member


Exit chains

Starting with an entry point, the Mill can chain through successive predictions without actually looking at the code.

Exit Table

entry address

123123

probe Exit Table using entry address as keyreturning the keyed prediction

17 ---


to get the next EBB entry address

Exit chains


Exit Table

entry address

123

17 ---+------------140

add the offset to the EBB entry address


rinse and repeat

Exit chains


Exit Table

entry address

123

17 ---+------------140140

-42 ---

+------------98

Repeat until:• no prediction in table• entry seen before (loop)• as far as you wanted to go


Prefetch

Exit Table

Predictions chained from the Exit Table are prefetched from memory

Prefetcher

pred entry addr

line count

Cache/DRAM

Prefetches cannot fault or trap, instead stops chaining

Prefetches are low priority, use idle cycles to memory


The Prediction Cache

After prefetch, chain predictions are stored in the Prediction Cache

pred

PrefetcherPrediction cache

The Prediction Cache is small, fast, and fully associative.

Chaining from the Exit Table stops if a prediction is found to be already in the Cache, typically a loop.

Chaining continues in the cache, possibly looping; a miss resumes from the Exit Table.


The Fetcher

Predictions are chained from the Prediction Cache (following loops) to the Fetcher

Prediction cache

Prefetcher


The Fetcher

pred

Fetcher

entry addrline count

Cache/DRAM

Microcache

Lines are fetched from the regular cache hierarchy to a microcache attached to the decoder.

Prediction cache

Predictions are chained from the Prediction Cache (following loops) to the Fetcher


Microcache

The Decoder

Prediction chains end at the Decoder, which also receives a stream of the corresponding cache lines from the Microcache.

Fetcher

predDecoder

The result is that the Decoder has a queue of predictions, and another queue of the matching cache lines, that are kept continuously full and available. It can decode down the predicted path at the full 30+ instructions per cycle speed.


Timing

DecoderExit Table

Prediction Cache

Prefetcher

Microcache

Fetcher

3 cycles 2 cycles 2 cycles 2 cycles

mispredict penalty

Vertically aligned units work in parallel

Once started, the predictor can sustain one prediction every three cycles from the Exit Table.


Fundamental problems redux

Don’t know how much to load from DRAM.Mill knows how much will execute.

Can’t spot branches until loaded and decoded.Mill knows where branches are, in unseen code

Can’t predict spotted branches without history.Mill can predict in never-executed code.


Prediction feedback

All predictors use feedback from execution experience to alter predictions to track changing program behavior.

ExecuteExit Table

pred

If a prediction was wrong, then it can be changed to predict what actually did happen.

Exit Table contents reflects current history for all contained predictions.


“All contained predictions”?

Not one prediction for each EBB in the program? No!Tables are much to small to hold predictions for all EBBs.

In a conventional branch predictor, each prediction is built up over time with increasing experience with the particular branch

conventional branch tableprediction

experience

But if the CPU is switched to another process, the prediction is thrown away and overwritten.Every process switch is followed by a period of poor predictions while experience is built up again.


?

A second source of predictions

Like others, the Mill builds predictions from experience.

However, it has a second source: the program load module.

codestatic data

predictions

program load

module

The load module is used when there is no experience.

exit table

Missing predictions are read from the load module.

key

to decode


But there’s a catch…

Loading a prediction from DRAM (or even L2 cache) takes much longer than a mispredict penalty!

By the time it’s loaded we no longer need it!

Solution: load bunches of likely-needed predictions

codestatic data

predictions

program load

module

exit table

key

But – what predictions are likely-needed?


Likely-needed predictions

Should we load on a misprediction? No.We have a prediction – it’s just wrong.

Should we load on a missing prediction? No.It may only be a rarely-taken path that aged out of the table.

We should bulk-load only when entering a whole new region of program activity that we haven’t been to before (recently), and may stay in for a while, or re-enter.

Like a function.


Likely-needed predictions

The Mill bulk-loads the predictions of a function when the call finds no prediction for the entry EBB.

int main() {phase1();phase2();phase3();return 0;}

Each call triggers loading of the predictions for the code of that function.

exit table


Program phase-change

At phase change (or just code that was swapped out long enough):

1. Recognize when a chain or misprediction leads to a call for which there is no Exit Table entry.

2. Bulk load the predictions for the function.3. Start the prediction chain in the called function4. Chaining will prefetch the predicted code path5. Execute as fast as the code comes in.

Overall delay: one load time for the first predictionsone load time for the initial code prefetchtwo loads total - everything after that in parallel

Vs. convential: one code load time per branch


Where’s the load module get its predictions?

The compiler can perfectly predict EBBs that contain no conditional branches.

Calls, returns and jumps

A profiler can measure conditional behavior.But instrumenting the load module changes the behavior.

So the Mill does it for you.

Exit table hardware logs experience with predictions.Post-processing of the log updates the load module. Log info is available for JITs and optimizers.

Mill programs get faster every time they run.


The fine print

Newly-compiled predictions assume every EBB will execute to the final transfer. This policy causes all cache lines of the EBB to be prefetched, improving performance at the expense of loading unused lines. Later experience corrects the line counts.

When experience shows that an EBB in a function is almost never entered (often error code) then it is omitted from the bulk load list, saving Exit Table space and memory traffic.


Fundamental problem summary

Don’t know how much to load from DRAM.

Mill knows how much will execute.

Can’t spot branches until loaded and decoded.

Mill knows where the exits are.

Can’t predict spotted branches without history.

Mill can predict in never-executed code.

Mill programs get faster every time they run.


Shameless plug

For technical info about the Mill CPU architecture:

ootbcomp.com/docsTo sign up for future announcements, white papers etc.

ootbcomp.com/mailing-listootbcomp.com/investor-list

11/12/2013 1 out-of-the-box computing patents pending ieee-svc 2013/11/12 drinking from the firehose...

Documents