Computer Architecture 2008 – Advanced Topics 1
Computer Architecture
Advanced Topics
Computer Architecture 2008 – Advanced Topics 2
Pentium® M Processor
Computer Architecture 2008 – Advanced Topics 3
Pentium® M Processor Intel’s 1st processor designed for mobility
– Achieve best performance at given power and thermal constraints
– Achieve longest battery life
Banias Dothan
transistors 77M 140M
process 130nm 90nm
Die size 84 mm2 85mm2
Peak power 24.5 watts 21 watts
Freq 1.7 GHz 2.1GHz
L1 cache 32KB I$ + 32KB D$ 32KB I$ + 32KB D$
L2 cache 1MB 2MB
Computer Architecture 2008 – Advanced Topics 4
Performance per Watt Mobile’s smaller form-factor decreases power budget
– Power generates heat, which must be dissipated to keep transistors within allowed temperature
– Limits the processor’s peak power consumption
Change the target– Old target: get max performance
– New target: get max performance at a given power envelope Performance per Watt
Performance via frequency increase– Power = CV2f, but increasing f also requires increasing V
– X% performance costs 3X% power Assume performance linear with frequency
A power efficient feature – better than 1:3 performance : power– Otherwise it is better to just increase frequency
– All Banias u-arch features (aimed at performance) are power efficient
Computer Architecture 2008 – Advanced Topics 5
Higher Performance vs.Longer Battery Life
Processor average power is <10% of the platform– The processor reduces power in periods of
low processor activity
– The processor enters lower power states in idle periods
Average power includes low-activity periods and idle-time– Typical: 1W – 3W
Max power limited by heat dissipation
– Typical: 20W – 100W
Decision
– Optimize for performance when Active
– Optimize for battery life when idle
Display(panel + inverter)
33%
CPU10%
Power Supply10%
Intel® MCH9%
Misc.8%
GFX8%
HDD8%
CLK5%
Intel® ICH3%
DVD2%
LAN2%
Fan2%
Computer Architecture 2008 – Advanced Topics 6
Static Power
The power consumed by a processor consists of
– Active power: used to switch transistors
– Static power: leakage of transistors under voltage
Static power is a function of
– Number of transistors and their type
– Operating voltage
– Die temperature
Leakage is growing dramatically in new process technologies
Pentium® M reduces static power consumption– The L2 cache is built with low-leakage transistors (2/3 of the die transistors)
Low-leakage transistors are slower, increasing cache access latency The significant power saved justifies the small performance loss
– Enhanced SpeedStep® technology Reduces voltage and temperature on low processor activity
Computer Architecture 2008 – Advanced Topics 7
Less is More Less instructions per task
– Advanced branch prediction reduces #wrong instructions executed
– SSE instructions reduce the number of instructions architecturally
Less uops per instruction
– Uops fusion
– Dedicated stack engine
Less transistor switches per micro-op
– efficient bus
– various lower-level optimizations
Less energy per transistor switch
– Enhanced SpeedStep® technology
Power-awareness top to bottomPower-awareness top to bottom
Computer Architecture 2008 – Advanced Topics 8
Improved Branch Predictor Pentium® M employs best-in-class branch prediction
– Bimodal predictor, Global predictor, Loop detector
– Indirect branch predictor
Reduces number of wrong instructions executed
– Saves energy spent executing wrong instructions
Loop predictor
– Analyzes branches for loop behavior Moving in one direction (taken or NT)
a fixed number of times Ended with a single movement
in the opposite direction
– Detect exact loop count – Loop predicted accurately
PredictionLimitCount
=
+1
0
Computer Architecture 2008 – Advanced Topics 9
Indirect Branch Predictor Indirect jumps are widely used in object-oriented code (C++, Java)
Targets are data dependent– Resolved at execution high misprediction penalty
Initially, allocate indirect branch only in target array (TA)– If TA mispredicts allocate in iTA according to global history
Multiple targets allocated for a given branch
– Indirects with a single target predicted by TA, saving iTA space
Use iTA if TA indicates indirect branch + iTA hits
Target Array
iTA
Branch IP
Predicted Target
hitindirect branch
hit
target
HIT
global history
target
Computer Architecture 2008 – Advanced Topics 10
Dedicated Stack Engine PUSH, POP, CALL, RET update ESP (add or sub an offset)
– Use a dedicated add uop
Track the ESP offset at the front-end– ID maintains offset in ESP_delta (+/- Osize)
– Eliminates need for uops updating ESP
– Patch displacements of stack operations
In some cases, ESP actual value is needed– For example: add eax, esp, 3
– A sync uop is inserted before the instruction if ESP_delta != 0 ESP = ESP + ESP_delta
– Reset ESP_delta
ESP_delta recovered on jump misprediction
Computer Architecture 2008 – Advanced Topics 11
ESP Tracking Example
ESP = SUB ESP, 8
STORE [ESP-8], EBX
STORE [ESP-4], EAX
PUSH eax
PUSH ebx
INC eax
INC esp
ESP = ESP - 4
STORE [ESP], EAX
ESP = ESP - 4
STORE [ESP], EBX
EAX = ADD EAX, 1
ESP = ADD ESP, 1
Δ = Δ - 4
Δ = 0
Δ = - 4
Δ = - 8Δ = Δ - 4
EAX = ADD EAX, 1
ESP = ADD ESP, 1
Δ = - 8
Δ = - 8
Δ = 0
Δ = 0
Sync ESP !
Computer Architecture 2008 – Advanced Topics 12
Uop Fusion The Instruction Decoder breaks an instruction into uops
– A conventional uop consists of a single operation operating on two sources
An instruction requires multiple uops when– the instruction operates on more than two sources, or
– the nature of the operation requires a sequence of operations
Uop fusion: in some cases the decoder fuses 2 uops into one uop– A short field added to the uop to support fusing of specific uop pairs
Uop fusion reduces the number of uops by 10%– Increases performance by effectively widening rename, and retire bandwidth
– More instructions can be decode by all decoders
The same task is accomplished by processing fewer uops– Decreases the energy required to complete a given task
Computer Architecture 2008 – Advanced Topics 13
A 2-uop Load-Op
Decoder
add eax,[ebp+4*esi+8]
Scheduler
LD
MEU ALU
OP
LD
OP
LD OP
tmp=load[ebp+4*esi+8]
eax = eax + tmp
Load-op with 3 reg. operands
Decoded into 2 uops LD: read data from mem OP: reg ← reg op data
The LD and OP are inherently serial
OP dispatched only when LD completes
Computer Architecture 2008 – Advanced Topics 14
A 1-uop Load-Op
Decoder
add eax,[ebp+4*esi+8]
Scheduler
Cache
LD + OP
LD + OP
LD
ALUOP
eax = eax + load[ebp+4*esi+8]
Decoded into 1 uopFused uops has a 3rd source – new field in uop holds index registerIncrease decode BW
Increase alloc BW andROB/RS effective size
Dispatched twiceOP dispatched after LD
fused uop retires after both LD&OP complete Increase retire BW
Computer Architecture 2008 – Advanced Topics 15
Enhanced SpeedStep™ Technology The “Basic” SpeedStep™ Technology had
– 2 operating points – Non-transparent switch
The “Enhanced” version provides– Multi voltage/frequency operating points. The Pentium M processor 1.6GHz
operation ranges: From 600MHz @ 0.956V To 1.6GHz @ 1.484V
– Transparent switch– Frequent switches
Benefits– Higher power efficiency
2.7X lower frequency 2X performance loss >2X energy gain
– Outstanding battery life– Excellent thermal mgmt.
Voltage, Frequency, Power
0.0
0.4
0.8
1.2
1.6
2.0
2.4
2.8
3.2
3.6
4.0
0.8 1.0 1.2 1.4 1.6Voltage (Volt)
Fre
qu
ency
( GH
z
)
0
2
4
6
8
10
12
14
16
18
20
Ty
pic
al P
ow
er
( Wa
tts
)
Freq (GHz)
Power (Watts)
2.7X2.7X
6.1X6.1X
Efficiency ratio = 2.3
Computer Architecture 2008 – Advanced Topics 16
Trace Cache(Pentium® 4 Processor)
Computer Architecture 2008 – Advanced Topics 17
Trace Cache
Decoding several IA-32 inst/clock at high frequency is difficult– Instructions have a variable length and have many different options
– Takes several pipe-stages Adds to the branch mis-prediction penalty
Trace-cache: cache uops of previously decoded instructions– Decoding is only needed for instructions that miss the TC
The TC is the primary (L1) instruction cache – Holds 12K uops
– 8-way set associative with LRU replacement
The TC has its own branch predictor (Trace BTB)– Predicts branches that hit in the TC
– Directs where instruction fetching needs to go next in the TC
Computer Architecture 2008 – Advanced Topics 18
Traces Instruction caches fetch bandwidth is limited to a basic blocks
– Cannot provide instructions across a taken branch in the same cycle
The TC builds traces: program-ordered sequences of uops– Allows the target of a branch to be included in the same TC line as the branch
itself
Traces have variable length– Broken into trace lines, six uops per trace line
– There can be many trace lines in a single trace
Jump into the line
Jump out of the line
jmpjmp
jmpjmp jmpjmpjmpjmpjmpjmp
Computer Architecture 2008 – Advanced Topics 19
Hyper Threading Technology(Pentium® 4 Processor )
Based on
Hyper-Threading Technology Architecture and Micro-architecture
Intel Technology Journal
Computer Architecture 2008 – Advanced Topics 20
Thread-Level Parallelism
Multiprocessor systems have been used for many years– There are known techniques to exploit multiprocessors
Software trends– Applications consist of multiple threads or processes that can be
executed in parallel on multiple processors
Thread-level parallelism (TLP) – threads can be from– the same application– different applications running simultaneously– operating system services
Increasing single thread performance becomes harder– and is less and less power efficient
Chip Multi-Processing (CMP)– Two (or more) processors are put on a single die
Computer Architecture 2008 – Advanced Topics 21
Multi-Threading Multi-threading: a single processor executes multiple threads Time-slice multithreading
– The processor switches between software threads after a fixed period
– Can effectively minimize the effects of long latencies to memory
Switch-on-event multithreading – Switch threads on long latency events such as cache misses
– Works well for server applications that have many cache misses
A deficiency of both time-slice MT and switch-on-event MT– They do not cover for branch mis-predictions and long dependencies
Simultaneous multi-threading (SMT)– Multiple threads execute on a single processor simultaneously w/o switching
– Makes the most effective use of processor resources Maximizes performance vs. transistor count and power
Computer Architecture 2008 – Advanced Topics 22
Hyper-threading (HT) Technology HT is SMT
– Makes a single processor appear as 2 logical processors = threads
Each thread keeps a its own architectural state– General-purpose registers
– Control and machine state registers
Each thread has its own interrupt controller – Interrupts sent to a specific logical processor are handled only by it
OS views logical processors (threads) as physical processors– Schedule threads to logical processors as in a multiprocessor system
From a micro-architecture perspective– Thread share a single set of physical resources
caches, execution units, branch predictors, control logic, and buses
Computer Architecture 2008 – Advanced Topics 23
Two Important Goals
When one thread is stalled the other thread can continue to make progress– Independent progress ensured by either
Partitioning buffering queues and limiting the number of entries each thread can use
Duplicating buffering queues
A single active thread running on a processor with HT runs at the same speed as without HT – Partitioned resources are recombined when only one thread is active
Computer Architecture 2008 – Advanced Topics 24
Front End Each thread manages its own next-instruction-pointer Threads arbitrate TC access every cycle (Ping-Pong)
– If both want to access the TC – access granted in alternating cycles
– If one thread is stalled, the other thread gets the full TC bandwidth
TC entries are tagged with thread-ID – Dynamically allocated as needed
– Allows one logical processor to have more entries than the other
TC Hit TC Miss
Computer Architecture 2008 – Advanced Topics 25
Front End (cont.)
Branch prediction structures are either duplicated or shared– The return stack buffer is duplicated
– Global history is tracked for each thread
– The large global history array is a shared Entries are tagged with a logical processor ID
Each thread has its own ITLB
Both threads share the same decoder logic– if only one needs the decode logic, it gets the full decode bandwidth
– The state needed by the decodes is duplicated
Uop queue is hard partitioned– Allows both logical processors to make independent forward progress
regardless of FE stalls (e.g., TC miss) or EXE stalls
Computer Architecture 2008 – Advanced Topics 26
Out-of-order Execution ROB and MOB are hard partitioned
– Enforce fairness and prevent deadlocks
Allocator ping-pongs between the thread– A thread is selected for allocation if
Its uop-queue is not empty its buffers (ROB, RS) are not full It is the thread’s turn, or the other thread cannot be selected
Computer Architecture 2008 – Advanced Topics 27
Out-of-order Execution (cont) Registers renamed to a shared physical register pool
– Store results until retirement
After allocation and renaming uops are placed in one of 2 Qs– Memory instruction queue and general instruction queue
The two queues are hard partitioned
– Uops are read from the Q’s and sent to the scheduler using ping-pong
The schedulers are oblivious to threads – Schedule uops based on dependencies and exe. resources availability
Regardless of their thread
– Uops from the two threads can be dispatched in the same cycle
– To avoid deadlock and ensure fairness Limit the number of active entries a thread can have in each
scheduler’s queue
Forwarding logic compares physical register numbers– Forward results to other uops without thread knowledge
Computer Architecture 2008 – Advanced Topics 28
Out-of-order Execution (cont)
Memory is largely oblivious– L1 Data Cache, L2 Cache, L3 Cache are thread oblivious
All use physical addresses
– DTLB is shared Each DTLB entry includes a thread ID as part of the tag
Retirement ping-pongs between threads– If one thread is not ready to retire uops all retirement bandwidth is
dedicated to the other thread
Computer Architecture 2008 – Advanced Topics 29
Single-task And Multi-task Modes MT-mode (Multi-task mode)
– Two active threads, with some resources partitioned as described earlier
ST-mode (Single-task mode)– There are two flavors of ST-mode
single-task thread 0 (ST0) – only thread 0 is active single-task thread 1 (ST1) – only thread 1 is active
– Resources that were partitioned in MT-mode are re-combined to give the single active logical processor use of all of the resources
Moving the processor from between modes
ST0 ST1
MTThread 1 executes HALT
LowPower Thread 1 executes HALT
Thread 0 executes HALT
Thread 0 executes HALT
Interrupt
Computer Architecture 2008 – Advanced Topics 30
Operating System And Applications
An HT processor appears to the OS and application SW as 2 processors– The OS manages logical processors as it does physical processors
The OS should implement two optimizations:
Use HALT if only one logical processor is active– Allows the processor to transition to either the ST0 or ST1 mode
– Otherwise the OS would execute on the idle logical processor a sequence of instructions that repeatedly checks for work to do
– This so-called “idle loop” can consume significant execution resources that could otherwise be used by the other active logical processor
On a multi-processor system, – Schedule threads to logical processors on different physical processors
before scheduling multiple threads to the same physical processor
– Allows SW threads to use different physical resources when possible