© 2012 IBM Corporation
POWERTM Processor Design & Methodology Directions
Joshua Friedrich – Senior Technical Staff Member, IBM Server & Technology Group
June 6, 2012
© 2012 IBM Corporation2
CMOS Microprocessor Trends, The First ~25 Years ( Good old days )Log(P
erf
orm
an
ce)
Single Thread
More active transistors, higher frequency
20051980
© 2012 IBM Corporation3
Characteristics of Single Thread Era
0
1
2
3
4
5
6
POWER4 POWER5 POWER6
Fre
qu
en
cy
(G
Hz
)
0
100
200
300
400
500
POWER4 POWER5 POWER6
TX
s p
er
co
re
Dennard Scaling
Exponential Frequency Growth
Optical Scaling / Node Migration
Expanding uArch Complexity
© 2012 IBM Corporation4
Single Thread Era Icon: POWER6
� 5+ GHz operation, >790M transistors, 341mm2 die
� 65nm SOI with 10 levels of Cu interconnect
� 2 superscalar, SMT cores with 7 instruction dispatch & 9 execution units
� 8 MB Level-2 cache + support for 32MB L3
� 2 memory controllers, Two-tier SMP Fabric
� Same pipeline depth & power at 2x frequency versus POWER5
2 MB L2
2 MB L2
2 MB L2
2 MB L2
L2 Dir
L2 Dir
Mem.Cntl.
Mem.Cntl.
SMP Fabric
L
3
CO
NTROL
LER
LSU
IFU /IDU
VMX
FXU
BF
U
DFU
RU
Core 1
© 2012 IBM Corporation5
Single Thread Era EDA: Transistor Analysis & Optimization
EinsTLT: Static timing analysis of complex circuits
EinsTuner: Transistor Level timing optimization
clkin
w2
w3
w0
fbk
Cycle
fbk = 0
evaluate precharge
fbk = 1
w2 = 11st Timing
2nd Timing
w3_int
0
200
400
600
800
1000
1200
-6 -4 -2 0 2 4 6 8
Slack (ps)# o
f p
ath
s
Pre-Tuning
Post-Tuning
© 2012 IBM Corporation6
End of an Era: The Power Wall
0.010.11
0.001
0.01
0.1
1
10
100
1000
Gate Length (microns)
Active Power
Passive Power
1994 2004
Po
wer
Den
sit
y (
W/c
m2) Air Cooling limit
Source: B. Meyerson (IBM) Source: S. Borkar (Intel)
Inability to scale Tox & lower voltage resulted in a power wall for single thread performance
© 2012 IBM Corporation7
Microprocessor TrendsLog(P
erf
orm
ance)
More active transistors, higher frequency
Single Thread
More active transistors, higher frequency
20051980
© 2012 IBM Corporation8
CMOS Scaling Continued
0%
20%
40%
60%
80%
100%
180nm 130nm 90nm 65nm 45nm 32nm
Gain by Traditional Scaling Gain by Innovation
0%
20%
40%
60%
80%
100%
180nm 130nm 90nm 65nm 45nm 32nm
Gain by Traditional Scaling Gain by Innovation
WL
BOX
DeepTrench
Cap
BL
Node
PassingWL
4.0
um
18fFStorage Node
WL
3fF BL (32 Cells)
Node
High-K Metal Gate
© 2012 IBM Corporation9
Microprocessor TrendsLog(P
erf
orm
ance)
Multi-Core
More active transistors, higher frequency
Single Thread
More active transistors, higher frequency
20051980
© 2012 IBM Corporation10
POWER Processors Began the Multi-Core / Multi-Thread Era
Power 4
2001
Introduces Dual core
Power 5
2004
Dual Core
Introduces SMT (4 threads)
Power 6
2007
Dual Core – 4 threads
Enhances SMT Efficiency
FX0
FX1
FP0
FP1
LS0
LS1
BRX
CRL
Superscalar, out of order
Cycles
FX0
FX1
FP0
FP1
LS0
LS1
BRX
CRL
2 Way SMT
FX0
FX1
FP0
FP1
LS0
LS1
BRX
CRL
2 Way SMT (Enhanced)
Thread 1 Executing
Thread 0 Executing
No Thread Executing
© 2012 IBM Corporation11
POWER7 Processor Chip
� 567mm2, 45nm SOI w/ eDRAM
� 1.2B transistors– Equivalent function of 2.7B
(eDRAM)
� Eight processor cores w/ 4 way SMT– 32 threads / chip
� 32MB on chip eDRAM shared L3
� Dual DDR3 Memory Controllers w/ 100GB/s sustained Memory bandwidth
� Scalability up to 32 Sockets– 360GB/s SMP bandwidth/chip– 20,000 coherent operations in flight
FX0FX1FP0FP1LS0LS1BRXCRL
4 Way SMT
Thread 1 Executing
Thread 0 Executing
No Thread Executing
Thread 3 Executing
Thread 2 Executing
© 2012 IBM Corporation12
Multi-core Era EDA: Power Analysis & Reduction
Unstructured Clock Buffer & Latch
Structured Clocking Construct
15% clock power reduction
s0
s1
s2
<0.14 , 0.75>
<0.13 , 0.19>
<0.14 , 0.25>
<0.14 , 0.75> <0.14 , 0.25>
BEFORE
Format: <SF,SP>
s0
s1
s2
< 0.06 , 0.06 >
<0.14 , 0.25>
<0.14 , 0.75><0.14 , 0.75>
AFTER
<0.14 , 0.75>∆Power < 0
5% data power reduction
© 2012 IBM Corporation13
Multi-core Era EDA: Cycle Simulation
0
50
100
150
200
250
300
350
Runtime Memory Use
CycleSim
EventSim
25x
10x
H a r d w a r e A c c e le r a t o r T h r o u g h p u t
0
5
1 0
1 5
2 0
2 5
3 0
3 5
4 0
4 5
5 0
W e e k s
Tri
llio
n S
im C
ycle
s
0
2 0
4 0
6 0
8 0
1 0 0
1 2 0
Min
ute
s r
eal
tim
e @
3.5
GH
z
Mostly core only model Larger chip Models
(� slower sim speed)
Introduction of next
Generation AcceleratorPrior Gen Project
High water mark
© 2012 IBM Corporation14
Multi-Core Era Limiters`
1
10
100
90 65 45 32 22 14 10
Technology Node
log
(p
erf
orm
an
ce
)
Ideal Growth
Likely Multi-Core Path
Technology complexity & rising costs
Power
SW parallelism
Socket BW
© 2012 IBM Corporation15
Multi-core evolution
Compute Throughput Potential
Multi-core evolution
Socket Throughput Limitation(Power, memory bandwidth)
Need to Amplify Effective
Socket BandwidthTo Achieve Potential
POWER’s Multi-Core Advantage
© 2012 IBM Corporation16 Multi-core evolution
Compute Throughput Potential
Socket Throughput Limitation(Power, memory bandwidth)
EDRAM = large, low power cache
High bandwidth memory buffer
Low-Power Off-Chip SignalingTechnology
Coherence Innovation to
minimize socket-to-socket communication
POWER Architecture & Technology Will Extend Multi-Core Gains
© 2012 IBM Corporation17
Microprocessor TrendsLog(P
erf
orm
ance)
More active transistors, higher frequency
Multi-Core
More active transistors, higher frequency
Single Thread
More active transistors, higher frequency
2015(?)20051980
© 2012 IBM Corporation18
Microprocessor Trends: The Next EraLog(P
erf
orm
ance) Designer-Driven Innovation
(System-level tech, Heterogeneity, HW-SW Opt.)
More active transistors, higher frequency
Multi-Core
More active transistors, higher frequency
Single Thread
More active transistors, higher frequency
2015(?)20051980
© 2012 IBM Corporation19
Innovation Era: System Level Technologies
3D Stacking with Through Silicon Vias
FPGA Accelerators Flash Memory / SSDSilicon Photonics
Single Processor–Memory Socket
© 2012 IBM Corporation20
Innovation Era: Heterogeneous Integration
uP
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
IO Hub
1/10GbE
Service Processor
PCIE
Slots
PCIE
Slots
PCIE
Slots
PCIE
Slots
PCIE
Slots
RAID Controller
Power ManagementuController
VRMs
Simple Server (2010)
uP
Single VRM
Disk
Simple Server (Future)
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
DIMM
PCIE
Slots
PCIE
Slots
PCIE
Slots
PCIE
Slots
PCIE
Slots
10/40GbE
Internal VRMs
IO Subsystem
Disk
Service Subsystem
© 2012 IBM Corporation21
Innovation Era: Hardware-Software Co-Optimization
Benefit of Specialized HW
Acceleration
0
20
40
60
80
100
Base w/ HW Accel
Encryption
Application
Profiling Hardware
Original program
Dynamically Optimized program
Software
Optimizer
ExecutionEngine
Dynamic Re-compilation with Profile Data
Specialized HW acceleration viable in many areas
•Graphics•Compression•Cryptography•More….
© 2012 IBM Corporation22
Designer Time: Technology-Driven Eras vs. Innovation Era
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Designer
Time
ValidationTools
Implement
Plan
Innovate
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Designer
Time
ValidationTools
Implement
Plan
Innovate
Technology Focused Innovation Focused
© 2012 IBM Corporation23
Manage Technology Complexity: Double Patterning
Need for Double / Triple Patterning
0
50
100
150
32nm 20nm Future
Technology Node
Pit
ch
(n
m)
Device Pitch
Single Exposure Limit
Metal Pitch
Double Patterning Limit
EUV?
© 2012 IBM Corporation24
Manage Technology Complexity: Extreme Variability
% Delay Variation vs. Device Size
0
10
20
30
40
0 2.5 5
% Delay Variation vs. Operating Voltage
0
10
20
30
40
50
60
0.5 0.7 0.9 1.1 1.3
3
Single Patterned RC Distribution
Double Patterned RC Distribution
Variability grows rapidly w/ each technology generation.
Increasing variation as devices shrink
Increasing variation as voltages drop
© 2012 IBM Corporation25
Productivity & TAT: Merge ASIC Productivity with Custom Performance
� Custom Processor Methodology
–Manual, custom design
–Tool support for Array / analog
–Transistor level analysis
–Clock mesh (low skew)
–Able to engineer wires & placement
–Small macros
� ASIC Methodology
–Synthesis Dominated
–Blackbox array / analog
–Gate level analysis
–Clock trees (high skew)
–Auto-place / auto-route only
–Large blocks
Custom performance w/ ASIC productivity
© 2012 IBM Corporation26
Productivity: Reduce Custom Design
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of Customs over Time
custom-like data flow alignment
91% density
Each color indicates cells associated w/ a different logical bit.
>10x reduction over 5 designs
Datapath Synthesis Results
© 2012 IBM Corporation27
Productivity: Reduce # of Design Partitions
60 logic macros, 25 customs, 14 unique arrays/RFs
1 macro, 0 customs, 9 unique arrays / RFsReduced area & power; equal cycle time
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
# of Macros over Time
Target
© 2012 IBM Corporation28
TAT: Gate-Level Analysis, Hierarchical Abstraction & Multi-Threading
0
50
hrs.
Base Cleanup Coarse
Parallelism
Hierarchical
abstracts
Multi-
threading
Projected Chip Timing Runtime
�Fast macro and global analysis tools allow designers to iterate more quickly resulting in reduced implementation time
�Gate level analysis provides orders-of-magnitude speedups over transistor level analysis
�Hierarchical abstraction & multi-threading can provide 5-10x gains in chip level analysis
Arb
itra
ry U
nit
s
Runtime Accuracy
Macro Signoff Runtime
TX Level
Gate Level
© 2012 IBM Corporation29
Productivity: Retiming
�Significant designer effort spent in optimizing cycle boundaries
�Retiming enables synthesis to optimally locate latches to balance
timing/area/power
� Invention is required to seamlessly handle verification implications
.
.
.latch
Doesn’t meet cycle time
Area/Power too high
Optimal
© 2012 IBM Corporation30
Productivity: Traditional Design Flow
HDL Specification•Function, test, pervasive, power, etc.
Synthesis
Physical Netlist
Model Build
Verification (all types)
Logic Designer
Subject Experts
requirements
© 2012 IBM Corporation31
Productivity: Aspect-Oriented Design Flow
Functional HDL
Power Spec
Pervasive Spec
Synthesis
Physical Netlist
Test Spec.
Model Build
Functional Sim
Model Build
Pervasive Verification
…
DFT Analysis
Expert designers
…
© 2012 IBM Corporation32
Productivity: Aspect-Oriented Design Flow
Functional HDL
Power Spec
Pervasive Spec
Synthesis
Physical Netlist
Test Spec.
Model Build
Functional Sim
Model Build
Pervasive Verification
…
DFT Analysis
Expert designers
Automation
…
© 2012 IBM Corporation33
Multi-Core Era Integration: Homogeneous IP
Standard Input Spec(HDL, ABS, assertions)
Construction Tools
Methodology Compliant IP
Macro Signoff Tools
Abstraction Formats
Global Analysis
Homogenous Design
Limited # of development groups
Strictly enforced design styles / rules with full methodology support
© 2012 IBM Corporation34
Innovation Era Integration: Heterogeneous IP
Standard Input Spec(HDL, ABS, assertions)
Construction Tools
Methodology Compliant IP
Signoff Tools
Abstraction Formats
Global Analysis
Heterogeneous Design
Multiple IP sources to obtain best-of-breed component designs
Limited control on design styles
Multiple entry points into design flow
Alternate Input Formats
Alternate Input Formats
Foreign (non-compliant) IP
Foreign (non-compliant) IP
Foreign abstractions
Foreign abstractions
© 2012 IBM Corporation35
Summary
�Moore’s law continues…..
�However, we will need MORE INNOVATION, not just more cores to leverage it
�EDA tools need to provide MORE PRODUCTIVITY to enable designer innovation