Download - POWER TM Processor Design & Methodology Directions

© 2012 IBM Corporation

POWERTM Processor Design & Methodology Directions

Joshua Friedrich – Senior Technical Staff Member, IBM Server & Technology Group

June 6, 2012

© 2012 IBM Corporation2

CMOS Microprocessor Trends, The First ~25 Years ( Good old days )Log(P

erf

orm

an

ce)

Single Thread

More active transistors, higher frequency

20051980


Characteristics of Single Thread Era

0

1

2

3

4

5

6

POWER4 POWER5 POWER6

Fre

qu

en

cy

(G

Hz

)

0

100

200

300

400

500

POWER4 POWER5 POWER6

TX

s p

er

co

re

Dennard Scaling

Exponential Frequency Growth

Optical Scaling / Node Migration

Expanding uArch Complexity


Single Thread Era Icon: POWER6

� 5+ GHz operation, >790M transistors, 341mm2 die

� 65nm SOI with 10 levels of Cu interconnect

� 2 superscalar, SMT cores with 7 instruction dispatch & 9 execution units

� 8 MB Level-2 cache + support for 32MB L3

� 2 memory controllers, Two-tier SMP Fabric

� Same pipeline depth & power at 2x frequency versus POWER5

2 MB L2

2 MB L2

2 MB L2

2 MB L2

L2 Dir

L2 Dir

Mem.Cntl.

Mem.Cntl.

SMP Fabric

L

3

CO

NTROL

LER

LSU

IFU /IDU

VMX

FXU

BF

U

DFU

RU

Core 1


Single Thread Era EDA: Transistor Analysis & Optimization

EinsTLT: Static timing analysis of complex circuits

EinsTuner: Transistor Level timing optimization

clkin

w2

w3

w0

fbk

Cycle

fbk = 0

evaluate precharge

fbk = 1

w2 = 11st Timing

2nd Timing

w3_int

0

200

400

600

800

1000

1200

-6 -4 -2 0 2 4 6 8

Slack (ps)# o

f p

ath

s

Pre-Tuning

Post-Tuning


End of an Era: The Power Wall

0.010.11

0.001

0.01

0.1

1

10

100

1000

Gate Length (microns)

Active Power

Passive Power

1994 2004

Po

wer

Den

sit

y (

W/c

m2) Air Cooling limit

Source: B. Meyerson (IBM) Source: S. Borkar (Intel)

Inability to scale Tox & lower voltage resulted in a power wall for single thread performance


Microprocessor TrendsLog(P

erf

orm

ance)


Single Thread


20051980


CMOS Scaling Continued

0%

20%

40%

60%

80%

100%

180nm 130nm 90nm 65nm 45nm 32nm

Gain by Traditional Scaling Gain by Innovation

0%

20%

40%

60%

80%

100%

180nm 130nm 90nm 65nm 45nm 32nm

Gain by Traditional Scaling Gain by Innovation

WL

BOX

DeepTrench

Cap

BL

Node

PassingWL

4.0

um

18fFStorage Node

WL

3fF BL (32 Cells)

Node

High-K Metal Gate



erf

orm

ance)

Multi-Core


Single Thread


20051980


POWER Processors Began the Multi-Core / Multi-Thread Era

Power 4

2001

Introduces Dual core

Power 5

2004

Dual Core

Introduces SMT (4 threads)

Power 6

2007

Dual Core – 4 threads

Enhances SMT Efficiency

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

Superscalar, out of order

Cycles

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

2 Way SMT

FX0

FX1

FP0

FP1

LS0

LS1

BRX

CRL

2 Way SMT (Enhanced)

Thread 1 Executing

Thread 0 Executing

No Thread Executing


POWER7 Processor Chip

� 567mm2, 45nm SOI w/ eDRAM

� 1.2B transistors– Equivalent function of 2.7B

(eDRAM)

� Eight processor cores w/ 4 way SMT– 32 threads / chip

� 32MB on chip eDRAM shared L3

� Dual DDR3 Memory Controllers w/ 100GB/s sustained Memory bandwidth

� Scalability up to 32 Sockets– 360GB/s SMP bandwidth/chip– 20,000 coherent operations in flight

FX0FX1FP0FP1LS0LS1BRXCRL

4 Way SMT

Thread 1 Executing

Thread 0 Executing

No Thread Executing

Thread 3 Executing

Thread 2 Executing


Multi-core Era EDA: Power Analysis & Reduction

Unstructured Clock Buffer & Latch

Structured Clocking Construct

15% clock power reduction

s0

s1

s2

<0.14 , 0.75>

<0.13 , 0.19>

<0.14 , 0.25>

<0.14 , 0.75> <0.14 , 0.25>

BEFORE

Format: <SF,SP>

s0

s1

s2

< 0.06 , 0.06 >

<0.14 , 0.25>

<0.14 , 0.75><0.14 , 0.75>

AFTER

<0.14 , 0.75>∆Power < 0

5% data power reduction


Multi-core Era EDA: Cycle Simulation

0

50

100

150

200

250

300

350

Runtime Memory Use

CycleSim

EventSim

25x

10x

H a r d w a r e A c c e le r a t o r T h r o u g h p u t

0

5

1 0

1 5

2 0

2 5

3 0

3 5

4 0

4 5

5 0

W e e k s

Tri

llio

n S

im C

ycle

s

0

2 0

4 0

6 0

8 0

1 0 0

1 2 0

Min

ute

s r

eal

tim

e @

3.5

GH

z

Mostly core only model Larger chip Models

(� slower sim speed)

Introduction of next

Generation AcceleratorPrior Gen Project

High water mark


Multi-Core Era Limiters`

1

10

100

90 65 45 32 22 14 10

Technology Node

log

(p

erf

orm

an

ce

)

Ideal Growth

Likely Multi-Core Path

Technology complexity & rising costs

Power

SW parallelism

Socket BW


Multi-core evolution

Compute Throughput Potential

Multi-core evolution

Socket Throughput Limitation(Power, memory bandwidth)

Need to Amplify Effective

Socket BandwidthTo Achieve Potential

POWER’s Multi-Core Advantage

© 2012 IBM Corporation16 Multi-core evolution

Compute Throughput Potential

Socket Throughput Limitation(Power, memory bandwidth)

EDRAM = large, low power cache

High bandwidth memory buffer

Low-Power Off-Chip SignalingTechnology

Coherence Innovation to

minimize socket-to-socket communication

POWER Architecture & Technology Will Extend Multi-Core Gains



erf

orm

ance)


Multi-Core


Single Thread


2015(?)20051980


Microprocessor Trends: The Next EraLog(P

erf

orm

ance) Designer-Driven Innovation

(System-level tech, Heterogeneity, HW-SW Opt.)


Multi-Core


Single Thread


2015(?)20051980


Innovation Era: System Level Technologies

3D Stacking with Through Silicon Vias

FPGA Accelerators Flash Memory / SSDSilicon Photonics

Single Processor–Memory Socket


Innovation Era: Heterogeneous Integration

uP

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

IO Hub

1/10GbE

Service Processor

PCIE

Slots

PCIE

Slots

PCIE

Slots

PCIE

Slots

PCIE

Slots

RAID Controller

Power ManagementuController

VRMs

Simple Server (2010)

uP

Single VRM

Disk

Simple Server (Future)

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

DIMM

PCIE

Slots

PCIE

Slots

PCIE

Slots

PCIE

Slots

PCIE

Slots

10/40GbE

Internal VRMs

IO Subsystem

Disk

Service Subsystem


Innovation Era: Hardware-Software Co-Optimization

Benefit of Specialized HW

Acceleration

0

20

40

60

80

100

Base w/ HW Accel

Encryption

Application

Profiling Hardware

Original program

Dynamically Optimized program

Software

Optimizer

ExecutionEngine

Dynamic Re-compilation with Profile Data

Specialized HW acceleration viable in many areas

•Graphics•Compression•Cryptography•More….


Designer Time: Technology-Driven Eras vs. Innovation Era

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Designer

Time

ValidationTools

Implement

Plan

Innovate

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Designer

Time

ValidationTools

Implement

Plan

Innovate

Technology Focused Innovation Focused


Manage Technology Complexity: Double Patterning

Need for Double / Triple Patterning

0

50

100

150

32nm 20nm Future

Technology Node

Pit

ch

(n

m)

Device Pitch

Single Exposure Limit

Metal Pitch

Double Patterning Limit

EUV?


Manage Technology Complexity: Extreme Variability

% Delay Variation vs. Device Size

0

10

20

30

40

0 2.5 5

% Delay Variation vs. Operating Voltage

0

10

20

30

40

50

60

0.5 0.7 0.9 1.1 1.3

3

Single Patterned RC Distribution

Double Patterned RC Distribution

Variability grows rapidly w/ each technology generation.

Increasing variation as devices shrink

Increasing variation as voltages drop


Productivity & TAT: Merge ASIC Productivity with Custom Performance

� Custom Processor Methodology

–Manual, custom design

–Tool support for Array / analog

–Transistor level analysis

–Clock mesh (low skew)

–Able to engineer wires & placement

–Small macros

� ASIC Methodology

–Synthesis Dominated

–Blackbox array / analog

–Gate level analysis

–Clock trees (high skew)

–Auto-place / auto-route only

–Large blocks

Custom performance w/ ASIC productivity


Productivity: Reduce Custom Design

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

# of Customs over Time

custom-like data flow alignment

91% density

Each color indicates cells associated w/ a different logical bit.

>10x reduction over 5 designs

Datapath Synthesis Results


Productivity: Reduce # of Design Partitions

60 logic macros, 25 customs, 14 unique arrays/RFs

1 macro, 0 customs, 9 unique arrays / RFsReduced area & power; equal cycle time

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

# of Macros over Time

Target


TAT: Gate-Level Analysis, Hierarchical Abstraction & Multi-Threading

0

50

hrs.

Base Cleanup Coarse

Parallelism

Hierarchical

abstracts

Multi-

threading

Projected Chip Timing Runtime

�Fast macro and global analysis tools allow designers to iterate more quickly resulting in reduced implementation time

�Gate level analysis provides orders-of-magnitude speedups over transistor level analysis

�Hierarchical abstraction & multi-threading can provide 5-10x gains in chip level analysis

Arb

itra

ry U

nit

s

Runtime Accuracy

Macro Signoff Runtime

TX Level

Gate Level


Productivity: Retiming

�Significant designer effort spent in optimizing cycle boundaries

�Retiming enables synthesis to optimally locate latches to balance

timing/area/power

� Invention is required to seamlessly handle verification implications

.

.

.latch

Doesn’t meet cycle time

Area/Power too high

Optimal


Productivity: Traditional Design Flow

HDL Specification•Function, test, pervasive, power, etc.

Synthesis

Physical Netlist

Model Build

Verification (all types)

Logic Designer

Subject Experts

requirements


Productivity: Aspect-Oriented Design Flow

Functional HDL

Power Spec

Pervasive Spec

Synthesis

Physical Netlist

Test Spec.

Model Build

Functional Sim

Model Build

Pervasive Verification

…

DFT Analysis

Expert designers

…


Productivity: Aspect-Oriented Design Flow

Functional HDL

Power Spec

Pervasive Spec

Synthesis

Physical Netlist

Test Spec.

Model Build

Functional Sim

Model Build

Pervasive Verification

…

DFT Analysis

Expert designers

Automation

…


Multi-Core Era Integration: Homogeneous IP

Standard Input Spec(HDL, ABS, assertions)

Construction Tools

Methodology Compliant IP

Macro Signoff Tools

Abstraction Formats

Global Analysis

Homogenous Design

Limited # of development groups

Strictly enforced design styles / rules with full methodology support


Innovation Era Integration: Heterogeneous IP

Standard Input Spec(HDL, ABS, assertions)

Construction Tools

Methodology Compliant IP

Signoff Tools

Abstraction Formats

Global Analysis

Heterogeneous Design

Multiple IP sources to obtain best-of-breed component designs

Limited control on design styles

Multiple entry points into design flow

Alternate Input Formats

Alternate Input Formats

Foreign (non-compliant) IP

Foreign (non-compliant) IP

Foreign abstractions

Foreign abstractions


Summary

�Moore’s law continues…..

�However, we will need MORE INNOVATION, not just more cores to leverage it

�EDA tools need to provide MORE PRODUCTIVITY to enable designer innovation

Download - POWER TM Processor Design & Methodology Directions

Top Related