near-threshold computing: reclaiming moore’s la · university of michigan ena-hpc -- september 7,...

34
1 1 1 1 University of Michigan EnA-HPC -- September 7, 2011 Near-Threshold Computing: Reclaiming Moore’s Law Dr. Ronald G. Dreslinski Research Fellow University of Michigan – Ann Arbor

Upload: others

Post on 12-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

1 1

1

1 University of Michigan EnA-HPC -- September 7, 2011

Near-Threshold Computing: Reclaiming Moore’s Law

Dr. Ronald G. Dreslinski Research Fellow

University of Michigan – Ann Arbor

Page 2: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

2 University of Michigan EnA-HPC -- September 7, 2011

0.001  

0.01  

0.1  

1  

10  

100  

1000  

10000  

100000  

1000000  

1985   1990   1995   2000   2005   2010   2015   2020  

Transistors  (100,000's)  

Power  (W)  

Performance  (GOPS)  

Efficiency  (GOPS/W)  

2  

Limits  on  heat  extrac6on  

Limits  on  energy-­‐efficiency  of  opera6ons  

Stagnates  performance  growth  

Motivation

Page 3: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

3 University of Michigan EnA-HPC -- September 7, 2011

0.001  

0.01  

0.1  

1  

10  

100  

1000  

10000  

100000  

1000000  

1985   1990   1995   2000   2005   2010   2015   2020  

Transistors  (100,000's)  

Power  (W)  

Performance  (GOPS)  

Efficiency  (GOPS/W)  

3  

Era  of  High  Performance  Compu6ng                                          Era  of  Energy-­‐Efficient  Compu6ng  c.  2000  

With  the  help  of  some  beBer  thermal  management…  

Goal:  To  increase  energy-­‐efficiency  of  operaGons  

Result:  Con6nue  scaling  trends  that  fueled  the  compu6ng  

revolu6on  

Motivation

Page 4: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

4 4

4

4 University of Michigan EnA-HPC -- September 7, 2011

Outline   Define a new region of operation, Near-Threshold

Computing

  Explore new architectures enabled by key insights of computing in the NTC region

  Present an initial design of a 3D stacked NTC system, Centip3De

Page 5: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

5 5

5

5 University of Michigan EnA-HPC -- September 7, 2011

Dark Silicon—The emerging dilemma: More and more gates can fit on a die,

but not all can be turned on at the same time

Environmental Concerns

Form factor vs. Battery Life

Power Density Limitations Circuit supply

voltages are no longer scaling…

Power does not decrease at the same rate that transistor count

increases

A = gate area scaling 1/s2

C = capacitance scaling < 1/s Dynamic dominates

Stagnant

Shrinking

Page 6: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

6 6

6

6 University of Michigan EnA-HPC -- September 7, 2011

Today: Super-Vth, High Performance, Power Constrained

Super-Vth

Ene

rgy

/ Ope

ratio

n Lo

g (D

elay

)

Supply Voltage 0 Vth Vnom

Core i7

3+ GHz 0.5 mW/MHz

Normalized Power, Energy, & Performance Energy per operation is the key metric for

efficiency. Goal: same performance, low energy per operation

Page 7: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

7 7

7

7 University of Michigan EnA-HPC -- September 7, 2011

Subthreshold Design

Super-Vth Sub-Vth

Ene

rgy

/ Ope

ratio

n Lo

g (D

elay

)

Supply Voltage 0 Vth Vnom

500 – 1000X

12-16X

Operating in the sub-threshold gives us huge power gains at the expense of performance OK for sensors!

Page 8: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

8 8

8

8 University of Michigan EnA-HPC -- September 7, 2011

Evolution of Subthreshold Designs

Phoenix 2 Design (2010) - 0.18 µm CMOS -Commercial ARM M3 Core -Used to investigate:

• Energy harvesting • Power management

-37.4 µW/MHz

Subliminal 2 Design (2007) -0.13 µm CMOS -Used to investigate process variation -3.5 µW/MHz

Subliminal 1 Design (2006) -0.13 µm CMOS -Used to investigate existence of Vmin -2.60 µW/MHz

Phoneix 1 Design (2008) - 0.18 µm CMOS -Used to investigate sleep current -2.8 µW/MHz / 30pW sleep power

Page 9: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

9 9

9

9 University of Michigan EnA-HPC -- September 7, 2011

Near-Threshold Computing (NTC)

Super-Vth Sub-Vth

Ene

rgy

/ Ope

ratio

n Lo

g (D

elay

)

Supply Voltage 0 Vth Vnom

~10X ~50-100X

~2X

~6-8X

Near-Threshold Computing (NTC): • >60X power reduction • 6-8X energy reduction

•  Invest portion of extra transistors from scaling to overcome barriers

Page 10: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

10 10

10

10 University of Michigan EnA-HPC -- September 7, 2011

Silicon Verification of Trends

Phoenix 2 Design [Seok’11] 180nm Design

1.8V -> 700mV ~10x NTC Performance Loss ~7x NTC Energy Reduction

Seok ISSCC 2011

Phoenix 2 Processor

Page 11: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

11 11

11

11 University of Michigan EnA-HPC -- September 7, 2011

NTC – Opportunities and Challenges

  Challenges:   Low Voltage Memory

  New SRAM designs   Robustness analysis at near-threshold

  Variation   Razor [Ernst’03] and other in-situ delay monitoring   Adaptive body biasing

  Performance Loss   Many-core designs to improve parallelism   Core boosting to improve single thread performance

  Opportunities:   New architectures   Optimized Processes   3D Integration – less thermal restrictions

Page 12: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

12 12

12

12 University of Michigan EnA-HPC -- September 7, 2011

Outline   Define a new region of operation, Near-Threshold

Computing

  Explore new architectures enabled by key insights of computing in the NTC region

  Present an initial design of a 3D stacked NTC system, Centip3De

Page 13: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

13 13

13

13 University of Michigan EnA-HPC -- September 7, 2011

Minimum Energy SRAM

  SRAM has a lower activity rate than logic   VDD for minimum energy operation (VMIN) is higher   Running logic at VMIN for SRAM has a small energy penalty

with increased performance

Leakage

Dynamic

Total

Page 14: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

14 14

14

14 University of Michigan EnA-HPC -- September 7, 2011

Cluster

L1

Key Insight: •  SRAM is run at a higher VDD than cores with little energy

penalty, allowing caches to operate faster than the core

Cluster Cluster Cluster

Core Core Core Core

New NTC Architectures

L1

BUS / Switched Network

Next Level Memory

Core

L1

Core

L1

Core

L1

Core

L1

Core

BUS / Switched Network

Next Level Memory

L1 L1 L1 L1

Design Levers: •  Operating Voltage •  L1 Size •  Number of Cores per Cluster •  Number of Clusters

Page 15: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

15 15

15

15 University of Michigan EnA-HPC -- September 7, 2011

Core

L1

L2

L1 Cache Size Tradeoff

Core

L1

L2

Decreased Miss Rate

Higher Energy/Access

Page 16: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

16 16

16

16 University of Michigan EnA-HPC -- September 7, 2011

Results – Energy Optimal L1 Size (Single Core)

  Energy dependency on L1 size   Trade-off between L1 and L2 access

Page 17: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

17 17

17

17 University of Michigan EnA-HPC -- September 7, 2011

Clustering Tradeoffs

CPU CPU CPU CPU

L1 L1 L1 L1

L2

CPU CPU CPU CPU

L1 L1

L2

O X X

Tradeoffs ----------------------- + Clustered Sharing - Cluster Conflict - New Bus - L1 Speed

Page 18: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

18 18

18

18 University of Michigan EnA-HPC -- September 7, 2011

Energy Optimal Cluster-based CMP (Fixed Die Size)

Page 19: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

19 19

19

19 University of Michigan EnA-HPC -- September 7, 2011

Full Space Analysis

Page 20: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

20 20

20

20 University of Michigan EnA-HPC -- September 7, 2011

0

0.2

0.4

0.6

0.8

1

Uniprocessor CMP w/ DVFS

NTC

Nor

mal

ized

Ene

rgy/

Ope

ratio

n

L2 L1 Core

Various Scaling Methods

  Baseline   Single CPU @

233MHz

  Simple CMP   One core per L1   Vdd scaling

  Proposed cluster-based CMP  Multiple cores per L1   Vdd scaling

38%

53%

71% 4 Cores 4 L1’s

2 Cores/Cluster 3 Clusters

Page 21: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

21 21

21

21 University of Michigan EnA-HPC -- September 7, 2011 -21-

Energy Optima for SPLASH2   Cluster based architecture with Vdd and Vth scaling

  Optimal cluster size is 2 for most of the apps

 Rad choose non-clustered CMP   Average: 74% over baseline, 55% over simple CMP

nc k L1 size/kB energy savings over baseline

energy savings over simple CMP

Cho 3 2 64 70.8% 52.8%

Fft 2 2 32 72.6% 68.5%

fmm � 8� 2� 128� 79.7%� 41.6%

luc � 3� 2� 32� 77.8%� 64.4%

lun� 2� 2� 64� 69.2%� 58.0%

rad� 16� 1� 128� 84.2%� 35.1%

ray� 3� 2� 128� 65.1%� 54.9%

Page 22: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

22 22

22

22 University of Michigan EnA-HPC -- September 7, 2011

Energy Optima w/ Performance Requirements

  Cluster based approach provides best savings   Traditional approach only saves energy at high end

53%

32%

20%

Page 23: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

23 23

23

23 University of Michigan EnA-HPC -- September 7, 2011

Outline   Define a new region of operation, Near-Threshold

Computing

  Explore new architectures enabled by key insights of computing in the NTC region

  Present an initial design of a 3D stacked NTC system, Centip3De

Page 24: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

24 24

24

24 University of Michigan EnA-HPC -- September 7, 2011

A Closer Look at Wafer-Level Stacking

Dielectric(SiO2/SiN) Gate Poly STI (Shallow Trench Isolation)

Oxide

Silicon

W (Tungsten contact & via) Al (M1 – M5) Cu (M6, Top Metal)

“Super-Contact”

Illustration from Bob Patti, Tezzaron

Page 25: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

25 25

25

25 University of Michigan EnA-HPC -- September 7, 2011

Next, Stack a Second Wafer & Thin:

Page 26: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

26 26

26

26 University of Michigan EnA-HPC -- September 7, 2011

3rd wafer

2nd wafer

1st wafer: controller

Then, Stack a Third Wafer:

Page 27: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

27 27

27

27 University of Michigan EnA-HPC -- September 7, 2011

Centip3De – 3D NTC Prototype

Logic - B Logic - B Logic - A

DRAM Sense/Logic – Bond Routing DRAM DRAM

F2F Bond

F2F Bond

Logic - A

Centip3De Design • 130nm, 7-Layer 3D-Stacked Chip • 128 - ARM M3 Cores • 150mm2

Page 28: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

28 28

28

28 University of Michigan EnA-HPC -- September 7, 2011

  1.9 GOPS (3.8 GOPS in Boost)   Max 1 IPC per core   128 Cores   15 MHz

  130 mW (691mW in Boost)   14.6 GOPS/W (5.5 in Boost)

Design Scaling and Power Breakdowns NTC Centip3De System

42

2.9 7.0

39

NTC Mode Power (mW)

Cores I-Caches D-Caches DRAM

336

28

67

45

Boosted Mode Power (mW)

Raytracing Benchmark

  Naïve Scaling to 22nm yields ~200GOPS/W

Page 29: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

29 29

29

29 University of Michigan EnA-HPC -- September 7, 2011

  Observed Voltage Scaling and Thermal Limits reducing the gains of Moore’s Law

  Defined a new computational operating region: Near Threshold Computing

  Leveraged key insights of NTC for new clustered architectures

  Initial ideas of a 3D integrated NTC system, Centip3De

Conclusions

Page 30: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

30 30

30

30 University of Michigan EnA-HPC -- September 7, 2011

Related References •  Ronald G. Dreslinski, Michael Wieckowski, David Blaauw, Dennis Sylvester, Trevor Mudge, “Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits,” Proceedings of the IEEE, Special Issue on Ultra-Low Power Circuit Technology, Vol. 98, No. 2, February 2010, pg. 253 – 266.

•  Bo Zhai, Ronald G. Dreslinski, Trevor Mudge, David Blaauw, Dennis Sylvester, “Energy Efficent Near-threshold Chip Multi-processing,” ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), August 2007, Best Paper Nomination.

•  Dan Ernst, Shidhartha Das, Seokwoo Lee, David Blaauw, Todd Austin, Trevor Mudge, Nam Sung Kim, Krisztian Flautner, “Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation”, IEEE, Vol. 24, No. 6, November-December 2004, pg. 10-20.

• Mingoo Seok, Dongsuk Jeon, Chaitali Chakrabarti, David Blaauw, Dennis Sylvester, “A 0.27V, 30MHz, 17.7nJ/transform 1024-pt complex FFT core with super-pipelining,” IEEE International Solid-State Circuits Conference (ISSCC), February 2011, to appear

Page 31: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

31 31

31

31 University of Michigan EnA-HPC -- September 7, 2011

Backup

Page 32: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

32 32

32

32 University of Michigan EnA-HPC -- September 7, 2011 -32-

Logic vs. Memory   To maintain same robustness at low voltages SRAM cell sizes needs

to be increased to compensate effects of process variation   Increased size leads to higher energy consumption, and longer

interconnects

Page 33: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

33 33

33

33 University of Michigan EnA-HPC -- September 7, 2011 -33-

Proposed Parallel Architecture

Page 34: Near-Threshold Computing: Reclaiming Moore’s La · University of Michigan EnA-HPC -- September 7, 2011 2 0.001 0.01 0.1 1 10 100 1000 10000 100000 1000000 1985 1990 1995 2000 2005

34 34

34

34 University of Michigan EnA-HPC -- September 7, 2011 -34-

Energy Optimal Vth Selection

  Vth is very high   Energy optimal Vdd is

independent of Vth   Free performance gain

without consuming more energy

  As Vth reduces   Circuit operates faster   More leakage, more energy

consumption per switching

  Choose Vth   Body bias   Dopant implant