near-threshold computing: reclaiming moore’s la · university of michigan ena-hpc -- september 7,...

1 University of Michigan EnA-HPC -- September 7, 2011

Near-Threshold Computing: Reclaiming Moore’s Law

Dr. Ronald G. Dreslinski Research Fellow

University of Michigan – Ann Arbor

100000

1000000

1985 1990 1995 2000 2005 2010 2015 2020

Transistors (100,000's)

Power (W)

Performance (GOPS)

Efficiency (GOPS/W)

Limits on heat extrac6on

Limits on energy-‐efficiency of opera6ons

Stagnates performance growth

Motivation

100000

1000000

1985 1990 1995 2000 2005 2010 2015 2020

Transistors (100,000's)

Power (W)

Performance (GOPS)

Efficiency (GOPS/W)

Era of High Performance Compu6ng Era of Energy-‐Efficient Compu6ng c. 2000

With the help of some beBer thermal management…

Goal: To increase energy-‐efficiency of operaGons

Result: Con6nue scaling trends that fueled the compu6ng

revolu6on

Motivation

Outline   Define a new region of operation, Near-Threshold

Computing

  Explore new architectures enabled by key insights of computing in the NTC region

  Present an initial design of a 3D stacked NTC system, Centip3De

Dark Silicon—The emerging dilemma: More and more gates can fit on a die,

but not all can be turned on at the same time

Environmental Concerns

Form factor vs. Battery Life

Power Density Limitations Circuit supply

voltages are no longer scaling…

Power does not decrease at the same rate that transistor count

increases

A = gate area scaling 1/s2

C = capacitance scaling < 1/s Dynamic dominates

Stagnant

Shrinking

Today: Super-Vth, High Performance, Power Constrained

Super-Vth

Supply Voltage 0 Vth Vnom

Core i7

3+ GHz 0.5 mW/MHz

Normalized Power, Energy, & Performance Energy per operation is the key metric for

efficiency. Goal: same performance, low energy per operation

Subthreshold Design

Super-Vth Sub-Vth

500 – 1000X

12-16X

Operating in the sub-threshold gives us huge power gains at the expense of performance OK for sensors!

Evolution of Subthreshold Designs

Phoenix 2 Design (2010) - 0.18 µm CMOS -Commercial ARM M3 Core -Used to investigate:

• Energy harvesting • Power management

-37.4 µW/MHz

Subliminal 2 Design (2007) -0.13 µm CMOS -Used to investigate process variation -3.5 µW/MHz

Subliminal 1 Design (2006) -0.13 µm CMOS -Used to investigate existence of Vmin -2.60 µW/MHz

Phoneix 1 Design (2008) - 0.18 µm CMOS -Used to investigate sleep current -2.8 µW/MHz / 30pW sleep power

Near-Threshold Computing (NTC)

Super-Vth Sub-Vth

~10X ~50-100X

Near-Threshold Computing (NTC): • >60X power reduction • 6-8X energy reduction

•  Invest portion of extra transistors from scaling to overcome barriers

Silicon Verification of Trends

Phoenix 2 Design [Seok’11] 180nm Design

1.8V -> 700mV ~10x NTC Performance Loss ~7x NTC Energy Reduction

Seok ISSCC 2011

Phoenix 2 Processor

NTC – Opportunities and Challenges

  Challenges:   Low Voltage Memory

  New SRAM designs   Robustness analysis at near-threshold

  Variation   Razor [Ernst’03] and other in-situ delay monitoring   Adaptive body biasing

  Performance Loss   Many-core designs to improve parallelism   Core boosting to improve single thread performance

  Opportunities:   New architectures   Optimized Processes   3D Integration – less thermal restrictions

Computing

Minimum Energy SRAM

  SRAM has a lower activity rate than logic   VDD for minimum energy operation (VMIN) is higher   Running logic at VMIN for SRAM has a small energy penalty

with increased performance

Leakage

Dynamic

Cluster

Key Insight: •  SRAM is run at a higher VDD than cores with little energy

penalty, allowing caches to operate faster than the core

Cluster Cluster Cluster

Core Core Core Core

New NTC Architectures

BUS / Switched Network

Next Level Memory

BUS / Switched Network

Next Level Memory

L1 L1 L1 L1

Design Levers: •  Operating Voltage •  L1 Size •  Number of Cores per Cluster •  Number of Clusters

L1 Cache Size Tradeoff

Decreased Miss Rate

Higher Energy/Access

Results – Energy Optimal L1 Size (Single Core)

  Energy dependency on L1 size   Trade-off between L1 and L2 access

Clustering Tradeoffs

CPU CPU CPU CPU

L1 L1 L1 L1

CPU CPU CPU CPU

Tradeoffs ----------------------- + Clustered Sharing - Cluster Conflict - New Bus - L1 Speed

Energy Optimal Cluster-based CMP (Fixed Die Size)

Full Space Analysis

Uniprocessor CMP w/ DVFS

L2 L1 Core

Various Scaling Methods

  Baseline   Single CPU @

233MHz

  Simple CMP   One core per L1   Vdd scaling

  Proposed cluster-based CMP  Multiple cores per L1   Vdd scaling

71% 4 Cores 4 L1’s

2 Cores/Cluster 3 Clusters

21 University of Michigan EnA-HPC -- September 7, 2011 -21-

Energy Optima for SPLASH2   Cluster based architecture with Vdd and Vth scaling

  Optimal cluster size is 2 for most of the apps

 Rad choose non-clustered CMP   Average: 74% over baseline, 55% over simple CMP

nc k L1 size/kB energy savings over baseline

energy savings over simple CMP

Cho 3 2 64 70.8% 52.8%

Fft 2 2 32 72.6% 68.5%

fmm � 8� 2� 128� 79.7%� 41.6%

luc � 3� 2� 32� 77.8%� 64.4%

lun� 2� 2� 64� 69.2%� 58.0%

rad� 16� 1� 128� 84.2%� 35.1%

ray� 3� 2� 128� 65.1%� 54.9%

Energy Optima w/ Performance Requirements

  Cluster based approach provides best savings   Traditional approach only saves energy at high end

Computing

A Closer Look at Wafer-Level Stacking

Dielectric(SiO2/SiN) Gate Poly STI (Shallow Trench Isolation)

Silicon

W (Tungsten contact & via) Al (M1 – M5) Cu (M6, Top Metal)

“Super-Contact”

Illustration from Bob Patti, Tezzaron

Next, Stack a Second Wafer & Thin:

3rd wafer

2nd wafer

1st wafer: controller

Then, Stack a Third Wafer:

Centip3De – 3D NTC Prototype

Logic - B Logic - B Logic - A

DRAM Sense/Logic – Bond Routing DRAM DRAM

F2F Bond

Logic - A

Centip3De Design • 130nm, 7-Layer 3D-Stacked Chip • 128 - ARM M3 Cores • 150mm2

  1.9 GOPS (3.8 GOPS in Boost)   Max 1 IPC per core   128 Cores   15 MHz

  130 mW (691mW in Boost)   14.6 GOPS/W (5.5 in Boost)

Design Scaling and Power Breakdowns NTC Centip3De System

2.9 7.0

NTC Mode Power (mW)

Cores I-Caches D-Caches DRAM

Boosted Mode Power (mW)

Raytracing Benchmark

  Naïve Scaling to 22nm yields ~200GOPS/W

  Observed Voltage Scaling and Thermal Limits reducing the gains of Moore’s Law

  Defined a new computational operating region: Near Threshold Computing

  Leveraged key insights of NTC for new clustered architectures

  Initial ideas of a 3D integrated NTC system, Centip3De

Conclusions

Related References •  Ronald G. Dreslinski, Michael Wieckowski, David Blaauw, Dennis Sylvester, Trevor Mudge, “Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits,” Proceedings of the IEEE, Special Issue on Ultra-Low Power Circuit Technology, Vol. 98, No. 2, February 2010, pg. 253 – 266.

•  Bo Zhai, Ronald G. Dreslinski, Trevor Mudge, David Blaauw, Dennis Sylvester, “Energy Efficent Near-threshold Chip Multi-processing,” ACM/IEEE International Symposium on Low-Power Electronics and Design (ISLPED), August 2007, Best Paper Nomination.

•  Dan Ernst, Shidhartha Das, Seokwoo Lee, David Blaauw, Todd Austin, Trevor Mudge, Nam Sung Kim, Krisztian Flautner, “Razor: Circuit-Level Correction of Timing Errors for Low-Power Operation”, IEEE, Vol. 24, No. 6, November-December 2004, pg. 10-20.

• Mingoo Seok, Dongsuk Jeon, Chaitali Chakrabarti, David Blaauw, Dennis Sylvester, “A 0.27V, 30MHz, 17.7nJ/transform 1024-pt complex FFT core with super-pipelining,” IEEE International Solid-State Circuits Conference (ISSCC), February 2011, to appear

Backup

Logic vs. Memory   To maintain same robustness at low voltages SRAM cell sizes needs

to be increased to compensate effects of process variation   Increased size leads to higher energy consumption, and longer

interconnects

Proposed Parallel Architecture

Energy Optimal Vth Selection

  Vth is very high   Energy optimal Vdd is

independent of Vth   Free performance gain

without consuming more energy

  As Vth reduces   Circuit operates faster   More leakage, more energy

consumption per switching

  Choose Vth   Body bias   Dopant implant

near-threshold computing: reclaiming moore’s la · university of michigan ena-hpc -- september 7,...

Documents

data-intensive distributed...

ineffective and effective way to find out latency...

category: design & engineering - 03 · 2020-03-17 ·...

evolving(the(architecture((...

pcr (polymerase chain reaction) - bioinformatics graz ·...

· xls file · web view6/2/2014 100000 0 0. 6/2/2014 100000...

[xls] · web view100000 100000. 100000 100000. 100000...

energy resource potential of gas hydrates on the north...

introduction to goms - rensselaer polytechnic institute min...

[xls] · web view2/1/2016 100000 100000 0 2/5/2016 1000000...

varmebehandling - orbit.dtu.dk · dtu fødevareinstituttet...

november 2017 - digital tv europedigital tv europe...

numero spares 1000000

scanned with camscanner ctr annexure-a.pdf · cooldrinks...

sdl e2).pdf · 2019. 7. 22. · s-p curve 100000 1000000...

[xls] · web view1 7/2/2018 1000000 100000. 2 7/2/2018...

computing on the lunatic fringe - university of texas at...

ministry of corporate affairs - government of...

ministry of corporate affairs - government of...

· xls file · web view4/1/2014 500000 500000 0. 4/1/2014...