Download - Realizing a High-performance, Power Efficient ARM Cortex ... · EDI GigaOpt, CCOpt, GigaPlace Multi-threaded, concurrent electrical/physical/PPA driven Tempus, Voltus, Quantus Path-based

1

Realizing a High-performance, Power Efficient ARM Cortex-A57 Processor Implementation at 16nm

Aniket M. Saha, Product Manager, ARM CPU GroupRahul Deokar, Product Director, Cadence Digital

2

ARM® Cortex®-A current portfolio

Performance

High Efficiency

Cortex-A9

Cortex-A15 Cortex-A17Cortex-A12

Cortex-A57

Cortex-A7Cortex-A5 Cortex-A53

Cortex-A8

V7-APremium performance with mid-range area & power

V8-A, 64bitHighest single thread performance CPU

V7-AHigh performance 32bit CPU with enterprise class feature set

Highest efficiency V8-A CPU64bit support big.LITTLE compatible

Highest efficiency V7-A CPUbig.LITTLE compatible

Smallest & lowest power v7-A CPU

3

Highest single-thread performance today

Out-of-order, multi-issue pipeline

Improved efficiency with latest revision

2+ GHz in sub 750mW in 16nm

Fault tolerance and Scalability ECC support and ARM AMBA® 5 CHI

interfaces

Mature: Proven in 28nm down to 16nm Multiple platforms tested in silicon

Attractive in new markets Automotive and aerospace applications Industrial and defense applications

ARM Cortex-A57 CPU: High-end Product for Mobile and Enterprise

Leading 64-/32-bit performance for mobile applications

High-endSmartphone

MobileComputing

AutomotiveIVI

Enterprise

4

ARM Cortex-A57: High Performance ARMv8-A mobile processor

Significant advancements in power efficiency >20% power efficiency improvement from Cortex-A15

Optimal performance for the smart phone power envelope in 16nm FF

big.LITTLE compatible for extended dynamic range of operation

High single-threaded performance for big.LITTLE systems Low power enabling maximum performance in mobile thermal limit

Large performance increase across integer, memory-streaming and browser benchmarks

2MB 1MB

CCI-400

1MB 1MB

CCI-400

512k1MB

CCI-400

PremiumEntry Mid-Range

Cortex-A15 r2 Cortex-A15 r3

Cortex-A57 efficiency

Cortex-A57 efficiency improvement over Cortex-A15

5

Market Suggested CPU Configuration Notes

Premium 2-4x Cortex-A57 + 64 bit big.LITTLE

Mobile 2-4x Cortex-A53 Performance and Efficiency

Digital TV, Home Server, Gaming Consoles

2-4x Cortex-A57General-purpose and media performance. Intensive streaming, Media, graphics and compute workloads.

Wireless 4/8/16/32 cores with 4G, LTE

Infrastructure CCN-504 and beyond Control Plane processing

Optimized for many-core

Server

8/16 cores with Data Tier and Application/Business Tier

CCN -504 Highest Performance

Robust & high reliable

ARM Cortex-A57 Market Position

L

2

L

2

L2 L

2

Cache Coherent Network

Cache Coherent Network

L2 L2 L2 L2

L2L2

L2

http://www.dreamstime.com/royalty-free-stock-photo-web-hosting-server-security-image11190055

http://www.dreamstime.com/royalty-free-stock-photo-web-hosting-server-security-image11190055

6

ARM Cortex-A57 in big.LITTLE

Simple, in-order, 8-stage pipeline

Performance better than today’s high-end smartphones

Most energy-efficient applications processor from ARM

Cortex-A53

LITTLE

Complex, out-of-order, multi-issue pipeline

Up to 3x the performance of today’s high-end superphones

Highest performance in mobile power envelope

Cortex-A57

big

Q

u

e

u

e

I

s

s

u

e

I

n

t

e

g

e

r

7

ARM Artisan® Physical IP for TSMC 16FFLL

Complete Physical IP platform

Alpha IP available, additional alpha and beta releases ongoing,

detailed schedule and deliveries are partner-driven

EAC releases start in Summer 2014

LogicLibraries

POP IP Interface

High-DensityHigh-PerformanceUltra-High Density

Power Management

ECO Kits

SkrymirARM Mali™- T760

Cortex-A57/A53

MemoryCompilers

9+ Memory Compilers

Multi-Periphery Options

Low Vdd Assist Features

Extensive Feature Set

Preliminary, Platform Content and Features may change

FinFET Optimized

Others: TBD

GPIO 1.8V

3

2

1

8

ARM POP™ IP: Complete Cortex-A CPU & Mali™ GPU Implementation Solution from ARM

Processor Optimized Physical IP – Fast Cache Instances + High Performance Kit

Full Implementation Knowledge Transfer

Customized POP IP Reference Flow Methodology ( scripts)

ARM Implementation Support

Full flexibility in CPU configuration

Best PPA, low risk with short time-to-market

9

ARM POP IP for Core-Hardening Acceleration

SoC design is getting more complex 64-bit processors in a smartphone are here

SoC design cycles are getting shorter

A new high-end , trend setting smartphone or tablet released every year

NRE cost of SoC design is increasing

More engineering & computing resources neededYou want to minimize your risk !

POP IP delivers a complete solutionincluding a customized reference flow to make complex designs easy

POP IP includes a comprehensive implementation user guide to transfer knowledge from ARM experts to you

Implementation support is also available

POP IP provides a proven roadmap to achieve your implementation goals

Minimize your risk by leveraging ARM’s implementation expertise

10

ARM Cortex-A57 POP IP Offerings on TSMC 16FFLLApplication # 1 Application # 2 Application # 3

Target Market Low Cost MobileHigh-end Mobile (big)/

EnterprisePower Optimized

OptimizationTarget

Max PerformanceMax Performance

With CryptoMax Performance

in a Power Budget

Configuration No of CPU MP4 MP4 MP4

L1/L2 L2 =2MB L2 =2MB L2 =1MB

ECC L1 & L2 L1 & L2 L1 & L2

Library Used Track Height High Density High Performance High Density

Vt Options used SVt, LVT, ULVt SVt, LVT, ULVt SVT, LVT

Power Gating Yes Yes Yes Yes

Power Domains1 per CPU +1 for L2D

1 per CPU + 1 for L2D

1 per CPU +1 for L2D

1 per CPU +1 for L2D

Three different optimized POP IP implementations for Cortex-A57 on 16FFLL

11

ARM Cortex-A57 in ARM big.LITTLE - Silicon Proof

PPA targets and CPU configuration set based on big.LITTLE smartphone application type

Switched power domains and UD/OD supported

Designed to support silicon correlation testing

12

Cortex-A57 configuration

Single 64-bit CPU

L1 data cache 32kB

L1 instruction cache 48kB

L2 cache size 512kB with ECC

AMBA® 4 ACE

Clock-gating power management

Implementation from RTL-to-GDSin six months

ARM Artisan® standard cells

TSMC memory macros, I/O, phase-locked loop (PLL)

Early flow development to identify and resolve EDA challenges

First milestone in collaborationto optimize ARM v8 designs on TSMC FinFET

First ARM Cortex-A57 on TSMC 16nm FinFETMade possible by close collaboration

CPU non-CPU

13 © 2014 Cadence Design Systems, Inc. All rights reserved.

Cadence RTL2signoff high-performance design

RCP-Physical aware synthesis

Enables predictable correlation

to implementation

EDI GigaOpt, CCOpt, GigaPlace

Multi-threaded, concurrent

electrical/physical/PPA driven

Tempus, Voltus, Quantus

Path-based accuracy, signoff timing/power

closure, 10X faster

Consistently faster TTM (weeks vs. months)

Better PPA (20% on average)

High-performance (ARM) design

benchmarks

CA57

@20nm

Design 1

V8 64bit

@16FF

Design 3

CA15

@28nm

Design 4

Exceeds 2GHz

CDNS POR

2X Better TNS

17% Better Power

12% Better Power

CDNS POR @16nmPP

A %

Ga

in

CA57

@16FF

Design 2

18% Better

Utilization


RC/RCP: 16/14nm adv node correlation during synthesisNative extraction, CCS/ECSM, Layer (NDR) support

Native R/C ExtractionCongestion-based capacitance in PhysIOPT

Current Source ModelingCCS/ECSM for pin caps and Ceff

Accurate correlation to EDIS, Tempus, etc.

Layer ModelingNDR support shared with EDIS

Layer support throughout flow

Accurate R/C extraction required for

physically aware timing models at

advanced process nodes

Current and layer (NDR) modeling

interacts heavily with route length

estimates that dictate timing

Metal layer stack by node

Wider,

Faster

Wires

180nm 45 nm 20 nm


New EDI System GigaPlaceNext-generation placement technology

Better PPA, Utilization & Faster Design Closure

Giga Place Analytical Placement

Engine

Electrical-driven

Optimization-driven

Physical-driven

(Topology/layer/

color/pin-access)

(Gate sizing/

buffering)

(Slack/MMMC/skew/power)Concurrent, multi-objective,

massively-parallel algorithm

Integrated and correlated with

Tempus and GigaOpt

Advanced node (16/14/10nm)

color-aware technology

5% better wirelength 5% better leakage 3% better utilization2X better TNS

Slack

Wire-length

Cong-estion


GigaPlace Slack-driven Placement

Timing-driven Placement

“Lightly” integrated

Net Weighting

Placement

Timer

Solves:

• Overlap

• Wirelength

Timing ~

wirelength scaling

Slack-driven Placement

“Tightly” integrated

Solves:

• Overlap

• Wirelength

• SlackGiga

PlaceTimer

Wirelength ≠ Slack

• Poor correlation with GigaOpt

Slack Driven by:

Gate delay

False/multi-cycle paths

layer assignment

congestion timing effects

correlates with GigaOpt

GigaPlaceTraditional Placement


13.2 “Enhanced” FlowNet weighting for slack

Region guides for timing

VS

No net weighting

No region guides

GigaPlace customer case study2M Instances

4073 Fanout Cone Reg

Pipeline Reg

Critical Path Reg

To 4033 Reg File

Post-Route WNS

r2r – I/O

TNS

r2r – I/O

VP

r2r – I/O

Density #DRC Leakage

%LSL

13.2 enhanced

place

-0.16 /

-0.27

-222.7 /

-398.0

7765 /

11659

86 324 2.02 mW

6.6%

GigaPlace -0.04 /

-0.14 (**)

-7.2 /

-76.6

1068 /

2775

77.3 69 1.32 mW

2.7%

Better pipeline placement

Less module splitting

No placement constraints!

Better TNS / WNS

5.5% Better wirelength

10% Better density

40% Better leakage

GigaPlace


GigaOpt power-driven optimization

Avoids local minima to achieve globally

optimal PPA

All transforms are

leakage-awaremWatt

Designs

New concurrent leakage

and dynamic power optimization

Up to 50% leakage power reduction


GigaOpt MMMC Acceleration

full circuit timing graph

CPU1 CPU2 CPU<n>

Gate sizeBuffer net

Split gateBubble push

Merge gate

Sub-linear speedup with

increasing # MMMC viewsMMMC

Dynamic View

Compression

2.4

2.0

2.3

1.31.2

1.51.7

1.4

3.2

30

12

16

4

9

6

15

4

15

0.5

1

1.5

2

2.5

3

3.5

0

5

10

15

20

25

30

35

MMMC Acceleration TAT Gain

PreCTS runtime gain #Setup views

TAT

gain

Design A Design B Design C Design D Design E Design F Design G Design H Design I

2-3X TAT Gain


GigaOpt

Placer

CCOpt

Nano

Route

Netlist

GDS

Placement

Nano

RoutePost-route Clock

ECO

2-3X faster vs. scripted

Better hold awareness, fence regions, halo and

multi-corner support

Clock Tree

Synthesis

Clock/Data-path

Opt

Com

mo

n T

imin

g E

ng

ine

Designs

Runtime

1.5X better

runtime

5% better freq 3% better area

15% better WNS 36% better TNS

Clock Concurrent Optimization (CCOpt)Natively integrated CCOpt and full-flow CCOpt CTS


Next-generation extraction solution

• Next-generation Cadence® Quantus™ QRC

Extraction Solution

− Up to 5X faster performance for single and multi-

corner extraction runs

− Scalable to 100s of CPUs/machines

− Best-in-its-class down to FinFET accuracy /

performance

• New random-walk based field solver,

Quantus FS

• Fully certified at TSMC for 16nm FinFET


Quantus QRC Extraction outperforms the competition

Up to 5X faster with linear scalability for digital designs

Customer Node Size # of Corners # of

CPUs

Quantus

(hrs)

Competition

(hrs)

Ratio Scalability

(x2 CPUs)

A 20nm 39M 13 32 6.0 15.0 2.5X 4.3X

B 20nm 71M 1 32 5.6 15.12 2.7X 4.6X

C 20nm 17M 1 32 2.2 6.6 3X 5.1X

D 28nm 6.1M 1 32 6.7 20.1 3X

E 28nm 56.8M 1 16 16.5 72.0 4.4X 7.4X

F 28nm 57M 1 16 9.5 15.5 1.6X 2.8X

G 28nm 2.5M 1 2 3..1 8.0 2.6X

4 2 5.0 16.0 3.2X

H 28nm 1.3M 4 4 0.5 6.5 13X

I 28nm 52.9M 5 64 10.28 32.42 3.2X 5.4X

Average of all designs using Quantus QRC Extraction Solution ~4X ~5X


Tempus Timing Signoff Optimization (TSO) capabilities

Feature Tempus

Built-in signoff delay and SI analysis

Distributed or concurrent MMMC

Physically aware optimization

Legalized / DRC clean placement directives

Hierarchical or flat ECO generation

Optimized MMMC timing graph for fast and high capacity optimization

Graph or path based optimization

Common timing engine within implementation

Power domain aware

Master / Clone support

Tempus TSO

Distributed

MMMC

delay

calculation

and STA

Physically

aware

optimization

Hold, DRV, setup,

leakage

Place and route

Timing closed

2-3

Iteration

Physical

view

(LEF/ DEF)

Physically

aware ECO

Inputs Files

• Technology data

• Design data

• Physical data

(Can load EDI DB)

Tempus Optimization

• Buffering, Vth Swapping and Sizing

• Hold timing violations

• Setup timing violations

• Design Rule Violations (max_cap/max_tran)

• Leakage power reduction

Output Reports

• Detailed reporting on all ECOs being performed

• Detailed diagnostic report on remaining violations

• Standard format ECO file

• Final timing summary reports

Tempus TSO Data Flow

Address the timing closure challenges introduced from the increased analysis complexities and capacities!


Setup Fixing – 7x faster, 3x Better QoR

A57 CPU : • 1.6M instances

• 3 Hold views and 3 Setup views

• High speed core with challenging timing targets

Fixing Mode

Initial Setup (WNS, TNS,

# Vp)

Setup

fixing

(runtime)

Memory

usage

# added buffers

# resized

instances

Final Setup after PnR

(WNS, TNS, # Vp)

Tempus13.2 -0.088ns

-99ns

6474

7h 17Gb 424 buffers

47164 resize

0.088ns

-98ns

6073

Tempus14.1 -0.088ns

-99ns

6474

1h04 15Gb 270 buffers

11580 resize

-0.088ns

-93ns

5670

Tempus14.1 Setup optimization

• many times faster

• more efficient in number of ECOs being done

• leading to equal or better QOR

• using less memory

No impact on Hold timing

7X runtime

reduction


Tapeout at

100% goal

Focus on performanceEnsure hold closureEnable MM/MC

A57 QoR ramp-up with v14.1 (2+ GHz ARM)Leading foundry FinFET node

Freq

Timeline

85%

90%

95%100%

Flow

setup

Dec 15th Jan 29th

Collaborate to build towards goal

Tapeout

< 3-month rampRapid, predictable

convergence

Cadence focusing on ARM PPA leadership in 2014

26

Conclusion ARM Cortex-A57 provides continued scaling for high performance and low power

Highest single threaded performance in ARM Cortex-A portfolio

Scalability to 32 cores and beyond with ARM CCN line of products

Additional efficiency for mobile through ARM big.LITTLE combination with ARM Cortex-A53

ARM has successfully implemented Cortex-A57 on 16FFLL

Silicon proven IP available

Latest Cadence EDA tools used for Cortex-A57 implementation on TSMC 16FF node

ARM POP IP offers a comprehensive implementation solution

Accelerated Cortex-A57 core-hardening in 16FFLL

Best in Class PPA

Fastest time-to-market

27

Q&A

Download - Realizing a High-performance, Power Efficient ARM Cortex ... · EDI GigaOpt, CCOpt, GigaPlace Multi-threaded, concurrent electrical/physical/PPA driven Tempus, Voltus, Quantus Path-based

Top Related