1
Realizing a High-performance, Power Efficient ARM Cortex-A57 Processor Implementation at 16nm
Aniket M. Saha, Product Manager, ARM CPU GroupRahul Deokar, Product Director, Cadence Digital
2
ARM® Cortex®-A current portfolio
Performance
High Efficiency
Cortex-A9
Cortex-A15 Cortex-A17Cortex-A12
Cortex-A57
Cortex-A7Cortex-A5 Cortex-A53
Cortex-A8
V7-APremium performance with mid-range area & power
V8-A, 64bitHighest single thread performance CPU
V7-AHigh performance 32bit CPU with enterprise class feature set
Highest efficiency V8-A CPU64bit support big.LITTLE compatible
Highest efficiency V7-A CPUbig.LITTLE compatible
Smallest & lowest power v7-A CPU
3
Highest single-thread performance today
Out-of-order, multi-issue pipeline
Improved efficiency with latest revision
2+ GHz in sub 750mW in 16nm
Fault tolerance and Scalability ECC support and ARM AMBA® 5 CHI
interfaces
Mature: Proven in 28nm down to 16nm Multiple platforms tested in silicon
Attractive in new markets Automotive and aerospace applications Industrial and defense applications
ARM Cortex-A57 CPU: High-end Product for Mobile and Enterprise
Leading 64-/32-bit performance for mobile applications
High-endSmartphone
MobileComputing
AutomotiveIVI
Enterprise
4
ARM Cortex-A57: High Performance ARMv8-A mobile processor
Significant advancements in power efficiency >20% power efficiency improvement from Cortex-A15
Optimal performance for the smart phone power envelope in 16nm FF
big.LITTLE compatible for extended dynamic range of operation
High single-threaded performance for big.LITTLE systems Low power enabling maximum performance in mobile thermal limit
Large performance increase across integer, memory-streaming and browser benchmarks
2MB 1MB
CCI-400
1MB 1MB
CCI-400
512k1MB
CCI-400
PremiumEntry Mid-Range
Cortex-A15 r2 Cortex-A15 r3
Cortex-A57 efficiency
Cortex-A57 efficiency improvement over Cortex-A15
5
Market Suggested CPU Configuration Notes
Premium 2-4x Cortex-A57 + 64 bit big.LITTLE
Mobile 2-4x Cortex-A53 Performance and Efficiency
Digital TV, Home Server, Gaming Consoles
2-4x Cortex-A57General-purpose and media performance. Intensive streaming, Media, graphics and compute workloads.
Wireless 4/8/16/32 cores with 4G, LTE
Infrastructure CCN-504 and beyond Control Plane processing
Optimized for many-core
Server
8/16 cores with Data Tier and Application/Business Tier
CCN -504 Highest Performance
Robust & high reliable
ARM Cortex-A57 Market Position
L
2
L
2
L2 L
2
Cache Coherent Network
Cache Coherent Network
L2 L2 L2 L2
L2L2
L2
6
ARM Cortex-A57 in big.LITTLE
Simple, in-order, 8-stage pipeline
Performance better than today’s high-end smartphones
Most energy-efficient applications processor from ARM
Cortex-A53
LITTLE
Complex, out-of-order, multi-issue pipeline
Up to 3x the performance of today’s high-end superphones
Highest performance in mobile power envelope
Cortex-A57
big
Q
u
e
u
e
I
s
s
u
e
I
n
t
e
g
e
r
7
ARM Artisan® Physical IP for TSMC 16FFLL
Complete Physical IP platform
Alpha IP available, additional alpha and beta releases ongoing,
detailed schedule and deliveries are partner-driven
EAC releases start in Summer 2014
LogicLibraries
POP IP Interface
High-DensityHigh-PerformanceUltra-High Density
Power Management
ECO Kits
SkrymirARM Mali™- T760
Cortex-A57/A53
MemoryCompilers
9+ Memory Compilers
Multi-Periphery Options
Low Vdd Assist Features
Extensive Feature Set
Preliminary, Platform Content and Features may change
FinFET Optimized
Others: TBD
GPIO 1.8V
3
2
1
8
ARM POP™ IP: Complete Cortex-A CPU & Mali™ GPU Implementation Solution from ARM
Processor Optimized Physical IP – Fast Cache Instances + High Performance Kit
Full Implementation Knowledge Transfer
Customized POP IP Reference Flow Methodology ( scripts)
ARM Implementation Support
Full flexibility in CPU configuration
Best PPA, low risk with short time-to-market
9
ARM POP IP for Core-Hardening Acceleration
SoC design is getting more complex 64-bit processors in a smartphone are here
SoC design cycles are getting shorter
A new high-end , trend setting smartphone or tablet released every year
NRE cost of SoC design is increasing
More engineering & computing resources neededYou want to minimize your risk !
POP IP delivers a complete solutionincluding a customized reference flow to make complex designs easy
POP IP includes a comprehensive implementation user guide to transfer knowledge from ARM experts to you
Implementation support is also available
POP IP provides a proven roadmap to achieve your implementation goals
Minimize your risk by leveraging ARM’s implementation expertise
10
ARM Cortex-A57 POP IP Offerings on TSMC 16FFLLApplication # 1 Application # 2 Application # 3
Target Market Low Cost MobileHigh-end Mobile (big)/
EnterprisePower Optimized
OptimizationTarget
Max PerformanceMax Performance
With CryptoMax Performance
in a Power Budget
Configuration No of CPU MP4 MP4 MP4
L1/L2 L2 =2MB L2 =2MB L2 =1MB
ECC L1 & L2 L1 & L2 L1 & L2
Library Used Track Height High Density High Performance High Density
Vt Options used SVt, LVT, ULVt SVt, LVT, ULVt SVT, LVT
Power Gating Yes Yes Yes Yes
Power Domains1 per CPU +1 for L2D
1 per CPU + 1 for L2D
1 per CPU +1 for L2D
1 per CPU +1 for L2D
Three different optimized POP IP implementations for Cortex-A57 on 16FFLL
11
ARM Cortex-A57 in ARM big.LITTLE - Silicon Proof
PPA targets and CPU configuration set based on big.LITTLE smartphone application type
Switched power domains and UD/OD supported
Designed to support silicon correlation testing
12
Cortex-A57 configuration
Single 64-bit CPU
L1 data cache 32kB
L1 instruction cache 48kB
L2 cache size 512kB with ECC
AMBA® 4 ACE
Clock-gating power management
Implementation from RTL-to-GDSin six months
ARM Artisan® standard cells
TSMC memory macros, I/O, phase-locked loop (PLL)
Early flow development to identify and resolve EDA challenges
First milestone in collaborationto optimize ARM v8 designs on TSMC FinFET
First ARM Cortex-A57 on TSMC 16nm FinFETMade possible by close collaboration
CPU non-CPU
13 © 2014 Cadence Design Systems, Inc. All rights reserved.
Cadence RTL2signoff high-performance design
RCP-Physical aware synthesis
Enables predictable correlation
to implementation
EDI GigaOpt, CCOpt, GigaPlace
Multi-threaded, concurrent
electrical/physical/PPA driven
Tempus, Voltus, Quantus
Path-based accuracy, signoff timing/power
closure, 10X faster
Consistently faster TTM (weeks vs. months)
Better PPA (20% on average)
High-performance (ARM) design
benchmarks
CA57
@20nm
Design 1
V8 64bit
@16FF
Design 3
CA15
@28nm
Design 4
Exceeds 2GHz
CDNS POR
2X Better TNS
17% Better Power
12% Better Power
CDNS POR @16nmPP
A %
Ga
in
CA57
@16FF
Design 2
18% Better
Utilization
14 © 2014 Cadence Design Systems, Inc. All rights reserved.
RC/RCP: 16/14nm adv node correlation during synthesisNative extraction, CCS/ECSM, Layer (NDR) support
Native R/C ExtractionCongestion-based capacitance in PhysIOPT
Current Source ModelingCCS/ECSM for pin caps and Ceff
Accurate correlation to EDIS, Tempus, etc.
Layer ModelingNDR support shared with EDIS
Layer support throughout flow
Accurate R/C extraction required for
physically aware timing models at
advanced process nodes
Current and layer (NDR) modeling
interacts heavily with route length
estimates that dictate timing
Metal layer stack by node
Wider,
Faster
Wires
180nm 45 nm 20 nm
15 © 2014 Cadence Design Systems, Inc. All rights reserved.
New EDI System GigaPlaceNext-generation placement technology
Better PPA, Utilization & Faster Design Closure
Giga Place Analytical Placement
Engine
Electrical-driven
Optimization-driven
Physical-driven
(Topology/layer/
color/pin-access)
(Gate sizing/
buffering)
(Slack/MMMC/skew/power)Concurrent, multi-objective,
massively-parallel algorithm
Integrated and correlated with
Tempus and GigaOpt
Advanced node (16/14/10nm)
color-aware technology
5% better wirelength 5% better leakage 3% better utilization2X better TNS
Slack
Wire-length
Cong-estion
16 © 2014 Cadence Design Systems, Inc. All rights reserved.
GigaPlace Slack-driven Placement
Timing-driven Placement
“Lightly” integrated
Net Weighting
Placement
Timer
Solves:
• Overlap
• Wirelength
Timing ~
wirelength scaling
Slack-driven Placement
“Tightly” integrated
Solves:
• Overlap
• Wirelength
• SlackGiga
PlaceTimer
Wirelength ≠ Slack
• Poor correlation with GigaOpt
Slack Driven by:
Gate delay
False/multi-cycle paths
layer assignment
congestion timing effects
correlates with GigaOpt
GigaPlaceTraditional Placement
17 © 2014 Cadence Design Systems, Inc. All rights reserved.
13.2 “Enhanced” FlowNet weighting for slack
Region guides for timing
VS
No net weighting
No region guides
GigaPlace customer case study2M Instances
4073 Fanout Cone Reg
Pipeline Reg
Critical Path Reg
To 4033 Reg File
Post-Route WNS
r2r – I/O
TNS
r2r – I/O
VP
r2r – I/O
Density #DRC Leakage
%LSL
13.2 enhanced
place
-0.16 /
-0.27
-222.7 /
-398.0
7765 /
11659
86 324 2.02 mW
6.6%
GigaPlace -0.04 /
-0.14 (**)
-7.2 /
-76.6
1068 /
2775
77.3 69 1.32 mW
2.7%
Better pipeline placement
Less module splitting
No placement constraints!
Better TNS / WNS
5.5% Better wirelength
10% Better density
40% Better leakage
GigaPlace
18 © 2014 Cadence Design Systems, Inc. All rights reserved.
GigaOpt power-driven optimization
Avoids local minima to achieve globally
optimal PPA
All transforms are
leakage-awaremWatt
Designs
New concurrent leakage
and dynamic power optimization
Up to 50% leakage power reduction
19 © 2014 Cadence Design Systems, Inc. All rights reserved.
GigaOpt MMMC Acceleration
full circuit timing graph
CPU1 CPU2 CPU<n>
Gate sizeBuffer net
Split gateBubble push
Merge gate
Sub-linear speedup with
increasing # MMMC viewsMMMC
Dynamic View
Compression
2.4
2.0
2.3
1.31.2
1.51.7
1.4
3.2
30
12
16
4
9
6
15
4
15
0.5
1
1.5
2
2.5
3
3.5
0
5
10
15
20
25
30
35
MMMC Acceleration TAT Gain
PreCTS runtime gain #Setup views
TAT
gain
Design A Design B Design C Design D Design E Design F Design G Design H Design I
2-3X TAT Gain
20 © 2014 Cadence Design Systems, Inc. All rights reserved.
GigaOpt
Placer
CCOpt
Nano
Route
Netlist
GDS
Placement
Nano
RoutePost-route Clock
ECO
2-3X faster vs. scripted
Better hold awareness, fence regions, halo and
multi-corner support
Clock Tree
Synthesis
Clock/Data-path
Opt
Com
mo
n T
imin
g E
ng
ine
Designs
Runtime
1.5X better
runtime
5% better freq 3% better area
15% better WNS 36% better TNS
Clock Concurrent Optimization (CCOpt)Natively integrated CCOpt and full-flow CCOpt CTS
21 © 2014 Cadence Design Systems, Inc. All rights reserved.
Next-generation extraction solution
• Next-generation Cadence® Quantus™ QRC
Extraction Solution
− Up to 5X faster performance for single and multi-
corner extraction runs
− Scalable to 100s of CPUs/machines
− Best-in-its-class down to FinFET accuracy /
performance
• New random-walk based field solver,
Quantus FS
• Fully certified at TSMC for 16nm FinFET
22 © 2014 Cadence Design Systems, Inc. All rights reserved.
Quantus QRC Extraction outperforms the competition
Up to 5X faster with linear scalability for digital designs
Customer Node Size # of Corners # of
CPUs
Quantus
(hrs)
Competition
(hrs)
Ratio Scalability
(x2 CPUs)
A 20nm 39M 13 32 6.0 15.0 2.5X 4.3X
B 20nm 71M 1 32 5.6 15.12 2.7X 4.6X
C 20nm 17M 1 32 2.2 6.6 3X 5.1X
D 28nm 6.1M 1 32 6.7 20.1 3X
E 28nm 56.8M 1 16 16.5 72.0 4.4X 7.4X
F 28nm 57M 1 16 9.5 15.5 1.6X 2.8X
G 28nm 2.5M 1 2 3..1 8.0 2.6X
4 2 5.0 16.0 3.2X
H 28nm 1.3M 4 4 0.5 6.5 13X
I 28nm 52.9M 5 64 10.28 32.42 3.2X 5.4X
Average of all designs using Quantus QRC Extraction Solution ~4X ~5X
23 © 2014 Cadence Design Systems, Inc. All rights reserved.
Tempus Timing Signoff Optimization (TSO) capabilities
Feature Tempus
Built-in signoff delay and SI analysis
Distributed or concurrent MMMC
Physically aware optimization
Legalized / DRC clean placement directives
Hierarchical or flat ECO generation
Optimized MMMC timing graph for fast and high capacity optimization
Graph or path based optimization
Common timing engine within implementation
Power domain aware
Master / Clone support
Tempus TSO
Distributed
MMMC
delay
calculation
and STA
Physically
aware
optimization
Hold, DRV, setup,
leakage
Place and route
Timing closed
2-3
Iteration
Physical
view
(LEF/ DEF)
Physically
aware ECO
Inputs Files
• Technology data
• Design data
• Physical data
(Can load EDI DB)
Tempus Optimization
• Buffering, Vth Swapping and Sizing
• Hold timing violations
• Setup timing violations
• Design Rule Violations (max_cap/max_tran)
• Leakage power reduction
Output Reports
• Detailed reporting on all ECOs being performed
• Detailed diagnostic report on remaining violations
• Standard format ECO file
• Final timing summary reports
Tempus TSO Data Flow
Address the timing closure challenges introduced from the increased analysis complexities and capacities!
24 © 2014 Cadence Design Systems, Inc. All rights reserved.
Setup Fixing – 7x faster, 3x Better QoR
A57 CPU : • 1.6M instances
• 3 Hold views and 3 Setup views
• High speed core with challenging timing targets
Fixing Mode
Initial Setup (WNS, TNS,
# Vp)
Setup
fixing
(runtime)
Memory
usage
# added buffers
# resized
instances
Final Setup after PnR
(WNS, TNS, # Vp)
Tempus13.2 -0.088ns
-99ns
6474
7h 17Gb 424 buffers
47164 resize
0.088ns
-98ns
6073
Tempus14.1 -0.088ns
-99ns
6474
1h04 15Gb 270 buffers
11580 resize
-0.088ns
-93ns
5670
Tempus14.1 Setup optimization
• many times faster
• more efficient in number of ECOs being done
• leading to equal or better QOR
• using less memory
No impact on Hold timing
7X runtime
reduction
25 © 2014 Cadence Design Systems, Inc. All rights reserved.
Tapeout at
100% goal
Focus on performanceEnsure hold closureEnable MM/MC
A57 QoR ramp-up with v14.1 (2+ GHz ARM)Leading foundry FinFET node
Freq
Timeline
85%
90%
95%100%
Flow
setup
Dec 15th Jan 29th
Collaborate to build towards goal
Tapeout
< 3-month rampRapid, predictable
convergence
Cadence focusing on ARM PPA leadership in 2014
26
Conclusion ARM Cortex-A57 provides continued scaling for high performance and low power
Highest single threaded performance in ARM Cortex-A portfolio
Scalability to 32 cores and beyond with ARM CCN line of products
Additional efficiency for mobile through ARM big.LITTLE combination with ARM Cortex-A53
ARM has successfully implemented Cortex-A57 on 16FFLL
Silicon proven IP available
Latest Cadence EDA tools used for Cortex-A57 implementation on TSMC 16FF node
ARM POP IP offers a comprehensive implementation solution
Accelerated Cortex-A57 core-hardening in 16FFLL
Best in Class PPA
Fastest time-to-market
27
Q&A