pushing 3ghz performance of 7nm arm neoverse n1 cpu using … · 2019. 10. 22. · neoverse n1 cpu...
TRANSCRIPT
Pushing 3GHz Performance of 7nm Arm Neoverse N1 CPU using the Cadence Digital Flow
Presenter Name and Title (Arial 16pt)Event NameLocationDate
2 © 2019 Cadence Design Systems, Inc. All rights reserved.
Agenda
• Arm® Neoverse™ N1 CPU - Introduction and Context
• Arm Neoverse N1 CPU Implementation Overview in SoC Tapeout
• Advanced Implementation/Signoff Techniques
• Neoverse N1 CPU Rapid Adoption Kit Overview
• Stylus Infrastructure
• Summary and Conclusions
• Q & A
2
© 2
01
8A
rm L
imit
ed
1st Generation System
Scalable High-End Compute from Edge, to Cloud Hyperscale, to Supercomputing
4 © 2019 Cadence Design Systems, Inc. All rights reserved.
The cloud-to-edge foundation for a world of 1 trillion connected devices
Neoverse N1 CPU 2.8GHz
CosmosNeoverseN1 CPU
Arm
Cortex®-A72
Cortex-A55
Arm
Cortex-A75
2018
Data plane optimized
SMT
Neoverse
E1 CPU
Cloud to edge performance and efficiency
DDR47nm
2014-2017
5 © 2019 Cadence Design Systems, Inc. All rights reserved.
Neoverse N1 CPU – Transforming Hyperscale and Edge Compute
Configuration 16nm 7nm
Performance 36.4 SPECint2k6 1T @ 2.8GHz 64C system
Freq (Vnom-
Vmax)
2.5GHz – 3.1GHz
Power (core) 319 mW/GHz 195 mW/GHz
Ptot
(2.6GHz,105C)
1450 mW 650 mW
Area (core, 1MB) 3.64 mm2 1.47 mm2
64C System: CMN -600 32xMP2 1MB L2 @ 2.8GHz, 64MB SLC @2GHz, 8xDDR4-3200
Performance: bare-metal SimPts emulation measurement in 64C system
Freq: 16nm Vnom=0.8V, Vmax=1V, 7nm Vnom=0.75V, Vmax=1V
Power: Aarch64 Dhrystone dynamic power, leakage at Tj=105C
Area: post-scaled in mm2
2
256b AMBA5 CHI / 2x128b AMBA4 ACE
SCU
Arm Neoverse N1 CPU Processor
Armv8.2-A
32b/64b CPU
NEON™ SIMD
engine 2x128b
Crypto extensions
64k I-Cache
w/parity
64k D-Cache
w/ECC
Core 1
Optional Shared L3 (1, 2, 4MB) w
DSUACP
ARM CoreSight™ Multicore Debug and Trace
Private L2 cache (512kB~1MB) w/ECC
Peripheral
Port
Async Bridges
Direct Connect to CMN ® -600 Mesh CHI
Arm’s 1st generation purpose-built server class CPU
• Delivering 3X compute density advantage over x86and +25% efficiency gain over Cortex-A72/A75
• New uArch introducing I-cache coherency, 1MB private L2, and direct connect to CMN-600 mesh
• Full Armv8.2 architectural support with SPE, 48b PA
Linux comparison, similar DRAM bandwidth per thread resourcing, turbo off, using GCC7 compiler
6 © 2019 Cadence Design Systems, Inc. All rights reserved.
Neoverse N1 CPU Introduces Direct-Connect Connectivity
• Targeted at systems which benefit from runtime allocation of a shared system cache, and a large low latency L2 cache private to each core
• Removes shared L3 cluster cache rather than provisioning cache to small subsets of cores
• Reduces load-to-use DRAM latency (3.5ns for 2GHz CMN-600 system)
• Neoverse N1 CPU adds a larger 1MB private L2 cache option with 11-cycle load-to-use, over the 512kB option with 9-cycle load-to-use
• Supports CHI.C protocol which optimizes system buffer usage in the CMN-600 mesh
• MP2 cluster is deployed in 1st generation of Neoverse N1 SoC
Shared L3$, SF
MP ClusteringDirect Connect
System Cache Slice
CMN-600 CAL
…
CHI CHI
MP2 ClusterMP1 MP1
CAL = Core Aggregation Layer(CMN-600 product component)
7Copyright © 2019 Arm TechCon, All rights reserved.
Neoverse N1 CPU Implementation
Overview in SoC Tapeout
8 © 2019 Cadence Design Systems, Inc. All rights reserved.
Neoverse N1 CPU Implementation Closure
Area: 1.3mm x 1.24mm, 1MB L2
Performance across corners for Neoverse N1 CPU
Voltage Process Temp Extraction Target (GHz) Achieved (GHz,
PBA)
1.0V TT 85C typical 3.0 MET
0.9V SSGNP 125CCworst_CCworst_T 2.2 MET
RCworst_CCworst_T 2.2 MET
0.75V TT 85C typical 2.3 2.27
0.675V SSGNP 125CCworst_CCworst_T 1.7 1.68
RCworst_CCworst_T 1.7 1.69
Performance for ssgnp-m40c-nom corners was timed and well achieved at the baseline target 1.5GHz target.
TSMC 7FF PDK 1.0, H360 library, ULVT-C8 with multi-vt backfill
Metal Stack: 13m_1x1xa1ya5y2yy2z, using up to M11 for signal routing
Cadence Tempus PBA timing
Dynamic Voltage Frequency Scaling
Unswitched power grid
Month Description Freq Notes
T0 Initial runs 2.5GHz Most modules were critical
T0+1 Floorplan
updates
2.6GHz Finalized floorplan based on
CPG squarish floorplan
T0+2 Fences/
Regions
2.75GH
z
Similar set of fences and
regions to CPG
T0+3 Final Build 2.8GHz Flow tweaks; changes to
optimization iterations etc..
T0+4
(mid)
Out of
Innovus
2.91GH
z
20 paths to 3GHz
T0+4
(end)
Final TOC db 3.0
GHz
Hand ECOs, mainly fixing SI
issues that provided the uplift
9 © 2019 Cadence Design Systems, Inc. All rights reserved.
Neoverse N1 CPU Implementation Flow
Place and Route Setup(Cadence Innovus™)
RTL Synthesis to Gates(Cadence Genus™)
Scan Insertion/Compression
Incremental Optimization
Mapping/Multibit Flop Insertion
Libraries
DEF/Floorplan
UPF
SDC
LEF Files
QRC Tech Files
Bounds
Blockages
Design RTL
Netlist/ScanDEF
Power Grid Construction
Floorplan Initialization
MMMC Setup
PlacementTiming-Aware Placement
Scan Reordering
Early Clock Flow
PRECTS Opt with Useful Skew
+ Additional
Constraints
- Route Guides
- NDR
- CTS Constraints
Clock Tree Synthesis
CTS with Useful Skew Enabled
Clock Tree Routing w/via pillar insertion
POSTCTS Setup Opt
POSTCTS Hold Fixing
Signal RoutingTiming-aware Signal Routing
Post-Route OptimizationAVMF-enabled Post-route Setup Opt
(Incremental Post-route Optimization)
AVMF-enabled Post-route Hold Fixing
ECO Via-Pillar insertion
Libraries
LEF files
Signoff RC Extraction
( Cadence Quantus™)
Multi-Corner RC Extraction
STA(Cadence Tempus)
- MCMM
- STA Signoff Settings
- Graph-Based STA
- Path-Based STA
on Failing Paths
Signoff
Constraints
(SDC)
Path-Based Signoff Optimization/ECO
(Cadence Innovus/Tempus™)Setup, Hold, DRV, Power Recovery
SPE
F
Netlis
t
Timing/Power Optimized Netlist
SDF, Timing, and Physical Models
Ready for downstream signoff analysis
and tapeout closure steps (IR/EM, PV,
Annotated sim, functional ECO etc..)
Synthesis
Multibit Level Shifter Insertion
SPE
F
DEF
EC
O
DB
All 7nm Arm Cortex-A76 uArch and flow recipe collateral are
applicable in Neoverse N1 CPU, but there are some more
advanced techniques introduced to expedite the aggressive
frequency push for tape-out closure within 5 months!
10 © 2019 Cadence Design Systems, Inc. All rights reserved.
Advanced Implementation and Signoff Techniques
11 © 2019 Cadence Design Systems, Inc. All rights reserved.
7nm Key Challenge – Mitigating Metal and Via Resistance
• Metal and via resistance is significantly higher than 16FF especially
on the double pattern layers
– Affects both the signal routing and power grid design• NDRs on double pattern layers are
too expensive for signal routing.
Must be used for power grid only.– Ex: 40nm M3 blocks 5 tracks
• Critical nets must get to non-double
pattern layers as quickly as
possible– Cadence GigaOpt™ will layer promote
critical nets preroute
• Moving up layers does incur a
penalty because of the via
resistance– 7nm introduces newly constructed via
pillar
12 © 2019 Cadence Design Systems, Inc. All rights reserved.
7nm Power Grid Optimization for High-Performance Core
The introduction of opportunistic post-route M1
stitch insertion for HP designs like Neoverse
N1 CPU
Innovus™ Implementation System has
support for automated stitch insertion
which was used
Merit
• Cells have more placement flexibility as M3
straps are disconnected from stitch location
so that M3 straps can be much closer
together to enable the tightest grid
• IR drop is substantially improved for 3+GHz
designs
• Disjoint M2 strap takes advantage of 7FF
EM rules
13 © 2019 Cadence Design Systems, Inc. All rights reserved.
7nm Key Challenge – Signal EM Limitation for High-Speed Core
• M0, M1, and M2 have EM issues
• M0 affects the density of the power grid
• M2 EM issues restricted the usage of large
drive strength cells highly demanded for
Neoverse N1 CPU for performance uplift
• Redundant M1/2 output pins are
specifically created in these cells to limit
the max current in M0 when used together
along with …...
• Via pillar is introduced to alleviate the
noticeable signal EM in HPC in a way of
defining a stack via array using default shapes
from disjoint output pins to upper metal layers
• Defined in LEF format and tcl commands to
enable its use inside Innovus™– Can be assigned to specific lib cells, route types
(i.e. clocks), or instances
– Can be automatically inserted based on timing
criticality and routability
• Whilst pre-route does understand via pillars
and can optimize against them, the
deployment for clock route or post-route data
ECO only is recommended
14 © 2019 Cadence Design Systems, Inc. All rights reserved.
Best Practice of Via Pillar Enablement for CPU Clock
• Via pillar applied on top clock net only
– Expand the usable buffer list to include much larger driving
strength accordingly
– Cell padding may be necessary
– NDR and shielding for upper layers are needed as usualPost-route comparison
Without via pillar With via pillar
• Comparable average clock
latency
• Marginally increased max
latency
• Substantially reduced local
skew and clock cell edge rate
• Improved post-route timing
metrics
15 © 2019 Cadence Design Systems, Inc. All rights reserved.
Sophisticated Electromigration ControllingArm-proprietary EMG (EM model generator) provides the evolved methodology to more elegantly control signal electromigration to justify the design margin
• Neoverse N1 CPU and SoC is the first 7FF test chip with successful deployment in both implementation and signoff stages
• Perfectly enabled by both Innovus™ and Tempus™ solutions for DRV reporting and fixing
Arm-supplied methods to control eletromigration in a design
Option Description Measure Advantages Disadvantages
(1) Liberty
max_capacitance
• Keeps load within timing tables across PVT
corners
• Timing tables indexed by input transition time (the
same for all cells at a given PVT) and output load
(capacitance)
Transition <= max
index
• Ease of use • Timing constraint, not EM constraint
• Values set indirectly via max transition per
architecture, per PVT
• Not frequency or design specific
(2) AF*DF tables in
User Guide
• Keeps below wire EM limit in worst cell,
constrained by transition time limit defined as a
fraction of the clock period
• Users select a max transition based on their
design-specific voltage domain, Vt and L
Javg,peak,rms <= EM
limit
• More accurate than
#1;
• Can save area and
power
• Pessimism can be significant for low
drive-cells
• May also turn out to be optimistic for
high-drive cells at maximized frequency
if the applied fraction is inadequately
eased.
• Large lookup table referenced to determine
appropriate max transition across corners
• Keeps load below wire EM limit in each cell
indexed by frequency
• Removes design-level constraint on output edge
rate due to EM
Javg,peak,rms, <= EM
limit
• Most accurate and
design-specific;
• Maximum area and
power savings
• User must run utility for all NLM, CCS,
ECSM PVTs needed for their design
16 © 2019 Cadence Design Systems, Inc. All rights reserved.
Extraction Correlation: Fill-Less - AVMF - Real Colored Fill
• AVMF (Advanced Virtualized Metal Fill) is introduced to
shrink the noticeable 7nm parasitic RC miscorrelation
derived from the color agnostic extraction with fill
• Maturely supported in Innovus™ signoff extraction-based
post-route optimization
• Successfully deployed in Neoverse N1 CPU implementation
for 7FF SoC tapeout with the effective mitigation for timing
ECO iteration
17 © 2019 Cadence Design Systems, Inc. All rights reserved.
Neoverse N1 CPU Power Integrity Recap
Static IR
heatmap/
histogram
Power Analysis
Analysis Type Results Notes
Static IR Clean under 4% DIV threshold • 2.17% Effective drop at MP2 cluster level
• Using 3x3 plocs to emulate SoC-level FC
bump distribution
Vector-based
Dynamic IR
• Clean under 8% worstavg threshold
• 17% worst instantaneous drop
over Neoverse N1 CPU with
max_pwr64 vector on FFG corner
• 16.45% worst instantaneous drop
over Neoverse N1 CPU with
Dhrystone64 Vector on FFG corner
• Memory internal violations are deemed
bogus (Worked around by less layer
demotion)
• Handful worst instantaneous drop on
clock out of bound: 14% for max_pwr64
and 10.5% for dhrystone64
Power EM Clean
Signal EM Clean • No Avg/Peak violation with 3-year
continuous operation
• Handful worst RMS violations under 150%
limit
• Temperature: IR@125c, EM@105c
• Process Corner: Power Maximum FFG
• Voltage: 0.825v/1.05v/1.8v (VDD/VDD_CPU/VDDA for TPIP PVT sensor)
• RC Corner: typical
• The analysis is performed at DSU MP2 cluster level and further validated in SoC-level with the actual package RLC
• 8% on-die EIV threshold for TWF-based worstavg and 10% for worst instantaneous drop on clock cells
18 © 2019 Cadence Design Systems, Inc. All rights reserved.
Dynamic Rail Analysis
Dhrystone6
4 heatmap/
histogram
Max power
64 heatmap/
histogram
19 © 2019 Cadence Design Systems, Inc. All rights reserved.
More Signoff Gimmicks
• IR drop impact on timing can cause 7nm
silicon failures
• The extra blanket OCV derates accounting for
voltage and temperature variation derived from
the lookup table SOCV based on the worst
average instance-based dynamic on-die
voltage threshold at power dominant corner
with the interpolation for STA corners in lieu of
static IR
The advancement of SOCV statistical
wire derating
• Similar in context in that the voltage out of
the characterized PVT bound should not be
derated; it’s irrational to derate outside of the
range of maximum global-local shifted
extraction
– The different single-sided BEOL derate should
be applied to these extraction corners
specifically for hold analysis only
• The double-sided wire derates are still
necessary with the rest global and global-
local shift tightened corners in order to
account for local on-die parasitic differences
between launch/data and capture paths
20 © 2019 Cadence Design Systems, Inc. All rights reserved.
Challenge / Key Takeaways
Technology 7nm • Developing libraries/flows on new node
• SI impact on signals and clocks even with NDR
• IR drop closure and more accurate impact analysis on
timing
• Static timing and margining revisited
Constraint verification/management
• Cross-domain crossings (CDCs)
– Lack of the effective approach to check and fix CDC
failing paths for Neoverse N1 CPU bridge with MP2 DSU
cluster. The effort for convoluted scripting is imperative.
• Modal vs modeless constraints
– The considerable work overhead to validate modeless
CPU constraint due to DFT complexity
21 © 2019 Cadence Design Systems, Inc. All rights reserved.
Neoverse N1 CPU Rapid Adoption Kit Overview
22 © 2019 Cadence Design Systems, Inc. All rights reserved.
What is a Cadence Rapid Adoption Kit?
• Complete Cadence RTL-to-GDS digital implementation flow
– Example flow scripts
– Example floorplan
– Application note explaining how to setup the RAK
– Application notes showing how the flow works
• Customized to deliver the power and performance goals associated with specific Arm IP
Genus™(RTL Synthesis)
Tempus™/Quantus™(Signoff STA)
Pegasus™(LVS/DRC/DFM)
Innovus™(Implementation)
Co
nfo
rma
l®(F
orm
al E
q)
Mo
du
s(B
IST
/AT
PG
)
Joules™(RTL Power)
Vo
ltus™
(Sig
n-O
ff I
R/E
M)
IP
RTLIP
Cons
Cell
Lib
Example
Floor
planExample
Flow
App
Notes RT
L-to
-GD
S F
low
23 © 2019 Cadence Design Systems, Inc. All rights reserved.
Stylus: Basis for Neoverse N1 CPU RAK
• Enables script reuse
• Consistent behavior for core commands
• Key reports and metrics share similar formats
• Improve data exchange between products through common objects and commands
Streamline synthesis to signoff flow by providing a consistent set of commands,
objects, reports, metrics, and flows across tools
Desig
n
Imp
lem
en
tati
on
Innovus™Implementation System
Stratus™High Level Synthesis
Genus™RTL Synthesis
Conformal ®
LEC, ECO, LP
Modus™Test Solution
Joules™RTL Power
Quantus™Signoff Extraction
Tempus™Signoff STA
Pegasus™Verification System
Voltus™Signoff PowerS
ign
off
Desig
n
Cre
ati
on
Co
mm
on U
se
r Inte
rface
24 © 2019 Cadence Design Systems, Inc. All rights reserved.
Stylus: Components
Common User Interface
Improved usability and
productivity
Unified Metrics
Holistic metrics from
synthesis to signoff
Common UI
Common Commands
Across Tools
Common Reports
and Logs
Common Initialization Sequences
Common
Database Access
Common GUI Across
Tools
Flow Kits
Flow Process
Tool Directives
Design Inputs
Quickly capture and deploy
digital flows
1 2 3
25 © 2019 Cadence Design Systems, Inc. All rights reserved.
Describing Flow Content with YAML
• Base Recipe– Flow content required to create a
DRC design (base + wire)– Physical
– Floorplanning, PG insertion, place, routing, metal fill, Via fill
– STA
PPA RecipeSynthesis Optimizations
Early Clock Flow CTS techniques
Route optimizations
Skew Optimizations
Base Recipe @ Process Node
Extreme Effort
PPA Recipe
• Tool options via a flow step are inserted into the base recipe to improve the PPA
26 © 2019 Cadence Design Systems, Inc. All rights reserved.
flowtool(submit flow to run)
create_flow_step –name init_floorplan {
read_def [get_flow_condig def_files]
}
create_flow_step –name place {
place_opt_design
}
create_flow_step –name cts {
ccopt_design
}
create_flow_step –name route {
route_design
}
create_flow_step –name postroute {
opt_design -postroute
}
create_flow –tool innovus {
floorplan place cts route postroute
}
flow_kit(flow language)
Stylus
The Flow Kit
• Flow generation from within the tool
• Organizes the flow (create_flow) and flow_steps (create_flow_step) into an executable task for flowtool(make replacement) manipulate the flow
27 © 2019 Cadence Design Systems, Inc. All rights reserved.
How to Get a Cadence Arm-Specific RAK
cadence.com RAK web page provides interface to request RAKs
https://www.cadence.com/go/arm-rak
28 © 2019 Cadence Design Systems, Inc. All rights reserved.
Cadence Arm IP Rapid Adoption Kit
• Complete Cadence® RTL-to-GDS digital implementation flow
• Tuned for the power and performance objectives of different Arm IP
• Enables designers to rapidly deploy an optimized Cadence flow for Arm IP
• Requested using a simple form on cadence.com
• Simplifies implementation of Arm IP products
Genus™(RTL Synthesis)
Innovus™(Implementation)
Place/CTS/
Optimization/Routing
Voltus™(Sign-off IR/EM)
Quantus™(Extraction)
Tempus™(STA)
Quantus(Extraction)
Co
nfo
rma
l®(F
orm
al E
q)
Voltus(Signoff IR/EM)
Tempus(STA)
Design
Implementation
Signoff-driven
Implementation
Signoff
Analysis
© 2019 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and the other Cadence marks found at www.cadence.com/go/trademarks are trademarks or registered trademarks of
Cadence Design Systems, Inc. All other trademarks are the property of their respective owners.