triple core lock step (tcls) arm for space...1 confidential triple core lock step (tcls) arm for...
TRANSCRIPT
CONFIDENTIAL 1
Triple Core Lock Step (TCLS) ARM FOR SPACE
Xabier Iturbe, Emre Ozer & Balaji Venu
Toby Proctor & Alex Robinson
ARM Research, Cambridge [email protected], [email protected]
CONFIDENTIAL 2
Agenda Big Picture - What is TCLS trying to address? Introduction to Cortex-R CPU family TCLS overview
Introduction TCLS results Architecture, Implementation PPA comparison
Status of the IP Future work Discussions
CONFIDENTIAL 3
TCLS system IP a research project initiated by ARM Research, not by a product division
This is not a committed product
We built on top of Cortex-R5 product
Disclaimer
CONFIDENTIAL 4
What is TCLS trying to address? Radiation hardening by process
State of the art technique used in the community to guard against Total Ionising Diode (TID) up to 300 krad(si) [1] Single event Latch-up (SEL) up to 110 MeV-cm2 / mg [1] Upsets & Transients (SEU, SET) up to 120 Mev [2]
SEUs are non-destructive Mission critical failures Non-deterministic behavior Silent Data Corruption
TCLS solution makes CPUs fully resilient to SEUs / soft errors
[1] http://www.voragotech.com/products/VA10820 [2] https://www.aerospaceonline.com/doc/how-rad-hard-do-you-need-the-changing-approac-0001
CONFIDENTIAL 5
Resiliency to rescue
Make it ultra reliabile Rad hard by process/technology
Choose parts with Rad hard by architecture/design features
Make the design more resilient thus improving reliability
Addressing reliability concern in COTS parts of cubesats
CONFIDENTIAL 6
Background Introduction to Cortex-R CPUs
CONFIDENTIAL 7
ARM Cortex® Processor Profiles in Automotive Three architecture variants profiled for the different application sectors Actuation, fast control
Fast response / Real-time control
Extended Functional Safety
Cortex-R processors
MCUs, IoT, sensors, motors
RTOS
DSP
Smallest footprint / lowest power
Cortex -M processors
Computation, robotics computer-vision
Linux®, QNX
Higher performance
Cortex-A processors ARMv8-R
CONFIDENTIAL 9
Cortex-R5 CPU Designed for functional safety (e.g. ISO 26262) in automotive, shipping for over a decade
Fault-tolerant Features
Dual-core lockstep operation (DCLS) for ASIL-D ◦ Detects transient and hard faults in CPUs
◦ Fail safe capable (e.g. reset & restart, checkpoint & roll-back)
ECC in Caches/TCMs and AXI/AHB busses
CPU0
Checker
Shared I$/TCM ECC
Shared D$/TCM ECC
CPU1
Flag Error
CPU Output Ports
CONFIDENTIAL 10
TCLS Overview
CONFIDENTIAL 11
TCLS ARM for Space
Introduction Collaborative project funded by European
Commission H2020 Space program http://www.tcls-arm-for-space.eu/
Objectives Understand fail functional design requirements and principles under heavy SEU scenarios Assess the fail functional design using radiation-tolerant STM65nm technology Trade-offs of TCLS v/s (DCLS) solutions v/s single core rad-hard solution
Partners (Project Leader)
Public Website
CONFIDENTIAL 12
Project outcomes
Soft Error Failure Rate:
5.4%
Source: X. Iturbe, B. Venu, and E. Ozer: "Soft error vulnerability assessment of the real-time safety-related ARM Cortex-R5 CPU", International Symposium on Defect and Fault Tolerance in VLSI and
Nanotechnology Systems (DFT’16)
1) TCLS System Arch Specification (v7R & v8R)
2) Implemented in 28nm tech PPA comparison
3) Soft error failure rate analysis
CONFIDENTIAL 13
TCLS CPU Subsystem Architecture
3 CPUs in lockstep operation Shared memory protected ECC No modifications to the CPU RTL TCLS Assist Unit - system-level fault handling Fail-functional operation
Source: X. Iturbe, B. Venu, E. Ozer and S. Das: "A Triple Core Lock-Step (TCLS) ARM Cortex-R5 Processor for Safety-Critical and Ultra-Reliable Applications", International Conference on Dependable Systems and Networks (DSN'16),
June/July 2016.
TCLS CPU Subsystem
Recovery Time is FAST - 2500 CPU cycles or 5.5µs @450MHz
Detect Divergence
Reset All 3 CPUs
Pop in Arch State from
TCM
Restart All 3 CPUs
Interrupt (FIQ) All 3 CPUs
Push out Arch State to
TCM
Correctable Error Sequence
Go to SAVE state
Go to RESTORE state
CONFIDENTIAL 14
System Level Challenges
CPU0
CPU cluster
Interconnect
DRAM
ACE
Accelerator
SRAM
UART / GPIO /
Peripherals
DMA
CONFIDENTIAL 15
System Level Challenges TCLS solves SEU vulnerability from the CPU point of view
System IPs still needs to be protected
TCLS enables the concept of “Reliable Root Node”
Build fault tolerant system leveraging the “Reliable Root Node”
Popular techniques: Scrubbing BIST (increase fault coverage) Implement System IPs using Rad hard process VORAGO HARDSIL technology
CPU0
TCLS cluster
Interconnect
DRAM
TCLS assist
ACE
Accelerator
SRAM
UART / GPIO /
Peripherals
DMA
CPU0
CPU0
CONFIDENTIAL 16
Soft error resiliency (TCLS v/s single core) IP name Flip Flop count Num of vulnerable flops
R5 single core 18500 1K (5.4%)
IP name Flip Flop count Num of vulnerable flops
TCLS config 3 R5 CPUs
3*18500 0
TCLS assist unit 7414 81 (1.1%) Easily reduced to 0% post local TMR of flops
CPU fully resilient against SEUs / soft errors using TCLS technology
CONFIDENTIAL 17
TCLS status & plan
CONFIDENTIAL 18
Status of the IP Porting it to V2M-MPS3 platform
Implement POC TCLS system on Xilinx Kintex-7 FPGA
Write bare metal code to demonstrate TCLS functionality
Invite universities & industry to develop around TCLS
Target further Cubesat events
TCLS IP on Xilinx FPGA
V2M-MPS3 platform
CONFIDENTIAL 19
The plan
HW deliverables TCLS RTL /
FPGA Bitstream
SW deliverables • SW Driver • Application
exampe • Startup guide
Silicon Partners Space Agencies
• White paper creation • Promotion at events • Official channel of
contact • Co-ordinate with ESA
the ARM-ESA CubeSat Challenge
Space & Avionics industry
Universities
CONFIDENTIAL 20
TCLS Executive Summary
System Architecture solution with the following benefits
CPU fully resilient against SEUs / soft errors Fault injection experiments estimate failure rate 5.4% for Cortex-R5, TCLS makes it 0%
Complementary technology to existing state of the art rad-hard process.
Resynchronization time upon a soft error is 2500 clock cycles (v7R arch) 5.5µs @450MHz (state of art 1ms[3])
Scrub feature further reduces non-deterministic behavior (i.e., real time)
PPA comparable against Single core rad hard technology with the above advantages
[3] http://www.ddc-web.com/Products/Microelectronics/images/documents/SCS750_rev8_r6.pdf
CONFIDENTIAL 21
References Publications
Triple Core Lock Step (TCLS) ARM FOR SPACE Xabier Iturbe, Balaji Venu, Emre Ozer https://indico.esa.int/indico/event/148/session/10/contribution/71/material/slides/0.pdf
A Triple Core Lock-Step (TCLS) ARM® Cortex®-R5 Processor for Safety-Critical and Ultra-Reliable Applications Xabier Iturbe, Balaji Venu, Emre Ozer, Shidhartha Das http://ieeexplore.ieee.org/abstract/document/7575387/
Soft error vulnerability assessment of the real-time safety-related ARM Cortex-R5 CPU Xabier Iturbe, Balaji Venu, Emre Ozer http://ieeexplore.ieee.org/abstract/document/7684076/
A Fail-Functional Automotive CPU subsystem architecture for Mitigating Single point of Failures Balaji Venu, EmreOzer, Xabier Iturbe, Alex Robinson https://www.researchgate.net/publication/310480363_A_Fail-Functional_Automotive_CPU_Subsystem_Architecture_for_Mitigating_Single_Point_of_Failures
CONFIDENTIAL 22
Thank you
CONFIDENTIAL 23
Backup
CONFIDENTIAL 24
Reliability estimation of R5 soft error failure rate
error propagation analysis (RAS talk earlier this year)
CONFIDENTIAL 25
TCLS Fault Injection work Component # Seq. Elements % of CPU
PFU 1,029 6%
MPU 1,065 7%
LSU 1,545 10%
CACHES 3,841 25%
CACHE-LOGIC 189 1%
DCACHE_16kB 604 4%
ICACHE_16kB 494 4%
CACHE_STB 681 4%
CACHE-AXIM 1,873 12%
DPU 8,398 52%
DPU_BR 220 1%
DPU_CPSR 442 3%
DPU_CTL 1,218 8%
DPU_DE 597 4%
DPU_LDST 205 1%
DPU_REGBANK 974 6%
DPU_FPU 1,663 10%
DPU_FREGBANK 1,130 7%
DPU_CP 777 5%
DPU_DP 1,172 7%
TOTAL 15,878 100%
CONFIDENTIAL 26
Fault Error Failure:
FI in all Sequential Elements (FFs & Memory cells)
10 iterations of 7 benchmarks of EEMBC AutoBench
Benchmark execution time divided into 64 equally-sized intervals
FI instant randomly chosen within each interval
Failure rate = (# of exp which resulted in error) (total # of exp)
Total # of exp = 16K * 64 * 7 ~= 8 Million
Fault Injection (FI) Methodology
Initialization Routines
Interval 1 Interval 2 Interval 3 Interval 4 Interval 63 Interval 64
Benchmark execution time (10 iterations)
Initialization Routines
Interval 1 Interval 2 Interval 3 Interval 4 Interval 63 Interval 64
Initialization Routines
Interval 1 Interval 2 Interval 3 Interval 4 Interval 63 Interval 64
CONFIDENTIAL 27
Soft-Error Rate (SER) Failure Rate:
5.4%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
ttsprk01 aifftr01 matrix01 tblook01 canrdr01 rspeed01 a2time01 puwmod01
OTHERS
DPU-CP
PFU
LSU
CACHE-AXIM
DPU-CTL
CACHE-STB
DPU-DE
DPU-FREGBANK
DPU-REGBANK