copyright © 2006 uci aces laboratory aces kyoungwoo lee 1, aviral shrivastava 2, ilya issenin 1,...
Post on 19-Dec-2015
214 Views
Preview:
TRANSCRIPT
Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Kyoungwoo Lee1, Aviral Shrivastava2, Ilya Issenin1,
Nikil Dutt1, and Nalini Venkatasubramanian3
Mitigating Soft Error Failures
for Multimedia Applications
by Selective Data Protection
1ACES Lab. and 3DSM Lab.
University of California at Irvine
2Compiler and Microarchitecture Lab.
Arizona State University
CASES’06 #2 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Soft Errors – Major Concern for Reliability
Soft Errors cause FailuresTransient faults in electronic devicesProgram can crash, give wrong output, go into infinite
loop etc.
Causes of Soft ErrorsPoor system designRandom-noise or signal-integrity such as crosstalkRadiations-induced
Alpha particles, neutrons, protons etc.Dominant contributor to soft errorsRadiations can not be completely shielded
e.g. - neutron can pass through 5 feet of concrete
Radiation-induced soft errors are dominant
CASES’06 #3 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
The Phenomenon of Radiation-Induced Soft Errors
01source drain
++ +
+ +
+-
--
--
-
Transistor
Radiation
Bit ValueBit Flip
CASES’06 #4 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Impact of Soft Errors Soft Error Rate (SER)
FIT: How many failures in one billion hours
Mean Time To Failure (MTTF)
Examples - Cellphone with 4 Mbit of low-power SRAM
@ 1,000 FIT per MbitMTTF = 28 years
Laptop PC with 256 MB of DRAM @ 600 FIT per Mbit
MTTF = 1 month
Router Farm with 100 Gbit of SRAM @ 600 FIT per Mbit
MTTF = 17 hours
CASES’06 #5 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
[Hazucha et al., IEEE] P. Hazucha and C. Svensson. Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, 2000.
Soft Errors on an Increase
Increase exponentially due to technology scaling0.18 µm
1,000 FIT per Mbit of SRAM
0.13 µm 10,000 to 100,000 FIT per Mbit of SRAM
Voltage ScalingVoltage scaling increases SER significantly
Soft Error is a main design concern!
SER Nflux CSx expQcritical{-x
Qs
}
where Qcritical = C Vx
CASES’06 #6 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Soft Errors in Caches are Important
Soft errors in memory are much more important than in combinational logic Strong temporal masking in
combinational logic Most upsets in memory manifest as
soft errors Only 11 % of Soft Errors in
combinational logic
Redundancy techniques are popular for Memories ECC-based solutions Not applicable for caches
Very sensitive to performance and power overheads
Caches are most vulnerable to soft errors Caches occupy majority area in
processors (can be more than 50 %)
Intel Itanium II (0.18 um) –More than 50 % Area
Need to minimize failures due to Soft Errors in Caches
CASES’06 #7 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
ECC ProtectionECC (Error Correcting Codes) is
popular technique to protect memory from soft errors
But has high overheads in terms of Area, Performance and Powere.g., SEC-DED
- Hamming Code (32, 6)Performance by up to 95 %
[Li et al., MTDT ’05] Energy by up to 22 %
[Phelan, ARM ’03]Area by more than 18 %
[Phelan, ARM ’03]
Coding
Decoding
Data
Unprotected Cache
Protected Cache
EC
C
ECC protection for caches is expensive!
CASES’06 #8 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Problem Statement
Dual OptimizationReduce failures due to soft errors in caches Minimize power and performance overheads
CASES’06 #9 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Outline
Motivation and Problem Statement
Related Work
Our Solution
Experiments
Conclusion
CASES’06 #10 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Related Work in Combating Soft Errors
Process Technology Solutions Hardening: [Baze et al., IEEE Trans. On Nuclear Science ’00] SOI: [O. Musseau, IEEE Trans. On Nuclear Science ‘96] Process complexity, yield loss, and substrate cost
Microarchitectural Solutions for Caches Cache Scrubbing: [Mukherjee et al., PRDC ’04] Low Power Cache: [Li et al., ISLPED ’04] Area Efficient Protection: [Kim et al., DATE ’06] Multiple Bit Correction: [Neuberger et al., TODAES ’03] Cache Size Selection: [Cai et al., ASP-DAC ’06] High overheads in terms of power, performance, and area
Our Solution Compiler-based Microarchitectural Technique Provide protection from soft errors while minimizing the power,
performance, and area overheads
CASES’06 #11 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Outline
Motivation and Problem Statement
Related Work
Our SolutionObservation Software SupportArchitectural Support
Experiments
Conclusion
CASES’06 #12 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Observation Memory is divided into pages Suppose you could protect pages from soft errors
independently
N x 1 KB pageApplication
Data Memory
1000 Simulations
Random ErrorInjection
Number of Failures
N KB
1
2
K
N
K
CASES’06 #13 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Observation
For a multimedia application - susan Failure: Application crashes, goes into infinite loop, broken header of
image file, wrong size of image etc.. Loss in Quality of Service is not a failure
Profiling Failure Rates (Susan Edges)
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 6 11 16 21 26 31 36 41 46 51 56 61 66 71 76 81
Memory Page Number
Failure
Rat
e
All pages are not important!!
CASES’06 #14 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Outline
Motivation and Problem Statement
Related Work
Our SolutionObservation Software SupportArchitectural Support
Experiments
Conclusion
CASES’06 #15 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Data Partitioning Failure Critical (FC) data
Loop bounds, loop iterators, branch decision variables etc…
An error may result in a failure Failure Non Critical (FNC) data
Multimedia data (e.g. image pixel bits)
An error may not cause failures Only loss in QoS
…if ( condition ) { for ( loop = 1; loop < 64 ; loop++ ) { local = MM[loop] / ( 2*constant ); MM[loop] = min( 127, max( -127, MM[loop] ) ); }}…
Our Approach for Multimedia Applications Simple Data Partitioning
All multimedia data is FNC Everything else is FC
User marks the FNC (multimedia) data Very simple to do
sample code (FNC, FC)
CASES’06 #16 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Size of failure critical and failure non-critical data
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
SusanEdges
SusanCorners
SusanSmoothing
G721Encoder
G721Decoder
ADPCMEncoder
ADPCMDecoder
H263Encoder
AVERAGE
Da
ta
FC Data FNC Data
Composition of FC and FNC data
54 %
On average 50% pages are FNCShould be able to reduce ECC overheads by
half
CASES’06 #17 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Outline
Motivation and Problem Statement
Related Work
Our SolutionObservationSoftware SupportArchitectural Support
Experiments
Conclusion
CASES’06 #18 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
HPC (Horizontally Partitioned Caches)
HPC More than one cache at the
same level of hierarchy Each page in memory is mapped
to exactly one cache
Originally proposed to separate stack data and array data Performance Improvements
But also very effective in reducing energy consumption
[Shrivastava et al., CASES’05] Performance improvements Mini Cache is typically smaller
than Main Cache
Processor Pipeline
HPC
Memory
Processor (e.g.: Intel XScale)
Memory Controller
Main Cache Mini Cache
Page Mappin
g
CASES’06 #19 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
MemoryFNC FC
Main Cache Mini Cache
PPC (Partially Protected Caches)
We propose Partially Protected Caches
Main CacheMini Cache Protected from
soft errors
Compiler maps data to the two caches
Map FNC to Unprotected Main Cache
Map FC to Protected Mini Cache
Intuition is to provide protection to only the FC data
Processor Pipeline
Unprotected Main Cache
Protected Mini
Cache
HPC
Processor
Memory ControllerPage Mappin
g
PPC
FNC FC
CASES’06 #20 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Outline
Motivation
Related Work
Partially Protected Caches and Selective Data Protection
ExperimentsExperimental FrameworkResults
Conclusion
CASES’06 #21 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Data Cache Configurations
Unprotected Cache
Configuration 1 - Unsafe CacheConfiguration
High FailuresHigh PerformanceLow Energy
Configuration 2- Safe CacheConfiguration
Low FailuresLow PerformanceHigh Energy
Configuration 3 - PPC Cache Configuration
Low FailuresHigh PerformanceLow Energy
Unprotected Cache
Protected Cache
Coding
Decoding
EC
CProt.Cache
Traditional Proposed Traditional
CASES’06 #22 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
MiBenchMediaBench
Experimental Framework
Application(MiBench etc)
Compiler(gcc)
Executable
Page Mapping
Image: SUSANAudio: ADPCM, G.721Video: H.263
FNC FC
No Protection
UNSAFE
FNC FCFNC FC
Protection
SAFE PPC
SelectiveProtection
CacheSimulator
(SimpleScalar)CACTI
Synthesis(Synopsys)
AcceleratedSoft Error Injection
HammingCode
REPORT : Failure Rate Runtime Energy
MultimediaData informed
CASES’06 #23 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Experimental Results 1 Effectiveness of our approach - Selective Data Protection using PPC
architecture Data Cache similar to Intel XScale
Unsafe: 32 KB (no protection) data cache Safe: 32 KB (protection) data cache PPC: 32 KB (no protection) & 2KB (protection) data caches
Data Cache Configuration 32 bytes line size, 4 way set-assoc, and FIFO
Soft Error Injection Randomly inject Soft Errors every cycle if data in cache is valid Accelerated Soft Error Rate (SER)
Base SER = 1e-9 per cycle per 1 KB of data cache Multiple-Bit Errors (MBE) and Single-Bit Errors (SBE)
SER for MBE is 100 times less than SER for SBE Metrics
Reliability in terms of Failure Rates Number of failures in 1,000 runs
Performance System Performance : Number of processor cycles + Data Cache accesses +
main memory accesses Energy Consumption
System energy : Processor energy + Data Cache energy (Protected one and Unprotected one) + main memory bus energy + main memory access energy
CASES’06 #24 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Failure Rate
Failure Rate
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+01
SusanEdges
SusanCorners
SusanSmoothing
G721Encoder
G721Decoder
ADPCMEncoder
ADPCMDecoder
H263Encoder
AVERAGE
Nor
mal
ized
Failu
re R
ate
Unsafe Safe PPC
Normalized Failure Rate : Ratio of failure rate for each configuration to that of Unsafe configuration
Failure Rate of PPC is close to that of Unsafe
CASES’06 #25 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Performance
Runtime
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Susan Edges Susan Corners SusanSmoothing
G721 Encoder G721 Decoder ADPCMEncoder
ADPCMDecoder
H263 Encoder AVERAGE
Norm
alize
d R
unti
me
Unsafe Safe PPC
Our paper in CASES ’06 has more conservative numbers due to a mistake of performance calculations for a couple of benchmarks.
Normalized Runtime : Ratio of runtime for each configuration to that of Unsafe configuration
PPC has performance close to UnsafeOn average, PPC has 32 % runtime reduction compared to SafePPC has only 1 % performance overhead compared to Unsafe
CASES’06 #26 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Energy Consumption
Energy Consumption
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Susan Edges SusanCorners
SusanSmoothing
G721 Encoder G721 Decoder ADPCMEncoder
ADPCMDecoder
H263 Encoder AVERAGE
Norm
alize
d E
nerg
y Consu
mpti
on
Unsafe Safe PPC
Normalized Energy Consumption : Ratio of energy consumption for each configuration to that of Unsafe configuration
PPC has energy consumption close to UnsafeOn average, PPC has 29 % energy reduction compared to Safe
PPC has 10 % energy consumption overhead compared to Unsafe
CASES’06 #27 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Experimental Results 2
Design Space ExplorationVarious Cache Configurations
Impact of Cache Size: 512 Bytes to 32 KB in exponents of 2Set Associativity: directed-map, 4 way, 32 way
MetricsReliability in terms of Failure RatesPerformanceEnergy Consumption
CASES’06 #28 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Results 2: Design Space Exploration
Failure rate of PPC is close to that of Safe
Performance and energy consumption of PPC are close to those of Unsafe
Failure Rate (Susan Edges)
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Cache Size (bytes)
Failu
re R
ate
Unsafe Safe
Runtime (Susan Edges)
5.0E+06
6.0E+06
7.0E+06
8.0E+06
9.0E+06
1.0E+07
1.1E+07
1.2E+07
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
Run
tim
e (c
ycle
s)
Unsafe Safe Energy Consumption (Susan Edges)
4.0E+06
5.0E+06
6.0E+06
7.0E+06
8.0E+06
9.0E+06
1.0E+07
1.1E+07
1.2E+07
1.3E+07
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
Ener
gy (
nJ)
Unsafe Safe
Failure Rate (Susan Edges)
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Cache Size (bytes)
Failu
re R
ate
Unsafe Safe PPC
Runtime (Susan Edges)
5.0E+06
6.0E+06
7.0E+06
8.0E+06
9.0E+06
1.0E+07
1.1E+07
1.2E+07
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
Runti
me (
cycl
es)
Unsafe Safe PPC Energy Consumption (Susan Edges)
4.0E+06
5.0E+06
6.0E+06
7.0E+06
8.0E+06
9.0E+06
1.0E+07
1.1E+07
1.2E+07
1.3E+07
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
Ener
gy (
nJ)
Unsafe Safe PPC
PPC can hold failure rate, performance, and power between Safe and Unsafe
CASES’06 #29 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Conclusion Soft Errors are major design concern for system reliability
We propose the Partially Protected Caches and the Selective Data Protection for Multimedia Applications
Our approach as compared to the Safe configuration Comparable failure rates 32 % performance improvement 29 % energy saving
Our approach works across cache configurations
Future Work Selective Data Protection for general applications Selective Data Protection in other components such as logic
Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Thanks!
Any Questions?
kyoungwl@ics.uci.edu
Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Backup Slides
CASES’06 #32 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Focus
Who is your audience? CASES ’06 – compiler people High-level presentation, more focus on compiler approaches
What is the strong motivation? Soft Error and its high-overhead protection Specific motivation
What is our problem? Soft Error Protection in Caches for Multimedia Applications
What is our contribution? Selective Data Protection without Losing Reliability Key Idea Clear experimental framework
Every slide should have at least a tiny picture, which can visualize and help the contents. Unified $ and HPC $
CASES’06 #33 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Outline
Soft Errors in CacheECC protection: High Overheads in terms of
Performance and PowerSelective Data Protection in HPC
Reduce Overheads with Comparable ReliabilityMultimedia Applications
ExperimentsExperimental FrameworkResults
Conclusion
CASES’06 #34 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Strong Motivation
Soft error is criticalECC is expensiveAll data are not equally critical to failures
Multimedia is a good example
CASES’06 #35 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Radiation-Induced Soft Errors
01source drain
++ +
+ +
+-
--
--
-
Transistor
Radiation
Bit ValueBit Flip
CASES’06 #36 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Soft Error
R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. EDN online, Sep 2004
CASES’06 #37 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Soft Errors vs. Hard Errors
Soft Errors vs. Hard ErrorsRandomly radiation-induced Single Event Effects (SEE)Transient faults vs. Permanent faultsProbability of soft errors is up to 100x higher than that of
hard errors
CASES’06 #38 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
SER formula
Nflux - intensity of the Neutron Flux
CS - the area of the cross section of the nodeQS - the charge collection efficiency
Qcritical - the min charge required for a cell to retain data
Qcirtical = C x V where C is Capacitance and V is Supply Voltage
SER Nflux CSx expQcritical{-x
Qs
}
CASES’06 #39 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Soft Error is Critical High Integration
High integration raises soft errors potentially [Mastipuram et al., EDN ’04]
(e.g.) Cellphone with 4 Mbit of low-power SRAM : 1,000 FIT per Mbit
28 years in MTTF (e.g.) Laptop PC with 256
MB of DRAM : 600 FIT per Mbit
one month in MTTF (e.g.) Router Farm with 100
Gbit of SRAM : 600 FIT per Mbit
17 hours in MTTF
[Mastipuram et al., EDN ’04] R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. EDN online, Sep 2004.
CASES’06 #40 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
[Hazucha et al., IEEE] P. Hazucha and C. Svensson. Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, 2000.
Soft Errors on an Increase
Increase exponentially due to technology scaling0.18 µm
1,000 FIT per Mbit of SRAM
0.13 µm 10,000 to 100,000 FIT per Mbit of SRAM
Voltage ScalingVoltage scaling increases SER significantly
SER Nflux CSx expQcritical{-x
Qs
}
where Qcritical = C Vx
Soft Error is a main design concern!
CASES’06 #41 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
[Hazucha et al., IEEE] P. Hazucha and C. Svensson. Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, 2000.
Soft Errors increase with technology advances
Soft errors are affected by [Hazucha et al., IEEE] :Process Technology
Shrinking increases SER exponentially
(e.g.) 1,000 FIT per Mbit of SRAM in 0.18 µm
10,000 to 100,000 FIT per Mbit of SRAM in 0.13 µm
[Mastipuram et al., EDN ’04]
Voltage ScalingVoltage scaling increases SER
significantly
source drain
0.13 µm Transistor
source drain
0.18 µm Transistor
C and Vdecrease
SER Nflux CSx expQcritical{-x
Qs
}
where Qcritical = C Vx
Soft Error is a main design concern!
CASES’06 #42 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
[Hazucha et al., IEEE] P. Hazucha and C. Svensson. Impact of CMOS Technology Scaling on the Atmospheric Neutron Soft Error Rate. IEEE Trans. on Nuclear Science, 47(6):2586–2594, 2000.
Soft Errors increase with technology advances
Soft errors are affected by [Hazucha et al., IEEE] : Process Technology
Shrinking increases SER exponentially
(e.g.) 1,000 FIT per Mbit of SRAM in 0.18 µm
10,000 to 100,000 FIT per Mbit of SRAM in 0.13 µm [Mastipuram et al., EDN ’04]
Voltage ScalingVoltage scaling increases SER
significantly
Radiation intensityLatitude and Altitude
(e.g.)10 to 100 times higher SER at flight than at ground
source drain
0.13 µm Transistor
source drain
0.18 µm Transistor
C and Vdecrease
SER Nflux CSx expQcritical{-x
Qs
}
where Qcritical = C Vx
Soft Error is a main design concern!
CASES’06 #43 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
source drain
0.13 µm Transistor
Soft Error is Critical High Integration
Raises SE linearly
Process Technology Shrinking decreases Qcritical and
increases SER exponentially (e.g.) 1,000 FIT per Mbit of
SRAM in 0.18 µm
10,000 to 100,000 FIT per Mbit of SRAM in 0.13 µm [Mastipuram et al., EDN ’04]
source drain
0.18 µm Transistor
C and Vdecrease
SER Nflux CSx expQcritical{-x
Qs
}
where Qcritical = C Vx
CASES’06 #44 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Soft Error is Critical
R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. EDN online, Sep 2004.
High Integration Raises SE linearly
Process Technology Shrinking decreases Qcritical
and increases SER exponentially
Voltage Scaling Voltage scaling decreases
Qcritical and increases SER exponentially
SER Nflux CSx expQcritical{-x
Qs
}
where Qcritical = C Vx
CASES’06 #45 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
High Integration Raises SE linearly
Process Technology Shrinking decreases Qcritical
and increases SER exponentially
Voltage Scaling Voltage scaling decreases
Qcritical and increases SER exponentially
Latitude and Altitude 10 to 100 times higher SER
at flight than at ground (e.g.) Potentially Laptop PC
with 256 MB of Memory on an airplane at 35,000 ft 5 hours MTTF [Mastipuram et al., EDN ‘04]
High Integration Raises SE linearly
Process Technology Shrinking decreases Qcritical
and increases SER exponentially
Voltage Scaling Voltage scaling decreases
Qcritical and increases SER exponentially
Latitude and Altitude 10 to 100 times higher SER
at flight than at ground
Soft Error is Critical
R. Mastipuram and E. C. Wee. Soft Errors’ Impact on System Reliability. EDN online, Sep 2004.
1 month MTTF
5 hours MTTF
SER Nflux CSx expQcritical{-x
Qs
}
where Qcritical = C Vx
Soft Error is a main design concern!
NfluxSER
5 hours MTTF
1 month MTTF
CASES’06 #46 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Soft Errors in Caches are Important
Core : Combinational Logic Robust structure Masking (e.g.: logical, electrical,
and temporal maskings) Only 10 % of Soft Errors in combinational logic
Main Memory: DRAM Upset of memory is not masked SER is not increasing with
technology generations Cache: SRAM
Upset is not masked SER is increasing significantly with
technology generations Most area of processor Cache affects performance and
power consumption significantlyRobert Bauman, Soft Errors in Advanced Computer Systems in IEEE Design and Test of Computers 2005S. Mitra, N. Seifert, M. Zhang, Q. Shi, and K. S. Kim, Robust System Design with Built-In Soft-Error Resilience, IEEE Computer 2005
Richard Loft, Supercomputing Challenges at the National Center for Atmospheric Research
Intel Itanium II (0.18 um) - More than 50 % Area
DR
AM
SE
RS
RA
M S
ER
CASES’06 #47 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Most Effective Protection: ECC
ECC (Error Correcting Codes) - Information RedundancyCode data and store extra control dataDecode data and detect/correct errors in dataHigh overheads in terms of Area, Performance and
Power(e.g.) SEC-DED (Single Error Correctionand Double Error Detection)for cache (or SRAM)– Hamming Codes (32, 6)
Performance by up to 95 % Energy by up to 22 % Area by more than 18 %
Coding
Decoding
Data
J.-F. Li and Y.-J. Huang. An Error Detection and Correction Scheme for RAMs with Partial-Write Function. In MTDT’05, pages 115–120, 2005.R. Phelan. Addressing Soft Errors in ARM Core-based Designs. Technical report, ARM, 2003.
Unprotected Cache
Protected Cache
Con
trol
ECC protection for every cache access is too expensive!
CASES’06 #48 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
ECC Protection for Caches is Expensive
ECC (Error Correcting Codes) is the most effective technique to protect memory from soft errors
ECC has high overheads in terms of Area, Performance and Power (e.g.) SEC-DED
– Hamming Codes (32, 6) Performance by up to 95 %
[Li et al., MTDT ’05] Energy by up to 22 % [Phelan, ARM
’03] Area by more than 18 % [Phelan, ARM
’03]
Coding
Decoding
Data
[Li et al., MTDT ’05] J.-F. Li and Y.-J. Huang. An Error Detection and Correction Scheme for RAMs with Partial-Write Function. In MTDT’05, pages 115–120, 2005.[Phelan, ARM ’03] R. Phelan. Addressing Soft Errors in ARM Core-based Designs. Technical report, ARM, 2003.
Unprotected Cache
Protected Cache
EC
C
ECC protection for every cache access is expensive!
CASES’06 #49 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Power PC 4
CASES’06 #50 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Pentium 4
CASES’06 #51 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Intel Duo
CASES’06 #52 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Cache Miss Rates of FC and FNC data
Miss Rates of FC Data and FNC Data
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
128 256 512 1024 2048 4096 8192 16384 32768 65536
Cache Size (bytes)
Nu
mb
er
of
Mis
se
s
FNC data FC data
CASES’06 #53 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Benchmarks
MiBench Image Processing: Susan Edges, Susan Corners, Susan
Smoothing Audio Codec: ADPCM Encoder/Decoder
Media BenchAudio Codec: G.721 Encoder/Decoder
PeaCE (Ptolemy extension as Codesign Environment)H.263 Video Encoder
CASES’06 #54 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Failures
Can not open output of multimedia processingNo output Incorrect output nameWrong headerDifferent output size
CrashInfinite Loop
CASES’06 #55 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Performance
UnsafeNum_Inst + Access*2 + Miss*25
SafeNum_Inst + Access*3 + Miss*25
HPCNum_Inst + 2*(Main_Access + Mini_Access) +
25*(Main_Miss + Mini_Miss)
CASES’06 #56 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Energy Consumption
Energy consumption of the whole systemProcessor Pipeline: 0.67 nJoules per cycleCache: nJoules from CACTIMemory: 42.19 nJoules per accessOff-chip bus
Tools: CACTI and Synopsys Design CompilerE = {(ASEprone × ESEprone) + ASEprotected×(ESEprotected+Edec)+
(WSEprotected×Ecod)} + {(MSEprone+MSEprotected)×(Ebus+Emem)}+{(MSEprotected×Ecod) + (RSEprotected × Edec)} + {Eproc ×(ASEprotected +ASEprone)}
CASES’06 #57 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Clear Problem Definition and Intensive Experiments
Problem should be clear and very specificSelective Data Protection in HPC for Multimedia
ApplicationsOur strength is experimental framework and
extensive experimentsDetailed presentation about our simulation environments
and benchmarksExperimental sets
Effects of our approach in terms of power, performance and reliability
Design space exploration
CASES’06 #58 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Problem Definition
ConfigurationsUnsafeSafeHPC
Our interest lies on mitigating failures due to soft errors, instead of decreasing soft errors
CASES’06 #59 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Failure Rate
Failure Rate
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+01
SusanEdges
SusanCorners
SusanSmoothing
G721Encoder
G721Decoder
ADPCMEncoder
ADPCMDecoder
H263Encoder
AVERAGE
Norm
alize
d F
ailure
Rate
Unsafe Safe PPC
PPC provides the comparable reliability to SafeOn average, both have 45 times less failures than Unsafe
Normalized Failure Rate : Ratio of failure rate for each configuration to that of Unsafe configuration
CASES’06 #60 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Performance
Runtime
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Susan Edges Susan Corners SusanSmoothing
G721 Encoder G721 Decoder ADPCMEncoder
ADPCMDecoder
H263 Encoder AVERAGE
Norm
alize
d R
unti
me
Unsafe Safe PPC
Our paper in CASES ’06 has more conservative numbers due to a mistake of performance calculations for a couple of benchmarks.
PPC removes performance overhead from SafeOn average, PPC has 32 % runtime reduction compared to SafePPC has only 1 % performance overhead compared to Unsafe
Normalized Runtime : Ratio of runtime for each configuration to that of Unsafe configuration
CASES’06 #61 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Energy Consumption
Energy Consumption
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
Susan Edges SusanCorners
SusanSmoothing
G721 Encoder G721 Decoder ADPCMEncoder
ADPCMDecoder
H263 Encoder AVERAGE
Norm
alize
d E
nerg
y C
onsu
mpti
on
PPC Safe HPC
PPC has less energy overhead than SafeOn average, PPC has 29 % energy reduction compared to Safe
PPC has 10 % energy consumption overhead compared to Unsafe
Normalized Energy Consumption : Ratio of energy consumption for each configuration to that of Unsafe configuration
CASES’06 #62 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Results 2: Design Space Exploration
Failure rate of PPC is close to that of Safe
Performance and energy consumption of PPC are close to those of Unsafe
Failure Rate (Susan Edges)
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Cache Size (bytes)
Failure
Rate
Unsafe Safe PPC
Runtime (Susan Edges)
5.0E+06
6.0E+06
7.0E+06
8.0E+06
9.0E+06
1.0E+07
1.1E+07
1.2E+07
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
Runti
me
(cyc
les)
Unsafe Safe PPC Energy Consumption (Susan Edges)
4.0E+06
5.0E+06
6.0E+06
7.0E+06
8.0E+06
9.0E+06
1.0E+07
1.1E+07
1.2E+07
1.3E+07
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
Energ
y (n
J)
Unsafe Safe PPC
PPC can hold failure rate, performance, and power between Safe and Unsafe
CASES’06 #63 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Runtime
Runtime
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Susan Edges Susan Corners SusanSmoothing
G721 Encoder G721 Decoder ADPCMEncoder
ADPCMDecoder
H263 Encoder AVERAGE
No
rma
lize
d R
un
tim
e
Unsafe Safe HPC
CASES’06 #64 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Failure Rate
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fa
ilure
Ra
te
Susan Edges Susan Corners SusanSmoothing
G721 Encoder G721 Decoder ADPCM Encoder ADPCM Decoder H263 Encoder
Failure Rate (Main = 32KB, mini = 2KB) Unsafe Safe HPC
CASES’06 #65 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
QoS
Quality of Service
0
1
2
3
4
5
Susan Edges Susan Corners SusanSmoothing
G721 Encoder G721 Decoder ADPCMEncoder
ADPCMDecoder
H263 Encoder AVERAGE
No
rma
lize
d P
SN
R
Unsafe Safe HPC
PSNR = 10LOG10(MAX2/MSE) MSE : Mean Squared Error
CASES’06 #66 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Results 2: Design Space Exploration
Failure Rate Failure Rate (Susan Edges)
1.0E-04
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07
Cache Size (bytes)
Failure
Rate
Unsafe Safe PPC
Failure Rates increasing and saturatedFailure Rate of PPC is close to that of Safe
CASES’06 #67 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Results 2: Design Space Exploration
PerformanceRuntime (Susan Edges)
5.0E+06
6.0E+06
7.0E+06
8.0E+06
9.0E+06
1.0E+07
1.1E+07
1.2E+07
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
Ru
nti
me
(c
yc
les
)Unsafe Safe PPC
Performance of PPC is close to that of Unsafe(32 % reduction compared to Safe)
CASES’06 #68 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Results 2: Design Space Exploration
Energy Consumption
Energy Consumption (Susan Edges)
4.0E+06
5.0E+06
6.0E+06
7.0E+06
8.0E+06
9.0E+06
1.0E+07
1.1E+07
1.2E+07
1.3E+07
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
En
erg
y (
nJ
)
Unsafe Safe PPC
Miss rate reduction and high access power costEnergy consumption of PPC is located b/w Safe
and Unsafe (24 % reduction compared to Safe)
CASES’06 #69 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
QoS
Quality of Service (Susan Edges)
2KB/512B
10
15
20
25
30
35
40
45
50
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
PS
NR
(P
ea
k S
ign
al t
o N
ois
e R
ati
o)
Unsafe Safe HPC
CASES’06 #70 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Area
Cache Area
1.0E-03
1.0E-02
1.0E-01
1.0E+00
1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06
Total Cache Size (bytes)
Are
a (
sq
ua
re c
m)
Unsafe Safe HPC
CASES’06 #71 Copyright © 2006 UCI ACES Laboratory http://www.cecs.uci.edu/~aces
Composite Metric
LOG(Failure_Rate) * Performance * EnergyComposite Metric (Susan Edges)
2KB/512B
1.0E+10
1.0E+11
1.0E+12
1.0E+13
1.0E+02 1.0E+03 1.0E+04 1.0E+05
Cache Size (bytes)
Fa
ilure
Ra
te *
Ru
nti
me
* E
ne
rgy
Unsafe Safe HPC
top related