04/27/06 1 quantitative analysis of fault-tolerant rapidio- based network architectures david bueno...
DESCRIPTION
04/27/06 3 Evaluation Criteria Overview Power Very important in nearly every embedded system Evaluate power based on number of active ports under no-fault conditions Conservatively assume multiplexer ports use 50% of the power of a full RapidIO switch port (much less logic needed, just need to multiplex and repeat LVDS signal) Size/cost Consider size/cost to be determined by total number of network pins in all chips in network fabric Most fair way to treat serial/parallel RIO pin-count considerations Means multiplexer chips are costly due to high pin count Fault isolation Measure of how much a fault affects other components in the system Classic approach of fully redundant networks provides near-perfect fault isolation Measuring fault isolation by average number of switches that must be rerouted in the event of a switch fault, assuming fault may occur in any active switch with equal likelihood Ideally, want switches to be unaware and unaffected by faults in the system Fault tolerance Most important metric for this work Want to calculate expected value of number of switches that may fail in a given system before performance loss greater than 5% occurs in corner turn app. Corner turn selected due to high level of network stress and relevance in real-world signal processing applications Failure of multiplexer devices not explicitly considered analytically, but must be discussedTRANSCRIPT
104/27/06
Quantitative Analysis of Fault-Tolerant RapidIO-Quantitative Analysis of Fault-Tolerant RapidIO-based Network Architecturesbased Network Architectures
David BuenoDavid BuenoApril 27, 2006April 27, 2006
HCS Research Laboratory, ECE DepartmentUniversity of Florida
204/27/06
Motivation Gain further insight into strengths/weaknesses of
proposed architectures Quantify power, size/cost, fault isolation, and fault
tolerance while maintaining a fixed level of performance
Provide flexible, weighted evaluation criteria that may be modified by other users to fit their needs Avoid excessive complexity using fair heuristics to estimate
power, size/cost, fault isolation, and fault tolerance
304/27/06
Evaluation Criteria Overview Power
Very important in nearly every embedded system Evaluate power based on number of active ports under no-fault conditions Conservatively assume multiplexer ports use 50% of the power of a full RapidIO switch port (much
less logic needed, just need to multiplex and repeat LVDS signal) Size/cost
Consider size/cost to be determined by total number of network pins in all chips in network fabric Most fair way to treat serial/parallel RIO pin-count considerations Means multiplexer chips are costly due to high pin count
Fault isolation Measure of how much a fault affects other components in the system Classic approach of fully redundant networks provides near-perfect fault isolation Measuring fault isolation by average number of switches that must be rerouted in the event of a
switch fault, assuming fault may occur in any active switch with equal likelihood Ideally, want switches to be unaware and unaffected by faults in the system
Fault tolerance Most important metric for this work Want to calculate expected value of number of switches that may fail in a given system before
performance loss greater than 5% occurs in corner turn app. Corner turn selected due to high level of network stress and relevance in real-world signal processing
applications Failure of multiplexer devices not explicitly considered analytically, but must be discussed
404/27/06
FT Calculation Calculation of most entries trivial (e.g. number of network pins) FT calculation slightly more complex and explained here for completeness F = expected number of switch failures tolerated before a loss of connectivity to any endpoint or a
5% drop in performance of our corner turn application
Si = probability that a system failure occurred with any number of faults up to and including n:
Where: N = number of switches in the system Pi = probability of a system failure after exactly i faults
Eqn. for F derived from the classical definition of an expected value Probability of system failure with a given number of faults is equal to the probability of system failure
with exactly that number of faults (Pi), multiplied by the probability that the system has not previously failed with any smaller number of faults (1-S(i-1)).
Since lower scores are better in our evaluation, reciprocal of the expected number of faults is taken prior to normalization (reciprocal is not shown in Table 8).
N
i
ii SPiF1
)1(1
n
i
in PS1
504/27/06
Weights and Scoring System Weights
Power and size/cost very important to a space system and each weighted at 1.0
FT the primary focus of this work, also encompasses performance for our purposes, weighted at 2.0
Fault isolation weighted 0.5 since based on a simple metric (rerouted switches) that was only a small focus of our investigation
Prior to weighting, scores for each system are normalized with the best system having a score of 1.0 (lower scores are better) Fault isolation a special case, since fully redundant baseline has
“perfect” fault isolation with 0 switches rerouted in the event of a single fault Allow data to be normalized to next best system and give baseline a
score of 0
604/27/06
Quantitative Results and Analysis Lower normalized scores are better
Total score is sum of normalized scores after weighting
Most archs. had similar power consumption, with mux-based archs. having slight disadvantage due to extra powered devices
Large differences in size/cost due to widely varied ways of providing FT Serial RIO architectures have
edge due to low pin-count and lack of muxes
FTC provides promising compromise between other alternatives due to number of muxes
Fault isolation metric of serial and FTC solutions suffers due to additional switch reconfigs needed (rather than mux reconfigs) Muxes in other archs. may
provide additional fault isolation and are trivial to reconfigure
All archs. provide better FT than baseline Extra-switch core networks with
redundant first stage may withstand nearly 4 faults
Addition of 1 core switch actually increases expected FT by more than 1 switch
Overall, serial RIO-based archs scored the best (lowest), with the FTC network providing an interesting compromise for parallel solutions in terms of all factors except fault isolation
Category Power(active ports)
Size/Cost(total network pins)
Fault Isolation(avg. rerouted
switches)
Fault Tolerance
(number of switch faults)
Total Score
Weight 1.0 1.0 0.5 2.0
Raw Norm. Raw Norm. Raw Norm. Raw Norm.
Baseline Clos Network
96 1.0 7680 2.5 0 0 2.0 1.98 7.45
Redundant First Stage Network
128 1.33 10240 3.33 2.67 1.0 2.37 1.67 8.51
Redundant First Stage Network with Extra-switch Core
128 1.33 12160 3.96 2.67 1.0 3.95 1.0 7.79
Redundant First Stage Network (Serial RIO)
96 1.0 3072 1.0 5.33 2.0 2.37 1.67 6.34
Redundant First Stage Network with Extra-switch Core (Serial RIO)
96 1.0 3584 1.17 5.33 2.0 3.95 1.0 5.17
RapidIO Fault-Tolerant Clos Network
96 1.0 7200 2.34 6 2.25 2.89 1.37 7.20
704/27/06
Supplementary Information
804/27/06
Summary of Basic Architectural Characteristics
Active Switches
Standby Switches
Total Switches
Active Ports per
Switch
Total Switch Ports
Mux Count
Number Switches to Reroute-1
Number Switches to Reroute-2
Baseline Clos Network
12 12 24 8 192 0 0 0
Redundant First Stage Network
12 8 20 8 160 8 (8:4) 0 8
Redundant First Stage Network with Extra-switch Core
12 9 21 8 184 8 (10:5) 0 8
Redundant First Stage Network (Serial RIO)
12 8 20 8 192 0 4 8
Redundant First Stage Network with Extra-switch Core (Serial RIO)
12 9 21 8 224 0 4 8
RapidIO Fault-Tolerant Clos Network
12 3 15 8 140 8 (4:1) 5 8
904/27/06
Baseline Clos Network
Non-blocking architecture supporting 32 RapidIO endpoints FT accomplished by completely duplicating network (redundant network not shown) Withstands 1 switch fault while maintaining full connectivity
Active Switches
StandbySwitches
Total Switches
Active Ports Per
Switch
Total Switch Ports Mux Count
Number Switches to Reroute (1st-level fault)
Number Switches to Reroute (2nd-level fault)
Baseline 12 12 24 8 192 0 0 0
1004/27/06
Redundant First Stage Network
Similar to baseline, but first level has switch-by-switch failover using components that multiplex 8 RapidIO links down to 4 Must consider muxes as potential point of
failure Second-level FT handled by redundant-
paths routing Full connectivity maintained as long as 1 of 4
switches remains functional Could also supplement with redundant
second level using switch-by-switch failover at cost of more complex multiplexing circuitry
Muxes may present single point of failure, so processor-level redundancy may be needed
Active Switches
StandbySwitches
Total Switches
Active Ports Per
Switch
Total Switch Ports Mux Count
Number Switches to Reroute (1st-level fault)
Number Switches to Reroute (2nd-level fault)
Redundant First Stage 12 8 20 8 160 8 (8:4) 0 8
1104/27/06
Redundant First Stage Network:Extra-Switch Core
Adds additional core switch to redundant first stage network Switch may be left inactive and
used in event of fault Second-level FT handled by
redundant paths routing Requires switches with at least
9 ports in first level, 8 ports in second level
Multiplexers must be 10:5 rather than 8:4
Active Switches
StandbySwitches
Total Switches
Active Ports Per
Switch
Total Switch Ports Mux Count
Number Switches to Reroute (1st-level fault)
Number Switches to Reroute (2nd-level fault)
Redundant First Stage: Extra-Switch Core 12 9 21 8 184 8 (10:5) 0 8
1204/27/06
Redundant First Stage Network:No Muxes Muxes add additional complexity and may
be a point of failure May be challenging to build LVDS mux
components Design requires 16-port switches in
backplane, but only need 8 active ports per switch High port-count switches will be enabled
through space-qualified serial RapidIO For future serial RIO, assume Honeywell
HX5000 SerDes with 3.125 GHz x 4 lanes (possible per Honeywell High-Speed Data Networking Tech. data sheet, June ’05) Roughly equivalent to 16-bit, 312.5 MHz
DDR parallel RIO For this research, using parallel RIO clock
rates for fair comparison
Active Switches
StandbySwitches
Total Switches
Active Ports Per
Switch
Total Switch Ports Mux Count
Number Switches to Reroute (1st-level fault)
Number Switches to Reroute (2nd-level fault)
Redundant First Stage: No Muxes 12 8 20 8 192 0 4 8
1304/27/06
Redundant First Stage Network:No Muxes + Extra-Switch Core
Combines methodologies from previous two architectures shown
Requires 9-port switches in first level, 16-port switches in second level Realistically attainable using serial
RIO Availability of a 32-port serial
switch would greatly simplify design (1-switch network!) Preferred FT approach would tend
towards “redundant network” approach for fabrics of these sizes
Active Switches
StandbySwitches
Total Switches
Active Ports Per
Switch
Total Switch Ports Mux Count
Number Switches to Reroute (1st-level fault)
Number Switches to Reroute (2nd-level fault)
Redundant First Stage: No Muxes + Extra-Switch Core 12 9 21 8 224 0 4 8
1404/27/06
Fault-Tolerant Clos Network Architecture studied at NJIT in 1990s, adapted here
for RapidIO Uses multiplexers (4:1) for more efficient
redundancy in first level Only requires 1 redundant switch for every 4 switches
in first stage Multiplexer components are no longer a potential single
point of failure for connectivity of any processors Has additional switch in second level, similar to
other architectures shown Requires 9-port switches in first level, 10-port
switches in second level 24-endpoint version possible using only 8-port switches
and 3:1 muxes Can withstand 1 first-level fault on either half of
network with no loss in functionality or performance Compromise on fully-redundant first-stage approaches
in terms of FT and size/weight/cost
Active Switches
StandbySwitches
Total Switches
Active Ports Per
Switch
Total Switch Ports Mux Count
Number Switches to Reroute (1st-level fault)
Number Switches to Reroute (2nd-level fault)
Fault-Tolerant Clos Network 12 3 15 8 140 8 (4:1) 5 8