04/27/06 1 quantitative analysis of fault-tolerant rapidio- based network architectures david bueno...

104/27/06

Quantitative Analysis of Fault-Tolerant RapidIO-Quantitative Analysis of Fault-Tolerant RapidIO-based Network Architecturesbased Network Architectures

David BuenoDavid BuenoApril 27, 2006April 27, 2006

HCS Research Laboratory, ECE DepartmentUniversity of Florida

204/27/06

Motivation Gain further insight into strengths/weaknesses of

proposed architectures Quantify power, size/cost, fault isolation, and fault

tolerance while maintaining a fixed level of performance

Provide flexible, weighted evaluation criteria that may be modified by other users to fit their needs Avoid excessive complexity using fair heuristics to estimate

power, size/cost, fault isolation, and fault tolerance

304/27/06

Evaluation Criteria Overview Power

Very important in nearly every embedded system Evaluate power based on number of active ports under no-fault conditions Conservatively assume multiplexer ports use 50% of the power of a full RapidIO switch port (much

less logic needed, just need to multiplex and repeat LVDS signal) Size/cost

Consider size/cost to be determined by total number of network pins in all chips in network fabric Most fair way to treat serial/parallel RIO pin-count considerations Means multiplexer chips are costly due to high pin count

Fault isolation Measure of how much a fault affects other components in the system Classic approach of fully redundant networks provides near-perfect fault isolation Measuring fault isolation by average number of switches that must be rerouted in the event of a

switch fault, assuming fault may occur in any active switch with equal likelihood Ideally, want switches to be unaware and unaffected by faults in the system

Fault tolerance Most important metric for this work Want to calculate expected value of number of switches that may fail in a given system before

performance loss greater than 5% occurs in corner turn app. Corner turn selected due to high level of network stress and relevance in real-world signal processing

applications Failure of multiplexer devices not explicitly considered analytically, but must be discussed

404/27/06

FT Calculation Calculation of most entries trivial (e.g. number of network pins) FT calculation slightly more complex and explained here for completeness F = expected number of switch failures tolerated before a loss of connectivity to any endpoint or a

5% drop in performance of our corner turn application

Si = probability that a system failure occurred with any number of faults up to and including n:

Where: N = number of switches in the system Pi = probability of a system failure after exactly i faults

Eqn. for F derived from the classical definition of an expected value Probability of system failure with a given number of faults is equal to the probability of system failure

with exactly that number of faults (Pi), multiplied by the probability that the system has not previously failed with any smaller number of faults (1-S(i-1)).

Since lower scores are better in our evaluation, reciprocal of the expected number of faults is taken prior to normalization (reciprocal is not shown in Table 8).

N

i

ii SPiF1

)1(1

n

i

in PS1

504/27/06

Weights and Scoring System Weights

Power and size/cost very important to a space system and each weighted at 1.0

FT the primary focus of this work, also encompasses performance for our purposes, weighted at 2.0

Fault isolation weighted 0.5 since based on a simple metric (rerouted switches) that was only a small focus of our investigation

Prior to weighting, scores for each system are normalized with the best system having a score of 1.0 (lower scores are better) Fault isolation a special case, since fully redundant baseline has

“perfect” fault isolation with 0 switches rerouted in the event of a single fault Allow data to be normalized to next best system and give baseline a

score of 0

604/27/06

Quantitative Results and Analysis Lower normalized scores are better

Total score is sum of normalized scores after weighting

Most archs. had similar power consumption, with mux-based archs. having slight disadvantage due to extra powered devices

Large differences in size/cost due to widely varied ways of providing FT Serial RIO architectures have

edge due to low pin-count and lack of muxes

FTC provides promising compromise between other alternatives due to number of muxes

Fault isolation metric of serial and FTC solutions suffers due to additional switch reconfigs needed (rather than mux reconfigs) Muxes in other archs. may

provide additional fault isolation and are trivial to reconfigure

All archs. provide better FT than baseline Extra-switch core networks with

redundant first stage may withstand nearly 4 faults

Addition of 1 core switch actually increases expected FT by more than 1 switch

Overall, serial RIO-based archs scored the best (lowest), with the FTC network providing an interesting compromise for parallel solutions in terms of all factors except fault isolation

Category Power(active ports)

Size/Cost(total network pins)

Fault Isolation(avg. rerouted

switches)

Fault Tolerance

(number of switch faults)

Total Score

Weight 1.0 1.0 0.5 2.0

Raw Norm. Raw Norm. Raw Norm. Raw Norm.

Baseline Clos Network

96 1.0 7680 2.5 0 0 2.0 1.98 7.45

Redundant First Stage Network

128 1.33 10240 3.33 2.67 1.0 2.37 1.67 8.51

Redundant First Stage Network with Extra-switch Core

128 1.33 12160 3.96 2.67 1.0 3.95 1.0 7.79

Redundant First Stage Network (Serial RIO)

96 1.0 3072 1.0 5.33 2.0 2.37 1.67 6.34

Redundant First Stage Network with Extra-switch Core (Serial RIO)

96 1.0 3584 1.17 5.33 2.0 3.95 1.0 5.17

RapidIO Fault-Tolerant Clos Network

96 1.0 7200 2.34 6 2.25 2.89 1.37 7.20

704/27/06

Supplementary Information

804/27/06

Summary of Basic Architectural Characteristics

Active Switches

Standby Switches

Total Switches

Active Ports per

Switch

Total Switch Ports

Mux Count

Number Switches to Reroute-1

Number Switches to Reroute-2


12 12 24 8 192 0 0 0


12 8 20 8 160 8 (8:4) 0 8

Redundant First Stage Network with Extra-switch Core

12 9 21 8 184 8 (10:5) 0 8

Redundant First Stage Network (Serial RIO)

12 8 20 8 192 0 4 8

Redundant First Stage Network with Extra-switch Core (Serial RIO)

12 9 21 8 224 0 4 8

RapidIO Fault-Tolerant Clos Network

12 3 15 8 140 8 (4:1) 5 8

904/27/06


Non-blocking architecture supporting 32 RapidIO endpoints FT accomplished by completely duplicating network (redundant network not shown) Withstands 1 switch fault while maintaining full connectivity

Active Switches

StandbySwitches

Total Switches

Active Ports Per

Switch

Total Switch Ports Mux Count

Number Switches to Reroute (1st-level fault)

Number Switches to Reroute (2nd-level fault)

Baseline 12 12 24 8 192 0 0 0

1004/27/06


Similar to baseline, but first level has switch-by-switch failover using components that multiplex 8 RapidIO links down to 4 Must consider muxes as potential point of

failure Second-level FT handled by redundant-

paths routing Full connectivity maintained as long as 1 of 4

switches remains functional Could also supplement with redundant

second level using switch-by-switch failover at cost of more complex multiplexing circuitry

Muxes may present single point of failure, so processor-level redundancy may be needed

Active Switches

StandbySwitches

Total Switches

Active Ports Per

Switch




Redundant First Stage 12 8 20 8 160 8 (8:4) 0 8

1104/27/06

Redundant First Stage Network:Extra-Switch Core

Adds additional core switch to redundant first stage network Switch may be left inactive and

used in event of fault Second-level FT handled by

redundant paths routing Requires switches with at least

9 ports in first level, 8 ports in second level

Multiplexers must be 10:5 rather than 8:4

Active Switches

StandbySwitches

Total Switches

Active Ports Per

Switch




Redundant First Stage: Extra-Switch Core 12 9 21 8 184 8 (10:5) 0 8

1204/27/06

Redundant First Stage Network:No Muxes Muxes add additional complexity and may

be a point of failure May be challenging to build LVDS mux

components Design requires 16-port switches in

backplane, but only need 8 active ports per switch High port-count switches will be enabled

through space-qualified serial RapidIO For future serial RIO, assume Honeywell

HX5000 SerDes with 3.125 GHz x 4 lanes (possible per Honeywell High-Speed Data Networking Tech. data sheet, June ’05) Roughly equivalent to 16-bit, 312.5 MHz

DDR parallel RIO For this research, using parallel RIO clock

rates for fair comparison

Active Switches

StandbySwitches

Total Switches

Active Ports Per

Switch




Redundant First Stage: No Muxes 12 8 20 8 192 0 4 8

1304/27/06

Redundant First Stage Network:No Muxes + Extra-Switch Core

Combines methodologies from previous two architectures shown

Requires 9-port switches in first level, 16-port switches in second level Realistically attainable using serial

RIO Availability of a 32-port serial

switch would greatly simplify design (1-switch network!) Preferred FT approach would tend

towards “redundant network” approach for fabrics of these sizes

Active Switches

StandbySwitches

Total Switches

Active Ports Per

Switch




Redundant First Stage: No Muxes + Extra-Switch Core 12 9 21 8 224 0 4 8

1404/27/06

Fault-Tolerant Clos Network Architecture studied at NJIT in 1990s, adapted here

for RapidIO Uses multiplexers (4:1) for more efficient

redundancy in first level Only requires 1 redundant switch for every 4 switches

in first stage Multiplexer components are no longer a potential single

point of failure for connectivity of any processors Has additional switch in second level, similar to

other architectures shown Requires 9-port switches in first level, 10-port

switches in second level 24-endpoint version possible using only 8-port switches

and 3:1 muxes Can withstand 1 first-level fault on either half of

network with no loss in functionality or performance Compromise on fully-redundant first-stage approaches

in terms of FT and size/weight/cost

Active Switches

StandbySwitches

Total Switches

Active Ports Per

Switch




Fault-Tolerant Clos Network 12 3 15 8 140 8 (4:1) 5 8

04/27/06 1 quantitative analysis of fault-tolerant rapidio- based network architectures david bueno...

Documents