CS294 Project1
Virtual and Redundant Switches
IRAM Retreat – Winter 2001
Sam Williams
CS294 Project2
Outline
• Motivation
• Existing Products
• Arrayed Commodity Switches
• Adding Redundancy
• Optimizing
• Generalization
• Conclusions
CS294 Project3
Motivation
• Cost of switches grows very quickly:
O(Ports2) for crossbar based
Additionally address tables and buffers must grow
• Industry leading MTBF for a single switch is about 50K hours
and typical is perhaps only 25K.
• Modular Switches provide redundancy for management and power, but not the data transport fabric.
• MTTR is typically over 1 hour
• Can the money saved by cascading commodity switches be applied towards improved performance or redundancy?
• The goals are to improve the MTBF, improve performance, and simplify the work that must be done to replace a failed switch.
CS294 Project4
Existing Products• Existing modular aggregators can merge several smaller
switches (modules) into a single large virtual switch.
• In this case, each 36 port switch module has a pair of gigabit uplinks to the switching fabric, which has either 6 or 24 gigabit ports (full duplex)
• Redundancy is also provided for management modules, fans, and power supplies.
• However, not for modules or switching fabric.
So if the switching fabric fails, the entire
device fails, but if individual switching
modules fail, then only that sub network
fails.
• Management modules can infer priority to
improve performance for critical activity
3com switch 4007
Management
4 x 36 port switching modules, each with 2 gigabit uplinks
120 Gbps backplane (16 used)
Logical View
Switching Fabric: 24 internal gigabit ports
CS294 Project5
Existing Products (Analysis)• The cost analysis here is based on use of either 18 or 48 Gbps switching
fabrics, 36 port switching modules and either a 7 or 13 bay chassis.
• Performance is slowdown on the time to send from every node to every other node compared to a true n*36 port switch.
• MTBF is for any part of the network
• MTTR was at least 1 hour.
• Repair cost is about $4000/failure – modularization helps to keep this low, but yearly maintenance cost will grow with the number of ports
$0.00
$10,000.00
$20,000.00
$30,000.00
$40,000.00
$50,000.00
$60,000.00
$70,000.00
0 100 200 300 400 500
Number of Ports
Co
st
0.5
0.8
1.0
1.3
1.5
1.8
2.0
0 100 200 300 400 500
Number of Ports
Slo
wd
ow
n c
om
par
ed t
o a
n N
po
rt s
wit
ch
0
50
100
150
200
250
300
350
400
0 100 200 300 400 500
Number of Ports
MTBF
(day
s)
CS294 Project6
Examples of failure
• Switching module fails, each of the nodes/sub-networks attached is no disconnected from all other nodes
• More likely case
• Switching fabric fails, each of the switches is now disconnected from the others, but nodes attached to a switch still can communicate with each other.
CS294 Project7
Examples of failure (continued)
• Redundancy allows for this failure, with reduced performance.• This are not commodity switches, and are considerably more
expensive.
• However, in this case, the failure does cause a network split.• This is the more likely case, so why not allow the extra switch be
used to cover any other switch’s failure• Could be extended to nodes, but then you pay double for NIC’s and
ports.
CS294 Project8
Virtual switch from commodity switches• Although without the management functions, and performance, cheaper
virtual switches can be built – nothing more than just cascading them
• This is based on 5, 8, 16, and 24 port switches, each with the last port MDI type, and from 5 different companies
• Performance is poor since the uplinks are only 100Mbps
• Adding a second uplink port only moderately alleviates this deficiency
$0.00
$2,000.00
$4,000.00
$6,000.00
$8,000.00
$10,000.00
$12,000.00
$14,000.00
0 100 200 300 400 500 600
Number of Ports
Cos
t
0.0
5.0
10.0
15.0
20.0
25.0
0 100 200 300 400 500 600
Number of Ports
Slow
down
0
50
100
150
200
0 100 200 300 400 500 600
Number of PortsMT
BF (d
ays)
CS294 Project9
Virtual switch from mid-range switches• By using switches more suited to this design (higher speed uplink(s)), we
can improve performance
• These switches use an 8 or 24 port switch at the bottom, each with 1 or 2 gigabit uplink modules, and a 4, 8, or 12 port gigabit switch at the top
• The gigabit uplinks and gigabit switches drive cost to at least twice as much as commodity solution, but with 10x better performance
• Performance is near that of a monolithic switch if 2 uplinks are used.
• Compared to packaged solution, its about half the cost, and slightly less performance, but no management functionality.
$0.00
$5,000.00
$10,000.00
$15,000.00
$20,000.00
$25,000.00
0 100 200 300
Number of Ports
Co
st
0.5
1.0
1.5
2.0
2.5
0 100 200 300
Number of Ports
Slo
wd
ow
n
CS294 Project10
Port Virtualization for Redundancy• The re-mapping stage is much simpler than a full n*m port switch.
Essentially each of the m n bit busses are mapped to one of the k n bit internal busses which are connected directly to the switches
• For this example each of the 4 groups of 8 virtual ports is mapped to one of the 5 groups of physical ports. The uplinks of the first stage switches are sent back, and into one of the top level switches.
• An even simpler solution, for single redundancy, would be to map either directly, or to the spare
• In this design the the single point of failure is the re-mapping block, since first and second level switches have redundancy
• So for the example below, MTBF is improved by about 50% (from 208 days to 347 days)
port re-mapping
Extra switches for redundancy 4
8
4
CS294 Project11
Operation (Homogenous switches)
• In this somewhat rigid example, there are 6 bays, 4 are map direct or to spare, There is a switching fabric slot, and a slot for the redundant switch, which can replace either of the other two classes
• In this case, the switching fabric switch failed, and the uplink ports were remapped to the spare.
• At this point the admin must replace the failed switch. If any other switch fails before this, the network will be partially split.
CS294 Project12
Operation - continued
• In this case, one of the first level of switches failed. Instead of those nodes loosing connection to the rest of the network, they are remapped to the spare.
• Once again, the admin must replace the failed switch. If any other switch fails before this, the network will be partially split.
• If the case had bee the spare went down, then it would need to be replaced to provide redundancy.
CS294 Project13
Port Virtualization for Higher Performance• Previous performance analysis was based on “1-to-all” messaging.
• However, it is likely that network access patterns can be broken into groups of high inter-node communication
• Thus monitoring can be performed, and the network can be periodically paritioned into activity groups
• Create a graph based on bandwidth used between nodes, use something like Kernighan partitioning to separate it into a number of partitions equal to the number of first stage switches (power of 2).
• The re-mapping stage is only slightly simpler than a full n*m port switch (no buffers, never any contention, etc…)
spares
1 3 4 58
6
2(failed)7
partition 1
partition 2
partition 3
partition 4
Logical View3 switches reserved as spares.
1 failed, and the network was repartitioning
CS294 Project14
Performance / Availability• MTTR for aggregators was typically over
an hour. This is on top of the time to detect the failure.
• By automating recovery, the downtime can be significantly reduced
• This is dependent on timely detection of a failed switch, which could be handled via packet injection.
• Once the failing switch is determined, a new mapping can quickly be determined.
• For the performance optimizing case, satisfying connectivity is the top priority, a previously scheduled performance can be done later.
Hard fail
Fail detected
Switches have adapted
perf
time
repartition for performanceSwitches have adapted
Hard fail
Fail detected
Switches have adapted
perf
time
Hard fail
admin notices & fixes failure
Switches have adapted
perf
time. . .
CS294 Project15
Generalization• Use homogenous switches. There is a mapping layer which maps
physical to virtual ports. This can range from simple 1 to 2, to complex 1 to n, with performance monitoring and repartitioning. Performance can be gained by using some faster switches where needed.
Extra switches for redundancy or extra performance1 2 3 4 5 6 7 8
monitor and port re-mapping
# Description Fails Performance Cost
0 Array of switches 0 Low Nswitches
1 Single Redundancy 1 Low Nswitches+ 1 + trivial mapper
2 R way redundancy R Low Nswitches+ R + general mapper
3 Array of switches with partitioning 0 Adaptive Nswitches+ expensive mapper
4 R way redundancy with partitioning R Adaptive Nswitches+ R + expensive mapper
5R way redundancy with partitioning
And total utilizationR Adaptive Nswitches+ R + expensive mapper
CS294 Project16
Conclusion• It is possible to make a larger virtual switch out of smaller switches, and
still get reasonable performance.
• With little additional hardware, and monitoring agent, it is possible to make it fault tolerant, with several spare switches which can be automatically swapped in – simple case cost ~ O(Spares * Ports).
more complex designs make it O(Ports2)
• With a very simple, but large switch, it is possible to also optimize for performance by balancing network bandwidth among switches in the pool. This is a much more costly solution.
• A generalization would provide a pool of switches connected by the port mapper, and some or none reserved as spares.
• Both of these concepts and their functionality could be integrated into a single ASIC or even using a network processor.
CS294 Project17
Future Work• How do switches fail? This determines the failure detection method.
• Implementation of type 1 or 2 switch would be possible given the relative simple mapper.
• Type 3, 4, or 5 would require a complex ASIC, which should be replaced with a network processor and software.