Overlay Networks
Xiaohui (Helen) Gu
Overlay Networks• Virtual application-level network
Physical network
Service overlay networks
Why Overlays?
• Improve IP layer network services– Resilience
• Overcome IP service deployment challenge– Multicast
Resilient Overlay Networks (RONs)• RONs seek to quickly detect and respond to network
failures
IP routing recovery can take minutes or even hours
Resilient Overlay Networks (RONs)
• RONs seek to quickly detect and respond to network failures– Network nodes participate in a limited size overlay
network– Overlay nodes cooperate with one another to forward
data on behalf of any other nodes in the RON– RON detects problems by aggressively probing the
paths connecting its nodes– RON nodes exchange information about the quality of
paths among themselves, and build forwarding tables based on a variety of path metrics• Latency, Packet Loss, and Available Throughput
Resilient Overlay Networks Goals
• Failure detection and recovery in less than 20 seconds
• Tighter integration of routing and path selection with the application
• Expressive policy routing
Active Probing
• RON probes every other node PROBE_INTERVAL plus a random jitter of 1/3 PROBE_INTERVAL
• A probe not returned in PROBE_TIMEOUT is considered loss
Link-State Dissemination
• RON nodes disseminate their performance metrics to the other nodes every ROUTING_INTERVAL
• This information is sent over the RON overlay• The only time that a RON node has incomplete
information about any other node is when it is completely cut off from the Overlay
Outage Detection
• On the loss of a probe, several consecutive probes spaced by PROBE_TIMEOUT are sent out
• If OUTAGE_THRESH probes elicit no response the path is considered “dead”
• If even one probe gets a response then high frequency probing is cancelled
• Paths experiencing outages are rated on their packet loss history
Latency and Loss Rate• Latency is the round trip time calculated from the
probes– Latency = A * Latency + (1-A) * New Sample– A is chosen to be 0.9– Overall latency is the SUM of the individual virtual link latencies
• Loss Rate is the average of the last k = 100 probe samples– If losses are assumed independent then the overall path loss
rate is the PRODUCT of the individual virtual link loss rates
Throughput
• Throughput is calculated using (2)– p is the one way packet loss probability
• Estimated as half of the calculated two-way packet loss probability
– rtt is the end-to-end round trip time• Throughput cannot be aggregated across virtual links
– In order to simplify the selection of throughput optimized paths only one intermediate node is considered
• An indirect path is only chosen if it improves throughput by 50%
Experiment
• The raw measurement data consists of probe packets
• To probe, each RON node independently repeated the following steps– Pick a random node j
– Pick a probe-type from one of {direct, latency, loss} using round-robin.
– Send probe to j
– Delay for a random interval between 1 and 2 seconds
Results
• Two distinct datasets– RON1
• 64 hours between 3/21/2001 and 3/23/2001• 12 nodes with 132 distinct paths• Traverses 36 different AS’s and 74 distinct inter-AS
links– RON2
• 85 hours between 5/7/2001 and 5/11/2001• 16 nodes with 240 distinct paths• Traverses 50 AS’s and 118 different AS links
Results
Summary
• Resilient overlay networks can greatly improve the reliability of the Internet
• RON was able to overcome 100%(RON1) and 60%(RON2) of the several hundred observed outages
• RON takes 18 seconds on average to detect and recover from a fault
• RON can substantially improve loss rate, latency and TCP throughput
• Forwarding packets via at most one intermediate node is sufficient for fault recovery and latency improvements
Supporting Multicast on the Internet
IP
Application
Internet architecture
Network
?
?
At which layer should multicast be implemented?
Unicast Transmission
End Systems
Routers
Gatech
CMU
Stanford
Berkeley
IP Multicast
•No duplicate packets•Highly efficient bandwidth usageKey Architectural Decision: Add support for multicast in IP layer
Berkeley
Gatech Stanford
CMU
Routers with multicast support
Key Concerns with IP Multicast
• Scalability with number of groups– Routers maintain per-group state– Analogous to per-flow state for QoS guarantees– Aggregation of multicast addresses is complicated
• Supporting higher level functionality is difficult– IP Multicast: best-effort multi-point delivery service– End systems responsible for handling higher level functionality – Reliability and congestion control for IP Multicast complicated
• Deployment is difficult and slow– ISP’s reluctant to turn on IP Multicast
End System MulticastStanford
CMU
Stan1
Stan2
Berk2
Overlay TreeGatech
Berk1
Berkeley
Gatech Stan1
Stan2
Berk1
Berk2
CMU
• Scalability– Routers do not maintain per-group state– End systems do, but they participate in very few groups
• Easier to deploy• Potentially simplifies support for higher level functionality
– Leverage computation and storage of end systems– For example, for buffering packets, transcoding, ACK aggregation– Leverage solutions for unicast congestion control and reliability
Potential Benefits
Performance Concerns
CMU
Gatech Stan1
Stan2
Berk1
Berk2
Duplicate Packets:Bandwidth Wastage
CMU
Stan1
Stan2
Berk2
Gatech
Berk1
Delay from CMU to Berk1 increases
What is an efficient overlay tree?• The delay between the source and receivers is small• Ideally,
– The number of redundant packets on any physical link is lowHeuristics:– Every member in the tree has a small degree – Degree chosen to reflect bandwidth of connection to Internet
Gatech
“Efficient” overlay
CMU
Berk2
Stan1
Stan2
Berk1Berk1
High degree (unicast)Berk2
Gatech
Stan2CMU
Stan1
Stan2
High latency
CMU
Berk2
Gatech
Stan1
Berk1
Why is self-organization hard?
• Dynamic changes in group membership – Members may join and leave dynamically– Members may die
• Limited knowledge of network conditions– Members do not know delay to each other when they join– Members probe each other to learn network related information – Overlay must self-improve as more information available
• Dynamic changes in network conditions– Delay between members may vary over time due to congestion
Berk2 Berk1
CMU
Gatech
Stan1Stan2
Narada Design
CMU
Berk2 GatechBerk1
Stan1Stan2
Step 1
•Source rooted shortest delay spanning trees of mesh•Constructed using well known routing algorithms
– Members have low degrees– Small delay from source to receivers
“Mesh”: Richer overlay that may have cycles and includes all group members
• Members have low degrees• Shortest path delay between any pair of members along mesh is small
Step 2
Narada Components• Mesh Management:
– Ensures mesh remains connected in face of membership changes
• Mesh Optimization:– Distributed heuristics for ensuring shortest path delay
between members along the mesh is small• Spanning tree construction:
– Routing algorithms for constructing data-delivery trees – Shortest path routing algorithm (e.g., Distance vector)
Optimizing Mesh Quality
• Members periodically probe other members at random • New Link added if
Utility Gain of adding link > Add Threshold• Members periodically monitor existing links• Existing Link dropped if
Cost of dropping link < Drop Threshold
Berk1
Stan2CMU
Gatech1
Stan1
Gatech2
A poor overlay topology
The terms defined• Utility gain of adding a link based on
– The number of members to which routing delay improves– How significant the improvement in delay to each member is
• Cost of dropping a link based on– The number of members to which routing delay increases
• Add/Drop Thresholds are functions of:– Member’s estimation of group size – Current and maximum degree of member in the mesh
Desirable properties of heuristics• Stability: A dropped link will not be immediately re-added• Partition Avoidance: A partition of the mesh is unlikely to be
caused as a result of any single link being dropped
Delay improves to Stan1, CMU but marginally.
Do not add link!
Delay improves to CMU, Gatech1 and significantly.Add link!
Berk1
Stan2CMU
Gatech1
Stan1
Gatech2
Probe
Berk1
Stan2CMU
Gatech1
Stan1
Gatech2Probe
Used by Berk1 to reach only Gatech2 and vice versa.Drop!!
An improved mesh !!
Gatech1Berk1
Stan2 CMUStan1
Gatech2
Gatech1Berk1
Stan2 CMUStan1
Gatech2
Narada Evaluation
• Simulation experiments• Evaluation of an implementation on the Internet
Performance Metrics• Delay between members using Narada• Stress, defined as the number of identical copies of a packet
that traverse a physical link
Berk2
Gatech Stan1Stress = 2
CMU Stan2
Berk1
Berk2CMU
Stan1
Stan2Gatech
Berk1
Delay from CMU to Berk1 increases
Factors affecting performance• Topology Model
– Waxman Variant – Mapnet: Connectivity modeled after several ISP backbones– ASMap: Based on inter-domain Internet connectivity
• Topology Size– Between 64 and 1024 routers
• Group Size– Between 16 and 256
• Fanout range– Number of neighbors each member tries to maintain in the mesh
Delay in typical run4 x unicast delay 1x unicast delay
Waxman : 1024 routers, 3145 linksGroup Size : 128 Fanout Range : <3-6> for all members
Naive Unicast
Native Multicast
Narada : 14-fold reduction in worst-case stress !
Stress in typical run
Variation with group size
Waxman model: 1024 routers, 3145 linksFanout Range: <3-6>
Summary• Overlay networks are useful in practice
– Easy deployable: application-level– Flexible: virtual architecture– Improve IP-layer routing inefficiency– Enable value-added communication functions such as
multicast