cs234 – application layer multicasting tuesdays, thursdays 3:30-4:50p.m. ics 243 prof. nalini...
TRANSCRIPT
CS234 – Application Layer Multicasting
Tuesdays, Thursdays 3:30-4:50p.m.ICS 243
Prof. Nalini [email protected]
(with slides adapted from Dr. Animesh Nandi and Dr.
Ayman Abdel-Hamid)
Groupware applicationsCollaborative applications
require 1-many communications Video conference Interactive distributed simulations Stock exchange (NYSE) News dissemination, file updates Online games
Dynamic peer groups Dynamic membership (group size) Non-hierarchical
Multicasting © Dr. Ayman Abdel-Hamid 3
Logical Multipoint Communications
•Two basic logical organizationsRooted: hierarchy (perhaps just two levels) that structures communicationsNon-rooted: peer-to-peer (no distinguished nodes)
•Different structure could apply to control and data “planes”
Control plane determines how multipoint session is createdData plane determines how data is transferred between hosts in the multipoint session
Multicasting © Dr. Ayman Abdel-Hamid 4
Logical Multipoint CommunicationsControl Plane
•The control plane manages creation of a multipoint sessionRooted control plane
One member of the session is the root, c_rootOther members are the leafs, c_leafsNormally c_root establishes a session
Root connects to one or more c_leafsc_leafs join c_root after session established
Non-rooted control planeAll members are the same (c_leafs)Each leaf adds itself to the session
Multicasting © Dr. Ayman Abdel-Hamid 5
Logical Multipoint CommunicationsData PlaneThe data plane is concerned with data transfer
•Rooted data planeSpecial root member, d_rootOther members are leafs, d_leafsData transferred between d_leafs and d_roots
d_leaf to d_rootd_root to d_leaf
There is no direct communication between d_leafs•Non-rooted data plane
No special members, all are d_leafsEvery d_leafs communicate with all d_leafs
Multicasting© Dr. Ayman Abdel-Hamid,
CS4254 Spring 2006 6
Multicast Communication•Multicast abstraction is peer-to-peer
Application-level multicastNetwork-level multicast
Requires router support (multicast-enabled routers)Multicast provided at network protocol level IP multicast
Multicasting © Dr. Ayman Abdel-Hamid 7
Multicast Communication
•Transport mechanism and network layer must support multicast
•Internet multicast limited to UDP (not TCP)
Unreliable: No acknowledgements or other error recovery schemes (perhaps at application level)
Connectionless: No connection setup (although there is routing information provided to multicast-enabled routers)
Datagram: Message-based multicast
IP multicast Using unicast to replicate packets not efficient thus, IP multicast needed Built around the notion of group of hosts:
Senders and receivers need not know about each other Sender simply sends packets to “logical” group address No restriction on number or location of receivers
Applications may impose limits Normal, best-effort delivery semantics of IP
• Unlike broadcast, allows a proper subset of hosts to participate
• Standards: IP Multicast (RFC 1112)
Multicasting © Dr. Ayman Abdel-Hamid 9
IP Multicast•IP supports multicasting
Uses only UDP, not TCPSpecial IP addresses (Class D) identify multicast groupsInternet Group Management Protocol (IGMP) to provide group routing informationMulticast-enabled routers selectively forward multicast datagramsIP TTL field limits extent of multicast
•Requires underlying network and adapter to support broadcast or, preferably, multicast
Ethernet (IEEE 802.3) supports multicast
Multicasting © Dr. Ayman Abdel-Hamid 10
IP Multicast: Group Address•How to identify the receivers of a multicast datagram?•How to address a datagram sent to these receivers?
Each multicast datagram to carry the IP addresses of all recipients? Not scalable for large number of recipients Use address indirection
A single identifier used for a group of receivers
Multicasting © Dr. Ayman Abdel-Hamid 11
IP Multicast: IGMP Protocol•RFC 3376 (IGMP v3): operates between a host and its directly attached router
•host informs its attached router that an application running on the host wants to join or leave a specific multicast group
•another protocol is required to coordinate multicast routers throughout the Internet network layer multicast routing algorithms
•Network layer multicast IGMP and multicast routing protocols
•IGMP enables routers to populate multicast routing tables
•Carried within an IP datagram
Multicasting© Dr. Ayman Abdel-Hamid,
CS4254 Spring 2006 12
IP Multicast: IGMP Protocol
•Joining a groupHost sends group report when the first process joins a given groupApplication requests join, service provider (end-host) sends report
•Maintaining table at the routerMulticast router periodically queries for group informationHost (service provider) replies with an IGMP report for each groupHost does not notify router when the last process leaves a group this is discovered through the lack of a report for a query
Multicasting© Dr. Ayman Abdel-Hamid,
CS4254 Spring 2006 13
IP Multicast: Multicast Routing•Multicast routers do not maintain a list of individual members of each host group•Multicast routers do associate zero or more host group addresses with each interface
Multicasting© Dr. Ayman Abdel-Hamid,
CS4254 Spring 2006 14
IP Multicast: Multicast Routing•Multicast router maintains table of multicast groups that are active on its networks•Datagrams forwarded only to those networks with group members
Multicasting © Dr. Ayman Abdel-Hamid 15
IP Multicast: Multicast Routing
•How multicast routers route traffic amongst themselves to ensure delivery of group traffic?
Find a tree of links that connects all of the routers that have attached hosts belonging to the multicast group
Group-shared treesSource-based trees
Shared Tree
Source Trees
Multicasting© Dr. Ayman Abdel-Hamid,
CS4254 Spring 2006 16
MBONE: Internet Multicast Backbone
•The MBone is a virtual network on top of the Internet (section B.2)
Routers that support IP multicast
IP tunnels between such routers and/or subnets
Issues with IP Multicast IP Multicast islands
ISPs are not always interested in mutual cooperation. Forwarding IP Unicast
traffic is “easier” (stateless) than forwarding IP Multicast. Billing Issues
Violates ISP input-rate-based billing model No incentive for ISPs to enable multicast!
No indication of group size (again needed for billing) Security Issues
ISPs fear the additional traffic induced through IP Multicast (DoS attacks – join high traffic volume Multicast groups).
Multicast islands are interconnected with tunnels: MBONE. Manual setup of tunnels, Administrative overhead.
Result: Usability is reduced
Application Layer Multicast A promising alternative: ALM
Attempts to mimic multicast routing at the application level
Instead of high-speed routers, use programmable end hosts as interior nodes in the ALM system
ALM SystemALM System
ALM Structure
ALM Structure
Dissemination Method
Dissemination Method
Data Dissemination in ALM Concerns
What is target data? Streaming, Large files, Notification
Messages How fast?
Tolerant to delay, emergency case How reliably? How efficiently?
Overhead, Bandwidth
What you expect with ALM
System perspective Utilize all available bandwidth
User perspective Perfect continuity Low startup delay Low lag
Design space of CEMALM
Data Plane Control Plane
Tree
Forest/
Multiple
Tree
Mesh
Hybrid
(Tree with Mesh
)
SAAR
[17]
Random
Network
Data Plane : Tree Loop free path Capacity of each tree link >= the
streaming rate Push based forwarding
Low latency
Unable to utilize the forwarding bandwidth of all participating nodes
Only interior nodes contribute Highly sensitive node failure
Subtrees are significantly affected by a node failure
Require sophisticated recovery techniques
ESM (End-System Multicast)[2], Overcast[3], ZIGZAG[4], NICE[5]
Points of failure Points on an ALM structure which
can disturb message dissemination behind these points Each node of a single tree structure is
a point of failure
Is failure recovery feasible? Tree repair takes time which causes delay
Tens of seconds Detection of failure takes time
Ex) TCP timeover, hearbeat duration Tree reconstruction cost is high
Eager techniques that frequently monitor failure – significant overhead
Goal: Reliable dissemination without failure recovery?
Increasing Reliability Increasing the number of parents
N parents ALM structure SplitStream[2003], K-DAG [2004], Multicast Forest
[2005] N parents ALM endures N-1 failures among
its parents But, each node may get N-1 redundant
messages
Data Plane : Forest/Multiple Tree
K trees K loop free paths Data is separated into multiple (K) chunks
Each chunk is disseminated in one of the trees Capacity of each tree link >= the streaming
rate/K Each node can have more children than a single tree Average depth of trees is lower than a single tree Reduce delay and increase fault resilience
High Reliability Sending same data to K paths Endure K-1 failures
SplitStream[6], CoopNet[7], Chunkyspread[8]
Random Multicast Forest Use multiple multicast tree ( K ) Each node gets k parents Sometimes it can have the same
parents for different trees Cannot endure K-1 failures with some
probability
h
g
i
S
a
b c
d
e f
K-DAG (Directed Acyclic Graph )
Each node selects each parent randomly Select “M” closest candidate nodes ( M > K ) Select K parents randomly among the M
Each node gets K parents Still it can not guarantee message delivery
hg i
S
a b c
d e
f
SplitStream[6] Forest based dissemination Basic idea
Split the stream into K stripes (with MDC coding)
For each stripe create a multicast tree such that the forest
Contains interior-node-disjoint trees Respects nodes’ individual bandwidth constraints
SplitStream : MDC coding Multiple Description coding
Fragments a single media stream into M substreams (M ≥ 2 )
K packets are enough for decoding (K < M)
Less than K packets can be used to approximate content
Useful for multimedia (video, audio) but not for other data
Cf) erasure coding for large data file
SplitStream : Interior-node-disjoint tree
Each node in a set of trees is interior node in at most one tree and leaf node in the other trees.
Each substream is disseminated over subtrees
S
a
b c
d
e f h
g
i
a
b c h
g
i
d
e f
d
e f
a
b c h
g
i
ID =0x… ID =1x… ID =2x…
Data Plane : Mesh
A Random Graph Node’s degree is proportional to the node’s
forwarding bandwidth Minimum degree : ensure the mesh remains connected
in the presence of churn Gossip based forwarding
Buffer (b) : most recently received blocks Every r seconds, nodes exchange the information of
missing blocks and buffered blocks. Randomly request an available block for a missing block
CoolStreaming[9], Chainsaw[10], PRIME[11], PULSE[12], CREW[13]
Pros and Cons of Mesh
Fully utilize bandwidth of all nodes High Failure resilience High overhead Long Delay/start up
Gossip-based Broadcast
Probabilistic Approach with Good Fault Tolerant Properties Choose a destination node, uniformly at random, and send it the message After Log(N) rounds, all nodes will have the message w.h.p. Requires N*Log(N) messages in total Needs a ‘random sampling’ service
Usually implemented as Rebroadcast ‘fanout’ times Using UDP: Fire and Forget
BiModal Multicast (99), Lpbcast (DSN 01), Rodrigues’04 (DSN), Brahami ’04, Verma’06
(ICDCS), Eugster’04 (Computer), Koldehofe’04, Periera’03
Data Plane : Hybrid
Tree/Forest + Mesh Tree/Forest : Push-based
dissemination Low delay
Mesh : Pull-based dissemination Instant failure recovery
Bullet[14], PRM[15], FaReCast[16]
Bullet
Layers a mesh on top of an overlay tree to increase overall bandwidth
Basic Idea Use a tree as a basis In addition, each node continuously
looks for peers to download from In effect, the overlay is a tree combined
with a random network (mesh)
PRM (Probabilistic Resilient Multicasting)
Tree-based Multicasting with a few of random mesh links.
Cross-Link Redundancy Add (random) cross links to the
forwarding topology For Reliability
Recovery Mechanism of PRM Basic assumption
Nodes maintain a set of mesh neighbors Ephemeral forwarding – Reactive
Upon detecting a disconnection, the head of disconnected node tries to obtain the stream from one of the mesh neighbors
Before data plane recovery Randomized forwarding – Proactive
A node forwards the stream to a random mesh neighbors with very small probability (1~3%)
Case Study : FaReCast for Flash Dissemination
Early Warning System Earthquake Early Warning System
Earthquake Event : Rare event, but its spreading speed is very fast
Give the recipients a little more time to prepare for the disaster
A simple alert-message should be delivered to all recipients within very short time duration
41
Catastrophe Resilience Use-Case
Earthquakes may damage infrastructure in unpredictable and multiple ways
Power, Telecommunication, Servers
If earthquake disrupts the major routers/links in LA
Can assume 10-25% of all overlay nodes in California will be “down”
Usually, people have studied node failures on smaller scale: Fault Tolerance
We are looking at scenario where a large fraction of the network is simultaneously affected : Catastrophe Tolerance
Design systems that are resilient to 50% simultaneous node failures
Characteristic of Recipients
The number of Recipients is huge Hundreds of thousands recipients may
be interested in this warning messages Recipients are unreliable
Normal machine administrated by normal users
Free will of join/leave Internal lagging for forwardingNeed to consider the reliable message dissemination to all online
recipients
Need to consider the reliable message dissemination to all online
recipients
( http://neic.usgs.gov/neis/eqlists/eqstats.html )
Event characteristicNumber of Earthquakes in the United States for 2000 - 2009
Located by the US Geological Survey National Earthquake Information Center
Magnitude 200
0200
1200
2200
32004 2005 2006 2007 2008 2009
8.0 to 9.9 0 0 0 0 0 0 0 0 0 0
7.0 to 7.9 0 1 1 2 0 1 0 1 0 0
6.0 to 6.9 6 5 4 7 2 4 7 9 9 0
5.0 to 5.9 63 41 63 54 25 47 51 72 73 5
4.0 to 4.9 281 290 536 541 284 345 346 366 446 49
3.0 to 3.9 917 842 1535130
31364 1475 1213 1137 1483 171
2.0 to 2.9 660 646 1228 704 1336 1738 1145 1173 1570 247
1.0 to 1.9 0 2 2 2 1 2 7 11 14 4
0.1 to 0.9 0 0 0 0 0 0 1 0 0 0
No Magnitude 415 434 507 333 540 73 13 22 20 6
Total234
22261 3876
2946
3552 3685 * 2783 2791 * 3615 * 482
The major earthquake (Magnitude>6) is very rare event.
Need to consider the maintenance overhead of the early warning system.
The major earthquake (Magnitude>6) is very rare event.
Need to consider the maintenance overhead of the early warning system.
44/32
Forest-based M2M ALM structure
Multiple Fan-In – increase reliability with rich path diversityMultiple Fan-Out – for efficient and fast dissemination↑Path Diversity, ↑ reliability, ↑speed
Multiple Fan-In – increase reliability with rich path diversityMultiple Fan-Out – for efficient and fast dissemination↑Path Diversity, ↑ reliability, ↑speed
45/32
Properties of M2M ALM structure
Fan-in constraint Each node has Fi distinct parents (except level 0/1) All Parents of a node are on the previous level
Level : length in overlay from a root node
Root node reliability Loop-Free transmission
No loop in parent-to-children dissemination All paths from root to node have same length – data
reaches from diff. parents closeby in time.
Parent set uniqueness Parent sets of any two nodes at a level (level ≥2) differ
by at least one parent. Increases path diversity, more reliable delivery
46/32
Realizing parent set uniqueness
A
D
Level 0
Level 1
Level 2
Fan-out factor = 2Fan-in = 2
Fan-out factor = 2Fan-in = 2
B C
E
?
F
G H
I
J K
2)( ioL
oL FFFN
NL = 4
NL = 6
Bypassing Problem with Simple Multicasting Method
a
d
b c
g
e
i
f
h
d
i
2 parents ALM structure2 parents ALM structure (1) Bypassing Problem: Type 1- The message may bypass an alive nodes, when its all parents fail- A child (node h) of a bypassed node ( node d ) can receive the message from other parents
(1) Bypassing Problem: Type 1- The message may bypass an alive nodes, when its all parents fail- A child (node h) of a bypassed node ( node d ) can receive the message from other parents(2) Bypassing Problem: Type 2-A child of a bypassed node can not receive the message, when its other parents fail
(2) Bypassing Problem: Type 2-A child of a bypassed node can not receive the message, when its other parents fail
Probability of Bypassing
Probability of bypassing (type 1)- Pf : probability that a node fails- n parents : means there are ‘n’ paths from which the message can come
Probability of bypassing (type 1)- Pf : probability that a node fails- n parents : means there are ‘n’ paths from which the message can come
n
jjPf
1
Probability of bypassing (type 2)- Multiply ‘Pf’ of its parents to the probability of bypassing- It is smaller than type 1, but still can happen
Probability of bypassing (type 2)- Multiply ‘Pf’ of its parents to the probability of bypassing- It is smaller than type 1, but still can happen
Prevent Bypassing
Use Child-to-Parents relationship General dissemination of tree/forest
ALM structures uses only Parent-to-Children relationship
Get additional ‘m’ paths, where ‘m’ is the number of children
m
jj
n
jj PfPf
11
Backup Dissemination
a
d
b c
g
e
i
f
h
d
i
2 parents ALM structure2 parents ALM structure
d
i
Each node can expect the duplicated messages
Each node can expect the duplicated messages
A node does not get duplicated message within some time, it sends the message to its parents which did not send a message
A node does not get duplicated message within some time, it sends the message to its parents which did not send a message
Limitation of Backup Dissemination
a
d
b c
g
e f
h
d
i
2 parents ALM structure2 parents ALM structure Node g, h and i are leaf nodes having no backup
paths from children
Node g, h and i are leaf nodes having no backup
paths from children
j
Node d and h can get the message by the
Backup Dissemination
Node d and h can get the message by the
Backup Dissemination
Still node i can not get the message
Leaf nodes are problem
Still node i can not get the message
Leaf nodes are problem
h
d
Leaf-to-Leaf Dissemination
a
d
b c
g
e f
h
d
i
2 parents ALM structure2 parents ALM structure
j
h
d
d
d
f
Each leaf node has links to other leaf nodes sharing the same
parents
Each leaf node has links to other leaf nodes sharing the same
parents
i
A leaf node does not get duplicated message
within some time from a specific parent, it sends
the message to this parent as well as its
children
A leaf node does not get duplicated message
within some time from a specific parent, it sends
the message to this parent as well as its
children
Control Plane : Random Network Bounce Protocol in CREW[13]
A kind of K-DAG Random walk through a tree to get new
parents Construct DAG with random parents
S
a b
c d e f
g
NeighborRequest
Control Plane : SAAR
Isolated control plane to support efficient and flexible selection of data dissemination peers
Multiple overlays share a control overlay Reduce overhead and improve scalability
Anycast primitive used to build different overlays
References[1] Understanding the design tradeoffs for cooperative streaming multicast, In Max
Planck Institute for Software Systems Technical Report MPI-SWS-2009-002.[2] Early Experience with an Internet Broadcast System Based on Overlay Multicast.
In Proceedings of USENIX Annual Technical Conference 2004.[3] Overcast: Reliable Multicasting with an Overlay Network. In Proceedings of OSDI
2000.[4] ZIGZAG: An efficient peer-to-peer scheme for media streaming. In Proceedings of
INFOCOM 2003.[5] Scalable Application Layer Multicast. In Proceedings of SIGCOMM 2002.[6] Splitstream: High-bandwidth multicast in a cooperative environment. In
Proceedings of SOSP 2003.[7] Distributing streaming media content using cooperative networking. In
Proceedings of NOSSDAV 2002.[8] Chunkyspread: Heterogeneous unstructured tree-based peer-to-peer multicast.
In Proceedings of ICNP 2006.[9] Coolstreaming/DONet: A data-driven overlay network for peer-to-peer live
media streaming. In Proceedings of INFOCOM 2005.[10] Chainsaw: Eliminating trees from overlay multicast. In Proceedings of IPTPS 2005.[11] PRIME: Peer-to-peer Receiver-drIven MEsh-based Streaming. In Proceedings of
INFOCOM 2007.[12] PULSE: An adaptive, incentive-based, unstructured p2p live streaming system.
In Proceedings of IEEE Transactions on Multimedia, Vol. 9, Nov. 2007.[13] CREW: A Gossip-based Flash-Dissemination System. In Proceedings of ICDCS
2006.[14] Bullet:High Bandwidth Data Dissemination Using an Overlay Mesh. In
Proceedings of SOSP 2003.[15] Resilient multicast using overlays. In Proceedings of ACM SIGMETRICS 2003.[16] FaReCast: Fast, Reliable Application Layer Multicast for Flash Dissemination. In
Proceedings of Middleware 2010.[17] SAAR: A shared control plane for overlay multicast. In Proceedings of NSDI 2007.
Continuity
T-C : the proportion of the streamed bits that have arrived at the receiving node within T seconds of their first
transmission by the source
T-C : the proportion of the streamed bits that have arrived at the receiving node within T seconds of their first
transmission by the source
Mesh achieves almost perfect continuity
after T=45
Mesh achieves almost perfect continuity
after T=45Trees are fast, but lose
continuity without recoveries
Trees are fast, but lose continuity without
recoveries
Effect of stream rate
With high stream rate (e.g. Ri=1.2), tree loses continuity substantially.
With high stream rate (e.g. Ri=1.2), tree loses continuity substantially.
Reducing Mesh Lag
By decreasing the streaming block size By decreasing swarming interval (r)
By decreasing the streaming block size By decreasing swarming interval (r)
Overhead!!
Closer view of buffer size and swarming interval
Swarming overhead hampers delivery delay of blocks.
Swarming overhead hampers delivery delay of blocks.
Improving Tree Continuity
Recovery mechanisms improve continuity, but incur delays
Recovery mechanisms improve continuity, but incur delays