![Page 1: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/1.jpg)
A Self-Organizing Flock of Condors
Ali Raza ButtRongmei ZhangY. Charlie Hu
{butta,rongmei,ychu}@purdue.edu
![Page 2: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/2.jpg)
2
The need for sharing compute-cycles• Scientific applications
– Complex, large data sets
• Specialized hardware– Expensive
• Modern workstation– Powerful resource– Available in large numbers– Underutilized
Harness idle-cycles of network of workstations
![Page 3: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/3.jpg)
3
Condor: High throughput computing
• Cost-effective idle-cycle sharing
• Job management facilities– Scheduling, checkpointing, migration
• Resource management– Policy specification/enforcement
• Solves real problems world-wide – 1200+ machines Condor pools, 100+ researchers
@Purdue
![Page 4: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/4.jpg)
4
flocking
Sharing across pools: Flocking
Pre-configuredresource sharing
Central manager
![Page 5: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/5.jpg)
5
Flocking
• Static flocking requires– Pre-configuration– Apriori knowledge of all remote pools
• Does not support dynamic resources
![Page 6: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/6.jpg)
6
Our contribution:Peer-to-peer based dynamic flocking
• Automated remote Condor pool discovery
• Dynamic resource management– Support dynamic membership– Support changing local policies
![Page 7: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/7.jpg)
7
Agenda
• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions
![Page 8: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/8.jpg)
8
Overlay Networks
P2P networks are self-organizing overlay networks without central control
ISP3
ISP1 ISP2
Site 1
Site 4
Site 3Site 2 N
N N
N
N
N N
![Page 9: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/9.jpg)
9
Advantages of structured p2p networks
• Scalable• Self-organization• Fault-tolerant• Locality-aware• Simple to deploy
• Many implementations available– E.g. Pastry, Tapestry, Chord, CAN…
![Page 10: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/10.jpg)
10
Pastry: locality-aware p2p substrate
• 128-bit circular identifier space– Unique random nodeIds– Message keys
• Routing: A message is routed reliably to a node with nodeIdnumerically closest to the key
• Routing in overlay < 2 * routing in IP
Identifierspace
![Page 11: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/11.jpg)
11
Agenda
• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions
![Page 12: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/12.jpg)
12
Step 1:P2p organization of Condor pools
• Participating central managers join an overlay– Just need to know a single remote pool
• P2p provides self-organization– Pools can reach each other through the overlay– Pools can join/leave at anytime
![Page 13: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/13.jpg)
13Central managers Resources
P2p organized central managers
![Page 14: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/14.jpg)
14
Step 2:Disseminating resource information
• Announcements to nearby pools– Contain pool status information– Leverage locality-aware routing table
• Routing table has O(log N) entries matching increasingly long prefix of local nodeId
– Soft state• Periodically refreshed
![Page 15: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/15.jpg)
15
Resource announcements
are physically close to
![Page 16: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/16.jpg)
16
Step 3:Enable dynamic flocking
• Central managers flock with nearby pools– Use knowledge gained from resource announcements– Implement local policies– Support dynamic reconfiguration
![Page 17: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/17.jpg)
17Central managers Resources
Interactions between central managers
Locality-awareflocking
![Page 18: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/18.jpg)
18
Matchmaking
• Orthogonal to flocking• Condor matchmaking within a pool
• P2p approach affects the flocking decisions only
![Page 19: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/19.jpg)
19
Are we discovering enough pools?
• Only subset of nearby pools reached using the Pastry routing table
• Multi-hop TTL based announcement forwarding
![Page 20: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/20.jpg)
20
Agenda
• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions
![Page 21: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/21.jpg)
21
Software
• Implemented as a daemon: poolD– Leverages FreePastry 1.3 from Rice – Runs on central managers– Manages self-organized Condor pools
• Condor version 6.4.7
• Interfaced to Condor configuration control
![Page 22: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/22.jpg)
22
Software architecturep2p
network
Cond
orp2
p ex
tens
ion
Query services Configuration
p2p module
AnnouncementManager
Condor module
Policy Manager
FlockingManager
![Page 23: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/23.jpg)
23
Agenda
• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions
![Page 24: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/24.jpg)
24
Evaluation
• Measured results– Effect of flocking on job throughput
• Time spent in queue
– Four pools, three compute machines each– Synthetic job trace
![Page 25: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/25.jpg)
25
Job trace
• Sequence– 100 (issue time: T, job length: L) pairs– Interval (Tn–Tn-1), L uniform distribution [1,17]– Designed to keep a single machine busy– Random overload/idle periods
• Trace– One or more job sequences merged together
![Page 26: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/26.jpg)
26
B
PlanetLab experimental setup
U.C. Berkeley
Dynamic flocking
DC
A
A
B
C
D
Interxion, GermanyRiceColumbia
![Page 27: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/27.jpg)
27
Time spent in queue
72.100.0330.30557.550.03131.2012overall58.380.1028.37557.550.25284.915D64.480.1038.6897.170.0346.583C63.700.1332.6819.850.083.302B72.100.0320.1514.320.031.762A
maxminmeanmaxminmean
With flockingWithout flockingNo.of sequences in
tracehPool
![Page 28: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/28.jpg)
28
Simulations
• 1000 Condor pools
• GT-ITM transit-stub model– 50 transit domains– 1000 stub domains
• Size of pool: uniform distribution [25,225]
• Number of sequences in trace:uniform distribution [25,225]
![Page 29: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/29.jpg)
29
Cumulative distribution of locality
![Page 30: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/30.jpg)
30
Total job completion time:without flocking
![Page 31: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/31.jpg)
31
Total job completion time:with flocking
![Page 32: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/32.jpg)
32
Agenda
• Background: peer-to-peer networks• Proposed scheme• Implementation• Evaluation• Conclusions
![Page 33: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/33.jpg)
33
Conclusions• Design and implementation of a self-
organizing flock of Condors– Scalability– Fault-tolerance– Locality-awareness which yields flocking with
nearby resources– Local sharing policy enforced
• P2p mechanisms provide an effective substrate for discovery and management of dynamic resources over the wide-area network
![Page 34: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/34.jpg)
34
Questions?
![Page 35: Ali Raza Butt Rongmei Zhang Y. Charlie Hupeople.cs.vt.edu/~butta/docs/sc03condorSlides.pdf · Ali Raza Butt Rongmei Zhang Y. Charlie Hu {butta,rongmei,ychu}@purdue.edu. 2 The need](https://reader034.vdocument.in/reader034/viewer/2022051806/5fff9bf913d50729385463c8/html5/thumbnails/35.jpg)
35
What about security?• Authenticated pools / users
– Enforced by policy manager– Accountability
• Restricted access– Limited privileges e.g. UNIX user nobody– Condor libraries
• Controlled execution environment– Sandboxing – Process cleanups on job completion
• Intrusion detection