MyOps
An Operational Framework for PlanetLab Deployments
1
Outline
o Objective of MyOpso Current statuso Future ideas
o Questions at any time
2
Example of Feedback
3
Objective : Close Operational Cycle
• System - Provides service (slice)• Monitoring - Feedback from running system• Operator - Interpret feedback into tasks• Management - Control running system
4
Challenges: Break-down
• System may not deliver service
• Monitoring not observe useful metrics
• Operator may not knowo how to interpret observationso how to control the systemo what the service goals are
• Management may not control system
5
Requirements for Operational Systems
• Satisfy Minimal Conditions1. Physical Integrity2. Interconnectivity3. Controllable4. Provide a Service
• Two requirementso Reliably reach the final conditiono When failures occurs, repair or report automatically
1. Two approaches in MyOps1. Precise bootstrap stages (not discussed)2. Operational monitoring & management in platform
6
System: PlanetLab Slices
7
Monitoring Types
Open-loop monitoring• Identify the unknown• More information, fine-grainedOperational monitoring (closed-loop)• Correctness• Less information, coarse-grained• Actionable
8
Management Types
Open-loop management• Bootstrap/Deploy from the ground up• Inefficient, coarse-grained• No feed-backOperational management (closed-loop)• Tweak the system to correct behavior• More efficient, fine-grained
9
Example
• Observe: Node is Off-Line• Control: Attempt to Power-On• Observe: Node is On-line but Failed to boot• Observe: Failed to boot Error• Control: Create ticket & Send email to local contact
• Time passes
• Control: Disable slice creation• Observe: Local contact responds• Observe: Node is Power-on and Running• Control: Re-enable slice creation• Contro: Close ticket
10
History of PlanetLab Operations
Open-loop Monitoring with Open-loop Management• Collect fine-grained statistics using CoMon• Act with coarse-grained operations (e.g. Reinstall)• Manual bridge between the two
Moving towards Closed-loop Operations• Collect targeted metrics• Take directed, problem-specific actions• Automate actions based on policy
11
PlanetLab Operations
• Close the monitor/management cycle• Direct automation of common operations• Indirect through remote contacts and incentives
12
MyOps Architecture
• Collection from Node• Translated by policy to Automated action
13
MyOps Architecture
• Collection from Node• Send notice to Local contact to take action
14
MyOps Architecture
• When there is no response• Indirect influence with incentives
15
Collection
• Operational monitoring specific targets, such as:o Boot status, Filesystem statuso DNS - internal and externalo RPMso System services, etc
• Periodic collectiono Coarse-grained collection at a human-timescaleo Time-series of events and status
16
Policy
• Constraints over a time-series of events
• To satisfy a constrainto Automated actiono Send noticeo Apply incentive
• Policy defineso Preferred status of systemo Frequency of actionso Magnitude of incentives
17
Automation
• Automatic correction of common bootstrap problemso Communication errors with MyPLCo Corrupt filesystem repairo Retry when state is unknowno PCU Rebooto Reinstall
• Automation Noticeso Bad disko Minimal hardwareo Bad DNSo Bad node configuration
18
Notices & Incentives
• Notices are indirect paths to node managemento Node down / online / specific problem (i.e. DNS, disk)o Site down / onlineo Privilege reduced / restoredo PCU errors
• The incentives on MyPLCo Sites 10 sliceso Disable slice creationo Disable running slices
19
Validation of Notices & Incentives
A B C D E
Notice Bug FixKernel Bug Fix Fix2
20
Time to Restore Down Node (all issues)
21
Future Ideas
• Generalize Configuration• Collect from multiple sources• Expose policy• Act on multiple targets
• Self-monitoring
• Positive Incentives• Special access to services• Additional resources (Slices, Bandwidth, CPU, etc)
22
Time to Reply (when there is a reply)
23