myops an operational framework for planetlab deployments 1
TRANSCRIPT
![Page 1: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/1.jpg)
MyOps
An Operational Framework for PlanetLab Deployments
1
![Page 2: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/2.jpg)
Outline
o Objective of MyOpso Current statuso Future ideas
o Questions at any time
2
![Page 3: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/3.jpg)
Example of Feedback
3
![Page 4: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/4.jpg)
Objective : Close Operational Cycle
• System - Provides service (slice)• Monitoring - Feedback from running system• Operator - Interpret feedback into tasks• Management - Control running system
4
![Page 5: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/5.jpg)
Challenges: Break-down
• System may not deliver service
• Monitoring not observe useful metrics
• Operator may not knowo how to interpret observationso how to control the systemo what the service goals are
• Management may not control system
5
![Page 6: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/6.jpg)
Requirements for Operational Systems
• Satisfy Minimal Conditions1. Physical Integrity2. Interconnectivity3. Controllable4. Provide a Service
• Two requirementso Reliably reach the final conditiono When failures occurs, repair or report automatically
1. Two approaches in MyOps1. Precise bootstrap stages (not discussed)2. Operational monitoring & management in platform
6
![Page 7: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/7.jpg)
System: PlanetLab Slices
7
![Page 8: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/8.jpg)
Monitoring Types
Open-loop monitoring• Identify the unknown• More information, fine-grainedOperational monitoring (closed-loop)• Correctness• Less information, coarse-grained• Actionable
8
![Page 9: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/9.jpg)
Management Types
Open-loop management• Bootstrap/Deploy from the ground up• Inefficient, coarse-grained• No feed-backOperational management (closed-loop)• Tweak the system to correct behavior• More efficient, fine-grained
9
![Page 10: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/10.jpg)
Example
• Observe: Node is Off-Line• Control: Attempt to Power-On• Observe: Node is On-line but Failed to boot• Observe: Failed to boot Error• Control: Create ticket & Send email to local contact
• Time passes
• Control: Disable slice creation• Observe: Local contact responds• Observe: Node is Power-on and Running• Control: Re-enable slice creation• Contro: Close ticket
10
![Page 11: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/11.jpg)
History of PlanetLab Operations
Open-loop Monitoring with Open-loop Management• Collect fine-grained statistics using CoMon• Act with coarse-grained operations (e.g. Reinstall)• Manual bridge between the two
Moving towards Closed-loop Operations• Collect targeted metrics• Take directed, problem-specific actions• Automate actions based on policy
11
![Page 12: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/12.jpg)
PlanetLab Operations
• Close the monitor/management cycle• Direct automation of common operations• Indirect through remote contacts and incentives
12
![Page 13: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/13.jpg)
MyOps Architecture
• Collection from Node• Translated by policy to Automated action
13
![Page 14: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/14.jpg)
MyOps Architecture
• Collection from Node• Send notice to Local contact to take action
14
![Page 15: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/15.jpg)
MyOps Architecture
• When there is no response• Indirect influence with incentives
15
![Page 16: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/16.jpg)
Collection
• Operational monitoring specific targets, such as:o Boot status, Filesystem statuso DNS - internal and externalo RPMso System services, etc
• Periodic collectiono Coarse-grained collection at a human-timescaleo Time-series of events and status
16
![Page 17: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/17.jpg)
Policy
• Constraints over a time-series of events
• To satisfy a constrainto Automated actiono Send noticeo Apply incentive
• Policy defineso Preferred status of systemo Frequency of actionso Magnitude of incentives
17
![Page 18: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/18.jpg)
Automation
• Automatic correction of common bootstrap problemso Communication errors with MyPLCo Corrupt filesystem repairo Retry when state is unknowno PCU Rebooto Reinstall
• Automation Noticeso Bad disko Minimal hardwareo Bad DNSo Bad node configuration
18
![Page 19: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/19.jpg)
Notices & Incentives
• Notices are indirect paths to node managemento Node down / online / specific problem (i.e. DNS, disk)o Site down / onlineo Privilege reduced / restoredo PCU errors
• The incentives on MyPLCo Sites 10 sliceso Disable slice creationo Disable running slices
19
![Page 20: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/20.jpg)
Validation of Notices & Incentives
A B C D E
Notice Bug FixKernel Bug Fix Fix2
20
![Page 21: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/21.jpg)
Time to Restore Down Node (all issues)
21
![Page 22: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/22.jpg)
Future Ideas
• Generalize Configuration• Collect from multiple sources• Expose policy• Act on multiple targets
• Self-monitoring
• Positive Incentives• Special access to services• Additional resources (Slices, Bandwidth, CPU, etc)
22
![Page 23: MyOps An Operational Framework for PlanetLab Deployments 1](https://reader036.vdocument.in/reader036/viewer/2022062516/56649da15503460f94a8d310/html5/thumbnails/23.jpg)
Time to Reply (when there is a reply)
23