openstack operations quick ramp-up and survival guide · 2019-02-26 · openstack operations quick...

32
OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private Cloud, @joshuakwan Fan He, Architect, IBM Bluemix Private Cloud, @fancyhe

Upload: others

Post on 13-Mar-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

OpenStack Operations Quick Ramp-up and Survival GuideJoshua Guan, Operations Engineer, IBM Bluemix Private Cloud, @joshuakwanFan He, Architect, IBM Bluemix Private Cloud, @fancyhe

Page 2: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Joshua Guan, Operations LeadIBM Bluemix Private Cloud

Fan He, Cloud ArchitectIBM Bluemix Private Cloud

Page 3: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

A Little Bit Background …• Bluemix Private Cloud is

IBM’s private cloud as service based on OpenStack

• Bluemix Private Cloud landed in China to support IBM’s Cloud business there.

• We were building an OpenStack Operations Team from scratch

Page 4: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Agenda• Define an OpenStack Operations Team

• Operating Model• Processes• Tooling• Teaming

• Tooling Integration• Cliché: OpenStack upgrade, HA, Live Migration

Page 5: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Operating OpenStack is like …

You thought you would work like this

And, Welcome to the real world

Page 6: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Define an OpenStack Operations Team

Operating Model• How the cloud services are

offered• What is the SLA• Collaboration with Business

Partners, Data Centers and backend teams, etc.

Processes• Operation Tiers• Escalation Levels• Incident Management• Change Management• Shifts• Onboard & Offboard• …

Tooling• Monitoring• Collaboration• Cloud Management• Knowledge Base• Security• Customer Support

Teaming• Roles and Responsibilities• Shift Model

Page 7: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Operating Model

Data Center

Service Level Agreement

Business Partner

Development Team

OpenStack Service Offering

Customers

OpenStack Operations

Support Entry Points

use consume

complies

operates

collaborate/escalate

route

collaborate/escalate

Page 8: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security• Roles• Responsibilities

Tier Role Responsibilities

1 Support First line of defense

2 Operations Deploy, upgrade, admin

3 OpenStack Engineering Build the product

3 Network Engineering Undercloud networks

Page 9: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security• How tickets/alerts/incidents

go between different tiers

customer

Tier 1

Tier 2

Tier 3Tier 3 Tier 3

Page 10: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security

Definition Example

Priority Level P0, P1, P2

IncidentDefinition

OpenStack node failure, Data center network interruption

ManagementActivities

RFO, Outage Track

Response time Immediate, 15min, 1hr

Update interval Every 30min

Communication method

Customer ticket, email, statuspage.io

Escalcation to leadership

1hr

Page 11: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security

• Different types of changes• How the change will be rolled

out• When the change will be

rolled out• Review and approval • Customer communication

Page 12: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Securityat-work

on-call primary

on-call secondary

Time

at-work

on-call primary

on-call secondary

at-work

on-call primary

on-call secondary

at-work at-workat-work

Page 13: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Processes

Operation Tiers

Escalation Flows

Incident Management

Change Management

Shifts

Security

• Security Compliance Activities• Health Check• Patch Reporting• Vulnerability Scanning• Continuous Business Need

Page 14: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Monitoring• Alerting• Log Aggregation• Dashboard

Page 15: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Chat• File Sharing• Project Kanban• Shift Management

Page 16: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• CMDB• Asset Management• Change Management• Incident Management

Page 17: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Internal Wiki/Runbooks• Product Documents for

Customers

Page 18: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Access Management• Security Compliance

Management• Health Checking• Patching Reporting• Vulnerability Scanning

Page 19: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling

OpenStack Operations

Monitoring

Collaboration

Cloud Management

Knowledge Base

Security

Customer Support

• Ticketing System• Customer Chat• Customer Satisfaction• Cloud Level Maintenance

Communication• Site Level Maintenance

Communication

Page 20: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Teaming

Service Level Agreement

Service Availability

Shift Model

Page 21: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Teaming• 24x7 Availability• Spread the pain• Eliminate interruptions as

possibleat-work

on-call primary

on-call secondary

Time

at-work

on-call primary

on-call secondary

at-work

on-call primary

on-call secondary

at-work at-workat-work

Operators on shift

SME On-call 1

Triage

at-work at-workat-work at-work

primary

secondary

SME On-call 2

primary

secondary

SME On-call 3

primary

secondary

Page 22: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling Integration• A lot of screens to watch• A lot of systems to work on• A lot of interruptions• Use your tools to “kill” them

Page 23: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling IntegrationAs a good start: Kill ”context switch” – work on a single platform

Page 24: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling IntegrationAs a good start: Kill ”context switch” – work on a single platform

Page 25: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Tooling IntegrationWhat’s next: Kill ”all interruptions” – workflow automation across platforms

Page 26: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Cliché – Where BOOOOOM Happens• Implementations & Operations: Change management• The Practices of Upgrade• The Story of HA• The Myth of Live Migration

Page 27: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Change management• “Infrastructure as Code”• Incoming change requests

• Customer initiated requirements• Internal enhancements roll out• Compliance

• Change planning for Consistency• Priorities• Dependencies

Page 28: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

OpenStack Upgrade• Prerequisites: deployment automation

• Consistency – cloud configurations in CMDB• Idempotency – code to run OpenStack upgrade

• Upgrade process design• Upgrade orchestration• Repeatable success &

minimum disruption

Reference: Upgrading OpenStack: A Best Practices Guide

Page 29: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Let’s talk about High Availability….• Architecture decisions for HA

• Eliminate SPOF; Non-disruptive upgrade; Load Balancing; …• Inherent availability = MTTF / (MTTF + MTTR)

• HA’s “dark side” for cloud operations• Recovery with HA resetting• Complexity’s impact on recovery time

• Mitigation plan• Built-in monitoring for HA mechanism• Recovery automation

Page 30: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Live Migration? • Does ”nova live-migrate” work?• Manage customer expectations• Abuse prevention

• Limited appropriate scenarios• Automation with caution• Integration with pre & post-

verification routine

Reference: Live Migration is a Perk, not a Panacea@kiwik http://kiwik.github.io/openstack/2015/05/23/Nova-Live-Migration-Workflow/

Page 31: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

11:25 Kickoff with Todd MooreIBM Vice President, Open Technology

11:30 OpenStack for BeginnersShamail Tahir • Tyler Britten

12:15 The Open Cloud: A Platform of Possibilities Jesse Proudman • Azmir Mohamed

2:15 Don’t Just Take Our Word for It: Use Cases from Materna & AT&TArmin von Dolenga (Materna) • Jacob Caspi (AT&T)

3:05 Part 1 - Designing Effective MicroservicesManuel Silveyra

3:55 Part 2 - Deploying Infrastructure FoundationsShaun Murikami • Andrew Bodine

5:05 Part 3 - Delivering Application MicroservicesDaniel Krook

5:55 Part 4 – Directing Deployments with DevOpsMegan Kostick • Michael Brewer • Manuel Silveyra

Microservices on the Open Cloud

Enterprise Perspectives

4:30 Join Brad Topol and the Interop

Challenge Vendors for refreshments

The Open Cloud: Delivering Solutions with Choice October 26th CCIB Room 116

Page 32: OpenStack Operations Quick Ramp-up and Survival Guide · 2019-02-26 · OpenStack Operations Quick Ramp-up and Survival Guide Joshua Guan, Operations Engineer, IBM Bluemix Private

Thank You