next generation of devops aiops in practice @baidu · • next generation of devops –aiops •...

26
Next Generation of DevOps AIOps in Practice @Baidu Xianping Qu & Jingjing Ha

Upload: others

Post on 21-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Next Generation of DevOpsAIOps in Practice @Baidu

Xianping Qu & Jingjing Ha

Page 2: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

About Baidu

Page 3: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Agenda

• History of Baidu SRE team• Next generation of DevOps – AIOps• Best practice based on AIOps• Future

Page 4: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Tools (2007-2009)

Dev OPQA

• DevOps– Deployment– Monitoring– Budgeting– Consulting– …

• Problems– Human labor

Page 5: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Systems (2009-2012)

Operation System

Dev OPQA

• Building operation systems– Service management system– Monitoring system– Deployment system– Traffic scheduling system– Naming service– …

• Problems– Human labor(GUI, configure)

Page 6: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Platforms (2012-2014)

Operation Service

Dev OPQAO

peration Platform

• Building operation platforms– API– Configurable– Executable– …

• Problems– Reusability– Scalability

Page 7: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Standardization (2014-)

• Building operation standards– Unified language– Unified method– Unified solution– …

• Problems– Need a brain

Services, IDCsClusters, Servers

Page 8: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

AIOps (2014-)

• Intelligent Operation Platforms– Development framework– Big data– Algorithm

• Data mining, machine learning…

Page 9: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Development framework

User Code 1 User Code 2 User Code …

User Code Management

Develop

Kit

Operation Abstraction Layer

Interface

Driver Runtime Environment

Cloud/Paas Scheduler

Develop Tool-chain

IDE

Build

Debug

Simulation

Test

Prof

Page 10: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Operation Knowledge Database

product

Meta data Metric data Event data

app service instance

host IDC

person

network

Unified data

model

Data source

Dataproduction

process

mapping

service management model metaDB,TSDB,eventDB

mining

API & View

cleaning

feedback

throughput latency

cpu mem io

...

anomaly

root cause

change

remediation

...

raw data

authority, quota

controlling

bandwidth

error

disk

rtt

...

resulting data

structuraldata

calculating

Management platforms Monitoring platforms Operation platforms

Page 11: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Solution

Operation Knowledge Database

Deployment Management Incident Management

General Components and Tools(transmission、storage、scheduling…)

Operation Development Framework ( SDK, RE )

Ops Algorithm Development

Solution Development

Ops Platform Development

Anomaly detection

Traffic scheduling

Root cause analysis

Trend forecasting

Other data mining &machine learning

algorithms

Page 12: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Best practice based on AIOps• Incident Management

– Single cluster stop-loss by traffic shifting

• Deploy Management– Unattended deployment with automated checker

• Consulting– ChatBots do Consultation

Page 13: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

When will a failure occur?• Infrastructure issue • Program defects

• Change exception • Dependent service unavailable

Page 14: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

How to stop loss?

• Limited Failure in one cluster

– Deployment isolation

– Dependency decoupling

– Reduce global risk

When single cluster fails do perform traffic shifting

• Capacity redundancy

– Availability and cost trade-off

– N+M redundancy

– Service degradation

Page 15: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Two layer traffic shifting @Baidu

Data Center1

Web Service cluster

BFE Cluster L7 LB

BGWL4 LB

Data Center2

Web Service cluster

BFE Cluster L7 LB

BGW L4 LB

Data Center3

Web Service cluster

BFE ClusterL7 LB

BGWL4 LB

User set

User set User

set

User set

Dependent Service cluster

Dependent Service cluster

Dependent Service cluster

Internet User set

shift traffic between user set and the edge node

shift traffic between front end and back end

BGW: Baidu Gate Way, layer-4 load balancer

BFE: Baidu Front End, layer-7 load balancer

Page 16: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Two layer traffic shifting @Baidu

• shift traffic between user set and edge node

– 10 minute to shift 80% traffic to the healthy edge node because

of DNS caching in the client side and ISP side

• shift traffic between front end and back end

– 10 second to shift 100% traffic to the healthy backend by

changing BFE’s routing configuration

Page 17: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Shift traffic between front end and back end

Concerns:• Service Capacity• Intranet bandwidth

Web Service cluster

BFE Cluster L7 LB

BGWL4 LB

Web Service cluster

BFE Cluster L7 LB

BGW L4 LB

Web Service cluster

BFE ClusterL7 LB

BGWL4 LB

User set

User set User

set

User set

Dependent Service cluster

Dependent Service cluster

Dependent Service cluster

Internet User set

Scenarios:• Web service cluster • Dependent service cluster• Internal network switch

Data Center1 Data Center2 Data Center3

Page 18: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Shift traffic between user set and the edge node

Web Service cluster

BFE Cluster L7 LB

BGWL4 LB

Web Service cluster

BFE Cluster L7 LB

BGW L4 LB

Web Service cluster

BFE ClusterL7 LB

BGWL4 LB

User set

User set User

set

User set

Dependent Service cluster

Dependent Service cluster

Dependent Service cluster

Internet User set

Concerns:• Bandwidth• BGW/BFE Capacity• Delay

Scenarios :• BGW & BFE• External network switch

Data Center1 Data Center2 Data Center3

Page 19: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Shift traffic between user set and the back end

Web Service cluster

BFE Cluster L7 LB

BGWL4 LB

Web Service cluster

BFE Cluster L7 LB

BGW L4 LB

Web Service cluster

BFE ClusterL7 LB

BGWL4 LB

User set

User set User

set

User set

Dependent Service cluster

Dependent Service cluster

Dependent Service cluster

Internet User set

Concerns:• Bandwidth• BGW/BFE Capacity• Delay• Service Capacity• Intranet bandwidth

Scenarios::• Entire data center failure

Data Center1 Data Center2 Data Center3

Page 20: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Single cluster stop-loss before AIOPs

Manual decision making

Perception Judgment Execution

Scattered Scripts

• Depends on personal experience

• Partial information

Traditional monitoring

• Handcrafted anomaly detection

• Lots of False Negatives and False Positives

• Poor code quality• Low availability

• Critical moment is unreliable• Wrong or slow decision making • No real-time feedback leads

to more serious failures

Page 21: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Single cluster stop-loss after AIOPs

Automated decision making Standard framework

• Rely on Algorithm Platform • Global information

Intelligent monitoring

• Intelligent anomaly detection• Abnormal event stored in

Operation Knowledge Database

• High precision and recall

• Development framework• Deployment framework

• High development efficiency• High availability

• Accurate and quick decision making

• Feedback control

Perception Judgment Execution

Page 22: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

The architecture of single cluster stop-loss

Abnormalevent DB

Network monitoring

Service Health Control

BFE Configuration Management

ServiceDegradation

VIP Health Control

DNS ConfigurationManagement

Internal Traffic Load balancer

Capacity DB

Metric DB

Operation Knowledge Database

Businessmonitoring

Applicationmonitoring

Execution

Judgment

Perception

Anomaly detectionAlgorithm

Traffic shifting

Algorithm

Algorithm Platform

Page 23: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Unattended deployment with automated checker

beginpre-check

pipe line

deploycheck

post-check

end

deploycheck

…confirm

Person Robot

confirm

confirm

confirm

Manual multiple metric dashboards check

Manual single anomaly event dashboard check

Automated API check and notify

Intelligent Anomaly detection

Operation Knowledge Database

Check process optimization

Page 24: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

ChatBots do consultation • Key points of building ChatBots

query example intention slots

xx模块从昨晚到现在有上线么? Change query Module : xx;Time:从昨晚到现在

今天xx模块有全流量上线么? Change query Module : xx; Time : 今天 ;Stage:全流量

• Change consultation scenario

1. Accumulate manually labeled query

2. Train an NLP model to understand the questions offline

3. Translate natural language questions into structured questions

4. Query operation knowledge database

5. Display results on SRE Service desk

Page 25: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

Future

• Dynamicresourceallocation• Capacitymanagement• Identificationofperformanceproblems• …

Page 26: Next Generation of DevOps AIOps in Practice @Baidu · • Next generation of DevOps –AIOps • Best practice based on AIOps • Future. Tools (2007-2009) Dev QA OP • DevOps –

[email protected]@baidu.com