ai driven day2 operation · ai driven day2 operation lai kwai seng technical solution architect,...

29
AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

Upload: others

Post on 24-Jul-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

AI Driven Day2 Operation

Lai Kwai Seng

Technical Solution Architect, Cisco Systems

Page 2: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

Agenda

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

• Introduction to Data Center Telemetry

• Data Center Telemetry Use Cases

• Operationalizing Telemetry

• Network Insights Resources

• Network Insights Advisor

• Network Assurance

• Key Takeaways

Page 3: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

syslog

SNMP

CLI

Hard to Operationalize

Incomplete

Unstructured

Device-Specific

Slow

How to manage Network?

Page 4: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Telemetry Frees the Data

As Much Useful DataAs Efficiently as Possible

Sensing & measurement

Where Data Is Created Where Data Is Useful

Storage & analysis

Page 5: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Key Telemetry Characteristics

Efficient Delivery

Tool-Chain consumption and Integration

Structure andAutomation

Data-model DrivenConsistent format

Push not Pull

Analytics-readyDataUDP

Page 6: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Use Cases

• Network Health

• Anomaly detection

• Troubleshooting / Remediation

• SLAs, Performance Tuning

• Capacity Planning

• Security

Trends

• Real time statistics

• Centralized / Software-defined

• Speed

• Scale

Why This Matters NowWhat hasn’t changed What has changed

Capabilities

Page 7: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Data Center Visibility Use Cases

Network Health

• CPU and memory utilization

• Forwarding table utilization

• Protocol state and events

• Environmental data

Path and Latency Measurement

• End-to-end visibility

• Path tracing over time

• Flow latency monitoring

Network Performance

• Interface utilization

• Buffer monitoring

• Microburst detection

• Drop event correlation

Page 8: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Memory

Power

Temperature

CPU

TCAM

System Info and Environmentals

Are my switches healthy?

Page 9: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

! Neighbor Lost!

Alert:

t

OSPF Routes over Time

Protocol State and Events

OSPF Process State

Process ID 10

Router ID 10.1.1.1

Area 0.0.0.0

OSPF Interfaces

105

Hypervisor Hypervisor

Is routing working as expected?

Page 10: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Monitoring Buffer Utilization and Drops

Incast or other oversubscription

Packet drops!

I see queue drops – but who’s affected?!

Page 11: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Path and Latency Measurement

Application performance is slow between Server A &

Server B!

Server A Server B

Page 12: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Insights Resources - Customer Benefits

Network

Insights

Resources

Resource UtilizationFabric-Wide Capacity Planning, Trend Monitoring

Troubleshoot Application LatencyIdentify Traffic/Protocol behavior

Identify/Predict Failing Devices Operations

Event AnalyticsEndpoint Analytics

Avoid Environmental (CPU, Power, Memory, Fan, StorageRelated Failures

Identify Subtle Path-Related issuesTrack endpoint details and moves

Statistics

Environmental Monitoring

Flow Analytics

Page 13: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

NIR Architecture

Data Lake

Data Lake Connector

Telemetry SourcesACI/NX-OS

Hardware & Software

Message Bus (Kafka)

REST APIs

Anomaly & Correlation

Engines

Telemetry Collectors

REST Client

NIR GUI

NIR

13

Page 14: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Correlation EngineCorrelate normalized telemetry data streams from Transformation Receiver

LLDP

Buffer and Queue stats

Flow details

End-to-end Flow Path

End-to-end Path Latency

Buffer Occupancy and drops along Flow Path

Correlation based on timestamp and matching 5-tuple

Pipelines

Configs

Page 15: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Operational Intelligence Engine for Network Insights

Dynamic CorrelationCorrelate information across data sources

Failure Prediction & Corrective ActionAbility to predict failure and provide corrective action

Intelligent InsightsAbility to discover information with ease

Proactive AlertsSee problems before end users do and alert

Dynamic Correlation

Proactive Alerts

Failure Prediction and Corrective Action

Intelligent Insights

Increase Availability and Performance

Page 16: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Insights Advisor -- Customer Benefits

Network

Insights

Advisor

Software/Hardware RecommendationsWorkarounds

Avoid multiple TAC calls

Significant CAPEX

And OPEX Savings

Remove Complexity

Avoid Outages

Faster Deployment times

Anomalies

Forwarding State Check

Network Anomaly Detection

Keep Network up to dateAdhere to Cisco policies Recommendations

Prevent traffic black holing

Avoid downtimes

Known Bugs/PSIRTs

Unknown runtime

Config anomalies

EOL/EOSField NoticesSMUs

Version Scale

Limits/Hardening

Check

Configuration

Network Insights Advisor - Customer Benefits

Page 17: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Insights Advisor Targeted Use CasesProactive supportability insights

Fabric wide analysis

Advisories

Provides advisories based on anomalies, bugs,

PSIRTs and field notices. Measure upgrade impact

Dashboard ”Give me a summary of issues”

Anomalies

hardening checks, scale checks

Bugs and PSIRTs

Known bugs and vulnerabilities in the

system

Page 18: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

Network

Provides:

• Running config of all devices

• “show tech” from all devices (including APIC)

Cisco

Provides:

• Best practices updates

• PSIRTs, FNs, EOS/EOL

• Software release notifications

• Digitized signatures of knowndefects

First, We Need Data!

NIACisco

Every 24h

Cloud Data

User-specified interval

Network On-prem Data

20

Page 19: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Known Bugs

Use Case – Notify About Issues

Fabric

NIA

Insight DB

1

3 Alert / Inform

Monitor

Detected:

CSCDT2396 SAL1820SDRE

Recommend:

Upgrade S/W to NXOS

7.0(3)I7(3)

WeeklySync

2 Detect

4 Implement

Alert RemediateDetect

Page 20: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

Network Insights issue detection

HardeningCheck

SignatureMatching

AdvisoryServices NIA – Core

StorageTech Support and ‘show run’ collection

Data Sources

Interacting with Cisco Services via NIA-PROXY

NIA – GUI

Tech supports from the switch collected and matched with signatures of external known caveats

Hardening guide is digitized into signatures and matched with show run from each switch

Insights DB

Bugs/PSIRTs detection

Updated periodically with signatures from the cloud

Page 21: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Use Case – Notify Me About recommended Releases

Fabric

NIA

Insight DB

1

3 Alert / Inform

Monitor

Push Notification

2 Identify Switches

4 Implement

s

p p p

Notifications

Affected devices: 3

Leaf 1, Leaf 2, Leaf 3

With BUG ID: XYZ

Recommend:

Upgrade S/W to NXOS

7.0(3)I7(3)

Alert RemediateDetectAlert RemediateDetect

p

s

Affected devicesS/W Notify

Page 22: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Network Assurance Engine: How it Works

• How it Works

24

Capture DC Wide Intent, Policy, Control/State across

Forwarding & Security

Precise Mathematical Models that codify Cisco’s 30+ Years of Networking and Cross Customer Domain Knowledge

Data Collection Formal Modeling of Network Continuous Analysis

Models verify that Network operates per Intent and accurately tell what is

wrong, where, why, impact and how to fix

Reactive Troubleshooting to Proactive Operations - continuously, network wide

Page 23: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

Continuous Assurance Workflows

Is my network compliant with Governance Rules ?

Compliance analysis

Did something change in my network ?

Epoch Delta analysis

Can A talk to B ?

Connectivity Analysis

Page 24: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

Smart Events & Compliance Score for Compliance

COMPLIANCE VIOLATED SMART EVENT

• Identify compliant policy

• Identify requirements satisfied

• Identify compliant EPGs

• Identify non compliant policy

• Identify requirements violated

• Identify non-compliant EPGs

COMPLIANCE SATISFIED SMART EVENT

COMPLIANCE SCORE

Page 25: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

Epoch Delta AnalysisCorrelated Ad hoc Analysis Workflow

4 Qs, correlated answers…

• What changed?

• Who was impacted?

• Was it due to config changes?

• What happened as a result?

Use Cases

• Change Management

• Root-cause analysis

• Migration

• Maintenance Upgrades

• Capacity Management

Before /

BaselineAfter /

Current

Page 26: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

Health Delta - SummaryChange in the health of the Fabric

Page 27: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Epoch Delta Workflow – Policy DeltaImpact, Change, Operator

What got impacted ?

Who made the changes ?

What has changed ?

Details of

impact, if any

Page 28: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Forwarding Connectivity AnalysisUse Cases

• Forwarding Communication Issues across entire fabric

• Visibility into Route Leakage

• Visibility into Fabric Communication with External Network

• Policy and Forwarding Inconsistencies

Page 29: AI Driven Day2 Operation · AI Driven Day2 Operation Lai Kwai Seng Technical Solution Architect, Cisco Systems

© 2019 Cisco and/or its affiliates. All rights reserved. Cisco Public

Key Takeaways

• Nexus leads the industry in telemetry capabilities

• Combination of software and hardware streaming provides deepest level of network visibility

• Platforms for consuming, analyzing, visualizing telemetry data available or being developed for both ACI and standalone

• Both Cisco turnkey solutions and custom/third-party integrations exist today