network telemetry and analytics in the age of big data · 1059 root 1 132 0 24344k 18936k run...

14
NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA RUTURAJ PATHAK/Senior Product Manager/INVENTEC

Upload: others

Post on 10-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

N E T WO R K T E L E M E T RY A N D A N A LY T I C S I N T H E AG E O F B I G DATA

RUTURAJ PATHAK/Senior Product Manager/INVENTEC

Page 2: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

• NIDC Submissions for OCP Certification

– Specifications Accepted by OCP as of January 2016

• 10/40G: D6254QSBP and D6254QSBX

– http://www.opencompute.org/wiki/Networking/SpecsAndDesigns#Inventec_DCS6072QS

• 100G: D7032Q28BP and D7032Q28BX

– http://www.opencompute.org/wiki/Networking/SpecsAndDesigns#Inventec_DCS7032Q28

2

Open Computing Project (OCP) Participation

OCP Accepted!

D6254QSBP and D6254QSBX D7032Q28BP and D7032Q28BX

Inventec Confidential

Inventec is a Platinum Member !!!

Page 3: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

Inventec Confidential 3

Open Architecture

NETWORK APPS MONITORING APPS MANAGEMENT APPS

REST APIs REST APIs

KPI, SLA, Capacity

RealtimeSecurity

Resource discoveryReconciliation, Real time Provisioning

SAI

Page 4: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

Inventec Confidential 4

RUDIMENTARY TELEMETRY

root> show chassis alarms1 alarms currently activeAlarm time Class Description2014-07-29 07:27:12 UTC Minor Host 0 Temperature Warm -------------------------------------------------------<Syslog Messages>Jul 29 07:26:47 chassisd[1387]: CHASSISD_SNMP_TRAP6: SNMP trap generated: Over Temperature! Red Alarm2014-07-29 08:07:50 UTC Major Host 0 Temperature Hot <Syslog Messages>CHASSISD_RE_OVER_TEMP_WARNING: Routing Engine 0 temperature (73 C) over 72 degrees C, platform will shut down in 240 seconds if condition persists

Debugging based on such information is difficult

Page 5: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

5

TELEMETRY today

Checking for CPU events and setting up notifications may not work

•CLI sessions are not closed gracefully on the router. In this case, one would see mgd running high on CPU, starving the kernel of CPU cycles.1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization•One way to address this issue is to kill the mgd processes eating up the CPU.

•'Sampling' is enabled on the router. This sometimes leads to high kernel CPU; to address this, reduce the rate at which you are sampling on the router.

Page 6: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

6

TELEMETRY issues

▪ Quite coarse data granularity

▪ SNMP polling puts lot of load on CPU and has severe scaling issues

▪ CLI Scripts break and need frequent changes

▪ Even IPFIX Flow sampling misses important information

▪ No Data Correlation

▪ Reactive, yet no information or hint given on root cause

SNMP SYSLOG CLI Scripts

A shift in the way we optimize and diagnose the networks is required

Page 7: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

Inventec Confidential 7

SDN TELEMETRY & ANALYTICS- Agent Based

PROGRAM

AUTOMATIC/MANUAL

TELEMETRY

VISUALIZE

ControllabilityObservability

Closed loop feedback

SDN Controller

GUI

AGENTNOS

SAI

SDK

OPEN API

HOST CPU

Page 8: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

Inband Network Telemetry with P4

Inventec Confidential8

insert or modify packet headers with custom metadata

Page 9: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

9

Discovering information

I

▪ This fan speed increase is in response to abnormal behavior in chip…

▪ This switch A is seeing more congestion @ 3PM because of …

▪ Bit error rate is increasing on interface x/y/z due to … Packet loss in 3 min.

Do we need a Crystal Ball to answer the above questions and ACT on it?

Optimize Network for Data Center SLAs

▪ Latency▪ Network Jitter▪ Packet Loss▪ Bandwidth Guarantees

Page 10: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

10

Why Deep learning now?

SDN is the key enabler!

▪ Plethora of Data available

▪ Lot of cheap compute power and storage available now

▪ More layers of NN are required to solve complex problems

▪ Introduction of GPUs : Perfect for matrix multiplication

▪ NN Algorithms have matured and can be scaled

▪ Lot of research is being conducted in this field

NN can be used to control nonlinear dynamic system

Page 11: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

11

Deep Reinforcement Learning

• Learning Behaviors and skills• No modeling • Sequence of decisions is necessary• Actions have consequences• Environment Stateful

RL Agent

Reward (network delay)State Observation (link bandwidth)

Action (spine-leaf link weight)

Environment(Network)

Sequence of states and actions: s0, a0, r0, sT-1, aT-1, rT-1, sT, rT

Used for non linear complex multidimensional systemsTransition Function : P (st+1, rt) | st, at)

Page 12: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

Inventec Confidential 12

SDN TELEMETRY & ANALYTICS- Deep Learning

PROGRAM

AUTOMATIC/MANUAL

TELEMETRY

VISUALIZE

ControllabilityObservability

Closed loop feedback

SDN Controller

GUI

AGENTNOS

SAI

SDK

REST API

HOST CPU

DEEP LEARNING

Page 13: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization

Conclusion

• Can we design proactive networks

• Can we get predictive insights

• Can we do Risk Mitigation

• Can we do Anomaly Detection

• Can we make networks more efficient?

AI/Robots will be Omnipresent by 2025

13

Monitoring Topology

Management

Maintenance

Performance

Location

Functional Data

Data Center

Costing

Opinion: Yes

LEARN

ANALYZE

PATTERNSANTICIPATE

PROBLEMS

SUGGEST

Page 14: NETWORK TELEMETRY AND ANALYTICS IN THE AGE OF BIG DATA · 1059 root 1 132 0 24344K 18936K RUN 405.0H 43.75% mgd 26275 root 1 132 0 24344K 18936K RUN 353.5H 43.75% mgd CPU utilization