using machine learning with wide area networks (wans) · • google’s dnn platform tensorflow...

43
Using Machine Learning with Wide Area Networks (WANs) Dr. Mariam Kiran ANTG Research Group Energy Sciences Network (ESnet) Lawrence Berkeley National Lab Summer Student 2017 Talk 1

Upload: others

Post on 21-May-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Using Machine Learning with Wide Area Networks (WANs)Dr. Mariam Kiran

ANTG Research GroupEnergy Sciences Network (ESnet)

Lawrence Berkeley National Lab

Summer Student 2017 Talk

1

Page 2: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Agenda

• Machine learning (ML) is everywhere: – Is it the Next Hype or substantial?

• But what exactly in ML versus AI (versus deep learning)?

• Some thoughts on Networks with additional ML support

• The future? Current projects:– Project 1: ‘Talking’ to Networks– Project 2: Self-healing Networks

2

Page 3: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

ML Riding the Wave

3

ML at the peak

Page 4: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Is Machine Learning New?

Page 5: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Is it New?

• NOPE!

• Science Fiction has been exploring it for ages

• Brought the main ideas around AI and human interactions

5

Page 6: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

ML in the 90s…

• Computers becoming smarter:– Playing Chess with your

Computer– Making games difficult for

players

• A number of innovations in multiple areas of– Speech recognition– Image recognition– Robotics– Philosophy of mind

6

Page 7: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

What is the Difference between AI, ML and DL

• Turing‘s paper “Can Machines Think!” – Turing Test : Exhibit human-like intelligence • Recently seen in movies

• Machine learning is an approach to achieve AI – spam filters, HR• Deep learning is one of the techniques for ML:

• Recent advances due to GPU and HPC processing (previously very slow, too much data, need training to work)

• Mainly for image and speech recognition – commercial apps

Nvidia blog

Page 8: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

AI Tree (subset is defined as ML techniques)

AI

Optimization technique

Many more….

Expert systems

Fuzzy systems

Neural Networks

Evolutionary algorithms (Genetic algorithms, evolutionary strategies, etc)

Swarm intelligence (ant colony, particle swarm, more)

Deep belief networks

Deep boltzman networks

Convolutional networks

Stacked autoencoders

Networks : graph algorithm (routing –shortest path)

Where ever learning involved (training):

ML

Page 9: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Trying out Deep Learning Libraries• Google’s DNN platform TensorFlow used to tag unlabeled videos,

recognize images with 70% accuracy and predict Gmail replies• Scikit-learn good for learning, python library• NERSC for data analysis in simulations, e.g. climate image analysis

• HPC innovation: analyze massive data sets, quick training • Model and data parallelism

Toolkit Language Use Processing capabilityCaffe C++ Images and video Distributed

(HPC, GPU)TensorFlow Python Images, regression, video, text,

speechDistributed(HPC, GPU)

Theano Python Images Distributed(HPC, GPU)

Torch Lua Images and speech Distributed(HPC, GPU)

Page 10: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Choosing Algorithms for Specific Problems Deep neural network

Input Data Applied for Variants

Feed forward neural network

Hierarchical data representations

• General classification• Clustering• anomaly finding• feature extraction

• Deep belief networks (uses restricted boltzman machine for activation function)

• Convolutional neural networks

Recurrent neural network

Sequential data representation (i.e. time series data)

Sequential learning, especially useful when time relationship exists.

Long short term memory (LTSM) used for speech translation.

• There are many variants of DNNs. Papers and researchers in each specific DNN• DeepMind used Deep Q-learning for Attari and Go

• Action-pairs based on learned data.

Page 11: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

12

Each algorithm is chosen depending on data being explored

and problem being explored(some 50% accuracy, others 80%

accuracy)

Page 12: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Also Choose Techniques based on ‘Learning’

Machine learning

Knowledge based reasoning

Probabilistic reasoning

Data-driven reasoning

Ontology based reasoning

Kalman filter

Hidden Markov models

Unsupervised

Semi supervised

Supervised Decision trees

Rule based systems (expert systems)

Bayesian networks

Support vector machines

K-means

Deep boltzmanUsing statistical

classification (need training)

Page 13: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

What would Networks do with Machine learning?

Range of possibilities of ML algorithms

• Problem

• Data

• Processing/Experience

Page 14: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

ESnet Background

• R&E networks for science (CERN, LHC, and more)• Provide reliable robust network connections to enable science workflows• Investigate research and techniques to help build better networks• Guarantees for our scientists for network needs (users)

16

Page 15: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

For example: Many Actors, softwares, data, etc ….

- 19 -

HipGISAXS & RMC

GISAXS

Slot-die printing of Organic photovoltaics

Borrowed from E Dart

Page 16: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

ESnet Team continually engaged

• Science workflows (using NSI or OSCARS)– Multi-domain provisioning

• Transfer tools and protocols (using Globus) (TCP research)– Ease of use, Reliable

• R&E Networks support big data oriented services (using ScienceDMZ)– Dedicated Bandwidth on demand, loss free– Isolation– Monitoring (perfSonar)– Network virtualization

• Network research– Virtualization, SDN, switches, routers, etc

20

Designing for

• Specific science cases

• End-users

• Network engineers

Teams:

Science Engagement, Software, Infrastructure, SDN Testbed, etc

Page 17: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

A Day in the Life of a Packet

21

Page 18: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Traffic Volume Growing Exponentially

22

203020171990

Page 19: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Managing Multiple Sites

23

• Different traffic requirements

• Quality of service, bandwidth, speed, time-based deliveries, etc

• Reliability and heterogeneity

Page 20: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Automating Networks using ML

• Predict traffic peak times• Find anomalies for security threats• Optimize how paths are currently

utilized• Predict link failures• Understand or predict user behavior

and network use

• These are Core Network Research Problems!

Page 21: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Most uses of Machine Learning in Networks (IETF forums)

• Network Security– Normal and outlier behaviors in traffic

• Change or predict possible behavior– This <QoS value> will cause this <event Y> with probability <P>

• Bug detection– Software or hardware faults

• WAN path optimization– Anticipate congestion – Divert traffic to alternate paths

Page 22: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Problem becomes quite Complex

• WAN are complete system with multiple layers

• Focus on specific case studies

• Over 2000 papers in the area

Page 23: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

WANS are complex!

Page 24: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Depends on what we use ML for…

User traffic data User traffic(directed flows)

28

WAN Topology(traffic engineering)

(flow-level, traffic prediction, adaptation, path optimization, link failure)Infrastructure traffic data

(Packet-level, queues, TCP, UDP)

Infrastructure-level modifications (Switches, deployment, etc)

Page 25: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

What is Most Published?

• Most ML techniques used for classification (of traffic) and prediction (failures)

• Recent Google papers have been most influential: – B4, Jupiter, BwE, etc. (data center to user-based provisioning)

• Network tools enhanced by embedding informed decisions such as traffic awareness for: – Forming topologies, optimum path finding– Improve path utilizations depending on traffic

0

10

20

30

40

50

60

User Traffic Traffic Engineering Packet-levelimprovements

Optimizinginfrastructure

ML Non-ML

No.

of p

aper

s (2

010-

2017

)

Page 26: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Contribute to Research (tools, software, papers)

• Simple graph optimization algorithms still largely used (non-ML)

• Room for unexplored ML techniques

• Game theory also seems a promising area (local vs global)

• Our applications are more complex, more actors and requirements– We need to develop network techniques catering to science workflows

30

Global optimum

Local optimum

Page 27: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Current Solutions being explored:Two projects using MLIntent-driven networks: INDIRA

Self-healing networks

• Studying ML to understand how we can apply to our problems

• Start with simple algorithms and progress to more complex tasks

• Depends on the Goals we want to achieve

Page 28: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Intent-driven networks (user-level)

• Focuses on improving user network interaction

• Intent-based network– Solution: development of the INDIRA tool– Use of semantic ontology– Network provisioning and rendering of commands– Current state – interaction with multiple tools (NSI, globus, and more)

32

Page 29: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Setting the Stage

33

R&EESnetnetworks

DoEuniversities

instruments

facilities

scientistscientist

Networkengineer

Networkengineer

Networkengineer

Networkengineer

scientist

scientist

I want to watch a movie tonight on netflix

scientist

I want to see my real time high resolution big data

visualization

I want to stream the big data directly into the cache

of my super computer

• Applications have complex workloads • Network behavior tailored for my application ‘intent’• Difficult to fulfill these diverse set of needs

• Learning curve is huge and complex• Difficult to specify needs in ‘english’• Specify in high-level language, portable, multi-domain

Page 30: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Setting Up Paths for Individual: Intent• Traffic paths provisioned with basic QoS values, what if this is optimized for ‘end-users’

• Rather than running the following command: (setting up a link with QoS)

• Networks can understand users: “Tell me what do you want!”– Example:

Scientist> Can you set up a connection between Berkeley and Argonne.

Network> Do you want guaranteed bandwidth?

Scientist> Sure!

Network> OK! Ill get this setup for you................................. You’re all set!

34

./onsa reserveprovision -g urn:uuid:6e1f288a-5a26-4ad8-a9bc-eb91785cee15 -d es.net:2013::bnl-mr2:xe-1_2_0:+#1000 -s es.net:2013::lbl-mr2:xe-9_3_0:+#1000 -b 5096 -a 2016-11-13T09:00:00 -e 2017-04-04T17:00:00 -u https://nsi-aggr-west.es.net:443/nsi-v2/ConnectionServiceProvider -p es.net:2013:nsa:nsi-aggr-west -r canada.eh:2016:nsa:requester -h 198.128.151.17 -o 8443 -l /etc/hostcert/muclient.crt -k /etc/hostcert/muclient.key -i /etc/ssl/certs/ -y -x -z -v -q;

Page 31: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Introducing iNDIRA... “Hello! Im iNDIRA”

Language processing to take intent input

• Automate rendering into network commands like bandwidth, time schedule, topology

• Optimize the network• Return success or failure to user

• Understand English (e.g. transfer, connect)• Check conditions • Ask for any further details• Check conflicts and permissions

NLP, OWL, “AI” Network engineering

Renderer translates intent

intent

Network state

iNDIRA

(Intelligent Network Deployment Intent Renderer Application)

“ I want to send data to my SuperComputer at NERSC by 5:00pm today”

“ Ok ill reconfigure the network to make this possible!”

Page 32: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Indira’s working: top level view

• Intent parser engine• ML to optimize decisions• Knowledge base

– Project profiles– Semantics – Network services and

conditions

• Backend details:– Multi-layer provisioning– Multi-tool connectivity– No platform dependence

(multiple programming languages)

36

User interface

Phys

ical

la

yer

Virt

ual

laye

r

Page 33: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Negotiate with Users

37

INDIRA> Tell me what want the network to do? Hint “I want to [connect] [disconnect] servicesScientist> I want to transfer files from LBL to ANLINDIRA> Do you have any time constraints with the transfer from LBL to ANL?Scientist> No

INDIRA> Hello and welcome! Tell me your project name?Scientist> BobINDIRA> Is ‘Bob’ your project username? Y/NScientist> NINDIRA> What is your project usernameScientist> Project1

INDIRA> Ok. Give me the link to the file to be moved. Scientist> project_test@LBL/datasource1 to project2@ANL/datasetINDIRA> Do you have any bandwidth requirements?Scientist> Yes Maximum INDIRA> Sorry, you have only 5Mb/s allowable to use. Do you want to use all of this?Scientist>NoINDIRA> Thank you. Please contact the network administrator to allow you more bandwidth, as you don’t have permission to go above 5Mbps

Page 34: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Negotiate with Users

38

INDIRA> Tell me what want the network to do? Hint “I want to [connect] [disconnect] servicesScientist> I want to transfer files from LBL to ANLINDIRA> Do you have any time constraints with the transfer from LBL to ANL?Scientist> No

INDIRA> Hello and welcome! Tell me your project name?Scientist> BobINDIRA> Is ‘Bob’ your project username? Y/NScientist> NINDIRA> What is your project usernameScientist> Project1

INDIRA> Ok. Give me the link to the file to be moved. Scientist> project_test@LBL/datasource1 to project2@ANL/datasetINDIRA> Do you have any bandwidth requirements?Scientist> NoINDIRA> Thankyou. I am configuring your transfer to start ‘now’ at 5GB/s…

….....

Congratulations....All Done!

Page 35: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Exploiting ML • Frontend AI:

– Rule based reasoning and pattern matching from ontologies (RDF/OWL)– Learn usage patterns so iNDIRA can make better suggestions (pattern recog.)

• Backend AI:– Find available links– Check user requirements are met– Optimize paths provisioned using ML (graph optimization)– Learn optimal network behaviors (pattern recognition)

• Work across multiple network layers

• No changes done to the multiple tools attached:– iNDIRA can ’converse’ with users and translate their needs to NSI/Globus

– Room to explore other ML techniques to improve iNDIRA’s performance

39

Page 36: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Towards a Self-healing Network

Project 2: Initial phases

Page 37: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Building Self-healing Networks (infrastructure)

• ML algorithms explored across multiple layers– Behavior forecast and

anomaly using DL– Simple classification

for traffic patterns• Recovery phase:

Reactive or rule based systems (if.. then)

• Bring it all together to solve one problem

• Objective: Keeping network as ‘stable’ as possible

41

LBL

FNL

ANL

CRN

Telemetry

Learning phase(training data sets, classification, etc)

Real-time monitoring Recovery phase

(action-plan)

Anomaly detection

Behavior forecasting

Page 38: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

High-level Research Questions

• What does it mean for a network to be ‘Stable’?

• What tools or devices we have control over to help automate recovery?

• What are the suitable machine learning algorithms for us?– Real-time anomaly detection: Need quick response– Data collection issues: 1s versus 30s intervals

• How often do we do training?– Networks are extremely dynamic, we want to limit processing for

optimum results

• Cost of processing data, quicken processing so that we can react quickly?

• This is a long term goal!

42

Complex Problem that needs to be broken down and solved as individual

parts!

Page 39: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Leveraging Existing Research

• Classification and Anomaly detection have extensively been studied.

• Exploiting or coupled with more techniques to understand – Network behavior– Network-user interactions or Science use cases

• Working with multiple data sets to find relationships

• Understanding what can be automated or not?

• Computational aspect: Cloud, HPC and GPU processing times…

43

Page 40: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Many Data Sets

• Perfsonar logs: throughput, loss, utilization

• SNMP data: ingress, outgress

• Netflow data: More packet data (TCP, UDP, etc)

• And much more…

44

LBL

FNL

ANL

CRN

Detecting performance anomalies in real-time

Cannot predict traffic patterns yet….

A. Chabbra results

Page 41: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Building Self-healing Networks

• Bring it all together to solve one problem like a puzzle

• Objective: Keeping network as ‘stable’ as possible

45

LBL

FNL

ANL

CRN

Telemetry

Learning phase(training data sets, classification, etc)

Real-time monitoring Recovery phase

(action-plan)

Anomaly detection

Behavior forecasting

Page 42: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Summary: The Future?

• If you are really interested: “2001: A Space Odyssey”• Abstract:

– HAL 9000 says “There is a 70% chance a link will fail in the next 48 hours”– Humans reply “No our results don’t show that”– HAL 9000 says “Human error!”

• So who was correct?

• AI has advanced and shown some promise: Used in a lot of applications in various fields• Solve some current problems:

– Our focus is on automation and optimization• ESnet is leading efforts in network (research, data, tools, expertise and complex apps)• NERSC are leading machine learning with HPC, software, scaling methods, more • Combining techniques (and algos) to advance research in explored:

– New areas in network and perhaps even more

46

Page 43: Using Machine Learning with Wide Area Networks (WANs) · • Google’s DNN platform TensorFlow used to tag unlabeled videos, recognize images with 70% accuracy and predict Gmail

Contact

• Thankyou!

• Summer internships/students

• Feel free to reach out for more information/collaboration/ideas:

• <[email protected]>

47