networking in the ‘jetson age’
TRANSCRIPT
Networking in the ‘Jetson Age’
1
Inder Monga, Director, Energy Sciences NetworkDivision Director, Scientific Networking, LBNL
ESnet: Department of Energy’s high-performance science network facility
ESnet provides connectivity toall of the DOE labs, experiment sites, &
supercomputers
Our vision:
Scientific progress will be completely unconstrained by the physical location of instruments, people, computational resources, or data.
3
Our vision:
Scientific progress will be completely unconstrained by the physical location of instruments, people, computational resources, or data.
7
Data over distance
• Transfers over a shared network are not predictable• Best-effort delivery can also mean worst-effort delivery
Challenge: Predictable Network Transfers at scale
Chard K, Dart E, Foster I, Shifflett D, Tuecke S, Williams J. (2018) The
Modern Research Data Portal: a design pattern for networked, data-
intensive science. PeerJ Computer
Science4:e144 https://doi.org/10.7717/peerj-cs.144
Nine orders of
variability for Petrel
Application Workflow
Agent
Failure on a network element, problem fixed
15 Gbps P2P service between Caltech and Fermilab Instantiated
20 Gbps available for a P2P service between Caltech and Fermilab
Endpoint Listing
Challenge: Applications cannot ‘dialogue’ with the network
Please provide a listing of all available provisioning
Endpoints
What is the maximum bandwidth available for a
P2P service between Caltech and Fermilab?
Request 20 Gbps P2P service between Caltech and
Fermilab. If 20 Gbps not available 10Gbps is ok.
Service is not working, please check status
Please provide a listing of all available provisioning
Endpoints
Endpoint Listing
What is the maximum bandwidth available for a
P2P service between Caltech and Fermilab?
20 Gbps available for a P2P service between Caltech and Fermilab
Request 20 Gbps P2P service between Caltech and
Fermilab. If 20 Gbps not available 10Gbps is ok.
15 Gbps P2P service between Caltech and Fermilab Instantiated
Service is not working, please check status
Failure on a network element, problem fixed
Machine Learning applied to network telemetry data –learn, understand and optimize
7/30/20188
Training input
Training output
Sliding Window
Predicting traffic per link/siteMethod: Deep Machine Learning (Recurrent Neural Network) for time-series data
• Predict anomalies (or peaks) accurately 15 minutes in future
• Graph showing 1 minutes prediction in future
7/30/20189
Language processing to take intent input
• Automate rendering into network commands like bandwidth, time schedule, topology
• Optimize the network• Return success or failure to user
• Understand English (e.g. transfer, connect)• Check conditions, conflicts and permissions• ML in Natural Language Processing for
intelligent negotiation with user
NLP, OWL, “AI” Network configuration
Renderer translates intent
intent
Network state
iNDIRA
“ I want to send data to my SuperComputer at NERSC by 5:00pm today”
“ Ok ill reconfigure the network to make this possible!”
Machine Learning applied to dialogue with applications –understand the intent
Vision (or a challenge): A cognitive network
7/30/201811
Network can:
Assimilates internal and external information (e.g. usage, maintenance schedules, component MTF, driver’s schedule, etc.)
Anticipates trips based on routines and disruptions (e.g. scheduled maintenance)
Adapts route and departure time due to road, (current and expected) traffic, and weather conditions
Machine Learning applied to network telemetry data –learn, understand and optimize
7/30/201812
Anomalies in link performance Method used:
ARIMA
Classifying flows across DOE sites Method: Gaussian
Mixture Models
Predicting traffic topologies across DOE sites Method: Markov Models
ESnet is built to handle science’s ‘big’ data whose traffic patterns differ dramatically from the Internet
13
Science ‘big’ data
VideoCloud Apps
Internet
Applications care about throughput, not bandwidth
Measured (TCP Reno) Measured (HTCP) Theoretical (TCP Reno) Measured (no loss)
14
. See Eli Dart, Lauren Rotman, Brian Tierney, Mary Hester, and Jason Zurawski. The Science DMZ: A Network Design Pattern for Data-
Intensive Science. In Proceedings of the IEEE/ACM Annual SuperComputing Conference (SC13), Denver CO, 2013.
Metro Area
Local(LAN)
Regional
Continental
International
20x performance gap due to 1 packet lost in 22000 packets (0.0046%)