Network Management
Richard Mortier
Microsoft Research, Cambridge
(Guest lecture, Digital Communications II)
Overview
Introduction Abstractions IP network components IP network management protocols Pulling it all together An alternative approach
Overview
Introduction What’s it all about then?
Abstractions IP network components IP network management protocols Pulling it all together An alternative approach
What is network management? One point-of-view: a large field full of acronyms
EMS, TMN, NE, CMIP, CMISE, OSS, AN.1, TL1, EML, FCAPS, ITU, ...
(Don’t ask me what all of those mean, I don’t care!) From question.com:
In 1989, a random of the journalistic persuasion asked hacker Paul Boutin “What do you think will be the biggest problem in computing in the 90s?” Paul's straight-faced response: “There are only 17,000 three-letter acronyms.” (To be exact, there are 26^3 = 17,576.)
Will ignore most of them
What is network management? Computer networks are considered to have
three operating timescales Data: packet forwarding [ μs, ms ] Control: flows/connections [ secs, mins ] Management: aggregates, networks [ hours,days ]
…so we’re concerned with “the network” rather than particular devices
Standardization is key!
Overview
Introduction Abstractions
ISO FCAPS, TMN EMS, ATM IP network components IP network management protocols Pulling it all together An alternative approach
ISO FCAPS: functional separation Fault
Recognize, isolate, correct, log faults Configuration
Collect, store, track configurations Accounting
Collect statistics, bill users, enforce quotas Performance
Monitor trends, set thresholds, trigger alarms Security
Identify, secure, manage risks
TMN EMS: administrative separation Telecommunications Management Network Element Management System
“...simple but elegant...” (!) (my emphasis)
NEL: network elements (switches, transmission systems) EML: element management (devices, links) NML: network management (capacity, congestion) SML: service management (SLAs, time-to-market) BML: business management (RoI, market share, blah)
The B-ISDN reference model
Asynchronous Transfer Mode “cube” See IAP lectures, maybe
Plane management… The whole network
…vs layer management Specific layers
Topology Configuration Fault Operations Accounting Performance
management plane
user planecontrol plane
higher layers
ATM layer
physical layer
plane m
anagem
ent
higher layers
ATM adaptation layer
layer managem
ent
Network management
Models of general communication networks Tend to be quite abstract and exceedingly tedious! Many practitioners still seem excited about OO
programming, WIMP interfaces, etc …probably because implementation is hard due to so
many excessively long and complex standards! My view: basic “need-to-know” requirements are
1. What should be happening? [ c ]
2. What is happening? [ f, p, a ]
3. What shouldn’t be happening? [ f, s ]
4. What will be happening? [ p, a ]
Network management
We’ll concentrate on IP networks Still acronym city: ICMP, SNMP, MIB, RFC Sample size: 102 routers, 105 hosts
We’ll concentrate on the network core Routers, not hosts
We’ll ignore “service management” DNS, AD, file stores, etc
Overview
Introduction Abstractions IP network components
IP primer, router configuration IP network management protocols Pulling it all together An alternative approach
IP primer (you probably know all this) Destination-routed packets – no connections
Time-to-live field: allow removal of looping packets Routers forward packets based on routeing tables
Tables populated by routeing protocols Routers and protocols operate independently
…although protocols aim to build consistent state
RFCs ~= standards Often much looser semantics than e.g. ISO, ITU standards Compare for example OSPF [RFC2327] and IS-IS
[RFC1142, RFC1195], two link-state routeing protocols
So, how do you build an IP network?1. Buy (lease) routers
2. Buy (lease) fibre
3. Connect them all together
4. Configure routers appropriately
5. Configure end-systems appropriately
Assume you’ve done 1–3 and someone else is doing 5…
Router configuration
Initialization Name the router, setup boot options, setup authentication options
Configure interfaces Loopback, ethernet, fibre, ATM Subnet/mask, filters, static routes Shutdown (or not), queueing options, full/half duplex
Configure routeing protocols (OSPF, BGP, IS-IS, …) Process number, addresses to accept routes from, networks to advertise
Access lists, filters, ... Numeric id, permit/deny, subnet/mask, protocol, port
Route-maps, matching routes rather than data traffic Other configuration aspects: traps, syslog, etc
Router configuration fragmentshostname FOOBAR!boot system flash slot0:a-boot-image.binboot system flash bootflash:logging buffered 100000 debugginglogging console informationalaaa new-model aaa authentication login default tacacs local aaa authentication login consoleport none aaa authentication ppp default if-needed tacacs aaa authorization network tacacs !ip tftp source-interface Loopback0no ip domain-lookupip name-server 10.34.56.78!ip multicast-routingip dvmrp route-limit 7000ip cef distributed
interface Loopback0 description router-1.network.corp.com ip address 10.65.21.43 255.255.255.255!interface FastEthernet0/0/0 description Link to New York ip address 10.65.43.21 255.255.255.128 ip access-group 175 in ip helper-address 10.65.12.34 ip pim sparse-mode ip cgmp ip dvmrp accept-filter 98 neighbor-list 99 full-duplex!interface FastEthernet4/0/0 no ip address ip access-group 183 in ip pim sparse-mode ip cgmp shutdown full-duplex
router ospf 2 log-adjacency-changes passive-interface FastEthernet0/0/0 passive-interface FastEthernet0/1/0 passive-interface FastEthernet1/0/0 passive-interface FastEthernet1/1/0 passive-interface FastEthernet2/0/0 passive-interface FastEthernet2/1/0 passive-interface FastEthernet3/0/0 network 10.65.23.45 0.0.0.255 area 1.0.0.0 network 10.65.34.56 0.0.0.255 area 1.0.0.0 network 10.65.43.0 0.0.0.127 area 1.0.0.0
access-list 24 remark Mcast ACLaccess-list 24 permit 239.255.255.254access-list 24 permit 224.0.1.111access-list 24 permit 239.192.0.0 0.3.255.255access-list 24 permit 232.192.0.0 0.3.255.255access-list 24 permit 224.0.0.0 0.0.0.255access-list 1011 deny 0000.0000.0000 ffff.ffff.ffff ffff.ffff.ffff 0000.0000.0000 0xD1 2 eq 0x42access-list 1011 permit 0000.0000.0000 ffff.ffff.ffff 0000.0000.0000 ffff.ffff.ffff
tftp-server slot1:some-other-image.bintacacs-server host 10.65.0.2tacacs-server key xxxxxxxxrmon event 1 trap Trap1 description "CPU Utilization>75%" owner configrmon event 2 trap Trap2 description "CPU Utilization>95%" owner config
Lots of quite large and fragile text files 00s/000s routers, 00s/000s lines per config Errors are hard to find and have non-obvious results Router configuration also editable on-line
How to keep track of them all? Naming schemes, directory hierarchies, CVS ssh upload and atomic commit to router Perhaps even a database
State of the art is pretty basic Few tools to check consistency Generally generate configurations from templates and have
human-intensive process to control access to running configs Topic of current research [Feamster et al]
Router configuration
this counts as quite advanced!
Overview
Introduction Abstractions IP network components IP network management protocols
ICMP, SNMP, Netflow Pulling it all together An alternative approach
ICMP
Internet Control Message Protocol [RFC792] IP protocol #1 In-band “control”
Variety of message types echo/echo reply [ PING (packet internet groper) ] time exceeded [ TRACEROUTE ] destination unreachable, redirect source quench
Ping (Packet INternet Groper) Test for liveness
…also used to measure (round-trip) latency Send ICMP echo Valid IP host [RFC1122, RFC1123] must reply with
ICMP echo response
Subnet PING? Useful but often not available/deprecated “ACK” implosion could be a problem RFCs ~= standards
Traceroute
Which route do my packets take to their destination? Send UDP packets with increasing time-to-live values Compliant IP host must respond with ICMP “time exceeded” Triggers each host along path to so respond
Not quite that simple One router, many IP addresses: which source address?
Router control processor, inbound or outbound interface Routes often asymmetric, so return path != outbound path Routes change
Do we want full-mesh host-host routes anyway?! Size of data set, amount of probe traffic This is topology, what about load on links?
SNMP
Protocol to manage information tables at devices Provides get, set, trap, notify operations
get, set: read, write values trap: signal a condition (e.g. threshold exceeded) notify: reliable trap
Complexity mostly in the MIB design Some standard tables, but many vendor specific Non-critical, so often tables populated incorrectly Many tens of MIBs (thousands of lines) per device Different versions, different data, different semantics
Yet another configuration tracking problem Inter-relationships between MIBs
IPFIX
IETF working group Export of flow based data out of IP network devices Developing suitable protocol based on Cisco NetFlow™ v9 [RFC3954, RFC3955]
Statistics reporting Setup template Send data records matching template
Many variables Packet/flow counters, rule matches, quite flexible
Overview
Introduction Abstractions IP network components IP network management protocols Pulling it all together
Network mapping, statistics gathering, control An alternative approach
An hypothetical NMS
GUI around ICMP (ping, traceroute), SNMP, etc Recursive host discovery
Broadcast ping, ARP, default gateway: start somewhere Recursively SNMP query for known hosts/connected networks Ping known hosts to test liveness Iterate
Display topology: allow “drill-down” to particular devices Configure and monitor known devices
Trap, Netflow™, syslog message destinations Counter thresholds, CPU utilization threshold, fault reporting Particular faults or fault patterns Interface statistics and graphs
An hypothetical NMS
All very straightforward? No, not really A lot of software engineering: corner cases, traceroute
interpretation, NATs, etc MIBs may contain rubbish Can only view inside your network anyway
Efficiency Rate pacing discovery traffic: ping implosion/explosion SNMP overloading router CPUs
Tunnelled, encrypted protocols becoming prevalent Using NMSs also not straightforward
How to setup “correct” thresholds? How to decide when something “bad” has happened? How to present (or even interpret) reams and reams of data?
Overview
Introduction Abstractions IP network components IP network management protocols Pulling it all together An alternative approach
From the edges…
Edge-based network management platform Collect flow information from hosts, and Combine with topology information from routeing protocols
Enable visualization, analysis, simulation, control
Avoid problems of not-quite-standard interfaces Management support is typically ‘non-critical’ (i.e. buggy )
and not extensively tested for inter-operability Do the work where resources are plentiful
Hosts have lots of cycles and little traffic (relatively) Protocol visibility: see into tunnels, IPSec, etc
ENMA
System outline
Control
Packets
Flows
Routeingprotocol
Topology
VisualizeSimulate
Simulator
Distributeddatabase
Traffic matrix Set of routes
srcs
dsts
routes
Pictures of current topology and traffic Routes+flows+forwarding rules BIG PICTURE
In fact, where did my traffic go yesterday? Keep historical data for capacity planning, etc
A platform for anomaly detection Historical data suggests “normality,” live
monitoring allows anomalies to be detected
Where is my traffic going today?
Where might my traffic go tomorrow? Plug into a simulator back-end
Discrete event simulator, flow allocation solver Run multiple ‘what-if’ scenarios
…failures …reconfigurations …technology deployments
E.g. “What happens if we coalesce all the Exchange servers in one data-centre?”
Where should my traffic be going? Close the loop: compute link weights to
implement policy goals Recompute on order of hours/days
Allows more dynamic policies Modify network configuration to track e.g. time of
day load changes Make network more efficient (~cheaper)?
Where are we now?
Three major components Flow collection Route collection Distributed database
Building prototypes, simulating system
Data collection
Flow collection Hosts track active flows
Using low overhead event posting infrastructure, ETW Built prototype device driver provider & user-space consumer
Used packet traces for feasibility study on (client, server) Peaks at (165, 5667) live and (39, 567) active flows per sec
Route collection OSPF is link-state: passively collect link state adverts Extension of my work at Sprint (for IS-IS and BGP); also
been done at AT&T (NSDI’04 paper)
The distributed database
Logically contains 1. Traffic flow matrix (bandwidths), {srcs} × {dsts}2. …each entry annotated with current route from src to dst
N.B. src/dst might be e.g. (IP end-point, application) Large dynamic data set suggests aggregation
Related work { distributed, continuous query, temporal } databases Sensor networks
Potential starting points: Astrolabe or SDIMS (SIGCOMM’04) Where/what/how much to aggregate?
Is data read- or write-dominated? Which is more dynamic, flow or topology data? Can the system successfully self-tune?
The distributed database
Construct traffic matrix from flow monitoring Hosts can supply flows they source and sink Only need a subset of this data to get complete traffic matrix
Construct topology from route collection OSPF supplies topology → routes
Wish to be able to answer queries like “Who are the top-10 traffic generators?”
Easy to aggregate, don’t care about topology “What is the load on link l ?”
Can aggregate from hosts, but need to know routes “What happens if we remove links {l…m} ?”
Interaction between traffic matrix, topology, even flow control
The distributed database
Building simulation model OSPF data gives topology, event list, routes Simple load model to start with (load ~ # subnets) Precedence matrix (from SPF) reduces flow-data query set
Can we do as well/better than e.g. NetFlow? Accuracy/coverage trade-off
How should we distribute the DB? Just OSPF data? Just flow data? A mixture?
How many levels of aggregation? How many nodes do queries touch?
What sort of API is suitable? Example queries for sample applications
Summary
Introduction What is network management?
Abstractions ISO FCAPS, TMN EMS, ATM
IP network components IP, routers, configurations
IP network management protocols ICMP, SNMP, etc
Pulling it all together Outline of a network management system
An alternative approach: from the edges
The end
Questions Answers?
http://www.cisco.com/ http://www.routergod.com/ http://www.ietf.org/ http://ipmon.sprintlabs.com/pyrt/ http://www.nanog.org/
Internet routeing
Q: how to get a packet from node to destination?
A1: advertise all reachable destinations and apply a consistent cost function (distance vector)
A2: learn network topology and compute consistent shortest paths (link state) Each node (1) discovers and advertises adjacencies;
(2) builds link state database; (3) computes shortest paths A1, A2: Forward to next-hop using longest-prefix-
match
OSPF (~link state routeing)
Q: how to route given packet from any node to destination?
A: learn network topology; compute shortest paths
For each node Discover adjacencies (~immediate neighbours); advertise Build link state database (~network topology) Compute shortest paths to all destination prefixes Forward to next-hop using longest-prefix-match (~most
specific route)
BGP (~path vector routeing)
Q: how to route given packet from any node to destination? A: neighbours tell you destinations they can reach; pick cheapest
option
For each node Receive (destination, cost, next-hop) for all destinations known to
neighbour Longest-prefix-match among next-hops for given destination Advertise selected (destination, cost+, next-hop') for all known
destinations Selection process is complicated Routes can be modified/hidden at all three stages
General mechanism for application of policy