open source tools for optimizing your peering …...software / network engineer at mauve mailorder...
TRANSCRIPT
Open source tools for optimizing
your peering infrastructure
@ DE-CIX TechMeeting 2018-06-06
by Daniel Czerwonk
• Software / Network Engineer at Mauve Mailorder Software
• Head of Network Freifunk Essen e.V.
• AS44821 (Mauve), AS206356 (Freifunk Essen e.V.),
AS202739 (routing-rocks)
• birdwatcher and bio-routing contributor
• Twitter: @dan_nrw
• Github: https://github.com/czerwonk
• LinkedIn: https://www.linkedin.com/in/czerwonk/
Who is this guy? About me…
Our journey starts late 2016
A new networking setup is about to
be build
But before that:
Let’s talk about monitoring…
• Very small operations team
• Freifunk Essen should be even less ops demanding
• Identify trends/anomalies early
• Capacity planing (beware of retention)
• Source for alerting
• Start point for traffic engineering, etc.
• Source to build post mortem on (in case of outage)
• Dashboard to give a quick overview when needed
Why is monitoring important for me?
So, let’s build a monitoring system…
• Prometheus to collect metrics
• Grafana to visualize metrics
• Alertmanager with Pushover integration for alerting
• Everything Ansible managed
What I wanted…
+ +
• Bird routing daemon
• JunOS running on a few EX series switches
• Host metrics from bare metal software router machines (statistics, resources)
• External network latencies (RIPE ATLAS, etc.)
What I wanted to scrape?
What I found…
In 2016…
Metric Solution Problem
bird no exporter available
JunOS snmp_exportercomplex configuration,
bad performance
Host metrics node_exporter
Network latenciesblackbox_exporter with
external probe VMs
bad coverage,
only one request per scrape
• Official Prometheus project
• On Linux hosts (e.g. Routers)
• Network interface metrics
• Resource consumption: CPU load, RAM usage, Disk space
• Interrupts / context switches
• License: Apache 2.0
• Source: https://github.com/prometheus/node_exporter
node_exporter
At least we got the host metrics covered.
And the rest?
I had to solve that…
So I started to write some
exporters…
• Performance is key feature
• Need for concurrent processing
• Single binary / no dependencies
• Easy installation via go get …
• Existing client API for Prometheus
• Love writing code in golang in my spare time
Which programming language?
I chose golang:
atlas_exporter
RIPE ATLAS
Milestones to an exporter suite
bird_exporter
Bird 1.x
2016 20182017
RIPE LABS
article
Support for
bird 2.x
Replaced SNMP
by SSH
junos_exporter
Juniper JunOS
using SNMP
ping_exporter
ICMP probing
mikrotik-exporter
RouterOS
• Started late 2016
• Communicates with bird via socket
• Bird 1.x and 2.x supported
• Protocols: BGP, OSPFv2, OSPFv3, Kernel, Static, Device, Direct
• License: MIT
• Source: https://github.com/czerwonk/bird_exporter
bird_exporter
bird_exporter
bird_protocol_prefix_import_count{proto=~"BGP|OSPFv3",ip_version="6"}
count(bird_protocol_up{proto=“BGP"} == 1)
• BGP session state metrics
• BGP message counts (received, sent, withdrawn, etc.)
• Prefix counts for all supported protocols (imported, exported, filtered, etc.)
• OSPFv2/OSPFv3 neighbour counts
• Protocol uptime
bird_exporter - Features
• Started early 2018
• Replacement for RRD based smokeping
• Concerning ICMP also replacement for blackbox_exporter since lack of loss
detection
• Based on go-ping by Digineo: https://github.com/digineo/go-ping
• License: MIT
• Source: https://github.com/czerwonk/ping_exporter
ping_exporter
ping_exporter
ping_rtt_mean_ms{ip_version="6"}
ping_loss_percent{ip_version="4"}
• Sends and aggregates multiple ICMP ECHO requests
• Roundtrip metrics (current, best, worst)
• Simple way to detect loss
• Supports multiple targets
• DNS refresh ensures the correct IP is measured when DNS is changed
• Only ICMP support at the moment
• Warning: ICMP is not user traffic so keep that in mind when trying to interpret these
metrics
ping_exporter - Features
• Started early 2017
• Metrics by requesting measurement results from RIPE ATLAS
• Useful to get an outside view from different other networks
• License: LGPL3 (since the binding used is under this license)
• Source: https://github.com/czerwonk/atlas_exporter
• More info:
https://labs.ripe.net/Members/daniel_czerwonk/using-ripe-atlas-measurement-
results-in-prometheus-with-atlas_exporter
atlas_exporter
atlas_exporter
avg(atlas_ping_avg_latency{ip_version="4"}) by (asn)
avg(atlas_traceroute_hops{ip_version="4"}) by (asn)
• Ping (success, min/max/avg latency, dups, size)
• Traceroute (success, hop count, rtt)
• NTP (delay, derivation, ntp version)
• DNS (succress, rtt)
• HTTP (return code, rtt, http version, header size, body size)
• SSL Certificates (alert, rtt)
atlas_exporter - Features
• Started late 2017
• snmp_exporter did not perform as required
• First implementation using a simple set of SNMP OIDs
• Early 2018: reimplementation using SSH and XML RPC representation
• Alternative to Junipers OpenNTI since telemetry is only supported on newer
versions of JunOS and hardware
• License: MIT
• Source: https://github.com/czerwonk/junos_exporter
junos_exporter
• Interfaces (bytes transmitted/received, errors, drops)
• Routes (per table, by protocol)
• Alarms (count)
• BGP (message count, prefix counts per peer, session state)
• OSPFv2, OSPFv3 (number of neighbours)
• Interface diagnostics (optical signals)
• ISIS (number of adjacencies, total number of routers)
• Environment (temperatures)
• Routing engine statistics
junos_exporter - Features
• Contribution to existing project
• Only interface and resource metrics at this point
• Added several other features
• License: BSD3
• Source: https://github.com/nshttpd/mikrotik-exporter
mikrotik-exporter
• Interface metrics (RX bytes, TX bytes, drops, errors, etc.)
• BGP session states
• BGP message counts (updates, withdraws)
• DHCP leases
• DHCPv6 bindings
• Optical diagnostics
• IPv4/IPv6 pool counts
• System resources (memory, CPU load, etc.)
• Prefix counts per protocol (in RIB)
mikrotik-exporter - Features
Dashboard examples
How to combine several exporters?
Mauve Network Overview
Mauve Routing
Alerting
When and how?
How to alert?
What the SRE book has taught us:
https://landing.google.com/sre/book/chapters/monitoring-distributed-systems.html
How to alert? A few examples…
Port saturation:
Upstream session down:
Thank you for your attention.
Special thanks to all people contributed to my projects!