jacopo nardiello - monitoring cloud-native applications with prometheus - codemotion milan 2017
TRANSCRIPT
Monitoring Cloud-Native applications with Prometheus
Jacopo Nardiello
CODEMOTION MILAN - SPECIAL EDITION 10 – 11 NOVEMBER 2017
Jacopo NardielloSIGHUP Founder & DevOps Engineer@jnardiello
~ whoami
~ ./stuff_I_poke_around_with
- Linux- Kubernetes (clusters lifecycles and workloads scheduling in general)- The CloudTM (VMs and Containers + other people's computers)- golang- More devops toys FTW! (CI/CDs, Ansible, etc..)
What is exactly “Cloud-Native”?
Cloud-Native is NOT The CloudTM
At its root, Cloud Native is structuring teams, culture and technology to utilize automation and architectures to manage
complexity and unlock velocity.
Joe Beda
There’s a copernican revolution happening on infrastructures
A fundamental shift:From VM-based Mutable to Highly Dynamic and Immutableinfrastructures
The path to Cloud-Native Architectures
Why Containers- A new infrastructural unit- Atomic deployments- Very small footprint, superfast scaling
Why Orchestrators
- Sandboxed environment- Computers take over the scheduling- Automatic Healthchecks and self-healing
Cloud-Native is challenging
PrometheusCloud-Native monitoring with
Overview: What is Prometheus?
Community Driven Open-source Monitoring and Alerting framework.
- Time series database for instrumentation, metrics collection, storage and querying
- Alerting entity- Integrated tools for metrics exposure
Overview: A bit of context around Prometheus
Started in 2012 as a SoundCloud internal project
Second project to join CNCF after Kubernetes
Overview: Focus
Operational systems monitoring
Dynamic cloud environments
Core features
● Powerful no-sql query language, PromQL● Time series data model● Optimized to be efficient● Operational & Architectural simplicity
Pull /metrics endpoints
Monitoring model: Pull
Prometheus Architecture
The Architecture behind Prometheus
Prometheus core
- Service discovery and targets definition
- Metrics scraping- Time series database- Alerts and Recording rules- Alerting evaluation- Metrics query
Alertmanager
- Alerting & silencing- Dispatching notification to
different channels
Exporters & SDKs
Formatting metrics to be exported in the expected prometheus format
- Either exporters (Node, Rabbit, Mysql, etc..)
- SDKs to export application metrics
Prometheus Basics
Prom Server configuration
- CLI flags for the immutable daemon
- Config file defines scraping targets, instances and jobs
Prom Server configuration
- CLI flags for the immutable daemon
- Config file defines scraping targets, instances and jobs
global: scrape_interval: 1m scrape_timeout: 30s
external_labels: cluster: "test-cluster"
rule_files: - rules/rules.yml
# Scraping targetsscrape_configs: - job_name: 'some-service' static_config: - <host> or <dns> labels: app: "some-service"
prometheus.yml
/metrics# HELP hash_seconds Time taken to create hashes # TYPE hash_seconds histogram hash_seconds_bucket{code="200",le="1"} 2 hash_seconds_bucket{code="200",le="2.5"} 2 hash_seconds_bucket{code="200",le="5"} 2 hash_seconds_bucket{code="200",le="10"} 2 hash_seconds_bucket{code="200",le="+Inf"} 2 hash_seconds_sum{code="200"} 9.370800000000002e-05 hash_seconds_count{code="200"} 2
Data model & querying
api_http_requests_total{method="POST", handler="/messages"}
- Labels based data model - Each label and combination of labels is a dimension where we
can filter and aggregate exported data - Changing, adding or removing a label will create a new time
series
PromQL & Label based queries
http_requests_total all time series related to the metric http_requests_total
http_requests_total{code="200",method="get"} time series related to successful request with method get for the metric http_requests_total
http_requests_total{code="200",method="get"}[5m] returns a range vector
PromQL & Label based queries
http_requests_total{status!~"^4..$"}Selecting all errors-related time series using regexes
sum(rate(http_requests_total[5m])) by (job) Applying functions, in this case we sum over a range vector and aggregating by job
Prometheus web interface
Visualization
Plotting and graphing are out of prometheus scope.
Use Grafana
AlertingRules- Evaluated by the prometheus
server on a regular basis- If a certain query matches a
condition, the alert is triggered
ALERT InstanceDown IF up == 0 FOR 5m LABELS { severity = "critical" } ANNOTATIONS { summary = "Instance {{ $labels.instance }} down", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.", }
Until Prometheus 1.8
This syntax has been changed to standard yaml starting from Prometheus v2 (structure stays the same)
Alert DispatchingJob of the alertmanager is to dispatch alerts to the right channel according to
their severity
Cloud-Native monitoring
Service discovery
Scraping statically defined targets is not very useful
kubernetes_sd_configNative integration for kubernetes environments
- Prometheus is aware of running in a kubernetes cluster- Automatically retrieve scraping targets such as nodes, pods, containers from the
k8s API
More integrations (many more…)- ec2_sd_config- azure_sd_config- openstack_sd_config- gce_sd_config- kubernetes_sd_config- consul_sd_config- dns_sd_config- file_sd_config- marathon_sd_config- nerve_sd_config- triton_sd_config- static_config
Re-labeling
- Relabeling is a very powerful mechanism that allow us to further manipulate labels from the targets. - It’s a very effective way to turn targets from an API and apply sophisticated targeting strategies (i.e.
manipulating addresses or ports, filtering a subset of targets, etc..)
A quick configuration example:
- job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true
Demo Time!