jacopo nardiello - monitoring cloud-native applications with prometheus - codemotion milan 2017

Monitoring Cloud-Native applications with Prometheus

Jacopo Nardiello

CODEMOTION MILAN - SPECIAL EDITION 10 – 11 NOVEMBER 2017

Jacopo NardielloSIGHUP Founder & DevOps Engineer@jnardiello

~ whoami

~ ./stuff_I_poke_around_with

- Linux- Kubernetes (clusters lifecycles and workloads scheduling in general)- The CloudTM (VMs and Containers + other people's computers)- golang- More devops toys FTW! (CI/CDs, Ansible, etc..)

What is exactly “Cloud-Native”?

Cloud-Native is NOT The CloudTM

At its root, Cloud Native is structuring teams, culture and technology to utilize automation and architectures to manage

complexity and unlock velocity.

Joe Beda

There’s a copernican revolution happening on infrastructures

A fundamental shift:From VM-based Mutable to Highly Dynamic and Immutableinfrastructures

The path to Cloud-Native Architectures

Why Containers- A new infrastructural unit- Atomic deployments- Very small footprint, superfast scaling

Why Orchestrators

- Sandboxed environment- Computers take over the scheduling- Automatic Healthchecks and self-healing

Cloud-Native is challenging

PrometheusCloud-Native monitoring with

Overview: What is Prometheus?

Community Driven Open-source Monitoring and Alerting framework.

- Time series database for instrumentation, metrics collection, storage and querying

- Alerting entity- Integrated tools for metrics exposure

Overview: A bit of context around Prometheus

Started in 2012 as a SoundCloud internal project

Second project to join CNCF after Kubernetes

Overview: Focus

Operational systems monitoring

Dynamic cloud environments

Core features

● Powerful no-sql query language, PromQL● Time series data model● Optimized to be efficient● Operational & Architectural simplicity

Pull /metrics endpoints

Monitoring model: Pull

Prometheus Architecture

The Architecture behind Prometheus

Prometheus core

- Service discovery and targets definition

- Metrics scraping- Time series database- Alerts and Recording rules- Alerting evaluation- Metrics query

Alertmanager

- Alerting & silencing- Dispatching notification to

different channels

Exporters & SDKs

Formatting metrics to be exported in the expected prometheus format

- Either exporters (Node, Rabbit, Mysql, etc..)

- SDKs to export application metrics

Prometheus Basics

Prom Server configuration

- CLI flags for the immutable daemon

- Config file defines scraping targets, instances and jobs

Prom Server configuration

- CLI flags for the immutable daemon

- Config file defines scraping targets, instances and jobs

global: scrape_interval: 1m scrape_timeout: 30s

external_labels: cluster: "test-cluster"

rule_files: - rules/rules.yml

# Scraping targetsscrape_configs: - job_name: 'some-service' static_config: - <host> or <dns> labels: app: "some-service"

prometheus.yml

/metrics# HELP hash_seconds Time taken to create hashes # TYPE hash_seconds histogram hash_seconds_bucket{code="200",le="1"} 2 hash_seconds_bucket{code="200",le="2.5"} 2 hash_seconds_bucket{code="200",le="5"} 2 hash_seconds_bucket{code="200",le="10"} 2 hash_seconds_bucket{code="200",le="+Inf"} 2 hash_seconds_sum{code="200"} 9.370800000000002e-05 hash_seconds_count{code="200"} 2

Data model & querying

api_http_requests_total{method="POST", handler="/messages"}

- Labels based data model - Each label and combination of labels is a dimension where we

can filter and aggregate exported data - Changing, adding or removing a label will create a new time

series

PromQL & Label based queries

http_requests_total all time series related to the metric http_requests_total

http_requests_total{code="200",method="get"} time series related to successful request with method get for the metric http_requests_total

http_requests_total{code="200",method="get"}[5m] returns a range vector

PromQL & Label based queries

http_requests_total{status!~"^4..$"}Selecting all errors-related time series using regexes

sum(rate(http_requests_total[5m])) by (job) Applying functions, in this case we sum over a range vector and aggregating by job

Prometheus web interface

Visualization

Plotting and graphing are out of prometheus scope.

Use Grafana

AlertingRules- Evaluated by the prometheus

server on a regular basis- If a certain query matches a

condition, the alert is triggered

ALERT InstanceDown IF up == 0 FOR 5m LABELS { severity = "critical" } ANNOTATIONS { summary = "Instance {{ $labels.instance }} down", description = "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.", }

Until Prometheus 1.8

This syntax has been changed to standard yaml starting from Prometheus v2 (structure stays the same)

Alert DispatchingJob of the alertmanager is to dispatch alerts to the right channel according to

their severity

Cloud-Native monitoring

Service discovery

Scraping statically defined targets is not very useful

kubernetes_sd_configNative integration for kubernetes environments

- Prometheus is aware of running in a kubernetes cluster- Automatically retrieve scraping targets such as nodes, pods, containers from the

k8s API

More integrations (many more…)- ec2_sd_config- azure_sd_config- openstack_sd_config- gce_sd_config- kubernetes_sd_config- consul_sd_config- dns_sd_config- file_sd_config- marathon_sd_config- nerve_sd_config- triton_sd_config- static_config

Re-labeling

- Relabeling is a very powerful mechanism that allow us to further manipulate labels from the targets. - It’s a very effective way to turn targets from an API and apply sophisticated targeting strategies (i.e.

manipulating addresses or ports, filtering a subset of targets, etc..)

A quick configuration example:

- job_name: 'kubernetes-pods' kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true

Demo Time!

Thank you,Questions?

We are [email protected]

@jnardiello

jacopo nardiello - monitoring cloud-native applications with prometheus - codemotion milan 2017

Technology