volta: logging, metrics, and monitoring as a service

30
Volta: Logging, Metrics and Monitoring as a Service LN Renganarayana Technical Director / Architect Cloud Platform Engineering [email protected] twitter: @lrengan 1 Jan 7, 2015 Volta / Cloud Platform Engineering, Symantec

Upload: lakshminarayananln-renganarayana

Post on 12-Jul-2015

1.799 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Volta: Logging, Metrics, and Monitoring as a Service

Volta: Logging, Metrics and Monitoring as a Service

LN RenganarayanaTechnical Director / Architect

Cloud Platform [email protected]

twitter: @lrengan

1Jan 7, 2015Volta / Cloud Platform Engineering, Symantec

Page 2: Volta: Logging, Metrics, and Monitoring as a Service

Outline

• Motivation: data and events are the foundation of business

• Why build a (new) Service?

• What have we built: a (near) real-time data analytics pipeline

• The journey and lessons learned

• Looking ahead: Volta next gen

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec2

Page 3: Volta: Logging, Metrics, and Monitoring as a Service

Data and events : the foundation

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec3

Picture: “Devops with S for sharing”, Patrick Debois

which features

to build?

what is a good

pricing model?

how fast can I

build?

what is the perf

of my code?

how is the

service?

what is

my

capacity?

what is

my

current

usage?

Page 4: Volta: Logging, Metrics, and Monitoring as a Service

Why build a (new) service?

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec4

Page 5: Volta: Logging, Metrics, and Monitoring as a Service

Why build a service?

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec5

Picture: Jim Nisbet & Philip O’Toole AWS re:Invent 2013 Loggly presentation

Page 6: Volta: Logging, Metrics, and Monitoring as a Service

Single place for events across the stack

Volta / Cloud Platform Engineering, Symantec6

Jan 7, 2015

Bare Metal

IaaS (OpenStack)

Platform ServicesBP, SP, KV, OBS

Symantec Services & Apps

Volta

Identity Manager

CI / CD

Common Services

Page 7: Volta: Logging, Metrics, and Monitoring as a Service

Volta : Design Goals

• Design for both Developers and Ops

– Make it extremely simple to capture events

– provide powerful search and visualization tools

• Secure, Multi Tenant : well we are Symantec, so Security comes first

• Scalable : elastically scale with load

• Highly Available: Volta is the eyes & ears for the Operations

• One system for logs, metrics, monitoring & other events

• Build using open source tools and for open sourcing

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec7

Page 8: Volta: Logging, Metrics, and Monitoring as a Service

What we have built ...

A (near) real-time data analytics pipeline

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec8

Page 9: Volta: Logging, Metrics, and Monitoring as a Service

Volta Client View

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec9

App

Platform

Services

Writes app

metrics directly

Infrastructure

SN

MP

Vars

expose

metr

ics

JM

X

Pull

Metrics

Push

Metrics

Volta

Shipper

VM

logs

Volta

metrics log events

Ale

rts &

Co

nfig

UI

Push: StatsD, metrics extension for openstack

Pull: CollectD. Shipper: logstash, moving to Heka

Page 10: Volta: Logging, Metrics, and Monitoring as a Service

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec10

Kafka cluster

knode1

Keystone

knode2 knode3 knodeN...

log, metric, alert events

Storm cluster

Front End Cluster: Multi-tenancy and Kibana, Graphana Proxies

Elastic

SearchElastic

SearchRedis

Alerts email &

callbacks

Load Balancer

Client App / Service

s1 s2 s3 s4 ... sn

log & metrics shipper

log, metric & alert events

InfluxDBInfluxDB

InfluxDB

Metr

ics S

tore

Elastic

SearchElastic

SearchElastic

SearchLog S

tore

Authentication, Validation, Alerts Processing

Vo

lta

Un

de

r th

e H

oo

d

Quota

&

Policy

Page 11: Volta: Logging, Metrics, and Monitoring as a Service

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec11

Kafka cluster

knode1 knode2 knode3 knodeN...

log, metric, alert events

Client App / Service

log & metrics shipper

log, metric & alert events

The Ingest Pipeline

VIP

• Kafka – replicated, fault

tolerant, persistent

message queue

• LogTopic, MetricTopic,

AlertTopic

• each topic is split into

partitions

• per topic retention policy

Page 12: Volta: Logging, Metrics, and Monitoring as a Service

Event processing and storage

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec12

Storm cluster

Elastic

SearchElastic

SearchRedis

Alerts email &

callbacks

log, metric & alert events

InfluxDBInfluxDB

InfluxDB

Metr

ics S

tore

Elastic

SearchElastic

SearchElastic

SearchLog S

tore

Authentication, Validation, Alerts Processing

Quota

&

Policy

• alert rules

• [tenantid,

apikey] pairs

• Per tenant per day index

• Index typed fields

• Quota and retention policy

• Tenant id prefixed time series names

• Continuous queries do rollups

• Retention policy through rollups

Page 13: Volta: Logging, Metrics, and Monitoring as a Service

Multi-tenancy Proxy & UI

Volta / Cloud Platform Engineering, Symantec13

Keystone

Front End Cluster: Multi-tenancy and Kibana, Graphana Proxies

Elastic

SearchElastic

SearchRedis

Load Balancer

s1 s2 s3 s4 ... sn

InfluxDBInfluxDB

InfluxDB

Metr

ics S

tore

Elastic

SearchElastic

SearchElastic

SearchLog S

tore

• Intercepts and rewrites queries

to ES and InfluxDB

• Enforces Multi-tenancy

(visibility of events to users)

Page 14: Volta: Logging, Metrics, and Monitoring as a Service

Security and Multi-tenancy model

• Authentication with Keystone backed by LDAP

– user authentication for Query API and UI

• Multi tenancy with users and groups

– Events have tenant id and apikey

• Cross tenant correlation

– group membership used for cross-tenant event visibility / correlation

• Dashboard sharing

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec14

Page 15: Volta: Logging, Metrics, and Monitoring as a Service

Retention Policy : Log Events

• ElasticSearch allows powerful querying, but comes at a cost– Store only logs that would help better operate and trouble shoot

– Use appropriate debug levels (not INFO)

• Fixed quota : 350 GB or 500 GB

• When tenant reaches quota limit, Volta will delete 20 % of old logs to free up space

• Through wise use of quota you can retain logs for lots of days

• Volta can retain logs for longer duration, for special tenants who need to store them for compliance / audit

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec15

Page 16: Volta: Logging, Metrics, and Monitoring as a Service

Metric Events: Retention Policy and Rollups

Naming scheme:

host + “.” + name + “.” + type_if_avail + “.” + retention_period

Retention period: 1 day, 1 week, 1 month, and 3 months:

Names for the example:

● default 1 day: lmm-dev-bastion.memory.used_

● 1 week: lmm-dev_bastion.memory.used_1w

● 1 month: lmm-dev_bastion.memory.used_1m

● 3 months: lmm-dev_bastion.memory.used_3m

rollup precision:

● default 1 day: user defined (highest)

● 1 week: metrics aggregated to 1 minute

● 1 month: metrics aggregated to 5 minutes

● 3 months: metrics aggregated to 1 hour

Naming scheme & retention policies

{

"@version": "1",

"@timestamp": "2014-08-06T19:17:43.000Z",

"host": "lmm-dev-bastion",

"name": "memory",

"collectd_type": "memory",

"type_instance": "used",

"value": 341884928,

"tenant_id": "db5ca8e4c8514fad9f98dbc4d648ee87",

"apikey": "26d85ae3-1e10-4ce4-837a-7a1c8dfc67fb"

}

Sample for metric from collectd

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec16

Page 17: Volta: Logging, Metrics, and Monitoring as a Service

Alerts : Email and Callbacks

• Alerts can be set using the Alert UI or the REST API

• Alerts can be sent to Email or post Webhook (REST endpoint)

• Webhook provides a good mechanism for integration with external automation and UIs

• Alerts on Log events– User specifies an alert template using regular expression to match

– Can match one or more fields from a Log event

– Simple and complex expressions

• Alerts on Metric events– User specifies an alert template using comparison operators

– Can match one or more fields from the Metric event

– Simple and complex expressions

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec17

Page 18: Volta: Logging, Metrics, and Monitoring as a Service

Current deployment

• Multiple deployments : on bare KVM nodes, on OpenStack VMs

– On KVM nodes: 40+ VMs, 80+ TB storage, many large memory nodes

– Components are deployed in clustered mode for HA

– Some with active/active replication, some with active/passive

• Use by Platform and Infrastructure Services

– Tens of thousands of events per second (seen around 160 K events /sec)

– Hundreds of GBs of data collected and indexed per day

– Queries are currently coming from Kibana and Grafana, in future from APIs

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec18

Page 19: Volta: Logging, Metrics, and Monitoring as a Service

The Journey and Lessons ...

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec19

Page 20: Volta: Logging, Metrics, and Monitoring as a Service

Log, metrics and alerts

• log events– insist on good severity levels,

– enforce quota induce behavior change

– watch out for large messages (zip lines from stdout/stderr)

• metric events– keep users aware of rollups (granularity)

• alerts– watch out for too simple ones alert floods

– watch out for complex regex performance / memory suckers

– encourage metrics based alerts this is what scales

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec20

Page 21: Volta: Logging, Metrics, and Monitoring as a Service

Kafka, ES and Storm• Kafka

– retention policy vs storage space: do the math with ingest & processing rate

– if you are not using auto-rebalance of leaders, keep an eye on the leaders

• Storm – smaller topologies: easy to update and optimize

– match consumer parallelism (number of partitions) to kafka spouts

– tune number of executor threads to optimal performance

• ElasticSearch: – aggregate your writes

– heap size <= 32 GB, turn off swap,

– benefits hugely from high iops use SSDs if you can

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec21

Page 22: Volta: Logging, Metrics, and Monitoring as a Service

Using Open Source Software : Joy and Frustrations

• Be ready for constant upgrades

– for bug fixes

– to get cool new features: Grafana, Kibana

– for stability, cool stats and visualization: Storm

• InfluxDB clustering maturing

– temporary HA solution (write to 2+ influxDBs)

– waiting for 0.9 release with better clustering

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec22

Page 23: Volta: Logging, Metrics, and Monitoring as a Service

Eat your own Dog Food

• Volta was a cobbler’s child for a while …

– did not use any system to aggregate logs and metrics!

• Now we are using Volta to collect its logs and metrics

– send logs and metrics from one Volta instance to another

– sending to the same instance is an interesting one!

• Important metrics:

– ingest rate, Storm processing rate, ES / Influx Write latency

– end to end latency of events

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec23

Page 24: Volta: Logging, Metrics, and Monitoring as a Service

Synthetic Transactions and Tracking SLAs

• Goal: track Service level metrics

– availability to users / business

– latency for operations to users

• Use Synthetic Transactions that exercise a sequence of APIs

– measure success / failure rates

– measure end to end latency

– collect, trend and alert on these

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec24

Page 25: Volta: Logging, Metrics, and Monitoring as a Service

Deployment & Ops : automate, automate, automate …

• Volta is a collection of services– use separate repos, deploy small changes

• Lots of configuration parameters : manage consistency– performance very sensitive to values

– e.g., Heap, number of workers, etc.

• Performance benchmarking– need to be done for each environment

• CI and Deployment pipeline

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec25

Page 26: Volta: Logging, Metrics, and Monitoring as a Service

Volta next gen

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec26

Page 27: Volta: Logging, Metrics, and Monitoring as a Service

Volta Next Gen

• OpenSource Volta

• Refactor Storm– Split into separate metric and log topologies and batch writes

• Move ES and InfluxDB to higher iops storage (SSDs?)

• Multi-DC support via stream duplication

• Archival into Swift / HDFS

• Anomaly detection using CEP / Storm

• HTTP REST API in front of Kafka

• Deployment automation using OpenStack Murano

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec27

Page 28: Volta: Logging, Metrics, and Monitoring as a Service

Thank you!

Questions, Comments, Suggestions?

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec 28

We are interested in Open Sourcing & Collaborating on Volta.

Interested?

And, we are hiring …. interested? [email protected]

twitter: @lrengan

Page 29: Volta: Logging, Metrics, and Monitoring as a Service

Backup Slides

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec 29

Page 30: Volta: Logging, Metrics, and Monitoring as a Service

LMM Metrics Data Model

● name : name of the metric. LMM uses this to store

the metrics and you will use in queries: select

“value” from “load”

● value : value of the metrics at a given time

● @timestamp : time stamp

● host : host name or any other id

● tenant_id : tenant id (keystone)

● apikey : LMM apikey

{"@version": "1","@timestamp": "2014-07-30T00:16:59.000Z","name": "cpu","host": "demo.symcpe.net","plugin_instance": "0","collectd_type": "cpu","type_instance": "interrupt","value": 0,"tenant_id":"db5ca8e4c8514fad9f98dbc4d648ee87","apikey": "26d85ae3-1e10-4ce4-837a-7a1c8dfc67fb"

}

Mandatory fields Sample for metric from collectd

Collectd : name of plugin becomes name of metric. E.g.: cpu or memory

StatsD : users metric name concatenated with metric type by a dot. E.g.: myapp.counter or myapp.gauge

Reserved fields: time, sequence_number Special field: type_instance

Jan 7, 2015Volta / Cloud Platform Engineering, Symantec30