as long term storage clickhouse for metrics, events, logs ... · as long term storage for metrics,...

28
Clickhouse Platform Team as long term storage for metrics, events, logs from K8s

Upload: others

Post on 25-May-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Clickhouse

Platform Team

as long term storagefor metrics, events, logs from K8s

Page 2: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

About us

● Platform architecture● Operations and maintenance● Troubleshooting● Deployments

* и Танцы с бубнами - это про нас)

Page 3: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

www.exness.com

About

Page 4: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

● Introduction● Long time ago)● 1nd implementation (Rancher)● 2nd implementation (K8s)● Questions

Agenda

Page 5: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Aboutwhat we have now

● in 2 datacenters

● 500+ service

● 2500+ containers

● 10Krps metrics up to 200+Krps

● 2Krps Logs up to 100+Krps

Page 6: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Clickhouse in production now

● 3 clusters

● 10+ servers

● 200+ cores, 1Tb ram, 20+ Tb SSD

Introduction

Page 7: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Clickhouse in production now

● Easy to replicate

● Easy to shard

● Easy to use

● Easy to manage

● Nice support

● K8S Operator :)

Introduction

Page 8: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Questions for “full” ClickhouseA lot of new versions ???Access rights so simpleZooKeeper only ?Clouds ?UI access ? ( tabix, superset )

1. Zookeeper replacement => Etcd2. Cloud messages brokers Kinesis, Pub/Sub3. Auhtorization LDAP, SSO4. Prometheus metrics exporter5. GraphQL interface out of the box6. Clickhouse as Prometheus long-term storage7. Auto retention policy8. Detach/Drop/Freeze parts (not only partition as a whole)

Page 9: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance
Page 10: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance
Page 11: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Logs

Page 12: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

Long time ago)

Service #1

Service #2

Service #2

Graylog Elastic

WhisperGraphite

logs

metrics

20+ services

Page 13: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

Long time ago)

Service #1

Service #2

Service #2

Graylog Elastic

WhisperGraphite

logs

metrics

1 instance max 20K rps

20+ services

Page 14: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

Long time ago)

Service #1

Service #2

Service #2

Graylog Elastic

WhisperGraphite

logs

metrics

14 days only1 instance max 20K rps

20+ services

Page 15: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

Long time ago)

Service #1

Service #2

Service #2

Graylog Elastic

WhisperGraphite

logs

metrics

14 days only1 instance max 20K rps

200+ services ???

Page 16: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

Long time ago)

Service #1

Service #2

Service #2

Graylog Elastic

WhisperGraphite

logs

metrics

14 days only1 instance max 20K rps

200+ services ???Throttling ?Maintenance ?Long term solution for 1 year

Page 17: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

Long time ago)

Service #1

Service #2

Service #2

Graylog Elastic

WhisperGraphite

logs

metrics

14 days only1 instance max 20K rps

2000+ services ???Throttling ?Maintenance ?Long term solution for 1 year

Page 18: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

Long time ago)

Service #1

Service #2

Service #2

Graylog Elastic

WhisperGraphite

logs

metrics

14 days only1 instance max 200K rps ???

2000+ services ???Throttling ?Maintenance ?Long term solution for 1 year

Page 19: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

Long time ago)

Service #1

Service #2

Service #2

Graylog Elastic

WhisperGraphite

logs

metrics

14 days only1 instance max 200K rps ???

2000+ services ???Throttling ?Maintenance ?Long term solution for 1 year

?

Page 20: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Disadvantages

● UDP: Throttling ?● Rate: 100+ Krps how ?● Elastic: huge load and a lot of resources● Graylog: Java, fixed load on each node● Retention policy: 1 year or more

Page 21: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

1st implementation (Rancher)

Page 22: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Host

1st implementation (Rancher)

Service #1

Kafka

Fluentd

Graphite Web + Graphouse Clickhouse

short & long term

Graylog

Elastic

Service #2

Syslog-nglogs, metrics

Page 23: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

1nd implementation (disadvantages)● Lack of throttling on service side

● Not supported for tags (Graphite, Grafana, Graphouse)

● Custom message format (hard to understand)

● Syslog-ng (one point of failure on host, module is written on Java)

● Fluentd pitfalls (offsets, not supported groups, configuration)

● Graphite Web is too slow (issue with long term queries)

● Graphouse is fast (but still written on Java, consumes memory)

trend to next K8s...

Page 24: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Pod #1

2nd implementation (K8s)

Service

Kafka

Clickhouselong term

Graylog

ElasticCollector

Pod #2

Service

Collector

Prometheusshort term

logs

, met

rics,

eve

nts

new logs, metrics, events

old logs

Consumer

all old new

Page 25: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Transformation by Consumer

Collector message

Consumer

Metrics table

Logs table

Events table

Tags table

Kafka

Clickhouse *

Metrics topic

Logs topic

Events topic

Tags topic

* K8s namespace-related database with it’s own metrics, logs, events (ReplicatedMergeTree) and tags (ReplicatedReplacingMergeTree)

Table-supported JSON

Page 26: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

2nd implementation (advantages)

● No one point of failure (collector on board)

● Fast and robust (Collector & Consumer have written on Golang)

● Throttling on service side

● Standard message format & version support (Telegraf)

● Tags support across logs, metrics and events

● Graylog + Elastic for logs, Prometheus for metrics (short term)

● Clickhouse for long term (metrics, logs, events)

Page 27: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance

Thank you! Questions?

Page 28: as long term storage Clickhouse for metrics, events, logs ... · as long term storage for metrics, events, logs from K8s. About us Platform architecture Operations and maintenance