opentsdb for monitoring @ criteo
TRANSCRIPT
Nathaniel Braun
Thursday, April 28th, 2016
OpenTSDB for
monitoring @ Criteo
@
2 | Copyright © 2016 Criteo
•Overview of Hadoop @ Criteo
•Our experimental cluster
•Rationale for OpenTSDB
•Stabilizing & scaling OpenTSDB
•OpenTSDB to the rescue in practice
Hitch hiker’s guide to this presentation
Overview of Hadoop @ Criteo
@
4 | Copyright © 2016 Criteo
Overview of Hadoop @ Criteo
Tokyo TY5 – PROD AS
Sunnyvale SV6 – PROD NAHongKong HK5 – PROD CN
Paris PA4 – PROD / PREPROD
Paris PA3 –PREPROD / EXP
Amsterdam AM5 – PROD
Criteo’s 8 Hadoop clusters – running CDH Community Edition
5 | Copyright © 2016 Criteo
AM5: main production cluster
• In use since 2011
• Running CDH3 initially, CDH4 currently
• 1118 DataNodes
• 13 400+ compute cores
• 39 PB of raw disk storage
• 105 TB of RAM capacity
• 40 TB of data imported every day, mostly through HTTPFS
• 100 000+ jobs run daily
Overview of Hadoop @ Criteo – Production AM5
6 | Copyright © 2016 Criteo
PA4: comparable to AM5, with fewer machines
• Migration done in Q4 2015 – H1 2016
• Running CDH5
• 650+ DataNodes
• 15 600+ compute cores
• 54 PB of raw disk storage
• 143 TB of RAM capacity
• Huawei servers (AM5 is HP-based)
Overview of Hadoop @ Criteo – Production PA4
7 | Copyright © 2016 Criteo
Criteo has 3 local production Hadoop clusters
• Sunnyvale (SV6): 20 nodes
• Tokyo (TY5): 35 nodes
• Hong Kong (HK5): 20 nodes
Overview of Hadoop @ Criteo – Production local clusters
8 | Copyright © 2016 Criteo
Criteo has 3 preproduction Hadoop clusters
• Preprod PA3: 54 nodes, running CDH4
• Preprod PA4: 42 nodes, running CDH5
• Experimental: 53 nodes, running CDH5
Overview of Hadoop @ Criteo – Preproduction clusters
9 | Copyright © 2016 Criteo
Overview of Hadoop @ Criteo – Usage
Types of jobs running on our clusters
• Cascading jobs, mostly for joins between different types of logs (e.g. displays & clicks)
• Pure Map/Reduce jobs for recommendation, Hadoop streaming jobs for learning
• Scalding jobs for analytics
• Hive queries for Business Intelligence
• Spark jobs on CDH5
10 | Copyright © 2016 Criteo
Overview of Hadoop @ Criteo – Special consideration
• Kerberos for security
• High-availability on NameNodes and ResourceManager (CDH5 only)
• Infrastructure installed & maintained with Chef
11 | Copyright © 2016 Criteo
Overview of Hadoop @ Criteo
How can we monitor this complex
infrastructure and services running on top
of it?
Our experimental cluster
@
13 | Copyright © 2016 Criteo
• Useful for testing infrastructure changes without impacting users (no SLA)
• Test environment for new technologies
• HBase
oNatural joins
oOpenTSDB for metrology & monitoring
ohRaven for job detailed data (not used anymore)
• Spark, now in production @ PA4
Our experimental cluster – Purpose
14 | Copyright © 2016 Criteo
• Based on Google BigTable paper
• Integrated with the Hadoop stack
• Stores data in rows sorted by row key
• Uses regions as an ordered set of rows
• Regions sharded by row key bounds
• Regions managed by Region servers, collocated with DataNodes (data is stored on HDFS)
• Oversize regions split into two regions
• Values stored in columns, with no fixed schema as in RDBMS
• Columns grouped in column families
Our experimental cluster – HBase features
15 | Copyright © 2016 Criteo
Our experimental cluster – HBase architecture
Row key
(user UID)
CF0: user CF1: event
C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site
AAA value Firefox NULL Click Client #0
BBB value Chrome NULL Click Client #0
CCC value Chrome [email protected] Display Client #1
DDD value IE NULL Sales Client #2
EEE value IE NULL Display Client #0
FFF value IE NULL Display Client #3
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
XXX value Firefox NULL Sales Client #4
YYY value Chrome NULL Bid Client #5
ZZZ value Opera [email protected] Click Client #5
16 | Copyright © 2016 Criteo
Our experimental cluster – HBase architecture
Row key
(user UID)
CF0: user CF1: event
C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site
AAA value Firefox NULL Click Client #0
BBB value Chrome NULL Click Client #0
CCC value Chrome [email protected] Display Client #1
DDD value IE NULL Sales Client #2
EEE value IE NULL Display Client #0
FFF value IE NULL Display Client #3
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
XXX value Firefox NULL Sales Client #4
YYY value Chrome NULL Bid Client #5
ZZZ value Opera [email protected] Click Client #5
R0
R1
R5
17 | Copyright © 2016 Criteo
Our experimental cluster – HBase architecture
Row key
(user UID)
CF0: user CF1: event
C0: IP C2: browser C3: e-mail C0: time C1: type C2: web site
AAA value Firefox NULL Click Client #0
BBB value Chrome NULL Click Client #0
CCC value Chrome [email protected] Display Client #1
DDD value IE NULL Sales Client #2
EEE value IE NULL Display Client #0
FFF value IE NULL Display Client #3
∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙ ∙∙∙
XXX value Firefox NULL Sales Client #4
YYY value Chrome NULL Bid Client #5
ZZZ value Opera [email protected] Click Client #5
R0
R1
R5
RS1
RS2
18 | Copyright © 2016 Criteo
HBase on the experimental cluster
• 50 region servers
• 44 000+ regions
• ~90 000 requests / second from OpenTSDB
Our experimental cluster – HBase @ Criteo
Rationale for OpenTSDB
on
20 | Copyright © 2016 Criteo
Metrics to monitor:
• CPU load
• Processes & threads
• RAM available/reserved
• Free/used disk space
• Network statistics
• Sockets open/closed
• Open connections with their statuses
• Network traffic
Rationale for using OpenTSDB – Infrastructure monitoring
21 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Service monitoring
NodeManagers ResourceManagersYARN
DataNodes NameNodes JournalNodesHDFS
ZooKeeper Kerberos
HBase
Kafka Storm
22 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Service monitoring
NodeManagers ResourceManagersYARN
DataNodes NameNodes JournalNodesHDFS
ZooKeeper Kerberos
HBase
Kafka Storm
Huge diversity of services!
23 | Copyright © 2016 Criteo
• Diversity
• Many types of nodes & services
• Must be extensible simply to add new metrics
• Scale
• > 2 500 servers
• ~ 90 000 requests / second
• Storage
• Keep fine-grained resolution (down to the minute, at least)
• Long-term storage for analysis & investigation
Rationale for using OpenTSDB – Scale
24 | Copyright © 2016 Criteo
• Suits the problem well: “Hadoop for monitoring Hadoop”
• Designed for time series: HBase schema optimized for time series queries
• Scalable and resilient, thanks to HBase
• Extensible easily: writing data collector is easy
• Simple to query
Rationale for using OpenTSDB – Solution
25 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Easy to query
uri = URI.parse("http://0.rtsd.hpc.criteo.preprod:4242/api/query")http = Net::HTTP.start(uri.hostname, uri.port)http.read_timeout = 300
params = {'start' => '2016/04/21-10:00:00','end' => '2016/04/21-12:00:00','queries‘ => {'aggregator' => 'min','downsample' => '5m-min','metric' => 'hadoop.resourcemanager.queuemetrics.root.AllocatedMB','tags' => {'cluster' => 'ams','host' => 'rm.hpc.criteo.prod'
}}
request = Net::HTTP::Post.new(uri.path, initheader = {'Content-Type' =>'application/json'})request.body = params.to_jsonresponse = http.request(request)
26 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Practical UI
27 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Practical UI
Metric
28 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Practical UI
Time rangeMetric
29 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Practical UI
Time rangeMetric
Tag keys/values
30 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Practical UI
Time rangeMetric
Tag keys/valuesAggregator
31 | Copyright © 2016 Criteo
• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors
• Some TSDs used for writing, others for reading, while tcollectors collect metrics
• TSDs are stateless
• TSDs use asyncHBase to scale
• Quiz: what are the advantages?
Rationale for using OpenTSDB – Design
32 | Copyright © 2016 Criteo
• OpenTSDB consists in Time Series Daemons (TSDs) and tcollectors
• Some TSDs used for writing, others for reading, while tcollectors collect metrics
• TSDs are stateless
• TSDs use asyncHBase to scale
• Quiz: what are the advantages?
Rationale for using OpenTSDB – Design
1. Clients never interact
with HBase directly
2. Simple protocol → easy
to use & extend
3. No state, no
synchronization → great
scalability
33 | Copyright © 2016 Criteo
• Metrics consist in:
• metric name
• UNIX timestamp
• value (64 bit integer or single-precision floating point value).
• tags (key-value pairs) specific to that metric instance
• Tags useful for aggregations on time series
proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod
• Charts: average load in 15 minutes with the count
aggregator (proxy to machine count)
• Quiz: what is the chart below?
Rationale for using OpenTSDB – Metrics
proc.loadavg.15min
34 | Copyright © 2016 Criteo
• Metrics consist in:
• metric name
• UNIX timestamp
• value (64 bit integer or single-precision floating point value).
• tags (key-value pairs) specific to that metric instance
• Tags useful for aggregations on time series
proc.loadavg.15min 1461781436 15 host=0.namenode.hpc.criteo.prod
• Charts: average load in 15 minutes with the count
aggregator (proxy to machine count)
• Quiz: what is the chart below?
Rationale for using OpenTSDB – Metrics
proc.loadavg.15min
proc.loadavg.15mincluster=*
35 | Copyright © 2016 Criteo
• A single data table (split in regions), named tsdb
• Row key: <metric_uid><timestamp><tagk1><tagv1>[...<tagkN><tagvN>]
• timestamp is rounded down to the hour
• This schema helps group data from the same metric & time bucket close together (HBase sorts rows based on the row key)
• Assumption: query first on time range, then metric, then tags, in that order of preference
• Tag keys are sorted lexicographically
• Tags should be limited, because they are in the row key. Usually less than 5 tags.
• Values are stored in columns
• Column name: 2 or 4 bytes. For 2 bytes:
• Encode offset up to 3 600 seconds → 212 = 4096 → 12 bits
• 4 bits left for format/type
• Other tables, for metadata and name ↔ ID mappings
Rationale for using OpenTSDB – HBase schema
36 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – HBase schema
Hexadecimal representation of a row key, with two tags
Sorted row keys for the same metric: 000001
Note: row key size varies across rows, because of tags
37 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Statistics
Quiz: what should we look for?
38 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Statistics
Quiz: what should we look for?
39 | Copyright © 2016 Criteo
Rationale for using OpenTSDB – Statistics
Quiz: what should we look for?
367 513 metrics
30 tag keys (!)
86 194 tag values
Stabilizing & scaling OpenTSDB
41 | Copyright © 2016 Criteo
OpenTSDB was hard to scale at first. What problem can you see?
Scaling OpenTSDB
42 | Copyright © 2016 Criteo
OpenTSDB was hard to scale at first. What problem can you see?
Scaling OpenTSDB
We’re missing data points
43 | Copyright © 2016 Criteo
• Analyze all the layers of the system
• Logs are your friends
• Change parameters one by one, not all at once
• Measure, change, deploy, measure. Rinse, repeat
Scaling OpenTSDB – Lessons learned
44 | Copyright © 2016 Criteo
Varnish & OpenResty save the day
Scaling OpenTSDB – Nifty trick
OpenRestyPOST -> GET
VarnishCache + LB
OpenRestyPOST -> GET
VarnishCache + LB
OpenRestyPOST -> GET
VarnishCache + LB
RTSDRead OpenTSDB
RTSDRead OpenTSDB
RTSDRead OpenTSDB
45 | Copyright © 2016 Criteo
Varnish & OpenResty save the day
Scaling OpenTSDB – Nifty trick
OpenRestyPOST -> GET
VarnishCache + LB
OpenRestyPOST -> GET
VarnishCache + LB
OpenRestyPOST -> GET
VarnishCache + LB
RTSDRead OpenTSDB
RTSDRead OpenTSDB
RTSDRead OpenTSDB
OpenTSDB to the rescue in practice
47 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Easier to use than logs
hadoop.namenode.fsnamesystem.tag.HAState
48 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Easier to use than logs
Two NameNode failovers in one night!
hadoop.namenode.fsnamesystem.tag.HAState
49 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Easier to use than logs
Two NameNode failovers in one night!
• Hard to spot : it in the morning nothing has changed
hadoop.namenode.fsnamesystem.tag.HAState
50 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Easier to use than logs
Two NameNode failovers in one night!
• Hard to spot : it in the morning nothing has changed
• Would be impossible to see with daily aggregation
hadoop.namenode.fsnamesystem.tag.HAState
51 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Easier to use than logs
Two NameNode failovers in one night!
• Hard to spot : it in the morning nothing has changed
• Would be impossible to see with daily aggregation
• Trivia: we fixed the tcollector to get that metric
hadoop.namenode.fsnamesystem.tag.HAState
52 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Investigation
hadoop.nodemanager.direct.TotalCapacity
53 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Investigation
hadoop.nodemanager.direct.TotalCapacity
Huge memory capacity spike
54 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Investigation
hadoop.nodemanager.direct.TotalCapacity
Huge memory capacity spike Node not reporting points
55 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Investigation
hadoop.nodemanager.direct.TotalCapacity
Huge memory capacity spike Node not reporting pointsAnother huge spike
56 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Investigation
hadoop.nodemanager.direct.TotalCapacity
Huge memory capacity spike Node not reporting pointsAnother huge spike
No data
57 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Superimpose charts
hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
58 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Superimpose charts
hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
Service restart – configuration change
59 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Superimpose charts
hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
Service restart – configuration change Service restart – OOM
60 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Superimpose charts
hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
Service restart – configuration change Service restart – OOM
Log extract:NodeManagerconfigured with 192 GB physical memory
allocated to containers,
which is more than 80% of the total physical memory
available (89 GB)
61 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Hiccups
hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
62 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Hiccups
hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
OpenTSDB problem – not node-specific
63 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – Hiccups
hadoop.nodemanager.direct.TotalCapacity hadoop.nodemanager.jvmmetrics.GcTimeMillis
OpenTSDB problem – not node-specific Node probably dead
64 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystem.BlocksTotal
65 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
File deletion
File deletion
hadoop.namenode.fsnamesystem.BlocksTotal
66 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
File deletion
File deletion
File creation
hadoop.namenode.fsnamesystem.BlocksTotal
67 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal
68 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
Slope
hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal
69 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
Slope
hadoop.namenode.fsnamesystem.BlocksTotalhadoop.namenode.fsnamesystem.FilesTotal
Be careful about the scale!
70 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
71 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
Quiz: what is this pattern?
72 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
Quiz: what is this pattern?
• Answer: NameNode checkpoint
73 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
Quiz: what is this pattern?
• Answer: NameNode checkpoint
• Note: done at regular intervals
74 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
Quiz: what is this pattern?
• Answer: NameNode checkpoint
• Note: done at regular intervals
• Trivia: never do a failover during a checkpoint!
75 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
76 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
77 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
Quiz: what is the problem?
78 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
Quiz: what is the problem?
• Answer: no NameNode checkpoint → no FS image!
79 | Copyright © 2016 Criteo
OpenTSDB to the rescue in practice – NameNode rescue
hadoop.namenode.fsnamesystemstate.NumLiveDataNodes
Quiz: what is the problem?
• Answer: no NameNode checkpoint → no FS image!
• Follow-up: standby namenode could not startup after a failover, because its FS image was too old
80 | Copyright © 2016 Criteo
Criteo ♥ BigData
- Very accessible: only 50 euros, which will be given to charity
- Speakers from leading organizations: Google, Spotify, Mesosphere, Criteo …
https://www.eventbrite.co.uk/e/nabdc-not-another-big-data-conference-registration-24415556587
81 | Copyright © 2016 Criteo
Criteo is hiring!
http://labs.criteo.com/
Criteo is hiring!