scaling graphite at yelp

34
…And Metrics For All Paul O’Connor github.com/pauloconnor 2015-05-19

Upload: paul-oconnor

Post on 21-Apr-2017

6.611 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Scaling Graphite At Yelp

…And Metrics For All

Paul O’Connorgithub.com/pauloconnor2015-05-19

Page 2: Scaling Graphite At Yelp

About Yelp

Founded: 2004Monthly Active Users: ~142 MillionNon-US Monthly Users: ~31 MillionReview: ~77 MillionLocal Businesses: 2.1 MillionTerritories: Available in 31 countries

Page 3: Scaling Graphite At Yelp

What are metrics?

Name Value

Page 4: Scaling Graphite At Yelp

What are metrics?

Name Value Timestamp

Page 5: Scaling Graphite At Yelp

What are metrics?

Name Value Timestampserver1.load.1m 28.826667 1431950640

Page 6: Scaling Graphite At Yelp

What are metrics?

Name Value Timestampserver1.load.1m 28.826667 1431950640server1.load.1m 29.188333 1431950700server1.load.1m 29.231667 1431950760server1.load.1m 29.083333 1431950820server1.load.1m 29.710000 1431950880

Page 7: Scaling Graphite At Yelp

What are metrics?

Name Value Timestampserver1.load.1m 28.826667 1431950640server1.load.1m 29.188333 1431950700server1.load.1m 29.231667 1431950760server1.load.1m 29.083333 1431950820server1.load.1m 29.710000 1431950880

Page 8: Scaling Graphite At Yelp

Graphite Components

• Carbon:• relay• cache• aggregator

• Whisper• Web app

Page 9: Scaling Graphite At Yelp

Carbon Relay

• Deals with 2 things• Replication• Sharding

Page 10: Scaling Graphite At Yelp

Relay Methods

• Rules• [replicate]• pattern = ^services\.ads\..+• servers = 10.1.2.3, 10.2.2.3• continue = true

• Consistent Hashing• Defines a sharding strategy across multiple backends

Page 11: Scaling Graphite At Yelp

Carbon Cache

• Receives metrics and persists them to disk• Writes based on storage schemas

Page 12: Scaling Graphite At Yelp

Storage Schemas

• Details retention rates for storing metrics

[databases_10sec_1year]pattern = ^servers\.db.*$retentions = 10s:7d,1m:30d,5m:90d,30m:365d

Page 13: Scaling Graphite At Yelp

Storage Aggregation

• Rules for aggregating data to lower-precision retentions

[all_min]pattern = \.min$xFilesFactor = 0.1aggregationMethod = min

Page 14: Scaling Graphite At Yelp

Carbon Aggregator

• Buffers metrics before forwarding to carbon cache• Roll up metrics based on rules

Page 15: Scaling Graphite At Yelp

Aggregation Rules

• Not to be confused with storage aggregation• Tells the carbon aggregator what to aggregate and how

output_template (frequency) = method input_pattern

<env>.applications.<app>.all.requests (60) = sum <env>.applications.<app>.*.requests

prod.applications.apache.www01.requestsprod.applications.apache.www02.requestsprod.applications.apache.www03.requestsprod.applications.apache.www04.requestsprod.applications.apache.www05.requests

prod.applications.apache.all.requests

Page 16: Scaling Graphite At Yelp

Whisper

• Fixed size database• Allows for roll ups• Allows for backfilling data

Page 17: Scaling Graphite At Yelp

Web App

• Django based app for rendering graphs

Page 18: Scaling Graphite At Yelp

Putting it all together

• Carbon cache listening on port 2003• Write to disk• Listen with web

Page 19: Scaling Graphite At Yelp

Getting more complicated

• Carbon relay using consistent hashing to multiple caches• Individual caches responsible for specific metrics

Page 20: Scaling Graphite At Yelp

More Relays

• Use HAProxy to load balance between relays• Use more relays to use CPU

Page 21: Scaling Graphite At Yelp

Even more relays• Useful for sending metrics to other locations

Page 22: Scaling Graphite At Yelp

Replicate the metrics• Duplicate your metrics for backup, and redundancy

Page 23: Scaling Graphite At Yelp

More caches instead• Consistent hash across multiple nodes

Page 24: Scaling Graphite At Yelp

Where does the aggregator fit?

• Aggregator uses a lot of CPU. Put it on it’s own node

Page 25: Scaling Graphite At Yelp

Scaling further

• Use nodes for particular functions:• Use forwarding relay nodes solely to forward• Have consistent hashing nodes• Have aggregation nodes

Page 26: Scaling Graphite At Yelp
Page 27: Scaling Graphite At Yelp
Page 28: Scaling Graphite At Yelp

Getting your data back out

• Graphite Dashboard• Third Party Dashboard

• We use Grafana http://grafana.org/• Graphite-api https://github.com/brutasse/graphite-api

Page 29: Scaling Graphite At Yelp
Page 30: Scaling Graphite At Yelp

Tips

• Aggregate before ingestion• Control the metrics that can be sent• Metrics are a gas - they expand to fill all available room• Use C implementation of carbon• Use the latest webapp.

Page 31: Scaling Graphite At Yelp

Optimize your dashboard queries

• services.biz_app.*.*.timers.pyramid_uwsgi_metrics_tweens_*.p99• 2154 results• 35 seconds to just find these files on disk• Running functions against these results• Timeout after a minute• Dashboard automatically refreshing every 10 seconds

Page 32: Scaling Graphite At Yelp
Page 33: Scaling Graphite At Yelp

What’s the Future?

• InfluxDB• Cassandra• Third party

Page 34: Scaling Graphite At Yelp

We’re hiring!http://www.yelp.com/careersHiring SREs in Dublin, London, New York, San Francisco