metrics-driven engineering at etsy

46
Metrics-driven Engineering at Etsy MIKE BRITTAIN [email protected] @mikebrittain

Upload: mike-brittain

Post on 08-May-2015

15.839 views

Category:

Technology


1 download

TRANSCRIPT

Page 2: Metrics-Driven Engineering at Etsy

Logs, Graphs, Trends,and Correlations

Page 3: Metrics-Driven Engineering at Etsy

Making Decisions

Page 4: Metrics-Driven Engineering at Etsy

How many visitors are using this thing?

Page 5: Metrics-Driven Engineering at Etsy

Can we deploy that to 100% of our visitors?

Page 6: Metrics-Driven Engineering at Etsy

Did we make it faster?

Page 7: Metrics-Driven Engineering at Etsy

Did I just break something?

Page 8: Metrics-Driven Engineering at Etsy

Q. Who makes the graphs?A. Well, the Ops team manages the network, racks the servers, installed the monitoring tools, wears

the pagers, blah, blah, blah...

Page 9: Metrics-Driven Engineering at Etsy

(but...) Engineers build the application.

Page 10: Metrics-Driven Engineering at Etsy

Dev + Ops

Page 11: Metrics-Driven Engineering at Etsy

Access

Page 12: Metrics-Driven Engineering at Etsy

Yes No

Page 13: Metrics-Driven Engineering at Etsy

“Engineers are too busy meeting our product

deadlines.”

Page 14: Metrics-Driven Engineering at Etsy

Here’s the big secret...

Page 15: Metrics-Driven Engineering at Etsy

Cacti (network, SNMP)Ganglia (machines)Graphite (application)Splunk (log analysis, nightly reports)Nagios (alerting)

Page 16: Metrics-Driven Engineering at Etsy

Logging

Page 17: Metrics-Driven Engineering at Etsy

Logger::log_error("User login failed. Reason: $msg for $username", “login”);

Page 18: Metrics-Driven Engineering at Etsy

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...

Page 19: Metrics-Driven Engineering at Etsy

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...

Page 20: Metrics-Driven Engineering at Etsy

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...

Page 21: Metrics-Driven Engineering at Etsy

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...

Page 22: Metrics-Driven Engineering at Etsy

web0054 [Fri Mar 04 16:27:48 2011] [info] [login] User login failed. Reason: wrong password for ...

Page 23: Metrics-Driven Engineering at Etsy

Logster

Page 24: Metrics-Driven Engineering at Etsy

Forked from ganglia-logtailer...

- Daemon mode (only cron mode)+ Support for Graphite+ Simplified parsing scripts

Page 25: Metrics-Driven Engineering at Etsy

web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Help me, Rhonda.web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Heeeeeeellllllllllllllppppp!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0001 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0201 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0034 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web1101 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0201 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0055 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0002 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling.web0089 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0020 [04:28:54 2011] [error] [client 10.101.x.x] Sky is falling.web1101 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0055 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0034 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0087 [04:28:54 2011] [fatal] [client 10.101.x.x] Sky is falling.web0002 [04:28:54 2011] [error] [client 10.101.x.x] Oh noooooo!web0201 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!web0077 [04:28:54 2011] [warning] [client 10.101.x.x] Gaaaaahhh!web0355 [04:28:54 2011] [warning] [client 10.101.x.x] Oh noooooooooooweb0052 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [error] [client 10.101.x.x] Gaaaaahhh!!!web0003 [04:28:54 2011] [error] [client 10.101.x.x] You've been eaten by a grue.web0066 [04:28:54 2011] [fatal] [client 10.101.x.x] Gaaaaahhh!!!web0001 [04:28:54 2011] [warning] [client 10.101.x.x] Sky is falling

Page 26: Metrics-Driven Engineering at Etsy

Fatals Errors Warnings

Page 27: Metrics-Driven Engineering at Etsy

StatsD

Page 28: Metrics-Driven Engineering at Etsy

StatsD::increment("logins.success");

StatsD::timing("gearman.time", $msec);

Page 29: Metrics-Driven Engineering at Etsy

StatsD::timing("gearman.time", $msec);

90th pct

average

lower

Page 30: Metrics-Driven Engineering at Etsy

Ad hocname value timestamp\n

Page 31: Metrics-Driven Engineering at Etsy

echo "events.deploy.site 1 `date +%s`" \| nc graphite.etsycorp.com 2003

Page 32: Metrics-Driven Engineering at Etsy

Trends + Eventstarget=drawAsInfinite(events.deploy.site)

Page 33: Metrics-Driven Engineering at Etsy

What Happened?

Page 34: Metrics-Driven Engineering at Etsy

16,000 metrics in Graphite(plus 32,000 metrics in Ganglia)

Page 35: Metrics-Driven Engineering at Etsy

Dashboards

Page 36: Metrics-Driven Engineering at Etsy

DashboardsMix & Match

Page 37: Metrics-Driven Engineering at Etsy

<a href="http://graphite.etsycorp.com/render?from=-1hours&width=800&height=600&title=File+or+Script+Not+Found&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"> <img src="http://graphite.etsycorp.com/render?from=-1hours&width=280&height=220&title=File+or+Script+Not+Found&hideLegend=1&yMin=0&target=webs.errorLog.notExist&target=drawAsInfinite%28deploys.config.production%29&target=drawAsInfinite%28deploys.web.production%29&target=drawAsInfinite%28deploys.search.production%29&target=drawAsInfinite%28deploys.imagestorage.other%29&colorList=%2300cc00,%230000ff,%23ff0000,%23006633,%23cc6600"></a>

Hard

Page 38: Metrics-Driven Engineering at Etsy

$g = new Graphite($time);$g->setTitle('File Not Found');$g->addMetric('webs.errorLog.notExist', '#00cc00');$g->showDeploys(true);echo $g->getDashboardHTML(280, 220);

Easy

Page 39: Metrics-Driven Engineering at Etsy

20 dashboards by25 engineers

Page 40: Metrics-Driven Engineering at Etsy

Application health correlated with events

Page 41: Metrics-Driven Engineering at Etsy

High-level visibility

Page 42: Metrics-Driven Engineering at Etsy

Low MTTD

Page 43: Metrics-Driven Engineering at Etsy

Validation

Page 44: Metrics-Driven Engineering at Etsy

Confidence

Page 45: Metrics-Driven Engineering at Etsy

codeascraft.etsy.comgithub.com/etsy/statsdgithub.com/etsy/logster

bitbucket.org/maplebed/ganglia-logtailer