iterative dashboards & monitors - agile...
TRANSCRIPT
Iterative dashboards & monitors
CARMEL HINKS | SOF TWARE ENGINEER | ATLASSIAN
Us, with our hands down
You build it, you run it
I addressed all operational concerns
You build it, you run it
Nice! We are finished forever!
I addressed all operational
Nice! We are finished forever!
Past Present
I addressed all operational
For now…
Past Present
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Dashboards MonitorsMetrics
Agenda
MetricA measure of a software characteristic
MetricA measure of a software characteristic
Analytic..?
MetricWhat are our systems doing?
AnalyticWhat are our users doing?
DashboardA visualisation of your metrics
MonitorAn alert against one or more metrics
Operational health
Operational health
What went wrong?
When did it go wrong?
Operational health
Why did it go wrong?
Operational health
Operational health
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Multi-tenant
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
What about cross-region latency?
Database
App App App
Queue
What about cross-region latency?What about scale?
Database
App App App
Queue
What about cross-region latency?What about scale? What about progressive rollouts?
Database
App App App
Queue
What about cross-region latency?What about scale? What about progressive rollouts?What about outage blast radius?
Database
App App App
Queue
What about cross-region latency?What about scale? What about progressive rollouts?What about outage blast radius?What about data sovereignty?
Database
App App App
Queue
What about cross-region latency?What about scale? What about progressive rollouts?What about outage blast radius?What about data sovereignty?
What about noisy neighbours?
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Shard
Sign me up to Jira!
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Provisioning pipeline
Sign me up to Jira!
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Provisioning pipeline
Sign me up to Jira!
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Provisioning pipeline
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Provisioning pipeline
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Shard Servic
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Shard Service
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Shard Service
Database
App App App
Queue
Shard Service
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Shard Service
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Shard Service
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Shard Service
Europe Australia USA
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Database
App App App
Queue
Shard Service
Europe Australia USA
Shard Service
Database
App App
Queue
App
Shard Service
70% 20% 90% 5%
Shard A
Shard Service
70% 20% 90% 5%
Shard A
Metrics from the shards, about the shards
Shard Service
70% 20% 90% 5%
Shard A
Wait…
Understanding what to measure is hard
Understanding what to measure is hard
What worked today, may not work tomorrow
Understanding what to measure is hard
What worked today, may not work tomorrow
Keeping everyone & everything up to date is hard
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
It is a capital mistake to theorise before one has data
SHERLOCK HOLMES
Measure nothing
Measure nothingMetrics aren’t verified before going live
Measure nothingMetrics aren’t verified before going live
First incident is going to SUCK
Measure nothingMetrics aren’t verified before going live
First incident is going to SUCK
This isn’t a solution, it’s a deferral
Measure everything
Measure everythingExpensive (time, money & resources)
Measure everything
Lots of noise
Expensive (time, money & resources)
Measure everythingExpensive (time, money & resources)
Lots of noise
Does not scale
Measure stuff from out the box
SHARD SERVICE DASHBOARD
SHARD SERVICE DASHBOARD
SHARD SERVICE DASHBOARD
SHARD SERVICE DASHBOARD
IF CPU > 80% for over 5 minutes THEN page
Think of idea
Design service
Build service (MVP)
Reach operational maturity
Release
Iterate service
Iterate service
Think of idea
Design service
Build service (MVP)
Reach operational maturity
Release
Iterate service
Iterate service
Sign me up to Jira!
Database
Nod Nod Nod
Queue
Database
Nod Nod Nod
Queue
Database
Nod Nod Nod
Queue
Sign me up to Jira!
Database
Nod Nod Nod
Queue
Database
Nod Nod Nod
Queue
Database
Nod Nod Nod
Queue
Sign me up to Jira!
Database
Nod Nod Nod
Queue
Database
Nod Nod Nod
Queue
Database
Nod Nod Nod
Queue
Sign me up to Jira!
Database
Nod Nod Nod
Queue
Database
Nod Nod Nod
Queue
Database
Nod Nod Nod
Queue
Sign me up to Jira!
What questions do we want to be able to answer with our operational resources?
TAKING A STEP BACK
Shard ServicePerforms the selection of a suitable shard based on geographical location and dynamic capacity metrics
Shard ServicePerforms the selection of a suitable shard based on geographical location and dynamic capacity metrics
Synchronous, http-facing
Shard Service
Requests slow down significantly
Shard Service
Requests slow down significantly
Requests are accepted, but then fail
Shard Service
Requests slow down significantly
Requests are accepted, but then fail
Requests start being rejected
Shard Service
Requests slow down significantly
Requests are accepted, but then fail
Requests start being rejected
There are no suitable shards
Shard Service
Requests slow down significantly
Requests are accepted, but then fail
Requests start being rejected
There are no suitable shards
Incorrect shards were selected
Shard Service
Requests slow down significantly
Requests are accepted, but then fail
Requests start being rejected
There are no suitable shards
Incorrect shards were selected
There is insufficient data to make decisions
Shard Service
Requests slow down significantly
Requests are accepted, but then fail
Requests start being rejected
There are no suitable shards
Incorrect shards were selected
There is insufficient data to make decisions
Infrastructure metrics
Application metrics
Shard Service
Requests slow down significantly
Requests are accepted, but then fail
Requests start being rejected
There are no suitable shards
Incorrect shards were selected
There is insufficient data to make decisions
Infrastructure metrics
Application metrics
Infrastructure health Useful metrics tied to components in your techstack.
Shard Service
Requests slow down significantly
Requests are accepted, but then fail
Requests start being rejected
There are no suitable shards
Incorrect shards were selected
There is insufficient data to make decisions
Infrastructure metrics
Application metrics
Infrastructure health Useful metrics tied to components in your techstack.
Application health Useful metrics tied to the domain of your application
Application metricsInfrastructure metrics + =
Application metricsInfrastructure metrics +
Latency
Memory utilisation
Load balancer errors
=
Application metricsInfrastructure metrics
Shard capacity
Errors logged
Shard selection reason
Latency
Memory utilisation
Load balancer errors
+ =
Application metricsInfrastructure metrics
Shard capacity
Errors logged
Shard selection reason
Latency
Memory utilisation
Load balancer errorsMetrics about the shards, from Shard Service
+ =
`
Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)
Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)
Top selected region (internal) Top selected region (AWS) Top selection reasons
Top selected shards
SHARD SERVICE INFRASTRUCTURE
Region capacity exhausted
Monitors
Surge in errors logged
Shard capacity exhausted
How can you…
How can you… Figure out what to measure?
What questions do you want to answer?
What questions do you want to answer?
Why does your service exist (what are its roles and responsibilities)?
What does it look like for those roles and responsibilities to degrade?
How can you verify whether or not such a degradation is occurring?
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Everything is right.
Fine… or exploding Never checked operational health unless it was on fire
Noisy alerts Frequent & un-actionable
As time went on…
Things changed Because, you know, agile
Elastic load balancer
Shard Service Node 1
Shard Service Node 2
Shard Service Node 3
As time went on…
Elastic load balancer
Shard Service Node 1
Shard Service Node 2
Shard Service Node 3
LatencyLoad balancer errors
Healthy hosts
Elastic load balancerApplication load balancer
Shard Service Node 1
Shard Service Node 2
Shard Service Node 3
Application load balancer
Shard Service Node 1
Shard Service Node 2
Shard Service Node 3
LatencyLoad balancer errors
Healthy hosts
As time went on…
Noisy alerts Frequent & un-actionable
As time went on…
Things changed Because, you know, agile
Fine… or exploding Never checked operational health unless it was on fire
Noisy alerts Frequent & un-actionable
As time went on…
Things changed Because, you know, agile
Fine… or exploding Never checked operational health unless it was on fire
Our team
Team who could actually fix the problem
Noisy alerts Frequent & un-actionable
As time went on…
Things changed Because, you know, agile
Fine… or exploding Never checked operational health unless it was on fire
Noisy alerts Frequent & un-actionable
As time went on…
Fine… or exploding Never checked operational health unless it was on fire
Things changed Because, you know, agile
What level of service you can commit to offer
SERVICE LEVEL OBJECTIVE
What level of service you can commit to offer
SERVICE LEVEL OBJECTIVE
E.g. 99.99% requests should succeed
We were not alone
Process dedicated to regularly reviewing, discussing and iterating on operational health
TECHOPS
Develop measurable goalsTechOpsCollect data
Prepare a report
Meet and discuss
Repeat and iterate
Develop measurable goalsTechOpsCollect data
Prepare a report
Meet and discuss
Repeat and iterate
Develop measurable goalsTechOpsCollect data
Prepare a report
Meet and discuss
Repeat and iterate
Develop measurable goalsTechOpsCollect data
Prepare a report
Meet and discuss
Repeat and iterate
Develop measurable goalsTechOpsCollect data
Prepare a report
Meet and discuss
Repeat and iterate
TechOps for everyone!
Goal
TechOps for everyone!
GoalReduce the number of noisy alerts
DataReduce the number of noisy alerts
DataAlerts received in the past weekReduce the number of noisy alerts
87
87Alerts received in the past week
87Low priority alerts
Reduce the number of noisy alerts
87
Report
ReportAlerts, dashboard screenshots, incidents…Reduce the number of noisy alerts
87
Reduce the number of noisy alertsMeet & discuss
Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts
Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts
Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts
Meet & discussActionable? Discoverable? Useful?Reduce the number of noisy alerts
ALERTS (ALL SERVICES, STAGING + PRODUCTION)
0
25
50
75
100
Week 1 Week 3 Week 5 Week 7 Week 9 Week 11 Week 13 Week 15 Week 17 Week 19 Week 21
Total High priority Low priority
CASE #2 - RELIABILITY INCREASETotal High priority Low priority
CASE #3 - ALERT REDUCTION
How can you…
Verify you’re measuring the right things?How can you…
Review your operational resources!
…frequently
Review your operational resources!
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Shard Service
Database
App App
Queue
App
Shard Service
70% 20% 90% 5%
Shard A
Shard Service
70% 20% 90% 5%
Shard A
Shard A
Database
App
Queue
App App
Shard A
Database
App
Queue
App App
Shard Sevice
Shard A
Database
App
Queue
App App
Shard Sevice
Shard A
Database
App
Queue
App App
Shard Sevice
Shard A
Database
App
Queue
App App
Shard Sevice
Shard A
Database
App
Queue
App App
Shard Sevice
What’s the big deal?
Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)
Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)
Top selected region (internal) Top selected region (AWS) Top selection reasons
Top selected shards
Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)
Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)
Top selected region (internal) Top selected region (AWS) Top selection reasons
Top selected shards
Panel per metric
Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)
Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)
Top selected region (internal) Top selected region (AWS) Top selection reasons
Top selected shards
Panel per metric
Slow
Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)
Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)
Top selected region (internal) Top selected region (AWS) Top selection reasons
Top selected shards
Panel per metric
Slow, error prone
Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)
Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)
Top selected region (internal) Top selected region (AWS) Top selection reasons
Top selected shards
Panel per metric
Slow, error prone, forgettable
Confluence Apdex Count metric: current vs target (by shard)Jira Apdex Count metric: current vs target (by shard)Database utilisation metric: current vs target (by shard)
Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)
Top selected region (internal) Top selected region (AWS) Top selection reasons
Top selected shards
{ }Confluence Apdex Count metric: current vs target (by shard) Jira Apdex Count metric: current vs target (by shard) Database utilisation metric: current vs target (by shard)
Provisioning failures metric: current vs target (by shard) Average remaining capacity by shard (Jira Apdex) Average remaining capacity by shard (Confluence Apdex)
Top selected region (internal) Top selected region (AWS) Top selection reasons
Top selected shards
{ }
{ }Application
{ }Test
Shard Service repository
ConfluenJira Databas
Provisioni Average Average
Top Top Top
To
ConfluencJira Databas
Provision Average Average
Top Top Top
To
{ }
Shard Service repository Dashboard tool
Marge’s Service
Homer’s Service
Shard Service
{ }Application
{ }Test
ConfluenJira Databas
Provisioni Average Average
Top Top Top
To
ConfluencJira Databas
Provision Average Average
Top Top Top
To
ConfluenJira Databas
Provisioni Average Average
Top Top Top
To
ConfluencJira Databas
Provision Average Average
Top Top Top
To
{ }
Shard Service repository Dashboard tool
Marge’s Service
Homer’s Service
Shard Service
{ }Application
{ }Test
ConfluenJira Databas
Provisioni Average Average
Top Top Top
To
ConfluencJira Databas
Provision Average Average
Top Top Top
To
Discoverable
Operational resources as code
Front of mind
Version control
Discoverable
Operational resources as code
Front of mind
Version control
Discoverable
Operational resources as code
Front of mind
Version control
Going a step further…
286Operational resources as code
286
286Dashboards
JIRA SHARD DASHBOARD (SUBSET)
286
Cue, templates
This can be solved at the platform level
SERGEJS SINICA, ATLASSIAN SENIOR DEVELOPER
Cue, templates
91
91respondents who owned 1 to 10+ services
83.5%
83.5%maintained operational resources through the UI
67%
67%kept their team up to date via “tribal knowledge”
23.1%
23.1%satisfied with their existing process
“We've been saying for eons that we should put our monitors and dashboards into code, but the task is too big to start so we don't do it. Over time the job just gets bigger and bigger and less likely to get done :P”
ATLASSIAN DEVELOPER
Introducing, Sauron
The “all seeing eye” for dashboards & monitors
Introducing, Sauron
Dashboard tool
Shard Service
Sauron
Marge’s Service
Homer’s Service
SauronExport my dashboard!
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
SauronExport my dashboard!
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
SauronExport my dashboard!
{ }
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
ConfJira DatProvi Aver Avera
ToToT T
ConflJira DatProvi Aver Aver
ToTT T
SauronExport my dashboard!
{ }
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
ConfJira DatProvi Aver Avera
ToToT T
ConflJira DatProvi Aver Aver
ToTT T
Sauron
{ }
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
ConfJira DatProvi Aver Avera
ToToT T
ConflJira DatProvi Aver Aver
ToTT T
Sauron
Application repository
{ }
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
ConfJira DatProvi Aver Avera
ToToT T
ConflJira DatProvi Aver Aver
ToTT T
Sauron
Application repository
{ }
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
Sauron
Application repository
{ }{ }
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
Sauron
Application repository
{ }
{ }
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
Sauron
Application repository
{ }
Dashboard tool
Shard Service
Marge’s Service
Homer’s Service
Monitors
Dashboards
Screenboards
shard-service operations
> 50
> 50Services adopted Sauron
How can you…
Help your team keep up to date with change?
How can you…
Define operational resources in code
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
Agenda
Iterative… what?
Setting some context
Deciding what to measure
Verifying your metrics
Keeping up with change
Summary
SelectLearn what questions you want to answer
Select
Verify
Learn what questions you want to answer
Review, review, review
Select
VerifyKeep up
Learn what questions you want to answer
Review, review, reviewDefine all the things in code
Thank you!