statscraft 2015: monitoring using riemann - moshe zada
TRANSCRIPT
In the pipeline
Intro
Basic alerts
Basic Alerts
Implement simple state machine
Throttled alert
Ignore spikes
4 / 63
In the pipeline
Intro
Basic alerts
Visualize
Visualize
Stream to ELK
Event enrichment
Showoff
5 / 63
In the pipeline
Intro
Basic alerts
Visualize
Back to tests
Back to tests
Maintenance mode
Heartbeat alerts
6 / 63
In the pipeline
Intro
Basic alerts
Visualize
Back to tests
Aggregation
Aggregation
Sum/ Count/ Max batch of events
Monitor browser javascript
7 / 63
riemann@
forter
Who am I
Moshe Zada
Problem solver@Forter
Responsible for entire monitoring, CI and CD stack among
other stuff
9 / 63
riemann@
forter
Who am I
And where do I
work
We can catch 80% of online thieves before they even get
to checkout
12 / 63
riemann@
forter
Who am I
And where do I
work
Tech
Forter's low-latency stack
Using Storm and Spark for transactions stream
proccesing
Couchbase, Elasticsearch, Redis, MySQL as datastores
Immutable images
Using ELK for visabillity
14 / 63
Basic Concepts
- Who is behind
riemann?
Who is behind riemann?
This dude
aphyr - Kyle Kingsbury
The one from "call me
maybe"
Works at stripe
16 / 63
Basic Concepts
- Who is behind
riemann?
- Event
Events
Events are just structs.
and in Riemann are treated as immutable maps.
message Event { optional int64 time = 1; optional string state = 2; optional string service = 3; optional string host = 4; optional string description = 5; repeated string tags = 7; optional float ttl = 8; repeated Attribute attributes = 9; optional sint64 metric_sint64 = 13; optional double metric_d = 14; optional float metric_f = 15;}
message Attribute { required string key = 1; optional string value = 2;}
17 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
Sample event
Collectd event
{ "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"]}
18 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
Sample event
Collectd event
{ "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"]}
19 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
Sample event
Collectd event
{ "service": "prod-redis-n01 Free memory", "host": "10.0.0.1", "description": "total memory free in bytes", "state": nil, "ttl": 60, "metric": 1024, "tags": ["collectd", "redis", "infra"]}
20 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
The index
The index is a table of the current state of all services
tracked by Riemann.
keykey eventevent
10.0.0.1-redis-free { .."metric":"5", "service":"redis-free".. }
10.0.0.2-cache-miss { .."metric":"6", "service":"cache-miss".. }
10.0.0.2-cache-hit { .."metric":"6", "service":"cache-hit".. }
21 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
- TTL
TTL
Events entered into the index have a :ttl field which
indicate how long that event is valid for.
{"service": "foobar", "ttl": 60, state:"pass"} -> "index"
22 / 63
Basic Concepts
- Who is behind
riemann?
- Event
- Examples
- The index
- TTL
TTL
Events entered into the index have a :ttl field which
indicate how long that event is valid for.
{"service": "foobar", "ttl": 60, state:"pass"} -> "index"
After 60 secs
{"service": "foobar", "ttl": 60, state:"expired"} -> "index"
22 / 63
Probes and
tests
Simple test
merchantSanity
Riemann will forward to pagerduty only events that their
state was changed
{ "service": "prod-gateway-n01 MerchantSanity system test", "host": "10.0.0.2", "description": "Check forters merchants api", "state": "failure", "ttl": 60, "metric": 0, "tags": ["test", "merchantSanity"]}
24 / 63
Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity" pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd))))
25 / 63
Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity" pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd))))
26 / 63
Probes and
tests
Simple test
Flow
"probe machine" --> "riemann" --> "pagerduty"
The code behind
(tagged "merchantSanity" pagerduty-test-dispatch "asdasdad")
(defn pagerduty-test-dispatch "Constructs a pagerduty stream which resolves and" "triggers alerts based on test failure" [key] (let [pd (pagerduty "merchantSanity-service-key")] (changed-state (where (state "ok") (:resolve pd)) (where (state "failure") (:trigger pd))))
27 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
Throttle alerts
Sometimes, when things break, they submit a ton of events.
; If changed state(changed-state {:init "passed"} ; and the state in passed - resolve (where (state "passed") (:resolve pd)))
; If the state of the event is failed(where (state "failed") ; group by host and service fields ; pass only one event in 60 seconds (by [:host :service] (throttle 1 60 (:trigger pd))))))
29 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
Throttle alerts
Sometimes, when things break, they submit a ton of events.
; If changed state(changed-state {:init "passed"} ; and the state in passed - resolve (where (state "passed") (:resolve pd)))
; If the state of the event is failed(where (state "failed") ; group by host and service fields ; pass only one event in 60 seconds (by [:host :service] (throttle 1 60 (:trigger pd))))))
30 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
32 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key)))
32 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key)))
33 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Monitoring Infra - ignore spikes
Collectd gether our instance cpu info
If >30% failed - Trigger
(defn pagerduty-probe-dispatch [key] ... (fixed-time-window 120 ... (assoc (first events) {:metric fraction :state (condp < fraction 0.3 "failed" 0.05 "warning" "passed")}) (pagerduty-test-dispatch key)))
34 / 63
Probes and
tests
Simple test
Test dispatch -
throttled
CPU spikes
Usage
(tagged "merchantSanity" (pagerduty-test-dispatch "3adab5c52e1511e5a"))(tagged-all ["collectd", "cpu"] (pagerduty-probe-dispatch "4a6b58212e1511e5b" 120))
35 / 63
Visualize
- Stream to ELK
Stream to ELK
(where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2})
37 / 63
Visualize
- Stream to ELK
Stream to ELK
(where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2})
37 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Where can I find my events?
branch : prod
role : nimbus
deployitme : 2015-07-19T1918
39 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Where can I find my events?
branch : prod
role : nimbus
deployitme : 2015-07-19T1918
{"service": "prod-nimbus-instance-2015-07-19T1918 df-mnt/percent","host": "ip-10-139-118-128","metric": 100,"tags": ["collectd"],"time": "2015-07-19T16:45:58.000Z","ttl": 240,"plugin": "df"}
So lets split the service field !
39 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Usage
(where (and (not (tagged-any ["kibanaIgnore"])) (not (state "expired"))) (enrich (logstash {:host "127.0.0.1" :pool-size 20 :claim-timeout 0.2}))
40 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Enrich
(defn enrich "Parse environment settings from service name prefix" [& children] (apply smap (fn stream [event] (let [ regex "^(.*?\-feature|prod)\-([\w\-]+)\-instance\-(\w+\-\w+\-\w+).(.*)" [all branch role deploytime subservice] (re-find #regex (:service event)) is-test (not (nil? (re-find #"^(1234|5678)" (:sessionId event)))) ] (assoc event :env (str branch "-" deploytime) :branch branch :deploytime deploytime :role role :subservice subservice :test is-test))) children))
41 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Enrich
(defn enrich "Parse environment settings from service name prefix" [& children] (apply smap (fn stream [event] (let [ regex "^(.*?\-feature|prod)\-([\w\-]+)\-instance\-(\w+\-\w+\-\w+).(.*)" [all branch role deploytime subservice] (re-find #regex (:service event)) is-test (not (nil? (re-find #"^(1234|5678)" (:sessionId event)))) ] (assoc event :env (str branch "-" deploytime) :branch branch :deploytime deploytime :role role :subservice subservice :test is-test))) children))
42 / 63
Visualize
- Stream to ELK
- Prepare for ELK
Enrich
{"service": "prod-nimbus-instance-2015-07-19T1918/df-mnt/percent_bytes-free","env": "prod-2015-07-19T1918","branch": "prod","deploytime": "2015-07-19T1918","role": "nimbus","subservice": "df-mnt/percent_bytes-free","host": "ip-10-139-118-128","metric": 100}
43 / 63
Back to tests
- Maintenance
Maintenance Mode
Sending "maintenance-mode" event
Riemann query its own index for the "maintenance-
mode" event if exist - ignore
Enable:
{ "service": "prod-2015-07-19T1918 maintenance-mode", "ttl": 120, "state": "active" }
53 / 63
Back to tests
- Maintenance
Maintenance Mode
Sending "maintenance-mode" event
Riemann query its own index for the "maintenance-
mode" event if exist - ignore
Enable:
{ "service": "prod-2015-07-19T1918 maintenance-mode", "ttl": 120, "state": "active" }
And usage:
(where (and (state "failed") (not (maintenance-mode (str (:env event) " maintenance-mode")))) (:trigger pd))
53 / 63
Back to tests
- Maintenance
- Heartbeat alerts
Heartbeat alerts
(defn pagerduty-cron-expiration"Constructs a pagerduty stream which resolves""and triggers alerts based on event expiration"[key] (let [pd (custom-pagerduty key)] (where (expired? event) (with {:state "failed" :description "TTL Expired. Check that the cron service"} (pagerduty-test-dispatch key)) (else (pagerduty-test-dispatch key)))))
55 / 63