operating consul as an early adopter
TRANSCRIPT
a talk
Nelson Elhage, @nelhage
Operating ConsulAs an Early Adopter
This Talk
• consul @ Stripe• War Stories• Lessons Learned
Consul at StripeThe Good, The Bad, The Outages
Why Consul?
• Early 2014• Stripe Infra gaining complexity• Nightmarish in-house service registry• Host lists distributed via puppet
Why Consul?
• Wanted a better service/host store• consul had everything baked in• Decided to do some test deployments
Initial Rollout
• Rolled out across all servers• (started with bake-in in QA)• No clients at all
What Could Go Wrong?
• We worried about memory leaks
Our First Production Issue
• Noticed one node taking >100M RAM• (others all <50M)• Reached out to armon for advice• bug in the stats framework:• https://github.com/armon/go-metrics/commit/02567bbc4f518a43853d262b651a3c8257c3f141
Started Adding Clients
• Hooked into our deploy tool• kept a manual emergency fallback
• Generated LB config from consul• Noticed a surprising rate of errors
Raft Instability
• Seeing >1 failover/minute• Reached out to Armon
• “Try 0.3”• “consul is not optimized for spinning disk”
Rolling out 0.3
• Roll to QA first• Nothing works!• Check logs: TLS verification errors
Rolling out 0.3
• 0.3 changed TLS verification to check the cert name
• Change our SSL issuing to add SANs• 2014/06/16 16:52:57 [ERR] raft: Failed to make RequestVote
RPC to 10.100.29.175:8300: x509: certificate is valid for [remote host], not [local host]
0.3 TLS Woes
• Whoops! consul was checking the remote cert against the local node name
• armon> we just use "demo.consul.io" as the CN for all of them
• 0.3 essentially completely broke TLS
0.3.1
• I wrote and got merged a patch to restore 0.2 behavior
• Rolled forward to 0.3.1• Upgraded to SSD-backed servers
Increasing Rollout
• Switched various operational tools from flatfile to consul
• Main app started using consul at startup
Consensus is Hard
consul-template
• Generating haproxy config using consul-template• https://github.com/hashicorp/consul-template/
issues/168 – `consul-template` takes O(N²) time with N services
consul-template
• Got that fixed, turned it on• consul immediately fell over• multiple elections/minute• 2M allocations/minute
consul-template
• Service Watches churn when any service changes health state
• Watching services on a large cluster → self-DDOS
consul-template
• We use `consul-template -once` in cron now
• Worse latency, but it works reliably
consul for leader election
• Our data team wanted a leader-election primitive
• Built on top of consul, cribbing example code
Sometime Later…
goroutine leak
• consul would rapidly eat all memory• larger heap -> large GC pauses -> raft
instability• manually restarted cluster 1/day
goroutine leak
• Reached out to Armon• Very helpful in debugging• Found several unrelated memory leaks
goroutine leak
• Tried to figure out what changed• Eventually correlated to a session leak in
our leader election code
goroutine leak
• Fixed our leader-election code• New policy: No non-discovery uses of
consul
consul DNS
• Increasingly reliant on consul for internal discovery
• Unhappy at exposure to periodic instability• Still have fallbacks, but outages remain painful
consul DNS
• Solution: Use consul-template to compile consul DNS to a zone file
• Serve that out of a normal DNS server• Refresh every 15s
Current Status
• Run consul everywhere• Register all services• Request-path lookups hit cached DNS• Operational tools use HTTP interface• Also generate config from consul-template
Final Stability Note
• consul 0.5.2 fixed our memory leaks• consul has been quite stable for us of late• consul-template watches still don’t scale
• 0.6 should help
Lessons Learnedbeing an early adopter without bringing down the site
(too many times)
Expect It To Be Rough
Monitoring, Monitoring, Monitoring
(graph all the things)
Incremental Rollout
Limit Scope
Isolation
Upgrade Aggressively
Get To Know Upstream
Be Willing to Dive In
Questions?