@EdMcBane 7 lessons learned building HP/HA systems
Never gonnagive you up Never
gonna let you down
@EdMcBane
Francesco Degrassi
Enthusiastic yet pragmatic Lean Software Developer.
Uppish and cynical nihilist from time to time.
@EdMcBane
Lean Software Development
Continuous Delivery - High availability - Scale-up
Security sensitive & high uncertainty domains
@EdMcBane
The challenge
● Primary european client
● Innovative service for the consumer market
● Non-trivial userbase (400K+ users)
● High request rate
● Low latency requirement (<< RTT)
@EdMcBane
What we built
@EdMcBane
Make your assumptions explicit
and keep testing them
Do not eatyellow snow
What did we learn?
@EdMcBane
Make your assumptions explicit
and keep testing them
#1 Make your
assumptions explicitand keep challenging them
@EdMcBane
Issues
● failure to properly estimate
● failure to reassess performance goals
● losing track of assumptions and implications
@EdMcBane
Make your assumptions explicit
and keep testing them
#2 Performance &
Availability are not extra features
@EdMcBane
@EdMcBane
Challenges
● Support for required failover modes
● Support for required scale-out/scale-up modes
● Operability in general○ and monitoring in particular
● most important of all, avoiding complexity
@EdMcBane
Make your assumptions explicit
and keep testing them
#3 Keep things simple
and do not reinvent the wheel
@EdMcBane
Everything should be made as simple as possible, but not simpler
— Albert Einstein
@EdMcBane
@EdMcBane
LESS(1) General Commands Manual LESS(1)
NAME less - opposite of more
SYNOPSIS less -? less --help less -V less --version less [-[+]aABcCdeEfFgGiIJKLmMnNqQrRsSuUVwWX~] [-b space] [-h lines] [-j line] [-k keyfile] [-{oO} logfile] [-p pattern] [-P prompt] [-t tag] [-T tagsfile] [-x tab,...] [-y lines] [-[z] lines] [-# shift] [+[+]cmd] [--] [filename]... (See the OPTIONS section for alternate option syntax with long option names.)
DESCRIPTION
LESS IS similar to MORE (1), but has many more features. Less does not have to read the entire input file before starting, so with large input files it starts up faster than text editors like vi (1). Less uses termcap (or terminfo on some systems), so it can run on
Manual page less(1) line 1 (press h for help or q to quit) .
@EdMcBane
● Everything was good with the single core scenario
In our case...
@EdMcBane
SO_REUSEPORT
For TCP, so_reuseport allows multiple listener sockets to be bound to the same port.
Received packets are distributed to multiple sockets bound to the same port using a 4-tuple hash.
With so_reuseport the distribution is uniform.
@EdMcBane
Suggestions
● Prefer open source solutions○ when things break, you want to be able to fix it
● Be skeptical○ pick any software, chances are it is crap○ +1 for open source, you can “peek under the hood”
● Do not use tools you do not fully understand○ or as I’d rather say...
@EdMcBane
Make your assumptions explicit
and keep testing them
#4Be wary of cargo-cult
software engineering
@EdMcBane
@EdMcBane
TCP_TW_RECYCLE
Enable fast recycling TIME-WAIT sockets. Default value is 0. It should not be changed without advice/request of technical experts.
Linux will drop any segment from the remote host whose timestamp is not strictly bigger than the latest recorded timestamp
TCP_TW_RECYCLE + NAT = MADNESS
@EdMcBane
@EdMcBane
Make your assumptions explicit
and keep testing them
#5High Availability is much more than just redundancy
@EdMcBane
Impact
Frequency
Time to recover
@EdMcBane
● Redundant hardware● Redundant software components
But there’s more!
● Graceful degradation● Incremental rollouts
Failure impact
@EdMcBane
Failure frequency
But then also:
● proven technology
● high quality hardware
● automation (to avoid errors)
@EdMcBane
● Effective monitoring○ realtime○ reliable○ understandable○ thorough○ meaningful○ actionable
● Rollback / rollforward● Automation (for speed)
Time to recover
@EdMcBane
Our response plan goes something like this...
AaaaaAAaaaah
@EdMcBane
...but be prepared to improvise
Processes designed for ordinary times are not resilient in a crisis and need to be changed.
Dave Snowden
“”
@EdMcBane
Easier said than done
No, improvising is wonderful.
But, the thing is that you cannot improvise unless you know exactly what you're doing.
Christopher Walken
“”
@EdMcBane
Improvisation requires
● In house expertise
● Lots and lots of experience
● Developers on call
● Practice (drills, e.g. chaos monkeys)
@EdMcBane
Also from Walken...
At its best, life is completely unpredictable.“ ”
Everybody has to be a little lucky, I think.“ ”I try not to worry about things I can't do anything about.“ ”
@EdMcBane
Make your assumptions explicit
and keep testing them
#6 Embrace diversity
@EdMcBane
@EdMcBane
@EdMcBane
Make your assumptions explicit
and keep testing them
#7Monitoring is essential
… and we can do way better
@EdMcBane
No one size fits all
● “Monitor everything”, like “100% test coverage” is a nice slogan, nothing more.
● Each environment requires a slightly different solution
● Balance between data availability, cost and ability to keep it actionable
@EdMcBane
@EdMcBane
We are doing logging wrong
● Unstructured
● Inconsistent
● Poor defaults
● Complex, obscure components
● A huge waste of computing power
@EdMcBane
We need a complete overview
● Logs
● Metrics
● Alerts
● Together, coherent, cross-referenced
○ correlating different stores poses challenges
@EdMcBane
Human beings, who are almost unique in having the ability to learn from the experience of others, are also remarkable for their apparent disinclination to do so.
Douglas Adams
“
”