reliability patterns for distributed applications
TRANSCRIPT
![Page 1: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/1.jpg)
Reliability Patterns for Distributed Applications
Andrew Hamilton
![Page 2: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/2.jpg)
Reliability Patterns for Web Applications
Andrew Hamilton
![Page 3: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/3.jpg)
$ whoami
![Page 4: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/4.jpg)
$ whoamiSite Reliability Engineer
Development and Operations but NOT a DevOps Engineer
Developer productivity
Zefr, Prevoty, Twitter, Eucalyptus, CSUN PTG
![Page 5: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/5.jpg)
What is reliability?
![Page 6: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/6.jpg)
What is reliability?Your application working when your users need it
A user’s #1 unstated feature request
Your application telling you when things aren’t
working and being able to fix things quickly
![Page 7: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/7.jpg)
Reliability does not completely remove failure
![Page 8: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/8.jpg)
Reliability does not completely remove failureFailure will happen no matter what you do
Perfection is not an obtainable goal
Deal with failure gracefully and reduce the impact of failures
Reducing the chance of failure by building repeatable and
reliable automated processes
![Page 9: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/9.jpg)
Where should you begin?
![Page 10: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/10.jpg)
Where should you begin? Build your appBuild packages for your code (zips/tarballs, RPMs/Debs,
container)
Automate builds with a CI environment (Jenkins, TravisCI)
![Page 11: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/11.jpg)
Where should you begin? Test your appAutomate testing of your app
Unit tests should be easy to run and quick (< 10m)
Functional tests can take longer, can become less reliable
Manual testing can also be done but not much
![Page 12: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/12.jpg)
Where should you begin? App deploymentAutomate the entire process from VM/Container setup to app
deployment
Make it multi environment (dev, stage, prod)
Make it one command
Needs to be repeatable and reliable
![Page 13: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/13.jpg)
Where should you begin? ConfigurationApp configurations should be easy to change
Don’t hardcode values that should be configurable
12 factor apps
Config files (YAML, JSON, key:value)
![Page 14: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/14.jpg)
Where should you begin? DevOpsCommunication is key for reliability
Make sure that people in development and operations know
what’s happening with your app
![Page 15: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/15.jpg)
But really this isn’t enough...
![Page 16: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/16.jpg)
Where should you begin? DevOpsCommunication is key for reliability
Make sure that people in development, operations, product
management, testing, security, design, marketing, management
know what’s happening with your app
Make sure that other teams know when something is happening
that may affect their app
![Page 17: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/17.jpg)
What’s next?
![Page 18: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/18.jpg)
What’s next? LoggingFind a logging format and standardize
Try to find an easy to understand, structured logging format
Make sure your logger is leveled (Debug, Info, Error, Panic)
Expect to use log messages at 3am
![Page 19: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/19.jpg)
What’s next? Loggingfunc myFunc() {
rtn, err := doSomething(val1, val2)if err != nil {
log.Print(err) // Don’t do this!}
}
![Page 20: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/20.jpg)
What’s next? Loggingfunc myFunc() {
rtn, err := doSomething(val1, val2)if err != nil {
log.Printf(“doSomething call failed in myFunc: %s”, err)}
}
![Page 21: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/21.jpg)
What’s next? Loggingtime=2012:11:24T17:32:23.3435 type=error func=myFunc host=host1 line=4
msg=”doSomething call failed in myFunc: Error marshaling JSON”
{
“time”: “2012:11:24T17:32:23.3435”,
“host”: “host1”,
“type”: “error”,
“func”: “myFunc”,
“line”: 4,
“msg”: ”doSomething call failed in myFunc: Error marshaling JSON”,
}
![Page 22: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/22.jpg)
What’s next? Aggregate LoggingOne place to view all of your app’s logs
With structured logging can pull out metrics
ELK stack - Elasticsearch, Logstash, Kibana
Splunk
![Page 23: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/23.jpg)
What’s next? Monitoring
https://twitter.com/sadserver/status/689588269047132160
![Page 24: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/24.jpg)
What’s next? MonitoringNeeds to be relatively real time (sub 15s)
Start with standard metrics on all requests (counts, latencies)
Add more metrics where you need them
Create a dashboard with important into
statsd/graphite/graphana, Prometheus, DataDog, Netuitive
Nagios is not sufficient for application monitoring
![Page 25: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/25.jpg)
What’s next? Monitoring
![Page 26: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/26.jpg)
What’s next? [email protected]_requestdef before_request(): g.request_time = time()
@app.after_requestdef after_request(response): total_time = (time() - g.request_time) * 1000 statsd.timing(“app.latency”, total_time, [“name:app”], 1) statsd.increment(“app.request”, 1, [“name:app”, “status_code:{0}”.format(response.status_code)], 1)
![Page 27: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/27.jpg)
What’s next? AlertingUses the monitoring system’s data to make sure the app is
healthy
Sends our emails to on-call dev or ops when issues occur
Requires knowledge of an app to create
Pagerduty, Big Panda, VictorOps
Area that still needs some work
![Page 28: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/28.jpg)
What’s next? Remove stateState is something like session information
Move to an external store all servers can access
Memory based stores the norm (memcache, redis)
Allows you to horizontally scale your app behind a LB
![Page 29: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/29.jpg)
What’s next? Have more than 1 of everythingYou need more than one instance of your service
It shouldn’t just be a primary/backup either
Remove your single points of failures as quickly as possible
![Page 30: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/30.jpg)
What’s next? Retries and backoffThings can fail from time to time
Resending a request can be helpful
Be careful not to DDOS another app because it went down and
came back
Exponential backoff if good
![Page 31: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/31.jpg)
What’s next? Retries and backoffdef my_func(val1, val2): data = None err = None for n in range(10): data, err = get_data(val1, val2) if err is None: break time.sleep((2**n)/1000) // sleep for 2^n milliseconds
if err != None: return None, err
return do_something(data)
![Page 32: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/32.jpg)
I’m bored! What’s cool?
![Page 33: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/33.jpg)
I’m bored! What’s cool? Canary deploys“Canary in the coal mine”
Deploy new code to a single instance
Watch that instance with your monitoring stack
Add more new instances, remove old instances gradually
Helps assure that a release is good before taking all traffic
Can be automated
![Page 34: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/34.jpg)
I’m bored! What’s cool? MicroservicesThe Unix philosophy brought to apps
Each service does only one thing
Requires a good build and deployment system
Requires monitoring, logging, alerting
Monolith → microservices
![Page 35: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/35.jpg)
I’m bored! What’s cool? Feature flagsAllows for features to be turned on and off inside the code base
Start off with a configuration file
Make sure to read configuration to memory
Can be left in after testing or removed
Can be dynamic eventually
![Page 36: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/36.jpg)
I’m bored! What’s cool? Feature flagsdef my_func(): rtn = do_something() print(rtn)
def do_something(): // run some code
![Page 37: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/37.jpg)
I’m bored! What’s cool? Feature flagsdef my_func(): rtn = do_something() print(rtn)
def do_something(): // new code added here...YOLO
![Page 38: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/38.jpg)
I’m bored! What’s cool? Feature flagsff = read_config(os.getenv(“FLAGS_CONF”, “flags.json”))
def my_func(): if ff[“do_something_ver”] == 2: rtn = do_something_2() else: rtn = do_something() print(rtn)
def do_something(): // run some code
def do_something_2(): // new way to do something
![Page 39: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/39.jpg)
I’m bored! What’s cool? Dark deploysTest new features and functionality with real users
They won’t know that anything new has changed
Runs the old and new code and checks output
Great with easy concurrency
Feature flags can be useful
![Page 40: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/40.jpg)
I’m bored! What’s cool? Dark deploysff = read_config(os.getenv(“FLAGS_CONF”, “flags.json”))
def my_func(): rtn = do_something()
if ff[“run_do_something_2”]: rtn2 = do_something_2() if rtn != rtn2: log.Error(“do_something and do_something_2 do not match! {0} != {1}”.format(rtn, rtn2))
print(rtn)
![Page 41: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/41.jpg)
I’m bored! What’s cool? Loose couplingGraceful degradation
Services continue to run when dependency services fail
Output might not be complete but will be as complete as possible
Third party apps with issues won’t take down your app
Important for both backend and frontend
Common with data stores
![Page 42: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/42.jpg)
I’m bored! What’s cool? Circuit breakersKeep track of issues with external services and short circuit calls
to them
Design pattern that’s becoming more popular
Netflix Hystrix -- Java
![Page 43: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/43.jpg)
I’m bored! What’s cool? Chaos engineeringInject faults into your production traffic to test your app
Tests how your apps truly cope with issues before the happen
Helps make sure that devs and ops understand app
Only runs during business hours
![Page 44: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/44.jpg)
Reliability doesn’t magically happen!
![Page 45: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/45.jpg)
Reliability doesn’t magically happenIt must be worked on
It must be prioritized properly and not just assumed
to happen organically
![Page 46: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/46.jpg)
Further reading
![Page 47: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/47.jpg)
Further ReadingContinuous Delivery: Reliable Software Releases through Build, Test and Deployment
Automation (Humble and Farley)
http://www.amazon.com/Continuous-Delivery-Deployment-Automation-Addison-
Wesley/dp/0321601912
![Page 48: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/48.jpg)
Further readingThe Practice of Cloud System Administration: Designing and Operating Large
Distributed Systems, Vol 2 (Limoncelli, Chalup, Hogan)
http://www.amazon.com/Practice-Cloud-System-Administration-
Distributed/dp/032194318X
![Page 49: Reliability Patterns for Distributed Applications](https://reader031.vdocument.in/reader031/viewer/2022030210/58a48ad91a28ab58738b668b/html5/thumbnails/49.jpg)
Further readinghttp://martinfowler.com/
http://www.devopsweekly.com/ (weekly newsletter of articles)
https://blog.cloudflare.com/
https://blog.twitter.com/engineering
http://highscalability.com/