devoxx2017
TRANSCRIPT
![Page 1: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/1.jpg)
#DevoxxUS
Architecting for failures in micro services:
Patterns and lessons learnedBhakti Mehta
@bhakti_mehta
![Page 2: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/2.jpg)
INTRODUCTION
➤ Platform@Atlassian
➤ In the past Platform Lead at BlueJeans Network
➤ Worked at Sun Microsystems/Oracle for 13 years
➤ Committer to numerous open source projects including GlassFish Application Server
![Page 3: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/3.jpg)
MY RECENT BOOK
![Page 4: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/4.jpg)
PREVIOUS BOOK
![Page 5: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/5.jpg)
ATLASSSIAN
![Page 6: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/6.jpg)
Microservices
![Page 7: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/7.jpg)
PATH TO MICROSERVICES
➤ Advantages
➤ Simplicity
➤ Isolation of problems
➤ Scale up and scale down
➤ Easy deployment
➤ Polyglotism and heterogenity
![Page 8: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/8.jpg)
Sounds great!!
![Page 9: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/9.jpg)
In reality……..
![Page 10: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/10.jpg)
MONOLITHS TO MICRO SERVICES
![Page 11: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/11.jpg)
RESILIENT SYSTEM
➤ Processes transactions, even when there are transient impulses, persistent stresses
➤ Functions even when there are component failures disrupting normal processing
➤ Accepts failures will happen
➤ Design for crumple zones
![Page 12: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/12.jpg)
RESILIENT SYSTEM
Be the duck
Behave normally when the system is not performing as expected
in face of outages
Behave normally
How the customer should perceive you?
![Page 13: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/13.jpg)
RESILIENT SYSTEM
How the system needs to function? Heal quickly before customers notice
![Page 14: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/14.jpg)
KINDS OF FAILURES
➤ Challenges at scale
➤ Integration point failures
➤ Network errors
➤ Semantic errors.
➤ Slow responses
➤ Outright hang
➤ GC issues
![Page 15: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/15.jpg)
THE NEW WAY OF LIFE
You build it You run it !! (You own it You plan for it !!! ]
![Page 16: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/16.jpg)
![Page 17: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/17.jpg)
➤ PERFECT STORM
![Page 18: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/18.jpg)
THINGS THAT WENT WRONG
➤ Bad node in load balancer group
➤ Deployment of new code
➤ Gradual increase in latency
➤ Abuse by clients
➤ Not enough prod like data in staging
➤ No easy way to trigger stale/lenient fallbacks
➤ Less alerts
![Page 19: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/19.jpg)
LESSONS LEARNED
consequential !!!!
Errors can be frequent but latencies are consequential !!
![Page 20: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/20.jpg)
ACTION PLAN
➤ Circuit breakers
➤ Fallback (lenient acceptable values)
➤ Predictive caching
➤ Reduce surface area by clients
➤ Load tests
➤ Failure injection testing
➤ Monitor
➤ Alerts
Development time
Before a deploy
Post deploy
![Page 21: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/21.jpg)
The more you sweat on the field the less you bleed in war!!!
![Page 22: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/22.jpg)
RESILIENCY PLANNING STAGE 1
➤ When developing code
➤ Avoiding Cascading failures
➤ Circuit breaker
➤ Timeouts
➤ Retry
➤ Bulkhead
➤ Cache optimisations
➤ Avoid malicious clients
➤ Rate limiting
![Page 23: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/23.jpg)
RESILIENCY PLANNING STAGE 2
➤ Planning for dealing with failures before deploy to prod
➤ load test ➤ a/b test ➤ longevity ➤ dark launch features
![Page 24: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/24.jpg)
RESILIENCY PLANNING STAGE 3
➤ Watching out for failures after deploy to prod
➤ health check ➤ metrics
![Page 25: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/25.jpg)
![Page 26: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/26.jpg)
CASCADING FAILURES
Caused by Chain reactions
For example
One node in a load balance group fails
Others need to pick up work
Eventually performance can degenerate
![Page 27: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/27.jpg)
HYSTRIX- CIRCUIT BREAKER PATTERN
• Fault tolerance pattern as a library
• Automatic fail fast
• Automatic fail over
• Metrics- Circuit breaker open, calls/sec, Execution time median, 90, 95 99 percentile
• If command has high failure rate in last 10 seconds it is unlikely to succeed now
![Page 28: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/28.jpg)
TIMEOUTS PATTERN
![Page 29: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/29.jpg)
RETRY PATTERN AND TIMEOUTS
➤ Retry for failures in case of network failures, timeouts or server errors
➤ Helps transient network errors such as dropped connections or server fail over
![Page 30: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/30.jpg)
BULKHEAD
![Page 31: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/31.jpg)
RATE LIMITING
![Page 32: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/32.jpg)
RATE LIMITING
➤ Restricting the number of requests that can be made by a client
➤ Client can be identified based on the access token used
➤ Additionally clients can be identified based on IP address
![Page 33: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/33.jpg)
CACHE OPTIMIZATIONS
Getting from first level cache
Getting from second
level cache
Getting from the DB
![Page 34: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/34.jpg)
TALE OF THE NEVER LEAVING CACHE ENTRIES
➤ Longer TTL
➤ Not evicted soon enough
➤ Bottlenecks
➤ Failures
![Page 35: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/35.jpg)
LOGGING BEST PRACTICES
➤ Include detailed, consistent pattern across service logs
➤ Obfuscate sensitive data
➤ Identify caller or initiator as part of logs
➤ Do not log payloads
➤ Request tracing across services
![Page 36: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/36.jpg)
RESILIENCE PLANNING STAGE 2
➤ Before deploy
➤ Load testing
➤ Longevity testing
➤ Capacity planning
![Page 37: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/37.jpg)
LOAD TESTING
➤ Ensure that you test for load on APIs ➤ Plan for longevity testing
![Page 38: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/38.jpg)
CAPACITY PLANNING
➤ Anticipate growth
➤ Design for handling exponential growth
![Page 39: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/39.jpg)
RESILIENCE PLANNING STAGE 3
➤ After deploy
➤ Health check
➤ Metrics and Monitoring
➤ Phased rollout of features
![Page 40: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/40.jpg)
Health Check
![Page 41: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/41.jpg)
HEALTH CHECK
➤ Memory
➤ CPU
➤ Threads
➤ Error rate
➤ If any of the checks exceed a threshold send alert
![Page 42: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/42.jpg)
Metrics and Monitoring
![Page 43: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/43.jpg)
METRICS
➤ Response times, throughput
➤ Identify slow running DB queries
➤ GC rate and pause duration
➤ Garbage collection can cause slow responses
➤ Monitor unusual activity
➤ Create alerts when thresholds are exceeded
➤ Run books for actions to be taken on alerts
![Page 44: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/44.jpg)
Thoughts of the on call person paged at 3 am
debugging an issue in your code
![Page 45: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/45.jpg)
MONITORING
Monitoring server
EnvironmentCHECKS
ALERTS
![Page 46: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/46.jpg)
SAVED BY THE METRICS AND ALERTS
➤ MaxDBConnection alert
➤ CPU Utilisation spiking up
➤ Analysed slow running queries
➤ Some select queries taking very long avg of 718 ms 95 percentile 2030 ms.
➤ Unidentified cause which was a bug fix which introduced pagination and the ORDER BY clause needed to match a function based index
![Page 47: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/47.jpg)
ROLLOUT OF NEW FEATURES
➤ Phasing rollout of new features
➤ Dark launch features
➤ Have a way to turn features off if not behaving as expected
➤ Alerts and more alerts!
![Page 48: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/48.jpg)
AWS S3 OUTAGE➤ S3 outage in US East
➤ Number of services affected
➤ 3rd party services we depend on have degraded performances
➤ Lots of key take aways from this
![Page 49: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/49.jpg)
Cheat sheet
A Alerts K Key invalidations
B Bulkheads L Logging
C Circuit Breakers M Metrics & monitoring
D Data obfuscation N Network latencies
E Eventual consistent O Optimizing queries
F Fallbacks & Hystrix P Phased rollouts
G GC settings Q Queues bounded
H Health checks R Run books
I Injecting failure S Staged deployments
J Jitter with Retries T Timeouts
![Page 50: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/50.jpg)
TAKEAWAY
➤ Inevitability of failures
➤ Expect systems will fail
➤ Failure prevention - Plan for failures Not if but when
➤ Automate
Keep Calm and Cloud On!
![Page 51: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/51.jpg)
REFERENCES➤ https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png
➤ http://www.constructionlawtoday.com/uploads/image/Expect-Delays-sign(1).jpg
➤ http://cdn.idigitaltimes.com/sites/idigitaltimes.com/files/2016/04/27/wolverinex-menapocalpse.jpg
➤ https://www.freevector.com/uploads/vector/preview/13242/FreeVector-Swimming-Duck.jpg
➤ http://weknowyourdreams.com/image.php?pic=/images/happiness/happiness-04.jpg
➤ http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg
➤ http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change-sign-resized_2.jpg
➤ https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need-A-Hug-Around-The-Neck-With-A-Rope-Image.jpg
➤ https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative Commons License
![Page 52: Devoxx2017](https://reader031.vdocument.in/reader031/viewer/2022030310/58f9ad18760da3da068b94d1/html5/thumbnails/52.jpg)
#DevoxxUS
Questions
@bhakti_mehta