glue con2011 jeff malek from bigdoor
TRANSCRIPT
@JPMALEK
04/13/2023 1
Retrospective from a startup built in the cloud : top 3 big lessons
from the AWS outage on
04.21.2011 plus 4,369 other smaller ones
@JPMALEK
04/13/2023 2
What a country : entrepreneurial resiliency
@JPMALEK
04/13/2023 3
“robust systems:highly fault-tolerant, on or off grid. eg: our culture wrt entrepreneurs,
AWS, the BD API”
(true story)
@JPMALEK
04/13/2023 4
Boom
@JPMALEK
04/13/2023 5
good to be home!
Go Buffs
@JPMALEK
04/13/2023 6
me: previous startupteams in 3 countries
highly transactional systemMS tech : IIS/MS SQL Server
co-located, leased/owned hardware0% in cloud
$75M/yearly rev
@JPMALEK
04/13/2023 7
me : current startupsystems 100% on AWS
99% free/open-source software
standing on the shoulders of giants
@JPMALEK
04/13/2023 8
fault tolerance: 3 to 47 important failearnings
and 4,369 less important ones
@JPMALEK
04/13/2023 9
in the context of our startup, of course
YMMV depending on velocity
@JPMALEK
04/13/2023 10
Ruger
@JPMALEK
04/13/2023 11
The Ruger Fault Equivalency
time = money
fault tolerance = time² - risk tolerance
Also known as:
'Fast, good and cheap : pick two‘
@JPMALEK
04/13/2023 12
system design philosophy:leverage proven, open-source tech
in the cloudto build ascaleablereliablesecure
operational foundationquickly
@JPMALEK
04/13/2023 13
So how do you achievethe right level of fault tolerance in
the cloud?
3 tenets
@JPMALEK
04/13/2023 14
Tenet #1
Scripted Repeatability Tenet #2
SPOF Elimination Tenet #3
Clear-Cut Communication
@JPMALEK
04/13/2023 15
who here has used AWS?
@JPMALEK
04/13/2023 16
Tenet #1prepare a fault-tolerant foundation with
scripted repeatability
aka automation
@JPMALEK
04/13/2023 17
from the start :script the non-interactive install of your tools
and OS
custom AMIDebian : great package management
based on Eric Hammond’s workhttp://alestic.com/
@JPMALEK
04/13/2023 18
which will allow you toscript the setup/tear-down of your stack
@JPMALEK
04/13/2023 19
which will allow you toscript system tests
integrity (3-4K tests)performance (30-40K tests)
load, capacity (2-4M requests)
@JPMALEK
04/13/2023 20
A/B system test results : MySQL Percona Upgrade
@JPMALEK
04/13/2023 21
That’s how1 person
set up andmanaged a network
comprised of 90+/- server instancesfor 1.5 years
while serving various other roleswithout having to leave their chair
try that with real hardware
@JPMALEK
04/13/2023 22
Tenet #2SPOF Elimination
We don’t need no stinkin single points of failure.
@JPMALEK
04/13/2023 23
SPOF Examples:Cloud Provider
RegionZone
Load BalancerApp Server
DatabaseFred
@JPMALEK
04/13/2023 24
Cloud Provider fail-over?
e.g. AWS –> Rackspace
@JPMALEK
04/13/2023 25
Region fail-over?
e.g. useast->uswest within AWSNah.
@JPMALEK
04/13/2023 26
Zone fail-over?Yes.
US-WEST
A
B
C
D
US-EAST
A
B
C
D
@JPMALEK
04/13/2023 27
Zone fail-over best practices:are you using auto-scaling?
no : distribute server instances evenly between 2 or more zonesyes : trigger scaling on network I/O or custom metrics
@JPMALEK
04/13/2023 28
Load-balancer (ELB), app server, database fail-over?
Yes.
@JPMALEK
04/13/2023 29
So it’s actually all about reduction of the right SPOFs for
your business context
Just adding the ability to fail-over and have backups within a region is huge!
Probably enough for most.What about Fred?
@JPMALEK
04/13/2023 30
Tenet #3Clear-Cut Communication
transparency is soooo 2010
@JPMALEK
04/13/2023 31
During an outage, communicating the right things at the right time:
hard.But not that hard.
@JPMALEK
04/13/2023 32
Tenet #1
Scripted Repeatability Tenet #2
SPOF Elimination Tenet #3
Clear-Cut Communication
Three Tenets Revisited
@JPMALEK
04/13/2023 33
Notes