cloud operations bootcamp: culture - jesse robbins
DESCRIPTION
Cloud Operations Bootcamp: CultureTRANSCRIPT
Today
2
Today
‣Operations is Culture
2
Today
‣Operations is Culture
‣ Failure Happens
2
Today
‣Operations is Culture
‣ Failure Happens
‣ The OODA Loop
2
Today
‣Operations is Culture
‣ Failure Happens
‣ The OODA Loop
‣Do Fire Drills
2
Operations is Culture
3
4
“You don’t choose the moment, the moment chooses you.
You only get to choose how prepared you are when it does.” -Fire Chief Mike Burtch
Cloud Operations is the ability to consistently create and deploy reliable software to an
unreliable platform that scales horizontally.
5
http://radar.oreilly.com/2007/10/operations-is-a-competitive-ad.html
6
“It’s not my code, it’s your machines!
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
6
“It’s not my code, it’s your machines!
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Spock ScottyLittle bit weird
Sits closer to the bossThinks too hard
Pulls levers & turns knobsEasily excitedYells a lot in emergencies
6
“It’s not my code, it’s your machines!
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Copyright © 2010 Opscode, Inc - All Rights Reserved 7
No "ngerpointing
http://www.!ickr.com/photos/rocketjim54/2955889085/http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Fingerpointyness
problem!!!argggh!
time
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Fingerpointyness
problem!!!argggh!
time
freaking out,not talking,finding fault
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Fingerpointyness
problem!!!argggh!
time
freaking out,not talking,finding fault
blaming,covering
ass
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Fingerpointyness
problem!!!argggh!
time
freaking out,not talking,finding fault
blaming,covering
ass
whining,hiding.
hurt egos
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Fingerpointyness
problem!!!argggh!
time
freaking out,not talking,finding fault
blaming,covering
ass
whining,hiding.
hurt egos
figuring it out
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Fingerpointyness
problem!!!argggh!
time
freaking out,not talking,finding fault
blaming,covering
ass
fixin
g th
ings
fixed
whining,hiding.
hurt egos
figuring it out
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Being productive
problem!!!argggh!
time
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Being productive
problem!!!argggh!
time
figuring it out
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Being productive
problem!!!argggh!
time
fixin
g th
ings
fixed
figuring it out
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Being productive
problem!!!argggh!
time
fixin
g th
ings
fixed
feeling guilty
figuring it out
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
Being productive
problem!!!argggh!
time
fixin
g th
ings
fixed
feeling guilty
figuring it out
move on with
life
http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr
This will be on the test:FAILURE HAPPENS!
Good Book!
12
Complexity Complex
Loos
eTi
ght
Cou
plin
g
Simple
"Catastrophic Potential" adapted from Normal Accidents by Charles Perrow
Catastrophic Potential
Created by Jesse Robbins
12
Complexity Complex
Loos
eTi
ght
Cou
plin
g
Simple
"Catastrophic Potential" adapted from Normal Accidents by Charles Perrow
Catastrophic Potential
Created by Jesse Robbins
KEEP OUT!!!
define:Nines (roughly)
define:Nines (roughly)
99% 5256 min (3.5 days)
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
99.999% 5 min
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
99.999% 5 min
99.9999% 30 Seconds
define:Nines (roughly)
99% 5256 min (3.5 days)
99.9% 528 min ( 8.8 hours )
99.99% 53 min
99.999% 5 min
99.9999% 30 Seconds
99.99999% 3 Seconds
99.9% *99.9% *99.9%
= 99.7%
14
Internet Routing... won’t.
!"#$$%"&'(')*)"+,-.,-/01,( +/.01210*"345467"89: #
;''-1(<"=/-)"3.1>0?-'"@'-':
http://radar.oreilly.com/2008/10/sprint-blocking-cogent-network.html
#googlefail
Copyright © 2010 Opscode, Inc - All Rights Reserved
YOU
21
Continuous Power... isn’t
365 Main SF
365 364.96 Main SF
http://radar.oreilly.com/2007/07/failure-happens-a-summary-of-t.html
http://radar.oreilly.com/2007/07/failure-happens-a-summary-of-t.html
Failure happens
A single datacenter is the problem• Since they all fail at some point
Recovery procedures after failure• Power was gone ~45 minutes• Most services took hours to come back• Some unnamed ones more than 12 hours
Geography is a Single Point of Failure
Copyright © 2010 Opscode, Inc - All Rights Reserved 30
Providers are baskets too.
Copyright © 2010 Opscode, Inc - All Rights Reserved 32
Failure Happens.
Anyone promising otherwise is either foolish or lying
(or both).
OODAObserve, Orient, Decide, Act
34
35
OODA: Observe, Orient, Decide, Act
http://en.wikipedia.org/wiki/OODA_loop
http://www.flickr.com/photos/dnorman/2678090600http://www.slideshare.net/jallspaw/10-deploys-per-day-dev-and-ops-cooperation-at-flickr