agility requires safety
TRANSCRIPT
AGILITYrequires
SAFETY
Every startup has the same story:
“We don’t have time for best practices.”
You can’t go faster by being reckless
Think of cars on a highway
What happens if everyone jams down on the gas?
To go fast, a car needs not only a powerful engine…
But also powerful brakes.
As well as seat belts, airbags, bumpers, and auto-pilot
For cars and for software, speed is limited by safety
What are the seat belts, brakes, & self-driving cars of
software?
This talk is about safety mechanisms
That make it possible tobuild software quickly
Founder of
Atomic Squirrel
atomic-squirrel.net
PAST LIVES
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
Good brakes stop your car before you run into something
Continuous integration stops buggy code before it goes into production
Imagine your goal is to build the International Space Station
Each team designs and builds their component in isolation
You launch everything into space and hope it all comes together
I thought the Russians were going to build the bathrooms?
Weren’t the French supposed to do the wiring?
Everyone is using the metric system, right?
Teams working for a long time with incorrect
assumptions
Finding this out when you’re in outer space is too
late
This is the result of “late integration”
Lots of teams working in isolation on separate branches
Before attempting a massive merge at the very end
MERGE CONFLICT
The alternative is “continuous integration”
Where everyone regularly merges their work
The most common approach is
trunk-based development
Everyone works on a single branch (trunk)
That can’t possibly scale to a lot of developers, can it?
Uses trunk-based development for 1,000+ developers
Uses trunk-based development for 4,000+ developers
Uses trunk-based development for 20,000+ developers
Wouldn’t you have merge conflicts all the time?
If you merge (commit) regularly, conflicts are rare.
And those that happen are from a day of work—not months.
Commit early and often.
Small commits are easier to merge, test, revert, review
Wouldn’t there constantly be broken code in trunk?
Build Build Build Build
Not if you run a self-testing build after every commit
Build Build Build Build Build Build Build
Build Build Build Build
It should compile your code and run your automated tests
Build Build Build Build Build Build Build
Build Build Build Build
If a build fails, a developer must fix it ASAP or revert the commit
Build Build Build Build Build Build Build
Of course, this depends on having good automated
tests
Tests give you the confidence to make changes
quickly
JUnit version 4.11
...
Time: 6.063
OK (259 tests)
How long would it take you to do 259 tests manually?
What should you test?
Everything!
Everything!
It’s a trade-off between: 1. Likelihood of bugs2. Cost of bugs 3. Cost of testing
Likelihood of bugs is higher for complex code and large
teams
Cost of bugs is higher for some systems (payments,
security)
Cost of tests is higher for integration and UI tests
“Without continuous integration, your software is broken until somebody proves it works, usually during a testing or integration stage.
With continuous integration, your software is proven to work (assuming a sufficiently comprehensive set of automated tests) with every new change—and you know the moment it breaks and can fix it immediately.”
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
Ships have bulkheads to try to contain flooding to one area.
You can split up a codebase to contain problems to one area.
Code is the enemy: the more you have, the slower
you go
Project SizeLines of code
Bug Density Bugs per thousand lines of code
< 2K 0 – 25
2K – 6K 0 – 40
16K – 64K 0.5 – 50
64K – 512K 2 – 70
> 512K 4 – 100
As the code grows, the number of bugs grows even
faster
“Software development doesn't happen in a chart, an IDE, or a design tool; it happens in your head.”
The mind can only handle so much complexity at once
One solution is to break the code into multiple
codebases
Instead of depending on the source of another module
/moduleA
/moduleB /moduleC /moduleD
/moduleE
You depend on a versioned artifact from that module
moduleA-0.3.1.jar
moduleB-3.1.0.jar moduleC-9.8.0.jar moduleD-1.4.3.jar
moduleE-0.5.6.jar
This provides isolation from changes in other modules
moduleA-0.3.1.jar
moduleB-3.1.0.jar moduleC-9.8.0.jar moduleD-1.4.3.jar
moduleE-0.5.6.jar
You already do this: guava-18.0.jar
jquery-2.2.0.js
Advantages of artifacts:
1. Isolation2. Decoupling3. Faster builds
Disadvantages of artifacts:
1. Dependency hell2. No continuous
integration3. Hard to make global
changes
Another option is to break the codebase into services
In a monolith, you use function calls within one process
A.a()
B.b() C.c() D.d()
E.e()
With services, you pass messages between processes
http://A/a
http://B/bhttp://C/c
http://D/d
http://E/e
Advantages of services:
1. Technology agnostic2. Scalability3. Isolation
Disadvantages of services:
1. Operational overhead2. Performance overhead3. I/O, error handling4. Backwards compatibility5. Hard to make global
changes
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
Autopilot prevents accidents caused by human error
Automated deployments prevent accidents caused by human error
Deploying code can be painful
“If it hurts, do it more often.” – Martin Fowler
The deployment process should be:
That means you should never deploy or configure
manually
> ssh [email protected]
__| __| __| _| ( \__ \ Amazon ECS-Optimized Amazon Linux AMI 2015.09.d ____|\___|____/
[ec2-user ~]$ sudo apt-get install ruby
Don’t do this
Or this
Instead, automate everything
The gold standard is theblue-green deployment
Let’s say you have version 0.0.1 of your app deployed
First, deploy version 0.0.2 on a duplicate set of servers
If everything looks good, switch the load balancer over to 0.0.2
Four main categories of deployment automation
tools:
1. Configuration management: Chef, Puppet, Ansible, Salt
- name: Install httpd and php yum: name={{ item }} state=present with_items: - httpd - php
- name: start httpd service: name=httpd state=started enabled=yes
- name: Copy the code from repository git: repo={{ repository }} dest=/var/www/html/
Imperative scripts to configure servers and deploy code
2. Provisioning tools: Terraform, CloudFormation, Heat
resource "aws_instance" "example" { ami = "ami-b960b1d" instance_type = ["t2.micro"]}
resource "aws_eip" "ip“ { instance = "${aws_instance.example.id}" depends_on = ["aws_instance.example"]}
Declarative templates that define your infrastructure
3. Virtual machines: VMWare, VirtualBox, Packer, Vagrant
{ "builders": [{ "type": "amazon-ebs", "source_ami": "ami-de0d9eb7", "instance_type": "m1.medium", "ami_name": "example-packer-ami-{{timestamp}}" }], "provisioners": [{ "type": "shell", "inline": [ "sudo apt-get -y update", "sudo apt-get -y install httpd php” ] }]}Images of configured servers
4. Containers: Docker, rkt, LXD
FROM ubuntu:12.04
RUN apt-get update && apt-get install -y apache2 php
ENV APACHE_RUN_USER www-dataENV APACHE_LOG_DIR /var/log/apache2
EXPOSE 80
CMD ["/usr/sbin/apache2", "-D", "FOREGROUND"]
Lightweight images of configured servers
These tools allow you to define your infrastructure
as code
That way, you can version it, review it, test it, and
reuse it.
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
Elisha Otis demoing elevatorfree-fall
safety in 1854
The safety elevator patent
The safety catches are locked by default
Only an intact cable can unlock thelatches
This elevator provides safety by default
Feature toggles provide safety by default
New feature, part 1
New feature, part 2
New feature, part 3
If a large new feature takes many commits, wouldn’t a user see it in an unfinished state?
<section id="new-section"> <!-- Code for new section--></div><section id="original-section"> <!-- Code for original section--></section>
Let’s say you were adding a new section to your website.
<% if toggles.enabled("new-section") %> <section id="new-section"> <!-- Code for new section--> </div><% end %> <section id="original-section"> <!-- Code for original section--></section>
Wrap new code in a conditional that looks up a feature toggle
<% if toggles.enabled("new-section") %> <section id="new-section"> <!-- Code for new section--> </div><% end %> <section id="original-section"> <!-- Code for original section--></section>
Toggles are off by default, so users won’t see unfinished work
development: feature_toggles: new-section: true
production: feature_toggles: new-section: false
You can enable feature toggles in a config file.
> curl http://feature.toggles/
{ "development": { "new-section": true }, "production": { "new-section": false }}
Or you could create a web service for feature toggles.
> curl http://feature.toggles/?user=123
{ "development": { "new-section": "A" }, "production": { "new-section": "B" }}
It could return different, complex values for each user.
And provide a web UI for configuring toggles.
This allows you to quickly turn features on or off.
<% if toggles.get("new-section") == "A" %> <section id="new-section-bucket-a"> <!-- Code for new section, version A --> </div><% elsif toggles.get("new-section") == "B" %> <section id="new-section-bucket-b"> <!-- Code for new section, version B --> </div><% end %>
This allows A/B testing
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
A speedometer tells you how fast you’re driving
Monitoring tells you how your product is performing
“If you can’t measure it, you can’t fix it.” – David Henke
There are many types of monitoring
Availability metrics: is my product up or down?
Useful tools: Keynote, Pingdom, Uptime Robot, Route53
Business metrics: what are my users doing in the product?
Useful tools: Google Analytics, KISSMetrics, Mixpanel
Application metrics: how is my application performing?
Useful tools: New Relic, CloudWatch, Datadog
127.0.0.1 - - [10/Oct/2000:13:55:36] "GET /apache_pb.gif HTTP/1.0" 200 232664.242.88.10 - - [07/Mar/2004:16:05:49] "GET /twiki/bin/ HTTP/1.1" 401 12846127.0.0.1 - - [28/Jul/2006:10:22:04] "GET / HTTP/1.0" 200 221664.242.88.10 - - [07/Mar/2004:16:06:51] "GET /twiki/bin/Twiki/" 200 452364.242.88.10 - - [07/Mar/2004:16:10:02] "GET /mailman HTTP/1.1" 200 6291127.0.0.1 - - [28/Jul/2006:10:27:32] "GET /hidden/ HTTP/1.0" 404 7218192.168.2.20 - - [28/Jul/2006:10:27:10] "GET /cgi-bin/try HTTP/1.0" 200 339564.242.88.10 - - [07/Mar/2004:16:11:58] "GET /twiki/bin/view/" 200 735264.242.88.10 - - [07/Mar/2004:16:20:55] "GET /twiki HTTP/1.1" 200 5253Log files are also a form of application-level monitoring
127.0.0.1 - - [10/Oct/2000:13:55:36] "GET /apache_pb.gif HTTP/1.0" 200 232664.242.88.10 - - [07/Mar/2004:16:05:49] "GET /twiki/bin/ HTTP/1.1" 401 12846127.0.0.1 - - [28/Jul/2006:10:22:04] "GET / HTTP/1.0" 200 221664.242.88.10 - - [07/Mar/2004:16:06:51] "GET /twiki/bin/Twiki/" 200 452364.242.88.10 - - [07/Mar/2004:16:10:02] "GET /mailman HTTP/1.1" 200 6291127.0.0.1 - - [28/Jul/2006:10:27:32] "GET /hidden/ HTTP/1.0" 404 7218192.168.2.20 - - [28/Jul/2006:10:27:10] "GET /cgi-bin/try HTTP/1.0" 200 339564.242.88.10 - - [07/Mar/2004:16:11:58] "GET /twiki/bin/view/" 200 735264.242.88.10 - - [07/Mar/2004:16:20:55] "GET /twiki HTTP/1.1" 200 5253Useful tools: loggly, logstash, Papertrail, Sumo Logic
Server metrics: how is my server performing?
Useful tools: Nagios, Icinga, Munin, collectd, CloudWatch
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
Warning lights notify you if something is wrong
Alerting systems notify you if something is wrong
You can’t look at metrics 24/7. Alerting systems can.
Useful tools: PagerDuty, VictorOps
For a full list of monitoring and alerting tools, see:
hello-startup.net/resources
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
Seat belts help you survive crashes
High availability helps you survive crashes
Stateless servers: multiple instances, multiple zones
Load balancer routes around server or zone outages
Auto-recovery mechanism brings server back after outage
Stateful servers: multiple instances, multiple zones
Replication to one or more standby servers
Load balancer switches to standby server in case of outage
Auto-recovery mechanism brings server back after outage
Test your recovery process regularly.
1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt
Outline
Speed is limited by safety
Two cars can drive at 80mph in opposite directions safely…
Because of two yellow lines
It’s worth the time to put these safety mechanisms in
place
For more info, see
Hello, Startup
hello-startup.net
Questions?
F1 racecar: Takayuki SuzukiHighway traffic: Oran ViriyincyCar accident: ER24 EMS (Pty) Ltd.Road: Nicolas RaymondBWM: Andy DurstSelf-driving car: Steve JurvetsonBus: Roland TanglaoTail lights: Tony WebsterUSS South Dakota: WikimediaCrash test dummy: Wikimedia
Elisha Otis: WikimediaOtis Elevator: WikimediaSpeedometer: Dawn HopkinsDashboard lights: Jim LarrisonSeat belt: WikimediaGoogle repo stats: Rachel PotvinISS: WikimediaFire: PeteMartin Fowler: Wikimedia
Image credits