agility requires safety

AGILITYrequires

SAFETY

Every startup has the same story:

“We don’t have time for best practices.”

You can’t go faster by being reckless

Think of cars on a highway

What happens if everyone jams down on the gas?

To go fast, a car needs not only a powerful engine…

But also powerful brakes.

As well as seat belts, airbags, bumpers, and auto-pilot

For cars and for software, speed is limited by safety

What are the seat belts, brakes, & self-driving cars of

software?

This talk is about safety mechanisms

That make it possible tobuild software quickly

I’mYevgeniyBrikmanybrikman.com

http://www.ybrikman.com/

http://www.ybrikman.com/

Founder of

Atomic Squirrel

atomic-squirrel.net

http://www.atomic-squirrel.net/?ref=startup-ideas-talk

PAST LIVES

Author ofHello,

Startup

hello-startup.net

http://www.hello-startup.net/?ref=ideas-talk

1. Brakes2. Bulkheads3. Autopilot4. Safety catch5. Speedometer6. Warning lights7. Seat belt

Outline

Good brakes stop your car before you run into something

Continuous integration stops buggy code before it goes into production

Imagine your goal is to build the International Space Station

Each team designs and builds their component in isolation

You launch everything into space and hope it all comes together

I thought the Russians were going to build the bathrooms?

Weren’t the French supposed to do the wiring?

Everyone is using the metric system, right?

Teams working for a long time with incorrect

assumptions

Finding this out when you’re in outer space is too

late

This is the result of “late integration”

Lots of teams working in isolation on separate branches

Before attempting a massive merge at the very end

MERGE CONFLICT

The alternative is “continuous integration”

Where everyone regularly merges their work

The most common approach is

trunk-based development

Everyone works on a single branch (trunk)

That can’t possibly scale to a lot of developers, can it?

Uses trunk-based development for 1,000+ developers

Wouldn’t you have merge conflicts all the time?

If you merge (commit) regularly, conflicts are rare.

And those that happen are from a day of work—not months.

Commit early and often.

Small commits are easier to merge, test, revert, review

Wouldn’t there constantly be broken code in trunk?

Build Build Build Build

Not if you run a self-testing build after every commit

Build Build Build Build Build Build Build


It should compile your code and run your automated tests



If a build fails, a developer must fix it ASAP or revert the commit


Of course, this depends on having good automated

tests

Tests give you the confidence to make changes

quickly

JUnit version 4.11

...

Time: 6.063

OK (259 tests)

How long would it take you to do 259 tests manually?

What should you test?

Everything!

It’s a trade-off between: 1. Likelihood of bugs2. Cost of bugs 3. Cost of testing

Likelihood of bugs is higher for complex code and large

teams

Cost of bugs is higher for some systems (payments,

security)

Cost of tests is higher for integration and UI tests

“Without continuous integration, your software is broken until somebody proves it works, usually during a testing or integration stage.

With continuous integration, your software is proven to work (assuming a sufficiently comprehensive set of automated tests) with every new change—and you know the moment it breaks and can fix it immediately.”


Outline

Ships have bulkheads to try to contain flooding to one area.

You can split up a codebase to contain problems to one area.

Code is the enemy: the more you have, the slower

you go

Project SizeLines of code

Bug Density Bugs per thousand lines of code

< 2K 0 – 25

2K – 6K 0 – 40

16K – 64K 0.5 – 50

64K – 512K 2 – 70

> 512K 4 – 100

As the code grows, the number of bugs grows even

faster

“Software development doesn't happen in a chart, an IDE, or a design tool; it happens in your head.”

The mind can only handle so much complexity at once

One solution is to break the code into multiple

codebases

Instead of depending on the source of another module

/moduleA

/moduleB /moduleC /moduleD

/moduleE

You depend on a versioned artifact from that module

moduleA-0.3.1.jar

moduleB-3.1.0.jar moduleC-9.8.0.jar moduleD-1.4.3.jar

moduleE-0.5.6.jar

This provides isolation from changes in other modules

moduleA-0.3.1.jar

moduleB-3.1.0.jar moduleC-9.8.0.jar moduleD-1.4.3.jar

moduleE-0.5.6.jar

You already do this: guava-18.0.jar

jquery-2.2.0.js

Advantages of artifacts:

1. Isolation2. Decoupling3. Faster builds

Disadvantages of artifacts:

1. Dependency hell2. No continuous

integration3. Hard to make global

changes

Another option is to break the codebase into services

In a monolith, you use function calls within one process

A.a()

B.b() C.c() D.d()

E.e()

With services, you pass messages between processes

http://A/a

http://B/bhttp://C/c

http://D/d

http://E/e

Advantages of services:

1. Technology agnostic2. Scalability3. Isolation

Disadvantages of services:

1. Operational overhead2. Performance overhead3. I/O, error handling4. Backwards compatibility5. Hard to make global

changes


Outline

Autopilot prevents accidents caused by human error

Automated deployments prevent accidents caused by human error

Deploying code can be painful

“If it hurts, do it more often.” – Martin Fowler

The deployment process should be:

That means you should never deploy or configure

manually

> ssh [email protected]

__| __| __| _| ( \__ \ Amazon ECS-Optimized Amazon Linux AMI 2015.09.d ____|\___|____/

[ec2-user ~]$ sudo apt-get install ruby

Don’t do this

Or this

Instead, automate everything

The gold standard is theblue-green deployment

Let’s say you have version 0.0.1 of your app deployed

First, deploy version 0.0.2 on a duplicate set of servers

If everything looks good, switch the load balancer over to 0.0.2

Four main categories of deployment automation

tools:

1. Configuration management: Chef, Puppet, Ansible, Salt

- name: Install httpd and php yum: name={{ item }} state=present with_items: - httpd - php

- name: start httpd service: name=httpd state=started enabled=yes

- name: Copy the code from repository git: repo={{ repository }} dest=/var/www/html/

Imperative scripts to configure servers and deploy code

2. Provisioning tools: Terraform, CloudFormation, Heat

resource "aws_instance" "example" { ami = "ami-b960b1d" instance_type = ["t2.micro"]}

resource "aws_eip" "ip“ { instance = "${aws_instance.example.id}" depends_on = ["aws_instance.example"]}

Declarative templates that define your infrastructure

3. Virtual machines: VMWare, VirtualBox, Packer, Vagrant

{ "builders": [{ "type": "amazon-ebs", "source_ami": "ami-de0d9eb7", "instance_type": "m1.medium", "ami_name": "example-packer-ami-{{timestamp}}" }], "provisioners": [{ "type": "shell", "inline": [ "sudo apt-get -y update", "sudo apt-get -y install httpd php” ] }]}Images of configured servers

4. Containers: Docker, rkt, LXD

FROM ubuntu:12.04

RUN apt-get update && apt-get install -y apache2 php

ENV APACHE_RUN_USER www-dataENV APACHE_LOG_DIR /var/log/apache2

EXPOSE 80

CMD ["/usr/sbin/apache2", "-D", "FOREGROUND"]

Lightweight images of configured servers

These tools allow you to define your infrastructure

as code

That way, you can version it, review it, test it, and

reuse it.


Outline

Elisha Otis demoing elevatorfree-fall

safety in 1854

The safety elevator patent

The safety catches are locked by default

Only an intact cable can unlock thelatches

This elevator provides safety by default

Feature toggles provide safety by default

New feature, part 1

New feature, part 2

New feature, part 3

If a large new feature takes many commits, wouldn’t a user see it in an unfinished state?

<section id="new-section"> </div><section id="original-section"> </section>

Let’s say you were adding a new section to your website.

<% if toggles.enabled("new-section") %> <section id="new-section">  </div><% end %> <section id="original-section"> </section>

Wrap new code in a conditional that looks up a feature toggle

<% if toggles.enabled("new-section") %> <section id="new-section">  </div><% end %> <section id="original-section"> </section>

Toggles are off by default, so users won’t see unfinished work

development: feature_toggles: new-section: true

production: feature_toggles: new-section: false

You can enable feature toggles in a config file.

> curl http://feature.toggles/

{ "development": { "new-section": true }, "production": { "new-section": false }}

Or you could create a web service for feature toggles.

> curl http://feature.toggles/?user=123

{ "development": { "new-section": "A" }, "production": { "new-section": "B" }}

It could return different, complex values for each user.

And provide a web UI for configuring toggles.

This allows you to quickly turn features on or off.

<% if toggles.get("new-section") == "A" %> <section id="new-section-bucket-a">  </div><% elsif toggles.get("new-section") == "B" %> <section id="new-section-bucket-b">  </div><% end %>

This allows A/B testing


Outline

A speedometer tells you how fast you’re driving

Monitoring tells you how your product is performing

“If you can’t measure it, you can’t fix it.” – David Henke

There are many types of monitoring

Availability metrics: is my product up or down?

Useful tools: Keynote, Pingdom, Uptime Robot, Route53

Business metrics: what are my users doing in the product?

Useful tools: Google Analytics, KISSMetrics, Mixpanel

Application metrics: how is my application performing?

Useful tools: New Relic, CloudWatch, Datadog

127.0.0.1 - - [10/Oct/2000:13:55:36] "GET /apache_pb.gif HTTP/1.0" 200 232664.242.88.10 - - [07/Mar/2004:16:05:49] "GET /twiki/bin/ HTTP/1.1" 401 12846127.0.0.1 - - [28/Jul/2006:10:22:04] "GET / HTTP/1.0" 200 221664.242.88.10 - - [07/Mar/2004:16:06:51] "GET /twiki/bin/Twiki/" 200 452364.242.88.10 - - [07/Mar/2004:16:10:02] "GET /mailman HTTP/1.1" 200 6291127.0.0.1 - - [28/Jul/2006:10:27:32] "GET /hidden/ HTTP/1.0" 404 7218192.168.2.20 - - [28/Jul/2006:10:27:10] "GET /cgi-bin/try HTTP/1.0" 200 339564.242.88.10 - - [07/Mar/2004:16:11:58] "GET /twiki/bin/view/" 200 735264.242.88.10 - - [07/Mar/2004:16:20:55] "GET /twiki HTTP/1.1" 200 5253Log files are also a form of application-level monitoring

127.0.0.1 - - [10/Oct/2000:13:55:36] "GET /apache_pb.gif HTTP/1.0" 200 232664.242.88.10 - - [07/Mar/2004:16:05:49] "GET /twiki/bin/ HTTP/1.1" 401 12846127.0.0.1 - - [28/Jul/2006:10:22:04] "GET / HTTP/1.0" 200 221664.242.88.10 - - [07/Mar/2004:16:06:51] "GET /twiki/bin/Twiki/" 200 452364.242.88.10 - - [07/Mar/2004:16:10:02] "GET /mailman HTTP/1.1" 200 6291127.0.0.1 - - [28/Jul/2006:10:27:32] "GET /hidden/ HTTP/1.0" 404 7218192.168.2.20 - - [28/Jul/2006:10:27:10] "GET /cgi-bin/try HTTP/1.0" 200 339564.242.88.10 - - [07/Mar/2004:16:11:58] "GET /twiki/bin/view/" 200 735264.242.88.10 - - [07/Mar/2004:16:20:55] "GET /twiki HTTP/1.1" 200 5253Useful tools: loggly, logstash, Papertrail, Sumo Logic

Server metrics: how is my server performing?

Useful tools: Nagios, Icinga, Munin, collectd, CloudWatch


Outline

Warning lights notify you if something is wrong

Alerting systems notify you if something is wrong

You can’t look at metrics 24/7. Alerting systems can.

Useful tools: PagerDuty, VictorOps

For a full list of monitoring and alerting tools, see:

hello-startup.net/resources

http://www.hello-startup.net/resources/monitoring


Outline

Seat belts help you survive crashes

High availability helps you survive crashes

Stateless servers: multiple instances, multiple zones

Load balancer routes around server or zone outages

Auto-recovery mechanism brings server back after outage

Stateful servers: multiple instances, multiple zones

Replication to one or more standby servers

Load balancer switches to standby server in case of outage

Auto-recovery mechanism brings server back after outage

Test your recovery process regularly.


Outline

Speed is limited by safety

Two cars can drive at 80mph in opposite directions safely…

Because of two yellow lines

It’s worth the time to put these safety mechanisms in

place

For more info, see

Hello, Startup

hello-startup.net

http://www.hello-startup.net/?ref=agility-safety-talk

Questions?

F1 racecar: Takayuki SuzukiHighway traffic: Oran ViriyincyCar accident: ER24 EMS (Pty) Ltd.Road: Nicolas RaymondBWM: Andy DurstSelf-driving car: Steve JurvetsonBus: Roland TanglaoTail lights: Tony WebsterUSS South Dakota: WikimediaCrash test dummy: Wikimedia

Elisha Otis: WikimediaOtis Elevator: WikimediaSpeedometer: Dawn HopkinsDashboard lights: Jim LarrisonSeat belt: WikimediaGoogle repo stats: Rachel PotvinISS: WikimediaFire: PeteMartin Fowler: Wikimedia

Image credits

https://flic.kr/p/hfxEVY

https://flic.kr/p/7SgJnu

https://flic.kr/p/7YMVXJ

https://flic.kr/p/pQoBvb

https://flic.kr/p/ovdGTT

https://flic.kr/p/dtNMk2

https://flic.kr/p/dwbBRN

https://flic.kr/p/pJamhX

https://commons.wikimedia.org/wiki/File:USS_South_Dakota_(BB-57)_under_construction,_1_April_1940.jpg

https://commons.wikimedia.org/wiki/File:V08383P339.jpg

https://en.wikipedia.org/wiki/File:Elisha_OTIS_1854.jpg

https://en.wikipedia.org/wiki/File:ElevatorPatentOtis1861.jpg

https://flic.kr/p/wnQ3p

https://flic.kr/p/cXMK9o

https://commons.wikimedia.org/wiki/File:Aircraft_Seatbelt.jpg

https://www.youtube.com/watch?v=W71BTkUbdqE

https://commons.wikimedia.org/wiki/File:ISS_configuration_2015-05_en.svg

https://flic.kr/p/arMbFf

https://commons.wikimedia.org/wiki/File:Webysther_20150414193208_-_Martin_Fowler.jpg