stop worrying about prodweb001 and start loving i-98fb9856 (arc201) | aws re:invent 2013

They Don't Hug Back! Or Why You Need To Stop Worrying About prodweb001 And Start Loving i-98fb9856

Chris Munns, Amazon Web Services

November 13, 2013

Why are we here? Old-school IT practices continue to weigh us down in the cloud. We need a way out.

“Everything now is a programmable resource. There are no physical things anymore. Things that you needed to do by walking to the datacenter, by hugging your servers, and believe me I’ve hugged servers enough in my life. They DO NOT hug you back.”

“Everything now is a programmable resource. There are no physical things anymore. Things that you needed to do by walking to the datacenter, by hugging your servers, and believe me I’ve hugged servers enough in my life. They DO NOT hug you back.” -

Dr. Werner Vogels (Re:Invent 2012)

“But I love my servers!” - You (now)

https://secure.flickr.com/photos/schluesselbein/4157426778/

“They hate you, actually, I honestly believe that they hate you.

“They hate you, actually, I honestly believe that they hate you. At least that is how they behaved towards me.” –

Dr. Werner Vogels (Re:Invent 2012)

“But I love my servers!” “Well now I’m kind of sad.”

- You (now)

https://secure.flickr.com/photos/bensonkua/2687804310/

So where does server hugging

come from?

NAMING THEM

https://secure.flickr.com/photos/quinnanya/4464205726

So where does server hugging come from?

Why do we name them?

Why do we name them? Because we have to know where to find them.

Why do we name them? Because we have to know where to find them. Where do we need to find them?

https://secure.flickr.com/photos/arthur-caranta/2925352521

Or here?

https://secure.flickr.com/photos/arthur-caranta/2925352521

IF THIS THING IS OUT OF TAPE, YOU HAD A REALLY BAD DAY.

https://secure.flickr.com/photos/stephendotcarter/6587082437

Why did we need to find them in person?

Why did we need to find them in person? Because we HAD to fix them.

Why did we need to find them in person? Because we HAD to fix them. WHY?

We fixed them because: Dead servers == dead space Dead space == wasted $$$ Dead servers == worse performance Worse performance == lost $$$

So where else does server hugging

come from?

SERVERS != OUR PETS

https://secure.flickr.com/photos/thegirlsny/3877243166/

What we name our pets • Greek gods: Zeus, Thor, Hercules… • Elements: Hydrogen, Helium, Lithium… • Comic book heroes: Superman, Ironman… • Musicians, Cities, Countries, Movies • Prodweb01, Prodapi01… • Web01.prod, Web01.test… • Tacotruck01 • P1cfw01v03

P1cfw01v03 https://secure.flickr.com/photos/75898532@N00/3243666946/

EC2 EC2

P1cfw01v03 https://secure.flickr.com/photos/verylastexcitingmoment/3118396767/

Waking when they cry: *** Nagios *** Notification Type: PROBLEM Service: Web CPU Host: web03.example.com Address: 10.167.10.51 State: CRITICAL Date/Time: Thu Oct 24 08:14:13 UTC 2013 Additional Info: CRITICAL – CPU LOAD 29

Hugging server babies and you • Is the site performing worse? • Are your customers impacted? • How impacted are they? • What are the other 20 web instances doing? • Did I really need to wake up at 4am for this? • If a server uses 100% of its CPU, should I care? • If this server is bad, how much work is there in fixing

it? • Is there something custom about this server?

Server hugging bad practices • “Pet-ting” – caring about a server’s “name,” its

well being, its individual status • “Snowflakes” – unique hosts in a common pool • “Model T-ing” – Hand-built one-off servers • “Names In Stone” – overuse of host names as

a source of truth

In short, there are a lot of old-school, dated habits being taken to cloud infrastructure. And once you’ve brought them to the cloud, you lose out on a lot of the benefits of the cloud. Such as: • Dynamic scale up/down • Self healing infrastructures • Increased flexibility • Automation

https://secure.flickr.com/photos/tolomea/5113266973/

Letting go involves moving forward with some of the best of what AWS can offer you in terms of services and how you can work with them in some pretty incredible ways.

Letting go and loving the new way

• Using Auto Scaling for everything • ENIs and EIPs • Tags are the new DNS • Deployment tools • Host-based configuration • Service registries

Sleeping through Infrastructure Recovery

https://secure.flickr.com/photos/dominiqs/331702231

The things that should never wake you up

• High CPU usage on anything • High memory usage on anything • Thread/process exhaustion • Filled disks • Not running software • Failed instances

Metrics:

Common actions taken when paged

1. Look at logs

2. Look at graphs

3. Reboot/restart related application/instance

1. Look at logs

2. Look at graphs

} Looking at past data

1. Look at logs

2. Look at graphs

} Looking at past data

Why do this manually?

Provisioned capacity

Traffic to our site vs. provisioned capacity manually

Traffic to our site vs. provisioned capacity with Auto Scaling

STONITH "Shoot the other node in the head”

Don’t be afraid to kill a node a with

something wrong with it as a resolution to failure!

With Auto Scaling it’s fine!

STONITH

AWS Cloud

Virtual Private Cloud Availability Zone Availability Zone

Availability Zone

Web Instance

Internet Gateway

ELB ELB ELB

Auto Scaling Group min=3

STONITH

AWS Cloud

Availability Zone

Web Instance

Internet Gateway

ELB ELB ELB

STONITH

AWS Cloud

Availability Zone

Web Instance

Internet Gateway

ELB ELB ELB

CloudWatch

STONITH

AWS Cloud

Availability Zone

Web Instance

Internet Gateway

ELB ELB ELB

CloudWatch

STONITH

AWS Cloud

Availability Zone

Amazon SNS

Web Instance

Internet Gateway

ELB ELB ELB

CloudWatch

STONITH

AWS Cloud

Availability Zone

Amazon SQS Amazon SNS

Web Instance

Internet Gateway

ELB ELB ELB

CloudWatch

Auto scaling Group min=3

STONITH

AWS Cloud

Availability Zone

Web Instance

Internet Gateway

ELB ELB ELB

CloudWatch

Watcher Instance

STONITH

AWS Cloud

Availability Zone

Web Instance

Internet Gateway

ELB ELB ELB

CloudWatch

Watcher Instance

EC2 API

STONITH

AWS Cloud

Availability Zone

Web Instance

Internet Gateway

ELB ELB ELB

CloudWatch

Watcher Instance

EC2 API

STONITH

AWS Cloud

Availability Zone

CloudWatch Amazon SQS Amazon SNS

Web Instance

Internet Gateway

ELB ELB ELB

EC2 API

Watcher Instance

Auto Scaling for everything! • You can use Auto Scaling for singular instances that

don’t scale up or down – min = 1, max = 1

• Auto Scaling gives you the ability to specify multiple Availability Zones, even you only need a single host – gives you multi-AZ failover

• Auto Scaling supports notifications on instance creation/termination – Useful for configuring other resources, bootstrapping, and

provisioning • Auto Scaling is free!

Auto Scaling for everything!

• Make use of the user data or configuration management tools to do things like: – Re-attaching an Amazon Elastic Block Store (EBS) volume with

application data – Re-attaching an Elastic Network Interface (ENI) – Update service registries – Update DNS – Update other reliant applications of the new host

Elastic Network Interfaces/Elastic IPs ENI: • Add additional interfaces to an

instance • One or more secondary private

IP addresses • Has its own MAC address • Can have Security Groups

assigned • Tag-able • Free

EIP: • A static public IP address • Can be assigned to either an

instance or an ENI • Doesn’t replace private IP • Small hourly charge when not

attached to an instance

Elastic Network Interfaces

Attaching multiple network interfaces to an instance is useful when you want to: • Create a management network. • Use network and security appliances in your

Amazon Virtual Private Cloud (VPC). • Create dual-homed instances with workloads/roles on distinct

subnets. • Create a low-budget, high-availability solution.

Elastic Network Interfaces

Attaching multiple network interfaces to an instance is useful when you want to: • Create a management network. • Use network and security appliances in your

Amazon Virtual Private Cloud (VPC). • Create dual-homed instances with workloads/roles on distinct

subnets. • Create a low-budget, high-availability solution.

Healing a single instance

AWS Cloud

EC2 API

AWS CloudFormation

AWS Cloud

Virtual Private Cloud

Availability Zone

EC2 API

AWS CloudFormation

Internet Gateway

NAT Instance

AWS Cloud

Availability Zone

App Instance

EC2 API

AWS CloudFormation

Internet Gateway

NAT Instance

AWS Cloud

Availability Zone

Auto-Scaling Group

App Instance

EC2 API

AWS CloudFormation

NAT Instance

Internet Gateway

AWS Cloud

Availability Zone

Auto-Scaling Group

Elastic Network Instance

App Instance

EBS Volume NAT

Instance

Internet Gateway

EC2 API

AWS CloudFormation

AWS Cloud

Availability Zone

Auto-Scaling Group

App Instance

EBS Volume NAT

Instance

Internet Gateway

EC2 API

AWS CloudFormation

Instances

AWS Cloud

Availability Zone

Auto-Scaling Group

App Instance

EBS Volume NAT

Instance

Internet Gateway

EC2 API

AWS CloudFormation

Instances

AWS Cloud

Availability Zone

Auto-Scaling Group

App Instance

EBS Volume NAT

Instance

Internet Gateway

EC2 API

AWS CloudFormation

Instances

AWS Cloud

Availability Zone

Auto-Scaling Group

App Instance

EBS Volume NAT

Instance

Internet Gateway

EC2 API

AWS CloudFormation

Instances

AWS Cloud

Availability Zone

Auto-Scaling Group

App Instance

EBS Volume NAT

Instance

Internet Gateway

EC2 API

AWS CloudFormation

Instances

AWS Cloud

Availability Zone

Auto-Scaling Group

App Instance

EBS Volume NAT

Instance

Internet Gateway

EC2 API

AWS CloudFormation

Instances

AWS Cloud

Availability Zone

Auto-Scaling Group

App Instance

EBS Volume NAT

Instance

Internet Gateway

EC2 API

AWS CloudFormation

Healing a single instance "myENI" : {

"Type" : "AWS::EC2::NetworkInterface",

"Properties" : {

"Tags": [{"Key":"Name","Value":"AppENI"}, {"Key":"Project","Value":"Blog"}],

"Description": "Blog One Off App Server ENI.",

"SubnetId": "subnet-d2286cb9",

"PrivateIpAddress": "192.168.11.100"

Healing a single instance import boto.ec2

import boto.utils

conn = boto.ec2.connect_to_region('us-west-2')

myfilters = {'tag:Name': 'AppENI', 'tag:Project': 'Blog’}

myEni=conn.get_all_network_interfaces(filters=myfilters)

myInstance=boto.utils.get_instance_metadata()['instance-id']

conn.attach_network_interface(myEni[0].id, myInstance, device_index=1, dry_run=False)

Healing a single instance import boto.ec2

import boto.utils

conn = boto.ec2.connect_to_region('us-west-2')

myfilters = {'tag:Name': 'AppENI', 'tag:Project': 'Blog’}

myEni=conn.get_all_network_interfaces(filters=myfilters)

myInstance=boto.utils.get_instance_metadata()['instance-id']

conn.attach_network_interface(myEni[0].id, myInstance, device_index=1, dry_run=False)

Connect to API

Find the right ENI Attach ENI to instance

https://secure.flickr.com/photos/cambodia4kidsorg/260004685

Use tags as a source of “truth” in your

infrastructure

DNS bad. Tags good.

DNS • 30-year old technology • Only tells us a single

thing about a host, a hostname to IP mapping.

• Potential for split brain/broken replicas

• Caching issues, caching issues, caching issues

• Set by you the user, held in AWS and available via APIs

• Key:Value is totally up to you

• Can have several per resource

• Free to implement and query

DNS bad. Tags good.

DNS Web03.example.com:

– 10.167.10.51

Tags i-933f81a4:

– Name:Web – Env:Prod – Project:Blog – Owner:BobSmith – aws:autoscaling:groupName :

ProdBlogWebsASG – aws:cloudformation:stack-name:

BlogSiteProd

Tags as a source of truth

• Tie various resources together • Billing reports • IAM resource-level permissions • Build automation • Deploy automation • Security resource grouping

Stop hand-crafting servers!

https://secure.flickr.com/photos/ndrwfgg/115898387

Use automation!

https://secure.flickr.com/photos/genewolf/147722350

AWS management tools

AWS Elastic Beanstalk AWS OpsWorks AWS CloudFormation

Higher-level services Do it yourself

Convenience Control

Host-based configuration management

Fabric

Host-based configuration management

• All more or less accomplish the same things – File configuration, package/software installation, user management, run

commands, interface with OS, process management

• All have their own syntax that isn’t too dissimilar • Some rely on agents, some are agentless • Use HBCM alongside one of the tools from the previous

slide • Spend the time required to learn them • Can’t scale easily without HBCM

“I don’t have time to learn Chef!?”

https://secure.flickr.com/photos/45909111@N00/9374169461/

“I don’t have time to learn Chef!?”

“I wrote custom shell scripts instead!”

https://secure.flickr.com/photos/45909111@N00/9374169461/

Go visit the AWS & Partner exhibits and ask for more

Making Use of Service Registries

https://secure.flickr.com/photos/fringedbenefit/9178086713

https://secure.flickr.com/photos/smartfinn/2651755337/

NOT THAT KINDA REGISTRY!

https://secure.flickr.com/photos/smartfinn/2651755337/

“A service registry is one of the fundamental pieces of service-oriented architecture (SOA) for achieving reuse. It refers to a

place in which service providers can impart information about their offered services and

potential clients can search for services.” - www.architecturejournal.net, Sept 2009

Service registry workflow

1. A new instance boots. 2. It registers itself with our “service registry.” 3. Changes to the service registry kick off changes on

other systems related to the new instance. 4. Other instances now know about our new instance. 5. On instance termination, instance is deregistered,

and other instances remove it from use.

Service registry examples:

• Zookeeper • MuleSoft Anypoint Service Registry • Netflix Eureka • IBM WebSphere Service Registry and

Repository • Airbnb SmartStack

Zookeeper “is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.” – zookeeper.apache.org

– leader election – group membership – configuration maintenance – event notification – locking – priority queue mechanism

Zookeeper

AWS Cloud

Virtual Private Cloud Availability Zone

Availability Zone Availability Zone

Zookeeper Instance

Worker Instance

Zookeeper Instance

Leader Host

Enough from me!

Customer Story: Airbnb SmartStack Martin Rhoads

Martin Rhoads SRE @ Airbnb November 13, 2013

Airbnb SmartStack Helping you build Service Oriented Architectures

not at Re:Invent

Intros

Igor Serebryany + SRE at Airbnb since 2012 + Built datacenter automation at

SingleHop + Scientific computing at University

of Chicago + Hobbies: welding, biking, long

walks on the beach

This guy is even more bearded than the last!

Intros

Martin Rhoads + SRE at Airbnb + user of AWS since 2006 + First 10 employees at RightScale + Previously worked at

Cloudscaling deploying OpenStack at Tier1s and Telcos

+ BioInformatics at UCSB + Obsessed with making things

easier

SmartStack Helping you build SOA

What are you trying to sell me?

Why do I need SOA?

+ The definitive way to scale your architecture + Allow different people to work on different code without stepping on toes + Separate deployment schedules + Separate machine and data requirements + Fail separately -- so you can have graceful degradation

How SOA happens When customers love a service very, very much...

How SOA happens

When customers love a service very, very much...

How SOA happens

When customers love a service very, very much...

How SOA happens When customers love a service very, very much...

Here’s how it ends up A certain kind of fun

To sum up

1 Services help you scale

2 SOA is an architecture style designed around services

3 A SOA is hard to manage

4 SmartStack makes managing SOA a breeze

What is SmartStack? And how does it help?

SERVICE 1 Service(s) you want to deliver

2 Zookeeper registry to track everything

ZOOKEEPER

3 Nerve checks health and updates Zookeeper

4 Synapse routes between services

SYNAPSE NERVE NERVE

MONORAIL

NERVE SYNAPSE

MOBILE WEB

NERVE SYNAPSE

ZOOKEEPER

+ /production/monorail/services/i-1234567 => {‘host’: 1.2.3.4, ‘port’: 5678}

+ /production/mobile_web/services/i-0abcdef => {‘host’: 5.6.7.8, ‘port’: 5678}

haproxy

We get myriad benefits from haproxy + Stable and well-tested

+ Performs in-process connectivity checks

+ Great introspection and logging

+ Lots of load-balancing algorithms (RR, least-conn)

+ Somewhat dynamically reconfigurable (stats socket)

At the core of synapse

To Recap SmartStack in action

Introspection

Abstraction and DRY

Distributed by design

Automatic failure detection Why SmartStack?

Abstraction

+ The same code in the same language is always doing discovery/registration

+ Your application doesn’t know about nerve/synapse -- it only knows about its dependencies

+ Always consistent across your infrastructure

You don’t have to wake up

Automatic Failure Handling

+ Bad backends are automatically taken out of rotation + Useful during both problems and routine maintenance/deploys + Push-based => very rapid detection; avoid those little blips + haproxy even routes around network partitions!

See what’s REALLY going on

Introspection

Leverage the power of haproxy + status page that lets you see local

state + lots of available integrations to

gather global state + world-class logging for large-scale

analysis

No central point of failure

Distributed by Design

+ Traffic flows directly between boxes -- no routing layer + Even if SmartStack is stopped or broken, haproxy keeps traffic flowing + Zookeeper helps to avoid common pitfalls (like different backends in

different network segments)

How SmartStack has changed Airbnb

The Impact

Services using

SmartStack

Requests per second

LOC deleted

Engineers using

SmartStack

2K 3K 30

Ben: “SmartStack is great! It helped me to discover services – and quit smoking”

Phillippe: “Distributed computing? And all this time I thought everything was running on one machine”

Spike : “Nerve and Synapse have greatly simplified my life as an application developer, and have enabled me to launch our first Node.js services with very little ops overhead.”

Barbara: “I love it!”

Sean: “Smart Stack has made deployment of new java services a matter of beer and 20 lines of ruby”

Our engineers love SmartStack

Future Direction Is this project, like, done...?

Better resiliency: more graceful handling of zookeeper edge cases

Better testing: improve on the current integration test suite

Dynamic registration: for services running on Mesos et. al.

A push API for nerve: allow services to communicate coming downtime

5 An auto-scaling layer: use nerve information to determine load levels

I’m sold! How do I get started?

Getting Started

install Vagrant

git clone https://github.com/airbnb/smartstack-cookbook.git

vagrant up

Where is the code?

https://github.com/airbnb/nerve.git

https://github.com/airbnb/synapse.git

AWS re:Invent Pub Crawl

Join the AWS Startup Team this evening at the AWS Pub Crawl When: Wednesday November 13, 5:30pm - 7:30pm Where: Canaletto at The Venetian, 2nd Floor Who Will Be There: Startups, the AWS Startup Team, Startup Launch Companies, and AWS re:Invent Hackathon winners

Startup Spotlight Sessions with Dr. Werner Vogels Thurs. Nov 14, Marcello Room 4406

SPOT 203 – Fireside Chats – Startup Founders, 1:30-2:30pm – Eliot Horowitz, CTO of MongoDB – Jeff Lawson, CEO of Twilio – Valentino Volonghi, Chief Architect of AdRoll

SPOT 204 – Fireside Chats – Startup Influencers, 3:00-4:00pm – Albert Wegner, Managing Partner at Union Square Ventures – David Cohen, Founder and CEO of TechStars

SPOT 101 - Startup Launches, 4:15-5:15pm – 5 companies powered by AWS launching at AWS re:Invent 2013

We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.

stop worrying about prodweb001 and start loving i-98fb9856 (arc201) | aws re:invent 2013

pets greek

physical things

ive hugged

aws startup

ec2 import

werner vogels

availability

management

Technology

20131122 cloudpack night re:invent report

"re:invent recruiting," the irecruit keynote

feedback on aws re:invent 2016

aws re:invent 2013 recap

worrying facts

arc201 microservices architecture @ aws re:invent 2015

[gokigenyou] one shot worrying

aws re:invent 2017 | cloudhealth tech session

aws re:invent 2017 · are re:invent passes required to use...

bluesoft @ aws re:invent 2017 + aws 101

aws re:invent 2015 re:cap

riverbed aws re:invent 2014 survey results

femmeplex –stop worrying and start enjoying

devops at netflix (re:invent)

re:invent 2012 optimizing cassandra

continuous deployment @ aws re:invent

aws re:invent hackathon

reserved seating at aws re:invent 2016

flyer - bw how to reduce worrying

worrying and covid-19