Automating Life in the Cloud!
Joshua Buss, Matthew Kemp & Cody Ray!
2
Add more features!!!This widget is too slow!!!No more downtime!!!Weʼre losing potential customers in Asia!!
Use Case 0: Scalability and Reliability!Designing for the Cloud!
3
Focus on scaling applications horizontally.!
Use Case 0: Scalability and Reliability!Scalability!
4
Wikipedia Definition:!SOA as an architecture relies on service-orientation as its fundamental design principle. If a service presents a simple interface that abstracts away its underlying complexity, users can access independent services without knowledge of the service's platform implementation.!!Laymanʼs terms:!A complex system is broken into simple components that are able to interact with each other (and possibly outside sources).!
Use Case 0: Scalability and Reliability!Service Oriented Architecture!
5
What is a Service in SOA?!
6
An independent unit that's composable with other components.!
Use Case 0: Scalability and Reliability!
Presenta(on (web, api, etc)
Business Logic
Data Access
Data Access
Business Logic
Data Access
Data Stores
Use Case 0: Scalability and Reliability!Services at BrightTag!
7
database
stathub
datahub
ui database
tagserve
When should you split services up?!
Use Case 0: Scalability and Reliability!Service Division of Labor!
8
Keep failures self contained.!
Use Case 0: Scalability and Reliability!Design for Failure!
9
Release It! by Nyard is a great resource for stability patterns
stathub database
database datahub tagserve
ui
Run a full stack in each region.!Use Case 0: Scalability and Reliability!Redundancy at BrightTag!
10
database
stathub
datahub
ui database
tagserve
Services are over HTTP.!!Able to use standard tools and components without extra effort.!
Use Case 0: Scalability and Reliability!!
Load Balancers!
11
Changes need to be allowed, but compatibility needs to be maintained.!!
Use Case 0: Scalability and Reliability!Backwards Compatibility!
12
Need some data available in all regions, but keep inter-region communication to a minimum.!!
Case 1: Inter-Region Communication!Cross-Region Data Replication!
13
Google's BigTable data model on Amazon's Dynamo infrastructure.!
Case 1: Inter-Region Communication!What is Cassandra?!
14
Case 1: Inter-Region Communication!Cassandra Token Ring!
15
cassandra01 [0-‐63]
cassandra02 [64-‐127]
cassandra03 [128-‐191]
cassandra04 [192-‐255]
East
cassandra01 [1-‐64]
cassandra02 [65-‐128]
cassandra03 [129-‐192]
cassandra04 [193-‐0]
West
Key hashes to 157?
Case 1: Inter-Region Communication!How Cassandra Writes!
16
cassandra01 [0-‐63]
cassandra02 [64-‐127]
cassandra03 [128-‐191]
cassandra04 [192-‐255]
East
cassandra01 [1-‐64]
cassandra02 [65-‐128]
cassandra03 [129-‐192]
cassandra04 [193-‐0]
West
Writes goes here.
Cross region messaging over HTTPS with compression.!
Case 1: Inter-Region Communication!Cross Region Messaging (Hiveway)!
17
local hiveway
remote hiveway
Mes
sage
s
Mes
sage
s
Use Case 2: Zero Downtime Builds!Smooth Code Pushes!
18
Easy migrations and upgrade path.!
Can be more expensive.!
Use Case 2: Zero Downtime Builds!Mirror Environment Cutover!
19
More complicated migrations and upgrades.!!Longer deploy window.!!Usually cheaper.!!
Use Case 2: Zero Downtime Builds!Rolling Deploy!
20
for region in regions: for app in apps: for server in region: if app on server: maintenance app scp new code to <deployment_tag> dir symlink app/current to app/<deployment_tag> restart app wait for healthy!
Use Case 2: Zero Downtime Builds!Fabric Pseudocode!
21
Use Case 2: Zero Downtime Builds!!
Health Checks at BrightTag!
22
Standardized health checks across services.!!!$ curl -‐si 'http://service/bthc' HTTP/1.1 204 No Content $ curl -‐si 'http://service/bthc?action=maint' HTTP/1.1 500 Internal Server Error Connection: close Content-‐Length: 5 MAINT
At a glance environment health.!Use Case 2: Zero Downtime Builds!Keeping an Eye on the Pulse!
23
Provide multiple modes of operation.!
Use Case 2: Zero Downtime Builds!Runtime Controls!
24
Use Case 3: Generating /etc/hosts Connectivity
Use Case 3: Generating /etc/host What is Zerg?!
26
+ =
DRIVER_MAPPING = { "dev": { "office": get_driver(Provider.EUCALYPTUS)( DEV_ID, secret=DEV_KEY, host="openmaster", port=8773, secure=False, path="/services/Cloud") }, "prod": { "us-‐east-‐1": get_driver(Provider.EC2_US_EAST)(PROD_ID, PROD_KEY), "eu-‐west-‐1": get_driver(Provider.EC2_EU_WEST)(PROD_ID, PROD_KEY) } }
@app.route("/hosts/<env>/<region>") def hosts(env, region): hosts = DRIVER_MAPPING[env][region].list_nodes() return str([d.extra['private_dns'] for host in hosts])
!
Use Case 3: Generating /etc/hosts Flask and libcloud Working Together!
27
@app.route("/etchosts/<env>/<region>") def etchosts(env, region): driver = DRIVER_MAPPING[env][region] sorted_nodes = sorted((node.name, node.private_ips, node.public_ips) for node
in driver.list_nodes()) hosts = [{'private_ip':private_ips[0], 'name':name, 'public_ip':public_ips[0]}
for (name, private_ips, public_ips) in sorted_nodes] response = render_template('etc_hosts.txt', hosts=hosts) return Response(response, content_type='text/plain')
Template:!# The following lines are desirable for IPv6 capable hosts ::1 ip6-‐localhost ip6-‐loopback
{% for host in hosts %} {{ "%-‐21s%-‐21s# External: %s"|format(host.private_ip, host.name,
host.public_ip) }} {%-‐ endfor %}
Use Case 3: Generating /etc/hosts The Zerg Code!
28
$ curl –s 'http://zerg/etchosts/prod/eu-west-1'
# The following lines are desirable for IPv6 capable hosts"
::1 ip6-localhost ip6-loopback
10.0.0.10 server01 # External: 123.123.123.123
10.0.0.11 server02 # External: 123.123.123.124
10.0.0.12 server03 # External: 123.123.123.125
10.0.0.13 server04 # External: 123.123.123.126
10.0.0.14 server05 # External: 123.123.123.127
10.0.0.15 server06 # External: 123.123.123.128
Use Case 3: Generating /etc/hosts !
The Zerg HTTP Response!
29
# Set variables read -‐r -‐d '' STATIC_HOSTS << static_hosts # The following lines are included by default 127.0.0.1 localhost # DO NOT EDIT THIS COMMENT -‐ everything after this line is managed by zerg! static_hosts cp /etc/hosts ${TMPDIR}/old_hosts grep -‐B 5000000 '# DO NOT' ${TMPDIR}/old_hosts >> ${TMPDIR}/static_hosts cp ${TMPDIR}/static_hosts ${TMPDIR}/new_hosts wget -‐qO-‐ "http://${ZERG_IP}/etchosts/${E}/${R}" >> ${TMPDIR}/new_hosts && if [[ $(diff ${TMPDIR}/new_hosts /etc/hosts | wc -‐l | awk '{print $1}') < 7
|| ${FORCE} == '-‐-‐force' ]]; then cp ${TMPDIR}/new_hosts /etc/hosts; fi
Use Case 3: Generating /etc/hosts The bash update_hosts.sh script!
30
Update timing tricky to get right!
Too important to leave completely autonomous!
Use Case 4: Generating Load Balancer Configuration!Configuring Load Balanced Services!
31
Need a rock-solid foundation to deploy onto.
Use Case 4: Generating Load Balancer Configuration!Consistency > *
Set environment per-instance: /etc/puppet/puppet.conf
Symlink /etc/puppet/environments/ on master to various git checkouts of the source:
$ cd /etc/puppet/environments $ ln –s ~/src/puppet/prod_stable prod_stable $ ln –s ~/src/puppet/dev_stable dev_stable $ ln –s ~/src/puppet/dev_test dev_test
Use cron to keep all branches up-to-date
Use Case 4: Generating Load Balancer Configuration!Single Puppet Master
Each environment has its own branch.
Make a new branch for every new feature.
Merge into a test branch to test.
Merge into stable.
Use Case 4: Generating Load Balancer Configuration!Source Controlled Puppet Configs
APP_DEFS : { "zerg": { "type": "http", "healthcheck": {"port": 19999, "resource": "/zerghealth"} }, "awesome": { "type": "http", "healthcheck": {"port": 20000, "resource": "/ahc"}, "frontend" : "10080" }, "haproxy_awesome":{ "type": "http", "healthcheck": {"port": 20001, "resource": "/"} }, "foo": { "type": "http", "healthcheck": {"port": 20002, "resource": "/"}, "frontend" : "10081" }, "mashed_potatoes": { "type": "http", "healthcheck": {"port": 20003, "resource": ”/"}, "frontend" : "10082" }, "haproxy_foo": { "type": "http", "healthcheck": {"port": 20004, "resource": "/hc"} }, "thehardproblem": { "type": "http", "healthcheck": {"port": 20006, "resource": "/"} }, "redis": { "type": "tcp", "healthcheck": {"port": 20007, "resource": "/rhc"} }, "dataserver": { "type": "http", "healthcheck": {"port": 20008, "resource": "/"} }, "frontend" : "10083" }, "itshards":{ "type": "http", "healthcheck": {"port": 20009, "resource": "/"} }, "devnull": { "type": "http", "healthcheck": {"port": 200010, "resource": "/hc"} } }
Use Case 4 – Load Balancer Configs!The App Definitions in Zerg!
35
@app.route("/haproxy/<env>/<region>/<type>") def haproxy(env, region, type): instances = get_region_manifest(region) apps = {} for app in APP_DEFS[env]: if 'frontend' in APP_PORTS[env][app].keys(): app_object = { 'servers':[], 'backend_port': APP_PORTS[env][app]['healthcheck']['port'], 'frontend_port': APP_PORTS[env][app]['frontend'] } for server in instances: if app in instances[server]['roles']: app_object['servers'].append({'name':server, 'details':instances[server]}) apps[app] = app_object return render_template('haproxy_%s_%s_%s.txt' % (env, region, type), vips=apps)
Use Case 4 – Load Balancer Configs!The Zerg Code!
36
global blah blah defaults blah blah frontend dataserver_vip bind *:{{ vips.dataserver.frontend_port }} default_backend dataserver frontend mashed_potatoes_vip bind *:{{ vips.mashed_potatoes.frontend_port }} default_backend mashed_potatoes backend dataserver balance roundrobin {%-‐ for server in vips.dataserver.servers %} server {{ server['name'] }} {{ server.details['private ip'] }}:{{ vips.dataserver.backend_port }} check {%-‐ endfor %} backend mashed_potatoes balance roundrobin {%-‐ for server in vips.mashed_potatoes.servers %} server {{ server['name'] }} {{ server.details['private ip'] }}:{{ vips.mashed_potatoes.backend_port }} check {%-‐ endfor %}
Use Case 4 – Load Balancer Configs!The Zerg Flask Template!
37
$ curl –s http://zerg/haproxy/<env>/<region>/<type>
globals and defaults blah blah frontend dataserver_vip
bind *:10083 default_backend dataserver frontend mashed_potatoes_vip
bind *:10082 default_backend mashed_potatoes
backend dataserver
blah blah options server dataserv01 10.0.0.28:20008 check server dataserv02 10.0.0.29:20008 check
backend mashed_potatoes
blah blah options server taters01 10.0.0.30:20003 check server taters02 10.0.0.31:20003 check
Use Case 4 – Load Balancer Configs!The Zerg HTTP Response!
38
Use Case 4 – Load Balancer Configs!The Config Workflow!
39
Large changes to templates (human)
Git (ops)
Zerg (genera(on)
Script (human)
Git (puppet)
Server Server Server
./update_haproxy.sh <env> <region> <service> ** Git is clean and in sync with origin.. now waiting for zerg http response.. [prod_stable 012345] [puppet] Haproxy Auto-‐Commit for <env> <region> <service> 1 files changed, 2 insertions(+), 2 deletions(-‐) ** Template pulled and committed ** Here is the diff from origin to the new version: diff -‐-‐git a/modules/haproxy/templates/haproxy_<env>_<region>_<service>_cfg.erb b/modules/haproxy/templates/haproxy_<env>_<region>_<service>_cfg.erb -‐-‐-‐ a/modules/haproxy/templates/haproxy_prod_us-‐east-‐1_tagserve_cfg.erb +++ b/modules/haproxy/templates/haproxy_prod_us-‐east-‐1_tagserve_cfg.erb -‐ server oldandslow01 10.0.0.23:20003 check -‐ server oldandslow02 10.0.0.24:20003 check + server taters01 10.0.0.30:20003 check + server taters01 10.0.0.31:20003 check ** Do you want to push this change? (y/n) y blah blah successful git push message ** Commit successfully pushed to origin ** All done!
Use Case 4 – Load Balancer Configs!The bash update_haproxy.sh script!
40
Alerting, Monitoring & Visualization!!
Use Case 5: Dashboards & Alerting!!
What's really going on?!
41
Identify metrics that act as signals.!!Add alerts after every incident.!
Use Case 5: Dashboards & Alerting!What to monitor?!
42
Use Case 5: Dashboards & Alerting!Metric Polling at BrightTag!
43
graphite carbon mpoller
tagserve haproxy
datahub redis
cassandra
graphite carbon
tagserve haproxy mpoller
datahub redis
mpoller cassandra mpoller
Storage of historical metrics allows for trending and comparisons.!!Aggregation is performed on data retrieval via the webapp.!
Use Case 5: Dashboards & Alerting!Graphite!
44
Expose a "metrics" service per region.!!Enables a flexible topology.!!
Use Case 5: Dashboards & Alerting!Branches and Leaves!
45
Use Case 5: Dashboards & Alerting!Metric Aggregation at BrightTag!
46
tagserve haproxy
datahub redis
cassandra metrics
metrics metrics
dashboard
Use Case 5: Dashboards & Alerting!Realtime Numbers Across Regions!
47
Requests are farmed out to each metrics service.
Different visualizations tell you different things.!Use Case 5: Dashboards & Alerting!!
Visualization!
48
Tattle allows us to alert on any metric in Graphite.!!Alerting is done per region.!
Use Case 5: Dashboards & Alerting!Alerting!
49
Fabric is push, puppet is pull.!!Businesses don't move as fast as infrastructure changes, but configs have to stay up to date all the time.!
(/etc/hosts) (systempoller.py) (mashed_potatoes.env) (dataserver.war)
puppet ===================================== fabric (real-‐time up-‐to-‐date) (moderately up-‐to-‐date) (weekly)
!
Deployment!Fabric vs Puppet!
50
Have to go with what cloud provider offers.!!Not always ideal for every workload.!
Designing for the Cloud!Virtual Machines!
51
(but if you find one let us know)!There are no Silver Bullets!
52
Questions?!
53