deep dive on microservices and amazon ecs

44
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Kuldeep Chowhan, Expedia Inc. July 13 th 2016 Deep Dive on Microservices and Amazon ECS

Upload: amazon-web-services

Post on 06-Apr-2017

1.030 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Deep Dive on Microservices and Amazon ECS

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Kuldeep Chowhan, Expedia Inc.

July 13th 2016

Deep Dive on Microservices and

Amazon ECS

Page 2: Deep Dive on Microservices and Amazon ECS

The inside scoop on Expedia’s internal cloud deployment tool

Page 3: Deep Dive on Microservices and Amazon ECS

Kuldeep Chowhan

Principal Engineer @ Expedia Inc.

@307redirect

Page 4: Deep Dive on Microservices and Amazon ECS

Expedia Inc.

One of the world’s leading travel companies

Page 5: Deep Dive on Microservices and Amazon ECS

Expedia’s AWS Regions

Page 6: Deep Dive on Microservices and Amazon ECS

ECS Production Clusters – Serving 200 applications

14 instances: 56 apps (+ 19 canaries) 17 instances: 78 apps (+ 25 canaries)

35 instances: 107 apps (+ 23 canaries) 5 instances: 7 apps (+ 4 canaries)

Charts produced with c3vis: github.com/ExpediaDotCom/c3vis

Page 7: Deep Dive on Microservices and Amazon ECS

Expedia’s internal cloud

deployment tool

Page 8: Deep Dive on Microservices and Amazon ECS

Primer – Internal Cloud Deployment Tool

Page 9: Deep Dive on Microservices and Amazon ECS

Primer – Supported Templates

Page 10: Deep Dive on Microservices and Amazon ECS

Primer Architecture

Page 11: Deep Dive on Microservices and Amazon ECS

Continuous Delivery to ECS with Primer

Page 12: Deep Dive on Microservices and Amazon ECS

Private Docker Registry Hosted on AWS EC2

Automated push to the registry using Jenkins docker plugin

nodejs

Page 13: Deep Dive on Microservices and Amazon ECS

Dockerfile for nodejs Primer template

Page 14: Deep Dive on Microservices and Amazon ECS

Demo

Page 15: Deep Dive on Microservices and Amazon ECS

ECS Cluster Management

Page 16: Deep Dive on Microservices and Amazon ECS

Expedia’s ECS Base AMI

• Based on Amazon’s ECS Optimized AMI• e.g. “amzn-ami-2016.03.b-amazon-ecs-optimized”

• CloudFormation userdata runs at launch time to set up:• OS Hardening

• Security

• Network configuration

• Log forwarding

• Cron job: Push EC2 statistics and custom metrics

• Run ‘cadvisor’ and ‘docker-cleanup’ as ECS Tasks on each instance (using ‘start-

task’)

Page 17: Deep Dive on Microservices and Amazon ECS

Zero-Downtime Instance Replacement• Uses a Lambda to avoid outages in production during a cluster instance rolling update

• Lambda is triggered by AutoScaling EC2_INSTANCE_TERMINATE SNS events

• Lambda deregisters the instance from the ECS cluster

• Lambda also sends a heartbeat to the ASG to keep the instance in Terminating:Wait state for 30mins

• This is generally enough to allow ECS to reschedule any tasks that are part of a service to another instance

• Downsides:

• Tasks can get rescheduled to another old instance in the ASG that is about to be replaced - so tasks can get bumped from instance to instance until all instances are replaced

• 30mins is a long time for old containers to still be registered in the services' ELBs. Any deploys during that time can cause confusion around why old and new versions of service are running behind ELB

• ECS agent pulls Docker containers serially so can take a while to launch a bunch of new tasks

Page 18: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 19: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 20: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 21: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 22: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 23: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 24: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 25: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 26: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 27: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 28: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 29: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 30: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 31: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 32: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 33: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 34: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 35: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 36: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 37: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 38: Deep Dive on Microservices and Amazon ECS

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Page 39: Deep Dive on Microservices and Amazon ECS

Detecting & Remediating Broken Instances?

• Custom CloudWatch metrics• How long does “docker images” take? Alarm if longer than 4 seconds for 5mins

• How long does “docker ps” take? Alarm if longer than 4 seconds for 5mins

• Is the ecs agent running? Alarm if not for 5mins

• Manual remediation based on email alert• Run “evict_instance” script

• Terminates instance via ASG – allows Lambda to deregister and pause termination

• aws autoscaling terminate-instance-in-auto-scaling-group --region $REGION --instance-id $INSTANCE_ID --no-should-decrement-desired-capacity

Page 40: Deep Dive on Microservices and Amazon ECS

Analyze Cluster-Wide Issues?

• Centralised Logging

• Forward instance logs to Splunk:

• /var/log/cfn-*

• /var/log/ecs*

• Query with timechart

Page 41: Deep Dive on Microservices and Amazon ECS

Auto-Scaling ECS Host Instances

• Scale Up:• CPU Reservation across entire cluster > 70% for 5mins

or

• Memory Reservation across entire cluster > 60% for 5mins

• Scale Down• CPU Reservation < 20% for 5mins

or

• Memory Reservation < 40% for 5mins

Page 42: Deep Dive on Microservices and Amazon ECS

Lessons Learnt

• Use Immutable Servers with CloudFormation

• Suspend ASG Processes During CFN Rolling Update

• Scale Down Carefully

Page 43: Deep Dive on Microservices and Amazon ECS

Future Work

• Auto Scaling at task level

• Bulk Instance Replacement

• Workload Profiles

• Treat Clusters as Cattle

Page 44: Deep Dive on Microservices and Amazon ECS

Thank you!

Kuldeep [email protected]@307redirect

Q & A