deep dive on microservices and amazon ecs

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Kuldeep Chowhan, Expedia Inc.

July 13th 2016

Deep Dive on Microservices and

Amazon ECS

The inside scoop on Expedia’s internal cloud deployment tool

Kuldeep Chowhan

Principal Engineer @ Expedia Inc.

@307redirect

Expedia Inc.

One of the world’s leading travel companies

Expedia’s AWS Regions

ECS Production Clusters – Serving 200 applications

14 instances: 56 apps (+ 19 canaries) 17 instances: 78 apps (+ 25 canaries)

35 instances: 107 apps (+ 23 canaries) 5 instances: 7 apps (+ 4 canaries)

Charts produced with c3vis: github.com/ExpediaDotCom/c3vis

https://github.com/ExpediaDotCom/c3vis

Expedia’s internal cloud

deployment tool

Primer – Internal Cloud Deployment Tool

Primer – Supported Templates

Primer Architecture

Continuous Delivery to ECS with Primer

Private Docker Registry Hosted on AWS EC2

Automated push to the registry using Jenkins docker plugin

nodejs

Dockerfile for nodejs Primer template

ECS Cluster Management

Expedia’s ECS Base AMI

• Based on Amazon’s ECS Optimized AMI• e.g. “amzn-ami-2016.03.b-amazon-ecs-optimized”

• CloudFormation userdata runs at launch time to set up:• OS Hardening

• Security

• Network configuration

• Log forwarding

• Cron job: Push EC2 statistics and custom metrics

• Run ‘cadvisor’ and ‘docker-cleanup’ as ECS Tasks on each instance (using ‘start-

task’)

Zero-Downtime Instance Replacement• Uses a Lambda to avoid outages in production during a cluster instance rolling update

• Lambda is triggered by AutoScaling EC2_INSTANCE_TERMINATE SNS events

• Lambda deregisters the instance from the ECS cluster

• Lambda also sends a heartbeat to the ASG to keep the instance in Terminating:Wait state for 30mins

• This is generally enough to allow ECS to reschedule any tasks that are part of a service to another instance

• Downsides:

• Tasks can get rescheduled to another old instance in the ASG that is about to be replaced - so tasks can get bumped from instance to instance until all instances are replaced

• 30mins is a long time for old containers to still be registered in the services' ELBs. Any deploys during that time can cause confusion around why old and new versions of service are running behind ELB

• ECS agent pulls Docker containers serially so can take a while to launch a bunch of new tasks

Old Instance

New Instance

Terminating:Wait

Active Task

Relocated Task

“Ghost” Task

Detecting & Remediating Broken Instances?

• Custom CloudWatch metrics• How long does “docker images” take? Alarm if longer than 4 seconds for 5mins

• How long does “docker ps” take? Alarm if longer than 4 seconds for 5mins

• Is the ecs agent running? Alarm if not for 5mins

• Manual remediation based on email alert• Run “evict_instance” script

• Terminates instance via ASG – allows Lambda to deregister and pause termination

• aws autoscaling terminate-instance-in-auto-scaling-group --region $REGION --instance-id $INSTANCE_ID --no-should-decrement-desired-capacity

Analyze Cluster-Wide Issues?

• Centralised Logging

• Forward instance logs to Splunk:

• /var/log/cfn-*

• /var/log/ecs*

• Query with timechart

Auto-Scaling ECS Host Instances

• Scale Up:• CPU Reservation across entire cluster > 70% for 5mins

or

• Memory Reservation across entire cluster > 60% for 5mins

• Scale Down• CPU Reservation < 20% for 5mins

or

• Memory Reservation < 40% for 5mins

Lessons Learnt

• Use Immutable Servers with CloudFormation

• Suspend ASG Processes During CFN Rolling Update

• Scale Down Carefully

Future Work

• Auto Scaling at task level

• Bulk Instance Replacement

• Workload Profiles

• Treat Clusters as Cattle

Thank you!

Kuldeep [email protected]@307redirect

Q & A

deep dive on microservices and amazon ecs

Technology