deep dive on microservices and amazon ecs
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Kuldeep Chowhan, Expedia Inc.
July 13th 2016
Deep Dive on Microservices and
Amazon ECS
The inside scoop on Expedia’s internal cloud deployment tool
Kuldeep Chowhan
Principal Engineer @ Expedia Inc.
@307redirect
Expedia Inc.
One of the world’s leading travel companies
Expedia’s AWS Regions
ECS Production Clusters – Serving 200 applications
14 instances: 56 apps (+ 19 canaries) 17 instances: 78 apps (+ 25 canaries)
35 instances: 107 apps (+ 23 canaries) 5 instances: 7 apps (+ 4 canaries)
Charts produced with c3vis: github.com/ExpediaDotCom/c3vis
Expedia’s internal cloud
deployment tool
Primer – Internal Cloud Deployment Tool
Primer – Supported Templates
Primer Architecture
Continuous Delivery to ECS with Primer
Private Docker Registry Hosted on AWS EC2
Automated push to the registry using Jenkins docker plugin
nodejs
Dockerfile for nodejs Primer template
Demo
ECS Cluster Management
Expedia’s ECS Base AMI
• Based on Amazon’s ECS Optimized AMI• e.g. “amzn-ami-2016.03.b-amazon-ecs-optimized”
• CloudFormation userdata runs at launch time to set up:• OS Hardening
• Security
• Network configuration
• Log forwarding
• Cron job: Push EC2 statistics and custom metrics
• Run ‘cadvisor’ and ‘docker-cleanup’ as ECS Tasks on each instance (using ‘start-
task’)
Zero-Downtime Instance Replacement• Uses a Lambda to avoid outages in production during a cluster instance rolling update
• Lambda is triggered by AutoScaling EC2_INSTANCE_TERMINATE SNS events
• Lambda deregisters the instance from the ECS cluster
• Lambda also sends a heartbeat to the ASG to keep the instance in Terminating:Wait state for 30mins
• This is generally enough to allow ECS to reschedule any tasks that are part of a service to another instance
• Downsides:
• Tasks can get rescheduled to another old instance in the ASG that is about to be replaced - so tasks can get bumped from instance to instance until all instances are replaced
• 30mins is a long time for old containers to still be registered in the services' ELBs. Any deploys during that time can cause confusion around why old and new versions of service are running behind ELB
• ECS agent pulls Docker containers serially so can take a while to launch a bunch of new tasks
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Old Instance
New Instance
Terminating:Wait
Active Task
Relocated Task
“Ghost” Task
Detecting & Remediating Broken Instances?
• Custom CloudWatch metrics• How long does “docker images” take? Alarm if longer than 4 seconds for 5mins
• How long does “docker ps” take? Alarm if longer than 4 seconds for 5mins
• Is the ecs agent running? Alarm if not for 5mins
• Manual remediation based on email alert• Run “evict_instance” script
• Terminates instance via ASG – allows Lambda to deregister and pause termination
• aws autoscaling terminate-instance-in-auto-scaling-group --region $REGION --instance-id $INSTANCE_ID --no-should-decrement-desired-capacity
Analyze Cluster-Wide Issues?
• Centralised Logging
• Forward instance logs to Splunk:
• /var/log/cfn-*
• /var/log/ecs*
• Query with timechart
Auto-Scaling ECS Host Instances
• Scale Up:• CPU Reservation across entire cluster > 70% for 5mins
or
• Memory Reservation across entire cluster > 60% for 5mins
• Scale Down• CPU Reservation < 20% for 5mins
or
• Memory Reservation < 40% for 5mins
Lessons Learnt
• Use Immutable Servers with CloudFormation
• Suspend ASG Processes During CFN Rolling Update
• Scale Down Carefully
Future Work
• Auto Scaling at task level
• Bulk Instance Replacement
• Workload Profiles
• Treat Clusters as Cattle