docker cluster management with ecs

Docker Cluster Management with ECS

Matt [email protected]/in/matthewcallanan@mcallana

© 2016 Expedia Group Australia and New Zealand. All rights reserved.

Table of Contents

• How Do We Bootstrap Instances?• Rolling Update with AutoScaling Group• How Do We Update Cluster Instances?• How Do We Detect & Remediate Broken Instances?• How Do We Analyse Cluster-‐Wide Issues?• How Do We Auto-‐Scale?• Lessons Learned• Future Work

Production Clusters – Serving 200 applications

7

14 instances: 56 apps (+ 19 canaries) 17 instances: 78 apps (+ 25 canaries)

35 instances: 107 apps (+ 23 canaries) 5 instances: 7 apps (+ 4 canaries)

Charts produced with c3vis: github.com/ExpediaDotCom/c3vis

How Do We Bootstrap Instances?

How Do We Bootstrap Instances?

• Based on Amazon’s ECS Optimized AMI• e.g. “amzn-‐ami-‐2016.03.b-‐amazon-‐ecs-‐optimized”

• CloudFormation userdata runs at launch time to set up:• Networking• Security• Log forwarding• Cron job: Push EC2 statistics and custom metrics• Run ‘cadvisor’ and ‘docker-‐cleanup’ as ECS Tasks on each instance (using ‘start-‐task’)

Rolling Update with AutoScaling Group

Old InstanceNew InstanceTerminating:Wait

Active TaskRelocated Task“Ghost” Task

How Do We Update Cluster Instances?

Zero-‐Downtime Instance Replacement

• Uses a Lambda to avoid outages in production during a cluster instance rolling update• Lambda is triggered by AutoScaling EC2_INSTANCE_TERMINATE SNS events• Lambda deregisters the instance from the ECS cluster• Lambda also sends a heartbeat to the ASG to keep the instance in Terminating:Wait state for

30mins• This is generally enough to allow ECS to reschedule any tasks that are part of a service to another

instance• Downsides:• Tasks can get rescheduled to another old instance in the ASG that is about to be replaced -‐ so tasks can

get bumped from instance to instance until all instances are replaced• 30mins is a long time for old containers to still be registered in the services' ELBs. Any deploys during

that time can cause confusion around why old and new versions of service are running behind ELB• ECS agent pulls Docker containers serially so can take a while to launch a bunch of new tasks

Old InstanceNew InstanceTerminating:Wait

Active TaskRelocated Task“Ghost” Task

How Do We Detect & Remediate Broken Instances?

How Do We Detect & Remediate Broken Instances?

• Custom Cloudwatch metrics• How long does “docker images” take? Alarm if longer than 4 seconds for 5mins• How long does “docker ps” take? Alarm if longer than 4 seconds for 5mins• Is the ecs agent running? Alarm if not for 5mins

• Manual remediation based on email alert• Run “evict_instance” script• Terminates instance via ASG – allows Lambda to deregister and pause termination• aws autoscaling terminate-‐instance-‐in-‐auto-‐scaling-‐group -‐-‐region $REGION -‐

-‐instance-‐id $INSTANCE_ID -‐-‐no-‐should-‐decrement-‐desired-‐capacity

How Do We Analyse Cluster-‐Wide Issues?

How Do We Analyse Cluster-‐Wide Issues?

• Centralised Logging• Forward instance logs to Splunk:• /var/log/cfn-‐*• /var/log/ecs*

• Query with timechart

How Do We Auto-‐Scale?

How Do We Auto-‐Scale?

• Scale Up:• CPU Reservation across entire cluster > 70% for 5mins

or• Memory Reservation across entire cluster > 60% for 5mins

• Scale Down• CPU Reservation < 20% for 5mins

or• Memory Reservation < 40% for 5mins

Lessons Learned

Lesson #1

Use Immutable Servers with CloudFormation

Lesson #1: Use Immutable Servers with CloudFormation

• cfn-‐update is dangerous if you don’t know what you’re doing• Problem:• Rolled out change that configures an extra docker EBS volume on new instances• cfn-‐update ran simultaneously on all old instances• Simultaneously restarted docker and deleted /var/lib/docker on all old instances – 5mins prod outage

• Solution:• Removed cfn-‐update from userdata• Rename launch configuration every time to force CFN’s ASG Rolling Update even for minor config

changes• Changed our mentality by renaming our “update” command to “replace_instances”

Lesson #2

Suspend ASG Processes During CFN Rolling Update

Lesson #1: Suspend ASG Processes During CFN Rolling Update

• CFN and ASG are independent services• Problem:• Changed ASG from 1 to 2 subnets as part of CFN update• ASG instantly tries to launch n/2 instances in new subnet• Meanwhile CFN is waiting for 1 signal at a time – times out – rolls back

• Solution:• Suspend processes with CFN Update Policy: • 'AlarmNotification’• 'HealthCheck’• 'ReplaceUnhealthy’• 'AZRebalance’

CloudFormationAuto-‐Scaling Group Update Policy

UpdatePolicy: {AutoScalingRollingUpdate: {MinInstancesInService: current_desired_capacity,MaxBatchSize: '1', PauseTime: 'PT30M', WaitOnResourceSignals: 'true', SuspendProcesses: ['AlarmNotification',

'HealthCheck','ReplaceUnhealthy','AZRebalance']

}}

Lesson #3

Don’t Use CloudFormation for Rolling Updates

Lesson #3: Don’t Use CloudFormation for Rolling Updates

• CFN interaction with ASG is too unreliable• Problem:• CFN timed out after not receiving a signal from instance created by ASG• AWS support explained there was an issue with the Auto Scaling service for 3hrs that caused

CloudFormation to experience increased latency when creating, updating and deleting stacks in us-‐east-‐1

• Solution:• Replace CFN rolling update with programmatic logic• Include health checks• Include deregistration logic

Lesson #4

Scale Down Carefully

Lesson #4: Scale Down Carefully

• Problem:1. ASG scales up due to high Memory Reservation2. 5mins later ASG scales down due to low CPU Reservation3. Repeat from #1

• Solution:• Fix scaling dimensions• Scale up when either CPU or Memory Reservation is high• Scale Down only on when both are low

• Tightly control cpu / mem reservations per service• Match equal ratios of instance type resources

Future Work

Future Work: “Bulk Instance Replacement”

• Bulk Instance Replacement• 1 canary instance• Increment DesiredCapacity / MaxSize• Add 1 instance to ASG and Cluster• Monitor / Test

• Replace N-‐1 instances• Suspend Processes• Add n-‐1 instances to ASG• Deregister / Terminate old n-‐1 instances

Future Work: “Workload Profiles”

• Predictable resource reservation• Workload Profiles• Opinionated resource sizings based on equal CPU / Memory ratio of instance type resources• App owners cannot specify cpu / mem – can only choose from preset profiles

• Downsides:• Ties cluster to instance type family

• Example:• For “m4” family…

Profile CPU (Cores) Memory (GiB)

Tiny 0.25 1

Small 0.5 2

Medium 1 4

Large 2 8

X.Large 4 16

Future Work: Treat Clusters as Cattle

• Automate all manual aspects of cluster updates• Building confidence in our automated checks• Are there enough IP addresses in target subnets?• Is there enough EBS volume space for N instances?• Are there enough instances of desired instance type available?

• Packer for building AMIs• Jenkins Pipeline for rolling out with confidence

Q & AThanks!

Any Questions?

Matt [email protected]/in/matthewcallanan@mcallana

docker cluster management with ecs

Technology