docker cluster management with ecs

71
Docker Cluster Management with ECS Matt Callanan [email protected] linkedin.com/in/matthewcallanan @mcallana © 2016 Expedia Group Australia and New Zealand. All rights reserved.

Upload: matt-callanan

Post on 15-Apr-2017

258 views

Category:

Technology


4 download

TRANSCRIPT

Page 1: Docker Cluster Management with ECS

Docker  Cluster  Management  with  ECS

Matt  [email protected]/in/matthewcallanan@mcallana

©  2016  Expedia Group  Australia  and  New  Zealand.  All  rights  reserved.

Page 2: Docker Cluster Management with ECS

Table  of  Contents

• How  Do  We  Bootstrap  Instances?• Rolling  Update  with  AutoScaling Group• How  Do  We  Update  Cluster  Instances?• How  Do  We  Detect  &  Remediate  Broken  Instances?• How  Do  We  Analyse Cluster-­‐Wide  Issues?• How  Do  We  Auto-­‐Scale?• Lessons  Learned• Future  Work

Page 3: Docker Cluster Management with ECS

Production  Clusters  – Serving  200  applications

7

14  instances:    56  apps  (+  19  canaries) 17  instances:    78  apps  (+  25  canaries)

35  instances:    107  apps  (+  23  canaries) 5  instances:    7  apps  (+  4  canaries)

Charts  produced  with  c3vis:  github.com/ExpediaDotCom/c3vis

Page 4: Docker Cluster Management with ECS

How  Do  We  Bootstrap  Instances?

Page 5: Docker Cluster Management with ECS

How  Do  We  Bootstrap  Instances?

• Based  on  Amazon’s  ECS  Optimized  AMI• e.g.  “amzn-­‐ami-­‐2016.03.b-­‐amazon-­‐ecs-­‐optimized”

• CloudFormation userdata runs  at  launch  time  to  set  up:• Networking• Security• Log  forwarding• Cron job:  Push  EC2  statistics  and  custom  metrics• Run  ‘cadvisor’  and  ‘docker-­‐cleanup’  as  ECS  Tasks  on  each  instance  (using  ‘start-­‐task’)

Page 6: Docker Cluster Management with ECS

Rolling  Update  with  AutoScaling Group

Page 7: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 8: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 9: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 10: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 11: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 12: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 13: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 14: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 15: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 16: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 17: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 18: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 19: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 20: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 21: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 22: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 23: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 24: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 25: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 26: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 27: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 28: Docker Cluster Management with ECS

How  Do  We  Update  Cluster  Instances?

Page 29: Docker Cluster Management with ECS

Zero-­‐Downtime  Instance  Replacement

• Uses  a  Lambda  to  avoid  outages  in  production   during  a  cluster  instance  rolling  update• Lambda  is  triggered  by  AutoScaling EC2_INSTANCE_TERMINATE SNS  events• Lambda  deregisters  the  instance  from  the  ECS  cluster• Lambda  also  sends  a  heartbeat  to  the  ASG  to  keep  the  instance  in  Terminating:Wait state  for  

30mins• This  is  generally  enough  to  allow  ECS  to  reschedule  any  tasks  that  are  part  of  a  service   to  another  

instance• Downsides:• Tasks  can  get  rescheduled   to  another  old  instance  in  the  ASG  that  is  about  to  be  replaced   -­‐ so  tasks  can  

get  bumped  from  instance  to  instance  until  all   instances  are  replaced• 30mins  is  a  long  time  for  old  containers  to  still  be  registered   in  the  services'   ELBs.  Any  deploys  during  

that  time  can  cause  confusion  around  why  old  and  new  versions  of  service   are  running  behind  ELB• ECS  agent  pulls  Docker  containers  serially   so  can  take  a  while  to  launch  a  bunch  of  new  tasks

Page 30: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 31: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 32: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 33: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 34: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 35: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 36: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 37: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 38: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 39: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 40: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 41: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 42: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 43: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 44: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 45: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 46: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 47: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 48: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 49: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 50: Docker Cluster Management with ECS

Old  InstanceNew  InstanceTerminating:Wait

Active  TaskRelocated  Task“Ghost”  Task

Page 51: Docker Cluster Management with ECS

How  Do  We  Detect  &  Remediate  Broken  Instances?

Page 52: Docker Cluster Management with ECS

How  Do  We  Detect  &  Remediate  Broken  Instances?

• Custom  Cloudwatch metrics• How  long  does  “docker images”  take?    Alarm  if  longer  than  4  seconds for  5mins• How  long  does  “docker ps”  take?    Alarm  if  longer  than  4  seconds  for  5mins• Is  the  ecs agent  running?  Alarm  if  not  for  5mins

• Manual  remediation  based  on  email  alert• Run  “evict_instance”   script• Terminates   instance  via  ASG  – allows  Lambda  to  deregister  and  pause  termination• aws autoscaling terminate-­‐instance-­‐in-­‐auto-­‐scaling-­‐group   -­‐-­‐region   $REGION  -­‐

-­‐instance-­‐id   $INSTANCE_ID   -­‐-­‐no-­‐should-­‐decrement-­‐desired-­‐capacity

Page 53: Docker Cluster Management with ECS

How  Do  We  Analyse Cluster-­‐Wide  Issues?

Page 54: Docker Cluster Management with ECS

How  Do  We  Analyse Cluster-­‐Wide  Issues?

• Centralised Logging• Forward  instance  logs  to  Splunk:• /var/log/cfn-­‐*• /var/log/ecs*

• Query  with  timechart

Page 55: Docker Cluster Management with ECS

How  Do  We  Auto-­‐Scale?

Page 56: Docker Cluster Management with ECS

How  Do  We  Auto-­‐Scale?

• Scale  Up:• CPU  Reservation  across  entire  cluster  >  70%  for  5mins

or• Memory  Reservation  across  entire  cluster  >  60%  for  5mins

• Scale  Down• CPU  Reservation  <  20%  for  5mins

or• Memory  Reservation  < 40%  for  5mins

Page 57: Docker Cluster Management with ECS

Lessons  Learned

Page 58: Docker Cluster Management with ECS

Lesson  #1

Use  Immutable  Servers  with  CloudFormation

Page 59: Docker Cluster Management with ECS

Lesson  #1:  Use  Immutable  Servers  with  CloudFormation

• cfn-­‐update  is  dangerous  if  you  don’t   know  what  you’re  doing• Problem:• Rolled  out  change  that  configures  an  extra  docker EBS  volume  on  new  instances• cfn-­‐update  ran  simultaneously  on  all  old instances• Simultaneously  restarted  docker and  deleted  /var/lib/docker on  all  old  instances  – 5mins  prod  outage

• Solution:• Removed  cfn-­‐update  from  userdata• Rename  launch  configuration  every  time  to  force  CFN’s  ASG  Rolling  Update  even  for  minor  config

changes• Changed  our  mentality  by  renaming  our  “update”  command  to  “replace_instances”

Page 60: Docker Cluster Management with ECS

Lesson  #2

Suspend  ASG  Processes  During  CFN  Rolling  Update

Page 61: Docker Cluster Management with ECS

Lesson  #1:  Suspend  ASG  Processes  During  CFN  Rolling  Update

• CFN  and  ASG  are  independent   services• Problem:• Changed  ASG  from  1  to  2  subnets  as  part  of  CFN  update• ASG  instantly  tries  to  launch  n/2  instances  in  new  subnet• Meanwhile  CFN  is  waiting  for  1  signal  at  a  time  – times  out  – rolls  back

• Solution:• Suspend  processes  with  CFN  Update  Policy:  • 'AlarmNotification’• 'HealthCheck’• 'ReplaceUnhealthy’• 'AZRebalance’

Page 62: Docker Cluster Management with ECS

CloudFormationAuto-­‐Scaling  Group  Update  Policy

UpdatePolicy: {AutoScalingRollingUpdate: {MinInstancesInService: current_desired_capacity,MaxBatchSize: '1', PauseTime: 'PT30M', WaitOnResourceSignals: 'true', SuspendProcesses: ['AlarmNotification',

'HealthCheck','ReplaceUnhealthy','AZRebalance']

}}

Page 63: Docker Cluster Management with ECS

Lesson  #3

Don’t  Use  CloudFormation for  Rolling  Updates

Page 64: Docker Cluster Management with ECS

Lesson  #3:  Don’t  Use  CloudFormation for  Rolling  Updates

• CFN  interaction  with  ASG  is  too  unreliable• Problem:• CFN  timed  out  after  not  receiving  a  signal  from  instance  created  by  ASG• AWS  support  explained   there  was  an  issue  with  the  Auto  Scaling  service   for  3hrs  that  caused  

CloudFormation to  experience   increased   latency  when  creating,  updating  and  deleting  stacks  in  us-­‐east-­‐1

• Solution:• Replace  CFN  rolling  update  with  programmatic   logic• Include  health  checks• Include  deregistration   logic

Page 65: Docker Cluster Management with ECS

Lesson  #4

Scale  Down  Carefully

Page 66: Docker Cluster Management with ECS

Lesson  #4:  Scale  Down  Carefully

• Problem:1. ASG  scales  up  due  to  high  Memory  Reservation2. 5mins  later  ASG  scales  down  due  to  low  CPU  Reservation3. Repeat  from  #1

• Solution:• Fix  scaling  dimensions• Scale  up  when  either CPU  or  Memory  Reservation  is  high• Scale  Down  only  on  when  both are  low

• Tightly  control  cpu /  mem reservations  per  service• Match  equal  ratios  of  instance  type  resources

Page 67: Docker Cluster Management with ECS

Future  Work

Page 68: Docker Cluster Management with ECS

Future  Work:  “Bulk  Instance  Replacement”

• Bulk  Instance  Replacement• 1  canary  instance• Increment  DesiredCapacity /  MaxSize• Add  1  instance  to  ASG  and  Cluster• Monitor   /  Test

• Replace  N-­‐1  instances• Suspend  Processes• Add  n-­‐1  instances  to  ASG• Deregister  /  Terminate  old  n-­‐1  instances

Page 69: Docker Cluster Management with ECS

Future  Work:  “Workload  Profiles”

• Predictable  resource  reservation• Workload  Profiles• Opinionated  resource  sizings based  on  equal  CPU  /  Memory  ratio  of  instance   type  resources• App  owners  cannot  specify  cpu /  mem  – can  only  choose  from  preset  profiles

• Downsides:• Ties  cluster  to  instance  type  family

• Example:• For  “m4”  family…

Profile CPU  (Cores) Memory  (GiB)

Tiny 0.25 1

Small 0.5 2

Medium 1 4

Large 2 8

X.Large 4 16

Page 70: Docker Cluster Management with ECS

Future  Work:  Treat  Clusters  as  Cattle

• Automate  all  manual  aspects  of  cluster  updates• Building  confidence  in  our  automated  checks• Are  there  enough  IP  addresses  in  target  subnets?• Is  there  enough  EBS  volume  space  for  N instances?• Are  there  enough  instances  of  desired   instance  type  available?

• Packer  for  building  AMIs• Jenkins  Pipeline  for  rolling  out  with  confidence

Page 71: Docker Cluster Management with ECS

Q  &  AThanks!

Any  Questions?

Matt  [email protected]/in/matthewcallanan@mcallana