aws re:invent 2016: running batch jobs on amazon ecs (con310)

65
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Asha Chakrabarty, Senior Solutions Architect, AWS Will White, Engineering Lead, Mapbox December 1, 2016 Running Batch Processes on ECS CON310

Upload: amazon-web-services

Post on 16-Apr-2017

1.173 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Asha Chakrabarty, Senior Solutions Architect, AWS

Will White, Engineering Lead, Mapbox

December 1, 2016

Running Batch Processes on ECS

CON310

Page 2: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

What to Expect from the Session

• Understand the challenges of running batch processes

• Why Amazon ECS for Batch?

• Architectural Design Patterns

• Best Practices

• Mapbox and Amazon ECS

Page 3: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Challenges of Running Batch Workloads

• Typically resource intensive

• Time constraint for completion

• Potential impact to concurrent batch jobs

• Scaling infrastructure resources

• Ensuring effective resource utilization and cost savings

• Fragile and unreliable

Page 4: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

What Batch Workloads Need

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 5: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Why ECS for Batch Processing?

Page 6: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Cluster Management Made Easy

Nothing to run

Complete state

Control and monitoring

Scale

Page 7: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Performance at Scale

Page 8: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Flexible Container Placement

Applications

Batch jobs

Multiple schedulers

Page 9: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Designed for Use with Other AWS Services

Elastic Load Balancing

Amazon Elastic Block Store

Amazon Virtual Private Cloud

AWS Identity and Access Management

AWS CloudTrail

Page 10: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Security

Your own EC2 instances in a VPC

with all its security features to

provide a high level of isolation.

Page 11: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Key Concepts

Page 12: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Tasks Containers

ClustersContainer Instances

Page 13: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

TasksContainers

ClustersContainer Instances

Page 14: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Task: A grouping of related containers

Nginx Web Server Rails Application

MySQL Database

Log Collector

Page 15: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Task Definition

{ “family” : “my-website”,

“version” : “1.0”

“containers” : [

<<CONTAINER DEFINTIONS>>

]

}

Page 16: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Tasks Containers

ClustersContainer Instances

Page 17: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Container Definition

Names and identifies your image

Includes default runtime attributes for your container• Environment Variables

• Port Mappings

• Container entry point and commands

• Resource constraints

• Etc.

Page 18: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Example

{ “name” : “webServer”,

“image” : “nginx:latest”

“cpu” : 512,

“memory” : 128,

“portMappings” : [ { “containerPort” : 9443, “hostPort” : 443 }],

“links” : [“rails”],

“essential” : true

}

Page 19: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Tasks Containers

ClustersContainer Instances

Page 20: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Cluster

Provides a pool of resources for

your Tasks

A grouping of Container Instances

Starts empty, dynamically scalable

Page 21: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Tasks Containers

ClustersContainer Instances

Page 22: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Container Instance

EC2 instance on which Tasks are scheduled

We provide ECS-optimized AMI or you can download lightweight ECS Agent

Registers into cluster upon launch

Different EC2 instance types for variety in resource pool

Page 23: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Architectural Design Patterns

Page 24: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Trigger Batch Processing with Lambda

Amazon ECS

Availability Zone Availability Zone

Container Instance Container Instance

AutoScaling Group

Task A

AWS Lambda

Amazon

S3 Bucket

(Source)

ecs:RunTask

Amazon

S3 Bucket

(Target)

Amazon

S3 Bucket

ObjectAmazon

CloudWatchAWS CloudTrail

Page 25: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Fleet of workers with ECS with SQS

Amazon ECS

Availability Zone Availability Zone

SQS queue

Container Instance Container Instance

AutoScaling Group

Task A

AWS Lambda

Amazon

S3

DynamoDB

Amazon

Kinesis

ecs:RunTask

Amazon

CloudWatchAWS CloudTrail

Page 26: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Long-running Batch Jobs

• Utilize Spot Instances

• EC2 Spot Blocks for

Defined-Duration

Workloads

• ECS event stream for

CloudWatch Events

• Service Scaling and

Monitoring

Amazon ECS

Availability Zone Availability Zone

Container Instance Container InstanceAutoScaling Group

Task A Task B

Task C

Amazon

CloudWatchAWS CloudTrail

Page 27: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Best Practices

• Store state and inputs, outputs in S3 or another datastore

• Minimize dependencies between task definitions (should

be independent of each other)

• Use Spot Instances and Spot fleets for long-running

batch jobs

• Monitor cluster state with ECS APIs

• Share pools of resources

• Auto Scaling, VPC, IAM, scheduled Reserved Instances

Page 28: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

ECS at Mapbox

Page 29: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 30: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Maps

Directions Geocoding

Mobile

Developer tools

Analysis

Page 31: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 32: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 33: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 34: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 35: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 36: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 37: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 38: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 39: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 40: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 41: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 42: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 43: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 44: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 45: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 46: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)
Page 47: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

3 billion probes = 100 million miles

per day

Page 48: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Similar pattern for batch processing

• EC2 instances

• SQS queue

• Error handling / reporting

Page 49: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Introducing Watchbot

Page 50: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

What is watchbot?

A library to help run a highly-scalable AWS service that

performs data processing tasks in response to external

events.

You provide the the messages and the logic to process

them, while Watchbot handles making sure that your

processing task is run at least once for each message.

Page 51: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

https://github.com/mapbox/ecs-watchbot

Page 52: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

ECS Cluster

SQS

Watcher

Container

Running

Tasks

Page 53: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Your task can do anything you want!

• Your task can be anything that works in Docker

• Use any language

• Environment variables as input

• bash exit codes to indicate success/failure/retry

• Do any I/O

• Save outputs to S3 or DynamoDB

Page 54: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Environment Variables

Name Description

Subject the message's subject

Message the message's body

MessageId the message's ID defined by SQS

SentTimestamp the time the message was sent

ApproximateFirstReceiveTimestamp the time the message was first received

ApproximateReceiveCount

the number of times the message has been

received

Page 55: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Messages

• Use any format as long as your task is equipped to handle

it

• JSON can capture more complex

Page 56: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Exit Codes

Exit code Description Outcome

0 completed successfullymessage is removed from the queue without

notification

3 rejected the messagemessage is removed from the queue and a

notification is sent

4 no-opmessage is returned to the queue without

notification

other failuremessage is returned to the queue and a

notification is sent

Page 57: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

More features!

• Logging - write logs to CloudWatch LogGroup

• Send alarms to SNS

• Reduce mode - tracks progress of distributed tasks and

runs a reduce task when everything finishes

Page 58: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Why not Lambda?

Watchbot is similar in many regards to AWS Lambda, but is

more configurable, more focused on data processing, and

not subject to several of Lambda's limitations.

• Full control over execution environment allows you to install anything you want

• No limits on execution time

• No memory limits

• No concurrency limits or account-wide throttling

• No DynamoDB Streams or Kinesis support

Page 59: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Gotcha: EBS Boot

• ECS optimized instances are only available as EBS boot

AMIs so consider rolling your own instance store AMI

• EBS is more expensive - especially if you are running

many instances on Spot

• Slower than ephemeral disks

Page 60: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Gotcha: EBS Boot

Page 61: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Demo!

Page 62: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

https://github.com/mapbox/ecs-telephone

Page 63: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

14Data Processing

Services

3500Peak Container

Instances

500 millionCompute Hours Used

This Year

Page 64: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Thank you!

Page 65: AWS re:Invent 2016: Running Batch Jobs on Amazon ECS (CON310)

Remember to complete

your evaluations!