amazon ecs at coursera: a unified execution framework while defending against untrusted code

78
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Frank Chen, Coursera Brennan Saeta, Coursera October 2015 CMP406 Amazon ECS at Coursera Powering a general-purpose near-line execution microservice, while defending against untrusted code

Upload: brennan-saeta

Post on 18-Jan-2017

3.864 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Frank Chen, CourseraBrennan Saeta, Coursera

October 2015

CMP406

Amazon ECS at CourseraPowering a general-purpose near-line execution

microservice, while defending against untrusted code

Page 2: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What to Expect from the Session

• Techniques for a unified near-line, batch, and scheduled micro-service powered by Amazon ECS

• Security vulnerabilities and countermeasures when running untrusted code in Docker with Amazon ECS

• Reasons to modify the Amazon ECS agent

Page 3: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Session Outline

• Introduction to Coursera• Near-line, batch and scheduled job execution framework

• Motivations and background• Amazon ECS benefits and limitations• Iguazú and its architecture

• Evaluating programming assignments• System requirements• Security threat model• Attacks and defenses

Page 4: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code
Page 5: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code
Page 6: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code
Page 7: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Education at Scale

15 million learners worldwide

2.5 millioncourse completions

1,300+courses

125+partners

Page 8: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

A unified execution framework

Page 9: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Batch Processing Enables…

ReportingInstructor Reports• Grade exports• Learner demographics• Course progress

statistics

Internal Reports• Business metrics• Payments

reconciliation

Page 10: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Scheduled Processing Enables…

Marketing• Recommendation emails

• Targeted marketing / reactivation emails

Page 11: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Nearline Processing Enables…

Pedagogical Innovations• Peer-review matching & analysis

• Auto-graded programming assignments

Page 12: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

The early days…

January 2012

Page 13: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Bad Old Days of Batch Processing @ Coursera

Cascade• PHP-based job runner• Originally ran in screen sessions • Polled APIs for new jobs• Forced restarts on regular basis

due to unidentified memory leaks• Fragile and unreliable

The early days…

Page 14: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Bad Old Days of Batch Processing @ Coursera

Saturn• Scala scheduled batch job runner

• Powered by Quartz Scheduler library• Better than Cascade, but…• All jobs ran on same JVM, causing

interference

The not-so early days?

Page 15: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Looking for something better…

Page 16: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 17: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 18: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 19: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 20: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 21: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What We Wanted

Reliable Easy Development Easy Deployment

High Efficiency Low Ops Load Cost Effective

Page 22: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What Else Did We Look At?

Home-grown Tech

• Tried, but proved to be unreliable

• Difficult to handle coordination and synchronization

• Powerful, but hard to productionize

• Needs developers with experience

• Designed for GCE first

• Not a managed service, higher Ops load

Page 23: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Amazon ECS to the Rescue

Amazon re:Invent 2014 – Dr. Werner Vogels introducing Amazon ECS

Screenshot from https://www.youtube.com/watch?v=LE5uBqNp2Ds by Amazon Web Services

Page 24: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Amazon ECS to the Rescue

Little maintenance

Integrated with rest of AWS

Easy to develop for

Page 25: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Amazon ECS to the Rescue

Little maintenance

Integrated with rest of AWS

Easy to develop for

Page 26: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Amazon ECS to the Rescue

Little maintenance

Integrated with rest of AWS

Easy to develop for

Page 27: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

However…

Amazon ECS is a great building block, but we still need to build tools around it

for our purposes.

Page 28: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What We Built: Iguazú

Marissa Strniste (https://www.flickr.com/photos/mstrniste/5999464924) CC-BY-2.0

• Batch Job Scheduler for Amazon ECS• Immediately• Deferred (run once at X time)• Scheduled recurring (cron-like)

• Programmatically accessible internally viaour standard APIs and clients

• Named for Iguazú falls• World’s largest waterfall by volume• We hope Iguazú handles a similar volume of jobs

Page 29: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú Frontend

Iguazú Scheduler

Iguazú Backend

Iguazú: Architecture

CassandraServices Services

Iguazú Admin

ECS Workers

SQS

ECS API

Devs

Users

Page 30: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú Frontend

Iguazú Scheduler

Iguazú Backend

Iguazú: Architecture

CassandraServices Services

Iguazú Admin

ECS Workers

SQS

ECS API

Devs

Users

Page 31: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú Frontend

Iguazú Scheduler

Iguazú Backend

Iguazú: Architecture

CassandraServices Services

Iguazú Admin

ECS Workers

SQS

ECS API

Devs

Users

Page 32: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú Frontend

Iguazú Scheduler

Iguazú Backend

Iguazú: Architecture

CassandraServices Services

Iguazú Admin

ECS Workers

SQS

ECS API

Devs

Users

Page 33: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú Frontend

Iguazú Scheduler

Iguazú Backend

Iguazú: Architecture

CassandraServices Services

Iguazú Admin

ECS Workers

SQS

ECS API

Devs

Users

Page 34: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú Frontend

Iguazú Scheduler

Iguazú Backend

Iguazú: Architecture

CassandraServices Services

Iguazú Admin

ECS Workers

SQS

ECS API

Devs

Users

Page 35: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú Frontend

Iguazú Scheduler

Iguazú Backend

Iguazú: Architecture

CassandraServices Services

Iguazú Admin

ECS Workers

SQS

ECS API

Devs

Users

Page 36: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú Frontend

Iguazú Scheduler

Iguazú Backend

Iguazú: Architecture

CassandraServices Services

Iguazú Admin

ECS Workers

SQS

ECS API

Devs

Users

Page 37: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Developing Iguazú Jobs

class Job extends AbstractJob with StrictLogging { override val reservedCpu = 1024 // 1 CPU core override val reservedMemory = 1024 // 1 GB RAM

 def run(parameters: JsValue) = {   logger.info("I am running my job! ")   expensiveComputationHere() }}

Page 38: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Running Jobs from Other Services

// invoking a job with one function call// from another service via Naptime RPC/REST framework

val invocationId = IguazuJobInvocationClient .create(IguazuJobInvocationRequest(   jobName = "exportQuizGrades",   parameters = quizParams))

Page 39: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Iguazú: Developer / Ops User Interface

Page 40: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Deploying Jobs

Easy Deployment1. Developers Merge into master. Done!

Jenkins Build Steps:2. Builds zip package from master3. Prepares Docker image with zip file4. Pushes image into Docker registry5. Registers updated jobs with

Amazon ECS API

Page 41: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Logs

• Logs are in /var/lib/docker/containers/*

• Upload into log analysis service (Sumologic)

• Wrapper prints out job name and job IDat the start for easy searching

• Good enough for now

Page 42: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Metrics

• Using third-party metrics collector (Datadog)

• Metrics for both jobs and container instances

• So long as the worker machines can talk to Internet, things will work out pretty well

Page 43: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Since April 2015…

65 jobs in production

>1000 runs per day

44 different scheduled jobs

Page 44: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

EvaluatingProgramming Assignments

Page 45: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code
Page 46: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code
Page 47: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Programming Assignments at Coursera

Page 48: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

The Security Challenge

Compiling and running untrusted, arbitrary code in Amazon EC2

Would you like to compile and run C code from randompeople on the Internet on your servers?

Page 49: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

1st Generation SystemClass graders in separate AWS acct

Custom grader systemson cloud providers

Course grader under the instructor’s desk

Learners Coursera Servers Queue Service

Page 50: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

1st Generation System: Weaknesses

No Auto Scaling No standard security Graders crashed

Page 51: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

1st Generation System: Weaknesses

No Auto Scaling No standard security Graders crashed

Page 52: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

1st Generation System: Weaknesses

No Auto Scaling No standard security Graders crashed

Page 53: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Design Goals

Cost Savings No Maintenance Near Real-time Secure Infrastructure

Page 54: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Design Goals

Cost Savings No Maintenance Near Real-time Secure Infrastructure

Page 55: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Design Goals

Cost Savings No Maintenance Near Real-time Secure Infrastructure

Page 56: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Design Goals

Cost Savings No Maintenance Near Real-time Secure Infrastructure

Page 57: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Threat Model

Prevent submitted code from:• impacting the evaluation of other submissions.• disrupting the grading environment (e.g., DoS)• affecting the rest of the Coursera learning platformAdditional goals:• Minimize exfiltration of information

• Test cases, solutions, etc…• Minimize risk of submissions changing own scores• Avoid turning into bitcoin miners or part of botnet

Page 58: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Threat Model - Assumptions

• Run arbitrary binaries

• Instructor grading scripts may have vulnerabilities• ∴ Grading code is untrusted

• Unknown vulnerabilities in Docker and Linux name-spacing and/or container implementation

Page 59: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Attack / Vulnerability Classes

Divided into 2 main categories:• Assuming basic containers are secure, prevent any

negative impacts to running arbitrary code.

• Assuming basic container technology is vulnerable, mitigate negative impacts as much as possible.

Page 60: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

What We Built: GrID

Patrick Hoesly (https://www.flickr.com/photos/zooboing/5665221326/) CC-BY-2.0

• Service + architecture for gradingprogramming assignments

• Builds on Amazon ECS and Iguazú

• Named for Tron’s “digital frontier”• Backronym: Grading Inside Docker

Page 61: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

High-level GrID Architecture

Learners

GrID

Iguazú

S3 Bucket

ECS APIs

Grading MachinesVPC Firewalls

Coursera Production Account Coursera GrID Grading Account

Page 62: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

High-level GrID Architecture

Learners

GrID

Iguazú

S3 Bucket

ECS APIs

Grading MachinesVPC Firewalls

Coursera Production Account Coursera GrID Grading Account

Page 63: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

High-level GrID Architecture

Learners

GrID

Iguazú

S3 Bucket

ECS API

Grading MachinesVPC Firewalls

Production Acct GrID Grading Account

Page 64: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

High-level GrID Architecture

Learners

GrID

Iguazú

S3 Bucket

ECS API

Grading Machines

VPC Firewalls

Production Acct GrID Grading Account

Page 65: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Attacks: Resource Exhaustion

Defenses:• Docker / CGroups:

• CPU quotas• Memory limits• Swap limits

• Hard timeouts for container execution• btrfs limits

• file system storage quotas• IOPS throttling

Page 66: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Attacks: Kernel Resource Exhaustion

Defenses:• Open file limits per container

(nofile)

• nproc Process limits

• Limit kernel memory per cgroup

• Limit execution time

Page 67: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Attacks: Network attacks

Attacks:• Bitcoin mining• DoS attacks on third-party systems• Access Amazon S3 and other AWS

APIs

Defense:• Deny network access

Page 68: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Modifying the ECS Agent: Network Modes

• NetworkDisabled too restrictive• Some graders require local loopback• Feature also deprecated

• --net=none + deny net_admin + audit network• Isolation via Docker creating an

independent network stack for each container

• github.com/coursera/amazon-ecs-agent

Page 69: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Attacks: Namespace / Container Vulnerabilities

• App Armor & Mandatory Access Control• Required modifying the Amazon ECS Agent• Allows auditing or denying access to a

variety of subsystems

• Drop capabilities• No need for NET_BIND_SERVICE,

CAP_FOWNER

• No root within container

Page 70: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Attacks: Root escalations within the container

• We modify instructor grader images before allowing them to be run

• Clears setuid• Inserts C wrapper to drop privileges from

root and redirect stdin/stdout/stderr• Required Amazon ECS Agent

modification• Grant root privileges• Map Docker socket into Docker

containers to run Docker in Docker!

Page 71: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Attacks: If all else fails…

• Utilizes VPC security measures to further restrict network access

• No public internet access• Security group to restrict

inbound/outbound access• Network flow logs for auditing

• Separate AWS account• Run in an Auto Scaling group

• Regularly terminate all grading EC2 instances

Page 72: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Other Security Measures

• Utilize AWS CloudTrail for audit logs

• Third-party security monitoring (Threat Stack)

• No one should log in, so any TTY is an alert

• Penetration testing by third-party red team (Synack)

Page 73: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Technique: Co-process• Environment has no network, but has to

get submissions in and results out

• Python co-process watches Amazon ECS / Docker

• Python co-process then:• Mounts a shared folder containing submission• Reads back the grade from the shared folder

after container exits• Monitors and cleans up

Page 74: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Future Improvements

• Priority queues for different grading priorities

• Re-grades vs on-demand grades• Better instructor tooling

• Automated “unit-testing” for new graders• Better simulation of production

environment on instructor machines• Support scheduling GPUs

Page 75: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Lessons Learned

• Run the latest kernels• Latest security patches• btrfs wedging on older kernels

• Default Ubuntu 14.04 kernel not new enough!

• Carefully monitor disk usage• Docker-in-docker can’t clean up after

itself (yet).• Reliable deploy tooling pays for itself

Page 76: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Related Sessions

Also from Coursera:• BDT404 - Building and Managing Large-Scale ETL Data

Flows with AWS Data Pipeline and Dataduct - Friday

Containers and Amazon ECS:• CMP302 - Amazon EC2 Container Service: Distributed

Applications at Scale – Next timeslot in Venetian H

Page 77: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Thank you!

Questions?

Also, we are hiring!www.coursera.org/jobs

tech.coursera.org

Brennan Saetagithub/saeta

@[email protected]

Frank Chengithub/frankchn

@[email protected]

Page 78: Amazon ECS at Coursera: A unified execution framework while defending against untrusted code

Remember to complete your evaluations!