(cmp404) cloud rendering at walt disney animation studios

Post on 12-Apr-2017

4.534 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Usman Shakeel, Amazon Web Services Kevin Constantine, Walt Disney Animation Studios

October 2015

CMP404 Cloud Rendering at

Walt Disney Animation Studios

Visual Effects and Animation 1

Who is using AWS for rendering?

3 Theme Parks

5 Gaming

Marketing 2

4 Manufacturing

6 Life Sciences

7 Engineering and Architecture

Visual Effects and Animation 1

Let’s make a film in the cloud…

VFX/Animation Rendering - workflow components

Compositing Modeling Rendering

Asset management

Collaboration and task management

The challenge of making a film

The challenge of making a film

On-premises capacity

The challenge of making a film

On-premises capacity

Rendering in the cloud

The challenge of making a film

On-premises capacity

Rendering in the cloud Cloud provides you the capability to scale fast and get the outputs faster

Initial project on-boarding artwork

A tale of two customers A boutique studio Walt Disney Animation Studios

On-Premises Hardware

No or very little investment A significant investment

Licenses Limited Unlimited

Project Structure

Project based from other studios Internal customers/projects

Budget Constraints

Time and resources Time and resources

Compute Needs Large scale Very large scale

Infrastructure Efficiencies

No or very little On-premises infrastructure optimized for rendering workload

Cloud Model All-in mostly Hybrid mostly

Security Mandated by customers Required due to high valued assets

They both ask us the same thing… The ability to spin up thousands of cores on-demand

…without any upfront investment …and leveraging the most up-to-date configurations

A project-based “disposable” infrastructure

…with a flexible licensing / utility / by the hour

They both tell us the same thing…

=< $0.01 per core/hour

Access to thousands of cores whenever needed

No upfront investments in infrastructure

Easier collaboration

Ecosystem of software providers

Access to large memory configs to do 6K/10K renders

Project based “disposable” infrastructure

…when the rubber meets the road !

Share FS everywhere Latency Large datasets Lots of instances

{Data/Content}

Rendering in the Cloud

Rendering in the Cloud - State of the Union Scale at a very cheap price

EC2 Spot

Leveraging Spot successfully today requires some effort Build stateless, distributed, scalable applications Choose which instance types fit your workload the best Ingest price feed data for AZs and regions Make run time decisions on which Spot pools to launch in based on price and volatility Manage interruptions Monitor and manage market prices across AZs and instance types Manage the capacity footprint in the fleet And all of this while you don’t know where the capacity is Serve your customers

Spot Fleet

Instead of writing all that code to manage Spot instances, simply specify:

•  Target Capacity – The number of EC2 instances that you want in your fleet.

•  Maximum Bid Price – The maximum bid price that you are willing to pay.

•  Launch Specifications – # of and types of instances, AMI ID, VPC, subnets or AZs, etc.

•  IAM Fleet Role – The name of an IAM role. It must allow Amazon EC2 to terminate instances on your behalf.

Spot Fleet Example – Instance Weighting Say your workload needs at least 60 GB of memory Want capacity to complete 20 units of work Choices:

•  r3.2xlarge (61.0 GB, 8 vCPUs) = 1 unit of 20 •  r3.4xlarge (122.0 GB, 16 vCPUs) = 2 units of 20 •  r3.8xlarge (244.0 GB, 32 vCPUs) = 4 units of 20

An option to bid for all of these instance types:

AWS cloud scale is “large” • 10s/100s/1000s/10000s cores on-demand in the cloud • A “large” (Disney Animation Studio) renderfarm:

55,000 cores • In this demo:

~40,000 vCPUs on EC2 Spot Market

Rendering in the Cloud - State of the Union Scale at a very cheap price

• BYOL • SaaS • AWS Marketplace • Elastic Licensing models

Thinkbox Deadline Usage Based Licensing •  Render nodes pull metered licenses from cloud-based license server •  Usage is tracked per minute •  Bulk minutes will be available via Thinkbox’s online store •  Store will eventually host 3rd party licensing (Nuke, VRay, etc.)

AutoDesk Maya

Rendering in the Cloud - State of the Union Licensing at Cloud Scale

Rendering in the Cloud - State of the Union Hydrating the Cloud Renderfarm Amazon S3 as the source of truth for your content/data •  On AWS Marketplace/SaaS

(Aspera, Signiant, File Catalyst, Expedat) •  Amazon S3 Multi-part Upload Direct to Shared File Systems •  Amazon EFS throughput scales linearly to the storage •  Lustre can hydrate from an S3 bucket •  Avere can be fronted to Amazon S3 or an

on-premises NAS

+ AWS Direct Connect

EFS S3 Multipart

Rendering in the Cloud - State of the Union Shared FileSystem Everywhere (some ideas)

Shared Storage

On-prem Storage AWS Direct Connect

Storage Cache

Amazon S3

Luster on EC2

Avere on EC2

EFS

AWS Direct Connect

Hydrate workers

EC2 Spot

Shared Storage

FXT on-prem

Rendering in the Cloud - State of the Union NFS/CIFS (Content/Data Share) Everywhere (some ideas)

Elastic File System •  Designed to support petabyte scale file systems •  Throughput scales linearly to storage •  Same latency spec across each AZ •  Thousands of concurrent NFS connections •  Works great for large I/O sizes •  Pay for only what you use not what you provision •  Managed with multi-copy durability

EFS

Rendering in the Cloud - State of the Union Move the Graphic Artist to the Cloud …

•  NVIDIA GPU based EC2 instances •  Teradici PCoIP •  Frame, Otoy •  Windows and Linux (VNC+VirtualGL)

Rendering in the Cloud - State of the Union Managing your “disposable” infrastructure

Launch a CloudFormation stack with all the infrastructure

resources for a specific project

Automatically scale the stack as appropriate

AMI

CloudFormation Template

CloudFormation Terminate Template

Rendering in the Cloud - State of the Union The Crown Jewels

•  AWS alignment with the latest MPAA cloud based application guidelines for content security – August 2015

•  VPC private endpoint for Amazon S3 – enables a true private workflow capability

•  Encryption & key management capabilities •  Amazon Glacier Vault for high-value media/originals

Rendering in the Cloud - A Sample Architecture (All in Cloud Pipeline)

Shared Storage

Renderfarm

On-Prem Storage

Pipeline and License Manager

3D Modeler

Remote App Visualization

AWS Direct Connect

Modeling Dumb Client

Storage Cache

Amazon S3

Avere on EC2

Scalable Renderfarm on EC2

Appstream or Teradici running on a G2 instance

Pipeline Manager running on EC2

G2

EC2 SPOT

EFS

Hydrate workers

EC2 Spot

Render Farm

Rendering in the Cloud - A Sample Architecture (A Hybrid Pipeline)

Shared Storage

Renderfarm

On-Prem Storage AWS Direct Connect

Storage Cache

Amazon S3

Avere on EC2

Scalable Renderfarm on EC2

EFS

Hydrate workers

EC2 Spot

On-premise Renderfarm

EC2 SPOT

Cloud renderfarm as an extension of on-prem renderfarm

FXT on-prem

Pipeline and License Manager (also manage cloud renderfarm)

Let’s make a real film in the cloud…

Disney Animation Renderfarm

Renderfarm Avere FXT cluster

WDAS Data Center

Renderfarm

Avere FXT cluster

Storage

Remote Data Center

Renderfarm Avere FXT cluster

Remote Data Center

San Francisco

Los Angeles

Burbank

Artists Redundant 10Gb

Disney Animation’s Environment

•  90% Red Hat Enterprise Linux 6, 8% MacOSX •  1Gb/s Ethernet to clients, 10Gb/s to most servers

•  Clients are bursty, not generally bandwidth constrained

•  Major Applications: •  Hyperion (GI Renderer) •  Maya •  Houdini •  Nuke •  Coda (Scheduler)

Disney Animation’s Environment •  NFS v3 Everywhere

•  5-7 petabytes •  500 TB working-set •  100 TB/week of data churn •  Global namespace •  Lots of metadata operations •  Serve everything out of RAM/SSD

•  Renderfarm Footprint •  55,000 core renderfarm •  1.1 million render hours per day •  200,000-400,000 tasks per day

•  Typical render •  8-16 threads, 64 GB •  3-5 hours per task

Disney Animation Renderfarm

Renderfarm Avere FXT cluster

WDAS Data Center

Renderfarm

Avere FXT cluster

Storage

Remote Data Center

Renderfarm Avere FXT cluster

Remote Data Center

San Francisco

Los Angeles

Burbank

Artists

Redundant 10Gb

virtual private cloud Avere vFXT

Oregon

Spot Instances

10Gb Primary, 1Gb backup

EFS

Mostly Automated Deployment

•  Pre-built EBS-backed AMI •  Heavily customized RHEL

•  Python/Boto3 •  Pass in how many resources and the minimum instance size •  Calculates resource weights •  Needs to calculate pricing

•  User-Data •  Raids ephemeral disks if available for scratch space •  Integrate with on-premises environment (DNS, asset inventory,

Puppet) •  Creates EC2 tags •  Runs Puppet to pick up changes since AMI-build-time •  Joins the render queue and asks for work

•  Scale-up/down still a manual process

Spot Fleet Deployment

Core Count

./aws_spot_fleet_request  -­‐p  reinvent  -­‐-­‐cpu  8  -­‐-­‐ram  64  -­‐m  4.7    -­‐c  1500  

Spot Fleet Deployment

Spot Fleet Pricing

•  Target Price 1 •  $0.47/resource for the 40,000 core

•  Target Price 2 •  $0.16/resource for 16,000 cores

Cloud Rendering Benchmarks

Benchmarks: On Premises vs. the Cloud

0"

20"

40"

60"

80"

100"

120"

stream"triad" disk"read" disk"write"

On"Prem"

r3.4xlarge"

r3.8xlarge"

m4.4xlarge"

m4.10xlarge"

cr1.8xlarge"

Higher is better

EFS Hydration

Single Node

50 Clients – multi-threaded file copy

Average Open Latency

Average Read Latency

0

100

200

300

400

500

600

700

100 500 800 1200 2400 4000

Tim

e (µ

s)

Render Processes

Mid-TierA Mid-TierB Mid-TierC Archive EFS

Rendering in the Cloud vs. On-Premises

!"!!!!

!5,000!!

!10,000!!

!15,000!!

!20,000!!

!25,000!!

!30,000!!

1! 10! 20! 30! 40! 50! 60! 70! 80! 90!

Ren

der T

ime

(s)

Frame #

EC2/EFS!

On!Prem!

Lower is better

Lessons Learned

•  Use as many different instance types as you can. Especially older generations.

•  Think about ways to modify your workload

•  Use every Availability Zone

•  Check your limits, especially your Amazon EBS limit and

VPC setup (address space)

•  Resource-oriented bidding

•  Diversified allocation

•  Benchmark your workload and set pricing accordingly

•  Set ONLY realistic pricing that you will pay for

•  Don’t be afraid to ask for help or pre-planning your run from AWS

Conclusion •  Cloud rendering on AWS - State of the Union

Is getting stronger …

•  Rendering forecast Partly cloudy with a chance of all in the cloud…

•  Future research • Storage hydration

Distribute across many clients to saturate the EFS throughput

• Storage for processing Read freely and lump the writes (for shared FS performance)

• Latency is killer Atomic workflows within a single AZ/region Caching appliances

Relevant talks

Remember to complete your evaluations!

Thank you!

top related