© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Usman Shakeel, Amazon Web Services Kevin Constantine, Walt Disney Animation Studios
October 2015
CMP404 Cloud Rendering at
Walt Disney Animation Studios
Visual Effects and Animation 1
Who is using AWS for rendering?
3 Theme Parks
5 Gaming
Marketing 2
4 Manufacturing
6 Life Sciences
7 Engineering and Architecture
Visual Effects and Animation 1
Let’s make a film in the cloud…
VFX/Animation Rendering - workflow components
Compositing Modeling Rendering
Asset management
Collaboration and task management
The challenge of making a film
The challenge of making a film
On-premises capacity
The challenge of making a film
On-premises capacity
Rendering in the cloud
The challenge of making a film
On-premises capacity
Rendering in the cloud Cloud provides you the capability to scale fast and get the outputs faster
Initial project on-boarding artwork
A tale of two customers A boutique studio Walt Disney Animation Studios
On-Premises Hardware
No or very little investment A significant investment
Licenses Limited Unlimited
Project Structure
Project based from other studios Internal customers/projects
Budget Constraints
Time and resources Time and resources
Compute Needs Large scale Very large scale
Infrastructure Efficiencies
No or very little On-premises infrastructure optimized for rendering workload
Cloud Model All-in mostly Hybrid mostly
Security Mandated by customers Required due to high valued assets
They both ask us the same thing… The ability to spin up thousands of cores on-demand
…without any upfront investment …and leveraging the most up-to-date configurations
A project-based “disposable” infrastructure
…with a flexible licensing / utility / by the hour
They both tell us the same thing…
=< $0.01 per core/hour
Access to thousands of cores whenever needed
No upfront investments in infrastructure
Easier collaboration
Ecosystem of software providers
Access to large memory configs to do 6K/10K renders
Project based “disposable” infrastructure
…when the rubber meets the road !
Share FS everywhere Latency Large datasets Lots of instances
{Data/Content}
Rendering in the Cloud
Rendering in the Cloud - State of the Union Scale at a very cheap price
EC2 Spot
Leveraging Spot successfully today requires some effort Build stateless, distributed, scalable applications Choose which instance types fit your workload the best Ingest price feed data for AZs and regions Make run time decisions on which Spot pools to launch in based on price and volatility Manage interruptions Monitor and manage market prices across AZs and instance types Manage the capacity footprint in the fleet And all of this while you don’t know where the capacity is Serve your customers
Spot Fleet
Instead of writing all that code to manage Spot instances, simply specify:
• Target Capacity – The number of EC2 instances that you want in your fleet.
• Maximum Bid Price – The maximum bid price that you are willing to pay.
• Launch Specifications – # of and types of instances, AMI ID, VPC, subnets or AZs, etc.
• IAM Fleet Role – The name of an IAM role. It must allow Amazon EC2 to terminate instances on your behalf.
Spot Fleet Example – Instance Weighting Say your workload needs at least 60 GB of memory Want capacity to complete 20 units of work Choices:
• r3.2xlarge (61.0 GB, 8 vCPUs) = 1 unit of 20 • r3.4xlarge (122.0 GB, 16 vCPUs) = 2 units of 20 • r3.8xlarge (244.0 GB, 32 vCPUs) = 4 units of 20
An option to bid for all of these instance types:
AWS cloud scale is “large” • 10s/100s/1000s/10000s cores on-demand in the cloud • A “large” (Disney Animation Studio) renderfarm:
55,000 cores • In this demo:
~40,000 vCPUs on EC2 Spot Market
Rendering in the Cloud - State of the Union Scale at a very cheap price
• BYOL • SaaS • AWS Marketplace • Elastic Licensing models
Thinkbox Deadline Usage Based Licensing • Render nodes pull metered licenses from cloud-based license server • Usage is tracked per minute • Bulk minutes will be available via Thinkbox’s online store • Store will eventually host 3rd party licensing (Nuke, VRay, etc.)
AutoDesk Maya
Rendering in the Cloud - State of the Union Licensing at Cloud Scale
Rendering in the Cloud - State of the Union Hydrating the Cloud Renderfarm Amazon S3 as the source of truth for your content/data • On AWS Marketplace/SaaS
(Aspera, Signiant, File Catalyst, Expedat) • Amazon S3 Multi-part Upload Direct to Shared File Systems • Amazon EFS throughput scales linearly to the storage • Lustre can hydrate from an S3 bucket • Avere can be fronted to Amazon S3 or an
on-premises NAS
+ AWS Direct Connect
EFS S3 Multipart
Rendering in the Cloud - State of the Union Shared FileSystem Everywhere (some ideas)
Shared Storage
On-prem Storage AWS Direct Connect
Storage Cache
Amazon S3
Luster on EC2
Avere on EC2
EFS
AWS Direct Connect
Hydrate workers
EC2 Spot
Shared Storage
FXT on-prem
Rendering in the Cloud - State of the Union NFS/CIFS (Content/Data Share) Everywhere (some ideas)
Elastic File System • Designed to support petabyte scale file systems • Throughput scales linearly to storage • Same latency spec across each AZ • Thousands of concurrent NFS connections • Works great for large I/O sizes • Pay for only what you use not what you provision • Managed with multi-copy durability
EFS
Rendering in the Cloud - State of the Union Move the Graphic Artist to the Cloud …
• NVIDIA GPU based EC2 instances • Teradici PCoIP • Frame, Otoy • Windows and Linux (VNC+VirtualGL)
Rendering in the Cloud - State of the Union Managing your “disposable” infrastructure
Launch a CloudFormation stack with all the infrastructure
resources for a specific project
Automatically scale the stack as appropriate
AMI
CloudFormation Template
CloudFormation Terminate Template
Rendering in the Cloud - State of the Union The Crown Jewels
• AWS alignment with the latest MPAA cloud based application guidelines for content security – August 2015
• VPC private endpoint for Amazon S3 – enables a true private workflow capability
• Encryption & key management capabilities • Amazon Glacier Vault for high-value media/originals
Rendering in the Cloud - A Sample Architecture (All in Cloud Pipeline)
Shared Storage
Renderfarm
On-Prem Storage
Pipeline and License Manager
3D Modeler
Remote App Visualization
AWS Direct Connect
Modeling Dumb Client
Storage Cache
Amazon S3
Avere on EC2
Scalable Renderfarm on EC2
Appstream or Teradici running on a G2 instance
Pipeline Manager running on EC2
G2
EC2 SPOT
EFS
Hydrate workers
EC2 Spot
Render Farm
Rendering in the Cloud - A Sample Architecture (A Hybrid Pipeline)
Shared Storage
Renderfarm
On-Prem Storage AWS Direct Connect
Storage Cache
Amazon S3
Avere on EC2
Scalable Renderfarm on EC2
EFS
Hydrate workers
EC2 Spot
On-premise Renderfarm
EC2 SPOT
Cloud renderfarm as an extension of on-prem renderfarm
FXT on-prem
Pipeline and License Manager (also manage cloud renderfarm)
Let’s make a real film in the cloud…
Disney Animation Renderfarm
Renderfarm Avere FXT cluster
WDAS Data Center
Renderfarm
Avere FXT cluster
Storage
Remote Data Center
Renderfarm Avere FXT cluster
Remote Data Center
San Francisco
Los Angeles
Burbank
Artists Redundant 10Gb
Disney Animation’s Environment
• 90% Red Hat Enterprise Linux 6, 8% MacOSX • 1Gb/s Ethernet to clients, 10Gb/s to most servers
• Clients are bursty, not generally bandwidth constrained
• Major Applications: • Hyperion (GI Renderer) • Maya • Houdini • Nuke • Coda (Scheduler)
Disney Animation’s Environment • NFS v3 Everywhere
• 5-7 petabytes • 500 TB working-set • 100 TB/week of data churn • Global namespace • Lots of metadata operations • Serve everything out of RAM/SSD
• Renderfarm Footprint • 55,000 core renderfarm • 1.1 million render hours per day • 200,000-400,000 tasks per day
• Typical render • 8-16 threads, 64 GB • 3-5 hours per task
Disney Animation Renderfarm
Renderfarm Avere FXT cluster
WDAS Data Center
Renderfarm
Avere FXT cluster
Storage
Remote Data Center
Renderfarm Avere FXT cluster
Remote Data Center
San Francisco
Los Angeles
Burbank
Artists
Redundant 10Gb
virtual private cloud Avere vFXT
Oregon
Spot Instances
10Gb Primary, 1Gb backup
EFS
Mostly Automated Deployment
• Pre-built EBS-backed AMI • Heavily customized RHEL
• Python/Boto3 • Pass in how many resources and the minimum instance size • Calculates resource weights • Needs to calculate pricing
• User-Data • Raids ephemeral disks if available for scratch space • Integrate with on-premises environment (DNS, asset inventory,
Puppet) • Creates EC2 tags • Runs Puppet to pick up changes since AMI-build-time • Joins the render queue and asks for work
• Scale-up/down still a manual process
Spot Fleet Deployment
Core Count
./aws_spot_fleet_request -‐p reinvent -‐-‐cpu 8 -‐-‐ram 64 -‐m 4.7 -‐c 1500
Spot Fleet Deployment
Spot Fleet Pricing
• Target Price 1 • $0.47/resource for the 40,000 core
• Target Price 2 • $0.16/resource for 16,000 cores
Cloud Rendering Benchmarks
Benchmarks: On Premises vs. the Cloud
0"
20"
40"
60"
80"
100"
120"
stream"triad" disk"read" disk"write"
On"Prem"
r3.4xlarge"
r3.8xlarge"
m4.4xlarge"
m4.10xlarge"
cr1.8xlarge"
Higher is better
EFS Hydration
Single Node
50 Clients – multi-threaded file copy
Average Open Latency
Average Read Latency
0
100
200
300
400
500
600
700
100 500 800 1200 2400 4000
Tim
e (µ
s)
Render Processes
Mid-TierA Mid-TierB Mid-TierC Archive EFS
Rendering in the Cloud vs. On-Premises
!"!!!!
!5,000!!
!10,000!!
!15,000!!
!20,000!!
!25,000!!
!30,000!!
1! 10! 20! 30! 40! 50! 60! 70! 80! 90!
Ren
der T
ime
(s)
Frame #
EC2/EFS!
On!Prem!
Lower is better
Lessons Learned
• Use as many different instance types as you can. Especially older generations.
• Think about ways to modify your workload
• Use every Availability Zone
• Check your limits, especially your Amazon EBS limit and
VPC setup (address space)
• Resource-oriented bidding
• Diversified allocation
• Benchmark your workload and set pricing accordingly
• Set ONLY realistic pricing that you will pay for
• Don’t be afraid to ask for help or pre-planning your run from AWS
Conclusion • Cloud rendering on AWS - State of the Union
Is getting stronger …
• Rendering forecast Partly cloudy with a chance of all in the cloud…
• Future research • Storage hydration
Distribute across many clients to saturate the EFS throughput
• Storage for processing Read freely and lump the writes (for shared FS performance)
• Latency is killer Atomic workflows within a single AZ/region Caching appliances
Relevant talks
Remember to complete your evaluations!
Thank you!