ilm - pipeline in the cloud

Who are we?Jim Vanns

Aaron CareyProduction Engineers at ILM London

VFX Pipeline in the Cloud

Experiments with Mesos and Docker

Nomenclature, glossary and other big words★ VFX Visual Effects★ Pipeline Data->Process->Data repeat!★ Show Film★ Sequence A thematically linked series of

(continuous) scenes!★ Shot An uninterrupted portion of the

sequence★ Frame A single image in time★ Asset A character or building etc.

What is a VFX pipeline?


Film Scan

Roto

3D

FX

Comp

Lighting

What VFX isn’t.

What VFX isn’t★ Rendering and Sims are our ‘Big Data’★ We’re not crunching analytics in real-time★ Rendering != MapReduce★ Apps run on hardware, not in a browser★ We’re not here to re-write a renderer (not yet...)

Where does the cloud meet VFX?

What’s in it for us?★ Reducing Capital Expenditure★ Potentially reducing overheads★ Flexibility★ Giving power back to developers

VFX Studio Infrastructure★ Render Farm★ Database★ Storage★ Workstations

Render Farm

First, what is rendering!?★ Take a virtual 3D representation of a scene

○ 3D Models

○ Textures

○ Light sources

○ Static backgrounds (plates)

★ Place a virtual camera in the scene★ Compute the 2D image that the camera will see★ Repeat the process for each frame

Rendering in the cloud★ Low hanging fruit★ Already happening★ Typical Farm 30-50k procs★ Managed by specialist software (Tractor/Deadline/in-

house etc)

★ VFX has been doing clustered computing for decades

What’s next?

Mesos★ Open Source framework for scheduling★ Already used at massive scale★ NOT a job scheduler★ We can concentrate on the scheduling

logic★ Support for task isolation/containment

(eg Docker)

Automating our Mesos cluster with Docker and Ansible★Goals: Quick - Easy - Repeatable

★Didn’t want to spend time fighting our config manager (or each other)

★Be able to deploy a virtual studio from scratch in under an hour (including provisioning, building software, deploying, configuration)

★Run multiple versions of the infrastructure at the same time (in the same availability zone/network)

★If something is typed in the terminal, we want to automate and version it

Docker + Ansible was the answer

Automating our Mesos cluster with Ansible★Heavily using tags and variables in Ansible

★Cloud agnostic: Some modification of GCE inventory and launch modules

★Example: Creating a multi-host dynamic Zookeeper configuration -- name: Append the zookeeper server entries lineinfile: dest=/etc/zookeeper/conf/zoo.cfg insertafter=EOF line="server.{{hostvars[item]['zkid']}}={{hostvars[item]['ansible_eth0']['ipv4']['address']}}:2888:3888" with_items: "{{ groups['tag_zookeeper_server_' + consul_domain ]}}"

Service Discovery in Mesos★ No control over where a service or render runs★ Services may move hosts★ Can’t guarantee hosts will have same IP★ Options:

○ Mesos-DNS○ Homegrown (etcd etc)○ Consul

Mesos and Consul★ What is Consul?★ Every host runs an agent★ All DNS lookups on a host go to its agent★ Consul servers outside the Mesos cluster★ Mesos-Consul automates service registry★ Can be used for services outside the cluster

Example - Static service outside the cluster$ ssh -i mykey.pem [email protected]

$ docker run -d -p 5000:5000 --restart=always -e REGISTRY_STORAGE_S3_ACCESSKEY \ -e REGISTRY_STORAGE_S3_SECRETKEY -e REGISTRY_STORAGE_S3_REGION -e REGISTRY_STORAGE=s3

$ curl -H "Content-Type: application/json" -X POST -d '{ "Name": "docker-registry", \"Tags": ["docker-registry", "v2"], "Port": 5000 }' \http://127.0.0.1:8500/v1/agent/service/register

Example - Static service outside the cluster- name: Run docker registry container docker: name: docker-registry image: registry:2.1 state: started ports: - "5000:5000" restart_policy: always env: REGISTRY_STORAGE_S3_ACCESSKEY: REGISTRY_STORAGE_S3_SECRETKEY: REGISTRY_STORAGE_S3_REGION: REGISTRY_STORAGE_S3_BUCKET: REGISTRY_STORAGE: s3

- name: Register registry with consul uri: url: http://127.0.0.1:8500/v1/agent/service/register method: PUT body: '{ "Name": "docker-registry", "Tags": [ "docker-registry", "v2" ], "Port": 5000 }' body_format: json

Example - Launching a service on marathon- name: Submit maya container to marathon hosts: "tag_build_docker_{{ consul_domain }}" gather_facts: False tasks: - name: Submit maya job to marathon uri: url: http://marathon:8080/v2/apps method: POST status_code: 201,409 body: '{ "args": [], "container": { "type": "DOCKER", "docker": { "network": "BRIDGE", "portMappings": [ { "containerPort": 5901, "hostPort": 0, "protocol": "tcp" } ],

"image": "docker-registry:5000/studio-local-base/maya", "forcePullImage": true, "parameters": [ { "key": "env", "value": "DISPLAY" }, { "key": "device", "value": "/dev/dri/card0" }, { "key": "device", "value": "/dev/nvidia0" }, { "key": "device", "value": "/dev/nvidiactl" } ] }, "volumes": [ { "containerPath": "/tmp/.X11-unix/X0", "hostPath": "/tmp/.X11-unix/X0", "mode": "RW" } ] }, "id": "maya", "instances": 1, "cpus": 4, "mem": 8024, "constraints": [ ["gfx", "CLUSTER", "gpu"] ] }' body_format: json

Studio Services

Studio Service Structure

Studio Service Deployment

Database

★ Sites (eg. London, San Francisco, Singapore etc.)★ Departments★ Shows (film)★ Sequences★ Shots★ Tasks★ Assets★ Data

Modelling studio relationships

Challenges★ New technologies

○ Graph database○ Query language/APIs○ Distributed storage engine

★ Complexity (both in the data modelling and system)

★ Adoption/Approval

Storage

Cloud Storage Pros and Cons★ Managed★ No more tape archives/backups

But..★ Getting data into the cloud is expensive★ Getting data into the cloud is slooow

Is there another way?

Work in Progress...★ Applications need a POSIX filesystem interface★ Can we cache cloud storage?

○ EFS○ Avere○ Homegrown

Can we create content entirely in the cloud?

Workstations

Can we create content entirely in the Cloud?★ Applications require OpenGL★ OpenGL requires hardware★ Hardware needs drivers

Can we do this in Docker?

Dockerising OpenGL Applications★ NVIDIA drivers must match the host

version exactly★ Driver inside the container must not

install kernel module★ Container requires access to GPU device

and X Server

Running an OpenGL Docker applicationdocker run \

-it \ -v /tmp/.X11-unix:/tmp/.X11-unix:rw \

--device=/dev/dri/card0 \--device=/dev/nvidia0 \--device=/dev/nvidiactl \-e DISPLAY

Scheduling a VFX app on Mesos in the cloud★ Must use custom Mesos

resources/attributes to only schedule on GPU machines

★ Cloud machines have no monitor★ Remote desktop apps will forward GL

calls to the client machine

Using VirtualGL★ Intercepts GLX calls on the host★ Calls forwarded to 2nd (local) X Server★ GPU computation is done on the GPU

and output forwarded to the 2D (VNC) X Server

Using VirtualGL

3D X server setup/etc/X11/xorg.conf

Section "Device"Identifier "Device0"Driver "nvidia"VendorName "NVIDIA Corporation"BoardName "GRID K520"BusID "PCI:0:3:0"

EndSection

Section "Screen"Identifier "Screen0"Device "Device0"Monitor "Monitor0"DefaultDepth 24Option "UseDisplayDevice"

"None"SubSection "Display"

Depth 24EndSubSection

EndSection

$ lspci00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 01)00:02.0 VGA compatible controller: Cirrus Logic GD 544600:03.0 VGA compatible controller: NVIDIA Corporation GK104GL [GRID K520] (rev a1)00:1f.0 Unassigned class [ff80]: XenSource, Inc. Xen Platform Device (rev 01)

We’re Hiring

ilm - pipeline in the cloud

Technology