aws re:invent 2013 scalable media processing in the cloud

Post on 10-Jun-2015

281 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

Presentation from AWS re:Invent 2013. See session video here: http://www.youtube.com/watch?v=MjZdiDotRU8 Presentation is in two parts: (1) Introduction to moving workloads to the cloud, (2) deep dive on how the BBC moved their playout to the cloud.

TRANSCRIPT

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

Scalable Media ProcessingPhil Cluff, British Broadcasting Corporation

David Sayed, Amazon Web Services

November 13, 2013

Agenda

• Media workflows• Where AWS fits• Cloud media processing approaches• BBC iPlayer in the cloud

Media Workflows

Featurettes

Interviews

2D Movie

3D Movie

Archive Materials

Stills

Networks

Theatrical

DVD/BD

Online

Mobile Apps

Archive

MSOs

Media Workflow

Media Workflow

Media Workflow

Where AWS Fits Into Media Processing

Amazon Web Services

Inge

st

Inde

x

Pro

cess

Pac

kage

Pro

tect

QC

Aut

h.

Tra

ck

Pla

ybac

k

Media Asset Management

Analytics and Monetization

Media Processing Approaches

3 Phases

Cloud Media Processing Approaches

Phase 1: Lift processing from the premises and shift to the cloud

Lift and Shift

Media Processing Operation

OS Storage

Media Processing Operation

OS Storage

EC2

Media Processing Operation

OS Storage

EC2

The Problem with Lift and Shift

Media Processing Operation

OS Storage

Monolithic Media Processing Operation

Ingest Operation

Post-processing

Export

Workflow Parameters

EC2

Cloud Media Processing Approaches: Phase 2

Phase 1: Lift processing from the premises and shift to the cloud

Phase 2: Refactor and optimize to leverage cloud resources

Refactor and Optimization Opportunities

“Deconstruct monolithic media processing operations”

– Ingest– Atomic media processing operation– Post-processing– Export– Workflow– Parameters

Refactoring and Optimization Example

AP

I C

alls

EC2 EBS

EC2 EBS

EC2 EBS

Source S3 Bucket

SWF

Output S3 Bucket

Cloud Media Processing Approaches

Phase 1: Lift processing from the premises and shift to the cloud

Phase 2: Refactor and optimize to leverage cloud resources

Phase 3: Decomposed, modular cloud-native architecture

Decomposition and Modularization Ideas for Media Processing

• Decouple *everything* that is not part of atomic media processing operation

• Use managed services where possible for workflow, queues, databases, etc.

• Manage– Capacity– Redundancy– Latency– Security

in the Cloud

AKA “Video Factory”

Phil CluffPrincipal Software Engineer & Team LeadBBC Media Services

• The UK’s biggest video & audio on-demand service– And it’s free!

• Over 7 million requests every day– ~2% of overall consumption of BBC output

• Over 500 unique hours of content every week– Available immediately after broadcast, for at least 7 days

• Available on over 1000 devices including– PC, iOS, Android, Windows Phone, Smart TVs, Cable Boxes…

• Both streaming and download (iOS, Android, PC)

• 20 million app downloads to date

Sources: BBC iPlayer Performance Pack August 2013http://www.bbc.co.uk/blogs/internet/posts/Video-Factory

Video“Where Next?”

What Is Video Factory?

• Complete in-house rebuild of ingest, transcode, and delivery workflows for BBC iPlayer

• Scalable, message-driven cloud-based architecture

• The result of 1 year of development by ~18 engineers

And here they are!

Why Did We Build Video Factory?

• Old system– Monolithic– Slow– Couldn’t cope with spikes– Mixed ownership with third party

• Video Factory– Highly scalable, reliable– Completely elastic transcode resource

– Complete ownership

Why Use the Cloud?• Background of 6 channels, spikes up to 24 channels, 6 days a week• A perfect pattern for an elastic architecture

Off-Air Transcode Requests for 1 week

Video Factory – Architecture

• Entirely message driven– Amazon Simple Queuing Service (SQS)

• Some Amazon Simple Notification Service (SNS)

– We use lots of classic message patterns

• ~20 small components– Singular responsibility – “Do one thing, and do it well”

• Share libraries if components do things that are alike• Control bloat

– Components have contracts of behavior• Easy to test

Video Factory – Workflow

SDI Broadcast Video Feed

x 24

Playout Data Feed

Broadcast Encoder

Live Ingest Logic

Amazon Elastic Transcoder

ElementalCloud

DRM

QC

Editorial Clipping

MAM

Amazon S3Mezzanine

Time AddressableMedia Store

Amazon S3Distribution Renditions

RTPChunker

Transcode Abstraction

Layer

Mezzanine

Playout Video

Transcoded Video

Metadata

SMPTE Timecode

Mezzanine Video Capture

Detail

• Mezzanine video capture• Transcode abstraction• Eventing demonstration

Mezzanine Video Capture

Mezzanine Capture

SDI Broadcast Video Feed

x 24Broadcast Grade Encoder

Amazon S3Mezzanine

Chunks

RTPChunker

ChunkUploader

MPEG2 Transport Stream (H.264) on RTP Multicast 30 MB HD/10 MB SD

MPEG2 Transport Stream (H.264) Chunks

3 GB HD/1 GB SD

ChunkConcatenator

Amazon S3Mezzanine

Control Messages

SMPTE Timecode

Concatenating Chunks

• Build file using Amazon S3 multipart requests – 10 GB Mezzanine file constructed in under 10 seconds

• Amazon S3 multipart APIs are very helpful– Component only makes REST API calls

• Small instances; still gives very high performance

• Be careful – Amazon S3 isn’t immediately consistent when dealing with multipart built files– Mitigated with rollback logic in message-based applications

By Numbers – Mezzanine Capture

• 24 channels– 6 HD, 18 SD– 16 TB of Mezzanine data every day per capture

• 200,000 chunks every day– And Amazon S3 has never lost one– That’s ~2 (UK) billion RTP packets every day… per capture

• Broadcast grade resiliency– Several data centers / 2 copies each

Transcode Abstraction

Transcode Abstraction• Abstract away from single supplier

– Avoid vendor lock in– Choose suppliers based on performance and quality and broadcaster-friendly feature sets– BBC: Elemental Cloud (GPU), Amazon Elastic Transcoder, in-house for subtitles

• Smart routing & smart bundling– Save money on non–time critical transcode– Save time & money by bundling together “like” outputs

• Hybrid cloud friendly– Route a baseline of transcode to local encoders, and spike to cloud

• Who has the next game changer?

Transcode Abstraction

TranscodeRequest

Transcode Router

Amazon Elastic Transcoder

ElementalCloud

Amazon Elastic Transcoder

Backend

Elemental Backend

RESTSQS

Amazon S3Mezzanine

Amazon S3Distribution Renditions

SQS

Subtitle Extraction Backend

Transcode Abstraction - Future

TranscodeRequest

Transcode Router

Amazon Elastic Transcoder

ElementalCloud

Amazon Elastic Transcoder

Backend

Elemental Backend

SQS

Amazon S3Mezzanine

Amazon S3Distribution Renditions

SQS

Subtitle Extraction Backend

Unknown Future Backend X

?

REST

Example – A Simple Elastic Transcoder Backend

XMLTranscodeRequest

Get Message from Queue

Unmarshal and Validate Message

Initialize Transcode

Wait for SNS Callback over HTTP

XMLTranscode

StatusMessage

Amazon Elastic Transcoder

POSTPOST(Via SNS)

SQS Message Transaction

Example – Add Error Handling

XMLTranscodeRequest

Get Message from Queue

Unmarshal and Validate Message

Initialize Transcode

Wait for SNS Callback over HTTP

XMLTranscode

StatusMessage

Amazon Elastic Transcoder

POSTPOST(Via SNS)

Bad MessageQueue

FailQueue

Dead LetterQueue

SQS Message Transaction

Example – Add Monitoring Eventing

XMLTranscodeRequest

Get Message from Queue

Unmarshal and Validate Message

Initialize Transcode

Wait for SNS Callback over HTTP

XMLTranscode

StatusMessage

Amazon Elastic Transcoder

POSTPOST(Via SNS)

Bad MessageQueue

FailQueue

Dead LetterQueue

MonitoringEvents

MonitoringEvents

MonitoringEvents

MonitoringEvents

SQS Message Transaction

BBC eventing framework

• Key-value pairs pushed into Splunk– Business-level events, e.g.:

• Message consumed• Transcode started

– System-level events, e.g.:

• HTTP call returned status 404• Application’s heap size• Unhandled exception

• Fixed model for “context” data– Identifiable workflows, grouping of events; transactions– Saves us a LOT of time diagnosing failures

Component Development – General Development & Architecture• Java applications

– Run inside Apache Tomcat on m1.small EC2 instances– Run at least 3 of everything– Autoscale on queue depth

• Built on top of the Apache Camel framework– A platform for build message-driven applications– Reliable, well-tested SQS backend– Camel route builders Java DSL

• Full of messaging patterns

• Developed with Behavior-Driven Development (BDD) & Test-Driven Development (TDD)– Cucumber

• Deployed continuously– Many times a day, 5 days a week

Error Handling Messaging Patterns

• We use several message patterns– Bad message queue– Dead letter queue– Fail queue

• Key concept– Never lose a message– Message is either in-flight, done, or in an error queue somewhere

• All require human intervention for the workflow to continue– Not necessarily a bad thing

Message Patterns – Bad Message Queue

• Wrapped in a message wrapper which contains context• Never retried• Very rare in production systems• Implemented as an exception handler on the route builder

The message doesn’t unmarshal to the object it should OR

We could unmarshal the object, but it doesn’t meet our validation rules

Message Patterns – Dead Letter Queue

• Message is an exact copy of the input message• Retried several times before being put on the DLQ• Can be common, even in production systems• Implemented as a bean in the route builder for SQS

We tried processing the message a number of times, and something we weren’t expecting went wrong each time

Message Patterns – Fail Queue

• Wrapped in a message wrapper that contains context• Requires some level of knowledge of the system to be retried• Often evolve from understanding the causes of DLQ’d messages• Implemented as an exception handler on the route builder

Something I knew could go wrong went wrong

Demonstration – Eventing Framework

Questions?

philip.cluff@bbc.co.ukdsayed@amazon.com

@GeneticGenesis@dsayed

Please give us your feedback on this presentation

As a thank you, we will select prize winners daily for completed surveys!

MED302

top related