with apache beam - talend · with apache beam william vambenepe, google @vambenepe. speakers info...

19
PORTABLE BATCH AND STREAM DATA PROCESSING WITH APACHE BEAM William Vambenepe, Google @vambenepe

Upload: others

Post on 11-Oct-2019

2 views

Category:

Documents


0 download

TRANSCRIPT

PORTABLE BATCH AND

STREAM DATA PROCESSING

WITH APACHE BEAM

William Vambenepe, Google

@vambenepe

SPEAKERS INFO

WILLIAM VAMBENEPE

Group Product Manager

Data Processing and Analytics

Google Cloud Platform

@vambenepe

Open source (top-level Apache project)

Portable

Unifies batch and stream

Cloud-native

Built on 15 years of large scale data processing at Google

You don’t need to be a developer to benefit from Beam

APACHE BEAM: THE KEY TO MODERN DATA PROCESSING

MapReduce Apache Beam

Cloud Dataflow

BigTable DremelColossus

FlumeMegastore Spanner

PubSub

Millwheel

THE EVOLUTION OF DATA PIPELINES

BEAM = Batch + StrEAM

Progressive evolution from batch to stream

- Stream as the new default

Cost/perf trade-offs without re-architecting

- Just turn the knob

ML: data preparation consistency between training & scoring

- Same pipeline to train in batch and score in stream

BENEFIT OF BATCH / STREAM UNIFICATION

PROCESSING

TIME VS.

EVENT TIME

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

THE BEAM MODEL: ASKING THE RIGHT QUESTIONS

The Beam Model:

is beingcomputed?

WHAT

WHERE

time ?

The Beam Model:

in event

WHEN

time ?

The Beam Model:

in processing

HOWrelate?

The Beam Model:

do refinements

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

THE BEAM MODEL: ASKING THE RIGHT QUESTIONS

PORTABLEWrite once, run anywhere

The Beam Model: the abstractions at the core of Apache Beam

Choice of API: Users write their pipelines in a language that’s familiar and integrated with their other tooling

Choice of Runtime: Users choose the right runner for their current needs -- on-prem / cloud, open source / not, fully managed / not

Scalability for Developers: Clean APIs allow developers to contribute modules independently

Language B SDK

Language A SDK

Language C SDK

Runner 1

Runner 3

Runner 2

The Beam Model

Language ALanguage

CLanguage B

The Beam Model

BEAM VISION: MIX AND MATCH SDKS AND RUNTIMES

APACHE SPARK

Open-source cluster-

computing framework

Large ecosystem of

APIs and tools

Runs on premise or

in the cloud

APACHE FLINK

Open-source distributed data

processing engine

High-throughput and

low-latency stream processing

Runs on premise or in the cloud

EXAMPLE BEAM RUNNERS

GOOGLE CLOUD DATAFLOW

Fully-managed service for batch and

stream data processing

Provides dynamic auto-scaling,

monitoring tools, and tight integration

with Google Cloud Platform

GA 360

Cloud Pub/Sub

BigQuery Storage(tables)

Cloud Bigtable(NoSQL)

Cloud Storage(files)

Cloud Dataflow

BigQuery Analytics

Capture Store Analyze

Stackdriver

Process

Stream

Use

Cloud Dataproc

Cloud Datalab

Real-time analytics

Real-timedashboard

Real-timealerts

ML Engine

Batch

Firebase

Storage Transfer Service

Cloud Dataflow

etc...

SQL

Adwords

DoubleClick

YouTube

BEAM ON GOOGLE CLOUD: SERVERLESS DATA PROCESSING

Streaming 101 and 102: The World Beyond Batchhttps://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101

https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102

BEAM

MORE INFO

Apache Beam: https://beam.apache.org

Google Cloud Platform: https://cloud.google.com

The Dataflow Model paper from VLDB 2015http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

THANK YOU!