apache beam - big data paris jdbc mongo / gridfs jms kafka kinesis wip hive cassandra reddis ......

Apache BeamModèle de programmation unifié pour Big Data

Who am I?

Jean-Baptiste Onofre <[email protected]> <[email protected]>

@jbonofre | http://blog.nanthrax.net

Member of the Apache Software Foundation

Fellow/Software Architect at Talend

PMC on ~20 Apache Projects from system integration & container (Karaf, Camel, ActiveMQ, Archiva,

Aries, ServiceMix, …) to big data (Beam, CarbonData, Falcon, Gearpump, Lens, …)

mailto:[email protected]

mailto:[email protected]

http://blog.nanthrax.net

Apache Beam origin

MapReduce

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

Millwheel Apache Beam

Google Cloud Dataflow

Beam model: asking the right questions

What results are calculated?

Where in event time are results calculated?

When in processing time are results materialized?

How do refinements of results relate?

Customizing What Where When How

3

Streaming

4Streaming

+ Accumulation

1

Classic

Batch

2Windowed

Batch

What is Apache Beam?

1. Unified model (Batch + strEAM)

What / Where / When / How

2. SDKs (Java, Python, ...) & DSLs (Scala, …)

3. Runners for Existing Distributed Processing

Backends (Google Dataflow, Spark, Flink, …)

4. IOs: Data store Sources / Sinks

Apache Beam vision

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

1. End users: who want to write pipelines in a language that’s familiar.

2. SDK/DSL writers: who want to make Beam concepts available in new languages.

3. Runner writers: who have a distributed processing environment and want to support Beam pipelines

Apache Beam - SDKs & DSLs

SDKs

API based on the Beam Model

1. Current:

a. Java

b. Python

2. Future (possible) SDKs:

Go, Ruby, etc.

DSLs

Domain-Specific Languages based on the

Beam Model:

1. Current:

• Scio (Scala API),

2. Future (ideas):

• Streaming SQL (Calcite)

• Machine Learning

• Complex Event Processing

Apache Beam SDK concepts

1. Pipeline - data processing job as a directed graph of transformations

2. PCollection - the data inside a pipeline

3. PTransform - a transformation step in the pipeline

a. IO transforms - Read from a Source or Write to a Sink.

b. Core transforms - common transformation provided (ParDo, GroupByKey, …)

c. Composite transforms - combine multiple transforms

Apache Beam - Pipeline

Data processing pipeline(executed via a Beam runner)

PTransform PTransformRead

PTransform(source)

Write PTransform

(sink)

Apache Beam - PCollection

1. PCollection is immutable, does not support random access to element,

belongs to a Pipeline

2. Each element in PCollection has a Timestamp (commonly set by IO Source)

3. Coder to support different data serialization

4. Bounded (batch) or Unbounded (streaming) (depending of the IO Source)

Apache Beam - PTransform

1. PTransform are operations that transform data

2. Receive one or multiple PCollections and produce one or multiple

PCollections

3. They must be Serializable

4. Should be thread-compatible (If you create your threads you must sync them).

5. Idempotency is not required but recommended.

Apache Beam - IO Transforms

1. IO read/write data as PCollections (Source/Sink)

2. Support Bounded and/or Unbounded PCollections

3. Extensible API to create custom sources & sinks

4. Deal with timestamp, watermarks, deduplication, read/write parallelism

1. Evolution of the Big Data programming models

2. The Beam approach

3. Apache Beam

Agenda

Apache Beam - Current IOs

Ready

File

Avro

Google Cloud Storage

BigQuery

BigTable

DataStore

HDFS

Elasticsearch

HBase

MQTT

JDBC

Mongo / GridFS

JMS

Kafka

Kinesis

WIP

Hive

Cassandra

Reddis

RabbitMQ

...

Apache Beam - Pipeline with IO Example

public static void main(String[] args) {

// Create a pipeline parameterized by command line flags eg. --runner

Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));

p.apply(KafkaIO.read().withBootstrapServers(servers)

.withTopics(topics)) // Read input

.apply(new YourFancyFn()) // Do some processing

.apply(ElasticsearchIO.write().withAddress(esServer)

.withIndex(index).withType(type)); // Write output

// Run the pipeline.

p.run();

}

What are you computing?

What Where When How

Element-Wise Aggregating Composite

Apache Beam - Programming model in the SDK

Grouping

GroupByKey

Combine -> Reduce

Sum

Count

Min

Max

Mean

...

Element-wise

ParDo

MapElements

FlatMapElements

Filter

WithKeys

Keys

Values

Windowing/Triggers

FixedWindows

GlobalWindows

SlidingWindows

Sessions

AfterWatermark

AfterProcessingTime

AfterPane

...

Apache Beam - Example - GDELT Events by location

Pipeline pipeline = Pipeline.create(options);// Read events from a text file and parse them.pipeline

.apply("GDELTFile", TextIO.Read.from(options.getInput()))// Extract location from the fields.apply("ExtractLocation", ParDo.of(...)// Count events per location.apply("CountPerLocation", Count.<String>perElement())// Reformat KV as a String.apply("StringFormat", MapElements.via(...))// write to result files.apply("Results",TextIO.Write.to(options.getOutput()));// Run the batch pipeline.

pipeline.run();

Apache Bean - Runners / Execution Engines

Runners “translate” the code to a target runtime (the runner itself doesn’t provide the runtime)

Many runners are tied to other top-level Apache projects, such as Apache Flink and Apache Spark

Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example

Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability

Runners

Google Cloud Dataflow

Managed (NoOps)

Apache FlinkApache Spark

Apache Apex Apache MapReduceApache Gearpump

Apache BeamDirect Runner

Local

Apache Karaf

Same code, different runners & runtimes

WIP

Apache Beam - Use cases

Apache Beam is a great choice for both batch and stream processing and can

handle bounded and unbounded datasets

Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on

Stream can focus on handling real-time processing on a record-by-record basis

Real use cases

Data processing, both batch and stream processing

Real-time event processing from IoT devices

Fraud detection, ...

Why Apache Beam?

1. Portable - You can use the same code with different runners (agnostic) and backends on premise, in the cloud, or locally

2. Unified - Same unified model for batch and stream processing

3. Advanced features - Event windowing, triggering, watermarking, lateness, etc.

4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel

Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors

Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem

Growing the Beam Community

Learn More!

Apache Beamhttp://beam.apache.org

Join the Beam mailing lists! [email protected]@beam.apache.org

Follow @ApacheBeam on Twitter

http://beam.incubator.apache.org

Thank You !

apache beam - big data paris jdbc mongo / gridfs jms kafka kinesis wip hive cassandra reddis ......

Documents