apache beam - big data paris jdbc mongo / gridfs jms kafka kinesis wip hive cassandra reddis ......
TRANSCRIPT
Who am I?
Jean-Baptiste Onofre <[email protected]> <[email protected]>
@jbonofre | http://blog.nanthrax.net
Member of the Apache Software Foundation
Fellow/Software Architect at Talend
PMC on ~20 Apache Projects from system integration & container (Karaf, Camel, ActiveMQ, Archiva,
Aries, ServiceMix, …) to big data (Beam, CarbonData, Falcon, Gearpump, Lens, …)
Apache Beam origin
MapReduce
BigTable DremelColossus
FlumeMegastoreSpanner
PubSub
Millwheel Apache Beam
Google Cloud Dataflow
Beam model: asking the right questions
What results are calculated?
Where in event time are results calculated?
When in processing time are results materialized?
How do refinements of results relate?
Customizing What Where When How
3
Streaming
4Streaming
+ Accumulation
1
Classic
Batch
2Windowed
Batch
What is Apache Beam?
1. Unified model (Batch + strEAM)
What / Where / When / How
2. SDKs (Java, Python, ...) & DSLs (Scala, …)
3. Runners for Existing Distributed Processing
Backends (Google Dataflow, Spark, Flink, …)
4. IOs: Data store Sources / Sinks
Apache Beam vision
Beam Model: Fn Runners
Apache Flink
Apache Spark
Beam Model: Pipeline Construction
OtherLanguagesBeam Java
Beam Python
Execution Execution
Cloud Dataflow
Execution
1. End users: who want to write pipelines in a language that’s familiar.
2. SDK/DSL writers: who want to make Beam concepts available in new languages.
3. Runner writers: who have a distributed processing environment and want to support Beam pipelines
Apache Beam - SDKs & DSLs
SDKs
API based on the Beam Model
1. Current:
a. Java
b. Python
2. Future (possible) SDKs:
Go, Ruby, etc.
DSLs
Domain-Specific Languages based on the
Beam Model:
1. Current:
• Scio (Scala API),
2. Future (ideas):
• Streaming SQL (Calcite)
• Machine Learning
• Complex Event Processing
Apache Beam SDK concepts
1. Pipeline - data processing job as a directed graph of transformations
2. PCollection - the data inside a pipeline
3. PTransform - a transformation step in the pipeline
a. IO transforms - Read from a Source or Write to a Sink.
b. Core transforms - common transformation provided (ParDo, GroupByKey, …)
c. Composite transforms - combine multiple transforms
Apache Beam - Pipeline
Data processing pipeline(executed via a Beam runner)
PTransform PTransformRead
PTransform(source)
Write PTransform
(sink)
Apache Beam - PCollection
1. PCollection is immutable, does not support random access to element,
belongs to a Pipeline
2. Each element in PCollection has a Timestamp (commonly set by IO Source)
3. Coder to support different data serialization
4. Bounded (batch) or Unbounded (streaming) (depending of the IO Source)
Apache Beam - PTransform
1. PTransform are operations that transform data
2. Receive one or multiple PCollections and produce one or multiple
PCollections
3. They must be Serializable
4. Should be thread-compatible (If you create your threads you must sync them).
5. Idempotency is not required but recommended.
Apache Beam - IO Transforms
1. IO read/write data as PCollections (Source/Sink)
2. Support Bounded and/or Unbounded PCollections
3. Extensible API to create custom sources & sinks
4. Deal with timestamp, watermarks, deduplication, read/write parallelism
Apache Beam - Current IOs
Ready
File
Avro
Google Cloud Storage
BigQuery
BigTable
DataStore
HDFS
Elasticsearch
HBase
MQTT
JDBC
Mongo / GridFS
JMS
Kafka
Kinesis
WIP
Hive
Cassandra
Reddis
RabbitMQ
...
Apache Beam - Pipeline with IO Example
public static void main(String[] args) {
// Create a pipeline parameterized by command line flags eg. --runner
Pipeline p = Pipeline.create(PipelineOptionsFactory.fromArgs(arg));
p.apply(KafkaIO.read().withBootstrapServers(servers)
.withTopics(topics)) // Read input
.apply(new YourFancyFn()) // Do some processing
.apply(ElasticsearchIO.write().withAddress(esServer)
.withIndex(index).withType(type)); // Write output
// Run the pipeline.
p.run();
}
Apache Beam - Programming model in the SDK
Grouping
GroupByKey
Combine -> Reduce
Sum
Count
Min
Max
Mean
...
Element-wise
ParDo
MapElements
FlatMapElements
Filter
WithKeys
Keys
Values
Windowing/Triggers
FixedWindows
GlobalWindows
SlidingWindows
Sessions
AfterWatermark
AfterProcessingTime
AfterPane
...
Apache Beam - Example - GDELT Events by location
Pipeline pipeline = Pipeline.create(options);// Read events from a text file and parse them.pipeline
.apply("GDELTFile", TextIO.Read.from(options.getInput()))// Extract location from the fields.apply("ExtractLocation", ParDo.of(...)// Count events per location.apply("CountPerLocation", Count.<String>perElement())// Reformat KV as a String.apply("StringFormat", MapElements.via(...))// write to result files.apply("Results",TextIO.Write.to(options.getOutput()));// Run the batch pipeline.
pipeline.run();
Apache Bean - Runners / Execution Engines
Runners “translate” the code to a target runtime (the runner itself doesn’t provide the runtime)
Many runners are tied to other top-level Apache projects, such as Apache Flink and Apache Spark
Due to this, runners can be run on-premise (on your local Flink cluster) or in a public cloud (using Google Cloud Dataproc or Amazon EMR) for example
Apache Beam is focused on treating runners as a top-level use case (with APIs, support, etc.) so runners can be developed with minimal friction for maximum pipeline portability
Runners
Google Cloud Dataflow
Managed (NoOps)
Apache FlinkApache Spark
Apache Apex Apache MapReduceApache Gearpump
Apache BeamDirect Runner
Local
Apache Karaf
Same code, different runners & runtimes
WIP
Apache Beam - Use cases
Apache Beam is a great choice for both batch and stream processing and can
handle bounded and unbounded datasets
Batch can focus on ETL/ELT, catch-up processing, daily aggregations, and so on
Stream can focus on handling real-time processing on a record-by-record basis
Real use cases
Data processing, both batch and stream processing
Real-time event processing from IoT devices
Fraud detection, ...
Why Apache Beam?
1. Portable - You can use the same code with different runners (agnostic) and backends on premise, in the cloud, or locally
2. Unified - Same unified model for batch and stream processing
3. Advanced features - Event windowing, triggering, watermarking, lateness, etc.
4. Extensible model and SDK - Extensible API; can define custom sources to read and write in parallel
Collaborate - Beam is becoming a community-driven effort with participation from many organizations and contributors
Grow - We want to grow the Beam ecosystem and community with active, open involvement so Beam is a part of the larger OSS ecosystem
Growing the Beam Community
Learn More!
Apache Beamhttp://beam.apache.org
Join the Beam mailing lists! [email protected]@beam.apache.org
Follow @ApacheBeam on Twitter