datatorrent presentation @ big data application meetup

16
Thomas Weise <[email protected]> Dec 2 nd , 2015 Introduction to Open Source Unified Streaming and Fast Batch Platform Apache Apex (incubating)

Upload: thomas-weise

Post on 21-Mar-2017

430 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Thomas Weise <[email protected]>Dec 2nd, 2015

Introduction to Open Source Unified Streaming and Fast Batch PlatformApache Apex (incubating)

© 2015 DataTorrent2

Apex Platform Overview

© 2015 DataTorrent3

Apache Malhar Library

© 2015 DataTorrent4

Native Hadoop Integration

• YARN is the resource manager

• HDFS used for storing any persistent state

© 2015 DataTorrent5

Application Programming Model

A Stream is a sequence of data tuplesAn Operator takes one or more input streams, performs computations & emits one or more output streams

• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library• Operator has many instances that run in parallel and each instance in single-threaded

Directed Acyclic Graph (DAG) is made up of operations and streams

Directed Acyclic Graph (DAG)

Filtered Stream

Output StreamTuple Tuple

Filtered Stream

Enriched Stream

Enriched

Stream

er

Operator

er

Operator

er

Operator

er

Operator

© 2015 DataTorrent6

Application Specification

© 2015 DataTorrent7

Partitioning and Scaling Out

• Operators can be dynamically scaled• Flexible Streams split• Parallel partitioning

• MxN partitioning • Unifiers

© 2015 DataTorrent8

Advanced Windowing Support

Application window Sliding window and tumbling window

Checkpoint window No artificial latency

© 2015 DataTorrent9

Guarantees and PerformanceStateful Fault Tolerance Processing Semantics Data Locality

Supported out of the box– Application state– Application master state– No data loss

Automatic recovery Lunch test Buffer server

At least once At most once Exactly once

Stream locality for placement of operators

Rack local – Distributed deployment

Node local – Data does not traverse NIC

Container local – Data doesn’t need to be serialized

Thread local – Operators run in same thread

Data locality

© 2015 DataTorrent10

Dynamic Updates Dynamic topology updates

– Properties of operators can be changed– New operators can be added

© 2015 DataTorrent11

Data Processing Pipeline ExampleApp Builder

© 2015 DataTorrent12

Data Processing Pipeline ExampleLogical Plan

© 2015 DataTorrent13

Data Processing Pipeline ExamplePhysical Plan

© 2015 DataTorrent14

Data Processing Pipeline ExampleReal Time Visualization

© 2015 DataTorrent15

ResourcesApache Apex Community Page - http://apex.incubator.apache.org/

Apache Apex LinkedIn Group

EndThank You!

16