real-time big data processing with datatorrent rts · apache apex unified batch and stream...

Post on 28-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache Apex Unified Batch and Stream Processing for Big Data

Milind Barve

Nov. 03, 2015

Project History

• Project development started

in 2012 at DataTorrent

• Open-sourced in July 2015

• Apache Apex started incubation in August 2015

Project Status

Mentor ListTed Dunning: Apache Member, MapRAlan Gates: Apache Member, HortonworksTaylor Goetz: Apache Member, Hortonworks

Justin Mclean: Apache Member, Class SoftwareChris Nauroth: Apache Member, HortonworksHitesh Shah: Apache Member, Hortonworks

Apex In Apache Incubation Stage

Apache Apex (Incubating) Committer List

Over 50 committers already…And growing….

What we will serve you today …

– Batch & Streaming-Two worlds collide??

– Apex Engine- all the nerdy features

– Questions, you still have some???

– Develop your first app on Apex …

Batch Layer

Speed Layer

Serving Layer

master dataset

real time view

real time view

batch view

query

query

Lambda Architecture

Aggregate Layer

master dataset

Incremental Layer

aggregate query

incremental dataset

Aggregate View

Apex Real-time Unified Architecture

Aggregate Layer

master dataset

Incremental Layer

rolling query

aggregate query

incremental dataset

Aggregate View

Incremental View

Apex Real-time Unified Architecture

Apex Platform Overview Enterprise Edition

Apache Apex-Malhar

Directed Acyclic Graph (DAG)

Application Programming Model

• A Stream is a sequence of data tuples

• An Operator takes one or more input streams, performs computations & emits one or more output streams• Each Operator is YOUR custom business logic in java, or built-in operator from our open source library

• Operator has many instances that run in parallel and each instance in single-threaded

• Directed Acyclic Graph (DAG) is made up of operators and streams

Output StreamTuple Tupleer

Operator

er

Operator

er

Operator

er

Operator

Application Programming Model

Hadoop Edge Node

DT RTS Management

Server

Hadoop Node

YARN Container

Apex App Master

Hadoop Node

YARN ContainerYARN Container

YARN Container

Thread1

Op2

Op1

Thread-N

Op3

Streaming Container

Hadoop Node

YARN ContainerYARN Container

YARN Container

Thread1

Op2

Op1

Thread-N

Op3

Streaming Container

CLI

REST API

DT RTS Management

Server

REST API

Part of Community Edition

Apex Component Overview

Apex Engine

Core Features

• YARN is the resource manager

• HDFS used for storing any persistent state

Native Hadoop Integration

Partitioning & Scaling built-in

• Operators can be statically/dynamically scaled

• Flexible Streams split

• Parallel partitioning

• MxN partitioning

• Unifiers

Partitioning and Scaling Out

Advanced Windowing support

• Application window

• Sliding window and tumbling window

• Checkpoint window

• No artificial latency

Advanced Windowing Support

• Supported out of the box– Application state

– Application master state

– No data loss

• Automatic recovery

• Lunch test

• Buffer server

Stateful Fault Tolerance

• AT_LEAST_ONCE (default): – Windows are processed at least once

• AT_MOST_ONCE: – Windows are processed at most once

• During recovery, all downstream operators are fast-forwarded to the window of latest checkpoint

• EXACTLY_ONCE: – Windows are processed exactly once

• Checkpoint every window• Checkpointing becomes blocking

Processing Semantics

Data locality• Stream locality for placement of operators

– Rack local – Distributed deployment

– Node local – Data does not traverse NIC

– Container local – Data doesn’t need to be serialized

– Thread local – Operators run in same thread

Compute Locality

• Dynamic topology updates

– Properties of operators can be changed

– New operators

• Upcoming

– Update attributes

Dynamic Updates

© 2014 DataTorrent Confidential – Do Not Distribute

For more Info …

• Mailing List: dev@apex.incubator.apache.org

• Apache Apex: http://apex.apache.org/

• Github

ᵒ Apex Core: http://github.com/apache/incubator-apex-core

ᵒ Apex Malhar: http://github.com/apache/incubator-apex-malhar

• DataTorrent: http://www.datatorrent.com

Thank You

Please send your questions at milind@datatorrent.com

top related