introduction to apache apex - cods 2016
TRANSCRIPT
Apache Apex - Stream Processing● YARN - Native - Uses Hadoop YARN framework for resource negotiation
● Highly Scalable - Scales statically as well as dynamically
● Highly Performant - Can reach single digit millisecond end-to-end latency
● Fault Tolerant - Automatically recovers from failures - without manual intervention
● Stateful - Guarantees that no state will be lost
● Easily Operable - Exposes an easy API for developing Operators (part of an
application) and Applications
Project History● Project development started in 2012 at DataTorrent
● Open-sourced in July 2015
● Apache Apex started incubation in August 2015
● 50+ committers from Apple, GE, Capital One, DirecTV, Silver Spring Networks,
Barclays, Ampool and DataTorrent
● Mentors from Class Software, MapR and Hortonworks
● Soon to be a top level Apache project
Apex Platform Overview
An Apex Application is a DAG(Directed Acyclic Graph)
● A DAG is composed of vertices (Operators) and edges (Streams).● A Stream is a sequence of data tuples which connects operators at end-points called Ports● An Operator takes one or more input streams, performs computations & emits one or more output streams
● Each operator is USER’s business logic, or built-in operator from our open source library● Operator may have multiple instances that run in parallel
Hadoop 1.0 vs 2.0 - YARN
Apex as a YARN Application
● YARN (Hadoop 2.0) replaces MapReduce with
a more generic Resource Management
Framework.
● Apex uses YARN for resource management
and HDFS for storing any persistent storage
Support for Windowing● Apex splits incoming tuples into finite time slices - Streaming Windows
○ Transparent to the user
○ Apex Default = 500 ms
● Checkpointing and book-keeping done at Streaming window boundary
● Applications may need to perform computations in windows - Application Windows○ Specified as a multiple of Streaming Window size
○ Call backs to user operator logic
■ beginWindow(long windowId)
■ endWindow()
○ Example - An application which identifies some aggregates and emits them every minute. Here
application window size = 60 secs = 30 Streaming Windows
● Sliding and Tumbling Application windows are supported natively
Buffer Server● Staging area for outgoing tuples
● Downstream operators connect to upstream Buffer Server to subscribe for tuples
● Plays a role in recovery by replaying data to the downstream operator from a
particular checkpoint
● Spooling to disk is also supported
Fault Tolerance - Checkpointing● During checkpointing all operator state is written to HDFS asynchronously
● This is decentralized and happens independently for each operator
● If all operators in the DAG have checkpointed a particular window, then that window
is said to be committed and all previous checkpoints are purged
O1 O2 O3 O4
3 3 3 2Checkpoint # --->
Committed Window # = 120180 180 180 120Checkpoint Window # --->
Committed Checkpoint # = 2
Checkpoint Window = 60 Streaming Windows
Recovery● Apex Application Master detects the failure
of an operator based on the missing heart
beats from the operators or if windows are
not progressing
● All downstream operators from the failed
operator are restarted from the last
committed checkpoint to recover from
their states.
● Data is replayed from the same checkpoint
by the Buffer Server
● Recovery is automatic and does not require
manual intervention.
Scalability - Partitioning● Operators can be “replicated” (partitioned) into
multiple instances to cope up with high speed
input streams.
● Can be specified at Application launch time
● User can control the distribution of tuples to
downstream partitions.
● Automatic Unifier to unify the tuples
Scalability - Dynamic scaling● Auto scaling is also supported. Number of partitions may automatically increase or
decrease based on the incoming load. Can be customized by the user
● User has to define the trigger for auto scaling:○ Example - Increase partitions if latency goes above 100 ms.
Apex Processing Semantics● AT_LEAST_ONCE (default): Windows are processed at least once
● AT_MOST_ONCE: Windows are processed at most once
○ During recovery, all downstream operators are fast-forwarded to the window of latest checkpoint
● EXACTLY_ONCE: Windows are processed exactly once
○ Checkpoint every window
○ Checkpointing becomes blocking
Apex Guarantees● Apex guarantees No loss of data and computational state - Checkpointed
periodically
● Automatic recovery ensures that processing resumes from where it left off
● Order of incoming data is guaranteed to be maintained○ Not applicable in case of partitioning of operators
● Events in a window are always replayed in the same window in case of failures
Application Specification
1. Add Operators
2. Add Streams
Logical and Physical DAGs
Apex Malhar Library
1. Performance requirementsa. A system which can provide a very very low latency for decision making
(40 ms)b. Ability to handle large volumes of data and ever changing rules (1,000
events per 20 ms burst)c. 99.5% uptime. Which is about 1.5 days downtime in an year
➔ Apex achieved:◆ 2 ms latency against the requirement of 40ms◆ Was able to handle 2,000 events burst against requirement of 1,000
events burst at a net rate of 70,000 events/s.◆ 99.9995% uptime against requirement of and 99.5% uptime and
2. Relevant Roadmap3. Enterprise grade4. Have a healthy and diverse community and committers, i.e. not controlled by one
vendor
Talk Slides: http://www.slideshare.net/ilganeli/nextgen-decision-making-in-under-2ms
DataTorrent Blog: https://www.datatorrent.com/blog/next-gen-decision-making-in-under-2-milliseconds/
Decision Making in < 2ms
Decision making in < 2ms contd..● Comparison finally boiled down to
○ Apache Storm
○ Apache Flink
○ Apache Apex
● Some problems in Storm and Flink among others○ Nimbus is a single point of failure
○ Bolts / Spouts / Operators share a JVM. Hard to debug
○ No dynamic topologies
○ Restarting entire topologies in case of failures
Resources● Mailing List
○ Developers [email protected]○ Users [email protected]
● Apache Apex http://apex.apache.org/● Github
○ Apex Core: http://github.com/apache/incubator-apex-core○ Apex Malhar: http://github.com/apache/incubator-apex-malhar
● DataTorrent: http://www.datatorrent.com● Twitter @ApacheApex Follow - https://twitter.com/apacheapex● Facebook https://www.facebook.com/ApacheApex/● Meetup http://www.meetup.com/topics/apache-apex● Startup Program Free Enterprise License for Startups, Universities, Non-Profits