dsp frameworks · 2018-05-26 · apache storm • apache storm – open-source, real-time, scalable...

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

DSP Frameworks

Corso di Sistemi e Architetture per Big Data A.A. 2017/18

Valeria Cardellini

DSP frameworks we consider •  Apache Storm (with lab) •  Twitter Heron

–  From Twitter as Storm and compatible with Storm

•  Apache Spark Streaming (lab) –  Reduce the size of each stream and process streams

of data (micro-batch processing) •  Apache Flink •  Apache Samza •  Cloud-based frameworks

–  Google Cloud Dataflow –  Amazon Kinesis Streams

1 Valeria Cardellini - SABD 2017/18

Apache Storm •  Apache Storm

–  Open-source, real-time, scalable streaming system –  Provides an abstraction layer to execute DSP

applications –  Initially developed by Twitter

•  Topology –  DAG of spouts (sources of streams) and bolts

(operators and data sinks)

Stream grouping in Storm

•  Data parallelism in Storm: how are streams partitioned among multiple tasks (threads of execution)?

•  Shuffle grouping –  Randomly partitions the tuples

•  Field grouping –  Hashes on a subset of the tuple attributes

Valeria Cardellini - SABD 2017/18

Stream grouping in Storm

•  All grouping (i.e., broadcast) –  Replicates the entire stream to all the consumer

•  Global grouping –  Sends the entire stream to a single task of a bolt

•  Direct grouping –  The producer of the tuple decides which task of

the consumer will receive this tuple

Storm architecture

•  Master-worker architecture

Storm components: Nimbus and Zookeeper

•  Nimbus –  The master node –  Clients submit topologies to it –  Responsible for distributing and coordinating the

topology execution

•  Zookeeper –  Nimbus uses a combination of the local disk(s)

and Zookeeper to store state about the topology

worker process

executor executorTHREAD THREAD

JAVA PROCESS

Storm components: worker

•  Task: operator instance –  The actual work for a bolt or a spout is done in the

task •  Executor: smallest schedulable entity

–  Execute one or more tasks related to same operator •  Worker process: Java process running one or

more executors •  Worker node: computing

resource, a container for one or more worker processes

Storm components: supervisor

•  Each worker node runs a supervisor

The supervisor: •  receives assignments from Nimbus (through

ZooKeeper) and spawns workers based on the assignment

•  sends to Nimbus (through ZooKeeper) a periodic heartbeat;

•  advertises the topologies that they are currently running, and any vacancies that are available to run more topologies

Twitter Heron

•  Realtime, distributed, fault-tolerant stream processing engine from Twitter

•  Developed as direct successor of Storm –  Released as open source in 2016

https://twitter.github.io/heron/ –  De facto stream data processing engine inside

Twitter •  Goal of overcoming Storm’s performance,

reliability, and other shortcomings •  Compatibility with Storm

–  API compatible with Storm: no code change is required for migration

Heron: in common with Storm

•  Same terminology of Storm –  Topology, spout, bolt

•  Same stream groupings –  Shuffle, fields, all, global

•  Example: WordCount topology

Heron: design goals •  Isolation

–  Process-based topologies rather than thread-based –  Each process should run in isolation (easy

debugging, profiling, and troubleshooting) –  Goal: overcoming Storm’s performance, reliability,

and other shortcomings •  Resource constraints

–  Safe to run in shared infrastructure: topologies use only initially allocated resources and never exceed bounds

•  Compatibility –  Fully API and data model compatible with Storm

Heron: design goals •  Backpressure

–  Built-in rate control mechanism to ensure that topologies can self-adjust in case components lag

–  Heron dynamically adjusts the rate at which data flows through the topology using backpressure

•  Performance –  Higher throughput and lower latency than Storm –  Enhanced configurability to fine-tune potential

latency/throughput trade-offs •  Semantic guarantees

–  Support for both at-most-once and at-least-once processing semantics

•  Efficiency –  Minimum possible resource usage

Heron topology architecture

•  Master-work architecture •  One Topology Master (TM)

–  Manages a topology throughout its entire lifecycle

•  Multiple Containers –  Each Container multiple Heron Instances, a

Stream Manager, and a Metrics Manager –  A Heron Instance is a process that handles a

single task of a spout or bolt –  Containers communicate with TM to ensure that

the topology forms a fully connected graph

•  Stream Manager (SM): routing engine for data streams –  Each Heron connects to its local SM, while all of the SMs in

a given topology connect to one another to form a network –  Responsbile for propagating back pressure

Topology submit sequence

Self-adaptation in Heron •  Dhalion: framework on top of Heron to autonomously

reconfigure topologies to meet throughput SLOs, scaling resource consumption up and down as needed

•  Phases in Dhalion: -  Symptom detection

(backpressure, skew, …)

-  Diagnosis generation -  Resolution

•  Adaptation actions: parallelism changes

Heron environment •  Heron supports deployment on Apache

Mesos •  Can also run on Mesos using Apache Aurora

as a scheduler or using a local scheduler

Batch processing vs. stream processing

•  Batch processing is just a special case of stream processing

Batch processing vs. stream processing

•  Batched/stateless: scheduled in batches –  Short-lived tasks (Hadoop, Spark) –  Distributed streaming over batches (Spark

Streaming)

•  Dataflow/stateful: continuous/scheduled once (Storm, Flink, Heron) –  Long-lived task execution –  State is kept inside tasks

Native vs. non-native streaming

Apache Flink

•  Distributed data flow processing system •  One common runtime for DSP applications

and batch processing applications –  Batch processing applications run efficiently

as special cases of DSP applications •  Integrated with many other projects in the

open-source data processing ecosystem

•  Derives from Stratosphere project by TU Berlin, Humboldt University and Hasso Plattner Institute

•  Support a Storm-compatible API

Flink: software stack

•  On top: libraries with high-level APIs for different use cases, still in beta

Flink: programming model

•  Data stream –  An unbounded, partitioned immutable sequence

of events •  Stream operators

–  Stream transformations that generate new output data streams from input ones

Flink: some features

•  Supports stream processing and windowing with Event Time semantics –  Event time makes it easy to compute over streams

where events arrive out of order, and where events may arrive delayed

•  Exactly-once semantics for stateful computations

•  Highly flexible streaming windows

Flink: some features

•  Continuous streaming model with backpressure

•  Flink's streaming runtime has natural flow control: slow data sinks backpressure faster sources

Flink: APIs and libraries

•  Streaming data applications: DataStream API –  Supports functional transformations on data

streams, with user-defined state, and flexible windows

–  Example: how to compute a sliding histogram of word occurrences of a data stream of texts

WindowWordCount in Flink's DataStream API

Sliding time window of 5 sec length and 1 sec trigger interval

Flink: APIs and libraries •  Batch processing applications: DataSet API •  Supports a wide range of data types beyond

key/value pairs, and a wealth of operators

Core loop of the PageRank algorithm for graphs

Flink: program optimization

•  Batch programs are automatically optimized to exploit situations where expensive operations (like shuffles and sorts) can be avoided, and when intermediate data should be cached

Flink: control events

•  Control events: special events injected in the data stream by operators

•  Periodically, the data source injects checkpoint barriers into the data stream by dividing the stream into pre-checkpoint and post-checkpoint •  More coarse-grained approach than Storm: acks sequences of

records instead of individual records

•  Watermarks signal the progress of event-time within a stream partition

•  Flink does not provide ordering guarantees after any form of stream repartitioning or broadcasting –  Dealing with out-of-order records is left to the operator

implementation Valeria Cardellini - SABD 2017/18

Flink: fault-tolerance •  Based on Chandy-Lamport distributed

snapshots •  Lightweight mechanism

–  Allows to maintain high throughput rates and provide strong consistency guarantees at the same time

Flink: performance and memory management •  High performance and low latency

•  Memory management

–  Flink implements its own memory management inside the JVM

Flink: architecture

•  The usual master-worker architecture

Flink: architecture

•  Master (Job Manager): schedules tasks, coordinates checkpoints, coordinates recovery on failures, etc.

•  Workers (Task Managers): JVM processes that execute tasks of a dataflow, and buffer and exchange the data streams –  Workers use task slots to control the number of

tasks it accepts –  Each task slot represents a fixed subset of

resources of the worker

Flink: application execution

•  Jobs are expressed as data flows •  The job graph is transformed into the

execution graph •  The execution graph contain information to

schedule and execute a job

Flink: infrastructure

•  Designed to run on large-scale clusters with many thousands of nodes

•  Provides support for YARN and Mesos

A recent need

•  A common need for many companies –  Run both batch and stream processing

•  Alternative solutions –  Lambda architecture –  Unified frameworks: Spark, Flink, Samza –  Unified programming model: Apache Beam

Lambda architecture •  Data-processing design pattern to integrate batch and

real-time processing •  Streaming framework used to process real-time events,

and, in parallel, batch framework (e.g., Hadoop/Spark) deployed to process the entire dataset (perhaps periodically)

•  Results from the two parallel pipelines are then merged

38 Source: https://voltdb.com/products/alternatives/lambda-architecture

Lambda architecture: example

•  Lambda architecture used at LinkedIn before Samza development

Lambda architecture

•  Pros: –  Flexibility in the frameworks’ choice

•  Cons: –  Implementing and maintaining two separate

frameworks for batch and stream processing can be hard and error-prone

–  Overhead of developing and managing multiple source codes

•  The logic in each fork evolves over time, and keeping them in sync involves duplicated and complex manual effort, often with different languages

Unified frameworks

•  Use a unified (Lambda-less) design for processing both real-time as well as batch data using the same data structure

•  Spark, Flink and Samza follow this trend

Apache Beam •  A new layer of abstraction •  Provides advanced unified programming model

–  Allows to define batch and streaming data processing pipelines that run on any execution engine (for now: Flink, Spark, Google Cloud Dataflow)

–  Java, Python and Go as programming languages •  Translates the data processing pipeline defined

by the user with the Beam program into the API compatible with the chosen distributed processing engine

•  Developed by Google and released as open-source top-level project

Apache Samza

•  A distributed system for stateful and fault-tolerant stream processing –  A unified framework for batch and stream

processing: similarly to Flink, streams as first-class citizen, batch as special case of streaming

–  Production at LinkedIn since 2014

•  Uses Kafka for messaging and YARN to provide fault tolerance, processor isolation, security, and resource management

Apache Samza

•  Why stateful and fault-tolerant processing? User profiles, email digests, aggregate counts, …

•  Example: Email Digestion System –  Production application running at LinkedIn using Samza to

digest updates into one email

Apache Samza: features

•  Provides distributed and scalable data processing with –  Configurable and heterogeneous data sources

and sinks (e.g., Kafka, HDFS, AWS Kinesis) –  Efficient state management: local state (in memory

or on disk) partitioned among tasks (rather than using a remote data store) and incremental checkpointing (more efficient than full state checkpointing)

–  Unified processing API for stream and batch processing

•  Differently for Flink

–  Flexible deployment models

Apache Samza: architecture

•  Samza is made up of three layers: –  A streaming layer: Kafka –  An execution layer: YARN –  A processing layer: Samza API

Apache Samza: architecture

•  Samza, YARN and Kafka interaction

•  RM: YARN ResourceManager •  NM: YARN NodeManager •  AM: ApplicationMaster

•  Samza client uses YARN to run a Samza job

•  YARN starts and supervises one or more SamzaContainers, and the processing code (using the StreamTask API) runs inside those containers

•  The input and output for the Samza StreamTasks come from Kafka brokers

Apache Samza: running an application

•  Count the number of page views aggregated by user ID

•  Two Samza jobs: one to group messages by user ID, the other to do the counting

Apache Samza: state management

•  Common approach (e.g., in Storm) to deal with large amounts of state: use a remote data store (e.g., Redis)

•  Samza approach: keep state local to each node and make it robust to machine failures by replicating state changes across multiple machines

Apache Samza: high-level API

Apache Samza: example application

•  Samza job to find top-k trending tags –  DAG for the logical representation of the

application

Apache Samza: example application

Towards strict delivery guarantees •  Most frameworks provide at-least-once delivery

guarantees (e.g., Storm, Samza) –  For stateful non-idempotent operators such as counting, at-

least-once delivery guarantees can give incorrect results

•  Flink, Storm with Trident, and Google’s MillWheel offer stronger delivery guarantees (i.e., exactly-once) –  Exactly-once low latency stream processing in MillWheel works

as follows: •  The record is checked against de-duplication data from previous

deliveries; duplicates are discarded •  User code is run for the input record, possibly resulting in pending

changes to timers, state, and productions •  Pending changes are committed to the backing store •  Senders are acked •  Pending downstream productions are sent

Source: MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.

DSP in the Cloud

•  Data streaming systems are also offered as Cloud services –  Amazon Kinesis Data Streams –  Google Cloud Dataflow –  IBM Streaming Analytics –  Microsoft Azure Stream Analytics

•  Abstract the underlying infrastructure and support dynamic scaling of the computing resources

•  Appear to execute in a single data center

Google Cloud Dataflow

•  Fully-managed data processing service, supporting both stream and batch execution of pipelines –  Transparently handles resource lifetime and can dynamically

provision resources to minimize latency while maintaining high utilization efficiency

–  On-demand and auto-scaling

•  Provides a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation –  Apache Beam model

•  Provides exactly-once processing –  MillWheel is Google’s internal version of Cloud Dataflow

Google Cloud Dataflow

•  Seamlessly integrates with other Google cloud services –  Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud

Bigtable, and BigQuery

•  Apache Beam SDKs, available in Java and Python –  Enable developers to implement custom extensions and

choose other execution engines

Amazon Kinesis Data Streams

•  Allows to build applications that process and analyze streaming data

Kinesis Data Streams: components •  Data record

–  Unit of data stored by Kinesis

•  Stream: group of data records –  Data producers write data records to Kinesis streams –  Data records in the stream are distributed into shards

•  Shard: sequence of data records in a stream –  Number of shards in a stream specified when the stream is

created •  Can then be increased or decreased as needed but manually

–  Base unit of capacity: up to 1MB/s of written data and 2MB/s of read data

–  Partition key used to group data by shard within a stream –  Used also for service pricing http://amzn.to/2szRTkG –  Data records are stored in shards temporarily (24 hours by

default) Valeria Cardellini - SABD 2017/18

Kinesis Data Streams: consuming data •  Kinesis Data Streams is used to capture streaming

data •  An application reads data from a Kinesis stream as

data records, then uses the Kinesis Client Library (KCL) for the processing logic –  KCL takes care of: load-balancing across multiple EC2

instances, responding to instance failures, check-pointing processed records, reacting to re-sharding (that adjusts the number of shards)

References

•  Overview on DSP frameworks –  Wingerath et al. “Real-time stream processing for Big Data”,

2016. http://bit.ly/2rUhXXG

•  Kulkarni et al., “Twitter Heron: stream processing at scale”, ACM SIGMOD'15. http://bit.ly/2rUXkux

•  Carbone et al., “Apache Flink: Stream and batch processing in a single engine”, Bulletin IEEE Comp. Soc. Tech. Comm. on Data Eng., 2015. http://bit.ly/2sYzoGb

•  Noghabi et al., “Samza: Stateful scalable stream processing at LinkedIn”, VLDB Endowment, 2017. https://bit.ly/2LushvF

•  Akidau et al., “MillWheel: fault-tolerant stream processing at Internet scale”, VLDB'13. http://bit.ly/2rE7Fa3

dsp frameworks · 2018-05-26 · apache storm • apache storm – open-source, real-time, scalable...

Documents

hortonworks data platform - apache storm …€¦ ·...

low latency scalable web crawling on apache storm

how apache kafka is transforming hadoop, spark and storm

apache storm internals

distributed and fault tolerant realtime computation with...

punch clock for debugging apache storm

apache storm: hands-on session - ce.uniroma2.it

scaling apache storm - hadoop summit 2014

apache storm vs. spark streaming - java forum...

developing java streaming applications with apache storm

real-time streaming: apache spark streaming i apache storm

apache storm tutorial

apache storm basics

learning stream processing with apache storm

real-time streaming with apache spark streaming and apache...

scaling apache storm (hadoop summit 2015)

hadoop summit europe 2014: apache storm architecture

future of apache storm

· b17 in v l) 7 -f v y 7 oss — 11 apache tomcat gcc...

distributed realtime computation using apache storm