dsp frameworks · 2018-05-26 · apache storm • apache storm – open-source, real-time, scalable...

Post on 20-May-2020

7 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Università degli Studi di Roma “Tor Vergata” Dipartimento di Ingegneria Civile e Ingegneria Informatica

DSP Frameworks

Corso di Sistemi e Architetture per Big Data A.A. 2017/18

Valeria Cardellini

DSP frameworks we consider •  Apache Storm (with lab) •  Twitter Heron

–  From Twitter as Storm and compatible with Storm

•  Apache Spark Streaming (lab) –  Reduce the size of each stream and process streams

of data (micro-batch processing) •  Apache Flink •  Apache Samza •  Cloud-based frameworks

–  Google Cloud Dataflow –  Amazon Kinesis Streams

1 Valeria Cardellini - SABD 2017/18

Apache Storm •  Apache Storm

–  Open-source, real-time, scalable streaming system –  Provides an abstraction layer to execute DSP

applications –  Initially developed by Twitter

•  Topology –  DAG of spouts (sources of streams) and bolts

(operators and data sinks)

2 Valeria Cardellini - SABD 2017/18

Stream grouping in Storm

•  Data parallelism in Storm: how are streams partitioned among multiple tasks (threads of execution)?

•  Shuffle grouping –  Randomly partitions the tuples

•  Field grouping –  Hashes on a subset of the tuple attributes

Valeria Cardellini - SABD 2017/18

3

Stream grouping in Storm

•  All grouping (i.e., broadcast) –  Replicates the entire stream to all the consumer

tasks

•  Global grouping –  Sends the entire stream to a single task of a bolt

•  Direct grouping –  The producer of the tuple decides which task of

the consumer will receive this tuple

Valeria Cardellini - SABD 2017/18

4

Storm architecture

5 Valeria Cardellini - SABD 2017/18

•  Master-worker architecture

Storm components: Nimbus and Zookeeper

•  Nimbus –  The master node –  Clients submit topologies to it –  Responsible for distributing and coordinating the

topology execution

•  Zookeeper –  Nimbus uses a combination of the local disk(s)

and Zookeeper to store state about the topology

Valeria Cardellini - SABD 2017/18

6

worker process

executor executorTHREAD THREAD

JAVA PROCESS

task

task

task

task

task

Storm components: worker

•  Task: operator instance –  The actual work for a bolt or a spout is done in the

task •  Executor: smallest schedulable entity

–  Execute one or more tasks related to same operator •  Worker process: Java process running one or

more executors •  Worker node: computing

resource, a container for one or more worker processes

7 Valeria Cardellini - SABD 2017/18

Storm components: supervisor

•  Each worker node runs a supervisor

The supervisor: •  receives assignments from Nimbus (through

ZooKeeper) and spawns workers based on the assignment

•  sends to Nimbus (through ZooKeeper) a periodic heartbeat;

•  advertises the topologies that they are currently running, and any vacancies that are available to run more topologies

Valeria Cardellini - SABD 2017/18

8

Twitter Heron

•  Realtime, distributed, fault-tolerant stream processing engine from Twitter

•  Developed as direct successor of Storm –  Released as open source in 2016

https://twitter.github.io/heron/ –  De facto stream data processing engine inside

Twitter •  Goal of overcoming Storm’s performance,

reliability, and other shortcomings •  Compatibility with Storm

–  API compatible with Storm: no code change is required for migration

Valeria Cardellini - SABD 2017/18

9

Heron: in common with Storm

•  Same terminology of Storm –  Topology, spout, bolt

•  Same stream groupings –  Shuffle, fields, all, global

•  Example: WordCount topology

Valeria Cardellini - SABD 2017/18

10

Heron: design goals •  Isolation

–  Process-based topologies rather than thread-based –  Each process should run in isolation (easy

debugging, profiling, and troubleshooting) –  Goal: overcoming Storm’s performance, reliability,

and other shortcomings •  Resource constraints

–  Safe to run in shared infrastructure: topologies use only initially allocated resources and never exceed bounds

•  Compatibility –  Fully API and data model compatible with Storm

Valeria Cardellini - SABD 2017/18

11

Heron: design goals •  Backpressure

–  Built-in rate control mechanism to ensure that topologies can self-adjust in case components lag

–  Heron dynamically adjusts the rate at which data flows through the topology using backpressure

•  Performance –  Higher throughput and lower latency than Storm –  Enhanced configurability to fine-tune potential

latency/throughput trade-offs •  Semantic guarantees

–  Support for both at-most-once and at-least-once processing semantics

•  Efficiency –  Minimum possible resource usage

Valeria Cardellini - SABD 2017/18

12

Heron topology architecture

•  Master-work architecture •  One Topology Master (TM)

–  Manages a topology throughout its entire lifecycle

•  Multiple Containers –  Each Container multiple Heron Instances, a

Stream Manager, and a Metrics Manager –  A Heron Instance is a process that handles a

single task of a spout or bolt –  Containers communicate with TM to ensure that

the topology forms a fully connected graph

Valeria Cardellini - SABD 2017/18

13

Heron topology architecture

Valeria Cardellini - SABD 2017/18

14

Heron topology architecture

•  Stream Manager (SM): routing engine for data streams –  Each Heron connects to its local SM, while all of the SMs in

a given topology connect to one another to form a network –  Responsbile for propagating back pressure

Valeria Cardellini - SABD 2017/18

15

Topology submit sequence

Valeria Cardellini - SABD 2017/18

16

Self-adaptation in Heron •  Dhalion: framework on top of Heron to autonomously

reconfigure topologies to meet throughput SLOs, scaling resource consumption up and down as needed

Valeria Cardellini - SABD 2017/18

17

•  Phases in Dhalion: -  Symptom detection

(backpressure, skew, …)

-  Diagnosis generation -  Resolution

•  Adaptation actions: parallelism changes

Heron environment •  Heron supports deployment on Apache

Mesos •  Can also run on Mesos using Apache Aurora

as a scheduler or using a local scheduler

Valeria Cardellini - SABD 2017/18

18

Batch processing vs. stream processing

•  Batch processing is just a special case of stream processing

Valeria Cardellini - SABD 2017/18

19

Batch processing vs. stream processing

•  Batched/stateless: scheduled in batches –  Short-lived tasks (Hadoop, Spark) –  Distributed streaming over batches (Spark

Streaming)

•  Dataflow/stateful: continuous/scheduled once (Storm, Flink, Heron) –  Long-lived task execution –  State is kept inside tasks

Valeria Cardellini - SABD 2017/18

20

Native vs. non-native streaming

Valeria Cardellini - SABD 2017/18

21

Apache Flink

•  Distributed data flow processing system •  One common runtime for DSP applications

and batch processing applications –  Batch processing applications run efficiently

as special cases of DSP applications •  Integrated with many other projects in the

open-source data processing ecosystem

Valeria Cardellini - SABD 2017/18

22

•  Derives from Stratosphere project by TU Berlin, Humboldt University and Hasso Plattner Institute

•  Support a Storm-compatible API

Flink: software stack

Valeria Cardellini - SABD 2017/18

23

•  On top: libraries with high-level APIs for different use cases, still in beta

Flink: programming model

•  Data stream –  An unbounded, partitioned immutable sequence

of events •  Stream operators

–  Stream transformations that generate new output data streams from input ones

Valeria Cardellini - SABD 2017/18

24

Flink: some features

•  Supports stream processing and windowing with Event Time semantics –  Event time makes it easy to compute over streams

where events arrive out of order, and where events may arrive delayed

•  Exactly-once semantics for stateful computations

•  Highly flexible streaming windows

Valeria Cardellini - SABD 2017/18

25

Flink: some features

•  Continuous streaming model with backpressure

•  Flink's streaming runtime has natural flow control: slow data sinks backpressure faster sources

Valeria Cardellini - SABD 2017/18

26

Flink: APIs and libraries

•  Streaming data applications: DataStream API –  Supports functional transformations on data

streams, with user-defined state, and flexible windows

–  Example: how to compute a sliding histogram of word occurrences of a data stream of texts

Valeria Cardellini - SABD 2017/18

27

WindowWordCount in Flink's DataStream API

Sliding time window of 5 sec length and 1 sec trigger interval

Flink: APIs and libraries •  Batch processing applications: DataSet API •  Supports a wide range of data types beyond

key/value pairs, and a wealth of operators

Valeria Cardellini - SABD 2017/18

28

Core loop of the PageRank algorithm for graphs

Flink: program optimization

•  Batch programs are automatically optimized to exploit situations where expensive operations (like shuffles and sorts) can be avoided, and when intermediate data should be cached

Valeria Cardellini - SABD 2017/18

29

Flink: control events

•  Control events: special events injected in the data stream by operators

•  Periodically, the data source injects checkpoint barriers into the data stream by dividing the stream into pre-checkpoint and post-checkpoint •  More coarse-grained approach than Storm: acks sequences of

records instead of individual records

•  Watermarks signal the progress of event-time within a stream partition

•  Flink does not provide ordering guarantees after any form of stream repartitioning or broadcasting –  Dealing with out-of-order records is left to the operator

implementation Valeria Cardellini - SABD 2017/18

30

Flink: fault-tolerance •  Based on Chandy-Lamport distributed

snapshots •  Lightweight mechanism

–  Allows to maintain high throughput rates and provide strong consistency guarantees at the same time

Valeria Cardellini - SABD 2017/18

31

Flink: performance and memory management •  High performance and low latency

•  Memory management

–  Flink implements its own memory management inside the JVM

Valeria Cardellini - SABD 2017/18

32

Flink: architecture

Valeria Cardellini - SABD 2017/18

33

•  The usual master-worker architecture

Flink: architecture

Valeria Cardellini - SABD 2017/18

34

•  Master (Job Manager): schedules tasks, coordinates checkpoints, coordinates recovery on failures, etc.

•  Workers (Task Managers): JVM processes that execute tasks of a dataflow, and buffer and exchange the data streams –  Workers use task slots to control the number of

tasks it accepts –  Each task slot represents a fixed subset of

resources of the worker

Flink: application execution

•  Jobs are expressed as data flows •  The job graph is transformed into the

execution graph •  The execution graph contain information to

schedule and execute a job

Valeria Cardellini - SABD 2017/18

35

Flink: infrastructure

•  Designed to run on large-scale clusters with many thousands of nodes

•  Provides support for YARN and Mesos

Valeria Cardellini - SABD 2017/18

36

A recent need

•  A common need for many companies –  Run both batch and stream processing

•  Alternative solutions –  Lambda architecture –  Unified frameworks: Spark, Flink, Samza –  Unified programming model: Apache Beam

Valeria Cardellini - SABD 2017/18

37

Lambda architecture •  Data-processing design pattern to integrate batch and

real-time processing •  Streaming framework used to process real-time events,

and, in parallel, batch framework (e.g., Hadoop/Spark) deployed to process the entire dataset (perhaps periodically)

•  Results from the two parallel pipelines are then merged

38 Source: https://voltdb.com/products/alternatives/lambda-architecture

Valeria Cardellini - SABD 2017/18

Lambda architecture: example

Valeria Cardellini - SABD 2017/18

39

•  Lambda architecture used at LinkedIn before Samza development

Lambda architecture

•  Pros: –  Flexibility in the frameworks’ choice

•  Cons: –  Implementing and maintaining two separate

frameworks for batch and stream processing can be hard and error-prone

–  Overhead of developing and managing multiple source codes

•  The logic in each fork evolves over time, and keeping them in sync involves duplicated and complex manual effort, often with different languages

Valeria Cardellini - SABD 2017/18

40

Unified frameworks

•  Use a unified (Lambda-less) design for processing both real-time as well as batch data using the same data structure

•  Spark, Flink and Samza follow this trend

Valeria Cardellini - SABD 2017/18

41

Apache Beam •  A new layer of abstraction •  Provides advanced unified programming model

–  Allows to define batch and streaming data processing pipelines that run on any execution engine (for now: Flink, Spark, Google Cloud Dataflow)

–  Java, Python and Go as programming languages •  Translates the data processing pipeline defined

by the user with the Beam program into the API compatible with the chosen distributed processing engine

•  Developed by Google and released as open-source top-level project

Valeria Cardellini - SABD 2017/18

42

Apache Samza

•  A distributed system for stateful and fault-tolerant stream processing –  A unified framework for batch and stream

processing: similarly to Flink, streams as first-class citizen, batch as special case of streaming

–  Production at LinkedIn since 2014

•  Uses Kafka for messaging and YARN to provide fault tolerance, processor isolation, security, and resource management

Valeria Cardellini - SABD 2017/18

43

Apache Samza

•  Why stateful and fault-tolerant processing? User profiles, email digests, aggregate counts, …

•  Example: Email Digestion System –  Production application running at LinkedIn using Samza to

digest updates into one email

Valeria Cardellini - SABD 2017/18

44

Apache Samza: features

•  Provides distributed and scalable data processing with –  Configurable and heterogeneous data sources

and sinks (e.g., Kafka, HDFS, AWS Kinesis) –  Efficient state management: local state (in memory

or on disk) partitioned among tasks (rather than using a remote data store) and incremental checkpointing (more efficient than full state checkpointing)

–  Unified processing API for stream and batch processing

•  Differently for Flink

–  Flexible deployment models

Valeria Cardellini - SABD 2017/18

45

Apache Samza: architecture

•  Samza is made up of three layers: –  A streaming layer: Kafka –  An execution layer: YARN –  A processing layer: Samza API

Valeria Cardellini - SABD 2017/18

46

Apache Samza: architecture

•  Samza, YARN and Kafka interaction

Valeria Cardellini - SABD 2017/18

47

•  RM: YARN ResourceManager •  NM: YARN NodeManager •  AM: ApplicationMaster

•  Samza client uses YARN to run a Samza job

•  YARN starts and supervises one or more SamzaContainers, and the processing code (using the StreamTask API) runs inside those containers

•  The input and output for the Samza StreamTasks come from Kafka brokers

Apache Samza: running an application

•  Count the number of page views aggregated by user ID

•  Two Samza jobs: one to group messages by user ID, the other to do the counting

Valeria Cardellini - SABD 2017/18

48

Apache Samza: state management

•  Common approach (e.g., in Storm) to deal with large amounts of state: use a remote data store (e.g., Redis)

•  Samza approach: keep state local to each node and make it robust to machine failures by replicating state changes across multiple machines

Valeria Cardellini - SABD 2017/18

49

Apache Samza: high-level API

Valeria Cardellini - SABD 2017/18

50

Apache Samza: example application

•  Samza job to find top-k trending tags –  DAG for the logical representation of the

application

Valeria Cardellini - SABD 2017/18

51

Apache Samza: example application

Valeria Cardellini - SABD 2017/18

52

Towards strict delivery guarantees •  Most frameworks provide at-least-once delivery

guarantees (e.g., Storm, Samza) –  For stateful non-idempotent operators such as counting, at-

least-once delivery guarantees can give incorrect results

•  Flink, Storm with Trident, and Google’s MillWheel offer stronger delivery guarantees (i.e., exactly-once) –  Exactly-once low latency stream processing in MillWheel works

as follows: •  The record is checked against de-duplication data from previous

deliveries; duplicates are discarded •  User code is run for the input record, possibly resulting in pending

changes to timers, state, and productions •  Pending changes are committed to the backing store •  Senders are acked •  Pending downstream productions are sent

Valeria Cardellini - SABD 2017/18

53

Source: MillWheel: Fault-Tolerant Stream Processing at Internet Scale, VLDB 2013.

DSP in the Cloud

•  Data streaming systems are also offered as Cloud services –  Amazon Kinesis Data Streams –  Google Cloud Dataflow –  IBM Streaming Analytics –  Microsoft Azure Stream Analytics

•  Abstract the underlying infrastructure and support dynamic scaling of the computing resources

•  Appear to execute in a single data center

Valeria Cardellini - SABD 2017/18

54

Google Cloud Dataflow

•  Fully-managed data processing service, supporting both stream and batch execution of pipelines –  Transparently handles resource lifetime and can dynamically

provision resources to minimize latency while maintaining high utilization efficiency

–  On-demand and auto-scaling

•  Provides a unified programming model and a managed service for developing and executing a wide range of data processing patterns including ETL, batch computation, and continuous computation –  Apache Beam model

•  Provides exactly-once processing –  MillWheel is Google’s internal version of Cloud Dataflow

Valeria Cardellini - SABD 2017/18

55

Google Cloud Dataflow

•  Seamlessly integrates with other Google cloud services –  Cloud Storage, Cloud Pub/Sub, Cloud Datastore, Cloud

Bigtable, and BigQuery

•  Apache Beam SDKs, available in Java and Python –  Enable developers to implement custom extensions and

choose other execution engines

Valeria Cardellini - SABD 2017/18

56

Amazon Kinesis Data Streams

•  Allows to build applications that process and analyze streaming data

Valeria Cardellini - SABD 2017/18

57

Kinesis Data Streams: components •  Data record

–  Unit of data stored by Kinesis

•  Stream: group of data records –  Data producers write data records to Kinesis streams –  Data records in the stream are distributed into shards

•  Shard: sequence of data records in a stream –  Number of shards in a stream specified when the stream is

created •  Can then be increased or decreased as needed but manually

–  Base unit of capacity: up to 1MB/s of written data and 2MB/s of read data

–  Partition key used to group data by shard within a stream –  Used also for service pricing http://amzn.to/2szRTkG –  Data records are stored in shards temporarily (24 hours by

default) Valeria Cardellini - SABD 2017/18

58

Kinesis Data Streams: consuming data •  Kinesis Data Streams is used to capture streaming

data •  An application reads data from a Kinesis stream as

data records, then uses the Kinesis Client Library (KCL) for the processing logic –  KCL takes care of: load-balancing across multiple EC2

instances, responding to instance failures, check-pointing processed records, reacting to re-sharding (that adjusts the number of shards)

Valeria Cardellini - SABD 2017/18

59

References

•  Overview on DSP frameworks –  Wingerath et al. “Real-time stream processing for Big Data”,

2016. http://bit.ly/2rUhXXG

•  Kulkarni et al., “Twitter Heron: stream processing at scale”, ACM SIGMOD'15. http://bit.ly/2rUXkux

•  Carbone et al., “Apache Flink: Stream and batch processing in a single engine”, Bulletin IEEE Comp. Soc. Tech. Comm. on Data Eng., 2015. http://bit.ly/2sYzoGb

•  Noghabi et al., “Samza: Stateful scalable stream processing at LinkedIn”, VLDB Endowment, 2017. https://bit.ly/2LushvF

•  Akidau et al., “MillWheel: fault-tolerant stream processing at Internet scale”, VLDB'13. http://bit.ly/2rE7Fa3

Valeria Cardellini - SABD 2017/18

60

top related