scalabledata ingestion for stream processingand...

19
Scalable Data Ingestion for Stream Processing and Beyond Gabriel Antoniu Joint work with Ovidiu Marcu, Alexandru Costan, Maria S. Pérez 9th JLESC workshop, UTK, Knoxville, April 15, 2019

Upload: others

Post on 28-May-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Scalable Data Ingestion for Stream Processing and Beyond

Gabriel AntoniuJoint work with Ovidiu Marcu, Alexandru Costan, Maria S. Pérez

9th JLESC workshop, UTK, Knoxville, April 15, 2019

Page 2: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Velocity

Data in motion

FluidDynamic

Volume

Data at rest

StationaryStatic

2

From Big Data to Fast Data

Page 3: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Correctness Exact results Approximate results

Latency High-latency Low-latency

Cost Stateless Stateful

Batch Streaming

3

Page 4: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

State of the art until recently:Lambda Architectures

Historical events

Real-time events

Exact historical

model

Approximate real-time

model

Periodic queries

Continuous queries

Batch processing

Stream processing

Results &

Actions

What?

Why?

4

Page 5: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

The streaming pipeline: latency happens

Unified batch and stream processing

Ingest delay (write latency)

Throughput(read latency)

Network delay or unavailable Backlog

Poor storage design

Starved resources

Hardware failure

DATATRANSFER

Edge Cloud

5

Page 6: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

What is ingestion ?

•Collect data from various sources → producers

•Deliver them for processing / storage → consumers

•Optionally: buffer, log, pre-process

Ingestion determines the processing performance6

Page 7: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

State of the art: Apache Kafka

Limitations• Scalability• Data duplication

400 nodes, peak 3.2M events/s50 nodes, average 200K events/s

7

Page 8: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

The KerA approach to ingestion • Scalability → Dynamic partitioning• Enables seamless elasticity

• Data duplication → Unified ingestion and storage • Support for both

• Streams (unbounded data)

• Objects (bounded data)

8

Page 9: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Zoom: scalability

Each partition is statically associated with one consumer: limited scalability

Producers Brokers Consumers

Partitions

Partitions

Partitions

9

Page 10: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

KerA: dynamic partitioning

10

• Streamlets: logical stream containers; #streamlets > #brokers• Groups: created and processed dynamically; maximum #active groups per broker

Streamlets Streamlets

Streamlets Streamlets

Streamlets Streamlets

GroupsGroups

GroupsGroups

Groups

Page 11: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Increased network and storage overheads 11

Zoom: data duplication

Page 12: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

KerA: unified ingestion and storage

KerA

INGESTIONBrokers

Acquire Push/Pulldata access

Streams

Objects

Common data model for streams and objects

STORAGEBackups

Move less data, process them faster

12

Page 13: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Evaluating scalability

1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106

4 8 16 32

Aver

age

Aggr

egat

ed

Clie

nt T

hrou

ghpu

t (re

cord

s/s)

Clients Number

KeraProdKeraConsKafkaProdKafkaCons

1.0*1053.0*1055.0*1057.0*1059.0*1051.1*1061.3*1061.5*1061.7*1061.9*1062.1*106

4 8 12 16

Ave

rage

Agg

rega

ted

Clie

nt T

hrou

ghpu

t (re

cord

s/s)

Nodes Number

KeraProdKeraConsKafkaProdKafkaCons

64 clients, 32 partitions, 1MB request size, 100B records

# Brokers

2x better throughputwith 75% less resources

Vertical Horizontal

8x10x

# Clients

4 brokers, 32 partitions, 128KB request size, 100B records

21*105

19*105

17*105

15*105

13*105

11*105

9*105

7*105

5*105

3*105

1*105

Thro

ughp

ut

21*105

19*105

17*105

15*105

13*105

11*105

9*105

7*105

5*105

3*105

1*105

13

Page 14: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

14

Our vision: hybrid analytics architecture

Present data

Stream processing

Past data

Historicalmodel

Real-timemodel

ComputationalModel

Batch processing

FuturemodelSimulation

Proactive control

Continuous update

In situ processing

In transit processing

HybridAnalytics

Page 15: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Hybrid analytics: processing architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest)

Historicaldata

BetterDecisionLearning

15

Data processing

Page 16: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

16

Hybrid analytics architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

BetterDecisionLearning

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest)

Historicaldata

Postdoc (ANR OverFlow project)• Investigating Edge vs. Cloud

computing trade-offs for stream processing

• Methodology for benchmarking Edge processing frameworks

Ph.D. (to hire)• Uniform Cloud and Edge stream

processing for Fast Data analytics

Pedro Silva

Page 17: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Hybrid analytics architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

Historicaldata

BetterDecisionLearning

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest) Research Engineer (Inria ADT project)• Enable support for in situ Big Data

analytics• Elastic allocation of dedicated resources

(cores/nodes)Ovidiu Marcu

Page 18: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Hybrid analytics architecture

DATAfrom the

Real World

DATAfrom the

Hypothetical World …

Simulation (e.g., digital twin)

Computation

In situ pre-processingof simulation data

Sensor

In situ streampre-processingof sensor data

BetterDecisionLearning

Hybrid (stream + batch) in transit processing

(data in-motion + data at-rest)

KerA+++seamless integration with in situ/in transit

+large state management

Historicaldata

Ovidiu Marcu

Startup (ZettaFlow)• Low and consistent latency

(lightweight offset indexing, independent memory management)

• Model applications not partitioning/stream storage

Ph.D. (Inria IPL project)• HPC – Big Data processing

convergence• Bridge in situ/in transit and

stream/batch processing

H2020 project in preparation

Page 19: ScalableData Ingestion for Stream Processingand Beyondicl.utk.edu/jlesc9/files/STA1.2/jlesc9_antoniu.pdfHybrid analytics architecture DATA from the Real World DATA Hypothetical World

Thank You!