stream me up, scotty: experiences of integrating event-driven approaches into … · 2019-03-13 ·...

26
Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic data platforms Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH Confluent Streaming Workshop Cologne / Hamburg, November 2018

Upload: others

Post on 20-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

Stream me up, Scotty: Experiencesof integrating event-driven

approaches into analytic dataplatforms

Dr. Dominik Benz, Head of Machine Learning Engineering, inovex GmbH

Confluent Streaming Workshop Cologne / Hamburg, November 2018

Page 2: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

2

Integrateexisting (batch) data sources?

Check consistency

with datasources?

Build realtimedata

visualizations?

https://flic.kr/p/5eQA7ehttps://flic.kr/p/bpFt7U

Page 3: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

3

Stream me up ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 4: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

4

A typical analytic data platform

raw processed datahub analysisingress egress

Scheduling, orchestration, metadata

user access, system integration,development

(Hive) Tables

Airflow, HiveMetastore

Batch Processing (Spark, Hive, ..)

Flat files, Databases, APIs, ...

SQL, Notebooks (Zeppelin, ..)

Page 5: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

5

A typical (?) streaming data platform

raw processed datahub analysisingress egress

Scheduling, orchestration, metadata

user access, system integration,development

(Kafka) Topics, KTables, ..

(Confluent) Schema Registry

Stream Processing (Kafka Streams, Nifi,

..)Kafka Connect

Input Data (Streams)

KSQL

Page 6: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

6

Stream me up ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 7: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

7

Integrating web tracking

companywebsite tracking

service

tracking pixel

rawtrackingdata

Page 8: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

› Hortonworks-based platform, including Nifiand Confluent Platform

› Apache Airflow established scheduling / workflowtool, integrated into monitoring, alerting, ..

› Tracking Service: Currently batch-oriented API (request data, get download links, ..),but click event stream planned

› Developers / Analysts with mixed backgroundw.r.t. programming skills

8

Integrating web tracking: setup / constraints

Page 9: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

› drag-and-drop visual definition of datapipelines

› various built-in connectors (file, stream, database, service, ...)

› event-based processing paradigm

› built-in queues, data provenance, backpressure handling, registry, ...

› focus: ingest & lightweight (!) transformation

› not a complex event processor (like Kafka Streams, Flink, Spark Streaming, ...)

› integrated into HDP stack

9

Apache Nifi in a Nutshell

Page 10: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

› python library to define & schedule batchworkflows

› programmatic specification of a „DAG“ (= tasks + dependencies)

› clean handling of job run metadata (success, duration, ..)

› developed by AirBnB, open-sourced 2015

› built-in standard operators (bash, hive, spark, kubernetes, ..)

› easily extendible (custom operators, ..)

› once used -> never Oozie again J

10

Apache Airflow in a nutshell

Page 11: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

11

Integrating web tracking: options

trackingservice

trackingdata

Option Aspects

Airflow only + integrated into monitoring, ..+ job status handling, reloading- not prepared for future streamAPI- handling file content complicated

Unified Abstraction(e.g. Apache Beam)

+ one model for batch / streamingest- comparatively high entry barrier

Nifi only + visual pipeline definition+ easy handling of file content+ event-based paradigm+ operators available- custom status handling, reloading

Kafka-Connect + fault-tolerant+ scalable setup- custom connector coding- custom status handling, reloading

Page 12: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

› Combinesadvantagesof Airflow & Nifi

› Prepared for futurestreaming API

› Integrated intomonitoring, alerting, ..

› Status handling / reloading easy

12

Integrating web tracking: chosen solution – Airflow + Nifi

trackingservice

trigger(hourly)download

check status(sensors)

trigger, fetchdownload links

download,process, storedata

Page 13: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

13

Stream me up ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 14: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

14

Checking consistency: Customer Consent

customerportal

grants / revokesconsent

writesconsentto hive

kafka

consentevent

in sync?

https://flic.kr/p/9yHuk8

Customer(consent)database

storesconsent

Page 15: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

› Analysts need up-to-date version of customerconsent information in platform

› Hard correctness requirements (especiallyregarding revoked consent)

› Continuous monitoring of correctness

› Alerting in case of differences

15

Checking consistency: setup / constraints

Page 16: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

16

Checking Consistency: Statistics Events

customerportal

kafka

› use existing channel (kafka)

› source inject periodic „statistics events“ into stream with defined measure point(in time)

{type:GRANT, cid:12, ts:2018-10-01 11:00:00 ..}

{type:GRANT, cid:10, ts:2018-10-01 11:01:00 ..}

{type:REVOK, cid:09, ts:2018-10-01 11:01:05 ..}

{type=STAT, measure_ts=2018-10-01 11:01:20,stats={num_consent_v1:72625,

num_consent_v2: 6252, ..}}

time

Page 17: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

17

Checking Consistency: Evaluate Statistics Event

› perform count on target side (Hive) upto$measurePoint

› compare counts

› counts = simple plausibility check, but more elaboratedchecks (hashes) thinkable

{type=STAT, measure_ts=2018-10-01 11:01:20,stats={num_consent_v1:72625,

num_consent_v2: 6252, ..}}

in sync?

{ measure_ts=2018-10-01 11:01:20,hive_stats={

num_consent_v1:72625, num_consent_v2: 6252, ..}

}

Customer

(consent)

database

Page 18: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

18

Stream me up ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 19: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

19

Realtime visualizations: Online Shop Purchases

onlineshop

JMS

purchaseevent

normalization,filtering,

aggregation, ..

https://flic.kr/p/9yHuk8

realtimedashboard

Page 20: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

› Goal: timely insights into various purchaseaspects (items bought last 5min, ..)

› flexible / configurable frontend (time window,aggregation dimension, ..)

› scalable to 100s / 1000s of dashboard users

› low latency of dashboard backend

20

Realtime visualizations: setup / constraints

Page 21: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

21

Realtime visualizations: components / options

JMS

transport layer

service backend

service API

processing

Kafka-connect

KafkaKafka-streams

Kafka-connect

HBase

Phoenix / JDBC

Spring Boot

Nifi

Kafka

Tranquility

Druid

Spring Boot

aggregation duringprocessing

aggregation at query-time

Built-in, configurableaggregation

Nifi

Kafka

Kafka-connect

HBase

Phoenix / JDBC

Spring Boot

Page 22: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

22

Realtime visualizations: chosen solution

JMS

Nifi

Kafka

Tranquility

Druid

Spring Boot

› Druid: time series database with focus on

› Realtime ingestion, good Kafka integation

› „slice-and-dice“ queries

› distributed scale-out architecture

› Event processing kept simple in Nifi› mainly cleaning, transformation

› aggregation is pushed down to Druid

› But: yet another distributed system .. L› Experiences good so far, but needs work / skills

Page 23: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

23

Stream me up ..

Analytic(Streaming)

Data Platforms

Integrating existing(batch) data sources

Checkingconsistency

Building realtimevisualizations

Wrap up & Summary

Page 24: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

› Technology moves from batch to stream – whatabout people?

› Analysts‘ world = often batch world› tooling centered around static datasets› can (and must) be generated from streams› but: education towards stream / event-based

thinking necessary!

› Incremental / stream-based data exchange = paradigm shift› efforts / commitment „from both ends“ necessary

24

The human factor ..

https://flic.kr/p/f2Wx6t

Page 25: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

25

Stream me up, Scotty ..

The future is event-based, but on the way:

› Existing batch-oriented APIs› use (scheduled) event-based tools for easier later migration

› Checking consistency› inject plausibility checks into data stream

› Realtime visualizations› Druid + Kafka powerful and flexible combination

› Don‘t forget the human in the loop!

Page 26: Stream me up, Scotty: Experiences of integrating event-driven approaches into … · 2019-03-13 · Stream me up, Scotty: Experiences of integrating event-driven approaches into analytic

Vielen Dank

Dr. Dominik Benz

[email protected]

inovex GmbH

Park Plaza

Ludwig-Erhard-Allee 6

76131 Karlsruhe