continuous data processing with kinesis at...

24
Continuous data processing with Kinesis at Snowplow Budapest DW Forum 2014

Upload: others

Post on 13-Jul-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Continuous data processing with Kinesis at Snowplow

Budapest DW Forum 2014

Page 2: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Agenda today

1. Introduction to Snowplow

2. Our batch data flow & use cases

3. Why are we excited about Kinesis?

4. Adding Kinesis support to Snowplow

5. Questions

Page 3: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Introduction to Snowplow

Page 4: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Snowplow is an open-source web and event analytics platform, first version released in early 2012

• Co-founders Alex Dean and Yali Sassoon met at OpenX, the open-source ad technology business in 2008

• After leaving OpenX, Alex and Yali set up Keplar, a niche digital product and analytics consultancy

• We released Snowplow as a skunkworksprototype at start of 2012:

github.com/snowplow/snowplow

• We started working full time on Snowplow in summer 2013

Page 5: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

We wanted to take a fresh approach to web analytics

• Your own web event data -> in your own data warehouse• Your own event data model

• Slice / dice and mine the data in highly bespoke ways to answer your specific business questions

• Plug in the broadest possible set of analysis tools to drive value from your data

Data warehouseData pipeline

Analyse your data in any analysis tool

Page 6: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

And we saw the potential of new “big data” technologies and services to solve these problems in a scalable, low-cost manner

These tools make it possible to capture, transform, store and analyse all your granular, event-level data, to you can perform any analysis

Amazon EMRAmazon S3CloudFront Amazon Redshift

Page 7: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Early on, we made a crucial decision: Snowplow should be composed of a set of loosely coupled subsystems

1. Trackers 2. Collectors 3. Enrich 4. Storage 5. AnalyticsA B C D

D = Standardised data protocols

Generate event data from any environment

Log raw events from trackers

Validate and enrich raw events

Store enriched events ready for analysis

Analyzeenriched events

These turned out to be critical to allowing us to evolve our technology stack

Page 8: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Our batch data flow & use cases

Page 9: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

By spring 2013 we had arrived at a relatively stable batch-based processing architecture

Website / webapp

Snowplow Hadoop data pipeline

CloudFront-based event

collectorScalding-

based enrichment on Hadoop

JavaScript event tracker

Amazon Redshift /

PostgreSQL

Amazon S3

or

Clojure-based event

collector

Page 10: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

What did people start using Snowplow for?

Warehousing their web event data

Agile aka ad hoc analytics

To enable…

Marketing attribution modelling

Customer lifetime value calculations

Customer churn

prediction

RTB fraud detection

Email product recs

Page 11: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

These use cases tended to be characterized by a few important traits

Trait Example

Agile aka ad hoc analytics

Marketing attribution modelling

1. They use data collected over long time periods

2. They demand ongoing & hands-on involvement from a BA/ data scientist

3. They tend not to elicit synchronous/deterministic responses

RTB fraud detection

Page 12: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

So why did we get excited about Kinesis?

Page 13: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

A quick history lesson: the three eras of business data processing

1. The classic era, 1996+

2. The hybrid era, 2005+

3. The unified era, 2013+

For more see http://snowplowanalytics.com/blog/2014/01/20/the-three-eras-of-business-data-processing/

Page 14: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

The classic era, 1996+

OWN DATA CENTER

Data warehouse

HIGH LATENCY

Point-to-point connections

WIDE DATA

COVERAGE

CMS

Silo

CRM

Local loop Local loop

NARROW DATA SILOES LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

Management reporting

ERP

SiloLocal loop

Silo

Nightly batch ETL process

FULL DATA

HISTORY

Page 15: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

The hybrid era, 2005+

CLOUD VENDOR / OWN DATA CENTER

Search

SiloLocal loop

LOW LATENCY LOCAL LOOPS

E-comm

SiloLocal loop

CRM

Local loop

SAAS VENDOR #2

Email marketing

Local loop

ERP

SiloLocal loop

CMS

SiloLocal loop

SAAS VENDOR #1

NARROW DATA SILOES

Stream processing

Productrec’s

Micro-batch processing

Systems monitoring

Batch processing

Data warehouse

Management reporting

Batch processing

Ad hoc analytics

Hadoop

SAAS VENDOR #3

Web analytics

Local loop

Local loop Local loop

LOW LATENCY LOW LATENCY

HIGH LATENCY HIGH LATENCY

APIs

Bulk exports

Page 16: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

The unified era, 2013+CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

LOW LATENCY WIDE DATA

COVERAGE

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

FEW DAYS’ DATA HISTORY

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

Page 17: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

The unified log is Kinesis (or Kafka)

Page 18: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

CLOUD VENDOR / OWN DATA CENTER

Search

Silo

SOME LOW LATENCY LOCAL LOOPS

E-comm

Silo

CRM

SAAS VENDOR #2

Email marketing

ERP

Silo

CMS

Silo

SAAS VENDOR #1

NARROW DATA SILOES

Streaming APIs / web hooks

Unified log

Archiving

Hadoop

< WIDE DATA

COVERAGE >

< FULL DATA

HISTORY >

Systems monitoring

Eventstream

HIGH LATENCY LOW LATENCY

Product rec’sAd hoc

analytics

Management reporting

Fraud detection

Churn prevention

APIs

We asked: can we implement Snowplow on top of Kinesis?

Page 19: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

What kinds of use cases can we support if we implement Snowplow on top of Kinesis?

Populating a unified log with your company’s event streams

In-session product recs

To enable…

Holistic systems

monitoring

In-game difficulty

tuning

In-session upselling

Ad retargeting &

RTB

… anything requiring low latency response /

holistic view of our data!

Page 20: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Adding Kinesis support to Snowplow

Page 21: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Where we are heading with our Kinesis architecture

Scala Stream Collector

Raw event stream

Enrich Kinesis app

Bad raw events stream

Enriched event

stream

S3

Redshift

S3 sink Kinesis app

Redshift sink Kinesis

app

SnowplowTrackers

Page 22: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

This is where we are today

Scala Stream Collector

Raw event stream

Enrich Kinesis app

Bad raw events stream

Enriched event

stream

S3

Redshift

S3 sink Kinesis app

Redshift sink Kinesis app

SnowplowTrackers

Page 23: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

What have we and the Snowplow community learnt about Kinesis and continuous data processing so far?

1. One stream many consuming apps is unexpected for many people (legacy of old MQs?)

2. Think of Kinesis apps as distributed Unix commands with streams mapping on to stdin, stderr, stdout

3. Build more complex systems by chaining simple Kinesis apps – the Kinesis stream is a really powerful primitive for continuous data flows

4. Scalability and elasticity are going to be much bigger challenges than in our batch flow

Page 24: Continuous data processing with Kinesis at Snowplow2014.adattarhazforum.hu/letoltes/2014dwforum/snowplow_alex_dean_parallel.pdf•Co-founders Alex Dean and Yali Sassoon met at OpenX,

Questions?

http://snowplowanalytics.com

https://github.com/snowplow/snowplow

@snowplowdata

To talk offline – @alexcrdean on Twitter or [email protected]