data pipelines from zero

Data pipelines from zeroLars Albertsson

Data architect @ Schibstedwww.mapflat.com

1

Who’s talking?Swedish Institute of Computer Science (test tools)Sun Microsystems (very large machines)Google (Hangouts, productivity)Recorded Future (NLP startup)Cinnober Financial Tech. (trading systems)Spotify (data processing & modelling)Schibsted (data processing & modelling)

2

Presentation goalsOverview of data pipelines for analytics / data productsTarget audience: Big data startersOverview of necessary componentsBase recipe

In vicinity of state-of-practiceBaseline for comparing design proposals

Subjective best practicesTechnology suggestions, (alternatives)

3

Data product anatomy

4

Cluster storage

Ingress

Unified log

ETL Egress

DBDB

DBService

DatasetJobPipeline

Service

Export

Businessintelligence

Cluster storageHDFS

(NFS, S3, Google CS, C*)

Event collection

5

Unified logImmutable events

Append-onlySource of truth

Service

Unreliable

Unreliable

Reliable,write available

Kafka(Kinesis,

Google Pub/Sub)

Secor,Camus

Immediate handoff to append-only replicated log.Don’t manipulate, shuffle, sort, demux. Add timestamps.

Database state collectionDo: Read snapshots, event conversion tools

(Aegisthus, Bottled Water)Careful: Dump replicated slaveDon’t: Use API, dump live master

6

Cluster storageHDFS

(NFS, S3, Google CS, C*)

Service

DB

DB backup

Service

Datasets

7

hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json

Hadoop + Hive name conventionsInstance = class + parameters, same schemaImmutable

Dataset class

Instance parameters,Hive convention

Seal PartitionsPrivacylevel

Schemaversion

PipelinesDataset “build system”Input will be missingJobs will failJobs will have bugs

Dataset = function([inputs], code)Deterministic, idempotent

8

Cluster storage

Unified log

Pristine,immutabledatasets

Intermediate

Derived,regenerable

Luigi, (Airflow, Oozie)

Workflow managerDataset “build tool”Build when input is availableBackfill for previous failuresRebuild for bugs=> Eventual correctnessDSL describes DAGIncludes egressData retention, privacy audit

9

DB

Batch processing MVPStart simple, lean, end-to-end, without Hadoop/Spark

Serial jobs on pool of machines + work queueDownsample to fit one machine if necessary(Local Spark, Scalding, Crunch, Akka reactive

streams) Get end-to-end workflows in production for trialIntegration test end-to-end semanticsEnsure developer productivity - code/test cycle

10

Processing at scaleParallelise jobs only when forced to do so

Spark, (Hadoop + Scalding / Crunch)Avoid: Vanilla MapReduce, non-JVM

Most jobs fit in single machineBig complexity + performance win

11

SchemasStorage formats: Json, Avro, Parquet. Protobuf, ThriftThere is always a schema, implicit or explicitSchema on read

Dynamic typing, quick schema changesSchema on write

Static typing possibleUse schema on read for analytics.Incompatible change? New dataset class.

12

Egress datasetsServing

Cassandra, denormalisedExport & Analytics

SQLWorkbenches (Zeppelin)(Elasticsearch, proprietary OLAP)

13

Parting wordsKeep things simple. Batch, few components & little state.Don’t drop incoming data.Focus on developer code, test, debug cycle - end to end.Expect, tolerate human error.Harmony with technical ecosystems - follow tech leaders.Scalability only when necessary.Plan early: Privacy, retention, audit, schema evolution.

14

Bonus slides

15

+Operations+Security+Responsive scaling- Development workflows- Privacy- Vendor lock-in

Cloud or not?

Data pipelines example

17

Users

Pageviews

Sales Salesreports

Views with demographics

Sales with demographics

Conversion analytics

Conversion analytics

Views with demographics

Raw Derived

Form teams that are driven by business cases & needForward-oriented -> filters implicitly appliedBeware of: duplication, tech chaos/autonomy, privacy loss

Data pipelines team organisation

Conway’s law

“Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.”

Better organise to match desired design, then.

data pipelines from zero

Software