data pipelines from zero

19
Data pipelines from zero Lars Albertsson Data architect @ Schibsted www.mapflat.com 1

Upload: lars-albertsson

Post on 09-Jan-2017

607 views

Category:

Software


0 download

TRANSCRIPT

Page 1: Data pipelines from zero

Data pipelines from zeroLars Albertsson

Data architect @ Schibstedwww.mapflat.com

1

Page 2: Data pipelines from zero

Who’s talking?Swedish Institute of Computer Science (test tools)Sun Microsystems (very large machines)Google (Hangouts, productivity)Recorded Future (NLP startup)Cinnober Financial Tech. (trading systems)Spotify (data processing & modelling)Schibsted (data processing & modelling)

2

Page 3: Data pipelines from zero

Presentation goalsOverview of data pipelines for analytics / data productsTarget audience: Big data startersOverview of necessary componentsBase recipe

In vicinity of state-of-practiceBaseline for comparing design proposals

Subjective best practicesTechnology suggestions, (alternatives)

3

Page 4: Data pipelines from zero

Data product anatomy

4

Cluster storage

Ingress

Unified log

ETL Egress

DBDB

DBService

DatasetJobPipeline

Service

Export

Businessintelligence

Page 5: Data pipelines from zero

Cluster storageHDFS

(NFS, S3, Google CS, C*)

Event collection

5

Unified logImmutable events

Append-onlySource of truth

Service

Unreliable

Unreliable

Reliable,write available

Kafka(Kinesis,

Google Pub/Sub)

Secor,Camus

Immediate handoff to append-only replicated log.Don’t manipulate, shuffle, sort, demux. Add timestamps.

Page 6: Data pipelines from zero

Database state collectionDo: Read snapshots, event conversion tools

(Aegisthus, Bottled Water)Careful: Dump replicated slaveDon’t: Use API, dump live master

6

Cluster storageHDFS

(NFS, S3, Google CS, C*)

Service

DB

DB backup

Service

Page 7: Data pipelines from zero

Datasets

7

hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json

Hadoop + Hive name conventionsInstance = class + parameters, same schemaImmutable

Dataset class

Instance parameters,Hive convention

Seal PartitionsPrivacylevel

Schemaversion

Page 8: Data pipelines from zero

PipelinesDataset “build system”Input will be missingJobs will failJobs will have bugs

Dataset = function([inputs], code)Deterministic, idempotent

8

Cluster storage

Unified log

Pristine,immutabledatasets

Intermediate

Derived,regenerable

Page 9: Data pipelines from zero

Luigi, (Airflow, Oozie)

Workflow managerDataset “build tool”Build when input is availableBackfill for previous failuresRebuild for bugs=> Eventual correctnessDSL describes DAGIncludes egressData retention, privacy audit

9

DB

Page 10: Data pipelines from zero

Batch processing MVPStart simple, lean, end-to-end, without Hadoop/Spark

Serial jobs on pool of machines + work queueDownsample to fit one machine if necessary(Local Spark, Scalding, Crunch, Akka reactive

streams) Get end-to-end workflows in production for trialIntegration test end-to-end semanticsEnsure developer productivity - code/test cycle

10

Page 11: Data pipelines from zero

Processing at scaleParallelise jobs only when forced to do so

Spark, (Hadoop + Scalding / Crunch)Avoid: Vanilla MapReduce, non-JVM

Most jobs fit in single machineBig complexity + performance win

11

Page 12: Data pipelines from zero

SchemasStorage formats: Json, Avro, Parquet. Protobuf, ThriftThere is always a schema, implicit or explicitSchema on read

Dynamic typing, quick schema changesSchema on write

Static typing possibleUse schema on read for analytics.Incompatible change? New dataset class.

12

Page 13: Data pipelines from zero

Egress datasetsServing

Cassandra, denormalisedExport & Analytics

SQLWorkbenches (Zeppelin)(Elasticsearch, proprietary OLAP)

13

Page 14: Data pipelines from zero

Parting wordsKeep things simple. Batch, few components & little state.Don’t drop incoming data.Focus on developer code, test, debug cycle - end to end.Expect, tolerate human error.Harmony with technical ecosystems - follow tech leaders.Scalability only when necessary.Plan early: Privacy, retention, audit, schema evolution.

14

Page 15: Data pipelines from zero

Bonus slides

15

Page 16: Data pipelines from zero

+Operations+Security+Responsive scaling- Development workflows- Privacy- Vendor lock-in

Cloud or not?

Page 17: Data pipelines from zero

Data pipelines example

17

Users

Pageviews

Sales Salesreports

Views with demographics

Sales with demographics

Conversion analytics

Conversion analytics

Views with demographics

Raw Derived

Page 18: Data pipelines from zero

Form teams that are driven by business cases & needForward-oriented -> filters implicitly appliedBeware of: duplication, tech chaos/autonomy, privacy loss

Data pipelines team organisation

Page 19: Data pipelines from zero

Conway’s law

“Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.”

Better organise to match desired design, then.