data pipelines from zero
TRANSCRIPT
Data pipelines from zeroLars Albertsson
Data architect @ Schibstedwww.mapflat.com
1
Who’s talking?Swedish Institute of Computer Science (test tools)Sun Microsystems (very large machines)Google (Hangouts, productivity)Recorded Future (NLP startup)Cinnober Financial Tech. (trading systems)Spotify (data processing & modelling)Schibsted (data processing & modelling)
2
Presentation goalsOverview of data pipelines for analytics / data productsTarget audience: Big data startersOverview of necessary componentsBase recipe
In vicinity of state-of-practiceBaseline for comparing design proposals
Subjective best practicesTechnology suggestions, (alternatives)
3
Data product anatomy
4
Cluster storage
Ingress
Unified log
ETL Egress
DBDB
DBService
DatasetJobPipeline
Service
Export
Businessintelligence
Cluster storageHDFS
(NFS, S3, Google CS, C*)
Event collection
5
Unified logImmutable events
Append-onlySource of truth
Service
Unreliable
Unreliable
Reliable,write available
Kafka(Kinesis,
Google Pub/Sub)
Secor,Camus
Immediate handoff to append-only replicated log.Don’t manipulate, shuffle, sort, demux. Add timestamps.
Database state collectionDo: Read snapshots, event conversion tools
(Aegisthus, Bottled Water)Careful: Dump replicated slaveDon’t: Use API, dump live master
6
Cluster storageHDFS
(NFS, S3, Google CS, C*)
Service
DB
DB backup
Service
Datasets
7
hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json
Hadoop + Hive name conventionsInstance = class + parameters, same schemaImmutable
Dataset class
Instance parameters,Hive convention
Seal PartitionsPrivacylevel
Schemaversion
PipelinesDataset “build system”Input will be missingJobs will failJobs will have bugs
Dataset = function([inputs], code)Deterministic, idempotent
8
Cluster storage
Unified log
Pristine,immutabledatasets
Intermediate
Derived,regenerable
Luigi, (Airflow, Oozie)
Workflow managerDataset “build tool”Build when input is availableBackfill for previous failuresRebuild for bugs=> Eventual correctnessDSL describes DAGIncludes egressData retention, privacy audit
9
DB
Batch processing MVPStart simple, lean, end-to-end, without Hadoop/Spark
Serial jobs on pool of machines + work queueDownsample to fit one machine if necessary(Local Spark, Scalding, Crunch, Akka reactive
streams) Get end-to-end workflows in production for trialIntegration test end-to-end semanticsEnsure developer productivity - code/test cycle
10
Processing at scaleParallelise jobs only when forced to do so
Spark, (Hadoop + Scalding / Crunch)Avoid: Vanilla MapReduce, non-JVM
Most jobs fit in single machineBig complexity + performance win
11
SchemasStorage formats: Json, Avro, Parquet. Protobuf, ThriftThere is always a schema, implicit or explicitSchema on read
Dynamic typing, quick schema changesSchema on write
Static typing possibleUse schema on read for analytics.Incompatible change? New dataset class.
12
Egress datasetsServing
Cassandra, denormalisedExport & Analytics
SQLWorkbenches (Zeppelin)(Elasticsearch, proprietary OLAP)
13
Parting wordsKeep things simple. Batch, few components & little state.Don’t drop incoming data.Focus on developer code, test, debug cycle - end to end.Expect, tolerate human error.Harmony with technical ecosystems - follow tech leaders.Scalability only when necessary.Plan early: Privacy, retention, audit, schema evolution.
14
Bonus slides
15
+Operations+Security+Responsive scaling- Development workflows- Privacy- Vendor lock-in
Cloud or not?
Data pipelines example
17
Users
Pageviews
Sales Salesreports
Views with demographics
Sales with demographics
Conversion analytics
Conversion analytics
Views with demographics
Raw Derived
Form teams that are driven by business cases & needForward-oriented -> filters implicitly appliedBeware of: duplication, tech chaos/autonomy, privacy loss
Data pipelines team organisation
Conway’s law
“Organizations which design systems ... are constrained to produce designs which are copies of the communication structures of these organizations.”
Better organise to match desired design, then.