test strategies for data processing pipelines

www.mapflat.com

Test strategies for data processing pipelines

Lars Albertsson, independent consultant (Mapflat)Øyvind Løkling, Schibsted Products & Technology

1

www.mapflat.com

Who’s talking?Swedish Institute of Computer. Science. (test & debug tools)Sun Microsystems (large machine verification)Google (Hangouts, productivity)Recorded Future (NLP startup) (data integrations)Cinnober Financial Tech. (trading systems)Spotify (data processing & modelling, productivity)Schibsted Products & Tech (data processing & modelling)Mapflat (independent data engineering consultant)

2

www.mapflat.com

Agenda● Data applications from a test perspective● Testing stream processing product● Testing batch processing products● Data quality testing

Main focus is functional, regression testing

Prerequisites: Backend dev testing, basic data experience

3

www.mapflat.com

Test valueFor data-centric applications, in this order:● Productivity

○ Move fast without breaking things● Fast experimentation

○ 10% good ideas, 90% bad● Data quality

○ Challenging, more important than● Technical quality

○ Technical failure => ops hassle, stale data4

www.mapflat.com

Streaming data product anatomy

5

Pub / sub

Unified log

Ingress Stream processing Egress

DBService

TopicJobPipeline

Service

Export

Businessintelligence

DB

www.mapflat.com

Test harness

Testfixture

Test concepts

6

System under test(SUT)

3rd party component(e.g. DB)

3rd party component

3rd party component

Test input

Test oracle

Test framework (e.g. JUnit, Scalatest)

Seam

IDEs

Build tools

www.mapflat.com

Test scopes

7

Unit

Class

Component

Integration

System / acceptance

● Pick stable seam● Small scope

○ Fast?○ Easy, simple?

● Large scope○ Real app value?○ Slow, unstable?

● Maintenance, cost○ Pick few SUTs

www.mapflat.com

● Output = function(input, code)○ No external factors => deterministic

● Pipeline and job endpoints are stable○ Correspond to business value

● Internal abstractions are volatile○ Reslicing in different dimensions is common

Data-centric application properties

8

q

www.mapflat.com

● Output = function(input, code)○ Perfect for test!○ Avoid: external service calls, wall clock

● Pipeline/job edges are suitable seams○ Focus on large tests

● Internal seams => high maintenance, low value○ Omit unit tests, mocks, dependency injection!

● Long pipelines crosses teams○ Need for end-to-end tests, but culture challenge

Data-centric app test properties

9

q

www.mapflat.com

Suitable seams, streaming

10

Pub / sub

DBService

TopicJobPipeline

Service

Export


DB

www.mapflat.com

2, 7. Scalatest. 4. Spark Streaming jobsIDE, CI, debug integration

Streaming SUT, example harness

11

DB

TopicKafka

5. Test input

6. Test oracle

3. Docker

1. IDE / Gradle

Polling

www.mapflat.com

Test lifecycle

12

1. Start fixture containers2. Await fixture ready3. Allocate test case resources4. Start jobs5. Push input data to Kafka6. While (!done && !timeout) { pollDatabase(); sleep(1ms) }7. While (moreTests) { Goto 3 }8. Tear down fixtureFor absence test, send dummy sync messages at end.

www.mapflat.com

Input generation

13

● Input & output is denormalised & wide● Fields are frequently changed

○ Additions are compatible○ Modifications are incompatible =>

new, similar data type● Static test input, e.g. JSON files

○ Unmaintainable● Input generation routines

○ Robust to changes, reusable

www.mapflat.com

Test oracles

14

● Compare with expected output● Check fields relevant for test

○ Robust to field changes○ Reusable for new, similar types

● Tip: Use lenses○ JSON: JsonPath (Java), Play JSON (Scala)○ Case classes: Monocle

● Express invariants for each data type

www.mapflat.com

Batch data product anatomy

15

Cluster storage

Unified log

Ingress ETL Egress

Datalake

DBService

DatasetJobPipeline

Service

Export


DBDB

Import

Datasets● Pipeline equivalent of objects● Dataset class == homogeneous records, open-ended

○ Compatible schema○ E.g. MobileAdImpressions

● Dataset instance = dataset class + parameters○ Immutable○ Finite set of homogeneous records○ E.g. MobileAdImpressions(hour=”2016-02-06T13”)

16

www.mapflat.com

Directory datasets

17

hdfs://red/pageviews/v1/country=se/year=2015/month=11/day=4/_SUCCESS part-00000.json part-00001.json

● Some tools, e.g. Spark, understand Hive name conventions

Dataset class

Instance parameters,Hive convention

Seal PartitionsPrivacylevel

Schemaversion

Batch processing● outDatasets =

code(inDatasets)● Component that scale up

○ Spark, (Flink, Scalding, Crunch)

● And scale down○ Local mode○ Most jobs fit in one

machine18

Gradual refinement1. Wash

- time shuffle, dedup, ...2. Decorate

- geo, demographic, ...3. Domain model

- similarity, clusters, ...4. Application model

- Recommendations, ...

www.mapflat.com

Workflow manager● Dataset “build tool”● Run job instance when

○ input is available○ output missing

● Backfills for previous failures● DSL describes dependencies● Includes ingress & egressSuggested: Luigi / Airflow

19

DB

`

www.mapflat.com

ClientSessions A/B tests

DSL DAG example (Luigi)

20

class ClientActions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [Actions(hour=self.hour - timedelta(hours=h)) for h in range(0, 24)] + \ [UserDB(date=self.hour.date)] ...

class ClientSessions(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientActions(hour=self.hour - timedelta(hours=h)) for h in range(0, 3)] ...

class SessionsABResults(SparkSubmitTask): hour = DateHourParameter() def requires(self): return [ClientSessions(hour=self.hour), ABExperiments(hour=self.hour)]

def output(self): return HdfsTarget(“hdfs://production/red/ab_sessions/v1/” + “{:year=%Y/month=%m/day=%d/hour=%H}”.format(self.hour))

...

Actions, hourly

● Expressive, embedded DSL - a must for ingress, egress○ Avoid weak DSL tools: Oozie, AWS Data Pipeline

UserDB

Time shuffle, user decorate

Form sessions

A/B compare

ClientActions, daily

A/B session evaluationDataset instance

Job (aka Task) classes

www.mapflat.com

Batch processing testing● Omit collection in SUT

○ Technically different● Avoid clusters

○ Slow tests● Seams

○ Between jobs○ End of pipelines

21

Data lake

Artifact of business valueE.g. service index

JobPipeline

www.mapflat.com

Batch test frameworks - don’t

● Spark, Scalding, Crunch variants● Seam == internal data structure

○ Omits I/O - common bug source● Vendor lock-in

When switching batch framework:○ Need tests for protection○ Test rewrite is unnecessary burden

22

Test input Test output

Job

www.mapflat.com

Testing single job

23

Standard Scalatest harness

file://test_input/ file://test_output/

1. Generate input 2. Run in local mode 3. Verify output

f() p()

Don’t commit - expensive to maintain.Generate / verify with code.

Runs well in CI / from IDE

Job

www.mapflat.com

Testing pipelines - two options

24

Standard Scalatest harness

file://test_input/ file://test_output/

1. Generate input 2. Run custom multi-job 3. Verify output

f() p()

A:

Customised workflow manager setup

+ Runs in CI+ Runs in IDE+ Quick setup- Multi-job maintenance

p()+ Tests workflow logic+ More authentic- Workflow mgr setup for testability- Difficult to debug- Dataset handling with Python

f()

B:● Both can be extended with egress DBs

Test job with sequence of jobs

www.mapflat.com

Testing with cloud services● PaaS components do not work locally

○ Cloud providers should provide fake implementations

○ Exceptions: Kubernetes, Cloud SQL, (S3)● Integrate PaaS service as fixture component

○ Distribute access tokens, etc○ Pay $ or $$$

25

www.mapflat.com

Quality testing variants● Functional regression

○ Binary, key to productivity● Golden set

○ Extreme inputs => obvious output○ No regressions tolerated

● (Saved) production data input○ Individual regressions ok○ Weighted sum must not decline○ Beware of privacy

26

www.mapflat.com

Hadoop / Spark counters

● Processing tool (Spark/Hadoop) counters○ Odd code path => bump counter

● Dedicated quality assessment pipelines○ Reuse test oracle invariants in production

Obtaining quality metrics

27

DB

Quality assessment job

www.mapflat.com

Quality testing in the process● Binary self-contained

○ Validate in CI● Relative vs history

○ E.g. large drops○ Precondition for

publishing dataset● Push aggregates to DB

○ Standard ops: monitor, alert

28

DB

∆?

Code ∆!

www.mapflat.com

RAM

Input

Computer program anatomy

29

Input data

Process Output

FileHID

VariableFunctionExecution path

Lookupstructure

Output data

Window

FileFile

Device

www.mapflat.com

Data pipeline = yet another programDon’t veer from best practices● Regression testing● Design: Separation of concerns, modularity, etc● Process: CI/CD, code review, static analysis tools● Avoid anti-patterns: Global state, hard-coding location,

duplication, ...In data engineering, slipping is in the culture... :-(● Mix in solid backend engineers, document “golden path”

30

www.mapflat.com

Top anti-patterns1. Test as afterthought or in production

Data processing applications are suited for test!2. Static test input in version control3. Exact expected output test oracle4. Unit testing volatile interfaces5. Using mocks & dependency injection6. Tool-specific test framework7. Calling services / databases from (batch) jobs8. Using wall clock time9. Embedded fixture components

31

www.mapflat.com

Further resources. Questions?

32

http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solidhttp://www.mapflat.com/lands/resources/reading-listhttp://www.slideshare.net/mathieu-bastian/the-mechanics-of-testing-large-data-pipelines-qcon-london-2016http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf

http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solid

http://www.slideshare.net/lallea/data-pipelines-from-zero-to-solid

http://www.mapflat.com/lands/resources/reading-list

http://www.mapflat.com/lands/resources/reading-list

http://www.slideshare.net/mathieu-bastian/the-mechanics-of-testing-large-data-pipelines-qcon-london-2016



http://www.slideshare.net/hkarau/effective-testing-for-spark-programs-strata-ny-2015



https://spark-summit.org/2014/wp-content/uploads/2014/06/Testing-Spark-Best-Practices-Anupama-Shetty-Neil-Marshall.pdf



www.mapflat.com

Bonus slides

33

www.mapflat.com

Unified log● Replicated append-only log● Pub / sub with history● Decoupled producers and consumers

○ In source/deployment○ In space○ In time

● Recovers from link failures● Replay on transformation bug fix

34

Credits: Confluent

www.mapflat.com

Stream pipelines● Parallelised jobs● Read / write to Kafka● View egress

○ Serving index○ SQL / cubes for Analytics

● Stream egress○ Services subscribe to topic○ E.g. REST post for export

35

Credits: Apache Samza

www.mapflat.com

The data lakeUnified log + snapshots● Immutable datasets● Raw, unprocessed● Source of truth from batch

processing perspective● Kept as long as permitted● Technically homogeneous

○ Except for raw imports

36

Cluster storage

Data lake

www.mapflat.com

Batch pipelines● Things will break

○ Input will be missing○ Jobs will fail○ Jobs will have bugs

● Datasets must be rebuilt● Determinism, idempotency● Backfill missing / failed● Eventual correctness

37

Cluster storage

Data lake

Pristine,immutabledatasets

Intermediate

Derived,regenerable

www.mapflat.com

Job == function([input datasets]): [output datasets]● No orthogonal concerns

○ Invocation○ Scheduling○ Input / output location

● Testable● No other input factors, no side-effects● Ideally: atomic, deterministic, idempotent● Necessary for audit

Batch job

38

q

www.mapflat.com

Form teams that are driven by business cases & needForward-oriented -> filters implicitly appliedBeware of: duplication, tech chaos/autonomy, privacy loss

Data pipelines

www.mapflat.com

Data platform, pipeline chains

Common data infrastructureProductivity, privacy, end-to-end agility, complexity

Beware: producer-consumer disconnect

www.mapflat.com

Ingress / egress representation

Larger variation:● Single file● Relational database table● Cassandra column family, other NoSQL● BI tool storage● BigQuery, Redshift, ...Egress datasets are also atomic and immutable. E.g. write full DB table / CF, switch service to use it, never change it.

41

www.mapflat.com

Egress datasets

● Serving○ Precomputed user query answers○ Denormalised○ Cassandra, (many)

● Export & Analytics○ SQL (single node / Hive, Presto, ..)○ Workbenches (Zeppelin)○ (Elasticsearch, proprietary OLAP)

● BI / analytics tool needs change frequently○ Prepare to redirect pipelines

42

Deployment

43

Hg/git repo Luigi DSL, jars, config

my-pipe-7.tar.gzHDFS

Luigidaemon

> pip install my-pipe-7.tar.gz

WorkerWorker

WorkerWorker

WorkerWorker

WorkerWorker

Redundant cron schedule, higher frequency + backfill (Luigi range tools)

* 10 * * * bin/my_pipe_daily \ --backfill 14

All that a pipeline needs, installed atomically

test strategies for data processing pipelines

Data & Analytics