apache gobblin: bridging batch and streaming data integration. big data meetup @ linkedin apr 2017

The Data Driven Network

Kapil Surlaker Director of Engineering

Bridging Batch and Streaming Data Integration with Gobblin

Shirshanka Das

Gobblin team 26th Apr, 2017

Big Data Meetup

github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin

http://github.com/linkedin/gobblin

http://gitter.im/gobblin

Data Integration: key requirements

Source, Sink Diversity

Batch +

StreamingData

Quality

So, we built

SFTP

JDBCREST

Simplifying Data Integration

@LinkedIn

Hundreds of TB per day

Thousands of datasets

~30 different source systems

80%+ of data ingest

Open source @ github.com/linkedin/gobblin

Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, CERN, NerdWallet and many more…

Apache incubation under way

SFTP

Azure StorageAzure

Storage

4

Other Open Source Systems in this Space

Sqoop, Flume, Falcon, Nifi, Kafka Connect

Flink, Spark, Samza, Apex

Similar in pieces, dissimilar in aggregate Most are tied to a specific execution model (batch / stream) Most are tied to a specific implementation, ecosystem (Kafka, Hadoop etc)

: Under the Hood

5

6

Gobblin: The Logical Pipeline

7

WorkUnit A logical unit of work, typically bounded but not necessary.

Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200

HDFS Folder: /data/Login, File: part-0.avro

Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh

8

Source: A provider of WorkUnits

(typically a system like Kafka, HDFS etc.)

9

Task: A unit of execution that operates on a WorkUnit

Extracts records from the source, writes to the destination

Ends when WorkUnit is exhausted of records

(assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)

10

Extractor: A provider of records given a WorkUnit

Connects to Data Source

Deserializer of records

11

Converter: A 1:N mapper of input records to output records

Multiple converters can be chained

(e.g. Avro <-> JSON, Schema project, Encrypt)

12

Quality Checker: Can check if the quality of the output is

satisfactory

Row-level (e.g. time value check)

Task-level (e.g. audit check, schema compatibility)

13

Writer: Writes to the destination

Connection to the destination, Serializer of records

Sync / Async

e.g. FsWriter, KafkaWriter, CouchbaseWriter

14

Publisher: Finalizes / Commits the data

Used for destinations that support atomicity

(e.g. move tmp staging directory to final

output directory on HDFS)

15


16

State Store (HDFS, S3, MySQL, ZK, …)

Load config previous watermarks

save watermarks

Gobblin: The Logical PipelineStateful

^

: Pipeline Specification

17

Gobblin: Pipeline Specification

job.name=PullFromWikipediajob.group=Wikipediajob.description=AgettingstartedexampleforGobblin

source.class=gobblin.example.wikipedia.WikipediaSourcesource.page.titles=LinkedIn,Wikipedia:Sandboxsource.revisions.cnt=5

wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10

converter.classes=gobblin.example.wikipedia.WikipediaConverter

Pipeline Name, Description

Source + configuration

source.revisions.cnt=5

wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10


extract.namespace=gobblin.example.wikipedia

writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner

data.publisher.type=gobblin.publisher.BaseDataPublisher


Converter

Writer + configuration


extract.namespace=gobblin.example.wikipedia

writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner

data.publisher.type=gobblin.publisher.BaseDataPublisher


Publisher

Gobblin: Pipeline Deployment

Bare Metal / AWS / Azure / VM

Standalone: Single Instance

Small Medium Large

AWS (EC2) Hadoop (YARN / MR)Standalone Cluster

Pipeline Specification

Static Cluster Elastic ClusterOne Box

One Spec Multiple Environments

Execution Model: Batch versus Streaming

Batch Determine work, Acquire slots, Run, Checkpoint, Repeat

+ Cost-efficient, deterministic, repeatable - Higher latency

- Setup, Checkpoint costs dominate if “micro-batching”

Execution Model: Batch versus Streaming

Streaming Determine work streams, Run continuously, Checkpoint periodically

+ Low latency

- Higher-cost because it is harder to provision accurately

- More sophistication needed to deal with change

Batch

Execution Model Scorecard

Batch

Streaming

Streaming

Streaming

Streaming

Batch

Batch

JDBC <->HDFS Kafka ->HDFS

HDFS ->Kafka Kafka <->Kinesis

Can we run in both models using the same system?

26


27

Batch Determine work

Streaming Determine work

- unbounded WorkUnit

Pipeline Stages: Start

28

Batch Acquire slots, Run

Streaming Run continuously

Checkpoint periodically

Shutdown gracefully

Pipeline Stages: Run

Watermark Manager

State Storage

notify ack

shutdown

29

Batch Checkpoint, Commit

Streaming Do nothing

- NoOpPublisher

Pipeline Stages: End

Enabling Streaming mode

task.executionMode = streaming

Standalone: Single Instance

AWS Hadoop (YARN / MR)Standalone Cluster

A Streaming Pipeline Spec: Kafka 2 Kafka

# A sample pull file that copies an input Kafka topic and # produces to an output Kafka topic with sampling

job.name=Kafka2KafkaStreaming job.group=Kafka job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic

job.lock.enabled=false

source.class=gobblin.source….KafkaSimpleStreamingSource

Pipeline Name, Description

job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic

job.lock.enabled=false

source.class=gobblin.source….KafkaSimpleStreamingSource

gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.common.serialization.StringDeserializer gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter

Source, configuration


gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10

writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer

data.publisher.type=gobblin.publisher.NoopPublisher


Converter, configuration

# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10

writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer


task.executionMode=STREAMING


Writer, configuration

Publisher


task.executionMode=STREAMING

# Configure watermark storage for streaming #streaming.watermarkStateStore.type=zk #streaming.watermarkStateStore.config.state.store.zk.connectString=localhost:2181 # Configure watermark commit settings for streaming #streaming.watermark.commitIntervalMillis=2000


Execution Mode, watermark storage configuration

Gobblin Streaming: Cluster view

Cluster of processes

Apache Helix: work-unit assignment, fault-tolerance, reassignment Cluster

Master

Helix

Worker 1

Worker 2

Worker 3

Sink (Kafka, HDFS, …)

Stream Source

Active Workstreams in Gobblin

Gobblin as a Service Global orchestrator with REST API for submitting logical flow specifications Logical flow specifications compile down to physical pipeline specs

Global Throttling Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w, Hadoop namenode etc.) Generic: can be used outside Gobblin

Metadata driven Integration with Metadata Service (c.f. WhereHows) Policy driven replication, permissions, encryption etc.

Roadmap

Final LinkedIn Gobblin 0.10.0 release Apache Incubator code donation and release More Streaming runtimes

Integration with Apache Samza, LinkedIn Brooklin GDPR Compliance: Data purge for Hadoop and other systems

Security improvements Credential storage, Secure specs

39

Gobblin Team @ LinkedIn

apache gobblin: bridging batch and streaming data integration. big data meetup @ linkedin apr 2017

Data & Analytics