apache gobblin: bridging batch and streaming data integration. big data meetup @ linkedin apr 2017
TRANSCRIPT
The Data Driven Network
Kapil Surlaker Director of Engineering
Bridging Batch and Streaming Data Integration with Gobblin
Shirshanka Das
Gobblin team 26th Apr, 2017
Big Data Meetup
github.com/linkedin/gobblin @ApacheGobblin gitter.im/gobblin
Data Integration: key requirements
Source, Sink Diversity
Batch +
StreamingData
Quality
So, we built
SFTP
JDBCREST
Simplifying Data Integration
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ github.com/linkedin/gobblin
Adopted by LinkedIn, Intel, Swisscom, Prezi, PayPal, CERN, NerdWallet and many more…
Apache incubation under way
SFTP
Azure StorageAzure
Storage
4
Other Open Source Systems in this Space
Sqoop, Flume, Falcon, Nifi, Kafka Connect
Flink, Spark, Samza, Apex
Similar in pieces, dissimilar in aggregate Most are tied to a specific execution model (batch / stream) Most are tied to a specific implementation, ecosystem (Kafka, Hadoop etc)
: Under the Hood
5
6
Gobblin: The Logical Pipeline
7
WorkUnit A logical unit of work, typically bounded but not necessary.
Kafka Topic: LoginEvent, Partition: 10, Offsets: 10-200
HDFS Folder: /data/Login, File: part-0.avro
Hive Dataset: Tracking.Login, date-partition=mm-dd-yy-hh
8
Source: A provider of WorkUnits
(typically a system like Kafka, HDFS etc.)
9
Task: A unit of execution that operates on a WorkUnit
Extracts records from the source, writes to the destination
Ends when WorkUnit is exhausted of records
(assigned to Thread in ThreadPool, Mapper in Map-Reduce etc.)
10
Extractor: A provider of records given a WorkUnit
Connects to Data Source
Deserializer of records
11
Converter: A 1:N mapper of input records to output records
Multiple converters can be chained
(e.g. Avro <-> JSON, Schema project, Encrypt)
12
Quality Checker: Can check if the quality of the output is
satisfactory
Row-level (e.g. time value check)
Task-level (e.g. audit check, schema compatibility)
13
Writer: Writes to the destination
Connection to the destination, Serializer of records
Sync / Async
e.g. FsWriter, KafkaWriter, CouchbaseWriter
14
Publisher: Finalizes / Commits the data
Used for destinations that support atomicity
(e.g. move tmp staging directory to final
output directory on HDFS)
15
Gobblin: The Logical Pipeline
16
State Store (HDFS, S3, MySQL, ZK, …)
Load config previous watermarks
save watermarks
Gobblin: The Logical PipelineStateful
^
: Pipeline Specification
17
Gobblin: Pipeline Specification
job.name=PullFromWikipediajob.group=Wikipediajob.description=AgettingstartedexampleforGobblin
source.class=gobblin.example.wikipedia.WikipediaSourcesource.page.titles=LinkedIn,Wikipedia:Sandboxsource.revisions.cnt=5
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10
converter.classes=gobblin.example.wikipedia.WikipediaConverter
Pipeline Name, Description
Source + configuration
source.revisions.cnt=5
wikipedia.api.rooturl=https://en.wikipedia.org/w/api.phpwikipedia.avro.schema={"namespace":“example.wikipedia.avro”,…"null"]}]}gobblin.wikipediaSource.maxRevisionsPerPage=10
converter.classes=gobblin.example.wikipedia.WikipediaConverter
extract.namespace=gobblin.example.wikipedia
writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner
data.publisher.type=gobblin.publisher.BaseDataPublisher
Gobblin: Pipeline Specification
Converter
Writer + configuration
converter.classes=gobblin.example.wikipedia.WikipediaConverter
extract.namespace=gobblin.example.wikipedia
writer.destination.type=HDFSwriter.output.format=AVROwriter.partitioner.class=gobblin.example.wikipedia.WikipediaPartitioner
data.publisher.type=gobblin.publisher.BaseDataPublisher
Gobblin: Pipeline Specification
Publisher
Gobblin: Pipeline Deployment
Bare Metal / AWS / Azure / VM
Standalone: Single Instance
Small Medium Large
AWS (EC2) Hadoop (YARN / MR)Standalone Cluster
Pipeline Specification
Static Cluster Elastic ClusterOne Box
One Spec Multiple Environments
Execution Model: Batch versus Streaming
Batch Determine work, Acquire slots, Run, Checkpoint, Repeat
+ Cost-efficient, deterministic, repeatable - Higher latency
- Setup, Checkpoint costs dominate if “micro-batching”
Execution Model: Batch versus Streaming
Streaming Determine work streams, Run continuously, Checkpoint periodically
+ Low latency
- Higher-cost because it is harder to provision accurately
- More sophistication needed to deal with change
Batch
Execution Model Scorecard
Batch
Streaming
Streaming
Streaming
Streaming
Batch
Batch
JDBC <->HDFS Kafka ->HDFS
HDFS ->Kafka Kafka <->Kinesis
Can we run in both models using the same system?
26
Gobblin: The Logical Pipeline
27
Batch Determine work
Streaming Determine work
- unbounded WorkUnit
Pipeline Stages: Start
28
Batch Acquire slots, Run
Streaming Run continuously
Checkpoint periodically
Shutdown gracefully
Pipeline Stages: Run
Watermark Manager
State Storage
notify ack
shutdown
29
Batch Checkpoint, Commit
Streaming Do nothing
- NoOpPublisher
Pipeline Stages: End
Enabling Streaming mode
task.executionMode = streaming
Standalone: Single Instance
AWS Hadoop (YARN / MR)Standalone Cluster
A Streaming Pipeline Spec: Kafka 2 Kafka
# A sample pull file that copies an input Kafka topic and # produces to an output Kafka topic with sampling
job.name=Kafka2KafkaStreaming job.group=Kafka job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
Pipeline Name, Description
job.description=This is a job that runs forever, copies an input Kafka topic to an output Kafka topic
job.lock.enabled=false
source.class=gobblin.source….KafkaSimpleStreamingSource
gobblin.streaming.kafka.topic.key.deserializer=org.apache.kafka.common.serialization.StringDeserializer gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092
# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter
Source, configuration
A Streaming Pipeline Spec: Kafka 2 Kafka
gobblin.streaming.kafka.topic.value.deserializer=org.apache.kafka.common.serialization.ByteArrayDeserializer gobblin.streaming.kafka.topic.singleton=test kafka.brokers=localhost:9092
# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
data.publisher.type=gobblin.publisher.NoopPublisher
A Streaming Pipeline Spec: Kafka 2 Kafka
Converter, configuration
# Sample 10% of the records converter.classes=gobblin.converter.SamplingConverter converter.sample.ratio=0.10
writer.builder.class=gobblin.kafka.writer.KafkaDataWriterBuilder writer.kafka.topic=test_copied writer.kafka.producerConfig.bootstrap.servers=localhost:9092 writer.kafka.producerConfig.value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
A Streaming Pipeline Spec: Kafka 2 Kafka
Writer, configuration
Publisher
data.publisher.type=gobblin.publisher.NoopPublisher
task.executionMode=STREAMING
# Configure watermark storage for streaming #streaming.watermarkStateStore.type=zk #streaming.watermarkStateStore.config.state.store.zk.connectString=localhost:2181 # Configure watermark commit settings for streaming #streaming.watermark.commitIntervalMillis=2000
A Streaming Pipeline Spec: Kafka 2 Kafka
Execution Mode, watermark storage configuration
Gobblin Streaming: Cluster view
Cluster of processes
Apache Helix: work-unit assignment, fault-tolerance, reassignment Cluster
Master
Helix
Worker 1
Worker 2
Worker 3
Sink (Kafka, HDFS, …)
Stream Source
Active Workstreams in Gobblin
Gobblin as a Service Global orchestrator with REST API for submitting logical flow specifications Logical flow specifications compile down to physical pipeline specs
Global Throttling Throttling capability to ensure Gobblin respects quotas globally (e.g. api calls, network b/w, Hadoop namenode etc.) Generic: can be used outside Gobblin
Metadata driven Integration with Metadata Service (c.f. WhereHows) Policy driven replication, permissions, encryption etc.
Roadmap
Final LinkedIn Gobblin 0.10.0 release Apache Incubator code donation and release More Streaming runtimes
Integration with Apache Samza, LinkedIn Brooklin GDPR Compliance: Data purge for Hadoop and other systems
Security improvements Credential storage, Secure specs
39
Gobblin Team @ LinkedIn