gobblin: unifying data ingestion for hadoop

(18)

GOBBLIN: UNIFYING DATA INGESTION FOR

HADOOP

Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil Surlaker, Shirshanka Das, Chavdar Botev

Data Analytics Infrastructure @ LinkedIn

(18)2

Agenda•Why Gobblin?•Gobblin Overview• Case Studies•Gobblin in Details•Gobblin in Production @ LinkedIn• Future Work•Q&A

(18)3

Data Ingestion Challenges @ LinkedIn

BIG engineering and operational COST!

Data Sources

Data Types Operational Pain

(18)4

Pre-Gobblin Era

OLTP

Tracking

Snapshot and delta file dumps

Kafka

DatabusChange

s

Pipeline #1

External Partner Data

Pipeline #2

REST

JDBC

SOAP

...

Pipeline #3

Pipeline #4

Pipeline #5

Pipeline #n

Databases (Oracle/Espresso

)

(18)5

The Gobblin Era

OLTP

Tracking

Snapshot and delta file dumps

Kafka

DatabusChange

s

External Partner Data

REST

JDBC

SOAP

...

Databases (Oracle/Espresso

)

(18)6

RequirementsMulti-platform and Scalability

Support

Rich Source Integration

Centralized State

Management

OperabilityExtensibility Self Service

(18)7

Architecture OverviewConstructs for Building Ingestion

Flows

WorkUnit / Task

Execution Runtime

Deployment Mode

state store

compaction

retention mgmt.

monitoring

Standalone

Hadoop MR

Yarn

Source Extractor Converter

Qlty. Chker.

Writer Publisher

Task Executor Task State Tracker

Job Launcher Job Scheduler

(18)8

Case Study: Kafka Ingestion

KafkaAvroSource

KafkaAvroExtractor

WorkUnit 1(Topic 1, Partition 1)

KafkaConverter

TimePartitionedAvroWriter

Avro

/kafka/topic/hourly/yyyy/mm/dd/hh/*.avro

Compaction

/kafka/topic/daily/yyyy/mm/dd/*.avro

AuditCountQualityChecker

KafkaAvroExtractor

WorkUnit 2(Topic 1, Partition 2)

KafkaConverter


Avro


KafkaAvroExtractor

WorkUnit 3(Topic 1, Partitions 1

& 2)

KafkaConverter


Avro


TimePartitionedDataPublisher

(18)9

Case Study: Database Ingestion

JdbcSource

JdbcExtractor

WorkUnit 1[2015090512, 2015090514)

ToAvroConverter

SnapshotAvroWriter

Row

/database/table/incremental/snapshot-ts/*.avro

Compaction

/database/table/full/snapshot-ts/*.avro

SchemaCompatibiliy & Count Qlty. Chker

SnapshotDataPublisher

JdbcExtractor

WorkUnit 1[2015090512, 2015090514)

ToAvroConverter

SnapshotAvroWriter

Row


JdbcExtractor

WorkUnit 1[2015090512, 2015090514)

ToAvroConverter

SnapshotAvroWriter

Row


(18)10

Case Study – Filtering Sensitive Data

Has Sensitive

Data?no

Source

Extractor

WorkUnit

Converter and Quality Checker

Fork and Branching

Writer

DataPublisher

Writer

Sensitive DataFiltering

Converter

yes

(18)11

Data Quality Checking

Record-level Policies

WriterTask-level

Policies

Publisher

Quarantine

Fail Task

Quality Checkers- Per record or per

task.- Policy driven- Composable

~ Schema compatibility

~ Audit check~ Sensitive fields~ Required fields~ Unique key

(18)12

State and Metadata Mgmt.

State Store- Stores runtime metadata, e.g.,

checkpoints (a.k.a. watermarks)~ Carried over between job runs

- Default impl: serializes job/task states into files, one per run.

- Allows other implementations that conform to the interface to be plugged in.

State Store

job run #2 job run

#3job run

#1 SEP2

SEP3

SEP2 SEP

3

EXAMPLE

(18)13

Metrics / Events and Alerting

KafkaMetricConte

xt

Topic 1MetricConte

xt

Topic 2MetricConte

xt

Partition 1MetricConte

xt

Partition 2MetricConte

xt

20

12 8

6 6

MetricReporte

r

EventReporte

rMetrics / Events Collection and

Reporting- Metrics for ingestion progress

~ supports tagging~ real-time

aggregation- Events for major

milestones~ “fire-and-forget”

- Various built-in metric / event reporters

(18)14

Running Modes

Standalone

Runs in a single JVM; tasks run in a thread pool.

Scale-out with MapReduce

Each job run launches a MR job, using mappers as containers to run tasks.

Scale-out with General

Distributed Resource Manager

Supports long-running continuous ingestion, with better resource utilization and SLA guarantees.

YARN

*in progress

(18)15

Gobblin in Production @ LinkedIn• In production since 2014

• Usages– Internal sources HDFS

• Kafka, MySQL, Dropbox, etc.– External sources HDFS

• Salesforce, Google Analytics, S3, etc.– HDFS HDFS

• Closed member data purging– Egress from HDFS (future work)

• Data volume– Over a dozen data sources,– thousands of datasets,– tens of TBs,… daily.

(18)16

Future Work• Gobblin on Yarn (alpha-release)• Real-time elastic ingestion• Integration with– Apache Sqoop: using Sqoop

connectors– Logstash: log ingestion–Morphlines: using Morphline

transformation– Apache Spark

(18)17

ConclusionsPain of

maintaining multiple ingestion pipelines

Gobblin to the rescue!

Data quality assurance and

centralized state

management

Gobblin in production for a wide

range of data sources

Continuous real-time ingestion

(18)18

ACKNOWLEDGEMENT

Pradhan CadabamShrikanth ShankarSuvodeep PyneRay OrtigasHenry CaiKenneth GoodhopeErik Krogen

(18)19

Thanks.

Github https://github.com/linkedin/gobblinDocumentation https://github.com/linkedin/gobblin/wikiUser Group https://groups.google.com/forum/#!forum/

gobblin-users

gobblin: unifying data ingestion for hadoop

Data & Analytics