gobblin: unifying data ingestion for hadoop
TRANSCRIPT
(18)
GOBBLIN: UNIFYING DATA INGESTION FOR
HADOOP
Lin Qiao, Yinan Li, Sahil Takiar, Ziyang Liu, Narasimha Veeramreddy, Min Tu, Ying Dai, Issac Buenrostro, Kapil Surlaker, Shirshanka Das, Chavdar Botev
Data Analytics Infrastructure @ LinkedIn
(18)2
Agenda•Why Gobblin?•Gobblin Overview• Case Studies•Gobblin in Details•Gobblin in Production @ LinkedIn• Future Work•Q&A
(18)3
Data Ingestion Challenges @ LinkedIn
BIG engineering and operational COST!
Data Sources
Data Types Operational Pain
(18)4
Pre-Gobblin Era
OLTP
Tracking
Snapshot and delta file dumps
Kafka
DatabusChange
s
Pipeline #1
External Partner Data
Pipeline #2
REST
JDBC
SOAP
...
Pipeline #3
Pipeline #4
Pipeline #5
Pipeline #n
Databases (Oracle/Espresso
)
(18)5
The Gobblin Era
OLTP
Tracking
Snapshot and delta file dumps
Kafka
DatabusChange
s
External Partner Data
REST
JDBC
SOAP
...
Databases (Oracle/Espresso
)
(18)6
RequirementsMulti-platform and Scalability
Support
Rich Source Integration
Centralized State
Management
OperabilityExtensibility Self Service
(18)7
Architecture OverviewConstructs for Building Ingestion
Flows
WorkUnit / Task
Execution Runtime
Deployment Mode
state store
compaction
retention mgmt.
monitoring
Standalone
Hadoop MR
Yarn
Source Extractor Converter
Qlty. Chker.
Writer Publisher
Task Executor Task State Tracker
Job Launcher Job Scheduler
(18)8
Case Study: Kafka Ingestion
KafkaAvroSource
KafkaAvroExtractor
WorkUnit 1(Topic 1, Partition 1)
KafkaConverter
TimePartitionedAvroWriter
Avro
/kafka/topic/hourly/yyyy/mm/dd/hh/*.avro
Compaction
/kafka/topic/daily/yyyy/mm/dd/*.avro
AuditCountQualityChecker
KafkaAvroExtractor
WorkUnit 2(Topic 1, Partition 2)
KafkaConverter
TimePartitionedAvroWriter
Avro
AuditCountQualityChecker
KafkaAvroExtractor
WorkUnit 3(Topic 1, Partitions 1
& 2)
KafkaConverter
TimePartitionedAvroWriter
Avro
AuditCountQualityChecker
TimePartitionedDataPublisher
(18)9
Case Study: Database Ingestion
JdbcSource
JdbcExtractor
WorkUnit 1[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
/database/table/incremental/snapshot-ts/*.avro
Compaction
/database/table/full/snapshot-ts/*.avro
SchemaCompatibiliy & Count Qlty. Chker
SnapshotDataPublisher
JdbcExtractor
WorkUnit 1[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
SchemaCompatibiliy & Count Qlty. Chker
JdbcExtractor
WorkUnit 1[2015090512, 2015090514)
ToAvroConverter
SnapshotAvroWriter
Row
SchemaCompatibiliy & Count Qlty. Chker
(18)10
Case Study – Filtering Sensitive Data
Has Sensitive
Data?no
Source
Extractor
WorkUnit
Converter and Quality Checker
Fork and Branching
Writer
DataPublisher
Writer
Sensitive DataFiltering
Converter
yes
(18)11
Data Quality Checking
Record-level Policies
WriterTask-level
Policies
Publisher
Quarantine
Fail Task
Quality Checkers- Per record or per
task.- Policy driven- Composable
~ Schema compatibility
~ Audit check~ Sensitive fields~ Required fields~ Unique key
(18)12
State and Metadata Mgmt.
State Store- Stores runtime metadata, e.g.,
checkpoints (a.k.a. watermarks)~ Carried over between job runs
- Default impl: serializes job/task states into files, one per run.
- Allows other implementations that conform to the interface to be plugged in.
State Store
job run #2 job run
#3job run
#1 SEP2
SEP3
SEP2 SEP
3
EXAMPLE
(18)13
Metrics / Events and Alerting
KafkaMetricConte
xt
Topic 1MetricConte
xt
Topic 2MetricConte
xt
Partition 1MetricConte
xt
Partition 2MetricConte
xt
20
12 8
6 6
MetricReporte
r
EventReporte
rMetrics / Events Collection and
Reporting- Metrics for ingestion progress
~ supports tagging~ real-time
aggregation- Events for major
milestones~ “fire-and-forget”
- Various built-in metric / event reporters
(18)14
Running Modes
Standalone
Runs in a single JVM; tasks run in a thread pool.
Scale-out with MapReduce
Each job run launches a MR job, using mappers as containers to run tasks.
Scale-out with General
Distributed Resource Manager
Supports long-running continuous ingestion, with better resource utilization and SLA guarantees.
YARN
*in progress
(18)15
Gobblin in Production @ LinkedIn• In production since 2014
• Usages– Internal sources HDFS
• Kafka, MySQL, Dropbox, etc.– External sources HDFS
• Salesforce, Google Analytics, S3, etc.– HDFS HDFS
• Closed member data purging– Egress from HDFS (future work)
• Data volume– Over a dozen data sources,– thousands of datasets,– tens of TBs,… daily.
(18)16
Future Work• Gobblin on Yarn (alpha-release)• Real-time elastic ingestion• Integration with– Apache Sqoop: using Sqoop
connectors– Logstash: log ingestion–Morphlines: using Morphline
transformation– Apache Spark
(18)17
ConclusionsPain of
maintaining multiple ingestion pipelines
Gobblin to the rescue!
Data quality assurance and
centralized state
management
Gobblin in production for a wide
range of data sources
Continuous real-time ingestion
(18)18
ACKNOWLEDGEMENT
Pradhan CadabamShrikanth ShankarSuvodeep PyneRay OrtigasHenry CaiKenneth GoodhopeErik Krogen
(18)19
Thanks.
Github https://github.com/linkedin/gobblinDocumentation https://github.com/linkedin/gobblin/wikiUser Group https://groups.google.com/forum/#!forum/
gobblin-users