building spark streaming pipelines with cask hydrator, by gokul gunasekaran, cask
TRANSCRIPT
Building Spark Streaming Pipelines with Cask Hydrator
Gokul GunasekaranSoftware Engineer, Cask Data
Aug 31, 2016
Cask, CDAP, Cask Hydrator and Cask Tracker are trademarks or registered trademarks of Cask Data. Apache Spark, Spark, the Spark logo, Apache Hadoop, Hadoop and the Hadoop logo are trademarks or registered trademarks of the Apache Software Foundation. All other trademarks and registered trademarks are the property of their respective owners.
cask.co
INGESTany data from any source
in real-time and batch
BUILDdrag-and-drop ETL/ELT
pipelines that run on Hadoop
EGRESSany data to any destination
in real-time and batch
Data Pipelineprovides the ability to automate complex workflows that involves fetching data,
performing non-trivial transformations, deriving and serving insights from the data
2
cask.co
Flight Data Analysis Use Case
✦Hadoop ETL pipeline(s) stitched together using hard-to-maintain, brittle scripts
✦Not many developers with expertise in Hadoop components (HDFS, MapReduce, Spark, YARN, HBase, Kafka, Hive)
✦Hard to debug and validate, resulting in frequent failures in production environment
Noise due to low flight paths is a common problem. We want to find out the affected airports around the country using flight data sensors placed around airports.
Challenge —
3
cask.co
Demo
Fetch Flight sensor data from Kafka and find out the affected airport areas
• Sensors are pushing data into Kafka about flight altitude/velocity etc.
• Fetch data from Kafka and batch events every minute
• Group the data by airport code and compute the average altitudes
• Filter out airport areas where average altitude is less than a threshold
• Write the filtered airport codes to HDFS
4
cask.co
Flight Data from sensor
1472669109, SAN, 400, 2001472669109, SFO, 300, 400….
Fields: Timestamp, Destination Airport Code, Altitude, Velocity
5
cask.co
Hydrator Studio
✦Drag-and-drop GUI for visual Data Pipeline creation
✦Rich library of pre-built sources, transforms, sinks for data ingestion and ETL use cases
✦Separation of pipeline creation from execution framework - MapReduce, Spark, Spark Streaming etc.
✦Hadoop-native and Hadoop Distro agnostic
6
cask.co
Hydrator Data Pipeline
✦Captures Metadata, Audit, Lineage info and visualized using Cask Tracker
✦Pre and Post run notification, centralized metrics and log collection for ease of operability
✦Simple Java API to build your own source, transforms, sinks with class loader isolation
✦SparkML based plugins, Python transforms for data scientists
7
cask.co
✦ElasticSearch, Cassandra, Kafka, SFTP, JMS and many more sources and sinks
✦De-duplicate, Group By Aggregation, Row Denormalizer and other transforms
Out of the box Integrations
8
cask.co
✦ Implement your own batch (or streaming) source, transform, sink plugins using simple Java API
Custom Plugins
9
cask.co
Data Lake FraudDetection
RecommendationEngine
Sensor DataAnalytics
Customer360Hydrator Tracker
CASK DATA APP PLATFORM
Hadoop ecosystem, 50 different projects
Top 6 Hadoop distributions
10
cask.co
Pipeline Implementation
Logical Pipeline
Physical Workflow
MR/Spark Executions
Planner
CDAP
✦Planner converts logical pipeline to a physical execution plan
✦Optimizes and bundles functions into one or more MR/Spark jobs and Spark streaming job in case of Realtime pipeline
✦CDAP is the runtime environment where all the components of the data pipeline are executed
✦CDAP provides centralized log and metrics collection, transaction, lineage and audit information
11
cask.co
Hydrator Realtime Data Pipeline
✦Generates micro batches of data in regular intervals
✦Supports sliding windows, aggregations, various transforms, joins and ML
✦Checkpointing of pipeline state is coming soon
12
cask.co
Streaming Source
✦Uses Spark DStreams (Discretized Streams)
✦Generates a new RDD every batch interval (pipeline property) (ex: 10 sec)
13
cask.co
Windowing
✦Sliding window is defined by size and slide interval, both of which are multiples of batch interval
14
In this example, size = 3, slide = 2
cask.co
ODBC Connector for BI Tools
15
✦Explore CDAP Streams and Datasets using popular BI Tools using CDAP ODBC connector
cask.co
* Checkpointing capability in Spark streaming (HYDRATOR-378)
* More ML and other plugins
Upcoming capabilities
16
Thank [email protected]
@CaskData
github.com/caskdata/cdapgithub.com/caskdata/hydrator-plugins
Questions?17
cask.co
Self-Service Data Ingestionand ETL for Data Lakes
Built for Productionon CDAP
Rich Drag-and-DropUser Interface
Open Source &Highly Extensible
18