abdw17-lightning talks track-simplifying big data ingestion challenge
TRANSCRIPT
Simplifying Big Data Ingestion Challenge
Apex Big Data World2017
© 2017 DataTorrent Confidential – Do Not Distribute2
• Big Data is neither Productized nor Operationalized• Total Cost of Ownership (TCO) =
• Time to Develop + Time to Launch + Cost of ongoing Operations
• Provide a Product to ...• Build Applications Rapidly with Simple Interfaces, Pre-Built Apps, Code
Reuse & Debuggability
• Support Dev, Test, Prod cycle to Launch Apps quickly
• Manage and Visualize Applications for Operability
DataTorrent Vision - Productize Big Data
© 2017 DataTorrent Confidential – Do Not Distribute3
Speaker: Ashwin Chandra Putta
Product Manager, DataTorrent
Committer for Apache Apex
Previous experience in Oracle, Propel IT
@ashwinchandrap
© 2017 DataTorrent Confidential – Do Not Distribute4
Speaker: Yogi / Devendra Vyavahare
Engineer @ DataTorrent
Committer for Apache Apex
Previous experience in Bio-informatics, Web applications
@yogidevendra
© 2017 DataTorrent Confidential – Do Not Distribute5
Big Data Ecosystem: Where Apex & DataTorrent fit
Data Sources Oper1 Oper2 Oper3
Hadoop (YARN + HDFS)
Sensor Data Social Media
Web Servers App Servers
Click Streams
Real-time Analytics &
Visualizations
DataVisualization
© 2017 DataTorrent Confidential – Do Not Distribute6
Apex as a framework
Browser
Web Server
Kafka Input(logs)
Decompress, Parse, Filter
Dimensions Aggregate Kafka
LogsKafka
Variety of sources - IoT, Kafka, files, social media etc.Variety of sinks – Kafka, files, databases etc.* Supports low latency real time visualizations as well
Unbounded and continuous data streamsBatch support, batch processed as stream
In-memory processing with temporal window boundaries
Stateful operations: Aggregation, Rules etc --> Analytics
© 2017 DataTorrent Confidential – Do Not Distribute7
Real-time Data Readiness
IndustrySolutions
FinServ Fraud Prevention
Ease of Use
Managem
ent & M
onitoring
Built-in V
isualization
Rapid D
evelopment
Batch Support
High-level API
Apex-Malhar Operator LibraryIngest | Enrich | Analyze | Query | Alert | Automate | Output
Hadoop 2.0 – YARN + HDFS
Ingestion/ReplicationApplicatio
n Templates
Apache ApexStream & Batch Processing | Fault Tolerance | Scale & Performance
Big Data Platform
ProcessingCore
Building Blocks
DataTorrent RTS as a Product
Adtech Optimization
On Premises (Cloudera, MapR, Hortonworks) Cloud (AWS, Azure, Google)Infrastructu
re
Anomaly Detection Query/OLAP Machine Scoring
Customer ExperienceOptimization
© 2017 DataTorrent Confidential – Do Not Distribute8
• Repeatable application patterns• App level code reuse for general purpose operators
App Templates – why?
Pattern Use Case Sources Processors Sinks
Data IntegrationDisaster Recovery,
Cluster Backup, Cloud Backup, Raw
data ingestion
HDFS, Kafka, JDBC,
S3→ HDFS,
S3
Data IngestionIngestion: Dedup
and Enrich for 360 views
HDFS,JDBC,Kafka
Parser → Deduper → Enricher → Formatter HDFS,Cassandra
Data IngestionIngestion: Merge &
Transform Data Streams
Kafka,JDBC,File
Stream Merge → Transform → Filter → Enricher HDFS
Analytics Fraud Scoring, Anomaly Detection Kafka H2O or Custom HDFS
© 2017 DataTorrent Confidential – Do Not Distribute9
Bridging the TTV Gap – App Templates
Big Data App Templates
• Pre-built templates certified for scalability and durability
• Quick to import, configure and launch
• Easy to add custom business logic
© 2017 DataTorrent Confidential – Do Not Distribute10
Bridging the TTV Gap - Infrastructure
Delivered through key infrastructural components:
Component Why?App Hub App Template delivery mechanism
Cloud Integration Easy install and run on cloud infrastructures
Schema Support Reuse templates for multiple use cases without code change
Open Source Code Template code accessible via GitHub to add custom logic
App Metrics and Visualizations Dashboard for key application and operational metrics
© 2017 DataTorrent Confidential – Do Not Distribute11
• Central repository for big data applications
• Available on RTS and DataTorrent website
• App Templates delivered through App Hub
• Tested, published and maintained by DataTorrent
App Hub – App Template Repository
© 2017 DataTorrent Confidential – Do Not Distribute12
App Hub on DataTorrent RTS
© 2017 DataTorrent Confidential – Do Not Distribute13
App Hub on DataTorrent website
Visit: https://www.datatorrent.com/apphub/#/
© 2017 DataTorrent Confidential – Do Not Distribute14
App Templates benefits
Ease of use Time to market and TCO Real-time Visualizations
Quickly import and launch app template applications
Easily add business logic by adding custom operators
Reduces time to production drastically
Reduces cost of operations in production
Real-time visualizations of operational metrics such as throughput, latency etc.
Real-time visualizations of application data such as number of files processed, amount of data transferred etc.
© 2017 DataTorrent Confidential – Do Not Distribute15
• Kafka to HFDS Application Template
• Kinesis to S3 Application Template with AWS install script
Checkout more apps at: https://www.datatorrent.com/apphub/
App Template Demo
© 2017 DataTorrent Confidential – Do Not Distribute16
• Schema propagation• Apply schema once at the input operator • No code changes for changing schema
• Visualizations – widgets on app data• Metrics such as size of data moved, lines per file, number of errors• Custom user defined metrics using apex auto-metrics
• Cloud Ecosystem Solutions• AWS Templates• Azure integration• Google Cloud integration
• Analytics Templates• Vertical Industry Solutions
Roadmap
© 2017 DataTorrent Confidential – Do Not Distribute17
Questions?