abdw17-lightning talks track-simplifying big data ingestion challenge

Simplifying Big Data Ingestion Challenge

Apex Big Data World2017

© 2017 DataTorrent Confidential – Do Not Distribute2

• Big Data is neither Productized nor Operationalized• Total Cost of Ownership (TCO) =

• Time to Develop + Time to Launch + Cost of ongoing Operations

• Provide a Product to ...• Build Applications Rapidly with Simple Interfaces, Pre-Built Apps, Code

Reuse & Debuggability

• Support Dev, Test, Prod cycle to Launch Apps quickly

• Manage and Visualize Applications for Operability

DataTorrent Vision - Productize Big Data


Speaker: Ashwin Chandra Putta

Product Manager, DataTorrent

Committer for Apache Apex

Previous experience in Oracle, Propel IT

@ashwinchandrap

https://twitter.com/ashwinchandrap

https://twitter.com/ashwinchandrap


Speaker: Yogi / Devendra Vyavahare

Engineer @ DataTorrent

Committer for Apache Apex

Previous experience in Bio-informatics, Web applications

@yogidevendra

https://twitter.com/yogidevendra

https://twitter.com/yogidevendra


Big Data Ecosystem: Where Apex & DataTorrent fit

Data Sources Oper1 Oper2 Oper3

Hadoop (YARN + HDFS)

Sensor Data Social Media

Web Servers App Servers

Click Streams

Real-time Analytics &

Visualizations

DataVisualization


Apex as a framework

Browser

Web Server

Kafka Input(logs)

Decompress, Parse, Filter

Dimensions Aggregate Kafka

LogsKafka

Variety of sources - IoT, Kafka, files, social media etc.Variety of sinks – Kafka, files, databases etc.* Supports low latency real time visualizations as well

Unbounded and continuous data streamsBatch support, batch processed as stream

In-memory processing with temporal window boundaries

Stateful operations: Aggregation, Rules etc --> Analytics


Real-time Data Readiness

IndustrySolutions

FinServ Fraud Prevention

Ease of Use

Managem

ent & M

onitoring

Built-in V

isualization

Rapid D

evelopment

Batch Support

High-level API

Apex-Malhar Operator LibraryIngest | Enrich | Analyze | Query | Alert | Automate | Output

Hadoop 2.0 – YARN + HDFS

Ingestion/ReplicationApplicatio

n Templates

Apache ApexStream & Batch Processing | Fault Tolerance | Scale & Performance

Big Data Platform

ProcessingCore

Building Blocks

DataTorrent RTS as a Product

Adtech Optimization

On Premises (Cloudera, MapR, Hortonworks) Cloud (AWS, Azure, Google)Infrastructu

re

Anomaly Detection Query/OLAP Machine Scoring

Customer ExperienceOptimization


• Repeatable application patterns• App level code reuse for general purpose operators

App Templates – why?

Pattern Use Case Sources Processors Sinks

Data IntegrationDisaster Recovery,

Cluster Backup, Cloud Backup, Raw

data ingestion

HDFS, Kafka, JDBC,

S3→ HDFS,

S3

Data IngestionIngestion: Dedup

and Enrich for 360 views

HDFS,JDBC,Kafka

Parser → Deduper → Enricher → Formatter HDFS,Cassandra

Data IngestionIngestion: Merge &

Transform Data Streams

Kafka,JDBC,File

Stream Merge → Transform → Filter → Enricher HDFS

Analytics Fraud Scoring, Anomaly Detection Kafka H2O or Custom HDFS


Bridging the TTV Gap – App Templates

Big Data App Templates

• Pre-built templates certified for scalability and durability

• Quick to import, configure and launch

• Easy to add custom business logic


Bridging the TTV Gap - Infrastructure

Delivered through key infrastructural components:

Component Why?App Hub App Template delivery mechanism

Cloud Integration Easy install and run on cloud infrastructures

Schema Support Reuse templates for multiple use cases without code change

Open Source Code Template code accessible via GitHub to add custom logic

App Metrics and Visualizations Dashboard for key application and operational metrics


• Central repository for big data applications

• Available on RTS and DataTorrent website

• App Templates delivered through App Hub

• Tested, published and maintained by DataTorrent

App Hub – App Template Repository


App Hub on DataTorrent RTS


App Hub on DataTorrent website

Visit: https://www.datatorrent.com/apphub/#/

https://www.datatorrent.com/apphub/#/

https://www.datatorrent.com/apphub/#/


App Templates benefits

Ease of use Time to market and TCO Real-time Visualizations

Quickly import and launch app template applications

Easily add business logic by adding custom operators

Reduces time to production drastically

Reduces cost of operations in production

Real-time visualizations of operational metrics such as throughput, latency etc.

Real-time visualizations of application data such as number of files processed, amount of data transferred etc.


• Kafka to HFDS Application Template

• Kinesis to S3 Application Template with AWS install script

Checkout more apps at: https://www.datatorrent.com/apphub/

App Template Demo

https://www.datatorrent.com/apphub/

https://www.datatorrent.com/apphub/


• Schema propagation• Apply schema once at the input operator • No code changes for changing schema

• Visualizations – widgets on app data• Metrics such as size of data moved, lines per file, number of errors• Custom user defined metrics using apex auto-metrics

• Cloud Ecosystem Solutions• AWS Templates• Azure integration• Google Cloud integration

• Analytics Templates• Vertical Industry Solutions

Roadmap


Questions?

abdw17-lightning talks track-simplifying big data ingestion challenge

Technology