abdw17-lightning talks track-simplifying big data ingestion challenge

17
Simplifying Big Data Ingestion Challenge Apex Big Data World 2017

Upload: datatorrent

Post on 12-Apr-2017

18 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

Simplifying Big Data Ingestion Challenge

Apex Big Data World2017

Page 2: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute2

• Big Data is neither Productized nor Operationalized• Total Cost of Ownership (TCO) =

• Time to Develop + Time to Launch + Cost of ongoing Operations

• Provide a Product to ...• Build Applications Rapidly with Simple Interfaces, Pre-Built Apps, Code

Reuse & Debuggability

• Support Dev, Test, Prod cycle to Launch Apps quickly

• Manage and Visualize Applications for Operability

DataTorrent Vision - Productize Big Data

Page 3: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute3

Speaker: Ashwin Chandra Putta

Product Manager, DataTorrent

Committer for Apache Apex

Previous experience in Oracle, Propel IT

@ashwinchandrap

Page 4: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute4

Speaker: Yogi / Devendra Vyavahare

Engineer @ DataTorrent

Committer for Apache Apex

Previous experience in Bio-informatics, Web applications

@yogidevendra

Page 5: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute5

Big Data Ecosystem: Where Apex & DataTorrent fit

Data Sources Oper1 Oper2 Oper3

Hadoop (YARN + HDFS)

Sensor Data Social Media

Web Servers App Servers

Click Streams

Real-time Analytics &

Visualizations

DataVisualization

Page 6: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute6

Apex as a framework

Browser

Web Server

Kafka Input(logs)

Decompress, Parse, Filter

Dimensions Aggregate Kafka

LogsKafka

Variety of sources - IoT, Kafka, files, social media etc.Variety of sinks – Kafka, files, databases etc.* Supports low latency real time visualizations as well

Unbounded and continuous data streamsBatch support, batch processed as stream

In-memory processing with temporal window boundaries

Stateful operations: Aggregation, Rules etc --> Analytics

Page 7: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute7

Real-time Data Readiness

IndustrySolutions

FinServ Fraud Prevention

Ease of Use

Managem

ent & M

onitoring

Built-in V

isualization

Rapid D

evelopment

Batch Support

High-level API

Apex-Malhar Operator LibraryIngest | Enrich | Analyze | Query | Alert | Automate | Output

Hadoop 2.0 – YARN + HDFS

Ingestion/ReplicationApplicatio

n Templates

Apache ApexStream & Batch Processing | Fault Tolerance | Scale & Performance

Big Data Platform

ProcessingCore

Building Blocks

DataTorrent RTS as a Product

Adtech Optimization

On Premises (Cloudera, MapR, Hortonworks) Cloud (AWS, Azure, Google)Infrastructu

re

Anomaly Detection Query/OLAP Machine Scoring

Customer ExperienceOptimization

Page 8: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute8

• Repeatable application patterns• App level code reuse for general purpose operators

App Templates – why?

Pattern Use Case Sources Processors Sinks

Data IntegrationDisaster Recovery,

Cluster Backup, Cloud Backup, Raw

data ingestion

HDFS, Kafka, JDBC,

S3→ HDFS,

S3

Data IngestionIngestion: Dedup

and Enrich for 360 views

HDFS,JDBC,Kafka

Parser → Deduper → Enricher → Formatter HDFS,Cassandra

Data IngestionIngestion: Merge &

Transform Data Streams

Kafka,JDBC,File

Stream Merge → Transform → Filter → Enricher HDFS

Analytics Fraud Scoring, Anomaly Detection Kafka H2O or Custom HDFS

Page 9: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute9

Bridging the TTV Gap – App Templates

Big Data App Templates

• Pre-built templates certified for scalability and durability

• Quick to import, configure and launch

• Easy to add custom business logic

Page 10: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute10

Bridging the TTV Gap - Infrastructure

Delivered through key infrastructural components:

Component Why?App Hub App Template delivery mechanism

Cloud Integration Easy install and run on cloud infrastructures

Schema Support Reuse templates for multiple use cases without code change

Open Source Code Template code accessible via GitHub to add custom logic

App Metrics and Visualizations Dashboard for key application and operational metrics

Page 11: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute11

• Central repository for big data applications

• Available on RTS and DataTorrent website

• App Templates delivered through App Hub

• Tested, published and maintained by DataTorrent

App Hub – App Template Repository

Page 12: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute12

App Hub on DataTorrent RTS

Page 13: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute13

App Hub on DataTorrent website

Visit: https://www.datatorrent.com/apphub/#/

Page 14: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute14

App Templates benefits

Ease of use Time to market and TCO Real-time Visualizations

Quickly import and launch app template applications

Easily add business logic by adding custom operators

Reduces time to production drastically

Reduces cost of operations in production

Real-time visualizations of operational metrics such as throughput, latency etc.

Real-time visualizations of application data such as number of files processed, amount of data transferred etc.

Page 15: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute15

• Kafka to HFDS Application Template

• Kinesis to S3 Application Template with AWS install script

Checkout more apps at: https://www.datatorrent.com/apphub/

App Template Demo

Page 16: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute16

• Schema propagation• Apply schema once at the input operator • No code changes for changing schema

• Visualizations – widgets on app data• Metrics such as size of data moved, lines per file, number of errors• Custom user defined metrics using apex auto-metrics

• Cloud Ecosystem Solutions• AWS Templates• Azure integration• Google Cloud integration

• Analytics Templates• Vertical Industry Solutions

Roadmap

Page 17: ABDW17-Lightning Talks track-Simplifying Big Data Ingestion Challenge

© 2017 DataTorrent Confidential – Do Not Distribute17

Questions?