spark seattle meetup - breaking etl barrier with spark streaming

18
Breaking ETL barrier with Real-time reporting using Kafka, Spark Streaming Santosh Sahoo Architect at Concur

Upload: santosh-sahoo

Post on 16-Aug-2015

103 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Breaking ETL barrier with Real-time reportingusing Kafka, Spark Streaming

Santosh SahooArchitect at Concur

Page 2: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

About us

Concur (now part of SAP) provides travel and expense management services to businesses.

Data Insights team is building solutions to provide customer access to data, visualization and reporting.

Page 3: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Stack so far..

OLAP ReportETL

OLTP

App

Page 4: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Numbers

7K OLTP database sources14K OLAP Reporting dbs28K ETL Jobs300M rows (Compacted), 2B row changesOnly ~20 failure a night

Page 5: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Batch ETL challenges

Scheduled (High latency)Processing timeHard to scale.Not fault toleranceMonolithicHigh maintenance

Page 6: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Moving forwardScheduled (High latency) Streaming, real time

Hard to scale Scalable

Monolithic Modular

Not fault tolerant Fault tolerant

ACID Consistent, Normalized Eventual Consistency

High maintenance (Single Tenant)

Reduce maintenance overhead(Multi tenant)

Page 7: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Source Flow Manager

StreamingProcessor Storage Reporting

Streaming Data Pipeline

Applications

Mobile Devices

Sensors

IOT - Internet of things

Database Log scrapping

Alert

Message Queues

Kafka

Flume

Azure Event hub

AWS Kinesis

HDFS

Storm

Spark Streaming

Azure Stream analytics

Samza

Flink

RDBMS

NoSQL

HDFS

Redshift

Custom App D3

Tableau

Cognos

Excel

Page 8: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Spark StreamingWhat? A data processing framework to build scalable fault-tolerant streaming applications.Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.

Page 9: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Demo….

Page 10: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Kafka - Flow Management

No nonsense logging100K/s throughput vs 20k of RabbitMQLog compactionDurable persistencePartition tolerance ReplicationBest in class integration with Spark

Page 11: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Spark Streaming Architecture

Worker

Worker

Worker

Receiver

Driver Master

Executor

Executor

Executor

Source

D1 D2

D3 D4

WAL

D1 D2

Replication

DataStore

TASK

DStream- Discretized Stream of RDDRDD - Resilient Distributed Datasets

Page 12: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Optimized Direct Kafka API

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Page 13: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Architecture

Page 14: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

OLTP

Reporting

CognosTableau ?

StreamProcessorSpark

HDFSImport

FTP

HTTP

SMTP

P

ProtobufJson

Broker

Kafka

Hive/Spark SQL

OLAP

Load balanceFailover

HANA

HANAOLAP

Replication

Service bus

Normalization

ExtractCompensate

Data {Quality, Correction, Analytics}Migrate method

API/SQL

ExpenseTravel

TTXAPI

Reporting Next Gen Architecture

C

Tachyon

Page 15: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Can Spark Streaming survive Chaos Monkey?

http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Page 16: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

QnA

Page 17: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

concur.com/en-us/careers

We are hiring

Page 18: Spark Seattle meetup - Breaking ETL barrier with Spark Streaming

Thank you!