spark seattle meetup - breaking etl barrier with spark streaming

Post on 16-Aug-2015

103 Views

Category:

Data & Analytics

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Breaking ETL barrier with Real-time reportingusing Kafka, Spark Streaming

Santosh SahooArchitect at Concur

About us

Concur (now part of SAP) provides travel and expense management services to businesses.

Data Insights team is building solutions to provide customer access to data, visualization and reporting.

Stack so far..

OLAP ReportETL

OLTP

App

Numbers

7K OLTP database sources14K OLAP Reporting dbs28K ETL Jobs300M rows (Compacted), 2B row changesOnly ~20 failure a night

Batch ETL challenges

Scheduled (High latency)Processing timeHard to scale.Not fault toleranceMonolithicHigh maintenance

Moving forwardScheduled (High latency) Streaming, real time

Hard to scale Scalable

Monolithic Modular

Not fault tolerant Fault tolerant

ACID Consistent, Normalized Eventual Consistency

High maintenance (Single Tenant)

Reduce maintenance overhead(Multi tenant)

Source Flow Manager

StreamingProcessor Storage Reporting

Streaming Data Pipeline

Applications

Mobile Devices

Sensors

IOT - Internet of things

Database Log scrapping

Alert

Message Queues

Kafka

Flume

Azure Event hub

AWS Kinesis

HDFS

Storm

Spark Streaming

Azure Stream analytics

Samza

Flink

RDBMS

NoSQL

HDFS

Redshift

Custom App D3

Tableau

Cognos

Excel

Spark StreamingWhat? A data processing framework to build scalable fault-tolerant streaming applications.Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.

Demo….

Kafka - Flow Management

No nonsense logging100K/s throughput vs 20k of RabbitMQLog compactionDurable persistencePartition tolerance ReplicationBest in class integration with Spark

Spark Streaming Architecture

Worker

Worker

Worker

Receiver

Driver Master

Executor

Executor

Executor

Source

D1 D2

D3 D4

WAL

D1 D2

Replication

DataStore

TASK

DStream- Discretized Stream of RDDRDD - Resilient Distributed Datasets

Optimized Direct Kafka API

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Architecture

OLTP

Reporting

CognosTableau ?

StreamProcessorSpark

HDFSImport

FTP

HTTP

SMTP

P

ProtobufJson

Broker

Kafka

Hive/Spark SQL

OLAP

Load balanceFailover

HANA

HANAOLAP

Replication

Service bus

Normalization

ExtractCompensate

Data {Quality, Correction, Analytics}Migrate method

API/SQL

ExpenseTravel

TTXAPI

Reporting Next Gen Architecture

C

Tachyon

Can Spark Streaming survive Chaos Monkey?

http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

QnA

concur.com/en-us/careers

We are hiring

Thank you!

top related