realtime reporting using spark streaming

26
Breaking ETL barrier with Real-time reporting using Kafka, Spark Streaming

Upload: santosh-sahoo

Post on 13-Aug-2015

77 views

Category:

Technology


6 download

TRANSCRIPT

Page 1: Realtime Reporting using Spark Streaming

Breaking ETL barrier with Real-time reporting

using Kafka, Spark Streaming

Page 2: Realtime Reporting using Spark Streaming

About us

Concur (now part of SAP) provides travel and expense management services to businesses.

Page 3: Realtime Reporting using Spark Streaming

Data Insights

A team that is building solutions to provide customer access to data, visualization and reporting.ExpenseTravelInvoice

Page 4: Realtime Reporting using Spark Streaming

About me

Santosh SahooPrincipal Architect III, Data Insights

Page 5: Realtime Reporting using Spark Streaming

Stack so far..

OLAP ReportETL

OLTP

App

Page 6: Realtime Reporting using Spark Streaming

Numbers

7K OLTP database sources14K OLAP Reporting dbs28K ETL Jobs2B row changes300M rows (Compacted)Only ~20 failure a night

Page 7: Realtime Reporting using Spark Streaming

Traditional ETL challenges

Scheduled (High latency)Hard to scale.Failover and recovery.Monolithic-nessSpaghetti (Logic +SQL)

Page 8: Realtime Reporting using Spark Streaming

Moving forward

Streaming, real timeScalableHighly availableReduce maintenance overheadEventual Consistency

Page 9: Realtime Reporting using Spark Streaming

Streaming Data Pipeline

SourceFlow ManagementProcessorStorage

Querying

Page 10: Realtime Reporting using Spark Streaming

Data Source

Event bus for business eventsLog ScrappingTransaction log scraping

(Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog)

Change Data CaptureApplication messaging/JMSMicro batching

(High watermarked, change tracking)

Page 11: Realtime Reporting using Spark Streaming

Kafka - Flow Management

No nonsense logging100K/s throughput vs 20k of RabbitMQLog compactionDurable persistencePartition tolerance ReplicationBest in class integration with Spark

Page 12: Realtime Reporting using Spark Streaming

Columnar Storage

Optimized for analytic query performance. Vertical partitioning Column ProjectionCompressionLoosely coupled schema.

HBaseAWS RedshiftParquetORCPostgres (Citrus)SAP HANA

Page 13: Realtime Reporting using Spark Streaming

Hadoop/HDFS

Pro - ScaleCon- Latency

Page 14: Realtime Reporting using Spark Streaming

Spark Streaming

What? A data processing framework to build scalable fault-tolerant streaming applications.Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.

Page 15: Realtime Reporting using Spark Streaming

Spark Streaming Architecture

Worker

Worker

Worker

Receiver

Driver Master

Executor

Executor

Executor

Source

D1 D2

D3 D4

WAL

D1 D2

Replication

DataStore

TASK

DStream- Discretized Stream of RDDRDD - Resilient Distributed Datasets

Page 16: Realtime Reporting using Spark Streaming

Optimized Direct Kafka API

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Page 17: Realtime Reporting using Spark Streaming

Howval kafkaParams = Map("metadata.broker.list" -> "localhost:9092,anotherhost:9092")

val topics = Set("sometopic", "anothertopic")

val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)

Page 18: Realtime Reporting using Spark Streaming

Architecture

Page 19: Realtime Reporting using Spark Streaming

App

OLTP

Kafka SparkStreaming OLAP

ReportingApp

High level view

Page 20: Realtime Reporting using Spark Streaming

OLTP

Reporting

CognosTableau ?

ArchiveFlume

Camus

StreamProcessorSparkSamza,Storm,Flink

HDFSImport

FTP

HTTP

SMTP

C

Tachyon

P

Standby

ProtobufJson

Broker

Kafka

Hive/Spark SQL

HANA

Load balanceFailover

HANA

HANAHANA

Replication

Service bus

SqoopSnapshot

Pig/Hive/MR - Normalization

ExtractCompensate

Data {Quality, Correction, Analytics}Migrate method

API/SQL

ExpenseTravel

TTXAPI

Complete Architecture

Page 21: Realtime Reporting using Spark Streaming

Can Spark Streaming survive Chaos Monkey?

http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Page 22: Realtime Reporting using Spark Streaming

Lambda Architecture

Lambda architecture is a data-processing pattern designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.

Page 23: Realtime Reporting using Spark Streaming

Demo….

Page 24: Realtime Reporting using Spark Streaming

QnA

Page 25: Realtime Reporting using Spark Streaming

concur.com/en-us/careers

We are hiring

Page 26: Realtime Reporting using Spark Streaming

Thank you!