realtime reporting using spark streaming

Post on 13-Aug-2015

78 Views

Category:

Technology

6 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Breaking ETL barrier with Real-time reporting

using Kafka, Spark Streaming

About us

Concur (now part of SAP) provides travel and expense management services to businesses.

Data Insights

A team that is building solutions to provide customer access to data, visualization and reporting.ExpenseTravelInvoice

About me

Santosh SahooPrincipal Architect III, Data Insights

Stack so far..

OLAP ReportETL

OLTP

App

Numbers

7K OLTP database sources14K OLAP Reporting dbs28K ETL Jobs2B row changes300M rows (Compacted)Only ~20 failure a night

Traditional ETL challenges

Scheduled (High latency)Hard to scale.Failover and recovery.Monolithic-nessSpaghetti (Logic +SQL)

Moving forward

Streaming, real timeScalableHighly availableReduce maintenance overheadEventual Consistency

Streaming Data Pipeline

SourceFlow ManagementProcessorStorage

Querying

Data Source

Event bus for business eventsLog ScrappingTransaction log scraping

(Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog)

Change Data CaptureApplication messaging/JMSMicro batching

(High watermarked, change tracking)

Kafka - Flow Management

No nonsense logging100K/s throughput vs 20k of RabbitMQLog compactionDurable persistencePartition tolerance ReplicationBest in class integration with Spark

Columnar Storage

Optimized for analytic query performance. Vertical partitioning Column ProjectionCompressionLoosely coupled schema.

HBaseAWS RedshiftParquetORCPostgres (Citrus)SAP HANA

Hadoop/HDFS

Pro - ScaleCon- Latency

Spark Streaming

What? A data processing framework to build scalable fault-tolerant streaming applications.Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.

Spark Streaming Architecture

Worker

Worker

Worker

Receiver

Driver Master

Executor

Executor

Executor

Source

D1 D2

D3 D4

WAL

D1 D2

Replication

DataStore

TASK

DStream- Discretized Stream of RDDRDD - Resilient Distributed Datasets

Optimized Direct Kafka API

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Howval kafkaParams = Map("metadata.broker.list" -> "localhost:9092,anotherhost:9092")

val topics = Set("sometopic", "anothertopic")

val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)

Architecture

App

OLTP

Kafka SparkStreaming OLAP

ReportingApp

High level view

OLTP

Reporting

CognosTableau ?

ArchiveFlume

Camus

StreamProcessorSparkSamza,Storm,Flink

HDFSImport

FTP

HTTP

SMTP

C

Tachyon

P

Standby

ProtobufJson

Broker

Kafka

Hive/Spark SQL

HANA

Load balanceFailover

HANA

HANAHANA

Replication

Service bus

SqoopSnapshot

Pig/Hive/MR - Normalization

ExtractCompensate

Data {Quality, Correction, Analytics}Migrate method

API/SQL

ExpenseTravel

TTXAPI

Complete Architecture

Can Spark Streaming survive Chaos Monkey?

http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Lambda Architecture

Lambda architecture is a data-processing pattern designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.

Demo….

QnA

concur.com/en-us/careers

We are hiring

Thank you!

top related