realtime reporting using spark streaming

Breaking ETL barrier with Real-time reporting

using Kafka, Spark Streaming

About us

Concur (now part of SAP) provides travel and expense management services to businesses.

Data Insights

A team that is building solutions to provide customer access to data, visualization and reporting.ExpenseTravelInvoice

About me

Santosh SahooPrincipal Architect III, Data Insights

Stack so far..

OLAP ReportETL

Numbers

7K OLTP database sources14K OLAP Reporting dbs28K ETL Jobs2B row changes300M rows (Compacted)Only ~20 failure a night

Traditional ETL challenges

Scheduled (High latency)Hard to scale.Failover and recovery.Monolithic-nessSpaghetti (Logic +SQL)

Moving forward

Streaming, real timeScalableHighly availableReduce maintenance overheadEventual Consistency

Streaming Data Pipeline

SourceFlow ManagementProcessorStorage

Querying

Data Source

Event bus for business eventsLog ScrappingTransaction log scraping

(Oracle GoldenGate, MySQL binlog, MongoDB oplog, Postgres BottledWater, SQL Server fn_dblog)

Change Data CaptureApplication messaging/JMSMicro batching

(High watermarked, change tracking)

Kafka - Flow Management

No nonsense logging100K/s throughput vs 20k of RabbitMQLog compactionDurable persistencePartition tolerance ReplicationBest in class integration with Spark

Columnar Storage

Optimized for analytic query performance. Vertical partitioning Column ProjectionCompressionLoosely coupled schema.

HBaseAWS RedshiftParquetORCPostgres (Citrus)SAP HANA

Hadoop/HDFS

Pro - ScaleCon- Latency

Spark Streaming

What? A data processing framework to build scalable fault-tolerant streaming applications.Why? It lets you reuse the same code for batch processing, join streams against historical data, or run ad-hoc queries on stream state.

Spark Streaming Architecture

Worker

Receiver

Driver Master

Executor

Source

Replication

DataStore

DStream- Discretized Stream of RDDRDD - Resilient Distributed Datasets

Optimized Direct Kafka API

https://databricks.com/blog/2015/03/30/improvements-to-kafka-integration-of-spark-streaming.html

Howval kafkaParams = Map("metadata.broker.list" -> "localhost:9092,anotherhost:9092")

val topics = Set("sometopic", "anothertopic")

val kafkaStream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, kafkaParams, topics)

Architecture

Kafka SparkStreaming OLAP

ReportingApp

High level view

Reporting

CognosTableau ?

ArchiveFlume

StreamProcessorSparkSamza,Storm,Flink

HDFSImport

Tachyon

Standby

ProtobufJson

Broker

Hive/Spark SQL

Load balanceFailover

HANAHANA

Replication

Service bus

SqoopSnapshot

Pig/Hive/MR - Normalization

ExtractCompensate

Data {Quality, Correction, Analytics}Migrate method

API/SQL

ExpenseTravel

TTXAPI

Complete Architecture

Can Spark Streaming survive Chaos Monkey?

http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html

Lambda Architecture

Lambda architecture is a data-processing pattern designed to handle massive quantities of data by taking advantage of both batch- and stream-processing methods.

Demo….

concur.com/en-us/careers

We are hiring

Thank you!

realtime reporting using spark streaming

data insights

historical data

data processing framework

dataprocessing pattern

massive quantities of

data source event bus

kafka flow management

olap reporting dbs

Technology

spark streaming

productionalizing spark streaming

spark concepts - spark sql, graphx, streaming

spark streaming and mllib - hyderabad spark group

structured streaming in spark

realtime video streaming the open source way

spark streaming with kafka

learning spark ch10 - spark streaming

apache spark streaming

spark summit - stratio streaming

spark streaming resiliency (bay area spark meetup)

realtime data processing with pentaho data integration...

spark streaming: best practices

cs 744: spark streaming

spark streaming preview

devops spark streaming

pulsar realtime analytics at scale · pdf filenetty . pulsar...

spark streaming into context

spark streaming , spark sql

realtime data pipeline with spark streaming and cassandra...