spark summit eu talk by brij bhushan ravat

26
Productizing a Spark & Cassandra Based Solution in Telecom Brij Bhushan Ravat Chief Architect, Voucher Server - Charging System

Upload: spark-summit

Post on 06-Jan-2017

187 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark & Cassandra Based Solution in Telecom

Brij Bhushan Ravat Chief Architect, Voucher Server - Charging System

Page 2: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 2

1 What is productizing?

2 A brief on the product – Voucher Server

3 Challenges & evolution

4 O&M challenges

5 Wrap up

Agenda

Page 3: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 3

Evolution of a product

Tony Stark in captivity

The Need

Invention of Arc Reactor

The Design

Armor – Mark 1

The Prototype

Ironman

The Solution

J.A.R.V.I.S.

The Support Mechanism

Ironman & IronPatriot

Productized Solution

Page 4: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 4

› First phase of development: – It involves designing a solution for a need – Prototyping it for proof of the concept – Completing the solution with all required features & good performance

› Second phase of development

– Building a mechanism to support usage of the solution – So that every customer can use the solution with same accuracy

› ‘Productizing’ is to take a solution to a level where anyone can:

– buy the solution off-the-shelf – install it by himself/herself, and – use it by himself/herself

productizing

Page 5: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 5

Evolution of VS

Market forces

The Need

Change in database

The Design

NGVS Prototype

The Prototype

Next Generation VS

The Solution

Strong Support

The Support Mechanism

Multiple deployments

Productized Solution

Page 6: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 6

1 What is productizing?

2 A brief on the product – Voucher Server

3 Challenges & evolution

4 O&M challenges

5 Wrap up

Page 7: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 7

› Generates ‘Activation Codes’ – An ‘activation code’ corresponds to a currency & a value

› Handles top-up requests of pre-paid subscribers – Interpret value of an ‘activation code’ – Add the corresponding value to the balance amount of the subscriber

Voucher Server

Page 8: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 8

Voucher Server (vs)

Subscribers Account Voucher Server

voucher lookup (Activation code)

Print Shop Voucher codes

vouc

her g

ener

atio

n

1

vouc

her l

oadi

ng

2

vouc

her s

tate

cha

nge

3 Retail Shop Printed

vouchers

2

3

subs

crib

ers vo

uche

r pur

ge

4

4

Operator

value

gene

rate

repo

rt

5

Page 9: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 9

1 What is productizing?

2 A brief on the product – Voucher Server

3 Challenges & evolution

4 O&M challenges

5 Wrap up

Page 10: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 10

Design Change: Oracle to Cassandra

First digit driven message routing

Refill Request

Cluster 1

Voucher Pool 1

Oracle RAC

Cluster 2

Voucher Pool 2

Oracle RAC

Refill Request

Single cluster

Single Voucher Pool Cassandra cluster

0 - 4 5 - 9

Old VS solution (Oracle based): • Overall a good solution • but limited capacity (one cluster – 300M vouchers) • and scaling is not immune to hotspots

VS VS VS VS VS VS VS VS VS

Next Generation VS solution (Cassandra based): • Highly scalable solution with good performance • No hotspots • but performance of reports generation is a constraint

Round robin message routing

Page 11: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 11

Report Generation: Feature

Voucher Server

Report Unavailable 170M. Available 80M. Used 30M. Damaged 13M. Stolen 2M. Refill

Load

Sta

te c

hang

e

Pur

ge

Analytics Solution

Gen

erat

e R

epor

t

Voucher state distribution: • Count vouchers in each state • Voucher reconciliation • Taking stock of inventory

Export voucher data: • Fraud detection • Market prediction

Challenges: • Cassandra by itself does not provide function to count fields w.r.t. their value • Report generation requires full-table scan (takes 5-6 hours)

Page 12: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 12

Report Generation: Design (w/o spark)

Web GUI

Cassandra cluster

Node 1

Node 2

Node 3

Node 4

Node 5

Design approach: • Web GUI triggers report

generation task on Java process • Java process interfaces with

Cassandra cluster and fetches voucher data from all nodes

• After full-table scan, entire data is loaded in a single Java process of a node

• The Java process writes the voucher data in a report file

Alternative design: • Implement distributed computing • Use a framework for distributed

computing

Page 13: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 13

Report Generation: New Design

Web GUI

Cassandra cluster

New design approach: • Web GUI triggers report

generation task on Java process • Java process triggers a spark job • Spark spawns spark executors

on each node • Each spark executor is assigned

task such that its local Cassandra instance has all its required tokens (leveraging data-locality)

• Each spark executor generate report file for the data it has fetched.

Benefits: • Minimum network latency • Therefore, better performance

Page 14: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 14

DatA locality: Spark-Cassandra Connector

Java (VS business logic)

Apache Cassandra v 2.0.14

Apache Cassandra v 2.0.14

Apache Cassandra v 2.0.14

Spark Executor v 1.2 Spark Executor v 1.2 Spark Executor v 1.2

Spark

partition

Spark

partition

Spark

partition

Spark

patition

Spark

partition

Spark

partition

Spark

partition

Spark

partition

Spark

partition

Spark

partition

Spark

partition

Spark

partition

Spark Cassandra Connector – Java wrapper v 1.2

Spark Cassandra Connector v 1.2

( Cassandra partition ) ( Cassandra partition ) ( Cassandra partition )

Page 15: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 15

1 What is productizing?

2 A brief on the product – Voucher Server

3 Challenges & evolution

4 O&M challenges

5 Wrap up

Page 16: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 16

› Cassandra repair

› Lagging compaction

› Handling real-time traffic along with Spark jobs

Key O&M challenges

Page 17: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 17

Cassandra Repair

Node 1

Node 2

Node 3

A

A

A

B

B B

Mutation drop

What causes inconsistency • Node-2 receives db-request (update value to B) • Node-2 becomes coordinator

• applies changes in its replica • sends message to Node-1 & Node-2

• Node-1 applies db-update (from value A to B) • Node-3 fails to receive the message (mutation drop)

• On Node-3, value remains ‘A’

Consequences of inconsistency (if not fixed) • If ‘consistency level’ is QUORUM, while writing as well as reading,

then no impact till all 3 nodes are up & running • Once a node with latest value goes down read-requests can fail • In some cases, purged data can also reappear

How to fix inconsistencies • Schedule ‘repair jobs’ to run periodically (e.g. weekly)

Challenges • Lack of awareness about ‘repair’ among customers • At times, customer fail to run repair for prolonged periods

Replication Factor : 3 Consistency Level : Quorum

Page 18: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 18

Lagging compaction (C* sstable)

Node 1

Node 2

Node 3

Why compaction is required? • A batch of updates is flushed into an sstable

• sstables are immutable • Therefore, every batch of updates make a new sstable

• In each Cassandra node sstable compaction runs in background • It merges individual sstables in single sstable • It frees up diskspace consumed by multiple entries of same record

What is lagging compaction? • In STCS (Size Tiered Compaction Strategy), by default Cassandra

requires 4 sstables of similar size to initiate compaction • When there is an unusual burst of several db operations in a small

duration of time, that results in an sstable of unusual large size. • Such large sstables are never selected for compaction because

there is often no sstable of similar size • Reoccurrence of this phenomenon cause high disk utilization

How to fix lagging compaction? • Tune compaction to run on 2 sstables itself & even with varied sizes

Challenges • Customers ignore the increasing disk utilization until it is 85%

Replication Factor : 3 Consistency Level : Quorum

S

S

S

S

S

S

S

S

S

S

S

S

Page 19: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 19

Real time traffic + Spark batch

At peak time of operations • Resources are dedicated to traffic of refill requests

• which are processed in real time • Other activities performed are those which take very

few resources, like voucher lookup, etc.

Rea

l tim

e re

fill r

eque

sts

Admin & other functions

Rea

l tim

e re

fill r

eque

sts

Spark-based report generation

Peak time distribution between real-time & batch processing

During off-peak of operations • Traffic of refill request is less • This window is utilized for admin functions which

require more resources • However, Spark jobs overload the system so much

that even off-peak traffic starts giving timeout • This requires tuning of ‘spark.executor.cores’

Off-peak distribution between real-time & batch processing

Page 20: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 20

› For ‘Cassandra repair’ monitor – ‘Time elapsed since last successful Cassandra-repair’ w.r.t. ‘GC Grace Seconds’ for each node in

Cassandra cluster

› For ‘lagging compaction’ monitor – Disk utilization – sstable count – Number of sstables read per table

› Handling real-time traffic along with Spark jobs

– Monitor real-time traffic throughput & response time – Monitor CPU utilization

Solution

Page 21: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 21

1 What is productizing?

2 A brief on the product – Voucher Server

3 Challenges & evolution

4 O&M challenges

5 Wrap up

Page 22: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 22

› First phase (technical issues) – Need for high scalability –

› use Cassandra – Need for efficiency in full-table scan –

› use Spark for report generation (leveraging data locality)

› Second phase (O&M issues)

– Lack of admin skills among end-users leads to: › issues with Cassandra repair & compaction › running real-time traffic with batch processing jobs

Summary of challenges (of productizing)

Page 23: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 23

› Active development – Tune Cassandra & Spark to handle all usual as well as unusual usage patterns – Along with core business logic, develop tools that can facilitate monitoring of vital statistics

related Cassandra & Spark operations – Regularly train ‘support team’ to strengthen their knowledge of Cassandra & Spark

› Strong support

– which can augment end-users knowledge, and – which can partner with end-users for smooth operations of Cassandra & Spark

Solution for the challenges

Page 24: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 24

Strong support

DEVELOPER CUSTOMER SUPPORT

Train support team on O&M of

Cassandra & Spark

Partner with end-user for smooth operations of Cassandra & Spark

Page 25: Spark Summit EU talk by Brij Bhushan Ravat

Productizing a Spark and Cassandra Based Solution in Telecom | Public | © Ericsson AB 2016 | 2016-09-25 | Page 25

Strong Support helps R&D handle multiple deployments

R&D Customer 7

Customer 1 Customer 2

Customer 3 Customer 4

Customer 5

Customer 6

Customer 8 Customer 9

Customer 10 Customer 11

Support

Page 26: Spark Summit EU talk by Brij Bhushan Ravat