apache spark: the next gen toolset for big data processing

Prajod Vettiyattil

Architect, Open source

Wipro

in.linkedin.com/in/prajod

@prajods

Apache Spark The Next Gen toolset for

Big Data Processing

Namitha M S

Architect, Advanced Technologies

Wipro

in.linkedin.com/in/namithams

Open Source India Nov 2014 Bangalore

• Big Data

• Hadoop stack and its limitations

• Spark: An overview

• Streaming, GraphX and MLlib

• Performance characteristics of Spark

Agenda

• Data too huge for normal systems

• 3 Vs: Volume, Variety, Velocity

• Storage challenge

• Analysis challenge

• Query results take hours, days or months

Big Data

Data disks

The Big Data Analysis Triad

Batch

Interactive Streaming

The Hadoop stack

• Distributed data processing

• Fault tolerant

• Process peta byte data sets

• Ecosystem tools

• Hive DB, Hbase

• Pig

• Storm

• Hadoop

• Map

• Reduce

• Shuffle, partition, sort

• HDFS

Hadoop: Data flow

Partition for target reducers

Buffer in memory Map

Input

data

files

Sort each partition

by key

Merge all partitions and write

to disk

Potential spill to

disk

Merge round

1

Merge round

2

Merge round

N http fetch from map node

Reduce

Merge sort

…

Output

High disk I/O

On Map nodes

On Reduce nodes

• Batch mode

• Only the batch layer in the Lambda pattern

• No real time

• No repetitive queries

• Iterative algorithms

• Interactive data querying

• Poor support for distributed memory

Limitations of Hadoop

Spark: An overview

• “Over time, fewer projects will use MapReduce, and more will use Spark”

• Doug Cutting, creator of Hadoop

• New architecture: scale better and simplify

• In memory processing for Big Data

• Cached intermediate data sets

• Multi-step DAG based execution

• Resilient Distributed Data(RDD) sets

• The core innovation in Spark

Spark Ecosystem tools

Apache Spark

Spark SQL

Streaming

MLlib GraphX

Spark R

Blink DB

Shark Bagel

DAG Execution Engine

Map

Collect

Filter

Map

Reduce

Sort

Collect

DAG = Directed Acyclic Graph

• Resilient Distributed Data sets

• Features

• Read only

• Fault tolerance without replication

• Uses data lineage for recovery

• Low network I/O

• Partitions/Slices

• parallel tasks

RDD

Disk Transform 1

RDD 1

Transform 2

RDD 2

Data partitions

Lambda architecture pattern

• Used for Lambda architecture implementation

• Batch layer

• Speed layer

• Serving layer Batch Layer

Speed Layer

Serving Layer

Input

Data consumers

Query

Query

Spark Streaming

• For stream processing in Spark

• Real time data

• Like Twitter queries

• Discretized streams(DStreams)

• Micro batches

• Sequence of RDDs

Discretized Streams

Spark Streaming

Spark

Batches of x seconds

Input

Output

Why Spark Streaming

• Near real time processing (0.5 – 2 sec latency)

• Parallel recovery of lost nodes and stragglers

• Implementation of Lambda architecture

• Single engine for batch and stream

• Not suited for low latency requirements

• i.e., 100ms

Apache Storm vs Spark Streaming

Feature Spark Streaming Storm

Processing Model Micro-Batching Event Stream

processing

Message Delivery

options

Inherently fault tolerant,

exactly once delivery

At least once, at most

once, exactly once

Flexibility Coarse grained

transformation

Fine grained

transformation

Implemented in Scala Clojure

Development

Cost

Common platform for both

batch and stream

Only stream – separate

setup for batch

Applicability Machine learning, Interactive

analytics, near real time

analytics

Near real time analytics,

Natural language

processing

GraphX & MLlib

• Data parallel Vs Graph Parallel processing

• Wikipedia search vs Facebook connection search, Page rank

• Spark MLlib implements high quality machine learning algorithms

• Iterative Algorithm Paradigm

• Leverage Spark’s in memory data sets

)()1( tt xfx f(xt) f(xt+1)

x(t+1) x(t)

Performance characteristics

Performance of Spark

• 100x faster in memory

• 10x faster on disk

Graph courtesy: spark.apache.org

Hadoop vs Spark

Hadoop Spark Spark World Record

100 TB * 1 PB

Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 6592 6080 # Reducers 10,000 29,000 250,000

Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min

Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min

Data courtesy: databricks.com

1 TB performance test: data per sec

1 TB performance test data rate vs RAM size

Apache Spark

•New architecture

•RDD, DAG

• In memory processing

•Map reduce and more

•GraphX

•MLlib

•Spark streaming

Summary

Ecosystem tools

•Spark R

•Blink DB

•Storm

Spark performance

•GBs per second

•RAM to data size

• Inflexion point

Questions

Prajod Vettiyattil

Architect, Open source

Wipro

@prajods

in.linkedin.com/in/prajod

Namitha M S

Architect, Advanced Technologies

Wipro

in.linkedin.com/in/namithams

apache spark: the next gen toolset for big data processing

Data & Analytics

data partitions

data flowpartition

data sizeinflexion

big data processingnamitha

replicationuses data

memorymapinput data

memory data sets t

pagerank spark mllib