apache spark: the next gen toolset for big data processing
DESCRIPTION
The Spark project from Apache(spark.apache.org), is the next generation of Big Data processing systems. It uses a new architecture and in-memory processing for orders of magnitude improvement in performance. Some would call it the successor to the Hadoop set of tools. Hadoop is a batch mode Big Data processor and depends on disk based files. Spark improves on this and supports real time and interactive processing, in addition to batch processing. Table of contents: 1. The Big Data triangle 2. Hadoop stack and its limitations 3. Spark: An Overview 3.a. Spark Streaming 3.b. GraphX: Graph processing 3.c. MLib: Machine Learning 4. Performance characteristics of SparkTRANSCRIPT
Prajod Vettiyattil
Architect, Open source
Wipro
in.linkedin.com/in/prajod
@prajods
Apache Spark The Next Gen toolset for
Big Data Processing
Namitha M S
Architect, Advanced Technologies
Wipro
in.linkedin.com/in/namithams
Open Source India Nov 2014 Bangalore
• Big Data
• Hadoop stack and its limitations
• Spark: An overview
• Streaming, GraphX and MLlib
• Performance characteristics of Spark
Agenda
• Data too huge for normal systems
• 3 Vs: Volume, Variety, Velocity
• Storage challenge
• Analysis challenge
• Query results take hours, days or months
Big Data
Data disks
The Big Data Analysis Triad
Batch
Interactive Streaming
The Hadoop stack
• Distributed data processing
• Fault tolerant
• Process peta byte data sets
• Ecosystem tools
• Hive DB, Hbase
• Pig
• Storm
• Hadoop
• Map
• Reduce
• Shuffle, partition, sort
• HDFS
Hadoop: Data flow
Partition for target reducers
Buffer in memory Map
Input
data
files
Sort each partition
by key
Merge all partitions and write
to disk
Potential spill to
disk
Merge round
1
Merge round
2
Merge round
N http fetch from map node
Reduce
Merge sort
…
Output
High disk I/O
On Map nodes
On Reduce nodes
• Batch mode
• Only the batch layer in the Lambda pattern
• No real time
• No repetitive queries
• Iterative algorithms
• Interactive data querying
• Poor support for distributed memory
Limitations of Hadoop
Spark: An overview
• “Over time, fewer projects will use MapReduce, and more will use Spark”
• Doug Cutting, creator of Hadoop
• New architecture: scale better and simplify
• In memory processing for Big Data
• Cached intermediate data sets
• Multi-step DAG based execution
• Resilient Distributed Data(RDD) sets
• The core innovation in Spark
Spark Ecosystem tools
Apache Spark
Spark SQL
Streaming
MLlib GraphX
Spark R
Blink DB
Shark Bagel
DAG Execution Engine
Map
Collect
Filter
Map
Reduce
Sort
Collect
DAG = Directed Acyclic Graph
• Resilient Distributed Data sets
• Features
• Read only
• Fault tolerance without replication
• Uses data lineage for recovery
• Low network I/O
• Partitions/Slices
• parallel tasks
RDD
Disk Transform 1
RDD 1
Transform 2
RDD 2
Data partitions
Lambda architecture pattern
• Used for Lambda architecture implementation
• Batch layer
• Speed layer
• Serving layer Batch Layer
Speed Layer
Serving Layer
Input
Data consumers
Query
Query
Spark Streaming
• For stream processing in Spark
• Real time data
• Like Twitter queries
• Discretized streams(DStreams)
• Micro batches
• Sequence of RDDs
Discretized Streams
Spark Streaming
Spark
Batches of x seconds
Input
Output
Why Spark Streaming
• Near real time processing (0.5 – 2 sec latency)
• Parallel recovery of lost nodes and stragglers
• Implementation of Lambda architecture
• Single engine for batch and stream
• Not suited for low latency requirements
• i.e., 100ms
Apache Storm vs Spark Streaming
Feature Spark Streaming Storm
Processing Model Micro-Batching Event Stream
processing
Message Delivery
options
Inherently fault tolerant,
exactly once delivery
At least once, at most
once, exactly once
Flexibility Coarse grained
transformation
Fine grained
transformation
Implemented in Scala Clojure
Development
Cost
Common platform for both
batch and stream
Only stream – separate
setup for batch
Applicability Machine learning, Interactive
analytics, near real time
analytics
Near real time analytics,
Natural language
processing
GraphX & MLlib
• Data parallel Vs Graph Parallel processing
• Wikipedia search vs Facebook connection search, Page rank
• Spark MLlib implements high quality machine learning algorithms
• Iterative Algorithm Paradigm
• Leverage Spark’s in memory data sets
)()1( tt xfx f(xt) f(xt+1)
x(t+1) x(t)
Performance characteristics
Performance of Spark
• 100x faster in memory
• 10x faster on disk
Graph courtesy: spark.apache.org
Hadoop vs Spark
Hadoop Spark Spark World Record
100 TB * 1 PB
Data Size 102.5 TB 100 TB 1000 TB Elapsed Time 72 mins 23 mins 234 mins # Nodes 2100 206 190 # Cores 50400 6592 6080 # Reducers 10,000 29,000 250,000
Rate 1.42 TB/min 4.27 TB/min 4.27 TB/min
Rate/node 0.67 GB/min 20.7 GB/min 22.5 GB/min
Data courtesy: databricks.com
1 TB performance test: data per sec
1 TB performance test data rate vs RAM size
Apache Spark
•New architecture
•RDD, DAG
• In memory processing
•Map reduce and more
•GraphX
•MLlib
•Spark streaming
Summary
Ecosystem tools
•Spark R
•Blink DB
•Storm
Spark performance
•GBs per second
•RAM to data size
• Inflexion point
Questions
Prajod Vettiyattil
Architect, Open source
Wipro
@prajods
in.linkedin.com/in/prajod
Namitha M S
Architect, Advanced Technologies
Wipro
in.linkedin.com/in/namithams