lightening fast big data analytics using apache spark
DESCRIPTION
Lightening Fast Big Data Analytics using Apache Spark ---------------------------------------------------------------------------------- Hadoop gives you a great (actually revolutionary) mechanism for storing large data sets in a highly fault tolerant and highly available storage system (HDFS) and ability to process these mammoth datsets using it's massively parallel & distributed processing framework (Map Reduce). It was built for batch processing where analysts and programmers can submit their series of jobs to crunch very large structured/unstructured datasets and then wait for results for performing further analysis. But, one of the very few reasons why Hadoop is criticized for is it's speed and not being highly interactive (mainly because it's user base has increased tremendously and people always demand more specially when it comes to speed). Spark is an open source system that can run on top of your existing HDFS and can provide upto 100x times faster (almost interactive) in-memory analytics than Map Reduce. Topics that will be covered: Quick Introduction of Hadoop & it's Limitation Introduction of Spark Spark Architecture Programming model of Spark Demo Spark Use CasesTRANSCRIPT
www.unicomlearning.com
India Big Data Week 2014Lightning Fast Big Data Analytics using Apache Spark
www.bigdatainnovation.org
Manish GuptaSolutions Architect – Product Engineering and Development
30th Jan 2014 - Delhi
www.bigdatainnovation.org
Agenda Of The Talk:
www.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
www.bigdatainnovation.org
Agenda Of The Talk:
www.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
www.bigdatainnovation.orgwww.unicomlearning.com
What is Hadoop?It’s an open-sourced software for distributed storage of large datasets on commodity class hardware in a highly fault-tolerant, scalable and a flexible way.
MRIt also provide a programming model/framework for processing these large datasets in a massively-parallel, fault-tolerant and data-location aware fashion.
HDFS
Map
Map
Map
Reduce
Reduce
Input Output
Slow due to replication, serialization, and disk IO
Inefficient for:
• Iterative algorithms (Machine Learning, Graphs & Network Analysis)
• Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)
www.bigdatainnovation.orgwww.unicomlearning.com
Limitations of Map Reduce
Input iter. 1 iter. 2 . . .
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Map
Map
Map
Reduce
Reduce
Input Output
www.bigdatainnovation.orgwww.unicomlearning.com
Approach: Leverage Memory?
Memory bus >> disk & SSDs
Many datasets fit into memory
1TB = 1 billion records @ 1 KB
Memory Capacity also follows the Moore’s Law
A single 8GB stick of RAM is about $80 right now. In 2021, you’d be able to buy a single
stick of RAM that contains 64GB for the same price.
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Open Sourced originally developed in AMPLab at UC Berkley.
Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).
Designed for running Iterative algorithms & Interactive analytics
Highly compatible with Hadoop’s Storage APIs.
- Can run on your existing Hadoop Cluster Setup.
Developers can write driver programs using multiple programming languages.
“A big data analytics cluster-computing framework written in Scala.”
…
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
HDFS
Datanode Datanode Datanode....Spark
WorkerSpark
WorkerSpark
Worker....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
iter. 1 iter. 2 . . .
Input
Not tied to 2 stage Map Reduce paradigm
1. Extract a working set2. Cache it3. Query it repeatedly
Logistic regression in Hadoop and Spark
HDFSread
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
A simple analytical operation:
pagecount = spark.textFile( "/wiki/pagecounts“ )pagecount.count()
englishPages = pagecount.filter( _.split(" ")(1) == "en“ )englishPages.cache()englishPages.count()englishTuples = englishPages.map( line => line.split(" ") )englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) )englishKeyValues.reduceByKey( _+_, 1).collect
1
2
Select count(*) from pagecounts
Select Col1, sum(Col4) from pagecountsWhere Col2 = “en”Group by Col1
www.bigdatainnovation.orgwww.unicomlearning.com
Shark
HIVE on SPARK = SHARK A large scale data warehouse system just like Apache Hive. Highly compatible with Hive (HQL, metastore, serialization formats, and
UDFs) Built on top of Spark (thus a faster execution engine). Provision of creating In-memory materialized tables (Cached Tables). And cached tables utilizes columnar storage instead of raw storage.
1
Column Storage
2 3
ABC XYZ PPP
4.1 3.5 6.4
Row Storage
1 ABC 4.1
2 XYZ 3.5
3 PPP 6.4
www.bigdatainnovation.orgwww.unicomlearning.com
Shark
Meta store
HDFS
Client
Driver
SQL Parser
Query Optimizer
Physical Plan
Execution
CLI JDBC
Map Reduce
HIVE
www.bigdatainnovation.orgwww.unicomlearning.com
SharkSHARK
Meta store
HDFS
Client
Driver
SQL Parser
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query Optimizer
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
Datanode
HDFS
Datanode…
www.bigdatainnovation.orgwww.unicomlearning.com
Spark Programming Model
User (Developer)
Writes
sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map
Driver Program
SparkContextCluster
Manager
Worker Node
Executer Cache
Task Task
Worker Node
Executer Cache
Task Task
www.bigdatainnovation.orgwww.unicomlearning.com
Spark Programming Model
User (Developer)
Writes
sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map
Driver Program
RDD(Resilient
Distributed Dataset)
• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to
optimize data placement• Can be manipulated using
rich set of operators.
www.bigdatainnovation.orgwww.unicomlearning.com
RDD
Programming Interface: Programmer can perform 3 types of operations:
Transformations
• Create a new dataset from and existing one.
• Lazy in nature. They are executed only when some action is performed.
• Example :• Map(func)• Filter(func)• Distinct()
Actions
• Returns to the driver program a value or exports data to a storage system after performing a computation.
• Example:• Count()• Reduce(funct)• Collect• Take()
Persistence
• For caching datasets in-memory for future operations.
• Option to store on disk or RAM or mixed (Storage Level).
• Example:• Persist() • Cache()
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
How Spark Works:
RDD: Parallel collection with partitions User application create RDDs, transform them,
and run actions.This results in a DAG (Directed Acyclic Graph) of
operators.DAG is compiled into stagesEach stage is executed as a series of Task (one
Task for each Partition).
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”) RDD[String]
textFile
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”))
RDD[String]
textFile map
RDD[List[String]]
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1])))
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3)
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect()
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey
Array[(String, Int)]
collect
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
textFile map map reduceByKey
collect
Execution Plan:
Above logical plan gets compiled by the DAG scheduler into a Plan comprising of Stages as…
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
textFile map map reduceByKey
collect
Execution Plan:
Stage 1 Stage 2
Stages are sequences of RDDs, that don’t have a Shuffle in between
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
textFile map map reduceByKey
collect
Stage 1 Stage 2
Stage 1 Stage 2
1. Read HDFS split2. Apply both the maps3. Start Partial reduce4. Write shuffle data
1. Read shuffle data2. Final reduce3. Send result to driver
program
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Stage Execution:Stage 1
Task 1
Task 2
Task 2
Task 2
Create a task for each Partition in the new RDD Serialize the Task Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Task Execution:
Task is the fundamental unit of execution in Spark
Fetch Input
Execute Task
Write Output
time
HDFS / RDD
HDFS / RDD / intermediate shuffle output
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Spark Executor (Slaves)
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 1
Core 2
Core 3
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Summary of Components
Task : The fundamental unit of execution in Spark
Stage : Set of Tasks that run parallel
DAG : Logical Graph of RDD operations
RDD : Parallel dataset with partitions
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Cluster Details: 6 m1.Xlarge EC2 nodes.
1 machine is Master Node 5 worker node machines
64 bit, 4 vCPU 15 GB Ram
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Wiki Page View Stats 20 GB of webpage view counts 3 days worth of data
Dataset:
<date_time> <project_code> <page_title> <num_hits> <page_size>
Base RDD to All Wiki Pagesval allPages = sc.textFile("/wiki/pagecounts")allPages.take(10).foreach(println)allPages.count()
Transformed RDD for all English pages (cached)val englishPages = allPages.filter(_.split(" ")(1) == "en")englishPages.cache()englishPages.count()englishPages.count()
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Wiki Page View Stats 20 GB of webpage view counts 3 days worth of data
Dataset:
<date_time> <project_code> <page_title> <num_hits> <page_size>
Select date, sum(pageviews) from pagecounts group by dateenglishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println)
Select date, count(distinct pageURL) from pagecounts group by dateenglishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println)
Select distinct(datetime) from pagecounts order by datetimeenglishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Network Datasets Directed and Bi-directed Graphs
One small Facebook Social Network 127 nodes (Friends) 1668 Edges (Friendships) Bi-directed graph
Google’s internal site network 15713 Nodes (web pages) 170845 Edges (hyperlinks) Directed Graph
Dataset:
www.bigdatainnovation.orgwww.unicomlearning.com
Example & DemoPage Rank Calculation:
• Estimate the node importance• Each directed link from A -> B is a vote to B from A.• More links to a page, more important a page is.• When a page with higher PR, points to something, then it’s vote weighs more.
1. Start each page at a rank of 1
2. On each iteration, have page p contribute (rank of p) / (no. of neighbors of p) to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Scala Code:
var iters = 100val lines = sc.textFile("/dataset/google/edges.csv",1)
val links = lines.map{ s =>val parts = s.split( "\t“ )(parts(0), parts(1))}.distinct().groupByKey().cache()var ranks = links.mapValues(v => 1.0)for (i <- 1 to iters) {val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>val size = urls.sizeurls.map(url => (url, rank / size))}ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)}val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1))output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))
2 seconds
38 secondsPage Rank Page URL
761.1985177 google455.7028756 google/about.html259.6052388 google/privacy.html192.7257649 google/jobs/144.0349154 google/support134.1566312 google/terms_of_service.html130.3546324 google/intl/en/about.html123.4014613 google/imghp120.0661165 google/accounts/Login118.6884515 google/intl/en/options/112.2309539 google/preferences108.8375347 google/sitemap.html106.9724799 google/press/105.822426 google/language_tools
105.1554798 google/support/toolbar/99.97741309 google/maps97.90651416 google/advanced_search90.7910291 google/intl/en/services/
90.70522689 google/intl/en/ads/87.4353413 google/adsense/
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
www.bigdatainnovation.orgwww.unicomlearning.com
Spark Current Users & Roadmap
Source: Apache - Powered By Spark
www.bigdatainnovation.orgwww.unicomlearning.com
Roadmap
www.bigdatainnovation.orgwww.unicomlearning.com
Conclusion
Because of In-memory processing, computations are very fast. Developers can write iterative algorithms without writing out a result set after each pass through the data.
Suitable for scenarios when sufficient memory available in your cluster.
It provides an integrated framework for advanced analytics like Graph processing, Stream Processing, Machine Learning etc. This simplifies integration.
It’s community is expanding and development is happening very aggressively.
It’s comparatively newer than Hadoop and only few users.
www.unicomlearning.com
Topic:
Organized byUNICOM Trainings & Seminars Pvt. Ltd.
Speaker name: MANISH GUPTAEmail ID: [email protected]
Thank You
www.bigdatainnovation.org
Backup Slides
www.bigdatainnovation.orgwww.unicomlearning.com
Spark Internal Components
Hadoop I/O Mesos backend Standalone backend
Interpreter
Spark core
Operators
Block manager
Scheduler
Networking
Accumulators Broadcast
www.bigdatainnovation.orgwww.unicomlearning.com
In-Memory
But what if I run out of memory?
Cache disabled
25% 50% 75% Fully cached0
10
20
30
40
50
60
70
80
90
10068.8
58.1
40.7
29.7
11.5
% of working set in memory
Itera
tion
tim
e (
s)
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
AMPLab performed a quantitative and qualitative comparisons of 4 system
HIVE, Impala, Redshift and Shark
Done on Common Crawl Corpus Dataset 81 TB size Consists of 3 tables:
Page Rankings User Visits Documents
Data was partitioned in such a way that each node had: 25GB of User Visits 1GB of Ranking 30GB of Web Crawl (document)
Source: https://amplab.cs.berkeley.edu/benchmark/#
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
www.bigdatainnovation.orgwww.unicomlearning.com
BenchmarksHardware Configuration
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
• Redshift outperforms for on-disk data.• Shark and Impala outperform Hive by 3-4X.• For larger result-sets, Shark outperforms Impala.
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
• Redshift columnar storage outperforms every time.• Shark in-memory is 2nd best in all cases.
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
• Redshift bigger cluster has an advantage.• Shark and Impala competing.
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
• Impala & Redshift don’t have UDF.• Shark outperforms hive.
www.bigdatainnovation.orgwww.unicomlearning.com
Roadmap
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
In Last 6 months of Year 2013