![Page 1: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/1.jpg)
www.unicomlearning.com
India Big Data Week 2014Lightning Fast Big Data Analytics using Apache Spark
www.bigdatainnovation.org
Manish GuptaSolutions Architect – Product Engineering and Development
30th Jan 2014 - Delhi
![Page 2: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/2.jpg)
www.bigdatainnovation.org
Agenda Of The Talk:
www.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
![Page 3: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/3.jpg)
www.bigdatainnovation.org
Agenda Of The Talk:
www.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
![Page 4: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/4.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
What is Hadoop?It’s an open-sourced software for distributed storage of large datasets on commodity class hardware in a highly fault-tolerant, scalable and a flexible way.
MRIt also provide a programming model/framework for processing these large datasets in a massively-parallel, fault-tolerant and data-location aware fashion.
HDFS
Map
Map
Map
Reduce
Reduce
Input Output
![Page 5: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/5.jpg)
Slow due to replication, serialization, and disk IO
Inefficient for:
• Iterative algorithms (Machine Learning, Graphs & Network Analysis)
• Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)
www.bigdatainnovation.orgwww.unicomlearning.com
Limitations of Map Reduce
Input iter. 1 iter. 2 . . .
HDFSread
HDFSwrite
HDFSread
HDFSwrite
Map
Map
Map
Reduce
Reduce
Input Output
![Page 6: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/6.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Approach: Leverage Memory?
Memory bus >> disk & SSDs
Many datasets fit into memory
1TB = 1 billion records @ 1 KB
Memory Capacity also follows the Moore’s Law
A single 8GB stick of RAM is about $80 right now. In 2021, you’d be able to buy a single
stick of RAM that contains 64GB for the same price.
![Page 7: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/7.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
![Page 8: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/8.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
![Page 9: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/9.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Open Sourced originally developed in AMPLab at UC Berkley.
Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).
Designed for running Iterative algorithms & Interactive analytics
Highly compatible with Hadoop’s Storage APIs.
- Can run on your existing Hadoop Cluster Setup.
Developers can write driver programs using multiple programming languages.
“A big data analytics cluster-computing framework written in Scala.”
…
![Page 10: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/10.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
HDFS
Datanode Datanode Datanode....Spark
WorkerSpark
WorkerSpark
Worker....
Cache Cache Cache
Block Block Block
Cluster Manager
Spark Driver (Master)
![Page 11: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/11.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
iter. 1 iter. 2 . . .
Input
HDFSread
HDFSwrite
HDFSread
HDFSwrite
![Page 12: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/12.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
iter. 1 iter. 2 . . .
Input
Not tied to 2 stage Map Reduce paradigm
1. Extract a working set2. Cache it3. Query it repeatedly
Logistic regression in Hadoop and Spark
HDFSread
![Page 13: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/13.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
A simple analytical operation:
pagecount = spark.textFile( "/wiki/pagecounts“ )pagecount.count()
englishPages = pagecount.filter( _.split(" ")(1) == "en“ )englishPages.cache()englishPages.count()englishTuples = englishPages.map( line => line.split(" ") )englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) )englishKeyValues.reduceByKey( _+_, 1).collect
1
2
Select count(*) from pagecounts
Select Col1, sum(Col4) from pagecountsWhere Col2 = “en”Group by Col1
![Page 14: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/14.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Shark
HIVE on SPARK = SHARK A large scale data warehouse system just like Apache Hive. Highly compatible with Hive (HQL, metastore, serialization formats, and
UDFs) Built on top of Spark (thus a faster execution engine). Provision of creating In-memory materialized tables (Cached Tables). And cached tables utilizes columnar storage instead of raw storage.
1
Column Storage
2 3
ABC XYZ PPP
4.1 3.5 6.4
Row Storage
1 ABC 4.1
2 XYZ 3.5
3 PPP 6.4
![Page 15: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/15.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Shark
Meta store
HDFS
Client
Driver
SQL Parser
Query Optimizer
Physical Plan
Execution
CLI JDBC
Map Reduce
HIVE
![Page 16: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/16.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
SharkSHARK
Meta store
HDFS
Client
Driver
SQL Parser
Physical Plan
Execution
CLI JDBC
Spark
Cache Mgr.
Query Optimizer
![Page 17: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/17.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
![Page 18: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/18.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
![Page 19: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/19.jpg)
Datanode
HDFS
Datanode…
www.bigdatainnovation.orgwww.unicomlearning.com
Spark Programming Model
User (Developer)
Writes
sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map
Driver Program
SparkContextCluster
Manager
Worker Node
Executer Cache
Task Task
Worker Node
Executer Cache
Task Task
![Page 20: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/20.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark Programming Model
User (Developer)
Writes
sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map
Driver Program
RDD(Resilient
Distributed Dataset)
• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to
optimize data placement• Can be manipulated using
rich set of operators.
![Page 21: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/21.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
RDD
Programming Interface: Programmer can perform 3 types of operations:
Transformations
• Create a new dataset from and existing one.
• Lazy in nature. They are executed only when some action is performed.
• Example :• Map(func)• Filter(func)• Distinct()
Actions
• Returns to the driver program a value or exports data to a storage system after performing a computation.
• Example:• Count()• Reduce(funct)• Collect• Take()
Persistence
• For caching datasets in-memory for future operations.
• Option to store on disk or RAM or mixed (Storage Level).
• Example:• Persist() • Cache()
![Page 22: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/22.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
How Spark Works:
RDD: Parallel collection with partitions User application create RDDs, transform them,
and run actions.This results in a DAG (Directed Acyclic Graph) of
operators.DAG is compiled into stagesEach stage is executed as a series of Task (one
Task for each Partition).
![Page 23: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/23.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”) RDD[String]
textFile
![Page 24: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/24.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”))
RDD[String]
textFile map
RDD[List[String]]
![Page 25: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/25.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1])))
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
![Page 26: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/26.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3)
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey
![Page 27: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/27.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Example:
sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect()
RDD[String]
textFile map
RDD[List[String]]
RDD[(String, Int)]
map
RDD[(String, Int)]
reduceByKey
Array[(String, Int)]
collect
![Page 28: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/28.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
textFile map map reduceByKey
collect
Execution Plan:
Above logical plan gets compiled by the DAG scheduler into a Plan comprising of Stages as…
![Page 29: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/29.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
textFile map map reduceByKey
collect
Execution Plan:
Stage 1 Stage 2
Stages are sequences of RDDs, that don’t have a Shuffle in between
![Page 30: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/30.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
textFile map map reduceByKey
collect
Stage 1 Stage 2
Stage 1 Stage 2
1. Read HDFS split2. Apply both the maps3. Start Partial reduce4. Write shuffle data
1. Read shuffle data2. Final reduce3. Send result to driver
program
![Page 31: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/31.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Stage Execution:Stage 1
Task 1
Task 2
Task 2
Task 2
Create a task for each Partition in the new RDD Serialize the Task Schedule and ship Tasks to Slaves
And all this happens internally (you need to do anything)
![Page 32: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/32.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Task Execution:
Task is the fundamental unit of execution in Spark
Fetch Input
Execute Task
Write Output
time
HDFS / RDD
HDFS / RDD / intermediate shuffle output
![Page 33: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/33.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Spark Executor (Slaves)
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Fetch Input
Execute Task
Write Output
Core 1
Core 2
Core 3
![Page 34: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/34.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
Summary of Components
Task : The fundamental unit of execution in Spark
Stage : Set of Tasks that run parallel
DAG : Logical Graph of RDD operations
RDD : Parallel dataset with partitions
![Page 35: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/35.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
![Page 36: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/36.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
![Page 37: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/37.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Cluster Details: 6 m1.Xlarge EC2 nodes.
1 machine is Master Node 5 worker node machines
64 bit, 4 vCPU 15 GB Ram
![Page 38: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/38.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Wiki Page View Stats 20 GB of webpage view counts 3 days worth of data
Dataset:
<date_time> <project_code> <page_title> <num_hits> <page_size>
Base RDD to All Wiki Pagesval allPages = sc.textFile("/wiki/pagecounts")allPages.take(10).foreach(println)allPages.count()
Transformed RDD for all English pages (cached)val englishPages = allPages.filter(_.split(" ")(1) == "en")englishPages.cache()englishPages.count()englishPages.count()
![Page 39: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/39.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Wiki Page View Stats 20 GB of webpage view counts 3 days worth of data
Dataset:
<date_time> <project_code> <page_title> <num_hits> <page_size>
Select date, sum(pageviews) from pagecounts group by dateenglishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println)
Select date, count(distinct pageURL) from pagecounts group by dateenglishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println)
Select distinct(datetime) from pagecounts order by datetimeenglishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)
![Page 40: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/40.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Network Datasets Directed and Bi-directed Graphs
One small Facebook Social Network 127 nodes (Friends) 1668 Edges (Friendships) Bi-directed graph
Google’s internal site network 15713 Nodes (web pages) 170845 Edges (hyperlinks) Directed Graph
Dataset:
![Page 41: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/41.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Example & DemoPage Rank Calculation:
• Estimate the node importance• Each directed link from A -> B is a vote to B from A.• More links to a page, more important a page is.• When a page with higher PR, points to something, then it’s vote weighs more.
1. Start each page at a rank of 1
2. On each iteration, have page p contribute (rank of p) / (no. of neighbors of p) to its neighbors
3. Set each page’s rank to 0.15 + 0.85 × contribs
![Page 42: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/42.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Example & Demo
Scala Code:
var iters = 100val lines = sc.textFile("/dataset/google/edges.csv",1)
val links = lines.map{ s =>val parts = s.split( "\t“ )(parts(0), parts(1))}.distinct().groupByKey().cache()var ranks = links.mapValues(v => 1.0)for (i <- 1 to iters) {val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>val size = urls.sizeurls.map(url => (url, rank / size))}ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)}val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1))output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))
![Page 43: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/43.jpg)
![Page 44: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/44.jpg)
2 seconds
![Page 45: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/45.jpg)
38 secondsPage Rank Page URL
761.1985177 google455.7028756 google/about.html259.6052388 google/privacy.html192.7257649 google/jobs/144.0349154 google/support134.1566312 google/terms_of_service.html130.3546324 google/intl/en/about.html123.4014613 google/imghp120.0661165 google/accounts/Login118.6884515 google/intl/en/options/112.2309539 google/preferences108.8375347 google/sitemap.html106.9724799 google/press/105.822426 google/language_tools
105.1554798 google/support/toolbar/99.97741309 google/maps97.90651416 google/advanced_search90.7910291 google/intl/en/services/
90.70522689 google/intl/en/ads/87.4353413 google/adsense/
![Page 46: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/46.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Hadoop – A Quick Introduction
An Introduction To Spark & Shark
Spark – Architecture & Programming Model
Example & Demo
Spark Current Users & Roadmap
Agenda Of The Talk:
![Page 47: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/47.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark Current Users & Roadmap
Source: Apache - Powered By Spark
![Page 48: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/48.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Roadmap
![Page 49: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/49.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Conclusion
Because of In-memory processing, computations are very fast. Developers can write iterative algorithms without writing out a result set after each pass through the data.
Suitable for scenarios when sufficient memory available in your cluster.
It provides an integrated framework for advanced analytics like Graph processing, Stream Processing, Machine Learning etc. This simplifies integration.
It’s community is expanding and development is happening very aggressively.
It’s comparatively newer than Hadoop and only few users.
![Page 50: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/50.jpg)
www.unicomlearning.com
Topic:
Organized byUNICOM Trainings & Seminars Pvt. Ltd.
Speaker name: MANISH GUPTAEmail ID: [email protected]
Thank You
www.bigdatainnovation.org
![Page 51: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/51.jpg)
![Page 52: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/52.jpg)
Backup Slides
![Page 53: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/53.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark Internal Components
Hadoop I/O Mesos backend Standalone backend
Interpreter
Spark core
Operators
Block manager
Scheduler
Networking
Accumulators Broadcast
![Page 54: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/54.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
In-Memory
But what if I run out of memory?
Cache disabled
25% 50% 75% Fully cached0
10
20
30
40
50
60
70
80
90
10068.8
58.1
40.7
29.7
11.5
% of working set in memory
Itera
tion
tim
e (
s)
![Page 55: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/55.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
AMPLab performed a quantitative and qualitative comparisons of 4 system
HIVE, Impala, Redshift and Shark
Done on Common Crawl Corpus Dataset 81 TB size Consists of 3 tables:
Page Rankings User Visits Documents
Data was partitioned in such a way that each node had: 25GB of User Visits 1GB of Ranking 30GB of Web Crawl (document)
Source: https://amplab.cs.berkeley.edu/benchmark/#
![Page 56: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/56.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
![Page 57: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/57.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
BenchmarksHardware Configuration
![Page 58: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/58.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
• Redshift outperforms for on-disk data.• Shark and Impala outperform Hive by 3-4X.• For larger result-sets, Shark outperforms Impala.
![Page 59: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/59.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
• Redshift columnar storage outperforms every time.• Shark in-memory is 2nd best in all cases.
![Page 60: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/60.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
• Redshift bigger cluster has an advantage.• Shark and Impala competing.
![Page 61: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/61.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Benchmarks
• Impala & Redshift don’t have UDF.• Shark outperforms hive.
![Page 62: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/62.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Roadmap
![Page 63: Lightening Fast Big Data Analytics using Apache Spark](https://reader033.vdocument.in/reader033/viewer/2022061223/54c6542f4a7959b1098b4658/html5/thumbnails/63.jpg)
www.bigdatainnovation.orgwww.unicomlearning.com
Spark
In Last 6 months of Year 2013