lightening fast big data analytics using apache spark

www.unicomlearning.com

India Big Data Week 2014Lightning Fast Big Data Analytics using Apache Spark

www.bigdatainnovation.org

Manish GuptaSolutions Architect – Product Engineering and Development

30th Jan 2014 - Delhi

http://www.unicomlearning.com/

http://bigdatainnovation.org/


Agenda Of The Talk:


Hadoop – A Quick Introduction

An Introduction To Spark & Shark

Spark – Architecture & Programming Model

Example & Demo

Spark Current Users & Roadmap



www.bigdatainnovation.orgwww.unicomlearning.com

What is Hadoop?It’s an open-sourced software for distributed storage of large datasets on commodity class hardware in a highly fault-tolerant, scalable and a flexible way.

MRIt also provide a programming model/framework for processing these large datasets in a massively-parallel, fault-tolerant and data-location aware fashion.

HDFS

Map

Map

Map

Reduce

Reduce

Input Output



Slow due to replication, serialization, and disk IO

Inefficient for:

• Iterative algorithms (Machine Learning, Graphs & Network Analysis)

• Interactive Data Mining (R, Excel, Adhoc Reporting, Searching)


Limitations of Map Reduce

Input iter. 1 iter. 2 . . .

HDFSread

HDFSwrite

HDFSread

HDFSwrite

Map

Map

Map

Reduce

Reduce

Input Output




Approach: Leverage Memory?

Memory bus >> disk & SSDs

Many datasets fit into memory

1TB = 1 billion records @ 1 KB

Memory Capacity also follows the Moore’s Law

A single 8GB stick of RAM is about $80 right now. In 2021, you’d be able to buy a single

stick of RAM that contains 64GB for the same price.







Example & Demo


Agenda Of The Talk:




Spark

Open Sourced originally developed in AMPLab at UC Berkley.

Provides In-Memory analytics which is faster than Hadoop/Hive (upto 100x).

Designed for running Iterative algorithms & Interactive analytics

Highly compatible with Hadoop’s Storage APIs.

- Can run on your existing Hadoop Cluster Setup.

Developers can write driver programs using multiple programming languages.

“A big data analytics cluster-computing framework written in Scala.”

…




Spark

HDFS

Datanode Datanode Datanode....Spark

WorkerSpark

WorkerSpark

Worker....

Cache Cache Cache

Block Block Block

Cluster Manager

Spark Driver (Master)




Spark

iter. 1 iter. 2 . . .

Input

HDFSread

HDFSwrite

HDFSread

HDFSwrite




Spark

iter. 1 iter. 2 . . .

Input

Not tied to 2 stage Map Reduce paradigm

1. Extract a working set2. Cache it3. Query it repeatedly

Logistic regression in Hadoop and Spark

HDFSread




Spark

A simple analytical operation:

pagecount = spark.textFile( "/wiki/pagecounts“ )pagecount.count()

englishPages = pagecount.filter( _.split(" ")(1) == "en“ )englishPages.cache()englishPages.count()englishTuples = englishPages.map( line => line.split(" ") )englishKeyValues = englishTuples.map( line => (line(0), line(3).toInt) )englishKeyValues.reduceByKey( _+_, 1).collect

1

2

Select count(*) from pagecounts

Select Col1, sum(Col4) from pagecountsWhere Col2 = “en”Group by Col1




Shark

HIVE on SPARK = SHARK A large scale data warehouse system just like Apache Hive. Highly compatible with Hive (HQL, metastore, serialization formats, and

UDFs) Built on top of Spark (thus a faster execution engine). Provision of creating In-memory materialized tables (Cached Tables). And cached tables utilizes columnar storage instead of raw storage.

1

Column Storage

2 3

ABC XYZ PPP

4.1 3.5 6.4

Row Storage

1 ABC 4.1

2 XYZ 3.5

3 PPP 6.4




Shark

Meta store

HDFS

Client

Driver

SQL Parser

Query Optimizer

Physical Plan

Execution

CLI JDBC

Map Reduce

HIVE




SharkSHARK

Meta store

HDFS

Client

Driver

SQL Parser

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimizer







Example & Demo


Agenda Of The Talk:



Datanode

HDFS

Datanode…


Spark Programming Model

User (Developer)

Writes

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map

Driver Program

SparkContextCluster

Manager

Worker Node

Executer Cache

Task Task

Worker Node

Executer Cache

Task Task




Spark Programming Model

User (Developer)

Writes

sc=new SparkContextrDD=sc.textfile(“hdfs://…”)rDD.filter(…)rDD.CacherDD.CountrDD.map

Driver Program

RDD(Resilient

Distributed Dataset)

• Immutable Data structure• In-memory (explicitly)• Fault Tolerant• Parallel Data Structure• Controlled partitioning to

optimize data placement• Can be manipulated using

rich set of operators.




RDD

Programming Interface: Programmer can perform 3 types of operations:

Transformations

• Create a new dataset from and existing one.

• Lazy in nature. They are executed only when some action is performed.

• Example :• Map(func)• Filter(func)• Distinct()

Actions

• Returns to the driver program a value or exports data to a storage system after performing a computation.

• Example:• Count()• Reduce(funct)• Collect• Take()

Persistence

• For caching datasets in-memory for future operations.

• Option to store on disk or RAM or mixed (Storage Level).

• Example:• Persist() • Cache()




Spark

How Spark Works:

RDD: Parallel collection with partitions User application create RDDs, transform them,

and run actions.This results in a DAG (Directed Acyclic Graph) of

operators.DAG is compiled into stagesEach stage is executed as a series of Task (one

Task for each Partition).




Spark

Example:

sc.textFile(“/wiki/pagecounts”) RDD[String]

textFile




Spark

Example:

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”))

RDD[String]

textFile map

RDD[List[String]]




Spark

Example:

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1])))

RDD[String]

textFile map

RDD[List[String]]

RDD[(String, Int)]

map




Spark

Example:

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3)

RDD[String]

textFile map

RDD[List[String]]

RDD[(String, Int)]

map

RDD[(String, Int)]

reduceByKey




Spark

Example:

sc.textFile(“/wiki/pagecounts”).map(line => line.split(“\t”)).map(R => (R[0], int(R[1]))).reduceByKey(_+_, 3).collect()

RDD[String]

textFile map

RDD[List[String]]

RDD[(String, Int)]

map

RDD[(String, Int)]

reduceByKey

Array[(String, Int)]

collect




Spark

textFile map map reduceByKey

collect

Execution Plan:

Above logical plan gets compiled by the DAG scheduler into a Plan comprising of Stages as…




Spark


collect

Execution Plan:

Stage 1 Stage 2

Stages are sequences of RDDs, that don’t have a Shuffle in between




Spark


collect

Stage 1 Stage 2

Stage 1 Stage 2

1. Read HDFS split2. Apply both the maps3. Start Partial reduce4. Write shuffle data

1. Read shuffle data2. Final reduce3. Send result to driver

program




Spark

Stage Execution:Stage 1

Task 1

Task 2

Task 2

Task 2

Create a task for each Partition in the new RDD Serialize the Task Schedule and ship Tasks to Slaves

And all this happens internally (you need to do anything)




Spark

Task Execution:

Task is the fundamental unit of execution in Spark

Fetch Input

Execute Task

Write Output

time

HDFS / RDD

HDFS / RDD / intermediate shuffle output




Spark

Spark Executor (Slaves)

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Fetch Input

Execute Task

Write Output

Core 1

Core 2

Core 3




Spark

Summary of Components

Task : The fundamental unit of execution in Spark

Stage : Set of Tasks that run parallel

DAG : Logical Graph of RDD operations

RDD : Parallel dataset with partitions







Example & Demo


Agenda Of The Talk:




Example & Demo

Cluster Details: 6 m1.Xlarge EC2 nodes.

1 machine is Master Node 5 worker node machines

64 bit, 4 vCPU 15 GB Ram




Example & Demo

Wiki Page View Stats 20 GB of webpage view counts 3 days worth of data

Dataset:

<date_time> <project_code> <page_title> <num_hits> <page_size>

Base RDD to All Wiki Pagesval allPages = sc.textFile("/wiki/pagecounts")allPages.take(10).foreach(println)allPages.count()

Transformed RDD for all English pages (cached)val englishPages = allPages.filter(_.split(" ")(1) == "en")englishPages.cache()englishPages.count()englishPages.count()




Example & Demo

Wiki Page View Stats 20 GB of webpage view counts 3 days worth of data

Dataset:

<date_time> <project_code> <page_title> <num_hits> <page_size>

Select date, sum(pageviews) from pagecounts group by dateenglishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(3).toInt)).reduceByKey(_+_, 1).collect.foreach(println)

Select date, count(distinct pageURL) from pagecounts group by dateenglishPages.map(line => line.split(" ")).map(line => (line(0).substring(0, 8), line(2))).distinct().countByKey().foreach(println)

Select distinct(datetime) from pagecounts order by datetimeenglishPages.map(line => line.split(" ")).map(line => (line(0), 1)).distinct().sortByKey().collect().foreach(println)




Example & Demo

Network Datasets Directed and Bi-directed Graphs

One small Facebook Social Network 127 nodes (Friends) 1668 Edges (Friendships) Bi-directed graph

Google’s internal site network 15713 Nodes (web pages) 170845 Edges (hyperlinks) Directed Graph

Dataset:




Example & DemoPage Rank Calculation:

• Estimate the node importance• Each directed link from A -> B is a vote to B from A.• More links to a page, more important a page is.• When a page with higher PR, points to something, then it’s vote weighs more.

1. Start each page at a rank of 1

2. On each iteration, have page p contribute (rank of p) / (no. of neighbors of p) to its neighbors

3. Set each page’s rank to 0.15 + 0.85 × contribs




Example & Demo

Scala Code:

var iters = 100val lines = sc.textFile("/dataset/google/edges.csv",1)

val links = lines.map{ s =>val parts = s.split( "\t“ )(parts(0), parts(1))}.distinct().groupByKey().cache()var ranks = links.mapValues(v => 1.0)for (i <- 1 to iters) {val contribs = links.join(ranks).values.flatMap{ case (urls, rank) =>val size = urls.sizeurls.map(url => (url, rank / size))}ranks = contribs.reduceByKey(_ + _).mapValues(0.15 + 0.85 * _)}val output = ranks.map(l=>(l._2,l._1)).sortByKey(false).map(l=>(l._2,l._1))output.take(20).foreach(tup => println( tup._2 + " : " + tup._1 ))



2 seconds

38 secondsPage Rank Page URL

761.1985177 google455.7028756 google/about.html259.6052388 google/privacy.html192.7257649 google/jobs/144.0349154 google/support134.1566312 google/terms_of_service.html130.3546324 google/intl/en/about.html123.4014613 google/imghp120.0661165 google/accounts/Login118.6884515 google/intl/en/options/112.2309539 google/preferences108.8375347 google/sitemap.html106.9724799 google/press/105.822426 google/language_tools

105.1554798 google/support/toolbar/99.97741309 google/maps97.90651416 google/advanced_search90.7910291 google/intl/en/services/

90.70522689 google/intl/en/ads/87.4353413 google/adsense/





Example & Demo


Agenda Of The Talk:





Source: Apache - Powered By Spark



https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark


Roadmap




Conclusion

Because of In-memory processing, computations are very fast. Developers can write iterative algorithms without writing out a result set after each pass through the data.

Suitable for scenarios when sufficient memory available in your cluster.

It provides an integrated framework for advanced analytics like Graph processing, Stream Processing, Machine Learning etc. This simplifies integration.

It’s community is expanding and development is happening very aggressively.

It’s comparatively newer than Hadoop and only few users.




Topic:

Organized byUNICOM Trainings & Seminars Pvt. Ltd.

[email protected]

Speaker name: MANISH GUPTAEmail ID: [email protected]

Thank You






Backup Slides


Spark Internal Components

Hadoop I/O Mesos backend Standalone backend

Interpreter

Spark core

Operators

Block manager

Scheduler

Networking

Accumulators Broadcast




In-Memory

But what if I run out of memory?

Cache disabled

25% 50% 75% Fully cached0

10

20

30

40

50

60

70

80

90

10068.8

58.1

40.7

29.7

11.5

% of working set in memory

Itera

tion

tim

e (

s)




Benchmarks

AMPLab performed a quantitative and qualitative comparisons of 4 system

HIVE, Impala, Redshift and Shark

Done on Common Crawl Corpus Dataset 81 TB size Consists of 3 tables:

Page Rankings User Visits Documents

Data was partitioned in such a way that each node had: 25GB of User Visits 1GB of Ranking 30GB of Web Crawl (document)

Source: https://amplab.cs.berkeley.edu/benchmark/#




Benchmarks




BenchmarksHardware Configuration




Benchmarks

• Redshift outperforms for on-disk data.• Shark and Impala outperform Hive by 3-4X.• For larger result-sets, Shark outperforms Impala.




Benchmarks

• Redshift columnar storage outperforms every time.• Shark in-memory is 2nd best in all cases.




Benchmarks

• Redshift bigger cluster has an advantage.• Shark and Impala competing.




Benchmarks

• Impala & Redshift don’t have UDF.• Shark outperforms hive.




Roadmap




Spark

In Last 6 months of Year 2013



lightening fast big data analytics using apache spark

Technology