lecture 09: cluster computing with mapreduce &...

67
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark Note 1 CSIE52400/CSIEM0140 Distributed Systems Lecture 09: Cluster Computing with MapReduce & Spark What is MapReduce? Data-parallel programming model for clusters of commodity machines • Designed for scalability and fault-tolerance Pioneered by Google • Processes 20 PB of data per day Popularized by open-source Hadoop project • Used by Yahoo!, Facebook, Amazon, … CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 2

Upload: others

Post on 16-Oct-2019

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 1

CSIE52400/CSIEM0140Distributed Systems

Lecture 09: Cluster Computing with

MapReduce & Spark

What is MapReduce?Data-parallel programming model for

clusters of commodity machines• Designed for scalability and fault-tolerance

Pioneered by Google• Processes 20 PB of data per day

Popularized by open-source Hadoopproject • Used by Yahoo!, Facebook, Amazon, …

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 2

Page 2: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 2

What is MapReduce Used for? At Google:

• Index building for Google Search • Article clustering for Google News • Statistical machine translation

At Yahoo!: • Index building for Yahoo! Search • Spam detection for Yahoo! Mail

At Facebook: • Data mining • Ad optimization • Spam detection

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 3

What is MapReduce Used for?

In research: • Analyzing Wikipedia conflicts (PARC) • Natural language processing (CMU) • Bioinformatics (Maryland) • Particle physics (Nebraska) • Ocean climate simulation (Washington) • <Your application here>

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 4

Page 3: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 3

MapReduce History The foundation stone: The Google File System

by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung in 2003. (19th ACM Symposium on Operating Systems Principles)

The paper that started everything – MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat in 2004. (6th Symposium on Operating System Design and Implementation)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 5

MapReduce History Shortly after the MapReduce paper,

open source pioneers Doug Cutting and Mike Cafarella started working on a MapReduce implementation to solve the scalability problem of Nutch (an open source search engine)

Over the course of a few months, Cutting and Cafarella built up the underlying file systems and processing framework that would become Hadoop (in Java)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 6

Page 4: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 4

MapReduce History In 2006, Cutting went to work with Yahoo. They spun out the storage and processing parts

of Nutch to form Hadoop (named after Cutting’s son’s stuffed elephant).

Over time and heavy investment by Yahoo!, Hadoop eventually became a top-level Apache Foundation project.

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 7

MapReduce Today Today, numerous independent people and

organizations contribute to Hadoop. Every new release adds functionality and boosts

performance. Several other open source projects have been

built with Hadoop at their core, and this list is continually growing.

Some of the more popular ones: Pig(programming tool), Hive(warehousing), Hbase(NoSQL DB), Mahout(machine learning), and ZooKeeper(distributed systems and services).

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 8

Page 5: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 5

Why MapReduce?Problem: Lots of data!Example: Word frequencies in Web

pagesThis is how the world’s first search

engine was done (Archie)World’s first search engine vs.

A search engine from Stanford called Google

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 9

Word Frequencies in Pages 20+ billion web pages x 20KB = 400+

terabytes

One computer can read 30-35 MB/sec from disk 4 months to read the web 1,000+ hard drives to store the web

Even more: To do something with the data Compute the word frequencies for each word in

each website

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 10

Page 6: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 6

Basic Solution: Spread the work over many machines Same problem with 1000 machines: < 3 hours New problems: Extra programming works

communication and coordination recovering from machine failure status reporting debugging optimization locality

Those works repeat for every problem you want to solve

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 11

MapReduce Design Goals1. Scalability to large data volumes:

• Scan 100 TB on 1 node @ 50 MB/s = 24 days • Scan on 1000-node cluster = 35 minutes • => 1000’s of machines, 10,000’s of disks

2. Cost-efficiency:• Commodity machines (cheap, but unreliable)• Commodity network• Automatic fault-tolerance (fewer administrators)• Easy to use (fewer programmers)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 12

Page 7: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 7

Computing Clusters Many racks of computers, thousands of

machines per cluster Limited bisection bandwidth between racks

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 13

Typical Hadoop ClusterAggregation switch

Rack switch

40 nodes/rack, up to 4000 nodes in cluster 1 Gbps bandwidth within rack, 10 Gbps out of rack Node specs (Yahoo terasort):

8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf

10 gigabit1 gigabit

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 14

Page 8: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 8

Hadoop Cluster

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 15

Implications of Computing Environment Single-thread performance doesn’t matter Large problems and total throughput/$ are more

important than peak performance

Stuff Breaks More nodes imply higher probability of breaking down

“Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails,

albeit less often software still needs to be fault-tolerant

commodity machines without fancy hardware give better perf/$

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 16

Page 9: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 9

Challenges1. Cheap nodes fail, especially if you have

many• Mean time between failures for 1 node = 3 years• MTBF for 1000 nodes = 1 day• Solution: Build fault-tolerance into system

2. Commodity network = low bandwidth• Solution: Push computation to the data

3. Programming distributed systems is hard• Solution: Data-parallel programming model: users

write “map” & “reduce” functions, system distributes work and handles faults

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 17

MapReduce A simple programming model that applies to many

large-scale computing problems

Hide messy details of distributed programs behind MapReduce runtime library: Automatic parallelization

Load balancing

Network and disk transfer optimization

Handling of machine failures

Robustness

Improvements to core library

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 18

Page 10: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 10

MapReduce Basics Semantics borrowed from function programming

(FP) languages

FP language are usually stateless, which is very good for parallelism No need to worry about synchronization

Even not using MapReduce, many programming languages like Ruby and Python provides such semantics. Simplicity

Chance for implicit optimization

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 19

Functional Abstractions Hide Parallelism The ideas of functions, mapping and reducing are

from functional programming languages (eg. Lisp) Map()

• In FP: [ 1,2,3,4 ] - (*2) -> [ 2,4,6,8 ]• Process a key/value pair to generate intermediate

key/value pairs Reduce()

• In FP: [ 1,2,3,4 ] - (sum) -> 10• Merge all intermediate values associated with the

same key Both Map and Reduce are easy to parallelize

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 20

Page 11: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 11

MapReduce in Functions

Data type: key-value records Map function:

(Kin, Vin) list(Kinter, Vinter) Reduce function:

(Kinter, list(Vinter)) list(Kout, Vout)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 21

Programming Model1. Read a lot of data2. Map: extract something you care about from

each record3. Shuffle and Sort4. Reduce: aggregate, summarize, filter, or

transform5. Write the results

Outline stays the same, map and reduce change to fit the problem

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 22

Page 12: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 12

Programming Model: More Specifically Programmer specifies two primary methods:

map(k, v) -> <k', v'>*

reduce(k', <v'>*) -> <k’’, v’’>*

All v' with same k' are reduced together in order

Can also specify:

partition(k’, total partitions) -> partition for k’

■ often a simple hash of the key

■ allows reduce operations for different k’ to be parallelized

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 23

The Word Count Example

def mapper(line):

foreach word in line.split():

output(word, 1)

def reducer(key, values):

output(key, sum(values))

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 24

Page 13: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 13

Word Count Execution

the quick

brown fox

the fox ate

the mouse

how now

brown cow

Map

Map

Map

Reduce

Reduce

brown, 2

fox, 2

how, 1

now, 1

the, 3

ate, 1

cow, 1

mouse, 1

quick, 1

the, 1brown, 1fox, 1

quick, 1

the, 1fox, 1the, 1

how, 1now, 1

brown, 1

ate, 1mouse, 1

cow, 1

Input Map Shuffle & Sort Reduce Output

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 25

Problems with Hadoop MapReduce When doing iterative computation

○ Bad performance due to replication & disk I/O

○ Even worse: Communication overheads in the distributed file system

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 26

Page 14: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 14

Problems with Hadoop MapReduce MapReduce greatly simplified big data analysis But when it got popular, users wanted more:

• More complex, iterative multi-pass analytics (e.g. ML, graph)

• More interactive ad-hoc queries

Requires intensive disk I/O • Intermediate data is always written to local disk

make poor performance• solution: Apache Spark’s in-memory computing

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 27

Solution: Keep the Data in Memory Apache Spark’s in-memory computing

o 10-100X faster than disk

Sharing at memory speed

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 28

Page 15: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 15

The Working Set Idea Peter Denning, “The Working Set Model for Program

Behavior”, Communications of the ACM, May 1968.• http://dl.acm.org/citation.cfm?id=363141

Idea: conventional programs generally exhibit a high degree of locality, returning to the same data over and over again.

Operating system, virtual memory system, compiler, and micro architecture are designed around this assumption!

Exploiting this observation makes programs run 100X faster than simply using plain old main memory in the obvious way.

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 29

Spark• Exploit the working set idea• Fast and expressive cluster computing system interoperable with Apache Hadoop

• Improves efficiency through:• In-memory computing primitives• General computation graphs

• Improves usability through:• Rich APIs in Scala, Java, Python• Interactive shell

• The subsystem of BDAS(Berkeley Data Analytics Stack)

Up to 100×faster(2-10× on disk)Often 5× less code

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 30

Page 16: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 16

Spark “on the radar” 2008 – Yahoo! Hadoop team collaboration w Berkeley AMP/RAD

Lab begins 2009 – Spark example built for Nexus(a common substrate for

cluster computing) -> Mesos(a distributed systems kernel) 2011 – “Spark is 2 years ahead of anything at Google”

– Conviva(a company for online video optimization and analytics) seeing good results w Spark

2012 – Yahoo! working with Spark/Shark(now Spark SQL) 2013 – The project was donated to the Apache Software Found 2014 – Spark became a Top-Level Apache Project 2014 – Won the Daytona Gray Sort 100TB Benchmark 2016 – New world record on CloudSort Benchmark of sorting

100TB using $144.22 USD (3X better than previous record)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 31

Spark Today

Spark 2.1.1 released on May 02, 2017

Most active open source project in Big Data.

1000+ contributors

250+ supporting organizations

Commercial support

Many success stories …

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 32

Page 17: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 17

Berkeley Data Analytics Stack(BDAS)• An open source software stack that integrates

software components being built by the AMPLab to make sense of Big Data

• The AMPLab was launched at Jan 2011•Goal: Next Generation of Analytics Data Stack for Industry & Research• Berkeley Data Analytics Stack (BDAS)

• Release as Open Source

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 33

The Berkeley AMPLab

• Funding & Sponsor

• Government

• Industry

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 34

Page 18: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 18

Berkeley Data Analytics Stack• Goals:

• Easy to combine batch, streaming, and interactive computations

• Easy to develop sophisticated algorithms• Compatible with existing open source ecosystem

(Hadoop/HDFS)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 35

Berkeley Data Analytics Stack

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 36

Page 19: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 19

CSIE52400/CSIEM0140 Distributed Syst OS Support 37

BDAS Main Components• Three main components:

•Mesos: a distributed systems kernel and resource manager that provides efficient resource isolation and sharing across distributed applications, or frameworks

• Tachyon: memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks

• Spark: a cluster computing system that aims to make specified computing(data analytics, ad-hoc) fast

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 38

Page 20: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 20

Spark Stack

CSIE52400/CSIEM0140 Distributed Syst OS Support 39

Spark Use Case Reference

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 40

Page 21: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 21

Spark Stack

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 41

Community of Spark

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 42

Page 22: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 22

Goals of Spark• Goals

• Low latency (interactive) queries on historical data: enable faster decisions • E.g., identify why a site is slow and fix it

• Iterative Analytics: Graph Processing , Machine Learning• E.g., PageRank, MaxFlow, K-Means

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 43

The Working Set Idea Should identify which datasets to access. Load datasets into memory, use them multiple times. Keep newly created data in memory until explicitly

told to store it. Master-Worker arch: Master (driver) contains the

main logic, and the workers simply keep data in memory and apply functions to the distributed data.

The master knows where data is located, so it can exploit locality.

The driver is written in a functional programming language (Scala) which can be easily parallelized.

Can use other languages: Python, Java, R, …CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 44

Page 23: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 23

Approach• Aggressive use of Memory

• Memory transfer rate >> Disk transfer rate

• Memory density (capacity) still grows with Moore’s Law

• Many datasets already fit into memory • The inputs of over 90% of jobs

in Facebook, Yahoo!, and Bing clusters fit into memory

• E.g., 1TB = 1 billion records @ 1 KB each

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 45

Approach

(Microsoft’s MapReduce)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 46

Page 24: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 24

Approach• Trade between result accuracy and response

times • Why?

• In-memory processing does not guarantee interactive query processing

•E.g., ~10’s sec just to scan 512 GB RAM!•Gap between memory capacity and transfer rate

increasing

• Trade between response time, quality, and cost

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 47

Challenges for In-Memory Computation● Provide distributed memory abstractions

for clusters to support apps with working sets

● Retain the attractive properties of MapReduce:○ Fault tolerance (for crashes &

stragglers)○ Data locality○ Scalability

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 48

Page 25: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 25

Challenge: Fault Tolerance

● Existing in-memory storage systems have interfaces based on fine-grained updates○ Read/Write to cells○ E.g. Database, key-value store, distributed memory

● Requires replicating data or logs for fault tolerance○ Very inefficient & expensive under Big Data

Challenge: How to design a distributed memory abstraction that is both fault-tolerant and efficient?Solution: Augment data flow model with RDD

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 49

Spark: Runtime Architecture

● Driver: Spark program● Worker: Compute and store distributed data

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 50

Page 26: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 26

Resilient Distributed Datasets (RDD)● Distributed data abstraction

○ for in-memory computation on large cluster● Read-only, partitioned records

○ Only way to “write” is to create a new RDD○ Partitions are scattered over the cluster

● Only coarse-grained operations are allowed○ map, join, filter ...○ operate on the whole dataset

The reasons of thedesigns will bediscussed later.

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 51

Resilient Distributed Datasets (RDD)● Lazy Evaluation

○ Two types of perations on RDD: Transformations & Actions

○ Transformations: create a new dataset from an existing one do not compute right away but add this record to Lineage only computed when an action requires a result

○ Actions: return a value after a computation on the dataset It would execute all operation of the Lineage

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 52

Page 27: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 27

Transformations

Immutable data

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 53

Operations on RDD

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 54

Page 28: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 28

Transformations/Actions

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 55

Generality of RDDs We conjecture that Spark’s combination of data

flow with RDDs unifies many proposed cluster programming models• General data flow models: MapReduce, Dryad, SQL• Specialized models for stateful apps: Pregel (BSP),

HaLoop (iterative MR), Continuous Bulk Processing

Instead of specialized APIs for one type of app, give user first-class control of distributed datasets

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 56

Page 29: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 29

Lineage Records of operations to the data(RDDs) Similar to logs

Maintained by the master node Centralized metadata

RDD

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 57

Lineage: Progress of Computation● Each RDD consists of partitions

○ Detailed lineage structure is a DAG

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 58

Page 30: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 30

Lineage: Lazy Evaluation

● Partitions of RDDs are not necessarily in RAM○ Only cached partitions are in preserved

Only dark rectangles are cached partitions

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 59

Fault Tolerance using Lineage

● RDD can only be created (written) from○ Static Storage○ Other RDDs

● Only coarse-grained opeations

Lost partitions can be re-computed efficiently

Less information to maintain

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 60

Page 31: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 31

Spark: Runtime Architecture

RDD partitions

Filter, map...

Collect...

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 61

Other Issue: Dealing with Stragglers● Speculative Execution

○ Observe the process of the tasks of a job○ Launch duplicates of those tasks that are

slower○ It then becomes a race between the

original and the speculative copies

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 62

Page 32: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 32

RDDs vs. DSM

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 63

Other Issue: Dependency

● Execution can be pipelined● Faster to recompute

Narrow Dependency:● 1/N-to-1

Wide Dependency:● N-to-N

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 64

Page 33: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 33

Other Issue: Memory Management● Problem:

○ Some RDDs (partitions) are too large to store in some worker’s memory

○ These RDDs are costly to re-compute

● Solution: Use hard disks○ Swap RDDs out under LRU eviction policy○ Users can set persistence priority to RDDs

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 65

Other Issue: Optimization● Persistence

○ Users can indicate which RDDs they will reuse=> save them in memory rather than recomputed

● Partitioning○ Utilize data locality to optimize transformations○ Similar to the partition function in MapReduce

when mapping○ e.g. partition URLs by domain name

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 66

Page 34: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 34

Programming Model Resilient Distributed Datasets

• HDFS files, “parallelized” Scala collections• Can be transformed with map and filter• Can be cached across parallel operations

Parallel operations• Foreach, reduce, collect, …

Shared variables• Accumulators (add-only)• Broadcast variables (read-only)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 67

Resilient Distributed Datasets

● In Spark, RDD is represented by a Scala object. There are four ways to construct RDD:○ From file in a shared filesystem, such as HDFS.○ Scala collection (e.g., an array)○ Transforming existing RDD○ Changing the persistence of existing RDD, RDD

by default are lazy and ephemeral (短暫的)■ cache: hint that the data need to be cache

after the first time■ save: save the dataset to distributed file

system (HDFS)CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 68

Page 35: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 35

Parallel Operations● Several parallel operations can be

performed on RDD○ reduce: combines dataset elements using

an associative function to produce a result at the driver program.

○ collect: sends all elements of the dataset to the driver program.

○ foreach: Passes each element through a user provided function.

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 69

Functional Programming and Stateless● Using Scala, a functional programming

language which runs on JVM● Recall from the MapReduce session:

stateless properties of functional programming language is good for parallelization

● That’s why RDDs must be built from these semantics.

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 70

Page 36: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 36

Example: Log Error Counting

To count the lines containing errors in a large log file stored in HDFS

val file = spark.textFile("hdfs://...")

val errs = file.filter(_.contains("ERROR"))

val ones = errs.map(_ => 1)

val count = ones.reduce(_+_)

Both errs and ones are lazy RDDs that are never materialized. Can be made persistent byval cachedErrs = errs.cache()

_ means “the default thing that should go here.”

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 71

Example: Log Mining Load error messages from a log into memory,

then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

messages = errors.map(_.split(‘\t’)(2))

cachedMsgs = messages.cache()

Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

cachedMsgs.filter(_.contains(“foo”)).count

cachedMsgs.filter(_.contains(“bar”)).count

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base RDDTransformed RDD

Cached RDDParallel operation

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 72

Page 37: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 37

RDDs Revisited

An RDD is an immutable, partitioned, logicalcollection of records• Need not be materialized, but rather contains

information to rebuild a dataset from stable storage

Partitioning can be based on a key in each record (using hash or range partitioning)

Built using bulk transformations on other RDDs Can be cached for future reuse

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 73

RDD Fault Tolerance

RDDs maintain lineage information that can be used to reconstruct the exact lost partitions

Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)).cache()

HdfsRDDpath: hdfs://…

( )

FilteredRDDfunc:

contains(...)

MappedRDDfunc: split(…)

CachedRDD

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 74

Page 38: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 38

Benefits of RDD Model Consistency is easy due to immutability Inexpensive fault tolerance (log lineage

rather than replicating/checkpointing data) Locality-aware scheduling of tasks on

partitions High performance with in-mem computation Despite being restricted, model seems

applicable to a broad variety of applications

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 75

Fault Recovery Test

119

57 56 58 58

81

57 59 57 59

0

20

40

60

80

100

120

140

1 2 3 4 5 6 7 8 9 10

Iteratrion tim

e (s)

Iteration

Failure happens

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 76

Page 39: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 39

Behavior with Increasing Cache

69

58

41

30

12

0

10

20

30

40

50

60

70

80

Cache disabled 25% 50% 75% Fully cached

Iteration tim

e (s)

% of working set in cache

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 77

Spark in Java and Scala

Java API:

JavaRDD<String> lines = spark.textFile(…);

errors = lines.filter(new Function<String, Boolean>() {public Boolean call(String s) {return s.contains(“ERROR”);

}});

errors.count()

Scala API:

val lines = spark.textFile(…)

errors = lines.filter(s => s.contains(“ERROR”))

// can also write // filter(_.contains(“ERROR”))

errors.count

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 78

Page 40: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 40

Which Language to Use? Standalone programs can be written in any, but

console is only Python & Scala Python developers: can stay with Python for

both Java developers: consider using Scala for

console (to learn the API)

Performance: Java/Scala will be faster (statically typed), but Python can do well for numerical work with NumPy

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 79

Languages Used with Spark

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 80

Page 41: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 41

Scala Cheat Sheet

Variables:var x: Int = 7var x = 7     // type inferred

val y = “hi” // read‐only

Functions:def square(x: Int): Int = x*x

def square(x: Int): Int = {x*x   // last line returned

}

Collections and closures:val nums = Array(1, 2, 3)

nums.map((x: Int) => x + 2) // => Array(3, 4, 5)

nums.map(x => x + 2)  // => samenums.map(_ + 2)       // => same

nums.reduce((x, y) => x + y) // => 6nums.reduce(_ + _)           // => 6

Java interop:import java.net.URL

new URL(“http://cnn.com”).openStream()

More details:scala-lang.org

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 81

Learning Spark Spark can be installed on Windows. Use Spark interpreter (spark‐shell or pyspark) to learn• Special Scala and Python consoles for cluster user

Runs in local mode on 1 thread by default, but can control with MASTER environment var:$ MASTER=local ./spark‐shell           # local, 1 thread$ MASTER=local[2] ./spark‐shell # local, 2 threads$ MASTER=spark://host:port ./spark‐shell # Spark standalone cluster

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 82

Page 42: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 42

Main entry point to Spark functionality Created for you in Spark shells as variable sc In standalone programs, you’d make your own

with

val sc = new SparkContext(master, appName, 

[sparkHome], [jars])  // or

val sc = new SparkContext(conf)

First Step: SparkContext

Cluster URL, or local / local[N]

App name

Spark Spark install path on cluster

List of JARs

( p)

List of JARs with app code

(to ship)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 83

Creating RDDs

// Turn a local collection into an RDDval data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data)// sc.parallelize(Array(1, 2, 3, 4))

// Load text file from local FS, HDFS, or S3val distFile = sc.textFile("data.txt")// sc.textFile(“directory/*.txt”)// sc.textFile(“hdfs://namenode:9000/path/file”)

// Use any existing Hadoop InputFormatsc.hadoopFile(keyClass, valClass, inputFmt, conf)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 84

Page 43: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 43

Basic Transformationsval nums = Array(1, 2, 3)

// Pass each element through a functionval squares = nums.map(x => x*x)   // => {1, 4, 9}

// Keep elements passing a predicateval even = nums.filter(x => x % 2 == 0) // => {4}

// Map each element to zero or more othersnums.flatMap(x => 0 to x‐1)  // => {0, 0, 1, 0, 1, 2}

Sequence of numbers 0, 1, …, x-1

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 85

nums = sc.parallelize([1, 2, 3])

# Retrieve RDD contents as a local collectionnums.collect() # => [1, 2, 3]

# Return first K elementsnums.take(2)   # => [1, 2]

# Count number of elementsnums.count()   # => 3

# Merge elements with an associative functionnums.reduce(lambda x, y: x + y)  # => 6

# Write elements to a text filenums.saveAsTextFile(“hdfs://file.txt”)

Basic Actions (in Python)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 86

Page 44: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 44

Spark’s “distributed reduce” transformations act on RDDs of key-value pairs

Python: pair = (a, b)pair[0] # => apair[1] # => b

Scala: val pair = (a, b)

pair._1 // => apair._2 // => b

Java: Tuple2 pair = new Tuple2(a, b); // scala.Tuple2pair._1 // => apair._2 // => b

Working with Key-Value Pairs

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 87

Some Key-Value Operations

pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])

pets.reduceByKey(lambda x, y: x + y)# => {(cat, 3), (dog, 1)}

pets.groupByKey()# => {(cat, Seq(1, 2)), (dog, Seq(1)}

pets.sortByKey()# => {(cat, 1), (cat, 2), (dog, 1)}

reduceByKey also automatically implements combiners on the map side

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 88

Page 45: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 45

Spark for MapReduce

MapReduce data flow can be expressed using RDD transformations

res = data.flatMap(rec => myMapFunc(rec)).groupByKey().map((key, vals) => myReduceFunc(key, vals))

Or with combiners:res = data.flatMap(rec => myMapFunc(rec))

.reduceByKey(myCombiner)

.map((key, val) => myReduceFunc(key, val))

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 89

Example: WordCount

// Create RDD from HDFSfile = sc.textFile("hdfs://...")Counts = file.flatMap(line => line.split(" "))

.map(word => (word, 1))

.reduceByKey(_ + _)// The “map” and “reduce” imply parallelism

“to be or”

“not to be”

“to”“be”“or”

“not”“to”“be”

(to, 1)(be, 1)(or, 1)

(not, 1)(to, 1)(be, 1)

(be, 2)(not, 1)

(or, 1)(to, 2)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 90

Page 46: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 46

WordCount Complete Appimport spark.SparkContextimport spark.SparkContext._

object WordCount {def main(args: Array[String]) {

val sc = new SparkContext(“local”, “WordCount”, args(0), Seq(args(1)))

val lines = sc.textFile(args(2))lines.flatMap(_.split(“ ”))

.map(word => (word, 1))

.reduceByKey(_ + _)

.saveAsTextFile(args(3))}

}

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 91

Example: Logistic Regression

Goal: find best line separating two sets of points

target

random initial line

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 92

Page 47: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 47

Example: Logistic Regression An iterative classification algorithm to find

a hyperplane w that best separates two sets of points.

Popular binary classifier in machinelearning

Gradient Descent○ ITERATIVELY minimizes the error by

computing the gradient over all data points○ Computing among data points: parallelization○ But the iterative instrinsic is another

bottleneck

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 93

Logistic Regression Algorithm

w = random(D) // D-dimensional vector

for i from 1 to ITERATIONS do {//Compute gradientg = 0 // D-dimensional zero vectorfor every data point (yn, xn) do {

// xn is a vector, yn is +1 or -1g += yn * xn /(1 + exp(yn * w * xn))

}w -= LEARNING_RATE * g

}

Very big!!!!!!

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 94

Page 48: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 48

Serial Version// Read points from a text fileval points = readData(...)// Initialize w to a random D‐dimensional vectorvar w = Vector.random(D)// Run multiple iterations to update wfor (i <‐ 1 to ITERATIONS) {

var gradient = Vector.zeros(D)for (p <‐ points) {

val s = (1/(1 + exp(‐p.y*(w dot p.x)))‐1) * p.ygradient += s * p.x

} w ‐= gradient 

}

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 95

Spark Version// Read points from a text file and cache themval points =

sc.textFile(...).map(parsePoint).cache()// Initialize w to a random D‐dimensional vectorvar w = Vector.random(D)// Run multiple iterations to update wfor (i <‐ 1 to ITERATIONS) {

var gradient = sc.accumulator(new Vector(D))for (p <‐ points) { // Run in parallel

val s = (1/(1 + exp(‐p.y*(w dot p.x)))‐1)*p.ygradient += s * p.x

) w ‐= LEARNING_RATE * gradient 

}

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 96

Page 49: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 49

Spark: FP Version// Read points from a text file and cache themval points =

sc.textFile(...).map(parsePoint).cache()// Initialize w to a random D‐dimensional vectorvar w = Vector.random(D)// Run multiple iterations to update wfor (i <‐ 1 to ITERATIONS) {

val gradient = points.map(p =>(1/(1 + exp(‐p.y*(w dot p.x)))‐1) * p.y * p.x

).reduce(_ + _)w ‐= LEARNING_RATE * gradient 

}

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 97

Some Spark Features for(p <‐ points){body} is equivalent to

points.foreach(p => {body}) and therefore is an invocation of the Spark’s parallel foreach operation

Accumulator allows results of tasks running on clusters to be accumulated using operators like +=

Only the driver program can read the accumulator’s value

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 98

Page 50: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 50

Logistic Regression Performance

127 s / iteration

first iteration 174 sfurther iterations 6 s

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 99

Example: PageRank

•Use PageRank as a Spark example• Good example of a more complex algorithm• Multiple stages of map & reduce

• Benefits from Spark’s in-memory comp• Multiple iterations over the same data

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 100

Page 51: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 51

Basic Idea

Give pages ranks based on links to them• Links from mamy pages -> high rank• Links from a high rank page -> high rank

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 101

PageRank Algorithm

1.0 1.0

1.0

1.0

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 102

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs

Page 52: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 52

PageRank Algorithm1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 103

1.0 1.0

1.0

1.0

1

0.5

0.5

0.5

1

0.5

PageRank Algorithm

0.58 1.0

1.85

0.58

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 104

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs

Page 53: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 53

PageRank Algorithm

0.58 1.0

1.85

0.58

0.58

0.29

0.29

0.5

1.85

0.5

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 105

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs

PageRank Algorithm

0.39 1.72

1.31

0.58

. . .

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 106

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs

Page 54: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 54

PageRank Algorithm

0.46 1.37

1.44

0.73

Final state:

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 107

1. Start each page at a rank of 12. On each iteration, have page p contribute

rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs

Spark Program (Scala)

val links = // RDD of (url, neighbors) pairs 

var ranks = // RDD of (url, rank) pairs 

for (i <‐ 1 to ITERATIONS) { 

val contribs = links.join(ranks).flatMap { 

case(url, (links, rank)) => 

links.map(dest => (dest, rank/links.size)) 

ranks = contribs.reduceByKey(_ + _)

.mapValues(0.15 + 0.85 * _) 

ranks.saveAsTextFile(...)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 108

Page 55: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 55

PageRanke Example

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 109

("MapR",List("Baidu","Blogger")),("Baidu", List("MapR"),("Blogger",List("Google","Baidu")),("Google", List("MapR"))

PageRanke Example

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 110

val links = sc.parallelize(List( ("MapR",List("Baidu","Blogger")), ("Baidu",List("MapR")),("Blogger",List("Google","Baidu")),("Google", List("MapR")))).persist()

var ranks = links.mapValues(v => 1.0)

Page 56: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 56

PageRanke Example

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 111

val contributions = links.join(ranks).flatMap { case(url, (links, rank)) => 

links.map(dest => (dest, rank/links.size))}

PageRanke Example

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 112

val ranks = contributions.reduceByKey((x, y) => x + y).mapValues(v => 0.15 + 0.85*v)

Page 57: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 57

PageRank Example

google

msn adobe yahoo

CSIE52400/CSIEM0140 Distributed Syst OS Support 113

Spark Program Executionlinks‐RDD(google,[Ljava.lang.String;@1771f11)(yahoo,[Ljava.lang.String;@19897e4)(msn,[Ljava.lang.String;@11c228f)(adobe,[Ljava.lang.String;@20f065)

ranks‐RDD(google,0.25)(yahoo,0.25)(msn,0.25)(adobe,0.25)

links.join(ranks)‐RDD(google,([Ljava.lang.String;@df1177,0.25))(msn,([Ljava.lang.String;@f3b9d3,0.25))(adobe,([Ljava.lang.String;@12cd143,0.25))(yahoo,([Ljava.lang.String;@15e8963,0.25))

contribs-RDD(yahoo,0.08333333333333333)(msn,0.08333333333333333)(adobe,0.08333333333333333)(google,0.25)(google,0.25)(google,0.25)

join

flatMap

reduceByKey

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 114

Page 58: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 58

PageRank Performance

171

80

23

14

0

50

100

150

200

30 60

Iteration tim

e (s)

Number of machines

Hadoop

Spark

CSIE52400/CSIEM0140 Distributed Syst OS Support 115

Spark Execution

CSIE52400/CSIEM0140 Distributed Syst OS Support 116

Page 59: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 59

Solution: Controlled Partitioning• Network bandwidth is ~100× as expensive as memory bandwidth

• Pre-partition the links RDD -so that links for URLs with the same hash code are on the same node

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 117

Controlled Partitioning

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 118

Page 60: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 60

New Execution

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 119

PageRank Performance

why it helps so much: 

links RDD is much bigger in bytes than ranks RDD

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 120

Page 61: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 61

PageRank Demo

•Input data : simple.datgoogle: yahoo msn adobeyahoo: googlemsn: googleadobe: google

•numberIterations = 30, usePartitioner= false

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 121

PageRank Demo• numberIterations = 30, usePartitioner = true

• numberIterations = 45, usePartitioner = false

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 122

Page 62: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 62

PageRank Demo

numberIterations = 45, usePartitioner = true

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 123

Example: Alternating Least Squares (ALS) ALS is for collaborative filtering such as

predicting u users’ ratings for m movies based on their movie rating history.

A user to a movie has a k-dim feature vector. A user’s rating to a movie is the dot product of

the user’s feature vector with the movie’s. Let M be a m × k matrix and U be a k × u matrix

of feature vectors, the rating R can be represented as M × U

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 124

Page 63: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 63

ALS Algorithm ALS algorithm:

1. Initialize M to a random value.2. Optimize U given M to minimize error on R.3. Optimize M given U to minimize error on R.4. Repeat steps 2 and 3 until convergence.

All steps need R. It is helpful to make R a broadcast variable so that it does not re-sent to each node on each step.

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 125

ALS Program in Spark

val Rb = spark.broadcast(R)for (i <- 1 to ITERATIONS) {

U = spark.parallelize(0 until u).map(j => updateUser(j, Rb, M)).collect()

M = spark.parallelize(0 until m).map(j => updateUser(j, Rb, U)).collect()

}

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 126

Page 64: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 64

Spark Implementation Overview Spark runs on the

Mesos cluster manager, letting it share resources with Hadoop & other apps

Can read from any Hadoop input source (e.g. HDFS) ~6000 lines of Scala code thanks to building

on Mesos

Spark Hadoop MPI

Mesos

Node Node Node Node

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 127

Language Integration Scala closures are Serializable Java objects

• Serialize on driver, load & run on workers

Not quite enough• Nested closures may reference entire outer scope• May pull in non-Serializable variables not used inside• Solution: bytecode analysis + reflection

Shared variables implemented using custom serialized form (e.g. broadcast variable contains pointer to BitTorrent tracker)

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 128

Page 65: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 65

Interactive SparkModified Scala interpreter to allow Spark to

be used interactively from the command line Required two changes:

• Modified wrapper code generation so that each “line” typed has references to objects for its dependencies

• Place generated classes in distributed filesystem

Enables in-memory exploration of big data

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 129

New Spark APIs

RDDs are good but tend to be too low-level for beginners.

New DataFrame (Spark 1.3) and Dataset (Spark 1.6) APIs come to the rescue.

Spark 2.0 and up even unify DataFrame and Dataset to simplify the access.

Choose the one that fits your applications.

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 130

Page 66: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 66

Conclusion By making distributed datasets a first-class

primitive, Spark provides a simple, efficient programming model for stateful data analytics

RDDs provide:• Lineage info for fault recovery and debugging• Adjustable in-memory caching• Locality-aware parallel operations

Spark can be the basis of a suite of batch and interactive data analysis tools

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 131

A6: Spark Exercises Implement the PageRank algorithm with Spark

and provide suitable input to test it. Given a set of house owners information in the

format:OwnerID, HouseID, Zip, Value

Write a Spark program to compute the average house value of each zip code.

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 132

Page 67: Lecture 09: Cluster Computing with MapReduce & Sparkweb.csie.ndhu.edu.tw/showyang/DistrSys2018f/09MapReduceSpark.pdf · Lecture 09: Cluster Computing with MapReduce & Spark What is

CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark

Note 67

A6: Spark Exercises

Write a Spark program to compute the inverted index of a set of documents. More specifically, given a set of (DocumentID, text) pairs, output a list of (word, (doc1, doc2, …)) pairs.

Due date: Self exercises

CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 133