lecture 09: cluster computing with mapreduce &...
TRANSCRIPT
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 1
CSIE52400/CSIEM0140Distributed Systems
Lecture 09: Cluster Computing with
MapReduce & Spark
What is MapReduce?Data-parallel programming model for
clusters of commodity machines• Designed for scalability and fault-tolerance
Pioneered by Google• Processes 20 PB of data per day
Popularized by open-source Hadoopproject • Used by Yahoo!, Facebook, Amazon, …
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 2
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 2
What is MapReduce Used for? At Google:
• Index building for Google Search • Article clustering for Google News • Statistical machine translation
At Yahoo!: • Index building for Yahoo! Search • Spam detection for Yahoo! Mail
At Facebook: • Data mining • Ad optimization • Spam detection
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 3
What is MapReduce Used for?
In research: • Analyzing Wikipedia conflicts (PARC) • Natural language processing (CMU) • Bioinformatics (Maryland) • Particle physics (Nebraska) • Ocean climate simulation (Washington) • <Your application here>
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 4
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 3
MapReduce History The foundation stone: The Google File System
by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung in 2003. (19th ACM Symposium on Operating Systems Principles)
The paper that started everything – MapReduce: Simplified Data Processing on Large Clusters by Jeffrey Dean and Sanjay Ghemawat in 2004. (6th Symposium on Operating System Design and Implementation)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 5
MapReduce History Shortly after the MapReduce paper,
open source pioneers Doug Cutting and Mike Cafarella started working on a MapReduce implementation to solve the scalability problem of Nutch (an open source search engine)
Over the course of a few months, Cutting and Cafarella built up the underlying file systems and processing framework that would become Hadoop (in Java)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 6
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 4
MapReduce History In 2006, Cutting went to work with Yahoo. They spun out the storage and processing parts
of Nutch to form Hadoop (named after Cutting’s son’s stuffed elephant).
Over time and heavy investment by Yahoo!, Hadoop eventually became a top-level Apache Foundation project.
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 7
MapReduce Today Today, numerous independent people and
organizations contribute to Hadoop. Every new release adds functionality and boosts
performance. Several other open source projects have been
built with Hadoop at their core, and this list is continually growing.
Some of the more popular ones: Pig(programming tool), Hive(warehousing), Hbase(NoSQL DB), Mahout(machine learning), and ZooKeeper(distributed systems and services).
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 8
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 5
Why MapReduce?Problem: Lots of data!Example: Word frequencies in Web
pagesThis is how the world’s first search
engine was done (Archie)World’s first search engine vs.
A search engine from Stanford called Google
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 9
Word Frequencies in Pages 20+ billion web pages x 20KB = 400+
terabytes
One computer can read 30-35 MB/sec from disk 4 months to read the web 1,000+ hard drives to store the web
Even more: To do something with the data Compute the word frequencies for each word in
each website
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 10
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 6
Basic Solution: Spread the work over many machines Same problem with 1000 machines: < 3 hours New problems: Extra programming works
communication and coordination recovering from machine failure status reporting debugging optimization locality
Those works repeat for every problem you want to solve
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 11
MapReduce Design Goals1. Scalability to large data volumes:
• Scan 100 TB on 1 node @ 50 MB/s = 24 days • Scan on 1000-node cluster = 35 minutes • => 1000’s of machines, 10,000’s of disks
2. Cost-efficiency:• Commodity machines (cheap, but unreliable)• Commodity network• Automatic fault-tolerance (fewer administrators)• Easy to use (fewer programmers)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 12
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 7
Computing Clusters Many racks of computers, thousands of
machines per cluster Limited bisection bandwidth between racks
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 13
Typical Hadoop ClusterAggregation switch
Rack switch
40 nodes/rack, up to 4000 nodes in cluster 1 Gbps bandwidth within rack, 10 Gbps out of rack Node specs (Yahoo terasort):
8 x 2GHz cores, 8 GB RAM, 4 disks (= 4 TB?)
Image from http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/YahooHadoopIntro-apachecon-us-2008.pdf
10 gigabit1 gigabit
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 14
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 8
Hadoop Cluster
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 15
Implications of Computing Environment Single-thread performance doesn’t matter Large problems and total throughput/$ are more
important than peak performance
Stuff Breaks More nodes imply higher probability of breaking down
“Ultra-reliable” hardware doesn’t really help At large scales, super-fancy reliable hardware still fails,
albeit less often software still needs to be fault-tolerant
commodity machines without fancy hardware give better perf/$
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 16
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 9
Challenges1. Cheap nodes fail, especially if you have
many• Mean time between failures for 1 node = 3 years• MTBF for 1000 nodes = 1 day• Solution: Build fault-tolerance into system
2. Commodity network = low bandwidth• Solution: Push computation to the data
3. Programming distributed systems is hard• Solution: Data-parallel programming model: users
write “map” & “reduce” functions, system distributes work and handles faults
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 17
MapReduce A simple programming model that applies to many
large-scale computing problems
Hide messy details of distributed programs behind MapReduce runtime library: Automatic parallelization
Load balancing
Network and disk transfer optimization
Handling of machine failures
Robustness
Improvements to core library
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 18
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 10
MapReduce Basics Semantics borrowed from function programming
(FP) languages
FP language are usually stateless, which is very good for parallelism No need to worry about synchronization
Even not using MapReduce, many programming languages like Ruby and Python provides such semantics. Simplicity
Chance for implicit optimization
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 19
Functional Abstractions Hide Parallelism The ideas of functions, mapping and reducing are
from functional programming languages (eg. Lisp) Map()
• In FP: [ 1,2,3,4 ] - (*2) -> [ 2,4,6,8 ]• Process a key/value pair to generate intermediate
key/value pairs Reduce()
• In FP: [ 1,2,3,4 ] - (sum) -> 10• Merge all intermediate values associated with the
same key Both Map and Reduce are easy to parallelize
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 20
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 11
MapReduce in Functions
Data type: key-value records Map function:
(Kin, Vin) list(Kinter, Vinter) Reduce function:
(Kinter, list(Vinter)) list(Kout, Vout)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 21
Programming Model1. Read a lot of data2. Map: extract something you care about from
each record3. Shuffle and Sort4. Reduce: aggregate, summarize, filter, or
transform5. Write the results
Outline stays the same, map and reduce change to fit the problem
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 22
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 12
Programming Model: More Specifically Programmer specifies two primary methods:
map(k, v) -> <k', v'>*
reduce(k', <v'>*) -> <k’’, v’’>*
All v' with same k' are reduced together in order
Can also specify:
partition(k’, total partitions) -> partition for k’
■ often a simple hash of the key
■ allows reduce operations for different k’ to be parallelized
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 23
The Word Count Example
def mapper(line):
foreach word in line.split():
output(word, 1)
def reducer(key, values):
output(key, sum(values))
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 24
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 13
Word Count Execution
the quick
brown fox
the fox ate
the mouse
how now
brown cow
Map
Map
Map
Reduce
Reduce
brown, 2
fox, 2
how, 1
now, 1
the, 3
ate, 1
cow, 1
mouse, 1
quick, 1
the, 1brown, 1fox, 1
quick, 1
the, 1fox, 1the, 1
how, 1now, 1
brown, 1
ate, 1mouse, 1
cow, 1
Input Map Shuffle & Sort Reduce Output
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 25
Problems with Hadoop MapReduce When doing iterative computation
○ Bad performance due to replication & disk I/O
○ Even worse: Communication overheads in the distributed file system
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 26
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 14
Problems with Hadoop MapReduce MapReduce greatly simplified big data analysis But when it got popular, users wanted more:
• More complex, iterative multi-pass analytics (e.g. ML, graph)
• More interactive ad-hoc queries
Requires intensive disk I/O • Intermediate data is always written to local disk
make poor performance• solution: Apache Spark’s in-memory computing
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 27
Solution: Keep the Data in Memory Apache Spark’s in-memory computing
o 10-100X faster than disk
Sharing at memory speed
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 28
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 15
The Working Set Idea Peter Denning, “The Working Set Model for Program
Behavior”, Communications of the ACM, May 1968.• http://dl.acm.org/citation.cfm?id=363141
Idea: conventional programs generally exhibit a high degree of locality, returning to the same data over and over again.
Operating system, virtual memory system, compiler, and micro architecture are designed around this assumption!
Exploiting this observation makes programs run 100X faster than simply using plain old main memory in the obvious way.
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 29
Spark• Exploit the working set idea• Fast and expressive cluster computing system interoperable with Apache Hadoop
• Improves efficiency through:• In-memory computing primitives• General computation graphs
• Improves usability through:• Rich APIs in Scala, Java, Python• Interactive shell
• The subsystem of BDAS(Berkeley Data Analytics Stack)
Up to 100×faster(2-10× on disk)Often 5× less code
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 30
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 16
Spark “on the radar” 2008 – Yahoo! Hadoop team collaboration w Berkeley AMP/RAD
Lab begins 2009 – Spark example built for Nexus(a common substrate for
cluster computing) -> Mesos(a distributed systems kernel) 2011 – “Spark is 2 years ahead of anything at Google”
– Conviva(a company for online video optimization and analytics) seeing good results w Spark
2012 – Yahoo! working with Spark/Shark(now Spark SQL) 2013 – The project was donated to the Apache Software Found 2014 – Spark became a Top-Level Apache Project 2014 – Won the Daytona Gray Sort 100TB Benchmark 2016 – New world record on CloudSort Benchmark of sorting
100TB using $144.22 USD (3X better than previous record)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 31
Spark Today
Spark 2.1.1 released on May 02, 2017
Most active open source project in Big Data.
1000+ contributors
250+ supporting organizations
Commercial support
Many success stories …
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 32
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 17
Berkeley Data Analytics Stack(BDAS)• An open source software stack that integrates
software components being built by the AMPLab to make sense of Big Data
• The AMPLab was launched at Jan 2011•Goal: Next Generation of Analytics Data Stack for Industry & Research• Berkeley Data Analytics Stack (BDAS)
• Release as Open Source
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 33
The Berkeley AMPLab
• Funding & Sponsor
• Government
• Industry
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 34
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 18
Berkeley Data Analytics Stack• Goals:
• Easy to combine batch, streaming, and interactive computations
• Easy to develop sophisticated algorithms• Compatible with existing open source ecosystem
(Hadoop/HDFS)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 35
Berkeley Data Analytics Stack
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 36
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 19
CSIE52400/CSIEM0140 Distributed Syst OS Support 37
BDAS Main Components• Three main components:
•Mesos: a distributed systems kernel and resource manager that provides efficient resource isolation and sharing across distributed applications, or frameworks
• Tachyon: memory-centric distributed storage system enabling reliable data sharing at memory-speed across cluster frameworks
• Spark: a cluster computing system that aims to make specified computing(data analytics, ad-hoc) fast
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 38
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 20
Spark Stack
CSIE52400/CSIEM0140 Distributed Syst OS Support 39
Spark Use Case Reference
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 40
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 21
Spark Stack
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 41
Community of Spark
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 42
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 22
Goals of Spark• Goals
• Low latency (interactive) queries on historical data: enable faster decisions • E.g., identify why a site is slow and fix it
• Iterative Analytics: Graph Processing , Machine Learning• E.g., PageRank, MaxFlow, K-Means
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 43
The Working Set Idea Should identify which datasets to access. Load datasets into memory, use them multiple times. Keep newly created data in memory until explicitly
told to store it. Master-Worker arch: Master (driver) contains the
main logic, and the workers simply keep data in memory and apply functions to the distributed data.
The master knows where data is located, so it can exploit locality.
The driver is written in a functional programming language (Scala) which can be easily parallelized.
Can use other languages: Python, Java, R, …CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 44
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 23
Approach• Aggressive use of Memory
• Memory transfer rate >> Disk transfer rate
• Memory density (capacity) still grows with Moore’s Law
• Many datasets already fit into memory • The inputs of over 90% of jobs
in Facebook, Yahoo!, and Bing clusters fit into memory
• E.g., 1TB = 1 billion records @ 1 KB each
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 45
Approach
(Microsoft’s MapReduce)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 46
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 24
Approach• Trade between result accuracy and response
times • Why?
• In-memory processing does not guarantee interactive query processing
•E.g., ~10’s sec just to scan 512 GB RAM!•Gap between memory capacity and transfer rate
increasing
• Trade between response time, quality, and cost
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 47
Challenges for In-Memory Computation● Provide distributed memory abstractions
for clusters to support apps with working sets
● Retain the attractive properties of MapReduce:○ Fault tolerance (for crashes &
stragglers)○ Data locality○ Scalability
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 48
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 25
Challenge: Fault Tolerance
● Existing in-memory storage systems have interfaces based on fine-grained updates○ Read/Write to cells○ E.g. Database, key-value store, distributed memory
● Requires replicating data or logs for fault tolerance○ Very inefficient & expensive under Big Data
Challenge: How to design a distributed memory abstraction that is both fault-tolerant and efficient?Solution: Augment data flow model with RDD
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 49
Spark: Runtime Architecture
● Driver: Spark program● Worker: Compute and store distributed data
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 50
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 26
Resilient Distributed Datasets (RDD)● Distributed data abstraction
○ for in-memory computation on large cluster● Read-only, partitioned records
○ Only way to “write” is to create a new RDD○ Partitions are scattered over the cluster
● Only coarse-grained operations are allowed○ map, join, filter ...○ operate on the whole dataset
The reasons of thedesigns will bediscussed later.
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 51
Resilient Distributed Datasets (RDD)● Lazy Evaluation
○ Two types of perations on RDD: Transformations & Actions
○ Transformations: create a new dataset from an existing one do not compute right away but add this record to Lineage only computed when an action requires a result
○ Actions: return a value after a computation on the dataset It would execute all operation of the Lineage
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 52
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 27
Transformations
Immutable data
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 53
Operations on RDD
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 54
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 28
Transformations/Actions
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 55
Generality of RDDs We conjecture that Spark’s combination of data
flow with RDDs unifies many proposed cluster programming models• General data flow models: MapReduce, Dryad, SQL• Specialized models for stateful apps: Pregel (BSP),
HaLoop (iterative MR), Continuous Bulk Processing
Instead of specialized APIs for one type of app, give user first-class control of distributed datasets
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 56
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 29
Lineage Records of operations to the data(RDDs) Similar to logs
Maintained by the master node Centralized metadata
RDD
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 57
Lineage: Progress of Computation● Each RDD consists of partitions
○ Detailed lineage structure is a DAG
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 58
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 30
Lineage: Lazy Evaluation
● Partitions of RDDs are not necessarily in RAM○ Only cached partitions are in preserved
Only dark rectangles are cached partitions
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 59
Fault Tolerance using Lineage
● RDD can only be created (written) from○ Static Storage○ Other RDDs
● Only coarse-grained opeations
Lost partitions can be re-computed efficiently
Less information to maintain
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 60
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 31
Spark: Runtime Architecture
RDD partitions
Filter, map...
Collect...
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 61
Other Issue: Dealing with Stragglers● Speculative Execution
○ Observe the process of the tasks of a job○ Launch duplicates of those tasks that are
slower○ It then becomes a race between the
original and the speculative copies
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 62
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 32
RDDs vs. DSM
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 63
Other Issue: Dependency
● Execution can be pipelined● Faster to recompute
Narrow Dependency:● 1/N-to-1
Wide Dependency:● N-to-N
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 64
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 33
Other Issue: Memory Management● Problem:
○ Some RDDs (partitions) are too large to store in some worker’s memory
○ These RDDs are costly to re-compute
● Solution: Use hard disks○ Swap RDDs out under LRU eviction policy○ Users can set persistence priority to RDDs
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 65
Other Issue: Optimization● Persistence
○ Users can indicate which RDDs they will reuse=> save them in memory rather than recomputed
● Partitioning○ Utilize data locality to optimize transformations○ Similar to the partition function in MapReduce
when mapping○ e.g. partition URLs by domain name
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 66
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 34
Programming Model Resilient Distributed Datasets
• HDFS files, “parallelized” Scala collections• Can be transformed with map and filter• Can be cached across parallel operations
Parallel operations• Foreach, reduce, collect, …
Shared variables• Accumulators (add-only)• Broadcast variables (read-only)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 67
Resilient Distributed Datasets
● In Spark, RDD is represented by a Scala object. There are four ways to construct RDD:○ From file in a shared filesystem, such as HDFS.○ Scala collection (e.g., an array)○ Transforming existing RDD○ Changing the persistence of existing RDD, RDD
by default are lazy and ephemeral (短暫的)■ cache: hint that the data need to be cache
after the first time■ save: save the dataset to distributed file
system (HDFS)CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 68
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 35
Parallel Operations● Several parallel operations can be
performed on RDD○ reduce: combines dataset elements using
an associative function to produce a result at the driver program.
○ collect: sends all elements of the dataset to the driver program.
○ foreach: Passes each element through a user provided function.
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 69
Functional Programming and Stateless● Using Scala, a functional programming
language which runs on JVM● Recall from the MapReduce session:
stateless properties of functional programming language is good for parallelization
● That’s why RDDs must be built from these semantics.
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 70
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 36
Example: Log Error Counting
To count the lines containing errors in a large log file stored in HDFS
val file = spark.textFile("hdfs://...")
val errs = file.filter(_.contains("ERROR"))
val ones = errs.map(_ => 1)
val count = ones.reduce(_+_)
Both errs and ones are lazy RDDs that are never materialized. Can be made persistent byval cachedErrs = errs.cache()
_ means “the default thing that should go here.”
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 71
Example: Log Mining Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(_.startsWith(“ERROR”))
messages = errors.map(_.split(‘\t’)(2))
cachedMsgs = messages.cache()
Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
cachedMsgs.filter(_.contains(“foo”)).count
cachedMsgs.filter(_.contains(“bar”)).count
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base RDDTransformed RDD
Cached RDDParallel operation
Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 72
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 37
RDDs Revisited
An RDD is an immutable, partitioned, logicalcollection of records• Need not be materialized, but rather contains
information to rebuild a dataset from stable storage
Partitioning can be based on a key in each record (using hash or range partitioning)
Built using bulk transformations on other RDDs Can be cached for future reuse
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 73
RDD Fault Tolerance
RDDs maintain lineage information that can be used to reconstruct the exact lost partitions
Ex: cachedMsgs = textFile(...).filter(_.contains(“error”)).map(_.split(‘\t’)(2)).cache()
HdfsRDDpath: hdfs://…
( )
FilteredRDDfunc:
contains(...)
MappedRDDfunc: split(…)
CachedRDD
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 74
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 38
Benefits of RDD Model Consistency is easy due to immutability Inexpensive fault tolerance (log lineage
rather than replicating/checkpointing data) Locality-aware scheduling of tasks on
partitions High performance with in-mem computation Despite being restricted, model seems
applicable to a broad variety of applications
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 75
Fault Recovery Test
119
57 56 58 58
81
57 59 57 59
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10
Iteratrion tim
e (s)
Iteration
Failure happens
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 76
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 39
Behavior with Increasing Cache
69
58
41
30
12
0
10
20
30
40
50
60
70
80
Cache disabled 25% 50% 75% Fully cached
Iteration tim
e (s)
% of working set in cache
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 77
Spark in Java and Scala
Java API:
JavaRDD<String> lines = spark.textFile(…);
errors = lines.filter(new Function<String, Boolean>() {public Boolean call(String s) {return s.contains(“ERROR”);
}});
errors.count()
Scala API:
val lines = spark.textFile(…)
errors = lines.filter(s => s.contains(“ERROR”))
// can also write // filter(_.contains(“ERROR”))
errors.count
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 78
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 40
Which Language to Use? Standalone programs can be written in any, but
console is only Python & Scala Python developers: can stay with Python for
both Java developers: consider using Scala for
console (to learn the API)
Performance: Java/Scala will be faster (statically typed), but Python can do well for numerical work with NumPy
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 79
Languages Used with Spark
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 80
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 41
Scala Cheat Sheet
Variables:var x: Int = 7var x = 7 // type inferred
val y = “hi” // read‐only
Functions:def square(x: Int): Int = x*x
def square(x: Int): Int = {x*x // last line returned
}
Collections and closures:val nums = Array(1, 2, 3)
nums.map((x: Int) => x + 2) // => Array(3, 4, 5)
nums.map(x => x + 2) // => samenums.map(_ + 2) // => same
nums.reduce((x, y) => x + y) // => 6nums.reduce(_ + _) // => 6
Java interop:import java.net.URL
new URL(“http://cnn.com”).openStream()
More details:scala-lang.org
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 81
Learning Spark Spark can be installed on Windows. Use Spark interpreter (spark‐shell or pyspark) to learn• Special Scala and Python consoles for cluster user
Runs in local mode on 1 thread by default, but can control with MASTER environment var:$ MASTER=local ./spark‐shell # local, 1 thread$ MASTER=local[2] ./spark‐shell # local, 2 threads$ MASTER=spark://host:port ./spark‐shell # Spark standalone cluster
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 82
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 42
Main entry point to Spark functionality Created for you in Spark shells as variable sc In standalone programs, you’d make your own
with
val sc = new SparkContext(master, appName,
[sparkHome], [jars]) // or
val sc = new SparkContext(conf)
First Step: SparkContext
Cluster URL, or local / local[N]
App name
Spark Spark install path on cluster
List of JARs
( p)
List of JARs with app code
(to ship)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 83
Creating RDDs
// Turn a local collection into an RDDval data = Array(1, 2, 3, 4, 5)val distData = sc.parallelize(data)// sc.parallelize(Array(1, 2, 3, 4))
// Load text file from local FS, HDFS, or S3val distFile = sc.textFile("data.txt")// sc.textFile(“directory/*.txt”)// sc.textFile(“hdfs://namenode:9000/path/file”)
// Use any existing Hadoop InputFormatsc.hadoopFile(keyClass, valClass, inputFmt, conf)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 84
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 43
Basic Transformationsval nums = Array(1, 2, 3)
// Pass each element through a functionval squares = nums.map(x => x*x) // => {1, 4, 9}
// Keep elements passing a predicateval even = nums.filter(x => x % 2 == 0) // => {4}
// Map each element to zero or more othersnums.flatMap(x => 0 to x‐1) // => {0, 0, 1, 0, 1, 2}
Sequence of numbers 0, 1, …, x-1
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 85
nums = sc.parallelize([1, 2, 3])
# Retrieve RDD contents as a local collectionnums.collect() # => [1, 2, 3]
# Return first K elementsnums.take(2) # => [1, 2]
# Count number of elementsnums.count() # => 3
# Merge elements with an associative functionnums.reduce(lambda x, y: x + y) # => 6
# Write elements to a text filenums.saveAsTextFile(“hdfs://file.txt”)
Basic Actions (in Python)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 86
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 44
Spark’s “distributed reduce” transformations act on RDDs of key-value pairs
Python: pair = (a, b)pair[0] # => apair[1] # => b
Scala: val pair = (a, b)
pair._1 // => apair._2 // => b
Java: Tuple2 pair = new Tuple2(a, b); // scala.Tuple2pair._1 // => apair._2 // => b
Working with Key-Value Pairs
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 87
Some Key-Value Operations
pets = sc.parallelize([(“cat”, 1), (“dog”, 1), (“cat”, 2)])
pets.reduceByKey(lambda x, y: x + y)# => {(cat, 3), (dog, 1)}
pets.groupByKey()# => {(cat, Seq(1, 2)), (dog, Seq(1)}
pets.sortByKey()# => {(cat, 1), (cat, 2), (dog, 1)}
reduceByKey also automatically implements combiners on the map side
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 88
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 45
Spark for MapReduce
MapReduce data flow can be expressed using RDD transformations
res = data.flatMap(rec => myMapFunc(rec)).groupByKey().map((key, vals) => myReduceFunc(key, vals))
Or with combiners:res = data.flatMap(rec => myMapFunc(rec))
.reduceByKey(myCombiner)
.map((key, val) => myReduceFunc(key, val))
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 89
Example: WordCount
// Create RDD from HDFSfile = sc.textFile("hdfs://...")Counts = file.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)// The “map” and “reduce” imply parallelism
“to be or”
“not to be”
“to”“be”“or”
“not”“to”“be”
(to, 1)(be, 1)(or, 1)
(not, 1)(to, 1)(be, 1)
(be, 2)(not, 1)
(or, 1)(to, 2)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 90
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 46
WordCount Complete Appimport spark.SparkContextimport spark.SparkContext._
object WordCount {def main(args: Array[String]) {
val sc = new SparkContext(“local”, “WordCount”, args(0), Seq(args(1)))
val lines = sc.textFile(args(2))lines.flatMap(_.split(“ ”))
.map(word => (word, 1))
.reduceByKey(_ + _)
.saveAsTextFile(args(3))}
}
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 91
Example: Logistic Regression
Goal: find best line separating two sets of points
target
random initial line
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 92
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 47
Example: Logistic Regression An iterative classification algorithm to find
a hyperplane w that best separates two sets of points.
Popular binary classifier in machinelearning
Gradient Descent○ ITERATIVELY minimizes the error by
computing the gradient over all data points○ Computing among data points: parallelization○ But the iterative instrinsic is another
bottleneck
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 93
Logistic Regression Algorithm
w = random(D) // D-dimensional vector
for i from 1 to ITERATIONS do {//Compute gradientg = 0 // D-dimensional zero vectorfor every data point (yn, xn) do {
// xn is a vector, yn is +1 or -1g += yn * xn /(1 + exp(yn * w * xn))
}w -= LEARNING_RATE * g
}
Very big!!!!!!
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 94
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 48
Serial Version// Read points from a text fileval points = readData(...)// Initialize w to a random D‐dimensional vectorvar w = Vector.random(D)// Run multiple iterations to update wfor (i <‐ 1 to ITERATIONS) {
var gradient = Vector.zeros(D)for (p <‐ points) {
val s = (1/(1 + exp(‐p.y*(w dot p.x)))‐1) * p.ygradient += s * p.x
} w ‐= gradient
}
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 95
Spark Version// Read points from a text file and cache themval points =
sc.textFile(...).map(parsePoint).cache()// Initialize w to a random D‐dimensional vectorvar w = Vector.random(D)// Run multiple iterations to update wfor (i <‐ 1 to ITERATIONS) {
var gradient = sc.accumulator(new Vector(D))for (p <‐ points) { // Run in parallel
val s = (1/(1 + exp(‐p.y*(w dot p.x)))‐1)*p.ygradient += s * p.x
) w ‐= LEARNING_RATE * gradient
}
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 96
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 49
Spark: FP Version// Read points from a text file and cache themval points =
sc.textFile(...).map(parsePoint).cache()// Initialize w to a random D‐dimensional vectorvar w = Vector.random(D)// Run multiple iterations to update wfor (i <‐ 1 to ITERATIONS) {
val gradient = points.map(p =>(1/(1 + exp(‐p.y*(w dot p.x)))‐1) * p.y * p.x
).reduce(_ + _)w ‐= LEARNING_RATE * gradient
}
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 97
Some Spark Features for(p <‐ points){body} is equivalent to
points.foreach(p => {body}) and therefore is an invocation of the Spark’s parallel foreach operation
Accumulator allows results of tasks running on clusters to be accumulated using operators like +=
Only the driver program can read the accumulator’s value
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 98
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 50
Logistic Regression Performance
127 s / iteration
first iteration 174 sfurther iterations 6 s
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 99
Example: PageRank
•Use PageRank as a Spark example• Good example of a more complex algorithm• Multiple stages of map & reduce
• Benefits from Spark’s in-memory comp• Multiple iterations over the same data
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 100
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 51
Basic Idea
Give pages ranks based on links to them• Links from mamy pages -> high rank• Links from a high rank page -> high rank
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 101
PageRank Algorithm
1.0 1.0
1.0
1.0
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 102
1. Start each page at a rank of 12. On each iteration, have page p contribute
rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 52
PageRank Algorithm1. Start each page at a rank of 12. On each iteration, have page p contribute
rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 103
1.0 1.0
1.0
1.0
1
0.5
0.5
0.5
1
0.5
PageRank Algorithm
0.58 1.0
1.85
0.58
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 104
1. Start each page at a rank of 12. On each iteration, have page p contribute
rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 53
PageRank Algorithm
0.58 1.0
1.85
0.58
0.58
0.29
0.29
0.5
1.85
0.5
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 105
1. Start each page at a rank of 12. On each iteration, have page p contribute
rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs
PageRank Algorithm
0.39 1.72
1.31
0.58
. . .
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 106
1. Start each page at a rank of 12. On each iteration, have page p contribute
rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 54
PageRank Algorithm
0.46 1.37
1.44
0.73
Final state:
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 107
1. Start each page at a rank of 12. On each iteration, have page p contribute
rankp /|neighborsp| to its neighbors3. Set each page’s rank to 0.15 + 0.85 contribs
Spark Program (Scala)
val links = // RDD of (url, neighbors) pairs
var ranks = // RDD of (url, rank) pairs
for (i <‐ 1 to ITERATIONS) {
val contribs = links.join(ranks).flatMap {
case(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))
}
ranks = contribs.reduceByKey(_ + _)
.mapValues(0.15 + 0.85 * _)
}
ranks.saveAsTextFile(...)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 108
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 55
PageRanke Example
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 109
("MapR",List("Baidu","Blogger")),("Baidu", List("MapR"),("Blogger",List("Google","Baidu")),("Google", List("MapR"))
PageRanke Example
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 110
val links = sc.parallelize(List( ("MapR",List("Baidu","Blogger")), ("Baidu",List("MapR")),("Blogger",List("Google","Baidu")),("Google", List("MapR")))).persist()
var ranks = links.mapValues(v => 1.0)
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 56
PageRanke Example
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 111
val contributions = links.join(ranks).flatMap { case(url, (links, rank)) =>
links.map(dest => (dest, rank/links.size))}
PageRanke Example
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 112
val ranks = contributions.reduceByKey((x, y) => x + y).mapValues(v => 0.15 + 0.85*v)
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 57
PageRank Example
msn adobe yahoo
CSIE52400/CSIEM0140 Distributed Syst OS Support 113
Spark Program Executionlinks‐RDD(google,[Ljava.lang.String;@1771f11)(yahoo,[Ljava.lang.String;@19897e4)(msn,[Ljava.lang.String;@11c228f)(adobe,[Ljava.lang.String;@20f065)
ranks‐RDD(google,0.25)(yahoo,0.25)(msn,0.25)(adobe,0.25)
links.join(ranks)‐RDD(google,([Ljava.lang.String;@df1177,0.25))(msn,([Ljava.lang.String;@f3b9d3,0.25))(adobe,([Ljava.lang.String;@12cd143,0.25))(yahoo,([Ljava.lang.String;@15e8963,0.25))
contribs-RDD(yahoo,0.08333333333333333)(msn,0.08333333333333333)(adobe,0.08333333333333333)(google,0.25)(google,0.25)(google,0.25)
join
flatMap
reduceByKey
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 114
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 58
PageRank Performance
171
80
23
14
0
50
100
150
200
30 60
Iteration tim
e (s)
Number of machines
Hadoop
Spark
CSIE52400/CSIEM0140 Distributed Syst OS Support 115
Spark Execution
CSIE52400/CSIEM0140 Distributed Syst OS Support 116
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 59
Solution: Controlled Partitioning• Network bandwidth is ~100× as expensive as memory bandwidth
• Pre-partition the links RDD -so that links for URLs with the same hash code are on the same node
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 117
Controlled Partitioning
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 118
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 60
New Execution
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 119
PageRank Performance
why it helps so much:
links RDD is much bigger in bytes than ranks RDD
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 120
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 61
PageRank Demo
•Input data : simple.datgoogle: yahoo msn adobeyahoo: googlemsn: googleadobe: google
•numberIterations = 30, usePartitioner= false
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 121
PageRank Demo• numberIterations = 30, usePartitioner = true
• numberIterations = 45, usePartitioner = false
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 122
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 62
PageRank Demo
numberIterations = 45, usePartitioner = true
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 123
Example: Alternating Least Squares (ALS) ALS is for collaborative filtering such as
predicting u users’ ratings for m movies based on their movie rating history.
A user to a movie has a k-dim feature vector. A user’s rating to a movie is the dot product of
the user’s feature vector with the movie’s. Let M be a m × k matrix and U be a k × u matrix
of feature vectors, the rating R can be represented as M × U
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 124
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 63
ALS Algorithm ALS algorithm:
1. Initialize M to a random value.2. Optimize U given M to minimize error on R.3. Optimize M given U to minimize error on R.4. Repeat steps 2 and 3 until convergence.
All steps need R. It is helpful to make R a broadcast variable so that it does not re-sent to each node on each step.
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 125
ALS Program in Spark
val Rb = spark.broadcast(R)for (i <- 1 to ITERATIONS) {
U = spark.parallelize(0 until u).map(j => updateUser(j, Rb, M)).collect()
M = spark.parallelize(0 until m).map(j => updateUser(j, Rb, U)).collect()
}
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 126
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 64
Spark Implementation Overview Spark runs on the
Mesos cluster manager, letting it share resources with Hadoop & other apps
Can read from any Hadoop input source (e.g. HDFS) ~6000 lines of Scala code thanks to building
on Mesos
Spark Hadoop MPI
Mesos
Node Node Node Node
…
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 127
Language Integration Scala closures are Serializable Java objects
• Serialize on driver, load & run on workers
Not quite enough• Nested closures may reference entire outer scope• May pull in non-Serializable variables not used inside• Solution: bytecode analysis + reflection
Shared variables implemented using custom serialized form (e.g. broadcast variable contains pointer to BitTorrent tracker)
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 128
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 65
Interactive SparkModified Scala interpreter to allow Spark to
be used interactively from the command line Required two changes:
• Modified wrapper code generation so that each “line” typed has references to objects for its dependencies
• Place generated classes in distributed filesystem
Enables in-memory exploration of big data
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 129
New Spark APIs
RDDs are good but tend to be too low-level for beginners.
New DataFrame (Spark 1.3) and Dataset (Spark 1.6) APIs come to the rescue.
Spark 2.0 and up even unify DataFrame and Dataset to simplify the access.
Choose the one that fits your applications.
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 130
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 66
Conclusion By making distributed datasets a first-class
primitive, Spark provides a simple, efficient programming model for stateful data analytics
RDDs provide:• Lineage info for fault recovery and debugging• Adjustable in-memory caching• Locality-aware parallel operations
Spark can be the basis of a suite of batch and interactive data analysis tools
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 131
A6: Spark Exercises Implement the PageRank algorithm with Spark
and provide suitable input to test it. Given a set of house owners information in the
format:OwnerID, HouseID, Zip, Value
Write a Spark program to compute the average house value of each zip code.
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 132
CSIE52400/CSIEM0140 Distributed Systems Lecture 09: MapReduce & Spark
Note 67
A6: Spark Exercises
Write a Spark program to compute the inverted index of a set of documents. More specifically, given a set of (DocumentID, text) pairs, output a list of (word, (doc1, doc2, …)) pairs.
Due date: Self exercises
CSIE52400/CSIEM0140 Distributed Systems MapReduce & Spark 133