spark: cluster computing with working sets --aaron 2013/03/28

34
Spark: Cluster Computing with Working Sets -- Aaron 2013/03/28

Upload: reynold-harry-miles

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Spark: Cluster Computing with Working Sets

--Aaron 2013/03/28

Page 2: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Overviews

• Motivation• Spark• Resilient Distributed Datasets(RDD)• Shared Variables• Experiments

Page 3: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Motivation

Page 4: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Hadoop

• Advantages:ScalabilityFault Tolerance

• Disadvantage:Acyclic : not suitable for iterative

algorithms

Page 5: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Disadvantage of Hadoop(1)

• Iterative jobs: Many common machine learning algorithms apply a function repeatedly to the same dataset to optimize a parameter.

• Interactive analytics: When querying on large datasets, a user would be able to load a dataset of interest into memory across a number of machines and query it repeatedly.

Page 6: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Disadvantage of Hadoop(2)

• In most current frameworks, the only way to reuse data between computations (e.g., between two MapReduce jobs) is to write it to an external stable storage system, e.g., a distributed file system.

• This incurs substantial overheads due to data replication, disk I/O, and serialization, which can dominate application execution times.

Page 7: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Spark

Page 8: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Programming Model

• Driver Program: implements the high-level control flow of their application and launches various operations in parallel.

• Resilient Distributed Datasets(RDD)• Parallel Operations: passing a function to apply

on a dataset.• Shared Variables: they can be used in functions

running on the cluster.

Page 9: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Framework of Spark

Page 10: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Resilient Distributed Datasets(RDD)

Page 11: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Resilient Distributed Datasets

• A resilient distributed dataset (RDD) is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost.

• The elements of an RDD need not exist in physical storage; instead, a handle to an RDD contains enough information to compute the RDD starting from data in reliable storage. This means that RDDs can always be reconstructed if nodes fail.

Page 12: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Changing the Persistence of RDD

• By default, RDDs are lazy and ephemeral.• User can alter the persistence of an RDD

through two actions:– Cache action: hints that it should be kept in

memory after the first time it is computed, because it will be reused.

– Save action: evaluates the dataset and writes it to a distributed filesystem such as HDFS

Page 13: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Operations on RDD(1)

• Transformations(create an RDD): operations on either (1) data in stable storage or (2) others RDDs.

• Examples of transformations include map, filter, and join.

Page 14: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Operations on RDD(2)

• Actions: operations that return a value to the application or export data to a storage system.

• Examples of actions include count, collect, and save.

Page 15: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Operations on RDD(3)

• Programmers can call a persist method to indicate which RDDs they want to reuse in future operations.

• Spark keeps persistent RDDs in memory by default, but it can spill them to disk if there is not enough RAM.

• Users can set a persistence priority on each RDD to specify which in-memory data should spill to disk first.

Page 16: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Example1 of RDD

• These datasets will be stored as a chain of objects capturing the lineage of each RDD. Each dataset object contains a pointer to its parent and information about how the parent was transformed.

Page 17: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Lineage Chain of Example1

Page 18: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Example2 of RDD

Page 19: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Lineage Chain of Example2

Page 20: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

RDD Objects• Each RDD object implements the same simple

interface, which consists of five piece of information:

Page 21: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Examples of RDD Operations

• HDFS files(lines): – partitions(): returns one partition for each block

of the file (with the block’s offset stored in each Partition object)

– preferredLocations: gives the nodes the block is on

– iterator: reads the block

Page 22: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Dependencies between RDDs(1)

• Narrow Dependencies: each partition of the parent RDD is used by at most one partition of the child RDD(1:1). Map leads to a narrow dependency.

• Wide Dependencies: multiple child partitions may depend on it(1:N). Join leads to wide dependencies.

Page 23: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28
Page 24: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Dependencies between RDDs(2)

• Narrow dependencies allow for pipelined execution on one cluster node, which can compute all the parent partitions. For example, one can apply a map followed by a filter on an element-by-element basis.

• Wide dependencies require data from all parent partitions to be available and to be shuffled across the nodes using a MapReduce like operation.

• Recovery after a node failure is more efficient with a narrow dependency than the ones with wide dependency.

Page 25: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Parallel Operations

• reduce: Combines dataset elements using an associative function to produce a result at the driver program.

• collect: Sends all elements of the dataset to the driver program.

Page 26: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Shared Variables

Page 27: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Shared Variables• Programmers invoke operations like map, filter

and reduce by passing closures (functions) to Spark. Normally, when Spark runs a closure on a worker node, these variables are copied to the worker.

• However, Spark also lets programmers create two restricted types of shared variables to support two simple but common usage patterns.

Page 28: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Broadcast Variables

• When one creates a broadcast variable b with a value v, v is saved to a file in a shared file system. The serialized form of b is a path to this file. When b’s value is queried on a worker node, Spark first checks whether v is in a local cache, and reads it from the file system if it isn’t.

Page 29: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Accumulators

• Each accumulator is given a unique ID when it is created. When the accumulator is saved, its serialized form contains its ID and the “zero” value for its type.

• On the workers, a separate copy of the accumulator is created for each thread that runs a task using thread-local variables, and is reset to zero when a task begins. After each task runs, the worker sends a message to the driver program containing the updates it made to various accumulators.

Page 30: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Experiments

Page 31: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Results(1)• Dataset: 29G• Envirement: 20 “m1.xlarge” EC2 nodes with 4 cores

each.• Algorithm: logistic regression

Page 32: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Results(2)

Page 33: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Results(3)

Page 34: Spark: Cluster Computing with Working Sets --Aaron 2013/03/28

Thank You!!!