resilient distributed datasets: a fault-tolerant...

Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Q&A Report by Xiaofeng Wu (Student ID: 1001556282)

Summary This paper proposes a new framework for large-scale data analytics, RDDs. A Resilient Distributed Dataset (RDD) programmed by Spark is the abstraction of an immutable, partitioned collection of elements that can be performed in parallel.

It provides efficient, general-purpose and fault-tolerant abstraction for sharing data in cluster applications. Coarse-grained transformations help recover data efficiently using lineage, and the framework is expressive for a wide range of parallel applications like iterative computation, queries.

The current large-scale data analytics frameworks, such as MapReduce, Dryad, is not suitable for iterative algorithms and interactive data analysis tools. That is because substantial overheads come from disk I/O, network bandwidth, serialization, and data replication.

Existing solutions to this problem is not expressive enough to most problems. 1. Specialized frameworks, like Pregel (a system for iterative graph computations that keeps intermediate data in memory), HaLoop(offer iterative MapReduce interface). They do not provide abstractions for more general reuse to let a user load several datasets into memory, and run ad-hoc queries. 2. Existing in-memory storage on clusters, e.g., distributed shared memory, key-value stores, databases, Piccolo. They only provide low-level programming interface: (1) just reads and updates to table cells (2) to copy large amounts of data over the cluster network when providing fault tolerance (3) to replicate too much data across machines or to log updates across machines.

Resilient distributed datasets can be represented as objects to persist in memory or disk. RDD class contains the basic operations, such as map, filter, and persist. Operations on RDDs are (1) transformations (build new RDDs); (2) actions (compute and output results).

Questions

(1) “...individual RDDs are immutable...” What does it mean by being “immutable”? What benefits does this property of RDD bring?

Solution:

1) An RDD is a read-only, which means “immutable” and cannot be modified.

2) RDDs is designed to be immutable to make it easier to describe lineage graphs (information about how it was derived from other datasets). If one partition is lost, then it can be recovered via lineage graphs from the previous partition.

(2) When an RDD is being created (new data are being written into it), can the data in the RDD be read for computing before the RDD is completely created?

Solution:

No. There should be read and write barriers between RDDs that have lineage dependency. Otherwise, the computation for the next iteration will be derived from the stale data, which can induce problems.

(3) “This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data.” “To achieve fault tolerance efficiently, RDDs provide a restricted form of shared memory, based on coarse-grained transformations rather than fine-grained updates to shared state.” Why does using RDD help to provide efficient fault tolerance? or why does coarse-grained transformation help with the efficiency?

Solution:

Logging coarse-grained transformations of RDDs are a good fit for many parallel applications, because these applications naturally apply the same operation to multiple data items. RDDs provide an interface based on coarse-grained transformations (e.g., map, filter and join) that apply the same operation to many data items at once. This allows them to efficiently provide fault tolerance by logging the transformations used to build a dataset (its lineage) rather than the actual data. So, we only need to recompute the lost part of the dataset via operations rather than data. Thus, lost data can be recovered, often quite quickly, without requiring costly replication.

(4) “Inaddition, programmers can call a persist method to indicate which RDDs they want to reuse in future operations.” What’s the consequence if a user does not explicitly request persistence of an RDD?

Solution:

Spark will not keep the RDD in memory if programmer does not explicitly request persistence in memory by default. So, the data cannot be reused efficiently across parallel operations.

(5) Explain Figure 1 about a lineage graph.

Solution:

Question and Data: This is an example of searching terabytes of logs in the Hadoop filesystem (HDFS) to find the cause of a web service that is experiencing errors. The procedure is to load error messages from a log into memory, then interactively search for various patterns. This creates a base RDD representing the log file in HDFS. lines = spark.textFile(“hdfs://…”) Next, we can do some transformations since we are only interested in lines starting with “ERROR” messages. errors = lines.filter(_.startswith(“ERROR”)) errors = lines.filter(_.startswith(“HDFS”)) We can do more further transformations. Maybe these are tab separated fields and we want to get the field NO.2 which has the actual error messages using map. messages = errors.map(_.split(‘\t’)(3)) Finally, we tell the system to persist the just messages RDD to fit into memory. message.persisit() After launch the action count, the master starts to look at the data is in HDFS. The workers compute and send back some results to master.

è è By the way, the workers will store in memory any partitions of the persist RDDs that they build all along the way.

Next time we launch another action finding “bar” and it will get back results much faster as the data is in memory.

(figure adopted from RDD’s conference presentation)

The data is built by each individual partitions. If one of them losts, we can recover by recomputing that piece of data in HDFS.

Song Jiang

附注

very good

resilient distributed datasets: a fault-tolerant...

Documents