Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing

Paper: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation (NSDI’12). USENIX Association, USA, 2.

RDDs are a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault tolerant manner.

Issues motivating Spark

Definitions

RDDs are read-only, in-memory, parallel, partitioned data structures that can only be created through deterministic transformations on data in stable storage or other RDDs. RDD stores metadata about its lineage to be able to compute its partitions from data in stable storage. A transformation is a lazy operation that defines a new RDD. Actions are operations that return a value to the application or materialize data to stable storage, e.g. count, collect, save. RDDs are computed lazily only when used in an action. Spark provides language-integrated API where each RDD is an object, and transformations are object methods.

Why not use a DSM (Distributed Shared Memory) system instead?

RDDs provide advantages compared with DSM.

Representing RDDs

RDDs are represented in a graph-based manner to support a rich set of transformations. RDDs expose five pieces of information.

In a job with many iterations, it may be necessary to reliably replicate some of the versions of the RDDs to reduce fault recovery times. RDDs allow users to control the persistence and the partitioning of the RDDs. Specifying persistence leads to efficient fault recovery. Spark provides three options for storage of persistent RDDs: in-memory storage as deserialized Java objects, in-memory storage as serialized data, and on-disk storage. Specifying consistent partitioning across iterations can optimize the communication across nodes, and improve run-time.

Challenge: How to represent dependencies between RDDs in this graph-based interface? We can classify dependencies into two types.

Two reasons for this classification.

This graph-based interface makes it easy to implement new transformations without interfering with the scheduler. Spark is implemented in about 14,000 lines of Scala, and runs over the Mesos cluster manager. The scheduler assigns tasks to machines based on data locality. For wide dependencies, the intermediate records are materialized on the nodes holding parent partitions to simplify fault recovery, like MapReduce materializes map outputs.

Benefits of the RDD abstraction

Evaluation

Strengths & Weaknesses

Strengths

Weaknesses

Follow-Up