Resilient
Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster
Computing
Paper: Matei Zaharia,
Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy
McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012.
Resilient distributed datasets: a fault-tolerant abstraction for
in-memory cluster computing. In Proceedings of the 9th USENIX conference
on Networked Systems Design and Implementation (NSDI’12). USENIX
Association, USA, 2.
RDDs are a distributed memory abstraction that lets programmers
perform in-memory computations on large clusters in a fault tolerant
manner.
Issues motivating Spark
- For reusing data between computations, frameworks such as MapReduce,
write the data to an external stable storage system, incurring data
replication cost, low disk I/O, serialization costs. This is bad for
apps requiring data reuse, e.g. iterative machine learning algorithms
(such as Logistic Regression, Page Rank) & interactive data mining
tools. Cluster computing frameworks at the time provided abstractions
for accessing a cluster’s computational resources but lack abstractions
for leveraging distributed memory.
- Define a programming interface that can provide fault tolerance
efficiently. Interfaces based on fine-grained updates to mutable state
require data replication or logging updates across machines for fault
tolerance. This requires copying large amounts of data over the network.
Instead, RDDs log coarse-grained transformations like map, filter, join
to provide fault tolerance rather than actual data.
Definitions
RDDs are read-only, in-memory, parallel, partitioned data
structures that can only be created through deterministic
transformations on data in stable storage or other RDDs. RDD stores
metadata about its lineage to be able to compute its partitions from
data in stable storage. A transformation is a lazy operation
that defines a new RDD. Actions are operations that return a
value to the application or materialize data to stable storage,
e.g. count, collect, save. RDDs are computed lazily only when used in an
action. Spark provides language-integrated API where each RDD is an
object, and transformations are object methods.
Why
not use a DSM (Distributed Shared Memory) system instead?
RDDs provide advantages compared with DSM.
- Applications in DSM systems read & write to arbitrary locations
in the global address space. RDDs allow only bulk writes, and hence
suited for batch analytics. This makes fault tolerance efficient by not
having to checkpoint & rollback the entire application in-memory
state, and instead recover by recomputing the lost partition of the RDD
using the lineage metadata.
- Can run backup tasks as RDDs are immutable. Backup tasks are hard to
implement in DSMs as updates to the same memory location by different
copies could lead to incorrect state.
- Spark runtime can schedule tasks based on data locality in RDDs for
performance, not possible in DSM.
- Graceful Degradation: When out of memory, RDDs can be stored on disk
& provide similar performance as data-parallel systems.
Representing RDDs
RDDs are represented in a graph-based manner to support a rich set of
transformations. RDDs expose five pieces of information.
- set of partitions
- set of dependencies on parent RDDs
- function for computing the dataset based on its parents
- metadata about its partitioning scheme (hash/range partitioned)
- metadata about data placement (fast access due to data
locality)
In a job with many iterations, it may be necessary to reliably
replicate some of the versions of the RDDs to reduce fault recovery
times. RDDs allow users to control the persistence and the partitioning
of the RDDs. Specifying persistence leads to efficient fault recovery.
Spark provides three options for storage of persistent RDDs: in-memory
storage as deserialized Java objects, in-memory storage as serialized
data, and on-disk storage. Specifying consistent partitioning across
iterations can optimize the communication across nodes, and improve
run-time.
Challenge: How to represent dependencies between RDDs in
this graph-based interface? We can classify dependencies into two
types.
- narrow dependencies: each partition of parent RDD used by at most
one partition of child RDD
- wide dependencies: each partition of parent RDD used by multiple
partitions of child RDD
Two reasons for this classification.
- narrow dependencies allow for pipelined execution on one node which
can compute all parent partitions, but wide dependencies require data
from all parent partitions to be available & communicated across
nodes in a MapReduce fashion.
- fault recovery is faster with a narrow dependency because only the
lost parent partitions need to be recomputed, whereas for wide
dependencies, a single failed node might cause loss of some partition
from all ancestors of an RDD, requiring a complete re-execution.
Checkpointing to stable storage is useful for fault recovery of RDDs
with long lineage chains and wide dependencies. This can avoid full
recomputations. For RDDs with short lineage chains, checkpointing may
not be worthwhile. The decision of checkpointing is left to the
user.
This graph-based interface makes it easy to implement new
transformations without interfering with the scheduler. Spark is
implemented in about 14,000 lines of Scala, and runs over the Mesos
cluster manager. The scheduler assigns tasks to machines based on data
locality. For wide dependencies, the intermediate records are
materialized on the nodes holding parent partitions to simplify fault
recovery, like MapReduce materializes map outputs.
Benefits of the RDD
abstraction
- Spark supports interactive data mining due to low-latency in-memory
design. Two changes were made to the Spark interpreter: class shipping
& modified code generation. Class shipping is letting the worker
nodes fetch the bytecode for the classes created on each line, served by
the interpreter over HTTP. Also, the code generation logic is modified
to trace the object graph to ship the classes wrapping a variable
defined on a previous line. Due to the low-latency of in-memory data
operations, Spark lets users query big datasets interactively.
- Many parallel programs naturally apply the same operation to
multiple records making them easy to express. RDDs can express many
cluster programming models: MapReduce, DryadLINQ, SQL, Pregel, HaLoop,
interactive data mining. This allows users to compose these models in
one program & share data between them, e.g. run a MapReduce
operation to build a graph, then run Pregel on it. For example,
MapReduce can be expressed using the flatMap and groupByKey operations
in Spark.
- Logging the lineage of RDDs created during a job can be useful for a
debugger, as it has virtually zero recording overhead compared to
traditional replay debuggers for general distributed systems which
captures order of events across multiple nodes.
Evaluation
- The paper evaluates Spark and RDDs on Amazon EC2 (4 cores with 15 GB
of RAM), and show that Spark outperforms Hadoop by upto 20x in iterative
machine learning and graph applications, due to speedup from avoiding
low disk I/O and deserialization costs by storing data in memory as Java
objects. Spark can query a 1 TB dataset interactively with latencies of
5-7 seconds, whereas querying from disk took 170s.
- They compare the performance of Hadoop, HadoopBinMem, and Spark with
logistic regression and k-means. HadoopBinMem is a Hadoop deployment
that converts the input data into a low-overhead binary format in the
first iteration to eliminate text parsing in later iterations, and
stores it in an in-memory HDFS instance. The results show that sharing
data via RDDs speeds up future iterations. This is due to:
- minimum overhead of the Hadoop stack
- overhead of HDFS while serving data
- deserialization cost to convert binary records to usable in-memory
Java objects
- They also compared the performance of Spark with Hadoop for PageRank
using a 54 GB Wikipedia dump & achieved 7.4x speedup over Hadoop on
30 nodes.
Strengths & Weaknesses
Strengths
- The paper provides a useful data sharing abstraction through RDDs
that wasn’t provided by cluster computing frameworks at the time. This
enables more broader classes of applications such as iterative machine
learning algorithms and interactive data mining. Frameworks such as
MapReduce provided abstractions for accessing computational resources in
a cluster, but not distributed memory. For reusing data between
computations, these frameworks relied on writing to external stable
storage incurring overheads due to replication, disk I/O and
serialization costs. Making RDDs immutable, they solve the consistency
issue by design.
- The Spark system allows a general-purpose programming language,
Scala, to be used at interactive speeds for in-memory data mining on
clusters. Indeed, the authors let users run Spark interactively from the
interpreter to query big datasets.
- A key advantage of RDDs is efficient fault recovery mechanisms by
avoiding checkpointing of entire in-memory state in contrast Distributed
Shared Memory (DSM) systems. Users can specify the persistence of RDDs,
and only the lineage graphs for the RDDs is logged, which are many order
of magnitude smaller than entire application state. Another advantage is
leveraging data locality to send computations to the nodes with the
data.
- The transformations and actions supported are rich enough to express
a wide range of batch analytics applications, even though it might seem
counter-intuitive as only coarse-grained transformations are supported
on RDDs. This required recognizing the common computations employed in
batch analytics applications and abstracting them out to support them on
RDDs. This expressivity of the RDD abstraction is a big part of the
reason of Spark’s popularity. Indeed, they show in the paper that RDDs
can easily express a wide range of parallel applications in a few lines
of code.
Weaknesses
- Spark is best suited for batch analytics, but not for asynchronous
application requiring fine-grained modifications to distributed state.
For example, storage system for a web application or an incremental web
crawler. These applications require systems with traditional update
logging and data checkpointing mechanisms, e.g. databases, RAMCloud,
Percolator, Piccolo.
- Spark relies on the user for specifying the persistence and optimize
data placement. This makes it less accessible to a broader audience.
Indeed, the authors mention that they are investigating automatic
checkpointing mechanisms. This is because the Spark scheduler knows best
the size of each dataset as well as the computation time of the dataset,
and so it should be able to select an optimal set of RDDs to checkpoint
to minimize system recovery time. Leveraging data locality can also be
done by the scheduler by scheduling tasks on the right nodes.
Follow-Up
- Some immediate follow-up ideas to extend Spark would be:
- tolerating scheduler failures by replicating the RDD lineage
graph
- automatic checkpointing (mentioned above), determined by the
scheduler
- automatic data placement mechanisms, determined by the
scheduler
- Another extension of RDDs could be to let the users control the
granularity of the data partitions in an RDD, to support even broader
class of applications beyond batch analytics. Then it would be possible
to combine asynchronous applications that make fine grained updates to
shared state with batch analytics applications. This would make RDD a
more general Distributed Shared Memory (DSM) systems by allowing
read/write to each memory location instead of allowing just bulk writes.
But this could mean employing more expensive fault tolerant mechanisms,
but there could be applications which could make the tradeoff
worthwhile. This brings the advantages of using RDDs (straggler
mitigation through backup tasks, work placement leveraging data
locality, and graceful degradation when out of memory) to DSM
systems.