PRISM: Rethinking the RDMA Interface for Distributed Systems

Matthew Burke, Sowmya Dharanipragada, Shannon Joyner, Adriana Szekeres, Jacob Nelson, Irene Zhang, and Dan R. K. Ports. 2021. PRISM: Rethinking the RDMA Interface for Distributed Systems. In ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP ’21), October 26–29, 2021, Virtual Event, Germany. ACM, New York, NY, USA, 15 pages. https://doi.org/10.1145/3477132.3483587

The past few years have seen a continual increase in network bandwidth and as signaled by Moore’s law, CPU speeds have become to stagnate. It has become critical to reduce the involvement of the CPU when it comes to networking. RDMA achieves this by providing a standard accelerated interface to access remote memory directly through the network. Following these hardware trends, recent years have seen a plethora of distributed systems work that have been redesigned to utilize RDMA. These include key-value stores, distributed transactions, & replicated storage systems.

The RDMA interface has two kinds of ops:

Two-sided ops: they have standard message passing semantics. Here the client machine’s CPU invokes an RDMA send op & the app running on the remote host invokes an RDMA receive op.
One-sided ops: allow the client to read/write directly to a remote memory region with no additional CPU involvement on the remote host.

They both come with their own tradeoffs:

one-sided ops are faster & more CPU efficient but their interface is restrictive allowing only reads & writes to pre-registered memory buffers.
two-sided ops offer richer semantics but are often slower and less CPU efficient.

Motivating Example

Many implementations of remote data structures use indirection. For example, hash table used in Pilaf (ATC’13) and FaRM (NSDI’14) key-value stores, two state-of-the-art RDMA systems, is a hash-based indexed data structure that maintains pointers to the actual value stored inside the key-value store. A get op could be implemented using RDMA as follows:

we could issue the get as a standard two-sided RPC that is processed by the remote CPU and return the value to the client. This only takes 1 network roundtrip, but would consume CPU cycles on the remote host.
we could issue 1 one-sided RDMA read to get the address of where the value is located, and another one-sided RDMA read to actually retrieve the value. There’s no CPU involved, but we need at least 2 network roundtrips to process a get. The number of network roundtrips would increase in case of hash collisions.

Additional network roundtrips can make one-sided implementations slower. What can we do about this?

Forgo the use of one-sided ops completely by relying on RPCs. But this requires giving up CPU efficiencies that one-sided ops come with. This is done in FaSST (OSDI’16), HERD (SIGCOMM’14).
Allow apps to install their own custom one-sided logic at the NIC. But this comes with significant deployment & security challenges. Done in StRoM (EuroSys’20), RMC (HotNets’20).
Standard Extensions to the RDMA Interface.

PRISM: Primitives for Remote Interaction with System Memory

This work considers standard extensions to the RDMA interface. How to design the API?

The primitives must be generic.
Primitives must express the application logic concisely and generalize to a large class of apps.
Simple enough to be implemented on multiple platforms.

They present PRISM: Primitives for Remote Interaction with System Memory based on:

feasible to implement across different network stacks.
implementable reusing existing RDMA mechanisms like retransmission & access control.

The PRISM API: list of proposed extensions to the traditional RDMA interface

Indirect Read/Write/CAS: The target address specified by the operation can instead be interpreted as a pointer to the actual target.
Enhanced Compare & Swap: Extends RDMA CAS to provide support for arithmetic comparisons (>,<) during the compare phase.
Allocation: Allows memory allocation on the data-plane from a pre-registered pool of memory.
Operation Chaining: Allows for the execution of a chain of other PRISM primitives at the NIC.

This is chosen based on their applicability to a large class of systems as well as their feasibility to implement across different platforms.

Indirect Reads with PRISM

Going back to the example, we can issue an indirect read PRISM call. The target address specified by the indirect read is interpreted as an address of a pointer. This pointer is then dereferenced to obtain the value that is returned to the client.

only one network roundtrip
no CPU involved in the process

More complicated patterns of indirection can be implemented using operation chaining.

Prototyped PRISM primitives on two different network stacks:

Google Snap inspired Software RDMA prototype that partitions some CPU cores dedicated for packet processing.
SmartNIC prototype using Mellanox BlueField-1 SmartNIC to execute the PRISM primitive.
Also extensively discussed how the PRISM API could be added to a NIC ASIC while reusing underlying mechanisms that already exist in today’s NICs.

Three case studies to demonstrate the benefits of PRISM:

PRISM-KV: a key-value store that implements both read & write ops using the one-sided PRISM API.
PRISM-TX: a transactional storage system that implements a timestamp-based optimistic concurrency control protocol using PRISM’s primitives.
PRISM-RS: a replicated storage system that implements the ABD quorum replication protocol.

Each of these apps are widely used in practice. The proposed extensions allows us to efficiently implement all of these apps entirely in terms of one-sided PRISM primitives.

Two kinds of benefits to using PRISM:

Ops can take fewer network roundtrips with PRISM
- Get() in Pilaf, a KV-Store.
- TxRead() in FaRM, a Transactional Storage System.
- ReadPhase() in ABD, a Replicated Block Store.
Ops that previously used traditional RPCs can be implemented as one-sided PRISM ops.
- Put() in Pilaf
- TxCommit() in FaRM
- WritePhase() in ABD

Transaction Processing with PRISM - PRISM-TX

FaRM is a state-of-the-art transaction processing sytem over RDMA. We consider a storage system where data is partitioned among multiple servers and clients group their ops into transactions. During the execution phase, they either read or write data from different servers. Reads use one-sided RDMA and writes are buffered locally in the execution phase. After all the operations finish executing, the transaction enters the commit phase. FaRM uses a variant of the two-phase commit protocol:

Lock: the first phase is obtain locks for the write set.
Validate: if the lock phase succeeds, we validate the read set to ensure that the objects that have been read haven’t changed since the execution phase.
Update & Unlock: if the first two phases are successful, updates take effect & all locks are released.

In FaRM, the Update phase & Lock phase have to use RPCs but the read set can be validated using one-sided RDMA reads.

Can we build something better than FaRM?

The execution phase of FaRM needs no further improvement since all communication that occurs between client & servers here uses only one-sided ops. There is scope for improvement in the commit phase. We can replace the RPCs in the Lock & Update phases with the PRISM API. But we need to make some modifications first.

In FaRM, data is stored in a hash table that’s accessible via RDMA & each object is associated with a version number & a lock. This layout wasn’t designed to allow non-local changes to the version numbers. In order to be able to prepare and commit transactions with no CPU involvement, these have to be accessible to clients via RDMA. So, we couple the version number with other metadata in the hash table’s index so that they are available to the client.
Version numbers must be client-generated since the ops are entirely one-sided. For this purpose, we use timestamps generated by loosely-synchronized clocks. For each object, we store timestamp of the latest reads, writes & commits that occurred to that object, followed by a pointer to it’s current value.

With these changes, all the OCC checks in the commit phase can use PRISM’s Enhanced CAS primitives with no additional CPU involvement.

At the beginning of the commit phase a timestamp is chosen such that it is larger than all the timestamps recorded in the read set during the execution phase.
Locking for the write set is done by issuing an Enhanced CAS op on the write timestamp. If this PRISM Enhanced CAS succeeds, it is equivalent to obtaining a lock on this key & prevent any concurrent readers or writers with a lower timestamp from succeeding.
In the validation phase, we validate the read set by ensuring that the write timestamp hasn’t changed. This ensures that there are no concurrent writers with the intent to write and commit. Then, we update the read timestamp. This ensures that no other concurrent transaction with a lower timestamp that reads or writes to this object after this point can commit.
If all the r/w validation checks succeed, in the update phase, we install the write by using a chain of PRISM’s Allocate & Indirect Write primitives.

The primitives enable a new class of one-sided OCC mechanisms.

They compared PRISM-TX with their implementation of the FaRM protocol using both hardware & software RDMA implementations for FaRM. PRISM-TX outperforms FaRM in both throughput and latency with a lower message complexity enabled by PRISM’s primitives, about 5 micros faster than FaRM, & reached about a 1M more txns/s before saturating the network.

PRISM significantly expands the design space for RDMA-based applications by offering a middle ground between the restrictive RDMA R/W interface and the full-generality of RPC communication.

Strengths

This paper presents an extended API for access to remote memory. PRISM expands the design space for network-accelerated apps, making it possible to build new types of apps with minimal server-side CPU involvement, demonstrated by their key-value store, replicated storage, and distributed transaction apps.
Coming up with design of an RDMA interface that satisfies the need of RDMA-based applications requires understanding the network access patterns of different apps & in-depth understanding of current RDMA capabilities to extend them for maximal expressibility without requiring complexity and losing scalability (on different current & future platforms).

Weaknesses

Not a particular weakness, but extending or proposing a new API interface for a powerful technology such as the RDMA to be supported in commercially-available hardware NICs would require testing a prototype’s performance on a larger number of applications to convince the NIC manufacturers. The paper demonstrates that three sophisticated applications - key-value stores, replicated block storage, and distributed transactions - can be implemented entirely using the PRISM interface & achieve latency and throughput benefits. However, these applications might not be representative of the overall class of RDMA-based applications. There is also a huge software development overhead of porting current applications to a new interface.

Future Work

Implementing the PRISM API in a RDMA-capable hardware NIC to prove their statement that their proposal reuses existing microarchitectural mechanisms.
Investigating mechanisms for memory allocation & management primitives for RDMA NICs. Management of client-allocated memory is challenging, with specific memory management policies left to the discretion of the client application.
Explore other communication patterns in apps or how apps can benefit from alternative patterns. For example HyperLoop implements a different form of chaining - with a specially crafted RDMA request, a sender can cause a receiver to initiate a RDMA request to a third machine.