The Demikernel Datapath OS Architecture for Microscale Datacenter Systems

Irene Zhang, Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Ja- cob Nelson, Omar S. Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, Anirudh Badam. 2021. The Demiker- nel Datapath OS Architecture for Microsecond-scale Datacenter Systems. In ACM SIGOPS 28th Symposium on Operating Systems Principles (SOSP ’21), October 26–29, 2021, Virtual Event, Ger- many. ACM, New York, NY, USA, 17 pages. https://doi.org/10.1145/ 3477132.3483569

Datacenter systems and I/O devices now run at single-digit microsecond latencies, requiring ns-scale OSes. Traditional kernel-based OSs impose unaffordable overhead, so we eliminate the OS kernel from I/O datapath.

Library OSes separate protection into the OS kernel and management into the user-level library OSes to better meet custom app needs. Kernel-bypass architectures offload protection into I/O devices, along with some OS management. We need a portable OS architecture for microscale kernel-bypass systems.

This paper proposes Demikernel, which offers a general-purpose datapath OS replacement that meet the needs of microscale systems:

1. Support Heterogeneous OS Offloads

Today’s datapath architectures are adhoc - different kernel-bypass libraries offer different OS features atop kernel-bypass devices. Different devices offload different OS features. E.g. DPDK provides a raw NIC interface, while RDMA implements a network protocol with CC & ordered, reliable transmission. We need to flexibly accommodate heterogeneous kernel-bypass devices.

Solution: A portable datapath API & flexible OS architecture

Each Demikernel datapath OS works with a legacy control path OS kernel & consists of several interchangeable libOSes implemening a new high-level datapath API called PDPIX (POSIX extension to support microscale kernel-bypass I/O). PDPIX centers around I/O queue abstraction instead of pipe-based POSIX. Offload OS features to device when possible (e.g. RDMA libOS offloads network stack to RDMA NIC). Each Demikernel libOS:

PDPIX is a POSIX extension tuned for kernel-bypass I/O:

2. Coordinate Zero-Copy Memory Access

Zero-copy I/O is critical for latency. Kernel-bypass zero-copy I/O requires 2 types of memory access coordination:

Solution: DMA-capable heap with Use-After-Free protection

Three new features for zero-copy memory coordination:

3. Multiplex and Schedule the CPU at microscale

Existing kernel-level abstractions like processes & threads are too coarse-grained for microscale scheduling - they consume entire cores for 100s of micros. Recent schedulers schedule app workers on a microscale per-I/O basis and use coarse-grained abstractions for OS work - distributed scheduling. Scheduler should multiplex app work & datapath OS tasks on single thread.

Solution: Coroutines

The authors compare Demikernel to 2 kernel-bypass applications - testpmd and perftest, and 3 kernel-bypass libraries - eRPC, Shenango, and Caladan. testpmd (L2 packet forwarded) & perftest (measure RDMA NIC send and recv latency) are included with DPDK & RDMA SDKs respectively.

The authors show result with 4 different microscale kernel-bypass systems:

Strengths

Weaknesses

This paper gives a design and implementation of a general-purpose datapath OS for microsecond-scale kernel-bypass apps. There is huge potential for future work, but this paper does not seem to be short of anything, and accomplishes what it set out to do.

Future Work

Each Demikernel OS feature represents a rich area for future work: