The
Demikernel Datapath OS Architecture for Microscale Datacenter
Systems
Irene Zhang,
Amanda Raybuck, Pratyush Patel, Kirk Olynyk, Ja- cob Nelson, Omar S.
Navarro Leija, Ashlie Martinez, Jing Liu, Anna Kornfeld Simpson, Sujay
Jayakar, Pedro Henrique Penna, Max Demoulin, Piali Choudhury, Anirudh
Badam. 2021. The Demiker- nel Datapath OS Architecture for
Microsecond-scale Datacenter Systems. In ACM SIGOPS 28th Symposium on
Operating Systems Principles (SOSP ’21), October 26–29, 2021, Virtual
Event, Ger- many. ACM, New York, NY, USA, 17 pages.
https://doi.org/10.1145/ 3477132.3483569
Datacenter systems and I/O devices now run at single-digit
microsecond latencies, requiring ns-scale OSes. Traditional kernel-based
OSs impose unaffordable overhead, so we eliminate the OS kernel from I/O
datapath.
Library OSes separate protection into the OS kernel and management
into the user-level library OSes to better meet custom app needs.
Kernel-bypass architectures offload protection into I/O devices, along
with some OS management. We need a portable OS architecture for
microscale kernel-bypass systems.
This paper proposes Demikernel, which offers a general-purpose
datapath OS replacement that meet the needs of microscale systems:
1. Support Heterogeneous OS
Offloads
Today’s datapath architectures are adhoc - different kernel-bypass
libraries offer different OS features atop kernel-bypass devices.
Different devices offload different OS features. E.g. DPDK provides a
raw NIC interface, while RDMA implements a network protocol with CC
& ordered, reliable transmission. We need to flexibly accommodate
heterogeneous kernel-bypass devices.
Solution: A portable datapath API & flexible OS
architecture
Each Demikernel datapath OS works with a legacy control path OS
kernel & consists of several interchangeable libOSes implemening a
new high-level datapath API called PDPIX (POSIX extension to support
microscale kernel-bypass I/O). PDPIX centers around I/O queue
abstraction instead of pipe-based POSIX. Offload OS features to device
when possible (e.g. RDMA libOS offloads network stack to RDMA NIC). Each
Demikernel libOS:
- supports a single kernel-bypass I/O device type (DPDK, SPDK,
RDMA)
- I/O processing stack for the device
- libOS-specific memory allocator
- a centralized coroutine scheduler
PDPIX is a POSIX extension tuned for kernel-bypass I/O:
accept
returns a queue descriptor instead of a file
descriptor.
push
& pop
are for submitting &
receiving (both net & storage) I/O
- input is a scatter-gather memory pointers array to avoid buffering
& poor tail latencies.
- non-blocking & return a
qtoken
for apps to fetch
completion later (using wait_*
lib calls).
- PDPIX requires all I/O must be from DMA-capable heap for UAF
protection.
wait_*
solves POSIX’s epoll
inefficiencies:
wait_*
directly returns data from the op for app to
process immediately.
wait_*
only wakes worker waiting on the specific
qtoken
on I/O completion.
2. Coordinate Zero-Copy
Memory Access
Zero-copy I/O is critical for latency. Kernel-bypass zero-copy I/O
requires 2 types of memory access coordination:
- IOMMU on I/O device needs to perform address translation. This
requires coordination with CPU’s IOMMU and TLB. To avoid page faults and
ensure address mappings stay fixed during I/O, devices need designated
DMA-capable memory, pinned in the OS kernel.
- Coordination among app, I/O stack and kernel-bypass device. The TCP
stack might send memory to the NIC and if network loses the packet &
app modified or freed the memory in the meantime, the TCP stack cannot
retransmit.
Solution: DMA-capable heap with Use-After-Free
protection
Three new features for zero-copy memory coordination:
- portable API with I/O memory buffer ownership semantics (in PDPIX)
- apps pass ownership to datapath OS when invoking I/O & do not
receive ownership back until I/O completes.
- zero-copy, DMA-capable heap
- libOSes replaces the app’s memory allocator to back the heap with
DMA-capable memory in device-specific way.
- Each libOS use a device-specific, modified Hoard (pool-based memory
allocator) for memory management. Hoard memory pools, superblocks, are
allocated with memory from the DPDK mempool for DMA-capable heap.
- zero-copy I/O offers improvement for buffers only over 1kB.
- UAF protection
- Demikernel libOS allocators do this with reference counting.
- no write-protection: can’t afford to protect apps from modifying
in-use buffers
3. Multiplex and
Schedule the CPU at microscale
Existing kernel-level abstractions like processes & threads are
too coarse-grained for microscale scheduling - they consume entire cores
for 100s of micros. Recent schedulers schedule app workers on a
microscale per-I/O basis and use coarse-grained abstractions for OS work
- distributed scheduling. Scheduler should multiplex app work &
datapath OS tasks on single thread.
Solution: Coroutines
- Kernel-bypass scheduling on per-I/O basis but POSIX’s
epoll
& select
have “thundering herd”
issue: cannot deliver events to just one worker. PDPIX introduces
wait
which lets app workers wait on specific I/O
requests.
- Coroutines encapsulate OS & app work - they are lightweight with
low-cost context switches well-suited for state-machine-based async
event handling. No need for expensive global state management.
- A centralized coroutine scheduler: Cannot afford interrupts, so
cooperative scheduling - coroutines yield after a few micros or
less.
- Three states of coroutines: running, runnable, and blocked.
Scheduler checks only runnable ones, since many coroutines blocked on
infrequent I/O events.
- Three types of coroutines:
- one fast-path I/O processing coroutine for each I/O stack polling
for I/O.
- many background coroutines for other I/O stack work:
- sending outgoing packets
- retransmitting lost packets
- sending pure acks
- managing connection close state transitions
- one app coroutine per blocked
qtoken
for app
worker.
- Single-threaded Demikernel libOS I/O stacks share app thread and aim
for run-to-completion: the fast-path coroutine processes incoming data,
finds the blocked
qtoken
, schedules the app coroutine &
processes any outgoing messages before moving on to the next I/O. The
fast-path coroutine yields after every n polls to let other I/O stacks
& background work run.
- Using Rust (memory-safety!) requires just 12 cyles for coroutine
context switch (since Rust compiles coroutines to regular function
calls) using Lemire’s algorithm using x86’s
tzcnt
instruction on “waker blocks” for nanoscale scheduling.
- Multiplexes net & storage I/O stacks by splitting the fast-path
coroutine between polling DPDK devices & SPDK completion queues in a
round-robin manner for fair-share CPU cycle allocation.
The authors compare Demikernel to 2 kernel-bypass applications -
testpmd and perftest, and 3 kernel-bypass libraries - eRPC, Shenango,
and Caladan. testpmd (L2 packet forwarded) & perftest (measure RDMA
NIC send and recv latency) are included with DPDK & RDMA SDKs
respectively.
The authors show result with 4 different microscale kernel-bypass
systems:
- Echo Application:
- Demikernel’s API semantics & memory management let Demikernel’s
echo server implementation process messages without allocating or
copying memory on the I/O processing path.
- Demikernel portably achieves competitive microsecond latencies,
ns-scale I/O processing, run-to-completion & zero-copy for
networking & storage.
- UDP Relay Server:
- Demikernel makes kernel-bypass easier to use for programmers that
are not kernel-bypass experts.
- Redis In-memory Distributed Cache:
- Demikernel lets Redis correctly implement zero-copy I/O from its
heap with no code changes.
- Demikernel provides existing microscale apps portable kernel-bypass
network & storage access with low overhead.
- TxnStore Distributed Transactional Storage:
- Compared to custom solutions, Demikernel simplifies the coordination
needed to support zero-copy I/O.
- Demikernel improves performance for higher-latency microscale
datacenter apps compared to a naive custom RDMA implementation.
Strengths
- Demikernel is a first step towards datapath OSes for microscale
kernel-bypass apps. The work solves the big problems of portability,
programmability, and performance in designing architectures for
kernel-bypass systems.
- Portability: This portability is across heterogeneous networking and
storage devices, and the architecture can also accommodate future
programmable devices.
- Programmability: The extend the POSIX API to optimize and simplify
microscale kernel-bypass I/O, exposing a friendly API to app
programmers.
- Performance: Demikernel datapath OSes have a per-I/O budget of less
than 1 microsecond for I/O processing & other OS services.
- Considerable engineering effort in implementing multiple libOSes
with Linux as well as Windows as legacy OS for control path, and three
libOS each for RDMA, DPDK & SPDK devices. The authors also integrate
network & storage libOSes, and also developed a POSIX libOS to test
and develop Demikernel apps without kernel-bypass hardware. This is a
testament to the flexibility of the Demikernel architecture.
Weaknesses
This paper gives a design and implementation of a general-purpose
datapath OS for microsecond-scale kernel-bypass apps. There is huge
potential for future work, but this paper does not seem to be short of
anything, and accomplishes what it set out to do.
Future Work
Each Demikernel OS feature represents a rich area for future
work:
- Investigate what semantics a microscale storage stack might supply.
Their current implementation, Cattree, maps PDPIX queue abstraction onto
an abstract log for SPDK devices. It is a minimal storage stack with few
features and works well for logging-based applications. But, we might
need to layer more complex storage systems above it.
- Investigate efficient microscale memory management with memory
allocators. As they mention, one can use more modern memory allocators
like
mimalloc
. Also, since the datapath OS is awaare of the
memory access patterns of the application, it is possible to do
I/O-aware memory scheduling.
- Scaling the coroutine design to multiple cores is not trivial - the
Demikernel libOSes will need to be carefully designed to avoid shared
state across cores.
- Demikernel currently does not eliminate all zero-copy coordination
& we might need datapath OS features for more explicit memory
ownership. Also, one can investigate an affordable way to provide write
protection in addition to UAF for buffers in use.