Snap: a Microkernel Approach to Host Networking

Marty and De Kruijf, et al. 2019. Snap: a Microkernel Approach to Host Networking. In ACM SIGOPS 27th Symposium on Operating Systems Principles (SOSP ’19), October 27–30, 2019, Huntsville, ON, Canada. ACM, New York, NY, USA, 15 pages. https://doi.org/ 10.1145/3341301.3359657

Host networking needs are evolving due to:

continuous capacity growth seeking new approaches to edge switching & bandwidth management
rise of cloud computing seeking rich virtualization features
high-performance distributed systems seeking efficient & low latency comms

Reasons for moving network functions from kernel-space to user-space:

developing kernel code is slow & only few engineers could do it
feature release through kernel module reloads covered a subset of functionality & required disconnecting apps
more common case required machine reboots which required draining the machine of running apps
change of kernel-based stack took 1-2 months to deploy, whereas a new Snap release done on a weekly basis
broad generality of Linux made optimization difficult & defied vertical integration efforts, which were easily broken by upstream changes

This paper presents a microkernel-inspired approach to host networking called Snap. Snap is a userspace networking system with flexible modules that implement various networking functions, including:

edge packet switching
virtualization for cloud platforms
traffic shaping policy enforcement
high-performance reliable messaging
RDMA-like service

Systems design principles for Snap:

Snap implements host networking functions as an ordinary Linux userspace process (microkernel-inspired approach)
- Retains centralized resource allocation & management benefits of monolithic kernel
- High rate of feature development with transparent software upgrades
- Improves accounting & isolation by accurately attributing both CPU & memory consumed on behalf of apps to those apps using Linux kernel interfaces to charge CPU & memory to app constainers
- Leverages user-space security tools like memory access sanitizers, fuzz testers
Interoperable with existing kernel network functions & app thread schedulers
- Implements custom kernel packet injection driver for packet processing through both Snap & Linux kernel networking stack
- Implements custom CPU scheduler without adopting new runtimes
Encapsulates packet processing functions (data plane ops) into composable units called “engines”
- stateful, single-threaded tasks scheduled & run by a Snap engine scheduling runtime using lock-free communication over memory-mapped regions shared with the input or output
- enables modular CPU scheduling
  - engines are organized into groups sharing a common scheduling discipline
- incremental & minimally-disruptive state transfer during upgrades
- examples: packet processing for network virtualization, pacing, rate limiting, stateful network transport like “Pony Express”
Provides support for OSI L4 & L5 functions through Pony Express transport
- A Pony Express engine services incoming packets, interacts with apps, runs state machines to advance messaging & one-sided ops, and generate outgoing packets
- Interface similar to RDMA capable “smart” NIC
- Transparently leverages stateless h/w offload capabilities in emerging NICs
  - includes Intel I/OAT DMA device to offload memory copy ops
  - another example: end-to-end invariant CRC32 calculation over each packet
- Given zero-copy capability, NIC NUMA node locality & locality within the transport layer are together more important than locality with the app thread, & hence Pony Express runs transport processing in a thread separate from the app, and instead shares CPU with other engines & other transport processing
- Ability to rapidly deploy new versions of Pony Express significantly aided development & tuning of congestion control
- Ability to easily update & change wire protocols
- Just-in-time generation of packets based on availability of NIC transmit descriptor slots ensures no per-packet queueing in the engine
- One-sided ops don’t involve app code on the remote destination & executes-to-completion within Pony Express engine
  - Supports rich operations such as “indirect read” to determine actual memory target to access from local app-filled indirection tables, and “scan & read” - naive RDMA benefits disappear in these cases
- Provides mechanism to create message streams to avoid HoL blocking
- Flow control is mix of receiver-driven buffer posting as well as a shared buffer pool managed using credits
  - Flow control for one-sided ops rely on congestion control & CPU scheduling mechanisms rather than higher-level mechanisms
Highly tuned Snap & Pony Express transport for performance to minimize I/O overhead
- 3x better transport processing efficiency than baseline Linux
- Supporting RDMA-like functionality at 5M IOPS/core
Transparent upgrades
- Gives ability to release new Snap versions without disrupting running apps
- During upgrades, running version serializes all state to an intermediate format stored in memory shared with new versions, migrating engines one at a time, each in its entirety
- Two-phase migration technique: brownout phase to perform preparatory background transfer & blackout phase when network stack is unavailable
- Targetted blackout period is 200 ms or less
Customizable emphasis between scheduling latency, performance isolation & CPU efficiency of engines. Snap supports three broad categories of scheduling modes for engine groups:
- Dedicating cores
  - engines are pinned to dedicated hyperthreads on which no other work can run
  - doesn’t allow CPU utilization to scale in proportion to load
  - minimizes latency via spin polling
  - static provisioning can strain system under load or overprovision
- Spreading engines
  - scales CPU consumption in proportion to load to minimize scheduling tail latency
  - binds each engine to a unique thread that schedules when active & blocks on interrupt notification (triggered from either NIC or app) when idle
  - not subject to scheduling delays caused by multiplexing multiple engines onto a small number of cores
  - leverages MicroQuanta, an internal real-time kernel scheduling class, engines bypass default Linux CFS kernel scheduler & can quickly be scheduled to an available core upon interrupt delivery
    - a MicroQuanta thread runs for a configurable runtime out of every period time units, with remaining CPU time available to other CFS-scheduled tasks
  - interrupts have system-level interference effects like being scheduled to a core in a low-power sleep state or in midst of running non-preemptible kernel code
- Compacting engines
  - collapses work onto as few cores as possible combining scaling advantages of interrupt-driven executing with cache-efficiency of dedicating cores
  - but relies on periodic polling of engine queueing delays to detect load imbalance instead of instantaneous interrupt signals
  - latency delay from polling engines plus the delay for statistical confidence in queueing estimation plus the delay in handing off engine to another core can be higher than latency of interrupt signaling

The paper shows the following results:

Performance measurement between a pair of machines connected to same ToR switch, machines with Intel Skylake processor & 100Gbps NICs
Baseline, single-stream throughput of TCP is 22Gbps with 1.2 cores/s utilization, & Snap/Pony delivers 38Gbps using 1.05 cores/s
With 5000B MTU, Snap/Pony single-core throughput increases to over 67Gbps, & enabling I/OAT receive copy offload (with zero-copy tx) increases to over 80Gbps
Average RTT latency with TCP is 23 microsecs, & Snap/Pony delivers 18 microsecs. Configuration with spin-poll reduces Snap/Pony’s RTT latency to less than 10 microsecs and busy-polling sockets in Linux reduces latency to 18 microsecs
Both Snap engine schedulers (“spreading engines” and “compacting engines”) succeed in scaling CPU consumption in proportion to load & shows sub-linear increase in CPU consumption due to batching efficiencies. Snap is 3x more efficient than TCP at high offered loads due to:
- copy reduction
- avoiding fine-grained synchronization
- 5000B vs 4096B MTU difference between Snap/Pony & TCP
- hotter instruction & data caches in case of the compacting scheduler
While Snap compacting scheduler offers the best CPU efficiency, the spreading scheduler has the best tail latency
Spreading scheduler relies on interrupts to wake on idleness, which is prone to system-level latency contributors - in these cases compacting engines provide best latency
- NIC generated interrupt might target a core in deep power-saving C-state
- Production machines run complex antagonists that can affect schedulability of even a MicroQuanta thread
Conventional RPC stacks written on standard TCP sockets (gRPC) see less than 100,000 IOPS/core, but Snap/Pony can provide up to 5M IOPS/core with custom batched indirect read ops
Transparent upgrades with median blackout of 250ms

Strengths

This paper presents the design & architecture of Snap, a widely-deployed, microkernel-inspired system for host networking. They also describe a communication stack based on this microkernel approach & transparent upgrade abilities without draining apps from running machines. Snap has been running in production for 3 years (at the time of the paper) supporting communication needs of several critical systems. This shows that the design principles that they followed were successful & battle-hardened for Google-scale needs. They had already experimented with various in-application transport designs & found the drawbacks exceed the strengths:
- They prioritize fast release schedules of the network stack which required them to decouple it from apps & the kernel for the transparent upgrades.
- Most of the in-app transports required a spin-polling transport thread with a provisioned core in every app to ameliorate scheduling unpredictabilities. But this is impractical because it is common to run dozens of apps on a single machine.
They design three scheduling disciplines which can be customized depending on the engine & it’s required emphasis between CPU efficiency & latency: dedicating cores, spreading engines & compacting engines. The abstraction of engine is helpful for both the purpose of scheduling (through engine groups) as well as for supporting transparent upgrades. This is a strength of the Snap design.

Weaknesses

Much of the Snap scheduling mechanism relies on their internally-developed kernel scheduling class called MicroQuanta which provides a flexible way to share cores between latency-sensitive Snap engine tasks & other tasks, and also an internally-developed driver for efficiently moving packets between Snap and the kernel. Without any access into such internal tooling, one can only speculate their functionality & take their word for it. This makes the knowledge accessible to the community at a very high-level without deep-level understanding.
The paper doesn’t go into the details of how they implement memory management with Snap which is a significant challenge with non-kernel networking.

Future Work

Memory Mapping & Management: Pony Express registers some app-shared memory with the NIC for zero-copy transmit but does so selectively and some apps take a copy from the app heap memory into bounce buffer memory shared & registerd with Snap and vice versa. This means extra copies which can be avoided. Designing a custom system call to translate & temporarily pin the memory which is cheaper than a memory copy can improve performance. In general, memory mapping & management with non-kernel networking is a significant challenge.
Dynamic CPU Scaling: Kernel TCP networking usage is very bursty in practice - sometimes, consuming upwards of a dozen cores over short bursts. Maybe there is a way to combine the CPU efficiency benefits of compacting engines and the better tail latency benefits at high Snap loads of spreading engines to achieve the best of both worlds. Scheduler design for host networking hosts is a challenge.
Rebalancing Across Engines: One can consider fine-grained rebalancing flows across engines using mechanisms such as work stealing.
Stateful Offloads: Currently they leverage stateless offloads, and do not mention about stateful h/w offloads in emerging h/w NICs.