Network
Requirements for Resource Disaggregation
Peter X.
Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han,
Rachit Agarwal, Sylvia Ratnasamy, and Scott Shenker. 2016. Network
requirements for resource disaggregation. In Proceedings of the 12th
USENIX conference on Operating Systems Design and Implementation
(OSDI’16). USENIX Association, USA, 249–264.
Recent industry trends suggest a paradigm shift to a disaggregated
datacenter (DDC) architecture containing a pool of resources, each built
as a standalone resource blade and interconnected using a network
fabric. This work derives the minimum latency and bandwidth requirements
that the network in DDCs must provide to avoid degrading app-level
performance & explores the feasibility of meeting these requirements
with existing system designs and commodity networks, in the context of
10 workloads spanning 7 open-source systems:
- Hadoop
- Spark
- GraphLab
- Timely dataflow
- Spark Streaming
- memcached
- HERD
- SparkSQL
The workloads can be classified into two classes based on their
performance characteristics.
- Class A: Hadoop Wordcount, Hadoop Sort, Graphlab, &
Memcached
- Class B: Spark Wordcount, Spark Sort, Timely, SparkSQL BDB,
HERD
The work makes the following assumptions:
- Hardware Architecture:
- Partial CPU-memory disaggregation: Each CPU blade retains some
amount of local memory that acts as a cache for remote meory dedicated
for cores on that blade.
- Cache coherence domain is limited to a single compute blade: An
external network fabric is unlikely to support the latency &
bandwidth requirements for inter-CPU cache coherence.
- Resource Virtualization: Each resource blade must support
virtualization of its resources for logical aggregations into
higher-level abstractions such as VMs or containers.
- Scope of disaggregation: Either rack or datacenter-scale. This is a
“design knob”. In a hypothetical DC-scale disaggregated system,
resources assigned to a single VM could be selected from anywhere in the
DC.
- Network designs: This depends on the assumed scope of
disaggregation. Existing DDC prototypes use specialized, proprietary
network designs.
- System Architecture:
- System abstractions for logical resource aggregations: VMs. We can
implement resource allocation & scheduling in terms of VMs. Depends
on scope of disaggregation: for rack-scale, a VM is assigned resource
from within a single rack, and for DC-scale, a VM is assigned resources
from anywhere in the DC.
- Hardware organization: “Mixed” organization in which each rack hosts
a mix of different types of resource blades, as opposed to a
“segregated” organization in which a rack is populated with a single
resource type. This leads to more uniform communication patterns &
permits optimizations to localize communication.
- Page-level remote memory access: Assume CPU blades access remote
memory at page-granularity (4KB in x86) instead of the typical memory
access between CPU and DRAM occurs in unit of a cache-line size (64B in
x86). This exploits spatial locality in memory access patterns.
- Block-level distributed data placement: Apps in DDC reads/writes
large files at sector-granularity (512B in x86). The disk block address
space is range-partitioned into blocks, uniformly distributed across
disk blades.
The design knobs in the study are:
- the amount of local memory on compute blades
- the scope of disaggregation (either rack-scale or DC-scale)
For I/O traffic to storage devices, the current latency & b/w
requirements allow us to consolidate them into the network fabric with
low performance impact, assuming we have a 40Gbps or 100Gbps network.
The dominant impact to app performance comes from CPU-memory
disaggregation.
The work compares the performance of apps in a server-centric
architecture to its performance in the disaggregated context (represents
the worst-case in terms of potential degradation, compared to re-write).
To emulate remote memory accesses, they implement a special swap device
backed by some amount of physical memory rather than disk, and intercept
all page faults & injects artificial delays to emulate network RTT
latency and bandwidth for each paging operation. This does not model
queueing delays.
The work shows the following:
For Class A apps, a network with end-to-end latency of 5 micros
& b/w of 40 Gbps is sufficient for performance penalty within 5%.
Class B apps require network latencies of 3 micros & b/w 40-100 Gbps
for performance penalty within 10%. Performance for Class B apps is very
sensitive to network latency.
- They show that end-to-end latencies estimated fail to meet the
target latencies for either Class A or Class B apps. The overhead of
network software is the key barrier to realizing disaggregation with
current networking technologies. They consider two potential
optimizations and technologies to reduce latencies (& integrating
them):
- Using RDMA. RDMA bypasses the packet processing in the kernel
eliminating the OS overheads.
- Using NIC integration. Integrating the NIC functions closer to the
CPU reduces the overheads associated with copying data to/from the
NIC.
- After assuming RDMA + NIC Integration, Class A jobs can be scheduled
at blades distributed across the DC, Class B jobs will need to scheduled
within a rack.
- New links & switches would bring cross-rack latency down enough
to enable true DC-scale disaggregation (along with RDMA + NIC
Integration).
- Managing network congestion to achieve 0 or close-to-0 queueing is
essential.
Increasing b/w beyond 40 Gbps offers little improvement in
app-level performance. This is easily met: 40Gbps are available in
commodity DC switches & server NICs.
Increasing the amount of local memory beyond 40% has little to no
impact on performance degradation. This requirement is feasible:
compatible with demonstrated h/w prototypes.
Overall memory b/w is well correlated with resultant performance
degradation. Thus, an app’s memory b/w requirements serves as a rough
indicator of its expected degradation under disaggregation without
requiring any instrumentation.
The work then evaluates the impact of queueing delay. They collect a
remote memory access trace from their instrumentation tool: a network
access trace using tcpdump and a disk access trace using blktrace. Then,
they translate these traces to network flows in their simulated
disaggregated cluster. They consider five protocols and use simulation
to evaluate their network-layer performance measured in FCT (Flow
Completion Time) under the generated traffic workloads:
- TCP
- DCTCP, which leverages ECN (Explicit Congestion Notification)
- pFabric, prioritizes short flows
- pHost, emulates pFabric but using scheduling at the end hosts
- FastPass, a centralized scheduler that schedules every packet
They find the following:
- The relative ordering in mean slowdown is consistent with prior
results but their absolute values are higher. This is due to differences
in traffic workloads used in original measurements.
- pFabric performs best, and they use pFabric FCTs as the memory
access times in the emulation (as artificial delays).
- Inclusion of queuing delay does have a non-trivial impact on
performance degradation. 100 Gbps links are both required and sufficient
to contain the performance impact of queuing delay.
The authors also built a kernel-space RDMA block device driver which
serves as swap device. The local CPU can now swap to remote memory
instead of disk.
- They batch block requests sent to RDMA NIC.
- They merge requests with contiguous addresses into a single large
request.
- They allow asynchronous RDMA requests: created a data structure to
keep track of outgoing requests & notify the upper layer immediately
for each completed request.
- The software overhead of swapping is 2.46 microseconds on a
commodity Linux desktop. This suggests that low-latency access to remote
memory can be realized.
Strengths
- This paper is an early work for network requirements for
disaggregation and is the earliest effort at evaluation the minimum
bandwidth & latency requirements that the network in disaggregated
datacenters must provide. They keep app-level performance degradation
within 5% and also discuss about the feasibility of their findings with
existing network designs & technologies.
- They use a clever combination of emulation, simulation &
implementation for doing this study. This is challenging because there
is no real disaggregated environment and much more harder in an academic
environment. They use an extensive suite of 10 workloads (without
modifying their server-centric architecture) spanning 7 open-source
systems.
Weaknesses
- Approaches such as Fastpass which introduces a centralized scheduler
scheduling every packet is impractical in datacenter settings. They also
optimistically assume that the scheduler’s decision logic incurs no
overhead. These assumptions will likely not hold true in a practical
setting.
- Their evaluation results might be impacted by testbed-centric
interferences. Even though they run experiments on Amazon EC2 cluster
and enable EC2’s Virtual Private Network capability in their cluster to
ensure no interference with other instances, there might be
testbed-induced bias in the results. But, this is the best one can do
for such a study.
Future Work
There are numerous directions for future work:
- Generalization of the results in the paper to other systems and
workloads.
- Design and implementation of a “disaggregation-aware” scheduler.
- In particular, based on the type of application (Class A or Class
B), the scheduler should be able to resource allocate according to the
scope of disaggregation: rack-scale or DC-scale, for maximum app-level
performance.
- Adoption of low-latency network software & hardware at
endpoints.
- Such as high-bandwidth links, high-radix switches, optical
switches.
- Creation of new programming models that exploit disaggregation.
- One direction is for apps to exploit the availability of low-latency
access to large pools of remote memory.
- Another direction is for improving performance through better
resource utilization. Current CPU-to-memory utilization for tasks in
today’s DCs varies by 3 orders of magnitudes - one can “bin pack” on a
larger scale and DDCs should achieve more efficient statistical
multiplexing.