High-Resolution
Measurement of Data Center Microbursts
Paper: Qiao
Zhang, Vincent Liu, Hongyi Zeng, and Arvind Krishnamurthy. 2017.
High-resolution measurement of data center microbursts. In Proceedings
of the 2017 Internet Measurement Conference (IMC ’17). Association for
Computing Machinery, New York, NY, USA, 78–85.
DOI:https://doi.org/10.1145/3131365.3131375
This study explores fine-grained behavior of a large production data
center using extremely high-resolution measurements (10-100
microseconds) of rack-level traffic.
Earlier work on data center traffic are either on the scale of
minutes or are heavily sampled. The two approaches taken in earlier
works are:
- Packet sampling (e.g. Facebook samples every 1 in 30,000)
- Coarse-grained counters (e.g. SNMP collection in minute-scale)
Coarse-grained measurements can inform us of long-term network
behavior and communication patterns, but fail to provide insight into
many important behaviors such as congestion. This study developed a
custom high-resolution counter collection framework on top of the data
center’s in-house switch platform, and then analyzes various counters
(including packet counters and buffer utilization statistics) from
Top-of-Rack (ToR) switches in multiple clusters running multiple
applications.
High-resolution counter
collection
Modern switches include relatively powerful general-purpose
multi-core CPUs in addition to their switching ASICs. The ASICs are
responsible for packet processing, and need to maintain many counters.
The CPUs handle control plane logic. The CPU can poll the switch’s local
counters at extremely low latency, and batch the samples before sending
them to a distributed collector service that is fine-grained and
scalable.
The study focusses on 3 sets of counters. For each, they manually
determine the minimum sampling interval possible while maintaining ~1%
sampling loss.
- Byte count: Measures the cumulative number of bytes sent/received
per switch port, to calculate the throughput. Sampling interval: 25
microseconds.
- Packet size: Histogram of the packet sizes sent/received at each
switch port. The ASIC bins packets into several buckets. Sampling
interval: 25 microseconds.
- Peak buffer utilization: Take the peak utilization since the last
measurement, and reset the counter after reading. Sampling interval: 50
microseconds.
Data set
Machines are organized into racks and connected to a ToR switch via
10 Gbps Ethernet links. Each ToR is connected to an aggregation layer of
“fabric” switches via 40 or 100 Gbps links. The “fabric” switches are
connected to a “spine” switches. This study focusses on the ToR
switches.
Each server machine have a single role (application):
- Web: Receive web requests and assemble a dynamic web page using data
from many remote sources.
- Cache: Serve as an in-memory cache of data used by the web servers.
Some are leaders handling cache coherency, and some are followers,
serving read requests.
- Hadoop: For offline analysis and data mining.
An entire rack is dedicated to each of these roles. Measuring at a
ToR level, the results can isolate behavior of different classes of
applications.
Port-level behavior
They studied the fine-grained behavior of individual ports. A
switch’s egress link is hot if, for the measurement period, its
utilization exceeds 50%. An unbroken sequence of hot samples indicates a
burst. They choose to define a burst by throughput rather than the
buffer utilization as buffers are often shared and dynamically carved,
making pure byte counts a more deterministic measure of burstiness.
Findings:
- High utilization is short lived. The 90th percentile burst
duration is < 200 microseconds. Congestion events observed by less
granular measurements are likely collection of smaller microbursts.
- Bursts are correlated. The high-utilization intervals tend
to be correlated.
- Fine-grained measurements are needed to capture bursty behavior
accurately. The 25 microsecond measurement granularity is itself
too coarse as over 60% of Web and Cache bursts terminated within that
period.
- Inter-burst periods have a much longer tail than burst
durations. Most interburst periods are small, particularly for
Cache and Web racks where 40% of inter-burst periods last less than 100
microseconds, but when idle periods are persistent, the inter-burst
periods last for 100s of milliseconds.
- Bursty periods tend to include more large packets than
non-bursty periods. Hadoop sees mostly full-MTU packets. The
material packet-level difference between packets inside and outside
bursts suggests that bursts at the ToR layer are often a result of
application-behavior changes, rather than random collisions.
- Different applications have different utilization patterns, but
all are extremely long tailed. This suggests that when bursts
occur, they are generally intense.
Cross-port behavior
They studied the synchronized behavior of switch ports. Each switch’s
port can be split into 2 classes: uplinks and downlinks. The uplinks
connect the rack to the rest of the data center, and modulo network
failures, they are symmetric in both capacity and reliability. Downlinks
connect to individual servers, which all serve similar role.
ToR switches use Equal-Cost MultiPath (ECMP) to spread load over each
of their 4 uplinks. ECMP configurations introduce at least 2 sources of
potential imbalance in order to avoid TCP reordering:
- ECMP operates on the level of flows, rather than packets
- ECMP uses consistent hashing, which cannot guarantee optimal
balance
Findings:
- Uplinks are unbalanced at small timescales. The
instantaneous efficacy of load balancing impact drop- and
latency-sensitive protocols like RDMA and TIMELY. The imbalance (Mean
Absolute Deviation, MAD, of the four uplink utilizations) is large (p50
over 25%) at small timescales (40 microseconds). This indicates
flow-level load balancing can be inefficient in the short term.
- The interconnect does not add significant variance. The
ingress and egress traffic for the ToR switch exhibits a similar pattern
(MAD of egress vs ingress traffic).
- Downlink utilization balance depends on application. Web
servers run stateless services that are entirely driven by user
requests, so correlation is zero (balanced). Subsets of the Cache
servers show very strong correlation with one another (imbalanced). This
is because their requests are initiated in groups from web servers, and
hence those subsets are involved in the same scatter-gather
requests.
- Direction of bursts depends on application. For Web and
Hadoop racks, hot downlinks are more frequent due to high fan-in where
many servers send to a single destination. Cache servers have more
frequent hot uplinks than downlinks because of two properties: (1) they
exhibit simple response-to-request communication pattern, and (2) cache
responses are much larger than the requests.
- Peak buffer occupancy depends on application. Hadoop puts
significantly more stress on ToR buffers than either Web or Cache racks,
and buffer occupancy scales with the number of hot ports. Buffer
occupancy levels off for high numbers of hot ports.
Strengths & Weaknesses
Strengths
- This study has implications for network measurement, design and
evaluation of new network protocols and architectures. As network
bandwidth continues to rise in data centers, the timescale of network
events will decrease accordingly. It is essential to understand
high-resolution behavior of these networks. At these small timescales,
traffic is extremely bursty, load is unbalanced, and different
applications have different behavior.
- The study required non-trivial engineering work to implement a
low-latency sampling framework built on top of an in-house switch
platform. It required manually determining the sampling intervals with
minimal sampling loss. Along with the engineering, the study also
rigorously establishes correlation between consecutive bursts using the
likelihood ratio test, and theoretically rejects the hypothesis that
burst arrivals are Poisson.
Weaknesses
- The paper presents the numbers and findings for fine-grained
port-level & synchronized network traffic behavior extensively, but
does not put much effort in providing reasoning for the findings
extensively. The results presented in the paper gives insight into the
degree of the problem in practice, but does not attempt to solve any of
the issues presented.
- No comparisons with long-term traffic behavior were made. It would
have been more interesting to see the comparison with similar studies
made with coarse-grained measurements, on the same deployment.
Implications & Follow-On
Avenues for possible future work:
- load balancing. Load balancing on microflows rather than
5-tuples - splitting a flow as soon as the inter-packet gap is long
enough to guarantee no reordering. This is viable as inter-burst periods
typically exceed end-to-end latencies. However, faster networks may
decrease this gap.
- congestion control. Traditional CC algorithms either react
to packet drops, RTT variation, or ECN as congestion signal. All of
these signals require at least RTT/2 to arrive at sender, and protocol
might take multiple RTTs to adapt. But large number of microbursts are
shorter than single RTT. Buffering is viable to handle momentary
congestion but lower-latency congestion signals may be required.
- pacing. TCP pacing was one of the original mechanisms that
prevented bursty traffic, but has been rendered ineffective through new
features like segmentation offload, interrupt coalescing. Recent pacing
proposals may be worth considering either at the hardware or software
level.
Other follow-on ideas:
- Domain knowledge suggests than application-level demand and traffic
patterns are a significant contributor to bursts. But to study this as a
cause of burstiness would require correlating and synchronizing switch
and end hosts measurements at a microsecond level.
- The study focusses on ToR switches, and leaves the study of other
network tiers (“fabric” and “spine” switches) to future work. It would
be interesting to see the network traffic behavior at these
aggregate-level switches.