High-Resolution Measurement of Data Center Microbursts

Paper: Qiao Zhang, Vincent Liu, Hongyi Zeng, and Arvind Krishnamurthy. 2017. High-resolution measurement of data center microbursts. In Proceedings of the 2017 Internet Measurement Conference (IMC ’17). Association for Computing Machinery, New York, NY, USA, 78–85. DOI:https://doi.org/10.1145/3131365.3131375

This study explores fine-grained behavior of a large production data center using extremely high-resolution measurements (10-100 microseconds) of rack-level traffic.

Earlier work on data center traffic are either on the scale of minutes or are heavily sampled. The two approaches taken in earlier works are:

Coarse-grained measurements can inform us of long-term network behavior and communication patterns, but fail to provide insight into many important behaviors such as congestion. This study developed a custom high-resolution counter collection framework on top of the data center’s in-house switch platform, and then analyzes various counters (including packet counters and buffer utilization statistics) from Top-of-Rack (ToR) switches in multiple clusters running multiple applications.

High-resolution counter collection

Modern switches include relatively powerful general-purpose multi-core CPUs in addition to their switching ASICs. The ASICs are responsible for packet processing, and need to maintain many counters. The CPUs handle control plane logic. The CPU can poll the switch’s local counters at extremely low latency, and batch the samples before sending them to a distributed collector service that is fine-grained and scalable.

The study focusses on 3 sets of counters. For each, they manually determine the minimum sampling interval possible while maintaining ~1% sampling loss.

Data set

Machines are organized into racks and connected to a ToR switch via 10 Gbps Ethernet links. Each ToR is connected to an aggregation layer of “fabric” switches via 40 or 100 Gbps links. The “fabric” switches are connected to a “spine” switches. This study focusses on the ToR switches.

Each server machine have a single role (application):

An entire rack is dedicated to each of these roles. Measuring at a ToR level, the results can isolate behavior of different classes of applications.

Port-level behavior

They studied the fine-grained behavior of individual ports. A switch’s egress link is hot if, for the measurement period, its utilization exceeds 50%. An unbroken sequence of hot samples indicates a burst. They choose to define a burst by throughput rather than the buffer utilization as buffers are often shared and dynamically carved, making pure byte counts a more deterministic measure of burstiness.

Findings:

Cross-port behavior

They studied the synchronized behavior of switch ports. Each switch’s port can be split into 2 classes: uplinks and downlinks. The uplinks connect the rack to the rest of the data center, and modulo network failures, they are symmetric in both capacity and reliability. Downlinks connect to individual servers, which all serve similar role.

ToR switches use Equal-Cost MultiPath (ECMP) to spread load over each of their 4 uplinks. ECMP configurations introduce at least 2 sources of potential imbalance in order to avoid TCP reordering:

Findings:

Strengths & Weaknesses

Strengths

Weaknesses

Implications & Follow-On

Avenues for possible future work:

Other follow-on ideas: