SIMON: A Simple and Scalable Method for Sensing, Inference and Measurement in DCNs

Yilong Geng, Shiyu Liu, Zi Yin, Ashish Naik, Balaji Prabhakar, Mendel Rosenblum, and Amin Vahdat. 2019. SIMON: a simple and scalable method for sensing, inference and measurement in data center networks. In Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation (NSDI’19). USENIX Association, USA, 549–564.

SIMON is about measuring accurate queue sizes in the network from the edge of the data center network.

Network measurement is useful for:

The main challenges are:

Most solutions seek to make effective trade-offs among these conflicting requirements.

Two classes of network measurement methods:

SIMON falls in edge-based measurements:

They observed that queue sizes sampled at 1-ms sampling rate captures almost all fluctuations of the queue. The noise (jitter) is small relative the the queue sizes and hence reconstructing this (& a per-flow decomposition of it) is much simpler & scalable. They try to first reconstruct the overall queueing delays & then break down the queueing delay into individual flows. They also try to get a link utilization breakdown to individual flows too.

Some use cases of the 1-ms average queues:

The 1 ms averaging operation is actually passing the queueing signal into a low pass filter with a cut-off frequency of 500 Hz. If we compute the power spectral density of the 1 us signal & the 1 ms signal, for the 1 us signal, the majority of the power of the signal concentrates in the low frequency (< 500 Hz) part, and the 1 ms average signal captures almost all the low-frequency part of the 1 us signal. The 1 ms signal preserves over 97.5% of the power of the 1 us signal. For 10G networks, if we increase the averaging interval, we start to quickly lose power & if the averaging interval is smaller than 1 ms, we don’t have much gain. For 1G & 40G networks, the sweet spot averaging intervals are 10 ms, and 250 us respectively. So the ideal averaging interval is inversely proportional to the link speed of the network: with higher-speed links, the queues will vary faster & we want to reconstruct the queues more frequently. This averaging interval is called the reconstruction interval.

Support server A sends a packet to server B through the network, and the two servers collect the tx & rx timestamps of the packet. Assuming clocks are synchronized, these two timestamps would tell us the one-way delay of the packet. Our goal is to breakdown the one-way delay into the individual hops in the network. Suppose q1, q2, …, q5 are the 5 queues passed by the packet in the network: RX1 - TX1 = q1 + q2 + q3 + q4 + q5 + n1, where n1 is a noise term which models the variable switching delays in the switches as well as the propagation delays on the wires.

So with this single packet, we have 1 equation for 5 variables. We can gather many packets & write down all the equations and try to solve for the queues.

The inputs to the algorithm are:

From the 5-tuples, we can know the path taken by the packet, and from the timestamps we can know the one-way delay of the packet. We have:

one way delay = sum of queueing delay over all hops + propagation delay

If we combine all packets, we get a linear system: D = AQ + N, D is a vector of 1-way delays, A is a matrix specifying which packets pass through which queues, Q is a vector of queueing delays. In factory networks, this linear system is always undetermined - we don’t have enough equations. Since congestion in network is sparse, the Q vector is sparse, and so we can use the Lasso estimator to estimate the Q.

Their first attempt was to reconstruct the queueing delays using the data packets which are the packets generated by the apps. Sometimes the reconstruction matches the ground truth pretty well, however other times, the algorithm is attributing the data into the wrong queues. This is because with only data packets there’s not enough equations (paths) compared to the variables (queues). Even though there’s a large number of data packets, many of them are passing through the same path because packets from the same flow go through the same path. So the path coverage in the network is not enough.

They introduce a probe mesh:

We can use a small number of packets to create a dense coverage of paths in the network.

Their second attempt was to reconstruct with only probe timestamps. The reconstruction matches ground truths well and the average reconstruction error is 5.2 KB, which is 3-4 MTU-sized packets.

Lasso solver is iterative, which means it takes time to solve the problem in a large network. They used a Neural Network to speed up the speed of the Lasso. Initially, we use Lasso to do the reconstruction, so as the probe timestamps come in, we feed the probe timestamps to both Lasso & the neural network, and then we can use the output of Lasso recon result to train the NN. After we have trained the NN, when new probe timestamps come in, we can then use the NN output as a recon result. NN is much faster than Lasso since it can be sped up by GPUs. NN gives us 5k-10k times the speedup without losing much accuracy.

We want to find:

Now they use data packet timestamps too. Now we know the queueing delays for queues in the network. If we know the tx timestamp of a data packet, we can simulate this data packet and know where this packet is at any point of time. If we do this for every data packet, we have a full-on reconstruction of the queueing dynamics in the network. This is costly because it is a per-packet operation. In reality, we only need to use the byte counts per flow to do this reconstruction. Reconstruction is accurate to a couple of MTU sized packets. They also found that reconstruction of the link utilizations has RMSE within 1%.

They have deployed and tested this algorithm in Google Jupiter 40G testbed with 20 racks and 3-layer network and also much larger network with 10G & 40G links, access to 12 out of 100s of racks & a 5-layer network. At Stanford, they used a testbed with 1G links, 128 servers and 2-layer network. They used NetFPGAs for verification, to get queueing delay ground truths from the network: the idea is that we use 1 FPGA as two timestampers: a packet will first go to the FPGA, take a timestamp of it, then the packet will go into the switch and come out of the switch and go into the FPGA again for a second timestamp. Using the two timestamps, we can know the ground truth queueing delay. They used 2 FPGA and got ground truths for 4 different queues in the testbed. They show that the Lasso recon matches the ground truth very well. In the other testbeds, they use a cross-validation scheme to verify that their algorithm is giving the correct results.

Strengths

Future Work