The goal of designing datacenter network is latency.
These goals interact with each other very badly. How close can we get to achieving them all simultaneously.
Data sender: For minimal latency, start sending at line rate (blast without even a connection-setup handshake). No prior handshake: zero RTT setup. What could go wrong?
Modern datacenters have some variant of Clos topology. A Clos topology provides enough bandwidth (capacity) for everyone to send at line rate. The problem is not lack of capacity. The problem is comes out of how do route the flows. With per-flow ECMP, then due to flow collisions which cause lots of loss, it’s hard to use all the capacity.
Solution: If every data sender starts at line rate and sprays packets equally across all paths to destination, then you are not going to have flow collisions and you can use the full capacity of the network core. The downside is this requires lots of reordering which makes it a burden for transport protocol.
The other problem is if all the flows are sending to the same destination, then we can have incast near the receiver.
At 10Gbps it takes 7.2us to serialize a 9KB packet. If there is no queueing, then latency is dominated by serialization. A small control packet (like an ACK) can traverse the whole network in < 7.2us.
Question: What should a switch do when the outgoing link is overloaded (like with incast)?
Packet Trimming is a middle ground: When a queue fills, trim off the payload and forward just the header. No metadata loss. Receiver knows exactly what was sent, even with reordering. (Lossless metadata but data loss). To keep the latency low, data packet queue of only 8 packets is enough. When the data queue overflows, the payload is trimmed (dropped) from the arriving packets, and forward the header in a priority queue (priority forwarded). This allows the receiver to find out as soon as possible that the data packet didn’t make it. We also priority forward ACKs, and other control packets. This lets us do fast retransmissions.
Let’s send an incast to the destination. A queue starts building at the ToR switch. A packet gets trimmed and receiver requests retransmission, which gets priority forwarded. And then retransmission happens. Retransmission arrives before queue has completely drained.
So, we just start, spray and trim.
This gives us:
But this also means:
We can solve this by decoupling the ack clock. We can separate the acking from the clocking:
When an incast starts at an NDP receiver, for every incoming packet or header, we add a pull packet to the pull queue. When the receiver’s link is overloaded, headers start arriving and a pull queue builds. Pull packets are sent at the rate we want the incoming packets to arrive. The receiver gets control of the incoming traffic. After the first RTT, packets arrive at line rate. If the receiver wants, the receiver can also prioritize traffic from one sender over another. The receiver is controlling its own incoming traffic flow.
The sender sends at line rate for the first RTT. And then it stops. After that receiver controls transmission by sending pulls. Pull packets from the receiver will clock out data. Pulls trigger retransmissions or sending of new data.
The senders send at line rate for the first RTT. The small queue fills. Packets are trimmed. After the first RTT, packets are pulled, no more trimming occurs.
So NDP is start, spray, trim, pull. This gives us:
The authors implemented NDP in:
htsim
packet-level simulationEvaluations:
Incast is not hard if we don’t care what happens to the neighbors. Suppose we have one long-lived flow going to node A and 64 more incast flows to node B on same switch as A but different switch port. With DCTCP when the incast flows arrive, the neighboring long flow dies and it takes a while to recover. With loss-less Ethernet, it’s lot better but still the long flows suffer from bad collateral damage and this is because there is so much incoming traffic that they get paused up to the next switch up in the topology and that pauses the long flow as well as the short flows. NDP causes minimal collateral damage to neighboring flows. There is a little tiny 1 RTT glitch in the long flow, and in the very next RTT, it is back to full rate again.
Receiver pulling allows the receiver to prioritize flows it cares about.
NDP also works well with:
NDP does not work well: