A Scalable, Commodity Data Center Network Architecture

Paper: Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data center network architecture. SIGCOMM Comput. Commun. Rev. 38, 4 (October 2008), 63–74. DOI:https://doi.org/10.1145/1402946.1402967

In 2008, data center networks consisted of a tree of routers and switches with progressively more specialized and higher-end IP switches/routers up in the hierarchy, but this resulted in topologies which supported only 50% of the aggregate bandwidth available at the edge of the network, while still incurring tremendous cost.

There were two high-level choices for building communication fabric for large-scale clusters:

The paper shows that by interconnecting commodity switches in a fat-tree architecture achieves the full bisection bandwidth of clusters consisting of 10s of 1000s of nodes. 48-port Ethernet switches is capable of providing full bandwidth to up to 27,648 nodes.

Current Data Center Network Topologies

Based on best practices as of 2008.

Topology

Consisted of 2- or 3-level trees of switches or routers.

Two types of switches are used:

Oversubscription

Oversubscription is the ratio of the worst case achievable aggregate bandwidth among the end hosts to the total bisection bandwidth of a particular communication topology. Typical designs are oversubscribed by a factor of 2.5:1 (400 Mbps) to 8:1 (125 Mbps) for cost reasons.

Multi-path Routing

Delivering full-bandwidth between arbitrary hosts requires multi-rooted tree (multiple core switches) in large clusters. So, most enterprise core switches support ECMP (Equal Cost Multi Path) routing. Without ECMP, a single rooted 128-port 10 GigE core switch with 1:1 oversubscription can support only 1,280 nodes.

ECMP performs static load splitting among flows. This suffers from:

Cost

Clos Networks/Fat-Trees

Charles Clos of Bell Labs designed a network topology that delivers high b/w for many end devices by interconnecting smaller commodity telephone switches in 1953. Fat-tree is a special instance of Clos topology.

A $k$-ary fat tree has:

A fat-tree built with $k$-port switches supports $\frac{k^3}{4}$ hosts.

Advantages:

Challenges:

Architecture

Achieving maximum bisection bandwidth (1:1) requires spreading outgoing traffic from any pod as evenly as possible among the core switches. There are $(k/2)^2$ shortest paths between any two hosts on different pods. Routing needs to take advantage of this path redundancy.

Addressing

This might seem waste of address space, but simplifies building routing tables and actually scales to 4,200,000 hosts ($k=255$)!

Two-Level Routing Table

Routing Algorithm

Flow Classification

This is an optional dynamic routing technique alternative to two-level routing above. This performs flow classification with dynamic port-reassignment in pod switches to overcome local congestion (two flows competing for same output port in presence of another equal cost port).

A flow is a sequence of packets with the same entries for a subset of fields of the packet headers (e.g. source and destination IP addresses, destination port).

Pod switches performs flow classification:

Flow Scheduling

The distribution of transfer times and burst lengths of internet traffic is long tailed. Routing large long-lived flows plays most important role in determining achievable bisection bandwidth of a network. We want to schedule large flows to minimize overlaps with one another.

Fault Tolerance

We can leverage the path redundancy between any two hosts. Each switch maintains a Bidirectional Forwarding Detection (BFD) session with each neighbor. We need redundancy at edge-layer to tolerate edge failures.

This mechanism is reversed when failed links and switches come back up and reestablish their BSD sessions.

Power and Heat Issues

Implementation, Evaluation & Results

Packaging

Increased wiring overhead is inherent to the fat-tree topology. They present an approach in the context of 27,648-node cluster with 48-port GigE switches where each pod consists of 576 machines and 48 switches. Suppose each host takes 1RU (1 Rack Unit) and 1 rack can accommodate 48 machines. There are 576 core switches.

The cable sets move in sets of 12 from pods to pod and in sets of 48 from racks to pod switches. This is opportunity for cable packaging to reduce wiring complexity.

The 12 racks can be placed around pod switch rack in two dimensions to reduce cable length. The 48 pods can be layed out in a 7x7 grid to reduce inter-pod cabling distance and can support standard cable lengths & packaging.

Strengths & Weaknesses

Strengths

Weaknesses

Follow-on