A
Scalable, Commodity Data Center Network Architecture
Paper: Mohammad Al-Fares,
Alexander Loukissas, and Amin Vahdat. 2008. A scalable, commodity data
center network architecture. SIGCOMM Comput. Commun. Rev. 38, 4 (October
2008), 63–74. DOI:https://doi.org/10.1145/1402946.1402967
In 2008, data center networks consisted of a tree of routers and
switches with progressively more specialized and higher-end IP
switches/routers up in the hierarchy, but this resulted in topologies
which supported only 50% of the aggregate bandwidth available at the
edge of the network, while still incurring tremendous cost.
There were two high-level choices for building communication fabric
for large-scale clusters:
- Leveraging specialized hardware and communication protocols such as
InfiniBand or Myrinet. This suffered from:
- More expensive.
- Not natively compatible with TCP/IP applications.
- Leverage commodity Ethernet switches & routers supporting
familiar management infra with unmodified apps, OSs and H/W. But
aggregate cluster bandwidth scales poorly with cluster size. This
suffered from:
- Oversubscription to reduce costs.
- Using non-commodity 10Gbps (2008) switches & routers at the
higher levels to reduce oversubscription.
- Typical single path IP routing along trees of interconnected
switches means that overall cluster bandwidth is limited by the
bandwidth available at the root.
The paper shows that by interconnecting commodity switches in a
fat-tree architecture achieves the full bisection bandwidth of clusters
consisting of 10s of 1000s of nodes. 48-port Ethernet switches is
capable of providing full bandwidth to up to 27,648 nodes.
- Scalable interconnection bandwidth: Any host can
communicate with another host at full bandwidth of its local network
interface.
- Economies of scale: Leverage economies of scale to make
cheap off-the-shelf Ethernet switches the basis for large-scale data
center networks like in cluster computing.
- Backward compatibility: Compatibility with hosts running
IP/Ethernet with minimal modifications on existing switches.
Current Data Center
Network Topologies
Based on best practices as of 2008.
Topology
Consisted of 2- or 3-level trees of switches or routers.
- A 3-tier design has a core tier in the tree root, an aggregation
tier in the middle, and an edge tier at the leaves. Supports ~10,000
hosts.
- A 2-tier design has only the core tier and the edge tier. Supports
5000-8000 hosts.
Two types of switches are used:
- 48-port GigE with four 10 GigE uplinks at the edge
- 128-port 10 GigE switches at the core
Oversubscription
Oversubscription is the ratio of the worst case achievable aggregate
bandwidth among the end hosts to the total bisection bandwidth of a
particular communication topology. Typical designs are oversubscribed by
a factor of 2.5:1 (400 Mbps) to 8:1 (125 Mbps) for cost reasons.
Multi-path Routing
Delivering full-bandwidth between arbitrary hosts requires
multi-rooted tree (multiple core switches) in large clusters. So, most
enterprise core switches support ECMP (Equal Cost Multi Path) routing.
Without ECMP, a single rooted 128-port 10 GigE core switch with 1:1
oversubscription can support only 1,280 nodes.
ECMP performs static load splitting among flows. This suffers
from:
- Doesn’t account for flow bandwidth in allocation decisions leading
to oversubscription for simple communication patterns
- ECMP implementations limited path multiplicity to 8-16 (2008);
larger data centers require more for higher bisection bandwidth
- Number of routing table entries grows multiplicatively with number
of paths which increases cost & lookup latency
Cost
- The switching h/w to interconnect 20,000 hosts with full bandwidth
(1:1) among all hosts costs USD 37,000,000 (2008). A Fat-tree built from
only 48-port GigE switches would support 27,648 hosts at 1:1
oversubscription for USD 8,640,000 (2008).
- There is no requirement of higher speed uplinks in the fat-tree
topology.
- It is technically infeasible to build a 27,648-node cluster with 10
GigE switches without fat-tree architecture, even though it would cost
over USD 690,000,000 (2008).
Clos Networks/Fat-Trees
Charles Clos of Bell Labs designed a network topology that delivers
high b/w for many end devices by interconnecting smaller commodity
telephone switches in 1953. Fat-tree is a special instance of Clos
topology.
A $k$-ary fat tree has:
- $k$ pods, each containing
- two layers of $k/2$ switches (edge layer and aggregation layer)
- each $k$-port switch in lower layer (edge layer) is connected to
$k/2$ hosts, and $k/2$ upper-layer (aggregation layer) switches
- $(k/2)*2$ $k$-port core switches
- each core switch has $i$th port connected to $i$th pod such that
consecutive ports in the aggregation layer of each pod switch are
connected to core switches on $(k/2)$ strides.
A fat-tree built with $k$-port switches supports $\frac{k^3}{4}$
hosts.
Advantages:
- All switching elements can be homogeneous (not required) enabling
use of cheap commodity switches even at the core.
- Fat-trees are rearrangeably non-blocking: For arbitrary
communication patterns, there is some set of paths that will saturate
all the bandwidth available to the ends hosts (1:1 oversubscription). In
practice, we might need to trade this off for avoiding TCP packet
reordering at end host.
- All hosts connected to same edge switch forms their own subnet (this
traffic is switched, no routing)
Challenges:
- IP/Ethernet networks typically use single routing path between
source and destination. The paper describes simple extension to IP
forwarding to leverage high fan-out in fat-trees.
- Significant wiring complexity. The paper presents packaging and
placement techniques to mitigate.
Architecture
Achieving maximum bisection bandwidth (1:1) requires spreading
outgoing traffic from any pod as evenly as possible among the core
switches. There are $(k/2)^2$ shortest paths between any two hosts on
different pods. Routing needs to take advantage of this path
redundancy.
Addressing
- All IP addresses allocated within $10.0.0.0/8$ block.
- Pod switches are allocated $10.pod.switch.1$ where $pod$ is the pod
number in $[0,k-1]$ from left to right, $switch$ is $[0,k-1]$ from left
to right, bottom to top.
- Core switches are allocated $10.k.j.i$ where $j$ and $i$ are the
switch’s coordinates in the $(k/2)^2$ core switch grid (each in
$[1,k/2]$, starting form top-left).
- Host address is $10.pod.switch.ID$ where $ID$ is in $[2,k/2+1]$ from
left to right.
This might seem waste of address space, but simplifies building
routing tables and actually scales to 4,200,000 hosts ($k=255$)!
Two-Level Routing Table
- Routing tables are modified to allow two-level prefix lookup for
even-distribution. Entry in the main routing table might have an
additional pointer to a small secondary table of $(suffix, port)$
entries. Entries in the primary table are left-handed, and entries in
secondary table are right-handed.
- If longest-matching prefix search yields a non-terminating prefix
(in main table), then the longest-matching suffix in the secondary table
is found and used.
- Can be implemented efficiently in H/W using Content-Addressable
Memory (CAM). A CAM can perform parallel searches among all its entries
in a single clock cycle. A Ternary CAM (TCAM) can store don’t
care bits in addition to matching 0s and 1s in particular
positions, making it suitable for storing variable length prefixes found
in routing tables. But CAMs have low storage density, are very power
hungry, are expensive per bit. In this application, we need just $k$
entries, each 32 bits wide.
- Left-handed (prefix) entries stores in numerically smaller addresses
and right-handed (suffix) entries in larger addresses and encode output
of the CAM so that numerically smallest matching address is output.
Routing Algorithm
- Pod switches: The edge layer & aggregation layer acts as
filtering traffic diffusers: they have terminating prefixes to the
subnets in the pod. For other outgoing inter-pod traffic, the pod
switches have a default $/0$ prefix with a secondary table matching host
IDs:
- This causes traffic to be evenly spread upwards among outgoing links
to the core switches.
- Also causes subsequent packets to same destination host to follow
same path avoid packet reordering.
- Core switches: Contains only terminating $/16$ prefixes pointing to
their destination pods.
- No switch in the network contains a table with more than $k$
first-level prefixes or $k/2$ second-level suffixes.
- If two flows from same source subnet are destined to same
destination subnet, single-path IP routing would route both flows to
same path which eliminates fan-out benefits of fat-tree. The two-level
routing avoids this.
Flow Classification
This is an optional dynamic routing technique alternative to
two-level routing above. This performs flow classification with dynamic
port-reassignment in pod switches to overcome local congestion (two
flows competing for same output port in presence of another equal cost
port).
A flow is a sequence of packets with the same entries for a subset of
fields of the packet headers (e.g. source and destination IP addresses,
destination port).
Pod switches performs flow classification:
- Recognize subsequent packets of same flow and forward them on the
same outgoing port. This avoids packet reordering.
- Periodically reassign a minimal number of flow output ports to
minimize disparity between aggregate flow capacity of different ports.
This ensures fair distribution on flow on upward-pointing ports in the
face of dynamically changing flow sizes.
Flow Scheduling
The distribution of transfer times and burst lengths of internet
traffic is long tailed. Routing large long-lived flows plays most
important role in determining achievable bisection bandwidth of a
network. We want to schedule large flows to minimize overlaps with one
another.
- Edge switches: Assigns a new flow to least-loaded port initially.
Sends periodic notifications to a central scheduler specifying the
source and destination of all active large flows. This is a request for
placement of that flow in an uncontended path.
- Central scheduler (replicated): Tracks all flows and tries to assign
them non-conflicting paths if possible. Linearly searches through the
$(k/2)^2$ core switches (corresponding to each equal cost path), and
finds one whose path components do not include a reserved link. Garbage
collects old flows and reservations.
Fault Tolerance
We can leverage the path redundancy between any two hosts. Each
switch maintains a Bidirectional Forwarding Detection (BFD) session with
each neighbor. We need redundancy at edge-layer to tolerate edge
failures.
- Failure in edge-to-aggregation link
- Traffic type 1: outgoing inter- and intra-pod to edge switch: this
switch sets cost of broken link to infinity and never assigns it to new
flows.
- Traffic type 2: intra-pod traffic to aggregation switch: this switch
broadcasts a tag notifying all other edge switches in same pod to not
use the broken link.
- Traffic type 3: inter-pod traffic to aggregation switch: this switch
broadcasts a tag to all core switches to not use the broken link. The
core switches can mirror the tag to all other aggregation switches in
other pods to not use the broken link.
- Failure in aggregation-to-core link
- Traffic type 1: outgoing inter-pod traffic to aggregation switch:
this switch removes link from local routing table and locally chooses
another core switch.
- Traffic type 2: incoming inter-pod traffic to core switch: core
switch broadcasts tag to all other aggregation switches in other pods to
avoid using the broken link.
This mechanism is reversed when failed links and switches come back
up and reestablish their BSD sessions.
Power and Heat Issues
- 10 GigE switches consume roughly double the Watts/Gbps and dissipate
roughly 3x heat of commodity GigE switches. This is two more reasons to
use commodity switches. (2008)
- For an interconnect supporting ~27,000 hosts, the fat-tree
architecture uses more switches (2,880 Netgear GSM 7252S) but with 56.6%
less power consumption and 56.5% less heat dissipation than typical
hierarchical designs (576 ProCurve 2900 edge switches and 54 BigIron
RX-32 switches, 36 in aggregation and 18 in core).
Implementation, Evaluation
& Results
- They implemented a prototype using NetFPGAs. The NetFPGA contains an
IPv4 router implementation that leverages TCAMs. The routing table
lookup routine modifications totaled < 100 LOC and introduced no
measurable lookup latency.
- They built a prototype using Click, a modular software router
architecture that supports implementation of experimental router
designs.
- To measure the total bisection bandwidth of fat-tree design, they
generate a benchmark suite of communication mappings to evaluate the
performance of the 4-port fat-tree using two-level routing, flow
classification, and flow scheduling and compared to a standard
hierarchical tree with a 3.6:1 oversubscription.
- With dynamic flow assignment and re-allocation, the flow classifier
outperforms both the traditional tree & two-level table with a worst
case bisection bandwidth of ~75%. It is imperfect because it avoids only
local congestions.
- The flow scheduler acts on global knowledge and tries to assign
large flows to disjoint paths, achieving 93% of ideal bisection
bandwidth for random communication mappings.
Packaging
Increased wiring overhead is inherent to the fat-tree topology. They
present an approach in the context of 27,648-node cluster with 48-port
GigE switches where each pod consists of 576 machines and 48 switches.
Suppose each host takes 1RU (1 Rack Unit) and 1 rack can accommodate 48
machines. There are 576 core switches.
- Each pod has 12 racks of 48 machines each.
- All the 48 switches in the pod are packed in a single monolithic
unit with 1,152 user-facing ports.
- 576 of these 1152 ports connect directly to machines.
- 576 of these 1152 ports fan out to one port on each of the 576 core
switches.
- 1152 internal ports are wired to other switches in the same pod
(edge-aggregation interconnects).
- Each of the 48 pods houses 12 core switches.
- Of the 576 cables from a packed pod switch rack, 12 connects
directly to core switches place near the pod.
- The rest fans out to other pods in sets of 12 to core switches
housed in remote pods.
The cable sets move in sets of 12 from pods to pod and in sets of 48
from racks to pod switches. This is opportunity for cable packaging to
reduce wiring complexity.
The 12 racks can be placed around pod switch rack in two dimensions
to reduce cable length. The 48 pods can be layed out in a 7x7 grid to
reduce inter-pod cabling distance and can support standard cable lengths
& packaging.
Strengths & Weaknesses
Strengths
- The paper presents a novel and important contribution to data center
cluster interconnections. The paper solves the scalable bandwidth
problem which affected data centers in a cost effective manner. The
authors bring the cluster-computing idea of displacing expensive,
high-end machines by clusters of commodity PCs to data center
interconnects by leveraging large number of commodity GigE switches to
displace high-end switches. Even though the fat-tree topology is not
novel (used for telephone switching), the authors solve the
architectural and implementation challenges of standard IP single-path
routing using novel IP addressing scheme and two-level routing mechanism
that require minimal modification to existing switches without breaking
backward compatibility of TCP/IP/Ethernet stack prevalent in data
centers.
- Also, they alleviate the ECMP challenge of static load splitting of
flows which disregards flow bandwidth by proposing their own simple flow
classification by dynamic port reassignment, which is stateful but can
fall back to two-level routing in case of lost state. In addition, they
attempt to maximize achievable bisection bandwidth by introducing flow
scheduling to direct large flows to use uncontended paths, and also
introduces simple fault tolerance schemes to tolerate link and switch
failures.
- They acknowledge the problem of power and heat issues which is an
important practical consideration in deployments of new technologies in
the data center. They provide sufficient empirical evidence for their
case of using larger number of commodity switches against expensive,
high-end switches.
Weaknesses
- Flow scheduling is impractical in real-world deployments. The use of
a centralized scheduler with knowledge of all active flows and the
status of all links is infeasible for large arbitrary networks. The
regularity of the fat-tree topology can simplify the search for
disjoint, uncontended paths but this study is not undertaken. Flow
scheduling as they do it is infeasible.
- They attempt to solve the cabling problem with clever packaging. But
they make some good number of assumptions: each host takes up 1RU,
individual racks accommodating 48 machines, and packing 48 switches in a
single rack. This puts constraint on the data center operator to trade
away some other placement requirements to accommodate the fat-tree
interconnect requirements.
- The experiments in power and heat issues are not made in real
deployments of the design. It would be interesting to see if the real
world deployments conform with their findings. There might be other
practical overheads associated with using a large number of commodity
switches.
Follow-on
- As mentioned in the weaknesses, one important follow-on is to
investigate more practical designs for the flow scheduling problem at
hyperscalar scale. At large scales, it is infeasible to maintain the
required large state for their described mechanism to work.
- Study the practical (power, heat, cooling issues) in a realistic
deployment of the fat-tree design. It would be interesting to find
issues not mentioned in the paper.
- Investigate the performance with fault tolerance mechanisms like
Bidirectional Forwarding Detection (BFD) active in a real data-center
deployment. The achievable bisection bandwidth would correspond to more
realistic values.