Inside the Social Network’s (Datacenter) Network

Paper: Arjun Roy, Hongyi Zeng, Jasmeet Bagga, George Porter, and Alex C. Snoeren. 2015. Inside the Social Network’s (Datacenter) Network. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication (SIGCOMM ’15). Association for Computing Machinery, New York, NY, USA, 123–137. DOI:https://doi.org/10.1145/2785956.2787472)

Summary

This paper reports on properties of network traffic in Facebook’s datacenters such as locality, stability and predictibility, and also discusses the implications for network architecture, traffic engineering, and switch design. The paper also finds that while traditional datacenter services like Hadoop follow behaviors reported earlier, the core Web service and supporting cache infrastructure exhibit a number of behaviors that contrast with those reported in the literature. This study is the first to report on production traffic in a datacenter network connecting hundreds of thousands of 10-Gbps nodes.

Using both Facebook-wide monitoring systems and per-host packet-header traces, the study examine services that generate the majority of the traffic in Facebook’s network. Major findings in this study are:

Datacenter Topology

Every machine typically has one role:

Data Collection

The study considers two distinct data sources: - Fbflow is a production monitoring system that samples packet headers from Facebook’s entire machine fleet. In production use, it aggregates statistics at a per-minute granularity. - To collect high-fidelity data, they collect traces by turning on port mirroring on the RSW and mirroring the full, bi-directional traffic for a single server to collection server. Since tcpdump is unable to handle more than approximately 1.5 Gbps of traffic, so in order to support line-rate traces, they employ a custom kernel module that effectively pins all free RAM on the server and uses it to buffer incoming packets.

Limitations:

Provisioning

The study quantifies the traffic intensity, locality and stability across three different types of clusters: Hadoop, Frontend machines serving Web requests and Cache.

Utilization

Locality and stability

The authors give a concrete argument for this locality. Loading the Facebook news feed draws from a vast array of different objects in the social graph: different people, relationships and events comprise a large graph interconnected in a complicated fashion. This connectedness means that the working set is unlikely to reduce even if users are partitioned; the net result is a low cache hit rate within the rack, leading to high intra-cluster traffic locality.

Implications for connection fabrics:

Traffic Engineering

Heavy hitters:

Heavy hitters represent the minimum set of flows (or hosts, or racks in the aggregated case) that is responsible for 50% of the observed traffic volume (in bytes) over a fixed time period. Heavy hitters signify an imbalance than can be acted upon. The study finds it challenging to identify heavy hitters that persist with any frequently. This situation results from a combination of effective load balancing (little difference in size between a heavy hitter and the median flow) and the relatively low long-term throughput of most flows (even heavy flows can be quite bursty internally).

Switching

Strengths & Weaknesses

Strengths

Weaknesses

Follow-On

There are multiple immediate follow-on ideas from this work: