Large-scale cluster management at Google with Borg

Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. 2015. Large-scale cluster management at Google with Borg. In Proceedings of the Tenth European Conference on Computer Systems (EuroSys ’15). Association for Computing Machinery, New York, NY, USA, Article 18, 1–17. DOI:https://doi.org/10.1145/2741948.2741964

Google’s Borg system is a cluster manager that:

runs 100s of 1000s of jobs
from many 1000s of apps
across a number of clusters
each cluster having 10s of 1000s of machines

Achieves high utilization by combining admission control, efficient task packing, over-commitment & machine sharing with process-level performance isolation.

Users submit their work to Borg in the form of jobs. A job consists of one or more tasks that all run the same program (binary). Each job runs in one Borg cell. A Borg cell is a set of machines belonging to a single cluster. A cluster lives in a single DC building, and a collection of buildings makes up a site.
Workload: heterogenous with two parts:
- long-running, highly available, latency-sensitive, end-user facing services (e.g. Gmail, Google Docs, web search, BigTable), few us to few 100s of ms, diurnal usage patterns. Classified as “prod”.
- batch jobs taking a few secs to a few days to complete, much less sensitive to short-term performance fluctuations. Classified as “non-prod”.
Heterogeneous machines in a cell, in many dimensions: CPU, RAM, disk and network sizes, processor type, performance, capabilities such as external IP address or flash storage.
Job’s properties include name, owner & number of tasks it has, constraints to force tasks to run on machines with particular attributes such as processor architecture, OS version, external IP address. A task maps to set of Linux processes A task maps to set of Linux processes.
Users operate on jobs by issuing RPCs to Borg from a cmdline tool, other jobs, or monitoring system. Job descriptions are written in a declarative configuration lang, BCL. Jobs can be submitted and accepted to Pending state, and get scheduled to Running state, and then finish to Dead state. fail, kill, lost leads to Dead. evict from Running leads to Pending (pre-emption).
A Borg alloc is a reserved set of resources on a machine in which one or more tasks can be run. Analogous concept is pod in Kubernetes. An alloc set is a group of allocs that reserve resources on multiple machines.
- To set resources aside for future tasks
- To retain resources b/w stopping a task & starting it again
- Gather tasks from different jobs onto same machine - webserver & logsaver task (copy server’s URL logs from local disk to distributed FS)
When more work shows up than can be accommodated, they use priorities & quota.
- Every job has a priority, a small positive integer, expressing relative importance of the job that are running or waiting to run in a cell. A high-priority task can obtain resources at the expense of a lower-priority one, even if that involves preempting the latter. To prevent preemption cascades, they disallow tasks in production priority band to preempt one another.
- Quota decides which jobs to admit for scheduling. It is a vector of resource quantities (CPU, RAM, disk, etc.) at a given priority for a period of time. Quota-checking is part of admission control. Over-selling quota at lower-priority levels to avoid over-buying quota. Quota allocation is outside Borg, and tied to Google’s physical capacity planning. User jobs are only admitted if they have sufficient quota at the required priority.
A service’s clients use the BNS (Borg Name Service) name to find a task. Forms the basis for the task’s DNS name: taskID.jobName.userName.cellName.borg.google.com. Every task contains a built-in HTTP server to publish information about task health & performance metrics (e.g. RPC latencies). A UI service (Sigma) provides a dashboard.
Borg architecture: Logically centralized controller, Borgmaster, and agent process called Borglet running on each machine in the cell.
- Borgmaster: main process + separate scheduler.
  - main process handles client RPCs that mutate state or provide RO access to data. Manages state machines for all objects (machines, tasks, allocs). Communicates with Borglets. Offers a backup web UI to Sigma. Replicated 5 times for fault tolerance. Paxos leader is the state mutator. Periodic checkpointing + change logs for fault tolerance & debugging & capacity planning.
  - Job’s tasks added to pending queue on submission by Borgmaster. Pending queue is scanned from high to low priority, round-robin scheme within a priority to ensure fairness across users & avoid HoL blocking, by scheduler. Two parts: feasibility checking & scoring.
    - feasibility checking: find machines that meet the task’s constraints & have enough “available” resources - including resources assigned to lower-priority tasks that can be evicted
    - scoring: score each feasible machine based on user-specified preferences, & built-in criteria (minimize the no. & priority of preempted tasks, pick machines having a copy of the task’s packages to reduce task startup time, spreading tasks across power & failure domains, packing quality). The scoring model tries to reduce amount of stranded resources that cannot be used because another resource on the machine is fully allocated.
- Borglet starts & stops tasks, restarts them if they fail, manage local resources by manipulating OS kernel settings, rolls over debug logs, reports the state of the machine to the Borgmaster & other monitoring systems. The Borgmaster polls each Borglet every few secs to retrieve the current state & send any outstanding requests.
To scale, they split the scheduler into a separate process so it could operate in parallel with other replicated Borgmaster functions. Separate threads to talk to the Borglets & respond to read-only RPCs improve response times. More scalability optimizations:
- score caching: ignore small changes in resource quantities reduces cache invalidations.
- equivalence classes: feasibility checking & scoring for one task per equivalence class (identical requirements & constraints).
- relaxed randomization: examine machines in a random order until enough feasible machines to score & select the best.
A key-design feature for availability is that already-running tasks continue to run even if the Borgmaster or a task’s Borglet goes down.
Increasing utilization by a few percentage points can save millions of dollars. “Cell compaction” is chosen as utilization metric: given a workload, how small a cell it could be fitted into? Better scheduling policies require fewer machines to run on the same workload.
- Cell sharing: segregating prod & non-prod work would need more machines because prod jobs usually reserve resources to handle rare workload spikes but don’t use these resources most of the time. Borg’s resource reclamation run non-prod work using reclaimed resources. The CPU slowdown (increase in CPI) by packing unrelated users & job types onto the same machine is outweighed by decrease in machines required.
- Large cells: Allows large computations, & decrease resource fragmentation.
- Fine-grained resource requests: Request CPU in milli-cores, and memory & disk-space in bytes (core = processor hyperthread). Offering fized-size CPU & memory would require more machines.
- Resource reclamation: A job specifies a resource limit - to determine if user has enough quota for admission. Borgmaster estimates how much resources a task will use (reservation) every few seconds based on usage information captured by Borglet & reclaim the rest for batch work. This reduces the number of machines used. Reservation - Usage = Safety Margin. Too less safety margin leads to OOMs.
They use a Linux chroot jail as the primary security isolation mechanism between multiple tasks on the same machine & VMs & security sandboxing techniques are used to run external software by Google’s AppEngine & Google Compute Engine.
Borg tasks run inside a Linux cgroup-based resource contained & Borglet manipulates the container settings, giving improved control.
- Borg tasks have an appclass: latency-sensitive (LS) & batch. High-priority LS tasks receive best treatment.
- Compressible (CPU cycles, disk I/O bandwidth) & non-compressible resources (memory, disk): if a machine runs out of non-compressible resources, Borglet immediately terminates tasks from lowest to highest priority; if a machine runs out of compressible resources, the Borglet throttles usage (favoring LS tasks) so that short load spikes can be handled without killing any tasks.
- To reduce scheduling delays, their version of CFS uses extended per-cgroup load history, allows preemption of batch tasks by LS tasks, reduces scheduling quantum when multiple LS tasks are runnable.

Strengths

Throughout the design they gave importance to debugging (instrospection) by surfacing debugging information to all users rather than hiding it. To handle the enormous volume of data, they provide several levels of UI & debugging tools so users can quickly identify anomalous events related to their jobs & then drill down to detailed event & errors logs from their applications & infrastructure itself.
Another important part of managing large scale clusters is automation & operator scaleout: Borg’s design allows them to support 10s of 1000s of machines per operator (SRE).
The qualitative lessons they learned from operating Borg in production for more than a decade were incorporated in Kubernetes which has become extremely popular now:
- allocs are useful: allows helper services to be developed by separate teams & this abstraction spawned the widely-used logsaver pattern & data-loader+webserver pattern.
- cluster management is more than task management: apps on Borg benefit from other cluster services such as naming & load balancing, apart from lifecycle management of tasks & machines.
- master is the kernel of a distributed system: although originally built as a monolithic system, Borgmaster became a kernel sitting at the heart of an ecosystem of services that cooperate to manage user jobs: scheduler, UI (Sigma), admission control, vertical & horizontal auto-scaling, re-packing tasks, periodic job submission, workflow management, archiving system actions for offline querying. Affording this scaling of workload & feature set without sacrificing performance, scalability & maintainability is hallmark of a good design.

Weaknesses

The only form of data locality supported by the Borg scheduler is sharing & caching immutable packages used by different tasks. The scheduler doesn’t take into account low-level resource interference: such as memory bandwidth contention or L3 cache pollution. Maybe the scheduler can be further improved.
As discussed in the paper:
- Jobs are restrictive as the only grouping mechanism for tasks: no way to manage multi-job service as a single entity or related instances of a service (canary & production tracks). K8s solves this by using labels - arbitrary key-value paris that user can attach to any object in the system.
- One IP address per machine complicates things: Borg needed to schedule ports as a resource, assign ports to tasks, enforce port isolation & RPC systems needed to handle ports as well as IP addresses. K8s uses Linux namespaces, VMs, IPv6 & SDN to give an IP address to every pod & service.
- Optimizing for power users at the expense of casual ones: the BCL specification lists about 230 parameters & makes it harder for the casual user to use Borg.
Borg, as described in this paper, is only suitable for Google’s workloads and Google’s machines. The design choices that affected their system are extremely Google-specific. To cite an instance: the vast majority of the Borg worklaod does not run inside VMs - a reason being the system was designed at a time when Google had made a considerable investment in processors with no virtualization support in hardware. But of course, the paper presents many ideas to build such a cluster management system.
Not enough motivation (quantitatively) provided for choosing “cell compaction” as the utilization metric for Borg’s design. They just say that they picked this metric after much experimentation and the details are avoided. They mention cell compaction gives a fair, consistent way to compare scheduling policies.

Future Work

As discussed above, work continues in improving the scheduler. Adding thread placement & CPU management that is NUMA-, hyperthreading-, and power-aware & improving the control fidelity of the Borglet.
Even though K8s is being actively developed, maybe rewriting the BCL to support a simpler API for casual users might be a potential future work.