LLVM Premerge infra - GCP runners

This document describes how the GCP based presubmit infra is working, and explains common maintenance actions.

Overview

Presubmit tests are using GitHub workflows. Executing GitHub workflows can be done in two ways:

using GitHub provided runners.
using self-hosted runners.

GitHub provided runners are not very powerful, and have limitations, but they are FREE. Self hosted runners are self-hosted, meaning they can be large virtual machines running on GCP, very powerful, but expensive.

To balance cost/performance, we keep both types.

simple jobs like clang-format shall run on GitHub runners.
building & testing LLVM shall be done on self-hosted runners.

LLVM has several flavor of self-hosted runners:

MacOS runners for HLSL managed by Microsoft.
GCP windows/linux runners managed by Google.
GCP linux runners setup for libcxx managed by Google.

This document only focuses on Google's GCP hosted runners.

Choosing on which runner a workflow runs is done in the workflow definition:

jobs:
  my_job_name:
    # Runs on expensive GCP VMs.
    runs-on: llvm-premerge-linux-runners

Our self hosted runners come in two flavors:

Linux
Windows

GCP runners - Architecture overview

We have two clusters to compose a high availability setup. The description below describes an individual cluster, but they are largely identical. Any relevant differences are explicitly enumerated.

Our runners are hosted on GCP Kubernetes clusters, and use the Action Runner Controller (ARC). The clusters have 4 main pools:

llvm-premerge-linux
llvm-premerge-linux-service
llvm-premerge-windows-2022
llvm-premerge-libcxx

llvm-premerge-linux-service is a fixed pool, only used to host the services required to manage the premerge infra (controller, listeners, monitoring). Today, this pool has three e2-highcpu-4 machine.

llvm-premerge-linux is a auto-scaling pool with large n2-standard-64 VMs. This pool runs the Linux workflows. In the US West cluster, the machines are n2d-standard-64 due to quota limitations.

llvm-premerge-windows-2022 is a auto-scaling pool with large n2-standard-32 VMs. Similar to the Linux pool, but this time it runs Windows workflows. In the US West cluster, the machines are n2d-standard-32 due to quota limitations.

llvm-premerge-libcxx is a auto-scaling pool with large n2-standard-32 VMs. This is similar to the Linux pool but with smaller machines tailored to the libcxx testing workflows. In the US West Cluster, the machines are n2d-standard-32 due to quota limitations.

Service pool: llvm-premerge-linux-service

This pool runs all the services managing the presubmit infra.

Action Runner Controller
1 listener for the Linux runners.
1 listener for the windows runners.
Grafana Alloy to gather metrics.
metrics container.

The Action Runner Controller listens on the LLVM repository job queue. Individual jobs are then handled by the listeners.

How a job is run:

The controller informs GitHub the self-hosted runner set is live.
A PR is uploaded on GitHub
The listener finds a Linux job to run.
The listener creates a new runner pod to be scheduled by Kubernetes.
Kubernetes adds one instance to the Linux pool to schedule new pod.
The runner starts executing on the new node.
Once finished, the runner dies, meaning the pod dies.
If the instance is not reused in the next 10 minutes, the autoscaler will turn down the instance, freeing resources.

Worker pools : llvm-premerge-linux, llvm-premerge-windows-2022, llvm-premerge-libcxx

To make sure each runner pod is scheduled on the correct pool (linux or windows, avoiding the service pool), we use labels and taints.

The other constraints we define are the resource requirements. Without information, Kubernetes is allowed to schedule multiple pods on the instance. So if we do not enforce limits, the controller could schedule 2 runners on the same instance, forcing containers to share resources.

Those bits are configures in the linux runner configuration, windows runner configuration, and libcxx runner configuration.