This document describes how the GCP based presubmit infra is working, and explains common maintenance actions.
Presubmit tests are using GitHub workflows. Executing GitHub workflows can be done in two ways:
GitHub provided runners are not very powerful, and have limitations, but they are FREE. Self hosted runners are self-hosted, meaning they can be large virtual machines running on GCP, very powerful, but expensive.
To balance cost/performance, we keep both types.
clang-format shall run on GitHub runners.LLVM has several flavor of self-hosted runners:
This document only focuses on Google's GCP hosted runners.
Choosing on which runner a workflow runs is done in the workflow definition:
jobs:
my_job_name:
# Runs on expensive GCP VMs.
runs-on: llvm-premerge-linux-runners
Our self hosted runners come in two flavors:
We have two clusters to compose a high availability setup. The description below describes an individual cluster, but they are largely identical. Any relevant differences are explicitly enumerated.
Our runners are hosted on GCP Kubernetes clusters, and use the Action Runner Controller (ARC). The clusters have 4 main pools:
llvm-premerge-linux-service is a fixed pool, only used to host the services required to manage the premerge infra (controller, listeners, monitoring). Today, this pool has three e2-highcpu-4 machine.
llvm-premerge-linux is a auto-scaling pool with large n2-standard-64 VMs. This pool runs the Linux workflows. In the US West cluster, the machines are n2d-standard-64 due to quota limitations.
llvm-premerge-windows-2022 is a auto-scaling pool with large n2-standard-32 VMs. Similar to the Linux pool, but this time it runs Windows workflows. In the US West cluster, the machines are n2d-standard-32 due to quota limitations.
llvm-premerge-libcxx is a auto-scaling pool with large n2-standard-32 VMs. This is similar to the Linux pool but with smaller machines tailored to the libcxx testing workflows. In the US West Cluster, the machines are n2d-standard-32 due to quota limitations.
This pool runs all the services managing the presubmit infra.
The Action Runner Controller listens on the LLVM repository job queue. Individual jobs are then handled by the listeners.
How a job is run:
To make sure each runner pod is scheduled on the correct pool (linux or windows, avoiding the service pool), we use labels and taints.
The other constraints we define are the resource requirements. Without information, Kubernetes is allowed to schedule multiple pods on the instance. So if we do not enforce limits, the controller could schedule 2 runners on the same instance, forcing containers to share resources.
Those bits are configures in the linux runner configuration, windows runner configuration, and libcxx runner configuration.