blob: c564d1a303cab230df1184c4cad4147975e56bd5 [file] [view] [edit]
# LLVM Premerge infra - GCP runners
This document describes how the GCP based presubmit infra is working, and
explains common maintenance actions.
## Overview
Presubmit tests are using GitHub workflows. Executing GitHub workflows can be
done in two ways:
- using GitHub provided runners.
- using self-hosted runners.
GitHub provided runners are not very powerful, and have limitations, but they
are **FREE**.
Self hosted runners are self-hosted, meaning they can be large virtual
machines running on GCP, very powerful, but **expensive**.
To balance cost/performance, we keep both types.
- simple jobs like `clang-format` shall run on GitHub runners.
- building & testing LLVM shall be done on self-hosted runners.
LLVM has several flavor of self-hosted runners:
- MacOS runners for HLSL managed by Microsoft.
- GCP windows/linux runners managed by Google.
- GCP linux runners setup for libcxx managed by Google.
This document only focuses on Google's GCP hosted runners.
Choosing on which runner a workflow runs is done in the workflow definition:
```
jobs:
my_job_name:
# Runs on expensive GCP VMs.
runs-on: llvm-premerge-linux-runners
```
Our self hosted runners come in two flavors:
- Linux
- Windows
## GCP runners - Architecture overview
We have two clusters to compose a high availability setup. The description
below describes an individual cluster, but they are largely identical.
Any relevant differences are explicitly enumerated.
Our runners are hosted on GCP Kubernetes clusters, and use the
[Action Runner Controller (ARC)](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/about-actions-runner-controller).
The clusters have 4 main pools:
- llvm-premerge-linux
- llvm-premerge-linux-service
- llvm-premerge-windows-2022
- llvm-premerge-libcxx
**llvm-premerge-linux-service** is a fixed pool, only used to host the
services required to manage the premerge infra (controller, listeners,
monitoring). Today, this pool has three `e2-highcpu-4` machine.
**llvm-premerge-linux** is a auto-scaling pool with large `n2-standard-64`
VMs. This pool runs the Linux workflows. In the US West cluster, the machines
are `n2d-standard-64` due to quota limitations.
**llvm-premerge-windows-2022** is a auto-scaling pool with large `n2-standard-32`
VMs. Similar to the Linux pool, but this time it runs Windows workflows. In the
US West cluster, the machines are `n2d-standard-32` due to quota limitations.
**llvm-premerge-libcxx** is a auto-scaling pool with large `n2-standard-32`
VMs. This is similar to the Linux pool but with smaller machines tailored
to the libcxx testing workflows. In the US West Cluster, the machines are
`n2d-standard-32` due to quota limitations.
### Service pool: llvm-premerge-linux-service
This pool runs all the services managing the presubmit infra.
- Action Runner Controller
- 1 listener for the Linux runners.
- 1 listener for the windows runners.
- Grafana Alloy to gather metrics.
- metrics container.
The Action Runner Controller listens on the LLVM repository job queue.
Individual jobs are then handled by the listeners.
How a job is run:
- The controller informs GitHub the self-hosted runner set is live.
- A PR is uploaded on GitHub
- The listener finds a Linux job to run.
- The listener creates a new runner pod to be scheduled by Kubernetes.
- Kubernetes adds one instance to the Linux pool to schedule new pod.
- The runner starts executing on the new node.
- Once finished, the runner dies, meaning the pod dies.
- If the instance is not reused in the next 10 minutes, the autoscaler
will turn down the instance, freeing resources.
### Worker pools : llvm-premerge-linux, llvm-premerge-windows-2022, llvm-premerge-libcxx
To make sure each runner pod is scheduled on the correct pool (linux or
windows, avoiding the service pool), we use labels and taints.
The other constraints we define are the resource requirements. Without
information, Kubernetes is allowed to schedule multiple pods on the instance.
So if we do not enforce limits, the controller could schedule 2 runners on
the same instance, forcing containers to share resources.
Those bits are configures in the
[linux runner configuration](linux_runners_values.yaml),
[windows runner configuration](windows_runner_values.yaml), and
[libcxx runner configuration](libcxx_runners_values.yaml).