actions-runner-controller/docs/adrs/2023-05-08-exposing-metrics.md

# Exposing metrics

Date: 2023-05-08

**Status**: Proposed

## Context

Prometheus metrics are a common way to monitor the cluster. Providing metrics
can be a helpful way to monitor scale sets and the health of the ephemeral runners.

## Proposal

Two main components are driving the behavior of the scale set:

1. ARC controllers responsible for managing Kubernetes resources.
2. The `AutoscalingListener`, driver of the autoscaling solution responsible for
   describing the desired state.

We can approach publishing those metrics in 3 different ways

### Option 1: Expose a metrics endpoint for the controller-manager and every instance of the listener

To expose metrics, we would need to create 3 additional resources:

1. `ServiceMonitor` - a resource used by Prometheus to match namespaces and
   services from where it needs to gather metrics
2. `Service` for the `gha-runner-scale-set-controller` - service that will
   target ARC controller `Deployment`
3. `Service` for each `gha-runner-scale-set` listener - service that will target
   a single listener pod for each `AutoscalingRunnerSet`

#### Pros

- Easy to control which scale set exposes metrics and which does not.
- Easy to implement using helm charts in case they are enabled per chart
  installation.

#### Cons

- With a cluster running many scale sets, we are going to create a lot of
  resources.
- In case metrics are enabled on the controller manager level, and they should
  be applied across all `AutoscalingRunnerSets`, it is difficult to inherit this
  configuration by applying helm charts.

### Option 2: Create a single metrics aggregator service

To create an aggregator service, we can create a simple web application
responsible for publishing and gathering metrics. All listeners would be
responsible to communicate the metrics on each message, and controllers are
responsible to communicate the metrics on each reconciliation.

The application can be executed as a single pod, or as a side container next to
the manager.

#### Running the aggregator as a container in the controller-manager pod

**Pros:**
- It exists side by side and is following the life cycle of the controller
  manager
- We don't need to introduce another controller managing the state of the pod

**Cons**

- Crashes of the aggregator can influence the controller manager execution
- The controller manager pod needs more resources to run

#### Running the aggregator in a separate pod

**Pros**

- Does not influence the controller manager pod
- The life cycle of the metric can be controlled by the controller manager (by
  implementing another controller)

**Cons**

- We need to implement the controller that can spin up the aggregator in case of
  the crash.
- If we choose not to implement the controller, the resource like `Deployment`
  can be used to manage the aggregator, but we lose control over its life cycle.

#### Metrics webserver requirements

1. Create a web server with a single `/metrics` endpoint. The endpoint will have
   `POST` and `GET` methods registered. The `GET` is used by Prometheus to
   fetch the metrics, while the `POST` is going to be used by controllers and
   listeners to publish their metrics.
2. `ServiceMonitor` - to target the metrics aggregator service
3. `Service` sitting in front of the web server.

**Pros**

- This implementation requires a few additional resources to be created
  in a cluster.
- Web server is easy to implement and easy to document - all metrics are aggregated in a
  single package, and the web server only needs to apply them to its state on
  `POST`. The `GET` handler is simple.
- We can avoid Pushgateway from Prometheus.

**Cons**

- Another image that we need to publish on release.
- Change in metric configuration (on manager update) would require re-creation
  of all listeners. This is not a big problem but is something to point out.
- Managing requests/limits can be tricky.

### Option 3: Use a Prometheus Pushgateway

#### Pros

- Using a supported way of pushing the metrics.
- Easy to implement using their library.

#### Cons

- In the Prometheus docs, they specify that: "Usually, the only valid use case
  for Pushgateway is for capturing the outcome of a service-level batch job".
  The listener does not really fit this criteria.
- Pushgateway is a single point of failure and potential bottleneck.
- You lose Prometheus's automatic instance health monitoring via the up metric (generated on every scrape).
- The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.

## Decision

Since there are many ways in which you can collect metrics, we have decided not
to apply `prometheus-operator` resources nor `Service`.

The responsibility of the controller and the autoscaling listener is
only to expose metrics. It is up to the user to decide how to collect them.

When installing the ARC, the configuration for both the controller manager
and autoscaling listeners' metric servers is established.

### Controller metrics

By default, metrics server is listening on `0.0.0.0:8080`.
You can control the port of the metrics server using the `--metrics-addr` flag.

Metrics can be collected from `/metrics` endpoint

If the value of  `--metrics-addr` is an empty string, metrics server won't be
started.

### Autoscaling listeners

By default, metrics server is listening on `0.0.0.0:8080`.
The endpoint used to expose metrics is `/metrics`.

You can control both the address and the endpoint using `--listener-metrics-addr` and `--listener-metrics-endpoint` flags.

If the value of  `--listener-metrics-addr` is an empty string, metrics server won't be
started.

### Metrics exposed by the controller

To get a better understanding of health and workings of the cluster
resources, we need to expose the following metrics:

- `pending_ephemeral_runners` - Number of ephemeral runners in a pending state.
  This information can show the latency between creating an `EphemeralRunner`
  resource, and having an ephemeral runner pod started and ready to receive a
  job.
- `running_ephemeral_runners` - Number of ephemeral runners currently running.
  This information is helpful to see how many ephemeral runner pods are running
  at any given time.
- `failed_ephemeral_runners` - Number of ephemeral runners in a `Failed` state.
  This information is helpful to catch the faulty image, or some underlying
  problem. When the ephemeral runner controller is not able to start the
  ephemeral runner pod after multiple retries, it will set the state of the
  `EphemeralRunner` to failed. Since the controller can not recover from this
  state, it can be useful to set Prometheus alerts to catch this issue quickly.

### Metrics exposed by the `AutoscalingListener`

Since the listener is responsible for communicating the state with the actions
service, it can expose actions service related data through metrics. In
particular:

- `available_jobs` - Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set.
- `acquired_jobs`- Number of jobs acquired by the scale set.
- `assigned_jobs` - Number of jobs assigned to this scale set.
- `running_jobs` - Number of jobs running (or about to be run).
- `registered_runners` - Number of registered runners.
- `busy_runners` - Number of registered runners running a job.
- `min_runners` - Number of runners desired by the scale set.
- `max_runners` - Number of runners desired by the scale set.
- `desired_runners` - Number of runners desired by the scale set.
- `idle_runners` - Number of registered runners not running a job.
- `available_jobs_total` - Total number of jobs available for the scale set (runs-on matches and scale set passes all the runner group permission checks).
- `acquired_jobs_total` - Total number of jobs acquired by the scale set.
- `assigned_jobs_total` - Total number of jobs assigned to the scale set.
- `started_jobs_total` - Total number of jobs started.
- `completed_jobs_total` - Total number of jobs completed.
- `job_queue_duration_seconds` - Time spent waiting for workflow jobs to get assigned to the scale set after queueing (in seconds).
- `job_startup_duration_seconds` - Time spent waiting for a workflow job to get started on the runner owned by the scale set (in seconds).
- `job_execution_duration_seconds` - Time spent executing workflow jobs by the scale set (in seconds).

### Metric names

Listener metrics belong to the `github_runner_scale_set` subsystem, so the names
are going to have the `github_runner_scale_set_` prefix.

Controller metrics belong to the `github_runner_scale_set_controller` subsystem,
so the names are going to have `github_runner_scale_set_controller` prefix.

## Consequences

Users can define alerts, monitor the behavior of both the actions-based metrics
(gathered from the listener) and the Kubernetes resource-based metrics
(gathered from the controller manager).