Scale Set Metrics ADR (#2568)

Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
2023-05-18 15:37:41 +02:00 · 2023-05-18 15:37:41 +02:00 · 91c8991835
parent c5ebe750dc
commit 91c8991835
1 changed files with 213 additions and 0 deletions
--- a/docs/adrs/2023-05-08-exposing-metrics.md
+++ b/docs/adrs/2023-05-08-exposing-metrics.md
@ -0,0 +1,213 @@
+# Exposing metrics
+
+Date: 2023-05-08
+
+**Status**: Proposed
+
+## Context
+
+Prometheus metrics are a common way to monitor the cluster. Providing metrics
+can be a helpful way to monitor scale sets and the health of the ephemeral runners.
+
+## Proposal
+
+Two main components are driving the behavior of the scale set:
+
+1. ARC controllers responsible for managing Kubernetes resources.
+2. The `AutoscalingListener`, driver of the autoscaling solution responsible for
+   describing the desired state.
+
+We can approach publishing those metrics in 3 different ways
+
+### Option 1: Expose a metrics endpoint for the controller-manager and every instance of the listener
+
+To expose metrics, we would need to create 3 additional resources:
+
+1. `ServiceMonitor` - a resource used by Prometheus to match namespaces and
+   services from where it needs to gather metrics
+2. `Service` for the `gha-runner-scale-set-controller` - service that will
+   target ARC controller `Deployment`
+3. `Service` for each `gha-runner-scale-set` listener - service that will target
+   a single listener pod for each `AutoscalingRunnerSet`
+
+#### Pros
+
+- Easy to control which scale set exposes metrics and which does not.
+- Easy to implement using helm charts in case they are enabled per chart
+  installation.
+
+#### Cons
+
+- With a cluster running many scale sets, we are going to create a lot of
+  resources.
+- In case metrics are enabled on the controller manager level, and they should
+  be applied across all `AutoscalingRunnerSets`, it is difficult to inherit this
+  configuration by applying helm charts.
+
+### Option 2: Create a single metrics aggregator service
+
+To create an aggregator service, we can create a simple web application
+responsible for publishing and gathering metrics. All listeners would be
+responsible to communicate the metrics on each message, and controllers are
+responsible to communicate the metrics on each reconciliation.
+
+The application can be executed as a single pod, or as a side container next to
+the manager.
+
+#### Running the aggregator as a container in the controller-manager pod
+
+**Pros:**
+- It exists side by side and is following the life cycle of the controller
+  manager
+- We don't need to introduce another controller managing the state of the pod
+
+**Cons**
+
+- Crashes of the aggregator can influence the controller manager execution
+- The controller manager pod needs more resources to run
+
+#### Running the aggregator in a separate pod
+
+**Pros**
+
+- Does not influence the controller manager pod
+- The life cycle of the metric can be controlled by the controller manager (by
+  implementing another controller)
+
+**Cons**
+
+- We need to implement the controller that can spin up the aggregator in case of
+  the crash.
+- If we choose not to implement the controller, the resource like `Deployment`
+  can be used to manage the aggregator, but we lose control over its life cycle.
+
+#### Metrics webserver requirements
+
+1. Create a web server with a single `/metrics` endpoint. The endpoint will have
+   `POST` and `GET` methods registered. The `GET` is used by Prometheus to
+   fetch the metrics, while the `POST` is going to be used by controllers and
+   listeners to publish their metrics.
+2. `ServiceMonitor` - to target the metrics aggregator service
+3. `Service` sitting in front of the web server.
+
+**Pros**
+
+- This implementation requires a few additional resources to be created
+  in a cluster.
+- Web server is easy to implement and easy to document - all metrics are aggregated in a
+  single package, and the web server only needs to apply them to its state on
+  `POST`. The `GET` handler is simple.
+- We can avoid Pushgateway from Prometheus.
+
+**Cons**
+
+- Another image that we need to publish on release.
+- Change in metric configuration (on manager update) would require re-creation
+  of all listeners. This is not a big problem but is something to point out.
+- Managing requests/limits can be tricky.
+
+### Option 3: Use a Prometheus Pushgateway
+
+#### Pros
+
+- Using a supported way of pushing the metrics.
+- Easy to implement using their library.
+
+#### Cons
+
+- In the Prometheus docs, they specify that: "Usually, the only valid use case
+  for Pushgateway is for capturing the outcome of a service-level batch job".
+  The listener does not really fit this criteria.
+- Pushgateway is a single point of failure and potential bottleneck.
+- You lose Prometheus's automatic instance health monitoring via the up metric (generated on every scrape).
+- The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.
+
+## Decision
+
+Since there are many ways in which you can collect metrics, we have decided not
+to apply `prometheus-operator` resources nor `Service`.
+
+The responsibility of the controller and the autoscaling listener is
+only to expose metrics. It is up to the user to decide how to collect them.
+
+When installing the ARC, the configuration for both the controller manager
+and autoscaling listeners' metric servers is established.
+
+### Controller metrics
+
+By default, metrics server is listening on `0.0.0.0:8080`.
+You can control the port of the metrics server using the `--metrics-addr` flag.
+
+Metrics can be collected from `/metrics` endpoint
+
+If the value of  `--metrics-addr` is an empty string, metrics server won't be
+started.
+
+### Autoscaling listeners
+
+By default, metrics server is listening on `0.0.0.0:8080`.
+The endpoint used to expose metrics is `/metrics`.
+
+You can control both the address and the endpoint using `--listener-metrics-addr` and `--listener-metrics-endpoint` flags.
+
+If the value of  `--listener-metrics-addr` is an empty string, metrics server won't be
+started.
+
+### Metrics exposed by the controller
+
+To get a better understanding of health and workings of the cluster
+resources, we need to expose the following metrics:
+
+- `pending_ephemeral_runners` - Number of ephemeral runners in a pending state.
+  This information can show the latency between creating an `EphemeralRunner`
+  resource, and having an ephemeral runner pod started and ready to receive a
+  job.
+- `running_ephemeral_runners` - Number of ephemeral runners currently running.
+  This information is helpful to see how many ephemeral runner pods are running
+  at any given time.
+- `failed_ephemeral_runners` - Number of ephemeral runners in a `Failed` state.
+  This information is helpful to catch the faulty image, or some underlying
+  problem. When the ephemeral runner controller is not able to start the
+  ephemeral runner pod after multiple retries, it will set the state of the
+  `EphemeralRunner` to failed. Since the controller can not recover from this
+  state, it can be useful to set Prometheus alerts to catch this issue quickly.
+
+### Metrics exposed by the `AutoscalingListener`
+
+Since the listener is responsible for communicating the state with the actions
+service, it can expose actions service related data through metrics. In
+particular:
+
+- `available_jobs` - Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set.
+- `acquired_jobs`- Number of jobs acquired by the scale set.
+- `assigned_jobs` - Number of jobs assigned to this scale set.
+- `running_jobs` - Number of jobs running (or about to be run).
+- `registered_runners` - Number of registered runners.
+- `busy_runners` - Number of registered runners running a job.
+- `min_runners` - Number of runners desired by the scale set.
+- `max_runners` - Number of runners desired by the scale set.
+- `desired_runners` - Number of runners desired by the scale set.
+- `idle_runners` - Number of registered runners not running a job.
+- `available_jobs_total` - Total number of jobs available for the scale set (runs-on matches and scale set passes all the runner group permission checks).
+- `acquired_jobs_total` - Total number of jobs acquired by the scale set.
+- `assigned_jobs_total` - Total number of jobs assigned to the scale set.
+- `started_jobs_total` - Total number of jobs started.
+- `completed_jobs_total` - Total number of jobs completed.
+- `job_queue_duration_seconds` - Time spent waiting for workflow jobs to get assigned to the scale set after queueing (in seconds).
+- `job_startup_duration_seconds` - Time spent waiting for a workflow job to get started on the runner owned by the scale set (in seconds).
+- `job_execution_duration_seconds` - Time spent executing workflow jobs by the scale set (in seconds).
+
+### Metric names
+
+Listener metrics belong to the `github_runner_scale_set` subsystem, so the names
+are going to have the `github_runner_scale_set_` prefix.
+
+Controller metrics belong to the `github_runner_scale_set_controller` subsystem,
+so the names are going to have `github_runner_scale_set_controller` prefix.
+
+## Consequences
+
+Users can define alerts, monitor the behavior of both the actions-based metrics
+(gathered from the listener) and the Kubernetes resource-based metrics
+(gathered from the controller manager).
+