From 91c8991835016f8c6568f101d4a28185baec3dcc Mon Sep 17 00:00:00 2001 From: Nikola Jokic Date: Thu, 18 May 2023 15:37:41 +0200 Subject: [PATCH] Scale Set Metrics ADR (#2568) Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com> --- docs/adrs/2023-05-08-exposing-metrics.md | 213 +++++++++++++++++++++++ 1 file changed, 213 insertions(+) create mode 100644 docs/adrs/2023-05-08-exposing-metrics.md diff --git a/docs/adrs/2023-05-08-exposing-metrics.md b/docs/adrs/2023-05-08-exposing-metrics.md new file mode 100644 index 00000000..6dc2fd7e --- /dev/null +++ b/docs/adrs/2023-05-08-exposing-metrics.md @@ -0,0 +1,213 @@ +# Exposing metrics + +Date: 2023-05-08 + +**Status**: Proposed + +## Context + +Prometheus metrics are a common way to monitor the cluster. Providing metrics +can be a helpful way to monitor scale sets and the health of the ephemeral runners. + +## Proposal + +Two main components are driving the behavior of the scale set: + +1. ARC controllers responsible for managing Kubernetes resources. +2. The `AutoscalingListener`, driver of the autoscaling solution responsible for + describing the desired state. + +We can approach publishing those metrics in 3 different ways + +### Option 1: Expose a metrics endpoint for the controller-manager and every instance of the listener + +To expose metrics, we would need to create 3 additional resources: + +1. `ServiceMonitor` - a resource used by Prometheus to match namespaces and + services from where it needs to gather metrics +2. `Service` for the `gha-runner-scale-set-controller` - service that will + target ARC controller `Deployment` +3. `Service` for each `gha-runner-scale-set` listener - service that will target + a single listener pod for each `AutoscalingRunnerSet` + +#### Pros + +- Easy to control which scale set exposes metrics and which does not. +- Easy to implement using helm charts in case they are enabled per chart + installation. + +#### Cons + +- With a cluster running many scale sets, we are going to create a lot of + resources. +- In case metrics are enabled on the controller manager level, and they should + be applied across all `AutoscalingRunnerSets`, it is difficult to inherit this + configuration by applying helm charts. + +### Option 2: Create a single metrics aggregator service + +To create an aggregator service, we can create a simple web application +responsible for publishing and gathering metrics. All listeners would be +responsible to communicate the metrics on each message, and controllers are +responsible to communicate the metrics on each reconciliation. + +The application can be executed as a single pod, or as a side container next to +the manager. + +#### Running the aggregator as a container in the controller-manager pod + +**Pros:** +- It exists side by side and is following the life cycle of the controller + manager +- We don't need to introduce another controller managing the state of the pod + +**Cons** + +- Crashes of the aggregator can influence the controller manager execution +- The controller manager pod needs more resources to run + +#### Running the aggregator in a separate pod + +**Pros** + +- Does not influence the controller manager pod +- The life cycle of the metric can be controlled by the controller manager (by + implementing another controller) + +**Cons** + +- We need to implement the controller that can spin up the aggregator in case of + the crash. +- If we choose not to implement the controller, the resource like `Deployment` + can be used to manage the aggregator, but we lose control over its life cycle. + +#### Metrics webserver requirements + +1. Create a web server with a single `/metrics` endpoint. The endpoint will have + `POST` and `GET` methods registered. The `GET` is used by Prometheus to + fetch the metrics, while the `POST` is going to be used by controllers and + listeners to publish their metrics. +2. `ServiceMonitor` - to target the metrics aggregator service +3. `Service` sitting in front of the web server. + +**Pros** + +- This implementation requires a few additional resources to be created + in a cluster. +- Web server is easy to implement and easy to document - all metrics are aggregated in a + single package, and the web server only needs to apply them to its state on + `POST`. The `GET` handler is simple. +- We can avoid Pushgateway from Prometheus. + +**Cons** + +- Another image that we need to publish on release. +- Change in metric configuration (on manager update) would require re-creation + of all listeners. This is not a big problem but is something to point out. +- Managing requests/limits can be tricky. + +### Option 3: Use a Prometheus Pushgateway + +#### Pros + +- Using a supported way of pushing the metrics. +- Easy to implement using their library. + +#### Cons + +- In the Prometheus docs, they specify that: "Usually, the only valid use case + for Pushgateway is for capturing the outcome of a service-level batch job". + The listener does not really fit this criteria. +- Pushgateway is a single point of failure and potential bottleneck. +- You lose Prometheus's automatic instance health monitoring via the up metric (generated on every scrape). +- The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API. + +## Decision + +Since there are many ways in which you can collect metrics, we have decided not +to apply `prometheus-operator` resources nor `Service`. + +The responsibility of the controller and the autoscaling listener is +only to expose metrics. It is up to the user to decide how to collect them. + +When installing the ARC, the configuration for both the controller manager +and autoscaling listeners' metric servers is established. + +### Controller metrics + +By default, metrics server is listening on `0.0.0.0:8080`. +You can control the port of the metrics server using the `--metrics-addr` flag. + +Metrics can be collected from `/metrics` endpoint + +If the value of `--metrics-addr` is an empty string, metrics server won't be +started. + +### Autoscaling listeners + +By default, metrics server is listening on `0.0.0.0:8080`. +The endpoint used to expose metrics is `/metrics`. + +You can control both the address and the endpoint using `--listener-metrics-addr` and `--listener-metrics-endpoint` flags. + +If the value of `--listener-metrics-addr` is an empty string, metrics server won't be +started. + +### Metrics exposed by the controller + +To get a better understanding of health and workings of the cluster +resources, we need to expose the following metrics: + +- `pending_ephemeral_runners` - Number of ephemeral runners in a pending state. + This information can show the latency between creating an `EphemeralRunner` + resource, and having an ephemeral runner pod started and ready to receive a + job. +- `running_ephemeral_runners` - Number of ephemeral runners currently running. + This information is helpful to see how many ephemeral runner pods are running + at any given time. +- `failed_ephemeral_runners` - Number of ephemeral runners in a `Failed` state. + This information is helpful to catch the faulty image, or some underlying + problem. When the ephemeral runner controller is not able to start the + ephemeral runner pod after multiple retries, it will set the state of the + `EphemeralRunner` to failed. Since the controller can not recover from this + state, it can be useful to set Prometheus alerts to catch this issue quickly. + +### Metrics exposed by the `AutoscalingListener` + +Since the listener is responsible for communicating the state with the actions +service, it can expose actions service related data through metrics. In +particular: + +- `available_jobs` - Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set. +- `acquired_jobs`- Number of jobs acquired by the scale set. +- `assigned_jobs` - Number of jobs assigned to this scale set. +- `running_jobs` - Number of jobs running (or about to be run). +- `registered_runners` - Number of registered runners. +- `busy_runners` - Number of registered runners running a job. +- `min_runners` - Number of runners desired by the scale set. +- `max_runners` - Number of runners desired by the scale set. +- `desired_runners` - Number of runners desired by the scale set. +- `idle_runners` - Number of registered runners not running a job. +- `available_jobs_total` - Total number of jobs available for the scale set (runs-on matches and scale set passes all the runner group permission checks). +- `acquired_jobs_total` - Total number of jobs acquired by the scale set. +- `assigned_jobs_total` - Total number of jobs assigned to the scale set. +- `started_jobs_total` - Total number of jobs started. +- `completed_jobs_total` - Total number of jobs completed. +- `job_queue_duration_seconds` - Time spent waiting for workflow jobs to get assigned to the scale set after queueing (in seconds). +- `job_startup_duration_seconds` - Time spent waiting for a workflow job to get started on the runner owned by the scale set (in seconds). +- `job_execution_duration_seconds` - Time spent executing workflow jobs by the scale set (in seconds). + +### Metric names + +Listener metrics belong to the `github_runner_scale_set` subsystem, so the names +are going to have the `github_runner_scale_set_` prefix. + +Controller metrics belong to the `github_runner_scale_set_controller` subsystem, +so the names are going to have `github_runner_scale_set_controller` prefix. + +## Consequences + +Users can define alerts, monitor the behavior of both the actions-based metrics +(gathered from the listener) and the Kubernetes resource-based metrics +(gathered from the controller manager). +