Scale Set Metrics ADR (#2568)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
This commit is contained in:
parent
c5ebe750dc
commit
91c8991835
|
|
@ -0,0 +1,213 @@
|
||||||
|
# Exposing metrics
|
||||||
|
|
||||||
|
Date: 2023-05-08
|
||||||
|
|
||||||
|
**Status**: Proposed
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Prometheus metrics are a common way to monitor the cluster. Providing metrics
|
||||||
|
can be a helpful way to monitor scale sets and the health of the ephemeral runners.
|
||||||
|
|
||||||
|
## Proposal
|
||||||
|
|
||||||
|
Two main components are driving the behavior of the scale set:
|
||||||
|
|
||||||
|
1. ARC controllers responsible for managing Kubernetes resources.
|
||||||
|
2. The `AutoscalingListener`, driver of the autoscaling solution responsible for
|
||||||
|
describing the desired state.
|
||||||
|
|
||||||
|
We can approach publishing those metrics in 3 different ways
|
||||||
|
|
||||||
|
### Option 1: Expose a metrics endpoint for the controller-manager and every instance of the listener
|
||||||
|
|
||||||
|
To expose metrics, we would need to create 3 additional resources:
|
||||||
|
|
||||||
|
1. `ServiceMonitor` - a resource used by Prometheus to match namespaces and
|
||||||
|
services from where it needs to gather metrics
|
||||||
|
2. `Service` for the `gha-runner-scale-set-controller` - service that will
|
||||||
|
target ARC controller `Deployment`
|
||||||
|
3. `Service` for each `gha-runner-scale-set` listener - service that will target
|
||||||
|
a single listener pod for each `AutoscalingRunnerSet`
|
||||||
|
|
||||||
|
#### Pros
|
||||||
|
|
||||||
|
- Easy to control which scale set exposes metrics and which does not.
|
||||||
|
- Easy to implement using helm charts in case they are enabled per chart
|
||||||
|
installation.
|
||||||
|
|
||||||
|
#### Cons
|
||||||
|
|
||||||
|
- With a cluster running many scale sets, we are going to create a lot of
|
||||||
|
resources.
|
||||||
|
- In case metrics are enabled on the controller manager level, and they should
|
||||||
|
be applied across all `AutoscalingRunnerSets`, it is difficult to inherit this
|
||||||
|
configuration by applying helm charts.
|
||||||
|
|
||||||
|
### Option 2: Create a single metrics aggregator service
|
||||||
|
|
||||||
|
To create an aggregator service, we can create a simple web application
|
||||||
|
responsible for publishing and gathering metrics. All listeners would be
|
||||||
|
responsible to communicate the metrics on each message, and controllers are
|
||||||
|
responsible to communicate the metrics on each reconciliation.
|
||||||
|
|
||||||
|
The application can be executed as a single pod, or as a side container next to
|
||||||
|
the manager.
|
||||||
|
|
||||||
|
#### Running the aggregator as a container in the controller-manager pod
|
||||||
|
|
||||||
|
**Pros:**
|
||||||
|
- It exists side by side and is following the life cycle of the controller
|
||||||
|
manager
|
||||||
|
- We don't need to introduce another controller managing the state of the pod
|
||||||
|
|
||||||
|
**Cons**
|
||||||
|
|
||||||
|
- Crashes of the aggregator can influence the controller manager execution
|
||||||
|
- The controller manager pod needs more resources to run
|
||||||
|
|
||||||
|
#### Running the aggregator in a separate pod
|
||||||
|
|
||||||
|
**Pros**
|
||||||
|
|
||||||
|
- Does not influence the controller manager pod
|
||||||
|
- The life cycle of the metric can be controlled by the controller manager (by
|
||||||
|
implementing another controller)
|
||||||
|
|
||||||
|
**Cons**
|
||||||
|
|
||||||
|
- We need to implement the controller that can spin up the aggregator in case of
|
||||||
|
the crash.
|
||||||
|
- If we choose not to implement the controller, the resource like `Deployment`
|
||||||
|
can be used to manage the aggregator, but we lose control over its life cycle.
|
||||||
|
|
||||||
|
#### Metrics webserver requirements
|
||||||
|
|
||||||
|
1. Create a web server with a single `/metrics` endpoint. The endpoint will have
|
||||||
|
`POST` and `GET` methods registered. The `GET` is used by Prometheus to
|
||||||
|
fetch the metrics, while the `POST` is going to be used by controllers and
|
||||||
|
listeners to publish their metrics.
|
||||||
|
2. `ServiceMonitor` - to target the metrics aggregator service
|
||||||
|
3. `Service` sitting in front of the web server.
|
||||||
|
|
||||||
|
**Pros**
|
||||||
|
|
||||||
|
- This implementation requires a few additional resources to be created
|
||||||
|
in a cluster.
|
||||||
|
- Web server is easy to implement and easy to document - all metrics are aggregated in a
|
||||||
|
single package, and the web server only needs to apply them to its state on
|
||||||
|
`POST`. The `GET` handler is simple.
|
||||||
|
- We can avoid Pushgateway from Prometheus.
|
||||||
|
|
||||||
|
**Cons**
|
||||||
|
|
||||||
|
- Another image that we need to publish on release.
|
||||||
|
- Change in metric configuration (on manager update) would require re-creation
|
||||||
|
of all listeners. This is not a big problem but is something to point out.
|
||||||
|
- Managing requests/limits can be tricky.
|
||||||
|
|
||||||
|
### Option 3: Use a Prometheus Pushgateway
|
||||||
|
|
||||||
|
#### Pros
|
||||||
|
|
||||||
|
- Using a supported way of pushing the metrics.
|
||||||
|
- Easy to implement using their library.
|
||||||
|
|
||||||
|
#### Cons
|
||||||
|
|
||||||
|
- In the Prometheus docs, they specify that: "Usually, the only valid use case
|
||||||
|
for Pushgateway is for capturing the outcome of a service-level batch job".
|
||||||
|
The listener does not really fit this criteria.
|
||||||
|
- Pushgateway is a single point of failure and potential bottleneck.
|
||||||
|
- You lose Prometheus's automatic instance health monitoring via the up metric (generated on every scrape).
|
||||||
|
- The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Since there are many ways in which you can collect metrics, we have decided not
|
||||||
|
to apply `prometheus-operator` resources nor `Service`.
|
||||||
|
|
||||||
|
The responsibility of the controller and the autoscaling listener is
|
||||||
|
only to expose metrics. It is up to the user to decide how to collect them.
|
||||||
|
|
||||||
|
When installing the ARC, the configuration for both the controller manager
|
||||||
|
and autoscaling listeners' metric servers is established.
|
||||||
|
|
||||||
|
### Controller metrics
|
||||||
|
|
||||||
|
By default, metrics server is listening on `0.0.0.0:8080`.
|
||||||
|
You can control the port of the metrics server using the `--metrics-addr` flag.
|
||||||
|
|
||||||
|
Metrics can be collected from `/metrics` endpoint
|
||||||
|
|
||||||
|
If the value of `--metrics-addr` is an empty string, metrics server won't be
|
||||||
|
started.
|
||||||
|
|
||||||
|
### Autoscaling listeners
|
||||||
|
|
||||||
|
By default, metrics server is listening on `0.0.0.0:8080`.
|
||||||
|
The endpoint used to expose metrics is `/metrics`.
|
||||||
|
|
||||||
|
You can control both the address and the endpoint using `--listener-metrics-addr` and `--listener-metrics-endpoint` flags.
|
||||||
|
|
||||||
|
If the value of `--listener-metrics-addr` is an empty string, metrics server won't be
|
||||||
|
started.
|
||||||
|
|
||||||
|
### Metrics exposed by the controller
|
||||||
|
|
||||||
|
To get a better understanding of health and workings of the cluster
|
||||||
|
resources, we need to expose the following metrics:
|
||||||
|
|
||||||
|
- `pending_ephemeral_runners` - Number of ephemeral runners in a pending state.
|
||||||
|
This information can show the latency between creating an `EphemeralRunner`
|
||||||
|
resource, and having an ephemeral runner pod started and ready to receive a
|
||||||
|
job.
|
||||||
|
- `running_ephemeral_runners` - Number of ephemeral runners currently running.
|
||||||
|
This information is helpful to see how many ephemeral runner pods are running
|
||||||
|
at any given time.
|
||||||
|
- `failed_ephemeral_runners` - Number of ephemeral runners in a `Failed` state.
|
||||||
|
This information is helpful to catch the faulty image, or some underlying
|
||||||
|
problem. When the ephemeral runner controller is not able to start the
|
||||||
|
ephemeral runner pod after multiple retries, it will set the state of the
|
||||||
|
`EphemeralRunner` to failed. Since the controller can not recover from this
|
||||||
|
state, it can be useful to set Prometheus alerts to catch this issue quickly.
|
||||||
|
|
||||||
|
### Metrics exposed by the `AutoscalingListener`
|
||||||
|
|
||||||
|
Since the listener is responsible for communicating the state with the actions
|
||||||
|
service, it can expose actions service related data through metrics. In
|
||||||
|
particular:
|
||||||
|
|
||||||
|
- `available_jobs` - Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set.
|
||||||
|
- `acquired_jobs`- Number of jobs acquired by the scale set.
|
||||||
|
- `assigned_jobs` - Number of jobs assigned to this scale set.
|
||||||
|
- `running_jobs` - Number of jobs running (or about to be run).
|
||||||
|
- `registered_runners` - Number of registered runners.
|
||||||
|
- `busy_runners` - Number of registered runners running a job.
|
||||||
|
- `min_runners` - Number of runners desired by the scale set.
|
||||||
|
- `max_runners` - Number of runners desired by the scale set.
|
||||||
|
- `desired_runners` - Number of runners desired by the scale set.
|
||||||
|
- `idle_runners` - Number of registered runners not running a job.
|
||||||
|
- `available_jobs_total` - Total number of jobs available for the scale set (runs-on matches and scale set passes all the runner group permission checks).
|
||||||
|
- `acquired_jobs_total` - Total number of jobs acquired by the scale set.
|
||||||
|
- `assigned_jobs_total` - Total number of jobs assigned to the scale set.
|
||||||
|
- `started_jobs_total` - Total number of jobs started.
|
||||||
|
- `completed_jobs_total` - Total number of jobs completed.
|
||||||
|
- `job_queue_duration_seconds` - Time spent waiting for workflow jobs to get assigned to the scale set after queueing (in seconds).
|
||||||
|
- `job_startup_duration_seconds` - Time spent waiting for a workflow job to get started on the runner owned by the scale set (in seconds).
|
||||||
|
- `job_execution_duration_seconds` - Time spent executing workflow jobs by the scale set (in seconds).
|
||||||
|
|
||||||
|
### Metric names
|
||||||
|
|
||||||
|
Listener metrics belong to the `github_runner_scale_set` subsystem, so the names
|
||||||
|
are going to have the `github_runner_scale_set_` prefix.
|
||||||
|
|
||||||
|
Controller metrics belong to the `github_runner_scale_set_controller` subsystem,
|
||||||
|
so the names are going to have `github_runner_scale_set_controller` prefix.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
Users can define alerts, monitor the behavior of both the actions-based metrics
|
||||||
|
(gathered from the listener) and the Kubernetes resource-based metrics
|
||||||
|
(gathered from the controller manager).
|
||||||
|
|
||||||
Loading…
Reference in New Issue