8.8 KiB
Exposing metrics
Date: 2023-05-08
Status: Proposed
Context
Prometheus metrics are a common way to monitor the cluster. Providing metrics can be a helpful way to monitor scale sets and the health of the ephemeral runners.
Proposal
Two main components are driving the behavior of the scale set:
- ARC controllers responsible for managing Kubernetes resources.
- The
AutoscalingListener, driver of the autoscaling solution responsible for describing the desired state.
We can approach publishing those metrics in 3 different ways
Option 1: Expose a metrics endpoint for the controller-manager and every instance of the listener
To expose metrics, we would need to create 3 additional resources:
ServiceMonitor- a resource used by Prometheus to match namespaces and services from where it needs to gather metricsServicefor thegha-runner-scale-set-controller- service that will target ARC controllerDeploymentServicefor eachgha-runner-scale-setlistener - service that will target a single listener pod for eachAutoscalingRunnerSet
Pros
- Easy to control which scale set exposes metrics and which does not.
- Easy to implement using helm charts in case they are enabled per chart installation.
Cons
- With a cluster running many scale sets, we are going to create a lot of resources.
- In case metrics are enabled on the controller manager level, and they should
be applied across all
AutoscalingRunnerSets, it is difficult to inherit this configuration by applying helm charts.
Option 2: Create a single metrics aggregator service
To create an aggregator service, we can create a simple web application responsible for publishing and gathering metrics. All listeners would be responsible to communicate the metrics on each message, and controllers are responsible to communicate the metrics on each reconciliation.
The application can be executed as a single pod, or as a side container next to the manager.
Running the aggregator as a container in the controller-manager pod
Pros:
- It exists side by side and is following the life cycle of the controller manager
- We don't need to introduce another controller managing the state of the pod
Cons
- Crashes of the aggregator can influence the controller manager execution
- The controller manager pod needs more resources to run
Running the aggregator in a separate pod
Pros
- Does not influence the controller manager pod
- The life cycle of the metric can be controlled by the controller manager (by implementing another controller)
Cons
- We need to implement the controller that can spin up the aggregator in case of the crash.
- If we choose not to implement the controller, the resource like
Deploymentcan be used to manage the aggregator, but we lose control over its life cycle.
Metrics webserver requirements
- Create a web server with a single
/metricsendpoint. The endpoint will havePOSTandGETmethods registered. TheGETis used by Prometheus to fetch the metrics, while thePOSTis going to be used by controllers and listeners to publish their metrics. ServiceMonitor- to target the metrics aggregator serviceServicesitting in front of the web server.
Pros
- This implementation requires a few additional resources to be created in a cluster.
- Web server is easy to implement and easy to document - all metrics are aggregated in a
single package, and the web server only needs to apply them to its state on
POST. TheGEThandler is simple. - We can avoid Pushgateway from Prometheus.
Cons
- Another image that we need to publish on release.
- Change in metric configuration (on manager update) would require re-creation of all listeners. This is not a big problem but is something to point out.
- Managing requests/limits can be tricky.
Option 3: Use a Prometheus Pushgateway
Pros
- Using a supported way of pushing the metrics.
- Easy to implement using their library.
Cons
- In the Prometheus docs, they specify that: "Usually, the only valid use case for Pushgateway is for capturing the outcome of a service-level batch job". The listener does not really fit this criteria.
- Pushgateway is a single point of failure and potential bottleneck.
- You lose Prometheus's automatic instance health monitoring via the up metric (generated on every scrape).
- The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API.
Decision
Since there are many ways in which you can collect metrics, we have decided not
to apply prometheus-operator resources nor Service.
The responsibility of the controller and the autoscaling listener is only to expose metrics. It is up to the user to decide how to collect them.
When installing the ARC, the configuration for both the controller manager and autoscaling listeners' metric servers is established.
Controller metrics
By default, metrics server is listening on 0.0.0.0:8080.
You can control the port of the metrics server using the --metrics-addr flag.
Metrics can be collected from /metrics endpoint
If the value of --metrics-addr is an empty string, metrics server won't be
started.
Autoscaling listeners
By default, metrics server is listening on 0.0.0.0:8080.
The endpoint used to expose metrics is /metrics.
You can control both the address and the endpoint using --listener-metrics-addr and --listener-metrics-endpoint flags.
If the value of --listener-metrics-addr is an empty string, metrics server won't be
started.
Metrics exposed by the controller
To get a better understanding of health and workings of the cluster resources, we need to expose the following metrics:
pending_ephemeral_runners- Number of ephemeral runners in a pending state. This information can show the latency between creating anEphemeralRunnerresource, and having an ephemeral runner pod started and ready to receive a job.running_ephemeral_runners- Number of ephemeral runners currently running. This information is helpful to see how many ephemeral runner pods are running at any given time.failed_ephemeral_runners- Number of ephemeral runners in aFailedstate. This information is helpful to catch the faulty image, or some underlying problem. When the ephemeral runner controller is not able to start the ephemeral runner pod after multiple retries, it will set the state of theEphemeralRunnerto failed. Since the controller can not recover from this state, it can be useful to set Prometheus alerts to catch this issue quickly.
Metrics exposed by the AutoscalingListener
Since the listener is responsible for communicating the state with the actions service, it can expose actions service related data through metrics. In particular:
available_jobs- Number of jobs withruns-onmatching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set.acquired_jobs- Number of jobs acquired by the scale set.assigned_jobs- Number of jobs assigned to this scale set.running_jobs- Number of jobs running (or about to be run).registered_runners- Number of registered runners.busy_runners- Number of registered runners running a job.min_runners- Number of runners desired by the scale set.max_runners- Number of runners desired by the scale set.desired_runners- Number of runners desired by the scale set.idle_runners- Number of registered runners not running a job.available_jobs_total- Total number of jobs available for the scale set (runs-on matches and scale set passes all the runner group permission checks).acquired_jobs_total- Total number of jobs acquired by the scale set.assigned_jobs_total- Total number of jobs assigned to the scale set.started_jobs_total- Total number of jobs started.completed_jobs_total- Total number of jobs completed.job_queue_duration_seconds- Time spent waiting for workflow jobs to get assigned to the scale set after queueing (in seconds).job_startup_duration_seconds- Time spent waiting for a workflow job to get started on the runner owned by the scale set (in seconds).job_execution_duration_seconds- Time spent executing workflow jobs by the scale set (in seconds).
Metric names
Listener metrics belong to the github_runner_scale_set subsystem, so the names
are going to have the github_runner_scale_set_ prefix.
Controller metrics belong to the github_runner_scale_set_controller subsystem,
so the names are going to have github_runner_scale_set_controller prefix.
Consequences
Users can define alerts, monitor the behavior of both the actions-based metrics (gathered from the listener) and the Kubernetes resource-based metrics (gathered from the controller manager).