Scale Set Metrics ADR (#2568)
Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>
This commit is contained in:
		
							parent
							
								
									c5ebe750dc
								
							
						
					
					
						commit
						91c8991835
					
				|  | @ -0,0 +1,213 @@ | ||||||
|  | # Exposing metrics | ||||||
|  | 
 | ||||||
|  | Date: 2023-05-08 | ||||||
|  | 
 | ||||||
|  | **Status**: Proposed | ||||||
|  | 
 | ||||||
|  | ## Context | ||||||
|  | 
 | ||||||
|  | Prometheus metrics are a common way to monitor the cluster. Providing metrics | ||||||
|  | can be a helpful way to monitor scale sets and the health of the ephemeral runners. | ||||||
|  | 
 | ||||||
|  | ## Proposal | ||||||
|  | 
 | ||||||
|  | Two main components are driving the behavior of the scale set: | ||||||
|  | 
 | ||||||
|  | 1. ARC controllers responsible for managing Kubernetes resources. | ||||||
|  | 2. The `AutoscalingListener`, driver of the autoscaling solution responsible for | ||||||
|  |    describing the desired state. | ||||||
|  | 
 | ||||||
|  | We can approach publishing those metrics in 3 different ways | ||||||
|  | 
 | ||||||
|  | ### Option 1: Expose a metrics endpoint for the controller-manager and every instance of the listener | ||||||
|  | 
 | ||||||
|  | To expose metrics, we would need to create 3 additional resources: | ||||||
|  | 
 | ||||||
|  | 1. `ServiceMonitor` - a resource used by Prometheus to match namespaces and | ||||||
|  |    services from where it needs to gather metrics | ||||||
|  | 2. `Service` for the `gha-runner-scale-set-controller` - service that will | ||||||
|  |    target ARC controller `Deployment` | ||||||
|  | 3. `Service` for each `gha-runner-scale-set` listener - service that will target | ||||||
|  |    a single listener pod for each `AutoscalingRunnerSet` | ||||||
|  | 
 | ||||||
|  | #### Pros | ||||||
|  | 
 | ||||||
|  | - Easy to control which scale set exposes metrics and which does not. | ||||||
|  | - Easy to implement using helm charts in case they are enabled per chart | ||||||
|  |   installation. | ||||||
|  | 
 | ||||||
|  | #### Cons | ||||||
|  | 
 | ||||||
|  | - With a cluster running many scale sets, we are going to create a lot of | ||||||
|  |   resources. | ||||||
|  | - In case metrics are enabled on the controller manager level, and they should | ||||||
|  |   be applied across all `AutoscalingRunnerSets`, it is difficult to inherit this | ||||||
|  |   configuration by applying helm charts. | ||||||
|  | 
 | ||||||
|  | ### Option 2: Create a single metrics aggregator service | ||||||
|  | 
 | ||||||
|  | To create an aggregator service, we can create a simple web application | ||||||
|  | responsible for publishing and gathering metrics. All listeners would be | ||||||
|  | responsible to communicate the metrics on each message, and controllers are | ||||||
|  | responsible to communicate the metrics on each reconciliation. | ||||||
|  | 
 | ||||||
|  | The application can be executed as a single pod, or as a side container next to | ||||||
|  | the manager. | ||||||
|  | 
 | ||||||
|  | #### Running the aggregator as a container in the controller-manager pod | ||||||
|  | 
 | ||||||
|  | **Pros:** | ||||||
|  | - It exists side by side and is following the life cycle of the controller | ||||||
|  |   manager | ||||||
|  | - We don't need to introduce another controller managing the state of the pod | ||||||
|  | 
 | ||||||
|  | **Cons** | ||||||
|  | 
 | ||||||
|  | - Crashes of the aggregator can influence the controller manager execution | ||||||
|  | - The controller manager pod needs more resources to run | ||||||
|  | 
 | ||||||
|  | #### Running the aggregator in a separate pod | ||||||
|  | 
 | ||||||
|  | **Pros** | ||||||
|  | 
 | ||||||
|  | - Does not influence the controller manager pod | ||||||
|  | - The life cycle of the metric can be controlled by the controller manager (by | ||||||
|  |   implementing another controller) | ||||||
|  | 
 | ||||||
|  | **Cons** | ||||||
|  | 
 | ||||||
|  | - We need to implement the controller that can spin up the aggregator in case of | ||||||
|  |   the crash. | ||||||
|  | - If we choose not to implement the controller, the resource like `Deployment` | ||||||
|  |   can be used to manage the aggregator, but we lose control over its life cycle. | ||||||
|  | 
 | ||||||
|  | #### Metrics webserver requirements | ||||||
|  | 
 | ||||||
|  | 1. Create a web server with a single `/metrics` endpoint. The endpoint will have | ||||||
|  |    `POST` and `GET` methods registered. The `GET` is used by Prometheus to | ||||||
|  |    fetch the metrics, while the `POST` is going to be used by controllers and | ||||||
|  |    listeners to publish their metrics. | ||||||
|  | 2. `ServiceMonitor` - to target the metrics aggregator service | ||||||
|  | 3. `Service` sitting in front of the web server. | ||||||
|  | 
 | ||||||
|  | **Pros** | ||||||
|  | 
 | ||||||
|  | - This implementation requires a few additional resources to be created | ||||||
|  |   in a cluster. | ||||||
|  | - Web server is easy to implement and easy to document - all metrics are aggregated in a | ||||||
|  |   single package, and the web server only needs to apply them to its state on | ||||||
|  |   `POST`. The `GET` handler is simple. | ||||||
|  | - We can avoid Pushgateway from Prometheus. | ||||||
|  | 
 | ||||||
|  | **Cons** | ||||||
|  | 
 | ||||||
|  | - Another image that we need to publish on release. | ||||||
|  | - Change in metric configuration (on manager update) would require re-creation | ||||||
|  |   of all listeners. This is not a big problem but is something to point out. | ||||||
|  | - Managing requests/limits can be tricky. | ||||||
|  | 
 | ||||||
|  | ### Option 3: Use a Prometheus Pushgateway | ||||||
|  | 
 | ||||||
|  | #### Pros | ||||||
|  | 
 | ||||||
|  | - Using a supported way of pushing the metrics. | ||||||
|  | - Easy to implement using their library. | ||||||
|  | 
 | ||||||
|  | #### Cons | ||||||
|  | 
 | ||||||
|  | - In the Prometheus docs, they specify that: "Usually, the only valid use case | ||||||
|  |   for Pushgateway is for capturing the outcome of a service-level batch job". | ||||||
|  |   The listener does not really fit this criteria. | ||||||
|  | - Pushgateway is a single point of failure and potential bottleneck. | ||||||
|  | - You lose Prometheus's automatic instance health monitoring via the up metric (generated on every scrape). | ||||||
|  | - The Pushgateway never forgets series pushed to it and will expose them to Prometheus forever unless those series are manually deleted via the Pushgateway's API. | ||||||
|  | 
 | ||||||
|  | ## Decision | ||||||
|  | 
 | ||||||
|  | Since there are many ways in which you can collect metrics, we have decided not | ||||||
|  | to apply `prometheus-operator` resources nor `Service`. | ||||||
|  | 
 | ||||||
|  | The responsibility of the controller and the autoscaling listener is | ||||||
|  | only to expose metrics. It is up to the user to decide how to collect them. | ||||||
|  | 
 | ||||||
|  | When installing the ARC, the configuration for both the controller manager | ||||||
|  | and autoscaling listeners' metric servers is established. | ||||||
|  | 
 | ||||||
|  | ### Controller metrics | ||||||
|  | 
 | ||||||
|  | By default, metrics server is listening on `0.0.0.0:8080`. | ||||||
|  | You can control the port of the metrics server using the `--metrics-addr` flag. | ||||||
|  | 
 | ||||||
|  | Metrics can be collected from `/metrics` endpoint | ||||||
|  | 
 | ||||||
|  | If the value of  `--metrics-addr` is an empty string, metrics server won't be | ||||||
|  | started. | ||||||
|  | 
 | ||||||
|  | ### Autoscaling listeners | ||||||
|  | 
 | ||||||
|  | By default, metrics server is listening on `0.0.0.0:8080`. | ||||||
|  | The endpoint used to expose metrics is `/metrics`. | ||||||
|  | 
 | ||||||
|  | You can control both the address and the endpoint using `--listener-metrics-addr` and `--listener-metrics-endpoint` flags. | ||||||
|  | 
 | ||||||
|  | If the value of  `--listener-metrics-addr` is an empty string, metrics server won't be | ||||||
|  | started. | ||||||
|  | 
 | ||||||
|  | ### Metrics exposed by the controller | ||||||
|  | 
 | ||||||
|  | To get a better understanding of health and workings of the cluster | ||||||
|  | resources, we need to expose the following metrics: | ||||||
|  | 
 | ||||||
|  | - `pending_ephemeral_runners` - Number of ephemeral runners in a pending state. | ||||||
|  |   This information can show the latency between creating an `EphemeralRunner` | ||||||
|  |   resource, and having an ephemeral runner pod started and ready to receive a | ||||||
|  |   job. | ||||||
|  | - `running_ephemeral_runners` - Number of ephemeral runners currently running. | ||||||
|  |   This information is helpful to see how many ephemeral runner pods are running | ||||||
|  |   at any given time. | ||||||
|  | - `failed_ephemeral_runners` - Number of ephemeral runners in a `Failed` state. | ||||||
|  |   This information is helpful to catch the faulty image, or some underlying | ||||||
|  |   problem. When the ephemeral runner controller is not able to start the | ||||||
|  |   ephemeral runner pod after multiple retries, it will set the state of the | ||||||
|  |   `EphemeralRunner` to failed. Since the controller can not recover from this | ||||||
|  |   state, it can be useful to set Prometheus alerts to catch this issue quickly. | ||||||
|  | 
 | ||||||
|  | ### Metrics exposed by the `AutoscalingListener` | ||||||
|  | 
 | ||||||
|  | Since the listener is responsible for communicating the state with the actions | ||||||
|  | service, it can expose actions service related data through metrics. In | ||||||
|  | particular: | ||||||
|  | 
 | ||||||
|  | - `available_jobs` - Number of jobs with `runs-on` matching the runner scale set name. Jobs are not yet assigned but are acquired by the runner scale set. | ||||||
|  | - `acquired_jobs`- Number of jobs acquired by the scale set. | ||||||
|  | - `assigned_jobs` - Number of jobs assigned to this scale set. | ||||||
|  | - `running_jobs` - Number of jobs running (or about to be run). | ||||||
|  | - `registered_runners` - Number of registered runners. | ||||||
|  | - `busy_runners` - Number of registered runners running a job. | ||||||
|  | - `min_runners` - Number of runners desired by the scale set. | ||||||
|  | - `max_runners` - Number of runners desired by the scale set. | ||||||
|  | - `desired_runners` - Number of runners desired by the scale set. | ||||||
|  | - `idle_runners` - Number of registered runners not running a job. | ||||||
|  | - `available_jobs_total` - Total number of jobs available for the scale set (runs-on matches and scale set passes all the runner group permission checks). | ||||||
|  | - `acquired_jobs_total` - Total number of jobs acquired by the scale set. | ||||||
|  | - `assigned_jobs_total` - Total number of jobs assigned to the scale set. | ||||||
|  | - `started_jobs_total` - Total number of jobs started. | ||||||
|  | - `completed_jobs_total` - Total number of jobs completed. | ||||||
|  | - `job_queue_duration_seconds` - Time spent waiting for workflow jobs to get assigned to the scale set after queueing (in seconds). | ||||||
|  | - `job_startup_duration_seconds` - Time spent waiting for a workflow job to get started on the runner owned by the scale set (in seconds). | ||||||
|  | - `job_execution_duration_seconds` - Time spent executing workflow jobs by the scale set (in seconds). | ||||||
|  | 
 | ||||||
|  | ### Metric names | ||||||
|  | 
 | ||||||
|  | Listener metrics belong to the `github_runner_scale_set` subsystem, so the names | ||||||
|  | are going to have the `github_runner_scale_set_` prefix. | ||||||
|  | 
 | ||||||
|  | Controller metrics belong to the `github_runner_scale_set_controller` subsystem, | ||||||
|  | so the names are going to have `github_runner_scale_set_controller` prefix. | ||||||
|  | 
 | ||||||
|  | ## Consequences | ||||||
|  | 
 | ||||||
|  | Users can define alerts, monitor the behavior of both the actions-based metrics | ||||||
|  | (gathered from the listener) and the Kubernetes resource-based metrics | ||||||
|  | (gathered from the controller manager). | ||||||
|  | 
 | ||||||
		Loading…
	
		Reference in New Issue