actions-runner-controller/docs/adrs/2022-10-27-runnerscaleset-l...

57 lines
3.7 KiB
Markdown

# ADR 2022-10-27: Lifetime of RunnerScaleSet on Service
**Date**: 2022-10-27
**Status**: Done
## Context
We have created the RunnerScaleSet object and APIs around it on the GitHub Actions service for better support of any self-hosted runner auto-scale solution, like [actions-runner-controller](https://github.com/actions-runner-controller/actions-runner-controller).
The `RunnerScaleSet` object will represent a set of homogeneous self-hosted runners to the Actions service job routing system.
A `RunnerScaleSet` client (ARC) needs to communicate with the Actions service via HTTP long-poll in a certain protocol to get a workflow job successfully landed on one of its homogeneous self-hosted runners.
In this ADR, we discuss the following within the context of actions-runner-controller's new scaling mode:
- Who and how to create a RunnerScaleSet on the service?
- Who and how to delete a RunnerScaleSet on the service?
- What will happen to all the runners and jobs when the deletion happens?
## RunnerScaleSet creation
- `AutoScalingRunnerSet` custom resource controller will create the `RunnerScaleSet` object in the Actions service on any `AutoScalingRunnerSet` resource deployment.
- The creation is via REST API on Actions service `POST _apis/runtime/runnerscalesets`
- The creation needs to use the runner registration token (admin).
- `RunnerScaleSet.Name` == `AutoScalingRunnerSet.metadata.Name`
- The created `RunnerScaleSet` will only have 1 label and it's the `RunnerScaleSet`'s name
- `AutoScalingRunnerSet` controller will store the `RunnerScaleSet.Id` as an annotation on the k8s resource for future lookup.
## RunnerScaleSet modification
- When the user patch existing `AutoScalingRunnerSet`'s RunnerScaleSet related properly, ex: `runnerGroupName`, `runnerWorkDir`, the controller needs to make an HTTP PATCH call to the `_apis/runtime/runnerscalesets/2` endpoint in order to update the object on the service.
- We will put the deployed `AutoScalingRunnerSet` resource in an error state when the user tries to patch the resource with a different `githubConfigUrl`
> Basically, you can't move a deployed `AutoScalingRunnerSet` across GitHub entity, repoA->repoB, repoA->OrgC, etc.
> We evaluated blocking the change before instead of erroring at runtime and that we decided not to go down this route because it forces us to re-introduce admission webhooks (require cert-manager).
## RunnerScaleSet deletion
- `AutoScalingRunnerSet` custom resource controller will delete the `RunnerScaleSet` object in the Actions service on any `AutoScalingRunnerSet` resource deletion.
> `AutoScalingRunnerSet` deletion will contain several steps:
>
> - Stop the listener app so no more new jobs coming and no more scaling up/down.
> - Request scale down to 0
> - Force stop all runners
> - Wait for the scale down to 0
> - Delete the `RunnerScaleSet` object from service via REST API
- The deletion is via REST API on Actions service `DELETE _apis/runtime/runnerscalesets/1`
- The deletion needs to use the runner registration token (admin).
The user's `RunnerScaleSet` will be deleted from the service by `DormantRunnerScaleSetCleanupJob` if the particular `AutoScalingRunnerSet` has not connected to the service for the past 7 days. We have a similar rule for self-hosted runners.
## Jobs and Runners on deletion
- `RunnerScaleSet` deletion will be blocked if there is any job assigned to a runner within the `RunnerScaleSet`, which has to scale down to 0 before deletion.
- Any job that has been assigned to the `RunnerScaleSet` but hasn't been assigned to a runner within the `RunnerScaleSet` will get thrown back to the queue and wait for assignment again.
- Any offline runners within the `RunnerScaleSet` will be deleted from the service side.