actions-runner-controller/docs/adrs/2022-10-27-runnerscaleset-l...

3.7 KiB

ADR 2022-10-27: Lifetime of RunnerScaleSet on Service

Date: 2022-10-27

Status: Done

Context

We have created the RunnerScaleSet object and APIs around it on the GitHub Actions service for better support of any self-hosted runner auto-scale solution, like actions-runner-controller.

The RunnerScaleSet object will represent a set of homogeneous self-hosted runners to the Actions service job routing system.

A RunnerScaleSet client (ARC) needs to communicate with the Actions service via HTTP long-poll in a certain protocol to get a workflow job successfully landed on one of its homogeneous self-hosted runners.

In this ADR, we discuss the following within the context of actions-runner-controller's new scaling mode:

  • Who and how to create a RunnerScaleSet on the service?
  • Who and how to delete a RunnerScaleSet on the service?
  • What will happen to all the runners and jobs when the deletion happens?

RunnerScaleSet creation

  • AutoScalingRunnerSet custom resource controller will create the RunnerScaleSet object in the Actions service on any AutoScalingRunnerSet resource deployment.
  • The creation is via REST API on Actions service POST _apis/runtime/runnerscalesets
  • The creation needs to use the runner registration token (admin).
  • RunnerScaleSet.Name == AutoScalingRunnerSet.metadata.Name
  • The created RunnerScaleSet will only have 1 label and it's the RunnerScaleSet's name
  • AutoScalingRunnerSet controller will store the RunnerScaleSet.Id as an annotation on the k8s resource for future lookup.

RunnerScaleSet modification

  • When the user patch existing AutoScalingRunnerSet's RunnerScaleSet related properly, ex: runnerGroupName, runnerWorkDir, the controller needs to make an HTTP PATCH call to the _apis/runtime/runnerscalesets/2 endpoint in order to update the object on the service.
  • We will put the deployed AutoScalingRunnerSet resource in an error state when the user tries to patch the resource with a different githubConfigUrl

    Basically, you can't move a deployed AutoScalingRunnerSet across GitHub entity, repoA->repoB, repoA->OrgC, etc. We evaluated blocking the change before instead of erroring at runtime and that we decided not to go down this route because it forces us to re-introduce admission webhooks (require cert-manager).

RunnerScaleSet deletion

  • AutoScalingRunnerSet custom resource controller will delete the RunnerScaleSet object in the Actions service on any AutoScalingRunnerSet resource deletion.

    AutoScalingRunnerSet deletion will contain several steps:

    • Stop the listener app so no more new jobs coming and no more scaling up/down.
    • Request scale down to 0
    • Force stop all runners
    • Wait for the scale down to 0
    • Delete the RunnerScaleSet object from service via REST API
  • The deletion is via REST API on Actions service DELETE _apis/runtime/runnerscalesets/1
  • The deletion needs to use the runner registration token (admin).

The user's RunnerScaleSet will be deleted from the service by DormantRunnerScaleSetCleanupJob if the particular AutoScalingRunnerSet has not connected to the service for the past 7 days. We have a similar rule for self-hosted runners.

Jobs and Runners on deletion

  • RunnerScaleSet deletion will be blocked if there is any job assigned to a runner within the RunnerScaleSet, which has to scale down to 0 before deletion.
  • Any job that has been assigned to the RunnerScaleSet but hasn't been assigned to a runner within the RunnerScaleSet will get thrown back to the queue and wait for assignment again.
  • Any offline runners within the RunnerScaleSet will be deleted from the service side.