6.6 KiB
ADR 2022-12-27: Pick the right runner to scale down
Date: 2022-12-27
Status: Done
Context
- A custom resource
EphemeralRunnerSetmanage a set of custom resourceEphemeralRunners - The
EphemeralRunnerSethasReplicasin itsSpec, and the responsibility of theEphemeralRunnerSet_controlleris to reconcile a givenEphemeralRunnerSetto have the same amount ofEphemeralRunnersas theSpec.Replicasdefined. - This means the
EphemeralRunnerSet_controllerwill scale up theEphemeralRunnerSetby creating moreEphemeralRunnerin the case of theSpec.Replicasis higher than the current amount ofEphemeralRunners. - This also means the
EphemeralRunnerSet_controllerwill scale down theEphemeralRunnerSetby finding some existingEphemeralRunnerto delete in the case of theSpec.Replicasis less than the current amount ofEphemeralRunners.
This ADR is about how can we find the right existing EphemeralRunner to delete when we need to scale down.
Current approach
-
EphemeralRunnerSet_controllerfigure out how manyEphemeralRunnerit needs to delete, ex: need to scale down from 10 to 2 means we need to delete 8EphemeralRunner -
EphemeralRunnerSet_controllerfind allEphemeralRunnerthat is in theRunningorPendingphase.Pendingmeans theEphemeralRunneris still probably creating and a runner has not yet configured with the Actions service.Runningmeans theEphemeralRunneris created and a runner has probably configured with Actions service, the runner may sit there idle, or maybe actively running a workflow job. We don't have a clear answer for it from the ARC side. (Actions service knows it for sure) -
EphemeralRunnerSet_controllermake an HTTP DELETE request to the Actions service for eachEphemeralRunnerfrom the previous step and ask the Actions service to delete the runner viaRunnerId. (TheRunnerIdis generated after the runner registered with the Actions service, and stored on theEphemeralRunner.Status.RunnerId)- The HTTP DELETE request looks like the following:
DELETE https://pipelines.actions.githubusercontent.com/WoxlUxJHrKEzIp4Nz3YmrmLlZBonrmj9xCJ1lrzcJ9ZsD1Tnw7/_apis/distributedtask/pools/0/agents/1024The Actions service will return 2 types of responses:
- 204 (No Content): The runner with Id 1024 has been successfully removed from the service or the runner with Id 1024 doesn't exist.
- 400 (Bad Request) with JSON body that contains an error message like
JobStillRunningException: The service can't remove this runner at this point since it has been assigned to a job request, the client won't be able to remove the runner until the runner finishes its current assigned job request.
- The HTTP DELETE request looks like the following:
-
EphemeralRunnerSet_controllerwill ignore any deletion error from runners that are still running a job, and keep trying deletion until the amount of204equals the amount ofEphemeralRunnerneeds to delete.
The problem with the current approach
In a busy AutoScalingRunnerSet, the scale up and down may happen all the time as jobs are queued up and jobs finished.
We will make way too many HTTP requests to the Actions service and ask it to try to delete a certain runner, and rely on the exception from the service to figure out what to do next.
The runner deletion request is not cheap to the service, for synchronization, the JobStillRunningException is raised from the DB call for the request.
So we are wasting resources on both the Actions service (extra load to the database) and the actions-runner-controller (useless outgoing HTTP requests).
In the test ARC that I deployed to Azure, the ARC controller tried to delete RunnerId 12408 for bbq-beets/ting-test a total of 35 times within 10 minutes.
Root cause
The EphemeralRunnerSet_controller doesn't know whether a given EphemeralRunner is actually running a workflow job or not
(it only knows the runner is configured at the service), so it can't filter out the EphemeralRunner.
Additional context
The legacy ARC's custom resource allows the runner image to leverage the RunnerJobHook feature to update the status of the runner custom resource in K8S (Mark the runner as running workflow run Id XXX).
This brings a good value to users as it can provide some insight about which runner is running which job for all the runners in the cluster and it looks pretty close to what we want to fix the root cause
However, the legacy ARC approach means the service account for running the runner pod needs to have elevated permission to update the custom resource,
this would be a big NO from a security point of view since we may not trust the code running inside the runner pod.
Possible Solution
The nature of the k8s controller-runtime means we might reconcile the resource base on stale cache data.
I think our goal for the solution should be:
- Reduce wasteful HTTP requests on a scale-down as much as we can.
- We can accept that we might make 1 or 2 wasteful requests to Actions service, but we can't accept making 5/10+ of them.
- See if we can meet feature parity with what the RunnerJobHook support with compromise any security concerns.
Since the root cause of why the reconciliation can't skip an EphemeralRunner is that we don't know whether an EphemeralRunner is running a job,
a simple thought is how about we somehow attach some info to the EphemeralRunner to indicate it's currently running a job?
How about we send this info from the service to the auto-scaling-listener via the existing HTTP long-poll
and let the listener patch the EphemeralRunner.Status to indicate it's running a job?
The listener is normally in a separate namespace with elevated permission and it's something we can trust.
Changes:
- Introduce a new message type
JobStarted(in addition to the existingJobAvailable/JobAssigned/JobCompleted) on the service side, the message is sent when a runner of theRunnerScaleSetget assigned to a job,RequestId,RunnerId, andRunnerNamewill be included in the message. - Add
RequestId (int)toEphemeralRunner.Status, this will indicate which job the runner is running. - The
AutoScalingListenerwill base on the payload of this new message to patchEphemeralRunners/RunnerName/Statuswith theRequestId - When
EphemeralRunnerSet_controllertry to findEphemeralRunnerto delete on a scale down, it will skip anyEphemeralRunnerthat hasEphemeralRunner.Status.RequestIdset. - In the future, we can expose more info to this
JobStartedmessage and introduce more property underEphemeralRunner.Statusto reach feature parity with legacy ARC's RunnerJobHook