6.6 KiB

Raw Blame History

actions-runner-controller v0.22.0

This version of ARC focuses on scalability and reliablity of runners.

GitHub API Cache

In terms of scalability, ARC now caches GitHub API responses according to their recommendation(=Cache-Control header¹). As long as GitHub keeps its current behavior, it will result in ARC to cache various List Runners API and List Workflow Jobs calls for 60 seconds.

The cache for List Runners API is expecially important, as their responses can be shared between every runner under the same scope (repository, organization, or enterprise).

In previous versions of ARC, the number of List Runners API calls had scaled proportional to the number of runners managed by ARC. Thanks to the addition of cache, since v0.22.0, it may scale proportional to the number of runner scopes (=The number of repositories for your repository runners + The number of organizations for your organizational runners + The number of enterprises for your enterprise runners). You might be able to scale to hundreds of runners depending on your environemnt.

Please share your experience if you successfully scaled to a level that wasn't possible with previous versions!

Improved Runner Scale Down Process

In terms of reliability, the first thing to note is that it has a new scale down process for both RunnerDeployment and RunnerSet.

Previously every runner pod can restart immediately after the completion, while at the same time ARC might mark the same runner pod for deletion due to scale down. That resulted in various race conditions that terminated the runner prematurely while running a workflow job².

And it's now fixed. The new scale down process ensures that the runner has been registered successfully and then de-registered from GitHub Actions, before starting the runner pod deletion process. Any runner pod can't be terminated while being restarting or running a job now, which makes it impossible to be in the middle of running a workflow job when a runner pod is being terminated. No more race conditions.

Optimized Ephemeral Runner Termination Makes Less "Remove Runner" API calls

It is also worth mentioning that the new scale down process makes less GitHub Actions RemoveRunner API calls, which contributes to more scallability.

Two enhancements had been made on that.

First, every runner managed by ARC now uses --ephemeral by default.

Second, we removed unnecessary RemoveRunner API calls when it's an ephemeral runner that has already completed running.

GitHub designed ephemeral runners to be automatically unregistered from GitHub Actions after running their first workflow jobs. It is unnecessary to call RemoveRunner API when the ephemeral runner pod has already completed successfully. These two enhancements aligns with that fact and it results in ARC making less API calls.

Prevention of Unnecessary Runner Pod Recreations

Another reliability enhancement is based on the addition of a new field, EffectiveTime, to our RunnerDeployment and RunnerSet specifications.

The field comes in play only for ephemeral runners, and ARC uses it as an indicator of when to add more runner pods, to match the current number of runner pods to the desired number.

How that improves the reliability?

Previously, ARC had been continuously recreating runner pods as they complete, with no delay. That sometimes resulted in a runner pod to get recreated and then immediately terminated without being used at all. Not only this is a waste of cluster resource, it resulted in race conditions we explained in the previous section about "Improved Runner Scale Down Process". We fixed the race conditions as explained in the previous section, but the waste of cluster resource was still problematic.

With EffectiveTime, ARC defers the addition(and recreations, as ARC doesn't distinguish addition vs recreation) of missing runner pods until the EffectiveTime is updated. EffectiveTime is updated only when the github-webhook-server of ARC updates the desired replicas number, ARC adds/recreates runner pods only after the webhook server updates it, the issue is resolved.

This can be an unnecessary detail, but anyway- the "defer" mechanism times out after the DefaultRunnerPodRecreationDelayAfterWebhookScale duration, which is currently hard-coded to 10 minutes. So in case ARC missed receiving a webhook event for proper scaling, it converges to the desired replicas after 10 minutes anyway, so that the current state eventually syncs up with the desired state.

Note that EffectiveTime fields are set by HRA controller for any RunnerDeployment and RunnerSet that manages ephemeral runners. That means, it is enabled regardless of the type of autoscaler you're using, webhook or API polling based ones. It isn't enabled for static(persistent) runners.

There's currently no way to opt-out of EffectiveTime because the author of the feature(@mumoshu) thought it's unneeded. Please open a GitHub issue with details on your use-case if you do need to opt-out.

Generalized Runner Pod Management Logic

This one might not be an user-visible change, but I'm explaining it for anyone who may wonder.

Since this version, ARC uses the same logic for RunnerDeployment and RunnerSet. RunnerDeployment is Pod-based and RunnerSet is StatefulSet-based. That remains unchanged. But the most of the logic about how runner pods are managed is shared between the two.

The only difference is that what adapters those variants pass to the generalized logic. RunnerDeployment uses RunnerReplicaSet(our another Kubernetes custom resource that powers RunnerDeployment) as an owner of a runner pod, and RunnerSet uses StatefulSet(it's vanilla Kubernetes StatefulSet) as an owner of a runner pod.

This refactoring turned out to enable us to make RunnerSet as reliable as RunnerDeployment. RunnerSet has been considered an experimental feature even though it is more customizable than RunnerDeployment and has a support for Persistent Volume Claim(PVC)s. But since it now uses the same logic under the hood, RunnerSet can be considered more production-ready than before.

If you staed away from using RunnerSet due to that, please try it and report anything you experienced!

https://docs.github.com/en/rest/overview/resources-in-the-rest-api#conditional-requests ↩︎
See this issue for more context. ↩︎

6.6 KiB Raw Blame History