30 KiB
		
	
	
	
	
	
			
		
		
	
	Autoscaling Runner Scale Sets mode
This new autoscaling mode brings numerous enhancements (described in the following sections) that will make your experience more reliable and secure.
How it works
- ARC is installed using the supplied Helm charts, and the controller manager pod is deployed in the specified namespace. A new AutoScalingRunnerSetresource is deployed via the supplied Helm charts or a customized manifest file. TheAutoScalingRunnerSetcontroller calls GitHub's APIs to fetch the runner group ID that the runner scale set will belong to.
- The AutoScalingRunnerSetcontroller calls the APIs one more time to either fetch or create a runner scale set in theActions Servicebefore creating theRunner ScaleSet Listenerresource.
- A Runner ScaleSet Listenerpod is deployed by theAutoScaling Listener Controller. In this pod, the listener application connects to theActions Serviceto authenticate and establish a long poll HTTPS connection. The listener stays idle until it receives aJob Availablemessage from theActions Service.
- When a workflow run is triggered from a repository, the Actions Servicedispatches individual job runs to the runners or runner scalesets where theruns-onproperty matches the name of the runner scaleset or labels of self-hosted runners.
- When the Runner ScaleSet Listenerreceives theJob Availablemessage, it checks whether it can scale up to the desired count. If it can, theRunner ScaleSet Listeneracknowledges the message.
- The Runner ScaleSet Listeneruses aService Accountand aRolebound to that account to make an HTTPS call through the Kubernetes APIs to patch theEphemeralRunner Setresource with the number of desired replicas count.
- The EphemeralRunner Setattempts to create new runners and theEphemeralRunner Controllerrequests a JIT configuration token to register these runners. The controller attempts to create runner pods. If the pod's status isfailed, the controller retries up to 5 times. After 24 hours theActions Serviceunassigns the job if no runner accepts it.
- Once the runner pod is created, the runner application in the pod uses the JIT configuration token to register itself with the Actions Service. It then establishes another HTTPS long poll connection to receive the job details it needs to execute.
- The Actions Serviceacknowledges the runner registration and dispatches the job run details.
- Throughout the job run execution, the runner continuously communicates the logs and job run status back to the Actions Service.
- When the runner completes its job successfully, the EphemeralRunner Controllerchecks with theActions Serviceto see if runner can be deleted. If it can, theEphemeral RunnerSetdeletes the runner.
In addition to the increased reliability of the automatic scaling, we have worked on these improvements:
- No longer require cert-manager as a prerequisite for installing actions-runner-controller
- Reliable scale-up based on job demands and scale-down to zero runner pods
- Reduce API requests to api.github.com, no more API rate-limiting problems
- The GitHub Personal Access Token (PAT) or the GitHub App installation token is no longer passed to the runner pod for runner registration
- Maximum flexibility for customizing your runner pod template
Demo
Will take you to YouTube for a short walkthrough of the Autoscaling Runner Scale Sets mode.
Setup
You can follow this quickstart guide for installation steps.
Troubleshooting
You can follow this troubleshooting guide for troubleshooting steps.
Changelog
0.13.0
- Remove workflow actions version comments since upgrades are done via dependabot #4161
- Fix image pull secrets list arguments in the chart #4164
- Update example GitHub URLs in values.yaml to include an example for enterprise account-level runners #4181
- docs: fix repo path typo #4229
- Remove deprecated preserveUnknownFields from CRDs #4135
- Add workflow name and target labels #4240
- docs: fix broken Grafana dashboard JSON path #4270
- Ensure ephemeral runner is deleted from the service on exit != 0 #4260
- Remove JIT config from ephemeral runner status field #4191
- Remove ephemeral runner when exit code != 0 and is patched with the job #4239
- Bump the gomod group across 1 directory with 4 updates #4277
- Bump all dependencies #4266
0.12.1
- Fix indentation of startupProbe attributes in dind sidecar #4126
- Remove duplicate float64 call #4139
- Fix dind sidecar template #4128
- Remove check if runner exists after exit code 0 #4142
- Explicitly requeue during backoff ephemeral runner #4152
0.12.0
- Allow use of client id as an app id #4057
- Relax version requirements to allow patch version mismatch #4080
- Refactor resource naming removing unnecessary calculations #4076
- Fix busy runners metric #4016
- Include more context to errors raised by github/actions client #4032
- Revised dashboard #4022
- feat(helm): move dind to sidecar #3842
- Pin third party actions #3981
- Fix docker lint warnings #4074
- Bump the gomod group across 1 directory with 7 updates #4008
- Bump go version #4075
- Add job_workflow_ref label to listener metrics #4054
- Bump github.com/cloudflare/circl from 1.6.0 to 1.6.1 #4118
- Avoid nil point when config.Metrics is nil and expose all metrics if none are configured #4101
- Bump github.com/golang-jwt/jwt/v5 from 5.2.1 to 5.2.2 #4120
- Add startup probe to dind side-car #4117
- Delete config secret when listener pod gets deleted #4033
- Add response body to error when fetching access token #4005
- Azure Key Vault integration to resolve secrets #4090
- Create backoff mechanism for failed runners and allow re-creation of failed ephemeral runners #4059
0.11.0
- Add events role permission to leader_election_role #3988
- Bump github.com/golang-jwt/jwt/v4 from 4.5.1 to 4.5.2 #3984
- Create configurable metrics #3975
- Wrap errors in controller helper methods and swap logic in cleanups #3960
- Rename log from target/actual to build/autoscalingRunnerSet version #3957
- Update all dependencies, conforming to the new controller-runtime API #3949
- Clean up as much as possible in a single pass for the EphemeralRunner reconciler #3941
- Remove old githubrunnerscalesetlistener, remove warning and fix config bug #3937
- Include custom annotations and labels to all resources created by gha-runner-scale-set chart #3934
- Use Ready from the pod conditions when setting it to the EphemeralRunner #3891
- Fix template tests and add go test on gha-validate-chart #3886
- Update dependabot config to group packages (& include actions eco) #3880
- cmd/ghalistener/config: export Validate #3870
- AutoscalingRunnerSet env: not Rendering correctly #3826
- Clarify syntax for githubConfigSecret #3812
- Trim volume and container helpers in gha-runner-scale-set #3807
- Drop verbose flag from runner scale set init-dind-externals copy #3805
- Use gha-runner-scale-set-controller.chart instead of .Chart.Version #3729
- metrics cardinality for ghalistener #3671
- Sanitize labels ending in hyphen, underscore, and dot #3664
- chore: Added OwnerReferences during resource creation for EphemeralRunnerSet, EphemeralRunner, and EphemeralRunnerPod #3575
0.10.1
- Fix helm chart bug related to runnerMaxConcurrentReconciles#3858
0.10.0
This release includes major improvements to the runner provisioning duration. In short, you should see less latency between queueing a workflow run and having a runner available to execute the job.
Make sure to check #3832 and #3848 for details on how to fine-tune that behavior.
Major changes
- Add exponential backoff when generating runner reg tokens #3724
- Make EphemeralRunnerController MaxConcurrentReconciles configurable #3832
- Make EphemeralRunnerReconciler create runner pods earlier #3831
- Make k8s client rate limiter parameters configurable #3848
Minor changes
- Bump github.com/bradleyfalzon/ghinstallation/v2 from 2.8.0to2.12.0#3837
- Bump golang.org/x/crypto from 0.22.0to0.31.0#3844
- Update docs with details for the dashboard visualizations #3696
v0.9.3
- AutoscalingListener controller: Inspect listener container state instead of pod phase #3548
- Exclude label prefix propagation #3607
- Check status code of fetch access token for github app #3568
- Remove .Named() from the ephemeral runner controller #3596
- Customize work directory #3477
- Fix problem with ephemeralRunner Succeeded state before build executed #3528
- Remove finalizers in one pass to speed up cleanups AutoscalingRunnerSet #3536
v0.9.2
- Refresh session if token expires during delete message #3529
- Re-use the last desired patch on empty batch #3453
- Extract single place to set up indexers #3454
- Include controller version in logs #3473
- Propogate arbitrary labels from runnersets to all created resources #3157
v0.9.1
Major changes
- Shutdown metrics server when listener exits #3445
- Propagate max capacity information to the actions back-end #3431
- Refactor actions client error to include request id #3430
- Include self correction on empty batch and avoid removing pending runners when cluster is busy #3426
- Add topologySpreadConstraint to gha-runner-scale-set-controller chart #3405
v0.9.0
⚠️ Warning
- This release contains CRD changes. During the upgrade, please remove the old CRDs before re-installing the new version. For more information, please read the Upgrading ARC.
- This release contains changes in the default docker socket path expanded for container mode dind.
- Older version of the listener (githubrunnerscalesetlistener) is deprecated and will be removed in the future0.10.0release.
Please evaluate these changes carefully before upgrading.
Major changes
- Change docker socket path to /var/run/docker.sock #3337
- Update metrics to include repository on job-based label #3310
- Bump Go version to 1.22.1 #3290
- Propagate runner scale set name annotation to EphemeralRunner #3098
- Add annotation with values hash to re-create listener #3195
- Fix overscaling when the controller is much faster then the listener #3371
- Add retry on 401 and 403 for runner-registration #3377
v0.8.3
- Expose volumeMounts and volumes in gha-runner-scale-set-controller #3260
- Refer to the correct variable in discovery error message #3296
- Fix acquire jobs after session refresh ghalistener #3307
v0.8.2
- Add listener graceful termination period and background context after the message is received #3187
- Publish metrics in the new ghalistener #3193
- Delete message session when listener.Listen returns #3240
v0.8.1
- Fix proxy issue in new listener client #3181
v0.8.0
- Change listener container name #3167
- Fix empty env and volumeMounts object on default setup #3166
- Fix override listener pod spec #3161
- Change minRunners behavior and fix the new listener min runners #3139
- Update user agent for new ghalistener #3138
- Bump golang.org/x/oauth2 from 0.14.0 to 0.15.0 #3127
- Bump golang.org.x.net from 0.18.0 to 0.19.0 #3126
- Bump k8s.io/client-go from 0.28.3 to 0.28.4 #3125
- Modify user agent format with subsystem and is proxy configured information #3116
- Record the error when the creation pod fails #3112
- Fix typo in helm chart comment #3104
- Set actions client timeout to 5 minutes, add logging to client #3103
- Refactor listener app with configurable fallback #3096
- Bump github.com/onsi/gomega from 1.29.0 to 1.30.0 #3094
- Bump k8s.io/api from 0.28.3 to 0.28.4 #3093
- Bump k8s.io/apimachinery from 0.28.3 to 0.28.4 #3092
- Bump github.com/gruntwork-io/terratest from 0.41.24 to 0.46.7 #3091
- Record a reason for pod failure in EphemeralRunner #3074
- ADR: Changing semantics of min runners to be min idle runners #3040
v0.7.0
- Add ResizePolicy and RestartPolicy on mergeListenerContainer #3075
- feat: GHA controller Helm Chart quoted labels #3061
- Update authorization for PAT to be Bearer as documented #3039
- Metrics: set max and min runners during startup time #3032
- Update Chart.yaml home URLs #3013
- Remove inheritance of imagePullPolicy from manager to listeners #3009
- Trim down metrics cardinality #3003
- Fix role and rolebinding cleanup for the listener controller #2970
- Configure listener pod with the secret instead of env #2965
- Allow custom labels to be specified for controller pods #2952
- Bump go version and all direct dependencies to newest for k8s compatibility #2947
- chore: Service accounts in Kubernetes mode can now be annotated. #2566
v0.6.1
- Replace TLS dockerd connection with unix socket #2833
- Fix name override labels when runnerScaleSetName value is set #2915
- Fix nil map when annotations are applied #2916
- Updates: container-hooks to v0.4.0 #2928
v0.6.0
- Fix parsing AcquireJob MessageQueueTokenExpiredError #2837
- Set restart policy on the runner pod to Never if restartPolicy is not set in template #2787
- Set the AutoscalingRunnerSet name to runnerScaleSetName #2803
- Extend and generate crds allowing listener pod spec change #2758
- Extend the user agent and fix the build version for the listener app #2892
v0.5.0
- Provide scale-set listener metrics #2559
- Add DrainJobsMode #2569
- Trim gha-runner-scale-set to gha-rs in names and remove role type suffixes #2706
- Adapt role name to prevent namespace collision #2617
- Add status check before deserializing runner-registration response #2699
- Add configurable log format to values.yaml and propagate it to listener #2686
- Extend manager roles to accept ephemeralrunnerset/finalizers #2493
- Trim repo/org/enterprise to 63 characters in label values #2657
- Revert back chart renaming #2824
- Discard logs on helm chart tests #2607
- Use build.Version to check if resource version is a mismatch #2521
- Reordering methods and constants so it is easier to look it up #2501
- chore: Set build version on make-runscaleset #2713
- Fix scaling back to 0 after min runners were set to number > 0 #2742
- Document customization for containerModes #2777
- Bump github.com/cloudflare/circl from 1.1.0 to 1.3.3 #2628
- chore(deps): bump github.com/stretchr/testify from 1.8.2 to 1.8.4 #2716
- Move gha-* docs out of preview #2779
- Prepare 0.5.0 release #2783
- Security fix #2676
v0.4.0
⚠️ Warning
This release contains a major change related to the way permissions are applied to the manager (#2276 and #2363).
Please evaluate these changes carefully before upgrading.
Major changes
- Surface EphemeralRunnerSet stats to AutoscalingRunnerSet #2382
- Improved security posture by removing list/watch secrets permission from manager cluster role #2276
- Improved security posture by delaying role/rolebinding creation to gha-runner-scale-set during installation #2363
- Improved security posture by supporting watching a single namespace from the controller #2374
- Added labels to AutoscalingRunnerSet subresources to allow easier inspection #2391
- Fixed bug preventing env variables from being specified #2450
- Enhance quickstart troubleshooting guides #2435
- Fixed ignore extra dind container when container mode type is "dind" #2418
- Added additional cleanup finalizers #2433
- gha-runner-scale-set listener pod inherits the ImagePullPolicy from the manager pod #2477
- Treat .ghe.comdomain as hosted environment #2480
v0.3.0
Major changes
- Runner pods are more similar to hosted runners #2348
- Add support for self-signed CA certificates #2268
- Fixed trailing slashes in config URLs breaking installations #2381
- Fixed a bug where the listener pod would ignore proxy settings from env #2366
- Added runner set name field making it optionally configurable #2279
- Name and namespace labels of listener pod have been split #2341
- Added chart name constraints validation on AutoscalingRunnerSet install #2347
v0.2.0
Major changes
- Added proxy support for the controller and the runner pods, see the new helm chart fields #2286
- Added the abiilty to provide a pre-defined kubernetes secret for the autoscaling runner set helm chart #2234
- Enhanced security posture by removing un-required permissions for the manager-role #2260
- Enhanced our logging by returning an error when a runner group is defined in the values file but it's not created in GitHub #2215
- Fixed helm charts issues that were preventing the use of DinD #2291
- Fixed a bug that was preventing runner scale from being removed from the backend when they were deleted from the cluster #2255 #2223
- Fixed bugs with the helm chart definitions preventing certain values from being set #2222
- Fixed a bug that prevented the configuration of a runner group for a runner scale set #2216


