We had some dead code left over from the removal of registration runners. Registration runners were removed in #859#1207
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Enhance RunnerSet to optionally retain PVs accross restarts
This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet.
The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`.
It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet.
Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass.
Ref #1286
* Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs
Ref #1286
This is intended to fix#1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".
Ref #1369
This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that.
Ref #1196
In #1373 we made two mistakes:
- We mistakenly checked if all the runner labels are included in the job labels and only after that it marked the target as eligible for scale. It should definitely be the opposite!
- We mistakenly checked for the existence of `self-hosted` labe l in the job. [Although it should be a good practice to explicitly say `runs-on: ["self-hosted", "custom-label"]`](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-labels-for-runner-selection), that's not a requirement so we should code accordingly.
The consequence of those two mistakes was that, for example, jobs with `self-hosted` + `custom` labels didn't result in scaling runner with `self-hosted` + `custom` + `custom2`. This should fix that.
Ref #1056
Ref #1373
Adds some unit tests for the runner pod generation logic that is used internally by runner deployment and runner set controllers as preparation for #1282
This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway.
Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions.
Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions.
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1307
Might also relate to https://github.com/actions-runner-controller/actions-runner-controller/issues/1273
With the current implementation if a pod is deleted, controller is failing to delete the runner as it's trying to annotate a pod that doesn't exist as we're passing a new pod object that is not an existing resource
#1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.
It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated.
This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.
The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job.
We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.
I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.
Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy.
There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.
This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.
The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271
* Remove legacy GitHub API cache of HRA.Status.CachedEntries
We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA.
Just remove the legacy cache to keep it simple and easy to understand.
* Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well
* Fix integration test
While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods.
This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it.
In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation.
Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.
This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.
Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.
There is a race condition between ARC and GitHub service about deleting runner pod.
- The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod.
- A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod.
- The ARC delete the runner pod which cause the in-progress job to end up canceled.
To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod.
- `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod.
- `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is.
TODO: I need to do some E2E tests to force the race condition to happen.
Ref #911
Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated.
This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens.
Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365
Some of logs like `HRA keys indexed for HRA` were so excessive that it made testing and debugging the githubwebhookserver harder. This tries to fix that.
This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required.
More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud.
Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012
If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories
`getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow:
1. Search for **repository** HRAs - if so it ends here
2. Get available HRAs in k8s
3. Compute visible runner groups
a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls.
b. If GitHub auth is not provided - assume all runner groups are visible to all repositories
4. Search for **default organization** runners (a.k.a runners from organization's visible default runner group) with matching labels
5. Search for **default enterprise** runners (a.k.a runners from enterprise's visible default runner group) with matching labels
6. Search for **custom organization runner groups** with matching labels
7. Search for **custom enterprise runner groups** with matching labels
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This allows providing a different `work` Volume.
This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage.
This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: RunnerDeployment
metadata:
name: runner-2
spec:
template:
spec:
dockerdContainerResources: {}
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
# this is to mount the docker in docker onto NVME disk
dockerVolumeMounts:
- mountPath: /var/lib/docker
name: scratch
subPathExpr: $(POD_NAME)-docker
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
volumeMounts:
- mountPath: /runner/_work
name: work
subPathExpr: $(POD_NAME)-work
dockerEnv:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
volumes:
- hostPath:
path: /mnt/disks/ssd0
name: scratch
- hostPath:
path: /mnt/disks/ssd0
name: work
nodeSelector:
cloud.google.com/gke-nodepool: runner-16-with-nvme
ephemeral: false
image: ""
imagePullPolicy: Always
labels:
- runner-2
- self-hosted
organization: yourorganization
```
Adding handling of paginated results when calling `ListWorkflowJobs`. By default the `per_page` is 30, which potentially would return 30 queued and 30 in_progress jobs.
This change should enable the autoscaler to scale workflows with more than 60 jobs to the exact number of runners needed.
Problem: I did not find any support for pagination in the Github fake client, and have not been able to test this (as I have not been able to push an image to an environment where I can verify this).
If anyone is able to help out verifying this PR, i would really appreciate it.
Resolves#990
The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected).
This also improves resolving the `scaleTargetKey` that are used to match an HRA based on the inputs of the `RunnerSet`/`RunnerDeployment` spec to better support for runner groups.
This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before
The webhook "workflowJob" pass the labels the job needs to the controller, who in turns search for them in its RunnerDeployment / RunnerSet. The current implementation ignore the search for `self-hosted` if this is the only label, however if multiple labels are found the `self-hosted` label must be declared explicitely or the RD / RS will not be selected for the autoscaling.
This PR fixes the behavior by ignoring this label, and add documentation on this webhook for the other labels that will still require an explicit declaration (OS and architecture).
The exception should be temporary, ideally the labels implicitely created (self-hosted, OS, architecture) should be searchable alongside the explicitly declared labels.
code tested, work with `["self-hosted"]` and `["self-hosted","anotherLabel"]`
Fixes#951
* fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0
* Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade
* Regenerate manifests with the latest K8s dependencies
Co-authored-by: Renovate Bot <bot@renovateapp.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Fix bug related to label matching.
Add start of test framework for Workflow Job Events
Signed-off-by: Aidan Jensen <aidan@artificial.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet.
**These features are not yet generally available to all GitHub users**. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end.
**Ephemeral runners**:
The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller).
`--once` has been suffering from a race issue #466. `--ephemeral` fixes that.
To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`.
Please read the section `Ephemeral Runners` in the updated version of our README for more information.
Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub!
**`workflow_job` events**:
`workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale.
Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial.
In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet.
Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `.
Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.
* Adding a default docker registry mirror
This change allows the controller to start with a specified default
docker registry mirror and avoid having to specify it in all the runner*
objects.
The change is backward compatible, if a runner has a docker registry
mirror specified, it will supersede the default one.
* Add POC of GitHub Webhook Delivery Forwarder
* multi-forwarder and ctrl-c existing and fix for non-woring http post
* Rename source files
* Extract signal handling into a dedicated source file
* Faster ctrl-c handling
* Enable automatic creation of repo hook on startup
* Add support for forwarding org hook deliveries
* Set hook secret on hook creation via envvar (HOOK_SECRET)
* Fix org hook support
* Fix HOOK_SECRET for consistency
* Refactor to prepare for custom log position provider
* Refactor to extract inmemory log position provider
* Add configmap-based log position provider
* Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder
* Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg
* Refactor to extract logger initialization
* Add hookdeliveryforwarder README and bump go-github to unreleased ver
`HRA.Spec.ScaleTargetRef.Kind` is added to denote that the scale-target is a RunnerSet.
It defaults to `RunnerDeployment` for backward compatibility.
```
apiVersion: actions.summerwind.dev/v1alpha1
kind: HorizontalRunnerAutoscaler
metadata:
name: myhra
spec:
scaleTargetRef:
kind: RunnerSet
name: myrunnerset
```
Ref #629
Ref #613
Ref #612
This option expose internally some `KUBERNETES_*` environment variables
that doesn't allow the runner to use KinD (Kubernetes in Docker) since it will
try to connect to the Kubernetes cluster where the runner it's running.
This option it's set by default to `true` in any Kubernetes deployment.
Signed-off-by: Jonathan Gonzalez V <jonathan.gonzalez@enterprisedb.com>
* feat: RunnerSet backed by StatefulSet
Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens.
Resolves#613
Ref #612
* Upgrade controller-runtime to 0.9.0
* Bump Go to 1.16.x following controller-runtime 0.9.0
* Upgrade kubebuilder to 2.3.2 for updated etcd and apiserver following local setup
* Fix startup failure due to missing LeaderElectionID
* Fix the issue that any pods become unable to start once actions-runner-controller got failed after the mutating webhook has been registered
* Allow force-updating statefulset
* Fix runner container missing work and certs-client volume mounts and DOCKER_HOST and DOCKER_TLS_VERIFY envvars when dockerdWithinRunner=false
* Fix runnerset-controller not applying statefulset.spec.template.spec changes when there were no changes in runnerset spec
* Enable running acceptance tests against arbitrary kind cluster
* RunnerSet supports non-ephemeral runners only today
* fix: docker-build from root Makefile on intel mac
* fix: arch check fixes for mac and ARM
* ci: aligning test data format and patching checks
* fix: removing namespace in test data
* chore: adding more ignores
* chore: removing leading space in shebang
* Re-add metrics to org hra testdata
* Bump cert-manager to v1.1.1 and fix deploy.sh
Co-authored-by: toast-gear <15716903+toast-gear@users.noreply.github.com>
Co-authored-by: Callum James Tait <callum.tait@photobox.com>
Sets the privileged flag to false if SELinuxOptions are present/defined. This is needed because containerd treats SELinux and Privileged controls as mutually exclusive. Also see https://github.com/containerd/cri/blob/aa2d5a97c/pkg/server/container_create.go#L164.
This allows users who use SELinux for managing privileged processes to use GH Actions - otherwise, based on the SELinux policy, the Docker in Docker container might not be privileged enough.
Signed-off-by: Jonah Back <jonah@jonahback.com>
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This allows using the `runtimeClassName` directive in the runner's spec.
One of the use-cases for this is Kata Containers, which use `runtimeClassName` in a pod spec as an indicator that the pod should run inside a Kata container. This allows us a greater degree of pod isolation.
Adds a column to help the operator see if they configured HRA.Spec.ScheduledOverrides correctly, in a form of "next override schedule recognized by the controller":
```
$ k get horizontalrunnerautoscaler
NAME MIN MAX DESIRED SCHEDULE
actions-runner-aos-autoscaler 0 5 0
org 0 5 0 min=0 time=2021-05-21 15:00:00 +0000 UTC
```
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/484
This fixes human-readable output of `kubectl get` on `runnerdeployment`, `runnerreplicaset`, and `runner`.
Most notably, CURRENT and READY of runner replicasets are now computed and printed correctly. Runner deployments now have UP-TO-DATE and AVAILABLE instead of READY so that it is consistent with columns of K8s deployments.
A few fixes has been also made to runner deployment and runner replicaset controllers so that those numbers stored in Status objects are reliably updated and in-sync with actual values.
Finally, `AGE` columns are added to runnerdeployment, runnerreplicaset, runnner to make that more visible to users.
`kubectl get` outputs should now look like the below examples:
```
# Immediately after runnerdeployment updated/created
$ k get runnerdeployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
example-runnerdeploy 0 0 0 0 8d
org-runnerdeploy 5 5 5 0 8d
# A few dozens of seconds after update/create all the runners are registered that "available" numbers increase
$ k get runnerdeployment
NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE
example-runnerdeploy 0 0 0 0 8d
org-runnerdeploy 5 5 5 5 8d
```
```
$ k get runnerreplicaset
NAME DESIRED CURRENT READY AGE
example-runnerdeploy-wnpf6 0 0 0 61m
org-runnerdeploy-fsnmr 2 2 0 8m41s
```
```
$ k get runner
NAME ENTERPRISE ORGANIZATION REPOSITORY LABELS STATUS AGE
example-runnerdeploy-wnpf6-registration-only actions-runner-controller/mumoshu-actions-test Running 61m
org-runnerdeploy-fsnmr-n8kkx actions-runner-controller ["mylabel 1","mylabel 2"] 21s
org-runnerdeploy-fsnmr-sq6m8 actions-runner-controller ["mylabel 1","mylabel 2"] 21s
```
Fixes#490
`PercentageRunnersBusy`, in combination with a secondary `TotalInProgressAndQueuedWorkflowRuns` metric, enables scale-from-zero for PercentageRunnersBusy.
Please see the new `Autoscaling to/from 0` section in the updated documentation about how it works.
Resolves#522
This adds the initial version of ScheduledOverrides to HorizontalRunnerAutoscaler.
`MinReplicas` overriding should just work.
When there are two or more ScheduledOverrides, the earliest one that matched is activated. Each ScheduledOverride can be recurring or one-time. If you have two or more ScheduledOverrides, only one of them should be one-time. And the one-time override should be the earliest item in the list to make sense.
Tests will be added in another commit. Logging improvements and additional observability in HRA.Status will also be added in yet another commits.
Ref #484
Adds two types `RecurrenceRule` and `Period` and one function `MatchSchedule` as the foundation for building the upcoming ScheduledOverrides feature.
Ref #484
- Adds `ephemeral` option to `runner.spec`
```
....
template:
spec:
ephemeral: false
repository: mumoshu/actions-runner-controller-ci
....
```
- `ephemeral` defaults to `true`
- `entrypoint.sh` in runner/Dockerfile modified to read `RUNNER_EPHEMERAL` flag
- Runner images are backward-compatible. `--once` is omitted only when the new envvar `RUNNER_EPHEMERAL` is explicitly set to `false`.
Resolves#457
This is an attempt to support scaling from/to zero.
The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those.
GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status, `busy`, `idle`, or `offline` would prevent GitHub actions from failing jobs. But retaining `busy` or `idle` runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retain `offline` runners.
In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays `offline`, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler.
Related to #447
Changes:
- Switched to use `jq` in startup.sh
- Enable docker registry mirror configuration which is useful when e.g. avoiding the Docker Hub rate-limiting
Check #478 for how this feature is tested and supposed to be used.
Enable the user to set a limit size on the volume of the runner to avoid some runner pod affecting other resources of the same cluster
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
This makes logging more concise by changing logger names to something like `controllers.Runner` to `actions-runner-controller.runner` after the standard `controller-rutime.controller` and reducing redundant logs by removing unnecessary requeues. I have also tweaked log messages so that their style is more consistent, which will also help readability. Also, runnerreplicaset-controller lacked useful logs so I have enhanced it.
As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event.
Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted.
This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas.
Additionally, I've also updated logging so that you observe which number was fetched from cache, and what number was suggested by either TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy, and what was the final number used as the desired-replicas(after applying {Min,Max}Replicas).
Follow-up for #282
Since #392, the runner controller could have taken unexpectedly long time until it finally notices that the runner has been registered to GitHub. This patch fixes the issue, so that the controller will notice the successful registration in approximately 1 minute(hard-coded).
More concretely, let's say you had configured a long sync-period of like 10m, the runner controller could have taken approx 10m to notice the successful registration. The original expectation was 1m, because it was intended to recheck every 1m as implemented in #392. It wasn't working as such due to my misunderstanding in how requeueing work.
We occasionally encountered those errors while the underlying RunnerReplicaSet is being recreated/replaced on RunnerDeployment.Spec.Template update. It turned out to be due to that the RunnerDeployment controller was waiting for the runner pod becomes `Running`, intead of the new replacement runner to have registered to GitHub. This fixes that, by trying to Runner.Status.Phase to `Running` only after the runner in the runner pod appears to be registered.
A side-effect of this change is that runner controller would call more "ListRunners" GitHub Actions API. I've reviewed and improved the runner controller code and Runner CRD to make make the number of calls minimum. In most cases, ListRunners should be called only twice for each runner creation.
This allows you to trigger autoscaling depending on check_run names(i.e. actions job names). If you are willing to differentiate scale amount only for a specific job, or want to scale only on a specific job, try this.
PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector.
Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-795200205
We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen.
I have also updated the hra controller to use Patch when updating RunnerDeployment, too.
Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.
Similar to #348 for #346, but for HRA.Spec.CapacityReservations usually modified by the webhook-based autoscaler controller.
This patch tries to fix that by improving the webhook-based autoscaler controller to omit expired reservations on updating HRA spec.
`kubectl get horizontalrunnerautoscalers.actions.summerwind.dev` shows HRA.status.desiredReplicas as the DESIRED count. However the value had been not taking capacityReservations into account, which resulted in showing incorrect count when you used webhook-based autoscaler, or capacityReservations API directly. This fixes that.
The controller had been writing confusing messages like the below on missing scale target:
```
Found too many scale targets: It must be exactly one to avoid ambiguity. Either set WatchNamespace for the webhook-based autoscaler to let it only find HRAs in the namespace, or update Repository or Organization fields in your RunnerDeployment resources to fix the ambiguity.{"scaleTargets": ""}
```
This fixes that, while improving many kinds of messages written while reconcilation, so that the error message is more actionable.
We occasionally see logs like the below:
```
2021-02-24T02:48:26.769ZERRORFailed to update runner status{"runnerreplicaset": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReplicaSetReconciler).Reconcile
/home/runner/work/actions-runner-controller/actions-runner-controller/controllers/runnerreplicaset_controller.go:207
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
2021-02-24T02:48:26.769ZERRORcontroller-runtime.controllerReconciler error{"controller": "testns-244olrunnerreplicaset", "request": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"}
github.com/go-logr/zapr.(*zapLogger).Error
/home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker
/home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211
k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152
k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153
k8s.io/apimachinery/pkg/util/wait.Until
/home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88
```
which can be compacted into one-liner, without the useless stack trace, without double-logging the same error from the logger and the controller.
* if a runner pod starts up with an invalid token, it will go in an
infinite retry loop, appearing as RUNNING from the outside
* normally, this error situation is detected because no corresponding
runner objects exists in GitHub and the pod will get removed after
registration timeout
* if the GitHub runner object already existed before - e.g. because a
finalizer was not properly run as part of a partial Kubernetes crash,
the runner will always stay in a running mode, even updating the
registration token will not kill the problematic pod
* introducing RunnerOffline exception that can be handled in runner
controller and replicaset controller
* as runners are offline when a pod is completed and marked for restart,
only do additional restart checks if no restart was already decided,
making code a bit cleaner and saving GitHub API calls after each job
completion
I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests.
This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.
* the reconciliation loop is often much faster than the runner startup,
so changing runner not found related messages to debug and also add the
possibility that the runner just needs more time
* errors.Is compares all members of a struct to return true which never
happened
* switched to type check instead of exact value check
* notRegistered was using double negation in if statement which lead to
unregistering runners after the registration timeout
* if a k8s node becomes unresponsive, the kube controller will soft
delete all pods after the eviction time (default 5 mins)
* as long as the node stays unresponsive, the pod will never leave the
last status and hence the runner controller will assume that everything
is fine with the pod and will not try to create new pods
* this can result in a situation where a horizontal autoscaler thinks
that none of its runners are currently busy and will not schedule any
further runners / pods, resulting in a broken runner deployment until the
runnerreplicaset is deleted or the node comes back online
* introducing a pod deletion timeout (1 minute) after which the runner
controller will try to reboot the runner and create a pod on a working
node
* use forceful deletion and requeue for pods that have been stuck for
more than one minute in terminating state
* gracefully handling race conditions if pod gets finally forcefully deleted within
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded).
It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc).
Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller.
Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work.
Ref #288
When we used `QueuedAndInProgressWorkflowRuns`-based autoscaling, it only fetched and considered only the first 30 workflow runs at the reconcilation time. This may have resulted in unreliable scaling behaviour, like scale-in/out not happening when it was expected.
* feat: HorizontalRunnerAutoscaler Webhook server
This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.
This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.
On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
Changes:
* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
* ensure that minReplicas <= desiredReplicas <= maxReplicas no matter what
* before this change, if the number of runners was much larger than the max number, the applied scale down factor might still result in a desired value > maxReplicas
* if for resource constraints in the cluster, runners would be permanently restarted, the number of runners could go up more than the reverse scale down factor until the next reconciliation round, resulting in a situation where the number of runners climbs up even though it should actually go down
* by checking whether the desiredReplicas is always <= maxReplicas, infinite scaling up loops can be prevented
One of the pod recreation conditions has been modified to use hash of runner spec, so that the controller does not keep restarting pods mutated by admission webhooks. This naturally allows us, for example, to use IRSA for EKS that requires its admission webhook to mutate the runner pod to have additional, IRSA-related volumes, volume mounts and env.
Resolves#200
* runner/controller: Add externals directory mount point
* Runner: Create hack for moving content of /runner/externals/ dir
* Externals dir Mount: mount examples for '__e/node12/bin/node' not found error
Add dockerEnabled option for users who does not need docker and want not to run privileged container.
if `dockerEnabled == false`, dind container not run, and there are no privileged container.
Do the same as closed#96
* Adds RUNNER_GROUP argument to the runner registration
Adds the ability to register a runner to a predefined runner_group
Resolves#137
* Update README with runner group example
- Updates the README with instructions of how to add the runner to a
group
- Fix code fencing for shell and yaml blocks in the README
- Use consistent bullet points (dash not asterisk)
Currently, after refreshing the token, the controller re-creates the runner with the new token. This results in jobs being interrupted. This PR makes sure the pod is not restarted if it is busy.
Closes#74