`PercentageRunnersBusy`, in combination with a secondary `TotalInProgressAndQueuedWorkflowRuns` metric, enables scale-from-zero for PercentageRunnersBusy.
Please see the new `Autoscaling to/from 0` section in the updated documentation about how it works.
Resolves#522
- Adds `ephemeral` option to `runner.spec`
```
....
template:
spec:
ephemeral: false
repository: mumoshu/actions-runner-controller-ci
....
```
- `ephemeral` defaults to `true`
- `entrypoint.sh` in runner/Dockerfile modified to read `RUNNER_EPHEMERAL` flag
- Runner images are backward-compatible. `--once` is omitted only when the new envvar `RUNNER_EPHEMERAL` is explicitly set to `false`.
Resolves#457
- You can now use `make acceptance/run` to run only a specific acceptance test case
- Add note about Ubuntu 20.04 users / snap-provided docker
- Add instruction to run Ginkgo tests
- Extract acceptance/load from acceptance/kind
- Make `acceptance/pull` not depend on `docker-build`, so that you can do `make docker-build acceptance/load` for faster image reload
This is an attempt to support scaling from/to zero.
The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those.
GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status, `busy`, `idle`, or `offline` would prevent GitHub actions from failing jobs. But retaining `busy` or `idle` runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retain `offline` runners.
In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays `offline`, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler.
Related to #447
* Fix acceptance helm test not using newly built controller image
* Locally build runner image instead of pulling it
* Revert runner controller image pull policy to always
and add a line to the test deployment to use IfNotPresent
* Change runner repository from summerwind/action-runner to the owner of actions-runner-controller.
Also fix some Makefile formatting.
* Undo renaming acceptance/pull to docker-pull
* Some env var cleanup
Rename USERNAME to DOCKER_USER(is still used for github too tho)
Add RUNNER_NAME var(defaults to $DOCKER_USER/actions-runner)
Add TEST_REPO(defaults to $DOCKER_USER/actions-runner-controller)
Changes:
- Switched to use `jq` in startup.sh
- Enable docker registry mirror configuration which is useful when e.g. avoiding the Docker Hub rate-limiting
Check #478 for how this feature is tested and supposed to be used.
* chore: adding Helm app version back
* chore: removing redundant values entry
* chore: bumping to newer version
* chore: bumping app version to latest
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Cache docker images locally
Cache dind, runner, and kube-rbac-proxy docker image on the host and copy onto the kind node instead of downloading it to the node directly.
* Also cache certmanager docker images
Images for `actions-runner:v${VERSION}` and `actions-runner:latest` tags are upgraded to Ubuntu 20.04.
If you would like not to upgrade Ubuntu in the runner image in the future, migrate to new tags suffixed with `-ubuntu-20.04` like`actions-runner:v${VERSION}-ubuntu-20.04`.
We also keep publishing the existing Ubuntu 18.04 images with new `actions-runner:v${VERSION}-ubuntu-18.04` tags. Please use it when it turned out that you had workflows dependent on Ubuntu 18.04.
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes)
* pods stuck in "Terminating" status on unreachable nodes will only be freed once #307 gets merged
* feat: HorizontalRunnerAutoscaler Webhook server
This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.
This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.
On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
Changes:
* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
* when setting a GitHub Enterprise server URL without a namespace, an error occurs: "error: the server doesn't have a resource type "controller-manager"
* setting default namespace "actions-runner-system" makes the example work out of the box
One of the pod recreation conditions has been modified to use hash of runner spec, so that the controller does not keep restarting pods mutated by admission webhooks. This naturally allows us, for example, to use IRSA for EKS that requires its admission webhook to mutate the runner pod to have additional, IRSA-related volumes, volume mounts and env.
Resolves#200
* Adds RUNNER_GROUP argument to the runner registration
Adds the ability to register a runner to a predefined runner_group
Resolves#137
* Update README with runner group example
- Updates the README with instructions of how to add the runner to a
group
- Fix code fencing for shell and yaml blocks in the README
- Use consistent bullet points (dash not asterisk)
* feat: Repository-wide RunnerDeployment Autoscaling
This adds `maxReplicas` and `minReplicas` to the RunnerDeploymentSpec. If and only if both fields are set, the controller computes and sets desired `replicas` automatically depending on the demand.
The number of demanded runner replicas is computed by `queued workflow runs + in_progress workflow runs` for the repository. The support for organizational runners is not included.
Ref https://github.com/summerwind/actions-runner-controller/issues/10
I had observed athe exact issue seen for the 4th option described in https://github.com/elastic/cloud-on-k8s/issues/1822, which resulted in actions-runner-controller is unable to create nor update runners. This fixes that.
I've also updated README to introduce RunnerDeployment and manually tested it to work after the fix.
---
`actions-runner-controller` has been failing while creating and updating runners:
```
2020-03-05T11:05:16.610+0900 ERROR controllers.Runner Failed to update runner {"runner": "default/example-runner", "error": "Runner.actions.summerwind.dev \"example-runner\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"actions.summerwind.dev/v1alpha1\", \"kind\":\"Runner\", \"metadata\":map[string]interface {}{\"creationTimestamp\":\"2020-03-05T02:05:16Z\", \"finalizers\":[]interface {}{\"runner.actions.summerwind.dev\"}, \"generation\":2, \"name\":\"example-runner\", \"namespace\":\"default\", \"resourceVersion\":\"911496\", \"selfLink\":\"/apis/actions.summerwind.dev/v1alpha1/namespaces/default/runners/example-runner\", \"uid\":\"48b62d07-ff2c-42d6-878c-d3f951202209\"}, \"spec\":map[string]interface {}{\"env\":interface {}(nil), \"image\":\"\", \"repository\":\"mumoshu/actions-runner-controller-ci\"}}: validation failure list:\nspec.env in body must be of type array: \"null\""}
github.com/go-logr/zapr.(*zapLogger).Error
/Users/c-ykuoka/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128
github.com/summerwind/actions-runner-controller/controllers.(*RunnerReconciler).Reconcile
/Users/c-ykuoka/p/actions-runner-controller/controllers/runner_controller.go:88
```
This seems like the exact issue seen in the 4th option in https://github.com/elastic/cloud-on-k8s/issues/1822
I also observed the same issue is failing while the runnerset controller is trying to create/update runners:
```
Also while creating runner in the runnerset controller:
2020-03-05T11:15:01.223+0900 ERROR controller-runtime.controller Reconciler error {"controller": "runnerset", "request": "default/example-runnerset", "error": "Runner.actions.summerwind.dev \"example-runnersetgp56m\" is invalid: []: Invalid value: map[string]interface {}{\"apiVersion\":\"actions.summerwind.dev/v1alpha1\", \"kind\":\"Runner\", \"metadata\":map[string]interface {}{\"creationTimestamp\":\"2020-03-05T02:15:01Z\", \"generateName\":\"example-runnerset\", \"generation\":1, \"name\":\"example-runnersetgp56m\", \"namespace\":\"default\", \"ownerReferences\":[]interface {}{map[string]interface {}{\"apiVersion\":\"actions.summerwind.dev/v1alpha1\", \"blockOwnerDeletion\":true, \"controller\":true, \"kind\":\"RunnerSet\", \"name\":\"example-runnerset\", \"uid\":\"e26f7d01-3168-496d-931b-8e6f97b776ea\"}}, \"uid\":\"4ee490f5-9a8c-4f30-86f9-61dea799b972\"}, \"spec\":map[string]interface {}{\"env\":interface {}(nil), \"image\":\"\", \"repository\":\"mumoshu/actions-runner-controller-ci\"}}: validation failure list:\nspec.env in body must be of type array: \"null\""}
github.com/go-logr/zapr.(*zapLogger).Error
```
and while the runnerdeployment controller is trying to create/update runners.
I've fixed it so that the new `RunnerDeployment` example added to README just works.