This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded).
It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc).
Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller.
Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work.
Ref #288
When we used `QueuedAndInProgressWorkflowRuns`-based autoscaling, it only fetched and considered only the first 30 workflow runs at the reconcilation time. This may have resulted in unreliable scaling behaviour, like scale-in/out not happening when it was expected.
* feat: HorizontalRunnerAutoscaler Webhook server
This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.
This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.
On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.
Changes:
* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
* bug-fix: patched dir owned by runner
* always build with latest runner version
* Revert "always build with latest runner version"
This reverts commit e719724ae9fe92a12d4a087185cf2a2ff543a0dd.
* Also patch dindrunner.Dockerfile
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
* Added GITHUB.RUN_NUMBER to DockerHub push
* switch run_number to sha on docker tag
* re-add mutable tags for backwards compatability
* truncate to short SHA (7 chars)
* behaviour workaround
* use ENV to define sha_short
* use ::set-output to define sha_short
* bump action
* feat/helm: Bump appVersion to 0.6.1 release
* Also bump chart version to trigger a new chart release
Co-authored-by: Yusuke Kuoka <c-ykuoka@zlab.co.jp>
* Add chart workflows (#1)
* Add chart workflows
* Fix publishing step in CI
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* Update CI on push-to-master (#3)
* Put helm installation step in the correct CI job
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* Put helm installation step in the correct CI job (#4)
* Update on-push-master-publish-chart.yml
* Remove references to certmanager dependency
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* Add ability to customize kube-rbac-proxy image
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* Only install cert-manager if we're going to spin up KinD
Signed-off-by: David Young <davidy@funkypenguin.co.nz>
* when setting a GitHub Enterprise server URL without a namespace, an error occurs: "error: the server doesn't have a resource type "controller-manager"
* setting default namespace "actions-runner-system" makes the example work out of the box
* ensure that minReplicas <= desiredReplicas <= maxReplicas no matter what
* before this change, if the number of runners was much larger than the max number, the applied scale down factor might still result in a desired value > maxReplicas
* if for resource constraints in the cluster, runners would be permanently restarted, the number of runners could go up more than the reverse scale down factor until the next reconciliation round, resulting in a situation where the number of runners climbs up even though it should actually go down
* by checking whether the desiredReplicas is always <= maxReplicas, infinite scaling up loops can be prevented
* feat: adding maanger secret to Helm
* fix: correcting secret data format
* feat: adding in common labels
* fix: updating default values to have config
The auth config needs to be commented out by default as we don't want to deploy both configs empty. This may break stuff, so we want the user to actively uncomment the auth method they want instead
* chore: updating default format of cert
* chore: wording
One of the pod recreation conditions has been modified to use hash of runner spec, so that the controller does not keep restarting pods mutated by admission webhooks. This naturally allows us, for example, to use IRSA for EKS that requires its admission webhook to mutate the runner pod to have additional, IRSA-related volumes, volume mounts and env.
Resolves#200
It turned out previous versions of runner images were unable to run actions that require `AGENT_TOOLSDIRECTORY` or `libyaml` to exist in the runner environment. One of notable examples of such actions is [`ruby/setup-ruby`](https://github.com/ruby/setup-ruby).
This change adds the support for those actions, by setting up AGENT_TOOLSDIRECTORY and installing libyaml-dev within runner images.
* runner/controller: Add externals directory mount point
* Runner: Create hack for moving content of /runner/externals/ dir
* Externals dir Mount: mount examples for '__e/node12/bin/node' not found error
Add dockerEnabled option for users who does not need docker and want not to run privileged container.
if `dockerEnabled == false`, dind container not run, and there are no privileged container.
Do the same as closed#96
docker:dind container creates `/var/run/docker.sock` with root user and root group.
so, docker command in runner container needs root privileges to use docker.sock and docker action fails because lack of permission.
Use tcp connection between runner and docker container, so runner container doesn't need root privileges to run docker, and can run docker action.
Fixes#174
Acceptance tests are passing with the chart. In addition to standard chart values, syncPeriod is supported.
Please use it as a foundation for further collaboration.
Ref #184
Inspired by #91
Related #61