Commit Graph

335 Commits

Author SHA1 Message Date
Johannes Nicolai 2d7fbbfb68
Handle offline runners gracefully (#341)
* if a runner pod starts up with an invalid token, it will go in an 
infinite retry loop, appearing as RUNNING from the outside
* normally, this error situation is detected because no corresponding 
runner objects exists in GitHub and the pod will get removed after 
registration timeout
* if the GitHub runner object already existed before - e.g. because a 
finalizer was not properly run as part of a partial Kubernetes crash, 
the runner will always stay in a running mode, even updating the 
registration token will not kill the problematic pod
* introducing RunnerOffline exception that can be handled in runner 
controller and replicaset controller
* as runners are offline when a pod is completed and marked for restart, 
only do additional restart checks if no restart was already decided, 
making code a bit cleaner and saving GitHub API calls after each job 
completion
2021-02-22 10:08:04 +09:00
Yusuke Kuoka dd0b9f3e95
Merge pull request #340 from int128/integration-test-check-run
Fix index key to find HRA in GitHub webhook handler
2021-02-22 09:49:54 +09:00
Yusuke Kuoka 7cb2bc84c8
Merge pull request #334 from summerwind/integration-test-check-run
Add integration test for autoscaling on check_run webhook event
2021-02-22 09:38:07 +09:00
Hidetake Iwata b0e74bebab Fix index key to find HRA in GitHub webhook handler 2021-02-20 21:25:23 +09:00
Hidetake Iwata dfbe53dcca Fix webhook payload in integration test 2021-02-20 21:08:23 +09:00
Yusuke Kuoka ebc3970b84 Add integration test for autoscaling on check_run webhook event 2021-02-19 10:33:04 +09:00
Hidetake Iwata 1ddcf6946a
Fix nil pointer error on received check_run event (#331)
* Reproduce nil pointer error on received check_run event

* Fix nil pointer error on received check_run event
2021-02-18 20:22:36 +09:00
Yusuke Kuoka cfbaad38c8
Merge pull request #328 from int128/fix-port-name-length
Changes:

1. Fix length of github-webhook-server port name
2. Add a cluster role binding for github-webhook-server
3. Remove --enable-leader-election from github-webhook-server
2021-02-18 20:20:39 +09:00
Yusuke Kuoka 67f6de010b
feat: Common runner labels configurable per controller (#327)
* feat: Common runner labels configurable per controller

Ref #321
2021-02-18 20:19:08 +09:00
Hidetake Iwata 2db608879a Remove --enable-leader-election from github-webhook-server 2021-02-18 16:51:47 +09:00
Hidetake Iwata 2c4a6ca90b Add cluster role binding for github-webhook-server 2021-02-18 16:49:24 +09:00
Hidetake Iwata 829bf20449 Fix length of github-webhook-server port name 2021-02-18 16:42:15 +09:00
Reinier Timmer be13322816
Update runner to 2.277.1 (#322)
* Update runner to 2.277.1

* Update build-and-release-runners.yml

* integration test condition

Don't run integration tests when only updating the runner image

* fixup! integration test condition

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-02-18 09:29:53 +09:00
Johannes Nicolai 7f4a76a39b
Also log into DockerHub for release event (#326)
* so far, only push events would trigger the DockerHub login step
* hence, attempts to release would fail because of a permission problem (tested locally)
* adding OR condition to also login in case a release got published
2021-02-18 08:54:44 +09:00
callum-tait-pbx 0fce761686
fix: add trunate to ensure service kinds have valid names (#325)
* fix: adding truncate for service kinds

* chore : bumping chart version
2021-02-18 08:43:48 +09:00
Yusuke Kuoka c88ff44518
Fix wip.yml workflow for building controller canary tags (#323)
In #306 we seem to have accidentally updated a wrong workflow, which was for runner builds. This updates the one for the controller.

Resolves #302
2021-02-18 08:42:24 +09:00
Yusuke Kuoka 2fdf35ac9d
Refactor integration test to use helpers (#320)
This should make the test code a bit more DRY and readable.
2021-02-17 10:23:35 +09:00
Johannes Nicolai 6cce3fefc5
Add project to awesome-runners list (#319) 2021-02-17 09:14:42 +09:00
Yusuke Kuoka eb2eaf8130
Fix TotalNumberOfQueuedAndInProgressWorkflowRuns to work with a lot of remaining `completed` jobs (#316)
I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests.

This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.
2021-02-16 18:55:55 +09:00
callum-tait-pbx 7bf712d0d4
fix: duplicate name attribute (#318) 2021-02-16 18:52:08 +09:00
Yusuke Kuoka 7d024a6c05
Fix "duplicate metrics collector registration attempted" errors at startup (#317)
I have seen this error a lot in our integration test. It turned out due to https://github.com/kubernetes-sigs/controller-runtime/issues/484 and is being fixed with this change.
2021-02-16 18:51:33 +09:00
Yusuke Kuoka 434823bcb3
`scale{Up,Down}Adjustment` to add/remove constant number of replicas on scaling (#315)
* `scale{Up,Down}Adjustment` to add/remove constant number of replicas on scaling

Ref #305

* Bump chart version
2021-02-16 17:16:26 +09:00
Yusuke Kuoka 35d047db01
Fix enterprise runners misusing cached token (#314)
Follow-up for #290
2021-02-16 12:56:52 +09:00
Yusuke Kuoka f1db6af1c5
Add repository runners support for PercentageRunnersBusy-based autoscaling (#313)
Resolves #258
2021-02-16 12:44:51 +09:00
Hidetake Iwata 4f3f2fb60d
Add metrics for GitHub API rate limit (#312) 2021-02-16 09:58:09 +09:00
Johannes Nicolai 2623140c9a
Make log message less scary (#311)
* the reconciliation loop is often much faster than the runner startup, 
so changing runner not found related messages to debug and also add the 
possibility that the runner just needs more time
2021-02-16 09:55:55 +09:00
Johannes Nicolai 1db9d9d574
Use ARM64 compatible kube-rbac-proxy from upstream (#310)
* as pointed out in #281 the currently used image for the 
kube-rbac-proxy - gcr.io/kubebuilder/kube-rbac-proxy:v0.4.1" - does not 
have an ARM64 image
* hence, trying to use the standard deployment manifest / helm char will 
fail on ARM64 systems
* replaced image with quay.io/brancz/kube-rbac-proxy:v0.8.0 which is the 
latest version from the upstream maintainer 
(https://github.com/brancz/kube-rbac-proxy/blob/master/Makefile#L13)
* successfully tested on both AMD64 and ARM64 clusters
* fixes #281
2021-02-16 09:55:03 +09:00
callum-tait-pbx d046350240
chore: bumping helm chart sematically (#296)
* chore: bumping helm chart sematically

* chore: removing the app version config
2021-02-16 09:45:56 +09:00
callum-tait-pbx cca4d249e9
feat: create workflow for runner releases (#306) 2021-02-16 09:42:28 +09:00
Johannes Nicolai bc8bc70f69
Fix rate limit and runner registration logic (#309)
* errors.Is compares all members of a struct to return true which never 
happened
* switched to type check instead of exact value check
* notRegistered was using double negation in if statement which lead to 
unregistering runners after the registration timeout
2021-02-15 09:36:49 +09:00
Johannes Nicolai 34c6c3d9cd
Pod eviction policy examples (crashed nodes) (#308)
* ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes)
* pods stuck in "Terminating" status on unreachable nodes will only be freed once #307 gets merged
2021-02-15 09:33:01 +09:00
Johannes Nicolai 9c8d7305f1
Introduce pod deletion timeout and forcefully delete stuck pods (#307)
* if a k8s node becomes unresponsive, the kube controller will soft
delete all pods after the eviction time (default 5 mins)
* as long as the node stays unresponsive, the pod will never leave the
last status and hence the runner controller will assume that everything
is fine with the pod and will not try to create new pods
* this can result in a situation where a horizontal autoscaler thinks
that none of its runners are currently busy and will not schedule any
further runners / pods, resulting in a broken  runner deployment until the
runnerreplicaset is deleted or the node comes back online
* introducing a pod deletion timeout (1 minute) after which the runner
controller will try to reboot the runner and create a pod on a working
node
* use forceful deletion and requeue for pods that have been stuck for
more than one minute in terminating state
* gracefully handling race conditions if pod gets finally forcefully deleted within
2021-02-15 09:32:28 +09:00
Yusuke Kuoka addcbfa7ee
Fix runner registration timeout (#301)
Fixes #300
2021-02-12 10:00:20 +09:00
Yusuke Kuoka bbb036e732
feat: Prevent blocking on transient runner registration failure (#297)
This enhances the controller to recreate the runner pod if the corresponding runner has failed to register itself to GitHub within 10 minutes(currently hard-coded).

It should alleviate #288 in case the root cause is some kind of transient failures(network unreliability, GitHub down, temporarly compute resource shortage, etc).

Formerly you had to manually detect and delete such pods or even force-delete corresponding runners to unblock the controller.

Since this enhancement, the controller does the pod deletion automatically after 10 minutes after pod creation, which result in the controller create another pod that might work.

Ref #288
2021-02-09 10:17:52 +09:00
Yusuke Kuoka 9301409aec
fix: Paginate ListRepositoryWorkflowRuns (#295)
When we used `QueuedAndInProgressWorkflowRuns`-based autoscaling, it only fetched and considered only the first 30 workflow runs at the reconcilation time. This may have resulted in unreliable scaling behaviour, like scale-in/out not happening when it was expected.
2021-02-09 10:13:53 +09:00
Yusuke Kuoka ab1c39de57
feat: HorizontalRunnerAutoscaler Webhook server (#282)
* feat: HorizontalRunnerAutoscaler Webhook server

This introduces a Webhook server that responds GitHub `check_run`, `pull_request`, and `push` events by scaling up matched HorizontalRunnerAutoscaler by 1 replica. This allows you to immediately add "resource slack" for future GitHub Actions job runs, without waiting next sync period to add insufficient runners.

This feature is highly inspired by https://github.com/philips-labs/terraform-aws-github-runner. terraform-aws-github-runner can manage one set of runners per deployment, where actions-runner-controller with this feature can manage as many sets of runners as you declare with HorizontalRunnerAutoscaler and RunnerDeployment pairs.

On each GitHub event received, the webhook server queries repository-wide and organizational runners from the cluster and searches for the single target to scale up. The webhook server tries to match HorizontalRunnerAutoscaler.Spec.ScaleUpTriggers[].GitHubEvent.[CheckRun|Push|PullRequest] against the event and if it finds only one HRA, it is the scale target. If none or two or more targets are found for repository-wide runners, it does the same on organizational runners.

Changes:

* Fix integration test
* Update manifests
* chart: Add support for github webhook server
* dockerfile: Include github-webhook-server binary
* Do not import unversioned go-github
* Update README
2021-02-07 17:37:27 +09:00
alex-mozejko a4350d0fc2
bug-fix: patched dir owned by runner (#284)
* bug-fix: patched dir owned by runner

* always build with latest runner version

* Revert "always build with latest runner version"

This reverts commit e719724ae9fe92a12d4a087185cf2a2ff543a0dd.

* Also patch dindrunner.Dockerfile

Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2021-02-07 17:21:10 +09:00
callum-tait-pbx 2146c62c9e
chore: bumping Helm chart version (#294)
* chore: adding mac rubbish to gitignore

* chore: bumping chart version
2021-02-07 16:46:19 +09:00
Jesse Haka 28e80a2d28
Add support for enterprise runners (#290)
* Add support for enterprise runners

* update docs
2021-02-05 09:31:06 +09:00
Tom Bamford 831db9ee2a
Added github.sha to DockerHub push (#286)
* Added GITHUB.RUN_NUMBER to DockerHub push

* switch run_number to sha on docker tag

* re-add mutable tags for backwards compatability

* truncate to short SHA (7 chars)

* behaviour workaround

* use ENV to define sha_short

* use ::set-output to define sha_short

* bump action
2021-02-04 09:29:32 +09:00
Donovan Muller 4d69e0806e
Update GitHub runner version (#280) 2021-02-02 14:06:08 +09:00
Donovan Muller d37cd69e9b
feat/helm: Bump appVersion to 0.6.1 release (#272)
* feat/helm: Bump appVersion to 0.6.1 release

* Also bump chart version to trigger a new chart release

Co-authored-by: Yusuke Kuoka <c-ykuoka@zlab.co.jp>
2021-01-29 09:29:43 +09:00
Yusuke Kuoka a2690aa5cb Update README.md
Follow-up for #275
2021-01-29 09:29:26 +09:00
Clément da020df0fd
docs: fix install installation method (#275) 2021-01-29 09:28:34 +09:00
Jonas Lergell 6c64ae6a01
Actually use 'dockerdContainerResources' to set resources on the dind container (#273) 2021-01-29 09:18:28 +09:00
Yusuke Kuoka 42c7d0489d chart: Bump to 0.2.0 2021-01-25 09:14:49 +09:00
Donovan Muller b3bef6404c
Add support for additional environment variables (#271) 2021-01-25 09:00:03 +09:00
David Young 1127c447c4
Add GitHub Actions to publish helm chart (#257)
* Add chart workflows (#1)

* Add chart workflows

* Fix publishing step in CI

Signed-off-by: David Young <davidy@funkypenguin.co.nz>

* Update CI on push-to-master (#3)

* Put helm installation step in the correct CI job

Signed-off-by: David Young <davidy@funkypenguin.co.nz>

* Put helm installation step in the correct CI job (#4)

* Update on-push-master-publish-chart.yml

* Remove references to certmanager dependency

Signed-off-by: David Young <davidy@funkypenguin.co.nz>

* Add ability to customize kube-rbac-proxy image

Signed-off-by: David Young <davidy@funkypenguin.co.nz>

* Only install cert-manager if we're going to spin up KinD

Signed-off-by: David Young <davidy@funkypenguin.co.nz>
2021-01-24 15:37:01 +09:00
Yusuke Kuoka ace95d72ab
Fix self-update failuers due to /runner/externals mount (#253)
* Fix self-update failuers due to /runner/externals mount

Fixes #252

* Tested Self-update Fixes (#269)

Adding fixes to #253 as confirmed and tested in https://github.com/summerwind/actions-runner-controller/issues/264#issuecomment-764549833 by @jolestar, @achedeuzot and @hfuss 🙇 🍻

Co-authored-by: Hayden Fuss <wifu1234@gmail.com>
2021-01-24 10:58:35 +09:00
Johannes Nicolai 42493d5e01
Adding --name-space parameter in example (#259)
* when setting a GitHub Enterprise server URL without a namespace, an error occurs: "error: the server doesn't have a resource type "controller-manager"
* setting default namespace "actions-runner-system" makes the example work out of the box
2021-01-22 10:12:04 +09:00