actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Yusuke Kuoka	9ae3551744	Remove unnecessary GitHub API calls (#363 ) The controller had the 2 extra and redundant calls to List Workflow Runs API. Ref #362	2021-03-02 10:55:30 +09:00
Rolf Ahrenberg	05ad3f5469	Set default python (#361 )	2021-03-01 09:45:13 +09:00
callum-tait-pbx	9c7372a8e0	docs: styling fixes (#359 ) * docs: styling fixes * docs: grammer fixes	2021-03-01 09:44:35 +09:00
Yusuke Kuoka	584590e97c	Use patch instead of update to alleviate HRA conflict on webhook (#358 ) We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen. I have also updated the hra controller to use Patch when updating RunnerDeployment, too. Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.	2021-02-26 10:17:09 +09:00
Yusuke Kuoka	d18884a0b9	Fix HRA expired cache entries not cleaned up (#357 ) Fixes #356	2021-02-26 09:54:24 +09:00
callum-tait-pbx	f987571b64	Improve docs (#303 )	2021-02-26 09:32:18 +09:00
Taehyun Kim	450e384c4c	Update helm chart (#343 ) * add replicaCount * Add authSecret.existingSecret * set image.tag null by default * implement ingress for githubwebhook server * fix deprecated and secretName template * backward compat .authSecret.enabled * existingSecret for github webhook secret * use secretName template * set default secret names * do not use app version based image tag * create and name variable for secrets	2021-02-26 09:26:51 +09:00
Yusuke Kuoka	e9eef04993	Fix old HRA capacity reservations not cleaned up (#354 ) Similar to #348 for #346, but for HRA.Spec.CapacityReservations usually modified by the webhook-based autoscaler controller. This patch tries to fix that by improving the webhook-based autoscaler controller to omit expired reservations on updating HRA spec.	2021-02-25 11:08:00 +09:00
Yusuke Kuoka	598dd1d9fe	Fix incorrect DESIRED on `kubectl get hra (#353 ) `kubectl get horizontalrunnerautoscalers.actions.summerwind.dev` shows HRA.status.desiredReplicas as the DESIRED count. However the value had been not taking capacityReservations into account, which resulted in showing incorrect count when you used webhook-based autoscaler, or capacityReservations API directly. This fixes that.	2021-02-25 10:32:09 +09:00
Yusuke Kuoka	9890a90e69	Improve webhook-based autoscaler log (#352 ) The controller had been writing confusing messages like the below on missing scale target: ``` Found too many scale targets: It must be exactly one to avoid ambiguity. Either set WatchNamespace for the webhook-based autoscaler to let it only find HRAs in the namespace, or update Repository or Organization fields in your RunnerDeployment resources to fix the ambiguity.{"scaleTargets": ""} ``` This fixes that, while improving many kinds of messages written while reconcilation, so that the error message is more actionable.	2021-02-25 10:07:41 +09:00
Yusuke Kuoka	9da123ae5e	Fix integration test flakiness (#351 ) Ref https://github.com/summerwind/actions-runner-controller/pull/345#issuecomment-785015406	2021-02-25 09:30:32 +09:00
Johannes Nicolai	4d4137aa28	Avoid zombie runners that missed token expiration by a bit (#345 ) * if a new runner pod was just scheduled to start up right before a registration expired, it will not get a new registration token and go in an infinite update loop (until #341) kicks in * if registzration tokens got updated a little bit before they actually expired, just starting up pods will way more likely get a working token	2021-02-25 09:07:49 +09:00
Yusuke Kuoka	022007078e	Compact excessive error message on runnerreplicaset status update conflict (#350 ) We occasionally see logs like the below: ``` 2021-02-24T02:48:26.769ZERRORFailed to update runner status{"runnerreplicaset": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"} github.com/go-logr/zapr.(zapLogger).Error /home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 github.com/summerwind/actions-runner-controller/controllers.(RunnerReplicaSetReconciler).Reconcile /home/runner/work/actions-runner-controller/actions-runner-controller/controllers/runnerreplicaset_controller.go:207 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).worker /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1 /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152 k8s.io/apimachinery/pkg/util/wait.JitterUntil /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153 k8s.io/apimachinery/pkg/util/wait.Until /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88 2021-02-24T02:48:26.769ZERRORcontroller-runtime.controllerReconciler error{"controller": "testns-244olrunnerreplicaset", "request": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"} github.com/go-logr/zapr.(zapLogger).Error /home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1 /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152 k8s.io/apimachinery/pkg/util/wait.JitterUntil /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153 k8s.io/apimachinery/pkg/util/wait.Until /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88 ``` which can be compacted into one-liner, without the useless stack trace, without double-logging the same error from the logger and the controller.	2021-02-25 09:01:02 +09:00
Johannes Nicolai	31e5e61155	Log correct runner that was deleted (#349 )	2021-02-25 08:38:55 +09:00
Aditya Purandare	1d1453c5f2	Fix user used for dind runner group permissions (#337 )	2021-02-24 19:06:52 +09:00
Yusuke Kuoka	e44e53b88e	Fix failure while saving HRA status after running controller for a while (#348 ) Fixes #346	2021-02-24 11:20:21 +09:00
Yusuke Kuoka	398791241e	Fix runner release workflow to do docker-push (#347 ) Apparently I have mistakenly removed `push` option from the workflow in #323 which resulted in new runner build #323 not being pushed. This fixes that.	2021-02-24 11:08:33 +09:00
Yusuke Kuoka	991535e567	Fix panic on webhook for user-owned repository (#344 ) * Fix panic on webhook for user-owned repository Follow-up for #282 and #334	2021-02-23 08:05:25 +09:00
Johannes Nicolai	2d7fbbfb68	Handle offline runners gracefully (#341 ) * if a runner pod starts up with an invalid token, it will go in an infinite retry loop, appearing as RUNNING from the outside * normally, this error situation is detected because no corresponding runner objects exists in GitHub and the pod will get removed after registration timeout * if the GitHub runner object already existed before - e.g. because a finalizer was not properly run as part of a partial Kubernetes crash, the runner will always stay in a running mode, even updating the registration token will not kill the problematic pod * introducing RunnerOffline exception that can be handled in runner controller and replicaset controller * as runners are offline when a pod is completed and marked for restart, only do additional restart checks if no restart was already decided, making code a bit cleaner and saving GitHub API calls after each job completion	2021-02-22 10:08:04 +09:00
Yusuke Kuoka	dd0b9f3e95	Merge pull request #340 from int128/integration-test-check-run Fix index key to find HRA in GitHub webhook handler	2021-02-22 09:49:54 +09:00
Yusuke Kuoka	7cb2bc84c8	Merge pull request #334 from summerwind/integration-test-check-run Add integration test for autoscaling on check_run webhook event	2021-02-22 09:38:07 +09:00
Hidetake Iwata	b0e74bebab	Fix index key to find HRA in GitHub webhook handler	2021-02-20 21:25:23 +09:00
Hidetake Iwata	dfbe53dcca	Fix webhook payload in integration test	2021-02-20 21:08:23 +09:00
Yusuke Kuoka	ebc3970b84	Add integration test for autoscaling on check_run webhook event	2021-02-19 10:33:04 +09:00
Hidetake Iwata	1ddcf6946a	Fix nil pointer error on received check_run event (#331 ) * Reproduce nil pointer error on received check_run event * Fix nil pointer error on received check_run event	2021-02-18 20:22:36 +09:00
Yusuke Kuoka	cfbaad38c8	Merge pull request #328 from int128/fix-port-name-length Changes: 1. Fix length of github-webhook-server port name 2. Add a cluster role binding for github-webhook-server 3. Remove --enable-leader-election from github-webhook-server	2021-02-18 20:20:39 +09:00
Yusuke Kuoka	67f6de010b	feat: Common runner labels configurable per controller (#327 ) * feat: Common runner labels configurable per controller Ref #321	2021-02-18 20:19:08 +09:00
Hidetake Iwata	2db608879a	Remove --enable-leader-election from github-webhook-server	2021-02-18 16:51:47 +09:00
Hidetake Iwata	2c4a6ca90b	Add cluster role binding for github-webhook-server	2021-02-18 16:49:24 +09:00
Hidetake Iwata	829bf20449	Fix length of github-webhook-server port name	2021-02-18 16:42:15 +09:00
Reinier Timmer	be13322816	Update runner to 2.277.1 (#322 ) * Update runner to 2.277.1 * Update build-and-release-runners.yml * integration test condition Don't run integration tests when only updating the runner image * fixup! integration test condition Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-02-18 09:29:53 +09:00
Johannes Nicolai	7f4a76a39b	Also log into DockerHub for release event (#326 ) * so far, only push events would trigger the DockerHub login step * hence, attempts to release would fail because of a permission problem (tested locally) * adding OR condition to also login in case a release got published	2021-02-18 08:54:44 +09:00
callum-tait-pbx	0fce761686	fix: add trunate to ensure service kinds have valid names (#325 ) * fix: adding truncate for service kinds * chore : bumping chart version	2021-02-18 08:43:48 +09:00
Yusuke Kuoka	c88ff44518	Fix wip.yml workflow for building controller canary tags (#323 ) In #306 we seem to have accidentally updated a wrong workflow, which was for runner builds. This updates the one for the controller. Resolves #302	2021-02-18 08:42:24 +09:00
Yusuke Kuoka	2fdf35ac9d	Refactor integration test to use helpers (#320 ) This should make the test code a bit more DRY and readable.	2021-02-17 10:23:35 +09:00
Johannes Nicolai	6cce3fefc5	Add project to awesome-runners list (#319 )	2021-02-17 09:14:42 +09:00
Yusuke Kuoka	eb2eaf8130	Fix TotalNumberOfQueuedAndInProgressWorkflowRuns to work with a lot of remaining `completed` jobs (#316 ) I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests. This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.	2021-02-16 18:55:55 +09:00
callum-tait-pbx	7bf712d0d4	fix: duplicate name attribute (#318 )	2021-02-16 18:52:08 +09:00
Yusuke Kuoka	7d024a6c05	Fix "duplicate metrics collector registration attempted" errors at startup (#317 ) I have seen this error a lot in our integration test. It turned out due to https://github.com/kubernetes-sigs/controller-runtime/issues/484 and is being fixed with this change.	2021-02-16 18:51:33 +09:00
Yusuke Kuoka	434823bcb3	`scale{Up,Down}Adjustment` to add/remove constant number of replicas on scaling (#315 ) * `scale{Up,Down}Adjustment` to add/remove constant number of replicas on scaling Ref #305 * Bump chart version	2021-02-16 17:16:26 +09:00
Yusuke Kuoka	35d047db01	Fix enterprise runners misusing cached token (#314 ) Follow-up for #290	2021-02-16 12:56:52 +09:00
Yusuke Kuoka	f1db6af1c5	Add repository runners support for PercentageRunnersBusy-based autoscaling (#313 ) Resolves #258	2021-02-16 12:44:51 +09:00
Hidetake Iwata	4f3f2fb60d	Add metrics for GitHub API rate limit (#312 )	2021-02-16 09:58:09 +09:00
Johannes Nicolai	2623140c9a	Make log message less scary (#311 ) * the reconciliation loop is often much faster than the runner startup, so changing runner not found related messages to debug and also add the possibility that the runner just needs more time	2021-02-16 09:55:55 +09:00
Johannes Nicolai	1db9d9d574	Use ARM64 compatible kube-rbac-proxy from upstream (#310 ) * as pointed out in #281 the currently used image for the kube-rbac-proxy - gcr.io/kubebuilder/kube-rbac-proxy:v0.4.1" - does not have an ARM64 image * hence, trying to use the standard deployment manifest / helm char will fail on ARM64 systems * replaced image with quay.io/brancz/kube-rbac-proxy:v0.8.0 which is the latest version from the upstream maintainer (https://github.com/brancz/kube-rbac-proxy/blob/master/Makefile#L13) * successfully tested on both AMD64 and ARM64 clusters * fixes #281	2021-02-16 09:55:03 +09:00
callum-tait-pbx	d046350240	chore: bumping helm chart sematically (#296 ) * chore: bumping helm chart sematically * chore: removing the app version config	2021-02-16 09:45:56 +09:00
callum-tait-pbx	cca4d249e9	feat: create workflow for runner releases (#306 )	2021-02-16 09:42:28 +09:00
Johannes Nicolai	bc8bc70f69	Fix rate limit and runner registration logic (#309 ) * errors.Is compares all members of a struct to return true which never happened * switched to type check instead of exact value check * notRegistered was using double negation in if statement which lead to unregistering runners after the registration timeout	2021-02-15 09:36:49 +09:00
Johannes Nicolai	34c6c3d9cd	Pod eviction policy examples (crashed nodes) (#308 ) * ... otherwise it will take 40 seconds (until a node is detected as unreachable) + 5 minutes (until pods are evicted from unreachable/crashed nodes) * pods stuck in "Terminating" status on unreachable nodes will only be freed once #307 gets merged	2021-02-15 09:33:01 +09:00
Johannes Nicolai	9c8d7305f1	Introduce pod deletion timeout and forcefully delete stuck pods (#307 ) * if a k8s node becomes unresponsive, the kube controller will soft delete all pods after the eviction time (default 5 mins) * as long as the node stays unresponsive, the pod will never leave the last status and hence the runner controller will assume that everything is fine with the pod and will not try to create new pods * this can result in a situation where a horizontal autoscaler thinks that none of its runners are currently busy and will not schedule any further runners / pods, resulting in a broken runner deployment until the runnerreplicaset is deleted or the node comes back online * introducing a pod deletion timeout (1 minute) after which the runner controller will try to reboot the runner and create a pod on a working node * use forceful deletion and requeue for pods that have been stuck for more than one minute in terminating state * gracefully handling race conditions if pod gets finally forcefully deleted within	2021-02-15 09:32:28 +09:00

... 5 6 7 8 9 ...

603 Commits All Branches Search

603 Commits

All Branches