actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Yusuke Kuoka	e7020c7c0f	Fix scale-from-zero to retain the reg-only runner until other pods come up (#523 ) Fixes #516	2021-05-05 12:13:51 +09:00
Yusuke Kuoka	0e0f385f72	Experimental support for ScheduledOverrides (#515 ) This adds the initial version of ScheduledOverrides to HorizontalRunnerAutoscaler. `MinReplicas` overriding should just work. When there are two or more ScheduledOverrides, the earliest one that matched is activated. Each ScheduledOverride can be recurring or one-time. If you have two or more ScheduledOverrides, only one of them should be one-time. And the one-time override should be the earliest item in the list to make sense. Tests will be added in another commit. Logging improvements and additional observability in HRA.Status will also be added in yet another commits. Ref #484	2021-05-03 23:31:17 +09:00
Yusuke Kuoka	469b117a09	Foundation for ScheduledOverrides (#513 ) Adds two types `RecurrenceRule` and `Period` and one function `MatchSchedule` as the foundation for building the upcoming ScheduledOverrides feature. Ref #484	2021-05-03 22:03:49 +09:00
Thejas N	588872a316	feat: allow ephemeral runner to be optional (#498 ) - Adds `ephemeral` option to `runner.spec` ``` .... template: spec: ephemeral: false repository: mumoshu/actions-runner-controller-ci .... ``` - `ephemeral` defaults to `true` - `entrypoint.sh` in runner/Dockerfile modified to read `RUNNER_EPHEMERAL` flag - Runner images are backward-compatible. `--once` is omitted only when the new envvar `RUNNER_EPHEMERAL` is explicitly set to `false`. Resolves #457	2021-05-02 19:04:14 +09:00
Christoph Brand	a18ac330bb	feature(controller): allow autoscaler to scale down to 0 (#447 )	2021-05-02 16:46:51 +09:00
Yusuke Kuoka	dbd7b486d2	feat: Support for scaling from/to zero (#465 ) This is an attempt to support scaling from/to zero. The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those. GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status, `busy`, `idle`, or `offline` would prevent GitHub actions from failing jobs. But retaining `busy` or `idle` runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retain `offline` runners. In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays `offline`, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler. Related to #447	2021-05-02 16:11:36 +09:00
Rolf Ahrenberg	6b77a2a5a8	feat: Docker registry mirror (#478 ) Changes: - Switched to use `jq` in startup.sh - Enable docker registry mirror configuration which is useful when e.g. avoiding the Docker Hub rate-limiting Check #478 for how this feature is tested and supposed to be used.	2021-04-25 14:04:01 +09:00
Manuel Jurado	37c2a62fa8	Allow to configure runner volume size limit (#436 ) Enable the user to set a limit size on the volume of the runner to avoid some runner pod affecting other resources of the same cluster Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-04-18 13:56:59 +09:00
Agoney Garcia-Deniz	2e551c9d0a	Add hostAliases to the runner spec (#456 )	2021-04-17 17:04:52 +09:00
asoldino	b42b8406a2	Add dockerVolumeMounts (#439 ) Resolves #435	2021-04-06 10:10:10 +09:00
Christoph Brand	9ed245c85e	feature(controller): remove dockerd executable (#432 )	2021-04-01 08:50:48 +09:00
Yusuke Kuoka	156e2c1987	Fix MTU configuration for dockerd (#421 ) Resolves #393	2021-03-31 09:29:21 +09:00
Yusuke Kuoka	374105c1f3	Fix dindWithinRunnerContainer not to crash-loop runner pods (#419 ) Apparently #253 broke dindWithinRunnerContainer completely due to the difference in how /runner volume is set up.	2021-03-25 10:23:36 +09:00
Yusuke Kuoka	bc6e499e4f	Make logging more concise (#410 ) This makes logging more concise by changing logger names to something like `controllers.Runner` to `actions-runner-controller.runner` after the standard `controller-rutime.controller` and reducing redundant logs by removing unnecessary requeues. I have also tweaked log messages so that their style is more consistent, which will also help readability. Also, runnerreplicaset-controller lacked useful logs so I have enhanced it.	2021-03-20 07:34:25 +09:00
Yusuke Kuoka	07f822bb08	Do include Runner controller in integration test (#409 ) So that we could catch bugs in runner controller like seen in #398, #404, and #407. Ref #400	2021-03-19 16:14:15 +09:00
Hidetake Iwata	3a0332dfdc	Add metrics of RunnerDeployment and HRA (#408 ) * Add metrics of RunnerDeployment and HRA * Use kube-state-metrics-style label names	2021-03-19 16:14:02 +09:00
Yusuke Kuoka	f6ab66c55b	Do not delay min/maxReplicas propagation from HRA to RD due to caching (#406 ) As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event. Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted. This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas. Additionally, I've also updated logging so that you observe which number was fetched from cache, and what number was suggested by either TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy, and what was the final number used as the desired-replicas(after applying {Min,Max}Replicas). Follow-up for #282	2021-03-19 12:58:02 +09:00
Yusuke Kuoka	c424215044	Do recheck runner registration timely (#405 ) Since #392, the runner controller could have taken unexpectedly long time until it finally notices that the runner has been registered to GitHub. This patch fixes the issue, so that the controller will notice the successful registration in approximately 1 minute(hard-coded). More concretely, let's say you had configured a long sync-period of like 10m, the runner controller could have taken approx 10m to notice the successful registration. The original expectation was 1m, because it was intended to recheck every 1m as implemented in #392. It wasn't working as such due to my misunderstanding in how requeueing work.	2021-03-19 11:02:47 +09:00
Yusuke Kuoka	3cccca8d09	Do patch runner status instead of update to reduce conflicts and avoid future bugs Ref https://github.com/summerwind/actions-runner-controller/pull/398#issuecomment-801548375	2021-03-18 10:31:17 +09:00
Yusuke Kuoka	7a7086e7aa	Make error logs more helpful	2021-03-18 10:26:21 +09:00
Yusuke Kuoka	3f23501b8e	Reduce "No runner matching the specified labels was found" errors while runner replacement (#392 ) We occasionally encountered those errors while the underlying RunnerReplicaSet is being recreated/replaced on RunnerDeployment.Spec.Template update. It turned out to be due to that the RunnerDeployment controller was waiting for the runner pod becomes `Running`, intead of the new replacement runner to have registered to GitHub. This fixes that, by trying to Runner.Status.Phase to `Running` only after the runner in the runner pod appears to be registered. A side-effect of this change is that runner controller would call more "ListRunners" GitHub Actions API. I've reviewed and improved the runner controller code and Runner CRD to make make the number of calls minimum. In most cases, ListRunners should be called only twice for each runner creation.	2021-03-16 10:52:30 +09:00
Yusuke Kuoka	5530030c67	Disable metrics-based autoscaling by default when scaleUpTriggers are enabled (#391 ) Relates to https://github.com/summerwind/actions-runner-controller/pull/379#discussion_r592813661 Relates to https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-793266609 When you defined HRA.Spec.ScaleUpTriggers[] but HRA.Spec.Metrics[], the HRA controller will now enable ScaleUpTriggers alone and insteaed of automatically enabling TotalNumberOfQueuedAndInProgressWorkflowRuns. This allows you to use ScaleUpTriggers alone, so that the autoscaling is done without calling GitHub API at all, which should grealy decrease the change of GitHub API calls get rate-limited.	2021-03-14 11:03:00 +09:00
Yusuke Kuoka	8d3a83b07a	Add CheckRun.Names scale-up trigger configuration (#390 ) This allows you to trigger autoscaling depending on check_run names(i.e. actions job names). If you are willing to differentiate scale amount only for a specific job, or want to scale only on a specific job, try this.	2021-03-14 10:21:42 +09:00
Brandon Kimbrough	2273b198a1	Add ability to set the MTU size of the docker in docker container (#385 ) * adding abilitiy to set docker in docker MTU size * safeguards to only set MTU env var if it is set	2021-03-12 08:44:49 +09:00
Yusuke Kuoka	3d62e73f8c	Fix PercentageRunnersBusy scaling not working (#386 ) PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector. Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-795200205	2021-03-11 20:16:36 +09:00
Yusuke Kuoka	f5c639ae28	Make webhook-based autoscaler github event logs more operator-friendly (#384 ) Adds fields like `pullRequest.base.ref` and `checkRun.status` that are useful for verifying the autoscaling behaviour without browsing GitHub. Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-794175312	2021-03-10 09:40:44 +09:00
Yusuke Kuoka	728829be7b	Fix panic on scaling organizational runners (#381 ) Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-793287133	2021-03-09 15:03:47 +09:00
Yusuke Kuoka	1b8a656051	Use --watch-namespace flag to restrict the namespace to watch Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-793172995	2021-03-09 09:46:21 +09:00
Rob Whitby	1753fa3530	handle GET requests in webhook hra (#378 )	2021-03-09 08:46:27 +09:00
Yusuke Kuoka	4fa5315311	Fix possible flapping autoscale on runner update (#371 ) Addresses https://github.com/summerwind/actions-runner-controller/pull/355#discussion_r587199428	2021-03-05 10:21:20 +09:00
Hiroshi Muraoka	11e58fcc41	Manage runner with label (#355 ) * Update RunnerDeploymentSpec to have Selector field Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Update RunnerReplicaSetSpec to have Selector field Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Add CloneSelectorAndAddLabel to add Selector field Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Fix tests Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Use label to find RunnerReplicaSet/Runner Signed-off-by: binoue <banji-inoue@cybozu.co.jp> * Update controller-gen versions in CRD Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Update autoscaler to list Pods with labels Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Add debug log Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Modify RunnerDeployment tests Signed-off-by: binoue <banji-inoue@cybozu.co.jp> * Modify RunnerReplicaset test Signed-off-by: binoue <banji-inoue@cybozu.co.jp> * Modify integration test Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Use RunnerDeployment Template Labels as the default selector for backward compatibility * Fix labeling Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Update func in Eventually to return (int, error) Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Update RunnerDeployment controller not to use label selector Signed-off-by: Hiroshi Muraoka <h.muraoka714@gmail.com> * Fix potential replicaset controller breakage on replicaset created before v0.17.0 * Fix errors on existing runner replica sets * Ensure RunnerReplicaSet Spec Selector addition does not break controller * Ensure RunnerDeployment Template.Spec.Labels change does result in template hash change * Fix comment Co-authored-by: binoue <banji-inoue@cybozu.co.jp> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-03-05 10:15:39 +09:00
Yusuke Kuoka	584590e97c	Use patch instead of update to alleviate HRA conflict on webhook (#358 ) We sometimes see that integration test fails due to runner replicas not meeting the expected number in a timely manner. After investigating a bit, this turned out to be due to that HRA updates on webhook-based autoscaler and HRA controller are conflicting. This changes the controllers to use Patch instead of Update to make conflicts less likely to happen. I have also updated the hra controller to use Patch when updating RunnerDeployment, too. Overall, these changes should make the webhook-based autoscaling more reliable due to less conflicts.	2021-02-26 10:17:09 +09:00
Yusuke Kuoka	d18884a0b9	Fix HRA expired cache entries not cleaned up (#357 ) Fixes #356	2021-02-26 09:54:24 +09:00
Yusuke Kuoka	e9eef04993	Fix old HRA capacity reservations not cleaned up (#354 ) Similar to #348 for #346, but for HRA.Spec.CapacityReservations usually modified by the webhook-based autoscaler controller. This patch tries to fix that by improving the webhook-based autoscaler controller to omit expired reservations on updating HRA spec.	2021-02-25 11:08:00 +09:00
Yusuke Kuoka	598dd1d9fe	Fix incorrect DESIRED on `kubectl get hra (#353 ) `kubectl get horizontalrunnerautoscalers.actions.summerwind.dev` shows HRA.status.desiredReplicas as the DESIRED count. However the value had been not taking capacityReservations into account, which resulted in showing incorrect count when you used webhook-based autoscaler, or capacityReservations API directly. This fixes that.	2021-02-25 10:32:09 +09:00
Yusuke Kuoka	9890a90e69	Improve webhook-based autoscaler log (#352 ) The controller had been writing confusing messages like the below on missing scale target: ``` Found too many scale targets: It must be exactly one to avoid ambiguity. Either set WatchNamespace for the webhook-based autoscaler to let it only find HRAs in the namespace, or update Repository or Organization fields in your RunnerDeployment resources to fix the ambiguity.{"scaleTargets": ""} ``` This fixes that, while improving many kinds of messages written while reconcilation, so that the error message is more actionable.	2021-02-25 10:07:41 +09:00
Yusuke Kuoka	9da123ae5e	Fix integration test flakiness (#351 ) Ref https://github.com/summerwind/actions-runner-controller/pull/345#issuecomment-785015406	2021-02-25 09:30:32 +09:00
Yusuke Kuoka	022007078e	Compact excessive error message on runnerreplicaset status update conflict (#350 ) We occasionally see logs like the below: ``` 2021-02-24T02:48:26.769ZERRORFailed to update runner status{"runnerreplicaset": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"} github.com/go-logr/zapr.(zapLogger).Error /home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 github.com/summerwind/actions-runner-controller/controllers.(RunnerReplicaSetReconciler).Reconcile /home/runner/work/actions-runner-controller/actions-runner-controller/controllers/runnerreplicaset_controller.go:207 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:256 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).worker /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1 /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152 k8s.io/apimachinery/pkg/util/wait.JitterUntil /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153 k8s.io/apimachinery/pkg/util/wait.Until /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88 2021-02-24T02:48:26.769ZERRORcontroller-runtime.controllerReconciler error{"controller": "testns-244olrunnerreplicaset", "request": "testns-244ol/example-runnerdeploy-j5wzf", "error": "Operation cannot be fulfilled on runnerreplicasets.actions.summerwind.dev \"example-runnerdeploy-j5wzf\": the object has been modified; please apply your changes to the latest version and try again"} github.com/go-logr/zapr.(zapLogger).Error /home/runner/go/pkg/mod/github.com/go-logr/zapr@v0.1.0/zapr.go:128 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).reconcileHandler /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:258 sigs.k8s.io/controller-runtime/pkg/internal/controller.(Controller).processNextWorkItem /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:232 sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker /home/runner/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.4.0/pkg/internal/controller/controller.go:211 k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1 /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:152 k8s.io/apimachinery/pkg/util/wait.JitterUntil /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:153 k8s.io/apimachinery/pkg/util/wait.Until /home/runner/go/pkg/mod/k8s.io/apimachinery@v0.0.0-20190913080033-27d36303b655/pkg/util/wait/wait.go:88 ``` which can be compacted into one-liner, without the useless stack trace, without double-logging the same error from the logger and the controller.	2021-02-25 09:01:02 +09:00
Johannes Nicolai	31e5e61155	Log correct runner that was deleted (#349 )	2021-02-25 08:38:55 +09:00
Yusuke Kuoka	e44e53b88e	Fix failure while saving HRA status after running controller for a while (#348 ) Fixes #346	2021-02-24 11:20:21 +09:00
Yusuke Kuoka	991535e567	Fix panic on webhook for user-owned repository (#344 ) * Fix panic on webhook for user-owned repository Follow-up for #282 and #334	2021-02-23 08:05:25 +09:00
Johannes Nicolai	2d7fbbfb68	Handle offline runners gracefully (#341 ) * if a runner pod starts up with an invalid token, it will go in an infinite retry loop, appearing as RUNNING from the outside * normally, this error situation is detected because no corresponding runner objects exists in GitHub and the pod will get removed after registration timeout * if the GitHub runner object already existed before - e.g. because a finalizer was not properly run as part of a partial Kubernetes crash, the runner will always stay in a running mode, even updating the registration token will not kill the problematic pod * introducing RunnerOffline exception that can be handled in runner controller and replicaset controller * as runners are offline when a pod is completed and marked for restart, only do additional restart checks if no restart was already decided, making code a bit cleaner and saving GitHub API calls after each job completion	2021-02-22 10:08:04 +09:00
Hidetake Iwata	b0e74bebab	Fix index key to find HRA in GitHub webhook handler	2021-02-20 21:25:23 +09:00
Hidetake Iwata	dfbe53dcca	Fix webhook payload in integration test	2021-02-20 21:08:23 +09:00
Yusuke Kuoka	ebc3970b84	Add integration test for autoscaling on check_run webhook event	2021-02-19 10:33:04 +09:00
Hidetake Iwata	1ddcf6946a	Fix nil pointer error on received check_run event (#331 ) * Reproduce nil pointer error on received check_run event * Fix nil pointer error on received check_run event	2021-02-18 20:22:36 +09:00
Yusuke Kuoka	67f6de010b	feat: Common runner labels configurable per controller (#327 ) * feat: Common runner labels configurable per controller Ref #321	2021-02-18 20:19:08 +09:00
Yusuke Kuoka	2fdf35ac9d	Refactor integration test to use helpers (#320 ) This should make the test code a bit more DRY and readable.	2021-02-17 10:23:35 +09:00
Yusuke Kuoka	eb2eaf8130	Fix TotalNumberOfQueuedAndInProgressWorkflowRuns to work with a lot of remaining `completed` jobs (#316 ) I have heard from some user that they have hundred thousands of `status=completed` workflow runs in their repository which effectively blocked TotalNumberOfQueuedAndInProgressWorkflowRuns from working because of GitHub API rate limit due to excessive paginated requests. This fixes that by separating list-workflow-runs calls to two - one for `queued` and one for `in_progress`, which can make the minimum API call from 1 to 2, but allows it to work regardless of number of remaining `completed` workflow runs.	2021-02-16 18:55:55 +09:00
Yusuke Kuoka	7d024a6c05	Fix "duplicate metrics collector registration attempted" errors at startup (#317 ) I have seen this error a lot in our integration test. It turned out due to https://github.com/kubernetes-sigs/controller-runtime/issues/484 and is being fixed with this change.	2021-02-16 18:51:33 +09:00

1 2 3

130 Commits