actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Yusuke Kuoka	0ef9a22cd4	Fix confusing PV controller log (#1526 ) Ref #1511	2022-06-14 08:35:04 +09:00
Yusuke Kuoka	9dd26168d6	Fix confusing logs from pv and pvc controllers (#1487 ) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212	2022-05-26 18:21:23 +09:00
Yusuke Kuoka	c7eea169ad	test: Add test for runner with generic ephemeral volume as "work" (#1472 ) This adds the test to verify the runner pod generation logic for the case that you use a generic ephemeral volume as "work". It is almost an adaptation of the test cases writetn for RunnerSet in #1471, to RunnerDeployment and Runner.	2022-05-25 10:37:23 +09:00
Yusuke Kuoka	63be0223ad	fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work" (#1471 ) * fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work" While manually testing configurations being documented in #1464, I discovered that the use of dynamic ephemeral volume for "work" directory was not working correctly due to the valiadation error. This fixes the runner pod generation logic to not add the default volume and volume mount for "work" dir, so that the error disappears. Ref #1464 * e2e: Ensure work generic ephemeral volume to work as expected	2022-05-22 10:25:50 +09:00
Yusuke Kuoka	3e988afc09	test: add fuzzing to the test suite (#1463 ) As a part of #1298, we add fuzzing based on Go test's fuzzing support to the test suite	2022-05-19 13:34:23 +01:00
Felipe Galindo Sanchez	78c01fd31d	cleanup some unused code and minor refactors (#1274 ) Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-05-16 18:38:32 +09:00
Callum Tait	65f7ee92a6	refactor: remove registration runner dead code (#1260 ) We had some dead code left over from the removal of registration runners. Registration runners were removed in #859 #1207 Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-05-16 11:23:39 +09:00
Yusuke Kuoka	b5194fd75a	Enhance RunnerSet to optionally retain PVs accross restarts (#1340 ) * Enhance RunnerSet to optionally retain PVs accross restarts This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet. The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`. It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet. Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass. Ref #1286 * Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs Ref #1286	2022-05-16 09:26:48 +09:00
Yusuke Kuoka	759349de11	fix: force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared (#1395 ) Ref #1369	2022-05-14 09:07:17 +01:00
Yusuke Kuoka	e46b90f758	fix: runner pods managed by RunnerSet to not stuck in Terminating (#1420 ) This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never". Ref #1369	2022-05-12 09:34:27 +01:00
Yusuke Kuoka	3a7e8c844b	feat: Support arbitrarily setting `privileged: true` for runner container (#1383 ) Resolves #1282	2022-05-12 09:25:51 +01:00
Yusuke Kuoka	dabbc99c78	refactor(controller): stop auto-setting RUNNER_FEATURE_FLAG_EPHEMERAL (#1385 ) This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that. Ref #1196	2022-05-11 11:42:55 +01:00
Yusuke Kuoka	4053ab3e11	Fix label support for TotalNumberOfQueuedAndInProgressWorkflowRuns metric (#1390 ) In #1373 we made two mistakes: - We mistakenly checked if all the runner labels are included in the job labels and only after that it marked the target as eligible for scale. It should definitely be the opposite! - We mistakenly checked for the existence of `self-hosted` labe l in the job. [Although it should be a good practice to explicitly say `runs-on: ["self-hosted", "custom-label"]`](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-labels-for-runner-selection), that's not a requirement so we should code accordingly. The consequence of those two mistakes was that, for example, jobs with `self-hosted` + `custom` labels didn't result in scaling runner with `self-hosted` + `custom` + `custom2`. This should fix that. Ref #1056 Ref #1373	2022-04-27 16:24:21 +01:00
Yusuke Kuoka	1c726ae20c	chore: Add unit test for RunnerReconciler.newPod (#1382 ) Adds some unit tests for the runner pod generation logic that is used internally by runner controller as preparation for #1282	2022-04-25 09:59:21 +09:00
Yusuke Kuoka	d6cdd5964c	chore: Add unit test for newRunnerPod (#1381 ) Adds some unit tests for the runner pod generation logic that is used internally by runner deployment and runner set controllers as preparation for #1282	2022-04-25 08:52:58 +09:00
Yusuke Kuoka	a622968ff2	feat: Add label support to TotalNumberOfQueuedAndInProgressWorkflowRuns metric (#1373 ) This is an implementation for my intepretation of the "bronze" case proposed in #1056 Ref #1056	2022-04-24 14:41:34 +09:00
Yusuke Kuoka	1551f3b5fc	Remove the default `githubEvent: {}` requiring a event to be defined (#1379 ) Ref #1358	2022-04-24 13:37:26 +09:00
Yusuke Kuoka	3ba7179995	Do not enable TotalNumberOfQueuedAndInProgressWorkflowRuns by default (#1372 ) Previously, omitting hra.spec.metrics at all resulted in ARC enabling the TotalNumberOfQueuedAndInProgressWorkflowRuns. That turned out not a good idea so since this change it is not enabled by default. Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/728	2022-04-24 13:36:42 +09:00
Callum Tait	24aae58dbc	feat: default scale down flag (#963 ) Resolves #899 Co-authored-by: Callum <callum@domain.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-04-20 11:09:09 +09:00
Jeff Billimek	13bfa2da4e	Fix runner pod dnsConfig (#1227 ) Fixes #1226 Fixes #1224 Signed-off-by: Jeff Billimek <jeff@billimek.com>	2022-04-20 10:55:20 +09:00
Patrick Ellis	7a5a6381c3	Add WorkflowJob to GitHubEventScaleUpTriggerSpec types (#922 )	2022-04-20 09:59:08 +09:00
Siyuan Zhang	a37b4dfbe3	Fix scale down condition to exclude skipped (#1330 ) * Fix scale down condition to exclude skipped * Use fallthrough and break to let default handle the skipped case Fixes #1326	2022-04-13 08:53:07 +09:00
Yusuke Kuoka	b09c54045a	Prevent runners from stuck in Terminating when pod disappeared without standard termination process (#1318 ) This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway. Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions. Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions. Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1307 Might also relate to https://github.com/actions-runner-controller/actions-runner-controller/issues/1273	2022-04-08 10:17:33 +09:00
Felipe Galindo Sanchez	e7e48a77e4	Merge remote-tracking branch 'upstream/master' into patch-4	2022-04-04 09:54:08 -07:00
Yusuke Kuoka	631a70a35f	Fix runner pod to be cleaned up earlier regardless of the sync period (#1299 ) Ref #1291	2022-04-03 11:12:44 +09:00
Rolf Ahrenberg	1b327a0721	refactor: use const envvars (#1251 )	2022-03-27 12:14:56 +01:00
Endre Karlson	ee7484ac91	Use container name to detect runner container in Pod	2022-03-23 12:39:58 +01:00
Felipe Galindo Sanchez	9657d3e5b3	Fix deleting a runner when pod was deleted With the current implementation if a pod is deleted, controller is failing to delete the runner as it's trying to annotate a pod that doesn't exist as we're passing a new pod object that is not an existing resource	2022-03-22 14:44:50 -07:00
Yusuke Kuoka	4551309e30	Fix runners to not terminate before unregistration when scaling down #1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.	2022-03-13 13:09:46 +00:00
Yusuke Kuoka	7123b18a47	chore: Log more variables when log level is -2	2022-03-13 13:04:28 +00:00
Yusuke Kuoka	cc55d0bd7d	Let runnerdeployment controller log runnerreplicaset creation	2022-03-13 12:25:53 +00:00
Yusuke Kuoka	c612e87d85	fix: Let RunnerDeployment scale RunnerReplicaSet to zero before terminating it so that hopefully RunnerDeployment can gracefully termiante older RunnerReplicaSet on update.	2022-03-13 12:18:22 +00:00
Yusuke Kuoka	326d6a1fe8	Fix the timing of `Marking owner for unregistration completion` log	2022-03-13 12:16:55 +00:00
Yusuke Kuoka	fa8ff70aa2	Add log when deletion timestamp is being set on owner object	2022-03-13 12:16:29 +00:00
Yusuke Kuoka	efb7fca308	Fix externally deleted runner pod to not block unregistration process	2022-03-13 12:15:49 +00:00
Yusuke Kuoka	e4280dcb0d	Fix patch MergeFrom target	2022-03-13 12:14:14 +00:00
Yusuke Kuoka	f153870f5f	fix: Do not block indefinitely on runner that cannot be deleted due to 403	2022-03-13 12:12:01 +00:00
Yusuke Kuoka	8ca39caff5	Fix log message on runner deletion	2022-03-13 12:11:11 +00:00
Yusuke Kuoka	791634fb12	Fix static runners not scaling up It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated. This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.	2022-03-13 07:26:43 +00:00
Yusuke Kuoka	c4b24f8366	Prevent static runners from terminating due to unregister timeout The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job. We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.	2022-03-13 07:26:36 +00:00
Yusuke Kuoka	adc889ce8a	Fix RunnerDeployment to be able to finish rollout (#1213 ) I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.	2022-03-13 10:10:24 +09:00
Yusuke Kuoka	fa287c4395	Fix RunnerDeployment-managed runner pods to not get RUNNER_NAME and RUNNER_TOKEN injected twice Since #1179, runner pods managed by RunnerDeployment had two duplicate environment variables for RUNNER_NAME and RUNNER_TOKEN. This fixes that.	2022-03-12 13:49:50 +00:00
Yusuke Kuoka	051089733b	Use --ephemeral by default Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1189	2022-03-12 13:20:07 +00:00
Yusuke Kuoka	9628bb2937	Prevent RemoveRunner spam on busy ephemeral runner scale down (#1204 ) Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy. There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job. This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter. The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before. Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271	2022-03-11 19:03:17 +09:00
Yusuke Kuoka	55ff4de79a	Remove legacy GitHub API cache of HRA.Status.CachedEntries (#1192 ) * Remove legacy GitHub API cache of HRA.Status.CachedEntries We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA. Just remove the legacy cache to keep it simple and easy to understand. * Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well * Fix integration test	2022-03-08 19:05:43 +09:00
Yusuke Kuoka	15ee6d6360	chore: Reorganize "Calculated desired replicas log fields (#1190 ) So that `max` is emitted immediately after `min`, which is the counterpart of it.	2022-03-08 10:29:53 +09:00
Yusuke Kuoka	cbbc383a80	Auto-correct replicas number on missing webhook_job completion event (#1180 ) While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods. This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it. In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation. Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.	2022-03-07 09:35:13 +09:00
Yusuke Kuoka	14a878bfae	refactor: Make RunnerReplicaSet and Runner backed by the same logic that backs RunnerSet	2022-03-06 05:53:26 +00:00
Yusuke Kuoka	c95e84a528	refactor: Extract runner pod owner management out of runnerset controller so that it can potentially be reusable from runnerreplicaset controller	2022-03-05 12:18:02 +00:00
Yusuke Kuoka	95a5770d55	Fix regression that registration-timeout check was not working for runnerset (#1178 ) Follow-up for #1167	2022-03-05 19:31:05 +09:00
Yusuke Kuoka	9cc9f8c182	chore: Add a few comments to runnerset and runnerpod controllers to help potential contributors	2022-03-05 05:41:56 +00:00
Yusuke Kuoka	138e326705	chore: Add comment on lastSyncTime in runnerset controller	2022-03-05 05:41:56 +00:00
Yusuke Kuoka	5f2b5327f7	integration: Reduce error logs to ease debugging	2022-03-03 18:47:54 +09:00
Felipe Galindo Sanchez	d20ad71071	Fix minor log in runner controller (#1175 ) Log is mentioning registration only but this is about the standard runner pod	2022-03-03 09:51:30 +09:00
Felipe Galindo Sanchez	27563c4378	Remove unused function (#1173 )	2022-03-03 09:02:47 +09:00
Felipe Galindo Sanchez	4a0f68bfe3	Cleanup extra block in runner controller (#1174 )	2022-03-03 09:01:34 +09:00
Yusuke Kuoka	1917cf90c4	chore: Tweak runner-id annotation name and the annotation prefix to be more consistent	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	0ba3cad6c2	fix: Prefix runner pod related annotation keys with `actions/` to make them distinguishable from other annotations	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	7f0e65cb73	refactor: Extract definitions of various annotation keys and other defaults to their own source	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	12a04b7f38	Fix typo in comment	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	a3072c110d	Prevent runnerset pod unregistration until it gets runner ID This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	15b402bb32	Make RunnerSet much more reliable with or without webhook	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	11be6c1fb6	Prevent runner pod deletion delay when pod disappeared before unregistration	2022-03-02 19:03:20 +09:00
Felipe Galindo Sanchez	eff0c7364f	Merge branch 'master' into improve-logs	2022-02-28 09:25:30 -08:00
Felipe Galindo Sanchez	3abecd0f19	logging: improve logs for scaling	2022-02-23 08:29:13 -08:00
Yusuke Kuoka	5bc16f2619	Enhance HRA capacity reservation update log	2022-02-21 00:06:26 +00:00
Yusuke Kuoka	b8e65aa857	Prevent unnecessary ephemeral runner recreations	2022-02-20 13:45:42 +00:00
Yusuke Kuoka	a6f0e0008f	Make unregistration timeout and retry delay configurable in integration tests	2022-02-20 12:05:34 +00:00
Yusuke Kuoka	79a31328a5	Stop recreating ephemeral runner pod Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1046161384	2022-02-20 04:42:19 +00:00
Yusuke Kuoka	3c16188371	Introduce consistent timeouts for runner unregistration and runner pod deletion Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion, so that we are very much unlikely to terminate pods that are running any jobs.	2022-02-20 04:36:35 +00:00
Tingluo Huang	0b9bef2c08	Try to unconfig runner before deleting the pod to recreate (#1125 ) There is a race condition between ARC and GitHub service about deleting runner pod. - The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod. - A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod. - The ARC delete the runner pod which cause the in-progress job to end up canceled. To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod. - `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod. - `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is. TODO: I need to do some E2E tests to force the race condition to happen. Ref #911	2022-02-19 21:22:31 +09:00
Yusuke Kuoka	a5ed6bd263	Fix RunerSet managed runner pods to terminate more gracefully (#1126 ) Make RunnerSet-managed runners as reliable as RunnerDeployment-managed runners. Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1042404460	2022-02-19 21:19:37 +09:00
Yusuke Kuoka	921f547200	fix: Do recreate runner pod on registration token update (#1087 ) Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated. This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens. Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365	2022-02-19 21:18:00 +09:00
Yusuke Kuoka	fcf4778bac	Fix regression that prevented default organizational runner group from being scale target Fixes #1131	2022-02-19 14:43:41 +09:00
Yusuke Kuoka	e22d981d58	githubwebhookserver: Tweak log levels of various messages (#1123 ) Some of logs like `HRA keys indexed for HRA` were so excessive that it made testing and debugging the githubwebhookserver harder. This tries to fix that.	2022-02-17 09:15:26 +09:00
Felipe Galindo Sanchez	d0d316252e	Option to consider runner group visibility on scale based on webhook (#1062 ) This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required. More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud. Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012 If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories `getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow: 1. Search for repository HRAs - if so it ends here 2. Get available HRAs in k8s 3. Compute visible runner groups a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls. b. If GitHub auth is not provided - assume all runner groups are visible to all repositories 4. Search for default organization runners (a.k.a runners from organization's visible default runner group) with matching labels 5. Search for default enterprise runners (a.k.a runners from enterprise's visible default runner group) with matching labels 6. Search for custom organization runner groups with matching labels 7. Search for custom enterprise runner groups with matching labels Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-02-16 19:08:56 +09:00
Daniel	8a73560dbc	if a Volume is defined by the operator don't add another "work" volume. (#1015 ) This allows providing a different `work` Volume. This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage. This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME. ``` apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: name: runner-2 spec: template: spec: dockerdContainerResources: {} env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name # this is to mount the docker in docker onto NVME disk dockerVolumeMounts: - mountPath: /var/lib/docker name: scratch subPathExpr: $(POD_NAME)-docker - mountPath: /runner/_work name: work subPathExpr: $(POD_NAME)-work volumeMounts: - mountPath: /runner/_work name: work subPathExpr: $(POD_NAME)-work dockerEnv: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name volumes: - hostPath: path: /mnt/disks/ssd0 name: scratch - hostPath: path: /mnt/disks/ssd0 name: work nodeSelector: cloud.google.com/gke-nodepool: runner-16-with-nvme ephemeral: false image: "" imagePullPolicy: Always labels: - runner-2 - self-hosted organization: yourorganization ```	2022-01-07 10:01:40 +09:00
Yusuke Kuoka	01301d3ce8	Stop creating registration-only runners on scale-to-zero (#1028 ) Resolves #859	2022-01-07 09:56:21 +09:00
Hyeonmin Park	1a6e5719c3	test: Add tests with self-hosted label for #953 (#1030 )	2022-01-07 08:50:26 +09:00
Callum Tait	ad48851dc9	feat: expose if docker is enabled and wait for docker to be ready (#962 ) Resolves #897 Resolves #915	2021-12-29 10:23:35 +09:00
Lars Haugan	c5950d75fa	fix: pagination for ListWorkflowJobs in autoscaler (#990 ) (#992 ) Adding handling of paginated results when calling `ListWorkflowJobs`. By default the `per_page` is 30, which potentially would return 30 queued and 30 in_progress jobs. This change should enable the autoscaler to scale workflows with more than 60 jobs to the exact number of runners needed. Problem: I did not find any support for pagination in the Github fake client, and have not been able to test this (as I have not been able to push an image to an environment where I can verify this). If anyone is able to help out verifying this PR, i would really appreciate it. Resolves #990	2021-12-24 09:12:36 +09:00
Felipe Galindo Sanchez	608c56936e	Remove duplicate self-hosted condition (#1016 ) Duplicate condition caused after merge of #953 and #1012	2021-12-21 09:08:21 +09:00
Felipe Galindo Sanchez	4ebec38208	Support runner groups with selected visibility in webhooks autoscaler (#1012 ) The current implementation doesn't support yet runner groups with custom visibility (e.g selected repositories only). If there are multiple runner groups with selected visibility - not all runner groups may be a potential target to be scaled up. Thus this PR introduces support to allow having runner groups with selected visibility. This requires to query GitHub API to find what are the potential runner groups that are linked to a specific repository (whether using visibility all or selected). This also improves resolving the `scaleTargetKey` that are used to match an HRA based on the inputs of the `RunnerSet`/`RunnerDeployment` spec to better support for runner groups. This requires to configure github auth in the webhook server, to keep backwards compatibility if github auth is not provided to the webhook server, this will assume all runner groups have no selected visibility and it will target any available runner group as before	2021-12-19 18:29:44 +09:00
clement-loiselet-talend	0c34196d87	fix(#951 ): add exception for self-hosted label in webhook search (#953 ) The webhook "workflowJob" pass the labels the job needs to the controller, who in turns search for them in its RunnerDeployment / RunnerSet. The current implementation ignore the search for `self-hosted` if this is the only label, however if multiple labels are found the `self-hosted` label must be declared explicitely or the RD / RS will not be selected for the autoscaling. This PR fixes the behavior by ignoring this label, and add documentation on this webhook for the other labels that will still require an explicit declaration (OS and architecture). The exception should be temporary, ideally the labels implicitely created (self-hosted, OS, architecture) should be searchable alongside the explicitly declared labels. code tested, work with `["self-hosted"]` and `["self-hosted","anotherLabel"]` Fixes #951	2021-12-19 10:55:23 +09:00
renovate[bot]	c64000e11c	fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0 (#740 ) * fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0 * Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade * Regenerate manifests with the latest K8s dependencies Co-authored-by: Renovate Bot <bot@renovateapp.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-12-17 09:06:55 +09:00
Felipe Galindo Sanchez	9bb21aef1f	Add support for default image pull secret name (#921 ) Resolves #896 Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-12-15 09:29:31 +09:00
Pavel Smalenski	91102c8088	Add dockerEnv variable for RunnerDeployment (#912 ) Resolves #878 Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-12-14 17:13:24 +09:00
Felipe Galindo Sanchez	f0fccc020b	refactor: split Reconciler from Reconcile in a few methods (#926 ) Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-12-12 14:22:55 +09:00
Patrick Ellis	ea2dbc2807	Update go-github from v37 -> v39 (#925 )	2021-12-11 21:43:40 +09:00
Yusuke Kuoka	898ad3c355	Work-around for offline+busy runners (#993 ) Ref #911	2021-12-09 09:31:06 +09:00
Max N. Boyarov	88b8871830	Reduce number of http superfluous messages (#894 ) write to http.ResponseWriter create HTTP OK response, so set ok to disable error code in defered function	2021-11-09 09:07:07 +09:00
Yusuke Kuoka	2191617eb5	Remove unnecessary scale-target-not-found error on in_progress workflow_job event (#927 ) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/877#issuecomment-955614456	2021-11-09 09:05:50 +09:00
Yusuke Kuoka	b305e38b17	Add webhook-based autoscale for Enterprise runners (#906 ) Fixes #892	2021-11-09 09:04:19 +09:00
apr-1985	0d3de9ee2a	chore: correct logging typo (#904 )	2021-10-22 09:03:23 +09:00
Maxim Pogozhiy	fce7d6d2a7	Add topologySpreadConstraints (#814 )	2021-10-17 21:49:44 +01:00
Callum Tait	5805e39e1f	Revert "feat: adding workflow_dispatch webhook event" (#879 ) This reverts commit `d36d47fe66`.	2021-10-09 18:36:02 +01:00
Callum	d36d47fe66	feat: adding workflow_dispatch webhook event	2021-10-09 10:07:07 +01:00
Aidan	fccf29970b	Fix bug related to label matching. (#852 ) * Fix bug related to label matching. Add start of test framework for Workflow Job Events Signed-off-by: Aidan Jensen <aidan@artificial.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-09-30 11:02:59 +09:00
Alex Kulikovskikh	ea06001819	fix: scaling issue based on `workflow_job` event (#850 ) This PR fix scaling issue based on `workflow_job` event discussed in #819	2021-09-30 10:36:59 +09:00
Rob Bos	3f331e9a39	Fixing capitalization and a typo (#838 ) * Fixing capitalization and a typo * typo * Typo * Update controllers/autoscaling.go * Update controllers/autoscaling.go Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-09-26 14:34:55 +09:00
Tristan Keen	5e3f89bdc5	Correct test to append docker container (#837 ) Fixes #835	2021-09-24 09:18:20 +09:00
Tristan Keen	1eb135cace	Correct default image logic	2021-09-14 17:00:57 +09:00
Tarasovych	7008b0c257	feat: Organization RunnerDeployment with webhook-based autoscaling only for certain repositories (#766 ) Resolves #765 Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-08-31 09:46:36 +09:00
Yusuke Kuoka	fabead8c8e	feat: Workflow job based ephemeral runner scaling (#721 ) This add support for two upcoming enhancements on the GitHub side of self-hosted runners, ephemeral runners, and `workflow_jow` events. You can't use these yet. These features are not yet generally available to all GitHub users. Please take this pull request as a preparation to make it available to actions-runner-controller users as soon as possible after GitHub released the necessary features on their end. Ephemeral runners: The former, ephemeral runners, is basically the reliable alternative to `--once`, which we've been using when you enabled `ephemeral: true` (default in actions-runner-controller). `--once` has been suffering from a race issue #466. `--ephemeral` fixes that. To enable ephemeral runners with `actions/runner`, you give `--ephemeral` to `config.sh`. This updated version of `actions-runner-controller` does it for you, by using `--ephemeral` instead of `--once` when you set `RUNNER_FEATURE_FLAG_EPHEMERAL=true`. Please read the section `Ephemeral Runners` in the updated version of our README for more information. Note that ephemeral runners is not released on GitHub yet. And `RUNNER_FEATURE_FLAG_EPHEMERAL=true` won't work at all until the feature gets released on GitHub. Stay tuned for an announcement from GitHub! `workflow_job` events: `workflow_job` is the additional webhook event that corresponds to each GitHub Actions workflow job run. It provides `actions-runner-controller` a solid foundation to improve our webhook-based autoscale. Formerly, we've been exploiting webhook events like `check_run` for autoscaling. However, as none of our supported events has included `labels`, you had to configure an HRA to only match relevant `check_run` events. It wasn't trivial. In contrast, a `workflow_job` event payload contains `labels` of runners requested. `actions-runner-controller` is able to automatically decide which HRA to scale by filtering the corresponding RunnerDeployment by `labels` included in the webhook payload. So all you need to use webhook-based autoscale will be to enable `workflow_job` on GitHub and expose actions-runner-controller's webhook server to the internet. Note that the current implementation of `workflow_job` support works in two ways, increment, and decrement. An increment happens when the webhook server receives` workflow_job` of `queued` status. A decrement happens when it receives `workflow_job` of `completed` status. The latter is used to make scaling-down faster so that you waste money less than before. You still don't suffer from flapping, as a scale-down is still subject to `scaleDownDelaySecondsAfterScaleOut `. Please read the section `Example 3: Scale on each `workflow_job` event` in the updated version of our README for more information on its usage.	2021-08-11 09:52:04 +09:00
Rolf Ahrenberg	14564c7b8e	Allow disabling /runner emptydir mounts and setting storage volume (#674 ) * Allow disabling /runner emptydir mounts * Support defining storage medium for emptydirs * Fix typos	2021-07-15 06:29:58 +09:00
Sebastien Le Digabel	7f2795b5d6	Adding a default docker registry mirror (#689 ) * Adding a default docker registry mirror This change allows the controller to start with a specified default docker registry mirror and avoid having to specify it in all the runner* objects. The change is backward compatible, if a runner has a docker registry mirror specified, it will supersede the default one.	2021-07-15 06:20:08 +09:00
Yusuke Kuoka	f858e2e432	Add POC of GitHub Webhook Delivery Forwarder (#682 ) * Add POC of GitHub Webhook Delivery Forwarder * multi-forwarder and ctrl-c existing and fix for non-woring http post * Rename source files * Extract signal handling into a dedicated source file * Faster ctrl-c handling * Enable automatic creation of repo hook on startup * Add support for forwarding org hook deliveries * Set hook secret on hook creation via envvar (HOOK_SECRET) * Fix org hook support * Fix HOOK_SECRET for consistency * Refactor to prepare for custom log position provider * Refactor to extract inmemory log position provider * Add configmap-based log position provider * Rename githubwebhookdeliveryforwarder to hookdeliveryforwarder * Refactor to rename LogPositionProvider to Checkpointer and extract ConfigMap checkpointer into a dedicated pkg * Refactor to extract logger initialization * Add hookdeliveryforwarder README and bump go-github to unreleased ver	2021-07-14 10:18:55 +09:00
Yusuke Kuoka	6f130c2db5	Fix dockerdWithinRunnerContainer for Runner(Deployment) not working in the main branch (#696 ) Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/674#issuecomment-878600993	2021-07-13 18:14:15 +09:00
Yusuke Kuoka	f19e7ea8a8	chore: Upgrade go-github to v36 (#681 )	2021-07-04 17:43:52 +09:00
Yusuke Kuoka	acb906164b	RunnerSet: Automatic-recovery from registration timeout and deregistration on pod termination (#652 ) Ref #629 Ref #613 Ref #612	2021-06-24 20:39:37 +09:00
Yusuke Kuoka	98da4c2adb	Add HRA support for RunnerSet (#647 ) `HRA.Spec.ScaleTargetRef.Kind` is added to denote that the scale-target is a RunnerSet. It defaults to `RunnerDeployment` for backward compatibility. ``` apiVersion: actions.summerwind.dev/v1alpha1 kind: HorizontalRunnerAutoscaler metadata: name: myhra spec: scaleTargetRef: kind: RunnerSet name: myrunnerset ``` Ref #629 Ref #613 Ref #612	2021-06-23 20:25:03 +09:00
Yusuke Kuoka	8b90b0f0e3	Clean up import list (#645 ) Resolves #644	2021-06-22 17:55:06 +09:00
Jonathan Gonzalez V	a277489003	Added support to enable and disable enableServiceLinks. (#628 ) This option expose internally some `KUBERNETES_*` environment variables that doesn't allow the runner to use KinD (Kubernetes in Docker) since it will try to connect to the Kubernetes cluster where the runner it's running. This option it's set by default to `true` in any Kubernetes deployment. Signed-off-by: Jonathan Gonzalez V <jonathan.gonzalez@enterprisedb.com>	2021-06-22 17:27:26 +09:00
Yusuke Kuoka	9e4dbf497c	feat: RunnerSet backed by StatefulSet (#629 ) * feat: RunnerSet backed by StatefulSet Unlike a runner deployment, a runner set can manage a set of stateful runners by combining a statefulset and an admission webhook that mutates statefulset-managed pods with required envvars and registration tokens. Resolves #613 Ref #612 * Upgrade controller-runtime to 0.9.0 * Bump Go to 1.16.x following controller-runtime 0.9.0 * Upgrade kubebuilder to 2.3.2 for updated etcd and apiserver following local setup * Fix startup failure due to missing LeaderElectionID * Fix the issue that any pods become unable to start once actions-runner-controller got failed after the mutating webhook has been registered * Allow force-updating statefulset * Fix runner container missing work and certs-client volume mounts and DOCKER_HOST and DOCKER_TLS_VERIFY envvars when dockerdWithinRunner=false * Fix runnerset-controller not applying statefulset.spec.template.spec changes when there were no changes in runnerset spec * Enable running acceptance tests against arbitrary kind cluster * RunnerSet supports non-ephemeral runners only today * fix: docker-build from root Makefile on intel mac * fix: arch check fixes for mac and ARM * ci: aligning test data format and patching checks * fix: removing namespace in test data * chore: adding more ignores * chore: removing leading space in shebang * Re-add metrics to org hra testdata * Bump cert-manager to v1.1.1 and fix deploy.sh Co-authored-by: toast-gear <15716903+toast-gear@users.noreply.github.com> Co-authored-by: Callum James Tait <callum.tait@photobox.com>	2021-06-22 17:10:09 +09:00
Jonah Back	8c42f99d0b	feat: avoid setting privileged flag if seLinuxOptions is not null (#599 ) Sets the privileged flag to false if SELinuxOptions are present/defined. This is needed because containerd treats SELinux and Privileged controls as mutually exclusive. Also see https://github.com/containerd/cri/blob/aa2d5a97c/pkg/server/container_create.go#L164. This allows users who use SELinux for managing privileged processes to use GH Actions - otherwise, based on the SELinux policy, the Docker in Docker container might not be privileged enough. Signed-off-by: Jonah Back <jonah@jonahback.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-06-04 08:59:11 +09:00
Ameer Ghani	7523ea44f1	feat: allow specifying runtime class in runner spec (#580 ) This allows using the `runtimeClassName` directive in the runner's spec. One of the use-cases for this is Kata Containers, which use `runtimeClassName` in a pod spec as an indicator that the pod should run inside a Kata container. This allows us a greater degree of pod isolation.	2021-06-04 08:56:43 +09:00
Yusuke Kuoka	cb14d7530b	Add HRA printer column "SCHEDULE" (#561 ) Adds a column to help the operator see if they configured HRA.Spec.ScheduledOverrides correctly, in a form of "next override schedule recognized by the controller": ``` $ k get horizontalrunnerautoscaler NAME MIN MAX DESIRED SCHEDULE actions-runner-aos-autoscaler 0 5 0 org 0 5 0 min=0 time=2021-05-21 15:00:00 +0000 UTC ``` Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/484	2021-05-22 08:29:53 +09:00
Yusuke Kuoka	0b88b246d3	Fix additionalPrinterColumns (#556 ) This fixes human-readable output of `kubectl get` on `runnerdeployment`, `runnerreplicaset`, and `runner`. Most notably, CURRENT and READY of runner replicasets are now computed and printed correctly. Runner deployments now have UP-TO-DATE and AVAILABLE instead of READY so that it is consistent with columns of K8s deployments. A few fixes has been also made to runner deployment and runner replicaset controllers so that those numbers stored in Status objects are reliably updated and in-sync with actual values. Finally, `AGE` columns are added to runnerdeployment, runnerreplicaset, runnner to make that more visible to users. `kubectl get` outputs should now look like the below examples: ``` # Immediately after runnerdeployment updated/created $ k get runnerdeployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE example-runnerdeploy 0 0 0 0 8d org-runnerdeploy 5 5 5 0 8d # A few dozens of seconds after update/create all the runners are registered that "available" numbers increase $ k get runnerdeployment NAME DESIRED CURRENT UP-TO-DATE AVAILABLE AGE example-runnerdeploy 0 0 0 0 8d org-runnerdeploy 5 5 5 5 8d ``` ``` $ k get runnerreplicaset NAME DESIRED CURRENT READY AGE example-runnerdeploy-wnpf6 0 0 0 61m org-runnerdeploy-fsnmr 2 2 0 8m41s ``` ``` $ k get runner NAME ENTERPRISE ORGANIZATION REPOSITORY LABELS STATUS AGE example-runnerdeploy-wnpf6-registration-only actions-runner-controller/mumoshu-actions-test Running 61m org-runnerdeploy-fsnmr-n8kkx actions-runner-controller ["mylabel 1","mylabel 2"] 21s org-runnerdeploy-fsnmr-sq6m8 actions-runner-controller ["mylabel 1","mylabel 2"] 21s ``` Fixes #490	2021-05-21 09:10:47 +09:00
Yusuke Kuoka	3cd124dce3	chore: Add debug logs for scheduledOverrides (#540 ) Follow-up for #515 Ref #484	2021-05-11 17:30:22 +09:00
Yusuke Kuoka	25f5817a5e	Improve debug log in webhook-based autoscaling Adds some helpful debug log messages I have used while verifying #534	2021-05-11 15:49:03 +09:00
Yusuke Kuoka	4e7b8b57c0	edge: Enable scaling from zero with PercentageRunnersBusy (#524 ) `PercentageRunnersBusy`, in combination with a secondary `TotalInProgressAndQueuedWorkflowRuns` metric, enables scale-from-zero for PercentageRunnersBusy. Please see the new `Autoscaling to/from 0` section in the updated documentation about how it works. Resolves #522	2021-05-05 14:27:17 +09:00
Yusuke Kuoka	e7020c7c0f	Fix scale-from-zero to retain the reg-only runner until other pods come up (#523 ) Fixes #516	2021-05-05 12:13:51 +09:00
Yusuke Kuoka	0e0f385f72	Experimental support for ScheduledOverrides (#515 ) This adds the initial version of ScheduledOverrides to HorizontalRunnerAutoscaler. `MinReplicas` overriding should just work. When there are two or more ScheduledOverrides, the earliest one that matched is activated. Each ScheduledOverride can be recurring or one-time. If you have two or more ScheduledOverrides, only one of them should be one-time. And the one-time override should be the earliest item in the list to make sense. Tests will be added in another commit. Logging improvements and additional observability in HRA.Status will also be added in yet another commits. Ref #484	2021-05-03 23:31:17 +09:00
Yusuke Kuoka	469b117a09	Foundation for ScheduledOverrides (#513 ) Adds two types `RecurrenceRule` and `Period` and one function `MatchSchedule` as the foundation for building the upcoming ScheduledOverrides feature. Ref #484	2021-05-03 22:03:49 +09:00
Thejas N	588872a316	feat: allow ephemeral runner to be optional (#498 ) - Adds `ephemeral` option to `runner.spec` ``` .... template: spec: ephemeral: false repository: mumoshu/actions-runner-controller-ci .... ``` - `ephemeral` defaults to `true` - `entrypoint.sh` in runner/Dockerfile modified to read `RUNNER_EPHEMERAL` flag - Runner images are backward-compatible. `--once` is omitted only when the new envvar `RUNNER_EPHEMERAL` is explicitly set to `false`. Resolves #457	2021-05-02 19:04:14 +09:00
Christoph Brand	a18ac330bb	feature(controller): allow autoscaler to scale down to 0 (#447 )	2021-05-02 16:46:51 +09:00
Yusuke Kuoka	dbd7b486d2	feat: Support for scaling from/to zero (#465 ) This is an attempt to support scaling from/to zero. The basic idea is that we create a one-off "registration-only" runner pod on RunnerReplicaSet being scaled to zero, so that there is one "offline" runner, which enables GitHub Actions to queue jobs instead of discarding those. GitHub Actions seems to immediately throw away the new job when there are no runners at all. Generally, having runners of any status, `busy`, `idle`, or `offline` would prevent GitHub actions from failing jobs. But retaining `busy` or `idle` runners means that we need to keep runner pods running, which conflicts with our desired to scale to/from zero, hence we retain `offline` runners. In this change, I enhanced the runnerreplicaset controller to create a registration-only runner on very beginning of its reconciliation logic, only when a runnerreplicaset is scaled to zero. The runner controller creates the registration-only runner pod, waits for it to become "offline", and then removes the runner pod. The runner on GitHub stays `offline`, until the runner resource on K8s is deleted. As we remove the registration-only runner pod as soon as it registers, this doesn't block cluster-autoscaler. Related to #447	2021-05-02 16:11:36 +09:00
Rolf Ahrenberg	6b77a2a5a8	feat: Docker registry mirror (#478 ) Changes: - Switched to use `jq` in startup.sh - Enable docker registry mirror configuration which is useful when e.g. avoiding the Docker Hub rate-limiting Check #478 for how this feature is tested and supposed to be used.	2021-04-25 14:04:01 +09:00
Manuel Jurado	37c2a62fa8	Allow to configure runner volume size limit (#436 ) Enable the user to set a limit size on the volume of the runner to avoid some runner pod affecting other resources of the same cluster Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-04-18 13:56:59 +09:00
Agoney Garcia-Deniz	2e551c9d0a	Add hostAliases to the runner spec (#456 )	2021-04-17 17:04:52 +09:00
asoldino	b42b8406a2	Add dockerVolumeMounts (#439 ) Resolves #435	2021-04-06 10:10:10 +09:00
Christoph Brand	9ed245c85e	feature(controller): remove dockerd executable (#432 )	2021-04-01 08:50:48 +09:00
Yusuke Kuoka	156e2c1987	Fix MTU configuration for dockerd (#421 ) Resolves #393	2021-03-31 09:29:21 +09:00
Yusuke Kuoka	374105c1f3	Fix dindWithinRunnerContainer not to crash-loop runner pods (#419 ) Apparently #253 broke dindWithinRunnerContainer completely due to the difference in how /runner volume is set up.	2021-03-25 10:23:36 +09:00
Yusuke Kuoka	bc6e499e4f	Make logging more concise (#410 ) This makes logging more concise by changing logger names to something like `controllers.Runner` to `actions-runner-controller.runner` after the standard `controller-rutime.controller` and reducing redundant logs by removing unnecessary requeues. I have also tweaked log messages so that their style is more consistent, which will also help readability. Also, runnerreplicaset-controller lacked useful logs so I have enhanced it.	2021-03-20 07:34:25 +09:00
Yusuke Kuoka	07f822bb08	Do include Runner controller in integration test (#409 ) So that we could catch bugs in runner controller like seen in #398, #404, and #407. Ref #400	2021-03-19 16:14:15 +09:00
Hidetake Iwata	3a0332dfdc	Add metrics of RunnerDeployment and HRA (#408 ) * Add metrics of RunnerDeployment and HRA * Use kube-state-metrics-style label names	2021-03-19 16:14:02 +09:00
Yusuke Kuoka	f6ab66c55b	Do not delay min/maxReplicas propagation from HRA to RD due to caching (#406 ) As part of #282, I have introduced some caching mechanism to avoid excessive GitHub API calls due to the autoscaling calculation involving GitHub API calls is executed on each Webhook event. Apparently, it was saving the wrong value in the cache- The value was one after applying `HRA.Spec.{Max,Min}Replicas` so manual changes to {Max,Min}Replicas doesn't affect RunnerDeployment.Spec.Replicas until the cache expires. This isn't what I had wanted. This patch fixes that, by changing the value being cached to one before applying {Min,Max}Replicas. Additionally, I've also updated logging so that you observe which number was fetched from cache, and what number was suggested by either TotalNumberOfQueuedAndInProgressWorkflowRuns or PercentageRunnersBusy, and what was the final number used as the desired-replicas(after applying {Min,Max}Replicas). Follow-up for #282	2021-03-19 12:58:02 +09:00
Yusuke Kuoka	c424215044	Do recheck runner registration timely (#405 ) Since #392, the runner controller could have taken unexpectedly long time until it finally notices that the runner has been registered to GitHub. This patch fixes the issue, so that the controller will notice the successful registration in approximately 1 minute(hard-coded). More concretely, let's say you had configured a long sync-period of like 10m, the runner controller could have taken approx 10m to notice the successful registration. The original expectation was 1m, because it was intended to recheck every 1m as implemented in #392. It wasn't working as such due to my misunderstanding in how requeueing work.	2021-03-19 11:02:47 +09:00
Yusuke Kuoka	3cccca8d09	Do patch runner status instead of update to reduce conflicts and avoid future bugs Ref https://github.com/summerwind/actions-runner-controller/pull/398#issuecomment-801548375	2021-03-18 10:31:17 +09:00
Yusuke Kuoka	7a7086e7aa	Make error logs more helpful	2021-03-18 10:26:21 +09:00
Yusuke Kuoka	3f23501b8e	Reduce "No runner matching the specified labels was found" errors while runner replacement (#392 ) We occasionally encountered those errors while the underlying RunnerReplicaSet is being recreated/replaced on RunnerDeployment.Spec.Template update. It turned out to be due to that the RunnerDeployment controller was waiting for the runner pod becomes `Running`, intead of the new replacement runner to have registered to GitHub. This fixes that, by trying to Runner.Status.Phase to `Running` only after the runner in the runner pod appears to be registered. A side-effect of this change is that runner controller would call more "ListRunners" GitHub Actions API. I've reviewed and improved the runner controller code and Runner CRD to make make the number of calls minimum. In most cases, ListRunners should be called only twice for each runner creation.	2021-03-16 10:52:30 +09:00
Yusuke Kuoka	5530030c67	Disable metrics-based autoscaling by default when scaleUpTriggers are enabled (#391 ) Relates to https://github.com/summerwind/actions-runner-controller/pull/379#discussion_r592813661 Relates to https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-793266609 When you defined HRA.Spec.ScaleUpTriggers[] but HRA.Spec.Metrics[], the HRA controller will now enable ScaleUpTriggers alone and insteaed of automatically enabling TotalNumberOfQueuedAndInProgressWorkflowRuns. This allows you to use ScaleUpTriggers alone, so that the autoscaling is done without calling GitHub API at all, which should grealy decrease the change of GitHub API calls get rate-limited.	2021-03-14 11:03:00 +09:00
Yusuke Kuoka	8d3a83b07a	Add CheckRun.Names scale-up trigger configuration (#390 ) This allows you to trigger autoscaling depending on check_run names(i.e. actions job names). If you are willing to differentiate scale amount only for a specific job, or want to scale only on a specific job, try this.	2021-03-14 10:21:42 +09:00
Brandon Kimbrough	2273b198a1	Add ability to set the MTU size of the docker in docker container (#385 ) * adding abilitiy to set docker in docker MTU size * safeguards to only set MTU env var if it is set	2021-03-12 08:44:49 +09:00
Yusuke Kuoka	3d62e73f8c	Fix PercentageRunnersBusy scaling not working (#386 ) PercentageRunnerBusy seems to have regressed since #355 due to that RunnerDeployment.Spec.Selector is empty by default and the HRA controller was using that empty selector to query runners, which somehow returned 0 runners. This fixes that by using the newly added automatic `runner-deployment-name` label for the default runner label and the selector, which avoids querying with empty selector. Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-795200205	2021-03-11 20:16:36 +09:00
Yusuke Kuoka	f5c639ae28	Make webhook-based autoscaler github event logs more operator-friendly (#384 ) Adds fields like `pullRequest.base.ref` and `checkRun.status` that are useful for verifying the autoscaling behaviour without browsing GitHub. Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-794175312	2021-03-10 09:40:44 +09:00
Yusuke Kuoka	728829be7b	Fix panic on scaling organizational runners (#381 ) Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-793287133	2021-03-09 15:03:47 +09:00
Yusuke Kuoka	1b8a656051	Use --watch-namespace flag to restrict the namespace to watch Ref https://github.com/summerwind/actions-runner-controller/issues/377#issuecomment-793172995	2021-03-09 09:46:21 +09:00
Rob Whitby	1753fa3530	handle GET requests in webhook hra (#378 )	2021-03-09 08:46:27 +09:00

1 2 3 4 5 ...

351 Commits