actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Nikola Jokic	8e52a6d2cf	EphemeralRunner: On cleanup, if pod is pending, delete from service (#2255 ) Co-authored-by: Tingluo Huang <tingluohuang@github.com>	2023-02-11 19:55:12 -05:00
Nikola Jokic	9990243520	Early return if finalizer does not exist to make it more readable (#2262 )	2023-02-08 15:21:13 +01:00
Tingluo Huang	facae69e0b	Remove un-required permissions for the manager-role of the new `AutoScalingRunnerSet` (#2260 )	2023-02-07 12:37:09 -05:00
Nikola Jokic	c4297d25bb	Avoid deleting scale set if annotation is not parsable or if it does not exist (#2239 )	2023-02-03 17:27:31 +01:00
Tingluo Huang	1f4fe4681e	Delete RunnerScaleSet on service when AutoScalingRunnerSet is deleted. (#2223 )	2023-01-31 15:03:11 -05:00
Tingluo Huang	803818162c	Allow update runner group for AutoScalingRunnerSet (#2216 )	2023-01-27 09:27:52 -05:00
dependabot[bot]	219ba5b477	chore(deps): bump sigs.k8s.io/controller-runtime from 0.13.1 to 0.14.1 (#2132 ) Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2023-01-27 09:23:28 +09:00
Nikola Jokic	882bfab569	Renaming autoScaling to autoscaling in tests matching the convention (#2201 )	2023-01-23 17:03:01 +01:00
Tingluo Huang	4932412cd6	Fix L0 test to make it more reliable. (#2178 )	2023-01-19 07:33:04 -05:00
Tingluo Huang	bb61bb1342	Include extra user-agent for runners created by actions-runner-controller. (#2177 )	2023-01-18 07:38:59 +09:00
Tingluo Huang	622eaa34f8	Introduce new preview auto-scaling mode for ARC. (#2153 ) Co-authored-by: Cory Miller <cory-miller@github.com> Co-authored-by: Nikola Jokic <nikola-jokic@github.com> Co-authored-by: Ava Stancu <AvaStancu@github.com> Co-authored-by: Ferenc Hammerl <fhammerl@github.com> Co-authored-by: Francesco Renzi <rentziass@github.com> Co-authored-by: Bassem Dghaidi <Link-@github.com>	2023-01-17 12:06:20 -05:00
Tingluo Huang	044c8ad4d5	Include actions-runner-controller in runner's User-Agent for better telemetry in Actions service. (#2155 )	2023-01-15 09:35:56 +09:00
Tingluo Huang	eaa451df32	Update controller package names to match the owning API group name (#2150 ) * Update controller package names to match the owning API group name * feedback. Co-authored-by: Bassem Dghaidi <568794+Link-@users.noreply.github.com>	2023-01-13 08:24:11 +09:00
Yusuke Kuoka	bc4f4fee12	Fix various golangci-lint errors (#2147 ) that we introduced via controller-runtime upgrade and via the removal of legacy pull-based scale triggers (#2001).	2023-01-13 07:14:36 +09:00
Nikola Jokic	aa6dab5a9a	Changes to folder structure to allow multigroups and changed go mod name (#2105 ) * Changed folder structure to allow multi group registration * included actions.github.com directory for resources and controllers * updated go module to actions/actions-runner-controller * publish arc packages under actions-runner-controller * Update charts/actions-runner-controller/docs/UPGRADING.md Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-12-28 09:38:34 +09:00
Callum Tait	418f719bdf	chore: highlight watch namespace (#2087 ) * chore: highlight watch namespace * chore: wording Co-authored-by: toast-gear <toast-gear@users.noreply.github.com>	2022-12-12 08:39:04 +09:00
Yusuke Kuoka	96a930bfd9	Fix runner pod to not stuck in Terminating when runner got deleted before pod scheduling (#2043 ) This fixes the said issue that I found while I was running a series of E2E tests to test other features and pull requestes I have recently contributed.	2022-11-27 11:13:38 +09:00
Yusuke Kuoka	ae86b1a011	Use the patch API instead to prevent unnecessary field updates (#1998 ) Fixes #1916	2022-11-22 12:09:24 +09:00
Yusuke Kuoka	86d7893d61	breaking: Make legacy webhook scale triggers no-op (#2001 ) Ref #1607	2022-11-22 12:08:29 +09:00
Igor Sarkisov	8f374d561f	Do not explicitly set Privileged to false. (#2009 ) Setting SecurityContext.Privileged bit to false, which is default, prevents GKE from admitting Windows pods. Privileged bit is not supported on Windows.	2022-11-15 11:29:37 +09:00
malachiobadeyi	fbdfe0df8c	1770 update log format and add additional fields to webhook server logs (#1771 ) * 1770 update log format and add runID and Id to worflow logs update tests, change log format for controllers.HorizontalRunnerAutoscalerGitHubWebhook use logging package remove unused modules add setup name to setuplog add flag to change log format change flag name to enableProdLogConfig move log opts to logger package remove empty else and reset timeEncoder update flag description use get function to handle nil rename flag and update logger function Update main.go Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com> Update controllers/horizontal_runner_autoscaler_webhook.go Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com> Update logging/logger.go Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com> copy log opt per each NewLogger call revert to use autoscaler.log update flag descript and remove unused imports add logFormat to readme rename setupLog to logger make fmt * Fix E2E along the way Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-11-04 10:46:58 +09:00
Yusuke Kuoka	23c8fe4a8b	Fix dead-lock when runner unregistration triggered before PV attachment (#1975 ) This fixes an issue discovered while I was testing #1759. Please see the new comment in code for more information.	2022-11-04 06:29:19 +09:00
Yusuke Kuoka	c74ad6195f	Fix runners to do their best to gracefully stop on pod eviction (#1759 ) Ref #1535 Ref #1581 Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-11-01 20:30:10 +09:00
Yusuke Kuoka	710e2fbc3a	Prevent runner controller from recreating runner pod when pod was terminated externally (#1851 )	2022-10-13 09:04:50 +09:00
Yusuke Kuoka	7ff5b7da8c	Handle missing runner ID more gracefully (#1855 ) so that ARC respect the registration timeout, terminationGracePeriodSeconds and RUNNER_GRACEFUL_STOP_TIMEOUT(#1759) when the runner pod was terminated externally too early after its creation While I was running E2E tests for #1759, I discovered a potential issue that ARC can terminate runner pods without waiting for the registration timeout of 10 minutes. You won't be affected by this in normal circumstances, as this failure scenario can be triggered only when you or another K8s controller like cluster-autoscaler deleted the runner or the runner pod immediately after the runner or the runner pod has been created. But probably is it worth fixing it anyway because it's not impossible to trigger it?	2022-10-09 16:52:51 +09:00
Nicholas Farley	a389292478	Allow `RunnerDeployment`s to configure `dnsPolicy` for runners (#1892 ) * Add DnsPolicy field to RunnerPodSpec struct * Ensure the runnerSpec's DNSPolicy is mirrored to the pod.Spec * Run `make manifests`	2022-10-05 08:16:11 +09:00
Yusuke Kuoka	e4879e7ae4	Tweak E2E and documentation about MTU configuration	2022-09-25 07:50:12 +09:00
renovate[bot]	0deb6809b9	fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0 (#1775 ) * fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0 * fixup! fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0 * fixup! fixup! fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0 * fixup! fixup! fixup! fix(deps): update module sigs.k8s.io/controller-runtime to v0.13.0 Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-09-21 11:04:07 +09:00
Cory Miller	c91e76f169	Add golangci-lilnt to CI (#1794 ) This introduces a linter to PRs to help with code reviews and code hygiene. I've also gone ahead and fixed (or ignored) the existing lints. I've only setup the default linters right now. There are many more options that are documented at https://golangci-lint.run/. The GitHub Action should add appropriate annotations to the lint job for the PR. Contributors can also lint locally using `make lint`.	2022-09-21 09:08:22 +09:00
Barun Mishra	921daff61b	Add cmd line arg for enterprise url. Fix enterprise bug. (#1 ) * Add cmd line arg for enterprise url. Fix enterprise bug. * Fix package import order * Fix comment	2022-09-05 13:50:17 +01:00
Yusuke Kuoka	d4fb6204cb	Add TODO comment to the PVC reconciler	2022-08-27 07:14:16 +00:00
Yusuke Kuoka	bdcde44642	chore: Bump go-github and minimum GHES version to 3.6 (#1747 ) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1574	2022-08-24 13:08:40 +09:00
João Carlos Ferra de Almeida	36e95dad47	Fix/multitenancy enterprise url (#1725 ) * Fix #1714 * Add Comment	2022-08-16 20:20:06 +09:00
Rahul Kumar	72ca998266	Add Additional Autoscaling Metrics to Prometheus (#1720 ) * Add prometheus metrics for autoscaling * Add desc for prometheus-metrics * FIX: Typo * Remove replicas_desired_before in metrics * Remove Num prefix in metricws	2022-08-15 23:12:00 +09:00
oreonl	e511401e51	fix: don't base64 decode secret strings (#1683 )	2022-08-03 11:47:07 +09:00
Yusuke Kuoka	97404144eb	Fix excessive runnerreplicaset update issue since 0.25.0 (#1650 ) Fixes #1643	2022-07-17 19:43:24 +09:00
Felipe Galindo Sanchez	584745b67d	Minor improvements for runner groups - Add group in runners columns - Add constant for runner group and labels	2022-07-15 09:47:25 +09:00
Yusuke Kuoka	8071ac7066	Remove github-api-cache-duration flag and code (#1631 ) This removes the flag and code for the legacy GitHub API cache. We already migrated to fully use the new HTTP cache based API cache functionality which had been added via #1127 and available since ARC 0.22.0. Since then, the legacy one had been no-op and therefore removing it is safe. Ref #1412	2022-07-12 20:37:24 +09:00
Yusuke Kuoka	618276e3d3	Enhance support for multi-tenancy (#1371 ) This enhances every ARC controller and the various K8s custom resources so that the user can now configure a custom GitHub API credentials (that is different from the default one configured per the ARC instance). Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1067#issuecomment-1043716646	2022-07-12 09:45:00 +09:00
Yusuke Kuoka	1cfe1974c4	Add missing job-related permissions to runner pods with k8s container mode	2022-07-10 16:16:32 +09:00
Felipe Galindo Sanchez	11cb9b7882	feat: allow to discover runner statuses (#1268 ) * feat: allow to discover runner statuses * fix manifests * Bump runner version to 2.289.1 which includes the hooks support * Add feedback from review * Update reference to newRunnerPod * Fix TestNewRunnerPodFromRunnerController and make hooks file names job specific * Fix additional TestNewRunnerPod test * Cover additional feedback from review * fix rbac manager role * Add permissions to service account for container mode if not provided * Rename flag to runner.statusUpdateHook.enabled and fix needsServiceAccount Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-07-10 15:11:29 +09:00
everpcpc	fea1457f12	fix: annotate pod instead of label on UnregistrationFailureMessage (#1615 ) The error message should go to annotation instead of pod label, since we only check annotations on autoscale: https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/autoscaling.go#L338-L342 Then there is no need to truncate or normalize the error message.	2022-07-09 11:45:05 +09:00
Yusuke Kuoka	bfc5ea4727	Fix a regression in webhook-based autoscaler (#1596 ) The regression resulted in the webhook-based autoscaler be unable to find visible runner groups and therefore unable to scale up and down the target RunnerDeployment/RunnerSet at all when the webhook-based autoscaler was provided GitHub API credentials to enable the runner groups support. This fixes that. The regression was introduced via #1578 which is not released yet. Users of existing ARC releases are therefore not affected.	2022-07-04 20:17:09 +09:00
Yusuke Kuoka	a517c1ff66	Fix old runner pods stuck in Terminating since #1579 (#1585 ) Ref #1579	2022-06-29 22:02:42 +09:00
Yusuke Kuoka	8161136cbd	Fix PercentageRunnersBusy scaling delay (#1579 ) * Use a dedicated pod label to say it is a runner pod Follow-up for #1546 * Fix PercentageRunnersBusy scaling delay Ref #1374	2022-06-29 20:49:21 +09:00
Yusuke Kuoka	ddd417f756	Bump go-github to v45 (#1573 ) * Bump go-github to v45 Ref #1402 * fixup! Bump go-github to v45	2022-06-29 06:34:58 +09:00
Thomas Boop	0386c0734c	`containerMode` option to allow running jobs in k8's instead of docker (#1546 ) * added containerMode=kubernetes env variables to the runner * removed unused logging * restored configs and charts * restored makefile cert version and acceptance/run * added workVolumeClaimTemplate in pod definition, including logic * added claim template name based on the runner * Apply suggestions from code review update errors * added concurrent cleanup before runner pod is deleted * update manifests * added retry after 30s if pod cleanup contains err * added admission webhook check, made workVolumeClaimTemplate mandatory for k8s * style changes and added comments * added izZero timestamp check for deleting runner-linked pods * changed order of local variable to avoid copy if p is deleted * removed docker from container mode k8s * restored charts, config, makefile * restored forked files back and not the ARC ones * created PersistentVolume on containerMode k8s * create pv only if storage class name is local-storage * removed actions if storage class name is local-storage * added service account validation if container mode kubernetes * changed the coding style to match rest of the ARC * added validation to the runnerdeployment webhook * specified fields more precisely, added webhook validation to the replicaset as well * remake manifests * wraped delete runner-linked-pods in kube mode * fixed empty line * fixed import * makefile changes for hooks * added cleanup secrets * create manifests * docs * update access modes * update dockerfile * nit changes * fixed dockerfile * rewrite allowing reuse for runners and runnersets * deepcopy forgot to stage * changed privileged * make manifests * partly moved to finalizer, still need to apply finalizer first * finalizer added if env variable used in container mode exists * bump runner version * error message moved from Error to Info on cleanup pods/secrets * removed useless dereferencing, added transformation tests of workVolumeClaimTemplate * Apply suggestions from code review * Update controllers/utils_test.go Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com> * Update controllers/utils_test.go Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com> * add hook version to cli, update to 0.1.2 * Apply suggestions from code review * Update controllers/utils_test.go * Update runner/Makefile * Fix missing secret permission and the error handling * Fix a runnerpod reconciler finalizer to not trigger unnecessary retry Co-authored-by: Nikola Jokic <nikola-jokic@github.com> Co-authored-by: Nikola Jokic <97525037+nikola-jokic@users.noreply.github.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-06-28 14:12:40 +09:00
Yusuke Kuoka	af96de6184	Fix completed runner pod recreation not to be blocked after max out (#1568 ) Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1477#issuecomment-1164154496	2022-06-28 13:50:07 +09:00
Sam Weston	bc7a3cab1b	Add priorityClassName to CRDs (#1513 ) * Add pod priorityClassName to controller and crds * Add missing bits in bases directory * Regenerate crds	2022-06-28 08:45:19 +09:00
Yusuke Kuoka	e2c8163b8c	Make webhook-based scale race-free (#1477 ) * Make webhook-based scale operation asynchronous This prevents race condition in the webhook-based autoscaler when it received another webhook event while processing another webhook event and both ended up scaling up the same horizontal runner autoscaler. Ref #1321 * Fix typos * Update rather than Patch HRA to avoid race among webhook-based autoscaler servers * Batch capacity reservation updates for efficient use of apiserver * Fix potential never-ending HRA update conflicts in batch update * Extract batchScaler out of webhook-based autoscaler for testability * Fix log levels and batch scaler hang on start * Correlate webhook event with scale trigger amount in logs * Fix log message	2022-06-27 18:31:48 +09:00
Yusuke Kuoka	0ef9a22cd4	Fix confusing PV controller log (#1526 ) Ref #1511	2022-06-14 08:35:04 +09:00
Yusuke Kuoka	9dd26168d6	Fix confusing logs from pv and pvc controllers (#1487 ) Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1321#issuecomment-1137431212	2022-05-26 18:21:23 +09:00
Yusuke Kuoka	c7eea169ad	test: Add test for runner with generic ephemeral volume as "work" (#1472 ) This adds the test to verify the runner pod generation logic for the case that you use a generic ephemeral volume as "work". It is almost an adaptation of the test cases writetn for RunnerSet in #1471, to RunnerDeployment and Runner.	2022-05-25 10:37:23 +09:00
Yusuke Kuoka	63be0223ad	fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work" (#1471 ) * fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work" While manually testing configurations being documented in #1464, I discovered that the use of dynamic ephemeral volume for "work" directory was not working correctly due to the valiadation error. This fixes the runner pod generation logic to not add the default volume and volume mount for "work" dir, so that the error disappears. Ref #1464 * e2e: Ensure work generic ephemeral volume to work as expected	2022-05-22 10:25:50 +09:00
Yusuke Kuoka	3e988afc09	test: add fuzzing to the test suite (#1463 ) As a part of #1298, we add fuzzing based on Go test's fuzzing support to the test suite	2022-05-19 13:34:23 +01:00
Felipe Galindo Sanchez	78c01fd31d	cleanup some unused code and minor refactors (#1274 ) Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-05-16 18:38:32 +09:00
Callum Tait	65f7ee92a6	refactor: remove registration runner dead code (#1260 ) We had some dead code left over from the removal of registration runners. Registration runners were removed in #859 #1207 Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-05-16 11:23:39 +09:00
Yusuke Kuoka	b5194fd75a	Enhance RunnerSet to optionally retain PVs accross restarts (#1340 ) * Enhance RunnerSet to optionally retain PVs accross restarts This is our initial attempt to bring back the ability to retain PVs across runner pod restarts when using RunnerSet. The implementation is composed of two new controllers, `runnerpersistentvolumeclaim-controller` and `runnerpersistentvolume-controller`. It all starts from our existing `runnerset-controller`. The controller now tries to mark any PVCs created by StatefulSets created for the RunnerSet. Once the controller terminated statefulsets, their corresponding PVCs are clean up by `runnerpersistentvolumeclaim-controller`, then PVs are unbound from their corresponding PVCs by `runnerpersistentvolume-controller` so that they can be reused by future PVCs createf for future StatefulSets that shares the same same StorageClass. Ref #1286 * Update E2E test suite to cover runner, docker, and go caching with RunnerSet + PVs Ref #1286	2022-05-16 09:26:48 +09:00
Yusuke Kuoka	759349de11	fix: force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared (#1395 ) Ref #1369	2022-05-14 09:07:17 +01:00
Yusuke Kuoka	e46b90f758	fix: runner pods managed by RunnerSet to not stuck in Terminating (#1420 ) This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never". Ref #1369	2022-05-12 09:34:27 +01:00
Yusuke Kuoka	3a7e8c844b	feat: Support arbitrarily setting `privileged: true` for runner container (#1383 ) Resolves #1282	2022-05-12 09:25:51 +01:00
Yusuke Kuoka	dabbc99c78	refactor(controller): stop auto-setting RUNNER_FEATURE_FLAG_EPHEMERAL (#1385 ) This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that. Ref #1196	2022-05-11 11:42:55 +01:00
Yusuke Kuoka	4053ab3e11	Fix label support for TotalNumberOfQueuedAndInProgressWorkflowRuns metric (#1390 ) In #1373 we made two mistakes: - We mistakenly checked if all the runner labels are included in the job labels and only after that it marked the target as eligible for scale. It should definitely be the opposite! - We mistakenly checked for the existence of `self-hosted` labe l in the job. [Although it should be a good practice to explicitly say `runs-on: ["self-hosted", "custom-label"]`](https://docs.github.com/en/actions/using-workflows/workflow-syntax-for-github-actions#example-using-labels-for-runner-selection), that's not a requirement so we should code accordingly. The consequence of those two mistakes was that, for example, jobs with `self-hosted` + `custom` labels didn't result in scaling runner with `self-hosted` + `custom` + `custom2`. This should fix that. Ref #1056 Ref #1373	2022-04-27 16:24:21 +01:00
Yusuke Kuoka	1c726ae20c	chore: Add unit test for RunnerReconciler.newPod (#1382 ) Adds some unit tests for the runner pod generation logic that is used internally by runner controller as preparation for #1282	2022-04-25 09:59:21 +09:00
Yusuke Kuoka	d6cdd5964c	chore: Add unit test for newRunnerPod (#1381 ) Adds some unit tests for the runner pod generation logic that is used internally by runner deployment and runner set controllers as preparation for #1282	2022-04-25 08:52:58 +09:00
Yusuke Kuoka	a622968ff2	feat: Add label support to TotalNumberOfQueuedAndInProgressWorkflowRuns metric (#1373 ) This is an implementation for my intepretation of the "bronze" case proposed in #1056 Ref #1056	2022-04-24 14:41:34 +09:00
Yusuke Kuoka	1551f3b5fc	Remove the default `githubEvent: {}` requiring a event to be defined (#1379 ) Ref #1358	2022-04-24 13:37:26 +09:00
Yusuke Kuoka	3ba7179995	Do not enable TotalNumberOfQueuedAndInProgressWorkflowRuns by default (#1372 ) Previously, omitting hra.spec.metrics at all resulted in ARC enabling the TotalNumberOfQueuedAndInProgressWorkflowRuns. That turned out not a good idea so since this change it is not enabled by default. Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/728	2022-04-24 13:36:42 +09:00
Callum Tait	24aae58dbc	feat: default scale down flag (#963 ) Resolves #899 Co-authored-by: Callum <callum@domain.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-04-20 11:09:09 +09:00
Jeff Billimek	13bfa2da4e	Fix runner pod dnsConfig (#1227 ) Fixes #1226 Fixes #1224 Signed-off-by: Jeff Billimek <jeff@billimek.com>	2022-04-20 10:55:20 +09:00
Patrick Ellis	7a5a6381c3	Add WorkflowJob to GitHubEventScaleUpTriggerSpec types (#922 )	2022-04-20 09:59:08 +09:00
Siyuan Zhang	a37b4dfbe3	Fix scale down condition to exclude skipped (#1330 ) * Fix scale down condition to exclude skipped * Use fallthrough and break to let default handle the skipped case Fixes #1326	2022-04-13 08:53:07 +09:00
Yusuke Kuoka	b09c54045a	Prevent runners from stuck in Terminating when pod disappeared without standard termination process (#1318 ) This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway. Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions. Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions. Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1307 Might also relate to https://github.com/actions-runner-controller/actions-runner-controller/issues/1273	2022-04-08 10:17:33 +09:00
Felipe Galindo Sanchez	e7e48a77e4	Merge remote-tracking branch 'upstream/master' into patch-4	2022-04-04 09:54:08 -07:00
Yusuke Kuoka	631a70a35f	Fix runner pod to be cleaned up earlier regardless of the sync period (#1299 ) Ref #1291	2022-04-03 11:12:44 +09:00
Rolf Ahrenberg	1b327a0721	refactor: use const envvars (#1251 )	2022-03-27 12:14:56 +01:00
Endre Karlson	ee7484ac91	Use container name to detect runner container in Pod	2022-03-23 12:39:58 +01:00
Felipe Galindo Sanchez	9657d3e5b3	Fix deleting a runner when pod was deleted With the current implementation if a pod is deleted, controller is failing to delete the runner as it's trying to annotate a pod that doesn't exist as we're passing a new pod object that is not an existing resource	2022-03-22 14:44:50 -07:00
Yusuke Kuoka	4551309e30	Fix runners to not terminate before unregistration when scaling down #1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.	2022-03-13 13:09:46 +00:00
Yusuke Kuoka	7123b18a47	chore: Log more variables when log level is -2	2022-03-13 13:04:28 +00:00
Yusuke Kuoka	cc55d0bd7d	Let runnerdeployment controller log runnerreplicaset creation	2022-03-13 12:25:53 +00:00
Yusuke Kuoka	c612e87d85	fix: Let RunnerDeployment scale RunnerReplicaSet to zero before terminating it so that hopefully RunnerDeployment can gracefully termiante older RunnerReplicaSet on update.	2022-03-13 12:18:22 +00:00
Yusuke Kuoka	326d6a1fe8	Fix the timing of `Marking owner for unregistration completion` log	2022-03-13 12:16:55 +00:00
Yusuke Kuoka	fa8ff70aa2	Add log when deletion timestamp is being set on owner object	2022-03-13 12:16:29 +00:00
Yusuke Kuoka	efb7fca308	Fix externally deleted runner pod to not block unregistration process	2022-03-13 12:15:49 +00:00
Yusuke Kuoka	e4280dcb0d	Fix patch MergeFrom target	2022-03-13 12:14:14 +00:00
Yusuke Kuoka	f153870f5f	fix: Do not block indefinitely on runner that cannot be deleted due to 403	2022-03-13 12:12:01 +00:00
Yusuke Kuoka	8ca39caff5	Fix log message on runner deletion	2022-03-13 12:11:11 +00:00
Yusuke Kuoka	791634fb12	Fix static runners not scaling up It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated. This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.	2022-03-13 07:26:43 +00:00
Yusuke Kuoka	c4b24f8366	Prevent static runners from terminating due to unregister timeout The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job. We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.	2022-03-13 07:26:36 +00:00
Yusuke Kuoka	adc889ce8a	Fix RunnerDeployment to be able to finish rollout (#1213 ) I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.	2022-03-13 10:10:24 +09:00
Yusuke Kuoka	fa287c4395	Fix RunnerDeployment-managed runner pods to not get RUNNER_NAME and RUNNER_TOKEN injected twice Since #1179, runner pods managed by RunnerDeployment had two duplicate environment variables for RUNNER_NAME and RUNNER_TOKEN. This fixes that.	2022-03-12 13:49:50 +00:00
Yusuke Kuoka	051089733b	Use --ephemeral by default Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1189	2022-03-12 13:20:07 +00:00
Yusuke Kuoka	9628bb2937	Prevent RemoveRunner spam on busy ephemeral runner scale down (#1204 ) Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy. There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job. This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter. The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before. Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271	2022-03-11 19:03:17 +09:00
Yusuke Kuoka	55ff4de79a	Remove legacy GitHub API cache of HRA.Status.CachedEntries (#1192 ) * Remove legacy GitHub API cache of HRA.Status.CachedEntries We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA. Just remove the legacy cache to keep it simple and easy to understand. * Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well * Fix integration test	2022-03-08 19:05:43 +09:00
Yusuke Kuoka	15ee6d6360	chore: Reorganize "Calculated desired replicas log fields (#1190 ) So that `max` is emitted immediately after `min`, which is the counterpart of it.	2022-03-08 10:29:53 +09:00
Yusuke Kuoka	cbbc383a80	Auto-correct replicas number on missing webhook_job completion event (#1180 ) While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods. This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it. In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation. Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.	2022-03-07 09:35:13 +09:00
Yusuke Kuoka	14a878bfae	refactor: Make RunnerReplicaSet and Runner backed by the same logic that backs RunnerSet	2022-03-06 05:53:26 +00:00
Yusuke Kuoka	c95e84a528	refactor: Extract runner pod owner management out of runnerset controller so that it can potentially be reusable from runnerreplicaset controller	2022-03-05 12:18:02 +00:00
Yusuke Kuoka	95a5770d55	Fix regression that registration-timeout check was not working for runnerset (#1178 ) Follow-up for #1167	2022-03-05 19:31:05 +09:00

1 2 3 4 5 ...

351 Commits