actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Yusuke Kuoka	c74ad6195f	Fix runners to do their best to gracefully stop on pod eviction (#1759 ) Ref #1535 Ref #1581 Signed-off-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-11-01 20:30:10 +09:00
Nicholas Farley	a389292478	Allow `RunnerDeployment`s to configure `dnsPolicy` for runners (#1892 ) * Add DnsPolicy field to RunnerPodSpec struct * Ensure the runnerSpec's DNSPolicy is mirrored to the pod.Spec * Run `make manifests`	2022-10-05 08:16:11 +09:00
Yusuke Kuoka	e4879e7ae4	Tweak E2E and documentation about MTU configuration	2022-09-25 07:50:12 +09:00
Cory Miller	c91e76f169	Add golangci-lilnt to CI (#1794 ) This introduces a linter to PRs to help with code reviews and code hygiene. I've also gone ahead and fixed (or ignored) the existing lints. I've only setup the default linters right now. There are many more options that are documented at https://golangci-lint.run/. The GitHub Action should add appropriate annotations to the lint job for the PR. Contributors can also lint locally using `make lint`.	2022-09-21 09:08:22 +09:00
Felipe Galindo Sanchez	584745b67d	Minor improvements for runner groups - Add group in runners columns - Add constant for runner group and labels	2022-07-15 09:47:25 +09:00
Yusuke Kuoka	618276e3d3	Enhance support for multi-tenancy (#1371 ) This enhances every ARC controller and the various K8s custom resources so that the user can now configure a custom GitHub API credentials (that is different from the default one configured per the ARC instance). Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1067#issuecomment-1043716646	2022-07-12 09:45:00 +09:00
Yusuke Kuoka	1cfe1974c4	Add missing job-related permissions to runner pods with k8s container mode	2022-07-10 16:16:32 +09:00
Felipe Galindo Sanchez	11cb9b7882	feat: allow to discover runner statuses (#1268 ) * feat: allow to discover runner statuses * fix manifests * Bump runner version to 2.289.1 which includes the hooks support * Add feedback from review * Update reference to newRunnerPod * Fix TestNewRunnerPodFromRunnerController and make hooks file names job specific * Fix additional TestNewRunnerPod test * Cover additional feedback from review * fix rbac manager role * Add permissions to service account for container mode if not provided * Rename flag to runner.statusUpdateHook.enabled and fix needsServiceAccount Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-07-10 15:11:29 +09:00
Yusuke Kuoka	8161136cbd	Fix PercentageRunnersBusy scaling delay (#1579 ) * Use a dedicated pod label to say it is a runner pod Follow-up for #1546 * Fix PercentageRunnersBusy scaling delay Ref #1374	2022-06-29 20:49:21 +09:00
Thomas Boop	0386c0734c	`containerMode` option to allow running jobs in k8's instead of docker (#1546 ) * added containerMode=kubernetes env variables to the runner * removed unused logging * restored configs and charts * restored makefile cert version and acceptance/run * added workVolumeClaimTemplate in pod definition, including logic * added claim template name based on the runner * Apply suggestions from code review update errors * added concurrent cleanup before runner pod is deleted * update manifests * added retry after 30s if pod cleanup contains err * added admission webhook check, made workVolumeClaimTemplate mandatory for k8s * style changes and added comments * added izZero timestamp check for deleting runner-linked pods * changed order of local variable to avoid copy if p is deleted * removed docker from container mode k8s * restored charts, config, makefile * restored forked files back and not the ARC ones * created PersistentVolume on containerMode k8s * create pv only if storage class name is local-storage * removed actions if storage class name is local-storage * added service account validation if container mode kubernetes * changed the coding style to match rest of the ARC * added validation to the runnerdeployment webhook * specified fields more precisely, added webhook validation to the replicaset as well * remake manifests * wraped delete runner-linked-pods in kube mode * fixed empty line * fixed import * makefile changes for hooks * added cleanup secrets * create manifests * docs * update access modes * update dockerfile * nit changes * fixed dockerfile * rewrite allowing reuse for runners and runnersets * deepcopy forgot to stage * changed privileged * make manifests * partly moved to finalizer, still need to apply finalizer first * finalizer added if env variable used in container mode exists * bump runner version * error message moved from Error to Info on cleanup pods/secrets * removed useless dereferencing, added transformation tests of workVolumeClaimTemplate * Apply suggestions from code review * Update controllers/utils_test.go Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com> * Update controllers/utils_test.go Co-authored-by: Thomas Boop <52323235+thboop@users.noreply.github.com> * add hook version to cli, update to 0.1.2 * Apply suggestions from code review * Update controllers/utils_test.go * Update runner/Makefile * Fix missing secret permission and the error handling * Fix a runnerpod reconciler finalizer to not trigger unnecessary retry Co-authored-by: Nikola Jokic <nikola-jokic@github.com> Co-authored-by: Nikola Jokic <97525037+nikola-jokic@users.noreply.github.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-06-28 14:12:40 +09:00
Sam Weston	bc7a3cab1b	Add priorityClassName to CRDs (#1513 ) * Add pod priorityClassName to controller and crds * Add missing bits in bases directory * Regenerate crds	2022-06-28 08:45:19 +09:00
Yusuke Kuoka	63be0223ad	fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work" (#1471 ) * fix: Avoid duplicate volume and mount name error for generic ephemeral volume as "work" While manually testing configurations being documented in #1464, I discovered that the use of dynamic ephemeral volume for "work" directory was not working correctly due to the valiadation error. This fixes the runner pod generation logic to not add the default volume and volume mount for "work" dir, so that the error disappears. Ref #1464 * e2e: Ensure work generic ephemeral volume to work as expected	2022-05-22 10:25:50 +09:00
Callum Tait	65f7ee92a6	refactor: remove registration runner dead code (#1260 ) We had some dead code left over from the removal of registration runners. Registration runners were removed in #859 #1207 Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-05-16 11:23:39 +09:00
Yusuke Kuoka	759349de11	fix: force restartPolicy "Never" to prevent runner pods from stucking in Terminating when the container disappeared (#1395 ) Ref #1369	2022-05-14 09:07:17 +01:00
Yusuke Kuoka	e46b90f758	fix: runner pods managed by RunnerSet to not stuck in Terminating (#1420 ) This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never". Ref #1369	2022-05-12 09:34:27 +01:00
Yusuke Kuoka	3a7e8c844b	feat: Support arbitrarily setting `privileged: true` for runner container (#1383 ) Resolves #1282	2022-05-12 09:25:51 +01:00
Yusuke Kuoka	dabbc99c78	refactor(controller): stop auto-setting RUNNER_FEATURE_FLAG_EPHEMERAL (#1385 ) This feature flag was provided from ARC to runner container automatically to let it use `--ephemeral` instead of `--once` by default. As the support for `--once` is being dropped from the runner image via #1384, we no longer need that. Ref #1196	2022-05-11 11:42:55 +01:00
Jeff Billimek	13bfa2da4e	Fix runner pod dnsConfig (#1227 ) Fixes #1226 Fixes #1224 Signed-off-by: Jeff Billimek <jeff@billimek.com>	2022-04-20 10:55:20 +09:00
Yusuke Kuoka	b09c54045a	Prevent runners from stuck in Terminating when pod disappeared without standard termination process (#1318 ) This fixes the said issue by additionally treating any runner pod whose phase is Failed or the runner container exited with non-zero code as "complete" so that ARC gives up unregistering the runner from Actions, deletes the runner pod anyway. Note that there are a plenty of causes for that. If you are deploying runner pods on AWS spot instances or GCE preemptive instances and a job assigned to a runner took more time than the shutdown grace period provided by your cloud provider (2 minutes for AWS spot instances), the runner pod would be terminated prematurely without letting actions/runner unregisters itself from Actions. If your VM or hypervisor failed then runner pods that were running on the node will become PodFailed without unregistering runners from Actions. Please beware that it is currently users responsibility to clean up any dangling runner resources on GitHub Actions. Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1307 Might also relate to https://github.com/actions-runner-controller/actions-runner-controller/issues/1273	2022-04-08 10:17:33 +09:00
Felipe Galindo Sanchez	e7e48a77e4	Merge remote-tracking branch 'upstream/master' into patch-4	2022-04-04 09:54:08 -07:00
Yusuke Kuoka	631a70a35f	Fix runner pod to be cleaned up earlier regardless of the sync period (#1299 ) Ref #1291	2022-04-03 11:12:44 +09:00
Felipe Galindo Sanchez	9657d3e5b3	Fix deleting a runner when pod was deleted With the current implementation if a pod is deleted, controller is failing to delete the runner as it's trying to annotate a pod that doesn't exist as we're passing a new pod object that is not an existing resource	2022-03-22 14:44:50 -07:00
Yusuke Kuoka	8ca39caff5	Fix log message on runner deletion	2022-03-13 12:11:11 +00:00
Yusuke Kuoka	c4b24f8366	Prevent static runners from terminating due to unregister timeout The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job. We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.	2022-03-13 07:26:36 +00:00
Yusuke Kuoka	fa287c4395	Fix RunnerDeployment-managed runner pods to not get RUNNER_NAME and RUNNER_TOKEN injected twice Since #1179, runner pods managed by RunnerDeployment had two duplicate environment variables for RUNNER_NAME and RUNNER_TOKEN. This fixes that.	2022-03-12 13:49:50 +00:00
Yusuke Kuoka	051089733b	Use --ephemeral by default Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1189	2022-03-12 13:20:07 +00:00
Yusuke Kuoka	9628bb2937	Prevent RemoveRunner spam on busy ephemeral runner scale down (#1204 ) Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy. There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job. This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter. The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before. Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271	2022-03-11 19:03:17 +09:00
Yusuke Kuoka	14a878bfae	refactor: Make RunnerReplicaSet and Runner backed by the same logic that backs RunnerSet	2022-03-06 05:53:26 +00:00
Felipe Galindo Sanchez	d20ad71071	Fix minor log in runner controller (#1175 ) Log is mentioning registration only but this is about the standard runner pod	2022-03-03 09:51:30 +09:00
Felipe Galindo Sanchez	27563c4378	Remove unused function (#1173 )	2022-03-03 09:02:47 +09:00
Felipe Galindo Sanchez	4a0f68bfe3	Cleanup extra block in runner controller (#1174 )	2022-03-03 09:01:34 +09:00
Yusuke Kuoka	a3072c110d	Prevent runnerset pod unregistration until it gets runner ID This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	11be6c1fb6	Prevent runner pod deletion delay when pod disappeared before unregistration	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	a6f0e0008f	Make unregistration timeout and retry delay configurable in integration tests	2022-02-20 12:05:34 +00:00
Yusuke Kuoka	79a31328a5	Stop recreating ephemeral runner pod Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1046161384	2022-02-20 04:42:19 +00:00
Yusuke Kuoka	3c16188371	Introduce consistent timeouts for runner unregistration and runner pod deletion Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion, so that we are very much unlikely to terminate pods that are running any jobs.	2022-02-20 04:36:35 +00:00
Tingluo Huang	0b9bef2c08	Try to unconfig runner before deleting the pod to recreate (#1125 ) There is a race condition between ARC and GitHub service about deleting runner pod. - The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod. - A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod. - The ARC delete the runner pod which cause the in-progress job to end up canceled. To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod. - `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod. - `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is. TODO: I need to do some E2E tests to force the race condition to happen. Ref #911	2022-02-19 21:22:31 +09:00
Yusuke Kuoka	a5ed6bd263	Fix RunerSet managed runner pods to terminate more gracefully (#1126 ) Make RunnerSet-managed runners as reliable as RunnerDeployment-managed runners. Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1042404460	2022-02-19 21:19:37 +09:00
Yusuke Kuoka	921f547200	fix: Do recreate runner pod on registration token update (#1087 ) Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated. This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens. Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365	2022-02-19 21:18:00 +09:00
Daniel	8a73560dbc	if a Volume is defined by the operator don't add another "work" volume. (#1015 ) This allows providing a different `work` Volume. This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage. This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME. ``` apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: name: runner-2 spec: template: spec: dockerdContainerResources: {} env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name # this is to mount the docker in docker onto NVME disk dockerVolumeMounts: - mountPath: /var/lib/docker name: scratch subPathExpr: $(POD_NAME)-docker - mountPath: /runner/_work name: work subPathExpr: $(POD_NAME)-work volumeMounts: - mountPath: /runner/_work name: work subPathExpr: $(POD_NAME)-work dockerEnv: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name volumes: - hostPath: path: /mnt/disks/ssd0 name: scratch - hostPath: path: /mnt/disks/ssd0 name: work nodeSelector: cloud.google.com/gke-nodepool: runner-16-with-nvme ephemeral: false image: "" imagePullPolicy: Always labels: - runner-2 - self-hosted organization: yourorganization ```	2022-01-07 10:01:40 +09:00
Callum Tait	ad48851dc9	feat: expose if docker is enabled and wait for docker to be ready (#962 ) Resolves #897 Resolves #915	2021-12-29 10:23:35 +09:00
renovate[bot]	c64000e11c	fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0 (#740 ) * fix(deps): update module sigs.k8s.io/controller-runtime to v0.11.0 * Fix dependencies and bump Go to 1.17 so that it builds after controller-runtime 0.11.0 upgrade * Regenerate manifests with the latest K8s dependencies Co-authored-by: Renovate Bot <bot@renovateapp.com> Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-12-17 09:06:55 +09:00
Felipe Galindo Sanchez	9bb21aef1f	Add support for default image pull secret name (#921 ) Resolves #896 Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-12-15 09:29:31 +09:00
Pavel Smalenski	91102c8088	Add dockerEnv variable for RunnerDeployment (#912 ) Resolves #878 Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-12-14 17:13:24 +09:00
Felipe Galindo Sanchez	f0fccc020b	refactor: split Reconciler from Reconcile in a few methods (#926 ) Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2021-12-12 14:22:55 +09:00
Patrick Ellis	ea2dbc2807	Update go-github from v37 -> v39 (#925 )	2021-12-11 21:43:40 +09:00
Yusuke Kuoka	898ad3c355	Work-around for offline+busy runners (#993 ) Ref #911	2021-12-09 09:31:06 +09:00
apr-1985	0d3de9ee2a	chore: correct logging typo (#904 )	2021-10-22 09:03:23 +09:00
Maxim Pogozhiy	fce7d6d2a7	Add topologySpreadConstraints (#814 )	2021-10-17 21:49:44 +01:00
Tristan Keen	5e3f89bdc5	Correct test to append docker container (#837 ) Fixes #835	2021-09-24 09:18:20 +09:00

1 2 3

133 Commits