actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Yusuke Kuoka	4551309e30	Fix runners to not terminate before unregistration when scaling down #1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.	2022-03-13 13:09:46 +00:00
Yusuke Kuoka	7123b18a47	chore: Log more variables when log level is -2	2022-03-13 13:04:28 +00:00
Yusuke Kuoka	cc55d0bd7d	Let runnerdeployment controller log runnerreplicaset creation	2022-03-13 12:25:53 +00:00
Yusuke Kuoka	c612e87d85	fix: Let RunnerDeployment scale RunnerReplicaSet to zero before terminating it so that hopefully RunnerDeployment can gracefully termiante older RunnerReplicaSet on update.	2022-03-13 12:18:22 +00:00
Yusuke Kuoka	326d6a1fe8	Fix the timing of `Marking owner for unregistration completion` log	2022-03-13 12:16:55 +00:00
Yusuke Kuoka	fa8ff70aa2	Add log when deletion timestamp is being set on owner object	2022-03-13 12:16:29 +00:00
Yusuke Kuoka	efb7fca308	Fix externally deleted runner pod to not block unregistration process	2022-03-13 12:15:49 +00:00
Yusuke Kuoka	e4280dcb0d	Fix patch MergeFrom target	2022-03-13 12:14:14 +00:00
Yusuke Kuoka	f153870f5f	fix: Do not block indefinitely on runner that cannot be deleted due to 403	2022-03-13 12:12:01 +00:00
Yusuke Kuoka	8ca39caff5	Fix log message on runner deletion	2022-03-13 12:11:11 +00:00
Yusuke Kuoka	791634fb12	Fix static runners not scaling up It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated. This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.	2022-03-13 07:26:43 +00:00
Yusuke Kuoka	c4b24f8366	Prevent static runners from terminating due to unregister timeout The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job. We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.	2022-03-13 07:26:36 +00:00
Yusuke Kuoka	adc889ce8a	Fix RunnerDeployment to be able to finish rollout (#1213 ) I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.	2022-03-13 10:10:24 +09:00
Yusuke Kuoka	fa287c4395	Fix RunnerDeployment-managed runner pods to not get RUNNER_NAME and RUNNER_TOKEN injected twice Since #1179, runner pods managed by RunnerDeployment had two duplicate environment variables for RUNNER_NAME and RUNNER_TOKEN. This fixes that.	2022-03-12 13:49:50 +00:00
Yusuke Kuoka	051089733b	Use --ephemeral by default Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1189	2022-03-12 13:20:07 +00:00
Yusuke Kuoka	9628bb2937	Prevent RemoveRunner spam on busy ephemeral runner scale down (#1204 ) Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy. There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job. This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter. The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before. Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271	2022-03-11 19:03:17 +09:00
Yusuke Kuoka	55ff4de79a	Remove legacy GitHub API cache of HRA.Status.CachedEntries (#1192 ) * Remove legacy GitHub API cache of HRA.Status.CachedEntries We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA. Just remove the legacy cache to keep it simple and easy to understand. * Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well * Fix integration test	2022-03-08 19:05:43 +09:00
Yusuke Kuoka	15ee6d6360	chore: Reorganize "Calculated desired replicas log fields (#1190 ) So that `max` is emitted immediately after `min`, which is the counterpart of it.	2022-03-08 10:29:53 +09:00
Yusuke Kuoka	cbbc383a80	Auto-correct replicas number on missing webhook_job completion event (#1180 ) While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods. This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it. In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation. Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.	2022-03-07 09:35:13 +09:00
Yusuke Kuoka	14a878bfae	refactor: Make RunnerReplicaSet and Runner backed by the same logic that backs RunnerSet	2022-03-06 05:53:26 +00:00
Yusuke Kuoka	c95e84a528	refactor: Extract runner pod owner management out of runnerset controller so that it can potentially be reusable from runnerreplicaset controller	2022-03-05 12:18:02 +00:00
Yusuke Kuoka	95a5770d55	Fix regression that registration-timeout check was not working for runnerset (#1178 ) Follow-up for #1167	2022-03-05 19:31:05 +09:00
Yusuke Kuoka	9cc9f8c182	chore: Add a few comments to runnerset and runnerpod controllers to help potential contributors	2022-03-05 05:41:56 +00:00
Yusuke Kuoka	138e326705	chore: Add comment on lastSyncTime in runnerset controller	2022-03-05 05:41:56 +00:00
Yusuke Kuoka	5f2b5327f7	integration: Reduce error logs to ease debugging	2022-03-03 18:47:54 +09:00
Felipe Galindo Sanchez	d20ad71071	Fix minor log in runner controller (#1175 ) Log is mentioning registration only but this is about the standard runner pod	2022-03-03 09:51:30 +09:00
Felipe Galindo Sanchez	27563c4378	Remove unused function (#1173 )	2022-03-03 09:02:47 +09:00
Felipe Galindo Sanchez	4a0f68bfe3	Cleanup extra block in runner controller (#1174 )	2022-03-03 09:01:34 +09:00
Yusuke Kuoka	1917cf90c4	chore: Tweak runner-id annotation name and the annotation prefix to be more consistent	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	0ba3cad6c2	fix: Prefix runner pod related annotation keys with `actions/` to make them distinguishable from other annotations	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	7f0e65cb73	refactor: Extract definitions of various annotation keys and other defaults to their own source	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	12a04b7f38	Fix typo in comment	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	a3072c110d	Prevent runnerset pod unregistration until it gets runner ID This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	15b402bb32	Make RunnerSet much more reliable with or without webhook	2022-03-02 19:03:20 +09:00
Yusuke Kuoka	11be6c1fb6	Prevent runner pod deletion delay when pod disappeared before unregistration	2022-03-02 19:03:20 +09:00
Felipe Galindo Sanchez	eff0c7364f	Merge branch 'master' into improve-logs	2022-02-28 09:25:30 -08:00
Felipe Galindo Sanchez	3abecd0f19	logging: improve logs for scaling	2022-02-23 08:29:13 -08:00
Yusuke Kuoka	5bc16f2619	Enhance HRA capacity reservation update log	2022-02-21 00:06:26 +00:00
Yusuke Kuoka	b8e65aa857	Prevent unnecessary ephemeral runner recreations	2022-02-20 13:45:42 +00:00
Yusuke Kuoka	a6f0e0008f	Make unregistration timeout and retry delay configurable in integration tests	2022-02-20 12:05:34 +00:00
Yusuke Kuoka	79a31328a5	Stop recreating ephemeral runner pod Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1046161384	2022-02-20 04:42:19 +00:00
Yusuke Kuoka	3c16188371	Introduce consistent timeouts for runner unregistration and runner pod deletion Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion, so that we are very much unlikely to terminate pods that are running any jobs.	2022-02-20 04:36:35 +00:00
Tingluo Huang	0b9bef2c08	Try to unconfig runner before deleting the pod to recreate (#1125 ) There is a race condition between ARC and GitHub service about deleting runner pod. - The ARC use REST API to find a particular runner in a pod that is not running any jobs, so it decides to delete the pod. - A job is queued on the GitHub service side, and it sends the job to this idle runner right before ARC deletes the pod. - The ARC delete the runner pod which cause the in-progress job to end up canceled. To avoid this race condition, I am calling `r.unregisterRunner()` before deleting the pod. - `r.unregisterRunner()` will return 204 to indicate the runner is deleted from the GitHub service, we should be safe to delete the pod. - `r.unregisterRunner` will return 400 to indicate the runner is still running a job, so we will leave this runner pod as it is. TODO: I need to do some E2E tests to force the race condition to happen. Ref #911	2022-02-19 21:22:31 +09:00
Yusuke Kuoka	a5ed6bd263	Fix RunerSet managed runner pods to terminate more gracefully (#1126 ) Make RunnerSet-managed runners as reliable as RunnerDeployment-managed runners. Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/911#issuecomment-1042404460	2022-02-19 21:19:37 +09:00
Yusuke Kuoka	921f547200	fix: Do recreate runner pod on registration token update (#1087 ) Apparently, we've been missed taking an updated registration token into account when generating the pod template hash which is used to detect if the runner pod needs to be recreated. This shouldn't have been the end of the world since the runner pod is recreated on the next reconciliation loop anyway, but this change will make the pod recreation happen one reconciliation loop earlier so that you're less likely to get runner pods with outdated refresh tokens. Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1085#issuecomment-1027433365	2022-02-19 21:18:00 +09:00
Yusuke Kuoka	fcf4778bac	Fix regression that prevented default organizational runner group from being scale target Fixes #1131	2022-02-19 14:43:41 +09:00
Yusuke Kuoka	e22d981d58	githubwebhookserver: Tweak log levels of various messages (#1123 ) Some of logs like `HRA keys indexed for HRA` were so excessive that it made testing and debugging the githubwebhookserver harder. This tries to fix that.	2022-02-17 09:15:26 +09:00
Felipe Galindo Sanchez	d0d316252e	Option to consider runner group visibility on scale based on webhook (#1062 ) This will work on GHES but GitHub Enterprise Cloud due to excessive GitHub API calls required. More work is needed, like adding a cache layer to the GitHub client, to make it usable on GitHub Enterprise Cloud. Fixes additional cases from https://github.com/actions-runner-controller/actions-runner-controller/pull/1012 If GitHub auth is provided in the webhooks controller then runner groups with custom visibility are supported. Otherwise, all runner groups will be assumed to be visible to all repositories `getScaleUpTargetWithFunction()` will check if there is an HRA available with the following flow: 1. Search for repository HRAs - if so it ends here 2. Get available HRAs in k8s 3. Compute visible runner groups a. If GitHub auth is provided - get all the runner groups that are visible to the repository of the incoming webhook using GitHub API calls. b. If GitHub auth is not provided - assume all runner groups are visible to all repositories 4. Search for default organization runners (a.k.a runners from organization's visible default runner group) with matching labels 5. Search for default enterprise runners (a.k.a runners from enterprise's visible default runner group) with matching labels 6. Search for custom organization runner groups with matching labels 7. Search for custom enterprise runner groups with matching labels Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-02-16 19:08:56 +09:00
Daniel	8a73560dbc	if a Volume is defined by the operator don't add another "work" volume. (#1015 ) This allows providing a different `work` Volume. This should be a cloud agnostic way of allowing the operator to use (for example) NVME backed storage. This is a working example where the workDir will use the provided volume, additionally here docker is placed on the same NVME. ``` apiVersion: actions.summerwind.dev/v1alpha1 kind: RunnerDeployment metadata: name: runner-2 spec: template: spec: dockerdContainerResources: {} env: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name # this is to mount the docker in docker onto NVME disk dockerVolumeMounts: - mountPath: /var/lib/docker name: scratch subPathExpr: $(POD_NAME)-docker - mountPath: /runner/_work name: work subPathExpr: $(POD_NAME)-work volumeMounts: - mountPath: /runner/_work name: work subPathExpr: $(POD_NAME)-work dockerEnv: - name: POD_NAME valueFrom: fieldRef: fieldPath: metadata.name volumes: - hostPath: path: /mnt/disks/ssd0 name: scratch - hostPath: path: /mnt/disks/ssd0 name: work nodeSelector: cloud.google.com/gke-nodepool: runner-16-with-nvme ephemeral: false image: "" imagePullPolicy: Always labels: - runner-2 - self-hosted organization: yourorganization ```	2022-01-07 10:01:40 +09:00
Yusuke Kuoka	01301d3ce8	Stop creating registration-only runners on scale-to-zero (#1028 ) Resolves #859	2022-01-07 09:56:21 +09:00

1 2 3 4 5

223 Commits