actions-runner-controller

Commit Graph

Author	SHA1	Message	Date
Callum Tait	a40793bb60	chore: bump chart to app 0.22.0 (#1232 ) * chore: bump chart to app 0.22.0	2022-03-16 07:57:30 +00:00
Callum Tait	48a7b78bf3	docs: remove runnerset limitation (#1225 ) This works great from testing now, this is no longer a limitation due to ARC now creating a statefulset per runner	2022-03-16 09:08:41 +09:00
renovate[bot]	6ff93eae95	chore(deps): update helm/chart-testing-action action to v2.2.1 (#1216 ) Co-authored-by: Renovate Bot <bot@renovateapp.com>	2022-03-15 18:51:54 +00:00
Yusuke Kuoka	b25a0fd606	Merge pull request #1217 from actions-runner-controller/docs/re-order docs: various changes in preparation for 0.22.0 release - Move RunnerSets to a more predominant location in the docs - Clean up a few bits - Highlight the deprecation and removal timeline for the `--once` flag - Renamed ephemeral runners section to something more logical (persistent runners). Static runners were an option however the word static is awkward as it's sort of tied up with autoscaling and the `Runner` kind so Persistent was chosen instead. - Update upgrade docs to use `replace` instead of `apply`	2022-03-15 09:01:32 +09:00
toast-gear	3beef84f30	docs: better sentences	2022-03-14 12:43:07 +00:00
toast-gear	76cc758d12	docs: minor consistency change	2022-03-14 12:37:57 +00:00
toast-gear	c4c6e833a7	chore: add deprecation warning	2022-03-14 12:35:07 +00:00
toast-gear	ecf74e615e	docs: bump versions and upgrade instructions	2022-03-14 10:23:36 +00:00
toast-gear	bb19e85037	docs: various cleanups and re-orderings	2022-03-14 09:52:22 +00:00
Yusuke Kuoka	e7200f274d	Merge pull request #1214 from actions-runner-controller/fix-static-runners Fix runner{set,deployment} rollouts and static runner scaling I was testing static runners as a preparation to cut the next release of ARC, v0.22.0, and found several problems that I thought worth being fixed. In particular, this PR fixes static runners reliability issues in two means. `c4b24f8366` fixes the issue that ARC gives up retrying RemoveRunner calls too early, especially on static runners, that resulted in static runners to often get terminated prematurely while running jobs. `791634fb12` fixes the issue that ARC was unable to scale up any static runners when the corresponding desired replicas number in e.g. RunnerDeployment gets updated. It was caused by a bug in the mechanism that is intended to prevent ephemeral runners from being recreated in unwanted circumstances, mistakenly triggered also for static runners. Since #1179, RunnerDeployment was not able to gracefully terminate old RunnerReplicaSet on update. `c612e87` fixes that by changing RunnerDeployment to firstly scale old RunnerReplicaSet(s) down to zero and waits for sync, and set the deletion timestamp only after that. That way, RunnerDeployment can ensure that all the old RunnerReplicaSets that are being deleted are already scaled to zero passing the standard unregister-and-then-delete runner termination process. It revealed a hidden bug in #1179 that sometimes the scale-to-zero-before-runnerreplicaset-termination does not work as intended. `4551309` fixes that, so that RunnerDeployment can actually terminate old RunnerReplicaSets gracefully.	2022-03-13 22:19:26 +09:00
Yusuke Kuoka	1cc06e7408	e2e: Make enterprise runners optional for testing GitHub App As GitHub App does not allow ARC to access enterprise runner related API endpoints, like the create-registration-token API.	2022-03-13 13:11:26 +00:00
Yusuke Kuoka	4551309e30	Fix runners to not terminate before unregistration when scaling down #1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.	2022-03-13 13:09:46 +00:00
Yusuke Kuoka	7123b18a47	chore: Log more variables when log level is -2	2022-03-13 13:04:28 +00:00
Yusuke Kuoka	cc55d0bd7d	Let runnerdeployment controller log runnerreplicaset creation	2022-03-13 12:25:53 +00:00
Yusuke Kuoka	c612e87d85	fix: Let RunnerDeployment scale RunnerReplicaSet to zero before terminating it so that hopefully RunnerDeployment can gracefully termiante older RunnerReplicaSet on update.	2022-03-13 12:18:22 +00:00
Yusuke Kuoka	326d6a1fe8	Fix the timing of `Marking owner for unregistration completion` log	2022-03-13 12:16:55 +00:00
Yusuke Kuoka	fa8ff70aa2	Add log when deletion timestamp is being set on owner object	2022-03-13 12:16:29 +00:00
Yusuke Kuoka	efb7fca308	Fix externally deleted runner pod to not block unregistration process	2022-03-13 12:15:49 +00:00
Yusuke Kuoka	e4280dcb0d	Fix patch MergeFrom target	2022-03-13 12:14:14 +00:00
Yusuke Kuoka	f153870f5f	fix: Do not block indefinitely on runner that cannot be deleted due to 403	2022-03-13 12:12:01 +00:00
Yusuke Kuoka	8ca39caff5	Fix log message on runner deletion	2022-03-13 12:11:11 +00:00
Yusuke Kuoka	791634fb12	Fix static runners not scaling up It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated. This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.	2022-03-13 07:26:43 +00:00
Yusuke Kuoka	c4b24f8366	Prevent static runners from terminating due to unregister timeout The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job. We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.	2022-03-13 07:26:36 +00:00
Yusuke Kuoka	a1c6d1d11a	doc: Add release note for 0.22.0 (#1199 ) As it turned out to be the biggest release ever, I was afraid I might not be able to write a summary of changes that communicates well. Here is my attempt. Please review and leave any comments so that we can be more confident in this release. Thank you!	2022-03-13 16:25:24 +09:00
Yusuke Kuoka	adc889ce8a	Fix RunnerDeployment to be able to finish rollout (#1213 ) I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.	2022-03-13 10:10:24 +09:00
Yusuke Kuoka	b83db7be8f	Merge pull request #1212 from actions-runner-controller/fix-runnerdeploy-duplicate-envvars Fix RunnerDeployment-managed runner pods to not get RUNNER_NAME and RUNNER_TOKEN injected twice	2022-03-12 23:27:45 +09:00
Yusuke Kuoka	da2adc0cc5	e2e: Omit RUNNER_FEATURE_FLAG_EPHEMERAL when TEST_FEATURE_FLAG_EPHEMERAL is not set	2022-03-12 14:08:23 +00:00
Yusuke Kuoka	fa287c4395	Fix RunnerDeployment-managed runner pods to not get RUNNER_NAME and RUNNER_TOKEN injected twice Since #1179, runner pods managed by RunnerDeployment had two duplicate environment variables for RUNNER_NAME and RUNNER_TOKEN. This fixes that.	2022-03-12 13:49:50 +00:00
Yusuke Kuoka	7c0340dea0	Merge pull request #1211 from actions-runner-controller/use-ephemeral-by-default Use --ephemeral by default Every runner is now --ephemeral by default. Note that this works by ARC setting the RUNNER_FEATURE_FLAG_EPHEMERAL envvar to true by default. Previously you had to explicitly set it to true otherwise the runner was passed --once which is known to various race conditions. It's worth noting that the very confusing and related configuration, ephemeral: true, which creates --once runners instead of static(or persistent) runners had been the default since many months ago. So, this should be the only change needed to make every runner ephemeral without any explicit configuration. You can still fall back to static(persistent) runners by setting ephemeral: false, and to --once runners by setting RUNNER_FEATURE_FLAG_EPHEMERAL to "false". But I don't think there're many reasons to do so. Ref #1189	2022-03-12 22:47:38 +09:00
Yusuke Kuoka	c3dd1c5c05	e2e: Make TEST_FEATURE_FLAG_EPHEMERAL optional	2022-03-12 13:32:42 +00:00
Yusuke Kuoka	051089733b	Use --ephemeral by default Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1189	2022-03-12 13:20:07 +00:00
Yusuke Kuoka	757e0a82a2	Merge pull request #1210 from actions-runner-controller/fix-github-api-cache-for-github-app-mode Fix GitHub API cache to work with GitHub App authentication	2022-03-12 21:17:25 +09:00
Yusuke Kuoka	83e550cde5	Experimetanl log level "-4" for logging every HTTP round-trip for GitHub API calls	2022-03-12 12:11:16 +00:00
Yusuke Kuoka	22ef7b3a71	acceptance,e2e: Fix deploy.sh and e2e_test.go for testing with GitHub App	2022-03-12 12:10:04 +00:00
Yusuke Kuoka	28fccbcecd	Fix GitHub API cache to work with GitHub App authentication The version of `bradleyfalzon/ghinstallation` which is used to enable GitHub App authentication turned out to add an extra header `application/vnd.github.machine-man-preview+json` to every HTTP request. That revealed an edge-case in our HTTP cache layer `gregjones/httpcache` that results it to not serve responses from cache when it should. There were two problems. One was that it does not support multi-valued header and it only looked for the first value for each header, and another is that it does not support any http.RoundTripper implementation that modifies HTTP request headers in a RoundTrip function call. I fixed it in my fork of httpcache, which is hosted at https://github.com/actions-runner-controller/httpcache. The relevant commits are: - `70d975e77d` - `197a8a3546` This can be considered as a follow-up for #1127, which turned out to have enabled the cache only for the case that ARC uses PAT for authentication. Since this fix, the cache is also enabled when ARC authenticates as a GitHub App.	2022-03-12 11:14:16 +00:00
Yusuke Kuoka	9628bb2937	Prevent RemoveRunner spam on busy ephemeral runner scale down (#1204 ) Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy. There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job. This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter. The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before. Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271	2022-03-11 19:03:17 +09:00
Renovate Bot	736a53fed6	fix(deps): update golang.org/x/oauth2 commit hash to 6242fa9	2022-03-10 08:39:51 +09:00
yourmoonlight	132faa13a1	docs: fix the helm command for webhook installation (#1188 ) * fix doc for install the webhook server * modify cmd with single set && add double quote for zsh users	2022-03-08 17:59:01 +00:00
Callum Tait	66e070f798	docs: remove githubAPICacheDuration from docs (#1194 )	2022-03-08 13:27:30 +00:00
Yusuke Kuoka	55ff4de79a	Remove legacy GitHub API cache of HRA.Status.CachedEntries (#1192 ) * Remove legacy GitHub API cache of HRA.Status.CachedEntries We migrated to the transport-level cache introduced in #1127 so not only this is useless, it is harder to deduce which cache resulted in the desired replicas number calculated by HRA. Just remove the legacy cache to keep it simple and easy to understand. * Deprecate githubAPICacheDuration helm chart value and the --github-api-cache-duration as well * Fix integration test	2022-03-08 19:05:43 +09:00
Yusuke Kuoka	301439b06a	chore: Change log ts format to RFC3339 (#1191 ) The TimeEncoder for zap seems to have been set to EpochTimeEncoder which is the default and it was not very readable. Changing it to a TimeEncoderOfLayout(time.RFC3339) for readability. Another benefit of doing this is the ts format is now consistent with various timestamps ARC put into pod and other custom resource annotations.	2022-03-08 10:34:52 +09:00
Yusuke Kuoka	15ee6d6360	chore: Reorganize "Calculated desired replicas log fields (#1190 ) So that `max` is emitted immediately after `min`, which is the counterpart of it.	2022-03-08 10:29:53 +09:00
Felipe Galindo Sanchez	5b899f578b	fix(chart): allow to use basic auth when authSecret.create is false (#1149 ) * fix(chart): allow to use basic auth when authSecret.create is false When secret is created outside of the ARC chart using authSecret.create=false and basicAuth, the controller fails as we're not including the basic password as environment variable as the password value won't be inside the helm values. This PR includes both environment variables for consistent regardless if those are set or not similar as the rest of the other auth options (e.g app_id, private key, etc) * chart: Add back the conditional block for .Values.authSecret.github_basicauth_username Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>	2022-03-07 10:07:24 +09:00
Yusuke Kuoka	d8c9eb7ba7	Fix arm64 image (#1185 ) Fixes #1184	2022-03-07 10:00:20 +09:00
Yusuke Kuoka	cbbc383a80	Auto-correct replicas number on missing webhook_job completion event (#1180 ) While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods. This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it. In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation. Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.	2022-03-07 09:35:13 +09:00
seplak	b57e885a73	Fix service account typo in Helm README (#1183 ) Just fixing a typo I discovered while reading through the README.	2022-03-07 08:39:01 +09:00
Yusuke Kuoka	bed927052d	Merge pull request #1179 from actions-runner-controller/refactor-runner-and-runnerset Refactor Runner and RunnerSet so that they use the same library code that powers RunnerSet. RunnerSet is StatefulSet-based and RunnerSet/Runner is Pod-based so it had been hard to unify the implementation although they look very similar in many aspects. This change finally resolves that issue, by first introducing a library that implements the generic logic that is used to reconcile RunnerSet, then adding an adapter that can be used to let the generic logic manage runner pods via Runner, instead of via StatefulSet. Follow-up for #1127, #1167, and 1178	2022-03-06 15:56:51 +09:00
Yusuke Kuoka	14a878bfae	refactor: Make RunnerReplicaSet and Runner backed by the same logic that backs RunnerSet	2022-03-06 05:53:26 +00:00
Yusuke Kuoka	c95e84a528	refactor: Extract runner pod owner management out of runnerset controller so that it can potentially be reusable from runnerreplicaset controller	2022-03-05 12:18:02 +00:00
Yusuke Kuoka	95a5770d55	Fix regression that registration-timeout check was not working for runnerset (#1178 ) Follow-up for #1167	2022-03-05 19:31:05 +09:00

1 2 3 4 5 ...

763 Commits All Branches Search

763 Commits

All Branches