Commit Graph

19 Commits

Author SHA1 Message Date
Yusuke Kuoka bdcde44642
chore: Bump go-github and minimum GHES version to 3.6 (#1747)
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1574
2022-08-24 13:08:40 +09:00
everpcpc fea1457f12
fix: annotate pod instead of label on UnregistrationFailureMessage (#1615)
The error message should go to annotation instead of pod label, since we only check annotations on autoscale:

https://github.com/actions-runner-controller/actions-runner-controller/blob/master/controllers/autoscaling.go#L338-L342

Then there is no need to truncate or normalize the error message.
2022-07-09 11:45:05 +09:00
Yusuke Kuoka 8161136cbd
Fix PercentageRunnersBusy scaling delay (#1579)
* Use a dedicated pod label to say it is a runner pod

Follow-up for #1546

* Fix PercentageRunnersBusy scaling delay

Ref #1374
2022-06-29 20:49:21 +09:00
Yusuke Kuoka ddd417f756
Bump go-github to v45 (#1573)
* Bump go-github to v45

Ref #1402

* fixup! Bump go-github to v45
2022-06-29 06:34:58 +09:00
Felipe Galindo Sanchez 78c01fd31d
cleanup some unused code and minor refactors (#1274)
Co-authored-by: Yusuke Kuoka <ykuoka@gmail.com>
2022-05-16 18:38:32 +09:00
Yusuke Kuoka e46b90f758
fix: runner pods managed by RunnerSet to not stuck in Terminating (#1420)
This is intended to fix #1369 mostly for RunnerSet-managed runner pods. It is "mostly" because this fix might work well for RunnerDeployment in cases that #1395 does not work, like in a case that the user explicitly set the runner pod restart policy to anything other than "Never".

Ref #1369
2022-05-12 09:34:27 +01:00
Yusuke Kuoka efb7fca308 Fix externally deleted runner pod to not block unregistration process 2022-03-13 12:15:49 +00:00
Yusuke Kuoka f153870f5f fix: Do not block indefinitely on runner that cannot be deleted due to 403 2022-03-13 12:12:01 +00:00
Yusuke Kuoka c4b24f8366 Prevent static runners from terminating due to unregister timeout
The unregister timeout of 1 minute (no matter how long it is) can negatively impact availability of static runner constantly running workflow jobs, and ephemeral runner that runs a long-running job.
We deal with that by completely removing the unregistaration timeout, so that regarldess of the type of runner(static or ephemeral) it waits forever until it successfully to get unregistered before being terminated.
2022-03-13 07:26:36 +00:00
Yusuke Kuoka 051089733b Use --ephemeral by default
Ref https://github.com/actions-runner-controller/actions-runner-controller/issues/1189
2022-03-12 13:20:07 +00:00
Yusuke Kuoka 9628bb2937
Prevent RemoveRunner spam on busy ephemeral runner scale down (#1204)
Since #1127 and #1167, we had been retrying `RemoveRunner` API call on each graceful runner stop attempt when the runner was still busy.
There was no reliable way to throttle the retry attempts. The combination of these resulted in ARC spamming RemoveRunner calls(one call per reconciliation loop but the loop runs quite often due to how the controller works) when it failed once due to that the runner is in the middle of running a workflow job.

This fixes that, by adding a few short-circuit conditions that would work for ephemeral runners. An ephemeral runner can unregister itself on completion so in most of cases ARC can just wait for the runner to stop if it's already running a job. As a RemoveRunner response of status 422 implies that the runner is running a job, we can use that as a trigger to start the runner stop waiter.

The end result is that 422 errors will be observed at most once per the whole graceful termination process of an ephemeral runner pod. RemoveRunner API calls are never retried for ephemeral runners. ARC consumes less GitHub API rate limit budget and logs are much cleaner than before.

Ref https://github.com/actions-runner-controller/actions-runner-controller/pull/1167#issuecomment-1064213271
2022-03-11 19:03:17 +09:00
Yusuke Kuoka c95e84a528 refactor: Extract runner pod owner management out of runnerset controller
so that it can potentially be reusable from runnerreplicaset controller
2022-03-05 12:18:02 +00:00
Yusuke Kuoka 95a5770d55
Fix regression that registration-timeout check was not working for runnerset (#1178)
Follow-up for #1167
2022-03-05 19:31:05 +09:00
Yusuke Kuoka 7f0e65cb73 refactor: Extract definitions of various annotation keys and other defaults to their own source 2022-03-02 19:03:20 +09:00
Yusuke Kuoka 12a04b7f38 Fix typo in comment 2022-03-02 19:03:20 +09:00
Yusuke Kuoka a3072c110d Prevent runnerset pod unregistration until it gets runner ID
This eliminates the race condition that results in the runner terminated prematurely when RunnerSet triggered unregistration of StatefulSet that added just a few seconds ago.
2022-03-02 19:03:20 +09:00
Yusuke Kuoka 11be6c1fb6 Prevent runner pod deletion delay when pod disappeared before unregistration 2022-03-02 19:03:20 +09:00
Yusuke Kuoka a6f0e0008f Make unregistration timeout and retry delay configurable in integration tests 2022-02-20 12:05:34 +00:00
Yusuke Kuoka 3c16188371 Introduce consistent timeouts for runner unregistration and runner pod deletion
Enhances runner controller and runner pod controller to have consistent timeouts for runner unregistration and runner pod deletion,
so that we are very much unlikely to terminate pods that are running any jobs.
2022-02-20 04:36:35 +00:00