Commit Graph

10 Commits

Author SHA1 Message Date
Yusuke Kuoka 4551309e30 Fix runners to not terminate before unregistration when scaling down
#1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.
2022-03-13 13:09:46 +00:00
Yusuke Kuoka 7123b18a47 chore: Log more variables when log level is -2 2022-03-13 13:04:28 +00:00
Yusuke Kuoka 326d6a1fe8 Fix the timing of `Marking owner for unregistration completion` log 2022-03-13 12:16:55 +00:00
Yusuke Kuoka fa8ff70aa2 Add log when deletion timestamp is being set on owner object 2022-03-13 12:16:29 +00:00
Yusuke Kuoka e4280dcb0d Fix patch MergeFrom target 2022-03-13 12:14:14 +00:00
Yusuke Kuoka 791634fb12 Fix static runners not scaling up
It turned out that #1179 broke static runners in a way it is no longer able to scale up at all when the desired replicas is updated.
This fixes that by correcting a certain short-circuit that is intended only for ephemeral runners to not mistakenly triggered for static runners.
2022-03-13 07:26:43 +00:00
Yusuke Kuoka adc889ce8a
Fix RunnerDeployment to be able to finish rollout (#1213)
I found that #1179 was unable to finish rollout of an RunnerDeployment update(like runner env update). It was able to create a new RunnerReplicaSet with the desired spec, but unable to tear down the older ones. This fixes that.
2022-03-13 10:10:24 +09:00
Yusuke Kuoka cbbc383a80
Auto-correct replicas number on missing webhook_job completion event (#1180)
While testing #1179, I discovered that ARC sometimes stop resyncing RunnerReplicaSet when the desired replicas is greater than the actual number of runner pods.
This seems to happen when ARC missed receiving a workflow_job completion event but it has no way to decide if it is either (1) something went wrong on ARC or (2) a loadbalancer in the middle or GitHub or anything not ARC went wrong. It needs a standard to decide it, or if it's not impossible, how to deal with it.

In this change, I added a hard-coded 10 minutes timeout(can be made customizable later) to prevent runner pod recreation.
Now, a RunnerReplicaSet/RunnerSet to restart runner pod recreation 10 minutes after the last scale-up. If the workflow completion event arrived after the timeout, it will decrease the desired replicas number that results in the removal of a runner pod. The removed runner pod might be deleted without ever being used, but I think that's better than leaving the desired replicas and the actual number of replicas diverged forever.
2022-03-07 09:35:13 +09:00
Yusuke Kuoka 14a878bfae refactor: Make RunnerReplicaSet and Runner backed by the same logic that backs RunnerSet 2022-03-06 05:53:26 +00:00
Yusuke Kuoka c95e84a528 refactor: Extract runner pod owner management out of runnerset controller
so that it can potentially be reusable from runnerreplicaset controller
2022-03-05 12:18:02 +00:00