With the current implementation if a pod is deleted, controller is failing to delete the runner as it's trying to annotate a pod that doesn't exist as we're passing a new pod object that is not an existing resource
* feat: copy dotfiles from asset to service dir
* Fixed `UNITTEST` Condition
* Load `/etc/environment`
See https://github.com/actions/runner/issues/1703 for context on this change.
docs: various changes in preparation for 0.22.0 release
- Move RunnerSets to a more predominant location in the docs
- Clean up a few bits
- Highlight the deprecation and removal timeline for the `--once` flag
- Renamed ephemeral runners section to something more logical (persistent runners). Static runners were an option however the word static is awkward as it's sort of tied up with autoscaling and the `Runner` kind so Persistent was chosen instead.
- Update upgrade docs to use `replace` instead of `apply`
Fix runner{set,deployment} rollouts and static runner scaling
I was testing static runners as a preparation to cut the next release of ARC, v0.22.0, and found several problems that I thought worth being fixed.
In particular, this PR fixes static runners reliability issues in two means.
c4b24f8366 fixes the issue that ARC gives up retrying RemoveRunner calls too early, especially on static runners, that resulted in static runners to often get terminated prematurely while running jobs.
791634fb12 fixes the issue that ARC was unable to scale up any static runners when the corresponding desired replicas number in e.g. RunnerDeployment gets updated. It was caused by a bug in the mechanism that is intended to prevent ephemeral runners from being recreated in unwanted circumstances, mistakenly triggered also for static runners.
Since #1179, RunnerDeployment was not able to gracefully terminate old RunnerReplicaSet on update. c612e87 fixes that by changing RunnerDeployment to firstly scale old RunnerReplicaSet(s) down to zero and waits for sync, and set the deletion timestamp only after that. That way, RunnerDeployment can ensure that all the old RunnerReplicaSets that are being deleted are already scaled to zero passing the standard unregister-and-then-delete runner termination process.
It revealed a hidden bug in #1179 that sometimes the scale-to-zero-before-runnerreplicaset-termination does not work as intended. 4551309 fixes that, so that RunnerDeployment can actually terminate old RunnerReplicaSets gracefully.
#1179 was not working particularly for scale down of static (and perhaps long-running ephemeral) runners, which resulted in some runner pods are terminated before the requested unregistration processes complete, that triggered some in-progress workflow jobs to hang forever. This fixes an edge-case that resulted in a decreased desired replicas to trigger the failure, so that every runner is unregistered then terminated, as originally designed.