Add release note for ARC 0.27.0 (#2068)

This commit is contained in:
Yusuke Kuoka 2023-01-15 19:56:13 +09:00 committed by GitHub
parent 044c8ad4d5
commit 2e406e3aef
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 132 additions and 0 deletions

132
docs/releasenotes/0.27.md Normal file
View File

@ -0,0 +1,132 @@
# actions-runner-controller v0.27.0
All planned changes in this release can be found in the milestone https://github.com/actions-runner-controller/actions-runner-controller/milestone/10.
Also see https://github.com/actions-runner-controller/actions-runner-controller/compare/v0.26.0...v0.27.0 for full changelog.
This log documents breaking changes and major enhancements
## Upgrading
In case you're using our Helm chart to deploy ARC, use the chart 0.21.0 or greater. Don't miss upgrading CRDs as usual! Helm doesn't upgrade CRDs.
## BREAKING CHANGE : `workflow_job` became ARC's only supported webhook event as the scale trigger.
In this release, we've removed support for legacy `check_run`, `push`, and `pull_request` webhook events, in favor of `workflow_job` that has been released a year ago. Since then, it served all the use-cases formely and partially supported by the legacy events, and we should be ready to fully migrate to `workflow_job`.
Anyone who's still using legacy webook events should see `HorizontalRunnerAutoscaler` specs that look similar to the following examples:
```yaml
kind: HorizontalRunnerAutoscaler
spec:
scaleUpTriggers:
- githubEvent:
push: {}
```
```yaml
kind: HorizontalRunnerAutoscaler
spec:
scaleUpTriggers:
- githubEvent:
checkRun: {}
```
```yaml
kind: HorizontalRunnerAutoscaler
spec:
scaleUpTriggers:
- githubEvent:
pullRequest: {}
```
You need to update the spec to look like the below, along with enabling the `Workflow Job` events(and disabling unneeded `Push`, `Check Run`, and `Pull Request` evenst) on your webhook setting page on GitHub.
```yaml
kind: HorizontalRunnerAutoscaler
spec:
scaleUpTriggers:
- githubEvent:
workflowJob: {}
```
Relevant PR(s): #2001
## Fix : Runner pods should work more reliably with cluster-autoscaler
We've fixed many edge-cases in the runner pod termination process which seem to have resulted in various issues, like pods stuck in Terminating, workflow jobs being stuck for 10 minutes or so when an external controller like cluster-autoscaler tried to terminate the runner pod that is still running a workflow job, a workflow job fails due to a job container step being unable access the docker daemon, and so on.
Do note that you need to set appropariate `RUNNER_GRACEFUL_STOP_TIMEOUT` for both the `docker` sidecar container and the `runner` container specs to let it wait for long and sufficient time for your use-case.
`RUNNER_GRACEFUL_STOP_TIMEOUT` is basically the longest time the runner stop process to wait until the runner agent to gracefully stop.
It's set to `RUNNER_GRACEFUL_STOP_TIMEOUT=15` by default, which might be too short for any use-cases.
For example, in case you're using AWS Spot Instances to power nodes for runner pods, it gives you 2 minutes at the longest. You'd want to set the graceful stop timeout slightly shorter than the 2 minutes, like `110` or `100` seconds depending how much cpu, memory and storage your runner pod is provided.
With rich cpu/memory/storage/network resources, the runner agent could stop gracefully well within 10 seconds, making `110` the right setting. With fewer resources, the runner agent could take more than 10 seconds to stop gracefully. If you think it would take 20 seconds for your environment, `100` would be the right setting.
`RUNNER_GRACEFUL_STOP_TIMEOUT` is designed to be used to let the runner stop process as long as possible to avoid cancelling the workflow job in the middle of processing, yet avoiding the workflow job to stuck for 10 minutes due to the node disappear before the runner agent cancelling the job.
Under the hood, `RUNNER_GRACEFUL_STOP_TIMEOUT` works by instructing [runner's signal handler](https://github.com/actions-runner-controller/actions-runner-controller/blob/master/runner/graceful-stop.sh#L7) to delay forwarding `SIGTERM` sent by Kubernetes on pod terminatino down to the runner agent. The runner agent is supposed to cancel the workflow job only on `SIGTERM` so making this delay longer allows you to delay cancelling the workfow job, which results in a more graceful period to stop the runner. Practically, the runner pod stops gracefully only when the workflow job running within the runner pod has completed before the runner graceful stop timeout elapses. The timeout can't be forever in practice, although it might theoretically possible depending on your cluster environment. AWS Spot Instances, again for example, gives you 2 minutes to gracefully stop the whole node, and therefore `RUNNER_GRACEFUL_STOP_TIMEOUT` can't be longer than that.
If you have success stories with the new `RUNNER_GRACEFUL_STOP_TIMEOUT`, please don't hesitate to create a `Show and Tell` discussion in our GitHub Discussions to share what configuration worked on which environment, including the name of your cloud provider, the name of managed Kubernetes service, the graceful stop timeout for nodes(defined and provided by the provider or the service) and the runner pods (`RUNNER_GRACEFUL_STOP_TIMEOUT`).
Relevant PR(s): #1759, #1851, #1855
## ENHANCEMENT : More reliable and customizable "wait-for-docker" feature
You can now add a `WAIT_FOR_DOCKER_SECONDS` envvar to the `runner` container of the runner pod spec to customize how long you want the runner startup script to wait until the docker daemon gets up and running. Previously this has been hard-coded to 120 seconds and it wasn't sufficient in some environments.
Along with the enhancement, we also fixed a bug in the runner startup script that it didn't exit immediately on the docker startup timeout.
The bug resulted in that you see a job container step failing due to missing docker socket. Ideally it should have kept auto-restarting the whole runner pod until you get a fully working runner pod with the working runner agent plus the docker daemon (that started within the timeout), and therefore you should have never seen the job step failing due to docker issue.
We fixed it so it should work as intended now.
Relvant PR(s): #1999
## ENHANCEMENT : New webhook and metrics server for monitoring workflow jobs
**This feature is 99% authored and contributed by @ColinHeathman. Big kudos to Colin for his awesome work! **
You can now use the new `actions-metrics-server` to expose additional GitHub webhook endpoint for receiving `workflow_job` events and calculating and collecting various metrics related to the jobs. Please see the updated chart documentation for how to enable it.
We made it a separate component instead of adding the new metrics collector to our existing `github-webhook-server` to retain the ability to scale the `github-webhook-server` to two or more replicas for availability and scalability.
Also note that `actions-metrics-server` cannot be scaled to 2 or more replicas today.
That's because it needs to store it's state somewhere to retain the `workflow_job` webhook event until it receives the corresponding webhook event to finally calculate the metric value, and the only supported state store is in-memory as of today.
For exmaple, it needs to save `workflow_job` of `status=queued` until it receives the corresponding `workflow_job` of `status=in_progress` to finally calculate the queue duration metric value.
We may add another state store that is backed by e.g. Memcached or Redis if there's enough demand. But we opted to not complicate ARC for now. You can follow the relevant discussion in [this thread](https://github.com/actions-runner-controller/actions-runner-controller/pull/1814#discussion_r974758924).
Relvant PR(s): #1814, #2057
## New runner images based on Ubuntu 22.04
We started publishing new runner images based on Ubuntu 22.04 with the following tags:
```
summerwind/actions-runner-dind-rootless:v2.299.1-ubuntu-22.04
summerwind/actions-runner-dind-rootless:v2.299.1-ubuntu-22.04-$COMMIT_ID
summerwind/actions-runner-dind-rootless:ubuntu-22.04-latest
ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind-rootless:v2.299.1-ubuntu-22.04
ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind-rootless:v2.299.1-ubuntu-22.04-$COMMIT_ID
ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind-rootless:ubuntu-22.04-latest
```
The `latest` tags for the runner images will stick with Ubuntu 20.04 for a while. We'll try to submit an issue or a discussion for notice before switching the latest to 22.04. See [this thread](https://github.com/actions/actions-runner-controller/pull/2036#discussion_r1032856803) for more context.
Note that we took this chance to slim down the runner images for more security, maintainability, and extensibility. That said, some packages that are present by default in hosted runners but can easily be installed using `setup-` actions (like `python` using the `setup-python` action) and other convenient but not strictly necessary packages like `ftp`, `telnet`, `upx` and so on are no longer installed onto our 22.04 based runners. Consult below Dockerfile parts and add some `setup-` actions to your workflows or build your own custom runner image(s) based on our new 22.04 images, in case you relied on some packages present in our 20.04 images but not in our 22.04 images:
- [20.04 runner](https://github.com/actions/actions-runner-controller/blob/master/runner/actions-runner.ubuntu-20.04.dockerfile#L17-L51)
- [22.04 runner](https://github.com/actions/actions-runner-controller/blob/master/runner/actions-runner.ubuntu-22.04.dockerfile#L15-L28)
- [20.04 dind-runner](https://github.com/actions/actions-runner-controller/blob/master/runner/actions-runner-dind.ubuntu-20.04.dockerfile#L17-L51)
- [22.04 dind-runner](https://github.com/actions/actions-runner-controller/blob/master/runner/actions-runner-dind.ubuntu-22.04.dockerfile#L15-L30)
- [20.04 rootless-dind-runner](https://github.com/actions/actions-runner-controller/blob/master/runner/actions-runner-dind-rootless.ubuntu-20.04.dockerfile#L19-L54)
- [22.04 rootless-dind-runner](https://github.com/actions/actions-runner-controller/blob/master/runner/actions-runner-dind-rootless.ubuntu-22.04.dockerfile#L18-L33)
These images are not strictly tied to the v0.27.0 release. You can freely try the new images with ARC v0.26.0, or use both 20.04 and 22.04 based images with ARC v0.27.0.
Relevant PR(s): #1924, #2030, #2033, #2036, #2050, #2078, #2079, #2080, #2098