Fix minor typos in 0.27.md (#2171)

This commit is contained in:
James Bradshaw 2023-01-17 15:38:42 -07:00 committed by GitHub
parent 211bacaf1e
commit 23fdca4786
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
1 changed files with 9 additions and 9 deletions

View File

@ -8,13 +8,13 @@ This log documents breaking changes and major enhancements
## Upgrading ## Upgrading
In case you're using our Helm chart to deploy ARC, use the chart 0.21.0 or greater. Don't miss upgrading CRDs as usual! Helm doesn't upgrade CRDs. In case you're using our Helm chart to deploy ARC, use the chart 0.22.0 or greater ([current](https://github.com/actions/actions-runner-controller/blob/master/charts/actions-runner-controller/Chart.yaml#L18)). Don't miss upgrading CRDs as usual! Helm doesn't upgrade CRDs.
## BREAKING CHANGE : `workflow_job` became ARC's only supported webhook event as the scale trigger. ## BREAKING CHANGE : `workflow_job` became ARC's only supported webhook event as the scale trigger.
In this release, we've removed support for legacy `check_run`, `push`, and `pull_request` webhook events, in favor of `workflow_job` that has been released a year ago. Since then, it served all the use-cases formely and partially supported by the legacy events, and we should be ready to fully migrate to `workflow_job`. In this release, we've removed support for legacy `check_run`, `push`, and `pull_request` webhook events, in favor of `workflow_job` that has been released a year ago. Since then, it served all the use-cases formerly and partially supported by the legacy events, and we should be ready to fully migrate to `workflow_job`.
Anyone who's still using legacy webook events should see `HorizontalRunnerAutoscaler` specs that look similar to the following examples: Anyone who's still using legacy webhook events should see `HorizontalRunnerAutoscaler` specs that look similar to the following examples:
```yaml ```yaml
kind: HorizontalRunnerAutoscaler kind: HorizontalRunnerAutoscaler
@ -54,21 +54,21 @@ Relevant PR(s): #2001
## Fix : Runner pods should work more reliably with cluster-autoscaler ## Fix : Runner pods should work more reliably with cluster-autoscaler
We've fixed many edge-cases in the runner pod termination process which seem to have resulted in various issues, like pods stuck in Terminating, workflow jobs being stuck for 10 minutes or so when an external controller like cluster-autoscaler tried to terminate the runner pod that is still running a workflow job, a workflow job fails due to a job container step being unable access the docker daemon, and so on. We've fixed many edge-cases in the runner pod termination process which seem to have resulted in various issues, like pods stuck in Terminating, workflow jobs being stuck for 10 minutes or so when an external controller like cluster-autoscaler tried to terminate the runner pod that is still running a workflow job, a workflow job fails due to a job container step being unable to access the docker daemon, and so on.
Do note that you need to set appropariate `RUNNER_GRACEFUL_STOP_TIMEOUT` for both the `docker` sidecar container and the `runner` container specs to let it wait for long and sufficient time for your use-case. Do note that you need to set appropriate `RUNNER_GRACEFUL_STOP_TIMEOUT` for both the `docker` sidecar container and the `runner` container specs to let it wait for long and sufficient time for your use-case.
`RUNNER_GRACEFUL_STOP_TIMEOUT` is basically the longest time the runner stop process to wait until the runner agent to gracefully stop. `RUNNER_GRACEFUL_STOP_TIMEOUT` is basically the longest time the runner stop process to wait until the runner agent to gracefully stop.
It's set to `RUNNER_GRACEFUL_STOP_TIMEOUT=15` by default, which might be too short for any use-cases. It's set to `RUNNER_GRACEFUL_STOP_TIMEOUT=15` by default, which might be too short for any use-cases.
For example, in case you're using AWS Spot Instances to power nodes for runner pods, it gives you 2 minutes at the longest. You'd want to set the graceful stop timeout slightly shorter than the 2 minutes, like `110` or `100` seconds depending how much cpu, memory and storage your runner pod is provided. For example, in case you're using AWS Spot Instances to power nodes for runner pods, it gives you 2 minutes at the longest. You'd want to set the graceful stop timeout slightly shorter than the 2 minutes, like `110` or `100` seconds depending on how much cpu, memory and storage your runner pod is provided.
With rich cpu/memory/storage/network resources, the runner agent could stop gracefully well within 10 seconds, making `110` the right setting. With fewer resources, the runner agent could take more than 10 seconds to stop gracefully. If you think it would take 20 seconds for your environment, `100` would be the right setting. With rich cpu/memory/storage/network resources, the runner agent could stop gracefully well within 10 seconds, making `110` the right setting. With fewer resources, the runner agent could take more than 10 seconds to stop gracefully. If you think it would take 20 seconds for your environment, `100` would be the right setting.
`RUNNER_GRACEFUL_STOP_TIMEOUT` is designed to be used to let the runner stop process as long as possible to avoid cancelling the workflow job in the middle of processing, yet avoiding the workflow job to stuck for 10 minutes due to the node disappear before the runner agent cancelling the job. `RUNNER_GRACEFUL_STOP_TIMEOUT` is designed to be used to let the runner stop process as long as possible to avoid cancelling the workflow job in the middle of processing, yet avoiding the workflow job to stuck for 10 minutes due to the node disappear before the runner agent cancelling the job.
Under the hood, `RUNNER_GRACEFUL_STOP_TIMEOUT` works by instructing [runner's signal handler](https://github.com/actions-runner-controller/actions-runner-controller/blob/master/runner/graceful-stop.sh#L7) to delay forwarding `SIGTERM` sent by Kubernetes on pod terminatino down to the runner agent. The runner agent is supposed to cancel the workflow job only on `SIGTERM` so making this delay longer allows you to delay cancelling the workfow job, which results in a more graceful period to stop the runner. Practically, the runner pod stops gracefully only when the workflow job running within the runner pod has completed before the runner graceful stop timeout elapses. The timeout can't be forever in practice, although it might theoretically possible depending on your cluster environment. AWS Spot Instances, again for example, gives you 2 minutes to gracefully stop the whole node, and therefore `RUNNER_GRACEFUL_STOP_TIMEOUT` can't be longer than that. Under the hood, `RUNNER_GRACEFUL_STOP_TIMEOUT` works by instructing [runner's signal handler](https://github.com/actions-runner-controller/actions-runner-controller/blob/master/runner/graceful-stop.sh#L7) to delay forwarding `SIGTERM` sent by Kubernetes on pod termination down to the runner agent. The runner agent is supposed to cancel the workflow job only on `SIGTERM` so making this delay longer allows you to delay cancelling the workflow job, which results in a more graceful period to stop the runner. Practically, the runner pod stops gracefully only when the workflow job running within the runner pod has completed before the runner graceful stop timeout elapses. The timeout can't be forever in practice, although it might theoretically be possible depending on your cluster environment. AWS Spot Instances, again for example, gives you 2 minutes to gracefully stop the whole node, and therefore `RUNNER_GRACEFUL_STOP_TIMEOUT` can't be longer than that.
If you have success stories with the new `RUNNER_GRACEFUL_STOP_TIMEOUT`, please don't hesitate to create a `Show and Tell` discussion in our GitHub Discussions to share what configuration worked on which environment, including the name of your cloud provider, the name of managed Kubernetes service, the graceful stop timeout for nodes(defined and provided by the provider or the service) and the runner pods (`RUNNER_GRACEFUL_STOP_TIMEOUT`). If you have success stories with the new `RUNNER_GRACEFUL_STOP_TIMEOUT`, please don't hesitate to create a `Show and Tell` discussion in our GitHub Discussions to share what configuration worked on which environment, including the name of your cloud provider, the name of managed Kubernetes service, the graceful stop timeout for nodes(defined and provided by the provider or the service) and the runner pods (`RUNNER_GRACEFUL_STOP_TIMEOUT`).
@ -76,11 +76,11 @@ Relevant PR(s): #1759, #1851, #1855
## ENHANCEMENT : More reliable and customizable "wait-for-docker" feature ## ENHANCEMENT : More reliable and customizable "wait-for-docker" feature
You can now add a `WAIT_FOR_DOCKER_SECONDS` envvar to the `runner` container of the runner pod spec to customize how long you want the runner startup script to wait until the docker daemon gets up and running. Previously this has been hard-coded to 120 seconds and it wasn't sufficient in some environments. You can now add a `WAIT_FOR_DOCKER_SECONDS` envvar to the `runner` container of the runner pod spec to customize how long you want the runner startup script to wait until the docker daemon gets up and running. Previously this has been hard-coded to 120 seconds which wasn't sufficient in some environments.
Along with the enhancement, we also fixed a bug in the runner startup script that it didn't exit immediately on the docker startup timeout. Along with the enhancement, we also fixed a bug in the runner startup script that it didn't exit immediately on the docker startup timeout.
The bug resulted in that you see a job container step failing due to missing docker socket. Ideally it should have kept auto-restarting the whole runner pod until you get a fully working runner pod with the working runner agent plus the docker daemon (that started within the timeout), and therefore you should have never seen the job step failing due to docker issue. The bug resulted in that you see a job container step failing due to missing docker socket. Ideally it should have kept auto-restarting the whole runner pod until you get a fully working runner pod with the working runner agent plus the docker daemon (that started within the timeout), and therefore you should have never seen the job step failing due to docker issue.
We fixed it so it should work as intended now. We fixed it so that it should work as intended now.
Relvant PR(s): #1999 Relvant PR(s): #1999