10 KiB

Raw Blame History

actions-runner-controller v0.27.0

All planned changes in this release can be found in the milestone https://github.com/actions-runner-controller/actions-runner-controller/milestone/10.

Also see https://github.com/actions-runner-controller/actions-runner-controller/compare/v0.26.0...v0.27.0 for full changelog.

This log documents breaking changes and major enhancements

Upgrading

In case you're using our Helm chart to deploy ARC, use the chart 0.22.0 or greater (current). Don't miss upgrading CRDs as usual! Helm doesn't upgrade CRDs.

BREAKING CHANGE : `workflow_job` became ARC's only supported webhook event as the scale trigger.

In this release, we've removed support for legacy check_run, push, and pull_request webhook events, in favor of workflow_job that has been released a year ago. Since then, it served all the use-cases formerly and partially supported by the legacy events, and we should be ready to fully migrate to workflow_job.

Anyone who's still using legacy webhook events should see HorizontalRunnerAutoscaler specs that look similar to the following examples:

kind: HorizontalRunnerAutoscaler
spec:
  scaleUpTriggers:
    - githubEvent:
        push: {}

kind: HorizontalRunnerAutoscaler
spec:
  scaleUpTriggers:
    - githubEvent:
        checkRun: {}

kind: HorizontalRunnerAutoscaler
spec:
  scaleUpTriggers:
    - githubEvent:
        pullRequest: {}

You need to update the spec to look like the below, along with enabling the Workflow Job events(and disabling unneeded Push, Check Run, and Pull Request events) on your webhook setting page on GitHub.

kind: HorizontalRunnerAutoscaler
spec:
  scaleUpTriggers:
    - githubEvent:
        workflowJob: {}

Relevant PR(s): #2001

Fix : Runner pods should work more reliably with cluster-autoscaler

We've fixed many edge-cases in the runner pod termination process which seem to have resulted in various issues, like pods stuck in Terminating, workflow jobs being stuck for 10 minutes or so when an external controller like cluster-autoscaler tried to terminate the runner pod that is still running a workflow job, a workflow job fails due to a job container step being unable to access the docker daemon, and so on.

Do note that you need to set appropriate RUNNER_GRACEFUL_STOP_TIMEOUT for both the docker sidecar container and the runner container specs to let it wait for long and sufficient time for your use-case.

RUNNER_GRACEFUL_STOP_TIMEOUT is basically the longest time the runner stop process to wait until the runner agent to gracefully stop.

It's set to RUNNER_GRACEFUL_STOP_TIMEOUT=15 by default, which might be too short for any use-cases.

For example, in case you're using AWS Spot Instances to power nodes for runner pods, it gives you 2 minutes at the longest. You'd want to set the graceful stop timeout slightly shorter than the 2 minutes, like 110 or 100 seconds depending on how much cpu, memory and storage your runner pod is provided.

With rich cpu/memory/storage/network resources, the runner agent could stop gracefully well within 10 seconds, making 110 the right setting. With fewer resources, the runner agent could take more than 10 seconds to stop gracefully. If you think it would take 20 seconds for your environment, 100 would be the right setting.

RUNNER_GRACEFUL_STOP_TIMEOUT is designed to be used to let the runner stop process as long as possible to avoid cancelling the workflow job in the middle of processing, yet avoiding the workflow job to stuck for 10 minutes due to the node disappear before the runner agent cancelling the job.

Under the hood, RUNNER_GRACEFUL_STOP_TIMEOUT works by instructing runner's signal handler to delay forwarding SIGTERM sent by Kubernetes on pod termination down to the runner agent. The runner agent is supposed to cancel the workflow job only on SIGTERM so making this delay longer allows you to delay cancelling the workflow job, which results in a more graceful period to stop the runner. Practically, the runner pod stops gracefully only when the workflow job running within the runner pod has completed before the runner graceful stop timeout elapses. The timeout can't be forever in practice, although it might theoretically be possible depending on your cluster environment. AWS Spot Instances, again for example, gives you 2 minutes to gracefully stop the whole node, and therefore RUNNER_GRACEFUL_STOP_TIMEOUT can't be longer than that.

If you have success stories with the new RUNNER_GRACEFUL_STOP_TIMEOUT, please don't hesitate to create a Show and Tell discussion in our GitHub Discussions to share what configuration worked on which environment, including the name of your cloud provider, the name of managed Kubernetes service, the graceful stop timeout for nodes(defined and provided by the provider or the service) and the runner pods (RUNNER_GRACEFUL_STOP_TIMEOUT).

Relevant PR(s): #1759, #1851, #1855

ENHANCEMENT : More reliable and customizable "wait-for-docker" feature

You can now add a WAIT_FOR_DOCKER_SECONDS envvar to the runner container of the runner pod spec to customize how long you want the runner startup script to wait until the docker daemon gets up and running. Previously this has been hard-coded to 120 seconds which wasn't sufficient in some environments.

Along with the enhancement, we also fixed a bug in the runner startup script that it didn't exit immediately on the docker startup timeout. The bug resulted in that you see a job container step failing due to missing docker socket. Ideally it should have kept auto-restarting the whole runner pod until you get a fully working runner pod with the working runner agent plus the docker daemon (that started within the timeout), and therefore you should have never seen the job step failing due to docker issue. We fixed it so that it should work as intended now.

Relvant PR(s): #1999

ENHANCEMENT : New webhook and metrics server for monitoring workflow jobs

**This feature is 99% authored and contributed by @ColinHeathman. Big kudos to Colin for his awesome work! **

You can now use the new actions-metrics-server to expose additional GitHub webhook endpoint for receiving workflow_job events and calculating and collecting various metrics related to the jobs. Please see the updated chart documentation for how to enable it.

We made it a separate component instead of adding the new metrics collector to our existing github-webhook-server to retain the ability to scale the github-webhook-server to two or more replicas for availability and scalability.

Also note that actions-metrics-server cannot be scaled to 2 or more replicas today. That's because it needs to store it's state somewhere to retain the workflow_job webhook event until it receives the corresponding webhook event to finally calculate the metric value, and the only supported state store is in-memory as of today.

For exmaple, it needs to save workflow_job of status=queued until it receives the corresponding workflow_job of status=in_progress to finally calculate the queue duration metric value.

We may add another state store that is backed by e.g. Memcached or Redis if there's enough demand. But we opted to not complicate ARC for now. You can follow the relevant discussion in this thread.

Relvant PR(s): #1814, #2057

New runner images based on Ubuntu 22.04

We started publishing new runner images based on Ubuntu 22.04 with the following tags:

summerwind/actions-runner-dind-rootless:v2.299.1-ubuntu-22.04
summerwind/actions-runner-dind-rootless:v2.299.1-ubuntu-22.04-$COMMIT_ID
summerwind/actions-runner-dind-rootless:ubuntu-22.04-latest
ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind-rootless:v2.299.1-ubuntu-22.04
ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind-rootless:v2.299.1-ubuntu-22.04-$COMMIT_ID
ghcr.io/actions-runner-controller/actions-runner-controller/actions-runner-dind-rootless:ubuntu-22.04-latest

The latest tags for the runner images will stick with Ubuntu 20.04 for a while. We'll try to submit an issue or a discussion for notice before switching the latest to 22.04. See this thread for more context.

Note that we took this chance to slim down the runner images for more security, maintainability, and extensibility. That said, some packages that are present by default in hosted runners but can easily be installed using setup- actions (like python using the setup-python action) and other convenient but not strictly necessary packages like ftp, telnet, upx and so on are no longer installed onto our 22.04 based runners. Consult below Dockerfile parts and add some setup- actions to your workflows or build your own custom runner image(s) based on our new 22.04 images, in case you relied on some packages present in our 20.04 images but not in our 22.04 images:

These images are not strictly tied to the v0.27.0 release. You can freely try the new images with ARC v0.26.0, or use both 20.04 and 22.04 based images with ARC v0.27.0.

Relevant PR(s): #1924, #2030, #2033, #2036, #2050, #2078, #2079, #2080, #2098

10 KiB Raw Blame History