Go to file
Rob Howie ff7b81dc8f fix: add health check to detect and recover stale EphemeralRunner registrations
fix: add health check to detect and recover stale EphemeralRunner registrations

## Summary

Adds a GitHub-side registration health check to the `EphemeralRunnerReconciler`. During reconciliation of a running EphemeralRunner that has a valid `RunnerId`, the controller now calls `GetRunner()` against the GitHub Actions API to verify the registration still exists. If the API returns 404 (registration gone), the runner is marked as failed so the `EphemeralRunnerSet` can provision a replacement.

This also fixes `deleteRunnerFromService` to tolerate 404 from `RemoveRunner`, since a runner whose registration was already invalidated will also 404 on removal. Previously this caused `markAsFailed` to error and trigger a requeue loop, preventing cleanup of stale runners.

A redundant pod phase check was removed from the health check path — `checkRunnerRegistration` is only called from within the `cs.State.Terminated == nil` branch (container confirmed running), so the phase check was unnecessary and could cause false negatives since the local `EphemeralRunner` object may not have the updated phase yet.

## Changes

- **`ephemeralrunner_controller.go`**: New `checkRunnerRegistration()` method that calls `GetRunner()` and returns unhealthy only on confirmed 404. Transient API errors (500, timeouts, etc.) are logged and ignored to avoid false positives. Called during reconciliation when a running pod has a non-zero `RunnerId`.
- **`ephemeralrunner_controller.go`**: `deleteRunnerFromService()` now treats 404 from `RemoveRunner` as success, since the runner is already gone.
- **`ephemeralrunner_controller_test.go`**: Two new test cases — one verifying a 404 from `GetRunner` marks the runner as failed, another verifying a 500 does not. Both simulate a running pod container status before the health check triggers.
- **`fake/client.go`**: New `WithRemoveRunnerError` option for the fake client to support testing the 404 removal path.

## Issues Fixed

### Directly fixes

- **#4396** — *ARC runners did not recover automatically after GitHub Outage*: Runners that lost their GitHub-side registration during the [2026-03-05 GitHub Actions incident](https://www.githubstatus.com/incidents/g9j4tmfqdd09) remained stuck indefinitely because the `EphemeralRunnerReconciler` never verified registration validity. This change adds exactly that verification — on each reconciliation of a running runner, `GetRunner()` confirms the registration exists. If it returns 404, the runner is marked failed and replaced.

- **#4395** — *GHA Self-hosted runner pods are running but the runner status is showing offline*: Runner pods were running for 5-7+ hours showing "Listening for Jobs" but marked offline in GitHub. The pods' registrations were invalidated server-side (same incident), but the controller had no mechanism to detect this. The new health check detects the 404 and triggers cleanup, preventing these zombie runners from accumulating.

### Partially addresses

- **#4397** — *Stale TotalAssignedJobs causes permanent over-provisioning after platform incidents*: This issue has two components: (1) zombie runner pods occupying slots, and (2) the listener's `TotalAssignedJobs` remaining inflated. This PR fixes component (1) — by detecting and cleaning up runners with invalidated registrations, the controller stops accumulating zombie pods. However, the listener-side `TotalAssignedJobs` inflation (a separate code path in `listener.go` → `worker.go`) is not addressed by this change and still requires either a session reconnect mechanism or CR deletion to clear.

- **#4307** — *Ephemeral Runners seem to get stuck when job is canceled or interrupted*: Reports runners getting stuck with `Registration <uuid> was not found` errors in BrokerServer backoff loops after job cancellation. The health check would detect that the registration is gone (404 from `GetRunner`) and mark these runners as failed rather than leaving them in an infinite backoff loop.

- **#4155** — *EphemeralRunner and its pods left stuck Running after runner OOMKill*: Runners that are OOMKilled can end up in a state where the pod is technically running but the runner process is non-functional, and the registration may become stale. The health check provides a secondary detection mechanism — if the GitHub-side registration is invalidated for a non-functional runner, it will be caught and cleaned up.

- **#3821** — *AutoscalingRunnerSet gets stuck thinking runners exist when they do not*: Reports the AutoscalingRunnerSet believing runners exist when no pods are present, requiring CR deletion to recover. While the root cause may vary, stale registrations that the controller cannot clean up (due to `RemoveRunner` 404 errors causing requeue loops) are one contributing factor. The 404 tolerance in `deleteRunnerFromService` directly addresses this cleanup path.

## Test Plan

- [x] Unit test: `GetRunner` returning 404 → runner marked as failed
- [x] Unit test: `GetRunner` returning 500 → runner NOT marked as failed (transient error tolerance)
- [x] Unit test: `RemoveRunner` returning 404 → deletion succeeds (runner already gone)
- [ ] Integration: Deploy to a test cluster, manually delete a runner's registration via the GitHub API, verify the controller detects and replaces it
- [ ] Integration: Simulate incident conditions by blocking `broker.actions.githubusercontent.com` for running runners, then unblocking — verify stale runners are detected and replaced
2026-03-09 09:31:17 +00:00
.github Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
acceptance Pin third party actions (#3981) 2025-04-17 12:19:15 +02:00
apis Remove ephemeral runner when exit code != 0 and is patched with the job (#4239) 2025-09-17 21:40:37 +02:00
build Extend the user agent and fix the build version for the listener app (#2892) 2023-09-14 20:10:49 +02:00
charts Prepare 0.13.1 release (#4341) 2025-12-23 14:57:05 +01:00
cmd Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
config Bump all dependencies (#4266) 2025-10-14 13:24:25 +02:00
contrib Small readme updates for readability (#3860) 2025-03-10 22:43:02 +01:00
controllers fix: add health check to detect and recover stale EphemeralRunner registrations 2026-03-09 09:31:17 +00:00
docs Prepare 0.13.1 release (#4341) 2025-12-23 14:57:05 +01:00
github fix: add health check to detect and recover stale EphemeralRunner registrations 2026-03-09 09:31:17 +00:00
hack Add ephemeral runner finalizer during creation and check finalizer without requeue (#4320) 2025-11-20 23:06:27 +01:00
hash Introduce new preview auto-scaling mode for ARC. (#2153) 2023-01-17 12:06:20 -05:00
logging Introduce new preview auto-scaling mode for ARC. (#2153) 2023-01-17 12:06:20 -05:00
pkg Small readme updates for readability (#3860) 2025-03-10 22:43:02 +01:00
runner Updates: runner to v2.332.0 container-hooks to v0.8.1 (#4388) 2026-03-03 01:02:40 +01:00
simulator Use head_branch metric (#2549) 2023-05-28 16:36:55 +09:00
test Updates: runner to v2.332.0 container-hooks to v0.8.1 (#4388) 2026-03-03 01:02:40 +01:00
test_e2e_arc Azure Key Vault integration to resolve secrets (#4090) 2025-06-11 15:53:33 +02:00
testing Bump node actions (#3569) 2024-06-21 12:11:29 +02:00
vault Azure Key Vault integration to resolve secrets (#4090) 2025-06-11 15:53:33 +02:00
.dockerignore dockerfile,e2e: Use buildx and cache mounts for faster rebuilds in E2E 2022-03-02 19:03:20 +09:00
.gitattributes Update CONTRIBUTING.md with new contribution guidelines and release process documentation (#2596) 2023-05-17 07:42:35 -04:00
.gitignore Fix tests and generate mocks (#4384) 2026-02-24 13:36:01 +01:00
.golangci.yaml Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
.mockery.yaml Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
CODEOWNERS Update CODEOWNERS to include new maintainer (#4253) 2025-09-17 21:33:38 +02:00
CODE_OF_CONDUCT.md Add code of conduct 2022-12-13 11:38:01 +00:00
CONTRIBUTING.md Fix tests and generate mocks (#4384) 2026-02-24 13:36:01 +01:00
Dockerfile Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
LICENSE Add LICENSE 2020-01-30 20:12:12 +09:00
Makefile Updates: runner to v2.332.0 container-hooks to v0.8.1 (#4388) 2026-03-03 01:02:40 +01:00
PROJECT Introduce new preview auto-scaling mode for ARC. (#2153) 2023-01-17 12:06:20 -05:00
README.md Small readme updates for readability (#3860) 2025-03-10 22:43:02 +01:00
SECURITY.md Add security guidelines and policy 2022-12-13 11:39:39 +00:00
TROUBLESHOOTING.md feat: allow for modifying `var-run` mount maximum size limit (#2624) 2023-05-27 11:47:23 +09:00
go.mod Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
go.sum Switch client to scaleset library for the listener and update mocks (#4383) 2026-02-24 14:17:31 +01:00
main.go feat(runner): add ubuntu 24.04 support (#3598) 2025-07-01 18:34:52 +09:00

README.md

Actions Runner Controller (ARC)

CII Best Practices awesome-runners Artifact Hub

About

Actions Runner Controller (ARC) is a Kubernetes operator that orchestrates and scales self-hosted runners for GitHub Actions.

With ARC, you can create runner scale sets that automatically scale based on the number of workflows running in your repository, organization, or enterprise. Because controlled runners can be ephemeral and based on containers, new runner instances can scale up or down rapidly and cleanly. For more information about autoscaling, see "Autoscaling with self-hosted runners."

You can set up ARC on Kubernetes using Helm, then create and run a workflow that uses runner scale sets. For more information about runner scale sets, see "Deploying runner scale sets with Actions Runner Controller."

People

Actions Runner Controller (ARC) is an open-source project currently developed and maintained in collaboration with the GitHub Actions team, external maintainers @mumoshu and @toast-gear, various contributors, and the awesome community.

If you think the project is awesome and is adding value to your business, please consider directly sponsoring community maintainers and individual contributors via GitHub Sponsors.

If you are already the employer of one of the contributors, sponsoring via GitHub Sponsors might not be an option. Just support them by other means!

See the sponsorship dashboard for the former and the current sponsors.

Getting Started

To give ARC a try with just a handful of commands, please refer to the Quickstart guide.

For an overview of ARC, please refer to About ARC.

With the introduction of autoscaling runner scale sets, the existing autoscaling modes are now legacy. The legacy modes have certain use cases and will continue to be maintained by the community only.

For further information on what is supported by GitHub and what's managed by the community, please refer to this announcement discussion.

Documentation

ARC documentation is available on docs.github.com.

Legacy documentation

The following documentation is for the legacy autoscaling modes that continue to be maintained by the community:

Contributing

We welcome contributions from the community. For more details on contributing to the project (including requirements), please refer to "Getting Started with Contributing."

Troubleshooting

We are very happy to help you with any issues you have. Please refer to the "Troubleshooting" section for common issues.